Abstract
This paper reviews models and tools emerged in recent years in the author’s work in connection with the discovery of interesting or anomalous patterns in sequences. Whereas customary approaches to pattern discovery proceed from either a statistical or a syntactic characterization alone, the approaches described here present the unifying feature of combining these two descriptors in a solidly intertwined, composite paradigm, whereby both syntactic structure and occurrence lists concur to define and identify a pattern in a subject. In turn, this supports a natural notion of pattern saturation, which enables one to partition patterns into equivalence classes over intervals of monotonicity of commonly adopted scores, in such a way that the subset of class representatives, consisting solely of saturated patterns, suffices to account for all patterns in the subject. The benefits at the outset consist not only of an increased descriptive power, but especially of a mitigation of the often unmanageable roster of candidates unearthed in a discovery attempt, and of the daunting computational burden that goes with it.
The applications of this paradigm as highlighted here are believed to point to a largely unexpressed potential. The specific pattern structures and configurations described include solid character strings, strings with errors, consensus sequences consisting of intermixed solid and wild characters, co- and multiple occurrences, and association rules thereof, etc. It is also outlined how, from a dual perspective, these constructs support novel paradigms of data compression, which leads to succinct descriptors, kernels, classification, and clustering methods of possible broader interest. Although largely inspired by biological sequence analysis, the ideas presented here apply to sequences of general origin, and mostly generalize to higher aggregates such as arrays, trees, and special types of graphs.
Work supported in part by the Italian Ministry of University and Research under the Bi-National Project FIRB RBIN04BYZ7, and by the Research Program of Georgia Tech.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agrawal R, Imielinski T, Swami A (1999) Mining association rules between sets of items in large databases. In: Proceedinngs of the ACM SIGMOD, Washington, DC, May 1993, pp 207–216
Apostolico A (1985) The myriad virtues of subword trees. In: Apostolico A, Galil Z (eds) Combinatorial algorithms on words. NATO ASI series F, vol 12. Springer, Berlin, pp 85–96
Apostolico A (2005) Of Lempel–Ziv–Welch parses with refillable gaps. In: Proceedings of IEEE DCC data compression conference, pp 338–347
Apostolico A (1996) String editing and longest common subsequences. In: Rozenberg G, Salomaa A (eds) Handbook of formal languages, vol II. Springer, Berlin, pp 361–398
Apostolico A, Bejerano G (2000) Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. J Comput Biol 7(3/4):381–393
Apostolico A, Bock ME, Lonardi S (2003) Monotony of surprise and large scale quest for unusual words. J Comput Biol 10(3–4):283–311
Apostolico A, Bock ME, Lonardi S, Xu X (2000) Efficient detection of unusual words. J Comput Biol 7(1–2):71–94.
Apostolico A, Comin M, Parida L (2006) Mining, compressing and classifying with extensible motifs. BMC Algorithms Mol Biol 1(4):1–7
Apostolico A, Comin M, Parida L (2004) Motifs in Ziv–Lempel–Welch clef. In: Proceedings of IEEE DCC data compression conference, pp 72–81
Apostolico A, Comin M, Parida L (2005) Conservative extraction of overrepresented extensible motifs. In: Proceedings of ISMB 05, intelligent systems for molecular biology, Detroit, MI, pp 9–18
Apostolico A, Comin M, Parida L (2006) Bridging lossy and lossless data compression by motif pattern discovery. In: Ahlswede R, Bäumer L, Cai N (eds) General theory of information transfer and combinatorics, vol II of Research report ZIF (Center of interdisciplinary studies) project, Bielefeld, October 1, 2002–August 31, 2003. Lecture notes in computer science, vol 4123. Springer, Berlin, pp 787–799
Apostolico A, Cunial F (2009) The subsequence composition of a string. Theor. Comp. Sci. (in press)
Apostolico A, Cunial F, Kaul V (2008) Table compression by record intersection. In: Proceedings of IEEE DCC data compression conference, pp 11–22
Apostolico A, Galil Z (eds) (1997) Pattern matching algorithms. Oxford University Press, Oxford
Apostolico A, Lonardi S (2000) Off-line compression by greedy textual substitution. Proc IEEE 88(11):1733–1744
Apostolico A, Parida L (2004) Incremental paradigms for motif discovery. J Comput Biol 11(1):15–25
Apostolico A, Pizzi C (2004) Monotone scoring of patterns with mismatches. In: Proceedings of WABI. Lecture notes in computer science, vol 3240. Springer, Berlin, pp 87–98
Apostolico A, Pizzi C, Satta G (2004) Optimal discovery of subword associations in strings. In: Proceedings of the 7th discovery science conference. Lecture notes in artificial intelligence, vol 3245. Springer, Berlin, pp 270–277
Apostolico A, Preparata FP (1985) Structural properties of the string statistics problem. J Comput Syst Sci 31(3):394–411
Apostolico A, Preparata FP (1996) Data structures and algorithms for the string statistics problem. Algorithmica 15:481–494
Apostolico A, Satta G (2009) Discovering subword associations in strings in time linear in the output size. J Discrete Algorithms 7(2):227–238
Apostolico A, Tagliacollo C (2008) Incremental discovery of irredundant motif bases for all suffixes of a string in O(|Σ|n2log n) time. Theor Comput Sci. doi:10.1016/j.tcs.2008.08.002
Blumer A, Blumer J, Ehrenfeucht A, Haussler D, Chen MT, Seiferas J (1985) The smallest automaton recognizing the subwords of a text. Theor Comput Sci 40:31–55
Buchsbaum AL, Caldwell DF, Church KW, Fowler GS, Muthukrishnan S (2000) Engineering the compression of massive tables: an experimental approach. In: Proceedings of 11th ACM–SIAM symposium on discrete algorithms, San Francisco, CA, pp 175–184
Buchsbaum AL, Fowler GS, Giancarlo R (2003) Improving table compression with combinatorial optimization. J ACM 50(6):825–851
Buhler J, Tompa M (2002) Finding motifs using random projections. J Comput Biol 9(2):225–242
Cole R, Gottlieb LA, Lewenstein M (2004) Dictionary matching and indexing with errors and don’t cares. Typescript
Colosimo A, De Luca A (2000) Special factors in biological strings. J Theor Biol 204:29–46
Cormack G (1985) Data compression in a data base system. Commun ACM 28(12):1336
Goldstein J, Ramakrishnan R, Shaft U (1998) Compressing relations and indexes. In: Proceedings of the 14th international conference on data engineering, pp 370–379
Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge
Johnson DS, Krishnan S, Chhugani J, Kumar S, Venkatasubramanian S (2004) Compressing large Boolean matrices using reordering techniques. In: Proceedings of the 30th international conference on very large databases (VLDB), pp 13–23
Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29:147–160
Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15:563–577
Keich H, Pevzner P (2002) Finding motifs in the twilight zone. In: Annual international conference on computational molecular biology, Washington, DC, April 2002, pp 195–204
Kolmogorov AN (1965) Three approaches to the quantitative definition of information. Probl Pederachi Inf 1
Lehman E, Shelat A (2002) Approximation algorithms for grammar based compression. In: Proceedings of the eleventh ACM–SIAM symposium on discrete algorithms (SODA 2002), pp 205–212
Lempel A, Ziv J (1976) On the complexity of finite sequences. IEEE Trans Inf Theory 22:75–81
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 6:707–710
Levitt SD, Dubner William SJ (2005) Freakonomics: a rogue economist explores the hidden side of everything. Morrow
Martin-Lof P (1966) The definition of random sequences. Inf Control 9(6):602–619
Nevill-Manning CG, Witten IH (1999) Protein is incompressible. In: Proceedings of the IEEE data compression conference, pp 257–266
Piatesky-Shapiro G, Frawley WJ (eds) (1991) Knowledge discovery in databases. AAAI Press/MIT Press, Menlo Park
Pisanti N, Crochemore M, Grossi R, Sagot M-F (2005) Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM Trans Comput Biol Bioinform 2(1):40–50
Rigoutsos I, Floratos A, Parida L, Gao Y, Platt D (2000) The emergence of pattern discovery techniques in computational biology. J Metab Eng 2:159–177
Rissanen J (1986) Complexity of strings in the class of Markov sources. IEEE Trans Inf Theory 32(4):526–532
Ron D, Singer Y, Tishby N (1996) The power of amnesia: learning probabilistic automata with variable memory length. Mach Learn 25:117–150
Storer JA (1988) Data compression: methods and theory. Computer Science Press, New York
Takeda M, Fukuda T, Nanri I, Yamasaki ăM, Tamari ăK (2003) Discovering instances of poetic allusion from anthologies of classical Japanese poems. Theor Comput Sci 292(2):497–524
Vo BD, Vo KP (2004) Using column dependency to compress tables. In: Proceedings of DCC 2004. IEEE Computer Society, Los Alamitos, pp 92–101
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Apostolico, A. (2009). Monotony and Surprise. In: Condon, A., Harel, D., Kok, J., Salomaa, A., Winfree, E. (eds) Algorithmic Bioprocesses. Natural Computing Series. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88869-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-88869-7_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88868-0
Online ISBN: 978-3-540-88869-7
eBook Packages: Computer ScienceComputer Science (R0)