Monotony and Surprise

Apostolico, Alberto

doi:10.1007/978-3-540-88869-7_2

Alberto Apostolico^6,7

Part of the book series: Natural Computing Series ((NCS))

1246 Accesses

Abstract

This paper reviews models and tools emerged in recent years in the author’s work in connection with the discovery of interesting or anomalous patterns in sequences. Whereas customary approaches to pattern discovery proceed from either a statistical or a syntactic characterization alone, the approaches described here present the unifying feature of combining these two descriptors in a solidly intertwined, composite paradigm, whereby both syntactic structure and occurrence lists concur to define and identify a pattern in a subject. In turn, this supports a natural notion of pattern saturation, which enables one to partition patterns into equivalence classes over intervals of monotonicity of commonly adopted scores, in such a way that the subset of class representatives, consisting solely of saturated patterns, suffices to account for all patterns in the subject. The benefits at the outset consist not only of an increased descriptive power, but especially of a mitigation of the often unmanageable roster of candidates unearthed in a discovery attempt, and of the daunting computational burden that goes with it.

The applications of this paradigm as highlighted here are believed to point to a largely unexpressed potential. The specific pattern structures and configurations described include solid character strings, strings with errors, consensus sequences consisting of intermixed solid and wild characters, co- and multiple occurrences, and association rules thereof, etc. It is also outlined how, from a dual perspective, these constructs support novel paradigms of data compression, which leads to succinct descriptors, kernels, classification, and clustering methods of possible broader interest. Although largely inspired by biological sequence analysis, the ideas presented here apply to sequences of general origin, and mostly generalize to higher aggregates such as arrays, trees, and special types of graphs.

Work supported in part by the Italian Ministry of University and Research under the Bi-National Project FIRB RBIN04BYZ7, and by the Research Program of Georgia Tech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal R, Imielinski T, Swami A (1999) Mining association rules between sets of items in large databases. In: Proceedinngs of the ACM SIGMOD, Washington, DC, May 1993, pp 207–216
Google Scholar
Apostolico A (1985) The myriad virtues of subword trees. In: Apostolico A, Galil Z (eds) Combinatorial algorithms on words. NATO ASI series F, vol 12. Springer, Berlin, pp 85–96
Google Scholar
Apostolico A (2005) Of Lempel–Ziv–Welch parses with refillable gaps. In: Proceedings of IEEE DCC data compression conference, pp 338–347
Google Scholar
Apostolico A (1996) String editing and longest common subsequences. In: Rozenberg G, Salomaa A (eds) Handbook of formal languages, vol II. Springer, Berlin, pp 361–398
Google Scholar
Apostolico A, Bejerano G (2000) Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. J Comput Biol 7(3/4):381–393
Article Google Scholar
Apostolico A, Bock ME, Lonardi S (2003) Monotony of surprise and large scale quest for unusual words. J Comput Biol 10(3–4):283–311
Article Google Scholar
Apostolico A, Bock ME, Lonardi S, Xu X (2000) Efficient detection of unusual words. J Comput Biol 7(1–2):71–94.
Article Google Scholar
Apostolico A, Comin M, Parida L (2006) Mining, compressing and classifying with extensible motifs. BMC Algorithms Mol Biol 1(4):1–7
Google Scholar
Apostolico A, Comin M, Parida L (2004) Motifs in Ziv–Lempel–Welch clef. In: Proceedings of IEEE DCC data compression conference, pp 72–81
Google Scholar
Apostolico A, Comin M, Parida L (2005) Conservative extraction of overrepresented extensible motifs. In: Proceedings of ISMB 05, intelligent systems for molecular biology, Detroit, MI, pp 9–18
Google Scholar
Apostolico A, Comin M, Parida L (2006) Bridging lossy and lossless data compression by motif pattern discovery. In: Ahlswede R, Bäumer L, Cai N (eds) General theory of information transfer and combinatorics, vol II of Research report ZIF (Center of interdisciplinary studies) project, Bielefeld, October 1, 2002–August 31, 2003. Lecture notes in computer science, vol 4123. Springer, Berlin, pp 787–799
Google Scholar
Apostolico A, Cunial F (2009) The subsequence composition of a string. Theor. Comp. Sci. (in press)
Google Scholar
Apostolico A, Cunial F, Kaul V (2008) Table compression by record intersection. In: Proceedings of IEEE DCC data compression conference, pp 11–22
Google Scholar
Apostolico A, Galil Z (eds) (1997) Pattern matching algorithms. Oxford University Press, Oxford
MATH Google Scholar
Apostolico A, Lonardi S (2000) Off-line compression by greedy textual substitution. Proc IEEE 88(11):1733–1744
Article Google Scholar
Apostolico A, Parida L (2004) Incremental paradigms for motif discovery. J Comput Biol 11(1):15–25
Article Google Scholar
Apostolico A, Pizzi C (2004) Monotone scoring of patterns with mismatches. In: Proceedings of WABI. Lecture notes in computer science, vol 3240. Springer, Berlin, pp 87–98
Google Scholar
Apostolico A, Pizzi C, Satta G (2004) Optimal discovery of subword associations in strings. In: Proceedings of the 7th discovery science conference. Lecture notes in artificial intelligence, vol 3245. Springer, Berlin, pp 270–277
Google Scholar
Apostolico A, Preparata FP (1985) Structural properties of the string statistics problem. J Comput Syst Sci 31(3):394–411
Article MATH MathSciNet Google Scholar
Apostolico A, Preparata FP (1996) Data structures and algorithms for the string statistics problem. Algorithmica 15:481–494
Article MATH MathSciNet Google Scholar
Apostolico A, Satta G (2009) Discovering subword associations in strings in time linear in the output size. J Discrete Algorithms 7(2):227–238
Article MATH Google Scholar
Apostolico A, Tagliacollo C (2008) Incremental discovery of irredundant motif bases for all suffixes of a string in O(|Σ|n2log n) time. Theor Comput Sci. doi:10.1016/j.tcs.2008.08.002
MathSciNet Google Scholar
Blumer A, Blumer J, Ehrenfeucht A, Haussler D, Chen MT, Seiferas J (1985) The smallest automaton recognizing the subwords of a text. Theor Comput Sci 40:31–55
Article MATH MathSciNet Google Scholar
Buchsbaum AL, Caldwell DF, Church KW, Fowler GS, Muthukrishnan S (2000) Engineering the compression of massive tables: an experimental approach. In: Proceedings of 11th ACM–SIAM symposium on discrete algorithms, San Francisco, CA, pp 175–184
Google Scholar
Buchsbaum AL, Fowler GS, Giancarlo R (2003) Improving table compression with combinatorial optimization. J ACM 50(6):825–851
Article MathSciNet Google Scholar
Buhler J, Tompa M (2002) Finding motifs using random projections. J Comput Biol 9(2):225–242
Article Google Scholar
Cole R, Gottlieb LA, Lewenstein M (2004) Dictionary matching and indexing with errors and don’t cares. Typescript
Google Scholar
Colosimo A, De Luca A (2000) Special factors in biological strings. J Theor Biol 204:29–46
Article Google Scholar
Cormack G (1985) Data compression in a data base system. Commun ACM 28(12):1336
Article MathSciNet Google Scholar
Goldstein J, Ramakrishnan R, Shaft U (1998) Compressing relations and indexes. In: Proceedings of the 14th international conference on data engineering, pp 370–379
Google Scholar
Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge
MATH Google Scholar
Johnson DS, Krishnan S, Chhugani J, Kumar S, Venkatasubramanian S (2004) Compressing large Boolean matrices using reordering techniques. In: Proceedings of the 30th international conference on very large databases (VLDB), pp 13–23
Google Scholar
Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29:147–160
MathSciNet Google Scholar
Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15:563–577
Article Google Scholar
Keich H, Pevzner P (2002) Finding motifs in the twilight zone. In: Annual international conference on computational molecular biology, Washington, DC, April 2002, pp 195–204
Google Scholar
Kolmogorov AN (1965) Three approaches to the quantitative definition of information. Probl Pederachi Inf 1
Google Scholar
Lehman E, Shelat A (2002) Approximation algorithms for grammar based compression. In: Proceedings of the eleventh ACM–SIAM symposium on discrete algorithms (SODA 2002), pp 205–212
Google Scholar
Lempel A, Ziv J (1976) On the complexity of finite sequences. IEEE Trans Inf Theory 22:75–81
Article MATH MathSciNet Google Scholar
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 6:707–710
MathSciNet Google Scholar
Levitt SD, Dubner William SJ (2005) Freakonomics: a rogue economist explores the hidden side of everything. Morrow
Google Scholar
Martin-Lof P (1966) The definition of random sequences. Inf Control 9(6):602–619
Article MathSciNet Google Scholar
Nevill-Manning CG, Witten IH (1999) Protein is incompressible. In: Proceedings of the IEEE data compression conference, pp 257–266
Google Scholar
Piatesky-Shapiro G, Frawley WJ (eds) (1991) Knowledge discovery in databases. AAAI Press/MIT Press, Menlo Park
Google Scholar
Pisanti N, Crochemore M, Grossi R, Sagot M-F (2005) Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM Trans Comput Biol Bioinform 2(1):40–50
Article Google Scholar
Rigoutsos I, Floratos A, Parida L, Gao Y, Platt D (2000) The emergence of pattern discovery techniques in computational biology. J Metab Eng 2:159–177
Article Google Scholar
Rissanen J (1986) Complexity of strings in the class of Markov sources. IEEE Trans Inf Theory 32(4):526–532
Article MATH MathSciNet Google Scholar
Ron D, Singer Y, Tishby N (1996) The power of amnesia: learning probabilistic automata with variable memory length. Mach Learn 25:117–150
Article MATH Google Scholar
Storer JA (1988) Data compression: methods and theory. Computer Science Press, New York
Google Scholar
Takeda M, Fukuda T, Nanri I, Yamasaki ăM, Tamari ăK (2003) Discovering instances of poetic allusion from anthologies of classical Japanese poems. Theor Comput Sci 292(2):497–524
Article MATH MathSciNet Google Scholar
Vo BD, Vo KP (2004) Using column dependency to compress tables. In: Proceedings of DCC 2004. IEEE Computer Society, Los Alamitos, pp 92–101
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Ingegneria dell’ Informazione, Università di Padova, Padova, Italy
Alberto Apostolico
College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, Atlanta, GA, 30318, USA
Alberto Apostolico

Authors

Alberto Apostolico
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alberto Apostolico .

Editor information

Editors and Affiliations

Dept. Computer Science, University of British Columbia, Main Mall 201-2366, Vancouver, V6T 1Z4, Canada
Anne Condon
Dept. Applied Mathematics, Weizmann Institute of Science, Rehovot, 76100, Israel
David Harel
Leiden Inst. Advanced Computer Science, Leiden University, Niels Bohrweg 1, Leiden, 2333 CA, Netherlands
Joost N. Kok
Turku Centre for Computer Science, Lemminkaisenkatu 14 A, Turku, 20520, Finland
Arto Salomaa
Computer Science, Computation,, California Inst. of Technology, Pasadena, 91125, U.S.A.
Erik Winfree

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Apostolico, A. (2009). Monotony and Surprise. In: Condon, A., Harel, D., Kok, J., Salomaa, A., Winfree, E. (eds) Algorithmic Bioprocesses. Natural Computing Series. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88869-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-540-88869-7_2
Published: 13 August 2009
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88868-0
Online ISBN: 978-3-540-88869-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics