Skip to main content

Monotony and Surprise

  • Chapter
  • First Online:
Algorithmic Bioprocesses

Part of the book series: Natural Computing Series ((NCS))

  • 1246 Accesses

Abstract

This paper reviews models and tools emerged in recent years in the author’s work in connection with the discovery of interesting or anomalous patterns in sequences. Whereas customary approaches to pattern discovery proceed from either a statistical or a syntactic characterization alone, the approaches described here present the unifying feature of combining these two descriptors in a solidly intertwined, composite paradigm, whereby both syntactic structure and occurrence lists concur to define and identify a pattern in a subject. In turn, this supports a natural notion of pattern saturation, which enables one to partition patterns into equivalence classes over intervals of monotonicity of commonly adopted scores, in such a way that the subset of class representatives, consisting solely of saturated patterns, suffices to account for all patterns in the subject. The benefits at the outset consist not only of an increased descriptive power, but especially of a mitigation of the often unmanageable roster of candidates unearthed in a discovery attempt, and of the daunting computational burden that goes with it.

The applications of this paradigm as highlighted here are believed to point to a largely unexpressed potential. The specific pattern structures and configurations described include solid character strings, strings with errors, consensus sequences consisting of intermixed solid and wild characters, co- and multiple occurrences, and association rules thereof, etc. It is also outlined how, from a dual perspective, these constructs support novel paradigms of data compression, which leads to succinct descriptors, kernels, classification, and clustering methods of possible broader interest. Although largely inspired by biological sequence analysis, the ideas presented here apply to sequences of general origin, and mostly generalize to higher aggregates such as arrays, trees, and special types of graphs.

Work supported in part by the Italian Ministry of University and Research under the Bi-National Project FIRB RBIN04BYZ7, and by the Research Program of Georgia Tech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal R, Imielinski T, Swami A (1999) Mining association rules between sets of items in large databases. In: Proceedinngs of the ACM SIGMOD, Washington, DC, May 1993, pp 207–216

    Google Scholar 

  2. Apostolico A (1985) The myriad virtues of subword trees. In: Apostolico A, Galil Z (eds) Combinatorial algorithms on words. NATO ASI series F, vol 12. Springer, Berlin, pp 85–96

    Google Scholar 

  3. Apostolico A (2005) Of Lempel–Ziv–Welch parses with refillable gaps. In: Proceedings of IEEE DCC data compression conference, pp 338–347

    Google Scholar 

  4. Apostolico A (1996) String editing and longest common subsequences. In: Rozenberg G, Salomaa A (eds) Handbook of formal languages, vol II. Springer, Berlin, pp 361–398

    Google Scholar 

  5. Apostolico A, Bejerano G (2000) Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. J Comput Biol 7(3/4):381–393

    Article  Google Scholar 

  6. Apostolico A, Bock ME, Lonardi S (2003) Monotony of surprise and large scale quest for unusual words. J Comput Biol 10(3–4):283–311

    Article  Google Scholar 

  7. Apostolico A, Bock ME, Lonardi S, Xu X (2000) Efficient detection of unusual words. J Comput Biol 7(1–2):71–94.

    Article  Google Scholar 

  8. Apostolico A, Comin M, Parida L (2006) Mining, compressing and classifying with extensible motifs. BMC Algorithms Mol Biol 1(4):1–7

    Google Scholar 

  9. Apostolico A, Comin M, Parida L (2004) Motifs in Ziv–Lempel–Welch clef. In: Proceedings of IEEE DCC data compression conference, pp 72–81

    Google Scholar 

  10. Apostolico A, Comin M, Parida L (2005) Conservative extraction of overrepresented extensible motifs. In: Proceedings of ISMB 05, intelligent systems for molecular biology, Detroit, MI, pp 9–18

    Google Scholar 

  11. Apostolico A, Comin M, Parida L (2006) Bridging lossy and lossless data compression by motif pattern discovery. In: Ahlswede R, Bäumer L, Cai N (eds) General theory of information transfer and combinatorics, vol II of Research report ZIF (Center of interdisciplinary studies) project, Bielefeld, October 1, 2002–August 31, 2003. Lecture notes in computer science, vol 4123. Springer, Berlin, pp 787–799

    Google Scholar 

  12. Apostolico A, Cunial F (2009) The subsequence composition of a string. Theor. Comp. Sci. (in press)

    Google Scholar 

  13. Apostolico A, Cunial F, Kaul V (2008) Table compression by record intersection. In: Proceedings of IEEE DCC data compression conference, pp 11–22

    Google Scholar 

  14. Apostolico A, Galil Z (eds) (1997) Pattern matching algorithms. Oxford University Press, Oxford

    MATH  Google Scholar 

  15. Apostolico A, Lonardi S (2000) Off-line compression by greedy textual substitution. Proc IEEE 88(11):1733–1744

    Article  Google Scholar 

  16. Apostolico A, Parida L (2004) Incremental paradigms for motif discovery. J Comput Biol 11(1):15–25

    Article  Google Scholar 

  17. Apostolico A, Pizzi C (2004) Monotone scoring of patterns with mismatches. In: Proceedings of WABI. Lecture notes in computer science, vol 3240. Springer, Berlin, pp 87–98

    Google Scholar 

  18. Apostolico A, Pizzi C, Satta G (2004) Optimal discovery of subword associations in strings. In: Proceedings of the 7th discovery science conference. Lecture notes in artificial intelligence, vol 3245. Springer, Berlin, pp 270–277

    Google Scholar 

  19. Apostolico A, Preparata FP (1985) Structural properties of the string statistics problem. J Comput Syst Sci 31(3):394–411

    Article  MATH  MathSciNet  Google Scholar 

  20. Apostolico A, Preparata FP (1996) Data structures and algorithms for the string statistics problem. Algorithmica 15:481–494

    Article  MATH  MathSciNet  Google Scholar 

  21. Apostolico A, Satta G (2009) Discovering subword associations in strings in time linear in the output size. J Discrete Algorithms 7(2):227–238

    Article  MATH  Google Scholar 

  22. Apostolico A, Tagliacollo C (2008) Incremental discovery of irredundant motif bases for all suffixes of a string in O(|Σ|n2log n) time. Theor Comput Sci. doi:10.1016/j.tcs.2008.08.002

    MathSciNet  Google Scholar 

  23. Blumer A, Blumer J, Ehrenfeucht A, Haussler D, Chen MT, Seiferas J (1985) The smallest automaton recognizing the subwords of a text. Theor Comput Sci 40:31–55

    Article  MATH  MathSciNet  Google Scholar 

  24. Buchsbaum AL, Caldwell DF, Church KW, Fowler GS, Muthukrishnan S (2000) Engineering the compression of massive tables: an experimental approach. In: Proceedings of 11th ACM–SIAM symposium on discrete algorithms, San Francisco, CA, pp 175–184

    Google Scholar 

  25. Buchsbaum AL, Fowler GS, Giancarlo R (2003) Improving table compression with combinatorial optimization. J ACM 50(6):825–851

    Article  MathSciNet  Google Scholar 

  26. Buhler J, Tompa M (2002) Finding motifs using random projections. J Comput Biol 9(2):225–242

    Article  Google Scholar 

  27. Cole R, Gottlieb LA, Lewenstein M (2004) Dictionary matching and indexing with errors and don’t cares. Typescript

    Google Scholar 

  28. Colosimo A, De Luca A (2000) Special factors in biological strings. J Theor Biol 204:29–46

    Article  Google Scholar 

  29. Cormack G (1985) Data compression in a data base system. Commun ACM 28(12):1336

    Article  MathSciNet  Google Scholar 

  30. Goldstein J, Ramakrishnan R, Shaft U (1998) Compressing relations and indexes. In: Proceedings of the 14th international conference on data engineering, pp 370–379

    Google Scholar 

  31. Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  32. Johnson DS, Krishnan S, Chhugani J, Kumar S, Venkatasubramanian S (2004) Compressing large Boolean matrices using reordering techniques. In: Proceedings of the 30th international conference on very large databases (VLDB), pp 13–23

    Google Scholar 

  33. Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29:147–160

    MathSciNet  Google Scholar 

  34. Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15:563–577

    Article  Google Scholar 

  35. Keich H, Pevzner P (2002) Finding motifs in the twilight zone. In: Annual international conference on computational molecular biology, Washington, DC, April 2002, pp 195–204

    Google Scholar 

  36. Kolmogorov AN (1965) Three approaches to the quantitative definition of information. Probl Pederachi Inf 1

    Google Scholar 

  37. Lehman E, Shelat A (2002) Approximation algorithms for grammar based compression. In: Proceedings of the eleventh ACM–SIAM symposium on discrete algorithms (SODA 2002), pp 205–212

    Google Scholar 

  38. Lempel A, Ziv J (1976) On the complexity of finite sequences. IEEE Trans Inf Theory 22:75–81

    Article  MATH  MathSciNet  Google Scholar 

  39. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 6:707–710

    MathSciNet  Google Scholar 

  40. Levitt SD, Dubner William SJ (2005) Freakonomics: a rogue economist explores the hidden side of everything. Morrow

    Google Scholar 

  41. Martin-Lof P (1966) The definition of random sequences. Inf Control 9(6):602–619

    Article  MathSciNet  Google Scholar 

  42. Nevill-Manning CG, Witten IH (1999) Protein is incompressible. In: Proceedings of the IEEE data compression conference, pp 257–266

    Google Scholar 

  43. Piatesky-Shapiro G, Frawley WJ (eds) (1991) Knowledge discovery in databases. AAAI Press/MIT Press, Menlo Park

    Google Scholar 

  44. Pisanti N, Crochemore M, Grossi R, Sagot M-F (2005) Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM Trans Comput Biol Bioinform 2(1):40–50

    Article  Google Scholar 

  45. Rigoutsos I, Floratos A, Parida L, Gao Y, Platt D (2000) The emergence of pattern discovery techniques in computational biology. J Metab Eng 2:159–177

    Article  Google Scholar 

  46. Rissanen J (1986) Complexity of strings in the class of Markov sources. IEEE Trans Inf Theory 32(4):526–532

    Article  MATH  MathSciNet  Google Scholar 

  47. Ron D, Singer Y, Tishby N (1996) The power of amnesia: learning probabilistic automata with variable memory length. Mach Learn 25:117–150

    Article  MATH  Google Scholar 

  48. Storer JA (1988) Data compression: methods and theory. Computer Science Press, New York

    Google Scholar 

  49. Takeda M, Fukuda T, Nanri I, Yamasaki ăM, Tamari ăK (2003) Discovering instances of poetic allusion from anthologies of classical Japanese poems. Theor Comput Sci 292(2):497–524

    Article  MATH  MathSciNet  Google Scholar 

  50. Vo BD, Vo KP (2004) Using column dependency to compress tables. In: Proceedings of DCC 2004. IEEE Computer Society, Los Alamitos, pp 92–101

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alberto Apostolico .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Apostolico, A. (2009). Monotony and Surprise. In: Condon, A., Harel, D., Kok, J., Salomaa, A., Winfree, E. (eds) Algorithmic Bioprocesses. Natural Computing Series. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88869-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-88869-7_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-88868-0

  • Online ISBN: 978-3-540-88869-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics