Skip to main content
Log in

Verbumculus and the discovery of unusual words

  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Measures relating word frequencies and expectations have been constantly of interest in Bioinformatics studies. With sequence data becoming massively available, exhaustive enumeration of such measures have become conceivalbe, and yet pose significant computational burdeneven when limited to words of bounded maximum length. In addition, the display of the huge tables possibly resulting from these counts poses practical problems of visualization and inference.

Verbumculus is a suite of software tools for the efficient and fast detection of over- or under-represented words in nucleotide sequences. The inner core ofVerbumculus rests on subtly interwoven properties of statistics, pattern matching and combinatories on words, that enable one to limit drastically anda priori the set of over-or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible both to detect and visualize such words in a fast and practically useful way. This paper is devoted to the description of the facility at the outset and to report experimental results, ranging from simulations on synthetic data to the discovery of regulatory elements on the upstream regions of a set of genes of the yeast.

The softwareVerbumculus is accessible at http://www.cs.ucr.edu/~stelo/Verbumculus/or http://wwwdbl. dei.unipd.it/Verbumculus/

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Guyer M S, Collins F S. How is the human genome project doing, and what have we learned so far? InProc. Natl. Acad. Sci. U.S.A., 1995, 92: 10841–10848.

  2. Collins F S, Patrinos A, Jordan Eet al. New goals for the U.S. human genome project: 1998–2003.Science, 1998, 282: 682–689.

    Article  Google Scholar 

  3. Fleischmann R D, Adams M Det al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.Science, 1995, 269: 496–512.

    Article  Google Scholar 

  4. Schena M, Shalom D, Davis R Wet al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray.Science, 1995, 270: 467–470.

    Article  Google Scholar 

  5. Lockhart D J, Dong Het al. Expression monitoring by hybridization to high-density oligonucleotide arrays.Nature Biotechnology, 1996, 14: 1675–1680.

    Article  Google Scholar 

  6. DeRisi J L, Iyer V R, Brown P O. Exploring the metabolblic and genetic control of gene expression on a genomic scale.Science, 1997, 278: 680–686.

    Article  Google Scholar 

  7. Chu S, DeRisi J L, Eisen Michael Bet al. The transcriptional program of sporulation in budding yeast.Science, October 1998, 282: 699–705.

    Article  Google Scholar 

  8. Apostolico A, Bock M E, Lonardi Set al. Efficient detection of unusual words.J. Comput. Bio., January 2000, 7(1/2): 71–94.

    Article  Google Scholar 

  9. Apostolico A, Bock M E, Lonardi S. Monotony of surprise and large-scale quest for unusual words (extended abstract). InProc. Research in Computational Molecular Biology (RECOMB), Myers G, Hannenhalli Set al. (Eds.), Washington DC, April 2002, pp.283–311. Also inJ. Comput. Bio., July 2003, 10: 3–4.

  10. Pesole G, Prunella N, Liuni Set al. Wordup: An efficient algorithm for discovering statistically significant patterns in DNA sequences.Nucleic Acids Res., 1992, 20(11):2871–2875.

    Article  Google Scholar 

  11. van Helden J, André B, Collado-Vides J. Extracting regulatory sites from the upstream region of the yeast genes by computational analysis of oligonucleotides.J. Mol. Biol., 1998, 281: 827–842.

    Article  Google Scholar 

  12. Schbath S, Prum Bet al. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences.J. Comput. Bio., 1995, 2: 417–437.

    Google Scholar 

  13. Schbath S. An efficient statistic to detect over- and under-represented words in DNA sequences.J. Comput. Bio., 1997, 4: 189–192.

    Google Scholar 

  14. Bräzma A, Jonassen I, Eidhammer Iet al. Approaches to the automatic discovery of patterns in biosequences.J. Comput Bio., 1998, 5(2): 277–304.

    Google Scholar 

  15. Bräzma A, Jonassen I, Ukkonen Eet al. Predicting gene regulatory elements in silico on a genomic scale.Genome Research, 1998, 8(11): 1202–1215.

    Google Scholar 

  16. Bailey T L, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization.Machine Learning, 1995, 21(1/2): 51–80.

    Article  Google Scholar 

  17. Jonassen I, Collins J F, Higgins D G Finding flexible patterns in unaligned protein sequences.Protein Science, 1995, 4: 1587–1595.

    Google Scholar 

  18. Jonassen I. Efficient discovery of conserved patterns using a pattern graph.Comput. Appl. Biosci., 1997, 13: 509–522.

    Google Scholar 

  19. Yada T, Totoki Y, Ishikawa Met al. Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences.Bioinformatics, 1998, 14: 317–325.

    Article  Google Scholar 

  20. Califano A. SPLASH: Structural pattern localization analysis by seqeuntial histograming.Bioinformatics, 2000, 15: 341–357.

    Article  Google Scholar 

  21. Rigoutsos I, Floratos A. Combinatorial pattern discovery in biological sequences: TheTeiresias algorithm.Bioinformatics, 1998, 14(1): 55–67.

    Article  Google Scholar 

  22. Hertz G Z, Stormo G D Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.Bioinformatics, 1999, 15: 563–577.

    Article  Google Scholar 

  23. Lawrence C E, Altschul S Fet al. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment.Science, October 1993, 262: 208–214.

    Article  Google Scholar 

  24. Neuwald A F, Liu J S, Lawrence C E. Gibbs motif sampling: Detecting bacterial outer membrane protein repeats.Protein Science, 1995, 4: 1618–1632.

    Google Scholar 

  25. Pevzner P A, Sze S H Combinatorial approaches to finding subtle signals in DNA sequences. InProc. the Int. Conf. Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, 2000, pp.269–278.

    Google Scholar 

  26. Keich U, Pevzner P A. Finding motifs in the twilight zone. InAnnual Int. Conf. Computational Molecular Biology, Washington DC, April 2002, pp.195–204.

  27. Buhler J, Tompa M. Finding motifs using random projections.J. Comput. Bio., 2002, 9(2): 225–242.

    Article  Google Scholar 

  28. Pavesi G, Mauri G, Pesole G An algorithm for finding signals of unknown length in DNA sequences. InProc. the Int. Conf. Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, 2001, pp.S207-S214.

    Google Scholar 

  29. Eskin E, Pevzner P A. Finding composite regulatory patterns in DNA sequences. InProc. the Int. Conf. Intelligent Systems for Molecular Biology, Bioinformatics AAAI Press, Menlo Park, CA, 2002, pp.S181-S188.

    Google Scholar 

  30. Apostolico A, Galil Z (Eds.). Pattern Matching Algorithms. Oxford University Press. 1997.

  31. Brendel V, Beckmann J Set al. Linguistics of nucleotide sequences: Morphology and comparison of vocabularies.J. Biomol. Struct. Dynamics, 1986, 4(1): 11–21.

    Google Scholar 

  32. Stückle E E, Emmrich C, Grob U, Nielsen P J. Statistical analysis of nucleotide sequences.Nucleic Acids Res., 1990, 18(22): 6641–6647.

    Article  Google Scholar 

  33. Apostolico A. Pattern discovery and the algorithmics of surprise. InArtificial Intelligence and Heuristic, Methods for Bioinformatics, Frasconi P, Shamir R (Eds.), IOS Press, 2003, pp.111–127.

  34. McCreight E M. A space-economical suffix tree construction algorithm.J. Assoc. Comput. Mach., April 1976, 23(2): 262–272.

    MATH  MathSciNet  Google Scholar 

  35. Apostolico A. The myriad virtues of suffix trees. InCombinatorial Algorithms on Words, Vol. 12 ofNATO Advanced Science Institutes, Series F, Apostolico A, Galil Z (Eds), Berlin: Springer-Verlag, 1985, pp.85–96.

    Google Scholar 

  36. Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology Cambridge University Press, 1997.

  37. Hui L C K. Color set size problem with applications to string matching. InProc. the 3rd Annual Symp. Combinatorial Pattern Matching.Lecture Notes in Computer Science 644, Apostolico A, Crochemore Met al. (Eds.), Berlin: Springer-Verlag, 1992, pp.230–243.

    Google Scholar 

  38. Gansner E R, Koutsofios E, North S, Vo K-P. A technique for drawing directed graphs.IEEE Trans. Software Eng., 1993, 19(3): 214–230.

    Article  Google Scholar 

  39. Leung M Y, Marsh G M, Speed T P. Over and underrepresentation of short DNA words in herpesvirus genomes.J. Comput. Bio., 1996, 3: 345–360.

    Google Scholar 

  40. Apostolico A, Giancarlo R. Sequence alignment in molecular biology.J. Comput. Bio., 1998, 5(2): 173–196.

    Article  Google Scholar 

  41. Wingender E, Dietze P, Karas Het al. Transfac: A database on transcription factors and their DNA binding sites.Nucleic Acids Res., 1996, 24: 238–241. http://transfac.gbf-braunschweig.de/TRANSFAC/.

    Article  Google Scholar 

  42. Wingender E, Chen X, Hehl Ret al. Transfac: An integrated system for gene expression regulation.Nucleic Acids Res., 2000 28: 316–319. http://transfac.gbf-braunschweig.de/TRANSFAC/.

    Article  Google Scholar 

  43. Luche R M, Sumrada R, Cooper T G. A cis-acting element present in multiple genes serves as a repressor protein binding site for the yeast CAR1 gene.Mol. Cell. Biol. 1990, 10: 3884–3895.

    Google Scholar 

  44. Strich R, Surosky R T, Steber Cet al. UME6 is a key regulator of nitrogen repression and meiotic development.Genes Dev., 1994, 8: 796–810.

    Article  Google Scholar 

  45. Amati B, Gasser S M. Drosophila scaffold-attached regions bind nuclear scaffolds and can function as MARS elements in both budding and fission yeast.Mol. Cell. Biol., 1990, 10: 5442–5454.

    Google Scholar 

  46. Strissel P L, Dann H Aet al. Scaffold-associated regions in the human type I interferon gene cluster on the short arm of chromosome 9.Genomics, 1998, 47: 217–229.

    Article  Google Scholar 

  47. Gasser S M. Nuclear scaffold and high-order folding of eukaryotic DNA. InArchitecture of Eukaryotic Genes, Kahl G (Ed.), VCH Verlagsgeselschaft, Wienheim, Germary, 1988, pp.461–471.

    Google Scholar 

  48. Boulikas T. Chromatin domains and prediction of MAR sequences.Int. Rev. Cytol., 1995, 162A: 279–388.

    Article  Google Scholar 

  49. Stief A, Winter D Met al. A nuclear DNA attachment element mediates elevated and position-independent gene activity.Nature, 1989, 341: 343–345.

    Article  Google Scholar 

  50. McKnight R A, Shamay A, Sankaran Let al. Matrixattachment regions can impart position-independent regulation of a tissue-specific gene in transgenic mice. InProc. Natl. Acad. Sci., 1992, 89: 6943–6947.

  51. Nussinov R. Strong adenine clustering in nucleotide sequences.J. Theor. Biol., 1980, 85: 285–291.

    Article  Google Scholar 

  52. Gasser S M, Laemmli U K. Cohabitation of scaffold binding regions with upstream/enhancer elements of three developmentally regulated genes ofD. melanogaster.Cell, 1986, 46: 521–530.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alberto Apostolico.

Additional information

Supported in part by the NSF of U.S.A. (Grant No. CCR-9700276), Purdue Research Foundation (Grant No. 690-1398-3145), the Italian Ministry of University and Research, and the Research Program of the University of Padova.

Supported by Purdue Research Foundation (Grant No.690-1398-3145), the Italian Ministry of University and Research, the Research Program of the University of Padova and Bourns College of Engineering, University of California, Riverside.

Alberto Apostolico (Dr. Eng., 1973, Univ. Naples) is a professor of computer engineering at Univ. Padova and professor of computer sciences at Purdue University. He is a fulbright scholar in 1974–75 at CMU, held visiting and permanent positions in the U.S. (UIUC, Rensselaer, Purdue, IBM) and Europe (U. of Salerno, U. of L' Aquila, LASI, U. of Paris, U. of London, King's Zif-Bielefeld, Renyi-Hungarian Acad of Science), and a full prof. in Italy since 1987, at DEI since 1992. His research interests are algorithmic analysis and design, with emphasis on pattern matching, on which subject he has authored more than 100 papers, and co-authored/edited 7 volumes. He serves on the Editorial Boards of Theor. Comp. Sci. Par. Proc. Let., J. of Comp. Biol., Chaos Th. and Appl., Springer Lecture Notes in Bioinformatics, Algorithmica (g.e.), Keynote at over 60, PC Member for over 50 international conferences He has been a reviewer for NSF, Canadian SERC, NATO, HSFP, Finland Acad Sci, Hong Kong and Israel Science Councils. He is a current or past member of ACM AICA, EATCS, IEEE. He has been (co-) recipient of U.S. (NSF, AFOSR, NIH) French, British, Italian (CNR, MURST, MIUR) and international (Fulbright, NATO, ESPRIT) grants, and of au IBM Faculty Award in 2002.

Fang-Cheng Gong (Ph.D 1995, Dept. Plant Sciences, Univ. Arizona) has held positions of graduate research associate at Dept. Plant Sciences, Univ. Arizona, Postdoc Fellow at the Dept. Plant Pathology of Univ. Arizona, Postdoc Fellow at, Dept. Biological Sciences, Purdue University, before joining the research staff of Celera Genomics. His research interests include plant molecular biology, microbial molecular genetics, and plant cell biology.

Stefano Lonardi (Ph.D., 2001, Purdue University) is an assistant professor of computer science & engineering at the Univ. California, Riverside. His research is currently focused on bioinformatics, data compression, information hiding, and data mining. He received his “Laurea” degree from Univ. Pisa in 1994. and his Ph.D. degree in computer science, from Purdue University. He also holds a Research Doctorate from the Univ. Padua (1999). He is a member of ACM, IEEE, Upsilon Pi Upsilon and Phi Kappa Phi honor societies, and the International Society for Computational Biology.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Apostolico, A., Gong, FC. & Lonardi, S. Verbumculus and the discovery of unusual words. J. Comput. Sci. & Technol. 19, 22–41 (2004). https://doi.org/10.1007/BF02944783

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02944783

Keywords

Navigation