Verbumculus and the discovery of unusual words

Apostolico, Alberto; Gong, Fang-Cheng; Lonardi, Stefano

doi:10.1007/BF02944783

Verbumculus and the discovery of unusual words

Published: January 2004

Volume 19, pages 22–41, (2004)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Alberto Apostolico^1,2,
Fang-Cheng Gong³ &
Stefano Lonardi⁴

55 Accesses
15 Citations
Explore all metrics

Abstract

Measures relating word frequencies and expectations have been constantly of interest in Bioinformatics studies. With sequence data becoming massively available, exhaustive enumeration of such measures have become conceivalbe, and yet pose significant computational burdeneven when limited to words of bounded maximum length. In addition, the display of the huge tables possibly resulting from these counts poses practical problems of visualization and inference.

Verbumculus is a suite of software tools for the efficient and fast detection of over- or under-represented words in nucleotide sequences. The inner core ofVerbumculus rests on subtly interwoven properties of statistics, pattern matching and combinatories on words, that enable one to limit drastically anda priori the set of over-or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible both to detect and visualize such words in a fast and practically useful way. This paper is devoted to the description of the facility at the outset and to report experimental results, ranging from simulations on synthetic data to the discovery of regulatory elements on the upstream regions of a set of genes of the yeast.

The softwareVerbumculus is accessible at http://www.cs.ucr.edu/~stelo/Verbumculus/or http://wwwdbl. dei.unipd.it/Verbumculus/

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Space-Efficient Detection of Unusual Words

Fast Indexes for Gapped Pattern Matching

Quantiprot - a Python package for quantitative analysis of protein sequences

Article Open access 17 July 2017

References

Guyer M S, Collins F S. How is the human genome project doing, and what have we learned so far? InProc. Natl. Acad. Sci. U.S.A., 1995, 92: 10841–10848.
Collins F S, Patrinos A, Jordan Eet al. New goals for the U.S. human genome project: 1998–2003.Science, 1998, 282: 682–689.
Article Google Scholar
Fleischmann R D, Adams M Det al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.Science, 1995, 269: 496–512.
Article Google Scholar
Schena M, Shalom D, Davis R Wet al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray.Science, 1995, 270: 467–470.
Article Google Scholar
Lockhart D J, Dong Het al. Expression monitoring by hybridization to high-density oligonucleotide arrays.Nature Biotechnology, 1996, 14: 1675–1680.
Article Google Scholar
DeRisi J L, Iyer V R, Brown P O. Exploring the metabolblic and genetic control of gene expression on a genomic scale.Science, 1997, 278: 680–686.
Article Google Scholar
Chu S, DeRisi J L, Eisen Michael Bet al. The transcriptional program of sporulation in budding yeast.Science, October 1998, 282: 699–705.
Article Google Scholar
Apostolico A, Bock M E, Lonardi Set al. Efficient detection of unusual words.J. Comput. Bio., January 2000, 7(1/2): 71–94.
Article Google Scholar
Apostolico A, Bock M E, Lonardi S. Monotony of surprise and large-scale quest for unusual words (extended abstract). InProc. Research in Computational Molecular Biology (RECOMB), Myers G, Hannenhalli Set al. (Eds.), Washington DC, April 2002, pp.283–311. Also inJ. Comput. Bio., July 2003, 10: 3–4.
Pesole G, Prunella N, Liuni Set al. Wordup: An efficient algorithm for discovering statistically significant patterns in DNA sequences.Nucleic Acids Res., 1992, 20(11):2871–2875.
Article Google Scholar
van Helden J, André B, Collado-Vides J. Extracting regulatory sites from the upstream region of the yeast genes by computational analysis of oligonucleotides.J. Mol. Biol., 1998, 281: 827–842.
Article Google Scholar
Schbath S, Prum Bet al. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences.J. Comput. Bio., 1995, 2: 417–437.
Google Scholar
Schbath S. An efficient statistic to detect over- and under-represented words in DNA sequences.J. Comput. Bio., 1997, 4: 189–192.
Google Scholar
Bräzma A, Jonassen I, Eidhammer Iet al. Approaches to the automatic discovery of patterns in biosequences.J. Comput Bio., 1998, 5(2): 277–304.
Google Scholar
Bräzma A, Jonassen I, Ukkonen Eet al. Predicting gene regulatory elements in silico on a genomic scale.Genome Research, 1998, 8(11): 1202–1215.
Google Scholar
Bailey T L, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization.Machine Learning, 1995, 21(1/2): 51–80.
Article Google Scholar
Jonassen I, Collins J F, Higgins D G Finding flexible patterns in unaligned protein sequences.Protein Science, 1995, 4: 1587–1595.
Google Scholar
Jonassen I. Efficient discovery of conserved patterns using a pattern graph.Comput. Appl. Biosci., 1997, 13: 509–522.
Google Scholar
Yada T, Totoki Y, Ishikawa Met al. Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences.Bioinformatics, 1998, 14: 317–325.
Article Google Scholar
Califano A. SPLASH: Structural pattern localization analysis by seqeuntial histograming.Bioinformatics, 2000, 15: 341–357.
Article Google Scholar
Rigoutsos I, Floratos A. Combinatorial pattern discovery in biological sequences: TheTeiresias algorithm.Bioinformatics, 1998, 14(1): 55–67.
Article Google Scholar
Hertz G Z, Stormo G D Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.Bioinformatics, 1999, 15: 563–577.
Article Google Scholar
Lawrence C E, Altschul S Fet al. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment.Science, October 1993, 262: 208–214.
Article Google Scholar
Neuwald A F, Liu J S, Lawrence C E. Gibbs motif sampling: Detecting bacterial outer membrane protein repeats.Protein Science, 1995, 4: 1618–1632.
Google Scholar
Pevzner P A, Sze S H Combinatorial approaches to finding subtle signals in DNA sequences. InProc. the Int. Conf. Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, 2000, pp.269–278.
Google Scholar
Keich U, Pevzner P A. Finding motifs in the twilight zone. InAnnual Int. Conf. Computational Molecular Biology, Washington DC, April 2002, pp.195–204.
Buhler J, Tompa M. Finding motifs using random projections.J. Comput. Bio., 2002, 9(2): 225–242.
Article Google Scholar
Pavesi G, Mauri G, Pesole G An algorithm for finding signals of unknown length in DNA sequences. InProc. the Int. Conf. Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, 2001, pp.S207-S214.
Google Scholar
Eskin E, Pevzner P A. Finding composite regulatory patterns in DNA sequences. InProc. the Int. Conf. Intelligent Systems for Molecular Biology, Bioinformatics AAAI Press, Menlo Park, CA, 2002, pp.S181-S188.
Google Scholar
Apostolico A, Galil Z (Eds.). Pattern Matching Algorithms. Oxford University Press. 1997.
Brendel V, Beckmann J Set al. Linguistics of nucleotide sequences: Morphology and comparison of vocabularies.J. Biomol. Struct. Dynamics, 1986, 4(1): 11–21.
Google Scholar
Stückle E E, Emmrich C, Grob U, Nielsen P J. Statistical analysis of nucleotide sequences.Nucleic Acids Res., 1990, 18(22): 6641–6647.
Article Google Scholar
Apostolico A. Pattern discovery and the algorithmics of surprise. InArtificial Intelligence and Heuristic, Methods for Bioinformatics, Frasconi P, Shamir R (Eds.), IOS Press, 2003, pp.111–127.
McCreight E M. A space-economical suffix tree construction algorithm.J. Assoc. Comput. Mach., April 1976, 23(2): 262–272.
MATH MathSciNet Google Scholar
Apostolico A. The myriad virtues of suffix trees. InCombinatorial Algorithms on Words, Vol. 12 ofNATO Advanced Science Institutes, Series F, Apostolico A, Galil Z (Eds), Berlin: Springer-Verlag, 1985, pp.85–96.
Google Scholar
Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology Cambridge University Press, 1997.
Hui L C K. Color set size problem with applications to string matching. InProc. the 3rd Annual Symp. Combinatorial Pattern Matching.Lecture Notes in Computer Science 644, Apostolico A, Crochemore Met al. (Eds.), Berlin: Springer-Verlag, 1992, pp.230–243.
Google Scholar
Gansner E R, Koutsofios E, North S, Vo K-P. A technique for drawing directed graphs.IEEE Trans. Software Eng., 1993, 19(3): 214–230.
Article Google Scholar
Leung M Y, Marsh G M, Speed T P. Over and underrepresentation of short DNA words in herpesvirus genomes.J. Comput. Bio., 1996, 3: 345–360.
Google Scholar
Apostolico A, Giancarlo R. Sequence alignment in molecular biology.J. Comput. Bio., 1998, 5(2): 173–196.
Article Google Scholar
Wingender E, Dietze P, Karas Het al. Transfac: A database on transcription factors and their DNA binding sites.Nucleic Acids Res., 1996, 24: 238–241. http://transfac.gbf-braunschweig.de/TRANSFAC/.
Article Google Scholar
Wingender E, Chen X, Hehl Ret al. Transfac: An integrated system for gene expression regulation.Nucleic Acids Res., 2000 28: 316–319. http://transfac.gbf-braunschweig.de/TRANSFAC/.
Article Google Scholar
Luche R M, Sumrada R, Cooper T G. A cis-acting element present in multiple genes serves as a repressor protein binding site for the yeast CAR1 gene.Mol. Cell. Biol. 1990, 10: 3884–3895.
Google Scholar
Strich R, Surosky R T, Steber Cet al. UME6 is a key regulator of nitrogen repression and meiotic development.Genes Dev., 1994, 8: 796–810.
Article Google Scholar
Amati B, Gasser S M. Drosophila scaffold-attached regions bind nuclear scaffolds and can function as MARS elements in both budding and fission yeast.Mol. Cell. Biol., 1990, 10: 5442–5454.
Google Scholar
Strissel P L, Dann H Aet al. Scaffold-associated regions in the human type I interferon gene cluster on the short arm of chromosome 9.Genomics, 1998, 47: 217–229.
Article Google Scholar
Gasser S M. Nuclear scaffold and high-order folding of eukaryotic DNA. InArchitecture of Eukaryotic Genes, Kahl G (Ed.), VCH Verlagsgeselschaft, Wienheim, Germary, 1988, pp.461–471.
Google Scholar
Boulikas T. Chromatin domains and prediction of MAR sequences.Int. Rev. Cytol., 1995, 162A: 279–388.
Article Google Scholar
Stief A, Winter D Met al. A nuclear DNA attachment element mediates elevated and position-independent gene activity.Nature, 1989, 341: 343–345.
Article Google Scholar
McKnight R A, Shamay A, Sankaran Let al. Matrixattachment regions can impart position-independent regulation of a tissue-specific gene in transgenic mice. InProc. Natl. Acad. Sci., 1992, 89: 6943–6947.
Nussinov R. Strong adenine clustering in nucleotide sequences.J. Theor. Biol., 1980, 85: 285–291.
Article Google Scholar
Gasser S M, Laemmli U K. Cohabitation of scaffold binding regions with upstream/enhancer elements of three developmentally regulated genes ofD. melanogaster.Cell, 1986, 46: 521–530.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Ingegneria dell' Informazione, Università di Padova, Padova, Italy
Alberto Apostolico
Department of Computer Sciences, Purdue University, Computer Sciences Building, 47907, West Lafayette, IN, U.S.A.
Alberto Apostolico
Celera Genomics, 45 W. Gude Drive, 20850, Rockville, MD, U.S.A.
Fang-Cheng Gong
Department of Computer Science and Engineering, University of California, 92521, Riverside, CA, U.S.A.
Stefano Lonardi

Authors

Alberto Apostolico
View author publications
You can also search for this author in PubMed Google Scholar
Fang-Cheng Gong
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Lonardi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alberto Apostolico.

Additional information

Supported in part by the NSF of U.S.A. (Grant No. CCR-9700276), Purdue Research Foundation (Grant No. 690-1398-3145), the Italian Ministry of University and Research, and the Research Program of the University of Padova.

Supported by Purdue Research Foundation (Grant No.690-1398-3145), the Italian Ministry of University and Research, the Research Program of the University of Padova and Bourns College of Engineering, University of California, Riverside.

Alberto Apostolico (Dr. Eng., 1973, Univ. Naples) is a professor of computer engineering at Univ. Padova and professor of computer sciences at Purdue University. He is a fulbright scholar in 1974–75 at CMU, held visiting and permanent positions in the U.S. (UIUC, Rensselaer, Purdue, IBM) and Europe (U. of Salerno, U. of L' Aquila, LASI, U. of Paris, U. of London, King's Zif-Bielefeld, Renyi-Hungarian Acad of Science), and a full prof. in Italy since 1987, at DEI since 1992. His research interests are algorithmic analysis and design, with emphasis on pattern matching, on which subject he has authored more than 100 papers, and co-authored/edited 7 volumes. He serves on the Editorial Boards of Theor. Comp. Sci. Par. Proc. Let., J. of Comp. Biol., Chaos Th. and Appl., Springer Lecture Notes in Bioinformatics, Algorithmica (g.e.), Keynote at over 60, PC Member for over 50 international conferences He has been a reviewer for NSF, Canadian SERC, NATO, HSFP, Finland Acad Sci, Hong Kong and Israel Science Councils. He is a current or past member of ACM AICA, EATCS, IEEE. He has been (co-) recipient of U.S. (NSF, AFOSR, NIH) French, British, Italian (CNR, MURST, MIUR) and international (Fulbright, NATO, ESPRIT) grants, and of au IBM Faculty Award in 2002.

Fang-Cheng Gong (Ph.D 1995, Dept. Plant Sciences, Univ. Arizona) has held positions of graduate research associate at Dept. Plant Sciences, Univ. Arizona, Postdoc Fellow at the Dept. Plant Pathology of Univ. Arizona, Postdoc Fellow at, Dept. Biological Sciences, Purdue University, before joining the research staff of Celera Genomics. His research interests include plant molecular biology, microbial molecular genetics, and plant cell biology.

Stefano Lonardi (Ph.D., 2001, Purdue University) is an assistant professor of computer science & engineering at the Univ. California, Riverside. His research is currently focused on bioinformatics, data compression, information hiding, and data mining. He received his “Laurea” degree from Univ. Pisa in 1994. and his Ph.D. degree in computer science, from Purdue University. He also holds a Research Doctorate from the Univ. Padua (1999). He is a member of ACM, IEEE, Upsilon Pi Upsilon and Phi Kappa Phi honor societies, and the International Society for Computational Biology.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Apostolico, A., Gong, FC. & Lonardi, S. Verbumculus and the discovery of unusual words. J. Comput. Sci. & Technol. 19, 22–41 (2004). https://doi.org/10.1007/BF02944783

Download citation

Received: 26 May 2003
Revised: 01 September 2003
Issue Date: January 2004
DOI: https://doi.org/10.1007/BF02944783

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Verbumculus and the discovery of unusual words

Abstract

Access this article

Similar content being viewed by others

Space-Efficient Detection of Unusual Words

Fast Indexes for Gapped Pattern Matching

Quantiprot - a Python package for quantitative analysis of protein sequences

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Verbumculus and the discovery of unusual words

Abstract

Access this article

Similar content being viewed by others

Space-Efficient Detection of Unusual Words

Fast Indexes for Gapped Pattern Matching

Quantiprot - a Python package for quantitative analysis of protein sequences

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation