Verbumculus and the discovery of unusual words

  • Alberto Apostolico
  • Fang-Cheng Gong
  • Stefano Lonardi
Article

Abstract

Measures relating word frequencies and expectations have been constantly of interest in Bioinformatics studies. With sequence data becoming massively available, exhaustive enumeration of such measures have become conceivalbe, and yet pose significant computational burdeneven when limited to words of bounded maximum length. In addition, the display of the huge tables possibly resulting from these counts poses practical problems of visualization and inference.

Verbumculus is a suite of software tools for the efficient and fast detection of over- or under-represented words in nucleotide sequences. The inner core ofVerbumculus rests on subtly interwoven properties of statistics, pattern matching and combinatories on words, that enable one to limit drastically anda priori the set of over-or under-represented candidate words of all lengths in a given sequence, thereby rendering it more feasible both to detect and visualize such words in a fast and practically useful way. This paper is devoted to the description of the facility at the outset and to report experimental results, ranging from simulations on synthetic data to the discovery of regulatory elements on the upstream regions of a set of genes of the yeast.

The softwareVerbumculus is accessible at http://www.cs.ucr.edu/~stelo/Verbumculus/or http://wwwdbl. dei.unipd.it/Verbumculus/

Keywords

Verbumeulus unusual words subword statistics pattern discovery regulatory elements suffix trees 

References

  1. [1]
    Guyer M S, Collins F S. How is the human genome project doing, and what have we learned so far? InProc. Natl. Acad. Sci. U.S.A., 1995, 92: 10841–10848.Google Scholar
  2. [2]
    Collins F S, Patrinos A, Jordan Eet al. New goals for the U.S. human genome project: 1998–2003.Science, 1998, 282: 682–689.CrossRefGoogle Scholar
  3. [3]
    Fleischmann R D, Adams M Det al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.Science, 1995, 269: 496–512.CrossRefGoogle Scholar
  4. [4]
    Schena M, Shalom D, Davis R Wet al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray.Science, 1995, 270: 467–470.CrossRefGoogle Scholar
  5. [5]
    Lockhart D J, Dong Het al. Expression monitoring by hybridization to high-density oligonucleotide arrays.Nature Biotechnology, 1996, 14: 1675–1680.CrossRefGoogle Scholar
  6. [6]
    DeRisi J L, Iyer V R, Brown P O. Exploring the metabolblic and genetic control of gene expression on a genomic scale.Science, 1997, 278: 680–686.CrossRefGoogle Scholar
  7. [7]
    Chu S, DeRisi J L, Eisen Michael Bet al. The transcriptional program of sporulation in budding yeast.Science, October 1998, 282: 699–705.CrossRefGoogle Scholar
  8. [8]
    Apostolico A, Bock M E, Lonardi Set al. Efficient detection of unusual words.J. Comput. Bio., January 2000, 7(1/2): 71–94.CrossRefGoogle Scholar
  9. [9]
    Apostolico A, Bock M E, Lonardi S. Monotony of surprise and large-scale quest for unusual words (extended abstract). InProc. Research in Computational Molecular Biology (RECOMB), Myers G, Hannenhalli Set al. (Eds.), Washington DC, April 2002, pp.283–311. Also inJ. Comput. Bio., July 2003, 10: 3–4.Google Scholar
  10. [10]
    Pesole G, Prunella N, Liuni Set al. Wordup: An efficient algorithm for discovering statistically significant patterns in DNA sequences.Nucleic Acids Res., 1992, 20(11):2871–2875.CrossRefGoogle Scholar
  11. [11]
    van Helden J, André B, Collado-Vides J. Extracting regulatory sites from the upstream region of the yeast genes by computational analysis of oligonucleotides.J. Mol. Biol., 1998, 281: 827–842.CrossRefGoogle Scholar
  12. [12]
    Schbath S, Prum Bet al. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences.J. Comput. Bio., 1995, 2: 417–437.Google Scholar
  13. [13]
    Schbath S. An efficient statistic to detect over- and under-represented words in DNA sequences.J. Comput. Bio., 1997, 4: 189–192.Google Scholar
  14. [14]
    Bräzma A, Jonassen I, Eidhammer Iet al. Approaches to the automatic discovery of patterns in biosequences.J. Comput Bio., 1998, 5(2): 277–304.Google Scholar
  15. [15]
    Bräzma A, Jonassen I, Ukkonen Eet al. Predicting gene regulatory elements in silico on a genomic scale.Genome Research, 1998, 8(11): 1202–1215.Google Scholar
  16. [16]
    Bailey T L, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization.Machine Learning, 1995, 21(1/2): 51–80.CrossRefGoogle Scholar
  17. [17]
    Jonassen I, Collins J F, Higgins D G Finding flexible patterns in unaligned protein sequences.Protein Science, 1995, 4: 1587–1595.Google Scholar
  18. [18]
    Jonassen I. Efficient discovery of conserved patterns using a pattern graph.Comput. Appl. Biosci., 1997, 13: 509–522.Google Scholar
  19. [19]
    Yada T, Totoki Y, Ishikawa Met al. Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences.Bioinformatics, 1998, 14: 317–325.CrossRefGoogle Scholar
  20. [20]
    Califano A. SPLASH: Structural pattern localization analysis by seqeuntial histograming.Bioinformatics, 2000, 15: 341–357.CrossRefGoogle Scholar
  21. [21]
    Rigoutsos I, Floratos A. Combinatorial pattern discovery in biological sequences: TheTeiresias algorithm.Bioinformatics, 1998, 14(1): 55–67.CrossRefGoogle Scholar
  22. [22]
    Hertz G Z, Stormo G D Identifying DNA and protein patterns with statistically significant alignments of multiple sequences.Bioinformatics, 1999, 15: 563–577.CrossRefGoogle Scholar
  23. [23]
    Lawrence C E, Altschul S Fet al. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment.Science, October 1993, 262: 208–214.CrossRefGoogle Scholar
  24. [24]
    Neuwald A F, Liu J S, Lawrence C E. Gibbs motif sampling: Detecting bacterial outer membrane protein repeats.Protein Science, 1995, 4: 1618–1632.Google Scholar
  25. [25]
    Pevzner P A, Sze S H Combinatorial approaches to finding subtle signals in DNA sequences. InProc. the Int. Conf. Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, 2000, pp.269–278.Google Scholar
  26. [26]
    Keich U, Pevzner P A. Finding motifs in the twilight zone. InAnnual Int. Conf. Computational Molecular Biology, Washington DC, April 2002, pp.195–204.Google Scholar
  27. [27]
    Buhler J, Tompa M. Finding motifs using random projections.J. Comput. Bio., 2002, 9(2): 225–242.CrossRefGoogle Scholar
  28. [28]
    Pavesi G, Mauri G, Pesole G An algorithm for finding signals of unknown length in DNA sequences. InProc. the Int. Conf. Intelligent Systems for Molecular Biology, AAAI Press, Menlo Park, CA, 2001, pp.S207-S214.Google Scholar
  29. [29]
    Eskin E, Pevzner P A. Finding composite regulatory patterns in DNA sequences. InProc. the Int. Conf. Intelligent Systems for Molecular Biology, Bioinformatics AAAI Press, Menlo Park, CA, 2002, pp.S181-S188.Google Scholar
  30. [30]
    Apostolico A, Galil Z (Eds.). Pattern Matching Algorithms. Oxford University Press. 1997.Google Scholar
  31. [31]
    Brendel V, Beckmann J Set al. Linguistics of nucleotide sequences: Morphology and comparison of vocabularies.J. Biomol. Struct. Dynamics, 1986, 4(1): 11–21.Google Scholar
  32. [32]
    Stückle E E, Emmrich C, Grob U, Nielsen P J. Statistical analysis of nucleotide sequences.Nucleic Acids Res., 1990, 18(22): 6641–6647.CrossRefGoogle Scholar
  33. [33]
    Apostolico A. Pattern discovery and the algorithmics of surprise. InArtificial Intelligence and Heuristic, Methods for Bioinformatics, Frasconi P, Shamir R (Eds.), IOS Press, 2003, pp.111–127.Google Scholar
  34. [34]
    McCreight E M. A space-economical suffix tree construction algorithm.J. Assoc. Comput. Mach., April 1976, 23(2): 262–272.MATHMathSciNetGoogle Scholar
  35. [35]
    Apostolico A. The myriad virtues of suffix trees. InCombinatorial Algorithms on Words, Vol. 12 ofNATO Advanced Science Institutes, Series F, Apostolico A, Galil Z (Eds), Berlin: Springer-Verlag, 1985, pp.85–96.Google Scholar
  36. [36]
    Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology Cambridge University Press, 1997.Google Scholar
  37. [37]
    Hui L C K. Color set size problem with applications to string matching. InProc. the 3rd Annual Symp. Combinatorial Pattern Matching.Lecture Notes in Computer Science 644, Apostolico A, Crochemore Met al. (Eds.), Berlin: Springer-Verlag, 1992, pp.230–243.Google Scholar
  38. [38]
    Gansner E R, Koutsofios E, North S, Vo K-P. A technique for drawing directed graphs.IEEE Trans. Software Eng., 1993, 19(3): 214–230.CrossRefGoogle Scholar
  39. [39]
    Leung M Y, Marsh G M, Speed T P. Over and underrepresentation of short DNA words in herpesvirus genomes.J. Comput. Bio., 1996, 3: 345–360.Google Scholar
  40. [40]
    Apostolico A, Giancarlo R. Sequence alignment in molecular biology.J. Comput. Bio., 1998, 5(2): 173–196.CrossRefGoogle Scholar
  41. [41]
    Wingender E, Dietze P, Karas Het al. Transfac: A database on transcription factors and their DNA binding sites.Nucleic Acids Res., 1996, 24: 238–241. http://transfac.gbf-braunschweig.de/TRANSFAC/.CrossRefGoogle Scholar
  42. [42]
    Wingender E, Chen X, Hehl Ret al. Transfac: An integrated system for gene expression regulation.Nucleic Acids Res., 2000 28: 316–319. http://transfac.gbf-braunschweig.de/TRANSFAC/.CrossRefGoogle Scholar
  43. [43]
    Luche R M, Sumrada R, Cooper T G. A cis-acting element present in multiple genes serves as a repressor protein binding site for the yeast CAR1 gene.Mol. Cell. Biol. 1990, 10: 3884–3895.Google Scholar
  44. [44]
    Strich R, Surosky R T, Steber Cet al. UME6 is a key regulator of nitrogen repression and meiotic development.Genes Dev., 1994, 8: 796–810.CrossRefGoogle Scholar
  45. [45]
    Amati B, Gasser S M. Drosophila scaffold-attached regions bind nuclear scaffolds and can function as MARS elements in both budding and fission yeast.Mol. Cell. Biol., 1990, 10: 5442–5454.Google Scholar
  46. [46]
    Strissel P L, Dann H Aet al. Scaffold-associated regions in the human type I interferon gene cluster on the short arm of chromosome 9.Genomics, 1998, 47: 217–229.CrossRefGoogle Scholar
  47. [47]
    Gasser S M. Nuclear scaffold and high-order folding of eukaryotic DNA. InArchitecture of Eukaryotic Genes, Kahl G (Ed.), VCH Verlagsgeselschaft, Wienheim, Germary, 1988, pp.461–471.Google Scholar
  48. [48]
    Boulikas T. Chromatin domains and prediction of MAR sequences.Int. Rev. Cytol., 1995, 162A: 279–388.CrossRefGoogle Scholar
  49. [49]
    Stief A, Winter D Met al. A nuclear DNA attachment element mediates elevated and position-independent gene activity.Nature, 1989, 341: 343–345.CrossRefGoogle Scholar
  50. [50]
    McKnight R A, Shamay A, Sankaran Let al. Matrixattachment regions can impart position-independent regulation of a tissue-specific gene in transgenic mice. InProc. Natl. Acad. Sci., 1992, 89: 6943–6947.Google Scholar
  51. [51]
    Nussinov R. Strong adenine clustering in nucleotide sequences.J. Theor. Biol., 1980, 85: 285–291.CrossRefGoogle Scholar
  52. [52]
    Gasser S M, Laemmli U K. Cohabitation of scaffold binding regions with upstream/enhancer elements of three developmentally regulated genes ofD. melanogaster.Cell, 1986, 46: 521–530.CrossRefGoogle Scholar

Copyright information

© Science Press, Beijing China and Allerton Press Inc. 2004

Authors and Affiliations

  • Alberto Apostolico
    • 1
    • 2
  • Fang-Cheng Gong
    • 3
  • Stefano Lonardi
    • 4
  1. 1.Dipartimento di Ingegneria dell' InformazioneUniversità di PadovaPadovaItaly
  2. 2.Department of Computer SciencesPurdue UniversityWest LafayetteU.S.A.
  3. 3.Celera GenomicsRockvilleU.S.A.
  4. 4.Department of Computer Science and EngineeringUniversity of CaliforniaRiversideU.S.A.

Personalised recommendations