Fast Computation of Entropic Profiles for the Detection of Conservation in Genomes

  • Matteo Comin
  • Morris Antonello
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7986)

Abstract

The information theory has been used for quite some time in the area of computational biology. In this paper we discuss and improve the function Entropic Profile, introduced by Vinga and Almeida in [23]. The Entropic Profiler is a function of the genomic location that captures the importance of that region with respect to the whole genome. We provide a linear time linear space algorithm called Fast Entropic Profile, as opposed to the original quadratic implementation. Moreover we propose an alternative normalization that can be also efficiently implemented. We show that Fast EP is suitable for large genomes and for the discovery of motifs with unbounded length.

Keywords

pattern discovery information theory computational biology 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Apostolico, A., Comin, M., Parida, L.: Varun: Discovering Extensible Motifs under Saturation Constraints. IEEE/ACM Transactions on Computational Biology and Bioinformatics 7(4), 752–762 (2010)CrossRefGoogle Scholar
  2. 2.
    Apostolico, A., Comin, M., Parida, L.: Mining, compressing and classifying with extensible motifs. Algorithms for Molecular Biology 1, 4 (2006)CrossRefGoogle Scholar
  3. 3.
    Apostolico, A., Comin, M., Parida, L.: Bridging Lossy and Lossless Compression by Motif Pattern Discovery. In: Ahlswede, R., Bäumer, L., Cai, N., Aydinian, H., Blinovsky, V., Deppe, C., Mashurian, H. (eds.) General Theory of Information Transfer and Combinatorics. LNCS, vol. 4123, pp. 793–813. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    Apostolico, A., Comin, M., Parida, L.: Motifs in Ziv-Lempel-Welch Clef. In: Proceedings of IEEE DCC Data Compression Conference, pp. 72–81. Computer Society Press (2004)Google Scholar
  5. 5.
    Bernaola-Galván, P., Grosse, I., Carpena, P., Oliver, J., Román-Roldán, R., Stanley, H.: Finding Borders between Coding and Noncoding DNA Regions by an Entropic Segmentation Method. Physical Review Letters 85(6), 1342–1345Google Scholar
  6. 6.
    Comin, M., Parida, L.: Subtle motif discovery for the detection of DNA regulatory sites. In: Proceeding of Asia-Pacific Bioinformatics Conference, pp. 27–36 (2007)Google Scholar
  7. 7.
    Comin, M., Parida, L.: Detection of Subtle Variations as Consensus Motifs. Theoretical Computer Science 395(2-3), 158–170 (2008)MathSciNetMATHCrossRefGoogle Scholar
  8. 8.
    Comin, M., Verzotto, D.: Alignment-Free Phylogeny of Whole Genomes using Underlying Subwords. BMC Algorithms for Molecular Biology 7, 34 (2012)CrossRefGoogle Scholar
  9. 9.
    Comin, M., Verzotto, D.: Whole-Genome Phylogeny by Virtue of Unic Subwords. In: Proceedings of 23rd International Workshop on Database and Expert Systems Applications, BIOKDD, pp. 190–194 (2012)Google Scholar
  10. 10.
    Comin, M., Verzotto, D.: The Irredundant Class Method for Remote Homology Detection of Protein Sequences. Journal of Computational Biology 18(12), 1819–1829 (2011)CrossRefGoogle Scholar
  11. 11.
    Gene, Y., Burge, C.: Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals. Journal of Computional Biology 11(2-3), 377–394 (2004)CrossRefGoogle Scholar
  12. 12.
    Hagenauer, J., Dawy, Z., Gobel, B., Hanus, P., Mueller, J.: Genomic Analysis using Methods from Information Theory. In: Information Theory Workshop, pp. 55–59 (2004)Google Scholar
  13. 13.
    Karlin, S., Mrazek, J., Campbell, A.: Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. Nucleic Acids Res. 24, 4263–4272 (1996)CrossRefGoogle Scholar
  14. 14.
    Kurtz, S., Choudhuri, J., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: Reputer: The manifold applications of repeat analysis on a genome scale. Nucleic Acids Res. 29(22), 4633–4642 (2001)CrossRefGoogle Scholar
  15. 15.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of ACM 23, 262–272 (1976)MathSciNetMATHCrossRefGoogle Scholar
  16. 16.
    Meek, C., Patel, J., Kasetty, S.: Oasis: An online and accurate technique for local-alignment searches on biological sequences. In: Proceedings of 29th International Conference on Very Large Databases, pp. 910–921 (2003)Google Scholar
  17. 17.
    Menconi, G., Marangoni, R.: A compression-based approach for coding sequences identification. I. Application to prokaryotic genomes. J. Comput Biol. 13(8), 1477–1488 (2006)Google Scholar
  18. 18.
    Nalla, V., Rogan, P.: Automated Splicing Mutation Analysis by Information Theory. Human Mutaion 25, 334–342 (2005)CrossRefGoogle Scholar
  19. 19.
    Schneider, T., Stormo, G., Gold, L., Ehrenfeucht, A.: Information content of binding sites on nucleotide sequences. Journal of Molecular Biology 188, 415–431 (1986)CrossRefGoogle Scholar
  20. 20.
    Shannon, C.: A Mathematical Theory of Communication. Bell System Technical Journal 27(3), 379–423 (1948)MathSciNetMATHGoogle Scholar
  21. 21.
    Sourice, S., Biaudet, V., El Karoui, M., Ehrlich, S.D., Gruss, A.: Identification of the Chi site of Haemophilus influenzae as several sequences related to the Escherichia coli Chi site. Mol. Microbiol. 27, 1021–1029 (1998)CrossRefGoogle Scholar
  22. 22.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MathSciNetMATHCrossRefGoogle Scholar
  23. 23.
    Vinga, S., Almeida, J.S.: Local Rényi entropic profiles of DNA sequences. BMC Bioinformatics 8, 393 (2007)CrossRefGoogle Scholar
  24. 24.
    Yockey, H.: Origin of life on earth and Shannon’s theory of communication. Comput. Chem. 24(1), 105–123 (2000)MATHGoogle Scholar
  25. 25.
    Waterman, M.S.: An Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman Hall (1995)Google Scholar
  26. 26.
    Ziv, J., Lempel, A.: A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Matteo Comin
    • 1
  • Morris Antonello
    • 1
  1. 1.Department of Information EngineeringUniversity of PadovaPadovaItaly

Personalised recommendations