Efficient Enumeration of Phylogenetically Informative Substrings

  • Stanislav Angelov
  • Boulos Harb
  • Sampath Kannan
  • Sanjeev Khanna
  • Junhyong Kim
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3909)


We study the problem of enumerating substrings that are common amongst genomes that share evolutionary descent. For example, one might want to enumerate all identical (therefore conserved) substrings that are shared between all mammals and not found in non-mammals. Such collection of substrings may be used to identify conserved subsequences or to construct sets of identifying substrings for branches of a phylogenetic tree. For two disjoint sets of genomes on a phylogenetic tree, a substring is called a discriminating substring or a tag if it is found in all of the genomes of one set and none of the genomes of the other set. Given a phylogeny for a set of m species, each with a genome of length at most n, we develop a suffix-tree based algorithm to find all tags in O(nm log2 m) time. We also develop a sublinear space algorithm (at the expense of running time) that is more suited for very large data sets. We next consider a stochastic model of evolution to understand how tags arise. We show that in this setting, a simple process of tag generation essentially captures all possible ways of generating tags. We use this insight to develop a faster tag discovery algorithm with a small chance of error. However, tags are not guaranteed to exist in a given data set. We thus generalize the notion of a tag from a single substring to a set of substrings whereby each species in one set contains a large fraction of the substrings while each species in the other set contains only a small fraction of the substrings. We study the complexity of this problem and give a simple linear programming based approach for finding approximate generalized tag sets. Finally, we use our tag enumeration algorithm to analyze a phylogeny containing 57 whole microbial genomes. We find tags for all nodes in the phylogeny except the root for which we find generalized tag sets.


Internal Node Substring Problem Common Substrings Input String Clostridium Acetobutylicum 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bejerano, G., Siepel, A., Kent, W., Haussler, D.: Computational screening of conserved genomic DNA in search of functional noncoding elements. Nature Methods 2(7), 535–545 (2005)CrossRefGoogle Scholar
  2. 2.
    Siepel, A., Bejerano, G., Pedersen, J., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L., Richards, S., Weinstock, G., Wilson, R.K., Gibbs, R., Kent, W., Miller, W., Haussler, D.: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research 15(8), 1034–1050 (2005)CrossRefGoogle Scholar
  3. 3.
    Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W., Mattick, J., Haussler, D.: Ultraconserved elements in the human genome. Science 304(5675), 1321–1325 (2004)CrossRefGoogle Scholar
  4. 4.
    Amann, R., Ludwig, W.: Ribosomal RNA-targeted nucleic acid probes for studies in microbial ecology. FEMS Microbiology Reviews 24(5), 555–565 (2000)CrossRefGoogle Scholar
  5. 5.
    Angelov, S., Harb, B., Kannan, S., Khanna, S., Kim, J., Wang, L.S.: Genome identification and classification by short oligo arrays. In: Proceedings of the Fourth Annual Workshop on Algorithms in Bioinformatics (2004)Google Scholar
  6. 6.
    Wolf, Y.I., Rogozin, I.B., Grishin, N.V., Koonin, E.V.: Genome trees and the tree of life. Trends in Genetics 18(9), 472–479 (2002)CrossRefGoogle Scholar
  7. 7.
    Weiner, P.: Linear pattern matching algorithms. In: Proc. of the 14th IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar
  8. 8.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York (1997)MATHCrossRefGoogle Scholar
  9. 9.
    Hui, L.: Color set size problem with applications to string matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 227–240. Springer, Heidelberg (1992)Google Scholar
  10. 10.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM (JACM) 23(2), 262–272 (1976)MATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Ukkonen, E.: On-line construction of suffix-trees. Algorithmica 14, 249–260 (1995)MATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM Journal of Computing 13(2), 338–355 (1984)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Schieber, B., Vishkin, U.: On finding lowest common ancestors: Simplifications and parallelization. SIAM Journal of Computing 17, 1253–1262 (1988)MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Brown, M.R., Tarjan, R.E.: Design and analysis of data structures for representing sorted lists. SIAM Journal of Computing 9(3), 594–614 (1980)MATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12, 327–343 (1994)MATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Thomas, J., et al.: Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424(6950), 788–793 (2003)CrossRefGoogle Scholar
  17. 17.
    Maidak, B.L., Cole, J.R., Lilburn, T.G., Parker, C.T.J., Sax man, P.R., Farris, R.J., Garrity, G.M., Olsen, G.J., Schmidt, T.M., Tie dje, J.M.: The RDP-II (ribosomal database project). Nucl. Acids. Res. 29(1), 173–174 (2001)CrossRefGoogle Scholar
  18. 18.
    Jukes, T.H., Cantor, C.: Mammalian Protein Metabolism, chapter Evolution of protein molecules. Academic Press, New York (1969)Google Scholar
  19. 19.
    Matveeva, O.V., Shabalina, S.A., Nemtsov, V.A., Tsodikov, A.D., Gesteland, R.F., Atkins, J.F.: Thermodynamic calculations and statistical correlations for oligo-probes design. Nucl. Acids. Res. 31(14), 4211–4217 (2003)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Stanislav Angelov
    • 1
  • Boulos Harb
    • 1
  • Sampath Kannan
    • 1
  • Sanjeev Khanna
    • 1
  • Junhyong Kim
    • 2
  1. 1.Department of Computer and Information SciencesUniversity of Pennsylvania 
  2. 2.Department of BiologyUniversity of Pennsylvania 

Personalised recommendations