Research in Computational Molecular Biology
Volume 3909 of the series Lecture Notes in Computer Science pp 248-264
Efficient Enumeration of Phylogenetically Informative Substrings
- Stanislav AngelovAffiliated withDepartment of Computer and Information Sciences, University of Pennsylvania
- , Boulos HarbAffiliated withDepartment of Computer and Information Sciences, University of Pennsylvania
- , Sampath KannanAffiliated withDepartment of Computer and Information Sciences, University of Pennsylvania
- , Sanjeev KhannaAffiliated withDepartment of Computer and Information Sciences, University of Pennsylvania
- , Junhyong KimAffiliated withDepartment of Biology, University of Pennsylvania
Abstract
We study the problem of enumerating substrings that are common amongst genomes that share evolutionary descent. For example, one might want to enumerate all identical (therefore conserved) substrings that are shared between all mammals and not found in non-mammals. Such collection of substrings may be used to identify conserved subsequences or to construct sets of identifying substrings for branches of a phylogenetic tree. For two disjoint sets of genomes on a phylogenetic tree, a substring is called a discriminating substring or a tag if it is found in all of the genomes of one set and none of the genomes of the other set. Given a phylogeny for a set of m species, each with a genome of length at most n, we develop a suffix-tree based algorithm to find all tags in O(nm log2 m) time. We also develop a sublinear space algorithm (at the expense of running time) that is more suited for very large data sets. We next consider a stochastic model of evolution to understand how tags arise. We show that in this setting, a simple process of tag generation essentially captures all possible ways of generating tags. We use this insight to develop a faster tag discovery algorithm with a small chance of error. However, tags are not guaranteed to exist in a given data set. We thus generalize the notion of a tag from a single substring to a set of substrings whereby each species in one set contains a large fraction of the substrings while each species in the other set contains only a small fraction of the substrings. We study the complexity of this problem and give a simple linear programming based approach for finding approximate generalized tag sets. Finally, we use our tag enumeration algorithm to analyze a phylogeny containing 57 whole microbial genomes. We find tags for all nodes in the phylogeny except the root for which we find generalized tag sets.
- Title
- Efficient Enumeration of Phylogenetically Informative Substrings
- Book Title
- Research in Computational Molecular Biology
- Book Subtitle
- 10th Annual International Conference, RECOMB 2006, Venice, Italy, April 2-5, 2006. Proceedings
- Pages
- pp 248-264
- Copyright
- 2006
- DOI
- 10.1007/11732990_22
- Print ISBN
- 978-3-540-33295-4
- Online ISBN
- 978-3-540-33296-1
- Series Title
- Lecture Notes in Computer Science
- Series Volume
- 3909
- Series ISSN
- 0302-9743
- Publisher
- Springer Berlin Heidelberg
- Copyright Holder
- Springer-Verlag Berlin Heidelberg
- Additional Links
- Topics
- Industry Sectors
- eBook Packages
- Editors
-
- Alberto Apostolico (19)
- Concettina Guerra (20)
- Sorin Istrail (21)
- Pavel A. Pevzner (22)
- Michael Waterman (23)
- Editor Affiliations
-
- 19. Georgia Institute of Technology and Università di Padova
- 20. Topic Chairs
- 21. Center for Molecular Biology and Computer Sciecne Department, Brown University
- 22. University of California
- 23. Department of Molecular and Computational Biology, University of Southern California
- Authors
-
- Stanislav Angelov (24)
- Boulos Harb (24)
- Sampath Kannan (24)
- Sanjeev Khanna (24)
- Junhyong Kim (25)
- Author Affiliations
-
- 24. Department of Computer and Information Sciences, University of Pennsylvania,
- 25. Department of Biology, University of Pennsylvania,
Continue reading...
To view the rest of this content please follow the download PDF link above.