Advertisement

Knowledge and Information Systems

, Volume 41, Issue 3, pp 559–590 | Cite as

EvoMiner: frequent subtree mining in phylogenetic databases

  • Akshay Deepak
  • David Fernández-Baca
  • Srikanta Tirthapura
  • Michael J. Sanderson
  • Michelle M. McMahon
Regular Paper

Abstract

The problem of mining collections of trees to identify common patterns, called frequent subtrees (FSTs), arises often when trying to interpret the results of phylogenetic analysis. FST mining generalizes the well-known maximum agreement subtree problem. Here we present EvoMiner, a new algorithm for mining frequent subtrees in collections of phylogenetic trees. EvoMiner is an Apriori-like levelwise method, which uses a novel phylogeny-specific constant-time candidate generation scheme, an efficient fingerprinting-based technique for downward closure, and a lowest-common-ancestor-based support counting step that requires neither costly subtree operations nor database traversal. Our algorithm achieves speedups of up to 100 times or more over Phylominer, the current state-of-the-art algorithm for mining phylogenetic trees. EvoMiner can also work in depth-first enumeration mode to use less memory at the expense of speed. We demonstrate the utility of FST mining as a way to extract meaningful phylogenetic information from collections of trees when compared to maximum agreement subtrees and majority-rule trees—two commonly used approaches in phylogenetic analysis for extracting consensus information from a collection of trees over a common leaf set.

Keywords

Data mining Pattern discovery Maximum agreement subtree  Phylogenetics Evolutionary bioinformatics 

Notes

Acknowledgments

This work was supported in part by National Science Foundation Grant DEB-0829674. The authors thank Drs. Sen Zhang and Jason T. L. Wang for sharing the source code of Phylominer and discussions on their work. They also thank Drs. Seung-Jin Sul and Tiffani L. Williams for sharing the datasets from Bayesian analyses, and Dr. Nicholas D. Pattengale for sharing the datasets consisting of bootstrapped trees. A special thanks to the anonymous reviewers at KAIS whose detailed comments helped greatly in improving the paper.

References

  1. 1.
    Aggarwal CC, Wang H (2010) Managing and mining graph data, advances in database systems, vol 40. Springer, BerlinCrossRefGoogle Scholar
  2. 2.
    Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo A (1996) Fast discovery of association rules. Adv Knowl Discov Data Min 12:307–328Google Scholar
  3. 3.
    Amenta N, Clarke F, John KS (2003) A linear-time majority tree algorithm. In: Proceedings of the 3rd workshop on algorithms in bioinformatics (WABI’03), pp 216–227Google Scholar
  4. 4.
    Amir A, Keselman D (1994) Maximum agreement subtree in a set of evolutionary trees. SIAM J Comput 26:758–769MathSciNetGoogle Scholar
  5. 5.
    Asai T, Abe K, Kawasoe S, Arimura H, Sakamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: Proceedings of the SIAM international conference on data mining, pp 158–174Google Scholar
  6. 6.
    Asai T, Arimura H, Uno T, Nakano S (2003) Discovering frequent substructures in large unordered trees. In: Proceedings of the 6th international conference on discovery science, pp 47–61Google Scholar
  7. 7.
    Ayres J, Flannick J, Gehrke J, Yiu T (2002) Sequential pattern mining using a bitmap representation. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, pp 429–435Google Scholar
  8. 8.
    Barns S, Delwiche C, Palmer J, Pace N (1996) Perspectives on archaeal diversity, thermophily and monophyly from environmental rRNA sequences. Proc Natl Acad Sci 93:9188–9193CrossRefGoogle Scholar
  9. 9.
    Baum D (2008) Reading a phylogenetic tree: the meaning of monophyletic groups. Nat Educ 1(1). http://www.nature.com/scitable/topicpage/reading-a-phylogenetic-tree-the-meaning-of-41956
  10. 10.
    Bei Y, Chen G, Shou L, Li X, Dong J (2009) Bottom-up discovery of frequent rooted unordered subtrees. Inf Sci 179:70–88CrossRefGoogle Scholar
  11. 11.
    Bender M, Farach-Colton M (2000) The LCA problem revisited. In: Proceedings of the 4th Latin American symposium on theoretical informatics, pp 88–94Google Scholar
  12. 12.
    Bhaskar R, Laxman S, Smith A, Thakurta A (2010) Discovering frequent patterns in sensitive data. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 503–512Google Scholar
  13. 13.
    Bryant D (1997) Building trees, hunting for trees and comparing trees. PhD thesis, University of Canterbury, New ZealandGoogle Scholar
  14. 14.
    Bryant D (2003) A classification of consensus methods for phylogenetics. DIMACS Ser Discret Math Theor Comput Sci 61:163–184Google Scholar
  15. 15.
    Chi Y, Yang Y, Muntz RR (2003) Indexing and mining free trees. In: Proceedings of the IEEE international conference on data mining, pp 509–512Google Scholar
  16. 16.
    Chi Y, Muntz R, Nijssen S, Kok J (2004) Frequent subtree mining—an overview. Fundamenta Informaticae 66:161–198Google Scholar
  17. 17.
    Chi Y, Yang Y, Muntz R (2004) Hybridtreeminer: an efficient algorithm for mining frequent rooted trees and free trees using canonical forms. In: Proceedings of the 16th international conference on scientific and statistical database management, pp 11–20Google Scholar
  18. 18.
    Chi Y, Xia Y, Yang Y, Muntz R (2005) Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Trans Knowl Data Eng 17:190–202CrossRefGoogle Scholar
  19. 19.
    Cole R, Farach-Colton M, Hariharan R, Przytycka T, Thorup M (2000) An \(O(n \log n)\) algorithm for the maximum agreement subtree problem for binary trees. SIAM J Comput 30:1385–1404CrossRefzbMATHMathSciNetGoogle Scholar
  20. 20.
    Currie TE, Greenhill SJ, Gray RD, Hasegawa T, Mace R (2010) Rise and fall of political complexity in island South-East Asia and the Pacific. Nature 467:801–804CrossRefGoogle Scholar
  21. 21.
    Daubin V, Gouy M, Perrière G (2002) A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res 12:1080–1090CrossRefGoogle Scholar
  22. 22.
    De Vienne D, Giraud T, Martin O (2007) A congruence index for testing topological similarity between trees. Bioinformatics 23:3119–3124CrossRefGoogle Scholar
  23. 23.
    Do T, Laurent A, Termier A (2010) Pglcm: efficient parallel mining of closed frequent gradual itemsets. In: Proceedings of the 10th IEEE international conference on data mining, pp 138– 147Google Scholar
  24. 24.
    Dong S, Kraemer E (2004) Calculation, visualization, and manipulation of masts (maximum agreement subtrees). In: Proceedings of the IEEE computational systems bioinformatics conference CSB, pp 405–414Google Scholar
  25. 25.
    Farach M, Thorup M (1994) Fast comparison of evolutionary trees. In: Proceedings of the 5th annual ACM-SIAM symposium on discrete algorithms, pp 481–488Google Scholar
  26. 26.
    Farach M, Przytycka T, Thorup M (1995) On the agreement of many trees. Inf Process Lett 55:297–301CrossRefzbMATHMathSciNetGoogle Scholar
  27. 27.
    Felsenstein J (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783–791CrossRefGoogle Scholar
  28. 28.
    Feng B, Xu Y, Zhao N, Xu H (2010) A new method of mining frequent closed trees in data streams. In: Proceedings of the seventh international conference on fuzzy systems and knowledge discovery (FSKD), pp 2245–2249Google Scholar
  29. 29.
    Finden C, Gordon A (1985) Obtaining common pruned trees. J Classifi 2:255–276CrossRefGoogle Scholar
  30. 30.
    Flint-Garcia S, Thuillet A, Yu J, Pressoir G, Romero S, Mitchell S, Doebley J, Kresovich S, Goodman M, Buckler E (2005) Maize association population: a high-resolution platform for quantitative trait locus dissection. Plant J 44:1054–1064CrossRefGoogle Scholar
  31. 31.
    Ganapathysaravanabavan G, Warnow T (2001) Finding a maximum compatible tree for a bounded number of trees with bounded degree is solvable in polynomial time. In: Proceedings of the international workshop on algorithms in bioinformatics, pp 156–163Google Scholar
  32. 32.
    Geerts F, Goethals B, Bussche J (2005) Tight upper bounds on the number of candidate patterns. ACM Trans Database Syst (TODS) 30:333–363CrossRefGoogle Scholar
  33. 33.
    Goddard W, Kubicka E, Kubicki G, McMorris F (1994) The agreement metric for labeled binary trees. Math Biosci 123:215–226CrossRefzbMATHMathSciNetGoogle Scholar
  34. 34.
    Gray R, Drummond A, Greenhill S (2009) Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science 323:479–483CrossRefGoogle Scholar
  35. 35.
    Guillemot S, Berry V (2010) Fixed-parameter tractability of the maximum agreement supertree problem. IEEE/ACM Trans Comput Biol Bioinform 7:342–353CrossRefGoogle Scholar
  36. 36.
    Hadzic F, Tan H, Dillon T, Hadzic F, Tan H, Dillon T (2010) Mining maximal and closed frequent subtrees. In: Mining of data with complex structures, studies in computational intelligence, vol 333. Springer, Berlin, Heidelberg, pp 191–199. http://link.springer.com/chapter/10.1007%2F978-3-647-17557-2_8
  37. 37.
    Han J, Pei J (2000) Mining frequent patterns by pattern-growth: methodology and implications. ACM SIGKDD Explor Newsl 2:14–20CrossRefGoogle Scholar
  38. 38.
    Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, pp 1–12Google Scholar
  39. 39.
    Han J, Pei J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M (2001) Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 17th international conference on data engineering, pp 215–224Google Scholar
  40. 40.
    Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8:53–87CrossRefMathSciNetGoogle Scholar
  41. 41.
    Harel D, Tarjan R (1984) Fast algorithms for finding nearest common ancestors. SIAM J Comput 13:338–355CrossRefzbMATHMathSciNetGoogle Scholar
  42. 42.
    Hromkovič J (2005) Abundance of witnesses. In: Design and analysis of randomized algorithms, texts in theoretical computer science. An EATCS series. Springer, Berlin, Heidelberg, pp 183– 207. http://link.springer.com/chapter/10.1007%2F3-540-27903-2_6
  43. 43.
    Huelsenbeck JP, Ronquist F (2001) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754–755CrossRefGoogle Scholar
  44. 44.
    Jia Y, Zhang J, Huan J (2011) An efficient graph-mining method for complicated and noisy data with real-world applications. Knowl Inf Syst 28:423–447CrossRefGoogle Scholar
  45. 45.
    Jimenez A, Berzal F, Cubero J (2010) Frequent tree pattern mining: a survey. Intell Data Anal 14:603–622Google Scholar
  46. 46.
    Jimenez A, Berzal F, Cubero J (2010) Potminer: mining ordered, unordered, and partially-ordered trees. Knowl Inf Syst 23:199–224Google Scholar
  47. 47.
    Kao M, Lam T, Sung W, Ting H (2001) An even faster and more unifying algorithm for comparing trees via unbalanced bipartite matchings. J Algorithms 40:212–233CrossRefzbMATHMathSciNetGoogle Scholar
  48. 48.
    Karp R, Rabin M (1987) Efficient randomized pattern-matching algorithms. IBM J Res Dev 31:249–260CrossRefzbMATHMathSciNetGoogle Scholar
  49. 49.
    Ke Y, Cheng J, Yu J (2009) Efficient discovery of frequent correlated subgraph pairs. In: Proceedings of the ninth IEEE international conference on data mining, pp 239–248Google Scholar
  50. 50.
    Kubicka E, Kubicki G, McMorris F (1992) On agreement subtrees of two binary trees. Congressus Numerantium 88:217–217MathSciNetGoogle Scholar
  51. 51.
    Lapointe F, Rissler L (2005) Congruence, consensus, and the comparative phylogeography of codistributed species in California. The Am Nat 166:290–299CrossRefGoogle Scholar
  52. 52.
    Lewis L, Lewis P (2005) Unearthing the molecular phylodiversity of desert soil green algae (Chlorophyta). Syst Biol 54:936–947CrossRefGoogle Scholar
  53. 53.
    Liu H, Lin Y, Han J (2011) Methods for mining frequent items in data streams: an overview. Knowl Inf Syst 26:1–30CrossRefGoogle Scholar
  54. 54.
    Liu L, Liu J (2011) Mining frequent embedded subtree from tree-like databases. In: Proceedings of the international conference on internet computing and information services (ICICIS), pp 3–7Google Scholar
  55. 55.
    Margush T, McMorris F (1981) Consensus n-trees. Bull Math Biol 43:239–244zbMATHMathSciNetGoogle Scholar
  56. 56.
    Mau B, Newton M, Larget B (1999) Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55:1–12CrossRefzbMATHMathSciNetGoogle Scholar
  57. 57.
    Motwani R, Raghavan P (1995) Randomized algorithms, chap 7. Cambridge University, CambridgeCrossRefGoogle Scholar
  58. 58.
    NCBI (2002) Tree facts: rooted versus unrooted trees. Online, http://www.ncbi.nlm.nih.gov/Class/NAWBIS/Modules/Phylogenetics/phylo9.html
  59. 59.
    Nguyen V, Yamamoto A (2010) Incremental mining of closed frequent subtrees. In: Pfahringer B, Holmes G, Hoffmann A (eds) Discovery science, lecture notes in computer science, vol 6332. Springer, Berlin, Heidelberg, pp 356–370Google Scholar
  60. 60.
    Nijssen S, Kok J (2003) Efficient discovery of frequent unordered trees. In: Proceedings of the international workshop on mining graphs, trees and sequences, pp 55–64Google Scholar
  61. 61.
    Pattengale N, Aberer A, Swenson K, Stamatakis A, Moret B (2011) Uncovering hidden phylogenetic consensus in large datasets. IEEE/ACM Trans Comput Biol Bioinform 8:902–911CrossRefGoogle Scholar
  62. 62.
    Pei J, Han J (2002) Constrained frequent pattern mining: a pattern-growth view. ACM SIGKDD Explor Newslett 4:31–39CrossRefGoogle Scholar
  63. 63.
    Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M (2004) Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans Knowl Data Eng 16:1424–1440CrossRefGoogle Scholar
  64. 64.
    Pei J, Han J, Wang W (2007) Constraint-based sequential pattern mining: the pattern-growth methods. J Intell Inf Syst 28:133–160CrossRefGoogle Scholar
  65. 65.
    Piel W, Donoghue M, Sanderson M (2002) Treebase: a database of phylogenetic knowledge. In: Shimura J, Wilson KL, Gordon D (eds) To the interoperable “catalog of life” with partners, Species 2000 Asia Oceania. Research report from the National Institute for Environmental Studies, Tsukuba, Japan 171, pp 41–47Google Scholar
  66. 66.
    Raissi C, Pei J (2011) Towards bounding sequential patterns. In: Proceedings of the 17th ACM SIGKDD International conference on knowledge discovery and data mining, pp 1379–1387Google Scholar
  67. 67.
    Rannala B, Yang Z (2008) Phylogenetic inference using whole genomes. Annu Rev of Genom Hum Genet 9:217–231CrossRefGoogle Scholar
  68. 68.
    Sanderson M, Boss D, Chen D, Cranston K, Wehe A (2008) The PhyLoTA browser: processing GenBank for molecular phylogenetics research. Syst Biol 57:335–346CrossRefGoogle Scholar
  69. 69.
    Sanderson M, McMahon M, Steel M (2011) Terraces in phylogenetic tree space. Science 333:448–450CrossRefGoogle Scholar
  70. 70.
    Schieber B, Vishkin U (1988) On finding lowest common ancestors: simplification and parallelization. SIAM J Comput 17:1253–1262CrossRefzbMATHMathSciNetGoogle Scholar
  71. 71.
    Scornavacca C (2009) Supertree methods for phylogenomics. PhD thesis, University of Montpellier II, Montpellier, FranceGoogle Scholar
  72. 72.
    Semple C, Steel M (2003) Phylogenetics. Oxford lecture series in mathematics, Oxford University Press, OxfordGoogle Scholar
  73. 73.
    Slowinski J, Keogh J (2000) Phylogenetic relationships of elapid snakes based on cytochrome b mtDNA sequences. Mol Phylogenet Evol 15:157–164CrossRefGoogle Scholar
  74. 74.
    Smith M, Patton J (1999) Phylogenetic relationships and the radiation of sigmodontine rodents in South America: evidence from cytochrome b. J Mammal Evol 6:89–128CrossRefGoogle Scholar
  75. 75.
    Steel M, Warnow T (1993) Kaikoura tree theorems: computing the maximum agreement subtree. Inf Process Lett 48:77–82CrossRefzbMATHMathSciNetGoogle Scholar
  76. 76.
    Sul S, Williams T (2009) An experimental analysis of consensus tree algorithms for large-scale tree collections. In: Proceedings of the international symposium on bioinformatics research and applications, pp 100–111Google Scholar
  77. 77.
    Swenson K, Chen E, Pattengale N, Sankoff D (2011) The kernel of maximum agreement subtrees. In: Proceedings of the international symposium on bioinformatics research and applications, pp 123–135Google Scholar
  78. 78.
    Termier A, Rousset M, Sebag M (2004) Dryade: a new approach for discovering closed frequent trees in heterogeneous tree databases. In: Proceedings of the IEEE international conference on data mining, pp 543–546Google Scholar
  79. 79.
    Wang C, Hong M, Pei J, Zhou H, Wang W, Shi B (2004) Efficient pattern-growth methods for frequent tree pattern mining. In: Dai H, Srikant R, Zhang C (eds) Advances in knowledge discovery and data mining, lecture notes in computer science, vol 3056. Springer, Berlin, Heidelberg, pp 441–451Google Scholar
  80. 80.
    Wang J, Shan H, Shasha D, Piel W (2005) Fast structural search in phylogenetic databases. Evol Bioinform Online 1:37–46Google Scholar
  81. 81.
    Wang S, Hong Y, Yang J (2012) XML document classification using closed frequent subtree. In: Bao Z, Gao Y, Gu Y, Guo L, Li Y, Lu J, Ren Z, Wang C, Zhang X (eds) Web-age information management, lecture notes in computer science, vol 7419. Springer, Berlin, Heidelberg, pp 350–359Google Scholar
  82. 82.
    Wu X, Kumar V, Ghosh J, Yang Q, Motoda H, McLachlan G, Ng A, Liu B, Yu P et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14:1–37CrossRefGoogle Scholar
  83. 83.
    Xiao Y, Yao J (2003) Efficient data mining for maximal frequent subtrees. In: Proceedings of the IEEE international conference on data mining, pp 379–386Google Scholar
  84. 84.
    Yang LH, Lee ML, Hsu W, Acharya S (2003) Mining frequent query patterns from XML queries. In: Proceedings of the eighth international conference on database systems for advanced applications, pp 355–362Google Scholar
  85. 85.
    Yule G (1925) A mathematical theory of evolution, based on the conclusions of Dr. JC Willis, F.R.S. Philos Trans R Soc Lond Ser B, Containing Papers of a Biological Character 213:21–87CrossRefGoogle Scholar
  86. 86.
    Zaki M (2004) Efficiently mining frequent embedded unordered trees. Fundamenta Informaticae 66:33–52MathSciNetGoogle Scholar
  87. 87.
    Zaki M (2005) Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Trans Knowl Data Eng 17:1021–1035CrossRefGoogle Scholar
  88. 88.
    Zhang S, Wang J (2008) Discovering frequent agreement subtrees from phylogenetic data. IEEE Trans Knowl Data Eng 20:68–82CrossRefGoogle Scholar
  89. 89.
    Zhang S, Yang J, Li S (2009) Ring: an integrated method for frequent representative subgraph mining. In: Proceedings of the ninth IEEE international conference on data mining, pp 1082–1087Google Scholar
  90. 90.
    Zou X, Zhang F, Zhang J, Zang L, Tang L, Wang J, Sang T, Ge S (2008) Analysis of 142 genes resolves the rapid diversification of the rice genus. Genome Biol 9:R49CrossRefGoogle Scholar
  91. 91.
    Zou Z, Gao H, Li J (2010) Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 633–642Google Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Akshay Deepak
    • 1
  • David Fernández-Baca
    • 1
  • Srikanta Tirthapura
    • 2
  • Michael J. Sanderson
    • 3
  • Michelle M. McMahon
    • 4
  1. 1.Department of Computer ScienceIowa State UniversityAmesUSA
  2. 2.Department of Electrical and Computer EngineeringIowa State UniversityAmesUSA
  3. 3.Department of Ecology and Evolutionary BiologyUniversity of ArizonaTucsonUSA
  4. 4.Department of Plant SciencesUniversity of ArizonaTucsonUSA

Personalised recommendations