Skip to main content

Advertisement

Log in

The phylogenomic revolution and its conceptual innovations: a text mining approach

  • Forum Paper
  • Published:
Organisms Diversity & Evolution Aims and scope Submit manuscript

Abstract

Natural language processing provides a quantitative framework with which to explore human cultural dynamics. Although this approach is less commonly used in the natural sciences, the text employed in scientific publications preserves a historical record of the development of disciplines as they evolve and mature. A high-throughput text mining study was performed here to investigate patterns of word use in publications dealing with phylogenomics. Over 2000 research articles in the field were surveyed, revealing the words whose frequency of use has shown the strongest positive correlation with time. Notably, concepts such as gene tree discordance and the susceptibility and discriminatory power of phylogenomic datasets were found to be among the strongest trending topics in the field. As systematics transitioned into a big data science, such obstacles to phylogenetic reconstruction were not left behind. On the contrary, phylogenomics opened a new door to explore these phenomena and their biological significance, becoming the focus of new theoretical and practical developments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Bell, G., Hey, T., & Szalay, A. (2009). Beyond the data deluge. Science, 323(5919), 1297–1298.

    Article  CAS  Google Scholar 

  • Betancur-R, R., Arcila, D., Vari, R. P., Hughes, L. C., Oliveira, C., Sabaj, M. H., et al. (2018). Phylogenomic incongruence, hypothesis testing, and taxonomic sampling: the monophyly of characiform fishes. Evolution, 73, 329–345. https://doi.org/10.1111/evo.13649.

    Article  PubMed  Google Scholar 

  • Bouchet-Valat, M. (2014). SnowballC: snowball stemmers based on the C libstemmer UTF-8 library. R package version 0.5.1. https://CRAN.R-project.org/package=SnowballC.

  • Boyd, D., & Crawford, K. (2012). Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society, 15(5), 662–679.

    Article  Google Scholar 

  • Candia, C., Jara-Figueroa, C., Rodriguez-Sickert, C., Barabási, A.-L., & Hidalgo, C. A. (2018). The universal decay of collective memory and attention. Nature Human Behaviour, 3, 82–91.

    Article  Google Scholar 

  • Chen, C. L. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: a survey of big data. Information Sciences, 275, 314–347.

    Article  Google Scholar 

  • Delsuc, F., Brinkmann, H., & Philippe, H. (2005). Phylogenomics and the reconstruction of the tree of life. Nature Reviews Genetics, 6(5), 361–375.

    Article  CAS  Google Scholar 

  • Edwards, S. V. (2009). Is a new and general theory of molecular systematics emerging? Evolution, 63(1), 1–19.

    Article  CAS  Google Scholar 

  • Eisen, J. A. (1998). Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Research, 8, 163–167.

    Article  CAS  Google Scholar 

  • Eisen, J. A., Kaiser, D., & Myers, R. M. (1997). Gastrogenomics delights: a movable feast. Nature Medicine, 3(10), 1076–1078.

    Article  CAS  Google Scholar 

  • Foote, A. D. (2018). Sympatric speciation in the genomic era. Trends in Ecology and Evolution, 33(2), 85–95.

    Article  Google Scholar 

  • Foster, P. G., & Hickey, D. A. (1999). Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. Journal of Molecular Evolution, 48(3), 284–290.

    Article  CAS  Google Scholar 

  • Francis, W. R., Canfield, D. E. (2018). Very few sites can reshape a phylogenetic tree. bioRxiv, 413518. https://doi.org/10.1101/413518

  • Galtier, N., & Daubin, V. (2008). Dealing with incongruence in phylogenomic analyses. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 363(1512), 4023–4029.

    Article  Google Scholar 

  • Gee, H. (2003). Evolution: ending incongruence. Nature, 425(6960), 782.

    Article  CAS  Google Scholar 

  • Harrison, J. (2016). RSelenium: R bindings for ‘Selenium WebDriver’. R package version, 1(7), 1 https://CRAN.R-project.org/package=RSelenium.

    Google Scholar 

  • Hillis, D. M., & Huelsenbeck, J. P. (1992). Signal, noise, and reliability in molecular phylogenetic analyses. Journal of Heredity, 83(3), 189–195.

    Article  CAS  Google Scholar 

  • Jeffroy, O., Brinkmann, H., Delsuc, F., & Philippe, H. (2006). Phylogenomics: the beginning of incongruence? Trends in Genetics, 22(4), 225–231.

    Article  CAS  Google Scholar 

  • Kaisler, S., Armour, F., Espinosa, J. A., Money, W. (2013). Big data: issues and challenges moving forward. In: 46th Hawaii International Conference on System Sciences (HICSS), 995–1004. IEEE.

  • Kaplan, R. M., Chambers, D. A., & Glasgow, R. E. (2014). Big data and large sample size: a cautionary note on the potential for bias. Clinical and Translational Science, 7(4), 342–346.

    Article  Google Scholar 

  • King, N., & Rokas, A. (2017). Embracing uncertainty in reconstructing early animal evolution. Current Biolology, 27(19), R1081–R1088.

    Article  CAS  Google Scholar 

  • Kocot, K. M., Struck, T. H., Merkel, J., Waits, D. S., Todt, C., Brannock, P. M., Weese, D. A., Cannon, J. T., Moroz, L. L., Lieb, B., & Halanych, K. M. (2017). Phylogenomics of Lophotrochozoa with consideration of systematic error. Systematic Biology, 66(2), 256–282.

    CAS  PubMed  Google Scholar 

  • Kumar, S., Filipski, A. J., Battistuzzi, F. U., Kosakovsky Pond, S. L., & Tamura, K. (2012). Statistics and truth in phylogenomics. Molecular Biology and Evolution, 29(2), 457–472.

    Article  CAS  Google Scholar 

  • Lafond-Lapalma, J., Duceppe, M.-O., Wang, S., Moffett, P., & Mimee, B. (2017). A new method for decontamination of de novo transcriptomes using a hierarchical clustering algorithm. Bioinformatics, 33, 1293–1300.

    Google Scholar 

  • Longo, M. S., O’Neill, M. J., & O’Neill, R. J. (2011). Abundant human DNA contamination identified in non-primate genome databases. PLoS One, 6, e16410.

    Article  CAS  Google Scholar 

  • Lust, R. W. (2014). Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data. PLoS One, 9, e110808.

    Article  Google Scholar 

  • Maddison, W. P. (1997). Gene trees in species trees. Systematic Biology, 46(3), 523–536.

    Article  Google Scholar 

  • Mai, U., & Mirarab, S. (2018). TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics, 19(5), 272.

    Article  Google Scholar 

  • Mei, Q., & Zhai, C. (2005). Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In R. Grossman (Ed.), Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining (pp. 198–207). Chicago, USA: ACM.

    Chapter  Google Scholar 

  • Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., et al. (2010). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176–182.

    Article  Google Scholar 

  • Naser-Khdour, S., Minh, B. Q., Zhang, W., Stone, E., Lanfear, R. (2018). The prevalence of model violations in phylogenetics analysis. bioRxiv, 460121, doi: https://doi.org/10.1101/460121.

  • Nesnidal, M. P., Helmkampf, M., Bruchhaus, I., & Hausdorf, B. (2010). Compositional heterogeneity and phylogenomic inference of metazoan relationships. Molecular Biology and Evolution, 27(9), 2095–2104.

    Article  CAS  Google Scholar 

  • Ogilvie, H. A., Vaughan, T. G., Matzke, N. J., Slater, G. J., Stadler, T., Welch, D., et al. (2018). Infering species trees using integrative models of species evolution. bioRxiv, 242875, doi: https://doi.org/10.1101/242875.

  • Philippe, H., Brinkmann, H., Lavrov, D. V., Littlewood, D. T. J., Manuel, M., Wörheid, G., et al. (2011). Resolving difficul phylogenetic questions: why more sequences are not enough. PLoS Biology, 9(3), e1000602.

    Article  CAS  Google Scholar 

  • Philippe, H., Delsuc, F., Brinkmann, H., & Lartillot, N. (2005). Phylogenomics. Annual Revuew of Ecology, Evolution and Systematics, 36, 541–562.

    Article  Google Scholar 

  • Phillips, M. J., Delsuc, F., & Penny, D. (2004). Genome-scale phylogeny and the detection of systematic biases. Molecular Biology and Evolution, 21(7), 1455–1458.

    Article  CAS  Google Scholar 

  • Pick, K. S., Philippe, H., Schreiber, F., Erpenbeck, D., Jackson, D. J., Wrede, P., et al. (2010). Improved phylogenomic taxon sampling noticeably affects nonbilaterian relationships. Molecular Biology and Evolution, 1(9), 1983–1987.

    Article  Google Scholar 

  • R Core Team. (2017). R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing https://www.R-project.org/.

    Google Scholar 

  • Reddy, S., Kimball, R. T., Pandey, A., Hosner, P. A., Braun, M. J., Hackett, S. J., Han, K. L., Harshman, J., Huddleston, C. J., Kingston, S., Marks, B. D., Miglia, K. J., Moore, W. S., Sheldon, F. H., Witt, C. C., Yuri, T., & Braun, E. L. (2017). Why do phylogenomic data sets yield conflicting trees? Data type influences the avian tree of life more than taxon sampling. Systematic Biology, 66(5), 857–879.

    Article  CAS  Google Scholar 

  • Rokas, A., Williams, B. L., King, N., & Carroll, S. B. (2003). Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature, 425(6960), 798–804.

    Article  CAS  Google Scholar 

  • Shen, X.-X., Hittinger, C. T., & Rokas, A. (2017). Contentious relationships in phylogenomic studies can be driven by a handful of genes. Nature Ecology and Evolution, 1(5), 0126.

    Article  Google Scholar 

  • Silge, J., & Robinson, D. (2016). tidytext: text mining and analysis using tidy data principles in R. The Journal of Open Source Software, 1(3), 37.

    Article  Google Scholar 

  • Simion, P., Belkhir, K., François, C., Veyssier, J., Rink, J. C., Manuel, M., Philippe, H., & Telford, M. J. (2018). A software tool ‘CroCo’ detects pervasive cross-species contamination in next generation sequencing data. BMC Biology, 16, 28.

    Article  Google Scholar 

  • Struck, T. H., Wey-Fabrizius, A. R., Golombek, A., Hering, L., Weigert, A., Bleidorn, C., Klebow, S., Iakovenko, N., Hausdorf, B., Petersen, M., Kück, P., Herlyn, H., & Hankeln, T. (2014). Platyzoan paraphyly based on phylogenomic data supports a noncoelomate ancestry of Spiralia. Molecular Biology and Evolution, 31(7), 1833–1849.

    Article  CAS  Google Scholar 

  • Thompson, P., Batista-Navarro, R. T., Kontonatsios, G., Carter, J., Toon, E., McNaught, J., Timmermann, C., Worboys, M., & Ananiadou, S. (2016). Text mining the history of medicine. PLoS One, 11(1), e0144717.

    Article  Google Scholar 

  • Wickham, H. (2016). rvest: Easily harvest (scrape) web pages. R package version 0.3.2. https://CRAN.R-project.org/package=rvest.

  • Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. Sebastopol: O'Reilly Media.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolás Mongiardino Koch.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

ESM 1

(DOCX 216 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mongiardino Koch, N. The phylogenomic revolution and its conceptual innovations: a text mining approach. Org Divers Evol 19, 99–103 (2019). https://doi.org/10.1007/s13127-019-00397-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13127-019-00397-0

Keywords

Navigation