The phylogenomic revolution and its conceptual innovations: a text mining approach

Mongiardino Koch, Nicolás

doi:10.1007/s13127-019-00397-0

The phylogenomic revolution and its conceptual innovations: a text mining approach

Forum Paper
Published: 07 March 2019

Volume 19, pages 99–103, (2019)
Cite this article

Organisms Diversity & Evolution Aims and scope Submit manuscript

Nicolás Mongiardino Koch ORCID: orcid.org/0000-0001-6317-5869¹

467 Accesses
5 Citations
2 Altmetric
Explore all metrics

Abstract

Natural language processing provides a quantitative framework with which to explore human cultural dynamics. Although this approach is less commonly used in the natural sciences, the text employed in scientific publications preserves a historical record of the development of disciplines as they evolve and mature. A high-throughput text mining study was performed here to investigate patterns of word use in publications dealing with phylogenomics. Over 2000 research articles in the field were surveyed, revealing the words whose frequency of use has shown the strongest positive correlation with time. Notably, concepts such as gene tree discordance and the susceptibility and discriminatory power of phylogenomic datasets were found to be among the strongest trending topics in the field. As systematics transitioned into a big data science, such obstacles to phylogenetic reconstruction were not left behind. On the contrary, phylogenomics opened a new door to explore these phenomena and their biological significance, becoming the focus of new theoretical and practical developments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A practical guide to amplicon and metagenomic analysis of microbiome data

Article Open access 11 May 2020

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

Visualizing Bibliometric Networks

References

Bell, G., Hey, T., & Szalay, A. (2009). Beyond the data deluge. Science, 323(5919), 1297–1298.
Article CAS Google Scholar
Betancur-R, R., Arcila, D., Vari, R. P., Hughes, L. C., Oliveira, C., Sabaj, M. H., et al. (2018). Phylogenomic incongruence, hypothesis testing, and taxonomic sampling: the monophyly of characiform fishes. Evolution, 73, 329–345. https://doi.org/10.1111/evo.13649.
Article PubMed Google Scholar
Bouchet-Valat, M. (2014). SnowballC: snowball stemmers based on the C libstemmer UTF-8 library. R package version 0.5.1. https://CRAN.R-project.org/package=SnowballC.
Boyd, D., & Crawford, K. (2012). Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society, 15(5), 662–679.
Article Google Scholar
Candia, C., Jara-Figueroa, C., Rodriguez-Sickert, C., Barabási, A.-L., & Hidalgo, C. A. (2018). The universal decay of collective memory and attention. Nature Human Behaviour, 3, 82–91.
Article Google Scholar
Chen, C. L. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: a survey of big data. Information Sciences, 275, 314–347.
Article Google Scholar
Delsuc, F., Brinkmann, H., & Philippe, H. (2005). Phylogenomics and the reconstruction of the tree of life. Nature Reviews Genetics, 6(5), 361–375.
Article CAS Google Scholar
Edwards, S. V. (2009). Is a new and general theory of molecular systematics emerging? Evolution, 63(1), 1–19.
Article CAS Google Scholar
Eisen, J. A. (1998). Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Research, 8, 163–167.
Article CAS Google Scholar
Eisen, J. A., Kaiser, D., & Myers, R. M. (1997). Gastrogenomics delights: a movable feast. Nature Medicine, 3(10), 1076–1078.
Article CAS Google Scholar
Foote, A. D. (2018). Sympatric speciation in the genomic era. Trends in Ecology and Evolution, 33(2), 85–95.
Article Google Scholar
Foster, P. G., & Hickey, D. A. (1999). Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. Journal of Molecular Evolution, 48(3), 284–290.
Article CAS Google Scholar
Francis, W. R., Canfield, D. E. (2018). Very few sites can reshape a phylogenetic tree. bioRxiv, 413518. https://doi.org/10.1101/413518
Galtier, N., & Daubin, V. (2008). Dealing with incongruence in phylogenomic analyses. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 363(1512), 4023–4029.
Article Google Scholar
Gee, H. (2003). Evolution: ending incongruence. Nature, 425(6960), 782.
Article CAS Google Scholar
Harrison, J. (2016). RSelenium: R bindings for ‘Selenium WebDriver’. R package version, 1(7), 1 https://CRAN.R-project.org/package=RSelenium.
Google Scholar
Hillis, D. M., & Huelsenbeck, J. P. (1992). Signal, noise, and reliability in molecular phylogenetic analyses. Journal of Heredity, 83(3), 189–195.
Article CAS Google Scholar
Jeffroy, O., Brinkmann, H., Delsuc, F., & Philippe, H. (2006). Phylogenomics: the beginning of incongruence? Trends in Genetics, 22(4), 225–231.
Article CAS Google Scholar
Kaisler, S., Armour, F., Espinosa, J. A., Money, W. (2013). Big data: issues and challenges moving forward. In: 46th Hawaii International Conference on System Sciences (HICSS), 995–1004. IEEE.
Kaplan, R. M., Chambers, D. A., & Glasgow, R. E. (2014). Big data and large sample size: a cautionary note on the potential for bias. Clinical and Translational Science, 7(4), 342–346.
Article Google Scholar
King, N., & Rokas, A. (2017). Embracing uncertainty in reconstructing early animal evolution. Current Biolology, 27(19), R1081–R1088.
Article CAS Google Scholar
Kocot, K. M., Struck, T. H., Merkel, J., Waits, D. S., Todt, C., Brannock, P. M., Weese, D. A., Cannon, J. T., Moroz, L. L., Lieb, B., & Halanych, K. M. (2017). Phylogenomics of Lophotrochozoa with consideration of systematic error. Systematic Biology, 66(2), 256–282.
CAS PubMed Google Scholar
Kumar, S., Filipski, A. J., Battistuzzi, F. U., Kosakovsky Pond, S. L., & Tamura, K. (2012). Statistics and truth in phylogenomics. Molecular Biology and Evolution, 29(2), 457–472.
Article CAS Google Scholar
Lafond-Lapalma, J., Duceppe, M.-O., Wang, S., Moffett, P., & Mimee, B. (2017). A new method for decontamination of de novo transcriptomes using a hierarchical clustering algorithm. Bioinformatics, 33, 1293–1300.
Google Scholar
Longo, M. S., O’Neill, M. J., & O’Neill, R. J. (2011). Abundant human DNA contamination identified in non-primate genome databases. PLoS One, 6, e16410.
Article CAS Google Scholar
Lust, R. W. (2014). Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data. PLoS One, 9, e110808.
Article Google Scholar
Maddison, W. P. (1997). Gene trees in species trees. Systematic Biology, 46(3), 523–536.
Article Google Scholar
Mai, U., & Mirarab, S. (2018). TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics, 19(5), 272.
Article Google Scholar
Mei, Q., & Zhai, C. (2005). Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In R. Grossman (Ed.), Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining (pp. 198–207). Chicago, USA: ACM.
Chapter Google Scholar
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., et al. (2010). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176–182.
Article Google Scholar
Naser-Khdour, S., Minh, B. Q., Zhang, W., Stone, E., Lanfear, R. (2018). The prevalence of model violations in phylogenetics analysis. bioRxiv, 460121, doi: https://doi.org/10.1101/460121.
Nesnidal, M. P., Helmkampf, M., Bruchhaus, I., & Hausdorf, B. (2010). Compositional heterogeneity and phylogenomic inference of metazoan relationships. Molecular Biology and Evolution, 27(9), 2095–2104.
Article CAS Google Scholar
Ogilvie, H. A., Vaughan, T. G., Matzke, N. J., Slater, G. J., Stadler, T., Welch, D., et al. (2018). Infering species trees using integrative models of species evolution. bioRxiv, 242875, doi: https://doi.org/10.1101/242875.
Philippe, H., Brinkmann, H., Lavrov, D. V., Littlewood, D. T. J., Manuel, M., Wörheid, G., et al. (2011). Resolving difficul phylogenetic questions: why more sequences are not enough. PLoS Biology, 9(3), e1000602.
Article CAS Google Scholar
Philippe, H., Delsuc, F., Brinkmann, H., & Lartillot, N. (2005). Phylogenomics. Annual Revuew of Ecology, Evolution and Systematics, 36, 541–562.
Article Google Scholar
Phillips, M. J., Delsuc, F., & Penny, D. (2004). Genome-scale phylogeny and the detection of systematic biases. Molecular Biology and Evolution, 21(7), 1455–1458.
Article CAS Google Scholar
Pick, K. S., Philippe, H., Schreiber, F., Erpenbeck, D., Jackson, D. J., Wrede, P., et al. (2010). Improved phylogenomic taxon sampling noticeably affects nonbilaterian relationships. Molecular Biology and Evolution, 1(9), 1983–1987.
Article Google Scholar
R Core Team. (2017). R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing https://www.R-project.org/.
Google Scholar
Reddy, S., Kimball, R. T., Pandey, A., Hosner, P. A., Braun, M. J., Hackett, S. J., Han, K. L., Harshman, J., Huddleston, C. J., Kingston, S., Marks, B. D., Miglia, K. J., Moore, W. S., Sheldon, F. H., Witt, C. C., Yuri, T., & Braun, E. L. (2017). Why do phylogenomic data sets yield conflicting trees? Data type influences the avian tree of life more than taxon sampling. Systematic Biology, 66(5), 857–879.
Article CAS Google Scholar
Rokas, A., Williams, B. L., King, N., & Carroll, S. B. (2003). Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature, 425(6960), 798–804.
Article CAS Google Scholar
Shen, X.-X., Hittinger, C. T., & Rokas, A. (2017). Contentious relationships in phylogenomic studies can be driven by a handful of genes. Nature Ecology and Evolution, 1(5), 0126.
Article Google Scholar
Silge, J., & Robinson, D. (2016). tidytext: text mining and analysis using tidy data principles in R. The Journal of Open Source Software, 1(3), 37.
Article Google Scholar
Simion, P., Belkhir, K., François, C., Veyssier, J., Rink, J. C., Manuel, M., Philippe, H., & Telford, M. J. (2018). A software tool ‘CroCo’ detects pervasive cross-species contamination in next generation sequencing data. BMC Biology, 16, 28.
Article Google Scholar
Struck, T. H., Wey-Fabrizius, A. R., Golombek, A., Hering, L., Weigert, A., Bleidorn, C., Klebow, S., Iakovenko, N., Hausdorf, B., Petersen, M., Kück, P., Herlyn, H., & Hankeln, T. (2014). Platyzoan paraphyly based on phylogenomic data supports a noncoelomate ancestry of Spiralia. Molecular Biology and Evolution, 31(7), 1833–1849.
Article CAS Google Scholar
Thompson, P., Batista-Navarro, R. T., Kontonatsios, G., Carter, J., Toon, E., McNaught, J., Timmermann, C., Worboys, M., & Ananiadou, S. (2016). Text mining the history of medicine. PLoS One, 11(1), e0144717.
Article Google Scholar
Wickham, H. (2016). rvest: Easily harvest (scrape) web pages. R package version 0.3.2. https://CRAN.R-project.org/package=rvest.
Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. Sebastopol: O'Reilly Media.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Geology & Geophysics, Yale University, 210 Whitney Avenue, New Haven, CT, 06511, USA
Nicolás Mongiardino Koch

Authors

Nicolás Mongiardino Koch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolás Mongiardino Koch.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

ESM 1

(DOCX 216 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mongiardino Koch, N. The phylogenomic revolution and its conceptual innovations: a text mining approach. Org Divers Evol 19, 99–103 (2019). https://doi.org/10.1007/s13127-019-00397-0

Download citation

Received: 19 November 2018
Accepted: 27 February 2019
Published: 07 March 2019
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s13127-019-00397-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The phylogenomic revolution and its conceptual innovations: a text mining approach

Abstract

Access this article

Similar content being viewed by others

A practical guide to amplicon and metagenomic analysis of microbiome data

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

Visualizing Bibliometric Networks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The phylogenomic revolution and its conceptual innovations: a text mining approach

Abstract

Access this article

Similar content being viewed by others

A practical guide to amplicon and metagenomic analysis of microbiome data

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

Visualizing Bibliometric Networks

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation