LSX: automated reduction of gene-specific lineage evolutionary rate heterogeneity for multi-gene phylogeny inference
Lineage rate heterogeneity can be a major source of bias, especially in multi-gene phylogeny inference. We had previously tackled this issue by developing LS3, a data subselection algorithm that, by removing fast-evolving sequences in a gene-specific manner, identifies subsets of sequences that evolve at a relatively homogeneous rate. However, this algorithm had two major shortcomings: (i) it was automated and published as a set of bash scripts, and hence was Linux-specific, and not user friendly, and (ii) it could result in very stringent sequence subselection when extremely slow-evolving sequences were present.
We address these challenges and produce a new, platform-independent program, LSX, written in R, which includes a reprogrammed version of the original LS3 algorithm and has added features to make better lineage rate calculations. In addition, we developed and included an alternative version of the algorithm, LS4, which reduces lineage rate heterogeneity by detecting sequences that evolve too fast and sequences that evolve too slow, resulting in less stringent data subselection when extremely slow-evolving sequences are present. The efficiency of LSX and of LS4 with datasets with extremely slow-evolving sequences is demonstrated with simulated data, and by the resolution of a contentious node in the catfish phylogeny that was affected by an unusually high lineage rate heterogeneity in the dataset.
LSX is a new bioinformatic tool, with an accessible code, and with which the effect of lineage rate heterogeneity can be explored in gene sequence datasets of virtually any size. In addition, the two modalities of the sequence subsampling algorithm included, LS3 and LS4, allow the user to optimize the amount of non-phylogenetic signal removed while keeping a maximum of phylogenetic signal.
KeywordsLong branch attraction Lineage rate heterogeneity Phylogenomics Phylogenetic methods Sequence subsampling
Long branch attraction
Lineages of interest
Likelihood ratio test
Locus specific sequence subsampling
Sum of branch lengths
We recently showed that biases emerging from evolutionary rate heterogeneity among lineages in multi-gene phylogenies can be reduced with a sequence data-subselection algorithm to the point of uncovering the true phylogenetic signal . In that study, we presented an algorithm called Locus Specific Sequence Subsampling (LS3), which reduces lineage evolutionary rate heterogeneity gene-by-gene in multi-gene datasets. LS3 implements a likelihood ratio test (LRT)  between a model that assumes equal rates of evolution among all ingroup lineages (single rate model) and another that allows three user-defined ingroup lineages to have independent rates of evolution (multiple rates model). If the multiple rates model fits the data significantly better than the single rate model, the fastest-evolving sequence, as determined by its sum-of-branch length from root to tip (SBL), is removed, and the reduced dataset is tested again with the LRT. This is iterated until a set of sequences is found whose lineage evolutionary rates can be explained equally well by the single rate or the multiple rates model. Gene datasets that never reached this point as well as the fast-evolving sequences removed from other gene alignments are flagged as potentially problematic . LS3 effectively reduced long branch attraction (LBA) artifacts in simulated and biological multi-gene datasets, and its utility to reduce phylogenetic biases has been recognized by several authors [3, 4].
The published LS3 algorithm is executed by a set of Linux-specific bash scripts (“LS3-bash”). Here we present a new, re-written program which is much faster, more user-friendly, contains important new features, and can be used across all platforms. We also developed and included a new data subselection algorithm based on LS3, called “LS3 supplement” or LS4, which leads to lineage evolutionary rate homogeneity by removing sequences that evolve too fast and also those that evolve too slowly.
The new program, LSX, is entirely written in R , and uses PAML  and the R packages ape [7, 8] and adephylo . If PAML, R, and the R packages ape and adephylo are installed and functional, LSX runs regardless of the platform, with all parameters given in a single raw text control file. LSX reads sequence alignments in PHYLIP format and produces, for each gene, a version of the alignment with homogenized lineage evolutionary rates. In the new program LSX, the best model of sequence evolution can be given for each gene, thus improving branch length estimations, and users can select more than three lineages of interest (LOIs) for the lineage evolutionary rate heterogeneity test (Additional file 1: Figure S1a,b).
Within LSX we also implemented LS4, a new data subselection algorithm optimized for datasets in which sequences that evolve too fast and sequences that evolve too slow disrupt lineage rate heterogeneity. In such cases, the approach of LS3, which removes only fast-evolving sequences, can lead to the excessive flagging of data (Additional file 1: Table S1). This is because it will flag and remove sequences with intermediate evolutionary rates because they are still evolving “too fast” relative to the extremely slow-evolving ones (Additional file 1: Figure S2).
LS4 employs a different criterion to homogenize lineage evolutionary rates, which considers both markedly fast- and slow-evolving sequences for removal. Under LS4, when the SBLs for all ingroup sequences of a given gene are calculated, they are grouped by the user-defined LOI to which they belong. The slowest-evolving sequence of each LOIs is identified, and then the fastest-evolving among them across all ingroup lineages is picked as a benchmark (i.e. “the fastest of the slowest”, see Additional file 1: Figure S1c). Because in both LS3 and LS4 each LOI has to be represented by at least one sequence, this “fastest (longest) of the slowest (shortest)” sequence represents the slowest evolutionary rate at which all lineages could converge. Then, LS4 removes the ingroup sequence that produces the tip furthest from the benchmark, be it faster- or slower-evolving (Additional file 1: Figure S1d).
We compared the efficiency of LSX relative to our previous script LS3-bash with simulated data (Additional file 1: Supplementary Methods), and found LSX to perform the LS3 algorithm 7× times faster than LS3-bash with a 100-gene dataset, and 8× faster with a 500-gene dataset (Additional file 1: Table S1). We then compared the relative effectiveness of LS4 and LS3 when analyzing datasets in which there were mainly average- and fast-evolving sequences, and datasets in which there were very slow-, average-, and very fast-evolving sequences (Additional file 1: Supplementary Methods). In the former case, both LS3 and LS4 gave similar results (Additional file 1: Table S1). In the latter case, which includes very slow and very fast-evolving sequences, the data subsampling under LS3 was too stringent and reduced substantially the phylogenetic signal, and only the data remaining after LS4 were able to clearly solve the phylogeny (Additional file 1: Table S1). In addition, we applied both algorithms, as implemented in LSX, to a biological case study: a 10-gene dataset of the catfish order Siluriformes . There are two conflicting hypotheses for the most basal splits of this phylogeny: one proposed by morphological phylogenetics, and one proposed by molecular phylogenetics (e.g. [11, 12]). The point of conflict is the positioning of the fast evolving lineage Loricarioidei, which is closer to the root in molecular phylogenies than in the morphological phylogenies. The attraction of the fast evolving Loricarioidei lineage towards the root may be an artifact due to strong lineage rate heterogeneity, and allowed us to explicitly test the different approaches of LS3 and LS4.
The results presented in  show that LS3 was able to find taxa subsets with lineage rate homogeneity in six out of the ten genes, and flagged four complete genes as unsuitable for analysis. Analyzing the LS3-processed dataset showed that the basal split of Siluriformes is indeed affected by lineage rate heterogeneity, and that there was a strong signal supporting the morphological hypothesis of the root. However, these results were not entirely satisfactory because one ingroup species was incorrectly placed among the outgroups, and one of the well-established clades of the phylogeny was not recovered. In contrast, LS4 found lineage rate homogeneity in seven out of the ten genes (only three genes were flagged), the final phylogeny showed the morphological hypothesis of the root, and all the ingroup taxa plus the well-established clades were recovered. In this case study, both LS3 and LS4 successfully mitigated the effect of lineage rate heterogeneity, but the data subselection criterion of LS4 allowed the inclusion of more data for the final analysis, and resulted in a phylogeny with better resolution.
The new program presented here, LSX, represents a substantial improvement over our initial scripts in LS3-bash. LSX is faster, platform-independent, the code is accessible, and also includes a new version of the algorithm, LS4. We show here and in a recent publication that this new version is more effective than LS3 in increasing the phylogenetic to non-phylogenetic signal ratio when extremely slow-evolving sequences are present in addition to very fast-evolving ones, and helped to solve a long-standing controversy of catfish phylogenetics. We also see a potential in both algorithms for scanning genome-wide datasets and using the gene flagging data to identify regions in which a single lineage shows a markedly accelerated evolution (such as human accelerated regions [13, 14]). Alternatively, the same data could also be used to identify genomic regions that are highly conserved (and thus slow-evolving) among some lineages but not others (e.g., conserved non-coding elements ). As research in phylogenetics progresses in the wake of the genomic era, we must begin to solve the most contentious nodes of the tree of life, where the usual methods may not be as effective. For undertaking these challenges we believe that accessible data subselection programs with clear criteria are a necessary tool, and should be made available whenever possible.
Availability and requirements
Project name: LSX v1.1.
Project homepage: https://github.com/carlosj-rr/LSx
Operating systems: Platform independent.
Programming language: R.
Other requirements: R 3.3.x or higher, R package ape 5.1 or higher (and dependencies), R package adephylo 1.1 or higher (and dependencies), PAML 4.
License: GNU GPL 3.0.
Any restrictions to use by non-academics: license needed.
We thank Jose Nunes for his suggestions during the programming of LSX in R, and Joe Felsenstein for discussions about the criterion used in the LS4 algorithm.
CJRR and JIMB developed the algorithms, CJRR did the initial code drafts, and finalized it with inputs from JIMB, and both authors wrote the manuscript. Both authors read and approved the final manuscript.
This work was supported by the Swiss National Science Foundation (grant 31003A_141233 to JIMB) and the Institute for Genetics and Genomics in Geneva (iGE3). The funding bodies had no role in the design of this study, its data collection and analysis, the interpretation of its data, nor in the writing of the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 5.R Core Team. R: A language and environment for statistical computing. Vienna: R Found Stat Comput; 2016. https://www.r-project.org/.
- 10.Rivera-Rivera CJ, Montoya-Burgos JI. Back to the roots : reducing evolutionary rate heterogeneity among sequences gives support for the early morphological hypothesis of the root of Siluriformes ( Teleostei : Ostariophysi ). Mol Phylogenet Evol. 2018;127:272–9. https://doi.org/10.1016/j.ympev.2018.06.004.CrossRefPubMedGoogle Scholar
- 12.Diogo R. The Origin of Higher Taxa. 2007. https://doi.org/10.1093/acprof:oso/9780199691883.001.0001.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.