Diversification Rate is Associated with Rate of Molecular Evolution in Ray-Finned Fish (Actinopterygii)

Understanding the factors that drive diversification of taxa across the tree of life is a key focus of macroevolutionary research. While the effects of life history, ecology, climate and geography on diversity have been studied for many taxa, the relationship between molecular evolution and diversification has received less attention. However, correlations between rates of molecular evolution and diversification rate have been detected in a range of taxa, including reptiles, plants and birds. A correlation between rates of molecular evolution and diversification rate is a prediction of several evolutionary theories, including the evolutionary speed hypothesis which links variation in mutation rates to differences in speciation rates. If it is widespread, such correlations could also have significant practical impacts, if they are not adequately accounted for in phylogenetic inference of evolutionary rates and timescales. Ray-finned fish (Actinopterygii) offer a prime target to test for this relationship due to their extreme variation in clade size suggesting a wide range of diversification rates. We employ both a sister-pairs approach and a whole-tree approach to test for correlations between substitution rate and net diversification. We also collect life history and ecological trait data and account for potential confounding factors including body size, latitude, max depth and reef association. We find evidence to support a relationship between diversification and synonymous rates of nuclear evolution across two published backbone phylogenies, as well as weak evidence for a relationship between mitochondrial nonsynonymous rates and diversification at the genus level. Supplementary Information The online version contains supplementary material available at 10.1007/s00239-022-10052-6.


Regression results with validation checks and restandardisation
Supplemental Methods S1. Validation checking and standardisation of substitution rate and clade size data.
To ensure that sister pair contrasts in diversification rates (estimated from clade sizes) and substitution rates (estimated from substitution rates) are independent of their ancestral trait values and do not violate the assumption of homoscedasticity, we re-ran our main results after testing for and removing any relationship between the absolute contrast in the dependent variable for each sister pair and the mean of the two clades in each sister pair, which may be taken as an estimate of the ancestral diversification or substitution rate (see Freckleton 2000;Lanfear et al. 2010a). We therefore plotted the contrasts in clade sizes N against their means, and performed Kendall rank-correlation tests to search for a strong relationship under different transformations of the clade sizes. Family-level contrasts showed no significant correlation with the standard log transformation, but a relationship remained in the four sets of Genus-level clade size contrasts, necessitating a different transformation of the data. By experiment we chose the power transformation * = − 1 0.3 ⁄ , for which no correlation remained (Figures S16-S17). We conduct a similar test for the relevant branch length estimates (Figure S18-S19).
Following the recommendations of Garland et al. (1992), we standardised the variance of all contrasts by dividing through by the square root of the inferred age of each pair. This is recommended because the variance in trait values among lineages should increase with the amount of time since they diverged. We used the time estimates from Rabosky et al. (2018) as our estimate for this normalisation. We checked for a positive relationship between the normalised contrasts in transformed clade sizes and the square root of time (figures S20-S23). In all but one data set none was found, indicating that this normalisation factor was adequate. A weak negative correlation occurred in the mitochondrial dS data set at the genus level. Therefore we used a slightly weaker standardising factor of (pair age) 0.3 for this data set alone, which removed the correlation ( Figure S9-S10 available as Supplementary Information). To test for an effect of life history or environment in our dataset, we generated four sets of comparisons. We generated them as described for the molecular-only data sets above, except thatwe used only families or genera for which at least one measurement of each life history or ecological trait was available (not necessarily for the same tips). For each sister clade, a taxonomic average of all available life history and ecological traits associated with its descendent species in each all-taxonassembled tree was produced by averaging trait values successively across each species, genus, and sister clade, similar to the procedure for substitutions. These values were then averaged across the 100 all-taxon-assembled posterior trees. For habitat, we found that the proportion of reef-associated species in each genus and sister clade was almost universally close to 1 or 0, so we recorded a 1 for each sister clade in which >= 50% of genera had a 1 for a majority of descendant species. These procedures resulted in data sets with complete trait and taxonomic data. The numbers of remaining comparisons are displayed in Table 3.
We selected the best fitting linear model regressing diversification rate contrasts against trait contrasts by minimising the Akaike Information Criterion (AIC) via stepwise regression using the R 'step' function. This was done for each trait data set separately. We then performed multiple regression with substitution rate contrasts to determine whether adding the contrast in substitution rates significantly improved prediction of the diversification rate once life history and ecology had been incorporated.
This test asks whether all the differences in net diversification rate can be explained by life history variation and chance, or whether there is a significant signal for differences in diversification rate that can be attributed to variation in rate of molecular evolution.

Supplementary results and discussion
Results of the life history regressions are shown in Supplementary Table S2. Stepwise regression universally selected either a null model with no predictors, or a model with Length only as a predictor.
In all datasets but one, the relationship between substitution rates and clade sizes was not significant in the life history data even when it was significant in our Full and Reduced datasets, indicating that the additional filtering step for life history availability reduced the power of the dataset. The only life history dataset for which this relationship remained was for Family-level contrasts of nuclear synonymous substitutions (Nuc.Rag1, dS). For this dataset, we detected a significant improvement for a model including both Length and nuclear dS as predictors of clade size over a model only including Length (F-test, p < 0.05). However, given the low power of these datasets as well as the lack of a confirming signal in the Total nuclear substitution dataset, this result is low confidence and may change with additional data.  Figure S1. Validation of our clade size measure for Total substitution rate (baseml) data. In order to examine the relationship between the diversity of sister clades and their relative substitution rates, we need a measure of the species richness or size of the clade. This is complicated by uncertainty in the positioning of species among genera, and sometimes genera among closely related families. This uncertainty is summarised in 100 all-taxon-assembled phylogenies by Rabosk et al. (2018), in which species without molecular data are distributed randomly within applied taxonomic constraints according to a birth-death process model of diversification. To accommodate this taxonomic uncertainty, we chose to use the average size of a clade across these trees as our diversity measure (Horizontal axis). Since this is a difference from previous sister-pair diversification analyses, which typically use the number of described species in a given taxonomic group as a measure of species richness/clade size (Vertical axis), we use this figure to see whether our method produces substantial differences from this standard method. The formal taxonomic count is the sum of FishBase described species for each genus descending from the common ancestor of that sister clade in the phylogeny of Rabosky et al. (2018). Some genera are not represented in the backbone tree, meaning that relying on formal taxonomy misses these genera, leading to very small clade sizes. This is remedied by our methodology Otherwise the two measures of species richness/clade size are highly correlated. Figure S2. Validation of our clade size measure for dN and dS substitution rate (codeml) data. Our measure was taken to be the average size of the clade across 100 all-taxon assembled trees, accounting for possible uncertainty in the placement of species without molecular data. These averages are plotted against a count derived from the FishBase taxonomy (horizontal axis). This count is the sum of FishBase described species for each genus descending from the common ancestor of that sister clade in the phylogeny of Rabosky et al. (2018). Some genera are not represented in the backbone tree, leading to a greatly increased clade size when averages are taken over the all-taxonassembled phylogeny. Otherwise clade size are highly correlated between the two methods. Figure S3. Histogram of quartet alignment lengths in kilobases (kb) used to infer Total substitution rates (baseml) for the Full data sets. Figure S4. Histogram of quartet alignment lengths in kilobases (kb) used to infer dS or dN (codeml) substitution rates for Full data sets.      Contrasts are standardised by the square root of pair age. Data are shown for mitochondrial and nuclear sequences and family-and genus-level data sets. The dotted line shows the least squares regression line forced through the origin. Bold text indicates p < 0.05. Figure S16. Full dataset, Total substitution rates (baseml) -test for heteroscedasticity of sister pair log clade size contrasts. Data is shown for mitochondrial and nuclear sequences at the family and genus taxonomic levels. Absolute contrasts (vertical axis) are plotted against the sister pair mean of the clade sizes in each sister clade within the pair (horizontal axis). Clade sizes are then transformed so that positive relationship remains (Kendall rank-correlation test, p > 0.05). For family-level data a log transformation is used, while for genus-level data the transformation applied is * = −1 0.3 ⁄ . Figure S17. Full data sets, dS or dN (codeml) substitution rates -test for heteroscedasticity of sister pair log clade size. Data is shown for mitochondrial and nuclear sequences at the family and genus taxonomic levels. Absolute contrasts (vertical axis) are plotted against the sister pair mean of the clade sizes in each sister clade within the pair (horizontal axis). Clade sizes are then transformed so that positive relationship remains (Kendall rank-correlation test, p > 0.05). For family-level data a log transformation is used, while for genus-level data the transformation applied is * = −1 0.3 ⁄ . Figure S18. Full data sets, Total substitution rates (baseml) -test for heteroscedasticity of sister pair log branch length contrasts. Data is shown for mitochondrial and nuclear sequences at the family and genus taxonomic levels. Absolute contrasts (vertical axis) are plotted against the sister pair mean of the log substitution rates in each sister clade within the pair (horizontal axis). Substitution rates are then transformed so that no positive relationship remains (Kendall rank-correlation test, p > 0.05). The log transformation is found to be adequate for all data sets. Figure S19. Full data sets, dS or dN (codeml) substitution rates -test for heteroscedasticity of sister pair log branch length contrasts. Data is shown for mitochondrial and nuclear sequences at the family and genus taxonomic levels. Substitution rates are for non-synonymous (dN) or synonymous (dS) substitutions. Absolute contrasts (vertical axis) are plotted against the sister pair mean of the log substitution rates in each sister clade within the pair (horizontal axis). Substitution rates are then transformed so that no positive relationship remains (Kendall rank-correlation test, p > 0.05). The log transformation is found to be adequate for all data sets. Figure S20. Full data sets, Total substitution rates (baseml) -tests for nonhomogeneity of variance in transformed sister pair species contrasts due to the age of sister pairs. Data is shown for mitochondrial and nuclear sequences at the family and genus taxonomic levels. Vertical axis is the absolute values of the contrasts standardised by the square root of pair age in millions of years (Myr) and plotted against the standardisation factor (Sqrt pair age). For family-level data a log transformation is used, while for genus-level data the transformation applied is * = −1 0.3 ⁄ . We then test for any remaining correlation (Kendall rank-correlation test, p > 0.05). This test showed that standardisation by the square root of sister pair age is adequate. Figure S21. Full data sets, dS or dN (codeml) substitution rates -tests for nonhomogeneity of variance in transformed sister pair species contrasts due to the age of sister pairs. Data is shown for mitochondrial and nuclear sequences at the family and genus taxonomic levels. Vertical axis is the absolute values of the contrasts standardised by the square root of pair age in millions of years (Myr) and plotted against the standardisation factor (Sqrt pair age). For family-level data a log transformation is used, while for genus-level data the transformation applied is * = −1 0.3 ⁄ . We then test for any remaining correlation (Kendall rank-correlation test, p > 0.05). This test showed that standardisation by the square root of sister pair age is adequate. Figure S22. Full data sets, Total branch length (baseml) substitution rates -Tests for an effect of pair age on the variance of branch length estimates. Vertical axis is the absolute values of the contrasts standardised by the square root of pair age in millions of years (Myr) and plotted against the standardisation factor (Sqrt pair age). Shallow pairs with small substitution rates have higher variance because they are estimated from small numbers of observed substitutions. The shallowest pairs are successively removed until a linear model shows no evidence of a negative relationship (Wald test, p > 0.05). The data sets shown have undergone this process and have no detectable remaining relationship. Figure S23. Tests for an effect of pair age on the variance of substitution rate estimates. Vertical axis is the absolute values of the contrasts standardised by the square root of pair age in millions of years (Myr) and plotted against the standardisation factor (Sqrt pair age). Non-synonymous (dN) or synonymous (dS) substitutions are inferred. Shallow pairs have higher variance because they are estimated from small numbers of observed substitutions. The shallowest pairs are successively removed until a linear model shows no evidence of a negative relationship (Wald test, p > 0.05). The data sets shown have undergone this process and have no detectable remaining relationship. Figure S24. Tree-based analysis of the relationship between molecular substitutions and diversification by the method of Webster et al. (2003). The scatter plot shows the total estimated substitution rates on each root-to-tip path (the pathlength). The analysis is shown for the substitutions inferred from the full 11,638-sequence supermatrix from Rabosky et al. (2018), and the substitutions inferred from mitochondrial (1130 sequences) and nuclear (rag1 only, 3035 sequences) alignments assembled in this study. These pathlengths are plotted against the number of nodes on the root-to-tip path. The trend line shows the result of a phylogenetic linear regression of pathlengths against numbers of nodes on the path. The dashed line in the mitochondrial plot shows the fit of a threeparameter model allowing for non-linearity caused by the node density effect. The best-fitting model is concave with exponent approximately 1, indicating minimal contribution from a node-density artefact. Figure S25. Axis -inverted tree-based analysis of the relationship between molecular substitutions and diversification by the method of Webster et al. (2003). Using the number of nodes as the response variable is the preferred method of Venditti et el. (2006) for diagnosing the node density effect, but was not usable for all of our tree-based analyses because the curve was fit to outlying points, producing trends with unreasonably large slopes. The scatter plot shows the total estimated total substitutions using mitochondrial data on each root-to-tip path in a 1133-tip tree from Rabosky et al.