Large-scale analysis of microRNA evolution
- First Online:
- Cite this article as:
- Guerra-Assunção, J.A. & Enright, A.J. BMC Genomics (2012) 13: 218. doi:10.1186/1471-2164-13-218
In animals, microRNAs (miRNA) are important genetic regulators. Animal miRNAs appear to have expanded in conjunction with an escalation in complexity during early bilaterian evolution. Their small size and high-degree of similarity makes them challenging for phylogenetic approaches. Furthermore, genomic locations encoding miRNAs are not clearly defined in many species. A number of studies have looked at the evolution of individual miRNA families. However, we currently lack resources for large-scale analysis of miRNA evolution.
We addressed some of these issues in order to analyse the evolution of miRNAs. We perform syntenic and phylogenetic analysis for miRNAs from 80 animal species. We present synteny maps, phylogenies and functional data for miRNAs across these species. These data represent the basis of our analyses and also act as a resource for the community.
We use these data to explore the distribution of miRNAs across phylogenetic space, characterise their birth and death, and examine functional relationships between miRNAs and other genes. These data confirm a number of previously reported findings on a larger scale and also offer novel insights into the evolution of the miRNA repertoire in animals, and it’s genomic organization.
MiRNAs are small (19-23nt) molecules that regulate mRNAs through binding to their 3’ UTR, mediated by the RNA induced silencing (RISC) complex. This binding event causes translational repression[2, 3] and mRNA destabilization. The effect of binding is significant down-regulation of the target, which can be readily detected at both the protein and mRNA levels[5, 6]. The function of miRNAs in general appears to be as a fine-tuner of gene expression.
The origin of small interfering RNAs appears to pre-date the emergence of eukaryotes. The miRNA repertoires seem to be independent between animals and plants, being absent in fungi. Fungi possess elements of the processing machinery but not functional miRNAs. Furthermore, although both animals and plants possess miRNAs, they operate through different mechanisms. Expansions in morphological complexity in metazoans have previously been shown to correlate with expansions in miRNA repertoire. This seems to indicate that miRNAs are particularly advantageous for defining cell and tissue types. In this study we focus exclusively on animal miRNAs. In recent years, animal miRNAs have been implicated in many areas of biology such as: tissue specificity, cell-fate, pluripotency, development, cancer, disease and stress response.
One of the first features observed for mature miRNAs was their high degree of similarity across species. Many miRNA families have identical mature sequences across a wide range of species, e.g. let-7. This high-degree of similarity can hamper phylogenetic approaches. Functional constraints surrounding the seed region (6-8nt) of the miRNA represent an important fraction of their length, which is less amenable to mutational changes. While many miRNAs are present in multiple species and are highly conserved, there are a growing number of miRNAs restricted to specific lineages.
The primary transcript of a miRNA (pri-miRNA) contains stem-loop structures that are recognised and excised by the enzyme Drosha, giving rise to precursor miRNAs (pre-miRNAs). Comparison of pre-miRNA sequences illustrates that they are less highly conserved and hence more amenable to phylogenetic approaches than the mature sequences alone.
The primary repository for miRNA sequence data is miRBase. The information in miRBase is based on primary experimental data within specific species. A miRNA discovered in one species is likely to also be present in other closely related species, but this is not always captured by miRBase. This presents a significant challenge for phylogenetic analysis, as one requires information about the presence, absence and sequence of miRNA families in many species in order to perform evolutionary analysis. The rapid growth of next-generation sequencing has made it easier to predict miRNAs but it is clear that some predicted miRNAs do not validate experimentally and as such are flagged and removed from miRBase. Previously, a number of miRNA sequences were shown to be likely false-positives and have been removed from the database.
Different miRNAs usually belong to the same family if they share the same seed sequence (i.e. nucleotides 2–8 of the mature miRNA). It is believed that these miRNAs have similar targets and thus similar cellular function although they may have very different spatial and temporal expression profiles.
Recently, we developed MapMi, a resource for cross-species mapping and identification of homologous miRNAs across genomes. This approach overcomes many of the issues described and provides a solid foundation from which to explore syntenic and phylogenetic relationships between miRNAs across species.
In our dataset, many miRNAs (48%) are encoded as independent non-coding transcripts while the rest (52%) are encoded within the introns of protein-coding genes. Some miRNAs exist as individual molecules encoded by a single locus while others occur in transcripts encoding multiple copies of the same miRNA or multiple transcripts at different genomic loci. It has been postulated that in some cases multiple loci are required to increase copy-number of specific miRNA molecules in certain circumstances (e.g. miR-430 in early development of the Zebrafish embryo).
Even with the rapid expansion of sequencing data available, we are still lacking a global overview of the genomic organization of miRNAs across a broad range of species, and an overview of their evolutionary relationships. Most previous studies (reviewed in), focused on specific clusters in a small set of species.
Each miRNA is potentially capable of regulating hundreds (or even thousands) of mRNA targets simultaneously. It is therefore important that their regulation be tightly controlled. Moreover, it has been postulated that intronic miRNAs may regulate the same biological pathway as their host genes. Several examples of this have been found, namely in the regulation of Myosin expression and cholesterol biosynthesis. This suggests that miRNAs that are consistently co-localised with proteins might be involved in the same biological processes.
In this study, we performed for the first time, an automated, large-scale analysis of miRNA synteny and evolutionary associations. We use these data to explore both the arrangement and significance of miRNA loci throughout evolution. We also aim to identify those miRNA families, which have expanded or contracted in specific lineages. ly, we have performed phylogenetic profile analysis to identify miRNA:miRNA and miRNA:protein pairs which appear to be significantly associated at a functional level.
We employ Dollo parsimony to detect instances of miRNA family gains throughout evolution. Using these data we explore the genomic organization, evolution and functional associations of miRNAs. This data forms part of a larger and more detailed resource that can be accessed at www.ebi.ac.uk/enright-srv/Sintra. We will continue to update this resource, as more genomes become available.
Large-scale analysis of miRNA evolution and syntenic arrangement requires accurate information about the presence or absence of miRNA loci across many species. We addressed that by expanding the miRBase loci annotation using our MapMi approach. The 80 species considered for these analyses are shown in Additional file1: Table S1. One factor hampering analysis can arise from low-coverage genomes[21, 22] which makes mapping and identification of miRNAs difficult. Even though the methods used for the analyses described herein are robust to gene loss, we look at all available genomes for completeness, specifying where results are likely due to a genome being low-coverage (Additional file1: Table S1).
Our dataset is based on Ensembl and Ensembl Metazoa genomic sequences and protein family annotations (Ensembl Families). Annotations for miRNAs were obtained by mapping all metazoan sequences in miRBase using MapMi (see Methods). The dataset contains 52 species containing both protein coding annotation and miRNA annotation, and 28 species where just miRNA annotation is present. This corresponds to 774,002 protein coding loci and 31,237 miRNA coding loci across all species under analysis. Given that many miRNAs are present in multiple related copies it is essential that we can accurately place them into families. Hence, we have defined 3,053 miRNA families based on all miRNAs in our dataset (see Methods).
Evolution of the microRNA repertoire
One drawback of this approach is that, while we seek to detect miRNA orthologues across species, we cannot detect novel miRNAs present in species that have been poorly characterised at the miRNA level. This creates an issue for analysis of gains and losses due to these sampling biases. Some species are extremely well profiled for small RNAs while for others there exists little or no validated data. However for those sets of species which are well profiled, such analyses can still provide useful information about the evolutionary dynamics of miRNA families.
The results of this analysis are striking and show a large number of miRNA expansions across the phylogenetic tree (Figure1). As previously reported, we observe a significant increase in miRNA number as morphological complexity increases with significant growth starting for metazoans and in particular across eutheria. The largest growth is observed for rodents and primates with a significant gain observed for great apes (see Figure1). Globally the tree highlights sampling biases between clades. Some clades (e.g. Mammals) are well profiled while others (e.g. Insectivora, Bilateria) are poorly profiled. Individual species (e.g. Tarsius syrichta) although they are in a well-profiled clade may have poor assemblies that hamper miRNA identification. Hence care must be taken in the interpretation of miRNA repertoire and the prediction of large gains and losses.
Additionally, we observe gains within Insects and Nematodes; this is particularly striking due to the absence of many species in these groups in the phylogenetic tree. A small number of clades exhibit significant losses, such as frog, marsupials, squirrel and hedgehog. Some of these perceived losses are most likely due to poor miRNA characterization within these species that, possibly due to assembly problems, cannot be recovered by the MapMi pipeline.
Evolutionary comparison of miRNA genomic context
Recently expanded miRNA families
Primate specific miRNA family expansions
mir-1186,mir-1186b,mir-130,mir-1303,mir-130a,mir-130b,mir-130c, mir-1972,mir-301,mir-301a,mir-301b,mir-301c,mir-3090,mir-3590, mir-4452,mir-5095,mir-5096,mir-544,mir-544a,mir-544b,mir-619
ES Cell Expressed
mir-1283,mir-1283a,mir-1283b,mir-290,mir-291a,mir-291b, mir-292,mir-293,mir-294,mir-371,mir-371b,mir-373,mir-512, mir-515,mir-516,mir-516a,mir-516b,mir-517,mir-517a,mir-517b, mir-517c,mir-518a,mir-518b,mir-518c,mir-518d,mir-518e,mir-518f, mir-519a,mir-519b,mir-519c,mir-519d,mir-519e,mir-519f,mir-520a, mir-520b,mir-520c,mir-520d,mir-520e,mir-520f,mir-520g,mir-520h, mir-521,mir-522,mir-523,mir-523a,mir-523b,mir-524, mir-525,mir-526a,mir-526b,mir-527
ES Cell Expressed Maternal Zygotic transition
mir-1254,mir-1268,mir-1273,mir-1273c,mir-1273d,mir-1273e, mir-1273f,mir-1273g,mir-1304,mir-297,mir-297a,mir-297b,mir-297c, mir-4419b,mir-4459,mir-4478,mir-466,mir-466a,mir-466b,mir-466c, mir-466d,mir-466e,mir-466f,mir-466g,mir-466h,mir-466i,mir-466j, mir-466k,mir-466l,mir-466m,mir-466n,mir-466o,mir-466p,mir-467a,mir-467b,mir-467c,mir-467d,mir-467e,mir-467g,mir-467h,mir-566,mir-669a,mir-669b,mir-669c,mir-669d,mir-669e,mir-669f,mir-669g,mir-669h,mir-669i,mir-669j,mir-669k,mir-669l,mir-669m,mir-669o,mir-669p
Repeat Associated miRNAs (simple repeats, SINE, LTR)
Repeat Associated miRNAs (MADE1 elements)
MER 63 Repeat Associated miRNAs
mir-3585,mir-463,mir-465,mir-465a,mir-465b,mir-465c,mir-470, mir-506,mir-507,mir-508,mir-509,mir-509a,mir-509b,mir-510, mir-513a,mir-513b,mir-513c,mir-514,mir-514b,mir-547,mir-652, mir-742,mir-743a,mir-743b,mir-871,mir-878,mir-880,mir-881, mir-883,mir-883a,mir-883b,mir-888,mir-890,mir-892,mir-892a,mir-892b
X-linked miRNA cluster
MiRNA family expansions in Amphibians, Fish and Insects
Maternal Zygotic Switch
Maternal Zygotic Switch
Unknown expansion in Culex
Two large families of miRNAs appear to have expanded rapidly in primates. The first cluster (Table1) contains miR-130 and miR-301 miRNAs which have been previously reported as ancient miRNAs arising from tandem repeat duplications and which have been remodeled in animals. Members of this primate expanded family have been shown to have ES cell expression[27, 28]. The second cluster is also linked to ES cell expression and contains members such as miR-290 – miR-294. Interestingly, not only is the miR-290-294 set of miRNAs expressed in ES cells but it has been postulated to be a putative maternal zygotic switching mechanism in mouse oocytes.
It is intriguing that such families of miRNAs involved in pluripotency and early embryonic development have expanded in primates, and it mirrors expansions seen for other maternal zygotic switches described below for Insects and Fish. The increase in both morphological complexity and longevity in primates possibly requires increasingly complex control of gene-expression in stem cells. These results suggest that miRNAs are expanding in unison.
Aside from these two groups of ES cell related miRNAs we observe significant expansion of two large families of repeat associated miRNAs. It has previously been shown that Alu elements were expanded in the ancestor of Old and New World monkeys and that this facilitated expansion of segmental duplications. Other studies have shown that such Alu expansion might also support frequent duplication of short units such as miRNAs.
The first cluster contains a number of miRNAs derived from simple repeats, (LINE and LTR elements), which have previously been shown to have expanded in primates, again likely through segmental duplication. The second family contains miRNAs likely derived from MADE1 elements, while the third family contains MER63 derived miRNAs. These data further support the hypothesis that many primate expanded miRNA families are derived from repetitive elements and formed through rounds of segmental duplication. The relevance and function of such miRNAs is difficult to establish. One possibility that has been suggested before is that such repeats may act as generators of novel miRNA sequences which have yet to find functional relevance.
Another interesting expansion involves a family of X-linked miRNAs including miR-465 and miR-509. A large number of expansions are also listed for miRNAs whose function and expression are not well characterised yet (Tables1 and2). A number of other expansions are observed for other miRNA families, however in many cases little is known about the family members involved.
For fish, amphibians and insects, few expansions are detected (Table2). However, two out of the four detected expansions involve miRNA families implicated in the Maternal-Zygotic transition, a process in early development that is regulated by miRNAs. In particular miR-430 has been reported to have rapidly expanded in Danio rerio. We also detect a similar expansion for the equivalent MZ-switch miRNA in Xenopus tropicalis (miR-427). An expansion is also detected for miR-2185 in Danio rerio, however this miRNA has been poorly characterised with limited expression information pointing to a possible role in heart development. For insects a single expansion is detected within Aedes for miR-2951, however this miRNA is also poorly characterised.
Analysis of linkage and synteny is a useful tool for establishing both orthology relationships and also functional linkages between genes. The application of synteny analysis to miRNA genes (both intronic and intergenic) has not been applied previously on a large scale. We used the Enredo algorithm to segment genomes into homologous collinear regions that include both protein-coding and miRNA genes. Enredo is a graph-based system for detecting collinear segments in genome sequences that handles large-scale genome rearrangements such as duplications and deletions. Enredo does not compute the likely history of genome-rearrangements but forms a solid basis for such analyses by providing a stable set of co-linear segment blocks.
We explored the question of whether synteny blocks containing miRNAs showed differences compared to those blocks that contain solely protein-coding genes. Moreover, we wanted to assess whether particular species illustrated unexpected arrangements for miRNA genes when compared to other species.
Syntenic blocks containing microRNAs
A large fraction (59%) of the miRNA loci in our dataset are found to be encoded on the genome by transcripts containing several miRNA loci. As expected, a large fraction (63%) of these are found in conserved synteny blocks across two or more species. A small fraction (3%) of non-clustered miRNA loci are found to be in conserved synteny, albeit with protein coding genes.
We also found clusters that duplicated within the genome, but to different chromosomes (Additional file3: Figure S1). The organization of miRNAs between species seems to be more constrained than that of the nearby protein coding genes. Due to the diversity of possible scenarios, it is challenging to accurately reconstruct the series of events that lead to the current organization of genes. In general, our data is coherent with the hypothesis that miRNA genomic organization is more conserved than expected compared to both random models and protein-coding genes.
Associations between microRNAs
A number of approaches have been successfully used to predict functional associations between protein-coding genes based on both their sequence and their genomic context[41, 42, 43]. We used phylogenetic profiles to apply functional association analysis to miRNAs for the first time. In the context of protein-coding genes, these approaches have usually been applied to detect possible protein-protein interactions. In our case, we sought to determine whether miRNAs from different families and different syntenic blocks had any significant and unexpected functional associations. Phylogenetic profile analysis detects functional associations between genes based on their shared presence or absence across many genomes. We applied this technique to miRNA presence and absence profiles using the BayesTraits approach. Surprisingly, those miRNAs within the same syntenic block, in general, do not exhibit significant functional associations. This is likely because of the extensive conservation of miRNAs, in a way that is consistent with species phylogenies. It is therefore more interesting to look at co-evolution of miRNAs in different genomic regions, as this is not affected by strong linkage between loci.
Phylogenetic associations among miRNAs and proteins
Significant Associations between protein-coding genes and miRNAs
interleukin 1 alpha
Asialoglycoprotein receptor 1
Macrophage galactose N-acetyl-galactosamine specific lectin 2
Preferentially Expressed Antigen in Melanoma
TLC domain containing 2
The associations detected are for three independent miRNA families (miR-876, miR-1251 and miR-1788). The associations for miR-876 are particularly interesting as there are four detected and all the protein-coding genes involved play a role in immune response. Two of the proteins, IL1A and CD86 have well established roles in immune response (Cytokine signaling and T-cell receptor signaling). The ASGR1 protein appears to be involved in endocytosis of glycoproteins and is a target of the Hepatitis virus. MGL2 is a C-type lectin active in Macrophages. Finally MEFV is a protein producing Pyrin in white blood cells (eosinophils and monocytes) and appears to play a role in inflammation. Mutations in the MEFV gene cause the Mediterranean fever an inflammatory disease. While the miR-876 associations appear to have strong connections to immune response, little is known about the expression or activity of miR-876. The only experimentally validated target so far for this miRNA in human is MCL1 (Induced myeloid leukemia cell differentiation), while predicted regulatory targets of this miRNA from both MicroCosm and TargetScan[13, 46] indicate a preference for receptor proteins.
Similarly, the miR-1251 familiy is poorly characterised but shows an interesting association with PRAME, a protein that normally is found exclusively in testis, but that is also highly expressed in melanoma. Finally, we detected a strong association between the fish specific miRNA miR-1788 and the TLCD2 protein family. Again in this instance little is known about the miRNA and the co-evolving protein. These associations represent interesting cases for further analysis both computational and experimental.
We also searched for significant phylogenetic associations between different miRNA families. Nevertheless, after filtering of associations found based on small numbers of species, there were no significant miRNA:miRNA associations.
We have constructed a global synteny map and phylogenetic analysis for miRNAs across 80 animal species. The dataset used not only forms the basis of our analyses but is also, we believe, interesting and useful resource for the community. The full dataset is available at http://www.ebi.ac.uk/enright-srv/Sintra. We will continue to update this resource as new genomes and miRNAs become available and as their annotation improves.
Using these data we have undertaken a large-scale analysis of miRNA synteny, genomic organization and evolution. Our results recapitulate a number of earlier findings, in a fully automated fashion, with many more genomes and miRNAs. Our work revisits previous studies on the evolution of the miRNA repertoire and its correlation with morphological complexity, whilst also highlighting the fact that few miRNA families are shared between different clades. We show that miRNAs have atypical patterns of synteny with preferences for longer clustered regions, which do not appear to be affected by genome compaction.
We have also discovered several new features of miRNA evolution and additionally reconfirm using automated methods, the recent growth of miRNA loci in a number of animal lineages including rodents and primates and an apparent loss of miRNA families in a smaller number species such as Xenopus tropicalis. We find that the largest miRNA expansions detected frequently involve miRNAs involved in both pluripotency and switching from maternal to zygotic gene expression in the early embryo. Furthermore, we have performed for the first time a large-scale phylogenetic profile analysis of miRNA and proteins, discovering a number of novel associations between miRNAs and protein coding genes with implications for the roles of miRNAs in immune response. Our data also identifies quite clearly those genomes whose low-coverage or poor assembly makes them difficult to work with. Many challenges are presented by low sequence coverage of certain genomes and biases towards model species. However we believe the current results shed new light on miRNA evolution and it will be interesting to explore the effect of new genomes and better sequence assemblies over time. Additionally, further sequencing and validation of miRNA families will be useful to remove erroneously predicted miRNA families and to mitigate biases. We hope these results and our dataset will prove useful to the community.
Materials and methods
We retrieved genomic sequences from all species in Ensembl (version 62) and Ensembl Metazoa (version 9). We used MapMi (version 1.0.4) to map all the metazoan miRNAs in miRBase[13, 47] (release 17) against all genomes, using the default MapMi score threshold of 35. This dataset was merged with miRBase annotations, to retain the full miRNA annotation and increase sensitivity. The protein coding data was obtained using the Ensembl API to retrieve coordinates, ID and family information for all proteins. Proteins with no family information or with ambiguous family attribution were removed from the dataset to ensure coherence of the homology attributions across species.
The phylogenetic trees shown are based on the tree provided by Ensembl on http://tinyurl.com/ensembltree. This is a rooted, binary branching phylogram built from molecular data. All format conversions and node sorting necessary for compatibility with the programs used in this research were performed using the Mesquite framework for phylogenetic analysis.
miRNA family attribution
To classify miRNAs in a comparable fashion, we grouped them into homologous families. All miRNA stem-loop sequences were compared using the Needleman-Wunsch algorithm (global-global alignment), as implemented in ggsearch (FASTA package), using a scoring matrix that gives double weight to in-seed matching. This differentiation was performed using an expanded set of nucleotide codes in the seed region. Families are then defined by single-linkage clustering of the scores. Single-linkage clustering was chosen for its computational simplicity, and ease of interpretation of the results. The appropriate threshold was determined by minimizing the split-join distance between the clustering and miRBase families. The families used in this analysis are enumerated in Additional file4: Table S2.
Synteny block detection and visualization
The syntenic anchor dataset was built by combining the miRNA and protein coding datasets, where each anchor is identified by its family name. The file was sorted and duplicates were eliminated according to the Enredo documentation. We detected conserved collinear segments using Enredo (version 0.5) using the following options: max-gap-length=10000, max-path-dissimilarity=10, min-regions=2, min-anchors=2, simplify-graph=7. Blocks sharing a terminal anchor were chained together, according to standard operating procedures (J. Herrero, personal communication). To visualise synteny blocks, we developed a set of scripts to align the conserved synteny blocks by miRNA family using a Perl implementation of the Needleman-Wunsch algorithm producing plots using PostScript. Each anchor is coloured based on its family (e.g. see Figure4 and Additional file3: Figure S1).
Phylogenetic profiles, as defined herein, are vectors containing, for each species, the presence or absence status per miRNA family. It has been shown that gene families that are gained and lost in a correlated fashion, are often involved in the same biological processes. We studied correlated miRNA gene gains and losses by using the BayesTraits package in a sequential fashion as implemented in the bms_runner script (version 1.4). This approach performs a Maximum Likelihood based analysis taking into account the phylogenetic distribution of the species under analysis, removing potential biases caused by uneven sampling of the phylogenetic space.
Birth and death of miRNA families
It is important, not only to look at the presence of miRNAs in present day species, but also to reconstruct the most likely state of the presence or absence of miRNAs in their ancestors. There are several models to infer the most parsimonious scenario. The major difference between them concerns the assumptions of the model in regard to the relative birth and death rate for each gene family.
In the case of miRNA families, current data indicates a low probability of convergent evolution. Based on this, we have selected Dollo parsimony, an approach that allows each gene family to be gained once, with no restrictions on the number of times it suffers secondary loss. It is thus robust to losses due to genome assembly issues. We used this approach, as implemented in the PHYLIP package (version 3.69). Binary presence/absence data for each of the miRNA families were used allowing us to obtain an estimate of the evolutionary time of birth for each of the miRNA families in our dataset. This was used to explore miRNA evolution from different perspectives, as shown in Figures1 and2.
While some miRNA families are present in a single copy in each genome, some families have rapidly expanded in some clades. To assess these fast expansions or unexpectedly fast deletions we use CAFE (Version 2.2). This approach uses quantitative data for the number of elements of each family at each species, and requires that the gene families being studied are present at the root node of the provided phylogenetic tree. To accommodate this requirement, we performed this analysis in a selected set of sub-trees.
AJE conceived the experiment. J.A.G-A performed the analyses and contributed to the design of the experiment. J.A.G-A wrote and maintains the computer programs used for the analysis. AJE and J.A.G-A wrote the manuscript and produced the figures. All authors read and approved the final manuscript.
We thank members of the Enright Laboratory for useful discussions and feedback. J.A.G-A thanks Albert Vilella, Javier Herrero and Catarina Bourgard for interesting comments and general feedback. J.A.G-A is a member of Clare Hall College, Cambridge and was supported by fellowships SFRH/BI/33193/2007 and SFRH/BD/33527/2008 from the Fundação para a Ciência e Tecnologia as part of the Ph.D. Program in Computational Biology of the Instituto Gulbenkian de Ciência, Oeiras, Portugal.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.