Introduction

Viruses are the most diverse and abundant biological entity, infecting species from all of life’s domains, regularly jumping to new hosts, and occasionally causing serious disease1,2. Although the diseases that we now know are caused by viruses have been documented for millennia, viruses were not formally identified until the late 1800s3. The first viruses were discovered in the context of strong disease phenotypes, and for much of its history virology was heavily biased towards research on viruses associated with overt disease, particularly from plants and animals of direct human relevance4. This has changed with advances in metagenomic next-generation sequencing (mNGS), which has enabled a broader characterization of virus diversity5,6,7,8,9. Yet despite these technological developments, our understanding of animal viruses remains strongly skewed towards those infecting a relatively small number of taxa (Figs 1,2). In addition, as metagenomic datasets continue to grow in both size and complexity, so does the challenge of their analysis10.

Fig. 1: Phylogenetic diversity of animal viruses.
figure 1

Schematic phylogenies showing each phylum within the kingdom Animalia (part a) and each animal class within the Chordata (part b), as well as the major events and traits acquired during chordate evolutionary history. In both part a and part b, the virus families and clades associated with each animal group are shown as identified from US National Center for Biotechnology Information (NCBI) GenBank nucleotide accession numbers. The animal phyla are those used for virus host taxonomy assignment within GenBank and the phylogeny is based on refs12,13. The figure is reliant on the host species assigned to a given virus sequence in the NCBI GenBank sequence database, such that these associations may not have been experimentally verified.

Fig. 2: Virome sequencing by animal phylum.
figure 2

a | Graphical representation of the number of unique virus nucleotide entries in the US National Center for Biotechnology Information (NCBI) GenBank nucleotide sequence database sorted by virus species and host species showing that viruses associated with chordates far outnumber those from all other animal phyla. The proportions of these entries assigned to hosts of note are shown in different colours. Duplicate entries were excluded. b | Graphical representation of the rapid increase in vertebrate-associated virus entries in the NCBI GenBank sequence database over the past two decades and the comparatively low numbers of invertebrate-associated viruses identified over the same period.

The development of increasingly large-scale and affordable mNGS technologies has ushered in a new age in our understanding of the diversity of the viral universe — the so-called virosphere — and the evolutionary and ecological processes that give rise to it. Paradoxically, however, the more animal viruses that are sequenced, the clearer it has become that most of this immense virosphere remains uncharacterized7,11. Few of the more than 1.5 million species within the kingdom Animalia have been surveyed for viruses, and most of those characterized come from a single phylum — the Chordata. Similarly, because mosquitoes and ticks are common disease vectors, most virological studies of invertebrates have focused on the Arthropoda, although this is just 1 of 21 invertebrate phyla12,13,14 (Fig. 2). In addition, many metagenomic studies of animal viromes largely involve cataloguing the viral diversity present in the species in question. Although an important first step, by designing appropriate sampling schemes, metagenomic data can also address specific hypotheses on the evolutionary and ecological factors that shape the structure of viromes15,16.

In this Review, we explore our current knowledge of the structure, diversity and evolution of the animal virome, particularly since the advent of mNGS. As most recent data have been generated by total RNA sequencing (also called ‘metatranscriptomics’), we necessarily devote the greatest attention to the diversity and evolution of RNA viruses, although in many cases similar conclusions can be drawn for viruses with DNA genomes. A key message is that profound sampling biases have restricted our understanding not only of virus biodiversity but also of fundamental aspects of virus evolution. We argue that placing those viruses that cause zoonotic disease in humans in the context of a wider sampling of animal viromes provides a more nuanced view of the frequency of host-jumping and emergence events, and hence assessments of zoonotic risk. We also give special emphasis to a central but rarely addressed question: whether major events in animal evolution — moments of evolutionary ‘transition’ such as the origin of the vertebrates or of adaptive immunity — also changed the phylogenetic diversity of the viruses that infect these species.

Diversity, composition and evolution of the animal virome

Metagenomics has widened the aperture through which we can view the diversity of the animal virome. Total RNA sequencing has enabled the rapid and comprehensive identification of viruses without the use of time-consuming and restrictive steps of cell culture or microscopy5,17,18,19,20 (Box 1). These studies have shown that animals are infected by viruses spanning the full range of genome types (that is, single-stranded RNA, in both positive-sense and negative-sense orientations, double-stranded RNA, retroviruses, single-stranded DNA and double-stranded DNA) as well as viruses with both segmented and unsegmented genomes. According to a recent (July 2021) classification by the International Committee on Taxonomy of Viruses, animal viruses can be placed into 5 (of 6) realms, 5 (of 10) kingdoms, 11 (of 17) phyla, 26 (of 39) classes, 36 (of 59) orders and 99 (of 189) families21.

However, despite the broadening of species sampling through mNGS, our knowledge of the animal virome is still dominated by viruses associated with humans or human activities. As an illustration, ~75% of animal virus entries in the US National Center for Biotechnology Information nucleotide sequence database derive from humans, and most of the animal entries are from species of anthropogenic significance, either as disease hosts or vectors, or those of economic or social importance (Fig. 2). Major sampling biases mean that there are also marked differences in the extent and pattern of the diversity of viruses associated with different animal groups, such as different phyla or vertebrate classes (Fig. 1). The greatest diversity of known viruses resides within the vertebrates, closely followed by arthropods, with the phylum Mollusca a distant third. It is no coincidence that these phyla contain anthropogenically significant species, such as vectors of disease in the case of arthropods and farmed shellfish in the case of molluscs. Other phyla have evidently been sampled far less frequently. For example, as viruses are ubiquitous within the environment, it is unlikely that there is truly a lack of viruses infecting phyla such as the Placozoa (Fig. 1). Similarly, recent explorations of the fish virome have revealed a multitude of novel DNA and RNA viruses, with virus families previously only described in mammals or birds now also found in fish, indicative of their antiquity22,23,24,25,26,27,28,29 (Fig. 3). Of the 37 families and clades of viruses found in mammals, 27 are also found in ray-finned fish (the Actinopterygii; Fig. 1). That these virus families and clades are seemingly absent from phylogenetic ‘intermediate’ taxa (such as Amphibia and Sarcopterygii) is again likely a signature of inadequate sampling (Fig. 1).

Fig. 3: Recent phylogenetic and genomic expansion of the coronaviruses.
figure 3

The figure illustrates how a combination of virus–host co-divergence and sporadic host-jumping has shaped the evolutionary history of the family Coronaviridae. The phylogenetic history of the major host taxa (part a) is broadly reflected in the phylogeny of the subfamilies Coronavirinae and Letovirinae (part b), with the former largely associated with mammals and the latter with fish and other aquatic animals. The major host taxon is indicated in part b by the branch colour corresponding to the host group shown in part a, and the host species is indicated by the animal silhouette at the tree tip. An expanded maximum likelihood phylogeny of the genus Betacoronavirus containing severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (part c) with animal silhouettes at the tree tips showing that most of these viruses are associated with bats, which are important reservoir hosts for these viruses. The phylogeny was estimated using ORF1ab protein using IQ-TREE137 and was midpoint rooted for clarity. The scale bars depict the number of amino acid substitutions per site. Parts a and b adapted from ref.28, CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).

Our limited knowledge of virus biodiversity has been put into sharp focus by the emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative agent of COVID-19, in late 2019 (refs30,31). Ongoing metagenomic studies are beginning to identify a wealth of animal coronaviruses. Although these animals include rodents32, the most notable hosts are arguably bats of the genus Rhinolophus (horseshoe bats), which are commonplace in China and parts of South-East Asia33,34 as these sometimes carry viruses closely related to SARS-CoV-2 (Fig. 3). However, while it is probable that both bats and rodents harbour the greatest diversity of coronaviruses, this picture is very likely distorted by major sampling biases, as these two mammalian groups are also popular subjects of metagenomic studies due to their known role as reservoirs for a range of human infectious diseases. Indeed, as SARS-CoV-2 can infect and be transmitted among many animal species, resulting in large outbreaks in farmed mink35 with transmission back to humans36, and even reports of high virus prevalence in white-tailed deer in the USA37, it is unlikely that the natural ecology of viruses closely related to SARS-CoV-2 involves only bats and pangolins38,39.

Recent studies of other coronaviruses (that is, members of the family Coronaviridae of positive-sense RNA viruses) similarly provide informative examples of how metagenomic sequencing is leading to a new perspective on the diversity and antiquity of animal viruses. Historically, most attention has been directed towards those coronaviruses associated with mammals as these are most likely to emerge in humans40. However, a combination of mNGS and transcriptome database mining has led to the identification of divergent coronaviruses in a broader range of vertebrates, including amphibians and fish28,41 (Fig. 3). Perhaps most surprising was the discovery of coronaviruses in a jawless vertebrate — the pouched lamprey (Geotria australis) from New Zealand28. Rather than falling basal to other vertebrate coronaviruses on a phylogenetic tree, as might be expected if they had co-diverged with their vertebrate hosts, the pouched lamprey viruses fell within the diversity of fish coronaviruses, highlighting the occurrence of host-jumping in aquatic environments28 (Fig. 3). As appears to be true of many virus families, the evolutionary history of the coronaviruses reflects a combination of virus–host co-divergence that likely covers the entire evolutionary history of vertebrates over hundreds of millions of years and relatively frequent cross-species virus transmission among animals that inhabit the same environment and that can sometimes result in disease emergence.

An even more dramatic story can be told for hepatitis D virus (HDV). Until recently, HDV was described only in humans and in close association with human hepatitis B virus (HBV), performing an essential ‘helper’ role in its replication. The intimate relationship between HDV and HBV led to theories that HDV evolved in humans, perhaps as an escaped host gene42. However, recent metatranscriptomic studies have revealed that viruses closely related to HDV infect other vertebrates (mammals, birds, fish, snakes and amphibians) as well as a number of invertebrates43,44,45,46 and in the absence of HBV-like viruses such that other viruses may act as helpers46. Similarly, it has traditionally been assumed that influenza viruses (family Orthomyxoviridae) are largely restricted to water birds of the orders Anseriformes and Charadriiformes, which act as reservoirs for their occasional emergence in mammals47,48. However, recent metagenomic studies have identified influenza virus-like viruses in fish, amphibians and even jawless vertebrates (that is, hagfish), and these viruses share common ancestry with a diverse set of invertebrate viruses6,9. Hence, as is true of many virus groups, the influenza viruses have a far older and more complex evolutionary history than previously envisaged25 (Fig. 4). Indeed, the broader viral order Articulavirales of negative-sense viruses also contains divergent viruses sampled in fish as well as those from a variety of invertebrate species5.

Fig. 4: The complexity of host associations in virus evolution.
figure 4

Phylogeny of the order Articulavirales (negative-sense RNA viruses that include the influenza viruses from the family Orthomyxoviridae) showing the diverse set of animal hosts infected and the complex virus–host associations. As with many virus groups, this phylogenetic pattern is indicative of a history of cross-species transmission set on a background of ancient virus–host co-divergence. The animal host group is indicated by the colour of the terminal branch and the host is indicated by an animal silhouette. The maximum likelihood phylogeny (IQ-TREE137) was inferred using amino acid sequences of the protein PB1 (or equivalent) and was midpoint rooted for clarity. The scale bar depicts the number of amino acid substitutions per site. Virus genome structures, with segment lengths drawn to scale, are indicated where available, illustrating the variation in genome structure and segment number.

One fascinating insight from mNGS studies of animal viromes has been the recognition that invertebrates commonly carry a far greater diversity and abundance of viruses than vertebrates, in accord with their huge species numbers. In particular, large-scale metagenomic studies of invertebrates have uncovered novel virus families and genera, as well as viral lineages previously thought to be restricted to vertebrates5,17,49,50,51. These studies have similarly identified a wide diversity of novel genome structures in invertebrate viruses, in turn revealing that viral genome evolution is more fluid and dynamic than previously envisaged5,17 (see later).

The first glimpse of the true breadth of the invertebrate virome came from a study of negative-sense RNA viruses in arthropods17. This was extended to cover other types of RNA virus in a broader range of invertebrate taxa5, eventually leading to a myriad of metagenomic studies52,53,54,55,56. More recently, metagenomic studies have begun to focus on individual invertebrate species, such as flies of the genus Drosophila57,58 and various species of mosquito54,59,60,61. Although these studies still reflect a limited sample of animals from the commonplace, easy to obtain and sometimes scientifically important arthropods, it is evident that viruses are copious in many invertebrate taxa. Indeed, some invertebrate RNA viruses reach abundance values as high as 87% of the non-ribosomal RNA reads in a single sequencing dataset5. That invertebrate species can possess such high virus abundance with no clear signs of disease (although these may be difficult to identify in such short-lived animals) further suggests that many of these viruses may be commensal and tolerated by their invertebrate hosts. Finally, not only are invertebrate viruses diverse but they often fall as basal lineages on phylogenetic trees of animal viruses, implying that they have ancient associations with animals62,63. Indeed, it is likely that many virus families will have an evolutionary ancestry that dates at least to the origin of vertebrates and perhaps even to the origin of animals.

Genome plasticity of animal viromes

The genome structures of animal viruses are characterized by a remarkable plasticity, reflected in major differences in genome length, genome organization (for example, the number and orientation of genes) and the number of genome segments present in specific virus families (Fig. 5). Traditionally, individual families of RNA viruses were thought to possess characteristic patterns of segmentation, with those containing multiple segments (such as members of the Orthomyxoviridae) generally considered as constituting phylogenetic groups distinct from those characterized by a single segment. Metagenomic data have drastically changed this picture. It is now clear that genome segmentation has been gained and lost multiple times in evolutionary history, with the RNA virus orders Nodamurales and Monjiviricetes providing important examples5,17 (Fig. 5). Similarly, the number of segments in the Articulavirales ranges from 4 to 10 (Fig. 4).

Fig. 5: The evolutionary flexibility of RNA virus genomes.
figure 5

To illustrate the genome flexibility in RNA virus evolution in animals, phylogenies of the order Monjiviricetes and the families Nodavidae and Flaviviridae are labelled with representative genome structures. Genome structures differ in size, organization and number of segments. Key genes are indicated by different colours, and the relative length of the coding regions is indicated by size. Boxes positioned below the centreline of the genome indicate overlapping open reading frames and black triangles at the ends of a structure indicate circularization. In the case of genomes within the Flaviviridae, boxes with rounded corners indicate individual proteins within the single polyprotein that characterizes many members of this family. Notably, genome segmentation has been gained and lost multiple times during the evolution of the Monjiviricetes and Nodaviridae, and has evolved once within the family Flaviviridae, specifically in the jingmenviruses associated with invertebrates. In each case, maximum likelihood phylogenetic trees (IQ-TREE137) were estimated using the RNA-dependent RNA polymerase (RdRP; NS5 or NS5-like protein for the Flaviviridae). All trees were midpoint rooted for clarity only. The scale bars depict the number of amino acid substitutions per site.

Of particular importance is that invertebrate viruses often have more complex genome structures than their vertebrate counterparts. A good example is presented by the Flaviviridae, a family of single-stranded, positive-sense RNA viruses that includes dengue virus, Zika virus and hepatitis C virus. All these familiar human pathogens are characterized by an unsegmented genome encoding a single polyprotein. Although this simple genome structure was once considered archetypal, the discovery of ‘flavi-like’ viruses with far more complex genome structures in a range of invertebrate taxa, such as Jingmenvirus from ticks, presents a very different picture6,64 (Fig. 5). The jingmenviruses comprise four or five segments, two of which show sequence similarity to the non-structural proteins NS5 and NS2B–NS3 of the Flaviviridae64. The two remaining segments exhibit no sequence similarity to known virus genes but likely encode structural proteins. Remarkably, these different segments may sometimes be associated with different virus particles, such that these viruses can be considered multicomponent viruses — a pattern of genome organization commonly seen in positive-sense RNA viruses of plants65. More dramatically, the recently discovered Chuviridae family of negative-sense RNA viruses contains viruses with unsegmented, bisegmented and even circular RNA genomes22 (Fig. 5). To date, this fascinating group of viruses has been described in arthropods, nematodes and reptiles5,17,66.

To evaluate whether any reduction in genome complexity is associated with the evolution of vertebrates will require a broader sampling of animals. One attractive, although untested, theory is that shorter genomes are selectively advantageous in vertebrates because fewer potential immune targets would be presented to hosts with more advanced adaptive immune responses. Testing this hypothesis will first require more detailed knowledge of the viromes of animal lineages that diverged close to the evolution of adaptive immunity.

Has host evolution shaped virus evolution?

As genome sequence data from animal viruses continue to accumulate, they can be used to address broader evolutionary questions. Viruses, by definition, have obligate associations with their hosts. Accordingly, changes in the number and diversity of host species through time are also expected to impact the number and diversity of the viruses they carry, albeit likely in a complex manner. A central issue is therefore whether and how the structure and phylogenetic diversity of animal viromes have been impacted by major events in the evolutionary history of their animal hosts. Although there has been some interest in documenting the generation, or ‘birth’, of lineages within individual virus species as this is central to the process of disease emergence2,67, aside from a limited number of phylogenetic studies68 and those examining local populations69, far less is known about the rates and mechanisms of virus birth and death (that is, lineage extinction) on evolutionary timescales. We hypothesize that major events in the evolution of animals — key evolutionarily transitions — are likely to have had a major impact on the evolution of the viruses they harbour. To the best of our knowledge, no studies directly addressing this question have been undertaken to date, although similar work has been performed on other systems. For example, the diversification of pathogenic Bartonella bacteria has been proposed to reflect the expansion of the mammals70.

The evolution of the Metazoa more than 600 million years ago resulted in a huge increase in phenotypic diversity, eventually leading to the myriad of animal phyla that we see today. Similarly, there was a massive increase in the phenotypic diversity of animals concurrent with the origin of the Chordata more than 500 million years ago71, while the evolution of jawed vertebrates (Gnathostomata) approximately 450 million years ago was associated with multiple rounds of full genome duplications and the evolution of adaptive immunity72 (Fig. 1). It seems inevitable that these major events in host evolution will have had a profound impact on the extent, diversity and composition of the viruses the hosts carry. Major questions in this context include whether the evolution of new types of host cell led to a rise in virus diversity, and whether the evolution of adaptive immunity led to the extinction of many viral lineages and hence a marked reduction in diversity. It is tempting to speculate that the apparent reduction in virus abundance levels in vertebrates compared with invertebrates7 (see earlier) in part reflects the evolution of adaptive immunity (Fig. 1). Similarly, the earlier evolutionary transition to multicellularity would have greatly increased the number and diversity of hosts cells, and their receptors, for viruses to infect.

Other events in host evolution may also have led to major reductions in virus diversity. Probable examples include mass extinction events73, such as those that occurred at the Permian–Triassic boundary approximately 250 million years ago resulting in the loss of more than 80% of all marine species and ~70% of terrestrial vertebrate species74, and the Cretaceous–Paleogene extinction event approximately 66 million years ago, which massively reduced the number of tetrapods and resulted in the extinction of non-avian dinosaurs75. Similarly, an overall decline in host population size and density coincident with the evolution of the vertebrates would have increased the impact of stochastic effects on virus populations subject to weaker natural selection76: with fewer potential hosts to infect, viral lineages would be expected to be lost more frequently leading to stronger genetic drift.

When sufficient data become available, a detailed phylogenetic analysis of animal viruses will provide meaningful insights into how host evolutionary transitions might have influenced the long-term macroevolution of viruses. The drastic reduction in the number of animal species associated with mass extinction events should be visible in the species distribution of viral lineages on phylogenetic trees. The first insights may come from comparisons of vertebrate and invertebrate viruses, particularly whether some viruses are restricted to either host type, or whether there is a marked phylogenetic gap between vertebrate and invertebrate viruses on phylogenetic trees of individual virus families that signifies a major transition in virus diversity. A provisional analysis of the limited and highly biased data currently available reveals that 16 of the 66 family or multifamily ‘superclades’ of viruses9,17 are associated with vertebrates alone, whereas 17 are found in invertebrates with no vertebrate counterpart (Fig. 1). Broader investigations of this type should be a research priority.

Linking virus emergence to virus evolution

The phylogenetic analysis of virus orders, families and genera sits at the heart of studies of the diversity of viromes and their evolution77. On the one hand, there is often a broad congruence between the phylogenies of viruses and their animal hosts, with, for example, viruses sampled from fish and jawless vertebrates tending to fall in more basal phylogenetic positions than those sampled from mammals and birds (Figs 3,4). Hence, these phylogenetic trees generally depict evolutionary events, particularly virus–host co-divergence, that have taken place on timescales of millions of years. Conversely, these phylogenetic analyses also reveal that cross-species virus transmission to new hosts has been commonplace throughout animal evolution78. In the short term, this same process of host-jumping is responsible for the emergence of novel pathogens such as SARS-CoV-2 (refs79,80,81), with the vast majority of human viruses appearing in this way2. Indeed, disease emergence events occur over observable human history, and on timescales that are far shorter than depicted in most phylogenetic studies82. Hence, there is necessarily a marked temporal disconnect between evolutionary studies of animal viromes, such as those described in the preceding sections, and the timescale of disease emergence11. This in part explains why we still know little about the frequency with which host-jumping occurs in nature, or the rate at which cross-species transmission events are successful compared with those that die out83.

Understanding the drivers of disease emergence on short timescales provides a means to link virus microevolution, as happens within populations, with virus macroevolution as reflected in broad-scale phylogenetic analyses. The historical domestication of animals and the development of animal husbandry provided many opportunities for viruses to jump to humans, with the emergence of measles virus from relatives (that is, rinderpest virus-like viruses) in cattle a likely case in point84. More recently, increased interactions with wildlife, following such factors as climate change, alterations in land use, the flourishing of live animal markets and the farming and trafficking of wild animals, have exposed the human population to novel pathogens, with urbanization, population growth and globalization allowing these emerging viruses to spread rapidly and far. Human immunodeficiency virus 1 (HIV-1) spread across Africa from its zoonotic origin in the Congo River basin region67, and then to other continents, in part reflecting changes in colonial administration. By moving humans, animals and cargo great distances, air travel aided the spread of diseases and disease vectors into new environments. This includes the translocation of the Aedes aegypti mosquito from Africa to Asia and South America, enabling chikungunya virus, yellow fever virus, Zika virus and West Nile virus to establish animal transmission cycles in immunologically naive localities85,86,87, and fuelling increasingly widespread outbreaks of Ebola virus infection in mammalian hosts88. Similarly, environmental changes such as increasing urbanization and climate change are leading to an increased prevalence of existing human pathogens such as yellow fever virus and dengue virus85,86,89.

Deforestation forces wildlife into smaller, overlapping habitats, leading to new and greater interactions between and within species, fuelling disease spread90,91. Urbanization alters the way in which animals behave, changing their diets and interspecies and intraspecies interactions. Intensive farming creates opportunities for virus interspecies transmission and provides an environment in which a virus can spread rapidly through a population92,93, with viruses moving from wildlife to domestic species as well between domestic animals. This is of special concern in poultry production, in which farmed birds regularly interact with wild birds, with virus transmission between them an occupational hazard. A powerful example is provided by the emergence of H5N1 avian influenza A virus in poultry and its subsequent zoonotic transmission to humans94. Backyard poultry populations within urban environments are of increasing concern as poultry-associated viruses such as Marek disease virus, infectious bursal disease virus and Newcastle disease virus (Avian orthoavulavirus 1) are being introduced into wild bird populations91,95, and they also harbour multiple picornaviruses96. The reverse process is also possible, with viruses jumping from domestic animals to wildlife. The migration of humans and wildlife has similarly acted as a driver of disease emergence97,98,99,100, with metagenomic studies revealing that very closely related animal viruses can be found in very diverse geographical regions101. A telling example is viruses associated with seabird ticks (Ixodes uriae) sampled as far apart as northern Sweden and the Antarctic peninsula, demonstrating that migratory birds and their ectoparasites can facilitate a global movement of viruses without human assistance102.

It has often been proposed that RNA viruses have a higher rate of cross-species transmission and hence experience less frequent virus–host co-divergence than their DNA counterparts2. Although this is supported by large-scale comparative analyses, it is also the case that both DNA viruses and RNA viruses jump species boundaries more readily over evolutionary time, as reflected in phylogenetic comparisons, than might have been assumed78. Although most cross-species transmission events likely occur between animals that are relatively close in taxonomic space, such as among different species of mammals77,82,103, some jumps may cover wide phylogenetic distances, including the possible transmission of hepadnaviruses from fish to mammals22,104. Again, sampling biases and data limitations make it difficult to draw precise conclusions on the frequency of cross-species transmission events in nature, although the more sampling that is done, the more examples are inevitably documented.

Metagenomics and zoonotic risk assessment

Determining the rate at which cross-species transmission events occur on epidemiological timescales of decades is of central importance in understanding disease emergence103. These data impact how we quantify zoonotic risk; that is, identifying those viruses with the potential ability to infect humans105,106. Before the metagenomic revolution, virus discovery studies in animals were focused on outbreaks with visible death and/or morbidity. As disease outbreaks in wildlife with low levels of death would generally not have been identified, a relatively high proportion of viruses appeared to be pathogenic107. However, the rebalancing of virome studies towards the sampling of seemingly healthy animals has shown that potentially pathogenic viruses may be more the exception rather than the rule, with studies of birds and bats important exemplars107. The broadening of animal sampling away from overt disease also changes the proportion of viruses that appear as potentially zoonotic, altering the denominator of emergence risk. Metagenomic studies have revealed that bats harbour a large and complex virome18,20,33,108,109,110,111, with considerable discussion of the reasons why this might be so, particularly whether these animals possess immune systems that can tolerate a heavy burden of viral infection73,112,113. Although bats are implicated in the ultimate evolutionary origins of some important human viruses, only a tiny proportion of the huge number of bat viruses have ever successfully spread in humans, often entering our species via ‘intermediate hosts’, as appears to be true of some coronaviruses40 (Fig. 3). The more bat viruses that are identified through metagenomic sequencing, so the relative frequency of those that are pathogenic and/or zoonotic declines.

The vast number of animal viruses described by metagenomics also complicates attempts to assess which of these will eventually emerge in humans107,114. There is no simple way to translate the long-term rates of virus evolution depicted in phylogenetic trees into short-term zoonotic risk assessments or pandemic predictions. Although revealing the diversity of the animal virome places newly emerged viruses into their true evolutionary context, it is arguably of less value for predicting whether some viruses have pandemic potential. There are many thousands of uncharacterized animal viruses that will differ in their natural propensity to infect humans. Large-scale metagenomic studies necessarily document virome composition in host species in a specific place at a particular point in time, often with little background ecological context. They should not be interpreted as exact descriptions of complete virome compositions in a species, particularly for hosts that occupy large geographical ranges, and do not necessarily inform on which viruses are able to emerge in humans. The snapshot of virus genetic diversity provided by metagenomics is also a static one in the face of the very rapid evolution of RNA viruses, which experience rates of nucleotide substitution approximately six orders of magnitude greater than those in their animal hosts115. The large-scale metagenomic sequencing of wildlife species will usually not identify the full spectrum of intrahost virus genetic variation, potentially missing low-frequency mutations that may facilitate host adaptation.

Most animal viruses sampled will lack some of the mutations they need to successfully replicate in and be transmitted among humans, with evolutionary optimization a necessity in the new host116. Hence, the vast majority of the viruses identified by metagenomic screening alone will have little chance of successfully spreading through human populations. As a topical case in point, although bat viruses that are closely related to SARS-CoV-2 have been identified, at the time of writing all those characterized lack an intact polybasic (furin) cleavage site at the S1–S2 junction in the virus spike protein that enhances human infectivity117,118. Similarly, although broad-scale screens have suggested that one of the closest relatives of SARS-CoV-2, virus RaTG13 sampled from Rhinolophus affinis bats in Yunnan province, China, had ‘high zoonotic potential’106, detailed virological studies revealed that this virus was unable to bind to the human ACE2 receptor119. Hence, although a potentially informative provisional screen, computational risk assessments of this kind may lack the precision necessary for actionable risk assessments. In addition, the identification of a virus sequence through metagenomics does not provide prima facie evidence that the virus can replicate in human cells, and evaluation of this key trait will require detailed experimental data, hugely increasing the associated costs and person hours.

Despite these limitations, the capacity of mNGS to detect the full range of microorganisms within a sample in a single run signifies a new age in clinical diagnostics120,121. In the same way, if not an exact prediction tool, mNGS will surely become a key component of future efforts for the surveillance for zoonotic pathogens at the human–animal interface. For example, to fully understand the emergence of SARS-CoV-2 and help prevent future epidemics, mNGS can be used to document the full host range of pathogens such as coronaviruses that seem best able to jump host species, and simultaneously reveal the barriers to cross-species virus transmission. As a case in point, a single study of a 1,100-hectare tropical botanical garden in Yunnan province, China, identified 24 novel bat coronaviruses, including close relatives of SARS-CoV-2 and of the animal pathogen porcine epidemic diarrhoea virus39. What other mammalian species within this single botanical garden carry coronaviruses are unknown, but a broader sampling of all the species in such an ecosystem will do much to reveal the patterns, rates and determinants of cross-species virus transmission at local scales.

The factors currently limiting the use of mNGS in studies of zoonotic risk assessment and disease emergence are that the technology detects only actively replicating viruses, is relatively expensive and generates a huge amount of data that require considerable computing power for detailed analysis. The deployment of metagenomics in resource-poor settings may therefore be challenging, even though these are the locations where humans likely interact most with wildlife species (as well as biting arthropods) and hence where the risk of virus spillover is perhaps greatest, and where approaches to reduce the exposure of humans to wildlife would likely have the greatest impact. In these instances, pathogen surveillance approaches based on immunological techniques, such as VirScan, which can be designed to detect past and present infection by hundreds of potential zoonotic pathogens with a single assay, represent a more practical solution122. Rather than recognizing only already known pathogens, approaches such as VirScan can in theory be extended to recognize peptides from those groups of viruses that are most likely to emerge in humans107. Given their past behaviour, the coronaviruses fall into this ‘high-risk’ category, as do the influenza viruses and the paramxyoviruses (within which the henipaviruses are an important example of an emerging threat123) and could be incorporated into broad-scale screening assays. Although such an approach will not capture all zoonotic viruses, it does provide some ability to detect potential threats.

Challenges and new research avenues

Although mNGS is transforming our understanding of animal viromes and their evolution, additional work is required on several fronts. We suggest that the priority for future sampling and sequencing should be those animal taxa that have been only poorly studied to date, particularly those that occupy key positions on the animal phylogeny, including those that mark evolutionary transitions. It will also be important to sample animals across their full range of habitats to determine whether virome structures differ substantially within individual host species. Similarly, given the rapidity of RNA virus evolution, a priority should be to determine how virome structures within individual animal species change over time, for instance by annually sampling the same species at the same locations. More broadly, it is essential that future metagenomic studies of virus populations test explicit ecological and/or evolutionary hypotheses, such as exploring the impact of changing land use on virome structures, rather than simply presenting descriptive lists of the viruses present.

Host associations cannot always be relied upon in metagenomic studies, as viruses infecting symbionts, components of host diet, and contaminant microorganisms and laboratory reagents are also sequenced as part of the metagenome. For example, RNA virus families associated with plants, such as the Tombusviridae and Luteoviridae, are often detected in animal metagenomes as they are probably a dietary component, while the Leviviridae, a family of RNA bacteriophages, are likely associated with the microbial communities within animal hosts124,125. Clearly, erroneous host assignments may lead to erroneous conclusions on virus ecology and evolution. As a consequence, new bioinformatic tools are required that can accurately assign virus sequences to the true hosts, perhaps using statistical approaches that jointly consider levels of virus abundance and phylogenetic relationships. Although the analysis of dinucleotide frequencies provides a potential way to distinguish viruses infecting different host phyla, it is unable to provide a fine-scale host discrimination126.

Future virome analyses will similarly be enabled by the development of methods that can identify highly divergent viral sequences, as it is clear that a large proportion of the virosphere comprises sequences that are so divergent from the sequences of known viruses that they are currently ‘invisible’ to discovery strategies based on sequence similarity alone7. Although this problem is particularly acute for host taxa that are the most divergent from the usual animal species usually considered in virus metagenomics studies, such as archaea, bacteria and basal eukaryotes, many animal taxa likely carry RNA viruses that are hidden within the ‘dark matter’ of uncharacterized sequences127. Arguably the simplest way to shed light on this hidden and likely diverse virosphere is through the detection and characterization of conserved protein structures as these retain the signal of homology and hence evolutionary relatedness for longer than primary sequences128,129. An informative example is provided by enveloped viruses, which require a protein capable of inducing the fusion of viral and cellular membranes for entry. Structural studies of multiple virus families have revealed that they fold into only three structural classes130. The amino acid sequences of these virus proteins show no detectable conservation among classes, and their relatedness is made apparent only through structural studies131. Fortunately, the ‘resolution revolution’ that has accompanied the development of cryo-electron microscopy has enabled the determination of more protein structures that are difficult to crystallize132. Hence, an important area for future research will be to use these structures to guide the identification of highly diverse viruses in metagenomic data, perhaps by determining the ‘profiles’ of physicochemical and structural features that distinguish virus proteins133. Detecting highly divergent viruses may also provide answers to some of the most profound questions in virus evolution, such as whether the absence of RNA viruses in archaea and their low frequency in bacteria is simply because they are too divergent in sequence to be detected134.

Although the analysis of protein structure provides a potential means to reveal more of the diversity of the virosphere, it also presents a fundamental problem: that any novel viruses identified are so divergent in sequence that they cannot be incorporated into phylogenetic or other evolutionary analyses. This is even true in the case of the canonical RNA-dependent RNA polymerase, which is routinely used to infer multifamily phylogenies of RNA viruses (a variety of genes are used as phylogenetic markers in the DNA viruses). Even with currently available data, attempts to infer the evolutionary relationships among all extant RNA viruses are unconvincing, with pairwise identities in amino acid sequence alignments that are often less than expected by chance135. This raises the vexing question of how viable it is to infer a ‘global’ phylogeny of RNA viruses using sequence data alone. The most profitable approach may again involve methods that are able to accurately infer the distant evolutionary relationships on the basis of shared features of protein structure. Although these are not unsurmountable challenges, and the foundations of this approach have been laid136, little productive work has been done in this area.

Conclusions

Metagenomic sequencing has radically changed our understanding of the diversity, structure and evolution of the animal virome, particularly in the case of RNA viruses. Yet it has also made the gaps in our knowledge more apparent than ever. As stressed throughout this Review, relatively little is known about the factors that shape virome structure outside anthropocentrically important species. Large-scale studies of a wider range of animal taxa are needed to provide a better understanding of the biological and phylogenetic diversity of viruses and the evolutionary and ecological processes that have given rise to it. Not only do we need to explain the large-scale patterns of virus diversity on evolutionary timescales, but to understand disease emergence and zoonotic risk it is essential to determine the factors that shape the ecology and evolution of viruses on shorter and more relevant timescales of years or decades, rather than millennia. Human activity is already leading to shifts in the diversity of the animal virome, although we usually see these effects only after they lead to a novel zoonotic event. Although metagenomics is shedding new light on the diversity of the virosphere, greater emphasis should be given to revealing the processes that determine cross-species transmission events among animals and hence that underpin disease outbreaks.