New strategy for virus discovery: viruses identified in human feces in the last decade

Emerging and re-emerging viruses continue to surface all over the world. Some of these viruses have the potential for rapid and global spread with high morbidity and mortality, such as the SARS coronavirus outbreak. It is extremely urgent and important to identify a novel virus near-instantaneously to develop an active preventive and/or control strategy. As a culture-independent approach, viral metagenomics has been widely used to investigate highly divergent and completely new viruses in humans, animals, and even environmental samples in the past decade. A new model of Koch’s postulates, named the metagenomic Koch’s postulates, has provided guidance for the study of the pathogenicity of novel viruses. This review explains the viral metagenomics strategy for virus discovery and describes viruses discovered in human feces in the past 10 years using this approach. This review also addresses issues related to the metagenomic Koch’s postulates and the challenges for virus discovery in the future.

In the past decade, several serious emerging viral infectious diseases have left deep impression on many people. The world witnessed the first pandemic of the new millennium in 2003. Instead of the influenza virus, severe acute respiratory syndrome coronavirus (SARS-CoV) caused the first serious and widespread zoonotic disease, having a huge global impact on health, travel, and the economy [1][2][3][4][5]. The great global impact of the SARS outbreak was in some ways intensified by the delay in identifying the causative agent of the disease. Within a few months, SARS-CoV caused approximately 916 deaths and affected around 30 countries. When the novel SARS-CoV was isolated by a joint effort of the World Health Organization (WHO) SARS Collaborative Network, six months had passed. Hence, identifying the causative agent of a new epidemic is one of the most important steps for active control and prevention of viral disease outbreaks.
With the development of novel modern molecular biology technology, especially sequence-independent singleprimer amplification (SISPA) [6], next-generation sequencing (NGS) such as 454-pyrosequencing [7,8], and viral metagenomics [9][10][11], multiple viruses can be detected simultaneously, and novel and highly divergent viruses can be discovered and genetically characterized quickly. A novel phlebovirus of the Bunyaviridae family, known as severe fever with thrombocytopenia syndrome bunyavirus, was discovered using SISPA technology [12,13]. Human bocavirus was also identified using the above-mentioned methods in 2005 [14]. Recently, a novel coronavirus, HCoV-EMC, was isolated in Jeddah, Saudi Arabia on June 13, 2012 [15]. Subsequently, the complete genome of MERS-hCoV was determined using an unbiased virus discovery approach involving NGS techniques, which deter-mined that it was closely related to bat coronavirus but was distant from SARS-CoV [16]. This novel coronavirus, MERS-hCoV, reminds us to pay attention to animal coronaviruses, which may be the cause of severe disease in humans, and to develop strategies to rapidly determine unknown viral agents [17]. In fact, only limited data are available regarding the diversity of viruses present in humans and animals. A great number of human and animal viruses are still unknown.

Traditional methods of virus discovery
Before the advent of modern molecular methods, traditional virus discovery included filtration, tissue culture, electron microscopy, serology, and vaccination; cell culture for virus propagation was the virus discovery gold standard. However, many viruses cannot be easily propagated in cell culture, thus limiting our understanding of virology. Two milestone findings solved these difficulties, which allowed those interested in understanding new viruses first to amplify and then to sequence viral nucleic acids, namely DNA amplification by polymerase chain reaction (PCR) [18] and DNA sequencing with chain-terminating inhibitors (Sanger sequencing or first-generation sequencing) [19]. The number of sequences deposited on GenBank between August 2009 and August 2010 has totaled 970 million and 43 million bases of viral and phage origin, respectively, representing an annual growth of 20%-24% [20]. After that, several serious emerging viruses, such as Hendravirus [21], Nipah virus [22], Menangle virus [23], Melaka virus [24], and Reston Ebola virus [25], were discovered in different countries. The trend in virology research has shown gradual substitution of the traditional virus discovery methods with novel modern molecular biology technology. Nevertheless, traditional methods to isolate, identify, and characterize viruses play complementary roles in the virus discovery effort.

Novel modern molecular techniques of virus discovery
There are two major types of molecular methods for virus discovery, including sequence-dependent and sequenceindependent methods. Sequence-dependent methods require knowledge of the nucleic acid sequence. Such methods include consensus primer PCR and microarrays. Several viruses, such as human immunodeficiency virus [26], simian retroviruses [27][28][29], and hepatitis E virus [30], have been identified using consensus sequences of known viruses; however, these methods have little or no value for characterizing completely novel viruses. Microarrays use probes that can hybridize to known viral sequences and potentially novel viruses with sufficient sequence similarity [31]. Microarrays have been applied in the detection of swine res-piratory viruses in clinical samples [32] and for dengue virus [33]. Unlike sequence-dependent methods, the sequenceindependent approaches do not rely on any knowledge of the virus. SISPA circumvents the viral load limitation of suppression subtractive hybridization (SSH). By now, there are several variations to the original protocol created by Reyes et al. [6]. The main strategy of SISPA is to exploit the sensitivity and the specificity of PCR amplification using primers that bind oligonucleotide fragments ligated to any putative viral DNA material in the sample. SISPA has been modified to allow the detection of both DNA and RNA viruses after the removal of genomic and contaminating nucleic acids [34]. Human astrovirus [35] and parvoviruses 2 and 3 [36] were discovered using SISPA methods. Another sequence-independent and culture-independent approach, viral metagenomics, provides superior capability for detecting known and unknown viruses and will be described in detail below.

Viral metagenomics
The term 'metagenome' was proposed earlier by Handelsman in PubMed, in 1998, in relation to classifying unculturable bacteria from soil samples [37]. Metagenomics can be defined as the characterization of genetic information directly from samples. It is a culture-and sequence-independent approach that does not rely on the presence of any particular gene in all subject entities [7,10]. This approach was originally developed as a tool for 'functional and sequence-based analysis of collective microbial genomes contained in environmental samples' [38,39]. The first application of metagenomics to the field of virology was in the analysis of viral communities sampled at two near-shore marine locations in San Diego, California [40]. Since then, viruses in numerous environments, including freshwater, marine sediment, soil, and the human gut, were surveyed. More information on published environmental viral metagenomes, DNA viral families, and RNA viral families in environmental samples and organisms found using the viral metagenomic approach can be found in Rosario's review [9]. The availability of NGS has forced viral metagenomics to be evaluated by unbiased means at previously unforeseen resolution and is providing a wealth of new opportunities in two major areas: viral candidate pathogen discovery and viral ecology [7].
Currently, several commercially available high-throughput sequencing platforms exist that vary by way of their sequencing principle, sequencing speed, expense, and read length. Figure 1 shows a schematic flow diagram summarizing the viral metagenomic studies that used sequenceindependent amplification and high-throughput sequencing for virus discovery in 2012. A metagenomic analysis essentially entails the following steps: sample preparation, sequence-independent amplification, high-throughput sequenc-  ing, and bioinformatics analysis ( Figure 2). Below, we briefly discuss each step of this viral metagenomic process. More detailed descriptions can be obtained in a previously published paper [41].
(i) Sample preparation. Theoretically, any type of sample can be analyzed using the viral metagenomic approach; however, viral genomes are relatively short and of low concentration in many samples, and bacterial and eukaryotic nucleic acids can interfere with the isolation and detection of viral DNA or RNA. Thus, one of the most important, yet difficult, tasks of viral metagenomics is removal of non-viral nucleic acid while preserving viral nucleic acids [11,41,42]. Toward this aim, several measures have been taken, such as SYBR-gold staining, filtration, sucrose or cesium chloride density centrifugation [41], DNase digestion [43], and chloroform treatment [44]. Total RNA content is usually estimated by measuring ribosomal RNA (rRNA), because the abundance of rRNA and its association with ribosomes makes the detection of contaminating RNA viruses difficult. However, RNases are ineffective for degradation of rRNA. Therefore, other strategies must be employed. One rRNA depletion strategy is to use random pri-mers during cDNA synthesis that do not target rRNA sequences [45,46]. Another strategy is to use biotin-labeled probes that target ribosomal sequences for depletion [47][48][49].
(ii) Sequence-independent amplification. Sequenceindependent amplification is the important step in viral metagenomics that shows the true genetic composition of the sample and simultaneously amplifies several viral genomes and makes them available, including highly divergent and completely novel viruses, for characterization [11,34]. SISPA, random PCR amplification, displacement amplification, arbitrarily primed PCR, and rolling circle amplification are typically used for sequence-independent amplification. Detailed descriptions of these methods have been made available in previously published papers [11,50].
(iii) High-throughput sequencing. To identify known, highly divergent, or new viruses, sequencing is often utilized. The Sanger sequencing method can create high-quality sequence reads up to nearly 1000 nt; however, this highly laborious process limits its usage [43]. High-throughput sequencing, or NGS, is used more and more often in viral metagenomics. The 454 sequencing platform (Roche diag-nostics) [8,51], SOLiD sequencing (Life Technologies) [7,52], Illumina sequencing [7,52], Helicos sequencing (Helicos Biosciences), PacBio sequencing (Pacific Biosciences), and Ion Torrent (Life Technologies) [7,52] are all commercial instruments in the marketplace, and all use slightly different methodologies to achieve clonal amplification and sequencing. The advantages and disadvantages of each of these instruments have been detailed in other papers [7,5254].
(iv) Bioinformatics. Post-sequencing, the analysis of the vast amount of sequencing data produced, is the challenging part of viral metagenomics. The datasets from metagenomic studies are complicated, and they typically contain a mixture of different species. The genomes in the datasets are usually incomplete, with some cases having only a small number of short fragments belonging to each genome. There are two approaches that can be used for analysis of read data, de novo assembly using software such as Velvet [55] and mapping strategies using a mapper such as BWA [56]. Other programs and platforms also have been developed, the details of which can be found in the NGS review by Blomstrom et al. [50].

Viruses discovered in human feces
Feces comprise a community of many different microorganisms, including bacteria and viruses. Many viral diseases are transmitted by the oral-fecal route, and the viral agents are excreted in the feces. With the development of modern molecular methods and viral metagenomics, several new viruses have been discovered in feces in the past decade.
(i) Picornaviruses. Enteroviruses constitute the largest genus within the family Picornaviridae, which have nonenveloped, positive-sense, single-stranded RNA (+ssRNA) genomes surrounded by four structural proteins, VP1-VP4. A majority of these viruses infect humans and cause diseases ranging from minor respiratory illness to severe neurological disorders like meningitis, encephalitis, and poliomyelitis [57]. Metagenomic analyses of viruses in stool samples from children with acute flaccid paralysis (AFP) found numerous viruses, particularly HEV-C, including one from a potentially novel enterovirus genotype [58].
Human cosavirus (HCoSV) is a proposed new genus in the family Picornaviridae originally identified in 2008 in feces [59]. Since then, HCoSVs have been detected in feces from Australian [60] and Chinese [61] children. HCoSV infection with accompanying diarrhea in Thailand [62], the prevalence of a new species (F) of HCoSV, and another 26 new HCoSV genotypes in human feces of healthy children and children with AFP [63] were also reported. HCoSV pathogenicity in humans has remained unknown because detection rates in patients and healthy controls were similar in the only available cohort studies of patients with AFP in Asia [59] and with gastroenteritis in China [61]. In Brazil, the 3.6% detection rate in children with gastroenteritis [64] was comparable to the 1.8% rate in a cohort study of gastroenteritis in China [61]. The large number of HCoSV genotypes/serotypes reported indicates that HCoSV infection is associated with a range of diseases, such as unexplained diarrhea, AFP, or others. Future work will focus on defining whether HCoSV is a true human pathogen.
Two other novel picornaviruses have also been described since 2008 due to the advent of metagenomics: klassevirus [65] and salivirus [66]. The reported association between salivirus shedding and diarrhea indicates that such infections may account for a significant fraction of the unexplained cases of diarrhea occurring worldwide. Future studies to determine a possible link to disease in humans and any unique characteristics of the viral life cycle will be required. Viral culture in human cell lines, especially those from the gastrointestinal tract, could be suggestive that the virus is competent to replicate in human cells and that humans could be a bona fide host of klassevirus [65]. Further epidemiological screening and serological assays will be necessary to understand the diversity within this possible genus, the prevalence of these novel viruses, and the age range of those susceptible to infection.
(ii) Parvoviruses. Parvoviruses are, as their name suggests, small viruses, with a single-stranded DNA genome and are widespread pathogens that cause a wide range of diseases in humans and animals. Allander et al. [14] first reported the discovery of a previously unidentified human parvovirus in 2005. From 2009 to 2010, the bocavirus genus was expanded to include three additional species of human bocaviruses, HBoV2 [67], HBoV3 [68], and HBoV4 [69]. HBoV2-4 seems to be found primarily in human stool [67][68][69][70][71]. HBoV1 is predominantly a respiratory pathogen, whereas HBoV2 and possibly HBoV3 are associated with gastroenteritis [67,71]. A variety of signs and symptoms have been described in patients with HBoV infection, including rhinitis, pharyngitis, cough, dyspnea, and diarrhea, among others [72].
Human parvovirus 4 (PARV4) was identified in 2005 in a plasma sample [73]. Another two novel parvoviruses closely related to PARV4, porcine hokovirus and bovine hokovirus, were discovered in feces specimens from Hong Kong [74]. Furthermore, a proposed new Parvoviridae genus associated with acute diarrhea has been identified; however, wider geographic sampling of human and animal fecal samples will provide a better understanding of the genetic diversity of this proposed genus. Serological assays and casecontrol studies will also help determine whether members of this viral clade are associated with diarrhea or other symptoms [75].
(iii) Astroviruses. The family Astroviridae consists of small, non-lipid enveloped, single-stranded, positive-sense RNA viruses with genomes that contain three open reading frames (ORFs), designated ORF1a, ORF1b, and ORF2. Astroviruses are known to infect a variety of human and animal hosts. Recently, two highly divergent members of the astrovirus family, MLB1 [76][77][78] and VA1 [79], were identified and associated with human viral diarrhea. Astroviruses VA2, MLB2, and VA3 were also reported as novel astroviruses in human stool [80], as was HMOAstV A-C [81]. With the rising number of human astrovirus species detected in infections associated with unexplained AFP and/or diarrhea, increased understanding of the genetic diversity within viral families infecting humans will assist in future studies of their pathogenicity and allow the design of specific diagnostic assays [81].
(iv) Polyomaviruses. Viruses in the Polyomaviridae family typically possess ~5000 bp of circular, doublestranded DNA. The genome can be divided into three parts, including the regulatory region, the early region, and the later region. Over the past five years, seven novel polyomaviruses have been discovered in humans, namely KI polyomavirus (KIPyV) [82], WU polyomavirus (WUPyV) [83], Merkel cell polyomavirus (MCPyV) [84], human polyomavirus 6 (HPyV-6), human polyomavirus 7 (HPyV-7) [85], human polyomavirus 9 (HPyV-9) [86], and trichodysplasia spinulosa-associated polyomavirus (TSPyV) [87]. In 2012, another two novel polyomaviruses were discovered in human stool and provisionally named MW polyomavirus (MWPyV) [88] and MX polyomavirus (MXPyV) [89]. The number of polyomaviruses found in the human body continues to grow, raising the question of how many more species have yet to be identified and what roles they play in humans with and without manifest disease. Taking MWPyV as an example, one critical question is whether MWPyV is a bona fide infectious agent of humans and, if so, which disease(s), if any, might be associated with MWPyV infection. The detection of MWPyV in feces of children with diarrhea, many of which are unexplained, raises the possibility that MWPyV might play a role in human diarrhea. It is also possible that MWPyV is a dietary contaminant and does not actively infect humans. Serological studies and antibodybased immune responses to MWPyV and wider sampling and screening are approaches to determine whether MWPyV is a true infectious agent [88].
(v) Circoviruses. Members of the family Circoviridae are non-enveloped, spherical viruses with a single-stranded circular DNA genome of approximately 2 kb--the smallest known autonomously replicating viral genome [90]--and include the genera circovirus, gyrovirus, and gyclovirus (proposed) [91,92]. Until recently, chicken anemia virus (CAV) was the only known representative of the gyrovirus genus. In the last two years, three novel gyroviruses have been reported: avian gyrovirus 2 (AGV2) in chickens [93], a human gyrovirus (HGV) detected on human skin [94], and gyrovirus 3 (GyV3) in human feces and chicken meat [95]. Recently, a novel gyrovirus 4 (GyV4) was also identified in human stool and in chicken meat prepared for human consumption [96]. In addition, a new strain of CAV, designated GD-1-12, was isolated from fecal samples of a 12-day-old commercial broiler in Guangdong province, China [97]. Multiple diverse circoviruses were found in human and chimpanzee feces, such as CyCV-PK5006 and Chimp 17 [92]. Other novel single-stranded, circular DNA viruses also detected were found in porcine [98] and bovine feces [99], respectively.
Diarrhea is the third leading infectious cause of death worldwide, and an estimated 1.4 billion nonfatal episodes occur yearly [100,101]. Importantly, it is estimated that 40% of diarrhea cases are of unknown etiology [60,102]. Viral metagenomics has largely propelled the progress in virus discovery; however, the association of these newlydiscovered viruses with specific diseases is largely unknown. An understanding of these newly-discovered viruses' disease associations and causations is extremely urgent.

Metagenomic Koch's postulates
Viral metagenomics has provided a powerful tool for discovering new viruses. However, virus discovery is only the first step to determining the etiology behind a disease. The detection of nucleic acid is not sufficient to prove causality. For that, more comprehensive studies have to be performed [103]. Evidence for causality is not always conclusive, even when the suspect virus is found "at the scene of the crime". This means that finding a virus in a sample from a patient with an illness of unknown etiology, and even demonstrating an association, does not always prove causation. Therefore, strict guidelines proposed by Robert Koch and later modified by Rivers [104] have been used to assign causality to infectious agents [10]. One of Koch's postulates requires that the candidate etiological agent be isolated from a diseased organism and grown in pure culture. However, not all viruses can be propagated by current culture methods [103].
In many cases, Koch's postulates will not be satisfied if current culture standards are used to prove causality. Koch's postulates have since been modified to acknowledge molecular methods used to monitor the role played by genes in bacterial virulence [105]. The revised so-called molecular Koch's postulates can be applied to pathogenic members of a genus or pathogenic strains of a species as well as nonpathogenic strains. Because genes can be expressed at different time points during infection, however, new molecular methods do not always distinctively characterize virulence genes and make a clear association with the disease of unknown etiology. Mokili et al. [10] proposed the so-called metagenomic Koch's postulates, which focus on the identification of metagenomic traits in disease subjects (Figure 3). Molecular markers such as sequence reads, assembled contigs, genes, or full genomes that can uniquely distinguish disease-associated metagenomes from those obtained from matched healthy control subjects can all be used to define metagenomic traits. Thus, satisfying metagenomic Koch's Figure 3 Flow chart of virus discovery, pathogenicity determination, intervention, and metagenomic Koch's postulates. Traits A, D, E, and J found in the diseased animal (e.g., a mouse) are not present in the healthy control (represented in blue). The acquisition or increase of new metagenomic traits, A, P, and E, are represented in orange. Inoculation of the suspected purified traits into a healthy mouse will induce disease if the traits encompass the etiology of the disease (represented in purple); pathogenicity as a trait is represented in red. The figure is modified from models proposed by Mokili et al. [10] and Delwart [106].

Figure 4
The global aviation network. Lines show direct links between airports, and the color of the line indicates passenger capacity in people per day (thousands (red), hundreds (yellow), and tens (blue)) [110].
postulates is possible when one or multiple viral agents are involved in disease causation.

Challenges in virus discovery
Novel viruses are still emerging. In November 2011, a novel orthobunyavirus named Schmallenberg virus (SBV) was detected in plasma samples from a farm near the German town of Schmallenberg by using a metagenomic approach with NGS [107]; this novel virus is on the rise in Europe [108,109]. The SBV epidemic showed once again that the novel technology of metagenomics is very useful for early detection of novel pathogens in livestock. In fact, veterinary diagnostics in Europe has proved to be a very effective network of institutions studying epizootic diseases. Before any strategy is decided for prediction, precaution, and prevention of EIDs in humans, three questions should always be asked and answered: what to expect, what to be prepared for, and what to do. The discovery curve for human virus species. The cumulative number of species reported to infect humans (black circles and line). Statistically significant upward breakpoints are shown (vertical lines). The best-fit curve (solid line) and lower and upper 95% posterior intervals (dashed lines) for extrapolation to 2020 are shown [115].
Another important issue is that of emerging vector-borne diseases in global health. Many vector-borne pathogens have appeared in new regions in the past two decades. Anthropogenic trade, travel, as well as hosts, vectors, and climate conditions all contribute to emerging vector-borne disease. Figure 4 shows the direct links between airports spanning across the world. Knowing the drivers, dynamics, and ecology of zoonoses are all helpful for prediction and prevention of the next pandemic zoonosis [110][111][112].
The Virus Pathogen Database and Analysis Resource (ViPR; www.ViPRbrc.org) is an integrated repository of data and analysis tools for multiple virus families and is supported by the National Institute of Allergy and Infectious Diseases (NIAID) Bioinformatics Resource Centers (BRC) program. This program has the advantage of a powerful suite of resources provided by the ViPR BRC with which virology researchers can streamline and expedite experimental discovery for the ultimate goal of developing improved diagnostics, prophylactics, and/or therapeutics for pathogenic viruses [113,114]. With globally coordinated activity and effort, temporal trends in the discovery of human viruses [115] (Figure 5) will enable the control of forthcoming pandemic zoonoses and lead to greater achievements.