Simultaneous virus identification and characterization of severe unexplained pneumonia cases using a metagenomics sequencing technique.

Many viruses can cause respiratory diseases in humans. Although great advances have been achieved in methods of diagnosis, it remains challenging to identify pathogens in unexplained pneumonia (UP) cases. In this study, we applied next-generation sequencing (NGS) technology and a metagenomic approach to detect and characterize respiratory viruses in UP cases from Guizhou Province, China. A total of 33 oropharyngeal swabs were obtained from hospitalized UP patients and subjected to NGS. An unbiased metagenomic analysis pipeline identified 13 virus species in 16 samples. Human rhinovirus C was the virus most frequently detected and was identified in seven samples. Human measles virus, adenovirus B 55 and coxsackievirus A10 were also identified. Metagenomic sequencing also provided virus genomic sequences, which enabled genotype characterization and phylogenetic analysis. For cases of multiple infection, metagenomic sequencing afforded information regarding the quantity of each virus in the sample, which could be used to evaluate each viruses' role in the disease. Our study highlights the potential of metagenomic sequencing for pathogen identification in UP cases.


INTRODUCTION
Viruses are responsible for the majority of human respiratory infections (Rudan et al., 2013), and various respiratory pathogens such as influenza virus, parainfluenza virus, coro- †Contributed equally to this work *Corresponding author (email: cfswdm@gzscdc.org) **Corresponding author (email: yshu@cnic.org.cn) navirus, and rhinovirus are frequently identified in clinical samples. However, pathogens remain undetected by routine clinical testing in approximately 30% of respiratory cases (Heikkinen and Järvinen, 2003). Moreover, many novel viral pathogens, such as the SARS-CorV and MERS viruses (Ding et al., 2003;Ishikane et al., 2015), can infect the human respiratory tract and lead to severe respiratory distress. Therefore, active surveillance of the causative agents involved in respiratory cases is also critical for early detection of emerging diseases.
Owing to its high specificity and sensitivity in clinical diagnosis, sequence-based PCR is currently the routine approach for virus identification. However, PCR-based methods require sequence information for the targeted agents, and a specific assay needs to be developed for each individual pathogen. Furthermore, as only the pathogen that such an assay is designed to detect will be identified, atypical or emerging pathogens may evade detection by PCR-based methods. In contrast, recently developed next-generation sequencing technology (NGS) combined with a highly sensitive random amplification method can provide a comprehensive view of the pathogenic agents in a given sample, making it suitable for identifying novel and unexpected pathogens Zoll et al., 2015). Moreover, metagenomic sequencing could provide detailed genomic information of the pathogens present in samples, which could be used for genotype characterization and phylogenetic analysis. Based on the taxonomic assignment of output sequences, clinicians can further evaluate the individual influence of each pathogen to disease development (Graf et al., 2016). Thus, compared to the positive/negative result provided by routine PCR, such sequence-based data are more informative to clinicians with regard to determination of the real causative agent of a disease.
Since 2004, the Chinese government has implemented a nationwide surveillance network to monitor unexplained pneumonia (UP) cases (Qian et al., 2011). However, the pathogens causing UP are typically difficult to identify using traditional diagnostic tools. Thus, to examine the potential of NGS and a metagenomic approach for pathogen identification and characterization of UP cases, we collected 33 respiratory samples from severe UP cases in Guizhou Province, China. These samples were subjected to linear amplification and deep sequencing for detection of respiratory viruses. Our study shows that a metagenomic approach could be a promising tool for the clinical diagnosis of UP cases.

Virus detection by metagenomic sequencing
Of the 33 samples processed for NGS and metagenomic analysis, 16 were positive for respiratory viruses and comprised 13 virus types (Table 1). HRV was the most frequently detected virus and was found in nine samples, seven of which were the HRV-C genotype. Two samples indicated dual infections, with co-infection by PIV3/HRV-C and PIV2/HCoV-229E in CN28 and CN32, respectively. A triple infection of HCoV-OC43, human coxsackievirus (HCkV), and RSVB was identified in a 1-year-old child (CN12). All of the detected viruses could be genotyped to subtype or serotype based on the genome sequence assembly. A long genome sequence was assembled for most of the viruses, with 10 having ≥80% genome coverage and a complete genome sequence directly obtained for three. Common respiratory viruses, such as PIV, HCoV, and HRV, were the primary pathogens detected in the UP cases examined. Of interest, the measles virus, which is not included in the routine PCR panel for respiratory pathogens, was also detected by blind analysis in a 1-month-old patient. Other viruses, such as human endogenous retrovirus and human papillomavirus, were also detected in some samples.

Taxonomic assignment of NGS data
Viruses could be detected from metagenomic sequencing data by blind analysis. Take CN31 as an example, we obtained 5,056,875 reads and assembled 5,149 contigs, 98 of which were aligned to the measles virus, most of which targeted the measles virus strain MVi/Pennsylvania.USA/20.09. Thus, the complete genome of this strain (GenBank: JN635411.1) was downloaded and mapped back to the clean reads; in which 1,047,220 were assigned, corresponding to 38.92% of the total clean reads.
In addition to the reads belonging to measles virus, there are still a large number of clean reads with no information regarding origin. Thus, we aligned all clean reads to the NCBI nt database and then imported the BLAST results into MEGAN software to detail taxonomic assignment. According to the taxon tree, the measles virus received the most reads assigned, followed by Streptococcus and Homo sapiens ( Figure 1A). After reads assigned to "Homo sapiens," "Not assigned," and "Low complexity" were filtered out, the data assigned to the measles virus accounted for 68.60% of the total clean reads with a taxon match, followed by 17.46% and 3.74% reads matching to Streptococcus and unclassified Siphoviridae, respectively ( Figure 1B).

Virus weight evaluation in multiple infections
Metagenomic sequencing provided an opportunity to evaluate the individual influence of each virus in cases of multiple infection. In the triple-infection case CN12, we obtained 5,370,692 reads and assembled 2,416 contigs, of which three, Figure 1 (Color online) Taxon assignment of metagenomic data from sample CN31. Clean reads were aligned to the NCBI nt database with BLASTn, and the results were imported into Megan software to assign reads to each taxon. A, Read assignment to the phylogenetic tree using Megan. Each circle represents a taxon in the NCBI taxonomy and is scaled logarithmically to indicate the number of reads assigned. Taxa with few reads assigned were collapsed to senior taxa. B, The pie charts correspond to the read distribution among the agents identified at the genus level. Reads matching "Homo sapiens," "Not assigned," and "Low complexity" were removed before analysis. The majority of attributable reads were for the measles virus, which belongs to the Morbillivirus genus. Genera with <1% reads assigned are not indicated in the pie charts. 25, and 12 were aligned to HCkV, HCoV-OC43, and RSVB, respectively (Table 2). Mapping analysis indicated that 1,519,427, 6,917, and 5,127 reads could be assigned to these three viruses, achieving 100%, 69.5%, and 66.7% virus genome coverage, respectively ( Figure 2). Therefore, the reads assigned to HCkV outnumbered those for HCoV-OC43 and RSVB by 200-fold. Virus quantification using digital PCR (dPCR) indicated 1,275 copy μL -1 for HCkV in the clinical sample, with 12 copy μL -1 for HCoV-OC43 and 6 copy μL -1 for RSVB ( Table 2). The dPCR results correlated Figure 2 Genome coverage of the three viruses in triple-infection sample CN12. Three viruses were detected in sample CN12. The reference genome of each virus was identified based on contigs BLAST searches and downloaded for mapping back of the clean reads. The Human coxsackievirus (HCkV) got complete genome covered with high depth, while the human coronavirus OC43 (CorV-OC43) and respiratory syncytial virus B (RSVB), obtained 69.5% and 66.7% of genome covered, respectively. Uncovered regions have no line assignment. well with the NGS read counts for each virus.

Phylogenetic analysis of the detected viruses
Phylogenetic information is valuable for determining virus origin, which is critical for management of endemics and outbreaks. For the measles virus in sample CN31, the H gene showed the highest homology to an eastern Chinese strain (MVi/Zhejiang.CHN/10.11/2[H1]) isolated in 2011 ( Figure  3A). Phylogenetic analysis including WHO standard strains revealed this measles virus belonged to the H1 genotype ( Figure 3A), which predominated the genotype of measles virus in China for 16 years (Zhang et al., 2012). Moreover, this measles virus clustered with the H1a subgroup, which overwhelmed the H1b genotype in China after 2005 (Xu et al., 2014). Sequence alignment including the Chinese vaccine strains Shanghai191 and Changchun47 indicated that the four N-linked glycosylation sites between position 168 and 268 of HA are conserved in our measles virus (Jianzhi and Zhihui, 1983). These glycosylation sites are critical for proper HA protein folding, antigenicity, dimerization, and export from the Golgi . The Guizhou measles virus strain was also consistent with vaccine strains in the five antigenic epitopes and seven cysteines of HA (Hu et al., 1993); these residues are important for accurate disulfide bonding of monomeric HA glycoproteins Plemper et al., 2000).
The complete genome of HCkV in CN05 was directly obtained by metagenomic sequencing, revealing that this virus belongs to genotype A10. The complete genome of this virus showed 99% sequence identity to coxsackievirus A10 strain FY01/AH/CHN/2013 in the NCBI nucleotide database. Phylogenetic analysis based on the partial VP1 sequence showed that this HCkV sequence groups with strains isolated in China after 2009 and that it is distinct from strains isolated in Japan or strains previously isolated in China ( Figure 3B). These results indicate that it is a local circulating strain.

DISCUSSION
The respiratory tract is one of the major interfaces for human contact with external microbes and is thus the target of many viruses, especially RNA viruses (Assiri et al., 2013;Xie et al., 2015). Current molecular tests for respiratory infection are largely pathogen specific, requiring test selection based on the patient's symptoms, which necessitates large panels of expected pathogens and has limited yield. In contrast, NGSbased metagenomic approaches are target independent and provide unbiased identification of common and unexpected pathogens in a given sample. Moreover, metagenomic sequencing assembles genome information directly from clinical samples, allowing assessment of antigenic epitope variation, strain-level typing, and phylogenetic relatedness. In addition, the increasing advances in bench-top sequencers such Figure 3 Phylogenetic analysis of the measles and HCkV viruses. Sequences were aligned with MUSCLE; phylogenetic analysis was performed using Mega 5.1. Bootstrap values <70% are hidden in the trees. A, The complete coding sequence of the hemagglutinin gene was used for phylogenetic analysis of the measles virus in sample CN02. Groups corresponding to cluster H1a and H1b are indicated. The phylogenetic trees are rooted to genotypes G3 and H2, respectively. WHO reference strains are labeled with black squares, and the measles virus we detected is labeled with a black circle. B, Phylogenetic analysis of HCoV-A10 strains based on the partial VP1 gene (269nt). HCoV detected in this study is labeled with a black circle.The results presented here prove that metagenomic sequencing is a promising tool for virus detection and characterization in cases of complex infection. In contrast to routine diagnostic tools, which are designed to detect predefined pathogens, metagenomic sequencing not only enables unbiased pathogen detection but also yields epidemiologically and clinically valuable sequence information. In addition to the decreasing cost of NGS and improved data processing methods, metagenomic sequencing offers unprecedented possibilities for characterizing, diagnosing, and evaluating clinical pathogens in UP cases. as the Illumina MiSeq and Ion Torrent PGM are associated with an ongoing reduction in cost, making metagenomic sequencing of clinical samples more affordable and available in clinical diagnostic laboratories.
Challenges in pathogen identification often occur for the respiratory infection cases forwarded to CHUPSN, at least with regard to local clinical diagnostic laboratories. Based on the definition of UP cases, viruses should be the main causative agents. Our metagenomic sequencing identified 13 virus types in 16 samples among a total 33 cases tested. All the viruses were identified to the strain or genotype level, as demonstrated by human adenovirus B 55 and HCkV A10. HRV-C was the virus most frequently detected. Although we could not conclude that HRV-C was the causative pathogen of the UP cases because of its high prevalence in non-symptomatic individuals , attention should be given to HRV-C in young patients, such as the 1-month-old infant case in our study (CN24), as there is evidence that HRV-C may be more virulent in children with pneumonia (Khetsuriani et al., 2008).
Laboratory diagnosis of infectious diseases has historically taken a syndrome-based approach, and the current molecular tests are usually target specific, resulting in challenges when novel or unexpected viruses emerge. Measles is a highly contagious disease characterized by a high fever, cough, and maculopapular rash. Its clinical symptoms are so characteristic that the measles virus is generally not included in common multiple PCR panels, such as the FDA-cleared GenMark eSensor RVP (Pierce and Hodinka, 2012) and CE-IVD-labeled RespiFinder SMART 22 FAST PCR panel (Dabisch-Ruthe et al., 2012), which cover 14 and 22 respiratory pathogens, respectively. The scenario in our study of an infant as young as one month of age infected with measles is a rarity, as protection owing to maternal antibodies usually exists during this early period of life (Kiliç et al., 2003). Thus, identification of the causative agent in such cases would be relatively difficult using routine PCR-based assays. Conversely, our unbiased metagenomic approach directly identified this virus, and we were able to assemble the complete genome sequence, allowing a comprehensive phylogenetic analysis and sequence characterization. Moreover, taxonomic assignment of the total reads indicated that the measles virus accounted for the greatest proportion of NGS output data. Given that no other pathogens were detected in the sample, we are substantially confident that the measles virus was the pathogen causing UP in this infant.
Metagenomics-based pathogen detection is particularly powerful when multiple viruses are detected in a single sample, which could provide a read count-based evaluation of viral load in the sample. For the triple-infection sample CN12, 200-fold more reads were found for HCkV compared to HCoV-OC43 and RSVB. Indeed, the read count ratio of the three virus correlated with the virus burden determined by dPCR (Table 2). A previous study also indicated that viral read counts based on untargeted metagenomics correlated with the viral burden determined by quantitative PCR or the threshold cycle (Ct) value provided by Taqman PCR (Graf et al., 2016). Therefore, HCkV appears to have played a predominant role in this patient's UP disease.
Another advantage of metagenomic pathogen identification is the ability to perform pathogen genome sequence assembly from NGS reads, which could be used for phylogenetic analysis and antigenic epitope characterization. Based on the complete genome sequence of the measles virus assembled for CN31, we found that the critical sites in HA responsible for protein folding and antigenic epitopes are conserved between our case and the vaccine strain. The phylogenetic analysis also indicated that this virus belongs to a local circulating lineage. Metagenomic sequencing also assembled the complete genome sequence of HCkV in CN12, and we further determined that it is of the A10 genotype. The HCkV sequence we detected clustered with A10 strains isolated in China in 2013 ( Figure 3B), a time when CV-A10 infections were increasing in some regions of the country (Yang et al., 2015). CV-A10 was demonstrated to be associated with severe complications in hand-foot-mouth disease (Lu et al., 2012), which suggests a critical role of this virus in the severe pneumonia of this infant. This molecular epidemiological information based on genome sequence assembly from NGS data can provide valuable characterization of the origin and transmission of infectious diseases (Du et al., 2016).

Sample collection
Samples were collected based on the Chinese Human Unexplained Pneumonia Surveillance Network (CHUPSN), which was established in China in 2004 to address the challenges of the early discovery of novel viruses, such as SARS-CorV. Sentinel hospitals in CHUPSN performed UP case identification and clinical sample collection. According to CHUPSN guidelines, a UP case should be accompanied by fever (>38°C) and pneumonia radiographic results as well as a normal or slightly decreased number of total white blood cells or a decreased lymphocyte differential count in the early stage of the disease. Moreover, no improvement is observed in UP cases after regular antibiotic treatment. After a UP case was defined, an oropharyngeal swab was sampled from the patient and stored in viral transport medium (VTM) at −80°C until testing. The VTM was transferred to the local center for disease control and prevention (CDC) for influenza A (FluA) virus testing by real-time PCR. This study focused on UP patients who developed severe clinical symptoms and required hospitalization in Guizhou Province from 2013 to 2014. Only FluA-negative samples were processed by metagenomic sequencing. A total of 33 patients, 24 males and nine females aged from one month to 82 years, were ultimately enrolled (Table S1 in Supporting Information). All the participants (or guardian for children) provided written informed consent. The collection of clinical samples and implementation of the study were approved by the ethics committee of the Guizhou CDC and National Institute of Viral Disease Control and Prevention, China CDC.

Sample processing and next-generation sequencing
Clinical specimens for NGS were subjected to a series of pre-treatments before nucleic acid extraction based on the protocol described by Claudia (Kohl et al., 2015). Briefly, 500 μL of VTM was centrifuged at 2,000×g for 10 min, and the supernatant was filtered through a 0.22-μm filter (Ultrafree MC, USA) at 2,000×g in a microcentrifuge. The filtered VTM was treated with an enzyme cocktail containing 10 U of DNaseI (NEB, USA), 20 U of RNase If (NEB), and 10 U of benzonase (Merck, USA) for 2 h to digest naked host nucleic acid. RNA was extracted from the treated sample using an RNeasy mini kit (QIAGEN, Germany) and subjected to linear reverse amplification using the NuGEN Ovation RNA-seq V2 kit (NuGEN, USA). Approximately 100 ng cDNA per sample was used for sequencing with the Ion PGM™ sequencing platform (Thermo Fisher, USA) at a read-length of 200 bp. For samples with scant cDNA produced owing to low viral copies, we used 1 ng of cDNA for library construction with Nextera XT DNA Library Preparation Kit (Illumina, USA) and sequenced the products using the MiSeq (Illumina) platform for 2× 150 cycles with version 2 reagents and the "FASTQ only" workflow.

NGS data processing
NGS data analysis was primarily conducted using CLC genomic workbench 7.0.5 package. At the filter step, reads with an error rate above 0.05 or a length below 100 bp were trimmed out. Reads passing quality filtering were mapped to human reference genome hg18 with stringent criteria that set the length fraction at 0.9 and the similarity fraction at 0.9 using the CLC mapper software. Unmapped reads were collected as clean reads for downstream analysis.

Virus identification
Clean reads were assembled de novo with the "de novo assembly" tool in the CLC package under the parameter "automatic word size=20, automatic bubble size=50". After assembly and scaffolding, contigs with a length greater than 300 bp were aligned to the NCBI nt database using local BLASTn. The pathogen sequence/genome identified in the BLAST hits was selected as the reference for mapping clean reads under the default parameters in CLC mapper. Consensus sequences were extracted from the mapping results to obtain the targeted virus genome. All clean reads were mapped to the Complete RefSeq release of viral and viroid sequences downloaded from NCBI, which to date contains 6,024 virus reference genome.

Taxonomic assignment
Sequence similarity-based taxonomic assignments were applied to non-host data sets. Clean reads were first aligned to the NCBI nt database using local BLASTn with the "-max_target_seqs 5" specification. The BLAST results were then imported into MEGAN software (version 5.10.2) to assign each read to the lowest common ancestor (LCA) of the set of taxa that it hit (Huson et al., 2007). The "min support" (minimum support) parameter was changed to 100 (<0.1% of the total clean reads) to filter out taxa to which only a small number of reads were assigned.

Phylogenetic analysis
Phylogenetic analysis of the measles virus was performed based on the complete coding sequence of the surface hemagglutinin (H) protein, and that for enterovirus was based on a partial VP1 sequence. Sequences included in the analysis were aligned using MUSCLE, and a phylogenetic tree was constructed using the neighbor-joining method. The reliability of the phylogenetic inference at each branch node was estimated using the bootstrap method with 1000 replications.

Digital PCR quantification
Virus quantification was implemented in QuantStudio™ 3D Digital PCR System (Thermo Fisher, USA). Briefly, 5 μL of sample RNA was reverse transcribed to first-strand cDNA using SuperScript VILO cDNA Synthesis Kit (Thermo Fisher). A digital PCR reaction mixture was then prepared with 3.5 μL cDNA product added as the template and loaded onto digital PCR 20K Chips for PCR amplification with GeneAmp ® PCR System 9700. The number of virus copies was assessed by the QuantStudio 3D Digital PCR instrument, which imaged the 20K chip and performed a preliminary analysis. Those data in the result meeting all quality thresholds were considered to be valid.

Compliance and ethics
The author(s) declare that they have no conflict of interest.
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and source are credited. The supporting information is available online at life.scichina.com and www.springerlink.com. The supporting materials are published as submitted, without typesetting or editing. The responsibility for scientific accuracy and content remains entirely with the authors.