Objective

Next generation sequencing (NGS) is an attractive approach to diagnosis of infection, with the potential to offer a single diagnostic pipeline to identify viruses, bacteria and fungi from a range of clinical samples [1,2,3,4]. However, there are multiple challenges in implementing such systems, and ongoing efforts are required to develop in vitro methods for handling diverse types of clinical samples, evaluate and improve sensitivity, reduce the high burden of human reads, distinguish contaminants or commensals from pathogenic organisms, and optimise positive and negative controls.

In this small pilot study, we focused on the detection of viruses from cerebrospinal fluid (CSF) and respiratory samples submitted to a routine diagnostic microbiology laboratory in order to evaluate a methods protocol and to provide a preliminary dataset for analysis, with a view to optimizing our laboratory approach and providing a foundation for improving bioinformatic algorithms.

A summary of this work was presented at the UK National Federation of Infection Societies (FIS) meeting, Birmingham, November 2017 [5]. We have subsequently focused specific attention on analysis of human herpes virus 6 (HHV-6) reads from within these samples, as this provides an interesting example of an organism which is widespread, potentially pathogenic [6,7,8] but may also be a bystander in clinical samples [9]. These results and analysis are presented in a separate manuscript [10].

Data description

Sample cohort

We randomly selected 10 cerebrospinal fluid (CSF) and 10 respiratory samples (20 different patients represented). CSF samples were submitted to the clinical diagnostic laboratory at Oxford University Hospitals NHS Foundation Trust between November 2012 and May 2014, and respiratory samples between May and December 2014. Prior to use for this research, samples had undergone routine clinical laboratory testing and were then stored at − 80 °C.

In vitro methods

A full description of laboratory methods is provided in linked data files (see ‘OSCAR protocol’ listed in Table 1). In brief, samples were filtered through 0.45 μm spin column filters (Merck Millipore) to remove large cellular debris and bacterial contaminants. To increase the relative amount of encapsidated viral to host nucleic acids in the sample, we pre-treated the sample with DNAse and RNAse. Nucleic acids were extracted using the QiaAmp MinElute Virus Spin Kit (Qiagen) and recovered in nuclease-free water. Reverse transcription was primed by random hexamer primers and performed using SuperScript III reagents. Sequence independent amplification of cDNA (and DNA also carried over during extraction) was carried out by an initial addition of random octamer containing primer sequences. Subsequent PCR was performed using a single primer amplification. Illumina Nextera XT libraries were made from amplified cDNAs according to the manufacturer’s protocol and sequenced on the HiSeq 4000 platform with 150-base paired end reads at the Centre for Genomic Research (CGR), University of Liverpool, UK.

Table 1 Overview of data files/data sets

Bioinformatic analysis

A full description of laboratory methods is provided in linked data files (OSCAR protocol; see description and link in Table 1). In brief, the raw FASTQ files were trimmed to remove adapter sequences and to remove low quality bases. After trimming, reads < 20 base pairs were removed. The remaining reads were classified using Kraken v0.10.5-beta [11] against a reference database comprising the human genome in combination with all RefSeq genomes for viruses, bacteria and archaea. Human-tagged reads were discarded and the remainder were taken forward for analysis.

We used Kaiju [12] to confirm that the Kraken analysis was complete, using the full Genbank non-redundant protein database for viruses, bacteria and archaea. Reads were assembled de novo using metaSPAdes v3.10 [13, 14]. Assembled contigs were classified with Kraken [11], and results were visualised with Krona [15].

Limitations

This study was undertaken as a pilot exercise to underpin refinement of both laboratory methods and analysis of metagenomic data from clinical samples. On the grounds of cost, we were restricted to analysis of a small number of samples. We did not set out to derive definitive clinical diagnosis, and the data should not be used for this purpose. There are inherent difficulties with using residual clinical samples, including bias introduced into sample selection (e.g. samples from patients with a high pre-test probability of infection tend to be used up in primary clinical testing and not available for research). In archived samples, the quality of nucleic acid may deteriorate over time (this may be especially pertinent for RNA viruses).

Our methods did not include positive and negative controls. For this reason, it is difficult to assess the sensitivity with which we detected any specific virus; it is possible that the in vitro methods may have enriched or depleted particular organisms or groups of organisms. In future, positive controls can be added by spiking samples with an organism that we anticipate would not be present in human samples, or running a parallel multiplex control panel [16].

Future studies, and an accumulation of practical experience, will be required to increase the certainty with which results of NGS platforms can be interpreted. While we anticipate instances in which a specific organism can be identified from a metagenomic dataset as the cause of a clinical syndrome, there are many instances in which ambiguity may arise as a result of the difficulties in discriminating between pathogenic organisms and contaminants or bystanders. Detailed prospective studies enrolling large numbers of study subjects are the ultimate aspiration, with the aim of collecting high resolution data that include medical history, other laboratory results, imaging, treatment and follow-up data.