Background

Metagenomics studies, where total genetic material is recovered directly from samples and sequenced, is rapidly becoming the method of choice for evaluating the taxonomic diversity of a sample. The Genomes OnLine Database (GOLD) provides information regarding genome and metagenome sequencing projects and currently lists 392 ongoing metagenomic studies that involve sequencing of 3028 samples [1]. This is an under-estimation of the actual number of ongoing projects because there are thousands of additional projects being initiated every month. A simple search in PubMed [2] of the word “metagenome” retrieves 3422 articles, of which 2205 were published within the last two years.

Metagenomics studies can range from analysis of microbial communities of marine [3] or soil [4] to microbiomes of various organs in the human body [57], and even to cross-biome studies [8]. Other metagenomic studies involve analysis of gene markers and genes from different species to better understand the functional components of the entire community and systems-level interactions [810]. Disease related studies that involve microbiome analysis using metagenomics include obesity [1113], Crohn’s disease [14, 15], type 2 diabetes [5] and many others [16]. Such studies have vastly extended the currently available sequences in databases and will likely lead to the discovery of new genes that have useful applications in biotechnology and medicine [17, 18].

Culture-dependent survey approaches to study microbial diversity were replaced by targeted sequencing of bacterial 16S rRNA two decades ago [19]. Since then, sequencing techniques have improved and high throughput methods such as whole-genome shotgun sequencing have been used to sequence the environmental samples [20]. The cloning-dependent ‘Sanger sequencing’ was gradually replaced by a “sequencing-by-synthesis” strategy [2123]. The advantage of this next-generation (next-gen) sequencing approach for metagenomics is that it avoids biases resulting from the cloning process used in traditional sample treatment. Thanks to the availability of these new sequencing technologies, metagenomics studies can either focus on targeted rRNA gene sequencing or whole-metagenome shotgun (WMS) sequencing. Both methods provide researchers informative data that either can be used to efficiently assess the diversity of a community or identify disease causing pathogens with increased taxonomic resolution [5, 24].

Analysis software plays an indispensable role in analyzing the components of the metagenomic data and further predicting functional attributes. In order to flexibly accommodate different scenarios and achieve different purposes, distinct approaches have been adopted by these tools. These tools utilize statistical models, like Support Vector Machines, interpolated Markov models, naïve Bayesian classifiers, etc., to train a taxonomic classifier from reference genomes and then use this classifier to cluster novel input metagenomic reads (PhyloPythiaS [25], Phymm [26], NBC: Naïve Bayes Classification tool [27]). These methods are good for analysis where sufficient prior genomic information is available. Where no prior knowledge of expected taxonomic diversity is available, running these tools becomes computationally expensive (if all reads are scanned against all known sequences) or less accurate (if reads are scanned against a signature/marker-based sequence database).

Because metagenomic reads can have multiple components, the application of universal markers can reduce the complexity and thus shrink the run time of the entire mapping process. But there are some drawbacks to this approach. This approach relies on informative slow evolving sequence features such as 16S rRNA in microorganisms or Internal Tanscribed Spacer (ITS) regions in plants to identify the community taxonomy. Many of these markers fail to distinguish between evolutionarily close yet distinct organisms. Software like AMPHORA [28] and MetaPhyler [29] were developed to analyze reads falling into certain known taxonomic clades, while MetaPhlAn concentrated on robustness of taxonomic abundance estimation based on clade specific markers present in metagenomic samples [30]. One promising tool in the area of identification of known pathogens from an environmental or clinical sample is Pathoscope [31]. Pathoscope uses a statistical framework to analyze raw next-generation sequence (NGS) reads from metagenomic samples and performs species and strain level identification based on hits to a curated signature database of known biological agents. Another strain-, species- and genus-specific genes database, CUPID, developed by our group [32] has been used by us and others to identify pathogens and related organisms from NGS data. Another approach involves construction of customized sub-databases from the NCBI-nr database [33]. The limitations of these methods are either the dependence on prior knowledge of taxonomic groups that the user is expecting or is interested in or the workflow involves iterative steps of hierarchical databases which are difficult to maintain and update on a regular basis. Additionally, the dependence on signatures which may not get amplified and sequenced for a variety of reasons [34] can result in lower accuracy.

A whole-genome analysis method serves as an emerging alternative approach that complements the marker-based methods. Unlike other previous methods, whole-genome studies require unbiased high-throughput sequencing technology and heavy computing, which in turn can provide comprehensive insights in the biodiversity and functional breadth of the organisms in the sample. This is an ideal and straightforward approach depicting the panoramic view of metagenomic data. One limitation of this method is the potential for ambiguous results due to the interruption of conserved regions by horizontally transferred sequences. However, this issue has been addressed by the lowest common ancestor (LCA) approach used by MEGAN [9] and other maximum parsimony based approaches [35]. The remaining unsolved issue is the extreme computational requirements of mapping millions of sequence reads to all known reference sequences in a reasonable amount of time. This issue becomes increasingly critical as the size of both reference databases and metagenomic datasets keep growing.

It is clear that the challenges in metagenomic analyses are many and varied [3639]. In this paper, we focus on one of the major challenges, that is, rapidly detecting the taxonomic composition in a metagenomic short read sample with minimal computing resources. Thus, we present here an algorithm and tool which allows assessment of the biodiversity of a metagenomic sample, as while also providing a means for quality control of NGS analysis pipelines to detect contamination present in sequence reads.

Materials and methods

Databases and tools

The Basic Local Alignment Search Tool (Blast) formatted NCBI nt database downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ is used as reference database [2]. Blast with following parameters is used: -task megablast -evalue 1e-6 -max_target_seqs 1 -best_hit_score_edge 0.05 -best_hit_overhang 0.05 -window_size 0 -perc_identity 90 [2]. Synthetic reads were generated using ART which adds plausible sequencing errors [40].

Implementation

The graphical user interface for CensuScope is implemented using PHP and JavaScript (Additional file 1: Figure S1).

CensuScope algorithm description and justification

Subsampling, or as it is often called subagging (subsample aggregating), has been previously established as an efficient yet accurate method for estimating statistical quantities in complex high dimensional datasets [41]. The large-sample asymptotic properties of subsampling approaches indicate that subsampling is an efficient way to obtain highly accurate and stable estimates of statistical sampling distributions in contexts where estimates derived from the entire dataset are computationally difficult to obtain [42]. Such is the case with metagenomic samples, where the massive dataset size, combined with the ever-increasing set of reference sequence available, produce the need for highly efficient profiling tools for aligning and characterizing the population-level composition of the sample. The CensuScope subsampling approach randomly selects (with replacement) i samples (henceforth denoted as ‘iterations’), of size m < <n from the overall read dataset containing n total reads. From the i iterations, the algorithm derives point estimates as well as confidence interval estimates of abundance for each taxonomic clade at a user-defined depth. Furthermore, to remove low-proportion clades that appear due to random reads selection, we filter taxonomic groups that are not present in a majority of the subsamples based on the user defined threshold p (e.g. default is p = 80%). A high value for p will result in more stable results and will filter out false positive taxa that come up due to misaligned reads.

Figure 1 provides an overview of the CensuScope algorithm. Reads are randomly picked from the metagenomic dataset. The user can choose how many reads to pick per iteration. The number of reads picked along with the number of iterations determines the amount of time required to perform the analysis. The software also extracts taxonomic information about the hits from the NCBI taxonomy database using E-utilities (http://www.ncbi.nlm.nih.gov/books/NBK25500/). Table 1 provides details on time taken to analyze a metagenomic sample with 7.5 million reads on a personal computer (3 gigahertz (GHz) 64-bit processor; 4 gigabyte (GB) RAM). More computing power, if available, reduces run times significantly, therefore allowing selection of larger number of random reads and iterations if needed.

Figure 1
figure 1

Flowchart describing the CensuScope algorithm. For clinical samples an additional step is required.

Table 1 Time taken to analyze a metagenomic sample with one 7.5 million sequence reads on a personal computer using CensuScope

The random read picker function creates an array of pointers from locations in the dataset with a specific size relating directly to the size of metagenome. By picking m number of elements of this array randomly (each time picking one whole array element from a shuffle process), a dataset of randomly selected chunks of data is created. Next, the user specified number of reads are selected from the newly created dataset based on the Mersenne Twister Function (a very fast random number generator [43]).

Definition of terms

Organism

The leaf-most taxonomic node in the taxonomy tree. The node can be at the species, strain, or sub-strain taxonomic level.

CensuScope

Census-based NGS metagenome taxonomic analyzer.

Best hit

Query that is listed at the top of any local alignment program. In this paper the local alignment program used is Blast.

Tax Slims

Conceptually similar to GO Slims (http://geneontology.org/GO.slims.shtml). Tax Slims consist of a subset of the terms in the whole taxonomy tree. Users can choose the depth of the taxonomy tree thereby creating their own Tax Slims or can choose our default Tax Slim which is set at the taxonomic depth of 3.

Taxonomic depth

Level of nodes within a given taxonomy tree/system.

Sampling reads for other metagenomic pipelines

CensuScope allows users to generate a randomly picked reads set from the entire metegenome short reads file by specifying the number of short reads and iterations upon input. The short reads set can be then used by mapping/alignment algorithms. Though CensuScope is designed to analyze NGS standard output (FASTQ), the output of random picker can be either FASTA or FASTQ for user convenience as some existing algorithms can only accept FASTA files.

Results and discussions

The two major bottlenecks in deciphering the taxonomic composition of a metagenomic sample are: a) mapping of millions of sequence reads to an exponentially growing NCBI nt; and b) creating an up-to-date index of NCBI nt for use by mapping algorithms. For the first challenge, we chose to randomly pick a limited set of reads from a metagenomic sample and map them to nt. We ran this process multiple times to identify taxonomic nodes that are hit in the majority of the iterations. For the second challenge, we chose to use a pre-computed Blast nt database that is updated regularly and distributed by NCBI.

Statistical considerations

It is obvious that the sensitivity of the method is dependent on both the number of random reads (m) and the number of subsamples or iterations (i) that are chosen. For example, if m = 100 random reads and i = 50 subsamples are chosen, then any organism specific read which has a frequency of less than one percent will most likely be missed based on our threshold criteria: the hits to a specific organism/taxon node has to be more than p = 80% of the iterations. This can be problematic for clinical samples as they tend to have an over-abundance of human sequences. To avoid missing reads which are present in low frequency an additional step is included which involves mapping of the reads to the human genome first and then running CensuScope on the unaligned reads (see Figure 1 and section below on Clinical metagenome for details). This additional step can be used with any metagenomic sample where there is an abundance of reads from one specific organism and if the genome for that organism is available.

We used a simulation approach to find optimal or recommended values for the number of random reads (m) and the number of iterations (i) for multiple scenarios (see Additional file 2: Table S1a-c). We note that the estimation errors for the taxonomy proportions are directly associated with the total number of reads drawn across all iterations (i.e. m × i), whereas the margin of error for the confidence interval depends on the number of random reads. Additionally, choosing too few iterations can result in severely biased confidence intervals, usually erring by making the intervals too small. Overall, we noticed that increasing the number of random reads leads to the greatest increases in the power of detecting lower proportion taxa compared to increasing the number of intervals. Therefore, increasing the number of iterations leads to more stable and robust results that are less likely to include incorrect taxonomic groups due to misaligned reads.

Based on our simulation results, we recommend the number of iterations to be at least i = 50 along with p = 0.8. If the user wants to accurately quantify taxonomic groups present in the sample at a level of 10% or greater, we recommend a subsampling size of 25–100 random reads (see Additional file 2: Table S1a). These parameters yield a statistical power of >99% for detecting taxa present at 10%, absolute estimation error ranging from 0.003 to 0.007 and confidence interval margins of error ranging from 0.05 to 0.09. For identifying taxa present at 1% or greater, we recommend a subsampling size of 250 random reads and a sampling size of 2,500 reads for identifying taxa present at 0.1% (see Additional file 2: Table S1b-c). R code used to calculate these power results are given in Additional file 3 (censuscope.R). However, we note one limitation of this simulation study: it optimistically assumes that all reads are properly aligned to the correct source organism sequence. If there are multiple clades with closely related genomes that share sequence identity with members of the source clade, then either a follow-up step that involves mapping all reads to a select few related genomes (obtainable from genome clusters Additional file 4: Table S2) or use of a signature database that consists of species specific markers [30, 32]. The genome clusters are calculated for genomes which have a complete proteome tag in UniProt and are also tagged as complete in NCBI databases [44]. Another view of taxonomy that is rapidly gaining popularity is the concept of pan-genomes as opposed to an individual genome. For some organisms there are a set of core genes and a cloud of additional ones (found in closely related organisms/isolates) which together form the pan-genome. Such pan-genome databases can be queried by CensuScope for taxonomic profiling in addition to nt and signature databases to identify the taxonomic groups represented within a metagenomic sample.

Test analysis results

Synthetic metagenome and gut microbiome

In order to test our algorithm, synthetic reads were generated using bacterial and viral genomes (Table 2) and the reads were mixed to create a metagenome-like sample containing 7.5 million reads. Table 2 shows that the actual content of the sample can be closely estimated using CensuScope with a fraction of the reads and minimal iterations. If the number of randomly picked reads is increased then the variation observed between each iteration decreases as can be seen in Figure 2. As expected, the number of reads that match to the correct taxonomy node decreases as we move closer to the leaf nodes. This happens because the reads hit closely related genomes. Identification of genomic neighbors is an active area of research and our group, with collaborators, has developed a method to identify genomic neighbors through the analysis of their proteomes [44]. The method is a refinement of a previous method we have used to determine the closest relative of an organism [32]. The genome clusters are calculated at different thresholds and the taxonomic ids are mapped to NCBI bioproject ids every month thereby enabling users to retrieve the closely related genomes. The results obtained from this Representative Genome and Representative Proteome project show that if we look at genomic neighbors calculated based on 75 CMT threshold (Additional file 4: Table S2), we immediately see that for organisms such as Escherichia coli there are other closely related genera such as Shigella to which the reads can also match. Therefore, at lower taxonomic nodes, the hits get split over multiple closely related organisms. For situations such as this, it is best to consider hits to all closely related organisms to make an assessment of the actual composition of the metagenome. Matching the comprehensive set of metagenomic reads to a smaller number of genomes is computationally far less challenging than matching to the entire NCBI nt. Such matching can be performed by HIVE tools (https://hive.biochemistry.gwu.edu/dna.cgi?cmd=main) or by other NGS analysis platforms such as Galaxy [45], CLC Bio (http://www.clcbio.com), Geneious (http://www.geneious.com) and others.

Table 2 Synthetic metagenome original content and detected taxonomic composition
Figure 2
figure 2

Relationship between the numbers of reads analyzed and results obtained. Effect of read numbers analyzed on the standard deviation of the percent of reads that are mapped to a specific taxonomic node. The numbers in parenthesis after the taxonomy node name represents the percent of reads from that taxonomy in the synthetic metagenomic sample. As expected not all taxonomy nodes are detected when the number of reads picked is low.

In a recent review, Segata et al. describe the current methods used to characterize microbial communities [10]. In their article they discuss methods that use intrinsic sequence properties for identification using information from sequenced microbial genome databases (extrinsic information). A popular method MetaPhlAn [30] relies on mapping of reads to a signature database that is created from 3000 sequenced microbial genomes. This means that reads not of microbial origin will not be detected. Additionally, if the signatures themselves are not sequenced then those organisms will be missed. In the following paragraphs we show why NCBI nt should be queried with CensuScope, albeit with a low number of reads to confirm MetaPhlAn results.

CensuScope uses Blast as the default mapping algorithm and NCBI nt database as the default reference database. CensuScope algorithm can also use other alignment algorithms like NGS specific alignment algorithms Bowtie2 [46] and BWA [47], and can be used to search against a curated signature database such as the MetaPhlAn db [30]. Below we provide a demonstration of how users can apply CensuScope to speed up their metagenome analysis with other pipelines using the same synthetic reads that were used for CensuScope testing as shown in Table 2. First we applied CensuScope to pick 50,000 reads randomly from the total 7.5 million synthetic reads to resize data to facilitate the comparison among different algorithms. Based on the tags in the definition line of each synthetic short read we confirmed that the composition of this randomly picked 50,000 reads is of high similarity (listed in the text below) to the total short reads pool of 7.5 million. This can be considered as a proof of the effectiveness of CensuScope’s random picker. Next the 50,000 reads were analyzed using different pipelines: A) MetaPhlAn analysis. The analysis was performed on the Galaxy version of MetaPhlAn with Bowtie2 algorithm using the ‘very sensitive local’ option. It took 8.3 s to sample the reads and 13 seconds to analyze using the Galaxy platform. The result (Additional file 5: Table S3) shows that MetaPhlAn detected 41.8% Escherichia coli (actual percentage 50.22%), 35.06% of the Bacillus subtilis (actual percentage 19.94%), 22.8% Archaeoglobus fulgidus (actual percentage 6.98%). The reads from eukaryota and viruses were not detected by MetaPhlAn [30] because MetaPhlAn db does not have signatures for them in their database. B) AMPHORA2 analysis. The exact same test was performed using AMPHORA [28] on their webserver AmphoraNet which uses the AMPHORA2 algorithm with the up-to-date AMPHORA markers. Due to the limited number of marker genes, the analysis resulted in no matches. C) CensuScope analysis using 50 iterations with 250 reads sampled per iteration, on the other hand, gives a result that is much closer to the actual composition of the 50,000 reads: 47.85+/-0.86% Escherichia coli (actual percentage 50.22%), 20.00+/-0.72% of the Bacillus subtilis (actual percentage 19.94%), 6.85+/-0.46% Archaeoglobus fulgidus (actual percentage 6.98%), and successfully identifies all the viruses and the eukaryote missed by MetaphlAn - 20.29+/-0.74% Saccharomyces cerevisiae (actual percentage 19.87%), 0.50+/-0.15% HAdV-C clade (actual percentage 2.05%) and 0.91+/-0.18% Vaccina virus (actual percentage 0.93%). These results show that the CensuScope default method provides a comprehensive and accurate view of the taxonomic composition of a metagenomic sample.

Next, we compared our analysis method using real metagenomic data (NCBI SRA ID: SRS015989) that has been previously analyzed by the MetaPhlAn group. The results obtained from MetaPhlAn website shows that SRS015989 according to their analysis consists of Actinomycetales 43.83%, Neisseriales 20.37%, Lactobacillales 12.81%, Flavobacteriales 10.47%, Pasteurellales 6.44% and Selenomonadales 2.21%. Our analysis using CensuScope with suggested parameters (250 randomly picked reads and 50 iterations) provided the following results: Actinomycetales 42.58+/-1.73%, Neisseriales 3.95+/-0.57%, Lactobacillales 30.14+/-1.57%, Flavobacteriales 3.20+/-0.47%, Pasteurellales 9.31+/-0.90%, Selenomonadales 2.54+/-0.47% and Fusobacteriales 0.77+/-0.27%. Additionally artificial sequence vectors (2.03+/-0.43%) and eukaryotic sequences (1.79+/- 0.39%) were also reported. To obtain a reference composition of this real metagenome for our comparison, we randomly picked 10,000 reads by using the same method described in the previous section. It is interesting to note that MetaPhlAn analysis (http://huttenhower.sph.harvard.edu/content/metaphlan-galaxy) of randomly picked 10,000 reads from SRS015989 identifies Actinomycetales 84.26% and Neisseriales 15.73% and nothing else. Blast against NCBI nt analysis of these 10,000 reads provides the following results: Actinomycetales 41.25%, Neisseriales 3.87%, Flavobacteriales 2.45%, Lactobacillales 28.49%, Pasteurellales 8.98%, Selenomonadales 2.62% which closely resembles CensuScope results using default parameters analyzing the same 10,000 reads (Actinomycetales 40.97+/-1.32%, Neisseriales 4.35+/-0.53%, Flavobacteriales 4.45+/-0.53%, Lactobacillales 30.64+/-1.68%, Pasteurellales 9.62+/-0.84%, Selenomonadales 2.86+/-0.49%).

Motivated by the significant difference of the detected taxonomic abundance between the results generated by MetaphlAn and CensuScope using the same reads set, we performed an additional analysis on a relatively small but complete real metagenome dataset (SRS011236). As this metagenomic data is only 109 MB (filtered and trimmed by Human Microbiome Project (HMP) group), Blast against NCBI nt results using all the reads can be easily used to provide us with a reference composition. Based on the HMP paper, the exact same short reads file has been directly analyzed by MetaphlAn to produce the result which contains only three clades: 69.87% Actinomycetales, 1.28% Bacillales and 28.85% Clostridiales (Additional file 6: Table S4). The Blast analysis using the entire dataset (called ‘Blast result’ from here onwards) and CensuScope analysis of the same metagenome provides an overall similar taxonomic composition that is distinct from what is presented by MetaphlAn. CensuScope (250 random reads/50 iterations) found 17.83+/-0.94% Actinomycetales (Blast result 17.99%), 0.48+/-0.16% Bacillales (Blast result 0.57%) and 0.60+/-0.17% Clostridiales (Blast result 0.56%). In addition CensuScope and Blast results identified several other clades that are not detected by MetaPhlAn. The novel detected clades include 2.97+/- 0.36% Lactobacillales (Blast result 3.03%), 35.07+/-1.43% Homininae (Blast result 35.59%), 1.16+/-0.25% Siphoviridae (Blast result 1.05%), etc. (Additional file 6: Table S4).

Fungal metagenome

Identification and characterization of organisms that degrade wood are active areas of research [48, 49]. It is well known that many fungal species are involved in the degradation of wood. More recently it has been recognized that bacteria might collaboratively work with fungal species to accelerate wood decay [50]. Emerging molecular techniques are expanding our knowledge of the diversity of wood inhabiting organisms [51, 52]. To better understand the biodiversity of wood decomposing organisms, environmental DNA was extracted from two decaying logs and used to generate 50 base pair single-end reads on the Illumina HiSeq. Given the accepted prevalence of fungi as wood decayers, we hypothesized that the majority of the 1.1 million reads in the metagenomic dataset would belong to fungi with a smaller fraction of the reads belonging to bacteria. CensuScope analysis, using Blast with E-value threshold of 1e-6, number of random reads selected per iteration set to 10,000 and number of iterations set to 10, found that the majority of reads are from proteobacteria and only 4.37% of the reads are from fungi (Figure 3). Proteobacteria have been found to be highly abundant in decaying wood [51] and are known to be capable of lignin decomposition [53]. The low percentage of fungi and prevalence of bacteria in this community could be due to either few fungal species truly present in the examined samples or lack of appropriate fungal sequences in the NCBI nt database. It is clear that more samples would need to be collected and analyzed to better understand the role of bacteria and fungi in wood degradation.

Figure 3
figure 3

Metagenomic analysis results of sample collected from decaying wood. One thousand reads were randomly picked and analyzed using CensuScope with the number of iterations set at 10. The number above the dots represents standard deviation. The pie chart shows the average distribution of hits in the different taxonomic nodes with mean and confidence intervals in parenthesis.

Clinical metagenome (respiratory tract infection)

Characterizing the microbiome in hard to treat respiratory tract infections is critical for disease treatment and management [54]. CensuScope analysis of a respiratory tract metagenome containing 3.2 million reads, using Blast with E-value threshold of 1e-6, number of random reads selected per iteration set to 10,000 and the number of iteration set to 10 reveals that 99.7% of reads are from Homo sapiens and ~0.08% of the reads are from Bacilli (data not shown). An overabundance of reads from the host is expected in clinical samples. Using an additional step that includes mapping the metagenomic reads to the human genome and then aligning the unaligned reads to NCBI nt gives a much better resolution as can be seen in Figure 4 for twelve patient samples. The initial reads were mapped to the human genome (minimum match length 80%; mismatch allowed 10%) using HIVE hexagon [55] and the unaligned reads were downloaded (Additional file 1: Figures S1 and S2) and CensuScope was run to obtain the taxonomic composition of the non-human reads.

Figure 4
figure 4

Metagenomic analysis of samples isolated from patients suffering from upper respiratory tract infection. Analysis results after filtering out Human specific reads from 12 patients (1000 reads/iteration; 10 iterations). Only hits above one percent are shown.

It is always possible that the best hit of a read can be to organisms from which the reads did not originate. This is inevitable if there are identical or closely related sequences in multiple organisms. Therefore, to attain strain level identification one can envision a step where the reads are run through a database of sequence signatures similar to a methodology that we have described previously [32]. One caveat is that it is possible that unique signatures may not have been sequenced in the metagenomic sample. Therefore, to obtain strain or isolate level identification, standard widely used follow-up experiments can be performed [56, 57] or alternatively, the entire metagenome can be mapped to all known sequences under the specific taxonomy node. Algorithms to create such pan-genomes are already available [5860] and can be implemented into the diagnostic workflow for closely related genomes.

Usage instruction

UNIX and Windows versions are available for download from the HIVE website (http://hive.biochemistry.gwu.edu). The download file is a zip folder of approximately 10GB size. The download package contains the CensuScope program, the indexed concatenated NCBI nt Blast database (reference database) and two sample metagenome datasets (virus.fastq and prok.fastq).

Specific instructions are as follows: For Windows: Run CensuScope.exe; choose the path to the metagnome file (e.g. c:\CensuScope\source\virus.fastq); set the number random reads; set the number of iterations; click submit. The output will be generated in \CensuScope\sample\. For UNIX: Install PHP (if not already installed) from http://php.net/downloads.php; provide arguments -s: number of random reads, -i: number of iterations, -t: depth of taxonomy tree to classify the reads into (default setting is 3), -d: metagenome file name, -p: CensuScope package path. Example command:

$php CensuScope.php -i 10 -s 1000 -t 3 -d ‘/user/desktop/fungi.fastq’ -p ‘/user/package’. Internet access is required to run the program. Additional instruction on how to use CensuScope is available in file Additional file 7.

Explanation of results

CensuScope is designed and optimized for rapid analyses of NGS metagenomic data and provids users with standard formatted reports of hits classified into species or higher taxonomic nodes. There are four output files: 1) log.txt - this text file provides all parameters which was used to run and generate the result. 2) gi_centric_table.csv – NCBI gi numbers sorted result. 3) tax_centric_table.csv - taxonomy id sorted result. 4) taxslim_centric_table.csv - user defined taxonomy nodes sorted result. The table also includes information on the number of times a specific taxnode was hit. This information provides an estimate of the reliability of taxonomic distribution that is reported.

CensuScope on High-performance Integrated Virtual Environment (HIVE)

CensuScope has been parallelized in HIVE. The HIVE CensuScope tool determines the taxonomic composition of a metagenomic sample by analyzing the sequence data through rapid iterative mapping to all known sequences using Blast.. Once the data is loaded into HIVE using HIVE’s dmDownloader utility (see “HIVE Downloader Tutorial” available in main pages of HIVE website) it is available for analysis. CensuScope can be used for quick sample origin discovery, to study metagenomic samples or mixed viral populations, or to evaluate the possible contamination of samples. CensuScope result is downloadable in Scalable Vector Graphics (SVG) format for visualization and Comma-separated Values (CSV) format for the mapping results. Detailed step-by-step instruction is available in Additional file 7.

Conclusion

The three key elements required in the analysis of metagenomic samples to obtain the taxonomic composition are: a) a comprehensive non-redundant reference sequence database coupled with b) a generally accepted taxonomy of known organisms, and c) the alignment software for sequence comparison. The first element consists of public sequence databases, which are maintained by NCBI, ENA and DDBJ through the International Nucleotide Sequence Database Collaboration (INSDC) [61]. The content of nt is heavily biased towards organisms that can be cultivated and organisms that have socio-economic benefits. It is expected that over time this bias will decrease as more organisms are cultured and sequencing techniques that do not require culturing gets easier and cheaper to use. The second component, the sequence alignment tool, is another critical step that determines the computational cost of analysis. Fast mapping algorithms, longer reads, and availability of mapping ready indexed reference sequence databases is expected to lower the barrier for this step. The third component is the taxonomic classification of species used which is also evolving over time. Our approach is based on the NCBI taxonomic system, which is maintained and updated by a team of taxonomy experts [2] and, importantly, linked directly to the molecular sequence data. Complementary advances which cluster genomes to create pan-genome sequences is expected to provide better resolution where the taxonomy tree does not correctly reflect the ‘sequence space cloud’ for groups of closely related organisms. CensuScope being highly flexible can be easily modified to use any indexed sequence database or taxonomic classification or mapping algorithm. For example, users can download a signature database such as MetaPhlAn or AMPHORA and use CensuScope to query that database resulting in even faster performance. It is possible that some organisms can be missed using signature databases such as MetaPhlAn or AMPHORA. MetaPhlAn consists of unique clade-specific marker genes identified from 3,000 reference genomes [30] and AMPHORA2 uses 31 bacterial markers and additional markers from 50 archaeal genomes [62]. Additionally, it is also possible that the metagenomic sequencing process may not result in the sequencing of the markers. Therefore, it is desirable to first use CensuScope with NCBI nt to get a comprehensive overview prior to using signature databases. Finally, to identify organisms which are in low abundance in the metagenomic sample users should filter out reads from highly represented organisms using NGS alignment tools such as HIVE Hexagon [55], Bowtie [46] etc. and then use CensuScope.

Based on user requests, we will continue to expand CensuScope capabilities based on the above mentioned considerations.

Availability

CensuScope is freely available for download from: https://hive.biochemistry.gwu.edu/dna.cgi?cmd=censuscope. To run CensuScope on the HIVE Cloud servers users are required to register.

Authors’ information

Amirhossein Shamsaddini and Yang Pan should be considered as first authors.