Background

Next-generation sequencing (NGS) technologies have dramatically changed genomic research. NGS instruments, the so-called second-generation sequencers, generate large volumes of data compared with conventional Sanger sequencers. Before 2010, although the cost of reading a whole genome was rapidly decreasing, the use of NGS technologies was still limited to large genome sequencing centers because of technical and logistical difficulties associated with the operation of the instruments and requirements for computer hardware and data analysis. The advent of benchtop sequencers has accelerated sequencing efforts in small centers and laboratories. For example, the 454 GS Junior (GS Jr), released by Roche in early 2010 as the first benchtop sequencer, uses the same emulsion PCR technology [1] as the Roche GS FLX. The Life Technologies Ion PGM (Ion PGM) benchtop sequencer, which was launched at the beginning of 2011, utilizes semiconductor technology [2]. The Illumina MiSeq (MiSeq) benchtop sequencer became available at the end of 2011 and employs the same sequencing-by-synthesis technology [3, 4] as the Illumina GAII and HiSeq sequencers. With the annual emergence of new NGS instruments, experimental procedures such as library preparation and analysis methods require continual improvement.

Second-generation sequencers generate massive amounts of short reads, which differ in throughput and length from reads produced by Sanger sequencers. To assemble massive amounts of short reads, a new type of algorithm using de Bruijn graphs has flourished, as illustrated by a series of genome assemblers including ABySS [5], ALLPATHS-LG [6], Velvet [7, 8], and SOAPdenovo [9]. Although these algorithms [59] have been developed to produce high-quality finished-grade genomes, it remains a challenge to assemble long contigs spanning an entire genome. One of the important factors in successfully obtaining finished genomes is resolving repetitive regions scattered across the genome. It is problematic to reconstruct long repetitive regions by assembling reads shorter than the repetitive regions. Paired ends and mate pairs have been used to tackle this problem. Mate pairs improved scaffold length, but the results using mate-pair assembly have usually been far from finished grade [10, 11].

To address this issue, reads longer than repetitive regions may offer a solution to the assembly problem. The recently launched third-generation Pacific Biosciences RS sequencer (PacBio) system [12] generates long reads with a mean length of 4.5 kbp and with randomly distributed sequencing errors. This evolutionary technology demands a new algorithm to process sequence reads because of the different nature of its reads, whose nucleotide-level accuracy is only 85% [12]. Therefore, several algorithms first correct sequencing errors in reads and then assemble the error-corrected reads [1315]. PacBio has the advantage of generating long reads but at a throughput lower than that of the second-generation sequencers. One of the disadvantages of PacBio is that the initial installation is more expensive than that of benchtop second-generation sequencers (Additional file 1: Table S1). Combining second- and third-generation sequencing data may be an option [13, 16]; however, these hybrid methods offer limited efficiency because they require more labor and consumables costs for additional library preparation.

Given that various sequencing instruments and software are available for genome sequencing and are evolving, selecting the best one or the best combination is difficult. Performance comparisons of NGS instruments, including that of a third-generation sequencer, have been previously published [1721]; however, considering the rapid improvement of NGS technologies, frequent comparisons are valuable for selecting the platform providing the best results. Therefore, we performed an updated comparison study of second- and third-generation sequencers using the bacterial genome of Vibrio parahaemolyticus, consisting of two chromosomes. Because of the presence of two chromosomes with higher copy numbers of rRNA operons than found in other bacteria, it was difficult to finish the genome sequence [21]. In this study, we demonstrated the reconstruction of the V. parahaemolyticus genome using current sequencers.

Results and Discussion

A summary of sequence run data and their assembly results is shown in Table 1, and the distribution of the sequence read quality of each sequencer is shown in Additional file 2: Figure S1. The assembler for each sequencer was selected on the basis of a previous study and our experiences [22]. To evaluate the accuracy of the generated contigs, we compared them with the V. parahaemolyticus reference genome [21] using QUAST v2.3 [23]. Table 2 shows the result of the accuracy evaluation.

Table 1 Data statistics for sequence run and assemblies
Table 2 Accuracy of assembled contigs with respect to the reference genome

Genome assembly using GS Junior

A single sequencing run of GS Jr yielded 48 Mbp with 115,611 reads, corresponding to 9× coverage of the V. parahaemolyticus genome. The mean length of the GS Jr reads was 418 bp. We selected the Newbler assembler [24], which is optimized for Roche 454 chemistry [22, 24]. The Newbler assembly consisted of 309 contigs with maximum length 164,926 bp. The total length of the contigs was 5,053,921 bp. Long reads are usually superior to short reads for the reconstruction of long contigs; however, this fragmented assembly suggested that low-coverage reads are insufficient for building a small number of long contigs.

The generated contigs were evaluated by comparison with the V. parahaemolyticus genome. The contig coverage of the V. parahaemolyticus genome was 97.844%. The total number of mismatches was 133, and the number of mismatches per 100 kbp was 2.6. The total number of insertions and deletions (indels) was 824, and the number of indels per 100 kbp was 16.3. These higher rates of errors compared with the other sequencers were largely because of the homopolymer error of 454 chemistry [22].

Genome assembly using Ion PGM

A single run from Ion PGM using the Ion 318 chip generated 1.44 Gbp with 4,982,888 reads. The mean length of the reads was 290 bp. The read coverage of the genome was 279×. We selected Newbler for Ion PGM because it is known to produce longer contigs for Ion PGM as well [22] because of the similarity of its sequencing chemistry to that of Roche 454.

We employed random sampling to reduce the number of input reads [20] and attempted to find the best amount of input data size for assembly [9]. Six sets of 100 inputs were prepared. The size of the inputs in each set was 100, 200, 300, 400, 500, and 600 Mbp, respectively. These sizes correspond to 19×, 39×, 58×, 77×, 96×, and 116× coverage, respectively. The maximum contig length and N50 contig length of all results are shown in Additional file 3: Figure S2. The best subset contained 61 contigs with maximum contig length of 895,358 bp in the 400 Mbp data set (Additional file 3: Figure S2). The number of reads used for the assembly was 1,380,757, corresponding to 77× genome coverage. The N50 contig length was 392,606 bp, and the total length of the contigs was 5,075,085 bp.

Subsequently, the accuracy was evaluated as that for the GS Jr contigs. The contig coverage of the genome was 98.290%. The total number of mismatches was 108, and the number of mismatches per 100 kbp was 2.1. The total number of indels was 2,853, and the number of indels per 100 kbp was 56.2. Homopolymer error has often been reported for Ion PGM [18, 22], and we could confirm this effect in the assembled contigs, as exemplified in Additional file 4: Figure S3(A).

Genome assembly using MiSeq

A single run of the MiSeq sequencer generated 9.95 Gbp with 39,656,630 reads in pairs. The read coverage of the genome was 1,927×. The mean length of the reads was 251 bp. We used CLC Assembly Cell as the assembler, which is known as a short-read assembler and has been used for a benchmark sequence comparison [22]. We performed random sampling to find the best subset of reads for assembly. The best subset yielded 34 contigs with a maximum contig length of 732,626 bp. The number of reads used for the assembly was 1,194,460, corresponding to 58× genome coverage. The total length of the contigs was 5,103,771 bp and N50 contig length was 431,440 bp.

The contigs contained 230 mismatches in total and 4.5 mismatches per 100 kbp. There were 184 indels in total and 3.6 indels per 100 kbp. MiSeq has a different error profile than Ion PGM. MiSeq errors are known to occur in GGC motifs [25], and we confirmed this error in the generated contigs. The examples of errors are shown in Additional file 4: Figure S3 (B).

Evaluation of random sampling

We used random sampling for the assembly of Ion PGM and MiSeq data and selected the best subset. For comparison, Additional file 5: Table S2 shows a summary of assemblies generated by the complete set of reads. Assembly using all 279× coverage reads generated by Ion PGM resulted in 502 contigs that were much more fragmented than the 61 contigs using the sampled reads. Likewise, the N50 contig length using all reads is 110,578 bp, a number much smaller than the 392,606 bp obtained with randomly sampled reads. MiSeq generated coverage of 1,927× in a single run and 42 contigs were generated using all reads by a single run of MiSeq, whereas the number of contigs assembled from the sampled reads was 34. These results suggest that an excessive number of reads does not help and can even harm genome assembly. Widely used assemblers do not assume excess coverage, suggesting that the number of reads fed to assemblers should be optimized by random sampling. The optimized sequencing coverage was reported to be <100 [9, 20].

To determine the factors that improve assembly by random sampling, we compared the best subset with the worst. The subset yielding the fewest contigs was considered the best. The best and worst sampled reads were mapped to the reference V. parahaemolyticus genome. On a closer examination of the junction regions, where reads from the worst sampled reads were unable to connect contigs (i.e., gaps), we found that the high-quality reads perfectly matching the reference genome were uniformly distributed in the gap regions of the best sampled reads (Additional file 6: Figure S4). In contrast, the distribution of the high-quality reads from the worst sampled reads was not uniform, suggesting that nonuniform coverage causes a disconnection of contigs. Random sampling enables us to generate different combinations of read sets, some of which contain high-quality reads that uniformly span the genome and aid in constructing long contigs. This finding indicates that random sampling would be a simple and effective procedure for finding the optimum coverage and best combination of reads for de novo assembly when excess reads are available.

Genome assembly using PacBio

Three cells of PacBio data yielded 120,230 subreads longer than 500 bp, amounting to 375 Mbp in total and corresponding to 73× coverage of the V. parahaemolyticus genome. Several assemblers have been developed for PacBio data. pacBioToCA is a program that corrects sequencing errors using other sequencers’ reads [13] or using PacBio reads themselves. HGAP does not require other sequencers’ reads to correct errors [14]. We employed Sprai [26], a new tool for correcting PacBio sequencing errors without other sequencers’ reads using multiple alignments of raw PacBio reads. The Sprai algorithm and its performance are shown in Additional file 7. The assembly by Sprai generated 31 contigs using three-cell data, showing better assembly performance than that by HGAP. The results are shown in Additional file 8: Table S3 and Additional file 9: Figure S5. The maximum length of the contigs was 3,288,561 bp, and the second longest contig was 1,875,537 bp. The lengths of these two contigs are almost equal to those of the V. parahaemolyticus genome chromosomes 1 and 2 (3,288,558 and 1,877,221 bps, respectively). The other 29 contigs were all <21 kbp. The contig length distribution is shown in Additional file 9: Figure S5. The two chromosomes of V. parahaemolyticus were reconstructed without gaps by PacBio reads alone, without using reads from other sequencing platforms or jumping libraries.

To further validate these two contigs, we evaluated their accuracy along with all 31 contigs (Table 2). The coverage of all 31 contigs was 99.999%, whereas that of the longest two contigs was 99.848%. The 31 contigs contained a total of 389 mismatches, whereas the longest two contigs contained 157. The number of mismatches per 100 kbp was 7.5 for the 31 contigs and 3.0 for the longest two contigs. The numbers of indels were 715 and 698, and the numbers of indels per 100 kbp were 13.8 and 13.5, respectively. The majority of PacBio sequencing errors were indels, a characteristic known to be a shortcoming of PacBio [27].

Comparison of assembled contigs

All contigs from GS Jr, Ion PGM, Miseq, and PacBio were aligned to the V. parahaemolyticus genome, as summarized in Figure 1. The contig length distributions are shown in Additional file 10: Figure S6. The sequence assembled using the PacBio sequencer was the highest in quality and genome coverage (Table 2). The Sprai assembler corrected the sequencing errors of PacBio and successfully assembled the reads into two contigs corresponding to the two chromosomes. MiSeq, Ion PGM, and GS Jr all left gaps across contigs. We found that these gaps often fell into rRNA tracts in the genome.

Figure 1
figure 1

Contig alignment against the V. parahaemolyticus genome. A Alignment of contigs to V. parahaemolyticus chromosome 1. PacBio, MiSeq, Ion PGM, and GS Jr contigs are aligned to chromosome 1 and visualized with Circos [28].

From outer to inter rings: forward CDS, reverse CDS, tRNA, rRNA, PacBio contigs, MiSeq contigs, Ion PGM contigs, GS Jr contigs, %GC plot, and GC skews. B Alignment of contigs to V. parahaemolyticus chromosome 2 PacBio, MiSeq, Ion PGM, and GS Jr contigs are aligned to chromosome 2 and visualized using a Circos plot. From outer to inter rings: forward CDS, reverse CDS, tRNA, rRNA, PacBio contigs, MiSeq contigs, Ion PGM contigs, GS Jr contigs, %GC plot, and GC skews.

The power of PacBio to generate long reads shows great promise for the assembly of bacterial sequences without hybrid assembly [15, 20]. Previous studies concluded that the accuracy and length of the contigs using PacBio alone surpassed those using second-generation sequencers. However, these studies analyzed bacterial genomes with a single chromosome. In contrast, the present study examined a more complex genome comprising two chromosomes containing 11 copies of rRNA operons. The lengths of 23S rRNA and 16S rRNA sequences are approximately 3.0 kbp and 1.4 kbp, respectively, and the mean read length obtained using PacBio was 3.1 kbp, making it possible to correctly determine the absolute positions of multiple rRNA coding regions (Figure 1). The difficulty of the V. parahaemolyticus genome assembly is caused by these rRNA repetitive regions and by similar regions between chromosomes 1 and 2, which may be the cause of misassembly (Additional file 11: Figure S7). These complications made assembly difficult for the second-generation sequencers.

Previously, the V. parahaemolyticus genome was sequenced by the Sanger method using multiple libraries with different insert sizes [21]. Libraries with long insert size (4–5 kbp) were used to construct the scaffolds. However, repetitive regions such as rRNA operons required to be independently sequenced to identify the absolute positions. From this experience, we know that jumping libraries would not be useful for accurate reconstruction of the repetitive regions. Long reads that cover not only entire repeat regions but both ends of each repeat region are necessary to determine their absolute positions.

Conclusions

We compared the abilities of currently available sequencers to assemble a bacterial genome. The use of random sampling improved the assembly of the sequence data from the second-generation sequencers. In the course of upgrading the performance of the second-generation sequencers, the best-subset selection of sequencing data would be more important to make a good assembly of bacterial genome. As described in previous reports [1721], PacBio achieved a long continuous, finished-grade assembly of a complex bacterial genome. Sequencing technology and chemistry are evolving at a dramatic speed. Future chemistry and instrument updates will bring further improvements, such as support for the sequencing and assembly of higher organisms with multiple chromosomes and the coexistence of multiple genomes in symbiotic organisms. Several challenges in assembling the genomes of higher organisms using PacBio have been published [2931], although hybrid assembly is required because of the limitations of current PacBio technology including low throughput, high cost, and the amount of DNA required. Our study and these recent challenges reinforce the importance of performing frequent evaluations of the rapidly improving hardware and software for determining genomic sequences.

Methods

DNA preparation of the V. parahaemolyticusgenome

A single colony of V. parahaemolyticus (RIMD2210633) from TCBS agar plates was isolated and transferred to 3% NaCl-containing LB medium. Cells were harvested after overnight culture and subjected to PowerSoil DNA Isolation Kit (MO BIO Laboratories). Purified DNA was quantified with a Qubit dsDNA HS Assay kit (Life Technologies). DNA degradation was evaluated by 1% agarose gel electrophoresis using an E-Gel Electrophoresis System (Life Technologies).

Library preparation, sequencing, and data analysis

GS Junior

Genomic DNA (500 ng) was sheared using a GS Rapid Library Prep Nebulizer (Roche) and a library was prepared using a GS Rapid Library Rgt/Adaptors Kit (Roche), according to the manufacturer’s instructions. Sequencing was performed using a GS Junior Titanium Sequencing Kit. The software Newbler v2.5 (Roche) [24] was employed to assemble the 454 GS Junior data with default parameters.

Ion PGM

Genomic DNA (2 μg) was sheared using the Covaris S220 (Covaris) and a library was prepared using an Ion Fragment Library Kit (Life Technologies), according to the manufacturer’s instructions. Sequencing was performed using a 318 chip and an Ion PGM Sequencing 400 Kit (Life Technologies). The Ion PGM data were randomly sampled with the sfffile tool v2.5 (Roche) and then assembled with the software Newbler v2.5 (Roche) [24] with default parameters.

MiSeq

Genomic DNA (500 ng) was sheared using the Covaris S220 (Covaris) and a library was prepared using ligation-based Illumina multiplex library preparation (LIMprep). Paired end sequencing (250 bp) was performed using a MiSeq v2 500 cycle kit (Illumina). Random sampling and assembly were performed with CLC Assembly Cell v4.10 (CLC bio). Parameters for assembly were bubble size 600 and word size 41.

PacBio

Genomic DNA (3 μg) was sheared using the HydroShear Plus (Digilab) and a library was prepared using a DNA Template Prep Kit 2.0 (Pacific Biosciences), according to the manufacturer’s instructions. Sequencing was performed with XL polymerase and a DNA Sequencing Kit C2 (Pacific Biosciences) and three SMRT cells (120 min movies). De novo assembly was performed with Sprai v0.9.5 [26] and HGAP v2.1.0 [14] with default parameters. The contigs from Sprai were circularized with a script in the Sprai package when the script detected a significant overlap between the beginning and end of contigs.

Evaluation criteria

Contig statistics were used to evaluate the performance. The number of contigs, maximum length of contigs, total length, and N50 contig length were used as general metrics for contig assessment. Contig statistics were calculated with QUAST v2.3 [23].

Availability of supporting data

The raw sequencing data have been deposited in the DDBJ Sequence Read Archive (DRA) under the accession code DRA002157.