Objective

Control sequencing data across different sequencing platforms is extremely important for validation and effective comparison of sequencing platforms. A commonly sequenced sample that has been extensively used for these purposes is the MG1655 strain of E. coli [1]. However, the MG1655 genome is smaller and less complex than those of some pathogenic E. coli strains [2, 3]. As part of control experiments, we have sequenced UTI89, a uropathogenic E. coli (UPEC) strain originally isolated from a patient suffering from an acute bladder infection [4], using several different sequencing technologies, including ABI SOLiD, Ion Torrent, PacBio, Oxford Nanopore, and Illumina. Our new data supplements previously published sequencing data generated using the Roche 454 [4], Illumina HiSeq [5], and the original Oxford Nanopore Technologies MinION [6]. With the inclusion of these new data sets, E. coli strain UTI89 now has a nearly complete set of raw sequence data generated using most second- and third-generation sequencers. For some of the technologies we have multiple data sets, such as for PacBio, which spans the first iteration of the RSII sequencing chemistry (XL/C2) in 2012 up to the P6-C4 chemistry (which was current in 2018), which led to a more than fivefold increase in mean read length.

Data description

The new data sets are summarized in Table 1. Details of library preparation and sequencing methods for the new datasets are presented below.

Table 1 Overview of data sets

SOLiD

Library preparation

Genomic DNA was extracted from UTI89 grown overnight in Lysogeny Broth (LB) and used to generate Long Mate Pair (LMP) libraries. LMP libraries were generated using an insert size of 3–4 kb according to the manufacturer’s instructions to produce a 375 bp library.

Sequencing

A 2x35bp LMP sequencing run was performed on two spots of an 8 spot slide using the Applied Biosystems SOLiD3 platform [7,8,9].

Ion Torrent

Library preparation

Genomic DNA was extracted from UTI89 harbouring the pBAD33 plasmid [10] grown overnight in LB. Sequencing libraries were then generated using the Ion Xpress™ Plus gDNA library preparation protocol according to the manufacturer’s instructions.

Sequencing

A 200 bp sequencing run was performed on the personal genome machine (PGM) system using the Ion PGM™ 200 Sequencing Kit with a 316 chip [11, 12].

PacBio, RSII, XL/C2 Chemistry

Library preparation

Genomic DNA was extracted from SLC-66 (UTI89 with a kanamycin cassette integrated into the phage HK022 integration site) grown overnight in LB. Large insert (15 Kb) native SMRTbell sequencing libraries were generated according to the manufacturer’s protocols.

Sequencing

Sequencing was performed on 6 SMRT Cells using XL/C2 Sequencing chemistry [13,14,15].

Illumina

Library preparation

Genomic DNA was extracted from UTI89 grown overnight in LB. Sequencing libraries were built using the Illumina TruSeq Nano DNA LT kit according to the manufacturer’s instructions, with shearing to 350 bp.

Sequencing

A 2x150bp sequencing run was performed using the Illumina NextSeq 500 and a NextSeq Mid Output flow cell and reagents [16, 17].

Oxford Nanopore, MinION Mk1B Device, R9.4, 1D Ligation sequencing

Library preparation

Genomic DNA was extracted from UTI89 grown overnight in LB. 1 μg of unsheared DNA was used to prepare sequencing libraries using the Ligation sequencing kit 1D R9 version (SQK-LSK108) according to the manufacturer’s instructions.

Sequencing

The prepared sequencing library was loaded onto a FLO-MIN106 R9.4 with Spot-ON and a 24 h sequencing run was performed. Base calling was subsequently performed using Oxford Nanopore’s Albacore Sequencing Pipeline Software (version 1.2.1) [18, 19].

PacBio, RSII, P6-C4 Chemistry

Library preparation

Genomic DNA was extracted from UTI89 grown overnight in LB. Large insert (20 Kb) native SMRTbell sequencing libraries were generated according to the manufacturer’s instructions.

Sequencing

Sequencing was performed on 2 SMRT Cells using P6-C4 Sequencing chemistry [20,21,22,23].

Previously published data sets

There are three previously published data sets generated using other sequencing platforms or sequencer versions: Roche 454 [4, 24,25,26,27,28,29,30], Illumina HiSeq 2000 [5, 31,32,33,34], and the original Oxford Nanopore MinION with an R7 flow cell [6, 35, 36]. The data presented in this manuscript complements these published datasets (also included in Table 1).

Limitations

The following are limitations of these data:

  1. 1.

    The data was collected over a period of several years, and thus all experimental steps were performed by different persons.

  2. 2.

    Some strains contain plasmids or other markers (see details above).

  3. 3.

    Not every generation of sequencing machine or library preparation method was used.