A hybrid reference-guided de novo assembly approach for generating Cyclospora mitochondrion genomes
Cyclospora cayetanensis is a coccidian parasite associated with large and complex foodborne outbreaks worldwide. Linking samples from cyclosporiasis patients during foodborne outbreaks with suspected contaminated food sources, using conventional epidemiological methods, has been a persistent challenge. To address this issue, development of new methods based on potential genomically-derived markers for strain-level identification has been a priority for the food safety research community. The absence of reference genomes to identify nucleotide and structural variants with a high degree of confidence has limited the application of using sequencing data for source tracking during outbreak investigations. In this work, we determined the quality of a high resolution, curated, public mitochondrial genome assembly to be used as a reference genome by applying bioinformatic analyses. Using this reference genome, three new mitochondrial genome assemblies were built starting with metagenomic reads generated by sequencing DNA extracted from oocysts present in stool samples from cyclosporiasis patients. Nucleotide variants were identified in the new and other publicly available genomes in comparison with the mitochondrial reference genome. A consolidated workflow, presented here, to generate new mitochondrion genomes using our reference-guided de novo assembly approach could be useful in facilitating the generation of other mitochondrion sequences, and in their application for subtyping C. cayetanensis strains during foodborne outbreak investigations.
KeywordsGenome sequencing Mitochondrion De novo assembly Reference genome Single nucleotide polymorphisms Cyclosporiasis Subtyping
next generation sequencing
whole genome sequencing
single nucleotide polymorphisms
Cyclospora cayetanensis is an important apicomplexan parasite causing cyclosporiasis, a common foodborne illness  worldwide. Due to the globalization of the food supply, this apicomplexan parasite is prevalent in both endemic regions producing food and non-endemic areas where food is imported [2, 3]. The lack of animal models or cell culture systems for Cyclospora and the limited availability of its oocysts have hampered its genomics and the development of efficient genotyping tools. With the advent of new sequencing methods, the number of genome sequences from C. cayetanensis is growing over the past 4 years, however, with no immediate solution to the subtyping problem. Our group ([4, 5]) and others [6, 7] have recently published genomes for the C. cayetanensis organelles—apicoplast and mitochondrion, and whole genome sequences [8, 9] that provide a glimpse into its biology. A few PCR targets amplified from geographically distinct strains have been tested for subtyping . Unlike the case with foodborne bacteria (https://www.fda.gov/Food/FoodScienceResearch/WholeGenomeSequencingProgramWGS/), the impact of genomics on the development of molecular epidemiological methods based on genomics are not yet fully realized for C. cayetanensis due to the complexity of obtaining high quality genomic information from NGS datasets.
Growing amounts of genomic data from mixed DNA samples recovered from clinical, animal or environmental samples often result in assemblies and variant determinations with various levels of confidence. High-quality reference genomes are critical for quality assurance and reproducibility  to support assembly, annotation and accurate variant determination with NGS metagenomic datasets from uncultivable microorganisms. We previously published a reference genome for the C. cayetanensis apicoplast  which involved applying manual curation and bioinformatic analysis of in-house metagenomic sequence datasets from stool samples to re-construct 11 new apicoplast assemblies. This reference genome was used to identify 25 genomically variable regions or hotspots in the apicoplast genomes of these 11 different clinical strains. In the current work, we (a) evaluated a high-resolution, annotated mitochondrial genome KP231180  and propose that it be adopted as the mitochondrial reference genome; (b) developed a hybrid, reference-guided NGS approach to build new assemblies de novo; (c) demonstrated the usefulness of this hybrid approach to generate three mitochondrial genome assemblies from metagenomic sequence datasets from cyclosporiasis clinical samples, and (d) identified alleles based on the reference-genome sequence of six new and public mitochondrion assemblies. This workflow is anticipated to generate mitochondrial genomes of comparable quality from different samples for genotyping purposes and possible source attribution.
Results of mapping, trimming and assembly of sequencing reads from three C. cayetanensis strains
C. cayetanensis assemblies
Total reads (metagenomic)
34 × 106
33 × 106
7.3 × 106
Average read length (bp)
Reads used in the assembly
Reads mapped back2
> 11,000 ×
> 8000 ×
> 1400 ×
Results and discussion
Determination of the reference genome for C. cayetanensis mitochondrion
Generation of three new genome assemblies from metagenomic sequence datasets
Three new mitochondria genomes were assembled using the mitochondrial reference genome KP231180 as outlined in the workflow (Fig. 1). The reference genome-based approach described in this study was developed over the years specifically to address the limited amount and metagenomic nature of C. cayetanensis DNA, and resembles a reference-guided assembly method recently described for single genomes . The mitochondrion assembly KP231180  and a dozen apicoplast assemblies from C. cayetanensis  were generated using this reference-guided workflow. A total of 298,217, 205,265 and 38,206 source-reads (less than 1% in each of the three samples) were first obtained from the large metagenomic sequence datasets of C5, C8 and C10 strains respectively (Table 1) by mapping to the mitochondrial reference genome. The source-reads were trimmed and assembled using CLC genome Workbench as described (Table 1). For each strain, the trimmed source-reads were used in the CLC Workbench de novo assembly tool to generate a single, contiguous contig. When compared with the reference genome in the Geneious suite (data not shown), the linear contigs of the three initial assemblies started from different base positions due to randomly-oriented assembling processes. C. cayetanensis mitochondrial genomes are known to be concatemeric [4, 6] wherein the tail region of one mitochondrial genome fuses with the head region of the second molecule creating a native tail:head junction. Sequencing reads overlapping this junction may possibly lead to mis-assembly. The lack of collinearity between the reference genome and these new assemblies also impaired the ability to align them for base-level comparisons. Alignment programs like Mugsy and Mauve  allow artificial re-arrangements of sequences to create locally collinear blocks (LCBs) for alignment. In our reference-guided approach, we added a manual curation step in which the reference genome KP231180 was used to correct any randomly-oriented contig to form a collinear assembly that could be readily aligned with other genomes end-to-end. In this step, blocks of sequences in each genome were manually rearranged to achieve synteny with the reference genome. This corrected assembly could be used in multiple alignments for detecting any relevant single nucleotide polymorphisms (SNPs) and/or InDels.
Comparison and correction with the reference genome resulted in a 6274 bp-long assembly for each of the three strains, C5, C8 and C10. A very high coverage for each genome was achieved suggesting the utility of NGS in obtaining a good quality genome for this small organelle from different patient stool samples (Table 1). The corrected assemblies (Fig. 4, tracks 1–3) of the three strains aligned to the reference genome without any shift in sequences or mis-alignment. When the source-reads for each sample were mapped back to the respective mitochondrion genome to detect and eliminate any spurious insertions/deletions and misassembled sequence regions, no valid gaps were detected. The source-reads mapped from each sample mapped back to the respective assembly or the reference genome with up to 99.97% of the reads (Table 1) suggesting a very high degree of homology between the reference and new genomes. DNA repeat sequences are known to create errors in alignment and assembly particularly with NGS datasets . To examine whether the known repeats of C. cayetanensis mitochondrial genome interfered with the mapping efficiency, the 1085 bases long fragment of the reference genome sequence (described earlier in the “Methods” section) was challenged with NGS reads from the three samples. As illustrated with the C5 data set as a representative sample in Fig. 3, source-reads from the three samples mapped without any gaps to the repeat-rich region of the target shown (from 6114 to 6200 spanning four repeat blocks), highlighting the quality of recovered reads using this reference-guided approach. The annotation based on the reference genome resulted in identifying three protein coding genes (cytB, cox1 and cox3), in addition to 14 LSU and nine SSU fragmented rRNA genes in each of the genomes as expected and suggesting that an intact assembly structure was obtained. When the 1085 bp template containing only the concatemeric junction (tail:head) of KP231180 was interrogated with source-reads from the sample datasets, numerous read-throughs from this region were observed (data not shown). The tail:head junction originally reported by Cinar et al.  in the reference genome and in the strain HEN01 by Tang et al. , was also seen in each of the three strains used in this study, confirming the native, concatemeric structure of the Cyclospora mitochondrion genome. All these results provide evidence that point to a high rate of specificity and accuracy of the mapping and the assembly processes, resulting in de novo mitochondrial assemblies retaining structural integrity similar to the reference genome. It has to be noted here that there are reference-based assembly methods available to circumvent the de novo step used in this study. For example, Cinar et al.  described a modification of this workflow in which after mapping reads to a reference genome, Geneious tool allowed the extraction of a linear consensus sequence formed by assembling overlapping reads. A contiguous new assembly was generated as a result of significantly greater depth of sequencing thus avoiding gaps and the formation of multiple contigs. Tools like AlignGraph  extend the use of paired-end sequencing reads to map to a closely related reference genome and creating contiguous or scaffolded assembly. Schneeberger et al.  used a hybrid approach similar to our method by combining reference-guided assembly with a de novo assembly. Such a hybrid approach was efficient in identifying new sequence and structural differences among the strains as observed also in the case of C. cayetanensis apicoplasts .
Sequence comparison and variant calling based on the reference genome
The mitochondrial genomes from the three strains C5, C8 and C10, and three other publicly available genomes were aligned with the reference genome to identify alleles among them. Based on this comparison, a synonymous, C- > A transversion was found for position 4415 (cox3 gene) on the reference Genome KP231880 in all the six strains used in the comparison (red oval zoomed into the inset box, Fig. 4). Specific reads mapping to this region confirmed the accuracy of this allele identification (data not shown). KP796149 contained seven unique alleles and shared three more alleles with KP658101 (Fig. 4, track 7, marked by ‘*’), which need to be independently verified in a larger number of C. cayetanensis strains. It is interesting to note that in the Nepal samples C5, C8 and C10, the 34 kb apicoplast genomes were indistinguishable from each other  as were their 6.7 kb mitochondrion genomes (Fig. 4; tracks 1–3). It has been observed in foodborne bacteria [19, 20] and in other apicomplexans [21, 22, 23] that strains from the same geographical locations display different levels of differentiation in their variant markers, a genomic feature that could be applied in creating identification/barcoding schemes for subtyping or strain-level identification.
Significance of the current work
We are presenting a hybrid reference-guided, de novo genome assembly approach for Cyclospora mitochondrion genomes. As part of this study, we described a C. cayetanensis mitochondrial reference genome that could be routinely used in building new mitochondrial assemblies, and for genome comparisons. This robust approach yielded three new mitochondrial assemblies derived from metagenomic sequencing data useful in variant determination with high confidence when aligned with the mitochondrial reference genome. This study complements a similar reference genome-based workflow for obtaining high quality Cyclospora apicoplast genomes, first reported by our group in Cinar et al. . Our reference guided workflow, enunciated from our work on the NGS datasets from stool samples, should be instrumental for the addition of more C. cayetanensis mitochondrial genome data from food or environmental samples. A standard and routine workflow for the extraction and recovery of mitochondrial sequences from contaminated clinical, food and environmental samples would foster the identification of potential subtyping markers for source-tracking and in the development of molecular diagnostic detection tools for C. cayetanensis to assist with outbreaks investigations.
The newly assembled mitochondrial genomes from C. cayetanensis strains C5, C8 and C10 are available (Accessions MG831586, MG831587 and MG831588 respectively) from NCBI Bioproject: PRJNA357478 C. cayetanensis Mitochondrial genome sequencing for Molecular Serotyping, a component of FDA Cyclospora GenomeTrakr (Bioproject PRJNA357477).
HNC and HRM were involved in sample handling, oocyst purification and DNA extraction; HNC and GG carried out NuGen library preparation and the NGS experiments; GG carried out the bioinformatic analyses and wrote the initial draft of the manuscript; all authors contributed to the manuscript and worked in the laboratory with the samples; AJD is the subject matter expert on foodborne parasites for CFSAN, FDA. All authors read and approved the final manuscript.
The authors thank Jeevan Sherchand and Ynes Ortega for providing clinical stool samples containing C. cayetanensis oocysts. The authors thank Mark Mammel from OARSA, CFSAN for critically reading the manuscript, and Yvonne Qvarnstrom, Mike Arrowood and their team members from CDC for general collaboration on Cyclospora genomics projects. We acknowledge the programmatic support by the OARSA and CFSAN managements for the Cyclospora cayetanensis genomic projects.
The authors declare that they have no competing interests.
Availability of data and materials
New genome sequences are submitted to GenBank.
Consent for publication
All authors have consented for publication.
Ethics approval and consent to participate
This study was reviewed and approved by Institutional Review Board of FDA, and identified with the file name, RIHSC-ID#10-095F and followed the CDC Human Subjects Research Protocol # 6756, titled “Use of residual diagnostic specimens from humans for laboratory methods research”.
This study is part of the Foodborne Parasitology Program of CFSAN, FDA and funds support were obtained internally through U.S. FDA appropriations.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 2.Chacin-Bonilla, L. 2017. Cyclospora cayetanensis. In: JB Rose and B Jiménez-Cisneros, editors. Global water pathogens project. http://www.waterpathogens.org (R.Fayer and W. Jakubowski, editor Part 3 Protists). http://www.waterpathogens.org/book/cyclospora-cayetanensis. E. Lansing: Michigan State University, UNESCO.
- 5.Cinar HN, Qvarnstrom Y, Wei-Pridgeon Y, Li W, Nascimento FS, Arrowood MJ, Murphy HR, Jang A, Kim E, Kim R, da Silva A, Gopinath GR. Comparative sequence analysis of Cyclospora cayetanensis apicoplast genomes originating from diverse geographical regions. Parasit Vectors. 2016;9(1):611.CrossRefPubMedPubMedCentralGoogle Scholar
- 7.Ogedengbe ME, Qvarnstrom Y, da Silva AJ, Arrowood MJ, Barta JR. A linear mitochondrial genome of Cyclospora cayetanensis (Eimeriidae, Eucoccidiorida, Coccidiasina, Apicomplexa) suggests the ancestral start position within mitochondrial genomes of eimeriid coccidia. Int J Parasitol. 2015;45(6):361–5.CrossRefPubMedPubMedCentralGoogle Scholar
- 8.Liu S, Wang L, Zheng H, Xu Z, Roellig DM, Li N, Frace MA, Tang K, Arrowood MJ, Moss DM, Zhang L, Feng Y, Xiao L. Comparative genomics reveals Cyclospora cayetanensis possesses coccidian-like metabolism and invasion components but unique surface antigens. BMC Genom. 2016;30(17):316.CrossRefGoogle Scholar
- 9.Qvarnstrom Y, Wei-Pridgeon Y, Li W, Nascimento FS, Bishop HS, Herwaldt BL, Moss DM, Nayak V, Srinivasamoorthy G, Sheth M, Arrowood MJ. Draft genome sequences from Cyclospora cayetanensis oocysts purified from a human stool sample. Genome Announc. 2015;3(6):e01324-15.CrossRefPubMedPubMedCentralGoogle Scholar
- 19.Gopinath G, Hari K, Jain R, Mammel MK, Kothary MH, et al. The pathogen-annotated tracking resource network (PATRN) system: a web-based resource to aid food safety, regulatory science, and investigations of foodborne pathogens and disease. Food Microbiol. 2013;34(2):303–18.CrossRefPubMedGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.