Preprocessing and Quality Control for Whole-Genome Sequences from the Illumina HiSeq X Platform

Wright, Marvin N.; Gola, Damian; Ziegler, Andreas

doi:10.1007/978-1-4939-7274-6_30

Marvin N. Wright³^na1,
Damian Gola³^na1 &
Andreas Ziegler³^na1

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1666))

3900 Accesses
5 Citations

Abstract

The advancement of high-throughput sequencing technologies enables sequencing of human genomes at steadily decreasing costs and increasing quality. Before variants can be analyzed, e.g., in association studies, the raw data obtained from the sequencer need to be preprocessed. These preprocessing steps include the removal of adapters, duplicates, and contaminations, alignment to a reference genome and the postprocessing of the alignment. All later steps, such as variant discovery, rely on high data quality and proper preprocessing, emphasizing the great importance of quality control. This chapter presents a workflow for preprocessing Illumina HiSeq X sequencing data. Code snippets are provided for illustrating all necessary steps, along with a brief description of the tools and underlying methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Bentley DR et al (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456:53–59. doi:10.1038/nature07517
Article PubMed PubMed Central CAS Google Scholar
McKernan KJ et al (2009) Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation. Genome Res 19:1527–1541. doi:10.1101/gr.091868.109
Article PubMed PubMed Central CAS Google Scholar
Margulies M et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380. doi:10.1038/nature03959
Article PubMed PubMed Central Google Scholar
Metzker ML (2010) Sequencing technologies — the next generation. Nat Rev Genet 11:31–46. doi:10.1038/nrg2626
Article PubMed CAS Google Scholar
Liu L et al (2012) Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012:1–11. doi:10.1155/2012/251364
Article PubMed Google Scholar
Van Dijk EL et al (2014) Ten years of next-generation sequencing technology. Trends Genet 30:1–9. doi:10.1016/j.tig.2014.07.001
Article CAS Google Scholar
Illumina Inc. (2016) An introduction to next-generation sequencing technology. http://www.illumina.com/technology/next-generation-sequencing.html. Accessed 16 Jan 2017
Nakazato T, Ohta T, Bono H (2013) Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive. PLoS One 8:e77910. doi:10.1371/journal.pone.0077910
Article PubMed PubMed Central CAS Google Scholar
Illumina Inc. (2016) Indexed sequencing guide. http://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/miseq/indexed-sequencing-overview-guide-15057455-02.pdf. Accessed 16 Jan 2017
Illumina Inc. (2015) HiSeq X series of sequencing systems. http://www.illumina.com/documents/products/datasheets/datasheet-hiseq-x-ten.pdf. Accessed 16 Jan 2017
DePristo MA et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498. doi:10.1038/ng.806
Article PubMed PubMed Central CAS Google Scholar
Van der Auwera GA et al (2013) From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics 11:11.10.1–11.10.33. doi:10.1002/0471250953.bi1110s43
Article Google Scholar
Illumina Inc. (2012) Using a PhiX control for HiSeq sequencing runs. http://support.illumina.com/content/dam/illumina-marketing/documents/products/technotes/hiseq-phix-control-v3-technical-note.pdf. Accessed 16 Jan 2017
Mukherjee S et al (2015) Large-scale contamination of microbial isolate genomes by Illumina PhiX control. Stand Genomic Sci 10:18. doi:10.1186/1944-3277-10-18
Article PubMed PubMed Central CAS Google Scholar
Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: 1303.3997. http://arxiv.org/abs/1303.3997
Burrows M, Wheeler DJ. (1994) A block-sorting lossless data compression algorithm. http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.html. Accessed 31 Jan 2017
Ebbert MTW et al (2016) Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics 17(Suppl. 7):239
Article PubMed PubMed Central Google Scholar
Dozmorov MG et al (2015) Detrimental effects of duplicate reads and low complexity regions on RNA-and chip-seq data. BMC Bioinformatics 16(Suppl. 13):S10. doi:10.1186/1471-2105-16-S13-S10
Article PubMed PubMed Central Google Scholar
The 1000 Genomes Consortium (2015) A global reference for human genetic variation. Nature 526:68–74. doi:10.1038/nature15393
Article CAS Google Scholar
Lee SH. Changing workflows around calling SNPs and indels. http://gatkforums.broadinstitute.org/gatk/discussion/7847. Accessed 11 Jan 2017
Van der Auwera G. Version highlights for GATK, version 3.6. https://software.broadinstitute.org/gatk/blog?id=7712. Accessed 11 Jan 2017
Nielsen R et al (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12:443–451. doi:10.1038/nrg2986
Article PubMed PubMed Central CAS Google Scholar
Ewing B, Green P (1998) Base-calling of automated sequencer traces using Phred: II. Error probabilities. Genome Res 8:186–194. doi:10.1101/gr.8.3.186
Article PubMed CAS Google Scholar
Li H. Burrow-wheeler aligner for pairwise alignment between DNA sequences. https://github.com/lh3/bwa. Accessed 12 Jan 2017
McKenna A et al (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303. doi:10.1101/gr.107524.110
Article PubMed PubMed Central CAS Google Scholar
Broad Institute. Genome analysis toolkit. https://software.broadinstitute.org/gatk/. Accessed 11 Jan 2017
Li H et al (2009) The sequence alignment/map format and samtools. Bioinformatics 25:2078–2079. doi:10.1093/bioinformatics/btp352
Article PubMed PubMed Central CAS Google Scholar
Andrews S. Tools for manipulating next-generation sequencing data. https://github.com/samtools/samtools. Accessed 12 Jan 2017
Broad Institute. Picard. https://broadinstitute.github.io/picard/. Accessed 4 Jan 2017
Dalca AV, Brudno M (2010) Genome variation discovery with high-throughput sequencing data. Brief Bioinform 11:3–14. doi:10.1093/bib/bbp058
Article PubMed CAS Google Scholar
Magi A et al (2010) Bioinformatics for next generation sequencing data. Genes (Basel) 1:294–307. doi:10.3390/genes1020294
Article CAS Google Scholar
Altmann A et al (2012) A beginners guide to SNP calling from high-throughput DNA-sequencing data. Hum Genet 131:1541–1554. doi:10.1007/s00439-012-1213-z
Article PubMed Google Scholar
Fonseca NA et al (2012) Tools for mapping high-throughput sequencing data. Bioinformatics 28:3169–3177. doi:10.1093/bioinformatics/bts605
Article PubMed CAS Google Scholar
Bao R et al (2014) Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer Inform 13(Suppl 2):67–82. doi:10.4137/CIN.S13779
Article PubMed PubMed Central CAS Google Scholar
Illumina Inc. iGenomes. http://support.illumina.com/sequencing/sequencing_software/igenome.html. Accessed 11 Jan 2017
Van der Auwera G. GATK Resource Bundle. http://gatkforums.broadinstitute.org/gatk/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it. Accessed 11 Jan 2017
Andrews S. FastQC. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 19 Dec 2016
Ewels P et al (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32:3047–3048
Article CAS PubMed PubMed Central Google Scholar
Jun G et al (2012) Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am J Hum Genet 91:839–848
Article CAS PubMed PubMed Central Google Scholar
Boratyn GM et al (2013) BLAST: a more efficient report with usability improvements. Nucleic Acids Res 41:W29–W33. doi:10.1093/nar/gkt282
Article PubMed PubMed Central Google Scholar
Pruitt KD et al (2014) RefSeq: an update on mammalian reference sequences. Nucleic Acids Res 42:D756–D763. doi:10.1093/nar/gkt1114
Article PubMed CAS Google Scholar
Kang HM. Genome analysis wiki. http://genome.sph.umich.edu/wiki/VerifyBamID. Accessed 12 Jan 2017
Illumina Inc. BaseSpace. https://basespace.illumina.com/. Accessed 5 Jan 2017

Download references

Acknowledgments

The work presented in this chapter was supported by the German Centre for Cardiovascular Research (DZHK; Deutsches Zentrum für Herz-Kreislauf-Forschung) and the DZHK OMICs Resource Project (grant: 81X1700104).

Author information

Marvin N. Wright and Damian Gola contributed equally to this work.

Authors and Affiliations

Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein - Campus Lübeck, Lübeck, Germany
Marvin N. Wright, Damian Gola & Andreas Ziegler

Authors

Marvin N. Wright
View author publications
You can also search for this author in PubMed Google Scholar
Damian Gola
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Ziegler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marvin N. Wright .

Editor information

Editors and Affiliations

Case Western Reserve University, Cleveland, OH, USA
Robert C. Elston

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Wright, M.N., Gola, D., Ziegler, A. (2017). Preprocessing and Quality Control for Whole-Genome Sequences from the Illumina HiSeq X Platform. In: Elston, R. (eds) Statistical Human Genetics. Methods in Molecular Biology, vol 1666. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7274-6_30

Download citation

DOI: https://doi.org/10.1007/978-1-4939-7274-6_30
Published: 05 October 2017
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-7273-9
Online ISBN: 978-1-4939-7274-6
eBook Packages: Springer Protocols

Publish with us

Policies and ethics