Skip to main content

Preprocessing and Quality Control for Whole-Genome Sequences from the Illumina HiSeq X Platform

  • Protocol
  • First Online:
Statistical Human Genetics

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1666))

Abstract

The advancement of high-throughput sequencing technologies enables sequencing of human genomes at steadily decreasing costs and increasing quality. Before variants can be analyzed, e.g., in association studies, the raw data obtained from the sequencer need to be preprocessed. These preprocessing steps include the removal of adapters, duplicates, and contaminations, alignment to a reference genome and the postprocessing of the alignment. All later steps, such as variant discovery, rely on high data quality and proper preprocessing, emphasizing the great importance of quality control. This chapter presents a workflow for preprocessing Illumina HiSeq X sequencing data. Code snippets are provided for illustrating all necessary steps, along with a brief description of the tools and underlying methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Bentley DR et al (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456:53–59. doi:10.1038/nature07517

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  2. McKernan KJ et al (2009) Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation. Genome Res 19:1527–1541. doi:10.1101/gr.091868.109

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  3. Margulies M et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380. doi:10.1038/nature03959

    Article  PubMed  PubMed Central  Google Scholar 

  4. Metzker ML (2010) Sequencing technologies — the next generation. Nat Rev Genet 11:31–46. doi:10.1038/nrg2626

    Article  PubMed  CAS  Google Scholar 

  5. Liu L et al (2012) Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012:1–11. doi:10.1155/2012/251364

    Article  PubMed  Google Scholar 

  6. Van Dijk EL et al (2014) Ten years of next-generation sequencing technology. Trends Genet 30:1–9. doi:10.1016/j.tig.2014.07.001

    Article  CAS  Google Scholar 

  7. Illumina Inc. (2016) An introduction to next-generation sequencing technology. http://www.illumina.com/technology/next-generation-sequencing.html. Accessed 16 Jan 2017

  8. Nakazato T, Ohta T, Bono H (2013) Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive. PLoS One 8:e77910. doi:10.1371/journal.pone.0077910

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  9. Illumina Inc. (2016) Indexed sequencing guide. http://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/miseq/indexed-sequencing-overview-guide-15057455-02.pdf. Accessed 16 Jan 2017

  10. Illumina Inc. (2015) HiSeq X series of sequencing systems. http://www.illumina.com/documents/products/datasheets/datasheet-hiseq-x-ten.pdf. Accessed 16 Jan 2017

  11. DePristo MA et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498. doi:10.1038/ng.806

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  12. Van der Auwera GA et al (2013) From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics 11:11.10.1–11.10.33. doi:10.1002/0471250953.bi1110s43

    Article  Google Scholar 

  13. Illumina Inc. (2012) Using a PhiX control for HiSeq sequencing runs. http://support.illumina.com/content/dam/illumina-marketing/documents/products/technotes/hiseq-phix-control-v3-technical-note.pdf. Accessed 16 Jan 2017

  14. Mukherjee S et al (2015) Large-scale contamination of microbial isolate genomes by Illumina PhiX control. Stand Genomic Sci 10:18. doi:10.1186/1944-3277-10-18

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  15. Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: 1303.3997. http://arxiv.org/abs/1303.3997

  16. Burrows M, Wheeler DJ. (1994) A block-sorting lossless data compression algorithm. http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.html. Accessed 31 Jan 2017

  17. Ebbert MTW et al (2016) Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics 17(Suppl. 7):239

    Article  PubMed  PubMed Central  Google Scholar 

  18. Dozmorov MG et al (2015) Detrimental effects of duplicate reads and low complexity regions on RNA-and chip-seq data. BMC Bioinformatics 16(Suppl. 13):S10. doi:10.1186/1471-2105-16-S13-S10

    Article  PubMed  PubMed Central  Google Scholar 

  19. The 1000 Genomes Consortium (2015) A global reference for human genetic variation. Nature 526:68–74. doi:10.1038/nature15393

    Article  CAS  Google Scholar 

  20. Lee SH. Changing workflows around calling SNPs and indels. http://gatkforums.broadinstitute.org/gatk/discussion/7847. Accessed 11 Jan 2017

  21. Van der Auwera G. Version highlights for GATK, version 3.6. https://software.broadinstitute.org/gatk/blog?id=7712. Accessed 11 Jan 2017

  22. Nielsen R et al (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12:443–451. doi:10.1038/nrg2986

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  23. Ewing B, Green P (1998) Base-calling of automated sequencer traces using Phred: II. Error probabilities. Genome Res 8:186–194. doi:10.1101/gr.8.3.186

    Article  PubMed  CAS  Google Scholar 

  24. Li H. Burrow-wheeler aligner for pairwise alignment between DNA sequences. https://github.com/lh3/bwa. Accessed 12 Jan 2017

  25. McKenna A et al (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303. doi:10.1101/gr.107524.110

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  26. Broad Institute. Genome analysis toolkit. https://software.broadinstitute.org/gatk/. Accessed 11 Jan 2017

  27. Li H et al (2009) The sequence alignment/map format and samtools. Bioinformatics 25:2078–2079. doi:10.1093/bioinformatics/btp352

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  28. Andrews S. Tools for manipulating next-generation sequencing data. https://github.com/samtools/samtools. Accessed 12 Jan 2017

  29. Broad Institute. Picard. https://broadinstitute.github.io/picard/. Accessed 4 Jan 2017

  30. Dalca AV, Brudno M (2010) Genome variation discovery with high-throughput sequencing data. Brief Bioinform 11:3–14. doi:10.1093/bib/bbp058

    Article  PubMed  CAS  Google Scholar 

  31. Magi A et al (2010) Bioinformatics for next generation sequencing data. Genes (Basel) 1:294–307. doi:10.3390/genes1020294

    Article  CAS  Google Scholar 

  32. Altmann A et al (2012) A beginners guide to SNP calling from high-throughput DNA-sequencing data. Hum Genet 131:1541–1554. doi:10.1007/s00439-012-1213-z

    Article  PubMed  Google Scholar 

  33. Fonseca NA et al (2012) Tools for mapping high-throughput sequencing data. Bioinformatics 28:3169–3177. doi:10.1093/bioinformatics/bts605

    Article  PubMed  CAS  Google Scholar 

  34. Bao R et al (2014) Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer Inform 13(Suppl 2):67–82. doi:10.4137/CIN.S13779

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  35. Illumina Inc. iGenomes. http://support.illumina.com/sequencing/sequencing_software/igenome.html. Accessed 11 Jan 2017

  36. Van der Auwera G. GATK Resource Bundle. http://gatkforums.broadinstitute.org/gatk/discussion/1213/whats-in-the-resource-bundle-and-how-can-i-get-it. Accessed 11 Jan 2017

  37. Andrews S. FastQC. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 19 Dec 2016

  38. Ewels P et al (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32:3047–3048

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Jun G et al (2012) Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am J Hum Genet 91:839–848

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Boratyn GM et al (2013) BLAST: a more efficient report with usability improvements. Nucleic Acids Res 41:W29–W33. doi:10.1093/nar/gkt282

    Article  PubMed  PubMed Central  Google Scholar 

  41. Pruitt KD et al (2014) RefSeq: an update on mammalian reference sequences. Nucleic Acids Res 42:D756–D763. doi:10.1093/nar/gkt1114

    Article  PubMed  CAS  Google Scholar 

  42. Kang HM. Genome analysis wiki. http://genome.sph.umich.edu/wiki/VerifyBamID. Accessed 12 Jan 2017

  43. Illumina Inc. BaseSpace. https://basespace.illumina.com/. Accessed 5 Jan 2017

Download references

Acknowledgments

The work presented in this chapter was supported by the German Centre for Cardiovascular Research (DZHK; Deutsches Zentrum für Herz-Kreislauf-Forschung) and the DZHK OMICs Resource Project (grant: 81X1700104).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marvin N. Wright .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media LLC

About this protocol

Cite this protocol

Wright, M.N., Gola, D., Ziegler, A. (2017). Preprocessing and Quality Control for Whole-Genome Sequences from the Illumina HiSeq X Platform. In: Elston, R. (eds) Statistical Human Genetics. Methods in Molecular Biology, vol 1666. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-7274-6_30

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-7274-6_30

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-7273-9

  • Online ISBN: 978-1-4939-7274-6

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics