Preprocessing and Quality Control for Whole-Genome Sequences from the Illumina HiSeq X Platform

  • Marvin N. WrightEmail author
  • Damian Gola
  • Andreas Ziegler
Part of the Methods in Molecular Biology book series (MIMB, volume 1666)


The advancement of high-throughput sequencing technologies enables sequencing of human genomes at steadily decreasing costs and increasing quality. Before variants can be analyzed, e.g., in association studies, the raw data obtained from the sequencer need to be preprocessed. These preprocessing steps include the removal of adapters, duplicates, and contaminations, alignment to a reference genome and the postprocessing of the alignment. All later steps, such as variant discovery, rely on high data quality and proper preprocessing, emphasizing the great importance of quality control. This chapter presents a workflow for preprocessing Illumina HiSeq X sequencing data. Code snippets are provided for illustrating all necessary steps, along with a brief description of the tools and underlying methods.

Key words

Whole-genome sequencing Sequencing High-throughput sequencing Illumina HiSeq X HTS NGS Quality control Preprocessing Alignment Mapping 



The work presented in this chapter was supported by the German Centre for Cardiovascular Research (DZHK; Deutsches Zentrum für Herz-Kreislauf-Forschung) and the DZHK OMICs Resource Project (grant: 81X1700104).


  1. 1.
    Bentley DR et al (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456:53–59. doi: 10.1038/nature07517 CrossRefPubMedPubMedCentralGoogle Scholar
  2. 2.
    McKernan KJ et al (2009) Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation. Genome Res 19:1527–1541. doi: 10.1101/gr.091868.109 CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Margulies M et al (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature 437:376–380. doi: 10.1038/nature03959 CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Metzker ML (2010) Sequencing technologies — the next generation. Nat Rev Genet 11:31–46. doi: 10.1038/nrg2626 CrossRefPubMedGoogle Scholar
  5. 5.
    Liu L et al (2012) Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012:1–11. doi: 10.1155/2012/251364 CrossRefPubMedGoogle Scholar
  6. 6.
    Van Dijk EL et al (2014) Ten years of next-generation sequencing technology. Trends Genet 30:1–9. doi: 10.1016/j.tig.2014.07.001 CrossRefGoogle Scholar
  7. 7.
    Illumina Inc. (2016) An introduction to next-generation sequencing technology. Accessed 16 Jan 2017
  8. 8.
    Nakazato T, Ohta T, Bono H (2013) Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive. PLoS One 8:e77910. doi: 10.1371/journal.pone.0077910 CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
  10. 10.
    Illumina Inc. (2015) HiSeq X series of sequencing systems. Accessed 16 Jan 2017
  11. 11.
    DePristo MA et al (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43:491–498. doi: 10.1038/ng.806 CrossRefPubMedPubMedCentralGoogle Scholar
  12. 12.
    Van der Auwera GA et al (2013) From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinformatics 11:11.10.1–11.10.33. doi: 10.1002/0471250953.bi1110s43 CrossRefGoogle Scholar
  13. 13.
    Illumina Inc. (2012) Using a PhiX control for HiSeq sequencing runs. Accessed 16 Jan 2017
  14. 14.
    Mukherjee S et al (2015) Large-scale contamination of microbial isolate genomes by Illumina PhiX control. Stand Genomic Sci 10:18. doi: 10.1186/1944-3277-10-18 CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: 1303.3997.
  16. 16.
    Burrows M, Wheeler DJ. (1994) A block-sorting lossless data compression algorithm. Accessed 31 Jan 2017
  17. 17.
    Ebbert MTW et al (2016) Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics 17(Suppl. 7):239CrossRefPubMedPubMedCentralGoogle Scholar
  18. 18.
    Dozmorov MG et al (2015) Detrimental effects of duplicate reads and low complexity regions on RNA-and chip-seq data. BMC Bioinformatics 16(Suppl. 13):S10. doi: 10.1186/1471-2105-16-S13-S10 CrossRefPubMedPubMedCentralGoogle Scholar
  19. 19.
    The 1000 Genomes Consortium (2015) A global reference for human genetic variation. Nature 526:68–74. doi: 10.1038/nature15393 CrossRefGoogle Scholar
  20. 20.
    Lee SH. Changing workflows around calling SNPs and indels. Accessed 11 Jan 2017
  21. 21.
    Van der Auwera G. Version highlights for GATK, version 3.6. Accessed 11 Jan 2017
  22. 22.
    Nielsen R et al (2011) Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet 12:443–451. doi: 10.1038/nrg2986 CrossRefPubMedPubMedCentralGoogle Scholar
  23. 23.
    Ewing B, Green P (1998) Base-calling of automated sequencer traces using Phred: II. Error probabilities. Genome Res 8:186–194. doi: 10.1101/gr.8.3.186 CrossRefPubMedGoogle Scholar
  24. 24.
    Li H. Burrow-wheeler aligner for pairwise alignment between DNA sequences. Accessed 12 Jan 2017
  25. 25.
    McKenna A et al (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303. doi: 10.1101/gr.107524.110 CrossRefPubMedPubMedCentralGoogle Scholar
  26. 26.
    Broad Institute. Genome analysis toolkit. Accessed 11 Jan 2017
  27. 27.
    Li H et al (2009) The sequence alignment/map format and samtools. Bioinformatics 25:2078–2079. doi: 10.1093/bioinformatics/btp352 CrossRefPubMedPubMedCentralGoogle Scholar
  28. 28.
    Andrews S. Tools for manipulating next-generation sequencing data. Accessed 12 Jan 2017
  29. 29.
    Broad Institute. Picard. Accessed 4 Jan 2017
  30. 30.
    Dalca AV, Brudno M (2010) Genome variation discovery with high-throughput sequencing data. Brief Bioinform 11:3–14. doi: 10.1093/bib/bbp058 CrossRefPubMedGoogle Scholar
  31. 31.
    Magi A et al (2010) Bioinformatics for next generation sequencing data. Genes (Basel) 1:294–307. doi: 10.3390/genes1020294 CrossRefGoogle Scholar
  32. 32.
    Altmann A et al (2012) A beginners guide to SNP calling from high-throughput DNA-sequencing data. Hum Genet 131:1541–1554. doi: 10.1007/s00439-012-1213-z CrossRefPubMedGoogle Scholar
  33. 33.
    Fonseca NA et al (2012) Tools for mapping high-throughput sequencing data. Bioinformatics 28:3169–3177. doi: 10.1093/bioinformatics/bts605 CrossRefPubMedGoogle Scholar
  34. 34.
    Bao R et al (2014) Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer Inform 13(Suppl 2):67–82. doi: 10.4137/CIN.S13779 CrossRefPubMedPubMedCentralGoogle Scholar
  35. 35.
  36. 36.
  37. 37.
    Andrews S. FastQC. Accessed 19 Dec 2016
  38. 38.
    Ewels P et al (2016) MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32:3047–3048CrossRefPubMedPubMedCentralGoogle Scholar
  39. 39.
    Jun G et al (2012) Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am J Hum Genet 91:839–848CrossRefPubMedPubMedCentralGoogle Scholar
  40. 40.
    Boratyn GM et al (2013) BLAST: a more efficient report with usability improvements. Nucleic Acids Res 41:W29–W33. doi: 10.1093/nar/gkt282 CrossRefPubMedPubMedCentralGoogle Scholar
  41. 41.
    Pruitt KD et al (2014) RefSeq: an update on mammalian reference sequences. Nucleic Acids Res 42:D756–D763. doi: 10.1093/nar/gkt1114 CrossRefPubMedGoogle Scholar
  42. 42.
    Kang HM. Genome analysis wiki. Accessed 12 Jan 2017
  43. 43.
    Illumina Inc. BaseSpace. Accessed 5 Jan 2017

Copyright information

© Springer Science+Business Media LLC 2017

Authors and Affiliations

  • Marvin N. Wright
    • 1
    Email author
  • Damian Gola
    • 1
  • Andreas Ziegler
    • 1
  1. 1.Institut für Medizinische Biometrie und StatistikUniversität zu Lübeck, Universitätsklinikum Schleswig-Holstein - Campus LübeckLübeckGermany

Personalised recommendations