Overview of Sequence Data Formats

  • Hongen ZhangEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1418)


Next-generation sequencing experiment can generate billions of short reads for each sample and processing of the raw reads will add more information. Various file formats have been introduced/developed in order to store and manipulate this information. This chapter presents an overview of the file formats including FASTQ, FASTA, SAM/BAM, GFF/GTF, BED, and VCF that are commonly used in analysis of next-generation sequencing data.

Key words

Sequencing data file format Next-generation sequencing Sequencing data FASTQ FASTA SAM/BAM GFF/GTF BED VCF 


  1. 1.
    Shendure J, Ji H (2008) Next-generation DNA sequencing. Nat Biotechnol 26:1135–1145CrossRefPubMedGoogle Scholar
  2. 2.
    Metzker ML (2010) Sequencing technologies—the next generation. Nat Rev Genet 11:31–46CrossRefPubMedGoogle Scholar
  3. 3.
    Quail MA, Smith M, Cooupland P et al (2012) A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13:341CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Mardis ER (2008) Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9:387–402CrossRefPubMedGoogle Scholar
  5. 5.
    Mardis ER (2013) Next-generation sequencing platforms. Annu Rev Anal Chem 6:287–303CrossRefGoogle Scholar
  6. 6.
    Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nat Methods 6(Suppl 11):S6–S12CrossRefPubMedGoogle Scholar
  7. 7.
    Medvedev P, Stanciu M, Brudno M (2009) Computational methods for discovering structural variation with next-generation sequencing. Nat Methods 6(Suppl 11):S13–S20CrossRefPubMedGoogle Scholar
  8. 8.
    Pepke S, Wold B, Mortazavi A (2009) Computation for ChIP-seq and RNA-seq studies. Nat Methods 6(Suppl 11):S22–S32CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    van Dijk EL, Auger H, Jaszczyszyn Y et al (2014) Ten years of next-generation sequencing technology. Trends Genet 30:418–426CrossRefPubMedGoogle Scholar
  10. 10.
    Voelkerding KV, Dames SA, Durtschi JD (2009) Next-generation sequencing: from basic research to diagnostics. Clin Chem 55:641–658CrossRefPubMedGoogle Scholar
  11. 11.
    Pavlopoulos GA, Oulas A, Lacucci E et al (2013) Unraveling genomic variation from next generation sequencing data. BioData Min 6:13CrossRefPubMedPubMedCentralGoogle Scholar
  12. 12.
    Allcock RJN (2014) Production and analytic bioinformatics for next-generation DNA sequencing. In: Trent R (ed) Clinical bioinformatics, 2nd edn. Humana, New York, pp 17–30Google Scholar
  13. 13.
    Cock PJ, Fields CJ, Goto N et al (2010) The Sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res 38:1767–1771CrossRefPubMedPubMedCentralGoogle Scholar
  14. 14.
    Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/Map format and SAMtools. Bioinformatics 25:2078–2079CrossRefPubMedPubMedCentralGoogle Scholar
  15. 15.
    The SAM/BAM Format Specification Working Group (2014) Sequence alignment/map format specification.
  16. 16.
    Danecek P, Auton A, Abecasis G et al (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Ewing B, Hillier L, Wendl MC et al (1998) Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8:175–185CrossRefPubMedGoogle Scholar
  18. 18.
    Ewing B, Green P (1998) Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res 8:186–194CrossRefPubMedGoogle Scholar
  19. 19.
    Andrews S (2010) FastQC: a quality control tool for high throughput sequence data., Available online at
  20. 20.
    Lipman D, Pearson W (1985) Rapid and sensitive protein similarity searches. Science 227:1435–1441CrossRefPubMedGoogle Scholar
  21. 21.
    Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci 85:2444–2448CrossRefPubMedPubMedCentralGoogle Scholar
  22. 22.
    Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859–1875CrossRefPubMedGoogle Scholar
  23. 23.
    Langmead B, Trapnell C, Pop M et al (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25CrossRefPubMedPubMedCentralGoogle Scholar
  24. 24.
    Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595CrossRefPubMedPubMedCentralGoogle Scholar
  25. 25.
    Dobin A, Davis CA, Schlesinger F et al (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21CrossRefPubMedPubMedCentralGoogle Scholar
  26. 26.
    Robinson JT, Thorvaldsdóttir H, Winckler W et al (2011) Integrative Genomics Viewer. Nat Biotechnol 29:24–26CrossRefPubMedPubMedCentralGoogle Scholar
  27. 27.
    Thorvaldsdóttir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 14:178–192CrossRefPubMedPubMedCentralGoogle Scholar
  28. 28.
  29. 29.
    GFF/GTF File Format—Definition and supported options.
  30. 30.
    BED File Format. Definition and supported options.
  31. 31.
  32. 32.
    The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073CrossRefPubMedCentralGoogle Scholar
  33. 33.
    McVean GA, Abecasis DM, Auton R et al (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56–65CrossRefPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Center for Cancer Research, National Cancer InstituteNational Institutes of HealthBethesdaUSA

Personalised recommendations