Skip to main content

Statistical Analyses of Next Generation Sequencing Data: An Overview

  • Chapter
  • First Online:
Statistical Analysis of Next Generation Sequencing Data

Part of the book series: Frontiers in Probability and the Statistical Sciences ((FROPROSTAS))

  • 7938 Accesses

Abstract

Next generation sequencing (NGS) is a significant technological advance in biomedical sciences. The sequencing platforms have advanced rapidly to the point that several genomes can now be sequenced simultaneously in a single instrument run in under two weeks. Its applications range from detecting transcription factor binding sites and quantifying gene expression to discovering methylation patterns and comparing genomes. We discuss and review some of the major NGS platforms that are currently in use. Some of these platforms like Illumina represent the fastest evolving genomic technologies in terms of cost, throughput and speed. However, despite overcoming the limitations of first generation platforms and microarray based studies, the generated data is not free of noise. The sources of noise are diverse and complex depending on the generating platform and sequencing chemistry. For example, errors can creep in from any intermediate sequencing steps like ligand adaption, fragmentation, Polymerase Chain Reaction (PCR) amplification and nucleotide removal. In methods like Chromatin Immunoprecipitation Sequencing (ChIP-Seq), non-specific binding is a major source of noise. All of this raises novel statistical and computational challenges, e.g., in basecalling and differential profiling. In this chapter, we point out the critical challenges that arise in NGS data analysis and provide an objective overview of the existing literature. As we shall see, NGS is not only transforming genomics but driving new methodological development in several branches of quantitative science.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abecasis, G., Altshuler, D., Auton, A., Brooks, L., Durbin, R., Gibbs, R.A., Hurles, M.E., McVean, G.A., Bentley, D., Chakravarti, A., et al.: A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061–1073 (2010)

    Article  Google Scholar 

  2. Agarwal, A., Koppstein, D., Rozowsky, J., Sboner, A., Habegger, L., Hillier, L.W., Sasidharan, R., Reinke, V., Waterston, R.H., Gerstein, M.: Comparison and calibration of transcriptome data from rna-seq and tiling arrays. BMC Genom. 11(1), 383 (2010)

    Article  Google Scholar 

  3. Alamancos, G.P., Agirre, E., Eyras, E.: Methods to study splicing from high-throughput rna sequencing data. Meth. Mol. Biol., 1126, 357–397 (2014)

    Article  Google Scholar 

  4. Anders, S.: Visualization of genomic data with the hilbert curve. Bioinformatics 25(10), 1231–1235 (2009)

    Article  Google Scholar 

  5. Baker, S.C., Bauer, S.R., Beyer, R.P., Brenton, J.D., Bromley, B., Burrill, J., Causton, H., Conley, M.P., Elespuru, R., Fero, M., et al.: The external rna controls consortium: a progress report. Nat. Meth. 2(10), 731–734 (2005)

    Article  Google Scholar 

  6. Bloom, J.S., Khan, Z., Kruglyak, L., Singh, M., Caudy, A.A.: Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC Genom. 10(1), 221 (2009)

    Article  Google Scholar 

  7. Boyle, A.P., Guinney, J., Crawford, G.E., Furey, T.S.: F-seq: a feature density estimator for high-throughput sequence tags. Bioinformatics 24(21), 2537–2538 (2008). doi:10.1093/bioinformatics/btn480

    Article  Google Scholar 

  8. Bravo, H.C., Irizarry, R.A.: Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics 66(3), 665–674 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  9. Bullard, J.H., Purdom, E., Hansen, K.D., Dudoit, S.: Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC Bioinform. 11, 94 (2010). doi:10.1186/1471-2105-11-94

    Article  Google Scholar 

  10. Cairns, J., Spyrou, C., Stark, R., Smith, M.L., Lynch, A.G., Tavare, S.: Bayespeak: an r package for analysing chip-seq data. Bioinformatics 27(5), 713–714 (2011)

    Article  Google Scholar 

  11. Chavez, L., Jozefczuk, J., Grimm, C., Dietrich, J., Timmermann, B., Lehrach, H., Herwig, R., Adjaye, J.: Computational analysis of genome-wide dna methylation during the differentiation of human embryonic stem cells along the endodermal lineage. Genome Res. 20(10), 1441–1450 (2010)

    Article  Google Scholar 

  12. Chen, G., Wang, C., Shi, T.: Overview of available methods for diverse rna-seq data analyses. Sci. China Life Sci. 54(12), 1121–1128 (2011)

    Article  Google Scholar 

  13. Cloonan, N., Grimmond, S.M.: Transcriptome content and dynamics at single-nucleotide resolution. Genome Biol. 9(9), 234 (2008). doi:10.1186/gb-2008-9-9-234

    Article  Google Scholar 

  14. Datta, S., Datta, S., Kim, S., Chakraborty, S., Gill, R.S.: Statistical analyses of next generation sequence data: a partial overview. J. Proteonomics Bioinform. 3(6), 183 (2010)

    Article  Google Scholar 

  15. Devonshire, A., Elaswarapu, R., Foy, C.: Evaluation of external rna controls for the standardisation of gene expression biomarker measurements. BMC Genom. 11(1), 662 (2010)

    Article  Google Scholar 

  16. Dohm, J.C., Lottaz, C., Borodina, T., Himmelbauer, H.: Substantial biases in ultra-short read data sets from high-throughput dna sequencing. Nucleic Acids Res. 36(16), e105 (2008). doi:10.1093/nar/gkn425

    Article  Google Scholar 

  17. Durinck, S., Moreau, Y., Kasprzyk, A., Davis, S., De Moor, B., Brazma, A., Huber, W.: Biomart and bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 21(16), 3439–3440 (2005)

    Article  Google Scholar 

  18. Durinck, S., Bullard, J., Spellman, P.T., Dudoit, S.: Genomegraphs: integrated genomic data visualization with r. BMC Bioinform. 10(1), 2 (2009)

    Article  Google Scholar 

  19. Erlich, Y., Mitra, P.P., delaBastide, M., McCombie, W.R., Hannon, G.J.: Alta-cyclic: a self-optimizing base caller for next-generation sequencing. Nat. Meth. 5(8), 679–682 (2008). doi:10.1038/nmeth.1230

    Google Scholar 

  20. Fejes, A.P., Robertson, G., Bilenky, M., Varhol, R., Bainbridge, M., Jones, S.J.M.: Findpeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics 24(15), 1729–1730 (2008). doi:10.1093/bioinformatics/btn305

    Google Scholar 

  21. Feng, J., Li, W., Jiang, T.: Inference of isoforms from short sequence reads. J. Comput. Biol. 18(3), 305–321 (2011). doi:10.1089/cmb.2010.0243

    Article  MathSciNet  Google Scholar 

  22. Fu, X., Fu, N., Guo, S., Yan, Z., Xu, Y., Hu, H., Menzel, C., Chen, W., Li, Y., Zeng, R., et al.: Estimating accuracy of rna-seq and microarrays with proteomics. BMC Genom. 10(1), 161 (2009)

    Article  Google Scholar 

  23. Fullwood, M.J., Wei, C.L., Liu, E.T., Ruan, Y.: Next-generation dna sequencing of paired-end tags (pet) for transcriptome and genome analyses. Genome Res. 19(4), 521–532 (2009)

    Article  Google Scholar 

  24. Garber, M., Grabherr, M.G., Guttman, M., Trapnell, C.: Computational methods for transcriptome annotation and quantification using rna-seq. Nat. Meth. 8(6), 469–477 (2011)

    Article  Google Scholar 

  25. Ghosh, D., Qin, Z.S.: Statistical issues in the analysis of chip-seq and rna-seq data. Genes 1(2), 317–334 (2010)

    Article  MathSciNet  Google Scholar 

  26. Jiang, H., Wong, W.H.: Statistical inferences for isoform expression in rna-seq. Bioinformatics 25(8), 1026–1032 (2009). doi:10.1093/bioinformatics/btp113

    Article  Google Scholar 

  27. Jiang, L., Schlesinger, F., Davis, C.A., Zhang, Y., Li, R., Salit, M., Gingeras, T.R., Oliver, B.: Synthetic spike-in standards for rna-seq experiments. Genome Res. 21(9), 1543–1551 (2011)

    Article  Google Scholar 

  28. Johnson, T.: Bayesian method for gene detection and mapping, using a case and control design and dna pooling. Biostatistics 8(3), 546–565 (2007). doi:10.1093/biostatistics/kxl028

    Article  MATH  Google Scholar 

  29. Kao, W.C., Stevens, K., Song, Y.S.: Bayescall: a model-based base-calling algorithm for high-throughput short-read sequencing. Genome Res. 19(10), 1884–1895 (2009). doi:10.1101/gr.095299.109

    Article  Google Scholar 

  30. Katz, Y., Wang, E.T., Airoldi, E.M., Burge, C.B.: Analysis and design of rna sequencing experiments for identifying isoform regulation. Nat. Meth. 7(12), 1009–1015 (2010). doi:10.1038/nmeth.1528

    Article  Google Scholar 

  31. Kharchenko, P.V., Tolstorukov, M.Y., Park, P.J.: Design and analysis of chip-seq experiments for dna-binding proteins. Nat. Biotech. 26(12), 1351–1359 (2008)

    Article  Google Scholar 

  32. Kim, H., Kim, J., Selby, H., Gao, D., Tong, T., Phang, T.L., Tan, A.C., et al.: A short survey of computational analysis methods in analysing chip-seq data. Hum. Genom. 5(2), 117–123 (2011)

    Article  Google Scholar 

  33. Kircher, M., Stenzel, U., Kelso, J., et al.: Improved base calling for the illumina genome analyzer using machine learning strategies. Genome Biol. 10(8), R83 (2009)

    Article  Google Scholar 

  34. Kirkpatrick, S.: Optimization by simulated annealing: Quantitative studies. J. Stat. Phys. 34(5–6), 975–986 (1984)

    Article  MathSciNet  Google Scholar 

  35. Kriseman, J., Busick, C., Szelinger, S., Dinu, V.: Bing: biomedical informatics pipeline for next generation sequencing. J. Biomed. Informat. 43(3), 428–434 (2010)

    Article  Google Scholar 

  36. Langmead, B.: Aligning short sequencing reads with Bowtie. Curr. Protoc. Bioinform. 32, 11–17 (2010)

    Google Scholar 

  37. Lawrence, M., Gentleman, R., Carey, V.: rtracklayer: an r package for interfacing with genome browsers. Bioinformatics 25(14), 1841–1842 (2009)

    Google Scholar 

  38. Ledergerber, C., Dessimoz, C.: Base-calling for next-generation sequencing platforms. Briefings Bioinform. 12(5), 489–497 (2011)

    Article  Google Scholar 

  39. Li, H., Durbin, R.: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14), 1754–1760 (2009). doi:10.1093/bioinformatics/btp324

    Article  Google Scholar 

  40. Li, H., Ruan, J., Durbin, R.: Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11), 1851–1858 (2008). doi:10.1101/gr.078212.108

    Article  Google Scholar 

  41. Loman, N.J., Constantinidou, C., Chan, J.Z., Halachev, M., Sergeant, M., Penn, C.W., Robinson, E.R., Pallen, M.J.: High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity. Nat. Rev. Microbiol. 10(9), 599–606 (2012)

    Article  Google Scholar 

  42. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., Gilad, Y.: Rna-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9), 1509–1517 (2008). doi:10.1101/gr.079558.108

    Article  Google Scholar 

  43. Massingham, T., Goldman, N.: All your base: a fast and accurate probabilistic approach to base calling. Genome Biol. 13, R13 (2012)

    Article  Google Scholar 

  44. McCarthy, A.: Third generation dna sequencing: pacific biosciences’ single molecule real time technology. Chem. Biol. 17(7), 675–676 (2010). doi:10.1016/j.chembiol.2010.07.004

    Article  Google Scholar 

  45. Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ecm algorithm: a general framework. Biometrika 80(2), 267–278 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  46. Mezlini, A.M., Smith, E.J., Fiume, M., Buske, O., Savich, G.L., Shah, S., Aparicio, S., Chiang, D.Y., Goldenberg, A., Brudno, M.: ireckon: simultaneous isoform discovery and abundance estimation from rna-seq data. Genome Res. 23(3), 519–529 (2013)

    Google Scholar 

  47. Minoche, A.E., Dohm, J.C., Himmelbauer, H.: Evaluation of genomic high-throughput sequencing data generated on illumina hiseq and genome analyzer systems. Genome Biol. 12(11), R112 (2011). doi:10.1186/gb-2011-12-11-r112

    Article  Google Scholar 

  48. Morgan, M., Anders, S., Lawrence, M., Aboyoun, P., Pagès, H., Gentleman, R.: Shortread: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data. Bioinformatics 25(19), 2607–2608 (2009)

    Article  Google Scholar 

  49. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by rna-seq. Nat. Meth. 5(7), 621–628 (2008). doi:10.1038/nmeth.1226

    Article  Google Scholar 

  50. Murray, I.A., Clark, T.A., Morgan, R.D., Boitano, M., Anton, B.P., Luong, K., Fomenkov, A., Turner, S.W., Korlach, J., Roberts, R.J.: The methylomes of six bacteria. Nucleic Acids Res. 40(22), 11,450–11,462 (2012)

    Google Scholar 

  51. Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., Snyder, M.: The transcriptional landscape of the yeast genome defined by rna sequencing. Science 320(5881), 1344–1349 (2008). doi:10.1126/science.1158441

    Article  Google Scholar 

  52. Nicolae, M., Mangul, S., Măndoiu, I.I., Zelikovsky, A.: Estimation of alternative splicing isoform frequencies from rna-seq data. Algorithms Mol. Biol. 6(1), 9 (2011). doi:10.1186/1748-7188-6-9

    Article  Google Scholar 

  53. Oshlack, A., Wakefield, M.J.: Transcript length bias in rna-seq data confounds systems biology. Biol. Direct. 4, 14 (2009). doi:10.1186/1745-6150-4-14

    Article  Google Scholar 

  54. Pages, H.: Bsgenome: infrastructure for biostrings-based genome data packages. R Package Version 1.32.0 (2014)

    Google Scholar 

  55. Lawrence, M., Huber, W., Pagès, H., Aboyoun, P., Carlson, M., Gentleman, R., Morgan, M., Carey, V.: Software for computing and annotating genomic ranges. PLoS Comput. Biol., 9, (2013)

    Google Scholar 

  56. Pepke, S., Wold, B., Mortazavi, A.: Computation for chip-seq and rna-seq studies. Nat. Meth. 6(11 Suppl), S22–S32 (2009). doi:10.1038/nmeth.1371

    Article  Google Scholar 

  57. Reid, L.H.: Proposed methods for testing and selecting the ercc external rna controls. BMC Genom. 6(1), 1–18 (2005)

    Article  Google Scholar 

  58. Renaud, G., Kircher, M., Stenzel, U., Kelso, J.: freeibis: an efficient basecaller with calibrated quality scores for illumina sequencers. Bioinformatics 29(9), 1208–1209 (2013)

    Google Scholar 

  59. Rougemont, J., Amzallag, A., Iseli, C., Farinelli, L., Xenarios, I., Naef, F.: Probabilistic base calling of solexa sequencing data. BMC Bioinform. 9, 431 (2008). doi:10.1186/1471-2105-9-431

    Article  Google Scholar 

  60. Rozowsky, J., Euskirchen, G., Auerbach, R.K., Zhang, Z.D., Gibson, T., Bjornson, R., Carriero, N., Snyder, M., Gerstein, M.B.: Peakseq enables systematic scoring of chip-seq experiments relative to controls. Nat. Biotech. 27(1), 66–75 (2009). doi:10.1038/nbt.1518

    Article  Google Scholar 

  61. Salzman, J., Jiang, H., Wong, W.H.: Statistical modeling of rna-seq data. Stat. Sci. 26(1), 62–83 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  62. Sanger, F., Nicklen, S., Coulson, A.R.: Dna sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. 74(12), 5463–5467 (1977)

    Article  Google Scholar 

  63. Sharon, E., Lubliner, S., Segal, E.: A feature-based approach to modeling protein-dna interactions. PLoS Comput. Biol. 4(8), e1000,154 (2008). doi:10.1371/journal.pcbi.1000154

    Google Scholar 

  64. Shendure, J., Ji, H.: Next-generation dna sequencing. Nat. Biotech. 26(10), 1135–1145 (2008). doi:10.1038/nbt1486

    Article  Google Scholar 

  65. Smith, C.L., Migliaccio, I., Chaubal, V., Wu, M.F., Pace, M.C., Hartmaier, R., Jiang, S., Edwards, D.P., Gutiérrez, M.C., Hilsenbeck, S.G., Oesterreich, S.: Elevated nuclear expression of the smrt corepressor in breast cancer is associated with earlier tumor recurrence. Breast Cancer Res. Treat. 136(1), 253–265 (2012). doi:10.1007/s10549-012-2262-7

    Article  Google Scholar 

  66. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. Ser. B (Methodological) 58, 267–288 (1996)

    MATH  MathSciNet  Google Scholar 

  67. Trapnell, C., Pachter, L., Salzberg, S.L.: Tophat: discovering splice junctions with rna-seq. Bioinformatics 25(9), 1105–1111 (2009). doi:10.1093/bioinformatics/btp120

    Article  Google Scholar 

  68. Trimarchi, M.P., Murphy, M., Frankhouser, D., Rodriguez, B.A., Curfman, J., Marcucci, G., Yan, P., Bundschuh, R.: Enrichment-based dna methylation analysis using next-generation sequencing: sample exclusion, estimating changes in global methylation, and the contribution of replicate lanes. BMC Genom. 13(Suppl 8), S6 (2012)

    Google Scholar 

  69. Vera, J.C., Wheat, C.W., Fescemyer, H.W., Frilander, M.J., Crawford, D.L., Hanski, I., Marden, J.H.: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol. Ecol. 17(7), 1636–1647 (2008). doi:10.1111/j.1365-294X.2008.03666.x

    Article  Google Scholar 

  70. Viswanath, S., Yang, C.: Color call improvement in next generation sequencing using multi-class support vector machines. BMC Bioinform. 13(Suppl 18), A3 (2012)

    Google Scholar 

  71. Wall, P.K., Leebens-Mack, J., Chanderbali, A.S., Barakat, A., Wolcott, E., Liang, H., Landherr, L., Tomsho, L.P., Hu, Y., Carlson, J.E., Ma, H., Schuster, S.C., Soltis, D.E., Soltis, P.S., Altman, N., dePamphilis, C.W.: Comparison of next generation sequencing technologies for transcriptome characterization. BMC Genom. 10, 347 (2009). doi:10.1186/1471-2164-10-347

    Google Scholar 

  72. Wang, D., Rendon, A., Wernisch, L.: Transcription factor and chromatin features predict genes associated with eqtls. Nucleic Acids Res. 41(3), 1450–1463 (2013)

    Article  Google Scholar 

  73. Wei, G.C., Tanner, M.A.: A monte carlo implementation of the em algorithm and the poor man’s data augmentation algorithms. J. Am. Stat. Assoc. 85(411), 699–704 (1990)

    Article  Google Scholar 

  74. Whiteford, N., Skelly, T., Curtis, C., Ritchie, M.E., Löhr, A., Zaranek, A.W., Abnizova, I., Brown, C.: Swift: primary data analysis for the illumina solexa sequencing platform. Bioinformatics 25(17), 2194–2199 (2009). doi:10.1093/bioinformatics/btp383

    Article  Google Scholar 

  75. Willenbrock, H., Salomon, J., Søkilde, R., Barken, K.B., Hansen, T.N., Nielsen, F.C., Møller, S., Litman, T.: Quantitative mirna expression analysis: comparing microarrays with next-generation sequencing. RNA 15(11), 2028–2034 (2009)

    Article  Google Scholar 

  76. Wu, H., Irizarry, R.A., Bravo, H.C.: Intensity normalization improves color calling in solid sequencing. Nat. Meth. 7(5), 336–337 (2010)

    Article  Google Scholar 

  77. Xie, C., Tammi, M.T.: Cnv-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinform. 10, 80 (2009). doi:10.1186/1471-2105-10-80

    Article  Google Scholar 

  78. Xing, Y., Yu, T., Wu, Y.N., Roy, M., Kim, J., Lee, C.: An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 34(10), 3150–3160 (2006)

    Article  Google Scholar 

  79. Zhang, Z.D., Rozowsky, J., Snyder, M., Chang, J., Gerstein, M.: Modeling chip sequencing in silico with applications. PLoS Comput. Biol. 4(8), e1000,158 (2008). doi:10.1371/journal.pcbi.1000158

    Google Scholar 

  80. Zhang, Y., Malone, J.H., Powell, S.K., Periwal, V., Spana, E., MacAlpine, D.M., Oliver, B.: Expression in aneuploid drosophila s2 cells. PLoS Biol. 8(2), e1000,320 (2010)

    Google Scholar 

  81. Zhu, L., Gazin, C., Lawson, N., Pagès, H., Lin, S., Lapointe, D., Green, M.: Chippeakanno: a bioconductor package to annotate chip-seq and chip-chip data. BMC Bioinform. 11(1), 237 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Somnath Datta .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Mitra, R., Gill, R., Datta, S., Datta, S. (2014). Statistical Analyses of Next Generation Sequencing Data: An Overview. In: Datta, S., Nettleton, D. (eds) Statistical Analysis of Next Generation Sequencing Data. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-07212-8_1

Download citation

Publish with us

Policies and ethics