Skip to main content

Data Management Challenges in Next Generation Sequencing

Abstract

Since the early days of the Human Genome Project, data management has been recognized as a key challenge for modern molecular biology research. By the end of the nineties, technologies had been established that adequately supported most ongoing projects, typically built upon relational database management systems. However, recent years have seen a dramatic increase in the amount of data produced by typical projects in this domain. While it took more than ten years, approximately three billion USD, and more than 200 groups worldwide to assemble the first human genome, today’s sequencing machines produce the same amount of raw data within a week, at a cost of approximately 2000 USD, and on a single device. Several national and international projects now deal with (tens of) thousands of genomes, and trends like personalized medicine call for efforts to sequence entire populations. In this paper, we highlight challenges that emerge from this flood of data, such as parallelization of algorithms, compression of genomic sequences, and cloud-based execution of complex scientific workflows. We also point to a number of further challenges that lie ahead due to the increasing demand for translational medicine, i.e., the accelerated transition of biomedical research results into medical practice.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    http://aws.amazon.com/ec2/.

  2. 2.

    http://hadoop.apache.org/.

  3. 3.

    SNP calling attempts to predict which of the disagreements between reference and query sequences are due to Single Nucleotide Polymorphisms.

  4. 4.

    Insertions and deletions.

References

  1. 1.

    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410

    Google Scholar 

  2. 2.

    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402

    Article  Google Scholar 

  3. 3.

    Antoniou D, Theodoridis E, Tsakalidis A (2010) Compressing biological sequences using self adjusting data structures. In: Information technology and applications in biomedicine

    Google Scholar 

  4. 4.

    Baeza-Yates RA, Perleberg CH (1992) Fast and practical approximate string matching. In: Proceedings of the third annual symposium on combinatorial pattern matching (CPM ’92), London, UK. Springer, Berlin, pp 185–192

    Chapter  Google Scholar 

  5. 5.

    Battré D, Ewen S, Hueske F, Kao O, Markl V, Warneke D (2010) Nephele/PACTs: a programming model and execution framework for web-scale analytical processing categories and subject descriptors. In: Proceedings of the 1st ACM symposium on cloud computing

    Google Scholar 

  6. 6.

    Bharti RK, Verma A, Singh R (2011) A biological sequence compression based on cross chromosomal similarities using variable length lut. Int J Biometr Bioinf 4:217–223

    Google Scholar 

  7. 7.

    Brandon MC, Wallace DC, Baldi P (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14):1731–1738

    Article  Google Scholar 

  8. 8.

    Chen X, Kwong S, Li M (2001) A compression algorithm for DNA sequences. IEEE Eng Med Biol Mag 20(4):61–66

    Article  Google Scholar 

  9. 9.

    Chen Y, Peng B, Wang X, Tang H (2012) Large-scale privacy-preserving mapping of human genomic sequences on hybrid clouds. In: Proceeding of the 19th network & distributed system security symposium

    Google Scholar 

  10. 10.

    Chiang GT, Clapham P, Qi G, Sale K, Coates G (2011) Implementing a genomic data management system using irods in the wellcome trust sanger institute. BMC Bioinform 12:361

    Article  Google Scholar 

  11. 11.

    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107

    Article  Google Scholar 

  12. 12.

    Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-Science: an overview of workflow system features and capabilities. Future Gener Comput Syst 25(5):528–540

    Article  Google Scholar 

  13. 13.

    Deelman E, Singh G, Su M, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman G, Good J et al. (2005) Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci Program 13(3):219–237

    Google Scholar 

  14. 14.

    Dennis C, Gallagher R (eds) (2002) The human genome. Palgrave Macmillan, Basingstoke

    Google Scholar 

  15. 15.

    Duc Cao M, Dix TI, Allison L, Mears C (2007) A simple statistical algorithm for biological sequence compression. In: Proceedings of the 2007 data compression conference. IEEE Computer Society, Washington, DC, USA, pp 43–52

    Google Scholar 

  16. 16.

    Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: Proc annual IEEE symposium on foundations of computer science (FOCS), Los Alamitos, CA, USA, IEEE Comput Soc, Los Alamitos, pp 390–398

    Google Scholar 

  17. 17.

    Foster I (1995) Designing and building parallel programs: concepts and tools for parallel software engineering. Parallel programming/scientific computing. Addison-Wesley, Reading

    MATH  Google Scholar 

  18. 18.

    Goecks J, Nekrutenko A, Taylor J, Team T (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11(8):R86

    Article  Google Scholar 

  19. 19.

    Grabowski S, Deorowicz S (2011) Engineering relative compression of genomes. CoRR abs/1103.2351

  20. 20.

    Grumbach S, Tahi F (1994) A new challenge for compression algorithms: genetic sequences. Inf Process Manag 30(6):875–886

    MATH  Article  Google Scholar 

  21. 21.

    Hoffa C, Mehta G, Freeman T, Deelman E, Keahey K, Good J (2008) On the use of cloud computing for scientific workflows. In: Proceedings of the 2008 fourth IEEE international conference on escience, pp 640–645

    Chapter  Google Scholar 

  22. 22.

    Holtgrewe M, Emde A-K, Weese D, Reinert K (2011) A novel and well-defined benchmarking method for second generation read mapping. BMC Bioinform 12:210

    Article  Google Scholar 

  23. 23.

    Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabe RR, Bhan MK, Calvo F, Eerola I, Gerhard DS et al. (2010) International network of cancer genome projects. Nature 464(7291):993–998

    Article  Google Scholar 

  24. 24.

    Juve G, Deelman E, Vahi K, Mehta G, Berriman B, Berman BP, Maechling P (2010) Data sharing options for scientific workflows on Amazon EC2. In: 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis, pp 1–9

    Chapter  Google Scholar 

  25. 25.

    Kent WJ (2002) BLAT—the BLAST-like alignment tool. Genome Res 12(4):656–664

    MathSciNet  Google Scholar 

  26. 26.

    Kuruppu S, Beresford-Smith B, Conway T, Zobel J (2012) Iterative dictionary construction for compression of large DNA data sets. IEEE/ACM Trans Comput Biol Bioinform 9(1):137–149

    Article  Google Scholar 

  27. 27.

    Kuruppu S, Puglisi SJ, Zobel J (2010) Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Proceedings of the 17th international conference on string processing and information retrieval (SPIRE’10). Springer, Berlin, pp 201–206

    Chapter  Google Scholar 

  28. 28.

    Langmead B, Salzberg SL (2012) Fast gapped-read alignment with bowtie 2. Nat Methods 9(4):357–359

    Article  Google Scholar 

  29. 29.

    Langmead B, Schatz M, Lin J, Pop M, Salzberg S (2009) Searching for snps with cloud computing. Genome Biol 10(11):R134

    Article  Google Scholar 

  30. 30.

    Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25

    Article  Google Scholar 

  31. 31.

    Li B, Leal SM (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 83(3):311–321

    Article  Google Scholar 

  32. 32.

    Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26(5):589–595

    Article  Google Scholar 

  33. 33.

    Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11(5):473–483

    Article  Google Scholar 

  34. 34.

    Li Y, Zhong S (2009) Seqmapreduce: software and web service for accelerating sequence mapping. In: Proceedings of the 9th international conference for the critical assessment of massive data analysis (CAMDA 2009)

    Google Scholar 

  35. 35.

    Liu Y, Schmidt B (2012) Long read alignment based on maximal exact match seeds. In: Bioinformatics (ECCB 2012 special issue)

    Google Scholar 

  36. 36.

    Mount DW (2004) Bioinformatics: sequence and genome analysis. CSHL Press, New York

    Google Scholar 

  37. 37.

    Nguyen T, Shi W, Ruden D (2011) CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping. BMC Res Notes 4(1):171

    Article  Google Scholar 

  38. 38.

    US Department of Health and Human Services (2003) OCR privacy brief: summary of the HIPAA privacy rule. In: HIPAA compliance assistance

    Google Scholar 

  39. 39.

    Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17):3045–3054

    Article  Google Scholar 

  40. 40.

    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, New York, pp 1099–1110

    Chapter  Google Scholar 

  41. 41.

    Pennisi E (2011) Will computers crash genomics? Science 331(6018):666–668

    Article  Google Scholar 

  42. 42.

    Rivals E, Salmela L, Kiiskinen P, Kalsi P, Tarhio J (2009) Mpscan: fast localisation of multiple reads in genomes. In: Proc. 9th international workshop on algorithms in bioinformatics (WABI). Lecture notes in computer science, vol 5724. Springer, Berlin, pp 246–260

    Google Scholar 

  43. 43.

    Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74(12):5463–5467

    Article  Google Scholar 

  44. 44.

    Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet 11(9):647–657

    Article  Google Scholar 

  45. 45.

    Schadt EE, Turner S, Kasarskis A (2010) A window into third-generation sequencing. Hum Mol Genet 19(R2):R227–R240

    Article  Google Scholar 

  46. 46.

    Schatz MC (2009) Cloudburst. Bioinform 25(11):1363–1369

    Article  Google Scholar 

  47. 47.

    Smith AD, Chung W-Y, Hodges E, Kendall J, Hannon G, Hicks J, Xuan Z, Zhang MQ (2009) Updates to the RMAP short-read mapping software. Bioinformatics 25(21):2841–2842

    Article  Google Scholar 

  48. 48.

    Smith AD, Xuan Z, Zhang MQ (2008) Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinform 9:128

    Article  Google Scholar 

  49. 49.

    Stein LD (2010) The case for cloud computing in genome informatics. Genome Biol 11(5):207

    Article  Google Scholar 

  50. 50.

    Trapnell C, Salzberg SL (2009) How to map billions of short reads onto genomes. Nat Biotechnol 27(5):455–457

    Article  Google Scholar 

  51. 51.

    Välimäki N, Gerlach W, Dixit K, Mäkinen V (2007) Compressed suffix tree—a basis for genome-scale sequence analysis. Bioinformatics 23(5):629–630

    Article  Google Scholar 

  52. 52.

    Vey G (2009) Differential direct coding: a compression algorithm for nucleotide sequence data. J Biol Database Curation

  53. 53.

    Warneke D, Kao O (2009) Nephele: efficient parallel data processing in the cloud categories and subject descriptors. In: Proceedings of the 2nd workshop on many-task computing on grids and supercomputers

    Google Scholar 

  54. 54.

    Weese D, Emde A, Rausch T, Döring A, Reinert K (2009) RazerS—fast read mapping with sensitivity control. Genome Res 19(9):1646–1654

    Article  Google Scholar 

  55. 55.

    White T (2010) Hadoop: the definitive guide. Yahoo Press

  56. 56.

    Zaharia M, Konwinski A, Joseph AD, Katz RH, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX conference on operating systems design and implementation, pp 29–42

    Google Scholar 

Download references

Acknowledgements

Astrid Rheinländer is funded by the Deutsche Forschungsgemeinschaft through the Stratosphere project. Marc Bux is funded by the Deutsche Forschungsgemeinschaft through the SOAMED research unit. Berit Haldemann is funded by the Bundesministerium f. Bildung und Forschung through the project Prositu.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Sebastian Wandelt.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Wandelt, S., Rheinländer, A., Bux, M. et al. Data Management Challenges in Next Generation Sequencing. Datenbank Spektrum 12, 161–171 (2012). https://doi.org/10.1007/s13222-012-0098-2

Download citation

Keywords

  • Compression Rate
  • Analysis Pipeline
  • Compression Scheme
  • Public Cloud
  • Read Mapping