Datenbank-Spektrum

, Volume 12, Issue 3, pp 161–171 | Cite as

Data Management Challenges in Next Generation Sequencing

  • Sebastian Wandelt
  • Astrid Rheinländer
  • Marc Bux
  • Lisa Thalheim
  • Berit Haldemann
  • Ulf Leser
Schwerpunktbeitrag

Abstract

Since the early days of the Human Genome Project, data management has been recognized as a key challenge for modern molecular biology research. By the end of the nineties, technologies had been established that adequately supported most ongoing projects, typically built upon relational database management systems. However, recent years have seen a dramatic increase in the amount of data produced by typical projects in this domain. While it took more than ten years, approximately three billion USD, and more than 200 groups worldwide to assemble the first human genome, today’s sequencing machines produce the same amount of raw data within a week, at a cost of approximately 2000 USD, and on a single device. Several national and international projects now deal with (tens of) thousands of genomes, and trends like personalized medicine call for efforts to sequence entire populations. In this paper, we highlight challenges that emerge from this flood of data, such as parallelization of algorithms, compression of genomic sequences, and cloud-based execution of complex scientific workflows. We also point to a number of further challenges that lie ahead due to the increasing demand for translational medicine, i.e., the accelerated transition of biomedical research results into medical practice.

References

  1. 1.
    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410 Google Scholar
  2. 2.
    Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402 CrossRefGoogle Scholar
  3. 3.
    Antoniou D, Theodoridis E, Tsakalidis A (2010) Compressing biological sequences using self adjusting data structures. In: Information technology and applications in biomedicine Google Scholar
  4. 4.
    Baeza-Yates RA, Perleberg CH (1992) Fast and practical approximate string matching. In: Proceedings of the third annual symposium on combinatorial pattern matching (CPM ’92), London, UK. Springer, Berlin, pp 185–192 CrossRefGoogle Scholar
  5. 5.
    Battré D, Ewen S, Hueske F, Kao O, Markl V, Warneke D (2010) Nephele/PACTs: a programming model and execution framework for web-scale analytical processing categories and subject descriptors. In: Proceedings of the 1st ACM symposium on cloud computing Google Scholar
  6. 6.
    Bharti RK, Verma A, Singh R (2011) A biological sequence compression based on cross chromosomal similarities using variable length lut. Int J Biometr Bioinf 4:217–223 Google Scholar
  7. 7.
    Brandon MC, Wallace DC, Baldi P (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics 25(14):1731–1738 CrossRefGoogle Scholar
  8. 8.
    Chen X, Kwong S, Li M (2001) A compression algorithm for DNA sequences. IEEE Eng Med Biol Mag 20(4):61–66 CrossRefGoogle Scholar
  9. 9.
    Chen Y, Peng B, Wang X, Tang H (2012) Large-scale privacy-preserving mapping of human genomic sequences on hybrid clouds. In: Proceeding of the 19th network & distributed system security symposium Google Scholar
  10. 10.
    Chiang GT, Clapham P, Qi G, Sale K, Coates G (2011) Implementing a genomic data management system using irods in the wellcome trust sanger institute. BMC Bioinform 12:361 CrossRefGoogle Scholar
  11. 11.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107 CrossRefGoogle Scholar
  12. 12.
    Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-Science: an overview of workflow system features and capabilities. Future Gener Comput Syst 25(5):528–540 CrossRefGoogle Scholar
  13. 13.
    Deelman E, Singh G, Su M, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman G, Good J et al. (2005) Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci Program 13(3):219–237 Google Scholar
  14. 14.
    Dennis C, Gallagher R (eds) (2002) The human genome. Palgrave Macmillan, Basingstoke Google Scholar
  15. 15.
    Duc Cao M, Dix TI, Allison L, Mears C (2007) A simple statistical algorithm for biological sequence compression. In: Proceedings of the 2007 data compression conference. IEEE Computer Society, Washington, DC, USA, pp 43–52 Google Scholar
  16. 16.
    Ferragina P, Manzini G (2000) Opportunistic data structures with applications. In: Proc annual IEEE symposium on foundations of computer science (FOCS), Los Alamitos, CA, USA, IEEE Comput Soc, Los Alamitos, pp 390–398 Google Scholar
  17. 17.
    Foster I (1995) Designing and building parallel programs: concepts and tools for parallel software engineering. Parallel programming/scientific computing. Addison-Wesley, Reading MATHGoogle Scholar
  18. 18.
    Goecks J, Nekrutenko A, Taylor J, Team T (2010) Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 11(8):R86 CrossRefGoogle Scholar
  19. 19.
    Grabowski S, Deorowicz S (2011) Engineering relative compression of genomes. CoRR abs/1103.2351 Google Scholar
  20. 20.
    Grumbach S, Tahi F (1994) A new challenge for compression algorithms: genetic sequences. Inf Process Manag 30(6):875–886 MATHCrossRefGoogle Scholar
  21. 21.
    Hoffa C, Mehta G, Freeman T, Deelman E, Keahey K, Good J (2008) On the use of cloud computing for scientific workflows. In: Proceedings of the 2008 fourth IEEE international conference on escience, pp 640–645 CrossRefGoogle Scholar
  22. 22.
    Holtgrewe M, Emde A-K, Weese D, Reinert K (2011) A novel and well-defined benchmarking method for second generation read mapping. BMC Bioinform 12:210 CrossRefGoogle Scholar
  23. 23.
    Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabe RR, Bhan MK, Calvo F, Eerola I, Gerhard DS et al. (2010) International network of cancer genome projects. Nature 464(7291):993–998 CrossRefGoogle Scholar
  24. 24.
    Juve G, Deelman E, Vahi K, Mehta G, Berriman B, Berman BP, Maechling P (2010) Data sharing options for scientific workflows on Amazon EC2. In: 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis, pp 1–9 CrossRefGoogle Scholar
  25. 25.
    Kent WJ (2002) BLAT—the BLAST-like alignment tool. Genome Res 12(4):656–664 MathSciNetGoogle Scholar
  26. 26.
    Kuruppu S, Beresford-Smith B, Conway T, Zobel J (2012) Iterative dictionary construction for compression of large DNA data sets. IEEE/ACM Trans Comput Biol Bioinform 9(1):137–149 CrossRefGoogle Scholar
  27. 27.
    Kuruppu S, Puglisi SJ, Zobel J (2010) Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Proceedings of the 17th international conference on string processing and information retrieval (SPIRE’10). Springer, Berlin, pp 201–206 CrossRefGoogle Scholar
  28. 28.
    Langmead B, Salzberg SL (2012) Fast gapped-read alignment with bowtie 2. Nat Methods 9(4):357–359 CrossRefGoogle Scholar
  29. 29.
    Langmead B, Schatz M, Lin J, Pop M, Salzberg S (2009) Searching for snps with cloud computing. Genome Biol 10(11):R134 CrossRefGoogle Scholar
  30. 30.
    Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25 CrossRefGoogle Scholar
  31. 31.
    Li B, Leal SM (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 83(3):311–321 CrossRefGoogle Scholar
  32. 32.
    Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26(5):589–595 CrossRefGoogle Scholar
  33. 33.
    Li H, Homer N (2010) A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 11(5):473–483 CrossRefGoogle Scholar
  34. 34.
    Li Y, Zhong S (2009) Seqmapreduce: software and web service for accelerating sequence mapping. In: Proceedings of the 9th international conference for the critical assessment of massive data analysis (CAMDA 2009) Google Scholar
  35. 35.
    Liu Y, Schmidt B (2012) Long read alignment based on maximal exact match seeds. In: Bioinformatics (ECCB 2012 special issue) Google Scholar
  36. 36.
    Mount DW (2004) Bioinformatics: sequence and genome analysis. CSHL Press, New York Google Scholar
  37. 37.
    Nguyen T, Shi W, Ruden D (2011) CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping. BMC Res Notes 4(1):171 CrossRefGoogle Scholar
  38. 38.
    US Department of Health and Human Services (2003) OCR privacy brief: summary of the HIPAA privacy rule. In: HIPAA compliance assistance Google Scholar
  39. 39.
    Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17):3045–3054 CrossRefGoogle Scholar
  40. 40.
    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, New York, pp 1099–1110 CrossRefGoogle Scholar
  41. 41.
    Pennisi E (2011) Will computers crash genomics? Science 331(6018):666–668 CrossRefGoogle Scholar
  42. 42.
    Rivals E, Salmela L, Kiiskinen P, Kalsi P, Tarhio J (2009) Mpscan: fast localisation of multiple reads in genomes. In: Proc. 9th international workshop on algorithms in bioinformatics (WABI). Lecture notes in computer science, vol 5724. Springer, Berlin, pp 246–260 Google Scholar
  43. 43.
    Sanger F, Nicklen S, Coulson AR (1977) DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci USA 74(12):5463–5467 CrossRefGoogle Scholar
  44. 44.
    Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet 11(9):647–657 CrossRefGoogle Scholar
  45. 45.
    Schadt EE, Turner S, Kasarskis A (2010) A window into third-generation sequencing. Hum Mol Genet 19(R2):R227–R240 CrossRefGoogle Scholar
  46. 46.
    Schatz MC (2009) Cloudburst. Bioinform 25(11):1363–1369 CrossRefGoogle Scholar
  47. 47.
    Smith AD, Chung W-Y, Hodges E, Kendall J, Hannon G, Hicks J, Xuan Z, Zhang MQ (2009) Updates to the RMAP short-read mapping software. Bioinformatics 25(21):2841–2842 CrossRefGoogle Scholar
  48. 48.
    Smith AD, Xuan Z, Zhang MQ (2008) Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinform 9:128 CrossRefGoogle Scholar
  49. 49.
    Stein LD (2010) The case for cloud computing in genome informatics. Genome Biol 11(5):207 CrossRefGoogle Scholar
  50. 50.
    Trapnell C, Salzberg SL (2009) How to map billions of short reads onto genomes. Nat Biotechnol 27(5):455–457 CrossRefGoogle Scholar
  51. 51.
    Välimäki N, Gerlach W, Dixit K, Mäkinen V (2007) Compressed suffix tree—a basis for genome-scale sequence analysis. Bioinformatics 23(5):629–630 CrossRefGoogle Scholar
  52. 52.
    Vey G (2009) Differential direct coding: a compression algorithm for nucleotide sequence data. J Biol Database Curation Google Scholar
  53. 53.
    Warneke D, Kao O (2009) Nephele: efficient parallel data processing in the cloud categories and subject descriptors. In: Proceedings of the 2nd workshop on many-task computing on grids and supercomputers Google Scholar
  54. 54.
    Weese D, Emde A, Rausch T, Döring A, Reinert K (2009) RazerS—fast read mapping with sensitivity control. Genome Res 19(9):1646–1654 CrossRefGoogle Scholar
  55. 55.
    White T (2010) Hadoop: the definitive guide. Yahoo Press Google Scholar
  56. 56.
    Zaharia M, Konwinski A, Joseph AD, Katz RH, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX conference on operating systems design and implementation, pp 29–42 Google Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Sebastian Wandelt
    • 1
  • Astrid Rheinländer
    • 1
  • Marc Bux
    • 1
  • Lisa Thalheim
    • 1
  • Berit Haldemann
    • 1
  • Ulf Leser
    • 1
  1. 1.Department of Computer ScienceHumboldt-Universität zu BerlinBerlinGermany

Personalised recommendations