Personal Genomes: A New Frontier in Database Research

  • Taro L. Saito
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7108)

Abstract

Due to the recent technological improvement of the next-generation sequencers, reading genome sequence of individual DNA becomes popular in biology and medical study. The amount of data produced by next generation sequencers is enormous. Today, more than 10,000 people’s DNAs are sequenced in the world and tera-bytes of data are being produced in a daily basis. The types of genome information also vary according to the biological experiments used for preparing DNA samples. Biologists and medical scientists are now facing to manage these huge volumes of data with variety of types. Existing DBMS, whose major targets are business applications, is not suited to managing these biological data because storing such large data to DBMS is time-consuming, and also current database queries cannot accommodate various types of bioinformatics tools written in various programming languages. Processing bioinformatics workflows in parallel and distributed manner is also a challenging problem. In this paper, in hope of recruiting database researchers into this rapidly progressing biology and medical research area, we introduce several challenges in genome informatics from the viewpoint of using existing DBMS for processing next-generation sequencer data.

Keywords

Personal genomes bioinformatics parallel computing workflow management 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Initial sequencing and analysis of the human genome. Nature 409(6822), 860–921 (2001)Google Scholar
  2. 2.
    Barski, A., Cuddapah, S., Cui, K., Roh, T., Schones, D.: High-resolution profiling of histone methylations in the human genome. Cell (2007)Google Scholar
  3. 3.
    Burrows, M., Wheeler, D.: A block-sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)Google Scholar
  4. 4.
    Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation, vol. 6, p. 10. USENIX Association, Berkeley (2004)Google Scholar
  5. 5.
    Durbin, R.M., Altshuler, D.L., Durbin, R.M., Abecasis, G.R., Bentley, D.R., et al.: A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061–1073 (2010)CrossRefGoogle Scholar
  6. 6.
    Flicek, P.: Sense from sequence reads: methods for alignment and assembly. Nature Methods (2009)Google Scholar
  7. 7.
    Flicek, P., Amode, M., Barrell, D., Beal, K.: Ensembl 2011. Nucleic Acid Research (2011)Google Scholar
  8. 8.
    Fujita, P., Rhead, B., Zweig, A.: The UCSC Genome Browser database: update 2011. Nucleic Acids … (2011)Google Scholar
  9. 9.
    Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F.J., Burton, J.N., Walker, B.J., Sharpe, T., Hall, G., Shea, T.P., Sykes, S., Berlin, A.M., Aird, D., Costello, M., Daza, R., Williams, L., Nicol, R., Gnirke, A., Nusbaum, C., Lander, E.S., Jaffe, D.B.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences 108(4), 1513–1518 (2011)CrossRefGoogle Scholar
  10. 10.
    Apache, hadoop, http://hadoop.apache.org/
  11. 11.
    Hashimoto, S.-i., Suzuki, Y., Kasai, Y., Morohoshi, K., Yamada, T., Sese, J., Morishita, S., Sugano, S., Matsushima, K.: 5?-end SAGE for the analysis of transcriptional start sites. Nature Biotechnology 22(9), 1146–1149 (2004)CrossRefGoogle Scholar
  12. 12.
    Illumina, HiSeq (2000), http://www.illumina.com/
  13. 13.
    Jagadish, H.V., Chapman, A., Elkiss, A., Jayapandian, M., Li, Y., Nandi, A., Yu, C.: Making database systems usable. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD 2007, pp. 13–24. ACM Press, New York (2007)CrossRefGoogle Scholar
  14. 14.
    Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10(3), R25+ (2009)Google Scholar
  15. 15.
    Li, H., Durbin, R.: Fast and accurate short read alignment with burrowswheeler transform. Bioinformatics 25(14), 1754–1760 (2009)CrossRefGoogle Scholar
  16. 16.
    Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Research (2010)Google Scholar
  17. 17.
    Lister, R., Pelizzola, M., Dowen, R.H., Hawkins, R.D., Hon, G., Tonti-Filippini, J., Nery, J.R., Lee, L., Ye, Z., Ngo, Q.-M., Edsall, L., Antosiewicz-Bourget, J., Stewart, R., Ruotti, V., Millar, A.H., Thomson, J.A., Ren, B., Ecker, J.R.: Human DNA methylomes at base resolution show widespread epigenomic differences.. Nature 462(7271), 315–322 (2009)CrossRefGoogle Scholar
  18. 18.
    Martin, J.A., Wang, Z.: Next-generation transcriptome assembly. Nature Reviews Genetics 12(10), 671–682 (2011)CrossRefGoogle Scholar
  19. 19.
    Nègre, N., Brown, C.D., Ma, L., Bristow, C.A., Miller, S.W., Wagner, U., Kheradpour, P., et al.: A cis-regulatory map of the Drosophila genome. Nature 471(7339), 527–531 (2011)CrossRefGoogle Scholar
  20. 20.
    Saito, T., Yoshimura, J., Sasaki, S., Ahsan, B., Sasaki, A., Kuroshu, R., Morishita, S.: UTGB toolkit for personalized genome browsers. Bioinformatics (January 2009)Google Scholar
  21. 21.
  22. 22.
    Schones, D.E., Cui, K., Cuddapah, S., Roh, T.-Y., Barski, A., Wang, Z., Wei, G., Zhao, K.: Dynamic regulation of nucleosome positioning in the human genome. Cell 132(5), 887–898 (2008)CrossRefGoogle Scholar
  23. 23.
    Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., Sirotkin, K.: dbSNP: the NCBI database of genetic variation. Nucleic Acids Research 29(1), 308–311 (2001)CrossRefGoogle Scholar
  24. 24.
    Simpson, J., Wong, K., Jackman, S.: ABySS: a parallel assembler for short read sequence data. Genome Research (2009)Google Scholar
  25. 25.
    Applied biosystems, SOLiD4 System, mhttp://www.appliedbiosystems.com/
  26. 26.
    Taura, K., Matsuzaki, T., Miwa, M., Kamoshida, Y.: Design and implementation of GXP make–A workflow system based on make. Future Generation Computer Systems (2011)Google Scholar
  27. 27.
    UCSC, Data File Formats FAQ, http://genome.ucsc.edu/FAQ/FAQformat.html
  28. 28.
    Wang, Z., Gerstein, M.: RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics (2009)Google Scholar
  29. 29.
    Wilhelm, B.: RNA-Seq–quantitative measurement of expression through massively parallel RNA-sequencing. Nature Methods (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Taro L. Saito
    • 1
  1. 1.Department of Computational BiologyThe University of TokyoJapan

Personalised recommendations