Skip to main content

Scalable Gene Sequence Analysis on Spark

  • Conference paper
  • First Online:
Big Data and Visual Analytics

Abstract

Scientific advances in technology have helped in digitizing genetic information, which resulted in the generation of the humongous amount of genetic sequences, and analysis of such large-scale sequencing data is the primary concern. This chapter introduces a scalable genome sequence analysis system, which makes use of parallel computing features of Apache Spark and its relational processing module called Spark Structured Query Language (Spark SQL). The Spark framework provides an efficient data reuse feature by holding the data in memory, increasing performance substantially. The introduced system also provides a web-based interface, by which users can specify the search criteria, and Spark SQL performs search operations on the data stored in memory. Experiments detailed in this chapter make use of publicly available 1000 genome Variant Calling Format (VCF) data (Size 1.2TB) as input. The input data are analyzed using Spark and the end results are evaluated to measure the scalability and performance of the system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Human Genome Project: [Online]. Available: https://www.genome.gov/10001772. Accessed 22 Nov 2015 (2003)

  2. DNA Sequencing Costs: [Online]. Available: http://www.genome.gov/sequencingcosts/. Accessed 31 Jan 2016 (2016)

  3. Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H., Hoopmann, M., Moritz, R., Deutsch, E., Boyle, J.: Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC Bioinf., 13, 6 (2012)

    Google Scholar 

  4. Apache Spark Project: [Online]. Available: http://spark.apache.org. Accessed 2015 (2015)

  5. Apache Pig: [Online]. Available: http://pig.apache.org/ (2015)

  6. Apache Hive: [Online]. Available: https://cwiki.apache.org/confluence/display/Hive/Home (2016)

  7. Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation (2006)

    Google Scholar 

  8. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd short Title USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, Berkeley, CA, USA (2010)

    Google Scholar 

  9. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. UCB/EECS (2011)

    Google Scholar 

  10. Apache Spark Documentation: [Online]. Available: http://spark.apache.org (2016)

  11. McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010)

    Article  Google Scholar 

  12. Massie, M., Nothaft, F.A., Hartl, C., Kozanitis, C., Schumacher, A., Joseph, A.D., Patterson, D.: ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing. DEECS Department, University of California, Berkeley (2013)

    Google Scholar 

  13. O’Brien, A.R., Saunders, N.F.W., Guo, Y., Buske, F.A., Scott, R.J., Bauer, D.C.: VariantSpark: population scale clustering of genotype information. BMC Genomics. 16(1), 1052 (2015)

    Article  Google Scholar 

  14. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Newton, MA (2015)

    Google Scholar 

  15. VCF File processing: [Online]. Available: http://vcftools.sourceforge.net (2014)

  16. 1000 Genome Data: [Online]. Available: http://www.1000genomes.org/data (2014)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinoh Kim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Syed, M., Hwang, T., Kim, J. (2017). Scalable Gene Sequence Analysis on Spark. In: Suh, S., Anthony, T. (eds) Big Data and Visual Analytics. Springer, Cham. https://doi.org/10.1007/978-3-319-63917-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63917-8_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63915-4

  • Online ISBN: 978-3-319-63917-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics