Scalable Gene Sequence Analysis on Spark

Syed, Muthahar; Hwang, Taehyun; Kim, Jinoh

doi:10.1007/978-3-319-63917-8_6

Muthahar Syed³,
Taehyun Hwang⁴ &
Jinoh Kim³

1994 Accesses
1 Citations

Abstract

Scientific advances in technology have helped in digitizing genetic information, which resulted in the generation of the humongous amount of genetic sequences, and analysis of such large-scale sequencing data is the primary concern. This chapter introduces a scalable genome sequence analysis system, which makes use of parallel computing features of Apache Spark and its relational processing module called Spark Structured Query Language (Spark SQL). The Spark framework provides an efficient data reuse feature by holding the data in memory, increasing performance substantially. The introduced system also provides a web-based interface, by which users can specify the search criteria, and Spark SQL performs search operations on the data stored in memory. Experiments detailed in this chapter make use of publicly available 1000 genome Variant Calling Format (VCF) data (Size 1.2TB) as input. The input data are analyzed using Spark and the end results are evaluated to measure the scalability and performance of the system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Human Genome Project: [Online]. Available: https://www.genome.gov/10001772. Accessed 22 Nov 2015 (2003)
DNA Sequencing Costs: [Online]. Available: http://www.genome.gov/sequencingcosts/. Accessed 31 Jan 2016 (2016)
Lewis, S., Csordas, A., Killcoyne, S., Hermjakob, H., Hoopmann, M., Moritz, R., Deutsch, E., Boyle, J.: Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC Bioinf., 13, 6 (2012)
Google Scholar
Apache Spark Project: [Online]. Available: http://spark.apache.org. Accessed 2015 (2015)
Apache Pig: [Online]. Available: http://pig.apache.org/ (2015)
Apache Hive: [Online]. Available: https://cwiki.apache.org/confluence/display/Hive/Home (2016)
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation (2006)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd short Title USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, Berkeley, CA, USA (2010)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. UCB/EECS (2011)
Google Scholar
Apache Spark Documentation: [Online]. Available: http://spark.apache.org (2016)
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., DePristo, M.A.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010)
Article Google Scholar
Massie, M., Nothaft, F.A., Hartl, C., Kozanitis, C., Schumacher, A., Joseph, A.D., Patterson, D.: ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing. DEECS Department, University of California, Berkeley (2013)
Google Scholar
O’Brien, A.R., Saunders, N.F.W., Guo, Y., Buske, F.A., Scott, R.J., Bauer, D.C.: VariantSpark: population scale clustering of genotype information. BMC Genomics. 16(1), 1052 (2015)
Article Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Newton, MA (2015)
Google Scholar
VCF File processing: [Online]. Available: http://vcftools.sourceforge.net (2014)
1000 Genome Data: [Online]. Available: http://www.1000genomes.org/data (2014)

Download references

Author information

Authors and Affiliations

Texas A&M University, Commerce, TX, USA
Muthahar Syed & Jinoh Kim
University of Southwestern Medical Center, Dallas, TX, USA
Taehyun Hwang

Authors

Muthahar Syed
View author publications
You can also search for this author in PubMed Google Scholar
Taehyun Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Jinoh Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jinoh Kim .

Editor information

Editors and Affiliations

Department of Computer Science, Texas A&M University-Commerce, Commerce, Texas, USA
Sang C. Suh
Department of Electrical and Computer Engineering, The University of Alabama at Birmingham, Birmingham, Alabama, USA
Thomas Anthony

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Syed, M., Hwang, T., Kim, J. (2017). Scalable Gene Sequence Analysis on Spark. In: Suh, S., Anthony, T. (eds) Big Data and Visual Analytics. Springer, Cham. https://doi.org/10.1007/978-3-319-63917-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-63917-8_6
Published: 17 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63915-4
Online ISBN: 978-3-319-63917-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics