Cloud Computing for Genome-Wide Association Analysis
With the increasing availability and affordability of genome-wide genotyping and sequencing technologies, biomedical researchers are faced with increasing computational challenges in managing and analyzing large quantities of genetic data. Previously, this data intensive research required computing and personnel resources accessible only to large institutions. Cloud computing allows researchers to analyze their data without a local computing infrastructure. We evaluated the feasibility of cloud computing for association analysis of genome-wide data. Our approach utilized the MapReduce model which divides the analysis into independent units and distributes the work to a computing cloud. We evaluated our approach by modeling the relationships between genetic variants and disease in a simulated genome-wide association study. We generated several data sets of 100,000 subjects and various number of genetic variants, and demonstrated that our analysis approach is scalable and provides an attractive alternative to establishing and maintaining a local computing cluster.
KeywordsCloud Computing Cloud Resource MapReduce Model MapReduce Programming Model Virtual Core
Unable to display preview. Download preview PDF.
- 1.Amazon elastic compute cloud, http://aws.amazon.com/ec2/
- 2.Amazon elastic mapreduce, http://aws.amazon.com/elasticmapreduce/
- 3.Amazon elastic mapreduce pricing, http://aws.amazon.com/elasticmapreduce/pricing/
- 4.Amazon simple storage service, http://aws.amazon.com/s3/
- 5.Amazon web services, http://aws.amazon.com
- 6.Apache hadoop, http://hadoop.apache.org/
- 8.Dean, J.: Mapreduce: Simplified data processing on large clusters. Usenix SDI (2004), http://www.eecs.umich.edu/~klefevre/eecs584/Handouts/mapreduce.pdf
- 9.Mell, P.: The nist definition of cloud computing. National Institute of Standards and Technology (2009), http://www.mendeley.com/research/nist-definition-cloud-computing-v15/
- 10.Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.R., Bender, D., Maller, J., Sklar, P., de Bakker, P.I.W., Daly, M.J., Sham, P.C.: Plink: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81(3), 559–575 (2007), doi:10.1086/519795Google Scholar
- 11.R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2010) ISBN 3-900051-07-0, http://www.R-project.org