Abstract
Recent progress in genomics and experimental biology has brought exponential growth of the biological information available for computational analysis in public genomics databases. However, applying the potentially enormous scientific value of this information to the understanding of biological systems requires computing and data storage technology of an unprecedented scale. The Grid, with its aggregated and distributed computational and storage infrastructure, offers an ideal platform for high-throughput bioinformatics analysis. To leverage this we have developed the Genome Analysis Research Environment (GNARE) – a scalable computational system for the high-throughput analysis of genomes, which provides an integrated database and computational backend for data-driven bioinformatics applications. GNARE efficiently automates the major steps of genome analysis including acquisition of data from multiple genomic databases; data analysis by a diverse set of bioinformatics tools; and storage of results and annotations.
High-throughput computations in GNARE are performed using distributed heterogeneous Grid computing resources such as Grid2003, TeraGrid, and the DOE Science Grid. Multi-step genome analysis workflows involving massive data processing, the use of application-specific tools and algorithms and updating of an integrated database to provide interactive web access to results are all expressed and controlled by a “virtual data” model which transparently maps computational workflows to distributed Grid resources. This paper describes how Grid technologies such as Globus, Condor, and the Gryphyn Virtual Data System were applied in the development of GNARE. It focuses on our approach to Grid resource allocation and to the use of GNARE as a computational framework for the development of bioinformatics applications.
Similar content being viewed by others
References
Ideker T, Galitski T, Hood L. A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet 2001; 2: 343–372.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25: 3389–3402.
Pearson WR. Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol 1994; 24: 307–331.
Shpaer EG, Robinson M, Yee D, Candlin JD, Mines R, Hunkapiller T. Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA. Genomics 1996; 38: 179–191.
Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, et al. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res 2003; 31: 315–318.
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths Jones S, Howe KL, Marshall M, Sonnhammer EL. The Pfam protein families database. Nucleic Acids Res 2002; 30: 276–280.
Henikoff S, Henikoff JG, Pietrokovski S. Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics 1999; 15: 471–479.
Pearl FM, Bennett CF, Bray JE, Harrison AP, Martin N, Shepherd A, Sillitoe I, Thornton J, Orengo CA. The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res 2003; 31: 452–455.
Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG. SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res 2002; 30: 264–267.
“Encyclopedia of Life” (http://eol.sdsc.edu/).
Goble C, Pettifer S, Stevens R. Knowledge Integration: In silico Experiments in Bioinformatics. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 2004.
North Carolina BioGRID (http://www.ncbiogrid.org/).
EUROGRID, (http://www.eurogrid.org/).
Asia Pacific BioGrid Initiative (http://www.apbionet.org/apbiogrid/).
Foster I, Kesselman C. Globus: A Metacomputing Infrastructure Toolkit. International Journal of Supercomputer Applications 1997; 11(2): 115–128.
Litzkow MJ, Livny M, Mutka MW. Condor - A Hunter of Idle Workstations. 8th International Conference on Distributed Computing Systems 1988; 104–111.
Foster I, Voeckler J, Wilde M, Zhao Y, The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration. Conference on Innovative Data Systems Research, 2003.
Foster I, others. The Grid2003 Production Grid: Principles and Practice. IEEE International Symposium on High Performance Distributed Computing, 2004, IEEE Computer Science Press.
Catlett C. The TeraGrid: A Primer, 2002. http://www.teragrid.org.
DOE Science Grid, http://www.doesciencegrid.org.
The NCBI handbook [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2002 Oct. Chapter 17, The Reference Sequence (RefSeq) Project. Available from http://ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books.
Wu CH, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu Z, Robert S. “The Protein Information Resource: an integrated public resource of functional annotation of proteins”, Oxford University Press Nucleic Acids Research 2002; 30(1): 35–37.
Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehis M. “KEGG: Kyoto Encyclopedia of Genes and Genomes,” Nucleic Acids Research 1999; 27(1).
Altschul, Stephen F, Gish W, Miller W, Myers EW, Lipman DJ. “Basic local alignment search tool” J Mol Biol 1990; 215: 403–410.
Henikoff JG, Henikoff S. “Blocks Database and its Applications,” Meth. Enzymology 1996; 26: 88–105. Review article about the BLOCKS database.
Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer EL. “The Pfam protein families database,” Nucl Acids Res 2000; 28, 260–262.
Krogh, Anders, Prediction of transmembrane helices in proteins, http://www.cbs.dtu.dk/services/TMHMM/.
Foster, Voeckler J, Wilde M, Zhou Y. Chimera: A virtual data system for representing, querying, and automating data derivation. In Proceedings of the 14th Conference on Scientific and Statistical Database.
Allcock W, et al. Data Management and Transfer in High-Performance Computational Grid Environments. Parallel Computing 2002; 28(5): 749–771.
Ann L Chervenak, Naveen Palavalli, Shishir Bharathi, Carl Kesselman, Robert Schwartzkopf. Performance and Scalability of a Replica Location Service, Proceedings of the International Symposium on High Performance Distributed Computing Conference (HPDC-13), June 2004.
PDB, The Protein Data Bank, http://www.rcsb.org/pdb/.
Swiss-Prot, The Swiss-Prot Protein Knowledgebase, http://us.expasy.org/sprot/.
Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS. The Universal Protein Resource (UniProt).
Burgard AP, Maranas CD. “Review of the enzymes and metabolic pathways (emp) database,” Metab Eng 2001; 3: 193–194.
Ross Overbeek, Terry Disz, Rick Stevens. The SEED: a peer-to-peer environment for genome annotation, Communications of the ACM 2004; 47(11): 46–51.
Midwest Center for Structural Genomics (MCSG) http://www.mcsg.anl.gov.
Great Lakes Regional Center of Excellence for Biodefense & emerging Infectious Diseases Research, http://www.glrce.org.
PUMA2 System, http://compbio.mcs.anl.gov/puma2.
Pathos System, http://compbio.mcs.anl.gov/pathos.
TarGet Environment, http://compbio.mcs.anl.gov/target.
Natalia Maltsev, Elizabeth Marland, Gong-Xin Yu, Saurabha Bhatnagar, Richard Lusk. Sentra, a database of signal transduction proteins. Nucleic Acids Res, 2002; 30. http://www-wit.mcs.anl.gov/sentra.
DOE Microbial Genome Program, http://microbialgenome.org.
Chisel, http://compbio.mcs.anl.gov/CHISEL.
Author information
Authors and Affiliations
Additional information
Based on “GNARE: An Environment for Grid-Based High-Throughput Genome Analysis” by D. Sulakhe, A. Rodriguez, M. D'Souza, M. Wilde, V. Nefedova, I. Foster, and N. Maltsev which appeared in Proceedings of CCGRID 2005.© 2005 IEEE.
Rights and permissions
About this article
Cite this article
Sulakhe, D., Rodriguez, A., D'Souza, M. et al. Gnare: Automated System For High-Throughput Genome Analysis With Grid Computational Backend. J Clin Monit Comput 19, 361–369 (2005). https://doi.org/10.1007/s10877-005-3463-y
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10877-005-3463-y