Gnare: Automated System For High-Throughput Genome Analysis With Grid Computational Backend

Sulakhe, Dinanath; Rodriguez, Alex; D'Souza, Mark; Wilde, Michael; Nefedova, Veronika; Foster, Ian; Maltsev, Natalia

doi:10.1007/s10877-005-3463-y

Gnare: Automated System For High-Throughput Genome Analysis With Grid Computational Backend

Published: October 2005

Volume 19, pages 361–369, (2005)
Cite this article

Journal of Clinical Monitoring and Computing Aims and scope Submit manuscript

Dinanath Sulakhe¹,
Alex Rodriguez¹,
Mark D'Souza¹,
Michael Wilde¹,
Veronika Nefedova¹,
Ian Foster^1,2 &
…
Natalia Maltsev¹

59 Accesses
13 Citations
Explore all metrics

Abstract

Recent progress in genomics and experimental biology has brought exponential growth of the biological information available for computational analysis in public genomics databases. However, applying the potentially enormous scientific value of this information to the understanding of biological systems requires computing and data storage technology of an unprecedented scale. The Grid, with its aggregated and distributed computational and storage infrastructure, offers an ideal platform for high-throughput bioinformatics analysis. To leverage this we have developed the Genome Analysis Research Environment (GNARE) – a scalable computational system for the high-throughput analysis of genomes, which provides an integrated database and computational backend for data-driven bioinformatics applications. GNARE efficiently automates the major steps of genome analysis including acquisition of data from multiple genomic databases; data analysis by a diverse set of bioinformatics tools; and storage of results and annotations.

High-throughput computations in GNARE are performed using distributed heterogeneous Grid computing resources such as Grid2003, TeraGrid, and the DOE Science Grid. Multi-step genome analysis workflows involving massive data processing, the use of application-specific tools and algorithms and updating of an integrated database to provide interactive web access to results are all expressed and controlled by a “virtual data” model which transparently maps computational workflows to distributed Grid resources. This paper describes how Grid technologies such as Globus, Condor, and the Gryphyn Virtual Data System were applied in the development of GNARE. It focuses on our approach to Grid resource allocation and to the use of GNARE as a computational framework for the development of bioinformatics applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

RNA-Seq Data Analysis in Galaxy

Bioinformatics: new tools and applications in life science and personalized medicine

Article 06 January 2021

References

GOLD: http://wit.integratedgenomics.com/GOLD/.
Ideker T, Galitski T, Hood L. A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet 2001; 2: 343–372.
Article PubMed CAS Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25: 3389–3402.
Article PubMed CAS Google Scholar
Pearson WR. Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol 1994; 24: 307–331.
PubMed CAS Google Scholar
Shpaer EG, Robinson M, Yee D, Candlin JD, Mines R, Hunkapiller T. Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA. Genomics 1996; 38: 179–191.
Article PubMed CAS Google Scholar
Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, et al. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res 2003; 31: 315–318.
Article PubMed CAS Google Scholar
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths Jones S, Howe KL, Marshall M, Sonnhammer EL. The Pfam protein families database. Nucleic Acids Res 2002; 30: 276–280.
Article PubMed CAS Google Scholar
Henikoff S, Henikoff JG, Pietrokovski S. Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics 1999; 15: 471–479.
Article PubMed CAS Google Scholar
Pearl FM, Bennett CF, Bray JE, Harrison AP, Martin N, Shepherd A, Sillitoe I, Thornton J, Orengo CA. The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res 2003; 31: 452–455.
Article PubMed CAS Google Scholar
Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG. SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res 2002; 30: 264–267.
PubMed CAS Google Scholar
“Encyclopedia of Life” (http://eol.sdsc.edu/).
Goble C, Pettifer S, Stevens R. Knowledge Integration: In silico Experiments in Bioinformatics. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 2004.
North Carolina BioGRID (http://www.ncbiogrid.org/).
EUROGRID, (http://www.eurogrid.org/).
Asia Pacific BioGrid Initiative (http://www.apbionet.org/apbiogrid/).
Foster I, Kesselman C. Globus: A Metacomputing Infrastructure Toolkit. International Journal of Supercomputer Applications 1997; 11(2): 115–128.
Article Google Scholar
Litzkow MJ, Livny M, Mutka MW. Condor - A Hunter of Idle Workstations. 8th International Conference on Distributed Computing Systems 1988; 104–111.
Foster I, Voeckler J, Wilde M, Zhao Y, The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration. Conference on Innovative Data Systems Research, 2003.
Foster I, others. The Grid2003 Production Grid: Principles and Practice. IEEE International Symposium on High Performance Distributed Computing, 2004, IEEE Computer Science Press.
Catlett C. The TeraGrid: A Primer, 2002. http://www.teragrid.org.
DOE Science Grid, http://www.doesciencegrid.org.
The NCBI handbook [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2002 Oct. Chapter 17, The Reference Sequence (RefSeq) Project. Available from http://ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books.
Wu CH, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu Z, Robert S. “The Protein Information Resource: an integrated public resource of functional annotation of proteins”, Oxford University Press Nucleic Acids Research 2002; 30(1): 35–37.
CAS Google Scholar
Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehis M. “KEGG: Kyoto Encyclopedia of Genes and Genomes,” Nucleic Acids Research 1999; 27(1).
Altschul, Stephen F, Gish W, Miller W, Myers EW, Lipman DJ. “Basic local alignment search tool” J Mol Biol 1990; 215: 403–410.
Article PubMed CAS Google Scholar
Henikoff JG, Henikoff S. “Blocks Database and its Applications,” Meth. Enzymology 1996; 26: 88–105. Review article about the BLOCKS database.
Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer EL. “The Pfam protein families database,” Nucl Acids Res 2000; 28, 260–262.
Article Google Scholar
Krogh, Anders, Prediction of transmembrane helices in proteins, http://www.cbs.dtu.dk/services/TMHMM/.
Foster, Voeckler J, Wilde M, Zhou Y. Chimera: A virtual data system for representing, querying, and automating data derivation. In Proceedings of the 14th Conference on Scientific and Statistical Database.
Allcock W, et al. Data Management and Transfer in High-Performance Computational Grid Environments. Parallel Computing 2002; 28(5): 749–771.
Article Google Scholar
Ann L Chervenak, Naveen Palavalli, Shishir Bharathi, Carl Kesselman, Robert Schwartzkopf. Performance and Scalability of a Replica Location Service, Proceedings of the International Symposium on High Performance Distributed Computing Conference (HPDC-13), June 2004.
PDB, The Protein Data Bank, http://www.rcsb.org/pdb/.
Swiss-Prot, The Swiss-Prot Protein Knowledgebase, http://us.expasy.org/sprot/.
Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS. The Universal Protein Resource (UniProt).
Burgard AP, Maranas CD. “Review of the enzymes and metabolic pathways (emp) database,” Metab Eng 2001; 3: 193–194.
CAS Google Scholar
Ross Overbeek, Terry Disz, Rick Stevens. The SEED: a peer-to-peer environment for genome annotation, Communications of the ACM 2004; 47(11): 46–51.
Google Scholar
Midwest Center for Structural Genomics (MCSG) http://www.mcsg.anl.gov.
Great Lakes Regional Center of Excellence for Biodefense & emerging Infectious Diseases Research, http://www.glrce.org.
PUMA2 System, http://compbio.mcs.anl.gov/puma2.
Pathos System, http://compbio.mcs.anl.gov/pathos.
TarGet Environment, http://compbio.mcs.anl.gov/target.
Natalia Maltsev, Elizabeth Marland, Gong-Xin Yu, Saurabha Bhatnagar, Richard Lusk. Sentra, a database of signal transduction proteins. Nucleic Acids Res, 2002; 30. http://www-wit.mcs.anl.gov/sentra.
DOE Microbial Genome Program, http://microbialgenome.org.
Chisel, http://compbio.mcs.anl.gov/CHISEL.

Download references

Author information

Authors and Affiliations

Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, 60439, USA
Dinanath Sulakhe, Alex Rodriguez, Mark D'Souza, Michael Wilde, Veronika Nefedova, Ian Foster & Natalia Maltsev
Department of Computer Science, University of Chicago, Chicago, IL, 60637, USA
Ian Foster

Authors

Dinanath Sulakhe
View author publications
You can also search for this author in PubMed Google Scholar
Alex Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Mark D'Souza
View author publications
You can also search for this author in PubMed Google Scholar
Michael Wilde
View author publications
You can also search for this author in PubMed Google Scholar
Veronika Nefedova
View author publications
You can also search for this author in PubMed Google Scholar
Ian Foster
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Maltsev
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Based on “GNARE: An Environment for Grid-Based High-Throughput Genome Analysis” by D. Sulakhe, A. Rodriguez, M. D'Souza, M. Wilde, V. Nefedova, I. Foster, and N. Maltsev which appeared in Proceedings of CCGRID 2005.© 2005 IEEE.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sulakhe, D., Rodriguez, A., D'Souza, M. et al. Gnare: Automated System For High-Throughput Genome Analysis With Grid Computational Backend. J Clin Monit Comput 19, 361–369 (2005). https://doi.org/10.1007/s10877-005-3463-y

Download citation

Received: 30 June 2005
Accepted: 30 June 2005
Issue Date: October 2005
DOI: https://doi.org/10.1007/s10877-005-3463-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gnare: Automated System For High-Throughput Genome Analysis With Grid Computational Backend

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

RNA-Seq Data Analysis in Galaxy

Bioinformatics: new tools and applications in life science and personalized medicine

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Gnare: Automated System For High-Throughput Genome Analysis With Grid Computational Backend

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

RNA-Seq Data Analysis in Galaxy

Bioinformatics: new tools and applications in life science and personalized medicine

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation