Skip to main content
Log in

Gnare: Automated System For High-Throughput Genome Analysis With Grid Computational Backend

  • Published:
Journal of Clinical Monitoring and Computing Aims and scope Submit manuscript

Abstract

Recent progress in genomics and experimental biology has brought exponential growth of the biological information available for computational analysis in public genomics databases. However, applying the potentially enormous scientific value of this information to the understanding of biological systems requires computing and data storage technology of an unprecedented scale. The Grid, with its aggregated and distributed computational and storage infrastructure, offers an ideal platform for high-throughput bioinformatics analysis. To leverage this we have developed the Genome Analysis Research Environment (GNARE) – a scalable computational system for the high-throughput analysis of genomes, which provides an integrated database and computational backend for data-driven bioinformatics applications. GNARE efficiently automates the major steps of genome analysis including acquisition of data from multiple genomic databases; data analysis by a diverse set of bioinformatics tools; and storage of results and annotations.

High-throughput computations in GNARE are performed using distributed heterogeneous Grid computing resources such as Grid2003, TeraGrid, and the DOE Science Grid. Multi-step genome analysis workflows involving massive data processing, the use of application-specific tools and algorithms and updating of an integrated database to provide interactive web access to results are all expressed and controlled by a “virtual data” model which transparently maps computational workflows to distributed Grid resources. This paper describes how Grid technologies such as Globus, Condor, and the Gryphyn Virtual Data System were applied in the development of GNARE. It focuses on our approach to Grid resource allocation and to the use of GNARE as a computational framework for the development of bioinformatics applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. GOLD: http://wit.integratedgenomics.com/GOLD/.

  2. Ideker T, Galitski T, Hood L. A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet 2001; 2: 343–372.

    Article  PubMed  CAS  Google Scholar 

  3. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25: 3389–3402.

    Article  PubMed  CAS  Google Scholar 

  4. Pearson WR. Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol 1994; 24: 307–331.

    PubMed  CAS  Google Scholar 

  5. Shpaer EG, Robinson M, Yee D, Candlin JD, Mines R, Hunkapiller T. Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA. Genomics 1996; 38: 179–191.

    Article  PubMed  CAS  Google Scholar 

  6. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, et al. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res 2003; 31: 315–318.

    Article  PubMed  CAS  Google Scholar 

  7. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths Jones S, Howe KL, Marshall M, Sonnhammer EL. The Pfam protein families database. Nucleic Acids Res 2002; 30: 276–280.

    Article  PubMed  CAS  Google Scholar 

  8. Henikoff S, Henikoff JG, Pietrokovski S. Blocks+: a non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics 1999; 15: 471–479.

    Article  PubMed  CAS  Google Scholar 

  9. Pearl FM, Bennett CF, Bray JE, Harrison AP, Martin N, Shepherd A, Sillitoe I, Thornton J, Orengo CA. The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res 2003; 31: 452–455.

    Article  PubMed  CAS  Google Scholar 

  10. Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG. SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res 2002; 30: 264–267.

    PubMed  CAS  Google Scholar 

  11. “Encyclopedia of Life” (http://eol.sdsc.edu/).

  12. Goble C, Pettifer S, Stevens R. Knowledge Integration: In silico Experiments in Bioinformatics. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, 2004.

  13. North Carolina BioGRID (http://www.ncbiogrid.org/).

  14. EUROGRID, (http://www.eurogrid.org/).

  15. Asia Pacific BioGrid Initiative (http://www.apbionet.org/apbiogrid/).

  16. Foster I, Kesselman C. Globus: A Metacomputing Infrastructure Toolkit. International Journal of Supercomputer Applications 1997; 11(2): 115–128.

    Article  Google Scholar 

  17. Litzkow MJ, Livny M, Mutka MW. Condor - A Hunter of Idle Workstations. 8th International Conference on Distributed Computing Systems 1988; 104–111.

  18. Foster I, Voeckler J, Wilde M, Zhao Y, The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration. Conference on Innovative Data Systems Research, 2003.

  19. Foster I, others. The Grid2003 Production Grid: Principles and Practice. IEEE International Symposium on High Performance Distributed Computing, 2004, IEEE Computer Science Press.

  20. Catlett C. The TeraGrid: A Primer, 2002. http://www.teragrid.org.

  21. DOE Science Grid, http://www.doesciencegrid.org.

  22. The NCBI handbook [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2002 Oct. Chapter 17, The Reference Sequence (RefSeq) Project. Available from http://ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books.

  23. Wu CH, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu Z, Robert S. “The Protein Information Resource: an integrated public resource of functional annotation of proteins”, Oxford University Press Nucleic Acids Research 2002; 30(1): 35–37.

    CAS  Google Scholar 

  24. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehis M. “KEGG: Kyoto Encyclopedia of Genes and Genomes,” Nucleic Acids Research 1999; 27(1).

  25. Altschul, Stephen F, Gish W, Miller W, Myers EW, Lipman DJ. “Basic local alignment search tool” J Mol Biol 1990; 215: 403–410.

    Article  PubMed  CAS  Google Scholar 

  26. Henikoff JG, Henikoff S. “Blocks Database and its Applications,” Meth. Enzymology 1996; 26: 88–105. Review article about the BLOCKS database.

  27. Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer EL. “The Pfam protein families database,” Nucl Acids Res 2000; 28, 260–262.

    Article  Google Scholar 

  28. Krogh, Anders, Prediction of transmembrane helices in proteins, http://www.cbs.dtu.dk/services/TMHMM/.

  29. Foster, Voeckler J, Wilde M, Zhou Y. Chimera: A virtual data system for representing, querying, and automating data derivation. In Proceedings of the 14th Conference on Scientific and Statistical Database.

  30. Allcock W, et al. Data Management and Transfer in High-Performance Computational Grid Environments. Parallel Computing 2002; 28(5): 749–771.

    Article  Google Scholar 

  31. Ann L Chervenak, Naveen Palavalli, Shishir Bharathi, Carl Kesselman, Robert Schwartzkopf. Performance and Scalability of a Replica Location Service, Proceedings of the International Symposium on High Performance Distributed Computing Conference (HPDC-13), June 2004.

  32. PDB, The Protein Data Bank, http://www.rcsb.org/pdb/.

  33. Swiss-Prot, The Swiss-Prot Protein Knowledgebase, http://us.expasy.org/sprot/.

  34. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS. The Universal Protein Resource (UniProt).

  35. Burgard AP, Maranas CD. “Review of the enzymes and metabolic pathways (emp) database,” Metab Eng 2001; 3: 193–194.

    CAS  Google Scholar 

  36. Ross Overbeek, Terry Disz, Rick Stevens. The SEED: a peer-to-peer environment for genome annotation, Communications of the ACM 2004; 47(11): 46–51.

    Google Scholar 

  37. Midwest Center for Structural Genomics (MCSG) http://www.mcsg.anl.gov.

  38. Great Lakes Regional Center of Excellence for Biodefense & emerging Infectious Diseases Research, http://www.glrce.org.

  39. PUMA2 System, http://compbio.mcs.anl.gov/puma2.

  40. Pathos System, http://compbio.mcs.anl.gov/pathos.

  41. TarGet Environment, http://compbio.mcs.anl.gov/target.

  42. Natalia Maltsev, Elizabeth Marland, Gong-Xin Yu, Saurabha Bhatnagar, Richard Lusk. Sentra, a database of signal transduction proteins. Nucleic Acids Res, 2002; 30. http://www-wit.mcs.anl.gov/sentra.

  43. DOE Microbial Genome Program, http://microbialgenome.org.

  44. Chisel, http://compbio.mcs.anl.gov/CHISEL.

Download references

Author information

Authors and Affiliations

Authors

Additional information

Based on “GNARE: An Environment for Grid-Based High-Throughput Genome Analysis” by D. Sulakhe, A. Rodriguez, M. D'Souza, M. Wilde, V. Nefedova, I. Foster, and N. Maltsev which appeared in Proceedings of CCGRID 2005.© 2005 IEEE.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sulakhe, D., Rodriguez, A., D'Souza, M. et al. Gnare: Automated System For High-Throughput Genome Analysis With Grid Computational Backend. J Clin Monit Comput 19, 361–369 (2005). https://doi.org/10.1007/s10877-005-3463-y

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10877-005-3463-y

Keywords

Navigation