Computational Infrastructures for Data and Knowledge Management in Systems Biology

  • Fotis Georgatos
  • Stéphane Ballereau
  • Johann Pellet
  • Moustafa Ghanem
  • Nathan Price
  • Leroy Hood
  • Yi-Ke Guo
  • Dominique Boutigny
  • Charles Auffray
  • Rudi Balling
  • Reinhard Schneider

Abstract

The volume, complexity and heterogeneity of data originating from high throughput functional genomics technologies have created challenges and opportunities for Information technology (IT) departments. These increased demands have also led to increasing costs for IT infrastructure such as necessary computing power and storage devices, as well as further costs for manpower effort, required for maintenance. This chapter describes some of the challenges for computational analysis infrastructure, including bottlenecks and most pressing needs that have to be addressed to effectively support the development of systems biology and its application in medicine.

Keywords

Bioinformatics Infrastructures Information technology Knowledge management Systems biology Translational research Data centers Storage Computational science Computational infrastructures Computing Scientific computing e-science Data management High performance computing Cluster Grid Desktop grid Cloud 

Acronyms

3D

Three dimensional

4D

Four dimensional typically 3D plus time dimension

BOINC

Berkeley Open Infrastructure for Network Computing

CERN

European Council for Nuclear Research

CPU

Central Processing Unit

EBI

European Bioinformatics Institute

EGEE

Enabling Grids for E-sciencE

EGI

European Grid infrastructure

ELIXIR

European Life Sciences Infrastructure for Biological Information

EMBL

European Molecular Biology Laboratory

Flops

FLoating-point Operations Per Second

GPU

Graphical Processing Unit

HPC

High Performance Computing

HTC

High Throughput Computing

IaaS

Infrastructure as a Service

IT

Information Technology

I/O

Input/Output—typically used in the context of software processing of data

LHC

Large Hadron Collider

MPI

Message Passing Interface

Omics

A collective term to refer to -omics keywords like metabolomics, genomics, proteomics etc.

PaaS

Platform as a Service

PRACE

Partnership for Advanced Computing in Europe

ROI

Return On Investment

SaaS

Software as a Service

SBML

Systems Biology Markup Language

WLCG

Worldwide LHC Computing Grid

References

  1. 1.
    Chen C, McGarvey PB, Huang H, Wu CH (2010) Protein bioinformatics infrastructure for the integration and analysis of multiple high-throughput ‘omics’ data. Adv Bioinform, 19pGoogle Scholar
  2. 2.
    Bousquet J et al (2011) MeDALL (mechanisms of the development of ALLergy): an integrated approach from phenotypes to systems medicine. Allergy 66:596–604PubMedCrossRefGoogle Scholar
  3. 3.
    Bel EH et al (2011) Diagnosis and definition of severe refractory asthma: an international consensus statement from the innovative medicine initiative (IMI). Thorax 66:910–917PubMedCrossRefGoogle Scholar
  4. 4.
    Rosenthal A et al (2010) Cloud computing: a new business paradigm for biomedical information sharing. J Biomed Inform 43:342–353PubMedCrossRefGoogle Scholar
  5. 5.
    Ruusalepp R (2008) Infrastructure planning and data curation: acomparative study of international approaches to enabling the sharing of research data. At http://www.jisc.ac.uk/media/documents/programmes/preservation/national_data_sharing_report_final.pdf
  6. 6.
    Twiki—a web-based collaboration for EGEE project. At https://twiki.cern.ch/twiki/bin/view/EGEE/LifeSciences
  7. 7.
  8. 8.
    HealthGrid Portal—A Human Grid Initiative. At http://healthgrid.org/
  9. 9.
    The BioinfoGRID Project. At http://www.bioinfogrid.eu/
  10. 10.
    IGI—Italian Grid Infrastructure. List of scientific application for VO biomed at http://www.italiangrid.it/appdb/listbyvo/6
  11. 11.
    Crosswell LC, Thornton JM (2012) ELIXIR: a distributed infrastructure for European biological data. Trends Biotechnol 30:241–242PubMedCrossRefGoogle Scholar
  12. 12.
    eTRIKS European Transnational Information and Knowledge Management Services. At http://www.etriks.org/
  13. 13.
    Wu Y, Kumar S, Park S-J (2010) Measurement and performance issues of transport protocols over 10 Gbps high-speed optical networks. Comput Netw 54:475–488CrossRefGoogle Scholar
  14. 14.
    Saltzer JH, Reed DP, Clark DD (1984) End-to-end arguments in system design. ACM Trans Comput Syst 2:277–288CrossRefGoogle Scholar
  15. 15.
    Welcome to the Worldwide LHC Computing Grid. At http://wlcg.web.cern.ch/
  16. 16.
    Newhouse S. D2.3 EGI-InSPIRE Paper, European Grid Infrastructure. At http://go.egi.eu/pdnon
  17. 17.
    Sujansky W (2001) Heterogeneous database integration in biomedicine. J Biomed Inform 34:285–298PubMedCrossRefGoogle Scholar
  18. 18.
    Alonso-Calvo R et al (2007) An agent- and ontology-based system for integrating public gene, protein, and disease databases. J Biomed Inform 40:17–29PubMedCrossRefGoogle Scholar
  19. 19.
    Brazma A, Krestyaninova M, Sarkans U (2006) Standards for systems biology. Nat Rev Genet 7:593–605PubMedCrossRefGoogle Scholar
  20. 20.
    Courtot M et al (2011) Controlled vocabularies and semantics in systems biology. Mol Syst Biol 7:543PubMedCentralPubMedCrossRefGoogle Scholar
  21. 21.
    Szalma S, Koka V, Khasanova T, Perakslis ED (2010) Effective knowledge management in translational medicine. J Transl Med 8:68PubMedCentralPubMedCrossRefGoogle Scholar
  22. 22.
    Stein LD (2008) Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nat Rev Genet 9:678–688PubMedCrossRefGoogle Scholar
  23. 23.
    Ghosh S, Matsuoka Y, Asai Y, Hsin K-Y, Kitano H (2011) Software for systems biology: from tools to integrated platforms. Nat Rev Genet 12:821–832PubMedGoogle Scholar
  24. 24.
    Wruck W, Peuker M, Regenbrecht CRA (2012) Data management strategies for multinational large-scale systems biology projects. Brief Bioinform. doi:10.1093/bib/bbs064
  25. 25.
    Blankenberg D et al (2010) Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol Chapter 19, Unit 19.10.1–21Google Scholar
  26. 26.
    Chervitz SA et al (2011) Data standards for omics data: the basis of data sharing and reuse. Methods Mol Biol 719:31–69PubMedCrossRefGoogle Scholar
  27. 27.
    Hucka M et al (2003) The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19:524–531PubMedCrossRefGoogle Scholar
  28. 28.
    Foster I, Kohr DR, Krishnaiyer R, Choudhary A (1997) A library-based approach to task parallelism in a data-parallel language. J Parallel Distrib Comput 45:148–158CrossRefGoogle Scholar
  29. 29.
    VitalIT tools—High Performance Computing Center. At http://www.vital-it.ch/software/tools.php
  30. 30.
    Hull D et al (2006) Taverna: a tool for building and running workflows of services. Nucleic Acids Res 34:729–732CrossRefGoogle Scholar
  31. 31.
    Hillman-Jackson J et al (2012) Using galaxy to perform large-scale interactive data analyses. Curr Protoc Bioinform Chapter 10, Unit10.5Google Scholar
  32. 32.
    Abouelhoda M, Issa SA, Ghanem M (2012) Tavaxy: integrating taverna and galaxy workflows with cloud computing support. BMC Bioinform 13:77CrossRefGoogle Scholar
  33. 33.
    Reich M et al (2006) GenePattern 2.0. Nat Genet 38:500–501PubMedCrossRefGoogle Scholar
  34. 34.
    Sage Synapse: Contribute to the Cure. At https://synapse.sagebase.org
  35. 35.
    Kushida CA et al (2012) Strategies for De-identification and anonymization of electronic health record data for use in multicenter research studies. Med Care 50:S82–S101PubMedCrossRefGoogle Scholar
  36. 36.
    Lyon L (2007) Dealing with data: roles, rights, responsibilities and relationships. Consultancy report, UKOLN, University of Bath, UKGoogle Scholar
  37. 37.
    Biosapiens network—A European Virtual Institute for Genome Annotation. At http://www.biosapiens.info
  38. 38.
    Training at EMBL-EBI. At http://www.ebi.ac.uk/training/
  39. 39.
    Laxminarayan S, Michelson L (1988) Perspectives in biomedical supercomputing. IEEE Eng Med Biol Mag 7:12–15PubMedCrossRefGoogle Scholar
  40. 40.
    Böhm K (1997) Supercomputing in cancer research. Stud Health Technol Inform 43 Pt A:104–108Google Scholar
  41. 41.
    Maizel JR (1988) Supercomputing in molecular biology: applications to sequence analysis. IEEE Eng Med Biol Mag 7:27–30PubMedCrossRefGoogle Scholar
  42. 42.
    Orphanoudakis SC (1988) Supercomputing in medical imaging. IEEE Eng Med Biol Mag 7:16–20PubMedCrossRefGoogle Scholar
  43. 43.
    Kesselman C, Foster I (1998) The grid: blueprint for a new computing infrastructure. Morgan Kaufmann Publishers, Burlington. At http://www.amazon.ca/exec/obidos/redirect?tag=citeulike09-20&path=ASIN/1558604758
  44. 44.
    Szolovits P (2007) What is a grid? J Am Med Inform Assoc 14:386PubMedCentralPubMedCrossRefGoogle Scholar
  45. 45.
    Breton V, Medina R, Montagnat J (2003) DataGrid, prototype of a biomedical grid. Methods Inf Med 42:143–147PubMedGoogle Scholar
  46. 46.
    European Grid Infrastructure. For further information, kindly refer to the EGI-InSPIRE paper. EGI at http://go.egi.eu/pdnon
  47. 47.
    The Open Science Grid Homepage. At http://www.opensciencegrid.org
  48. 48.
    The NorduGrid Collaboration, Web site. http://www.nordugrid.org
  49. 49.
    Armbrust M et al (2009) Above the clouds: a berkeley view of cloud computing. EECS Department, University of California, Berkeley. At http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.html
  50. 50.
    Anderson DP (2004) Boinc: a system for public-resource computing and storage. In: 5th IEEE/ACM international workshop on grid computing 4–10Google Scholar
  51. 51.
    Mesirov J (2010) Computer science: accessible reproducible research. Science 327(5964):415–416. doi:10.1126/science.1179653. 22 Jan 2010Google Scholar
  52. 52.
    Tan TW et al (2010) Advancing standards for bioinformatics activities: persistence, reproducibility, disambiguation and minimum information about a bioinformatics investigation (MIABi). BMC Genomics 11(4):S27. doi:10.1186/1471-2164-11-S4-S27. http://www.ncbi.nlm.nih.gov/pubmed/21143811. 2 Dec 2010
  53. 53.
    Kenneth H et al (2012) EasyBuild: building software with ease, PyHPC 2012, Supercomputing 2012, Salt Lake CityGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Fotis Georgatos
    • 1
  • Stéphane Ballereau
    • 2
  • Johann Pellet
    • 2
  • Moustafa Ghanem
    • 3
  • Nathan Price
    • 4
  • Leroy Hood
    • 4
  • Yi-Ke Guo
    • 3
  • Dominique Boutigny
    • 5
  • Charles Auffray
    • 2
  • Rudi Balling
    • 1
  • Reinhard Schneider
    • 1
  1. 1.Luxembourg Centre for Systems BiomedicineUniversity of LuxembourgEsch-sur-AlzetteLuxembourg
  2. 2.European Institute for Systems Biology and Medicine—CNRS-UCBL-ENSUniversité de LyonLyonFrance
  3. 3.Department of ComputingImperial College LondonLondonUK
  4. 4.Institute for Systems BiologySeattleUSA
  5. 5.Centre de Calcul de l’IN2P3USR6402 CNRS/IN2P3Villeurbanne CedexFrance

Personalised recommendations