Enabling Data and Compute Intensive Workflows in Bioinformatics

  • Gaurang Mehta
  • Ewa Deelman
  • James A. Knowles
  • Ting Chen
  • Ying Wang
  • Jens Vöckler
  • Steven Buyske
  • Tara Matise
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7156)


Accelerated growth in the field of bioinformatics has resulted in large data sets being produced and analyzed. With this rapid growth has come the need to analyze these data in a quick, easy, scalable, and reliable manner on a variety of computing infrastructures including desktops, clusters, grids and clouds. This paper presents the application of workflow technologies, and, specifically, Pegasus WMS, a robust scientific workflow management system, to a variety of bioinformatics projects from RNA sequencing, proteomics, and data quality control in population studies using GWAS data.


workflows bioinformatics sequencing epigenetics proteomics 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Deelman, E., Mehta, G., Singh, G., Su, M.H., Vahi, K.: Pegasus: Mapping Large-Scale Workflows to Distributed Resources. In: Workflows for e-Science (2007)Google Scholar
  2. 2.
    Deelman, E., et al.: Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal 13, 219–237 (2005)Google Scholar
  3. 3.
    Juve, G., Deelman, E., Vahi, K., Mehta, G., et al.: Data Sharing Options for Scientific Workflows on Amazon EC2. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010)Google Scholar
  4. 4.
    Frey, J., Tannenbaum, T., Livny, M., Foster, I., Tuecke, S.: Condor-G: a computation management agent for multi-institutional grids. In: Proceedings 10th IEEE International Symposium on High Performance Distributed Computing, vol. 5(3), pp. 55–63 (2002)Google Scholar
  5. 5.
    Litzkow, M.J., Livny, M., Mutka, M.W.: Condor: A Hunter of Idle Workstations. In: 8th International Conference on Distributed Computing Systems (1988)Google Scholar
  6. 6.
    Couvares, P., Kosar, T., Roy, A., et al.: Workflow in Condor. In: Taylor, I., Deelman, E., et al. (eds.) Workflows for e-Science. Springer Press (January 2007)Google Scholar
  7. 7.
    Xu, H., Freitas, M.A.: Bioinformatics 25(10), 1341–1343 (2009)CrossRefGoogle Scholar
  8. 8.
    Freitas, M.A., Mehta, G., et al.: Large-Scale Proteomic Data Analysis via Flexible Scalable Workflows. In: RECOMB Satellite Conference on Computational Proteomics (2010)Google Scholar
  9. 9.
    Transcriptional Atlas of the Developing Human Brain,
  10. 10.
    Illumina Eland Alignment Algorithm,
  11. 11.
    Chen, Y., Souaiaia, T., Chen, T.: PerM: Efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics 25(19), 2514–2521 (2009)CrossRefGoogle Scholar
  12. 12.
    Wang, Y., Mehta, G., Mayani, R., Lu, J., Souaiaia, T., et al.: RseqFlow: Workflows for RNA-Seq data analysis. Submission: Oxford Bioinformatics-Application NotesGoogle Scholar
  13. 13.
    O’Connor, B., Merriman, B., Nelson, S.: SeqWare Query Engine: storing and searching sequence data in the cloud. BMC Bioinformatics 11(suppl. 12), S2 (2010)Google Scholar
  14. 14.
    Matise, T.C., Ambite, J.L., et al.: For the PAGE Study.  Population Architecture using Genetics and Epidemiology. Am. J. Epidemiol (2011), doi:10.1093/aje/kwr160Google Scholar
  15. 15.
    Mailman, M.D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., et al.: The NCBI dbGaP Database of Genotypes and Phenotypes. Nat Genet. 39(10), 1181–1186 (2007)CrossRefGoogle Scholar
  16. 16.
  17. 17.
  18. 18.
    Kivity, A., Kamay, Y., Laor, D., Lublin, U., Liguori, A.: kvm: the Linux virtual machine monitor. In: OLS 2007: The 2007 Ottawa Linux Symposium, pp. 225–230 (July 2007)Google Scholar
  19. 19.
    Ludascher, B., Altintas, I., Berkley, C., et al.: Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience (2005)Google Scholar
  20. 20.
    Blankenberg, D., et al.: Galaxy: a web-based genome analysis tool for experimentalists. In: Current Protocols in Molecular Biology, ch. 19, Unit 19.10.1-21 (2010) Google Scholar
  21. 21.
    Hull, D., Wolstencroft, K., Stevens, R., Goble, C., et al.: Taverna: a tool for building and running workflows of services. Nucleic Acids Research 34, 729–732 (2006)CrossRefGoogle Scholar
  22. 22.
    Romano, P.: Automation of in-silico data analysis processes through workflow management systems. Briefings in Bioinformatics 9(1), 57–68 (2008)CrossRefGoogle Scholar
  23. 23.
    Nakata, K., Lipska, B.L., Hyde, T.M., Ye, T., et al.: DISC1 splice variants are upregulated in schizophrenia and associated with risk polymorphisms. PNAS, August 24 (2009) Google Scholar
  24. 24.
    Deelman, E., Kesselman, C., Mehta, G., et al.: GriPhyN and LIGO, Building a Virtual Data Grid for Gravitational Wave Scientists. In: 11th Int. Symposium HPDC, HPDC11 2002, p. 225 (2002)Google Scholar
  25. 25.
    Eng, J.K., McCormack, A.L., Yates III, J.R.: An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J. Am. Soc. Mass. Spectrom. 5(11), 976–989 (1994)CrossRefGoogle Scholar
  26. 26.
    Perkins, D.N., Pappin, D.J., et al.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551–3567 (1999)CrossRefGoogle Scholar
  27. 27.
    Eker, J., Janneck, J., Lee, E.A., Liu, J., et al.: Taming heterogeneity - the Ptolemy approach. Proceedings of the IEEE 91(1), 127–144 (2003)CrossRefGoogle Scholar
  28. 28.
    Pegasus Workflow Management System,
  29. 29.
  30. 30.
    Open Science Grid,
  31. 31.
  32. 32.
    Nagavaram, A., Agrawal, G., et al.: A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis. In: Proceedings of the 7th IEEE International Conference on e-Science (e-Science 2011) (December 2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Gaurang Mehta
    • 1
  • Ewa Deelman
    • 1
  • James A. Knowles
    • 2
  • Ting Chen
    • 3
  • Ying Wang
    • 3
    • 5
  • Jens Vöckler
    • 1
  • Steven Buyske
    • 4
  • Tara Matise
    • 4
  1. 1.USC Information Sciences InstituteUSA
  2. 2.Keck School of Medicine of USCUSA
  3. 3.University of Southern CaliforniaUSA
  4. 4.Rutgers UniversityUSA
  5. 5.Xiamen UniversityP.R. China

Personalised recommendations