Skip to main content

Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data

  • Conference paper
Data Integration in the Life Sciences (DILS 2006)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4075))

Included in the following conference series:

Abstract

Steps in scientific workflows often generate collections of results, causing the data flowing through workflows to become increasingly nested. Because conventional workflow components (or actors) typically operate on simple or application-specific data types, additional actors often are required to manage these nested data collections. As a result, conventional workflows become increasingly complex as data becomes more nested. This paper describes a new paradigm for developing scientific workflows that transparently manages nested data collections. Collection-oriented workflows have a number of advantages over conventional approaches including simpler workflow designs (e.g., requiring fewer actors and control-flow constructs) that are invariant under changes in data nesting. Our implementation within the Kepler scientific workflow system enables the explicit representation of collections and collection schemas, concurrent operation over collection contents via multi-level pipeline parallelism, and allows collection-aware actors to be composed readily from conventional actors.

Work supported in part by SciDAC/SDM (DE-FC02-01ER25486), NSF/SEEK (DBI-0533368), and NSF/GEON (EAR-0225673).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludäscher, B., Mock, S.: Kepler: An Extensible System for Design and Execution of Scientific Workflows. In: SSDBM (2004)

    Google Scholar 

  2. Buneman, P., Naqvi, S.A., Tannen, V., Wong, L.: Principles of Programming with Complex Objects and Collection Types. Theoretical Computer Science 149(1) (1995)

    Google Scholar 

  3. Davidson, S., Hara, C., Popa, L.: Querying an Object-Oriented Database using CPL. In: Brazilian Symposium on Databases (SBBD) (1997)

    Google Scholar 

  4. Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., Su, M.-H., Vahi, K., Livny, M.: Pegasus: Mapping Scientific Workflows onto the Grid. In: European Across Grids Conference (2004)

    Google Scholar 

  5. Goderis, A., Sattler, U., Lord, P., Goble, C.A.: Seven Bottlenecks to Workflow Reuse and Repurposing. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 323–337. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  6. Golab, L., Özsu, M.T.: Issues in Data Stream Management. In: ACM SIGMOD Record (2003)

    Google Scholar 

  7. Gupta, A.K., Suciu, D.: Stream Processing of XPath Queries with Predicates. In: ACM SIGMOD, pp. 419–430 (2003)

    Google Scholar 

  8. Ives, Z.G., Halevy, A.Y., Weld, D.S.: An XML Query Engine for Network-Bound Data. VLDB Journal 11(4), 380–402 (2002)

    Article  MATH  Google Scholar 

  9. Kahn, G., MacQueen, D.B.: Coroutines and Networks of Parallel Processes. In: IFIP Congress (1977)

    Google Scholar 

  10. Lee, E.A., Messerschmitt, D.G.: Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing. IEEE Trans. Comput. C-36 (1987)

    Google Scholar 

  11. Leser, U., Naumann, F.: (Almost) Hands-Off Information Integration for the Life Sciences. In: Conference on Innovative Data Systems Research (CIDR) (2005)

    Google Scholar 

  12. Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific Workflow Management and the Kepler System. In: Concurrency and Computation: Practice & Experience (2005)

    Google Scholar 

  13. MacLeod, R.S., Weinstein, D.M., de St. Germain, J.D., Johnson, C.R., Parker, S.G., Brooks, D.: SCIRun/BioPSE: Integrated Problem Solving Environment for Bioelectric Field Problems and Visualization. In: Symposium on Biomedical Imaging (ISBI): From Nano to Macro (2004)

    Google Scholar 

  14. Maddison, D., Swofford, D., Maddison, W.: NEXUS: An Extensible File Format for Systematic Information. Systematic Biology 46(4), 590–621 (1997)

    Article  Google Scholar 

  15. Majithia, S., Shields, M.S., Taylor, I.J., Wang, I.: Triana: A Graphical Web Service Composition and Execution Toolkit. In: ICWS (2004)

    Google Scholar 

  16. May, W.: XPath-Logic and XPathLog: A Logic-Programming-Style XML Data Manipulation Language. Theory and Practice of Logic Programming 4(3), 239–287 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  17. McPhillips, T., Bowers, S.: An Approach for Pipelining Nested Collections in Scientific Workflows. ACM SIGMOD Record 34(3), 12–17 (2005)

    Article  Google Scholar 

  18. Morrison, J.: Flow-Based Programming. Van Nostrand Reinhold (1994)

    Google Scholar 

  19. Murata, M., Lee, D., Mani, M.: Taxonomy of XML Schema Languages using Formal Language Theory. In: Extreme Markup Languages Conferences (2001)

    Google Scholar 

  20. Oinn, T.M., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R.M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: A Tool for the Composition and Enactment of Bioinformatics Workflows. Bioinformatics 20(17) (2004)

    Google Scholar 

  21. Swofford, D.: PAUP*: Phylogenetic Analysis Under Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, Massachusetts

    Google Scholar 

  22. Thain, D., Tannenbaum, T., Livny, M.: Distributed Computing in Practice: The Condor Experience. Concurrency – Practice and Experience 17(2-4) (2005)

    Google Scholar 

  23. Tian, F., Reinwald, B., Pirahesh, H., Mayr, T., Myllymaki, J.: Implementing a Scalable XML Publish/Subscribe System Using a Relational Database System. In: ACM SIGMOD, pp. 479–490 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

McPhillips, T., Bowers, S., Ludäscher, B. (2006). Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data. In: Leser, U., Naumann, F., Eckman, B. (eds) Data Integration in the Life Sciences. DILS 2006. Lecture Notes in Computer Science(), vol 4075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11799511_23

Download citation

  • DOI: https://doi.org/10.1007/11799511_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-36593-8

  • Online ISBN: 978-3-540-36595-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics