Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data

  • Timothy McPhillips
  • Shawn Bowers
  • Bertram Ludäscher
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4075)


Steps in scientific workflows often generate collections of results, causing the data flowing through workflows to become increasingly nested. Because conventional workflow components (or actors) typically operate on simple or application-specific data types, additional actors often are required to manage these nested data collections. As a result, conventional workflows become increasingly complex as data becomes more nested. This paper describes a new paradigm for developing scientific workflows that transparently manages nested data collections. Collection-oriented workflows have a number of advantages over conventional approaches including simpler workflow designs (e.g., requiring fewer actors and control-flow constructs) that are invariant under changes in data nesting. Our implementation within the Kepler scientific workflow system enables the explicit representation of collections and collection schemas, concurrent operation over collection contents via multi-level pipeline parallelism, and allows collection-aware actors to be composed readily from conventional actors.


Data Item Collection Type Collection Schema Composite Actor Read Scope 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludäscher, B., Mock, S.: Kepler: An Extensible System for Design and Execution of Scientific Workflows. In: SSDBM (2004)Google Scholar
  2. 2.
    Buneman, P., Naqvi, S.A., Tannen, V., Wong, L.: Principles of Programming with Complex Objects and Collection Types. Theoretical Computer Science 149(1) (1995)Google Scholar
  3. 3.
    Davidson, S., Hara, C., Popa, L.: Querying an Object-Oriented Database using CPL. In: Brazilian Symposium on Databases (SBBD) (1997)Google Scholar
  4. 4.
    Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., Su, M.-H., Vahi, K., Livny, M.: Pegasus: Mapping Scientific Workflows onto the Grid. In: European Across Grids Conference (2004)Google Scholar
  5. 5.
    Goderis, A., Sattler, U., Lord, P., Goble, C.A.: Seven Bottlenecks to Workflow Reuse and Repurposing. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 323–337. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  6. 6.
    Golab, L., Özsu, M.T.: Issues in Data Stream Management. In: ACM SIGMOD Record (2003)Google Scholar
  7. 7.
    Gupta, A.K., Suciu, D.: Stream Processing of XPath Queries with Predicates. In: ACM SIGMOD, pp. 419–430 (2003)Google Scholar
  8. 8.
    Ives, Z.G., Halevy, A.Y., Weld, D.S.: An XML Query Engine for Network-Bound Data. VLDB Journal 11(4), 380–402 (2002)zbMATHCrossRefGoogle Scholar
  9. 9.
    Kahn, G., MacQueen, D.B.: Coroutines and Networks of Parallel Processes. In: IFIP Congress (1977)Google Scholar
  10. 10.
    Lee, E.A., Messerschmitt, D.G.: Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing. IEEE Trans. Comput. C-36 (1987)Google Scholar
  11. 11.
    Leser, U., Naumann, F.: (Almost) Hands-Off Information Integration for the Life Sciences. In: Conference on Innovative Data Systems Research (CIDR) (2005)Google Scholar
  12. 12.
    Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific Workflow Management and the Kepler System. In: Concurrency and Computation: Practice & Experience (2005)Google Scholar
  13. 13.
    MacLeod, R.S., Weinstein, D.M., de St. Germain, J.D., Johnson, C.R., Parker, S.G., Brooks, D.: SCIRun/BioPSE: Integrated Problem Solving Environment for Bioelectric Field Problems and Visualization. In: Symposium on Biomedical Imaging (ISBI): From Nano to Macro (2004)Google Scholar
  14. 14.
    Maddison, D., Swofford, D., Maddison, W.: NEXUS: An Extensible File Format for Systematic Information. Systematic Biology 46(4), 590–621 (1997)CrossRefGoogle Scholar
  15. 15.
    Majithia, S., Shields, M.S., Taylor, I.J., Wang, I.: Triana: A Graphical Web Service Composition and Execution Toolkit. In: ICWS (2004)Google Scholar
  16. 16.
    May, W.: XPath-Logic and XPathLog: A Logic-Programming-Style XML Data Manipulation Language. Theory and Practice of Logic Programming 4(3), 239–287 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    McPhillips, T., Bowers, S.: An Approach for Pipelining Nested Collections in Scientific Workflows. ACM SIGMOD Record 34(3), 12–17 (2005)CrossRefGoogle Scholar
  18. 18.
    Morrison, J.: Flow-Based Programming. Van Nostrand Reinhold (1994)Google Scholar
  19. 19.
    Murata, M., Lee, D., Mani, M.: Taxonomy of XML Schema Languages using Formal Language Theory. In: Extreme Markup Languages Conferences (2001)Google Scholar
  20. 20.
    Oinn, T.M., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R.M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: A Tool for the Composition and Enactment of Bioinformatics Workflows. Bioinformatics 20(17) (2004)Google Scholar
  21. 21.
    Swofford, D.: PAUP*: Phylogenetic Analysis Under Parsimony (*and Other Methods). Version 4. Sinauer Associates, Sunderland, MassachusettsGoogle Scholar
  22. 22.
    Thain, D., Tannenbaum, T., Livny, M.: Distributed Computing in Practice: The Condor Experience. Concurrency – Practice and Experience 17(2-4) (2005)Google Scholar
  23. 23.
    Tian, F., Reinwald, B., Pirahesh, H., Mayr, T., Myllymaki, J.: Implementing a Scalable XML Publish/Subscribe System Using a Relational Database System. In: ACM SIGMOD, pp. 479–490 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Timothy McPhillips
    • 1
  • Shawn Bowers
    • 1
  • Bertram Ludäscher
    • 1
    • 2
  1. 1.UC Davis Genome CenterUniversity of CaliforniaDavis
  2. 2.Department of Computer ScienceUniversity of CaliforniaDavis

Personalised recommendations