StarFlow: A Script-Centric Data Analysis Environment

  • Elaine Angelino
  • Daniel Yamins
  • Margo Seltzer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6378)

Abstract

We introduce StarFlow, a script-centric environment for data analysis. StarFlow has four main features: (1) extraction of control and data-flow dependencies through a novel combination of static analysis, dynamic runtime analysis, and user annotations, (2) command-line tools for exploring and propagating changes through the resulting dependency network, (3) support for workflow abstractions enabling robust parallel executions of complex analysis pipelines, and (4) a seamless interface with the Python scripting language. We describe real applications of StarFlow, including automatic parallelization of complex workflows in the cloud.

Keywords

automatic parallelization automatic updating computational workflows control flow data-flow data analysis dependency tracking provenance Python workflow abstraction 

References

  1. 1.
    Proceedings of the 2010 USENIX Workshop on the Theory and Practice of Provenance, San Jose, CA, USA. USENIX (February 22, 2010)Google Scholar
  2. 2.
    United States Environmental Protection Agency. Epa frs facilities state combined csv files download, http://epa.gov/enviro/html/frs_demo/geospatial_data/geo_data_state_combined.html
  3. 3.
    Berthold, M.R., Cebron, N., Dill, F., Gabriel, T.R., Kotter, T., Meinl, T., Ohl, P., Thiel, K., Wiswedel, B.: Knime - the konstanz information miner: version 2.0 and beyond. SIGKDD Explor. Newsl. 11(1), 26–31 (2009)CrossRefGoogle Scholar
  4. 4.
    Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: Vistrails: visualization meets data management. In: SIGMOD 2006 Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pp. 745–747. ACM, New York (2006), General Chair-Yu, Clement and General Chair-Scheuermann, Peter and Program Chair-Chaudhuri, SurajitCrossRefGoogle Scholar
  5. 5.
    clario Analytics. clario, http://clarioanalytics.com
  6. 6.
    Clifford, B., Freire, J., Gil, Y., Groth, P., Futrelle, J., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Simmhan, Y., Stephan, E., den Bussche, J.V.: The open provenance model core specification, v1.1 (2009), http://eprints.ecs.soton.ac.uk/18332/1/opm.pdf
  7. 7.
    LinkedIn Corporation Azkaban, http://sna-projects.com/azkaban/
  8. 8.
    Pentaho Corporation, Kettle: Pentaho data integration, http://kettle.pentaho.org
  9. 9.
    Deelman, E., Blythe, J., Gil, A., Kesselman, C., Mehta, G., Patil, S., Su, M.-h., Vahi, K., Livny, M.: Pegasus: Mapping scientific workflows onto the grid, pp. 11–20 (2004)Google Scholar
  10. 10.
    Elkabany, K., Staley, A., Park, K.: Picloud - cloud computing for science. simplified. In: SciPy 2010 Python for Scientific Computing Conference, Austin, TX (July 2010)Google Scholar
  11. 11.
    Foster, I., Vckler, J., Wilde, M., Zhao, Y.: Chimera: A virtual data system for representing, querying, and automating data derivation. In: Proceedings of the 14th Conference on Scientific and Statistical Database Management, pp. 37–46 (2002)Google Scholar
  12. 12.
    The Eclipse Foundation. Eclipse c/c++ development tooling project, http://www.eclipse.org/cdt
  13. 13.
    Guo, P.J., Engler, D.: Towards practical incremental recomputation for scientists: An implementation for the python language. In: TaPP 2010 [1] (2010)Google Scholar
  14. 14.
    Ikeda, R., Widom, J.: Panda: A system for provenance and data. In: TaPP 2010 [1] (2010)Google Scholar
  15. 15.
    Yahoo! Inc., Oozie, http://yahoo.github.com/oozie/
  16. 16.
    Kuehn, H., Liberzon, A., Reich, M., Mesirov, J.P.: Using genepattern for gene expression analysis. Curr. Prot. in Bioinformatics, 7.12.1–7.12.39 (2008)Google Scholar
  17. 17.
    Amazon Web Services LLC. Amazon elastic compute cloud (ec2), http://aws.amazon.com/ec2
  18. 18.
    McPhillips, T., Bowers, S., Zinn, D., Ludaschera, B.: Scientific workflow design for mere mortals. Future Generation Computer Systems 25(5), 541–551 (2009)CrossRefGoogle Scholar
  19. 19.
    Mercurial. Mercurial, http://mercurial.selenic.com
  20. 20.
    Missier, P., Belhajjame, K., Zhao, J., Roos, M., Goble, C.: Data lineage model for taverna workflows with lightweight annotation requirements. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 17–30. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  21. 21.
    Muniswamy-Reddy, K.-K., Holland, D.A., Braun, U., Seltzer, M.I.: Provenance-aware storage systems. In: USENIX Annual Technical Conference, General Track, pp. 43–56. USENIX (2006)Google Scholar
  22. 22.
    Oinn, T.M., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R.M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004)CrossRefGoogle Scholar
  23. 23.
    Pan, M.J.: pomsets: workflow management for your cloud. In: SciPy 2010 Python for Scientific Computing Conference, , Austin, TX (July 2010)Google Scholar
  24. 24.
    The GNU Project, Gnu automake, http://www.gnu.org/software/automake
  25. 25.
    Riley, J.: Starcluster - numpy/scipy computing in the cloud. In: SciPy 2010: Python for Scientific Computing Conference, Austin, TX (July 2010)Google Scholar
  26. 26.
    Taylor, J., Schenck, I., Blankenberg, D., Nekrutenko, A.: Using galaxy to perform large-scale interactive data analyses. Curr. Prot. in Bioinformatics, 10.5.1–10.5.25 (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Elaine Angelino
    • 1
  • Daniel Yamins
    • 1
  • Margo Seltzer
    • 1
  1. 1.School of Engineering and Applied SciencesHarvard UniversityCambridgeUSA

Personalised recommendations