Article

Cluster Computing

, Volume 13, Issue 3, pp 315-333

Parameterized specification, configuration and execution of data-intensive scientific workflows

  • Vijay S. KumarAffiliated withDept. of Computer Science and Engineering, Ohio State University Email author 
  • , Tahsin KurcAffiliated withCenter for Comprehensive Informatics, Emory University
  • , Varun RatnakarAffiliated withInformation Sciences Institute, University of Southern California
  • , Jihie KimAffiliated withInformation Sciences Institute, University of Southern California
  • , Gaurang MehtaAffiliated withInformation Sciences Institute, University of Southern California
  • , Karan VahiAffiliated withInformation Sciences Institute, University of Southern California
  • , Yoonju Lee NelsonAffiliated withInformation Sciences Institute, University of Southern California
  • , P. SadayappanAffiliated withDept. of Computer Science and Engineering, Ohio State University
  • , Ewa DeelmanAffiliated withInformation Sciences Institute, University of Southern California
    • , Yolanda GilAffiliated withInformation Sciences Institute, University of Southern California
    • , Mary HallAffiliated withSchool of Computing, University of Utah
    • , Joel SaltzAffiliated withCenter for Comprehensive Informatics, Emory University

Rent the article at a discount

Rent now

* Final gross prices may vary according to local VAT.

Get Access

Abstract

Data analysis processes in scientific applications can be expressed as coarse-grain workflows of complex data processing operations with data flow dependencies between them. Performance optimization of these workflows can be viewed as a search for a set of optimal values in a multidimensional parameter space consisting of input performance parameters to the applications that are known to affect their execution times. While some performance parameters such as grouping of workflow components and their mapping to machines do not affect the accuracy of the analysis, others may dictate trading the output quality of individual components (and of the whole workflow) for performance. This paper describes an integrated framework which is capable of supporting performance optimizations along multiple such parameters. Using two real-world applications in the spatial, multidimensional data analysis domain, we present an experimental evaluation of the proposed framework.

Keywords

Scientific workflow Performance parameters Semantic representations Grid Application QoS