A Data Locality Aware Online Scheduling Approach for I/O-Intensive Jobs with File Sharing

  • Gaurav Khanna
  • Umit Catalyurek
  • Tahsin Kurc
  • P. Sadayappan
  • Joel Saltz
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4376)


Many scientific investigations have to deal with large amounts of data from simulations and experiments. Data analysis in such investigations typically involves extraction of subsets of data, followed by computations performed on extracted data. Scheduling in this context requires efficient utilization of the computational, storage and network resources to optimize response time. The data-intensive nature of such applications necessitates data-locality aware job scheduling algorithms. This paper proposes a hypergraph based dynamic scheduling heuristic for a stream of independent I/O intensive jobs with file sharing behavior. The proposed heuristic is based on an event-driven, run-time hypergraph modeling of the file sharing characteristics among jobs. Our experiments on a coupled compute/storage cluster show it performs better compared to previously proposed strategies, under a varying set of parameters for workloads from the application domain of biomedical image analysis.


Execution Time Completion Time Average Response Time Storage Node Gantt Chart 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Andrade, H., et al.: Scheduling multiple data visualization query workloads on a shared memory machine. In: Proceedings of the 2002 IEEE International Parallel and Distributed Processing Symposium (IPDPS 2002), Fort Lauderdale, FL, April 2002, IEEE Computer Society Press, Los Alamitos (2002)Google Scholar
  2. 2.
    Casanova, H., et al.: The AppLeS parameter sweep template: User-level middleware for the grid. In: Proceedings of the 2000 ACM/IEEE SC00 Conference, pp. 75–76. IEEE Computer Society Press, Los Alamitos (2000)Google Scholar
  3. 3.
    Casanova, H., et al.: Heuristics for scheduling parameter sweep applications in grid environments. In: Proceedings of the 9th Heterogeneous Computing Workshop (HCW’00), pp. 349–363 (2000)Google Scholar
  4. 4.
    Çatalyürek, U.V., Aykanat, C.: Hypergraph-partitioning based decomposition for parallel sparse-matrix vector multiplication. IEEE Transactions on Parallel and Distributed Systems 10(7), 673–693 (1999)CrossRefGoogle Scholar
  5. 5.
    Jain, R., et al.: Heuristics for scheduling I/O operations. IEEE Transactions on Parallel and Distributed Systems 8(3), 310–320 (1997)CrossRefGoogle Scholar
  6. 6.
    Kavas, A., Er-El, D., Feitelson, D.G.: Using multicast to pre-load jobs on the parpar cluster. Parallel Computing 27(3), 315–327 (2001)CrossRefzbMATHGoogle Scholar
  7. 7.
    Khanna, G., et al.: A hypergraph partitioning based approach for scheduling of tasks with batch-shared I/O. In: Proceedings of the 5th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2005), May 2005, ACM Press, New York (2005)Google Scholar
  8. 8.
    Kotz, D.: Disk-directed i/o for mimd multiprocessors. ACM Transactions on Computer Systems 15(1), 41–74 (1997)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Lublin, U., Feitelson, D.G.: The workload on parallel supercomputers: modeling the characteristics of rigid jobs. J. Parallel Distrib. Comput. 63(11), 1105–1122 (2003)CrossRefzbMATHGoogle Scholar
  10. 10.
    Maheswaran, M., et al.: Dynamic matching and scheduling of a class of independent tasks onto heterogeneous computing systems. In: Heterogeneous Computing Workshop (HCW’99), Apr. 1999, pp. 30–44 (1999)Google Scholar
  11. 11.
    Mehta, M., Soloviev, V., DeWitt, D.J.: Batch scheduling in parallel database systems. In: Proceedings of the 9th International Conference on Data Engineering (ICDE 1993), Vienna, Austria (1993)Google Scholar
  12. 12.
    Mohamed, H., Epema, D.: An evaluation of the close-to-files processor and data co-allocation policy in multiclusters. In: 2004 IEEE International Conference on Cluster Computing, pp. 287–298. IEEE Computer Society Press, Los Alamitos (2004)Google Scholar
  13. 13.
    Ranganathan, K., Foster, I.: Decoupling computation and data scheduling in distributed data-intensive applications. In: Proceedings of the Eleventh IEEE Symposium on High Performance Distributed Computing (HPDC), Edinburgh, Scotland, July 2002, IEEE, Los Alamitos (2002)Google Scholar
  14. 14.
    Thain, D., et al.: Pipeline and batch sharing in grid workloads. In: Proceedings of High-Performance Distributed Computing (HPDC-12), Seattle, Washington, June 2003, pp. 152–161 (2003)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Gaurav Khanna
    • 1
  • Umit Catalyurek
    • 2
  • Tahsin Kurc
    • 2
  • P. Sadayappan
    • 1
  • Joel Saltz
    • 2
  1. 1.Dept. of Computer Science and Engineering 
  2. 2.Dept. of Biomedical Informatics, The Ohio State University 

Personalised recommendations