Failure Recovery Alternatives in Grid-Based Distributed Query Processing: A Case Study

  • Jim Smith
  • Paul Watson
Conference paper


Fault-tolerance has long been a feature of database systems, with transactions supporting the structuring of applications so as to ensure continuation of updating applications in spite of machine failures. For read-only queries the perceived wisdom has been that support for fault-tolerance is too expensive to be worthwhile. Distributed query processing (DQP) is coming to be seen as a promising way of implementing applications that combine structured data and analysis operations in dynamic distributed settings such as computational grids. Accordingly, a number of protocols have been described that support tolerance to failure of intermediate machines, so as to permit continuation from surviving intermediate state. However, a distributed query can have a non-trivial mapping onto hardware resources. Because of this it is often possible to choose between a number of possible recovery strategies in the event of a failure. The work described here makes an initial investigation in this area in the context of an example query expressed over distributed resources in a Grid and shows that it can be worthwhile to make this choice between recovery alternatives dynamically, at the point a failure is detected rather than statically beforehand.


distributed query processing fault-tolerance parallel query processing rollback-recovery 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    N. Alpdemir, A. Mukherjee, A. Gounaris, N. W. Paton, P. Watson, and Alvaro A. A. Fernandes. OGSA-DQP: A grid service for distributed querying on the grid. In EDBT, pages 858–861, 2004.Google Scholar
  2. [2]
    S. Babu and J. Widom. Continuous queries over data streams. SIGMOD Record, 30(3): 109–120, September 2001.CrossRefGoogle Scholar
  3. [3]
    R. S. Barga, D. B. Lomet, S. Paparizos, H. Yu, and S. Chandrasekaran. Persistent applications via automatic recovery. In IDEAS, pages 258–267, 2003.Google Scholar
  4. [4]
    R. Braumandl, M. Keidl, A. Kemper, D. Kossmann, A. Kreutz, S. Pröls, S. Seltzsam, and K. Stacker. ObjectGlobe: Ubiquitous query processing. The VLDB Journal, 10(1):48–71, August 2001.zbMATHGoogle Scholar
  5. [5]
    I. Foster and C. Kesselman, editors. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 2003.Google Scholar
  6. [6]
    I. T. Foster, C. Kesselman, J. M. Nick, and S. Tuecke. Grid services for distributed system integration. Computer, 35(6):37–46, June 2002.CrossRefGoogle Scholar
  7. [7]
    G. Graefe. Encapsulation of parallelism in the Volcano query processing system. In SIGMOD, pages 102–111, Atlantic City, NJ, USA, 1990. ACM Press.Google Scholar
  8. [8]
    G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73–170, June 1993.CrossRefGoogle Scholar
  9. [9]
    J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993.Google Scholar
  10. [10]
    D. Georgakopoulos M. Hornick and A. Sheth. An overview of workflow management: From process modeling to workflow automation infrastructure. Distributed and Parallel Databases, 3(2):119–153, April 1995.CrossRefGoogle Scholar
  11. [11]
    J. Hwang, M. Balazinska, A. Rasin, U. Çetintemel, M. Stonebraker, and S. Zdonik. High-availability algorithms for distributed stream processing. Technical Report CS-04-05, Brown University, May 2004.Google Scholar
  12. [12]
    M. Kamath, G. Alonso, R. Gunthor, and C. Mohan. Providing high availability in very large workflow management systems. In EDBT, pages 427–442, March 1996.Google Scholar
  13. [13]
    D. Kossman. The state of the art in distributed query processing. Computing Surveys, 32(4):422–469, December 2000.CrossRefGoogle Scholar
  14. [14]
    W. Labio, J. Wiener, and H. Garcia-Molina. Efficient resumption of interrupted warehouse loads. In SIGMOD, pages 46–57. ACM Press, 2000.Google Scholar
  15. [15]
    T. Malik, A. Szalay, T. Budavari, and A. Thakar. Skyquery: A web service approach to federate databases. In CIDR, 2003.Google Scholar
  16. [16]
    S. Narayanan, T. M. Kurç, Ü. V. Çatalyürek, and J. H. Saltz. Database support for data-driven scientific applications in the grid. Parallel Processing Letters, 13(2):245–271, 2003.CrossRefMathSciNetGoogle Scholar
  17. [17]
    The OGSA-DAI project., 2005.Google Scholar
  18. [18]
    M. Shah, J. M. Hellerstein, S. Chandrasekaran, and M. J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In ICDE, pages 25–36. IEEE, 2003.Google Scholar
  19. [19]
    J. Smith, A. Gounaris, P. Watson, N. W. Paton, A. A.A. Fernandes, and R. Sakalleriou. Distributed query processing on the grid. In GRID, pages 279–290, November 2002.Google Scholar
  20. [20]
    J. Smith, A. Gounaris, P. Watson, N. W. Paton, A. A.A. Fernandes, and R. Sakalleriou. Distributed query processing on the grid. International Journal of High Performance Computing Applications, 17(4), November 2003.Google Scholar
  21. [21]
    J. Smith and P. Watson. Fault-tolerance in distributed query processing. In IDEAS. IEEE Computer Society, 2005.Google Scholar
  22. [22]
    X. Zhang, T. M. Kurç, T. Pan, Ü. V. Çatalyürek, S. Narayanan, P. Wyckoff, and J. H. Saltz. Strategies for using additional resources in parallel hash-based join algorithms. In HPDC, pages 4–13. IEEE Computer Society, 2004.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Jim Smith
    • 1
  • Paul Watson
    • 1
  1. 1.Newcastle UniversityNewcastle upon TyneUK

Personalised recommendations