Journal of the Brazilian Computer Society

, Volume 16, Issue 3, pp 177–190 | Cite as

Reliable management of checkpointing and application data in opportunistic grids

  • Raphael Y. de Camargo
  • Fernando Castor
  • Fabio Kon
Open Access
Original Paper
  • 170 Downloads

Abstract

Opportunistic computational grids use idle processor cycles from shared machines to enable the execution of long-running parallel applications. Besides computational power, these applications may also consume and generate large amounts of data, requiring an efficient data storage and management infrastructure. In this article, we present an integrated middleware infrastructure that enables the use of not only idle processor cycles, but also unused disk space of shared machines. Our middleware enables the reliable distributed storage of application data in the shared machines in a redundant and fault-tolerant way. A checkpointing-based mechanism monitors the execution of parallel applications, saves periodical checkpoints in the shared machines, and in case of node failures, supports the application migration across heterogeneous grid nodes. We evaluate the feasibility of our middleware using experiments and simulations. Our evaluation shows that the proposed middleware promotes important improvements in grid data management reliability while imposing a low performance overhead.

Keywords

Grid computing Distributed data storage Opportunistic grid Grid middleware 

References

  1. 1.
    Antoniu G, Bougé L, Jan M (2005) Juxmem: An adaptive supportive platform for data sharing on the grid. Scalable Comput Pract Exp 6(3):45–55Google Scholar
  2. 2.
    Batten C, Barr K, Saraf A, Trepetin S (2002) pStore: A secure peer-to-peer backup system. Tech Rep MIT-LCS-TM-632, MIT LCSGoogle Scholar
  3. 3.
    Blackham B (2009) Cryopid page. http://cryopid.berlios.de/
  4. 4.
    Blake C, Rodrigues R (2003) High availability, scalable storage, dynamic peer networks: pick two. In: HotOS’03: Proc of the 9th workshop on hot topics in operating systems, USENIXGoogle Scholar
  5. 5.
    Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426. doi:10.1145/362686.362692MATHCrossRefGoogle Scholar
  6. 6.
    Bolosky WJ, Douceur JR, Ely D, Theimer M (2000) Feasibility of a serverless distributed file system deployed on an existing set of desktop pcs. SIGMETRICS Perform Eval Rev 28(1):34–43. doi:10.1145/345063.339345CrossRefGoogle Scholar
  7. 7.
    Bronevetsky G, Marques D, Pingali K, Stodghill P (2003) Automated application-level checkpointing of MPI programs. In: PPoPP ’03: Proceedings of the 9th ACM, SIGPLAN symposium on principles and practice of parallel programming, pp 84–89Google Scholar
  8. 8.
    Cai M, Chervenak A, Frank M (2004) A peer-to-peer replica location service based on a distributed hash table. In: SC ’04: Proceedings of the 2004 ACM/IEEE conference on supercomputing. IEEE Computer Society, Washington, p 56. doi:10.1109/SC.2004.7Google Scholar
  9. 9.
    de Camargo RY, Kon F (2006) Distributed data storage for opportunistic grids. In: ACM/IFIP/USENIX middleware doctoral symp, Melbourne, AustraliaGoogle Scholar
  10. 10.
    de Camargo RY, Kon F (2007) Design and implementation of a middleware for data storage in opportunistic grids. In: Proceedings of the 7th IEEE international symposium on cluster computing and the grid (CCGRID 2007), Rio de Janeiro, Brazil. IEEE Computer Society, Washington, pp 23–30CrossRefGoogle Scholar
  11. 11.
    de Camargo RY, Kon F, Goldman A (2005) Portable checkpointing and communication for BSP applications on dynamic heterogeneous Grid environments. In: SBAC-PAD’05: The 17th international symposium on computer architecture and high performance computing, Rio de Janeiro, BrazilGoogle Scholar
  12. 12.
    de Camargo RY, Castor Filho F, Kon F (2009) Efficient maintenance of distributed data in highly dynamic opportunistic grids. In: Proceedings of the 24th ACM symposium on applied computing (SAC 2009), Track on dependable and adaptive distributed systems (DADS), Honolulu, HI, USA. ACM, New YorkGoogle Scholar
  13. 13.
    Chervenak AL, Palavalli N, Bharathi S, Kesselman C, Schwartzkopf R (2004) Performance and scalability of a replica location service. In: HPDC ’04: Proceedings of the 13th IEEE international symposium on high performance distributed computing (HPDC’04). IEEE Computer Society, Washington, pp 182–191. doi:10.1109/HPDC.2004.27CrossRefGoogle Scholar
  14. 14.
    Chiba S (1995) A metaobject protocol for C++. In: OOPSLA ’95: Proceedings of the 10th ACM conference on object-oriented programming systems, languages, and applications, pp 285–299Google Scholar
  15. 15.
    Chien A, Calder B, Elbert S, Bhatia K (2003) Entropia: architecture and performance of an enterprise desktop grid system. J Parallel Distrib Comput 63(5):597–610. doi:10.1016/S0743-7315(03)00006-6CrossRefGoogle Scholar
  16. 16.
    Cirne W, Brasileiro F, Andrade N, Costa L, Andrade A, Novaes R, Mowbray M (2006) Labs of the world, unite!!! J Grid Comput 4(3):225–246MATHCrossRefGoogle Scholar
  17. 17.
    Dabek F, Kaashoek MF, Karger D, Stoica I Morris R (2001) Wide-area cooperative storage with cfs. In: SOSP ’01: Proceedings of the eighteenth ACM symposium on operating systems principles. ACM, New York, pp 202–215. doi:10.1145/502034.502054CrossRefGoogle Scholar
  18. 18.
    Domingues P, Marques P, Silva L (2005) Resource usage of windows computer laboratories. In: Proc of the int conf on parallel processing (ICCP’05): workshops, pp 469–476Google Scholar
  19. 19.
    Elnozahy M, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375–408CrossRefGoogle Scholar
  20. 20.
    Goldchleger A, Kon F, Goldman A, Finger M, Bezerra GC (2004) InteGrade: Object-oriented grid middleware leveraging idle computing power of desktop machines. Concurr Comput Pract Exp 16:449–459CrossRefGoogle Scholar
  21. 21.
    Goldchleger A, Goldman A, Hayashida U, Kon F (2005) The implementation of the bsp parallel computing model on the integrade grid middleware. In: MGC ’05: Proceedings of the 3rd international workshop on middleware for grid computing. ACM, New York, pp 1–6. doi:10.1145/1101499.1101504CrossRefGoogle Scholar
  22. 22.
    Hayashida UK, Okuda K, Panetta J, Song SW (2005) Generating parallel algorithms for cluster and grid computing. In: ICCSA ’05: The 2005 international conference on computational science and its applications. Springer, Berlin, pp 509–516Google Scholar
  23. 23.
    Karablieh F, Bazzi RA, Hicks M (2001) Compiler-assisted heterogeneous checkpointing. In: SRDS ’01: Proceedings of the 20th IEEE symposium on reliable distributed systems, New Orleans, USA, pp 56–65Google Scholar
  24. 24.
    Kircher M, Jain P (2004) Pattern-oriented software architecture, Volume 3: patterns for resource management. Wiley, New YorkGoogle Scholar
  25. 25.
    Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565MATHCrossRefGoogle Scholar
  26. 26.
    Landers M, Zhang H, Tan KL (2004) Peerstore: Better performance by relaxing in peer-to-peer backup. In: P2P ’04: Proc of the 4th int conf on peer-to-peer computing. IEEE Computer Society, Washington, pp 72–79. doi:10.1109/P2P.2004.38Google Scholar
  27. 27.
    Litzkow M, Livny M, Mutka M (1988) Condor—a hunter of idle workstations. In: ICDCS ’88: Proceedings of the 8th int conference of distributed computing systems, pp 104–111Google Scholar
  28. 28.
    Luckow A, Schnor B (2008) Adaptive checkpoint replication for supporting the fault tolerance of applications in the grid. In: Proceedings of the 2008 seventh IEEE international symposium on network computing and applications. IEEE Computer Society, Washington, pp 299–306. doi:10.1109/NCA.2008.38CrossRefGoogle Scholar
  29. 29.
    Malluhi QM, Johnston WE (1998) Coding for high availability of a distributed-parallel storage system. IEEE Trans Parallel Distrib Syst 9(12):1237–1252. doi:10.1109/71.737699CrossRefGoogle Scholar
  30. 30.
    Mutka MW, Livny M (1991) The available capacity of a privately owned workstation environment. Perform Eval 12(4):269–284. doi:10.1016/0166-5316(91)90005-NMATHCrossRefGoogle Scholar
  31. 31.
    Plank JS, Kingsley MBG, Li K (1995) Libckpt: Transparent checkpointing under unix. In: Proceedings of the USENIX winter 1995 technical conference, pp 213–323Google Scholar
  32. 32.
    Plank JS, Li K, Puening MA (1998) Diskless checkpointing. IEEE Trans Parallel Distrib Syst 9(10):972–986. doi:10.1109/71.730527CrossRefGoogle Scholar
  33. 33.
    Pruyne J, Livny M (1996) Managing checkpoints for parallel programs. In: IPPS ’96: Proceedings of the workshop on job scheduling strategies for parallel processing. Springer, London, pp 140–154Google Scholar
  34. 34.
    Rabin MO (1989) Efficient dispersal of information for security, load balancing, and fault tolerance. J ACM 36(2):335–348. doi:10.1145/62044.62050MATHMathSciNetCrossRefGoogle Scholar
  35. 35.
    Ripeanu M, Foster I (2002) A decentralized, adaptive replica location mechanism. In: HPDC ’02: Proceedings of the 11th IEEE international symposium on high performance distributed computing. IEEE Computer Society, WashingtonGoogle Scholar
  36. 36.
    Rowstron A, Druschel P (2001) Storage management and caching in past, a large-scale, persistent peer-to-peer storage utility. In: SOSP ’01: Proceedings of the eighteenth ACM symposium on operating systems principles. ACM, New York, pp 188–201. doi:10.1145/502034.502053CrossRefGoogle Scholar
  37. 37.
    Rowstron AIT, Druschel P (2001) Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Middleware 2001: IFIP/ACM international conference on distributed systems platforms, Heidelberg, Germany, pp 329–350Google Scholar
  38. 38.
    Sobe P (2003) Stable checkpointing in distributed systems without shared disks. In: IPDPS ’03: Proceedings of the 17th international symposium on parallel and distributed processing. IEEE Computer Society, Washington, p 214.2Google Scholar
  39. 39.
    Stoica I, Morris R, Karger D, Kaashock M, Balakrishman H (2003) Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans Netw 11(1):17–32CrossRefGoogle Scholar
  40. 40.
    Strumpen V, Ramkumar B (1996) Portable checkpointing and recovery in heterogeneous environments. Tech Rep UI-ECE TR-96.6.1, University of IowaGoogle Scholar
  41. 41.
    Valiant L (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111CrossRefGoogle Scholar
  42. 42.
    Vazhkudai SS, Ma X, Freeh VW, Strickland JW, Tammineedi N, Scott SL (2005) Freeloader: Scavenging desktop storage resources for scientific data. In: SC ’05: Proceedings of the 2005 ACM/IEEE conference on supercomputing. IEEE Computer Society, Washington, p 56. doi:10.1007/s13173-010-0016-0CrossRefGoogle Scholar
  43. 43.
    Weatherspoon H, Kubiatowicz J (2002) Erasure coding vs. replication: a quantitative comparison. In: IPTPS ’01: Revised papers from the first international workshop on peer-to-peer systems. Springer, London, pp 328–338Google Scholar

Copyright information

© The Brazilian Computer Society 2010

Authors and Affiliations

  • Raphael Y. de Camargo
    • 1
  • Fernando Castor
    • 2
  • Fabio Kon
    • 3
  1. 1.Center for Mathematics, Computation and CognitionFederal University of ABC (UFABC)Santo André/SPBrazil
  2. 2.Informatics CenterFederal University of Pernambuco (UFPE)Recife/PEBrazil
  3. 3.Department of Computer ScienceUniversity of São Paulo (USP)São Paulo/SPBrazil

Personalised recommendations