An implementation of parallel file distribution in an agent hierarchy
- 43 Downloads
- 1 Citations
Abstract
PC grid is a cost-effective grid-computing platform that attracts users by allocating to their massively parallel applications as many desktop computers as requested. However, a challenge is how to distribute necessary files to remote computing nodes that may be unconnected to the same network file system, equipped with insufficient disk space to keep entire files, and even powered off asynchronously.
Targeting PC grid, the AgentTeamwork grid-computing middleware deploys a hierarchy of mobile agents to remote desktops so as to launch, monitor, check-point, and resume a parallel and distributed computing job. To achieve high-speed file distribution, AgentTeamwork takes advantage of its agent hierarchy. The system partitions files into stripes at the tree root if they are random-access files, duplicates them at each tree level if they are shared among all remote nodes, fragments them into smaller messages if they are too large to relay to a lower tree level, aggregates such messages in a larger fragment if they are in transit to the same subtree, and returns output files to the user along multi-paths established within the tree. To achieve fault-tolerant file delivery, each agent periodically takes a snapshot of in-transit and on-memory file messages with its user job, and thus resumes them from the latest snapshot when they crash accidentally.
This paper presents an implementation and its competitive performance of AgentTeamwork’s file-distribution algorithm including file partitioning, transfer, check-pointing, and consistency maintenance.
Keywords
Parallel file distribution Fault tolerance Grid middleware Job deployment Mobile agentsPreview
Unable to display preview. Download preview PDF.
References
- 1.Balaton Z, Gombas G, Kacsuk P, Kornafeld A, Kovacs J, Morsi AC, Vida G, Podhorszki N, Kiss T (2007) SZTAKI desktop grid: a modular and scalable way of building large computing grids. In: Proceedings of workshop on large-scale and volatile desktop grids—PCGrid 2007 in conjuction with IEEE international parallel and distributed processing symposium, Long Beach, CA, March 2007. IEEE, pp 26–30 Google Scholar
- 2.Tachikawa M (2006) PC grid computing—using increasingly common and powerful PCs to supply society with ample computing resources. Sci Technol Trends Q Rev 18:45–52 MathSciNetGoogle Scholar
- 3.Fukuda M, Kashiwagi K, Kobayashi S (2006) AgentTeamwork: Coordinating grid-computing jobs with mobile agents. Int J Appl Intell 25:181–198 MATHCrossRefGoogle Scholar
- 4.Message Passing Interface Forum (1997) MPI-2: Extention to the message-passing interface, Chap 9, I/O. University of Tenessee Google Scholar
- 5.Ching A, Coloma K, Coudhary A (2006) Challenges for parallel I/O in GRID computing, Chap 6, Grid I/O. American Scientific Publisher Google Scholar
- 6.Fukuda M, Smith D (2006) UWAgents: A mobile agent system optimized for grid computing. In: Proceedings of the 2006 international conference on grid computing and applicaitons—CGA’06, Las Vegas, NV, June 2006. CSREA, pp 107–113 Google Scholar
- 7.Fukuda M, Ngo C, Mak E, Morisaki J (2007) Resource management and monitoring in AgentTeamwork grid computing middleware. In: Proceedings of the IEEE Pacific Rim conference on communications, computers, and signal processing—PacRim’07, Victoria, BC, August 2007. IEEE, pp 145–148 Google Scholar
- 8.Fukuda M, Horvath E, Lane S (2007) Fault-tolerant job execution over multi-clusters using mobile agents. In: Proceedings of the 2007 international conference on grid computing and applicaitons—CGA’07, Las Vegas, NV, June 2007. CSREA, pp 123–129 Google Scholar
- 9.mpiJava Home Page (2008) http://www.hpjava.org/mpijava.html. Accessible as of February 2008
- 10.Fukuda M, Huang Z (2005) The check-pointed and error-recoverable MPI Java library of AgentTeamwork gird computing middleware. In: Proceedings of IEEE Pacific Rim conference on communications, computers, and signal processing—PacRim’05, Victoria, BC, August 2005. IEEE, pp 259–262 Google Scholar
- 11.Phillips J, Fukuda M, Miyauchi J (2007) A Java implemenation of MPI-I/O-oriented random acess file class in AgentTeamwork grid computing middleware. In: Proceedings of the IEEE Pacific Rim conference on communications, computers, and signal processing—PacRim’07, Victoria, BC, August 2007. IEEE, pp 149–152 Google Scholar
- 12.Thakur R, Gropp W, Lusk E (1999) Data sieving and collective I/O in ROMIO. In: Proceedings of the seventh symposium on the frontiers of massively parallel computation, 1999. IEEE Computer Society Press, pp 182–189 Google Scholar
- 13.White BS, Grimshaw AS, Nguyen-Tuong A (2007) Grid-based file access: the legion I/O model. In: Proceedings of the 9th IEEE international symposium on high performance distributed computing—HPDC’00, Pittsburgh, PA, August 2000. IEEE CS, pp 165–174 Google Scholar
- 14.Bester J, Foster I, Kesselman C, Tedesco J, Tuecke S (1999) GASS: a data movement and access service for wide area computing systems. In: Proceedings of the sixth workshop on Input/Output in parallel and distributed systems, Atlanta, GA, May 1999. ACM Press, New York, pp 78–88 CrossRefGoogle Scholar
- 15.Condor Team (2006) Conder version 6.6.11 manual. User manual, University of Wisconsin, Madison, WI, June 2006. http://www.cs.wisc.edu/condor/manual/v6.6.11/. Accessible as of February 2008
- 16.Parallel Virtual File System (2008) http://www.pvfs.org/. Accessible as of February 2008
- 17.Allcock W, Bresnahan J, Kettimuthu R, Link M, Dumitrescu C, Raicu I, Foster I (2005) The Globus striped GridFTP framework and server. In: Proceedings of super computing 2005—SC05, Seattle, WA, November 2005. ACM Press, New York, pp 54–64 Google Scholar
- 18.Bhardwaj D, Kumar R (2005) A parallel file transfer protocol for clusters and gird systems. In: Proceedings of the 1st international conference on e-science and grid computing, Melbourne, Australlia, December 2005. IEEE CS, pp 248–254 Google Scholar
- 19.Izmailov R, Ganguly S, Tu N (2004) Fast parallel file replication in data grid. In: Homepage of GGF-10 workshop: The future of grid data environments, Berlin, Germany, March 2004. http://ness.ac.uk/events/GGF10-DA/index.html. Accessible as of February 2008
- 20.Madduri RK, Hood CS, Allcock WE (2002) Reliable file transfer in grid environments. In: Proceedings of the 27th annual IEEE conference on local computer networks—LCN2002, Tampa, FL, November 2002. IEEE-CS, pp 737–738 Google Scholar
- 21.Kotz D (1997) Disk-directed I/O for MIMD multiprocessors. ACM Trans Comput Syst (TOCS) 15:41–74 CrossRefGoogle Scholar
- 22.Seamons KE, Chen Y, Jones P, Jozwiak J, Winslett M (1995) Server-directed collective I/O in Panda. In: Proceedings of supercomputing ’95, San Diego, CA, December 1995. IEEE CS, pp 57–60 Google Scholar
- 23.del Rosario JM, Bordawekar R, Choudhary A (1993) Improved parallel I/O via a two-phase run-time access strategy. In: Proceedings of the IPPS ’93 workshop on Input/Output in parallel computer systems, Newport Beach, CA, 1993, pp 56–70 Google Scholar
- 24.Singh DE, Isaila F, Pichel JC, Carretero J (2007) A collective I/O implementation based on inspector-executor paradignm. In: Proceedings of the international conference on parallel and distributed processing techniques and applications—PDPTA 2007, Las Vegas, NV, June 2007. CSREA, pp 683–689 Google Scholar
- 25.Kistler JJ, Satyanarayanan M (1991) Disconnected operation in the Coda file system. In: Proceedings of the 13th ACM symposium on operating systems principles, Pacific Grove, CA, October 1991. ACM Press, New York, pp 213–225 Google Scholar
- 26.Kunszt P, Laure E, Stockinger H, Stockinger K (2005) File-based replica management. Future Gener Comput Syst 21:115–123 CrossRefGoogle Scholar
- 27.Allcock B, Bester J, Bresnahan J, Chervenak AL, Kesselman C, Meder S, Nefedova V, Quesnel D, Tuecke S, Foster I (2001) Secure, efficient data transport and replica management for high-performance data-intensive computing. In: Proceedings of the 18th IEEE symposium on mass storage systems and technologies—MSS 2001, San Diego, CA, April 2001. IEEE CS, pp 13–28 Google Scholar
- 28.Ghemawat S, Gobioff H, Leung S-T (2003) The Google file system. In: Proceedings of the 19th ACM symposium on operating systems principles, Bolton Landing, NY, October 2003. ACM Press, New York, pp 29–43 Google Scholar
- 29.Condor MW Homepage (2007) http://www.cs.wisc.edu/condor/mw/. Accessible as of February 2007
- 30.Nguyen-Tuong A (2000) Integrating fault-tolerance techniques in grid applications. PhD thesis, University of Virginia, Charlottesville, VA 22904, August 2000 Google Scholar
- 31.Cardinale Y, Pereira W, Hernnandez E (2006) Extended mpiJava for distributed checkpointing and recovery. In: Proceedings of the 13th European PVMMPI conference, Bonn, Germany, September 2006. LNCS, vol 4192. Springer, Berlin, pp 158–165 Google Scholar
- 32.Abramson D, Sosic R, Giddy J, Hall B (1995) Nimrod: A tool for performing parametized simulations using distributed workstations. In: Proceedings of the 4th IEEE international symposium on high performance distributed computing—HPDC-4, Pentagon City, VA, August 1995. IEEE-CS, pp 112–121 Google Scholar
- 33.Czajkowski K, Foster I, Kesselman C (1999) Resource co-allocation in computational grids. In: Proceedings of the 8th IEEE symposium on high performance distributed computing—HPDC8, Redondo Beach, CA, August 1999. pp 219–228 Google Scholar
- 34.Abramson D, Giddy J, Kotler L (2000) High performance parametric modeling with Nimrod/G: Killer application for the global grid? In: Proceedings of the 14th international symposium on parallel and distributed processing—ISPDP, Cancun, Mexico, May 2000. IEEE-CS, pp 520–528 Google Scholar