Distributed Data Mining using a Public Resource Computing Framework
The public resource computing paradigm is often used as a successful and low cost mechanism for the management of several classes of scientific and commercial applications that require the execution of a large number of independent tasks. Public computing frameworks, also known as “Desktop Grids”, exploit the computational power and storage facilities of private computers, or “workers”. Despite the inherent decentralized nature of the applications for which they are devoted, these systems often adopt a centralized mechanism for the assignment of jobs and distribution of input data, as is the case for BOINC, the most popular framework in this realm. We present a decentralized framework that aims at increasing the flexibility and robustness of public computing applications, thanks to two basic features: (i) the adoption of a P2P protocol for dynamically matching the job specifications with the worker characteristics, without relying on centralized resources; (ii) the use of distributed cache servers for an efficient dissemination and reutilization of data files. This framework is exploitable for a wide set of applications. In this work, we describe how a Java prototype of the framework was used to tackle the problem of mining frequent itemsets from a transactional dataset, and show some preliminary yet interesting performance results that prove the efficiency improvements that can derive from the presented architecture.
KeywordsExecution Time Frequent Itemsets Data Cachers Cache Strategy Desktop Grid
Unable to display preview. Download preview PDF.
- 1.Al-Shakarchi, E., Cozza, P., Harrison, A., Mastroianni, C., Shields, M., Talia, D., and Taylor, I. (2007). Distributing workflows over a ubiquitous p2p network. Scientific Programming, 15 (4): 269—28 1.Google Scholar
- 2.Anderson, D. P. (2004). Boinc: A system for public-resource computing and stomge. In GRID ‘04: Proceedings of the Fifth IEEE/ACM International Workshop on Grid Computing (GRID’04), pages 4—10.Google Scholar
- 3.Barbalace, D., Lucchese, C., Mastroianni, C., Orlando, S., and Talia, D. (2008). Mming@home: Public resource computing for distributed datamining. In CoreGRJD Symposium, Las Palmas de Gran Canaria, Canary Island, Spain.Google Scholar
- 4.Cappello, F., Djilali, S., Fedak, G., Herault, T., Magniette, F., Neri, V., and Lodygensky, 0. (2005). Computing on large-scale distributed systems: Xtrem web architecture, progmniming models, security, tests and convergence with grid. Future Generation Computer Systems, 21(3):417—437.Google Scholar
- 5.Fedak, G., Germain, C., Neff, V., and Cappello, E (2001). Xtremweb: A generic global computing system. In Proceedings of the IEEE Int. Symp. on Cluster Computing and the Grid, Brisbane, Australia.Google Scholar
- 6.Fedak, G., He, H., and Cappello, E (2009). Bitdew: a data management and distñbution service with multi-protocol file transfer and metadata abstraction. Journal of Network and Computer Applications, 32(5).Google Scholar
- 7.Kelley, I. and Taylor, I. (2008). Bridging the data management gap between service and desktop grids. In Peter Kacsuk, R. L. and Nemeth, Z., editors, Distributed and Parallel Systems In Focus: Desktop Grid Computing. SpringetGoogle Scholar
- 8.Lucchese, C., Orlando, S., and Perego, R. (2007). Parallel ruining of frequent closed patterns:Harnessing modem computer architectures. In ICDM ‘07: Proceedings of the Fourth IEEE International Conference on Data Mining.Google Scholar
- 9.Tan, P.-N., Steinbach, M., and Kumar, V. (2006). Introduction to Data Mining. Addison- Wesley.Google Scholar