IC3 2011: Contemporary Computing pp 347-358 | Cite as
Data Mining on Grids
Abstract
Data mining algorithms are widely used today for the analysis of large corporate and scientific datasets stored in databases and data archives. Industry, science and commerce fields often need to analyze very large datasets maintained over geographically distributed sites by using the computational power of distributed and parallel systems. Grid computing emerged as an important new field of distributed computing, which could support the distributed knowledge discovery applications. In this paper, we have proposed a method to perform Data Mining on Grids. The Grid has been setup using Foster and Kesselman’s Globus Toolkit, which is the most widely used middleware in scientific and data intensive grid applications. For the development of data mining applications on grids we have used Weka4WS. Weka4WS is an open source framework extended from the Weka toolkit for distributed data mining on Grid, which deploys many of machine learning algorithms provided by Weka Toolkit. To evaluate the efficiency of the proposed system, a performance analysis of Weka4WS by executing distributed data mining tasks, namely clustering and classification, in grid scenario has been performed. At last, a study on the speed up obtained by doing data mining on grids is done.
Keywords
Weka4WS Distributed Data Mining Grid Computing Classification Clustering GridPreview
Unable to display preview. Download preview PDF.
References
- 1.Ye, N.: The Handbook of Data Mining. Lawrence Erlbaum Associates, Mahwah (2003)Google Scholar
- 2.Witten, I.H., Frank, E.: Data Mining. Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publisher, San Francisco (2005)MATHGoogle Scholar
- 3.Cannataro, M., Talia, D., Trunfio, P.: Distributed data mining on the grid. Future Generation Computer System 18, 1101–1112 (2002)CrossRefMATHGoogle Scholar
- 4.Zeng, L., Xu, L., Shi, Z., Wang, M., Wu, W.: Distribued Computing Environment: Approaches and Applications, pp. 3240–3244. IEEE, Los Alamitos (2004)Google Scholar
- 5.Chattratichat, J., Darlington, J., Guo, Y., Hedvall, S., Kohler, M., Syed, J.: An Architecture for Distributed Enterprise Data Mining (2002)Google Scholar
- 6.Du, W., Agrawal, G.: Developing Distribued Data Mining Implementations for a Grid Environment. In: Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 1–2 (2002)Google Scholar
- 7.Dutta, H.: Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure (2007)Google Scholar
- 8.OASIS Web Service Resource Framework (WSRF) TC (2004), http://www.oasisopen.org/committees/tc_home.php?wg_abbrev=wsrf
- 9.The Weka4WS user guide, http://grid.deis.unical.it/weka4ws
- 10.Cannataro, M., Pugliese, A., Talia, D., Trunfio, P.: Distributed Data Mining on Grids: Services,Tools, and Applications. IEEE Transactions on Systems, Man, and Cybernetics 34(6), 2451–2465 (2004)CrossRefGoogle Scholar
- 11.The Globus Toolkit, http://www.globus.org/toolkit/
- 12.GT WS MDS WebMDS: System Administrator’s Guide, http://www.globus.org/toolkit/docs/4.2/4.2.0/info/webmds/index.html
- 13.WS GRAM Admin Guide, http://www.globus.org/toolkit/docs/4.2/4.2.0/execution/gram4/index.html
- 14.Overview of Grid Security Infrastructure, http://www.globus.org/toolkit/docs/4.2/4.2.0/security/security
- 15.GridFTP Admin Guide, http://www.globus.org/toolkit/docs/4.2/4.2.0/data/gridftp/index.html
- 16.Allcock, W., Bresnahan, J., Kettimuthu, R., Link, M., Dumitrescu, C., Raicu, I., Foster, I.: The Globus striped GridFTP framework and server. In: Supercomputing Conf. 2005 (2005)Google Scholar
- 17.Hettich, S., Bay, S.D.: The UCI KDD Archive, University of California, Department of Information and Computer Science. US Census datasets, http://kdd.ics.uci.edu/databases/census1990/USCensus1990.html
- 18.Hettich, S., Bay, S.D.: The UCI KDD Archive, University of California, Department of Information and Computer Science. In: The Third International Knowledge Discovery and Data Mining Tools Competition datasets, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
- 19.Hettich, S., Bay, S.D.: The UCI KDD Archive, University of California, Department of Information and Computer Science. The forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data, http://kdd.ics.uci.edu/databases/covertype/covertype.html
- 20.MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate Observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability 1967, pp. 281–297. University of California Press, Berkeley (1967)Google Scholar