IC3 2011: Contemporary Computing pp 347-358 | Cite as

Data Mining on Grids

  • Shampa Chakraverty
  • Ankuj Gupta
  • Akhil Goyal
  • Ashish Singal
Part of the Communications in Computer and Information Science book series (CCIS, volume 168)

Abstract

Data mining algorithms are widely used today for the analysis of large corporate and scientific datasets stored in databases and data archives. Industry, science and commerce fields often need to analyze very large datasets maintained over geographically distributed sites by using the computational power of distributed and parallel systems. Grid computing emerged as an important new field of distributed computing, which could support the distributed knowledge discovery applications. In this paper, we have proposed a method to perform Data Mining on Grids. The Grid has been setup using Foster and Kesselman’s Globus Toolkit, which is the most widely used middleware in scientific and data intensive grid applications. For the development of data mining applications on grids we have used Weka4WS. Weka4WS is an open source framework extended from the Weka toolkit for distributed data mining on Grid, which deploys many of machine learning algorithms provided by Weka Toolkit. To evaluate the efficiency of the proposed system, a performance analysis of Weka4WS by executing distributed data mining tasks, namely clustering and classification, in grid scenario has been performed. At last, a study on the speed up obtained by doing data mining on grids is done.

Keywords

Weka4WS Distributed Data Mining Grid Computing Classification Clustering Grid 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ye, N.: The Handbook of Data Mining. Lawrence Erlbaum Associates, Mahwah (2003)Google Scholar
  2. 2.
    Witten, I.H., Frank, E.: Data Mining. Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publisher, San Francisco (2005)MATHGoogle Scholar
  3. 3.
    Cannataro, M., Talia, D., Trunfio, P.: Distributed data mining on the grid. Future Generation Computer System 18, 1101–1112 (2002)CrossRefMATHGoogle Scholar
  4. 4.
    Zeng, L., Xu, L., Shi, Z., Wang, M., Wu, W.: Distribued Computing Environment: Approaches and Applications, pp. 3240–3244. IEEE, Los Alamitos (2004)Google Scholar
  5. 5.
    Chattratichat, J., Darlington, J., Guo, Y., Hedvall, S., Kohler, M., Syed, J.: An Architecture for Distributed Enterprise Data Mining (2002)Google Scholar
  6. 6.
    Du, W., Agrawal, G.: Developing Distribued Data Mining Implementations for a Grid Environment. In: Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, pp. 1–2 (2002)Google Scholar
  7. 7.
    Dutta, H.: Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure (2007)Google Scholar
  8. 8.
    OASIS Web Service Resource Framework (WSRF) TC (2004), http://www.oasisopen.org/committees/tc_home.php?wg_abbrev=wsrf
  9. 9.
    The Weka4WS user guide, http://grid.deis.unical.it/weka4ws
  10. 10.
    Cannataro, M., Pugliese, A., Talia, D., Trunfio, P.: Distributed Data Mining on Grids: Services,Tools, and Applications. IEEE Transactions on Systems, Man, and Cybernetics 34(6), 2451–2465 (2004)CrossRefGoogle Scholar
  11. 11.
    The Globus Toolkit, http://www.globus.org/toolkit/
  12. 12.
    GT WS MDS WebMDS: System Administrator’s Guide, http://www.globus.org/toolkit/docs/4.2/4.2.0/info/webmds/index.html
  13. 13.
  14. 14.
    Overview of Grid Security Infrastructure, http://www.globus.org/toolkit/docs/4.2/4.2.0/security/security
  15. 15.
  16. 16.
    Allcock, W., Bresnahan, J., Kettimuthu, R., Link, M., Dumitrescu, C., Raicu, I., Foster, I.: The Globus striped GridFTP framework and server. In: Supercomputing Conf. 2005 (2005)Google Scholar
  17. 17.
    Hettich, S., Bay, S.D.: The UCI KDD Archive, University of California, Department of Information and Computer Science. US Census datasets, http://kdd.ics.uci.edu/databases/census1990/USCensus1990.html
  18. 18.
    Hettich, S., Bay, S.D.: The UCI KDD Archive, University of California, Department of Information and Computer Science. In: The Third International Knowledge Discovery and Data Mining Tools Competition datasets, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
  19. 19.
    Hettich, S., Bay, S.D.: The UCI KDD Archive, University of California, Department of Information and Computer Science. The forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data, http://kdd.ics.uci.edu/databases/covertype/covertype.html
  20. 20.
    MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate Observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability 1967, pp. 281–297. University of California Press, Berkeley (1967)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Shampa Chakraverty
    • 1
  • Ankuj Gupta
    • 1
  • Akhil Goyal
    • 1
  • Ashish Singal
    • 1
  1. 1.Department of Computer EngineeringNetaji Subhas Institute of TechnologyNew DelhiIndia

Personalised recommendations