Skip to main content

Improving Distributed Data Mining Techniques by Means of a Grid Infrastructure

  • Conference paper
On the Move to Meaningful Internet Systems 2004: OTM 2004 Workshops (OTM 2004)

Abstract

Nowadays, the process of data mining is one of the most important topics in scientific and business problems. There is a huge amount of data that can help to solve many of these problems. However, data is geographically distributed in various locations and belongs to several organizations. Furthermore, it is stored in different kind of systems and it is represented in many formats. In this paper, different techniques have been studied to make easier the data mining process in a distributed environment. Our approach proposes the use of grid to improve the data mining process due to the features of this kind of systems. In addition, we show a flexible architecture that allows data mining applications to be dynamically configured according to their needs. This architecture is made up of generic, data grid and specific data mining grid services.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Atkinson, M., Chervenak, A.L., Kunszt, P., Narang, I., Paton, N.W., Pearson, D., Shashoni, A., Watson, P.: Data access, integration and management, 2nd edn., The Grid: Blueprint for a New Computing Infrastructure, ch. 22, pp. 391–429 (December 2003)

    Google Scholar 

  2. Bailey, S., Creel, E., Grossman, R.L., Gutti, S., Sivakumar, H.: Large-Scale Parallel Data Mining, ch. A High Performance Implementation of the Data Space Transfer Protocol (DSTP), pp. 55–64 (1999)

    Google Scholar 

  3. Bayucan, A., Henderson, R.L., Lesiak, C., Mann, B., Proett, T., Tweten, D.: PBS Portable Batch System. External Reference Specification

    Google Scholar 

  4. Bestavros, A.: Middleware support for data mining and knowledge discovery in large-scale distributed information systems. In: Proceedings of the SIGMOD 1996 Data MiningWorkshop, Montreal, Canada (June 1996)

    Google Scholar 

  5. Cannataro, M., Talia, D.: The knowledge grid. Commun. ACM 46(1), 89–93 (2003)

    Article  Google Scholar 

  6. Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., Tuecke, S.: The Data Grid:Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications 23, 187–200 (2001)

    Article  Google Scholar 

  7. The Condor Project, http://www.cs.wisc.edu/condor

  8. Foster, I., Kesselman, C. (eds.): The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  9. Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of SuperComputer Applications, 15(3) (2001); Descripci n detallada de la arquitectura Grid

    Google Scholar 

  10. Giannadakis, N., Rowe, A., Ghanem, M., Guo, Y.: InfoGrid: providing information integration for knowledge discovery. Information Sciences 155(3-4), 199–226 (2003)

    Article  Google Scholar 

  11. Grossman, R.L., Bailey, S.M., Sivakumar, H., Turinsky, A.L.: Papyrus: Asystem for data mining over local and wide-area clusters and super-clusters. In: ACM (ed.) SC 1999, ACM Press and IEEE Computer Society Press, New York (1999)

    Google Scholar 

  12. Grid Scheduling Architecture Research Group (GSA-RG), http://ds.e-technik.uni-dortmund.de/yahya/ggf-sched/wg/arch-rg.htm

  13. Grid Engine, http://gridengine.sunsource.net

  14. Huber, J.: PPFS: An experimental filesystem for high performance parallel input/output. Master’s thesis, Department of Computer Science, University of Illinois at Urbana-Champaing (1995)

    Google Scholar 

  15. IBM. Application programming interface and utility reference. IBM DB2 Intelligent Miner for Data. IBM (September 1999)

    Google Scholar 

  16. ISL. Clementine servcer distributed architecture. White Paper, Integrates Solution Limited, SPSS Group (1999)

    Google Scholar 

  17. Joshi, M.V., Han, E.-H(S.): Parallel Computing Handbook, chapter Parallel Algorithms for Data Mining. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  18. Kamath, C., Musick, R.: Advanced in Distributed and Parallel Knowledge Discovery, chapter Scalable data mining through fine grained parallesim: the present and the future, pp. 29–77. AAAI Press/MIT Press (2000)

    Google Scholar 

  19. Kargupta, H., Huang, W., Krishnamoorthy, S., Johnson, E.: Distributed clustering using collective principal component analysis. Knoledge and Information Systems Journal Special Issue on Distributed and Parallel Knowledge Discovery (2000)

    Google Scholar 

  20. Kargupta, H., Kamath, C., Chan, P.: Advanced in Distributed and Parallel Knowledge Discovery, chapter Distributed and Parallel Data Mining: Emergence, Growth and Future Directions, pp. 409–416. AAAI Press/MIT Press (2000)

    Google Scholar 

  21. Kargupta, H., Chan, P.: Advances in Distributed and Parallel Knowledge Discovery, chapter Distributed and Parallel Data Mining: A Brief Introduction, pp. xv–xxv. AAAI Press/MIT Press (2000)

    Google Scholar 

  22. Kargupta, H., Hamzaoglu, I., Stafford, B.: Scalable, distributed data mining-an agent architecture. pp. 211

    Google Scholar 

  23. Kensingston, Enterprise Data Mining. Kensington: New generation enterprise data mining. White Paper. Parallel Computing Research Centre, Department of Computing Imperial College, (Contact Martin K hler) (1999)

    Google Scholar 

  24. Kohavi, R.: Data mining with mineset: What worked, what did not, and what might. In: Proceedings of theWorkshop on Knowledge Discovery in Databases, Workshop on the Commercial Success of Data Mining (1997)

    Google Scholar 

  25. Krishnaswamy, S., Loke, S.W., Zaslavsky, A.: Cost models for distributed data mining. Technical Report 2000/59, School of Computer Science and Software Engineering, Monash University, Australia 3168 (February 2000)

    Google Scholar 

  26. Maniatty, W.A., Zaki, M.J.: A requirements analysis for parallel kdd systems. In: Rolim, J.D.P. (ed.) IPDPS-WS 2000. LNCS, vol. 1800, pp. 265–358. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  27. Musick, R.: Supporting large-scale computational science. Technical Report UCRL–ID– 129903, Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA (1998)

    Google Scholar 

  28. Prodromidis, A., Chan, P., Stolfo, S.: Meta-learning in distributed data mining systems: Issues and approaches. AAAI/MIT Press (2000)

    Google Scholar 

  29. Provost, F.: Distributed Data Mining: Scaling Up and Beyond. In: Advances in Distributed and Parallel Knowledge Discovery, pp. 3–28. AAAI Press/MIT Press (2000)

    Google Scholar 

  30. Provost, F.J., Kolluri, V.: A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery 3(2), 131–169 (1999)

    Article  Google Scholar 

  31. Schopf, J.M.: A General Architecture for Scheduling on the Grid. Special Issue of JPDC on Grid Computing, Brokers y planificadores de tareas (2002)

    Google Scholar 

  32. Shearer, C.: User driven data mining. In: Unicom Data Mining Conference, London (1996)

    Google Scholar 

  33. Shintani, T., Kitsuregawa, M.: Parallel algorithms for mining association rule mining on large scale PC cluster. In: Zaki, M.J., Ho, C.-T. (eds.) Conjunction with ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 1999) (1999)

    Google Scholar 

  34. Zaki, M.J., Ho, C.T., Agrawal, R.: Parallel classification for data mining on sharedmemory multiprocessors. In: Proceedings International Conference on Data Engineering (March 1999)

    Google Scholar 

  35. Zaki, M.J.: Scalable Data Mining for Rules. PhD thesis, University of Rochester (July 1998) Published also as Technical Report #702

    Google Scholar 

  36. Zaki, M.J.: Parallel sequence mining onSMPmachines a data clustering algorithm on distributed memory machines a data clustering algorithm on distributed memory machines. In: Zaki, M.J., Ho, C.-T. (eds.) Conjunction with ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 1999) (1999)

    Google Scholar 

  37. Zaki, M.J., Ho, C.-T. (eds.): Workshop on Large-Scale Parallel KDD Systems, San Diego, CA, USA, August 1999. ACM, New York (1999); in conjunction with ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 1999)

    Google Scholar 

  38. Zaki, M.J., Ho, C.-T.: Workshop report: Large-scale parallel KDD systems. In: SIGKDD Explorations [37], pp. 112–114; in conjunction with ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sánchez, A., Peña, J.M., Pérez, M.S., Robles, V., Herrero, P. (2004). Improving Distributed Data Mining Techniques by Means of a Grid Infrastructure. In: Meersman, R., Tari, Z., Corsaro, A. (eds) On the Move to Meaningful Internet Systems 2004: OTM 2004 Workshops. OTM 2004. Lecture Notes in Computer Science, vol 3292. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30470-8_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30470-8_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23664-1

  • Online ISBN: 978-3-540-30470-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics