Improving Distributed Data Mining Techniques by Means of a Grid Infrastructure

Sánchez, Alberto; Peña, José M.; Pérez, María S.; Robles, Víctor; Herrero, Pilar

doi:10.1007/978-3-540-30470-8_29

Alberto Sánchez¹⁹,
José M. Peña¹⁹,
María S. Pérez¹⁹,
Víctor Robles¹⁹ &
…
Pilar Herrero¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3292))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

687 Accesses
2 Citations

Abstract

Nowadays, the process of data mining is one of the most important topics in scientific and business problems. There is a huge amount of data that can help to solve many of these problems. However, data is geographically distributed in various locations and belongs to several organizations. Furthermore, it is stored in different kind of systems and it is represented in many formats. In this paper, different techniques have been studied to make easier the data mining process in a distributed environment. Our approach proposes the use of grid to improve the data mining process due to the features of this kind of systems. In addition, we show a flexible architecture that allows data mining applications to be dynamically configured according to their needs. This architecture is made up of generic, data grid and specific data mining grid services.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Atkinson, M., Chervenak, A.L., Kunszt, P., Narang, I., Paton, N.W., Pearson, D., Shashoni, A., Watson, P.: Data access, integration and management, 2nd edn., The Grid: Blueprint for a New Computing Infrastructure, ch. 22, pp. 391–429 (December 2003)
Google Scholar
Bailey, S., Creel, E., Grossman, R.L., Gutti, S., Sivakumar, H.: Large-Scale Parallel Data Mining, ch. A High Performance Implementation of the Data Space Transfer Protocol (DSTP), pp. 55–64 (1999)
Google Scholar
Bayucan, A., Henderson, R.L., Lesiak, C., Mann, B., Proett, T., Tweten, D.: PBS Portable Batch System. External Reference Specification
Google Scholar
Bestavros, A.: Middleware support for data mining and knowledge discovery in large-scale distributed information systems. In: Proceedings of the SIGMOD 1996 Data MiningWorkshop, Montreal, Canada (June 1996)
Google Scholar
Cannataro, M., Talia, D.: The knowledge grid. Commun. ACM 46(1), 89–93 (2003)
Article Google Scholar
Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., Tuecke, S.: The Data Grid:Towards an architecture for the distributed management and analysis of large scientific datasets. Journal of Network and Computer Applications 23, 187–200 (2001)
Article Google Scholar
The Condor Project, http://www.cs.wisc.edu/condor
Foster, I., Kesselman, C. (eds.): The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Foster, I., Kesselman, C., Tuecke, S.: The Anatomy of the Grid: Enabling Scalable Virtual Organizations. International Journal of SuperComputer Applications, 15(3) (2001); Descripci n detallada de la arquitectura Grid
Google Scholar
Giannadakis, N., Rowe, A., Ghanem, M., Guo, Y.: InfoGrid: providing information integration for knowledge discovery. Information Sciences 155(3-4), 199–226 (2003)
Article Google Scholar
Grossman, R.L., Bailey, S.M., Sivakumar, H., Turinsky, A.L.: Papyrus: Asystem for data mining over local and wide-area clusters and super-clusters. In: ACM (ed.) SC 1999, ACM Press and IEEE Computer Society Press, New York (1999)
Google Scholar
Grid Scheduling Architecture Research Group (GSA-RG), http://ds.e-technik.uni-dortmund.de/yahya/ggf-sched/wg/arch-rg.htm
Grid Engine, http://gridengine.sunsource.net
Huber, J.: PPFS: An experimental filesystem for high performance parallel input/output. Master’s thesis, Department of Computer Science, University of Illinois at Urbana-Champaing (1995)
Google Scholar
IBM. Application programming interface and utility reference. IBM DB2 Intelligent Miner for Data. IBM (September 1999)
Google Scholar
ISL. Clementine servcer distributed architecture. White Paper, Integrates Solution Limited, SPSS Group (1999)
Google Scholar
Joshi, M.V., Han, E.-H(S.): Parallel Computing Handbook, chapter Parallel Algorithms for Data Mining. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Kamath, C., Musick, R.: Advanced in Distributed and Parallel Knowledge Discovery, chapter Scalable data mining through fine grained parallesim: the present and the future, pp. 29–77. AAAI Press/MIT Press (2000)
Google Scholar
Kargupta, H., Huang, W., Krishnamoorthy, S., Johnson, E.: Distributed clustering using collective principal component analysis. Knoledge and Information Systems Journal Special Issue on Distributed and Parallel Knowledge Discovery (2000)
Google Scholar
Kargupta, H., Kamath, C., Chan, P.: Advanced in Distributed and Parallel Knowledge Discovery, chapter Distributed and Parallel Data Mining: Emergence, Growth and Future Directions, pp. 409–416. AAAI Press/MIT Press (2000)
Google Scholar
Kargupta, H., Chan, P.: Advances in Distributed and Parallel Knowledge Discovery, chapter Distributed and Parallel Data Mining: A Brief Introduction, pp. xv–xxv. AAAI Press/MIT Press (2000)
Google Scholar
Kargupta, H., Hamzaoglu, I., Stafford, B.: Scalable, distributed data mining-an agent architecture. pp. 211
Google Scholar
Kensingston, Enterprise Data Mining. Kensington: New generation enterprise data mining. White Paper. Parallel Computing Research Centre, Department of Computing Imperial College, (Contact Martin K hler) (1999)
Google Scholar
Kohavi, R.: Data mining with mineset: What worked, what did not, and what might. In: Proceedings of theWorkshop on Knowledge Discovery in Databases, Workshop on the Commercial Success of Data Mining (1997)
Google Scholar
Krishnaswamy, S., Loke, S.W., Zaslavsky, A.: Cost models for distributed data mining. Technical Report 2000/59, School of Computer Science and Software Engineering, Monash University, Australia 3168 (February 2000)
Google Scholar
Maniatty, W.A., Zaki, M.J.: A requirements analysis for parallel kdd systems. In: Rolim, J.D.P. (ed.) IPDPS-WS 2000. LNCS, vol. 1800, pp. 265–358. Springer, Heidelberg (2000)
Chapter Google Scholar
Musick, R.: Supporting large-scale computational science. Technical Report UCRL–ID– 129903, Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA (1998)
Google Scholar
Prodromidis, A., Chan, P., Stolfo, S.: Meta-learning in distributed data mining systems: Issues and approaches. AAAI/MIT Press (2000)
Google Scholar
Provost, F.: Distributed Data Mining: Scaling Up and Beyond. In: Advances in Distributed and Parallel Knowledge Discovery, pp. 3–28. AAAI Press/MIT Press (2000)
Google Scholar
Provost, F.J., Kolluri, V.: A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery 3(2), 131–169 (1999)
Article Google Scholar
Schopf, J.M.: A General Architecture for Scheduling on the Grid. Special Issue of JPDC on Grid Computing, Brokers y planificadores de tareas (2002)
Google Scholar
Shearer, C.: User driven data mining. In: Unicom Data Mining Conference, London (1996)
Google Scholar
Shintani, T., Kitsuregawa, M.: Parallel algorithms for mining association rule mining on large scale PC cluster. In: Zaki, M.J., Ho, C.-T. (eds.) Conjunction with ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 1999) (1999)
Google Scholar
Zaki, M.J., Ho, C.T., Agrawal, R.: Parallel classification for data mining on sharedmemory multiprocessors. In: Proceedings International Conference on Data Engineering (March 1999)
Google Scholar
Zaki, M.J.: Scalable Data Mining for Rules. PhD thesis, University of Rochester (July 1998) Published also as Technical Report #702
Google Scholar
Zaki, M.J.: Parallel sequence mining onSMPmachines a data clustering algorithm on distributed memory machines a data clustering algorithm on distributed memory machines. In: Zaki, M.J., Ho, C.-T. (eds.) Conjunction with ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 1999) (1999)
Google Scholar
Zaki, M.J., Ho, C.-T. (eds.): Workshop on Large-Scale Parallel KDD Systems, San Diego, CA, USA, August 1999. ACM, New York (1999); in conjunction with ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 1999)
Google Scholar
Zaki, M.J., Ho, C.-T.: Workshop report: Large-scale parallel KDD systems. In: SIGKDD Explorations [37], pp. 112–114; in conjunction with ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Facultad de Informática, Universidad Politécnica de Madrid, Madrid, Spain
Alberto Sánchez, José M. Peña, María S. Pérez, Víctor Robles & Pilar Herrero

Authors

Alberto Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
José M. Peña
View author publications
You can also search for this author in PubMed Google Scholar
María S. Pérez
View author publications
You can also search for this author in PubMed Google Scholar
Víctor Robles
View author publications
You can also search for this author in PubMed Google Scholar
Pilar Herrero
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Vrije Universiteit Brussel (VUB), STARLab, Bldg G/10, Pleinlaan 2, 1050, Brussels, Belgium
Robert Meersman
School of Computer Science and Information Technology, RMIT University, Bld 10.10, 376-392 Swanston Street, 3001, Melbourne, VIC, Australia
Zahir Tari
PrismTech, 4, Rue Angiboust, 91460, Marcoussis, France
Angelo Corsaro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sánchez, A., Peña, J.M., Pérez, M.S., Robles, V., Herrero, P. (2004). Improving Distributed Data Mining Techniques by Means of a Grid Infrastructure. In: Meersman, R., Tari, Z., Corsaro, A. (eds) On the Move to Meaningful Internet Systems 2004: OTM 2004 Workshops. OTM 2004. Lecture Notes in Computer Science, vol 3292. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30470-8_29

Download citation

DOI: https://doi.org/10.1007/978-3-540-30470-8_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23664-1
Online ISBN: 978-3-540-30470-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics