Abstract
When data mining first appeared, several disciplines related to data analysis, like statistics or artificial intelligence were combined toward a new topic: extracting significant patterns from data. The original data sources were small datasets and, therefore, traditional machine learning techniques were the most common tools for this tasks. As the volume of data grows these traditional methods were reviewed and extended with the knowledge from experts working on the field of data management and databases. Today problems are even bigger than before and, once again, a new discipline allows the researchers to scale up to these data. This new discipline is distributed and parallel processing. In order to use parallel processing techniques, specific factors about the mining algorithms and the data should be considered. Nowadays, there are several new parallel algorithms, that in most of the cases are extensions of a traditional centralized algorithm. Many of these algorithms have common core parts and only differ on distribution schema, parallel coordination or load/task balancing methods. We call these groups algorithm families. On this paper we introduce a methodology to implement algorithm families. This methodology is founded on the MOIRAE distributed control architecture. In this work we will show how this architecture allows researchers to design parallel processing components that can change, dynamically, their behavior according to some control policies.
This research project is funded under the Universidad Politécnica de Madrid grant program
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Peter Christen, Ole M. Nielsen, and Markus Hegland. DMtools-open source software for database mining. In PKDD’2001, 2001.
Robert L. Grossman, Stuart M. Bailey, Harinath Sivakumar, and Andrei L. Turinsky. Papyrus: A system for data mining over local and wide-area clusters and super-clusters. In ACM, editor, SC’99. ACM Press and IEEE Computer Society Press, 1999.
Mahesh V. Joshi, Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar. CRPC Parallel Computing Handbook, chapter Parallel Algorithms for Data Mining. Morgan Kaufmann, 2000.
H. Kargupta, B. Park, D. Hershbereger, and E. Johnson. Advanced in Distributed and Parallel Knowledge Discovery, chapter Collective Data Mining: A new perspective towards distributed data mining. AAAI Press / MIT Press, 2000.
Hillol Kargupta, Ilker Hamzaoglu, and Brian Stafford. Scalable, distributed data mining-an agent architecture. page 211.
Kensingston, Enterprise Data Mining. Kensington: New generation enterprise data mining. White Paper, 1999. Parallel Computing Research Centre, Department of Computing Imperial College, (Contact Martin Khler).
S. Krishnaswamy, S. W. Loke, and A. Zaslavsky. Cost models for distributed data mining. Technical Report 2000/59, School of Computer Science and Software Engineering, Monash University, Australia 3168, February 2000.
José M. Peña. Distributed Control Architecture for Data Mining Systems. PhD thesis, DATSI, FI, Universidad Politécnica de Madrid, Spain, June 2001. Spanish title: “Arquitectura Distribuida de Control para Sistemas con Capacidades de Data Mining”.
José M. Peña and Ernestina Menasalvas. Towards flexibility in a distributed data mining framework. In Proceedings of ACM-SIGMOD/PODS 2001, pages 58–61, 2001.
Foster Provost. Advances in Distributed and Parallel Knowledge Discovery, chapter Distributed Data Mining: Scaling Up and Beyond, pages 3–28. AAAI Press/MIT Press, 2000.
O.F. Rana, D.W. Walker, M. Li, S. Lynden, and M. Ward. PaDDMAS: Parallel and distributed data mining application suite. In Proceedings of the Fourteenth International Parallel and Distributed Processing Symposium, 2000.
T. Shintani and M. Kitsuregawa. Parallel algorithms for mining association rule mining on large scale PC cluster. In Mohammed J. Zaki and Ching-Tien Ho, editors, Workshop on Large-Scale Parallel KDD Systems, San Diego, CA, USA, August 1999. ACM. in conjunction with ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD99).
S. Stolfo, W. Fan, W. Lee, A. Prodromidis, and P. Chan. Cost-based modeling for fraud and instrusion detection: Results from the JAM project. In DARPA Information Survivability Conference and Exposition, pages 130–144. IEEE Computer Press, 2000.
M. Zaki. Large-Scale Parallel Data Mining, volume 1759 of Springer Lecture Note in Artificial Intelligence, chapter Parallel and Distributed Data Mining: An Introduction. Springer Verlag, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Peña, J.M., Javier Crespo, F., Menasalvas, E., Robles, V. (2002). Parallel Data Mining Experimentation Using Flexible Configurations. In: Alpigini, J.J., Peters, J.F., Skowron, A., Zhong, N. (eds) Rough Sets and Current Trends in Computing. RSCTC 2002. Lecture Notes in Computer Science(), vol 2475. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45813-1_58
Download citation
DOI: https://doi.org/10.1007/3-540-45813-1_58
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44274-5
Online ISBN: 978-3-540-45813-5
eBook Packages: Springer Book Archive