Abstract
The current generation of data mining tools have limited capacity and performance, since these tools tend to be sequential. This paper explores a migration path out of this bottleneck by considering an integrated hardware and software approach to parallelize data mining. Our analysis shows that parallel data mining solutions require the following components: parallel data mining algorithms, parallel and distributed data bases, parallel file systems, parallel I/O, tertiary storage, management of online data, support for heterogeneous data representations, security, quality of service and pricing metrics. State of the art technology in these areas is surveyed with an eye towards an integration strategy leading to a complete solution.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Trans. on Knowledge and Data Engg., 8(6):962–969, December 1996.
R. Agrawal and K. Shim. Developing tightly-coupled data mining applications on a relational DBMS. In Int’l Conf. on Knowledge Discovery and Data Mining, 1996.
R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective, IEEE Trans. on Knowledge and Data Engg., 5(6):914–925, December 1993.
S. Anand, et al. Designing a kernel for data mining. IEEE Expert, pages 65–74, March 1997.
H. Boral, et al. Prototyping Bubba, a highly parallel database system. IEEE Trans. on Knowledge and Data Engg., 2(1), March 1990.
J. Carretero, et al. ParFiSys: A parallel file system for MPP. ACM Operating Systems Review, 30(2):74–80, 1996.
F. Chang and G. Gibson. Automatic hint generation through speculative execution. In Symp. on Operating Systems Design and Implementation, February 1999.
P. M. Chen, et al. RAID: High-performance, reliable secondary storage. ACM Computing Surveys, 26(2):145–185, June 1994.
D. Cheung, et al. A fast distributed algorithm for mining association rules. In 4th Int’l Conf. Parallel and Distributed Info. Systems, December 1996.
A. Choudhary and D. Kotz. Large-scale file systems with the flexibility of databases. ACM Computing Surveys, 28A(4), December 1996.
T. Cortes. High Performance Cluster Computing, Vol. 1, chapter Software Raid and Parallel File Systems, pages 463–495. Prentice Hall, 1999.
D. DeWitt et al. The GAMMA database machine project. IEEE Trans. on Knowledge and Data Engg., 2(1):44–62, March 1990.
D. DeWitt and J. Gray. Parallel database systems: The future of high-performance database systems. Communications of the ACM, 35(6):85–98, June 1992.
I. S. Dhillon and D. S. Modha. A clustering algorithm on distributed memory machines. In Zaki and Ho, 2000.
A. Freitas and S. Lavington. Mining very large databases with parallel processing. Kluwer Academic Pub., 1998.
V. Gaede and O. Gunther. Multidimensional access methods. ACM Computing Surveys, 30(2): 170–231, 1998.
G. Gibson, et al. NASD scalable storage systems. In USENIX99, Extreme Linux Workshop, June 1999.
J. Han, et al. DMQL: A data mining query language for relational databases. In SIGMOD Workshop on Research Issues in Data Mining and. Knowledge Discovery, June 1996.
E-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD Conf. Management of Data, May 1997.
M. Holsheimer, M. L. Kersten, and A. Siebes. Data surveyor: Searching the nuggets in parallel. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI Press, 1996.
D. Hsiao. Advanced Database Machine Architectures. Prentice Hall, 1983.
T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of the ACM, 39(11), November 1996.
T. Imielinski, A. Virmani, and A. Abdulghani. DataMine: Application programming interface and query language for database mining. In Int’l Conf. Knowledge Discovery and Data Mining, August 1996.
Scalable I/O Initiative. http://www.cacr.caltech.edu/SIO . California Institute of Technology.
M. Joshi, G. Karypis, and V. Kumar. ScalParC: A scalable and parallel classfication algorithm for mining large datasets. In Int’l Parallel Processing Symposium, 1998.
D. Judd, P. McKinley, and A. Jain. Large-scale parallel data clustering. In Int’l Conf. Pattern Recognition, 1996.
H. Kargupta and P. Chan, editors. Advances in Distributed Data Mining. AAAI Press, 2000.
K. Keeton, D. Patterson, and J.M. Hellerstein. The case for intelligent disks. SIGMOD Record, 27(3):42–52, September 1998.
M.F. Khan, et al. Intensive data management in parallel systems: A survey. Distributed and Parallel Databases, 7:383–414, 1999.
T. Kimbrel, et al. A trace-driven comparison of algorithms for parallel prefetching and caching. In USENIX Symp. on Operating Systems Design and Implementation, pages 19–34, October 1996.
D. Kotz. The parallel i/o archive. Includes pointers to his Parallel I/O Bibliography, can be found at http://www.cs.dartmouth.edu/pario/ .
C. E. Kozyrakis and D. A. Patterson. New direction in computer architecture research. IEEE Computer, pages 24–32, November 1998.
R. Lorie, et al. Adding inter-transaction parallelism to existing DBMS: Early experience. IEEE Data Engineering Newsletter, 12(1), March 1989.
T. M. Madhyastha and D. A. Reed. Exploiting global input/output access pattern classification. In Proceedings of SC’97, 1997. On CDROM.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. In Int’l Conf. Very Large Databases, 1996.
S. A. Moyer and V. S. Sunderam. PIOUS: a scalable parallel I/O system for distributed computing environments. In Scalable High-Performance Computing Conf., 1994.
N. Nieuwejaar and D. Kotz. The galley parallel file system. Parallel Computing, 23(4), June 1997.
M. T. Oszu and P. Valduriez. Principles of Distributed Database Systems. Prentice Hall, 1999.
R. H. Patterson III. Informed Prefetching and Caching. PhD thesis, Carnegie Mellon University, December 1997.
Pirahesh et al. Parallelism in Relational Data Base Systems. In nt’l Symp. on Parallel and Distributed Systems, July 1990.
D. A. Reed, et al. Performance analysis of parallel systems: Approaches and open problems. In Joint Symposium on Parallel Processing (JSPP), June 1998.
E. Riedel, G. A. Gibson, and C. Faloutsos. Active storage for large-scale data mining and multimedia. In Int’l Conf. on Very Large Databases, August 1997.
H. Nagesh S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report 9906-010, Northwestern University, June 1999.
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with databases: alternatives and implications. In ACM SIGMOD Conf. on Management of Data, June 1998.
E. Schikuta, T. Fuerle, and H. Wanek. ViPIOS: The vienna parallel input/output system. In Euro-Par’98, September 1998.
K. E. Seamons and M.. Winslett. Multidimensional array I/O in Panda 1.0. Journal of Supercomputing, 10(2):191–211, 1996.
J. Shafer, R. Agrawal, and M. Mehta. Sprint: A scalable parallel classifier for data mining. In Int’I Conf. on Very Large Databases, March 1996.
T. Shintani and M. Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach. In 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining, April 1998.
A. Siebes. Foundations of an inductive query language. In Int’l Conf. on Knowledge Discovery and Data Mining, August 1995.
D. Skillicorn. Strategies for parallel data mining. IEEE Concurrency, 7(4):26–35, October-December 1999.
M. Sreenivas, K. Alsabti, and S. Ranka. Parallel out-of-core divide and conquer techniques with application to classification trees. In Int’l Parallel Processing Symposium, April 1999.
H. Stockinger. Dictionary on parallel input/output. Master’s thesis, Dept. of Data Engineering, University of Vienna, February 1998.
Tandem Performance Group. A benchmark of non-stop SQL on the debit credit transaction. In SIGMOD Conference, June 1988.
R. Thakur, W. Gropp, and E. Lusk. On implementing mpi-io portably and with high performance. In Workshop on I/O in Parallel and Distributed Systems, May 1999.
P. Valduriez. Parallel database systems: Open problems and new issues. Distributed and Parallel Databases, 1:137–165, 1993.
G. Williams, et al. The integrated delivery of large-scale data mining: The ACSys data mining project. In Zaki and Ho, 2000.
M. J. Zaki and C-T. Ho, editors. Large-Scale Parallel Data Mining, LNCS Vol. 1759. Springer-Verlag, 2000.
M. J. Zaki, et al. Parallel algorithms for fast discovery association rules. Data Mining and Knowledge Discovery: An International Journal, 1(4):343–373, December 1997.
M. J. Zaki, C.-T. Ho, and R. Agrawal. Parallel classification for data mining on shared-memory multiprocessors. In Int’I Conf. on Data Engineering, March 1999.
M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4):14–25, 1999.
M. J. Zaki. Parallel sequence mining on SMP machines. In Zaki and Ho, 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Maniatty, W.A., Zaki, M.J. (2000). A Requirements Analysis for Parallel KDD Systems. In: Rolim, J. (eds) Parallel and Distributed Processing. IPDPS 2000. Lecture Notes in Computer Science, vol 1800. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45591-4_47
Download citation
DOI: https://doi.org/10.1007/3-540-45591-4_47
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67442-9
Online ISBN: 978-3-540-45591-2
eBook Packages: Springer Book Archive