A Requirements Analysis for Parallel KDD Systems

  • William A. Maniatty
  • Mohammed J. Zaki
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1800)

Abstract

The current generation of data mining tools have limited capacity and performance, since these tools tend to be sequential. This paper explores a migration path out of this bottleneck by considering an integrated hardware and software approach to parallelize data mining. Our analysis shows that parallel data mining solutions require the following components: parallel data mining algorithms, parallel and distributed data bases, parallel file systems, parallel I/O, tertiary storage, management of online data, support for heterogeneous data representations, security, quality of service and pricing metrics. State of the art technology in these areas is surveyed with an eye towards an integration strategy leading to a complete solution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Trans. on Knowledge and Data Engg., 8(6):962–969, December 1996.CrossRefGoogle Scholar
  2. R. Agrawal and K. Shim. Developing tightly-coupled data mining applications on a relational DBMS. In Int’l Conf. on Knowledge Discovery and Data Mining, 1996.Google Scholar
  3. R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective, IEEE Trans. on Knowledge and Data Engg., 5(6):914–925, December 1993.CrossRefGoogle Scholar
  4. S. Anand, et al. Designing a kernel for data mining. IEEE Expert, pages 65–74, March 1997.Google Scholar
  5. H. Boral, et al. Prototyping Bubba, a highly parallel database system. IEEE Trans. on Knowledge and Data Engg., 2(1), March 1990.Google Scholar
  6. J. Carretero, et al. ParFiSys: A parallel file system for MPP. ACM Operating Systems Review, 30(2):74–80, 1996.CrossRefGoogle Scholar
  7. F. Chang and G. Gibson. Automatic hint generation through speculative execution. In Symp. on Operating Systems Design and Implementation, February 1999.Google Scholar
  8. P. M. Chen, et al. RAID: High-performance, reliable secondary storage. ACM Computing Surveys, 26(2):145–185, June 1994.CrossRefGoogle Scholar
  9. D. Cheung, et al. A fast distributed algorithm for mining association rules. In 4th Int’l Conf. Parallel and Distributed Info. Systems, December 1996.Google Scholar
  10. A. Choudhary and D. Kotz. Large-scale file systems with the flexibility of databases. ACM Computing Surveys, 28A(4), December 1996.Google Scholar
  11. T. Cortes. High Performance Cluster Computing, Vol. 1, chapter Software Raid and Parallel File Systems, pages 463–495. Prentice Hall, 1999.Google Scholar
  12. D. DeWitt et al. The GAMMA database machine project. IEEE Trans. on Knowledge and Data Engg., 2(1):44–62, March 1990.CrossRefGoogle Scholar
  13. D. DeWitt and J. Gray. Parallel database systems: The future of high-performance database systems. Communications of the ACM, 35(6):85–98, June 1992.CrossRefGoogle Scholar
  14. I. S. Dhillon and D. S. Modha. A clustering algorithm on distributed memory machines. In Zaki and Ho, 2000.Google Scholar
  15. A. Freitas and S. Lavington. Mining very large databases with parallel processing. Kluwer Academic Pub., 1998.Google Scholar
  16. V. Gaede and O. Gunther. Multidimensional access methods. ACM Computing Surveys, 30(2): 170–231, 1998.CrossRefGoogle Scholar
  17. G. Gibson, et al. NASD scalable storage systems. In USENIX99, Extreme Linux Workshop, June 1999.Google Scholar
  18. J. Han, et al. DMQL: A data mining query language for relational databases. In SIGMOD Workshop on Research Issues in Data Mining and. Knowledge Discovery, June 1996.Google Scholar
  19. E-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD Conf. Management of Data, May 1997.Google Scholar
  20. M. Holsheimer, M. L. Kersten, and A. Siebes. Data surveyor: Searching the nuggets in parallel. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI Press, 1996.Google Scholar
  21. D. Hsiao. Advanced Database Machine Architectures. Prentice Hall, 1983.Google Scholar
  22. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of the ACM, 39(11), November 1996.Google Scholar
  23. T. Imielinski, A. Virmani, and A. Abdulghani. DataMine: Application programming interface and query language for database mining. In Int’l Conf. Knowledge Discovery and Data Mining, August 1996.Google Scholar
  24. Scalable I/O Initiative. http://www.cacr.caltech.edu/SIO. California Institute of Technology.
  25. M. Joshi, G. Karypis, and V. Kumar. ScalParC: A scalable and parallel classfication algorithm for mining large datasets. In Int’l Parallel Processing Symposium, 1998.Google Scholar
  26. D. Judd, P. McKinley, and A. Jain. Large-scale parallel data clustering. In Int’l Conf. Pattern Recognition, 1996.Google Scholar
  27. H. Kargupta and P. Chan, editors. Advances in Distributed Data Mining. AAAI Press, 2000.Google Scholar
  28. K. Keeton, D. Patterson, and J.M. Hellerstein. The case for intelligent disks. SIGMOD Record, 27(3):42–52, September 1998.CrossRefGoogle Scholar
  29. M.F. Khan, et al. Intensive data management in parallel systems: A survey. Distributed and Parallel Databases, 7:383–414, 1999.CrossRefGoogle Scholar
  30. T. Kimbrel, et al. A trace-driven comparison of algorithms for parallel prefetching and caching. In USENIX Symp. on Operating Systems Design and Implementation, pages 19–34, October 1996.Google Scholar
  31. D. Kotz. The parallel i/o archive. Includes pointers to his Parallel I/O Bibliography, can be found at http://www.cs.dartmouth.edu/pario/.
  32. C. E. Kozyrakis and D. A. Patterson. New direction in computer architecture research. IEEE Computer, pages 24–32, November 1998.Google Scholar
  33. R. Lorie, et al. Adding inter-transaction parallelism to existing DBMS: Early experience. IEEE Data Engineering Newsletter, 12(1), March 1989.Google Scholar
  34. T. M. Madhyastha and D. A. Reed. Exploiting global input/output access pattern classification. In Proceedings of SC’97, 1997. On CDROM.Google Scholar
  35. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. In Int’l Conf. Very Large Databases, 1996.Google Scholar
  36. S. A. Moyer and V. S. Sunderam. PIOUS: a scalable parallel I/O system for distributed computing environments. In Scalable High-Performance Computing Conf., 1994.Google Scholar
  37. N. Nieuwejaar and D. Kotz. The galley parallel file system. Parallel Computing, 23(4), June 1997.Google Scholar
  38. M. T. Oszu and P. Valduriez. Principles of Distributed Database Systems. Prentice Hall, 1999.Google Scholar
  39. R. H. Patterson III. Informed Prefetching and Caching. PhD thesis, Carnegie Mellon University, December 1997.Google Scholar
  40. Pirahesh et al. Parallelism in Relational Data Base Systems. In nt’l Symp. on Parallel and Distributed Systems, July 1990.Google Scholar
  41. D. A. Reed, et al. Performance analysis of parallel systems: Approaches and open problems. In Joint Symposium on Parallel Processing (JSPP), June 1998.Google Scholar
  42. E. Riedel, G. A. Gibson, and C. Faloutsos. Active storage for large-scale data mining and multimedia. In Int’l Conf. on Very Large Databases, August 1997.Google Scholar
  43. H. Nagesh S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report 9906-010, Northwestern University, June 1999.Google Scholar
  44. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with databases: alternatives and implications. In ACM SIGMOD Conf. on Management of Data, June 1998.Google Scholar
  45. E. Schikuta, T. Fuerle, and H. Wanek. ViPIOS: The vienna parallel input/output system. In Euro-Par’98, September 1998.Google Scholar
  46. K. E. Seamons and M.. Winslett. Multidimensional array I/O in Panda 1.0. Journal of Supercomputing, 10(2):191–211, 1996.CrossRefGoogle Scholar
  47. J. Shafer, R. Agrawal, and M. Mehta. Sprint: A scalable parallel classifier for data mining. In Int’I Conf. on Very Large Databases, March 1996.Google Scholar
  48. T. Shintani and M. Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach. In 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining, April 1998.Google Scholar
  49. A. Siebes. Foundations of an inductive query language. In Int’l Conf. on Knowledge Discovery and Data Mining, August 1995.Google Scholar
  50. D. Skillicorn. Strategies for parallel data mining. IEEE Concurrency, 7(4):26–35, October-December 1999.CrossRefGoogle Scholar
  51. M. Sreenivas, K. Alsabti, and S. Ranka. Parallel out-of-core divide and conquer techniques with application to classification trees. In Int’l Parallel Processing Symposium, April 1999.Google Scholar
  52. H. Stockinger. Dictionary on parallel input/output. Master’s thesis, Dept. of Data Engineering, University of Vienna, February 1998.Google Scholar
  53. Tandem Performance Group. A benchmark of non-stop SQL on the debit credit transaction. In SIGMOD Conference, June 1988.Google Scholar
  54. R. Thakur, W. Gropp, and E. Lusk. On implementing mpi-io portably and with high performance. In Workshop on I/O in Parallel and Distributed Systems, May 1999.Google Scholar
  55. P. Valduriez. Parallel database systems: Open problems and new issues. Distributed and Parallel Databases, 1:137–165, 1993.CrossRefGoogle Scholar
  56. G. Williams, et al. The integrated delivery of large-scale data mining: The ACSys data mining project. In Zaki and Ho, 2000.Google Scholar
  57. M. J. Zaki and C-T. Ho, editors. Large-Scale Parallel Data Mining, LNCS Vol. 1759. Springer-Verlag, 2000.Google Scholar
  58. M. J. Zaki, et al. Parallel algorithms for fast discovery association rules. Data Mining and Knowledge Discovery: An International Journal, 1(4):343–373, December 1997.CrossRefGoogle Scholar
  59. M. J. Zaki, C.-T. Ho, and R. Agrawal. Parallel classification for data mining on shared-memory multiprocessors. In Int’I Conf. on Data Engineering, March 1999.Google Scholar
  60. M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4):14–25, 1999.CrossRefGoogle Scholar
  61. M. J. Zaki. Parallel sequence mining on SMP machines. In Zaki and Ho, 2000.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • William A. Maniatty
    • 1
  • Mohammed J. Zaki
    • 2
  1. 1.Computer Science Dept.University at AlbanyAlbany
  2. 2.Computer Science Dept.Rensselaer Polytechnic InstituteTroy

Personalised recommendations