Advertisement

High Performance Data Mining

  • Vipin Kumar
  • Mahesh V. Joshi
  • Eui-Hong (Sam) Han
  • Pang-Ning Tan
  • Michael Steinbach
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2565)

Abstract

Recent times have seen an explosive growth in the availability of various kinds of data. It has resulted in an unprecedented opportunity to develop automated data-driven techniques of extracting useful knowledge. Data mining, an important step in this process of knowledge discovery, consists of methods that discover interesting, non-trivial, and useful patterns hidden in the data [SAD+93, CHY96]. The field of data mining builds upon the ideas from diverse fields such as machine learning, pattern recognition, statistics, database systems, and data visualization. But, techniques developed in these traditional disciplines are often unsuitable due to some unique characteristics of today’s data-sets, such as their enormous sizes, high-dimensionality, and heterogeneity. There is a necessity to develop effective parallel algorithms for various data mining techniques. However, designing such algorithms is challenging, and the main focus of the paper is a description of the parallel formulations of two important data mining algorithms: discovery of association rules, and induction of decision trees for classification. We also briefly discuss an application of data mining to the analysis of large data sets collected by Earth observing satellites that need to be processed to better understand global scale changes in biosphere processes and patterns.

Keywords

Data Mining Association Rule Parallel Algorithm Hash Table Parallel Formulation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Eng., 5(6):914–925, December 1993. 116CrossRefGoogle Scholar
  2. [2]
    R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of 1993 ACM-SIGMOD Int. Conf. on Management of Data, Washington, D. C., 1993. 113Google Scholar
  3. [3]
    R. Agrawal and J.C. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Eng., 8(6):962–969, December 1996. 114CrossRefGoogle Scholar
  4. [4]
    R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference, pages 487–499, Santiago, Chile, 1994. 114Google Scholar
  5. [5]
    J. Chattratichat, J. Darlington, M. Ghanem, Y. Guo, H. Huning, M. Kohler, J. Sutiwaraphun, H.W. To, and D. Yang. Large scale data mining: Challenges and responses. In Proc. of the Third Int’l Conference on Knowledge Discoveryand Data Mining, 1997. 117Google Scholar
  6. [6]
    M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from database perspective. IEEE Transactions on Knowledge and Data Eng., 8(6):866–883, December 1996. 111, 112CrossRefGoogle Scholar
  7. [7]
    D. J. Spiegelhalter D. Michie and C.C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. 116Google Scholar
  8. [8]
    S. Goil, S. Aluru, and S. Ranka. Concatenated parallelism: A technique for efficient parallel divide and conquer. In Proc. of the Symposium of Parallel and Distributed Computing (SPDP’96), 1996. 117Google Scholar
  9. [9]
    D.E. Goldberg. Genetic Algorithms in Search, Optimizations and Machine Learning. Morgan-Kaufman, 1989. 116Google Scholar
  10. [10]
    R. Grossman, C. Kamath, P. Kegelmeyer, V. Kumar, and R. Namburu. Data Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, 2001. 112Google Scholar
  11. [11]
    E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In Proc. of 1997 ACM-SIGMOD Int. Conf. on Management of Data, Tucson, Arizona, 1997. 114, 115Google Scholar
  12. [12]
    E.H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. IEEE Transactions on Knowledge and Data Eng., 12(3), May/June 2000. 115Google Scholar
  13. [13]
    J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan-Kaufman, 2000. 112Google Scholar
  14. [14]
    D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, 2001. 112Google Scholar
  15. [15]
    M.V. Joshi, E.-H. Han, G. Karypis, and V. Kumar. Efficient parallel algorithms for mining associations. In M. J. Zaki and C.-T. Ho, editors, Lecture Notes in Computer Science: Lecture Notes in Artificial Intelligence (LNCS/LNAI), volume 1759. Springer-Verlag, 2000. 113, 114, 115Google Scholar
  16. [16]
    M.V. Joshi, G. Karypis, and V. Kumar. ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. In Proc. of the International Parallel Processing Symposium, 1998. 117, 120Google Scholar
  17. [17]
    M.V. Joshi, G. Karypis, and V. Kumar. Universal formulation of sequential patterns. Technical Report TR 99-021, Department of Computer Science, University of Minnesota, Minneapolis, 1999. 115Google Scholar
  18. [18]
    R. Kufrin. Decision trees on parallel processors. In J. Geller, H. Kitano, and C. B. Suttner, editors, Parallel Processing for Artificial Intelligence 3. Elsevier Science,1997. 117Google Scholar
  19. [19]
    R. Lippmann. An introduction to computing with neural nets. IEEE ASSP Magazine, 4(22), April 1987. 116Google Scholar
  20. [20]
    M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth Int’l Conference on Extending Database Technology, Avignon, France, 1996. 116Google Scholar
  21. [21]
    R. A. Pearson. A coarse grained parallel induction heuristic. In H. Kitano, V. Kumar, and C.B. Suttner, editors, Parallel Processing for Artificial Intelligence 2, pages 207–226. Elsevier Science, 1994. 117Google Scholar
  22. [22]
    J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, SanMateo, CA, 1993. 116Google Scholar
  23. [23]
    J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 22nd VLDB Conference, 1996. 116, 117, 120Google Scholar
  24. [24]
    A. Srivastava, E.-H. Han, V. Kumar, and V. Singh. Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237–261, September 1999. 117CrossRefGoogle Scholar
  25. [25]
    M. Steinbach, P. Tan, V. Kumar, S. Klooster, and C. Potter. Temporal data mining for the discovery and analysis of ocean climate indices. In KDD Workshop on Temporal Data Mining(KDD’2002), Edmonton, Alberta, Canada, 2001. 122Google Scholar
  26. [26]
    M. Stonebraker, R. Agrawal, U. Dayal, E. J. Neuhold, and A. Reuter. DBMS research at a crossroads: The vienna update. In Proc. of the 19th VLDB Conference, pages 688–692, Dublin, Ireland, 1993. 111Google Scholar
  27. [27]
    P. Tan, M. Steinbach, V. Kumar, S. Klooster, C. Potter, and A. Torregrosa. Finding spatio-temporal patterns in earth science data. In KDD Workshop on Temporal Data Mining(KDD’2001), San Francisco, California, 2001. 121Google Scholar
  28. [28]
    M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency (Special Issue on Data Mining), December 1999. 114Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Vipin Kumar
    • 1
  • Mahesh V. Joshi
    • 1
  • Eui-Hong (Sam) Han
    • 1
  • Pang-Ning Tan
    • 1
  • Michael Steinbach
    • 1
  1. 1.University of Minnesota 4-192 EE/CSci BuildingMinneapolisUSA

Personalised recommendations