Parallel and Grid-Based Data Mining – Algorithms, Models and Systems for High-Performance KDD

Chapter

Summary

Data Mining often is a computing intensive and time requiring process. For this reason, several Data Mining systems have been implemented on parallel computing platforms to achieve high performance in the analysis of large data sets. Moreover, when large data repositories are coupled with geographical distribution of data, users and systems, more sophisticated technologies are needed to implement high-performance distributed KDD systems. Since computational Grids emerged as privileged platforms for distributed computing, a growing number of Grid-based KDD systems has been proposed. In this chapter we first discuss different ways to exploit parallelism in the main Data Mining techniques and algorithms, then we discuss Grid-based KDD systems. Finally, we introduce the Knowledge Grid, an environment which makes use of standard Grid middleware to support the development of parallel and distributed knowledge discovery applications.

Key words

Parallel Data Mining Grid-based Data Mining Knowledge Grid Distributed Knowledge Discovery 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal G. High-level Interfaces and Abstractions for Grid-based Data Mining. Workshop on Data Mining and Exploration Middleware for Distributed and Grid Computing; 2003 September 18–19; Minneapolis, MI.Google Scholar
  2. Agrawal R., Shafer J.C. Parallel Mining of Association Rules. IEEE Transactions on Knowledge and Data Engineering 1996; 8: 962-969.CrossRefGoogle Scholar
  3. Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules. Proceedings of the 20th International Conference on Very Large Databases; 1994; Santiago, Chile.Google Scholar
  4. Berman F. From TeraGrid to Knowledge Grid. Communications of the ACM 2001; 44(11): 27-28.CrossRefGoogle Scholar
  5. Berry, M. JA, Linoff, G., Data Mining Techniques for Marketing, Sales, and Customer Support. New York: Wiley Computer Publishing, 1997.Google Scholar
  6. Beynon M, Kurc T, Catalyurek U, Chang C, Sussman A, Saltz J. Distributed Processing of Very Large Datasets with DataCutter. Parallel Computing 2001. 27(11):1457-1478.MATHCrossRefGoogle Scholar
  7. Bigus, J. P., Data Mining with Neural Networks. New York: McGraw-Hill, 1996.Google Scholar
  8. Bruynooghe M., Parallel Implementation of Fast Clustering Algorithms. Proceedings of the International Symposium on High Performance Computing; 1989 March 22-24; Montpellier, France. Elsevier Science, 1989; 65-78.Google Scholar
  9. Cannataro M, Congiusta A, Talia D, Trunfio P. A Data Mining Toolset for Distributed Highperformance Platforms. Proceedings of the International Conference on Data Mining Methods and Databases for Engineering; 2002 September 25-27; Bologna, Italy.Wessex Institute Press, 2002; 41-50.Google Scholar
  10. Cannataro M., Talia D. The Knowledge Grid. Communications of the ACM 2003; 46(1):89-93.CrossRefGoogle Scholar
  11. Cannataro M, Talia D, Trunfio P. KNOWLEDGE GRID: High Performance Knowledge Discovery Services on the Grid. Proceedings of the 2nd InternationalWorkshop GRID 2001; 2001 November; Denver, CO. Springer-Verlag, 2001; LNCS 2242:38-50.Google Scholar
  12. Cannataro M., Talia D., Trunfio P. Distributed Data Mining on the Grid. Future Generation Computer Systems 2002. 18(8):1101-1112.MATHCrossRefGoogle Scholar
  13. Congiusta A, Talia D, Trunfio P. VEGA: A Visual Environment for Developing Complex Grid Applications. Proceedings of the First International Workshop on Knowledge Grid and Grid Intelligence (KGGI); 2003 October 13; Halifax, Canada.Google Scholar
  14. Catlett C. The TeraGrid: a Primer, 2002.Google Scholar
  15. Curcin V, Ghanem M, Guo Y, Kohler M, Rowe A, Syed J,Wendel P. Discovery Net: Towards a Grid of Knowledge Discovery. Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining; 2002 July 23-26; Edmonton, Canada.Google Scholar
  16. Foster I, Kesselman C, Nick J, Tuecke S (2002). The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration.Google Scholar
  17. Foti D, Lipari D, Pizzuti C, Talia D. Scalable Parallel Clustering for Data Mining on Multicomputers. Proceedings of the 3rd International Workshop on High Performance Data Mining; 2000; Cancun. Springer-Verlag, 2000; LNCS 1800:390-398.Google Scholar
  18. Freitas, A. A., Lavington, S. H, Mining Very Large Database with Parallel Processing. Boston: Kluwer Academic Publishers, 1998.Google Scholar
  19. Giannadakis N., Rowe A., Ghanem M., Guo Y. InfoGrid: Providing Information Integration for Knowledge Discovery. Information Sciences 2003; 155:199-226.Google Scholar
  20. Han E. H., Karypis G., Kumar V. Scalable Parallel Data Mining for Association Rules. IEEE Transactions on Knowledge and Data Engineering 2000; 12(2):337-352Google Scholar
  21. Hinke T., Novonty J. Data Mining on NASA’s Information Power Grid. Proceedings 9th International Symposium on High Performance Distributed Computing; 2000 August 1-4; Pittsburgh, PA.Google Scholar
  22. Johnston W. E. Computational and Data Grids in Large-Scale Science and Engineering. Future Generation Computer Systems 2002; 18(8):1085-1100.MATHCrossRefGoogle Scholar
  23. Judd D, McKinley K, Jain AK. Large-Scale Parallel Data Clustering. Proceedings of the International Conference On Pattern Recognition; 1996; Wien.Google Scholar
  24. Kargupta, H., Chan, P. (Eds.), Advances in Distributed and Parallel Knowledge Discovery. Boston: AAAI/MIT Press, 2000.Google Scholar
  25. Kufrin R. Generating C4.5 Production Rules in Parallel. Proceedings of the 14th National Conference on Artificial Intelligence; AAAI Press, 1997.Google Scholar
  26. Li X., Fang Z. Parallel Clustering Algorithms. Parallel Computing 1989; 11:275–290.MATHCrossRefMathSciNetGoogle Scholar
  27. Moore R.W. (2001). Knowledge-Based Grids: Two Use Cases. GGF-3 Meeting.Google Scholar
  28. Neri F, Giordana A. A Parallel Genetic Algorithm for Concept Learning. Proceedings of the 6th International Conference on Genetic Algorithms; 1995 July 15-19; Pittsburgh, PA. Morgan Kaufmann, 1995; 436-443.Google Scholar
  29. Olson C.F. Parallel Algorithms for Hierarchical Clustering. Parallel Computing 1995; 21:1313-1325.MATHCrossRefMathSciNetGoogle Scholar
  30. Pearson, R. A. “A Coarse-grained Parallel Induction Heuristic.” In Parallel Processing for Artificial Intelligence 2, H. Kitano, V. Kumar, C.B. Suttner, ed. Elsevier Science, 1994.Google Scholar
  31. Prodromidis, A. L., Chan, P. K., Stolfo, S. J. “Meta-Learning in Distributed Data Mining Systems: Issues and Approaches”, In Advances in Distributed and Parallel Knowledge Discovery, H. Kargupta, P. Chan, ed. AAAI Press, 2000.Google Scholar
  32. Shafer J, Agrawal R, Mehta M. SPRINT: A Scalable Parallel Classifier for Data Mining. Proceedings of the 22nd International Conference Very Large Databases; 1996; Bombay.Google Scholar
  33. Skillicorn D. Strategies for Parallel Data Mining. IEEE Concurrency 1999; 7(4):26-35.CrossRefGoogle Scholar
  34. Skillicorn D., Talia D. Mining Large Data Sets on Grids: Issues and Prospects. Computing and Informatics 2002; 21:347-362.MATHGoogle Scholar
  35. Witten, I. H., Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann, 2000.Google Scholar
  36. Zaki M.J. Parallel and Distributed Association Mining: A Survey. IEEE Concurrency 1999; 7(4):14-25.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Antonio Congiusta
    • 1
  • Domenico Talia
    • 1
  • Paolo Trunfio
    • 1
  1. 1.DEIS – University of CalabriaCosenzaItaly

Personalised recommendations