Advertisement

Data Mining Paradigms

  • T. Ravindra Babu
  • M. Narasimha Murty
  • S. V. Subrahmanya
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)

Abstract

In the process of finding novel patterns, algorithms for mining large datasets face a number of issues. We discuss the issues related to efficiency in data mining. We elaborate some important data mining tasks such as clustering, classification, and association rule mining that are relevant to the content of the book. We discuss popular and representative algorithms of partitional and hierarchical data clustering. In classification, we discuss the nearest-neighbor classifier and the support vector machine. We use both these algorithms extensively in the book. We provide an elaborate discussion on issues in mining large datasets and possible solutions. We discuss each possible direction in detail. The discussion on clustering includes topics such as incremental clustering with focus on leader and BIRCH clustering algorithms, divide-and-conquer clustering algorithms, and clustering based on intermediate representation. The discussion on classification includes topics such as incremental classification and classification based on intermediate abstraction. We further discuss frequent-itemset mining with two directions such as divide-and-conquer itemset mining and intermediate abstraction for frequent-itemset mining. Bibliographic notes contain a brief discussion on the significant research contribution in each of the directions discussed in the chapter and literature for further study.

Keywords

Association Rule Main Memory Test Pattern Frequent Itemsets Association Rule Mining 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in Proceedings of International Conference on VLDB (1994) Google Scholar
  2. V.S. Ananthanarayana, M.N. Murty, D.K. Subramanian, Efficient clustering of large data sets. Pattern Recognit. 34(12), 2561–2563 (2001) CrossRefMATHGoogle Scholar
  3. V.S. Ananthanarayana, M.N. Murty, D.K. Subramanian, Tree structure for efficient data mining using rough sets. Pattern Recognit. Lett. 24(6), 851–862 (2003) CrossRefMATHGoogle Scholar
  4. M.R. Anderberg, Cluster Analysis for Applications (Academic Press, New York, 1973) MATHGoogle Scholar
  5. D. Arthur, S. Vassilvitskii, K-means++: the advantages of careful seeding, in Proceedings of ACM-SODA (2007) Google Scholar
  6. S. Asharaf, S.K. Shevade, M.N. Murty, Scalable non-linear support vector machine using hierarchical clustering, in ICPR, vol. 1 (2006) pp. 908–911 Google Scholar
  7. G.P. Babu, M.N. Murty, A near-optimal initial seed value selection for k-means algorithm using genetic algorithm. Pattern Recognit. Lett. 14(10) 763–769 (1993) CrossRefMATHGoogle Scholar
  8. P. Berkhin, Survey of clustering data mining techniques. Technical Report, Accrue Software, San Jose, CA (2002) Google Scholar
  9. D.M. Blei, Introduction to probabilistic topic models. Commun. ACM 55(4), 77–84 (2012) MathSciNetCrossRefGoogle Scholar
  10. L. Breiman, Random forests. Mach. Learn. 45(1), 5–32 (2001) CrossRefMATHGoogle Scholar
  11. C.J.C. Burges, A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 121–168 (1998) CrossRefGoogle Scholar
  12. H. Cheng, X. Yan, J. Han, C.-W. Hsu, Discriminative frequent pattern analysis for effective classification, in Proceedings of ICDE (2007) Google Scholar
  13. B.V. Dasarathy, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques (IEEE Press, Los Alamitos, 1990) Google Scholar
  14. R.O. Duda, P.E. Hart, D.J. Stork, Pattern Classification (Wiley-Interscience, New York, 2000) Google Scholar
  15. R.-E. Fan, K.-W. Chang, C.-J. Hsich, X.-R. Wang, C.-J. Lin, LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008) MATHGoogle Scholar
  16. D. François, V. Wertz, M. Verleysen, The concentration of fractional distances. IEEE Trans. Knowl. Data Eng. 19(7), 873–885 (2007) CrossRefGoogle Scholar
  17. B.C.M. Fung, Hierarchical document clustering using frequent itemsets. M.Sc. Thesis, Simon Fraser University (2002) Google Scholar
  18. S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O’Callaghan, Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003) CrossRefGoogle Scholar
  19. J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in Proc. of ACM-SIGMOD (2000) Google Scholar
  20. J. Han, M. Kamber, J. Pei, Data Mining—Concepts and Techniques (Morgan-Kauffman, San Mateo, 2012) MATHGoogle Scholar
  21. X. Hu, X. Zhang, C. Lu, E.K. Park, X. Zhou, Exploiting Wikipedia as external knowledge for document clustering, in ACM SIGKDD, KDD (2009) Google Scholar
  22. A.K. Jain, B. Chandrasekaran, Dimensionality and sample size considerations in pattern recognition practice, in Handbook of Statistics, ed. by P.R. Krishnaiah, L. Kanal (1982), pp. 835–855 Google Scholar
  23. A.K. Jain, R.C. Dubes, Algorithms for Clustering Data (Prentice-Hall, Englewood Cliffs, 1988) MATHGoogle Scholar
  24. A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999) CrossRefGoogle Scholar
  25. H.-P Kriegel, P. Kroeger, A. Zimek, Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering and correlation clustering. ACM Trans. Knowl. Discov. Data 3(1), 1–58 (2009) CrossRefGoogle Scholar
  26. K. Krishna, M.N. Murty, Genetic k-means algorithm. IEEE Trans. Syst. Man Cybern., Part B, Cybern. 29(3), 433–439 (1999) CrossRefGoogle Scholar
  27. J. MacQueen, Some methods for classification and analysis of multivariate observations, in Proceedings of the Fifth Berkeley Symposium (1967) Google Scholar
  28. M.N. Murty, Clustering large data sets, in Soft Computing Approach to Pattern Recognition and Image Processing, ed. by A. Ghosh, S.K. Pal (World-Scientific, Singapore, 2002), pp. 41–63 CrossRefGoogle Scholar
  29. M.N. Murty, G. Krishna, A computationally efficient technique for data-clustering. Pattern Recognit. 12(3), 153–158 (1980) CrossRefGoogle Scholar
  30. R.T. Ng, J. Han, Efficient and effective clustering methods for spatial data mining, in Proc. of the VLDB Conference (1994) Google Scholar
  31. A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. Dewit, S. Madden, M. Stonebraker, A comparison of approaches to large-scale data analysis, in Proceedings of ACM SIGMOD (2009) Google Scholar
  32. A.K. Pujari, Data Mining Techniques (Universities Press, Hyderabad, 2001) Google Scholar
  33. M. Radovanović, A. Nanopoulos, M. Ivanović, Nearest neighbors in high-dimensional data: the emergence and influence of hubs, in Proceedings of ICML (2009) Google Scholar
  34. T. Ravindra Babu, M.N. Murty, Comparison of genetic algorithm based prototype selection schemes. Pattern Recognit. 34(2), 523–525 (2001) CrossRefGoogle Scholar
  35. T. Ravindra Babu, M.N. Murty, V.K. Agrawal, Classification of run-length encoded binary data. Pattern Recognit. 40(1), 321–323 (2007) CrossRefMATHGoogle Scholar
  36. P. Russom, Big data analytics. TDWI Research Report (2011) Google Scholar
  37. S.Z. Selim, M.A. Ismail, K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 6(1), 81–87 (1984) CrossRefMATHGoogle Scholar
  38. P. Sneath, The applications of computers to taxonomy. J. Gen. Microbiol. 17(2), 201–226 (1957) MathSciNetCrossRefGoogle Scholar
  39. H. Spath, Cluster Analysis Algorithms for Data Reduction and Classification (Ellis Horwood, Chichester, 1980) Google Scholar
  40. X. Sun, Y. Liu, M. Xu, H. Chen, J. Han, K. Wang, Feature selection using dynamic weights for classification. Knowl.-Based Syst. 37, 541–549 (2013) CrossRefGoogle Scholar
  41. V.N. Vapnik, Statistical Learning Theory (Wiley, New York, 1998) MATHGoogle Scholar
  42. P.A. Vijaya, M.N. Murty, D.K. Subramanian, Leaders–subleaders: an efficient hierarchical clustering algorithm for large data sets. Pattern Recognit. Lett. 25(4), 505–513 (2005) CrossRefGoogle Scholar
  43. P. Viswanath, M.N. Murty, S. Bhatnagar, Fusion of multiple approximate nearest neighbor classifiers for fast and efficient classification. Inf. Fusion 5(4), 239–250 (2004) CrossRefGoogle Scholar
  44. D. Xin, J. Han, X. Yan, H. Cheng, Mining compressed frequent-pattern sets, in Proceedings of VLDB Conference (2005) Google Scholar
  45. R. Xu, D.C. Wunsch II, Clustering (IEEE Press/Wiley, Los Alamitos/New York, 2009) Google Scholar
  46. X. Yin, J. Han, CPAR: classification based on predictive association rules, in Proceedings of SDM (2003) Google Scholar
  47. Z. Yin, L. Cao, Q. Gu, J. Han, Latent community topic analysis: integration of community discovery with topic modeling. ACM Trans. Intell. Syst. Technol. 3(4), 63:1–63:23 (2012). CrossRefGoogle Scholar
  48. H. Yu, J. Yang, J. Han, Classifying large data sets using SVM with hierarchical clusters, in Proc. of ACM SIGKDD (KDD) (2003) Google Scholar
  49. T. Zhang, Data clustering for very large datasets plus applications. Ph.D. Thesis, University of Wisconsin–Madison (1997) Google Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • T. Ravindra Babu
    • 1
  • M. Narasimha Murty
    • 2
  • S. V. Subrahmanya
    • 1
  1. 1.Infosys Technologies Ltd.BangaloreIndia
  2. 2.Indian Institute of ScienceBangaloreIndia

Personalised recommendations