Polar Classification of Nominal Data

  • Guy Wolf
  • Shachar Harussi
  • Yaniv Shmueli
  • Amir Averbuch
Part of the Computational Methods in Applied Sciences book series (COMPUTMETHODS, volume 27)

Abstract

Many modern systems record various types of parameter values. Numerical values are relatively convenient for data analysis tools because there are many methods to measure distances and similarities between them. The application of dimensionality reduction techniques for data sets with such values is also a well known practice. Nominal (i.e., categorical) values, on the other hand, encompass some problems for current methods. Most of all, there is no meaningful distance between possible nominal values, which are either equal or unequal to each other. Since many dimensionality reduction methods rely on preserving some form of similarity or distance measure, their application to such data sets is not straightforward. We propose a method to achieve clustering of such data sets by applying the diffusion maps methodology to it. Our method is based on a distance metric that utilizes the effect of the boolean nature of similarities between nominal values (i.e., equal or unequal) on the diffusion kernel and, in turn, on the embedded space resulting from its principal components. We use a multi-view approach by analyzing small, closely related, sets of parameters at a time instead of the whole data set. This way, we achieve a comprehensive understanding of the data set from many points of view.

Keywords

Clustering Unsupervised learning Diffusion maps Nominal data 

References

  1. 1.
    Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD ’98: proceedings of the 1998 ACM SIGMOD international conference on management of data. ACM, New York, pp 94–105 CrossRefGoogle Scholar
  2. 2.
    Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: SIGMOD ’99: proceedings of the 1999 ACM SIGMOD international conference on management of data. ACM, New York, pp 49–60 CrossRefGoogle Scholar
  3. 3.
    Babuška R (1998) Fuzzy modeling for control. Kluwer, Norwell CrossRefGoogle Scholar
  4. 4.
    Berkhin P (2006) A survey of clustering data mining techniques. Grouping Multidimensional Data Cl(c):25–71 CrossRefGoogle Scholar
  5. 5.
    Bickel S, Scheffer T (2004) Multi-view clustering. In: ICDM ’04: proceedings of the fourth IEEE international conference on data mining. IEEE, Washington, pp 19–26 CrossRefGoogle Scholar
  6. 6.
    Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory, Madison, WI, 1998. ACM, New York, pp 92–100 CrossRefGoogle Scholar
  7. 7.
    Chung F (1997) Spectral graph theory. CBMS regional conference series in mathematics, vol 92. AMS, Providence MATHGoogle Scholar
  8. 8.
    Coifman RR, Lafon S (2006) Diffusion maps. Appl Comput Harmon Anal 21(1):5–30 MathSciNetMATHCrossRefGoogle Scholar
  9. 9.
    Dasgupta S, Littman ML, McAllester D (2001) PAC generalization bounds for co-training. Technical report, AT&T Labs-Research Google Scholar
  10. 10.
    David G (2009) Anomaly detection and classification via diffusion processes in hyper-networks. PhD thesis, School of Computer Science, Tel Aviv University Google Scholar
  11. 11.
    David G, Averbuch A (2012) Hierarchical data organization, clustering and denoising via localized diffusion folders. Appl Comput Harmon Anal 33(1):1–23 MathSciNetMATHCrossRefGoogle Scholar
  12. 12.
    David G, Averbuch A (2011) Localized diffusion. Part II: Coarse-grained process (submitted) Google Scholar
  13. 13.
    David G, Averbuch A (2012) SpectralCAT: categorical spectral clustering of numerical and nominal data. Pattern Recognit 45(1):416–433 MathSciNetMATHCrossRefGoogle Scholar
  14. 14.
    de Diego IM, Munoz A, Moguerza J (2010) Methods for the combination of kernel matrices within a support vector framework. Mach Learn 78:137–174 CrossRefGoogle Scholar
  15. 15.
    de Sa VR, Gallagher PW, Lewis JM, Malave VL (2010) Multi-view kernel construction. Mach Learn 79(1):47–71 CrossRefGoogle Scholar
  16. 16.
    Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD ’96: proceedings of the 2nd international conference on knowledge discovery and data mining. AAAI, New York, pp 226–231 Google Scholar
  17. 17.
    Everitt B, Landau S, Leese M (2001) Cluster analysis, 4th edn. Arnold, London MATHGoogle Scholar
  18. 18.
    Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: SIGMOD ’98: proceedings of the 1998 ACM SIGMOD international conference on management of data. ACM, New York, pp 73–84 CrossRefGoogle Scholar
  19. 19.
    Guha S, Rastogi R, Shim K (2000) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst (Oxf) 25(5):345–366 CrossRefGoogle Scholar
  20. 20.
    Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: KDD ’98: proceedings of the 4th international conference on knowledge discovery and data mining, pp 58–65 Google Scholar
  21. 21.
    Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: SIGMOD-DMKD ’97: workshop on research issues on data mining and knowledge discovery Google Scholar
  22. 22.
    Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304 CrossRefGoogle Scholar
  23. 23.
    Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaud Sci Nat 37:547–579 Google Scholar
  24. 24.
    Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323 CrossRefGoogle Scholar
  25. 25.
    Karypis G, Han EH, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32(8):68–75 CrossRefGoogle Scholar
  26. 26.
    Lafon S (2004) Diffusion maps and geometric harmonics. PhD thesis, Yale University Google Scholar
  27. 27.
    MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability. Statistics, vol I. Univ California Press, Berkeley, pp 281–297 Google Scholar
  28. 28.
    Rabin N (2010) Data mining dynamically evolving systems via diffusion methodologies. PhD thesis, School of Computer Science, Tel Aviv University Google Scholar
  29. 29.
    Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132(3434):1115–1118 CrossRefGoogle Scholar
  30. 30.
    Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523 CrossRefGoogle Scholar
  31. 31.
    Sebban M, Nock R (2002) A hybrid filter/wrapper approach of feature selection using information theory. Pattern Recognit 35(4):835–846 MATHCrossRefGoogle Scholar
  32. 32.
    Sheikholeslami G, Chatterjee S, Zhang A (2000) WaveCluster: A wavelet-based clustering approach for spatial data in very large databases. VLDB J 8(3–4):289–304 CrossRefGoogle Scholar
  33. 33.
    Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228 CrossRefGoogle Scholar
  34. 34.
    Strehl A, Ghosh J (2000) A scalable approach to balanced, high-dimensional clustering of market-baskets. In: HiPC ’00: proceedings of the 7th international conference on high performance computing. Springer, London, pp 525–536 Google Scholar
  35. 35.
    Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: CIKM ’99: proceedings of the 8th international conference on information and knowledge management. ACM, New York, pp 483–490 Google Scholar
  36. 36.
    Wang P (2008) Clustering and classification techniques for nominal data application. PhD thesis, City University of Hong Kong Google Scholar
  37. 37.
    Wang W, Yang J, Muntz R (1997) STING: a statistical information grid approach to spatial data mining. In: VLDB ’97: proceedings of the 23rd international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 186–195 Google Scholar
  38. 38.
    Wang W, Yang J, Muntz R (1999) STING+: an approach to active spatial data mining. In: ICDE ’99: proceedings of the 15th international conference on data engineering. IEEE, Los Alamitos, pp 116–125 Google Scholar
  39. 39.
    Yang Y, Guan X, You J (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: KDD ’02: proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 682–687 CrossRefGoogle Scholar
  40. 40.
    Yarowsky D (1995) Unsupervised word sense disambiguation rivaling supervised methods. In: ACL ’95: proceedings of the 33rd annual meeting on association for computational linguistics. Association for Computational Linguistics, Stroudsburg, pp 189–196 Google Scholar
  41. 41.
    Yun CH, Chuang KT, Chen MS (2001) An efficient clustering algorithm for market basket data based on small large ratios. In: COMPSAC ’01: proceedings of the 25th international computer software and applications conference on invigorating software development. IEEE, Washington, pp 505–510 Google Scholar
  42. 42.
    Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: SIGMOD ’96: proceedings of the 1996 ACM SIGMOD international conference on management of data. ACM, New York, pp 103–114 CrossRefGoogle Scholar
  43. 43.
    Zhao Y, Song J (2001) GDILC: a grid-based density-isoline clustering algorithm. In: ICII ’01: proceedings of the international conferences on info-tech and info-net, vol 3. IEEE, New York, pp 140–145 Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Guy Wolf
    • 1
    • 2
  • Shachar Harussi
    • 1
    • 2
  • Yaniv Shmueli
    • 1
  • Amir Averbuch
    • 1
    • 2
  1. 1.School of Computer ScienceTel Aviv UniversityTel AvivIsrael
  2. 2.Department of Mathematical Information TechnologyUniversity of JyväskyläJyväskyläFinland

Personalised recommendations