PROFIT: A Projected Clustering Technique

  • Dharmveer Singh Rajput
  • Pramod Kumar Singh
  • Mahua Bhattacharya
Chapter
Part of the Annals of Information Systems book series (AOIS, volume 17)

Abstract

Clustering high dimensional dataset is one of the major areas of research because of its widespread applications in many domains. However, a meaningful clustering in high dimensional dataset is a challenging issue due to (i) it usually contains many irrelevant dimensions which hide the clusters, (ii) the distance, which is the most common similarity measure in most of the methods, loses its meaning in high dimensions, and (iii) different clusters may exist in different subsets of dimensions in high dimensional dataset. Feature selection based clustering methods prominently solve the problem of clustering high dimensional data. However, finding all the clusters in one subset of few selected relevant dimensions is not justified as different clusters may exist in different subsets of dimensions. In this article, we propose an algorithm PROFIT (PROjective clustering algorithm based on FIsher score and Trimmed mean) which extends the idea of feature selection based clustering to projective clustering and works well with the high dimensional dataset consisting of attributes in continuous variable domain. It works in four phases: sampling phase, initialization phase, dimension selection phase and refinement phase. We consider five real datasets for experiments with different input parameters and consider three other well-known top-down subspace clustering methods PROCLUS, ORCLUS and PCKA along with our feature selection based non-subspace clustering method FAMCA for comparison. The obtained results are subjected to two well-known subspace clustering quality measures (Jagota index and sum of squared error) and Student’s t-test to determine the significant difference between clustering results. The obtained results and quality measures show effectiveness and superiority of the proposed method PROFIT to its competitors.

Keywords

Feature Selection Cluster Center Relevant Dimension Irrelevant Dimension Subspace Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Aggarwal, C., Yu, P.: Finding generalized projected clusters in high dimensional spaces. In: ACM SIGMOD International Conference on Management of Data, pp. 70–81. ACM (2000)Google Scholar
  2. 2.
    Aggarwal, C., Wolf, J., Yu, P., Procopiuc, C., Park, J.: Fast algorithms for projected clustering. ACM SIGMOD Record 28(2), 61–72 (1999)CrossRefGoogle Scholar
  3. 3.
    Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: ACM SIGMOD International Conference on Management of Data, pp. 94–105. ACM Press (1998)Google Scholar
  4. 4.
    Andrews, H., Patterson, C.: Singular value decompositions and digital image processing. IEEE Trans. Acoust. Speech Signal Process 24(1), 26–53 (1976)CrossRefGoogle Scholar
  5. 5.
    Apolloni, B., Bassis, S., Brega, A.: Feature selection via boolean independent component analysis. Inf. Sci. 179(22), 3815–3831 (2009)CrossRefGoogle Scholar
  6. 6.
    Arai, K., Barakbah, A.: Hierarchical k-means: An algorithm for centroids initialization for k-means. Rep. Fac. Sci. Eng. 36(1), 25–31 (2007)Google Scholar
  7. 7.
    Barakbah, A., Kiyoki, Y.: A pillar algorithm for k-means optimization by distance maximization for initial centroid designation. In: Computational Intelligence and Data Mining, 2009. IEEE Symposium on CIDM'09, pp. 61–68. IEEE (2009)Google Scholar
  8. 8.
    Berkhin, P.: A survey of clustering data mining techniques. Technical Report (2002)Google Scholar
  9. 9.
    Bouguessa, M., Wang, S.: Mining projected clusters in high-dimensional spaces. IEEE Trans. Knowl. Data Eng. 21(4), 507–522 (2009)CrossRefGoogle Scholar
  10. 10.
    Celebi, M.: Effective initialization of k-means for color quantization. In: 16th IEEE International Conference on Image Processing (ICIP), 2009, pp. 1649–1652. IEEE (2009)Google Scholar
  11. 11.
    Cheng, C., Fu, A., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM (1999)Google Scholar
  12. 12.
    Chu, Y., Huang, J., Chuang, K., Yang, D., Chen, M.: Density conscious subspace clustering for high-dimensional data.. IEEE Trans. Knowl. Data Eng. 22(1), 16–30 (2010)CrossRefGoogle Scholar
  13. 13.
    Ding, C., He, X.: K-means clustering via principal component analysis. In: Proceedings of the twenty-first International Conference on Machine Learning, pp. 225–232. ACM (2004)Google Scholar
  14. 14.
    Gheyas, I., Smith, L.: Feature subset selection in large dimensionality domains. Pattern Recognit. 43(1), 5–13 (2010)CrossRefGoogle Scholar
  15. 15.
    Goil, S., Nagesh, H., Choudhary, A.: Mafia: Efficient and scalable subspace clustering for very large data sets. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 443–452 (1999)Google Scholar
  16. 16.
    Günnemann, S., Färber, I., Müller, E., Seidl, T.: Asclu: Alternative subspace clustering. In: In MultiClust at KDD. Citeseer (2010)Google Scholar
  17. 17.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann (2001)Google Scholar
  18. 18.
    Hu, Q., Che, X., Zhang, L., Yu, D.: Feature evaluation and selection based on neighborhood soft margin. Neurocomputing 73(10), 2114–2124 (2010)CrossRefGoogle Scholar
  19. 19.
    Jagota, A.: Novelty detection on a very large number of memories stored in a hopfield-style network. In: IJCNN-91-Seattle International Joint Conference on Neural Networks, 1991, vol. 2, pp. 905–. IEEE (1991)Google Scholar
  20. 20.
    Jain, A., Dubes, R.: Algorithms for Clustering Data. Prentice-Hall, Inc. (1988)Google Scholar
  21. 21.
    Jain, A., Murty, M., Flynn, P.: Data clustering: A review. ACM Computing Surveys (CSUR) 31(3), 264–323 (1999)Google Scholar
  22. 22.
    Kabir, M., Islam, M., et al.: A new wrapper feature selection approach using neural network. Neurocomputing 73(16), 3273–3283 (2010)Google Scholar
  23. 23.
    Khan, S., Ahmad, A.: Cluster center initialization algorithm for k-means clustering. Pattern Recognit. Lett. 25(11), 1293–1302 (2004)CrossRefGoogle Scholar
  24. 24.
    Kriegel, H., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowledge Discov. Data (TKDD) 3(1), 1–58 (2009)CrossRefGoogle Scholar
  25. 25.
    Kruskal, J., Wish, M.: Multidimensional Scaling, Quantitative Applications in the Social Sciences. Beverly Hills (1978)Google Scholar
  26. 26.
    Liu, Y., Liu, Y., Chan, K.: Dimensionality reduction for heterogeneous dataset in rushes editing. Pattern Recognit. 42(2), 229–242 (2009)CrossRefGoogle Scholar
  27. 27.
    Moise, G., Zimek, A., Kröger, P., Kriegel, H., Sander, J.: Subspace and projected clustering: Experimental evaluation and analysis. Knowl. Inf. Syst. 21(3), 299–326 (2009)CrossRefGoogle Scholar
  28. 28.
    Ng, R., Han, J.: Clarans: A method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)CrossRefGoogle Scholar
  29. 29.
    Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: A review. ACM SIGKDD Explorations Newsletter. 6(1), 90–105 (2004)CrossRefGoogle Scholar
  30. 30.
    Parsons, L., Haque, E., Liu, H., et al.: Evaluating subspace clustering algorithms. In: Workshop on Clustering High Dimensional Data and its Applications, SIAM International Conference on Data Mining, pp. 48–56. Citeseer (2004)Google Scholar
  31. 31.
    Pearson, E.: Studies in the history of probability and statistics. XX: Some early correspondence between W.S. Gosset, R.A. Fisher and Karl Pearson, with notes and comments. Biometrika 55(3), 445–457 (1968)Google Scholar
  32. 32.
    Puri, C., Kumar, N.: Projected Gustafson-Kessel clustering algorithm and its convergence. Trans. on Rough Sets XIV, 159–182 (2011)CrossRefGoogle Scholar
  33. 33.
    Rajput, D., Singh, P., Bhattacharya, M.: An efficient technique for clustering high dimensional data set. In: 10th International Conference on Information and Knowledge Engineering. pp. 434–440. WASET, USA (July 2011)Google Scholar
  34. 34.
    Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)CrossRefGoogle Scholar
  35. 35.
    Sugiyama, M., Kawanabe, M., Chui, P.: Dimensionality reduction for density ratio estimation in high-dimensional spaces. Neural Netw. 23(1), 44–59 (2010)CrossRefGoogle Scholar
  36. 36.
    Tenenbaum, J., De Silva, V., Langford, J.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)CrossRefGoogle Scholar
  37. 37.
    Veenman, C., Reinders, M., Backer, E.: A maximum variance cluster algorithm. IEEE Trans. Patt. Anal. Machine Intell. 24(9), 1273–1280 (2002)CrossRefGoogle Scholar
  38. 38.
    Wang, D., Ding, C., Li, T.: K -subspace clustering. Machine Learn. Knowl. Discov. Databases 506–521 (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Dharmveer Singh Rajput
    • 1
  • Pramod Kumar Singh
    • 1
  • Mahua Bhattacharya
    • 1
  1. 1.ABV – Indian Institute of Information Technology and ManagementGwaliorIndia

Personalised recommendations