Advertisement

Machine Learning

, Volume 82, Issue 1, pp 43–70 | Cite as

Particle swarm optimizer for variable weighting in clustering high-dimensional data

  • Yanping Lu
  • Shengrui Wang
  • Shaozi Li
  • Changle Zhou
Article

Abstract

In this paper, we present a particle swarm optimizer (PSO) to solve the variable weighting problem in projected clustering of high-dimensional data. Many subspace clustering algorithms fail to yield good cluster quality because they do not employ an efficient search strategy. In this paper, we are interested in soft projected clustering. We design a suitable k-means objective weighting function, in which a change of variable weights is exponentially reflected. We also transform the original constrained variable weighting problem into a problem with bound constraints, using a normalized representation of variable weights, and we utilize a particle swarm optimizer to minimize the objective function in order to search for global optima to the variable weighting problem in clustering. Our experimental results on both synthetic and real data show that the proposed algorithm greatly improves cluster quality. In addition, the results of the new algorithm are much less dependent on the initial cluster centroids. In an application to text clustering, we show that the algorithm can be easily adapted to other similarity measures, such as the extended Jaccard coefficient for text data, and can be very effective.

High-dimensional data Projected clustering Variable weighting Particle swarm optimization Text clustering 

References

  1. Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (2005). Automatic subspace clustering of high dimensional data. Data Mining and Knowledge Discovery, 11(1), 5–33. CrossRefMathSciNetGoogle Scholar
  2. Aitnouri, E., Wang, S., & Ziou, D. (2000). On comparison of clustering techniques for histogram pdf estimation. Pattern Recognition and Image Analysis, 10(2), 206–217. Google Scholar
  3. Boley, D., Gini, M., et al. (1999). Document categorization and query generation on the world wild web using WebACE. AI Review, 11, 365–391. Google Scholar
  4. Bouguessa, M., Wang, S., & Sun, H. (2006). An objective approach to cluster validation. Pattern Recognition Letters, 27(13), 1419–1430. CrossRefGoogle Scholar
  5. Domeniconi, C., Gunopulos, D., Ma, S., Yan, B., Al-Razgan, M., & Papadopoulos, D. (2007). Locally adaptive metrics for clustering high dimensional data. Data Mining and Knowledge Discovery Journal, 14, 63–97. CrossRefMathSciNetGoogle Scholar
  6. Eberhart, R. C., & Kennedy, J. (1995). A new optimizer using particle swarm theory. In Proc. 6th international symposium on micromachine and human science, Japan (pp. 39–43). Google Scholar
  7. Elke, A., Christian, B., et al. (2008). Detection and visualization of subspace cluster hierarchies. In LNCS (Vol. 4443, pp. 152–163). Berlin: Springer. Google Scholar
  8. Evett, I. W., & Spiehler, E. J. (1987). Rule induction in forensic science. Central Research Establishment. Home Office Forensic Science Service, Aldermaston, Reading, Berkshire RG7 4PN. Google Scholar
  9. Goil, G. S., Nagesh, H., & Choudhary, A. (1999). Mafia: Efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906-010, Northwestern University. Google Scholar
  10. Han, E. H., Boley, D., et al. (1998). WebACE: A web agent for document categorization and exploration. In Proc. of 2nd international conf. on autonomous agents. Google Scholar
  11. Handl, J., & Knowles, J. (2004). Multiobjective clustering with automatic determination of the number of clusters. Technical Report, UMIST, Department of Chemistry. Google Scholar
  12. Huang, J. Z., Ng, M. K., Rong, H., & Li, Z. (2005). Automated dimension weighting in k-means type clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 1–12. MATHCrossRefGoogle Scholar
  13. Jing, L., Ng, M. K., & Huang, J. Z. (2007). An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering, 19(8), 1026–1041. CrossRefGoogle Scholar
  14. Kriegel, H.-P., Kroger, P., & Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery Data, 3(1), Article 1. Google Scholar
  15. Liang, J. J., Qin, A. K., Suganthan, P. N., & Baskar, S. (2006). Comprehensive learning particle swarm optimizer for global optimization of multimodal functions. IEEE Transactions on Evolutionary Computation, 10(3), 281–295. CrossRefGoogle Scholar
  16. Lu, Y. (2009). Particle swarm optimizer: applications in high-dimensional data clustering. Ph.D. Dissertation, University of Sherbrooke, Department of Computer Science. Google Scholar
  17. Makarenkov, V., & Legendre, P. (2001). Optimal variable weighting for ultrametric and additive trees and k-means partitioning: Methods and software. Journal of Classification, 18(2), 245–271. MATHMathSciNetGoogle Scholar
  18. Mangasarian, O. L., & Wolberg, W. H. (1990). Breast cancer diagnosis via linear programming. SIAM News, 23(5), 1–18. Google Scholar
  19. Moise, G., & Sander, J. (2008). Finding non-redundant, statistically significant regions in high dimensional data: A novel approach to projected and subspace clustering. In Proc. of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 533–541). Google Scholar
  20. Moise, G., Sander, J., & Ester, M. (2008). Robust projected clustering. Knowledge and Information Systems, 14(3), 273–298. MATHCrossRefGoogle Scholar
  21. Onwubolu, G. C., & Clerc, M. (2004). Optimal operating path for automated drilling operations by a new heuristic approach using particle swarm optimization. International Journal of Production Research, 42(3), 473–491. MATHCrossRefGoogle Scholar
  22. Pang-ning, T., Michael, S., & Vipin, K. (2006). Introduction to data mining (p. 77). Upper Saddle River: Pearson Education. Google Scholar
  23. Parsons, L., Haque, E., & Liu, H. (2004). Subspace clustering for high dimensional data: A review. SIGKDD Explorations Newsletter, 6, 90–105. CrossRefGoogle Scholar
  24. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137. Google Scholar
  25. Procopiuc, C. M., Jones, M., Agarwal, P. K., & Murali, T. M. (2002). A Monte Carlo algorithm for fast projective clustering. In Proc. of ACM SIGMOD international conference on management of data (pp. 418–427). Google Scholar
  26. Salman, A., Ahmad, I., & Al-Madani, S. (2003). Particle swarm optimization for task assignment problem. Microprocessors and Microsystems, 26(8), 363–371. CrossRefGoogle Scholar
  27. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523. CrossRefGoogle Scholar
  28. Tasgetiren, M. F., Sevkli, M., Liang, Y.-C., & Gencyilmaz, G. (2004). Particle swarm optimization algorithm for single machine total weighted tardiness problem. In Proc. of the 2004 congress on evolutionary computation (CEC’04), Portland, Oregon (pp. 1412–1419). Google Scholar
  29. TREC (1999). Text retrieval conference, http://trec.nist.gov/.
  30. Tibshirani, R., Walther, G., et al. (2000). Estimating the number of clusters in a dataset via the Gap Statistic. Technical Report, Stanford Univeristy. Google Scholar
  31. Van den Bergh, F., & Engelbecht, A. P. (2000). Cooperative learning in neural networks using particle swarm optimizers. South African Computer Journal, 26, 84–90. Google Scholar
  32. van der Putten, P., & van Someren, M. (Eds.) (2000). CoIL challenge 2000: the insurance company case. Published by Sentient Machine Research, Amsterdam. Technical Report. Google Scholar
  33. Woo, K.-G., Lee, J.-H., Kim, M.-H., & Lee, Y.-J. (2004). FINDIT: A fast and intelligent subspace clustering algorithm using dimension voting. Information and Software Technology, 46(4), 255–271. CrossRefGoogle Scholar
  34. Yoshida, H., Kawata, K., Fukuyama, Y., & Nakanishi, Y. (2000). A particle swarm optimization for reactive power and voltage control considering voltage security assessment. IEEE Transactions on Power Systems, 15(4), 1232–1239. CrossRefGoogle Scholar
  35. Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis. Technical Report, CS Dept., Univ. of Minnesota. Google Scholar
  36. Zhou, X., Hu, X., Zhang, X., Lin, X., & Song, I.-Y. (2006). Context-sensitive semantic smoothing for the language modeling approach to genomic IR. In SIGIR’06. Google Scholar

Copyright information

© The Author(s) 2009

Authors and Affiliations

  • Yanping Lu
    • 1
    • 2
  • Shengrui Wang
    • 1
  • Shaozi Li
    • 2
  • Changle Zhou
    • 2
  1. 1.Department of Computer ScienceUniversity of SherbrookeSherbrookeCanada
  2. 2.Department of Cognitive ScienceXiamen UniversityXiamenChina

Personalised recommendations