Advertisement

Analysis and Application of Normalization Methods with Supervised Feature Weighting to Improve K-means Accuracy

  • Iratxe Niño-AdanEmail author
  • Itziar Landa-Torres
  • Eva Portillo
  • Diana Manjarres
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 950)

Abstract

Normalization methods are widely employed for transforming the variables or features of a given dataset. In this paper three classical feature normalization methods, Standardization (St), Min-Max (MM) and Median Absolute Deviation (MAD), are studied in different synthetic datasets from UCI repository. An exhaustive analysis of the transformed features’ ranges and their influence on the Euclidean distance is performed, concluding that knowledge about the group structure gathered by each feature is needed to select the best normalization method for a given dataset. In order to effectively collect the features’ importance and adjust their contribution, this paper proposes a two-stage methodology for normalization and supervised feature weighting based on a Pearson correlation coefficient and on a Random Forest Feature Importance estimation method. Simulations on five different datasets reveal that our two-stage proposed methodology, in terms of accuracy, outperforms or at least maintains the K-means performance obtained if only normalization is applied.

Keywords

Normalization Standardization Weighted Euclidean Distance Pearson correlation Random Forest K-means 

Notes

Acknowledgement

This work has been supported in part by the ELKARTEK program (SeNDANEU KK-2018/00032), the HAZITEK program (DATALYSE ZL-2018/00765) of the Basque Government and a TECNALIA Research and Innovation PhD Scholarship.

References

  1. 1.
    Blömer, J., Lammersen, C., Schmidt, M., Sohler, C.: Theoretical analysis of the k-means algorithm–a survey. In: Algorithm Engineering, pp. 81–116. Springer, Heidelberg (2016)CrossRefGoogle Scholar
  2. 2.
    Daszykowski, M., Kaczmarek, K., Vander Heyden, Y., Walczak, B.: Robust statistics in data analysis—a review: basic concepts. Chemometr. Intell. Lab. Syst. 85(2), 203–219 (2007)CrossRefGoogle Scholar
  3. 3.
    Aksoy, S., Haralick, R.M.: Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recogn. Lett. 22(5), 563–582 (2001)CrossRefGoogle Scholar
  4. 4.
    Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recogn. 38(12), 2270–2285 (2005)CrossRefGoogle Scholar
  5. 5.
    Pan, J., Zhuang, Y., Fong, S.: The impact of data normalization on stock market prediction: using SVM and technical indicators. In: International Conference on Soft Computing in Data Science, pp. 72–88. Springer, Heidelberg (2016)Google Scholar
  6. 6.
    Milligan, G.W.: Clustering validation: results and implications for applied analyses. In: Clustering and Classification, pp. 341–375. World Scientific, Singapore (1996)CrossRefGoogle Scholar
  7. 7.
    Milligan, G.W., Cooper, M.C.: A study of standardization of variables in cluster analysis. J. Classif. 5(2), 181–204 (1988)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)CrossRefGoogle Scholar
  9. 9.
    Dillon, W.R., Mulani, N., Frederick, D.G.: On the use of component scores in the presence of group structure. J. Consum. Res. 16(1), 106–112 (1989)CrossRefGoogle Scholar
  10. 10.
    Steinley, D.: Standardizing variables in k-means clustering. In: Classification, Clustering, and Data Mining Applications, pp. 53–60. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  11. 11.
    Tsai, C.Y., Chiu, C.C.: Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Comput. Stat. Data Anal. 52(10), 4658–4672 (2008)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Huang, J.Z., Ng, M.K., Rong, H., Li, Z.: Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 657–668 (2005)CrossRefGoogle Scholar
  13. 13.
    Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 19(8), 1026–1041 (2007)CrossRefGoogle Scholar
  14. 14.
    Makarenkov, V., Legendre, P.: Optimal variable weighting for ultrametric and additive trees and k-means partitioning: methods and software. J. Classif. 18(2), 245–271 (2001)MathSciNetzbMATHGoogle Scholar
  15. 15.
    Lu, Y., Wang, S., Li, S., Zhou, C.: Particle swarm optimizer for variable weighting in clustering high-dimensional data. Mach. Learn. 82(1), 43–70 (2011)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Chan, E.Y., Ching, W.K., Ng, M.K., Huang, J.Z.: An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recogn. 37(5), 943–952 (2004)CrossRefGoogle Scholar
  17. 17.
    Jiang, L., Zhang, L., Li, C., Wu, J.: A correlation-based feature weighting filter for Naive Bayes. IEEE Trans. Knowl. Data Eng. (2018)Google Scholar
  18. 18.
    Phan, A.V., Le Nguyen, M., Bui, L.T.: Feature weighting and SVM parameters optimization based on genetic algorithms for classification problems. Appl. Intell. 46(2), 455–469 (2017)CrossRefGoogle Scholar
  19. 19.
    Chen, Y., Hao, Y.: A feature weighted support vector machine and k-nearest neighbor algorithm for stock market indices prediction. Expert Syst. Appl. 80, 340–355 (2017)CrossRefGoogle Scholar
  20. 20.
    Gürüler, H.: A novel diagnosis system for Parkinson’s disease using complex-valued artificial neural network with k-means clustering feature weighting method. Neural Comput. Appl. 28(7), 1657–1666 (2017)CrossRefGoogle Scholar
  21. 21.
    Elbasiony, R.M., Sallam, E.A., Eltobely, T.E., Fahmy, M.M.: A hybrid network intrusion detection framework based on random forests and weighted k-means. Ain Shams Eng. J. 4(4), 753–762 (2013)CrossRefGoogle Scholar
  22. 22.
    Wei, P., Lu, Z., Song, J.: Variable importance analysis: a comprehensive review. Reliab. Eng. Syst. Saf. 142, 399–432 (2015)CrossRefGoogle Scholar
  23. 23.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefGoogle Scholar
  24. 24.
    Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)Google Scholar
  25. 25.
    UCI Machine Learning Repository. http://archive.ics.uci.edu/ml

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Iratxe Niño-Adan
    • 1
    Email author
  • Itziar Landa-Torres
    • 2
  • Eva Portillo
    • 3
  • Diana Manjarres
    • 1
  1. 1.Tecnalia Research and InnovationDerioSpain
  2. 2.Petronor Innovación S.L.MuskizSpain
  3. 3.Department of Automatic Control and System Engineering, School of EngineeringUniversity of the Basque Country, UPV/EHUBilbaoSpain

Personalised recommendations