Skip to main content

Analysis and Application of Normalization Methods with Supervised Feature Weighting to Improve K-means Accuracy

  • 1003 Accesses

Part of the Advances in Intelligent Systems and Computing book series (AISC,volume 950)

Abstract

Normalization methods are widely employed for transforming the variables or features of a given dataset. In this paper three classical feature normalization methods, Standardization (St), Min-Max (MM) and Median Absolute Deviation (MAD), are studied in different synthetic datasets from UCI repository. An exhaustive analysis of the transformed features’ ranges and their influence on the Euclidean distance is performed, concluding that knowledge about the group structure gathered by each feature is needed to select the best normalization method for a given dataset. In order to effectively collect the features’ importance and adjust their contribution, this paper proposes a two-stage methodology for normalization and supervised feature weighting based on a Pearson correlation coefficient and on a Random Forest Feature Importance estimation method. Simulations on five different datasets reveal that our two-stage proposed methodology, in terms of accuracy, outperforms or at least maintains the K-means performance obtained if only normalization is applied.

Keywords

  • Normalization
  • Standardization
  • Weighted Euclidean Distance
  • Pearson correlation
  • Random Forest
  • K-means

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-20055-8_2
  • Chapter length: 11 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   219.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-20055-8
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   279.99
Price excludes VAT (USA)
Fig. 1.

References

  1. Blömer, J., Lammersen, C., Schmidt, M., Sohler, C.: Theoretical analysis of the k-means algorithm–a survey. In: Algorithm Engineering, pp. 81–116. Springer, Heidelberg (2016)

    CrossRef  Google Scholar 

  2. Daszykowski, M., Kaczmarek, K., Vander Heyden, Y., Walczak, B.: Robust statistics in data analysis—a review: basic concepts. Chemometr. Intell. Lab. Syst. 85(2), 203–219 (2007)

    CrossRef  Google Scholar 

  3. Aksoy, S., Haralick, R.M.: Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recogn. Lett. 22(5), 563–582 (2001)

    CrossRef  Google Scholar 

  4. Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recogn. 38(12), 2270–2285 (2005)

    CrossRef  Google Scholar 

  5. Pan, J., Zhuang, Y., Fong, S.: The impact of data normalization on stock market prediction: using SVM and technical indicators. In: International Conference on Soft Computing in Data Science, pp. 72–88. Springer, Heidelberg (2016)

    Google Scholar 

  6. Milligan, G.W.: Clustering validation: results and implications for applied analyses. In: Clustering and Classification, pp. 341–375. World Scientific, Singapore (1996)

    CrossRef  Google Scholar 

  7. Milligan, G.W., Cooper, M.C.: A study of standardization of variables in cluster analysis. J. Classif. 5(2), 181–204 (1988)

    MathSciNet  CrossRef  Google Scholar 

  8. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)

    CrossRef  Google Scholar 

  9. Dillon, W.R., Mulani, N., Frederick, D.G.: On the use of component scores in the presence of group structure. J. Consum. Res. 16(1), 106–112 (1989)

    CrossRef  Google Scholar 

  10. Steinley, D.: Standardizing variables in k-means clustering. In: Classification, Clustering, and Data Mining Applications, pp. 53–60. Springer, Heidelberg (2004)

    CrossRef  Google Scholar 

  11. Tsai, C.Y., Chiu, C.C.: Developing a feature weight self-adjustment mechanism for a k-means clustering algorithm. Comput. Stat. Data Anal. 52(10), 4658–4672 (2008)

    MathSciNet  CrossRef  Google Scholar 

  12. Huang, J.Z., Ng, M.K., Rong, H., Li, Z.: Automated variable weighting in k-means type clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 657–668 (2005)

    CrossRef  Google Scholar 

  13. Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Trans. Knowl. Data Eng. 19(8), 1026–1041 (2007)

    CrossRef  Google Scholar 

  14. Makarenkov, V., Legendre, P.: Optimal variable weighting for ultrametric and additive trees and k-means partitioning: methods and software. J. Classif. 18(2), 245–271 (2001)

    MathSciNet  MATH  Google Scholar 

  15. Lu, Y., Wang, S., Li, S., Zhou, C.: Particle swarm optimizer for variable weighting in clustering high-dimensional data. Mach. Learn. 82(1), 43–70 (2011)

    MathSciNet  CrossRef  Google Scholar 

  16. Chan, E.Y., Ching, W.K., Ng, M.K., Huang, J.Z.: An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recogn. 37(5), 943–952 (2004)

    CrossRef  Google Scholar 

  17. Jiang, L., Zhang, L., Li, C., Wu, J.: A correlation-based feature weighting filter for Naive Bayes. IEEE Trans. Knowl. Data Eng. (2018)

    Google Scholar 

  18. Phan, A.V., Le Nguyen, M., Bui, L.T.: Feature weighting and SVM parameters optimization based on genetic algorithms for classification problems. Appl. Intell. 46(2), 455–469 (2017)

    CrossRef  Google Scholar 

  19. Chen, Y., Hao, Y.: A feature weighted support vector machine and k-nearest neighbor algorithm for stock market indices prediction. Expert Syst. Appl. 80, 340–355 (2017)

    CrossRef  Google Scholar 

  20. Gürüler, H.: A novel diagnosis system for Parkinson’s disease using complex-valued artificial neural network with k-means clustering feature weighting method. Neural Comput. Appl. 28(7), 1657–1666 (2017)

    CrossRef  Google Scholar 

  21. Elbasiony, R.M., Sallam, E.A., Eltobely, T.E., Fahmy, M.M.: A hybrid network intrusion detection framework based on random forests and weighted k-means. Ain Shams Eng. J. 4(4), 753–762 (2013)

    CrossRef  Google Scholar 

  22. Wei, P., Lu, Z., Song, J.: Variable importance analysis: a comprehensive review. Reliab. Eng. Syst. Saf. 142, 399–432 (2015)

    CrossRef  Google Scholar 

  23. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    CrossRef  Google Scholar 

  24. Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Advances in Neural Information Processing Systems, pp. 431–439 (2013)

    Google Scholar 

  25. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml

Download references

Acknowledgement

This work has been supported in part by the ELKARTEK program (SeNDANEU KK-2018/00032), the HAZITEK program (DATALYSE ZL-2018/00765) of the Basque Government and a TECNALIA Research and Innovation PhD Scholarship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Iratxe Niño-Adan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Niño-Adan, I., Landa-Torres, I., Portillo, E., Manjarres, D. (2020). Analysis and Application of Normalization Methods with Supervised Feature Weighting to Improve K-means Accuracy. In: Martínez Álvarez, F., Troncoso Lora, A., Sáez Muñoz, J., Quintián, H., Corchado, E. (eds) 14th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2019). SOCO 2019. Advances in Intelligent Systems and Computing, vol 950. Springer, Cham. https://doi.org/10.1007/978-3-030-20055-8_2

Download citation