Abstract
Normalization of feature vectors is often used as a step of data preprocessing for clustering. A unified statistical approach to feature vector normalization has been proposed recently by the authors. After the proposed normalization, the contributions of both numerical and categorical attributes to a specified objective function are statistically the same. In spite of the importance for estimators to be consistent, the consistency of the sample estimators used for normalization, has never been considered. A mathematical justification of the statistical normalization procedure is given here. The sample estimators proposed for normalization of attributes of feature vectors are proven to have desirable properties, namely they are consistent and unbiased. Some other mathematical questions related to clustering have got here a rigorous treatment. In particular, the statistical normalization procedure is discussed in detail in the cases of the objective functions being based on the Chebyshev, attribute mismatch categorical and Minkowski mixed p-metrics. As an application of the normalization procedure, clustering of several benchmark datasets is performed with non-normalized and introduced normalized mixed metrics using either the \(k\)-prototypes (for \(p=2\)) or another algorithm (for \(p\not = 2\)).
Similar content being viewed by others
References
Aksoy S, Haralick RM (2001) Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recognit Lett 22:563–582
Amorim RC, Mirkin B (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recognit 45:1061–1075
Asuncion A, Newman DJ (2007) UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://www.ics.uci.edu/~mlearn/MLRepository.html
Bo L, Wang L, Jiao L (2006) Feature scaling for kernel Fisher discriminant analysis using leave-one-out cross validation. Neural Comput 18:961–978
Celebi ME, Celiker F, Kingravi HA (2011) On Euclidean norm approximations. Pattern Recognit 44:278–283
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40:200–210
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York
Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19:1090–1099
Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications. SIAM, Philadelphia
Giudici P, Figini S (2009) Applied data mining for business and industry. Willey, Chichester
Graf ABA, Borer S (2001) Normalization in support vector machines. In: Pattern recognition. Springer, Berlin, pp 277–282
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference, and prediction. Springer, Berlin
Hathaway RJ, Bezdek JC, Hu Y-K (2000) Generalized fuzzy c-means clustering strategies using \(L_p\) norm distances. IEEE Trans Fuzzy Syst 8:576–582
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2:283–304
Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27:657–668
Ivchenko GI, Medvedev YuI, Chistyakov AV (1991) Problems in mathematical statistics. Mir Publishers, Moscow
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surveys 31:264–323
Juszczak P, Tax DMJ, Duin RPW (2002) Feature scaling in support vector data description. In: Deprettere E, Belloum A, Heijnsdijk J, van der F Stappen (eds.) Proceedings of ASCI 2002, 8th annual conference of the advanced school for computing and imaging, pp 95–102
Kamath C (2009) Scientific data mining: a practical perspective. SIAM, Philadelphia
Kettenring JR (2006) The practice of cluster analysis. J Classif 23:3–30
Larose DT (2005) Discovering knowledge in data: an introduction to data mining. Wiley, Hoboken
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28:129–136
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, pp 281–297
Meila M (2007) Comparing clusterings—an information based distance. J Multivar Anal 98:873–895
Mirkin B (2005) Clustering for data mining: a data recovery approach. Chapman and Hall/CRC, Boca Raton
Pham DT, Suarez-Alvarez MM, Prostov YI (2011) Random search with k-prototypes algorithm for clustering mixed datasets. Proc R Soc A 467:2387–2403
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Shao J (2003) Mathematical statistics. Springer, New York
Stolcke A, Kajarekar S, Ferrer L (2008) Nonparametric feature normalization for SVM-based speaker verification. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 1577–1580
Suarez-Alvarez MM (2010) Design and analysis of clustering algorithms for numerical, categorical and mixed data. PhD thesis, Cardiff University, Cardiff
Suarez-Alvarez MM, Pham DT, Prostov MY, Prostov YI (2012) Statistical approach to normalization of feature vectors and clustering of mixed datasets. Proc R Soc A 468:2630–2651
Tsakalidis S, Doumpiotis V, Byrne W (2005) Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation. IEEE Trans Speech Audio Process 13:367–376
Xie L, Tian Q, Zhang B (2013) Feature normalization for part-based image classification. In: International conference on image processing
Youn E, Jeong MK (2009) Class dependent feature scaling method using naive Bayes classifier for text datamining. Pattern Recognit Lett 30:477–485
Zhigljavsky AA, Žilinskas AG (2008) Stochastic global optimization. Springer, Berlin
Acknowledgments
Thanks are due to Professor Feodor M. Borodich (Cardiff University) for his valuable comments on the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Chih-Jen Lin.
Rights and permissions
About this article
Cite this article
Prostov, M.Y., Suarez-Alvarez, M.M. & Prostov, Y.I. Properties of the sample estimators used for statistical normalization of feature vectors. Data Min Knowl Disc 29, 1815–1837 (2015). https://doi.org/10.1007/s10618-014-0395-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-014-0395-5