Properties of the sample estimators used for statistical normalization of feature vectors

Prostov, Mikhail Y.; Suarez-Alvarez, Maria M.; Prostov, Yuriy I.

doi:10.1007/s10618-014-0395-5

Properties of the sample estimators used for statistical normalization of feature vectors

Published: 30 November 2014

Volume 29, pages 1815–1837, (2015)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Mikhail Y. Prostov¹,
Maria M. Suarez-Alvarez² &
Yuriy I. Prostov³

427 Accesses
1 Citation
Explore all metrics

Abstract

Normalization of feature vectors is often used as a step of data preprocessing for clustering. A unified statistical approach to feature vector normalization has been proposed recently by the authors. After the proposed normalization, the contributions of both numerical and categorical attributes to a specified objective function are statistically the same. In spite of the importance for estimators to be consistent, the consistency of the sample estimators used for normalization, has never been considered. A mathematical justification of the statistical normalization procedure is given here. The sample estimators proposed for normalization of attributes of feature vectors are proven to have desirable properties, namely they are consistent and unbiased. Some other mathematical questions related to clustering have got here a rigorous treatment. In particular, the statistical normalization procedure is discussed in detail in the cases of the objective functions being based on the Chebyshev, attribute mismatch categorical and Minkowski mixed p-metrics. As an application of the normalization procedure, clustering of several benchmark datasets is performed with non-normalized and introduced normalized mixed metrics using either the \(k\)-prototypes (for \(p=2\)) or another algorithm (for \(p\not = 2\)).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Survey on Feature Weighting Based K-Means Algorithms

Article 01 July 2016

A Comprehensive Review on Unsupervised Feature Selection Algorithms

Feature Maximization Based Clustering Quality Evaluation: A Promising Approach

References

Aksoy S, Haralick RM (2001) Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recognit Lett 22:563–582
Article MATH Google Scholar
Amorim RC, Mirkin B (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recognit 45:1061–1075
Article Google Scholar
Asuncion A, Newman DJ (2007) UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://www.ics.uci.edu/~mlearn/MLRepository.html
Bo L, Wang L, Jiao L (2006) Feature scaling for kernel Fisher discriminant analysis using leave-one-out cross validation. Neural Comput 18:961–978
Article MathSciNet MATH Google Scholar
Celebi ME, Celiker F, Kingravi HA (2011) On Euclidean norm approximations. Pattern Recognit 44:278–283
Article MATH Google Scholar
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40:200–210
Article Google Scholar
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York
MATH Google Scholar
Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19:1090–1099
Article Google Scholar
Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications. SIAM, Philadelphia
Book Google Scholar
Giudici P, Figini S (2009) Applied data mining for business and industry. Willey, Chichester
Book MATH Google Scholar
Graf ABA, Borer S (2001) Normalization in support vector machines. In: Pattern recognition. Springer, Berlin, pp 277–282
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference, and prediction. Springer, Berlin
MATH Google Scholar
Hathaway RJ, Bezdek JC, Hu Y-K (2000) Generalized fuzzy c-means clustering strategies using \(L_p\) norm distances. IEEE Trans Fuzzy Syst 8:576–582
Article Google Scholar
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2:283–304
Article Google Scholar
Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27:657–668
Article Google Scholar
Ivchenko GI, Medvedev YuI, Chistyakov AV (1991) Problems in mathematical statistics. Mir Publishers, Moscow
MATH Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
MATH Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surveys 31:264–323
Article Google Scholar
Juszczak P, Tax DMJ, Duin RPW (2002) Feature scaling in support vector data description. In: Deprettere E, Belloum A, Heijnsdijk J, van der F Stappen (eds.) Proceedings of ASCI 2002, 8th annual conference of the advanced school for computing and imaging, pp 95–102
Kamath C (2009) Scientific data mining: a practical perspective. SIAM, Philadelphia
Book Google Scholar
Kettenring JR (2006) The practice of cluster analysis. J Classif 23:3–30
Article MathSciNet Google Scholar
Larose DT (2005) Discovering knowledge in data: an introduction to data mining. Wiley, Hoboken
Google Scholar
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28:129–136
Article MathSciNet MATH Google Scholar
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, pp 281–297
Meila M (2007) Comparing clusterings—an information based distance. J Multivar Anal 98:873–895
Article MathSciNet MATH Google Scholar
Mirkin B (2005) Clustering for data mining: a data recovery approach. Chapman and Hall/CRC, Boca Raton
Book Google Scholar
Pham DT, Suarez-Alvarez MM, Prostov YI (2011) Random search with k-prototypes algorithm for clustering mixed datasets. Proc R Soc A 467:2387–2403
Article MathSciNet MATH Google Scholar
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850
Article Google Scholar
Shao J (2003) Mathematical statistics. Springer, New York
Book MATH Google Scholar
Stolcke A, Kajarekar S, Ferrer L (2008) Nonparametric feature normalization for SVM-based speaker verification. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 1577–1580
Suarez-Alvarez MM (2010) Design and analysis of clustering algorithms for numerical, categorical and mixed data. PhD thesis, Cardiff University, Cardiff
Suarez-Alvarez MM, Pham DT, Prostov MY, Prostov YI (2012) Statistical approach to normalization of feature vectors and clustering of mixed datasets. Proc R Soc A 468:2630–2651
Article MathSciNet Google Scholar
Tsakalidis S, Doumpiotis V, Byrne W (2005) Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation. IEEE Trans Speech Audio Process 13:367–376
Article Google Scholar
Xie L, Tian Q, Zhang B (2013) Feature normalization for part-based image classification. In: International conference on image processing
Youn E, Jeong MK (2009) Class dependent feature scaling method using naive Bayes classifier for text datamining. Pattern Recognit Lett 30:477–485
Article Google Scholar
Zhigljavsky AA, Žilinskas AG (2008) Stochastic global optimization. Springer, Berlin
MATH Google Scholar

Download references

Acknowledgments

Thanks are due to Professor Feodor M. Borodich (Cardiff University) for his valuable comments on the paper.

Author information

Authors and Affiliations

Faculty of Mechanics and Mathematics, Moscow State University, Moscow, 119991, Russia
Mikhail Y. Prostov
School of Engineering, Cardiff University, Cardiff, CF24 3AA, UK
Maria M. Suarez-Alvarez
Department of Higher Mathematics, Moscow State Technical University of Radioengineering, Electronics and Automation, 78 Vernadskogo pr., Moscow, 119454, Russia
Yuriy I. Prostov

Authors

Mikhail Y. Prostov
View author publications
You can also search for this author in PubMed Google Scholar
Maria M. Suarez-Alvarez
View author publications
You can also search for this author in PubMed Google Scholar
Yuriy I. Prostov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maria M. Suarez-Alvarez.

Additional information

Responsible editor: Chih-Jen Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Prostov, M.Y., Suarez-Alvarez, M.M. & Prostov, Y.I. Properties of the sample estimators used for statistical normalization of feature vectors. Data Min Knowl Disc 29, 1815–1837 (2015). https://doi.org/10.1007/s10618-014-0395-5

Download citation

Received: 10 February 2014
Accepted: 17 November 2014
Published: 30 November 2014
Issue Date: November 2015
DOI: https://doi.org/10.1007/s10618-014-0395-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Properties of the sample estimators used for statistical normalization of feature vectors

Abstract

Access this article

Similar content being viewed by others

A Survey on Feature Weighting Based K-Means Algorithms

A Comprehensive Review on Unsupervised Feature Selection Algorithms

Feature Maximization Based Clustering Quality Evaluation: A Promising Approach

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Properties of the sample estimators used for statistical normalization of feature vectors

Abstract

Access this article

Similar content being viewed by others

A Survey on Feature Weighting Based K-Means Algorithms

A Comprehensive Review on Unsupervised Feature Selection Algorithms

Feature Maximization Based Clustering Quality Evaluation: A Promising Approach

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation