Skip to main content
Log in

Properties of the sample estimators used for statistical normalization of feature vectors

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Normalization of feature vectors is often used as a step of data preprocessing for clustering. A unified statistical approach to feature vector normalization has been proposed recently by the authors. After the proposed normalization, the contributions of both numerical and categorical attributes to a specified objective function are statistically the same. In spite of the importance for estimators to be consistent, the consistency of the sample estimators used for normalization, has never been considered. A mathematical justification of the statistical normalization procedure is given here. The sample estimators proposed for normalization of attributes of feature vectors are proven to have desirable properties, namely they are consistent and unbiased. Some other mathematical questions related to clustering have got here a rigorous treatment. In particular, the statistical normalization procedure is discussed in detail in the cases of the objective functions being based on the Chebyshev, attribute mismatch categorical and Minkowski mixed p-metrics. As an application of the normalization procedure, clustering of several benchmark datasets is performed with non-normalized and introduced normalized mixed metrics using either the \(k\)-prototypes (for \(p=2\)) or another algorithm (for \(p\not = 2\)).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aksoy S, Haralick RM (2001) Feature normalization and likelihood-based similarity measures for image retrieval. Pattern Recognit Lett 22:563–582

    Article  MATH  Google Scholar 

  • Amorim RC, Mirkin B (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recognit 45:1061–1075

    Article  Google Scholar 

  • Asuncion A, Newman DJ (2007) UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://www.ics.uci.edu/~mlearn/MLRepository.html

  • Bo L, Wang L, Jiao L (2006) Feature scaling for kernel Fisher discriminant analysis using leave-one-out cross validation. Neural Comput 18:961–978

    Article  MathSciNet  MATH  Google Scholar 

  • Celebi ME, Celiker F, Kingravi HA (2011) On Euclidean norm approximations. Pattern Recognit 44:278–283

    Article  MATH  Google Scholar 

  • Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40:200–210

    Article  Google Scholar 

  • Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York

    MATH  Google Scholar 

  • Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19:1090–1099

    Article  Google Scholar 

  • Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications. SIAM, Philadelphia

    Book  Google Scholar 

  • Giudici P, Figini S (2009) Applied data mining for business and industry. Willey, Chichester

    Book  MATH  Google Scholar 

  • Graf ABA, Borer S (2001) Normalization in support vector machines. In: Pattern recognition. Springer, Berlin, pp 277–282

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Data mining, inference, and prediction. Springer, Berlin

    MATH  Google Scholar 

  • Hathaway RJ, Bezdek JC, Hu Y-K (2000) Generalized fuzzy c-means clustering strategies using \(L_p\) norm distances. IEEE Trans Fuzzy Syst 8:576–582

    Article  Google Scholar 

  • Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2:283–304

    Article  Google Scholar 

  • Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27:657–668

    Article  Google Scholar 

  • Ivchenko GI, Medvedev YuI, Chistyakov AV (1991) Problems in mathematical statistics. Mir Publishers, Moscow

    MATH  Google Scholar 

  • Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs

    MATH  Google Scholar 

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surveys 31:264–323

    Article  Google Scholar 

  • Juszczak P, Tax DMJ, Duin RPW (2002) Feature scaling in support vector data description. In: Deprettere E, Belloum A, Heijnsdijk J, van der F Stappen (eds.) Proceedings of ASCI 2002, 8th annual conference of the advanced school for computing and imaging, pp 95–102

  • Kamath C (2009) Scientific data mining: a practical perspective. SIAM, Philadelphia

    Book  Google Scholar 

  • Kettenring JR (2006) The practice of cluster analysis. J Classif 23:3–30

    Article  MathSciNet  Google Scholar 

  • Larose DT (2005) Discovering knowledge in data: an introduction to data mining. Wiley, Hoboken

    Google Scholar 

  • Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28:129–136

    Article  MathSciNet  MATH  Google Scholar 

  • MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, pp 281–297

  • Meila M (2007) Comparing clusterings—an information based distance. J Multivar Anal 98:873–895

    Article  MathSciNet  MATH  Google Scholar 

  • Mirkin B (2005) Clustering for data mining: a data recovery approach. Chapman and Hall/CRC, Boca Raton

    Book  Google Scholar 

  • Pham DT, Suarez-Alvarez MM, Prostov YI (2011) Random search with k-prototypes algorithm for clustering mixed datasets. Proc R Soc A 467:2387–2403

    Article  MathSciNet  MATH  Google Scholar 

  • Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850

    Article  Google Scholar 

  • Shao J (2003) Mathematical statistics. Springer, New York

    Book  MATH  Google Scholar 

  • Stolcke A, Kajarekar S, Ferrer L (2008) Nonparametric feature normalization for SVM-based speaker verification. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, pp 1577–1580

  • Suarez-Alvarez MM (2010) Design and analysis of clustering algorithms for numerical, categorical and mixed data. PhD thesis, Cardiff University, Cardiff

  • Suarez-Alvarez MM, Pham DT, Prostov MY, Prostov YI (2012) Statistical approach to normalization of feature vectors and clustering of mixed datasets. Proc R Soc A 468:2630–2651

    Article  MathSciNet  Google Scholar 

  • Tsakalidis S, Doumpiotis V, Byrne W (2005) Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation. IEEE Trans Speech Audio Process 13:367–376

    Article  Google Scholar 

  • Xie L, Tian Q, Zhang B (2013) Feature normalization for part-based image classification. In: International conference on image processing

  • Youn E, Jeong MK (2009) Class dependent feature scaling method using naive Bayes classifier for text datamining. Pattern Recognit Lett 30:477–485

    Article  Google Scholar 

  • Zhigljavsky AA, Žilinskas AG (2008) Stochastic global optimization. Springer, Berlin

    MATH  Google Scholar 

Download references

Acknowledgments

Thanks are due to Professor Feodor M. Borodich (Cardiff University) for his valuable comments on the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maria M. Suarez-Alvarez.

Additional information

Responsible editor: Chih-Jen Lin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Prostov, M.Y., Suarez-Alvarez, M.M. & Prostov, Y.I. Properties of the sample estimators used for statistical normalization of feature vectors. Data Min Knowl Disc 29, 1815–1837 (2015). https://doi.org/10.1007/s10618-014-0395-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0395-5

Keywords

Navigation