Abstract
In order to address high dimensional problems, a new ‘direction-aware’ metric is introduced in this paper. This new distance is a combination of two components: (1) the traditional Euclidean distance and (2) an angular/directional divergence, derived from the cosine similarity. The newly introduced metric combines the advantages of the Euclidean metric and cosine similarity, and is defined over the Euclidean space domain. Thus, it is able to take the advantage from both spaces, while preserving the Euclidean space domain. The direction-aware distance has wide range of applicability and can be used as an alternative distance measure for various traditional clustering approaches to enhance their ability of handling high dimensional problems. A new evolving clustering algorithm using the proposed distance is also proposed in this paper. Numerical examples with benchmark datasets reveal that the direction-aware distance can effectively improve the clustering quality of the k-means algorithm for high dimensional problems and demonstrate the proposed evolving clustering algorithm to be an effective tool for high dimensional data streams processing.
Similar content being viewed by others
References
Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional space. In: International conference on database theory, pp 420–434
Allah FA, Grosky WI, Aboutajdine D (2008) Document clustering based on diffusion maps and a comparison of the k-means performances in various spaces. In: IEEE symposium on computers and communications, pp 579–584
Angelov P, Sadeghi-Tehran P, Ramezani R (2014) An approach to automatic real-time novelty detection, object identification, and tracking in video streams based on recursive density estimation and evolving Takagi–Sugeno fuzzy systems. Int J Intell Syst 29(2):1–23
Angelov P, Gu X, Kangin D (2017) Empirical data analytics. Int J Intell Syst. doi:10.1002/int.21899
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is ‘nearest neighbors’ meaningful? In: International conference on database theory, pp 217–235
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Methods 3(1):1–27
Callebaut DK (1965) Generalization of the Cauchy–Schwarz inequality. J Math Anal Appl 12(3):491–494
Cardiotocography Dataset. https://archive.ics.uci.edu/ml/datasets/Cardiotocography. Accessed 19 July 2017
Chiu SL (1994) Fuzzy model identification based on cluster estimation. J Intell Fuzzy Syst 2(3):267–278
Clustering datasets. http://cs.joensuu.fi/sipu/datasets/. Accessed 19 July 2017
Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227
Dehak N, Dehak R, Glass J, Reynolds D, Kenny P (2010) Cosine similarity scoring without score normalization techniques. In: Proceeding Odyssey 2010—Speaker Language Recognition Work (Odyssey 2010), pp 71–75
Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Domingos P (2012) A few useful things to know about machine learning. Commun ACM 55(10):78–87
Dutta Baruah R, Angelov P (2012) Evolving local means method for clustering of streaming data. In: IEEE international conference fuzzy system, pp 10–15
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Int Conf Knowl Discov Data Min 96:226–231
Franti P, Virmajoki O, Hautamaki V (2008) Probabilistic clustering by random swap algorithm. In: IEEE international conference on pattern recognition, pp 1–4
Fukunaga K, Hostetler L (1975) The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inf Theory 21(1):32–40
Keller JM, Gray MR (1985) A fuzzy k-nearest neighbor algorithm. IEEE Trans Syst Man Cybern 15(4):580–585
Li J, Ray S, Lindsay BG (2007) A nonparametric statistical approach to clustering via mode identification. J Mach Learn Res 8(8):1687–1723
Lughofer E, Cernuda C, Kindermann S, Pratama M (2015) Generalized smart evolving fuzzy systems. Evol Syst 6(4):269–292
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: 5th Berkeley symposium mathematical statistics and probability 1967, vol 1, no 233, pp 281–297
McCune B, Grace JB, Urban DL (2002) Analysis of ecological communities, vol 28. MJM Software Design, Gleneden Beach
McLachlan GJ (1999) Mahalanobis distance. Resonance 4(6):20–26
Optical Recognition of Handwritten Digits Dataset. https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits. Accessed 19 July 2017
Pen-Based Recognition of Handwritten Digits Dataset. http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits. Accessed 19 July 2017
Precup RE, Filip HI, Rədac MB, Petriu EM, Preitl S, Dragoş CA (2014) Online identification of evolving Takagi-Sugeno-Kang fuzzy models for crane systems. Appl Soft Comput J 24:1155–1163
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science (80-) 344(6191):1493–1496
Rong HJ, Sundararajan N, Bin Huang G, Saratchandran P (2006) Sequential adaptive fuzzy inference system (SAFIS) for nonlinear system identification and prediction. Fuzzy Sets Syst 157(9):1260–1275
Rong HJ, Sundararajan N, Bin Huang G, Zhao GS (2011) Extended sequential adaptive fuzzy inference system for classification problems. Evol Syst 2(2):71–82
Senoussaoui M, Kenny P, Dumouchel P, Stafylakis T (2013) Efficient iterative mean shift based cosine dissimilarity for multi-recording speaker clustering. In: IEEE international conference acoustics speech and signal processing, pp 7712–7715
Setlur V, Stone MC (2016) A linguistic approach to categorical color assignment for data visualization. IEEE Trans Vis Comput Graph 22(1):698–707
Steel Plates Faults Dataset. https://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults. Accessed 19 July 2017
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gu, X., Angelov, P.P., Kangin, D. et al. A new type of distance metric and its use for clustering. Evolving Systems 8, 167–177 (2017). https://doi.org/10.1007/s12530-017-9195-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12530-017-9195-7