Evolving Systems

, Volume 8, Issue 3, pp 167–177 | Cite as

A new type of distance metric and its use for clustering

  • Xiaowei Gu
  • Plamen P. Angelov
  • Dmitry Kangin
  • Jose C. Principe
Original Paper

Abstract

In order to address high dimensional problems, a new ‘direction-aware’ metric is introduced in this paper. This new distance is a combination of two components: (1) the traditional Euclidean distance and (2) an angular/directional divergence, derived from the cosine similarity. The newly introduced metric combines the advantages of the Euclidean metric and cosine similarity, and is defined over the Euclidean space domain. Thus, it is able to take the advantage from both spaces, while preserving the Euclidean space domain. The direction-aware distance has wide range of applicability and can be used as an alternative distance measure for various traditional clustering approaches to enhance their ability of handling high dimensional problems. A new evolving clustering algorithm using the proposed distance is also proposed in this paper. Numerical examples with benchmark datasets reveal that the direction-aware distance can effectively improve the clustering quality of the k-means algorithm for high dimensional problems and demonstrate the proposed evolving clustering algorithm to be an effective tool for high dimensional data streams processing.

Keywords

Cosine similarity Distance metric Metric space Clustering High dimensional data streams processing 

References

  1. Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional space. In: International conference on database theory, pp 420–434Google Scholar
  2. Allah FA, Grosky WI, Aboutajdine D (2008) Document clustering based on diffusion maps and a comparison of the k-means performances in various spaces. In: IEEE symposium on computers and communications, pp 579–584Google Scholar
  3. Angelov P, Sadeghi-Tehran P, Ramezani R (2014) An approach to automatic real-time novelty detection, object identification, and tracking in video streams based on recursive density estimation and evolving Takagi–Sugeno fuzzy systems. Int J Intell Syst 29(2):1–23MATHGoogle Scholar
  4. Angelov P, Gu X, Kangin D (2017) Empirical data analytics. Int J Intell Syst. doi:10.1002/int.21899 Google Scholar
  5. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is ‘nearest neighbors’ meaningful? In: International conference on database theory, pp 217–235Google Scholar
  6. Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Methods 3(1):1–27MathSciNetCrossRefMATHGoogle Scholar
  7. Callebaut DK (1965) Generalization of the Cauchy–Schwarz inequality. J Math Anal Appl 12(3):491–494MathSciNetCrossRefMATHGoogle Scholar
  8. Cardiotocography Dataset. https://archive.ics.uci.edu/ml/datasets/Cardiotocography. Accessed 19 July 2017
  9. Chiu SL (1994) Fuzzy model identification based on cluster estimation. J Intell Fuzzy Syst 2(3):267–278Google Scholar
  10. Clustering datasets. http://cs.joensuu.fi/sipu/datasets/. Accessed 19 July 2017
  11. Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619CrossRefGoogle Scholar
  12. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227CrossRefGoogle Scholar
  13. Dehak N, Dehak R, Glass J, Reynolds D, Kenny P (2010) Cosine similarity scoring without score normalization techniques. In: Proceeding Odyssey 2010—Speaker Language Recognition Work (Odyssey 2010), pp 71–75Google Scholar
  14. Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798CrossRefGoogle Scholar
  15. Domingos P (2012) A few useful things to know about machine learning. Commun ACM 55(10):78–87CrossRefGoogle Scholar
  16. Dutta Baruah R, Angelov P (2012) Evolving local means method for clustering of streaming data. In: IEEE international conference fuzzy system, pp 10–15Google Scholar
  17. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Int Conf Knowl Discov Data Min 96:226–231Google Scholar
  18. Franti P, Virmajoki O, Hautamaki V (2008) Probabilistic clustering by random swap algorithm. In: IEEE international conference on pattern recognition, pp 1–4Google Scholar
  19. Fukunaga K, Hostetler L (1975) The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inf Theory 21(1):32–40MathSciNetCrossRefMATHGoogle Scholar
  20. Keller JM, Gray MR (1985) A fuzzy k-nearest neighbor algorithm. IEEE Trans Syst Man Cybern 15(4):580–585CrossRefGoogle Scholar
  21. Li J, Ray S, Lindsay BG (2007) A nonparametric statistical approach to clustering via mode identification. J Mach Learn Res 8(8):1687–1723MathSciNetMATHGoogle Scholar
  22. Lughofer E, Cernuda C, Kindermann S, Pratama M (2015) Generalized smart evolving fuzzy systems. Evol Syst 6(4):269–292CrossRefGoogle Scholar
  23. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: 5th Berkeley symposium mathematical statistics and probability 1967, vol 1, no 233, pp 281–297Google Scholar
  24. McCune B, Grace JB, Urban DL (2002) Analysis of ecological communities, vol 28. MJM Software Design, Gleneden BeachGoogle Scholar
  25. McLachlan GJ (1999) Mahalanobis distance. Resonance 4(6):20–26CrossRefGoogle Scholar
  26. Optical Recognition of Handwritten Digits Dataset. https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits. Accessed 19 July 2017
  27. Pen-Based Recognition of Handwritten Digits Dataset. http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits. Accessed 19 July 2017
  28. Precup RE, Filip HI, Rədac MB, Petriu EM, Preitl S, Dragoş CA (2014) Online identification of evolving Takagi-Sugeno-Kang fuzzy models for crane systems. Appl Soft Comput J 24:1155–1163CrossRefGoogle Scholar
  29. Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science (80-) 344(6191):1493–1496CrossRefGoogle Scholar
  30. Rong HJ, Sundararajan N, Bin Huang G, Saratchandran P (2006) Sequential adaptive fuzzy inference system (SAFIS) for nonlinear system identification and prediction. Fuzzy Sets Syst 157(9):1260–1275MathSciNetCrossRefMATHGoogle Scholar
  31. Rong HJ, Sundararajan N, Bin Huang G, Zhao GS (2011) Extended sequential adaptive fuzzy inference system for classification problems. Evol Syst 2(2):71–82CrossRefGoogle Scholar
  32. Senoussaoui M, Kenny P, Dumouchel P, Stafylakis T (2013) Efficient iterative mean shift based cosine dissimilarity for multi-recording speaker clustering. In: IEEE international conference acoustics speech and signal processing, pp 7712–7715Google Scholar
  33. Setlur V, Stone MC (2016) A linguistic approach to categorical color assignment for data visualization. IEEE Trans Vis Comput Graph 22(1):698–707CrossRefGoogle Scholar
  34. Steel Plates Faults Dataset. https://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults. Accessed 19 July 2017

Copyright information

© Springer-Verlag GmbH Germany 2017

Authors and Affiliations

  1. 1.School of Computing and CommunicationsLancaster University LancasterLancasterUK
  2. 2.Computational NeuroEngineering Laboratory, Department of Electrical and Computer EngineeringUniversity of FloridaGainesvilleUSA

Personalised recommendations