Advertisement

Knowledge and Information Systems

, Volume 53, Issue 2, pp 479–506 | Cite as

Data-dependent dissimilarity measure: an effective alternative to geometric distance measures

  • Sunil AryalEmail author
  • Kai Ming Ting
  • Takashi Washio
  • Gholamreza Haffari
Regular Paper

Abstract

Nearest neighbor search is a core process in many data mining algorithms. Finding reliable closest matches of a test instance is still a challenging task as the effectiveness of many general-purpose distance measures such as \(\ell _p\)-norm decreases as the number of dimensions increases. Their performances vary significantly in different data distributions. This is mainly because they compute the distance between two instances solely based on their geometric positions in the feature space, and data distribution has no influence on the distance measure. This paper presents a simple data-dependent general-purpose dissimilarity measure called ‘\(m_p\)-dissimilarity’. Rather than relying on geometric distance, it measures the dissimilarity between two instances as a probability mass in a region that encloses the two instances in every dimension. It deems two instances in a sparse region to be more similar than two instances of equal inter-point geometric distance in a dense region. Our empirical results in k-NN classification and content-based multimedia information retrieval tasks show that the proposed \(m_p\)-dissimilarity measure produces better task-specific performance than existing widely used general-purpose distance measures such as \(\ell _p\)-norm and cosine distance across a wide range of moderate- to high-dimensional data sets with continuous only, discrete only, and mixed attributes.

Keywords

Distance measure \(\ell _p\)-norm Cosine distance \(m_p\)-dissimilarity 

Notes

Acknowledgements

The preliminary version of this paper is published in Proceedings of the IEEE International conference on data mining (ICDM) 2014 [3]. We would like to thank anonymous reviewers for their useful comments. Kai Ming Ting is partially supported by the Air Force Office of Scientific Research (AFOSR), Asian Office of Aerospace Research and Development (AOARD) under Award Number FA2386-13-1-4043. Takashi Washio is partially supported by the AFOSR AOARD Award Number 15IOA008-154005 and JSPS KAKENHI Grant Number 2524003.

References

  1. 1.
    Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional space. In: Proceedings of the 8th international conference on database theory. Springer, Berlin, pp. 420–434Google Scholar
  2. 2.
    Ariyaratne HB, Zhang D (2012) A novel automatic hierachical approach to music genre classification. In: Proceedings of the 2012 IEEE international conference on multimedia and expo workshops, IEEE Computer Society, Washington DC, USA, pp. 564–569Google Scholar
  3. 3.
    Aryal S, Ting KM, Haffari G, Washio T (2014) Mp-dissimilarity: a data dependent dissimilarity measure. In: Proceedings of the IEEE international conference on data mining. IEEE, pp. 707–712Google Scholar
  4. 4.
    Bache K, Lichman M (2013) UCI machine learning repository, http://archive.ics.uci.edu/ml. University of California, Irvine, School of Information and Computer Sciences
  5. 5.
    Bellet A, Habrard A, Sebban M (2013) A survey on metric learning for feature vectors and structured data, Technical Report, arXiv:1306.6709
  6. 6.
    Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When Is “Nearest Neighbor” meaningful? Proceedings of the 7th international conference on database theory. Springer, London, pp. 217–235Google Scholar
  7. 7.
    Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the eighth SIAM international conference on data mining, pp. 243–254Google Scholar
  8. 8.
    Cardoso-Cachopo A (2007) Improving methods for single-label text categorization, PhD thesis, Instituto Superior Tecnico, Technical University of Lisbon, Lisbon, PortugalGoogle Scholar
  9. 9.
    Conover WJ, Iman RL (1981) Rank transformations as a bridge between parametric and nonparametric statistics. Am Stat 35(3):124–129zbMATHGoogle Scholar
  10. 10.
    Deza MM, Deza E (2009) Encyclopedia of distances. Springer, BerlinCrossRefzbMATHGoogle Scholar
  11. 11.
    Fodor I (2002) A survey of dimension reduction techniques. Technical Report UCRL-ID-148494, Lawrence Livermore National LaboratoryGoogle Scholar
  12. 12.
    François D, Wertz V, Verleysen M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7):873–886CrossRefGoogle Scholar
  13. 13.
    Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18CrossRefGoogle Scholar
  14. 14.
    Han E-H, Karypis G (2000) Centroid-based document classification: Analysis and experimental results. In: Proceedings of the 4th European conference on principles of data mining and knowledge discovery. Springer, London, pp. 424–431Google Scholar
  15. 15.
    Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. Proceedings of the thirtieth annual ACM symposium on theory of computing, STOC ’98, ACM, New York, pp. 604–613Google Scholar
  16. 16.
    Jolliffe I (2005) Principal component analysis. Wiley Online Library, HobokenCrossRefzbMATHGoogle Scholar
  17. 17.
    Krumhansl CL (1978) Concerning the applicability of geometric models to similarity data: the interrelationship between similarity and spatial density. Psychol Rev 85(5):445–463CrossRefGoogle Scholar
  18. 18.
    Kulis B (2013) Metric learning: a survey. Found Trends Mach Learn 5(4):287–364MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735CrossRefGoogle Scholar
  20. 20.
    Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, pp. 296–304Google Scholar
  21. 21.
    Lundell J, Ventura D (2007) A data-dependent distance measure for transductive instance-based learning. In: Proceedings of the IEEE international conference on systems, man and cybernetics, pp. 2825–2830Google Scholar
  22. 22.
    Mahalanobis PC (1936) On the generalized distance in statistics. Proc Natl Inst Sci India 2:49–55zbMATHGoogle Scholar
  23. 23.
    Minka TP (2003) The ‘summation hack’ as an outlier model, http://research.microsoft.com/en-us/um/people/minka/papers/minka-summation.pdf. Microsoft Research
  24. 24.
    Radovanović M, Nanopoulos A, Ivanović M (2010) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531MathSciNetzbMATHGoogle Scholar
  25. 25.
    Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24(5):513–523CrossRefGoogle Scholar
  26. 26.
    Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill Inc, New YorkzbMATHGoogle Scholar
  27. 27.
    Schleif F-M, Tino P (2015) Indefinite proximity learning: a review. Neural Comput 27(10):2039–2096CrossRefGoogle Scholar
  28. 28.
    Schneider P, Bunte K, Stiekema H, Hammer B, Villmann T, Biehl M (2010) Regularization in matrix relevance learning. IEEE Trans Neural Netw 21(5):831–840CrossRefGoogle Scholar
  29. 29.
    Tanimoto TT (1958) Mathematical theory of classification and prediction, International Business Machines CorpGoogle Scholar
  30. 30.
    Tuytelaars T, Lampert C, Blaschko MB, Buntine W (2010) Unsupervised object discovery: a comparison. Int J Comput Vision 88(2):284–302CrossRefGoogle Scholar
  31. 31.
    Tversky A (1977) Features of similarity. Psychol Rev 84(4):327–352CrossRefGoogle Scholar
  32. 32.
    Wang F, Sun J (2015) Survey on distance metric learning and dimensionality reduction in data mining. Data Min Knowl Disc 29(2):534–564MathSciNetCrossRefGoogle Scholar
  33. 33.
    Yang L (2006) Distance metric learning: a comprehensive survey, Technical report, Michigan State UniversiyGoogle Scholar
  34. 34.
    Zhou G-T, Ting KM, Liu FT, Yin Y (2012) Relevance feature mapping for content-based multimedia information retrieval. Pattern Recogn 45(4):1707–1720CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2017

Authors and Affiliations

  • Sunil Aryal
    • 1
    • 2
    Email author
  • Kai Ming Ting
    • 3
  • Takashi Washio
    • 4
  • Gholamreza Haffari
    • 2
  1. 1.School of Engineering and Information Technology, Faculty of Science and TechnologyFederation UniversityMount HelenAustralia
  2. 2.Clayton School of Information TechnologyMonash UniversityClaytonAustralia
  3. 3.School of Engineering and Information TechnologyFederation UniversityChurchillAustralia
  4. 4.The Institute of Scientific and Industrial ResearchOsaka UniversityIbarakiJapan

Personalised recommendations