Data Mining and Knowledge Discovery

, Volume 31, Issue 1, pp 264–286 | Cite as

SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption

Article

Abstract

Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the attributes are interval scale. In the interval scale, it is assumed that a unit difference has the same meaning irrespective of the magnitudes of the values separated. When this assumption is violated, accuracy may be reduced. Our experiments show that removing the interval scale assumption by transforming data to ranks can improve the accuracy of distance-based similarity measures on some tasks. However the rank transform has high time and storage overheads. In this paper, we introduce an efficient similarity measure which does not consider the magnitudes of inter-instance distances. We compare the new similarity measure with popular similarity measures in two applications: DBScan clustering and content based multimedia information retrieval with real world datasets and different transform functions. The results show that the proposed similarity measure provides good performance on a range of tasks and is invariant to violations of the interval scale assumption.

Keywords

Similarity measure Interval scale Clustering  CBMIR 

References

  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410CrossRefGoogle Scholar
  2. Ashby FG, Ennis DM (2007) Similarity measures. Scholarpedia 2(12):4116CrossRefGoogle Scholar
  3. Breiman L (2001) Random forests. Mach Learn 45(1):5–32MathSciNetCrossRefMATHGoogle Scholar
  4. Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. Int J Math Models Methods Appl Sci 1(4):300–307MathSciNetGoogle Scholar
  5. Conover WJ (1980) Practical nonparametric statistics. Wiley, New YorkGoogle Scholar
  6. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second ACM international conference on knowledge discovery and data mining, pp 226–231Google Scholar
  7. Faith DP, Minchin PR, Belbin L (1987) Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69(1–3):57–68CrossRefGoogle Scholar
  8. Giacinto G, Roli F (2005) Instance-based relevance feedback for image retrieval. Adv Neural Inf Process Syst 17:489–496Google Scholar
  9. Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann, BurlingtonMATHGoogle Scholar
  10. He J, Li M, Zhang HJ, Tong H, Zhang C (2004) Manifold-ranking based image retrieval. In: Proceedings of the 12th annual ACM international conference on multimedia, MULTIMEDIA ’04, ACM, New York, pp 9–16Google Scholar
  11. Lichman M (2014) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 22 Oct 2014
  12. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkCrossRefMATHGoogle Scholar
  13. Osborne J (2002) Notes on the use of data transformations. Pract Assess Res Eval 8(6):1–8Google Scholar
  14. Osborne JW (2010) Improving your data transformations: applying the box-cox transformation. Pract Assess Res Eval 15(12):1–9Google Scholar
  15. Petitjean F, Gançarski P (2012) Summarizing a set of time series by averaging: from steiner sequence to compact multiple alignment. Theor Comput Sci 414(1):76–91MathSciNetCrossRefMATHGoogle Scholar
  16. Rocchio JJ (1971) Relevance feedback in information retrieval. In: Salton G (ed) The SMART retrieval system: experiments in automatic document processing. Prentice-Hall, Englewood Cliffs, pp 313–323Google Scholar
  17. Shi T, Horvath S (2006) Unsupervised learning with random forest predictors. J Comput Gr Stat 15(1):118–138MathSciNetCrossRefGoogle Scholar
  18. SIGKDD (2015) 2014 SIGKDD test of time award winners. http://www.kdd.org/awards/view/2014-sikdd-test-of-time-award-winners. Accessed 16 May 2015
  19. Stevens S (1946) On the theory of scales of measurement. Science 103(2684):677–680CrossRefMATHGoogle Scholar
  20. University of Eastern Finland (2015) Clustering datasets. https://cs.joensuu.fi/sipu/datasets/. Accessed 19 Nov 2015
  21. Zezula P, Amato G, Dohnal V, Batko M (2006) Similarity search: the metric space approach, advances in database systems, vol 32. Springer, BerlinMATHGoogle Scholar
  22. Zhang R, Zhang ZM (2006) BALAS: empirical bayesian learning in the relevance feedback for image retrieval. Image Vis Comput 24(3):211–223CrossRefGoogle Scholar
  23. Zhou G, Ting K, Liu F, Yin Y (2012) Relevance feature mapping for content-based multimedia information retrieval. Pattern Recognit 45(4):1707–1720CrossRefGoogle Scholar
  24. Zhou ZH, Dai HB (2006) Query-sensitive similarity measure for content-based image retrieval. In: Proceedings of the sixth international conference on data mining, ICDM ’06, IEEE Computer Society, Washington, DC, pp 1211–1215Google Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  1. 1.Monash UniversityMelbourneAustralia

Personalised recommendations