Skip to main content

Similarity of Objects and the Meaning of Words

  • Conference paper
Theory and Applications of Models of Computation (TAMC 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3959))

Abstract

We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like “red” or “christianity.” For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches.

This work supported in part by the EU sixth framework project RESQ, IST–1999–11234, the NoE QUIPROCONE IST–1999–29064, the ESF QiT Programmme, and the EU NoE PASCAL, and by the Netherlands Organization for Scientific Research (NWO) under Grant 612.052.004.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bagrow, J.P., ben-Avraham, D.: On the Google-fame of scientists and other populations. In: AIP Conference Proceedings, vol. 779(1), pp. 81–89 (2005)

    Google Scholar 

  2. Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Review Lett. 88(4), 48702 (2002)

    Article  Google Scholar 

  3. Bennett, C.H., Gács, P., Li, M., Vitányi, P.M.B., Zurek, W.: Information Distance. IEEE Trans. Information Theory 44(4), 1407–1423 (1998); Conference version: Thermodynamics of Computation and Information Distance, In: Proc. 25th ACM Symp. Theory of Comput. pp. 21–30 (1993)

    Article  MATH  Google Scholar 

  4. Bennett, C.H., Li, M., Ma, B.: Chain letters and evolutionary histories. Scientific American, 76–81 (June 2003)

    Google Scholar 

  5. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167 (1998)

    Article  Google Scholar 

  6. Chen, X., Francia, B., Li, M., McKinnon, B., Seker, A.: Shared information and program plagiarism detection. IEEE Trans. Inform. Th. 50(7), 1545–1551 (2004)

    Article  MathSciNet  Google Scholar 

  7. Cilibrasi, R.: The CompLearn Toolkit, CWI, (2003), http://www.complearn.org/

  8. Cimiano, P., Staab, S.: Learning by Googling. SIGKDD Explorations 6(2), 24–33 (2004)

    Article  Google Scholar 

  9. Chai, W., Vercoe, B.: Folk music classification using hidden Markov models. In: Proc. of International Conference on Artificial Intelligence (2001)

    Google Scholar 

  10. Cilibrasi, R., Vitanyi, P.: Automatic Meaning Discovery Using Google: 100 Experiments in LearningWordNet Categories (2004), http://www.cwi.nl/~cilibrar/googlepaper/appendix.pdf

  11. Cilibrasi, R., de Wolf, R., Vitanyi, P.: Algorithmic clustering of music based on string compression. Computer Music J. 28(4), 49–67 (2004), Web version http://xxx.lanl.gov/abs/cs.SD/0303025

    Google Scholar 

  12. Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Trans. Information Theory 51(4), 1523–1545 (2005), Web version http://xxx.lanl.gov/abs/cs.CV/0312044

    Google Scholar 

  13. Cilibrasi, R., Vitanyi, P.: Automatic meaning discovery using Google, Manuscript, CWI (2004), http://arxiv.org/abs/cs.CL/0412098

  14. Cilibrasi, R., Vitanyi, P.M.B.: A New Quartet Tree Heuristic for Hierarchical Clustering. In: EUPASCAL Statistics and Optimization of Clustering Workshop, London, UK, July 5-6(2005), http://homepages.cwi.nl/paulv/papers/quartet.pdf

  15. Dannenberg, R., Thom, B., Watson, D.: A machine learning approach to musical style recognition. In: Proc. International Computer Music Conference, pp. 344–347 (1997)

    Google Scholar 

  16. Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley and Sons, Chichester (2001)

    MATH  Google Scholar 

  17. The basics of Google search, http://www.google.com/help/basics.html

  18. Grimaldi, M., Kokaram, A., Cunningham, P.: Classifying music by genre using the wavelet packet transform and a round-robin ensemble. Technical report TCD-CS-2002-64, Trinity College Dublin (2002), http://www.cs.tcd.ie/publications/tech-reports/reports.02/TCDCS-2002-64.pdf

  19. Keogh, E., Lonardi, S., Rtanamahatana, C.A.: Toward parameter-free data mining. In: Proc. 10th ACM SIGKDD Intn’l Conf. Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22–25, pp. 206–215 (2004)

    Google Scholar 

  20. Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Problems Inform. Transmission 1(1), 1–7 (1965)

    MathSciNet  Google Scholar 

  21. Kolmogorov, A.N.: Combinatorial foundations of information theory and the calculus of probabilities. Russian Math. Surveys 38(4), 29–40 (1983)

    Article  MATH  Google Scholar 

  22. Landauer, T., Dumais, S.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol. Rev. 104, 211–240 (1997)

    Article  Google Scholar 

  23. Lenat, D.B.: Cyc: A large-scale investment in knowledge infrastructure. Comm. ACM 38(11), 33–38 (1995)

    Article  Google Scholar 

  24. Lesk, M.E.: Word-word associations in document retrieval systems. American Documentation 20(1), 27–38 (1969)

    Article  Google Scholar 

  25. Li, M., Vitányi, P.M.B.: Theory of thermodynamics of computation. In: Proc. IEEE Physics of Computation Workshop, Dallas (Texas), October 4-6, pp. 42–46 (1992), A full version (basically the here relevant part of [26]) appeared in the Preliminary Proceedings handed out at the Workshop

    Google Scholar 

  26. Li, M., Vitányi, P.M.B.: Reversibility and adiabatic computation: trading time and space for energy. Proc. Royal Society of London, Series A 452, 769–789 (1996)

    Article  MATH  Google Scholar 

  27. Li, M., Vitányi, P.M.B.: An Introduction to Kolmogorov Complexity and its Applications, 2nd edn. Springer, New York (1997)

    MATH  Google Scholar 

  28. Chen, X., Kwong, S., Li, M.: A compression algorithm for DNA sequences based on approximate matching. In: Proc. 10th Workshop on Genome Informatics (GIW), Tokyo, December 14-15. Genome Informatics Series, vol. 10 (1999); Also in Proc. 4th ACM RECOMB, p. 107 (2000)

    Google Scholar 

  29. Li, M., Badger, J.H., Chen, X., Kwong, S., Kearney, P., Zhang, H.: An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 17(2), 149–154 (2001)

    Article  Google Scholar 

  30. Li, M., Vitányi, P.M.B.: Algorithmic Complexity. In: Smelser, N.J., Baltes, P.B. (eds.) International Encyclopedia of the Social & Behavioral Sciences, pp. 376–382. Pergamon, Oxford (2001/2002)

    Google Scholar 

  31. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. IEEE Trans. Information Theory 50(12), 3250–3264 (2004); Conference version in: Proc. 14th ACM-SIAM Symposium on Discrete Algorithms, Baltimore, USA, pp 863–872 (2003) Web version: http://xxx.lanl.gov/abs/cs.CC/0111054

    Article  MathSciNet  Google Scholar 

  32. Li, M., Vitanyi, P.M.B.: An Introduction to Kolmogorov Complexity and Its Applications, 2nd edn. Springer, New York (1997)

    MATH  Google Scholar 

  33. Reed, S.L., Lenat, D.B.: Mapping ontologies into cyc. In: Proc. AAAI Conference 2002 Workshop on Ontologies for the Semantic Web, Edmonton, Canada, http://citeseer.nj.nec.com/509238.html

  34. Scott, P.: Music classification using neural networks (2001), http://www.stanford.edu/class/ee373a/musicclassification.pdf

  35. Strimmer, K., von Haeseler, A.: Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13, 964–969 (1996)

    Google Scholar 

  36. Miller, G.A., et al.: WordNet, A Lexical Database for the English Language, Cognitive Science Lab. Princeton University, http://www.cogsci.princeton.edu/~wn

  37. Terra, E., Clarke, C.L.A.: Frequency Estimates for Statistical Word Similarity Measures. In: HLT/NAACL 2003, Edmonton, Alberta, 37/162 (May 2003)

    Google Scholar 

  38. Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for associating patterns. In: Proc. ACM-SIGKDD Conf. Knowledge Discovery and Data Mining, pp. 491–502 (2002)

    Google Scholar 

  39. Tzanetakis, G., Cook, P.: Music genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10(5), 293–302 (2002)

    Article  Google Scholar 

  40. Wehner, S.: Analyzing network traffic and worms using compression, http://arxiv.org/abs/cs.CR/0504045

  41. Corpus collosal: How well does the world wide web represent human language? The Economist, January 20 (2005), http://www.economist.com/science/displayStory.cfm?story-id=3576374

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cilibrasi, R., Vitanyi, P. (2006). Similarity of Objects and the Meaning of Words. In: Cai, JY., Cooper, S.B., Li, A. (eds) Theory and Applications of Models of Computation. TAMC 2006. Lecture Notes in Computer Science, vol 3959. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11750321_2

Download citation

  • DOI: https://doi.org/10.1007/11750321_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-34021-8

  • Online ISBN: 978-3-540-34022-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics