Informational content of cosine and other similarities calculated from high-dimensional Conceptual Property Norm data

Abstract

To study concepts that are coded in language, researchers often collect lists of conceptual properties produced by human subjects. From these data, different measures can be computed. In particular, inter-concept similarity is an important variable used in experimental studies. Among possible similarity measures, the cosine of conceptual property frequency vectors seems to be a de facto standard. However, there is a lack of comparative studies that test the merit of different similarity measures when computed from property frequency data. The current work compares four different similarity measures (cosine, correlation, Euclidean and Chebyshev) and five different types of data structures. To that end, we compared the informational content (i.e., entropy) delivered by each of those 4 × 5 = 20 combinations, and used a clustering procedure as a concrete example of how informational content affects statistical analyses. Our results lead us to conclude that similarity measures computed from lower-dimensional data fare better than those calculated from higher-dimensional data, and suggest that researchers should be more aware of data sparseness and dimensionality, and their consequences for statistical analyses.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

References

  1. Aggarwal CC (2015) Data mining: the textbook. Springer, Cham. https://doi.org/10.1007/978-3-319-14142-8

    Book  Google Scholar 

  2. Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton. NJ

    Book  Google Scholar 

  3. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Beeri C, Buneman P (eds) Database theory—ICDT’99. ICDT 1999. Lecture Notes in Computer Science, vol 1540. Springer, Berlin. https://doi.org/10.1007/3-540-49257-7_15

    Google Scholar 

  4. Bruffaerts R, De Deyne S, Meersmans K, Liuzzi A, Storms G, Vandenberghe R (2019) Redefining the resolution of semantic knowledge in the brain: advances made by the introduction of models of semantics in neuroimaging. Neurosci Biobehav Rev 103:3–13. https://doi.org/10.1016/j.neubiorev.2019.05.015

    Article  PubMed  Google Scholar 

  5. Brusco MJ (2004) Clustering binary data in the presence of masking variables. Psychol Methods 9(4):510–523. https://doi.org/10.1037/1082-989X.9.4.510

    Article  PubMed  Google Scholar 

  6. Cree GS, McRae K (2003) Analyzing the factors underlying the structure and computation of the meaning of chipmunk, cherry, chisel, cheese, and cello (and many other such concrete nouns). J Exp Psychol: Gen 132(2):163–201. https://doi.org/10.1037/0096-3445.132.2.163

    Article  Google Scholar 

  7. De Deyne S, Verheyen S, Ameel E, Vanpaemel W, Dry MJ, Voorspoels W, Storms G (2008) Exemplar by feature applicability matrices and other Dutch normative data for semantic concepts. Behav Res Methods 40(4):1030–1048. https://doi.org/10.3758/brm.40.4.1030

    Article  PubMed  Google Scholar 

  8. Devereux BJ, Tyler LK, Geertzen J, Randall B (2014) The Centre for Speech, Language and the Brain (CSLB) concept property norms. Behav Res Methods 46(4):1119–1127. https://doi.org/10.3758/s13428-013-0420-4

    Article  PubMed  Google Scholar 

  9. Dry MJ, Storms G (2009) Similar but not the same: a comparison of the utility of directly rated and feature-based similarity measures for generating spatial models of conceptual data. Behav Res Methods 41(3):889–900. https://doi.org/10.3758/brm.41.3.889

    Article  PubMed  Google Scholar 

  10. Dunn JC (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57. https://doi.org/10.1080/01969727308546046

    Article  Google Scholar 

  11. Hampton JA (1979) Polymorphous concepts in semantic memory. J Verbal Learn Verbal Behav 18(4):441–461. https://doi.org/10.1016/s0022-5371(79)90246-9

    Article  Google Scholar 

  12. Harary F, Norman RA, Cartwright D (1965) Structural models: an introduction to the theory of directed graphs. Wiley, New York, NY

    Google Scholar 

  13. Hutchison KA, Balota DA, Cortese MJ, Watson JM (2008) Predicting semantic priming at the item level. Q J Exp Psychol 61(7):1036–1066. https://doi.org/10.1080/17470210701438111

    Article  Google Scholar 

  14. Jaccard P (1901) Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bull de la Soc Vaud des Sci Nat 37:241–272

    Google Scholar 

  15. Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254. https://doi.org/10.1007/BF02289588

    CAS  Article  PubMed  Google Scholar 

  16. Kleinbaum DG, Kupper LL, Muller KE (1988) Applied regression analysis and other multivariate methods. PWS-Kent Publishing Co, Boston

    Google Scholar 

  17. Kremer G, Baroni M (2011) A set of semantic norms for German and Italian. Behav Res Methods 43(1):97–109. https://doi.org/10.3758/s13428-010-0028-x

    Article  PubMed  Google Scholar 

  18. Kuiper FK, Fisher L (1975) A Monte Carlo comparison of six clustering procedures. Biometrics 31(3):777–783. https://doi.org/10.2307/2529565

    Article  Google Scholar 

  19. Landauer TK, Dumais ST (1997) A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev 104(2):211–240. https://doi.org/10.1037/0033-295X.104.2.211

    Article  Google Scholar 

  20. Lenci A, Baroni M, Cazzolli G, Marotta G (2013) BLIND: a set of semantic feature norms from the congenitally blind. Behav Res Methods 45(4):1218–1233. https://doi.org/10.3758/s13428-013-0323-4

    Article  PubMed  Google Scholar 

  21. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of 5th Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 281–297

    Google Scholar 

  22. Maki WS, Buchanan E (2008) Latent structure in measures of associative, semantic, and thematic knowledge. Psychon Bull Rev 15(3):598–603. https://doi.org/10.3758/PBR.15.3.598

    Article  PubMed  Google Scholar 

  23. Mandera P, Keuleers E, Brysbaert M (2015) How useful are corpus-based methods for extrapolating psycholinguistic variables? Q J Exp Psychol 68(8):1623–1642. https://doi.org/10.1080/17470218.2014.988735

    Article  Google Scholar 

  24. McRae K, Cree GS, Westmacott R, Sa VRD (1999) Further evidence for feature correlations in semantic memory. Can J Exp rimental Psychol/Revue canadienne de psychologie expérimentale 53(4):360–373. https://doi.org/10.1037/h0087323

    Article  Google Scholar 

  25. McRae K, Cree GS, Seidenberg MS, Mcnorgan C (2005) Semantic feature production norms for a large set of living and nonliving things. Behav Res Methods 37(4):547–559. https://doi.org/10.3758/bf03192726

    Article  PubMed  Google Scholar 

  26. Montefinese M, Ambrosini E, Fairfield B, Mammarella N (2013) Semantic memory: a feature-based analysis and new norms for Italian. Behav Res Methods 45(2):440–461. https://doi.org/10.3758/s13428-012-0263-4

    Article  PubMed  Google Scholar 

  27. Montefinese M, Zannino GD, Ambrosini E (2015) Semantic similarity between old and new items produces false alarms in recognition memory. Psychol Res 79(5):785–794. https://doi.org/10.1007/s00426-014-0615-z

    Article  PubMed  Google Scholar 

  28. Recchia G, Jones MN (2009) More data trumps smarter algorithms: comparing pointwise mutual information with latent semantic analysis. Behav Res Methods 41(3):647–656. https://doi.org/10.3758/BRM.41.3.647

    Article  PubMed  Google Scholar 

  29. Rosch E, Mervis CB (1975) Family resemblances: studies in the internal structure of categories. Cogn Psychol 7:573–605

    Article  Google Scholar 

  30. Rosch E, Mervis CB, Gray WD, Johnson DM, Boyes-Braem P (1976) Basic objects in natural categories. Cogn Psychol 8(3):382–439. https://doi.org/10.1016/0010-0285(76)90013-x

    Article  Google Scholar 

  31. Sahlgren M (2006) The word-space model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Ph.D. Dissertation, Department of Linguistics, Stockholm University

  32. Shepard RN, Arabie P (1979) Additive clustering: representation of similarities as combinations of discrete overlapping properties. Psychol Rev 86(2):87. https://doi.org/10.1037/0033-295X.86.2.87

    Article  Google Scholar 

  33. Simmons S, Estes Z (2006) Using latent semantic analysis to estimate similarity. In: Proceedings of the 28th annual conference of the cognitive science society, Austin, TX, pp 2169–2173

  34. Steinbach M, Ertöz L, Kumar V (2004) The challenges of clustering high dimensional data. In: Wille LT (ed) New directions in statistical physics: econophysics, bioinformatics, and pattern recognition. Springer, Berlin, pp 273–309

    Google Scholar 

  35. Tversky A (1977) Features of similarity. Psychol Rev 84(4):327–352. https://doi.org/10.1037/0033-295X.84.4.327

    Article  Google Scholar 

  36. Tversky B, Hemenway K (1984) Objects, parts, and categories. J Exp Psychol: Gen 113(2):169–197. https://doi.org/10.1037/0096-3445.113.2.169

    CAS  Article  Google Scholar 

  37. Verbeemen T, Vanpaemel W, Pattyn S, Storms G, Verguts T (2007) Beyond exemplars and prototypes as memory representations of natural concepts: a clustering approach. J Mem Lang 56(4):537–554. https://doi.org/10.1016/j.jml.2006.09.006

    Article  Google Scholar 

  38. Vigliocco G, Vinson DP, Lewis W, Garrett MF (2004) Representing the meanings of object and action words: the featural and unitary semantic space hypothesis. Cogn Psychol 48(4):422–488. https://doi.org/10.1016/j.cogpsych.2003.09.001

    Article  PubMed  Google Scholar 

  39. Vivas J, Vivas L, Comesaña A, Coni AG, Vorano A (2017) Spanish semantic feature production norms for 400 concrete concepts. Behav Res Methods 49(3):1095–1106. https://doi.org/10.3758/s13428-016-0777-2

    Article  PubMed  Google Scholar 

  40. Wilderjans TF, Ceulemans E, Van Mechelen I, Depril D (2011) ADPROCLUS: a graphical user interface for fitting additive profile clustering models to object by variable data matrices. Behav Res Methods 43(1):56–65. https://doi.org/10.3758/s13428-010-0033-0

    Article  PubMed  Google Scholar 

  41. Wu LL, Barsalou LW (2009) Perceptual simulation in conceptual combination: evidence from property generation. Acta Physiol (Oxf) 132:173–189. https://doi.org/10.1016/j.actpsy.2009.02.002

    Article  Google Scholar 

Download references

Acknowledgements

We want to thank two anonymous reviewers for their useful comments to a previous version of this manuscript. We also want to thank Eyal Sagi for his valuable input regarding the ideas discussed here. This research was carried out with funds provided by grant 1200139 from the Fondo Nacional de Desarrollo Científico y Tecnológico (FONDECYT) of the Chilean government to the first and second authors.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Enrique Canessa.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the special topic ‘Eliciting Semantic Properties: Methods and Applications’ guest-edited by Barry Devereux, and Alessandro Lenci.

Handling editor: Alessandro Lenci; Reviewers: David Vinson (University College London), Cai Wingfield (Lancaster University).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Canessa, E., Chaigneau, S.E., Moreno, S. et al. Informational content of cosine and other similarities calculated from high-dimensional Conceptual Property Norm data. Cogn Process 21, 601–614 (2020). https://doi.org/10.1007/s10339-020-00985-5

Download citation

Keywords

  • Cosine similarity
  • Euclidean distance
  • Chebyshev distance
  • Clustering
  • Conceptual properties