To study concepts that are coded in language, researchers often collect lists of conceptual properties produced by human subjects. From these data, different measures can be computed. In particular, inter-concept similarity is an important variable used in experimental studies. Among possible similarity measures, the cosine of conceptual property frequency vectors seems to be a de facto standard. However, there is a lack of comparative studies that test the merit of different similarity measures when computed from property frequency data. The current work compares four different similarity measures (cosine, correlation, Euclidean and Chebyshev) and five different types of data structures. To that end, we compared the informational content (i.e., entropy) delivered by each of those 4 × 5 = 20 combinations, and used a clustering procedure as a concrete example of how informational content affects statistical analyses. Our results lead us to conclude that similarity measures computed from lower-dimensional data fare better than those calculated from higher-dimensional data, and suggest that researchers should be more aware of data sparseness and dimensionality, and their consequences for statistical analyses.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Aggarwal CC (2015) Data mining: the textbook. Springer, Cham. https://doi.org/10.1007/978-3-319-14142-8
Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton. NJ
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Beeri C, Buneman P (eds) Database theory—ICDT’99. ICDT 1999. Lecture Notes in Computer Science, vol 1540. Springer, Berlin. https://doi.org/10.1007/3-540-49257-7_15
Bruffaerts R, De Deyne S, Meersmans K, Liuzzi A, Storms G, Vandenberghe R (2019) Redefining the resolution of semantic knowledge in the brain: advances made by the introduction of models of semantics in neuroimaging. Neurosci Biobehav Rev 103:3–13. https://doi.org/10.1016/j.neubiorev.2019.05.015
Brusco MJ (2004) Clustering binary data in the presence of masking variables. Psychol Methods 9(4):510–523. https://doi.org/10.1037/1082-989X.9.4.510
Cree GS, McRae K (2003) Analyzing the factors underlying the structure and computation of the meaning of chipmunk, cherry, chisel, cheese, and cello (and many other such concrete nouns). J Exp Psychol: Gen 132(2):163–201. https://doi.org/10.1037/0096-34126.96.36.199
De Deyne S, Verheyen S, Ameel E, Vanpaemel W, Dry MJ, Voorspoels W, Storms G (2008) Exemplar by feature applicability matrices and other Dutch normative data for semantic concepts. Behav Res Methods 40(4):1030–1048. https://doi.org/10.3758/brm.40.4.1030
Devereux BJ, Tyler LK, Geertzen J, Randall B (2014) The Centre for Speech, Language and the Brain (CSLB) concept property norms. Behav Res Methods 46(4):1119–1127. https://doi.org/10.3758/s13428-013-0420-4
Dry MJ, Storms G (2009) Similar but not the same: a comparison of the utility of directly rated and feature-based similarity measures for generating spatial models of conceptual data. Behav Res Methods 41(3):889–900. https://doi.org/10.3758/brm.41.3.889
Dunn JC (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57. https://doi.org/10.1080/01969727308546046
Hampton JA (1979) Polymorphous concepts in semantic memory. J Verbal Learn Verbal Behav 18(4):441–461. https://doi.org/10.1016/s0022-5371(79)90246-9
Harary F, Norman RA, Cartwright D (1965) Structural models: an introduction to the theory of directed graphs. Wiley, New York, NY
Hutchison KA, Balota DA, Cortese MJ, Watson JM (2008) Predicting semantic priming at the item level. Q J Exp Psychol 61(7):1036–1066. https://doi.org/10.1080/17470210701438111
Jaccard P (1901) Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bull de la Soc Vaud des Sci Nat 37:241–272
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254. https://doi.org/10.1007/BF02289588
Kleinbaum DG, Kupper LL, Muller KE (1988) Applied regression analysis and other multivariate methods. PWS-Kent Publishing Co, Boston
Kremer G, Baroni M (2011) A set of semantic norms for German and Italian. Behav Res Methods 43(1):97–109. https://doi.org/10.3758/s13428-010-0028-x
Kuiper FK, Fisher L (1975) A Monte Carlo comparison of six clustering procedures. Biometrics 31(3):777–783. https://doi.org/10.2307/2529565
Landauer TK, Dumais ST (1997) A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev 104(2):211–240. https://doi.org/10.1037/0033-295X.104.2.211
Lenci A, Baroni M, Cazzolli G, Marotta G (2013) BLIND: a set of semantic feature norms from the congenitally blind. Behav Res Methods 45(4):1218–1233. https://doi.org/10.3758/s13428-013-0323-4
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of 5th Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 281–297
Maki WS, Buchanan E (2008) Latent structure in measures of associative, semantic, and thematic knowledge. Psychon Bull Rev 15(3):598–603. https://doi.org/10.3758/PBR.15.3.598
Mandera P, Keuleers E, Brysbaert M (2015) How useful are corpus-based methods for extrapolating psycholinguistic variables? Q J Exp Psychol 68(8):1623–1642. https://doi.org/10.1080/17470218.2014.988735
McRae K, Cree GS, Westmacott R, Sa VRD (1999) Further evidence for feature correlations in semantic memory. Can J Exp rimental Psychol/Revue canadienne de psychologie expérimentale 53(4):360–373. https://doi.org/10.1037/h0087323
McRae K, Cree GS, Seidenberg MS, Mcnorgan C (2005) Semantic feature production norms for a large set of living and nonliving things. Behav Res Methods 37(4):547–559. https://doi.org/10.3758/bf03192726
Montefinese M, Ambrosini E, Fairfield B, Mammarella N (2013) Semantic memory: a feature-based analysis and new norms for Italian. Behav Res Methods 45(2):440–461. https://doi.org/10.3758/s13428-012-0263-4
Montefinese M, Zannino GD, Ambrosini E (2015) Semantic similarity between old and new items produces false alarms in recognition memory. Psychol Res 79(5):785–794. https://doi.org/10.1007/s00426-014-0615-z
Recchia G, Jones MN (2009) More data trumps smarter algorithms: comparing pointwise mutual information with latent semantic analysis. Behav Res Methods 41(3):647–656. https://doi.org/10.3758/BRM.41.3.647
Rosch E, Mervis CB (1975) Family resemblances: studies in the internal structure of categories. Cogn Psychol 7:573–605
Rosch E, Mervis CB, Gray WD, Johnson DM, Boyes-Braem P (1976) Basic objects in natural categories. Cogn Psychol 8(3):382–439. https://doi.org/10.1016/0010-0285(76)90013-x
Sahlgren M (2006) The word-space model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Ph.D. Dissertation, Department of Linguistics, Stockholm University
Shepard RN, Arabie P (1979) Additive clustering: representation of similarities as combinations of discrete overlapping properties. Psychol Rev 86(2):87. https://doi.org/10.1037/0033-295X.86.2.87
Simmons S, Estes Z (2006) Using latent semantic analysis to estimate similarity. In: Proceedings of the 28th annual conference of the cognitive science society, Austin, TX, pp 2169–2173
Steinbach M, Ertöz L, Kumar V (2004) The challenges of clustering high dimensional data. In: Wille LT (ed) New directions in statistical physics: econophysics, bioinformatics, and pattern recognition. Springer, Berlin, pp 273–309
Tversky A (1977) Features of similarity. Psychol Rev 84(4):327–352. https://doi.org/10.1037/0033-295X.84.4.327
Tversky B, Hemenway K (1984) Objects, parts, and categories. J Exp Psychol: Gen 113(2):169–197. https://doi.org/10.1037/0096-34188.8.131.52
Verbeemen T, Vanpaemel W, Pattyn S, Storms G, Verguts T (2007) Beyond exemplars and prototypes as memory representations of natural concepts: a clustering approach. J Mem Lang 56(4):537–554. https://doi.org/10.1016/j.jml.2006.09.006
Vigliocco G, Vinson DP, Lewis W, Garrett MF (2004) Representing the meanings of object and action words: the featural and unitary semantic space hypothesis. Cogn Psychol 48(4):422–488. https://doi.org/10.1016/j.cogpsych.2003.09.001
Vivas J, Vivas L, Comesaña A, Coni AG, Vorano A (2017) Spanish semantic feature production norms for 400 concrete concepts. Behav Res Methods 49(3):1095–1106. https://doi.org/10.3758/s13428-016-0777-2
Wilderjans TF, Ceulemans E, Van Mechelen I, Depril D (2011) ADPROCLUS: a graphical user interface for fitting additive profile clustering models to object by variable data matrices. Behav Res Methods 43(1):56–65. https://doi.org/10.3758/s13428-010-0033-0
Wu LL, Barsalou LW (2009) Perceptual simulation in conceptual combination: evidence from property generation. Acta Physiol (Oxf) 132:173–189. https://doi.org/10.1016/j.actpsy.2009.02.002
We want to thank two anonymous reviewers for their useful comments to a previous version of this manuscript. We also want to thank Eyal Sagi for his valuable input regarding the ideas discussed here. This research was carried out with funds provided by grant 1200139 from the Fondo Nacional de Desarrollo Científico y Tecnológico (FONDECYT) of the Chilean government to the first and second authors.
Conflict of interest
The authors declare that they have no conflict of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the special topic ‘Eliciting Semantic Properties: Methods and Applications’ guest-edited by Barry Devereux, and Alessandro Lenci.
Handling editor: Alessandro Lenci; Reviewers: David Vinson (University College London), Cai Wingfield (Lancaster University).
About this article
Cite this article
Canessa, E., Chaigneau, S.E., Moreno, S. et al. Informational content of cosine and other similarities calculated from high-dimensional Conceptual Property Norm data. Cogn Process 21, 601–614 (2020). https://doi.org/10.1007/s10339-020-00985-5
- Cosine similarity
- Euclidean distance
- Chebyshev distance
- Conceptual properties