Introduction

In the current work, we discuss a measure of conceptual homogeneity and illustrate its potential by using it to analyze differences between two sets of concepts and two populations. Our data was collected using a semantic Property Listing Task (PLT, Lenci et al., 2013), where people freely produce featural descriptions of a given concept. Consequently, our measure of homogeneity quantifies, across participants, the average correspondence between the descriptions (i.e., properties/features) that were produced for a given concept. The more similar across participants the descriptions are, the greater the homogeneity.

Concepts are probably variable and not homogeneous across a population, and differences may exist even when people conceptualize a situation similarly. Philosophers have pointed out the theoretical difficulties in asserting that different people share strictly the same concepts (Frege, 1893; Glock, 2009; Russell, 1997). Empirical evidence also suggests that the situation is that there is non-homogeneity in how people instantiate a given concept, and that there is even non-homogeneity in how a single person instantiates the same concept in two different occasions (Barsalou, 1987, 1993). In the current work, we take this non-homogeneity to be a fundamental characteristic of naturally occurring concepts. A simple source of non-homogeneity is learning. When concepts are learned in natural environments (in contrast to experimental environments), the most likely situation is that people will be exposed to different training sets, and so they will develop different versions of putatively the same concept.

In the current work, we will use the idea that concepts have variable instantiations as a guiding principle and offer a quantitative probabilistic measure of this non-homogeneity (to be explained shortly). To show one example of our measure’s usefulness, we will use it to characterize the concrete versus abstract concept difference and also the differences between congenitally blind and sighted people when conceptualizing the same set of concepts. Both issues have been highly researched topics.

Differences between concrete and abstract concepts

There is a large literature on the differences between abstract and concrete concepts. Our reading of the literature leads us to conclude that an essential difference between these types of concepts is that, while concrete concepts depend more on perceptual information than abstract concepts, abstract concepts rely to a large extent on social and linguistic input (for a good review of the evidence, see Borghi et al., 2019; for a critical view, see Willems & Casasanto, 2011). Importantly, in our analysis, perceptual information is predicted to introduce greater homogeneity in the semantic properties produced, leading to concrete concepts being more homogeneous than abstract concepts.

Compared to abstract concepts, concrete concepts are easier to learn and process (e.g., Jones, 1985; Walker & Hulme, 1999), are characterized by a larger number of conceptual features (Plaut & Shallice, 1991, 1993), and are more closely related to specific contexts (Schwanenflugel et al., 1988; Schwanenflugel & Shoben, 1983). A summary of all this research might be that semantic memory (SM) is more densely structured for concrete than for abstract concepts (Jones, 1985; Plaut & Shallice, 1993; Recchia & Jones, 2012; Yap & Pexman, 2016). A richer semantic structure would make concrete concepts easier to access. In contrast, having a less densely structured representation in memory is coherent with abstract concepts having more different senses (Hoffman et al., 2013).

In addition to these differences in semantic richness and context dependence, and as foreshadowed at the beginning of this section, several authors have proposed that the difference between concrete and abstract concepts hinges on the type of features that corresponds to each type of concept. In this view, concrete concepts depend on perceptual content, while abstract concepts depend on linguistic information (e.g., while dog may be described by “barks,” “has four legs,” and “is hairy,” justice may be described by “fairness” and “law;” Barsalou et al., 2008; Breedin et al., 1994; Paivio, 1986; Wiemer-Hastings & Xu, 2005). This view is in line with the proposal that conceptual processing involves reactivating perceptual representations (Barsalou, 1999; Feldman, 2010; Gallese & Lakoff, 2005; Prinz, 2002; Pulvermüller, 2005).

Previous studies provide evidence consistent with the idea that people reactivate perceptual features during language comprehension (Lupyan & Ward, 2013; Ostarek & Huettig, 2017), during property verification (i.e., Is y a property of concept x? Kan et al., 2003; Solomon & Barsalou, 2004), and during semantic property listing (Santos et al., 2011). Consequently, in the current work we hypothesize that concrete concepts are characterized more by perceptual information than abstract concepts, and that this perceptual information introduces a greater homogeneity in conceptualization for concrete versus abstract concepts. For expository purposes, we will call these our characterizing concreteness hypotheses.

Differences in semantic representations between congenitally blind and sighted individuals

As previously discussed, it is possible that concrete concepts are characterized by having more perceptual content than abstract concepts, and that this perceptual content may introduce a greater homogeneity in conceptual representations and processing. If these hypotheses are correct, then they suggest that we should find that congenitally blind individuals, because they lack visual perceptual information, should show differences when processing concrete concepts, but not when processing abstract concepts, which seem to depend more on linguistic and social input (Borghi et al., 2017; Borghi & Cimatti, 2009). For expository purposes, we will call this our role of vision hypothesis.

There is in fact previous evidence consistent with our hypothesis that congenitally blind subjects should process concrete concepts differently. Blind individuals show differences in performance relative to sighted individuals when visual information (e.g., color) is critical for judgments (Connolly et al., 2007). Similarly, Kim et al. (2019) found that though blind subjects used general-purpose inferential mechanisms to acquire knowledge about appearances (e.g., that all birds have feathers), they showed systematic differences relative to sighted people when judging similarity based on shape, knowledge that is highly dependent on vision (i.e., choosing the dissimilar item in a triad odd-one-out paradigm, for example, choosing the different animal in the wolf, gorilla, and bear triad).

However, our hypothesis might not be correct. There is a fair amount of evidence, suggesting that conceptual representations are strikingly similar in sighted and blind subjects (Landau & Gleitman, 1985; Marmor, 1978; Zimler & Keenan, 1983) and that though there are some detectable differences in early development, language acquisition and use is remarkably resilient to the lack of visual input (Pérez-Pereira, 2006). It is likely that congenitally blind subjects can use statistical regularities in language experienced in their communities (Erickson & Thiessen, 2015; Steyvers & Tenenbaum, 2005) to acquire knowledge of semantic relations, even when they do not have direct access to the perceptual information that underlies those statistical regularities (e.g., they know that zebra and penguin are similar in that they are “black” and “white,” even if they have never had the corresponding visual experiences). Thus, it is an open question whether our hypothesis about processing differences between blind and sighted subjects should hold or not.

Agreement probability as a measure of homogeneity

The semantic PLT is a procedure widely used in psychology to obtain property-based descriptions of concepts coded in language (Cree & McRae, 2003; Hampton, 1979; McRae et al., 2005; Rosch et al., 1976). Though there are slight differences in the way the task is implemented by different researchers, the general procedure is to ask subjects to produce properties that are typically true of a given concept. Once lists are obtained, they are generally coded into property types (i.e., responses with only superficial differences across subjects are coded as a single property) and accumulated across participants to obtain property frequency distributions. When the PLT is used to collect properties across whole semantic fields, the resulting data can be organized in Conceptual Property Norms (CPNs, e.g., Devereux et al., 2014; Kremer & Baroni, 2011; Lenci et al., 2013; McRae et al., 2005; Montefinese et al., 2013; Vivas et al., 2017).

As a way of measuring homogeneity in the PLT, here we compute agreement probability (p(a); Chaigneau et al., 2012), which will be explained in detail in the next section. Conceptually, agreement probability (p(a)) is defined as the probability that one property taken randomly from one list produced by an average subject in a PLT is also found in another list produced by a different average subject for the same PLT. By average subject, here we mean a hypothetical participant who on average represents the lists generated across participants, and thus, not a specific individual who produced a particular list. Lists may come from the same concept (the two lists were produced for the same concept C1) or from different concepts (the two lists were produced for two different concepts C1 and C2). It is called agreement probability because it is a measure of the agreement in the properties being listed. The maximum agreement will be produced when all subjects produce the same list (same properties, same length). In that case, p(a) = 1. The minimum agreement will be produced when all subjects produce different lists (different properties, not necessarily different lengths).

Quantifying homogeneity by using p(a) is important given that the instantiation of a concept, and thus the properties with which people describe it, depends on multiple factors. Hence, p(a) may be used as a measure of how sensitive a concept is to those multiple factors, whichever they are. On different occasions, concepts can be instantiated differently (e.g., the concept to jump may be instantiated differently in the context of “extreme sports” from the context of “children”). It is likely that concepts are sensitive to the contexts in which they occur in terms of how frequently a given context is associated with a given concept (e.g., bill occurs more frequently at a restaurant and less frequently at a beach). It is also likely that concepts are sensitive to context in terms of different senses being associated with different contexts (e.g., bill adopts a different sense in the context of restaurant than in the context of government). These factors are likely to introduce non-homogeneity in concepts (i.e., lack of agreement in lists being produced) because people may adopt different points of view when producing property lists after having been cued with a given concept. Other individual factors may also introduce lack of agreement (e.g., subjects being influenced by recent events in memory, or by idiosyncrasies in how a concept was learned or is processed). Therefore, agreement can be interpreted as the degree to which a concept is independent from all those factors, where the higher the p(a) for a concept, the more independent the concept is from all those factors.

Note that, because homogeneity in conceptualization might be influenced by multiple factors (as discussed immediately above), if measured in a different task, different homogeneity estimates could be obtained. For example, in conversation, it is possible that people will progress to higher agreements due to their history of interactions (Fay et al., 2018). However, and to the best of our knowledge, there is no similar measure in the literature, and we hypothesize that, though other measures could produce different estimates, the results we report here should hold, at least in relative terms. Additionally, p(a) has the advantage that it makes use of and summarizes/aggregates in a unique indicator, information that is routinely obtained in PLTs, such as the average length of property lists being produced, the total number of unique properties produced for a given concept by a group of subjects, and the property frequency distributions found in a CPN. These advantages will be better appreciated when we present the mathematical properties of p(a).

As will be discussed below, computing p(a) from frequency distributions of conceptual properties involves a combinatorial problem, which makes it impractical to use combinatorial formulae to compute it. Instead, we resort to computational simulations that deliver a close estimate for p(a).

Computing and interpreting the meaning of agreement probability

To understand agreement probability, consider the following simple example. By asking people to produce conceptual properties for two related concepts or two versions of the same concept (C1, C2), two property frequency distributions are obtained. For concept C1, subjects produced properties a, b, c. For concept C2, subjects produced properties c, d, e. This situation is shown in Fig. 1, where for example C1 = dog and C2 = cat. Note that to simplify the example, these are equiprobable distributions (i.e., properties in each distribution occur the same number of times). Assume now that subjects produced samples of average size = 2 for C1 (the average number of properties mentioned by people for C1 is 2, s1 = 2) and also 2 for C2 (s2 = 2). Imagine that for concept C1, subjects produced the following properties: a = it barks, b = wags its tail, c = has four legs and for C2: c = has four legs, d = it meows, and e = catches mice.

Fig. 1
figure 1

Two concepts C1 and C2 with their corresponding set of properties obtained in a PLT (for C1 = {a, b, c} and C2 = {c, d, e}) and intersection (C1C2 = {c}), with k1 (number of C1’s properties) = 3, k2 (number of C2’s properties) = 3, u (number of properties in the intersection) = 1

According to conceptual agreement theory (CAT, Chaigneau et al., 2012) agreement probability p(a) is the probability that one property randomly chosen from a sample of size s2 of properties extracted from the set of all k2 properties that are listed in a PLT for a concept C2, is contained in a sample of size s1 randomly obtained from the set of all k1 properties that are listed in a PLT for a concept C1. CAT’s mathematical formulation allows calculating p(a) using expression (1), where Table 1 defines each of the variables:

$$p(a)=\frac{1}{s_2}\ {\sum}_{i=1}^{n_1}{\sum}_{j=1}^{n_2}\#\left({S}_i^1\cap {S}_j^2\right)\ {p}_i\ {q}_j$$
(1)
Table 1 Definition of variables used in expressions for calculating p(a)

Equation (1) is the summation of the expected value of the number of common elements between samples \({S}_i^1\)of properties listed for C1 and independent samples \({S}_j^2\)of properties listed for C2 (i.e., the \(\#\left({S}_i^1\cap {S}_j^2\right)\) term), taking into account the probabilities of each sample (i.e., the pi and qj). Assuming that all properties in the \({S}_i^1\) and \({S}_j^2\)samples have the same probability of being obtained, p(a) is calculated as the summation in Eq. (1) divided by s2. The interested reader may find the complete mathematical and theoretical development of p(a) in Chaigneau et al. (2012). Here we just present the most important details necessary to understand the present work. Instead, to aid the reader in comprehending Eq. (1) and the definitions in Table 1, we present a simple example, which illustrates the application of such expressions.

Following the previous example, we have that:

$$\begin{array}{lc}\text{Properties}\;\text{listed}\;\text{for}\;C1=\left\{a,b,c\right\}&\text{Properties}\;\text{listed}\;\text{for}\;C2=\left\{c,d,e\right\}\\k_1=3\;\text{and}\;\text{assuming}\;s_1=2&k_2=3\;\text{and}\;\text{assuming}\;s_2=2\\u=1\;\left(\mathrm{one}\;\mathrm{common}\;\mathrm{element},\;\mathrm i.\mathrm e.,\;c\right)\;\mathrm{and}\;\mathrm{thus},&\\n_1=n_2=3!/\left(3-2\right)!/2!=3&\end{array}$$

the n1 samples from the properties listed for C1 are {ab, ac, bc} (\({S}_i^1\in\) {ab, ac, bc}) and the n2 samples from the properties listed for C2 are {cd, ce, de} (\({S}_j^2\in\) {cd, ce, de})

For simplicity, assume that each sample in C1 and in C2 has an equal probability of being selected and thus, pi = 1/n1 = 1/3 and qj=1/n2 = 1/3. Then, using Eq. (1):

$$p(a)=\frac{1}{2}\sum\limits_{i=1}^3\sum\limits_{j=1}^3\#\left({S}_i^1\cap {S}_j^2\right)\frac{1}{3}\frac{1}{3}=\frac{1}{18}\ \sum\limits_{i=1}^3\sum\limits_{j=1}^3\#\left({S}_i^1\cap {S}_j^2\right)$$
(2)

In Eq. (2), the double summation corresponds to the sum of counts of coincidences between each sample \({S}_i^1\) and \({S}_j^2\), for example:

#( \({S}_1^1\)\({S}_1^2\) ) = #(abcd) = 0

#( \({S}_1^1\)\({S}_2^2\)) = #(abce) = 0 and so on until,

#( \({S}_3^1\)\({S}_3^2\)) = #(bcde ) = 0

For this example, each term of the double summation is

#( \({S}_i^1\cap {S}_j^2\)) = {0,0,0,1,1,0,1,1,0} and hence p(a) = 4/18 = 2/9

Probability p(a) tells us that for individuals who have listed properties for the concept C1 (e.g., dog) and C2 (e.g., cat), there exists a 2/9 probability that if one average participant listed a given property for the concept C2 (e.g., cat), that same property will be in the list of properties listed by another different average participant for C1 (e.g., dog). Several things are noteworthy. Note that first, p(a) is a measure of homogeneity because maximal homogeneity will be achieved when all participants in a PLT produce the same list, and minimal homogeneity will be obtained when all participants produce different lists.

Second, the reader may have noted that we assumed that the frequency distribution of the properties is uniform (i.e., properties in the distribution occur the same number of times, and thus pi, qj are the same for all i and j, see Eq. (1) and definitions in Table 1). This is an idealized case, and it is highly unlikely that real data would ever produce it. However, idealized models may have the virtue of reducing a problem to its essential characteristics. For such a case, we can demonstrate that (see Appendix A):

$$p(a)=\frac{s_1}{k_1}\frac{u}{k_2}$$
(3)

where s1 is the average number of properties in a group member’s sample of conceptual content for concept C1 and k1 is the total number of properties listed at least once for C1, u is the number of common properties between the properties’ distributions of two concepts (C1 and C2), and k2 is the total number of properties listed at least once for concept C2. Thus, p(a) is a measure of how well separated two distributions are. This probability depends on the number of common properties between two distributions (u). Note that for our simple example above, Eq. (3) necessarily gives the same result as Eq. (2) (p(a) = s1/k1 x u/k2 = 2/3 x 1/3 = 2/9).

Third, note that p(a) is not symmetric with respect to concepts C1 and C2, i.e., calculating p(a) for concept C1 relative to C2 does not necessarily give the same p(a) as if we were computing it for concept C2 relative to C1. Equations (1) and (3) give p(a) when concept C1 is the reference concept and C2 the comparison concept (i.e., p(a) for C1 relative to C2). On the other hand, when concept C2 is the reference and C1 is the comparison concept, the expressions are similar, but using s2 and k2 and k1 in (3); i.e., p(a) = s2/k2 x u/k1. This asymmetry tells us that the probability that a property contained in a sample for concept C2 is also obtained in another sample of properties for concept C1 is not necessarily the same as the probability that a property contained in a sample for concept C1 is also obtained in another sample of properties for concept C2. This asymmetry is an important fact to remember when analyzing p(a) for concrete and abstract concepts, as well as for blind versus sighted subjects, which we will further use and explain in the corresponding analyses.

Finally, and as already stated, p(a) may be also computed for the same concept. In this case, there is only one concept C1 and p(a) is the probability that one property randomly chosen from a sample of size s1 of properties extracted from the set of all k1 properties that are listed for a concept C1, is contained in a different sample of size s1 randomly obtained from the set of all k1 properties that are listed for the same concept C1.

Thus, the same expression (1) and definitions in Table 1 apply, but s1 = s2, k1 = k2, n1 = n2, pi = qj, and samples \({S}_i^1\) and \({S}_j^2\) are drawn from the same distribution of properties of concept C1. Hence, based on Eq. (1) and taking into account that now we are computing p(a) for the same concept C1, we can write:

$$p(a)=\frac{1}{s_1}\ \sum\limits_{i=1}^{n_1}\sum\limits_{j=1}^{n_1}\#\left({S}_i^1\cap {S}_j^1\right)\ {p}_i\ {p}_j$$
(4)

Note that in Eq. (4) the samples \({S}_i^1\) and \({S}_j^1\) both have superscript 1, which indicates that they are independently drawn from the same distribution of properties of concept C1. Additionally, note that we replaced qj by pj so that it is clearer that those probabilities correspond to samples drawn from the same distribution of properties.

With regard to computing p(a) for the same concept and for uniform property frequency distributions, Eq. (4) becomes:

$$p(a)=\frac{s_1}{k_1}$$
(5)

because in Eq. (3) and for the same list of properties for concept C1, it will always happen that u = k2, i.e., for the same list of properties obtained for a concept, the number of common elements will be the same as the number of properties obtained for the concept (see Appendix A for a more formal demonstration). That fact also tells us that for concepts with uniform property frequency distributions, p(a) for two different concepts calculated using Eq. (3) will always be lower than p(a) computed for one of those concepts using Eq. (4), i.e., agreement probability for two different concepts will always be lower than p(a) for one of the concepts with itself (see Appendix A for a demonstration).

To help understand the computation of p(a) for the same concept, let’s use the same example shown in Fig. 1 and calculate p(a) for C1. Then we have that:

  • Properties listed for C1 = {a, b, c}

  • k1 = 3and assuming s1 = 2

  • n1 = 3! / (3 − 2)! / 2! = 3

the n1 samples from the properties listed for C1 are {ab, ac, bc} (\({S}_i^1\in\) {ab, ac, bc} and \({S}_j^1\in\) {ab, ac, bc})

For simplicity, assume that each sample in the properties listed for C1 has an equal probability of being selected and thus, pi = pj =1/n1 = 1/3.

And thus applying those values to Eq. (4):

$$p(a)=\frac{1}{2}\sum\limits_{i=1}^3\sum\limits_{j=1}^3\#\left({S}_i^1\cap {S}_j^1\right)\frac{1}{3}\frac{1}{3}=\frac{1}{18}\kern0.5em \sum\limits_{i=1}^3\sum\limits_{j=1}^3\#\left({S}_i^1\cap {S}_j^1\right)\kern0.5em$$
(6)

In Eq. (6), the double summation corresponds to the sum of counts of coincidences between each sample \({S}_i^1\)and \({S}_j^1\) , for example:

#( \({S}_1^1\)\({S}_1^1\) ) = #(abab) = 2

#( \({S}_1^1\)\({S}_2^1\)) = #(abac) = 1 and so on until,

#( \({S}_3^1\)\({S}_3^1\)) = #(bcbc) = 2

For this example, each term of the double summation is

#( \({S}_i^1\)\({S}_j^1\)) = {2,1,1,1,2,1,1,1,2} and hence p(a) = 12/18 = 2/3

Note that p(a) = 2/3 is the same as the one computed by using Eq. (5), i.e., p(a) for the same concept = s1/k1 = 2/3. As easily seen from Eq. (5), p(a) is a measure of conceptual homogeneity in a group that uses the concept that produced it. Its maximal value is reached only when all group members produce the same set of properties for the concept in question (i.e., when s1 = k1). In contrast, its minimal value is approached when each group member produces unique properties. Also see that p(a) for concept C1 relative to C2 (2/9) is lower than p(a) for the same concept C1 (2/3), or for the same concept C2 = s2/k2 = 2/3.

One last feature that is interesting to note is that p(a) for the same concept computed with Eq. (5) provides a lower bound for this probability regardless of a distribution’s statistical structure (see Appendix A). In other words, p(a) for the same concept cannot reach a value lower than s1/k1. A direct consequence of this is that statistical structure in property frequency distributions (i.e., nonuniformity) will in general increase homogeneity, which is intuitively correct.

As discussed in Canessa and Chaigneau (2016), computing agreement probability from frequency distributions of conceptual properties involves a combinatorial problem. As shown in Eq. (1), it requires counting coincidences among pairs of samples weighted by their respective probabilities, where a sample means the conceptual content (i.e., properties) provided by an average individual contributing data to the distribution. That equation has the advantage of being formulated for the general case of nonuniform property frequency distributions, but the number of samples (combinations) that need to be taken into account rapidly grows as s1, s2 and/or k1, k2 increase. For example, for a realistic PLT, a concept may have s1 = 7 and k1 = 30, and thus n1 = 2,035,800 (see Table 1 for the expression that calculates n1). That makes expression (1) impossible to use in real PLTs. Therefore, in Canessa and Chaigneau (2016) we present a simulator that emulates the property comparison process underlying p(a) (i.e., counting the number of times in which a property found in a randomly selected sample is also found in a second randomly selected sample, over the total number of selected samples) and that allows calculating that probability with non-statistically significant differences relative to the exact values that might be computed using Eq. (1). For the detailed simulator’s algorithm, the interested reader may consult Canessa and Chaigneau (2016). Here we briefly describe it, so that the parameters that must be inputted to the simulator and will be used in this paper are understood.

The simulator receives the property frequency distribution of properties listed for a concept C1 and C2 and their corresponding s1 and s2 values. First, the simulator probabilistically gets one independent sample of size s1 properties without replacement from the properties listed for C1 and another sample of size s2 properties without replacement from the properties listed for C2. We label the first sample as reference sample and the second one as comparison sample. Note that the sampling probability of each property corresponds to the frequency of the given property relative to the frequencies of the other properties. In our simple example, if e.g., the frequencies of property a = 15, b = 20, and c = 10 in concept C1, then, the probability of sampling a is 15 / (15 + 20 + 10) = 1/3 and similarly for b = 20/45 = 4/9 and for c = 10/45 = 2/9. Then, it randomly selects one property from the comparison sample and if that property is contained in the reference sample, it increments a pa_counter. This is done max_iterations times. Then, p(a) is simply approximated by pa_counter / max_iterations. Additionally, the simulator has two more inputs. Given that as the simulator iterates, the approximation gets closer to the real value of p(a) (in fact as max_iterations tends to infinity, the approximation reaches the true p(a) value), we can calculate a moving average of p(a) as the simulator iterates by using the last nr_points_moving_avg iterations. Finally, we can repeat the simulation for nr_repetitions times and calculate a mean and standard deviation of p(a) using each of the values computed in the individual repetitions. This simulator was implemented in NetLogo v. 6.2.1 (Wilensky, 1999) and is available at https://osf.io/xhfmz/?view_only=31c08caa642f42c694425a4f2b46a8b4 along with data files and instructions on how to use the simulator. For this work, the simulator’s parameters were set as follows: max_iterations = 5000, nr_points_moving_avg = 1000 and nr_repetitions = 50. The property frequency distributions for each concept may be found in the abovementioned URL and were obtained from Lenci et al. (2013) norms for concrete and abstract concepts, and for sighted and blind individuals, as described in the next section.

Difference in agreement probability between concrete and abstract concepts, and between sighted and blind individuals

Participants and data collection procedures

To show one example of the application of agreement probability as a measure of homogeneity in lists produced by subjects in a PLT, we resorted to data collected in Lenci et al. (2013) norms, which report properties for 70 concepts (50 abstract and concrete nouns, and 20 verbs). The Lenci et al. data are freely available on the web. In this work, we use the concrete (NC = 40) and abstract (NA = 10) nouns, which were classified as such in Lenci et al. (2013). Here we provide just the most important details of Lenci et al. (2013) norms; for more particulars see the corresponding paper. Appendix B shows the 40 concrete and 10 abstract nouns. Concrete nouns cover living and nonliving things, most of which were already used in previous norms (Kremer & Baroni, 2011; McRae et al., 2005) or by experiments with blind subjects (Connolly et al., 2007). These concrete concepts included things with salient visual features (e.g., “stripes” for zebra; “yellow” for banana). Abstract concepts included emotions (e.g., jealousy) and ideals (e.g., freedom). Forty-eight Italian subjects (N = 48), 22 congenitally blind (NB = 22) and 26 sighted (NS = 26), were included in the study, all of them native Italian speakers. The blind participants were 10 females and 12 males with an average age of 47.2 years (s.d. = 16.5) and with education ranging from junior high school to a master’s degree. The 26 sighted participants were selected to match blind subjects as close as possible regarding age, gender, residence, education, and profession. Sighted subjects’ average age was 45.1 years (s.d. = 16.8). Subjects were instructed to orally describe the concepts with short phrases and listened to the concepts in random order. To avoid too much fatigue, the 70 concepts were split into two separate sessions, and each session contained a 5-minute break at the middle of it. The entire procedure was done on a laptop, and the oral responses were recorded in digital audio. The oral responses were translated to text using an automated software program. The text was then coded by a trained coder using standard coding procedures (Kremer & Baroni, 2011; McRae et al., 2005).

Relating visual perceptual strength to agreement probability

According to our characterizing concreteness hypotheses, concrete concepts should be characterized by more perceptual information than abstract concepts, and this perceptual information should introduce a greater homogeneity in conceptualization for concrete versus abstract concepts. To test these hypotheses, we resorted to the perceptual modality norms in Vergallito et al. (2020). In those norms, 57 sighted Italian participants rated concepts for their perceptual strength in each of five sensory modalities (i.e., vision, touch, smell, hearing, taste). Subjects were asked to rate, on a scale of 1 to 5, to which extent a given concept was experienced through each of these senses (e.g., the concept sweet may receive a high rating for taste and lower for the other modalities). A total of 20 concepts (15 concrete and 5 abstract) in the Vergallito et al. (2020) norms were also present in the Lenci et al. (2013) norms, so we used them in this analysis (see those concepts in Appendix B). Due to our emphasis in the current work on the visual modality, we only used those ratings. As predicted, those 15 concrete concepts showed significantly higher visual strength ratings (M = 4.8, s.d. = .08) than the five abstract concepts (M = 3.3, s.d. = .23) (t(4.32) = 14.249, p < .001; adjusted for unequal variances, F = 8.55, p = .009). The observed statistical power for this test for an α = 0.05 is above 0.99, and thus, despite the rather small sample size used, this result is reliable.

Regarding our use of only visual perceptual strength, we must note that it is also possible that other perceptual information (e.g., olfactive, haptic, etc.) would also introduce homogeneity in property listing. Though we believe this is an interesting problem that could be tackled by our measure, it is well beyond the scope of the current work and we defer it for future work.

Regarding our second characterizing concreteness hypothesis, data also supported it. Using the p(a) simulator described in Computing and interpreting the meaning of agreement probability and the concepts’ property frequency distributions obtained from the Lenci et al. (2013) norms, we computed agreement probability for the 15 concrete and 5 abstract concepts for which the Vergallito et al. (2020) norms provided perceptual strength ratings. As predicted, our p(a) measure positively correlated with visual strength ratings. For sighted subjects the correlation is 0.59 (r(20) = .59, t(18) = 3.100, p = .006, observed statistical power for an α = 0.05 equal to 0.87) and for blind subjects it is 0.62 (r(20) = .62, t(18) = 3.353, p = .004, observed statistical power for an α = 0.05 equal to 0.92). These positive and statistically significant correlations show that the higher/lower p(a)s exhibited by concrete/abstract concepts are associated with higher/lower visual strength ratings, which is consistent with the hypothesized homogenizing effect of visual perceptual information on property listing for concrete concepts relative to abstract ones. Also see that the high statistical power attained by those tests suggest that those results are reliable, and are not spurious findings due to underpowered comparisons. Finally, somewhat surprisingly, the correlation between visual perceptual strength and p(a) is statistically significant for blind subjects, which suggests that visual properties have a homogenizing effect on lists produced by those participants, even though they cannot directly perceive them. We will further elaborate on this issue in the Discussion section.

Additionally, for the 15 concrete and 5 abstract concepts used here, for sighted subjects, p(a) for concrete concepts (M = 0.17, s.d. = .03) is higher than for abstract ones (M = 0.11, s.d. = .03) (t(18) = 3.352, p = .004, observed statistical power for an α = 0.05 equal to 0.91). The same happens for blind, where p(a) is higher for concrete concepts (M = 0.15, s.d. = .03) than for abstract concepts (M = 0.11, s.d. =.02) (t(18) = 3.565, p = .002, observed statistical power for an α = 0.05 equal to 0.85). As we will show in the next subsection, this result agrees with the more general conclusion for the 50 concrete and abstract concepts in the Lenci et al. (2013) norms.

We acknowledge that, because our results are based on subjective ratings of perceptual strength, other explanations are possible. However, we believe that the results we report next provide converging evidence in support of our explanation, so we defer discussing alternative accounts for our Discussion and conclusions. Thus, we proceed now to test our role of vision hypothesis. Recall that this hypothesis predicts that lacking visual perceptual information would make concrete concepts less homogeneous for blind subjects than for sighted participants, reducing the difference in homogeneity between concrete and abstract concepts for a blind population.

Comparing agreement probability between concrete and abstract concepts for sighted and blind subjects

Using the p(a) simulator and the concepts’ property frequency distributions in Lenci et al.’s (2013) norms, we computed agreement probability for concrete and abstract concepts, within sighted and blind participants. Additionally, recall from our discussion in Computing and interpreting the meaning of agreement probability, that agreement probability can be calculated for the same concept (i.e., a single property frequency distribution) or for two different concepts or versions of the same concept (i.e., two different distributions obtained from different concepts or from two different samples or populations). Given that we have sighted and blind individuals who separately listed properties for the same set of concepts, we have two different property frequency distributions: one for sighted (S) and another for blind (B). Hence, p(a) may be separately computed using the S distribution and the B distribution, i.e., separately inputting to the simulator S and then B. Those p(a)s will quantify the agreement probabilities within the sighted group (here we label it: S → S) and within the blind group (B → B). We can also compute other two p(a)s: an intergroup (between-groups) agreement probability from sighted to blind (S → B) and from blind to sighted (B → S). Thus, Table 2 presents the results of a two-way analysis of variance (ANOVA) (Type of concept × Condition: S → S, B → B, S → B, B → S).

Table 2 ANOVA for agreement probabilities for concrete (C) and abstract (A) concepts, and for conditions S → S, B → B, S → B, B → S

From Table 2 we can see that the model as a whole is statistically significant, and p(a) may differ for some comparisons between concrete and abstract concepts and between sighted and blind subjects. Also, there is a significant interaction between those two factors. Note also that the observed statistical power for an α = 0.05 for all the ANOVA results are high and hence, the corresponding results are reliable. Thus, we may now compare and analyze the mean of the eight treatments or cells of the ANOVA. To help visually assess those comparisons, Fig. 2 shows the mean p(a) and a 95% CI for the eight treatments.

Fig. 2
figure 2

Agreement probability, p(a), for concrete and abstract concepts and for conditions S → S, B → B, S → B, B → S. Bars are 95% CIs. Note that we introduced jitter so that overlapping CIs are better visualized.

From Fig. 2 we can see that p(a) is higher for concrete than for abstract concepts for the S → S (t(48) = 5.612, p < .001) and B → B conditions (t(48) = 5.114, p = .001) (i.e., within groups). Our results for visual perceptual strength lead us to interpret this as showing that concrete concepts show more homogeneity due to the influence of visual/perceptual information, while abstract concepts are in general less homogeneous due to the influence of social and linguistic information.

Also, from Fig. 2 we can see that p(a) for concrete concepts and for condition S → S is higher than for conditions B → B (t(78) = 2.655, p = .01), S → B (t(78) = 27.790, p < .001), and B → S (t(78) = 27.403, p < .001). However, the difference in p(a) between conditions S → B and B → S is not statistically significant (t(78) = 0.321, p = .749). Similarly, p(a) for concrete concepts and for condition B → B is higher than for conditions S → B (t(78) = 31.607, p < .001) and B → S (t(78) = 30.956, p < .001). These results are again consistent with our hypotheses in Differences between concrete and abstract concepts and Differences in semantic representations between congenitally blind and sighted individuals. Perceptual information is probably dominant and imposes homogeneity on the sighted subjects’ sample. Lacking this information in the blind subjects’ sample presumably introduces differences in the lists of properties being produced, which in turn is reflected in the comparisons reported above.

An interesting result that Fig. 2 illustrates is that, for abstract concepts, the difference in p(a) between the S → S and B → B conditions (t(18) = 0.138, p = .892), as well as between S → B and B → S (t(18) = 0.261, p = .797), is not statistically significant. Though this is a null result, and should be considered with care, it is expected by our theoretical analysis. Because abstract concepts should be learned by paying attention to the same social and linguistic input in both blind and sighted populations, there is no reason to expect that the respective list of properties should differ in these comparisons.

A final noteworthy results is that, as shown in Fig. 2, p(a) for abstract concepts is higher for the S → S than for S → B (t(18) = 12.791, p < .001) and B → S (t(18) = 12.408, p < .001) conditions. The same happens for B → B with respect to conditions S → B (t(18) = 14.666, p < .001) and B → S (t(18) = 14.167, p < .001). Interestingly, lists within groups are more homogeneous than between groups, suggesting that there are factors that operate differently in each group to produce these results (e.g., different learning experiences). This is a surprising result, because it suggests that differences in the property lists that characterize the two groups extend beyond concrete concepts. This was expected for concrete concepts, but we currently have no explanation for why it would happen for abstract concepts. This result awaits replication for further discussion.

Classification of concrete versus abstract concepts using several machine learning tools and inputs

Given that we found evidence that agreement probability values differ between abstract and concrete concepts, both for sighted and blind subjects, we used several machine learning (ML) techniques to assess whether agreement probability is not only able to discriminate abstract from concrete at the aggregated average level of analysis, but also at the level of individual concepts. To foreshadow, our results show that p(a) can be used to classify abstract versus concrete with a good level of certainty. To better generalize and understand our findings, we used the ML tools k-nearest neighbors (KNN), Gaussian naïve Bayes (NB), decision trees (DT), and support vector machines (SVM). Additionally, and as a baseline, we also employed logistic regression (LR), which is a simpler regression tool. The inputs to all those tools were: s&k: s1 (mean list length) and k1 (number of unique properties listed for each concept), equiprobable p(a)eq for each concept (i.e., p(a)eq = s1 / k1 : agreement probability without taking into account the property frequency distribution, see Eq. (5)), and non-equiprobable p(a) (i.e., p(a) computed using the simulator, which takes into consideration the property frequency distribution, see Eq. (4) and description of the simulator). The idea behind using the three aforementioned variables was to assess whether more parsimonious variables achieve a better classification than more elaborated ones (i.e., s&k is the most parsimonious variable and p(a) the least parsimonious one). The classification performance measure used was the F1 score, which is given by Eq. (7):

$${F}_1 score=\frac{2\ TP}{2\ TP+ FP+ FN}\kern0.5em$$
(7)

where we use the values of the confusion matrix (TP: true positives; FP: false positives; FN: false negatives, and TN: true negatives). The F1 score is the harmonic average of precision (TP / (TP+FP)) and recall (TP / (TP+FN)) and thus, it balances two objectives: that most of the points which belong to the positive class are correctly classified (i.e., recall), and that most of the points classified as positive class are correctly classified (i.e., precision). The F1 score varies between 0 and 1, and a high value implies that the model can appropriately classify the positive class and generates a low number of false negatives and false positives, where true positives are associated with the class with fewer labels. We must note that we could have also used accuracy, which is one of the most typical classification performance measures employed in machine learning and that indicates the percentage of correctly classified points over the total number of data points. However, this measure behaves improperly when a class is biased (i.e., when a class has substantially more data points than the rest) because high accuracy is achievable labeling all data points as members of the majority class, which is exactly the situation we face here, i.e., there are 40 concrete and 10 abstract concepts.

Even though there are several classification models in the literature, some of them need a large number of data points, e.g., neural networks, to learn the parameter’s model. For this reason, in this paper, we use the following classic models:

  • k-nearest neighbors (KNN): The KNN model is one of the most simple and basic classification models, called a lazy learner (Cover & Hart, 1967). There is no training process and the classification process is based on the distance and class of the k-closest neighbors of a test point. Specifically, when a new data point is presented, the distance over all the training data points is calculated, and the closest k points, with their respective labels, are selected. Based on these k points the probability of belonging to a class is estimated as the total number of points belonging to a specific class over k. For this work, we chose k = 3 to avoid overfitting (i.e., avoid memorization of the training data, obtaining a high test error).

  • Gaussian naïve Bayes (NB): The NB model is based on Bayes' theorem and probability conditional independence. For a given series of known inputs or variables, the model quantifies the conditional probability that the analyzed record belongs to a specific category of the class label (Langley et al., 1992). However, given the difficulty of finding the conditional probability of the data for a specific class label, the model assumes independence between variables given the class. Once the parameters are learned and given a new data point, the model calculates the probability of belonging to each class (standardizing the proportional probabilities generates corresponding probabilities).

  • Decision trees (DT): A DT is a structure composed by nodes, leaves, and branches, where each node corresponds to a decision (or a test applied on some attribute), and each branch represents a possible path of this decision or test. When a data point is inserted in the model, the tree is traveled until a leaf is reached. Each “leaf” determines the probability that the data point belongs to one of the two possible classes (Quinlan, 1986). For this work, we restricted the depth of the tree to two levels to avoid overfitting.

  • Support vector machines (SVM): SVM uses a hyperplane to separate between classes. The training algorithm searches for the hyperplane with the highest “margin,” i.e., the hyperplane such as the distance to the support vector points (closest points of each class to the hyperplane) is maximized (Boser et al., 1992). For complex problems, the dimension of the data points can be increased artificially by a kernel function, and the hyperplane in this new dimension can be found.

  • Logistic regression (LR): LR is a nonlinear regression that allows predicting binary variables. The model calculates the probability that a data point belongs to one of the two possible classes, using a logistic function (Fang, 2013).

The F1 score results for abstract versus concrete concept classification can be seen in Table 3, where we used abstract concepts as the positive class. We evaluated three different datasets: sighted (26 participants), blind (22 participants), and both (sighted and blind combined, with a total of 48 participants). The classification results, for each dataset, correspond to the average of the test folds using a five-fold stratified cross-validation approach. This approach separates the selected dataset into five folds, using four folds for training and another fold for testing (the stratification forces each fold to have two abstract concepts). The process is repeated five times, using each fold as a test set. All models were also checked for overfitting, obtaining a test error similar to the training error.

Table 3 F1 score results for abstract versus concrete concept classification using fivefold stratified cross-validation (mean, std. dev. in parentheses)

† shows cases where the average value of F1 score achieved by using p(a) or p(a)eq compared to s&k are higher and statistically significant at least at the 0.05 level. Αn * indicates that the F1 score achieved by using p(a) is higher than the one of p(a)eq and of s&k, and is statistically significant at least at the 0.05 level.

As can be seen from Table 3, most of the models using s&k are unable to obtain good performance, the lowest score being 0.07. In contrast, by comparing s&k with p(a)eq and p(a), the F1 scores for the agreement probability values are higher on 12 and 13 occasions, respectively. From those comparisons, the F1 scores achieved by p(a)eq and p(a) are higher and statistically significant at least at the 0.05 level in 7 and 12 cases, respectively. This shows that p(a)eq and p(a) better differentiate concrete from abstract concepts in contrast to using the more parsimonious s&k combination. Additionally, comparing the F1 scores achieved with p(a) and p(a) eq, we can see that they are equal to or higher for p(a) in all cases, differences that reach statistical significance at the 0.05 level in three comparisons. Thus, all in all, we may say that the classification performance attained by p(a) is the best, followed by p(a) eq, and those two, trailed by s&k.

Finally, note from Table 3 that the F1 scores for p(a) suggest that the discrimination between concrete and abstract concepts in the sighted population is better than in the blind population. All F1 scores for the five classification tools used are statistically significantly higher for sighted than for blind, except for KNN (t(8) and p value in parenthesis; respectively, 0.746 (0.477), 3.751 (.006), 4.399 (0.002), 3.651 (0.006), and 2.800 (0.023), and note that the values for making these comparisons come from executing the test fold for each tool five times). This is consistent with our hypothesis that the difference between concrete and abstract concepts would be more conspicuous in sighted than in blind, because the blind population tends to learn abstract concepts in much the same way it learns concrete concepts due to a lack of visual perceptual properties. Hence, this similarity in learning concrete and abstract concepts blurs their distinction in the blind population.

Discussion and conclusions

In the current work, we have discussed agreement probability, a measure of homogeneity of concept instantiations in the Property Listing Task. Being a probability, the measure has the positive characteristic of being naturally bounded in the 0 to 1 range. Relatedly, the 0 and 1 values are interpreted in a clear and straightforward fashion (i.e., respectively, total heterogeneity and total homogeneity). Additionally, agreement probability naturally integrates information produced when property listing data is collected into a single value that depends on the average list length produced by subjects (s), the total number of unique properties produced by the subject sample (k), and the frequency distribution of those properties. Finally, as shown in Appendix A, agreement probability also has the nice feature of directly implying that nonuniform property probability distributions reflect greater homogeneity in property lists (see the lower bound demonstration in Appendix A) and so they should be considered in a homogeneity index.

We assume that heterogeneity is an inherent characteristic of naturally occurring concepts coded in language. Many factors could influence this heterogeneity in the real world. Consequently, p(a) could be used to gauge these factors’ relative influence when comparing types of concepts or types of conceptualizers. To show that this is the case, we compared conceptual agreement values between two types of concepts and two types of conceptualizers.

What have we learned from the concrete/abstract and blind/sighted comparisons

A large literature strongly suggests that concrete concepts are different from abstract concepts. In that literature, evidence is discussed that when conceptualizing, people routinely reenact perceptual content associated with the corresponding concepts (Kan et al., 2003; Lupyan & Ward, 2013; Ostarek & Huettig, 2017; Santos et al., 2011; Solomon & Barsalou, 2004), which is characteristic of concrete concepts. In contrast, abstract concepts appear not to be characterized as much by perceptual content, but rather by social and linguistic associations (Barsalou et al., 2008; Borghi et al., 2017; Borghi & Cimatti, 2009; Breedin et al., 1994; Paivio, 1986; Wiemer-Hastings & Xu, 2005).

From this literature, we posited our characterizing concreteness hypotheses, which holds that concrete concepts are characterized more by perceptual information than abstract concepts. Importantly, we also hypothesized that this perceptual information introduces a greater homogeneity in conceptualization for concrete than for abstract concepts. Consistently with this hypothesis, we found visual strength subjective ratings obtained from Vergallito et al. (2020) to be higher for concrete than for abstract concepts, and we also found that visual strength ratings positively correlated with our p(a) measure, confirming that visual information is associated with increasing homogeneity across participants.

A somewhat surprising result is that the positive correlation between visual strength ratings and p(a) also occurs when blind subjects’ data are analyzed. This suggests that blind participants not only have information about visual perceptual information (e.g., that “black” and “white” can be used to describe zebras), which is likely to be obtained from blind subjects’ interactions with the sighted community (cf., Louwerse, 2018), but also that this linguistic source introduces homogeneity in their lists, similarly to what occurs with sighted subjects. In fact, there is recent evidence which is consistent with this. The concreteness advantage effect consists in people showing faster processing for concrete than for abstract words, presumably due to the effect of perceptual information. However, Bottini et al. (2022) report that early blind subjects show this effect even when a word’s concreteness depends mostly on its reliance on visual information (e.g., “blue”).

Because discrimination tasks that rely on visual information can detect differences between sighted and blind subjects (Connolly et al., 2007; Kim et al., 2019), in our role of vision hypothesis, we posited that visual reenactments should introduce greater homogeneity for concrete concepts in sighted compared to blind participants. This seems at odds with our finding discussed in the immediately preceding paragraph, which suggests that blind participants do have information about visual perceptual information, presumably acquired through regularities experienced in language, and that this information does introduce relative homogeneity in the lists they produce. However, as discussed next, we did find evidence consistent with our role of vision hypothesis.

As shown in Fig. 2, concrete concepts are less homogeneous for blind than for sighted participants. To explain this apparent contradiction, here we further hypothesize that when visual reenactments occur, they capture attention and guide property listing. Thus, even if blind subjects have the linguistically represented perceptual information, their lists rely on linguistic associations and not on the highly salient visual reenactments. In contrast, for sighted subjects, visual reenactments capture attention and guide listing, thus introducing homogeneity to a larger extent than would be expected only from linguistic regularities.

An additional and interesting finding is that, as shown in Fig. 2, p(a) computations indicate that property lists differ across groups of conceptualizers, suggesting that perhaps different learning experiences lead to different category memory representations. When comparisons were made across groups (i.e., between blind and sighted participants), p(a) values were consistently lower than when those comparisons were made within the same groups. Evidently, property frequency distributions were not the same across our groups.

Though being able to use p(a) to make group level comparisons (i.e., groups of concepts and groups of conceptualizers) is already interesting, we also showed that p(a) can be used to discriminate between individual concepts. If abstract concepts are characterized by producing more variable instantiations in the PLT than concrete concepts, then, p(a) might also allow discriminating between concepts at the individual concept level (i.e., showing that a particular concept can be classified as concrete or abstract based on its agreement probability value). To this effect, we introduced a simple measure consisting of s (the average number of properties produced by subjects) and k (the total number of unique properties produced by the whole subject sample) and contrasted it with p(a)eq and p(a) in their capacity to discriminate concrete and abstract concepts. These three variables were submitted to machine learning algorithms and their classification performances contrasted. Overall, our data showed that the best classification performance was achieved by p(a). Three consequences ensue: agreement probability carries more useful information about concepts than its s and k constituents considered in isolation; information about property frequency distribution needs to be considered in the computation of agreement probability; abstract concepts effectively are more heterogeneous than concrete concepts, not only as a group, but also at the level of individual concepts. Additionally, classification results for p(a) between sighted and blind indicated that a better discrimination between concrete and abstract concepts is achieved among sighted individuals. This lends further support to the theory that blind people learn abstract concepts much in the same way as concrete concepts due to lack of visual perceptual information. Hence, this similarity in learning concrete and abstract concepts blurs their distinction in the blind population, which makes distinguishing between concrete and abstract concepts more difficult for blind people.

What more might p(a) enable

As discussed above, the work we report here shows that variability in the PLT is not necessarily noise. Rather, variability in the PLT contains information that can be meaningfully related to the literature on the abstract concept versus concrete concept distinction, and to the literature on the effect that lack of sight has on conceptual representations. That a simple task like the PLT contains such a wealth of information is surprising. In what follows, we want to suggest other issues to which p(a) could be applied to gain theoretical insights.

Studying conceptualizations in social groups

The PLT and ensuing CPNs have been used to characterize shared semantic memory in social groups, either using them in isolation or in combination with other techniques (e.g., Hood, 2020; Mazzuca et al., 2020; Sunohara et al., 2022; Weiler & Jacobsen, 2021). The aim of these researchers has been to characterize shared semantic concepts in a particular social group (e.g., to characterize knowledge of foods in children; to characterize the meaning of tattoos in older adults). However, it is not trivial to claim that a certain semantic structure (1) is shared across members of a social group, and (2) is also specific to that social group, in contrast to being relatively invariant across different social groups. Following our comparison between blind and sighted subjects, we envision that, by using p(a), it should be possible to compare linguistically coded concepts in different social groups. A group would have a shared and distinctive conceptualization if the within-group p(a) is greater than the between-group p(a), just as our analyses illustrate.

Analyzing the effect of coding on CPN results

Because the PLT task is highly productive and properties can be expressed in numerous ways (e.g., people who are cued with the concept democracy may use “a president is elected” and “there are presidential elections” to essentially refer to the same property), PLT data needs to be coded. This coding process typically involves several coders, and inter-coder reliability is always a concern. Only recently have there been attempts to develop methods oriented to promoting highly reliable codings (Buchanan et al., 2020; Reid & Katz, 2022). Note that low or even moderate reliabilities make it difficult to aim for replicable studies.

The problems introduced by coding are partly responsible for why one can hardly find in the literature studies that make use of coding procedures that were developed by other researchers and why CPN studies are seldom replicated. A closely related problem is the following. Coders in different CPN studies could code highly related sets of raw properties with slightly different labels, and their codes could produce somewhat different partitions of the raw properties, such that inter-studies comparisons are made difficult to carry out (i.e., How do we know if the coded properties yield similar data structures to the extent that both studies should be considered replications?). Note that these problems only increase when concepts of interest are abstract, because people tend to produce more unique properties. We hope that computing p(a) could help to solve these issues, given that different coding systems over the essentially same raw property data should produce comparable agreement values.

Testing the effect of context on the instantiation of a concept

It has been long argued that contextual knowledge plays a central role in categorization and cognition (Chaigneau et al., 2009; Kiefer & Pulvermüller, 2012; Lin & Murphy, 2001; Roth & Shoben, 1983; Wenchi & Barsalou, 2006). Furthermore, evidence supports the idea that conceptual properties can be meaningfully divided into those that are context dependent (i.e., those that become active depending on specific contexts, e.g., that a basketball “can float”) and those that are relatively independent from context (i.e., those that become active across different contexts, e.g., that dogs “bark”) (Barsalou, 1982). If, as we hypothesize (see our Agreement probability as a measure of homogeneity), concepts with lower p(a) are those for which people may adopt different points of view when conceptualizing them, then, manipulating contexts should change a concept’s p(a). To test this hypothesis, we envision experiments where property lists are obtained after subjects have been primed with specific contexts, and we would predict that p(a) should increase when specific relevant contexts are introduced, and that perhaps abstract concepts should be relatively more influenced by this manipulation. However, these experiments are beyond the scope of the current work, and we defer them for future work.

On closing, we want to highlight that our p(a) measure is consistent with views that see an intimate link between cognition and culture (Atran, 2003; Berntsen & Rubin, 2004; DiMaggio, 1997; Lehman et al., 2004; McCauley et al., 2022; ojalehto & Medin, 2015; Patterson, 2014; Roberson et al., 2000; Talmy, 2000; Waxman et al., 2007), where cognition is thought to reflect objective cultural practices in the subjective domain (Kashima, 2016; Nisbett et al., 2001; Nisbett & Masuda, 2003; Nisbett & Miyamoto, 2005; Romney & Moore, 1998). Thus, we believe that p(a) has a wide range of application and will be pleased if it does indeed live up to this standard.