Statistical and methodological problems with concreteness and other semantic variables: A list memory experiment case study

Pollock, Lewis

doi:10.3758/s13428-017-0938-y

Statistical and methodological problems with concreteness and other semantic variables: A list memory experiment case study

Open access
Published: 13 July 2017

Volume 50, pages 1198–1216, (2018)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Statistical and methodological problems with concreteness and other semantic variables: A list memory experiment case study

Download PDF

Lewis Pollock ORCID: orcid.org/0000-0003-1488-7645¹

5289 Accesses
41 Citations
2 Altmetric
Explore all metrics

Abstract

The purpose of this article is to highlight problems with a range of semantic psycholinguistic variables (concreteness, imageability, individual modality norms, and emotional valence) and to provide a way of avoiding these problems. Focusing on concreteness, I show that for a large class of words in the Brysbaert, Warriner, and Kuperman (Behavior Research Methods 46: 904–911, 2013) concreteness norms, the mean concreteness values do not reflect the judgments that actual participants made. This problem applies to nearly every word in the middle of the concreteness scale. Using list memory experiments as a case study, I show that many of the “abstract” stimuli in concreteness experiments are not unequivocally abstract. Instead, they are simply those words about which participants tend to disagree. I report three replications of list memory experiments in which the contrast between concrete and abstract stimuli was maximized, so that the mean concreteness values were accurate reflections of participants’ judgments. The first two experiments did not produce a concreteness effect. After I introduced an additional control, the third experiment did produce a concreteness effect. The article closes with a discussion of the implications of these results, as well as a consideration of variables other than concreteness. The sensorimotor experience variables (imageability and individual modality norms) show the same distribution as concreteness. The distribution of emotional valence scores is healthier, but variability in ratings takes on a special significance for this measure because of how the scale is constructed. I recommend that researchers using these variables keep the standard deviations of the ratings of their stimuli as low as possible.

Deconstructing the effect of self-directed study on episodic memory

Article 19 June 2014

Decision Making: a Theoretical Review

Article 15 November 2021

Recognize the Value of the Sum Score, Psychometrics’ Greatest Accomplishment

Article Open access 01 March 2024

Word concreteness has become one of the most studied variables in the psycholinguistic literature. Since Paivio, Yuille, and Madigan (1968) published one of the first large-scale databases of word concreteness norms, “concreteness effects” have emerged in a variety of investigations of various cognitive processes, and a range of theories have been proposed in an attempt to explain these effects. Independent teams of researchers operating over a period of decades have repeatedly shown that concrete words show a processing advantage over abstract words in certain experimental paradigms. For example, concrete words are easier to remember than abstract words (Allen & Hulme, 2006; Miller & Roodenrys, 2009; Romani, McAlpine, & Martin, 2008; Walker & Hulme, 1999), are easier to make associations with (de Groot, 1989), and are more easily and more thoroughly defined in dictionary definition tasks (Sadoski, Kealy, Goetz, & Paivio, 1997). Historically, it was claimed that concrete words are responded to more quickly than abstract words in lexical decision tasks (Bleasdale, 1987; James, 1975; Kroll & Merves, 1985), although more recent experiments have shown no difference (Brysbaert, Stevens, Mandera, & Keuleers, 2016), or even that abstract words might have an advantage after various other variables have been accounted for (Kousta, Vigliocco, Vinson, Andrews, & Del Campo, 2011). However, even an abstractness advantage in lexical decision points to the utility of word concreteness as a psycholinguistic variable.

Brain-imaging techniques have also been employed to determine whether the neural systems underpinning concrete words and abstract words are distinct (Binder, Westbury, McKiernan, Possing, & Medler, 2005; Dhond, Witzel, Dale, & Halgren, 2007; Kounios & Holcomb, 1994; Pexman, Hargreaves, Edwards, Henry, & Goodyear, 2007; Sabsevitz, Medler, Seidenberg, & Binder, 2005). The general consensus from these brain-imaging studies is that there is evidence of a neuroanatomical difference in the processing of concrete versus abstract words.

Psychologists are clearly heavily invested in the investigation of word concreteness, and for good reasons. If there are properties that define a cognitively relevant ontology of concepts, concreteness seems like a good candidate: Something about what constitutes the concept of “elephants” (highly concrete) is probably different from what constitutes the concept of “paradoxes” (highly abstract). However, in this article I will highlight a problem with the concreteness measure, based on a simple statistical summary of the Brysbaert, Warriner, and Kuperman (2013) concreteness norms. I report three replication experiments that together suggest that this problem is not fatal to concreteness research, but also that it should be acknowledged when researchers design their stimuli. I also show that the same problem applies to other variables in semantic databases, such as imageability (Cortese & Fugett, 2004; Schock, Cortese, & Khanna, 2012) and individual modality norms (Lynott & Connell, 2012).

Word concreteness

A word’s concreteness rating is derived by asking a group of participants to rate that word for concreteness on a Likert scale. A low score indicates that a word is highly “abstract,” whereas a high rating indicates that a word is highly “concrete.” The mean value of all participants’ ratings is taken to be an approximation of a word’s position on an abstract-concrete continuum. I will now develop some theoretical concerns about the validity of traditional concreteness norms before turning to a statistical analysis of the Brysbaert et al. (2013) database. Consider the job a participant is being asked to do when she is told to rate a word between, say, 1 and 5 on a scale of concreteness. She is told that “concrete words are experienced by the senses,” whereas abstract words are not (Paivio et al., 1968). For some words, the interpretation of traditional concreteness norming instructions is relatively straightforward. A participant who is presented with the word “apple” is likely to have seen, touched, smelled, and tasted apples throughout the course of their life, and will unproblematically assign “apple” a high concreteness rating. Similarly, a participant that is presented with the word “serendipity” is likely to reason that since serendipity is a loose association between some coincidental, nonspecified events, and is not something that affords direct sensory experience, the word “serendipity” should be assigned a low concreteness rating. However, what are the properties that a word/concept should have in order for it to be assigned a mid-scale rating? It is difficult to formulate a coherent approach to this task: Can an entity or idea be “half-seen” or “half-touched”? What does it mean to have intermediate sensory experience of an entity or idea? That is to ask: What is a participant telling us about a word when they rate it a 3 out of 5? They could mean any one of the following:

1.
Adding up all of my sensory experience of this object across all five of the sensory modalities, I realize that I have seen and heard it, but never touched, smelled, or tasted it. So I suppose I’ll rate it a 3.
2.
One interpretation of this word brings to mind something that cannot be directly experienced, whereas a different interpretation of this word brings to mind something that can be directly experienced. So I suppose I’ll rate it a 3.
3.
Sometimes I associate sensory experience with this word, but sometimes I don’t. So I suppose I’ll rate it a 3.

It is certainly possible to imagine more potential approaches, and there is no empirical basis for selecting one of these approaches over another. Furthermore, it is likely that different participants will generate different interpretations for many of the words in any list of words to be normed. When a participant sees the letter string < deed > presented in isolation, there is no way that a researcher can control for the fact that half of the participants may interpret < deed > as referring to a document associated with proof of property ownership (high concreteness value?), and the other half may interpret it as referring to some unspecified action, perhaps involving some element of heroism (low concreteness value?). Consequently, for a number of words it is just not clear what word/concept the mean concreteness rating is supposed to reflect.

This point on its own might be enough to motivate the avoidance of words with a mean value in the middle of a concreteness–abstractness scale. Given that it is not clear what it is that participants are even telling us when they rate a word a 3, we might also wonder how often participants actually use values from the middle of the concreteness scale when making their judgments. Recently, Brysbaert et al. (2013) provided a concreteness norm database of 40,000 English words, which dwarfs the previously popular MRC database used in most studies (Coltheart, 1981). This new, larger database allows a statistical analysis of the distributions of concreteness norms across a much larger section of the English lexicon. I now present this analysis and use it to develop the concerns raised in this section.

Brysbaert et al. (2013) concreteness norms

Brysbaert et al. (2013) collected a new set of concreteness norms for 40,000 English words. Groups of approximately 25 participants rated subsets of the whole list of 40,000 words on a concreteness scale of 1 (very abstract) to 5 (very concrete). The participants (n = 4,237) came from a range of ages, with approximately one third between 17 and 25 years old, and two thirds between 26 and 65. The mean value of a group of participants’ judgments about the concreteness of a stimulus word was assumed to be a useful approximation of that word’s position on a hypothesized concrete–abstract continuum. I shall now argue that this is not necessarily the case. The standard deviation of a dataset is a measure of the average distance between all data points in that dataset and the mean value of all data points in the dataset. If every participant rates a word as a 1 (highly abstract), then that word’s concreteness rating will have a standard deviation of 0. However, if half of the participants rated a word as a 1, but the other half rated the word as a 5 (highly concrete), that word would have a mean concreteness rating of 3 but a standard deviation of 2. In Likert scale norming tasks, the standard deviation of a set of ratings is therefore a blunt index of the extent to which participants agreed with each other about how a word should be rated.

If a dataset contains 25 numbers (in our case, 25 individual concreteness judgments), all of which are integers between 1 and 5, then there are a finite number of possible combinations of means and standard deviations for that dataset. Figure 1 below plots all of these possible combinations:

Note how, at the extreme ends of the x-axis, only a standard deviation of 0 is possible, because for a mean value to be 1 or 5, all 25 participants must have rated a word as 1 or 5, respectively. However, in the middle of the scale the disagreement that is theoretically possible increases, reaching a peak at mean value ~3, standard deviation ~2. Crucially, it is still theoretically possible for a data point to occur with a mean value located in the middle of the scale, but with a relatively low standard deviation. That is, it is still clearly theoretically possible for participants to more or less consistently agree that a word is of intermediate concreteness.

Now, consider Fig. 2, which plots the actual mean concreteness value and the standard deviation of every noun in the Brysbaert et al. (2013) concreteness norm dataset (n = 14,592) over the top of the theoretically possible combinations depicted in Fig. 1.

The pattern is striking. At the extreme concrete end of the scale, many items have high concreteness ratings and relatively low standard deviations, indicating that participants more or less agreed in their judgments about how to rate these words. At the extreme abstract end of the scale, there are likewise words with low concreteness ratings and relatively low standard deviations, although not to the same extent as at the extreme concrete end. However, in the middle of the scale there is an obvious rise in the standard deviation. Only a handful of words have a mean value near 3 and a standard deviation even slightly below 1. Indeed, a large class of words have a standard deviation well over 1, ranging from mean values of 1.5 to 4.5.

This indicates that for a great number of items, participants were not agreeing in their judgments of how concrete a stimulus word was. At mean values of 2 and 4 there are many cases of standard deviations above 1. Remember that ratings on this scale can only take integer values between 1 and 5. This means that for many of the words with a mean value of 2 or 4, some participants must have judged these words as belonging at the opposite end of the concreteness scale from the position where the mean value suggests the word belongs. This phenomenon is problematic for the assumption that concreteness should be treated as a continuous variable. This is because in a vast number of cases, participants’ judgments tended not to be continuous; instead, they tended to be binary: Participants were using values of 1, 2, 4, and 5 in producing these concreteness norms, and avoided using 3. Furthermore, in many cases participants were judging a word as a 1 (totally abstract), whereas others were judging that same word as a 4 (somewhat concrete).

Given these methodological issues, it might seem surprising that concreteness effects are so widely reported. If measurements for a large section of the hypothesized concreteness spectrum are actually procedural artifacts, it is then unclear what phenomenon it is that concreteness effects are actually indexing. One potential explanation is that generally, when investigating the effect of a variable, researchers try to choose stimuli that maximize a change in this variable, in order to generate the maximum possible effect. It is therefore possible that empirical concreteness research might not suffer too badly from the problem of binary disagreements concerning midscale items, because researchers will have aimed to pick stimuli from the extreme ends of the scale, and these polar items are less subject to disagreement.

However, if it turns out that many experimental stimuli do suffer from the disagreement phenomenon, this poses an explanatory problem concerning the evidence in favor of processing differences between abstract and concrete items. The typical finding is that there are processing advantages for concrete items relative to abstract items, and the typical explanation of this finding is that concrete and abstract items have different neurologically instantiated formats and/or structural relationships. If a significant number of the stimuli included in an abstract or concrete experimental condition actually come from the middle of the concreteness scale, then the typical claim that there are processing differences between concrete and abstract items is no longer supported by the data. This is because words from the middle of the scale must have high standard deviations. This means that only half of the participants who produced the concreteness measure for that word judged it to be abstract, and the other half judged it to be concrete. Therefore, there are no empirical grounds for calling these words “concrete” or “abstract” in the first place.

Stimuli in concreteness experiments: A case study of list memory paradigms

In this section I plot the stimuli featured in four list memory experimental studies against the entire Brysbaert et al. (2013) database. These studies are Allen and Hulme (2006), Walker and Hulme (1999), Romani et al. (2008), and Miller and Roodenrys (2009). We should note a few things. First, although the replication experiments that I report below feature noun stimuli, and most studies under discussion here also featured nouns, occasionally their stimulus sets featured other word classes alongside nouns. In the case of Allen and Hulme, many of the stimuli in the abstract condition were not nominal. Therefore, to display the maximum number of stimuli for all experiments, I have plotted the entire Brysbaert et al. (2013) database (n = 40,000) instead of just the nominal subsection of it. Not all of the stimuli featured in all experiments appeared in the Brysbaert et al. norms, and these stimuli have been omitted from the analysis. Second, the pattern of means and standard deviations is absolutely unchanged when we compare the entire Brysbaert et al. database with the noun subsection of it.

Now, consider Fig. 3. The stimuli featured in Romani et al. (2008) best exemplify the problem, although the intention here is not to single out Romani et al. or any of the other authors under discussion for criticism. The analysis I present here would have been almost impossible to carry out at the time that these experiments were conducted, given that the Brysbaert et al. concreteness database was only published in 2013. In brief, the problem is that the concrete words tend to have low standard deviations, whereas the abstract stimuli tend to have high standard deviations and to be drawn from the middle of the scale, rather than the unequivocally abstract part of the scale. This is potentially problematic for the validity of Romani et al.’s conclusions regarding concreteness effects, because many of the stimuli that made up their abstract stimuli were not unequivocally abstract. For the standard deviations of many of the “abstract” stimuli to be as high as they are—in many cases, well above 1—many participants must have been judging those words to be concrete during the Brysbaert et al. (2013) norming process. Some of the abstract stimuli have standard deviations approaching the theoretical maximum of 2, indicating maximum disagreement among participants about whether that word is concrete or abstract. To reiterate: Participants could only apply integer values in making their judgments. Therefore, even if a word has a mean concreteness rating of approximately 2, but also a standard deviation of the rating above 1, that means that some participants must have been crossing scale halves in making their judgments. Ultimately, it is not clear what comparison is actually being made here. The concrete stimulus lists were more or less unproblematically concrete. However, the abstract stimulus lists contained words drawn from nearly the entire length of the concreteness scale, and also tended to feature words that participants disagreed about how to rate.

Figure 4 depicts the abstract and concrete stimuli featured in Allen and Hulme (2006). Again, many “abstract” stimuli here have standard deviations well above 1, indicating that people disagreed about whether the words were abstract in the first place. The range of mean ratings of concreteness for the abstract condition is also clearly much higher than in the concrete condition. Once again, a relatively homogeneous group of concrete words has been compared to a heterogeneous group of words about which participants tended to disagree.

Figure 5 plots the stimuli featured in Miller and Roodenrys (2009). Again, there is a marked difference in standard deviations between the concrete and the abstract stimuli. Furthermore, the standard deviations of the abstract stimuli are so high (well above 1 in the majority of cases) that the mean value does not reflect the judgments that participants were actually making.

Finally, consider Fig. 6, which depicts the stimuli featured in Walker and Hulme (1999). The midscale criticism applies least to this set of stimuli, although it is still clearly the case that the concrete stimuli tended to have lower standard deviations than the abstract stimuli. The reasons for this have already been expounded. The upshot is that a skeptic could reasonably argue that these experiments do not actually provide evidence for concreteness effects. The reason is that the comparison being made was meant to be between concrete and abstract items, but the comparison that was actually made was between concrete items, on the one hand, and a group of stimuli about which participants disagree, on the other. It could be the case that words that engender disagreement are those that are hard to remember, and that this explains processing differences that have previously been attributed to concreteness/abstractness. The experiments that I report below were designed to test this possibility.

Before moving on to a report of these replication attempts, I wish to point out that list memory paradigms are not a special case when it comes to the properties of “abstract” stimuli. Table 1 presents a number of experimental concreteness studies from a wide variety of paradigms, as well as a summary of the concreteness values and standard deviations of the stimuli featured in their experiments. The abstract–midscale stimulus pattern applies to every single experiment.

Table 1 Concreteness statistics in various experimental paradigms

Statistical and methodological problems with concreteness and other semantic variables: A list memory experiment case study

Abstract

Similar content being viewed by others

Deconstructing the effect of self-directed study on episodic memory

Decision Making: a Theoretical Review

Recognize the Value of the Sum Score, Psychometrics’ Greatest Accomplishment

Word concreteness

Brysbaert et al. (2013) concreteness norms

Stimuli in concreteness experiments: A case study of list memory paradigms

Experiment 1

Method

Participants

Materials

Procedure

Results

Experiment 2

Method

Participants

Materials

Procedure

Results

Interim summary

Experiment 3

Method

Participants

Materials

Procedure

Results

General discussion

Conclusion

Notes

References

Author note

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendices

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation