Taming big data: Applying the experimental method to naturalistic data sets

Sagi, Eyal

doi:10.3758/s13428-018-1185-6

Taming big data: Applying the experimental method to naturalistic data sets

Published: 08 January 2019

Volume 51, pages 1619–1635, (2019)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

Taming big data: Applying the experimental method to naturalistic data sets

Download PDF

Eyal Sagi¹

1339 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Psychological researchers have traditionally focused on lab-based experiments to test their theories and hypotheses. Although the lab provides excellent facilities for controlled testing, some questions are best explored by collecting information that is difficult to obtain in the lab. The vast amounts of data now available to researchers can be a valuable resource in this respect. By incorporating this new realm of data and translating it into traditional laboratory methods, we can expand the reach of the lab into the wilderness of human society. This study demonstrates how the troves of linguistic data generated by humans can be used to test theories about cognition and representation. It also suggests how similar interpretations can be made of other research in cognition. The first case tests a long-standing prediction of Gentner’s natural partition hypothesis: that verb meaning is more subject to change due to the textual context in which it appears than is the meaning of nouns. Within a diachronic corpus, verbs and other relational words indeed showed more evidence of semantic change than did concrete nouns. In the second case, corpus statistics were employed to empirically support the existence of phonesthemes—nonmorphemic units of sound that are associated with aspects of meaning. A third study also supported this measure, by demonstrating that it corresponds with performance in a lab experiment. Neither of these questions can be adequately explored without the use of big data in the form of linguistic corpora.

Finding the traces of behavioral and cognitive processes in big data and naturally occurring datasets

Article Open access 19 April 2017

Language, Cognitive Systems, and the Methodology of Observation

Introduction

Traditional lab-based studies provide a great degree of control. This control enables experimental designs that can be used to explore subtle effects. However, that level of control also means that some lab results do not readily replicate in other, less controlled circumstances, most notably in real-world situations. This article supports the proposal that some of the approaches and methods that have proven so useful in the lab can be applied to more naturalistic data sets gathered from external sources, primarily the internet and other collections of big data. By applying such methods to more naturalistic data, researchers can strike a new balance between internal and external validity in their pursuit of furthering our understanding of cognition and behavior.

This study was designed to establish the efficacy of these methods by applying them to investigate two related questions regarding the representation of word meaning—whether verb representations are more relational than those of nouns, and the relationship between word form and its meaning. Both of these cases involve hypotheses regarding the variability of meaning across words and uses, whether grouped by grammatical category or phonetic similarity. Moreover, the dependent measures in both studies involve the textual context in which the words appear and its variability. As a result, the same overall methodological approach can be applied in both cases, with some modifications.

The first study demonstrates how big data can allow researchers to develop new approaches for testing existing question by examining patterns that unfold over long periods of time. In particular, this study tests the hypothesis that the meaning of verbs changes more quickly than the meaning of nouns. The second study shows how, by replacing participants with text, researchers can test hypotheses that are larger in scope and replace some of the reliance on participants’ intuitions and judgment with objective statistical measures. Specifically, the second study explores proposed relationships between phonetic clusters and the meaning of words incorporating them. Both of these studies illustrate how the combination of large corpora and traditional hypothesis testing designs enables researchers to conduct naturalistic studies with external validity and high statistical power. In particular, using large datasets enables research to approach problems from a different perspective, allowing questions that are difficult, or perhaps even impossible, to explore in the lab to be answered. These difficulties can arise out of the limitations of the lab (Study 1), or because collecting a similar quantity and quality of data from participants is difficult and expensive (Study 2).

The experimental method and the study of cognition

Psychological researchers have customarily focused on lab-based experiments to test their theories and hypotheses. The lab provides many advantages for research in psychology, and especially for investigations of cognition. Primary among these is the important role control plays in experiments. By controlling the environment, researchers can eliminate many possible confounds and other threats to the validity of their conclusions. This results in studies with a high degree of internal validity and provides a dramatic increase in the statistical power available for testing hypotheses at the cost of reduced external validity. Additionally, the degree of control available at the lab means studies are also easier to replicate, although the success of such efforts at replication has recently come under scrutiny (Aarts et al., 2015).

Nevertheless, conducting research in the lab has its disadvantages. In particular, questions often arise regarding the external validity of lab-based results. That same level of control and care that researchers exercise in the lab can result in studies whose results depend on the particular conditions of the study. Small variations in those conditions, such as the addition of noise or ambiguity in language, might cause the effects observed in the lab to be greatly reduced, or even disappear.

To assuage these concerns, researchers also conduct studies in more natural settings. This can be achieved either by endeavoring to recreate such settings within the lab or by venturing outside of the lab to conduct studies in less controlled environments. The advantage of the former is that it allows the researcher to maintain a high degree of control over the study. Its disadvantage is that it is neigh impossible to faithfully recreate a natural setting within a controlled environment and such settings tend to present a compromise between a fully controlled lab study and a study conducted in a natural setting.

In contrast, although studying behavior in a natural setting seems like an ideal avenue for conducting studies in psychology, it complicates the study designs and limits the possible manipulations an experimenter can employ, as well as the types and precision of the quantitative measurements that can be collected. Therefore, although the inspiration of theories and hypotheses is often found outside of the lab, researchers frequently start their scientific investigation by conducting rigorous, precise, and controlled lab studies. Once the phenomenon is understood better in such controlled settings, researchers will turn to support the results by providing convincing, if less conclusive, evidence that their theories also predict behavior outside the lab.

Using big data to conduct studies

The vast amounts of data now available to researchers can be a valuable resource. By incorporating this new realm of data and translating it into traditional laboratory methods, we can expand the reach of the lab into the wilderness of human society. This can allow researchers to conduct research that has more external validity than traditional lab studies, while maintaining, or even improving, the available statistical power. The first step toward such translations is the realization that data from outside the lab, although less controlled, can also be analyzed using the same methods employed for analyzing data gathered in the lab.

Studies conducted in the lab are often concerned with the effect of one or more independent variables (IVs) on the outcome as measured by a dependent variable (DV). Commonly, such studies are designed as experiments, in which the IVs are intentionally manipulated by the researcher. The effect of such manipulations on the measured DV are then explored using inferential statistics such as t tests and ANOVAs. In most studies, several different manipulations are used, each giving rise to a different experimental condition and differences in the DV due the condition in which they are measured provide evidence of the effect of the manipulation (hence, the IVs) on the DV. The extensive control that researchers have in a lab setting manifest themselves as a reduction in variability caused by extraneous factors and therefore increases statistical power.

In contrast with lab studies, studies using big data, large amounts of data obtained from outside the lab in forms that defy traditional methods of analysis, do not include a direct manipulation of the IVs because no new data is being collected. Nevertheless, in practice, the quantity of data helps offset issues of control and manipulation by providing alternative means of increasing statistical power. Instead of minimizing variance due to random error to increase the likelihood that trends and regularities can be identified, a larger number of samples helps separate them from randomness without sacrificing external validity.

Structured versus unstructured sources of data

When using big data for research purposes, it is important to note that there are two large classes of data—structured and unstructured. Most lab studies carefully collect structured data, in which each measurement is classified and categorized according to the conditions under which it is collected. The data are therefore annotated and structured on the basis of relevant variables and conditions. Likewise, many existing datasets, for example those that are often used for marketing and business purposes, are structured. Each datum is provided with contextual information that relates it to the dataset in relevant, and often important, ways.

However, plenty of data are also available that are unstructured—that is, data that are provided on their own, with very little relevant contextual information. This lack of relevant contextual information means that the researcher needs to supply a structure in which to contextualize the provided data and facilitate analysis.

Text is perhaps the best known of these unstructured data sets. The context provided by text is often included in the text itself. Nevertheless, even texts are often accompanied by some structural information. For instance, the date the text was written, as well as the identity of its authors, is often available. However, for most uses text is largely unstructured, because it is difficult to convert the provided textual information into quantifiable measurements. This makes textual data difficult to analyze using standard statistical methods.

Quantifying language

Multidimensional spaces and vector arithmetic

One common approach to producing quantified information out of text focuses on the analysis of the contexts in which words appear. These approaches ignore the structures of language and follows the premise that the distribution of words in a text is primarily governed by its content. This premise, succinctly identified by Firth (1957) when he postulated “You shall know a word by the company it keeps,” has proven resilient and useful in many studies. It forms the basis for some of the most frequently employed methods used to quantify textual data, such as latent semantic analysis (LSA; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998), topic models (Griffiths, Steyvers, & Tenenbaum, 2007; Steyvers & Griffiths, 2007), and machine-learning-based approaches such as Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013a), skip-gram (Mikolov, Chen, Corrado, & Dean, 2013b), and GloVe (Pennington, Socher, & Manning, 2014). These approaches extract patterns of word co-occurrence as a proxy to their semantic content.

In all cases, the methods attempt to estimate the similarity of word meanings based on their proximity of appearance within a text. Figure 1 shows a two-dimensional depiction of the space for the words representing some mammals (“dog” and “cat”) and birds (“dove” and “eagle”), as well as two associated motion verbs (“walking” and “flying”). As the figure illustrates, related terms (e.g., mammals) appear in relative proximity, and distinct terms are separated by space. It is also relatively straightforward to represent phrases and sentences by combining the representations of the words of which they are composed, via methods such as vector addition.

Researchers have combined these techniques with other methods from natural language processing to explore a variety of applications, including answering questions (Mohler & Mihalcea, 2009), summarizing texts (Yeh, Ke, Yang, & Meng, 2005), automatic grading (Foltz, Laham, & Landauer, 1999; Graesser et al., 2000), and translating between languages (Tam, Lane, & Schultz, 2007). More importantly, in the context of psychological research, measurements of word similarity based on these methods also correlate with human performance in related tasks, such as judgments of similarity and semantic priming (Günther, Dudschig, & Kaup, 2016; Landauer & Dumais, 1997). Iliev, Dehghani, and Sagi (2015) have reviewed some of the methods of textual analysis used in psychology and related disciplines.

Conducting studies on quantified language data

A measure of textual similarity is surprisingly useful when it comes to testing psychological theories using texts. It provides a basic quantified measurement of difference that is amenable to statistical analyses and designs that are common in psychology. Even more importantly, the underlying representations used to generate this measure are already quantities, although they involve vector representations rather than scalars. Specifically, we can calculate the central tendency and variability of the vectors representing a group of related texts. The distances between pairs of vectors, whether they be representations of individual texts or central tendencies, are scalars (i.e., single numbers). The similarity measure mentioned above is an example of such a measure of distance. Consequently, we can use these vector representations as a basis for conducting a variety of statistical tests, such as t tests, analyses of variance (ANOVAs), and regression models.

This becomes of particular interest for psychological research when we consider that texts are produced by people. As such, texts can be considered as representing the individuals who created them. When comparing texts created by individuals that differ on specific attributes, such as gender, culture, or moral values, we are essentially comparing how these individuals use language. If a theory predicts some differences between individuals based on those attributes, we might be able to predict how the texts would differ, and consequently look for such differences as a test of the theory. Essentially, these attributes could play the role of the IVs in our studies, while textual similarity would be the DV, and the method itself would follow the same pattern as more common approaches to hypothesis testing in psychology (see also Sagi, 2018).

It is relatively straightforward to consider how such studies can be used in conjunction with lab-based studies. For instance, after observing an effect in the lab, we might be able to test for a similar effect on texts collected from online sources. This can provide researchers with an accessibly approach for examining the external validity of their results.

However, there are also cases in which it is better to test a hypothesis using collected texts first. This often occurs when bringing the participants of interest to the lab is particularly difficult. For instance, we might predict that the outlawing of slavery following the Civil War in the United States changed the reasoning that individuals apply toward issues such as freedom and racial differences. However, it is difficult to test this theory, because all the individuals currently alive were born after those changes took place. Nevertheless, various textual artifacts have been left over from the Civil War period, such as books, letters, and journals. If we have theory-based predictions regarding how such individuals would consider slavery, we can collect textual evidence generated by individuals prior to the Civil War as well as similar evidence generated after the Civil War, and compare how individuals in these different periods treated slavery. One theory that can provide us with such predictions is Haidt and Joseph’s (2004) moral foundations theory, and Sagi and Dehghani (2014) have described how it can easily be applied to comparing the style of moral reasoning about particular concepts in texts.

Moreover, in some cases psychological theories make predictions that are difficult to test because they require the analysis of trends that are difficult to generate in the lab. Below, a number of texts are used to analyze two such cases related to the representation of meaning and its link to the variability in how words are used: The first case tests a long-standing prediction of Gentner’s (1982) natural partition hypothesis: that verb meaning is more subject to change due to the textual context in which it appears than is the meaning of nouns (e.g., Gentner & France, 1988; Gillette, Gleitman, Gleitman, & Lederer, 1999). Drawing on theories of semantic change, such variability should lead to a higher rate of semantic change for verbs than for concrete nouns. Although testing the context specificity of verbs and nouns can be reasonably achieved via lab experiments, semantic change takes shape over decades, and consequently is difficult to re-create in the lab. Using a diachronic corpus, we can demonstrate that relational words, such as verbs, show more evidence of semantic change than do concrete nouns.

Two additional studies were conducted, demonstrating that similar language-based analyses can be used to empirically support phonesthemes—nonmorphemic units of sound that are associated with aspects of meaning (e.g., the English prefix gl- is associated with the visual modality, as in glimpse, glow). A large number of phonesthemes have been proposed (see Hutchins, 1998) but are difficult to support empirically. Corpus statistics were employed to gauge the likelihood that each proposed phonestheme is indeed associated with meaning. This concept was also supported by demonstrating that it corresponds with participants’ performance in a lab experiment.

Study 1: Semantic change and relational representations

Relations in the representation of word meaning

Dedre Gentner and her colleagues (Asmuth & Gentner, 2005, 2017; Gentner, 1982; Gentner & France, 1988) proposed that, whereas many nouns denote specific entities (e.g., dog, lion, man), the meaning of verbs is inherently relational. For example, the notion of buying can only be exemplified using an entity such as a woman, who is performing the action on a different entity, such as a computer. That is, the verb buy can only be used in reference to other entities, frequently denoted using nouns. The denoted action can therefore be thought of as identifying a relationship between the entities involved. More generally, verbs denote relations between entities. For instance, Gentner (2006) argues that concrete nouns are easier to learn because they are inherently individuated and more easily separable from the environment. In contrast, the relational nature of verbs makes their meaning more dependent on the context in which they appear. Likewise, Gentner and France proposed that this contextual sensitivity explains why participants in their studies preferred to adjust the meaning of the verbs than those of concrete nouns when paraphrasing sentences such as “The lizard worshipped.” Asmuth and Gentner (2005, 2017) demonstrated that such results can be seen not only when contrasting nouns and verbs, but also when contrasting concrete nouns (such as lion) with nouns that denote relational meanings (such as threat).

Interestingly, this hypothesis has implications to the structure of language and to our expectations regarding the uses of nouns and verbs more generally. In particular, such adjustments are an essential aspect of metaphors. The hypothesis that verbs are more relational than nouns can therefore be used to predict that it is easier to use verb metaphorically than it is to use concrete nouns. Moreover, linguistic theories on semantic change have long argued that metaphorical uses are one of the primary avenues through which the meaning of words is changed and extended (Traugott & Dasher, 2001).

Consequently, we can hypothesize that a word that is more contextually sensitive should appear in a greater variety of contexts, and, more importantly, change its meaning more over time. However, such changes take place over long periods of time, and are likely to be infrequent. It is therefore difficult to observe and measure such changes in the context of lab studies. In contrast, we have access to a variety of textual sources that were created over long periods of time. Statistical methods can thus be applied in order to test the hypotheses that particular classes of words, such as verbs, vary more across these texts than do words that arguably are less relational, such as concrete nouns. We could then examine the variability of context within a period, as well as how it varies across periods.

Measuring semantic change in a diachronic corpus

Semantic change has traditionally been measured on a word-by-word basis. Researchers identify a word whose meaning they are interested in tracing. They collect the contexts in which it appears over a period of time (often centuries) and record its use in each case. The hypothesis of semantic change can then be tested by examining trends in its uses over time. One such famous example was the rise of periphrastic do, which was traced by (Ellegård, 1953). In this case, the word do used to have a specific verb meaning in Old and Middle English—it denoted a causative relation (e.g., “did him gyuen up,” the Peterborough Chronicle, ca. 1154). In modern English, do is more frequently used as a grammatical function word (e.g., “Do you like it?”).

Although do started out as a verb with a meaning that was quite relational, its meaning was still less context-sensitive in Middle English than it is in Modern English. By measuring the variety of contexts in which do appears, Sagi, Kaufmann, and Clark (2012) demonstrate this shift using corpus statistics, with results that correspond to Ellegård’s hand-coded measures. The measure they use is essentially a measure of the variability of contexts in which the word appears within each period. The contextual variability in the uses of do exhibits a marked incline between the 15th and 16th centuries.

However, not all changes in meaning necessarily result in its broadening as was the case for periphrastic do. In many cases, such broadening is limited to the addition of a handful of new uses, or the depreciation of an old use. It might therefore be useful to also examine the change in use as a shift in the contexts rather than simply an increase in their variability. One possible source of such shifts, through metaphoric extension, might arise out of the effects of conceptual framing, such as the framing of terror as an act of war instead of as a crime following the events of September 11, 2001 (see Lakoff, 2009; for a related computational method see Sagi, Diermeier, & Kaufmann, 2013).

The similarity measure used in LSA and other methods of corpus statistics provide one possible method for tracing these changes. In particular, the more similar the uses of a word in one period are to its uses in another, the less likely it is that semantic change has occurred. Conversely, if a word has undergone a shift in its meaning or how it is used, we might expect it to appear in a different set of contexts in the new period than it did in the old. For example, the word computer used to mean a person who computes. This meaning was largely replaced by its current use, referring to a class of machines. Therefore the vectors representing the new contexts should be farther away from the vectors representing the old contexts than for a word whose meaning did not change (or changed to a lesser degree). By examining these measures across a large number of words, we can use statistics to identify trends in semantic change.

It is important to distinguish between two distinct sources of variability in contexts of use over time. In the first case, words with broader meanings (periphrastic do presents an extreme case of such words), are likely to be used in a wide variety of ways. Consequently, their uses will vary greatly within a time period, as well as across time periods. Semantic change presents a second source of variability of contexts over time. In this case, a drift, or change, in the meaning of a word, such as computer, or the addition of new meanings, result in a change in the contexts that a word is used in over time. Importantly, without semantic change, the broadness of application of a word remains constant over time, and therefore can be expected to be largely constant regardless of the time span involved. In contrast, drifts in meaning can be expected to accumulate over time and therefore show an increase in contextual variability as the time span examined increases. That is, whereas the effect of broadness of meaning on contextual variability does not depend on the length of time between uses, semantic change should show an increase in variability for longer time periods. For example, variability in contexts due to the broadness of applicability of a word should be the same whether measured over 25 years or 50 years, whereas semantic change would be expected to result in higher variability when measured over 50 years than when measured over 25 years.

Finally, an additional source of variability in the context of use of words over time comes from changes in the use of other words. That is, a shift in which the word man appears frequently with the word silly at one time period, but more frequently with blessed in another, might occur not because the meaning of man has changed, but because of pejoration in the meaning of the word silly, whose uses are then replaced by blessed. For the purposes of the present study, since each word was analyzed in isolation by comparing its uses in one period to its later uses, these changes will be treated as statistical noise, under the assumption that they are uniformly distributed across the corpus and do not vary by grammatical category.

Method

Materials

Corpus

To identify changes in word meaning in modern English, a corpus of 19th-century texts were collected from Project Gutenberg (www.gutenberg.org; Lebert, 2011), using the bulk of the English-language literary works available through the project’s website. This resulted in a corpus of 4,034 separate documents, consisting of over 240 million words. The Gutenberg Project preamble was removed from the books prior to analysis. Infomap (2007; Takayama, Flournoy, Kaufmann, & Peters, 1998) was used to generate a semantic space based on this corpus, using default settings (the 20,000 most frequent content words for the analysis with a co-occurrence window of ± 15 words and generating a 100-dimension space) and its default stop list.

Dating texts from the 19th century is difficult, because publication dates are often not readily available. Moreover, when considering language change, the publication date might not be the relevant date to use because the manuscript might have been written years earlier. The analysis was based on the birth dates of the authors instead, because they were easily obtained and are relevant from a linguistic perspective—since much of language learning occurs within the first few years of life. For the purposes of the analysis below, texts were also aggregated into 25-year periods. Consequently, the text from 3,490 books were written by authors born in the 19th century were used in the present analysis (1800–1824, 887 books; 1825–1849, 1,020 books; 1850–1874, 1,243 books; 1875–1899, 340 books).

Nouns and verbs

The nouns and verbs used in the study came from two sources: First, high-frequency nouns and verbs were collected from the 500 most frequent words in the corpus. This procedure contrasts words that are in frequent use and have a relatively stable meaning. The grammatical categories of the high-frequency words were determined on the basis of the MRC2 database (Wilson, 1988).^{Footnote 1} Nouns that were only rarely used as verbs were counted as nouns, and vice versa. Out of the 500 words, 168 nouns and 95 verbs followed this selection criterion (roughly 52.6% of the high-frequency words examined). Although this list includes more nouns than verbs, this is to be expected when examining high-frequency words. Nevertheless, these nouns and verbs are relatively equally interspersed among the list of the 500 most frequent words in the corpus. Specifically, the mean frequency of the nouns was 46,486 (SD = 2,884.78), and the mean frequency of the verbs was 43,077 (SD = 3,289.40). The two conditions did not significantly differ in frequency, t(258) = 0.744, p = .46.

Second, the study employed a list of frequency-matched relational nouns, entity nouns, and verbs obtained from Dedre Gentner, which was based on the lists used in previous studies (primarily from Asmuth & Gentner, 2005). This list comprised 70 entity nouns (e.g., emotion, fruit), 81 relational nouns (e.g., game, marriage), and 76 frequency-matched verbs (e.g., buy, explain).

Procedure

Calculating context vectors

This study was designed to compare the variability of different word classes across uses and time. This analysis was based on the precomputed semantic space generated from a corpus of texts from the 19th century, as described above. This space provides vector representation for 20,000 words. However, these vectors are computed as aggregates over the entire corpus.

We can employ vector arithmetic to compute the vectors representing the use of a word, such as man, in a particular subset of a corpus (such as a particular book, an author, or a time period).^{Footnote 2} This is done by aggregating the contextual representations of the word and essentially averaging them together. Specifically, we can calculate the context vector of each occurrence of the target word by summing up the vectors of the words that appear in its context and normalizing this resulting vector to a unit length. Following the convention used in Infomap, the contexts used here comprised the 15 words that preceded the target word and the 15 words that followed it, for a total of 30 words. After computing a context vector for each appearance of the target word (e.g., man) in the selected subset, we can average the resulting context vectors together using the same vector addition and normalization process. The resulting vector represents the centroid of the vectors it aggregates and is functionally equivalent to the mean in a scalar context.

Measuring vector similarity

We can gauge the similarity of two vector representations by examining the angle between them—similar vectors will point in similar directions and will have a small angle whereas differing vector will point in different directions and therefore exhibit a larger angle. In vectors of unit length, the cosine of the angle is equivalent to the Pearson correlation between the components of each vector, which is the basic measure of similarity used in this article.

Computing contextual variability

The variability associated with an aggregate vector (such as a vector that represents a word in a subset of the corpus as described above) can be conceptualized as the variability of its vectors constituting it. We can therefore measure such variability by examining the similarity of each constituent vector to the centroid, in a similar fashion to how variance of individual data points is measured with relation to the mean. Importantly, since the correlation measure is higher for context vectors that are closer to the centroid (with a maximal value of 1 for vectors that are identical to the centroid), higher numbers indicate less variability. To aid with interpretation, below this correlation will be referred to as a measure of uniformity. Finally, its distribution is asymmetric, with maximal uniformity at one end and maximum variability at the other. As a result, there is no need to use the absolute value of the differences or to square them in order to get an accurate estimate of uniformity for comparisons.

Results

Nouns and verbs

The first analysis was based on the variability in context and change in word use over time between high-frequency nouns and verbs. For each word, the vectors representing the contexts in which the word appeared were examined in groups of texts spanning 25-year-long periods, based on the birth date of the authors. Two hypotheses were tested: first that verbs are used in more varied textual contexts than nouns, and second that the contexts in which verbs appear change more rapidly over time than do the contexts in which nouns appear. To demonstrate this, changes in the mean textual contexts were compared over two time scales—25 and 50 years. Importantly, if we were to observe higher variability over periods of 50 years than over periods of 25 years, we could demonstrate that change in use accumulates over time. Since the number of authors whose works are out of copyright (and therefore can be provided by Project Gutenberg) drops sharply at the beginning of the 20th century, the starting periods in this analysis were limited to authors born from 1800 to 1825 and from 1825 to 1850. The means and standard deviations of the correlations between context vectors across time can be found in Fig. 2.

Variability within a time period was measured by averaging the correlation of the contexts of each term to the centroid representing it for the particular time period. That is, first the average vector of all of the contexts was calculated for a particular word (e.g., man), and then the correlation of each context to this centroid. The resulting measure informed as to the uniformity of contexts—if all contexts were identical, the average of these correlations would be 1. The more variability there was in the contexts, the lower the average correlation of the individual vectors to the centroid would be. This measure of uniformity was then averaged across all the 25-year time periods, to calculate the overall uniformity of the context for each word. The overall uniformity of use of nouns and verbs could then be compared using a one-way ANOVA. As predicted, nouns (M = .433, SD = .037) were more uniform than verbs (M = .399, SD = .033), F(1, 258) = 55.32, MSE = 0.0013, p < .001, η_p² = .18.

A two-way ANOVA was used to analyze change over time. In this analysis, the basic dependent measure was the correlation of the centroids between periods. That is, the centroid of each word from one time period (e.g., 1800–1825) was correlated to its centroid in a second time period (e.g., 1825–1850 for a 25-year span, or 1850–1875 for a 50-year span). Grammatical category (noun vs. verb) and length of time elapsed (25 vs. 50 years) were the IVs, and the similarity of meaning was the DV (measured as the correlation between the centroids of a word for two time periods that began either 25 or 50 years apart). Where grammatical category was a between-subjects variable (since different words counted as subjects in this study), the elapsed length of time was a within-subjects variable.

As predicted, a significant main effect of grammatical category emerged, in which the meaning of nouns was more similar over time than was the meaning of verbs, F(1, 258) = 50.59, MSE = 0.00032, p < .001, η_p² = .16. Unsurprisingly, there was also a main effect of time, in which the correlation between the centroids was lower for the 50-year than for the 25-year periods, F(1, 258) = 2,523.75, MSE = 0.000037, p < .001, η_p² = .91. More importantly, the predicted interaction was also observed—verbs showed more change in their centroids over time than did nouns, F(1, 258) = 37.28, MSE = 0.000037, p < .001, η_p² = .13.

It is important to consider that grammatical classes differ not only in relationality, but also in qualities such as concreteness and familiarity. In particular, nouns tend to denote more concrete entities than verbs do. To test whether concreteness and familiarity accounted for the differences observed, all the words in the high-frequency study that had MRC2 concreteness and familiarity ratings were collected, and a median split was used to identify low- and high-rating words. Concreteness significantly correlated with context similarity for both the 25-year span (r = .165, p < .05) and the 50-year span (r = .266, p < .01). Similarly, familiarity also correlated with context similarity for the 25-year span (r = .195, p < .01) and the 50-year span (r = .211, p < .01). The analysis above was repeated on this reduced set of words (135 nouns and 60 verbs), with concreteness and familiarity as covariants, and replicated the above effect. Importantly, the interaction observed earlier was still significant, even after controlling for the effects of concreteness and familiarity, F(1, 191) = 5.29, MSE = 0.000033, p < .05, η_p² = .03. Nevertheless, the effect size was reduced, suggesting that the effects of grammatical class might be partially, but not completely, explained as differences in concreteness and familiarity between the two classes.

Entity nouns and relational nouns

Next, we turn to a comparison of entity nouns and relational nouns. As was mentioned earlier, if the likelihood of semantic change is higher for relational words, we should expect the higher rate of change to be evident not only for verbs, but for other relational words, such as relational nouns. The means and standard deviations of the correlations between context vectors across time for the entity nouns, relational nouns, and frequency-matched verbs used in the analysis can be found in Fig. 3.

As before, first I computed the average uniformity of use for each word. A one-way ANOVA was used to test whether relational nouns and verbs showed more variability in use than entity nouns. As predicted, entity nouns (M = .36, SD = .056) exhibited more contextual uniformity than either relational nouns (M = .33, SD = .043) or verbs (M = .29, SD = .048), F(2, 224) = 45.10, MSE = .002, p < .001, η_p² = .29. Tukey’s HSD test showed that all three classes of words were different from each other in their uniformity. That is, relational nouns were more variable than entity nouns, and verbs exhibited less uniformity than either class of nouns.

For analyzing change in context over time, the same overall procedure was followed that had been used previously. As before, a small but significant main effect of grammatical category emerged, F(2, 224) = 6.33, MSE = .001, p < .01, η_p² = .053. The difference between the centroids also increased over time, F(2, 224) = 1,004.21, MSE = .0001, p < .001, η_p² = .818. More importantly, the expected interaction was observed, in which this increase over time was greater for relational nouns and for verbs than for entity nouns, F(2, 224) = 11.08, MSE = .0001, p < .01, η_p² = .09.

Because this effect might have been driven primarily by the change in verbs, a planned analysis was also conducted that did not include the verbs. This analysis resulted in a similar pattern, with entity nouns showing less overall evidence of change in their centroid than relational nouns, F(2, 149) = 4.80, MSE = .0008, p < .05, η_p² = .031. The rate of change also increased over time, F(2, 149) = 760.66, MSE = .00001, p < .001, η_p² = .836. Most importantly, the observed interaction, in which relational nouns showed an increased rate of change over time as compared to entity nouns, was also preserved, F(2, 149) = 6.98, MSE = .00001, p < .01, η_p² = .045.

Discussion

In this study, patterns of language change were compared for English nouns and verbs. This analysis revealed that nouns showed less contextual variability within each time period than verbs. Likewise, the centroids representing nouns changed more slowly over time than verbs, and entity nouns change more slowly than relational nouns. These results are in line with theories that argue that verbs, and relational nouns, are represented using relations whereas entity nouns are represented as direct denotations.

These results also demonstrate the utility and efficacy of corpus statistics as a tool for observing large scale trends in language use. Whereas in the lab we observe and record the behavior of an individual or a small number of individuals at a time, focusing on the details of their behavior, corpora provide us with an overview of the behavior of large groups of humans. Converging evidence from both methodologies is likely to provide researchers with more confidence in the validity and reliability of their results.

Study 2: Phonesthemes in text

The case for phonological correlates of meaning

It is a popular intuition that words with similar sounds also mean similar things. There is a long tradition of belief in the association between phonetic clusters and semantic clusters going back at least as far as Wallis’s grammar of English (Wallis, 1699). Morphemes form one such well-known cluster, but other submorphemic phonetic clusters that contribute to the meaning of the word as a whole have also been hypothesized (Firth, 1957; Jakobson & Waugh, 1979). Anthropologists have documented sound symbolism in many languages (Blust, 2003; Nuckolls, 1999; Ramachandran & Hubbard, 2001), but its role as a purely linguistic phenomenon is still unclear. Moreover, the Saussurean notion of the arbitrary relationship between the sign’s form and its referent is a matter of dogma for most linguists (de Saussure, 1916/2011; Hockett & Hockett, 1960). This makes the study of words that do participate in predictable sound–meaning mappings all the more important, since, under the framework of contemporary linguistics it is difficult to explain how these patterns come to be, or why they might survive despite the obvious benefits of arbitrary sound–meaning mappings. What is meant by “sound–meaning mapping” here is not purely sound symbolism, however, nor is it morphology. The present study offers a statistical, corpus-based approach to phonesthemes, or submorphemic units that have a predictable effect on the meaning of a word as a whole. These nonmorphological relationships between sound and meaning have not been explored thoroughly in behavioral or computational research, with some notable exceptions (e.g., Bergen, 2004; Hutchins, 1998).

Monaghan, Chater, and Christiansen (2005) and Farmer, Christiansen, and Monaghan (2006) studied the diagnosticity of phonological cues for lexical category membership. They performed a regression analysis on over 3,000 monosyllabic English words and demonstrated that certain phonological features are associated with an unambiguous interpretation as either a noun or a verb. An associated series of experiments demonstrated reaction time, reading time, and sentence comprehension advantages for phonologically “noun-like nouns” and “verb-like verbs.”

Bergen (2004) used a morphological-priming paradigm to test whether there was a processing advantage for words containing phonesthemes over words that shared only semantic or only formal features, or that contained “pseudo-phonesthemes.” He found a difference in reaction times between the phonestheme condition and the other three conditions by comparing primed reaction times to RTs to the same words in isolation, drawn from Washington University’s English Lexicon Project. He demonstrated both a facilitation effect for word pairs containing a phonestheme and an inhibitory effect for word pairs in which the prime contained a pseudo-phonestheme. His use of corpus-based methods (in this case, latent semantic analysis; Landauer et al., 1998) was limited to ensuring that the list of words used in meaning-only priming pairs did not have any higher semantic coherence than the list of words used in phonestheme priming pairs.

Finally, Hutchins (1998, Studies 1 and 2) examined participants’ intuitions about 46 phonesthemes, drawn from nearly 70 years of speculation about sound–meaning links in the literature. In her studies, participants ranked phonestheme-bearing words’ perceived coherence with a proposed gloss or definition meant to represent the meaning uniquely contributed by the phonestheme. Participants also assigned candidate definitions to nonsense words containing phonesthemes at rates significantly above chance, whereas words without phonesthemes were assigned particular definitions at rates not significantly different from chance. She also examined patterns internal to phonesthemes: strength of sound–meaning association, regularity of this association, and “productivity,” defined as likelihood that a nonword containing that phonestheme will be associated with the definition of a real word containing that phonestheme.

A big-data approach to studying phonesthemes

Previous studies of phonesthemes relied on the intuitions of participants to verify the sound–meaning relationships of interest (e.g., Bergen, 2004; Hutchins, 1998). These methods are at their best when testing only a limited number of phonesthemes. As a result, such studies have often constrained their examination to only a handful of phonesthemes. Even in the most extensive of these works, Hutchins, who identified over 100 phonesthemes previously indicated in the literature, used only 46 of them in her experiments. Big data, and in particular textual data in the form of corpora, provides an alternative source of information on the meanings of words. As was described earlier, statistical approaches such as LSA, topic models, and Word2Vec can be used to extract measures that correlate with participants’ performance on a variety of semantic similarity measures. Consequently, we can use a corpus to examine the hypothesis that words sharing a particular phonestheme also share a similarity in meaning.

Because the phonestheme as a construct necessarily involves a partial overlap in meaning beyond that generally found in language, the hypothesis here was that words sharing a phonestheme would exhibit greater semantic relatedness than words chosen at random from the entire corpus. This computational approach to the problem has two distinct advantages over the experimental methods commonly found in the literature. First, this method is objective and does not rely on intuition on either the part of the experimenter (e.g., in choosing particular examples and glosses for a phonestheme) or the participants (e.g., in Study 1, Hutchins, 1998, asked participants to rate the fit between glosses and words).^{Footnote 3} Second, it is possible to use the method to test a large number of candidate phonesthemes without requiring us to probe each participant for hundreds of linguistic intuitions at a time.