Bartlett (1928, 1932) conducted a classic experiment in cognitive psychology to examine how people remember. In the experiment, participants read an Indigenous American story entitled “The War of the Ghosts.” When prompted to recall the story, people inserted their own knowledge. For example, participants used the word boat (i.e., a word that they had experience with) in place of canoe (i.e., a word they likely did not know). Although Bartlett’s demonstration is remembered as a foundational example in the theory of reconstructive memory, it also makes another point.

When people read or hear language, they comprehend that language through the lens of their own experience. For example, when asked to play a game of football, a person’s interpretation of that request might change depending on the side of the Atlantic where the person was raised. Similarly, if one is asked to play a game of “roque,” only people familiar with croquet variants from the late 1800s would understand the rules.

But do the subtler differences in language experience exert a meaningful and distinguishable influence on people’s behavior and cognition?

By traditional methods of scholarship, a complete examination of language experience is intractable. However, recent advances in natural language processing, coupled with the availability of sizeable text corpora, have changed the game (e.g., Brysbaert, Mandera, & Keuleers, 2018; Chubala, Johns, Jamieson, & Mewhort, 2016; Green, Feinerer, & Burman, 2013, 2015; Johns, 2019; Johns, Jones, & Mewhort, 2019; Johns, Mewhort, & Jones, 2019; Johns & Jones, 2015; Jones, 2017; Jones, Dye, & Johns, 2017; Landauer & Dumais, 1997). For example, Johns and Jamieson (2018) applied theories of natural language processing to analyzing language use in published fiction. On the basis of the analysis, they reported that language differs in meaningful and measurable ways between genres, books, and authors. But, from a theoretical perspective, their results set up a more interesting problem: Can theories of natural language processing track differences in language behavior conditional on language experience?

The question has been analyzed previously in several fields, including sociolinguistics, corpus linguistics, and psycholinguistics (e.g., Baker, 2010; Biber, 1993; Brysbaert, Keuleers, & New, 2011). For example, Brysbaert and New (2009; SUBTLEX) recorded modern word frequency norms based on language from American film and television, whereas van Heuven, Mandera, Keuleers, and Brysbaert (2014; SUBTLEX-UK) recorded modern word frequency norms based on language in British film and television. Although the details vary, the authors of those projects arrived at a general conclusion germane to the present investigation: Norms developed from local sources (i.e., the same country) do a better job of tracking people’s language behavior than do norms developed from a different country. Given the success of the SUBTLEX norms, this approach to building word frequency datasets has been extended to a number of other languages, including Chinese (Cai & Brysbaert, 2010), Dutch (Keuleers, Brysbaert, & New, 2010), and Polish (Mandera, Keuleers, Wodniecka, & Brysbaert, 2015).

Our goal was to expand on this line of investigation. Our first goal was to use fiction books written by authors of known nationality and era to measure and characterize differences in language use as a function of geography and time (although, as Brysbaert & New, 2009, point out, subtitles may provide a more naturalistic type of language; see Herdağdelen, & Marelli, 2017, for similar arguments about using word frequency values from social media sources). Our second goal was to determine whether lexical models trained on those place- and time-specific subcorpora better explain people’s lexical behavior when the nationality and era of the subcorpus matches the nationality of the participants and the era in which the data were collected—a proposition we call the selective-reading hypothesis. Our third goal was to use the assembled subcorpora and behavioral norms to evaluate the technique of experiential optimization (EO; Johns, Jones, & Mewhort, 2019), a new machine-learning method that uses lexical data to infer and identify sections of the corpus that match the experimental participants’ language experience.

Corpus development and analysis

In previous work, we reported that language varies in meaningful ways conditional on authors and genres (Johns & Jamieson, 2018). However, we did not study differences as a function of when authors were born or the places in which they lived. Given that our goal was to show that language behavior reflects the language environment, we first repeated and confirmed that analysis using a larger corpus of books written by American and British authors in different years.

To support the analysis, we assembled a corpus of 26,000 books from the internet. We then used the corresponding metadata to tag each book by its author, the author’s place of birth, the author’s date of birth, and the book’s genre. We also collected information from those sources on book length (i.e., number of words). All information was obtained from the websites Goodreads, Amazon, and Wikipedia.

Table 1 provides information about the books in our corpus, which in total includes over two billion words. As is shown there, the total corpus includes over 26,000 books written by over 3,000 different authors: over 1.3 billion words written by 1,999 American authors, and about 500 million words written by 738 British authors. More specific information is included in Table 1, which differentiates books by author place of birth and genre, where author genre was determined by the most frequent genre in which an author published.Footnote 1

Table 1 Characteristics of the book collections for authors born in American and Britain

We also recorded dates of birth for 2,088 of the 3,209 authors in the corpus, yielding a range of birth dates between 1801 and 1998 and an overall corpus size of approximately 1.5 billion words for the time-delineated corpus. Figure 1 shows the numbers of words in the sample as a function of author birth date. As the figure shows, most words in the sample were written by authors who were born between 1925 and 1975. Therefore, we split the sample so that authors born before 1942 were included in an old-generation subcorpus of 762 million words, whereas authors who were born from 1942 onward constituted a new-generation subcorpus of 792 million words. Those subcorpora will be used to characterize language use as a function of time (i.e., old vs. new).Footnote 2

Fig. 1
figure 1

Numbers of words in the book sample, organized by author date of birth

For the first analysis, we used Johns and Jamieson’s (2018) methods to measure the similarity (and thus the differences) in language use between books written by authors in the United States (USA) and United Kingdom (UK) subcorpora and books written by authors in the old-generation versus new-generation subcorpora. The method involved, first, identifying the 100,000 highest-frequency words in both of the relevant subcorpora (i.e., USA and UK subcorpora for the place analysis, and old- and new-generation subcorpora for the time analysis). Second, we constructed an Author (i.e., 3,209 authors in the place comparison and 2,088 in the time comparison) × Word (100,000 highest-frequency words) matrix that recorded the number of times that each of the 100,000 highest-frequency words appeared in each author’s books. Third, we converted all word frequencies to their log equivalents (i.e., n' = ln[n + 1]), where n is the word count from the book). Finally, we computed the cosine similarity of each author’s vocabulary vector to every other author’s vocabulary vector. If two authors had a date of birth within a 30-year window (a common method of calculating generational differences; see Tremblay & Vézina, 2000), they were classified as belonging to the same generation; otherwise, they were classified as belonging to different generations.

The middle panel in Fig. 2 shows vocabulary similarity between authors with the same versus different nationalities (i.e., both USA or both UK vs. one USA and one UK). As is shown, the vocabularies of authors born in the same country were more similar than the vocabularies of authors born in different countries.

Fig. 2
figure 2

Histograms of similarity distributions for authors who wrote in the same or different genres (top panel), had the same or different places of birth (middle panel), and who were born in the same or in different generations (bottom panel). The genre comparison serves as an intuitive comparison for the differences seen in the bottom two panels

The bottom panel in Fig. 2 shows vocabulary similarity between authors from the same versus different generations (i.e., less than vs. more than 30 years’ difference). As the figure shows, the vocabularies of authors belonging to the same generation were more similar to one another than the vocabularies of authors belonging to different generations.

To provide an intuitive point for comparison, the top panel of Fig. 2 shows vocabulary similarity for authors writing in the same versus different genres, independent of their nationalities and ages. As is shown, the vocabularies of authors writing in the same genre were more similar to one another than the vocabularies of authors writing in different genres, an expected result (Johns & Jamieson, 2018). More importantly, the scale of that difference provides an intuitive picture for assessing the difference in vocabulary similarity distributions as a function of place and time.

Naturally there are trends in books, so there is a chance that differences in vocabulary use conditional on place and time might be conflated with those trends (e.g., the popular genres at a given time). Therefore, we recomputed the time and place analyses for authors depending on whether they wrote in the same versus different genres. As is shown in Fig. 3, the pattern of similarities in Fig. 2 was preserved. Authors of the same nationality and authors belonging to the same generation used more similar vocabularies, even when they were writing in different genres.

Fig. 3
figure 3

Histograms of authors who wrote in the same or different genres and had the same or different places of birth (top) or who were born in the same or different generations (bottom)

On the basis of the analysis, we concluded that vocabulary varies systematically as a function of both time and place, even when authors write in different genres. The results corroborate Johns and Jamieson’s (2018) results, using a different and substantially larger corpus.

Method

Although we showed differences in vocabulary as a function of author nationality and date of birth, it was an open question whether those differences would be reflected in the lexical behavior of experimental participants. To answer the question, we now examined the correspondence (or lack thereof) between variation in written language in our subcorpora as a function of time and place and the lexical behavior of participants collected in the USA and the UK in different years.

Empirical databases

Both lexical organization and lexical semantic data were tested. To examine lexical semantics, semantic category production data were used. In a semantic category production task, participants were cued with a category label (e.g., vegetable) and had to produce as many examples from that category as possible (e.g., carrot, lettuce, cucumber). By tradition, the frequency of production is interpreted as a snapshot of the psychological structure for the mental category (e.g., Battig & Montague, 1969; Rosch & Mervis, 1975). Three datasets were examined: the classic Battig and Montague norms, the updated norms of Van Overschelde, Rawson, and Dunlosky (2004), and the British norms of Hampton and Gardiner (1983). The main time comparison was between the Battig and Montague and van Overschelde et al. norms. The main place comparison was between all three datasets, since both the Battig and Montague and Van Overschelde norms were collected in the USA, whereas the Hampton and Gardner data were collected in the UK.

The words included in the analysis were reduced so as to only include words that appeared across datasets, to ensure that the difference in production values, and not the different words in the norms, were causing the different fits to the data. For the time comparison between the Battig and the Van Overschelde norms, this reduction resulted in 40 categories and 803 exemplars. The same reduction was done to compare the Hampton norms to the Battig and Van Overschelde norms, which resulted in 178 words across 11 categories for the place comparison.

To explore lexical organization, word familiarity and lexical decision data were used. In a word familiarity task, participants are simply asked to rate how familiar they are with a given word on a scale. Three datasets had been collected from North America: the norms of Paivio, Yuille, and Madigan (1968; familiarity data later released by Paivio, 1974); the norms of Stratton, Jacobus, and Brinley (1975; n = 543); and the extended Pavio et al. (1968) norms of Clark and Paivio (2004). The Paivio (1974) and Clark and Paivio norms were collected in Canada, but given the cultural overlap between Canada and the USA, these three datasets could be considered to be representative of the USA. Paivio et al. contained 925 words, and Clark and Paivio collected data on these same words, so these words were used to directly compare the datasets. These were contrasted with two datasets collected in the UK: the norms of Gilhooly and Logie (1980, n = 1,944) and the norms of Stadthagen-Gonzalez and Davis (2006, n = 1,526). Given the publication dates of these datasets and the locations where they were collected, these different datasets provided a powerful basis for testing our hypothesis.

Additionally, two mega-datasets of lexical decision data were used to examine the influence of place on lexical behavior: the English Lexicon Project of Balota et al. (2007), and the British Lexicon Project of Keuleers, Lacey, Rastle, and Brysbaert (2012). The data from the English Lexicon Project were collected from sites around the USA, whereas the British Lexicon Project data were collected from participants in Great Britain. The English Lexicon Project database contains 40,481 words, whereas that for the British Lexicon Project contains 28,515 words. To design a maximally inclusive but balanced comparison, we computed correlations over the 16,214 words that appeared in both the American and British lexical decision databases (a sample that also excluded any words that participants identified as words worse than chance: 50% accuracy). The data used were z-transformed lexical decision times.

Models

We used a simple word frequency model to explore the word familiarity and lexical decision data (for a recent review of word frequency, see Brysbaert et al., 2018). Although more complex models of word frequency have been developed (e.g., semantic diversity counts; Jones, Johns, & Recchia, 2012), using simple word frequency allowed for an uncomplicated examination of whether the word frequencies observed in the different subcorpora would map onto word familiarity ratings and lexical decision times as a function of time and place.

To account for the category production data, we used the BEAGLE model of semantics (Jones & Mewhort, 2007). Broadly, BEAGLE works by “reading” a text corpus and, en route, encoding each word’s meaning into a set of corresponding vectors. The theory is one in a larger class, labeled distributional models of language (e.g., Griffiths, Steyvers, & Tenenbaum, 2007; Landauer & Dumais, 1997).

To train BEAGLE, each of the i unique words in a corpus is represented by a unique n-dimensional environment vector, ei, with each element being assigned a random deviate from a normal distribution with mean zero and variance 1/n. In the simulations that follow, dimensionality was set to n = 1,024. Environment vectors are stable over a simulation and are meant to serve as unique identifiers for the words in the corpus.

Next, the model “reads” the corpus one sentence at a time, to build a semantic memory vector, mi, for each word. The memory vector for each word is composed of two kinds of information: context information and order information.

Context information is computed by summing the environmental vectors for all other words in the same sentence (i.e., excluding the word of interest) into the representation for that word. The summing of environmental vectors in this manner causes the memory vectors for all words in the same sentence to grow more similar to one another.

Order information encodes how a word is used within a sentence and is computed by encoding all of the n-grams (up to a specified size) that a word is part of within a sentence.Footnote 3 The computation of order information relies on noncommutative circular convolution (Plate, 1995) to bind the environmental vectors into unique n-gram vectors, which are then summed into the target word’s order representation. The order representation encodes how a word is used in relation to the words that surround it within a sentence.

In sum, a word’s context vector represents pure co-occurrence information, whereas order information encodes a simplified representation of syntactic relations. The representation used here was the sum of the context and order vectors. Despite its simplicity, BEAGLE explains a broad range of semantic and language behavior (e.g., Jones & Mewhort, 2007; Recchia, Sahlgren, Kanerva, & Jones, 2015).

Results

Word familiarity

Our first analysis tested whether word frequencies tabulated from the new-books corpus (i.e., authors born after 1942) accounted for the word familiarity data in the database of newer experimental norms (i.e., Clark & Paivio, 2004; Stadthagen-Gonzalez & Davis, 2006) better than for the older experimental norms (i.e., Gilhooly & Logie, 1980; Paivio et al., 1968; Stratton et al., 1975), and vice versa. The top panel of Fig. 4 shows the correlations between the old-book and new-book subcorpora relative to the word familiarity norms in the old and new empirical sets of norms. As is shown, the word frequency values computed from the old- and new-book subcorpora provide better fits to their time-appropriate empirical norms: The old-book corpus matches word familiarity ratings better in the old sets of norms, and the new-book corpus matches word familiarity ratings better in the new sets of norms better.

Fig. 4
figure 4

The top panel displays the correlations of word frequency values derived from the old and new subcorpora to older versus more recent familiarity norms. The bottom panel displays the amounts of unique variance accounted for by the word frequency counts

Of course, there is a great deal of shared variance between word familiarity ratings in the old and new experimental norms, and that shared variance might work against identifying the differences between the two for predicting performance. Thus, to measure the predictive power of the differences in word familiarity in the old and new norms, we applied regression to quantify the unique variance accounted for in the word familiarity norms by word frequency in the old- and new-book subcorpora. The analysis is standard and provides a measure of the predictive gain (i.e., measured as the percentΔof R2 improvement) for one predictor over a competing predictor (see Adelman, Brown, & Quesada, 2006; Johns, Gruenenfelder, Pisoni, & Jones, 2012; Johns, Sheppard, Jones, & Taler, 2016; Jones et al., 2012).Footnote 4

The bottom panel of Fig. 4 shows the results of the regression. As can be seen there, word frequencies tabulated from the old-book corpus (i.e., books written by authors born before 1942) account for nearly all the unique variance in the old empirical norms, whereas word frequencies tabulated from the new-book corpus (i.e., books written by authors born in or after 1942) accounts for nearly all of the unique variance in the more recent empirical norms. These results provide a first, positive validation of the selective-reading hypothesis (and follow similar results found by Taler, Johns, & Jones, 2019, in a large-scale examination of verbal fluency performance).

A second test of the selective-reading hypothesis was to determine whether place also provides a unique signature in lexical behavior. The top panel of Fig. 5 shows correlations between word frequencies tabulated from our USA and UK subcorpora relative to word familiarity norms collected in North America (Clark & Paivio, 2004; Paivio et al., 1968; Stratton et al., 1975) and the UK (Gilhooly & Logie, 1980; Stadthagen-Gonzalez & Davis, 2006). The bottom panel shows regression results for the amount of unique variance accounted for by each corpus. As can be seen, the same pattern is observed as a function of place that was observed as a function of time: The USA book corpus matches word familiarity ratings better for the North American norms, and the UK book corpus matches word ratings better for the UK norms. This result corroborates our time-based analysis and provides additional support for the selective-reading hypothesis.

Fig. 5
figure 5

The top panel displays the correlations of word frequency values derived from the USA and UK subcorpora to word familiarity norms collected in the USA versus the UK. The bottom panel displays the amounts of unique variance accounted for by the word frequency counts across the different sets of word familiarity norms

Lexical decision data

The correlations of word frequency values to the data from the English and the British Lexicon Projects are shown in the top panel of Fig. 6. The bottom panel shows the unique variance that the two frequency sets explain. This figure shows a result very similar to that from the word familiarity dataset. Word frequency in the USA books corpus accounted for more variance in lexical decision times collected in the USA than in the lexical decision times collected in the UK, and the opposite was true for the UK data.

Fig. 6
figure 6

The top panel displays the correlations of word frequency values derived from the USA and UK subcorpora to lexical decision data collected in the USA (English Lexicon Project) versus the UK (British Lexicon Project). The bottom panel displays the amounts of unique variance accounted for by the two word frequency models across the different sets of norms

The finding of the advantage of a corpus written by authors born in the UK replicates the finding of van Heuven, Mandera, Keuleers, and Brysbaert (2014), who found a similar increase in variance accounted for when using subtitle files from UK-based television and films to account for lexical decision data collected in the UK. However, the advantage for the corpus of books from the USA for the English Lexicon Project data is substantially larger than those from other studies attempting to model these data (e.g., Adelman et al., 2006; Jones et al., 2012). It is worth pointing out that this is not due to our book corpus providing an overwhelmingly better fit to those data than other subcorpora (the raw correlations are consistent with the other corpora used). Rather, our UK subcorpus provides a relatively poor fit to those data—a difference that enables a bigger advantage for the book corpus from authors born in the USA (this point also applies to the familiarity and category norm data). Finally, we speculate that the disparity in fits between the English and British Lexicon Projects in Fig. 6 might be due to an asymmetry in reading experience: American students likely have less experience with British authors than British students have with American authors. However, this hypothesis needs to be tested in future research.

Category production norms

To generate predictions for category production from the BEAGLE model, we computed the cosine similarity of the vector that corresponded to a given category label (e.g., vegetable) to the vectors that corresponded to each of the relevant category exemplars (e.g., carrot, lettuce, cucumber). Then we converted the cosines to ranks and computed the Spearman rank correlation between the exemplar ranks generated from the model and the corresponding exemplar ranks in the data (i.e., Battig & Montague, 1969; Rosch & Mervis, 1975).

The first analysis focused on the effect of training the model on time-specific subcorpora. Figure 7 shows the correlations between the two models and category production frequency (top panel) and the amount of unique variance each model accounted for (bottom panel). As is shown there, the results for the category production data as a function of time mirror the results for the word familiarity data: Semantic vectors derived from the old-book corpus explains the older Battig and Montague (1969) production norms better, whereas the semantic vectors derived from the new-book corpus explains the newer Van Overschelde et al. (2004) norms better.

Fig. 7
figure 7

The top panel displays the fit of the BEAGLE model of semantics derived from the old and new subcorpora to older and more recent category production data. The bottom panel displays the amounts of unique variance accounted for by the two semantic models.

The results from our final analysis, presented in Fig. 8, tested the influence of place on semantic category production. As expected, the results as a function of place mirrored the results as a function of time in Fig. 7: Semantic vectors derived from the USA book corpus predicted the American Battig and Montague (1969) and Van Overschelde et al. (2004) production norms better, whereas the semantic vectors derived from the UK book corpus predict the British Hampton and Gardiner (1983) norms better. The results provide additional support in favor of the selective-reading hypothesis: Training a semantic memory model on a country-appropriate corpus offers the best fit to category production norms collected in that country.

Fig. 8
figure 8

The top panel displays the fit of the BEAGLE model of semantics trained on the USA and UK book subcorpora to category production norms collected in the USA versus the UK. The bottom panel displays the amounts of unique variance accounted for by the two semantic models across the different sets of norms

Discussion

Our goal was to assess the selective-reading hypothesis across several classes of lexical behavior. We found that the time- and place-matched subcorpora yielded a large and systematic advantage over the time- and place-unmatched subcorpora. The advantage was consistent for word familiarity, lexical decision, and category production data, and suggests that differential language experience exerts a measurable influence on language processing. It also demonstrates a subtler message: The language sample that a language model is trained on impacts its ability to account for lexical behavior. Specifically, if the materials that a corpus-based model is trained on mismatch the lexical experiences that a group of participants have had, that model might be rejected not because it is a poor model, but because it has a mismatched language background. In the next section of this article, we address the issue of matching language experience through a new machine-learning method called experiential optimization.

A validation of experiential optimization

The analyses above show that the lexical behavior of participants across several behavioral tasks was sensitive to the time and place that the behavior was collected in. The combined results suggest that one way to construct better, more powerful models of lexical behavior is to train models with appropriate linguistic materials that accurately represent the language experience of the experimental participants.

Recently, Johns, Jones, and Mewhort (2019) presented a new machine-learning framework that is designed to accomplish just this. Their method, called experiential optimization (EO), uses the inherent variability in language (see Johns & Jamieson, 2018) to optimize models of natural language to sets of human behavior. Johns, Jones, and Mewhort showed that EO allows for benchmark fits across a diverse number of areas in the study of language and memory, including lexical organization, lexical semantics, sentence processing, and episodic recognition. In doing so, they found that EO provides a framework in which experience-based models of cognition can be embedded, which in turn allows for the behavior of the models to be optimized, in a similar fashion to the more standard parameter-fitting algorithms used in cognitive modeling. Furthermore, the methodology was tested using numerous cross-validation procedures and was shown to be able to fit data at both the group and individual levels The goal of the following simulations was to demonstrate that the information that EO is using when optimizing to a set of data is based on experiential factors (e.g., the time and place of data collection).

The basis of EO is that people’s differential experience with language should be reflected in their lexical behavior (as we demonstrated above). Following on that premise, EO aims to optimize a model’s fit to experimental measurements of lexical behavior by selecting the types of language materials that best reflect that experience. For example, Johns, Jones, and Mewhort (2019) showed that when optimizing the fit to lexical decision data from young versus old adults, EO selected young adult fiction books to account for the young adult lexical decision data, but selected more advanced fiction books to account for older adults’ lexical decision data. Johns, Jones, and Mewhort demonstrated that models optimized in this fashion can achieve benchmark fits to data across a range of language behaviors.

EO is implemented by first assembling a large corpus of different text sources (in Johns, Jones, & Mewhort, 2019, the text sources used were fiction books, nonfiction books, young adult books, Wikipedia articles, and product descriptions from Amazon). These text sources are then split into smaller sections; typically, sections of 50,000 sentences (section size is a free parameter; however, using sections smaller than 50,000 sentences can lead to overfitting). Then a hill-climbing algorithm is used to iteratively select the best fitting section or sections.

For example, consider optimizing to a word familiarity dataset. The sections of the texts are first preprocessed into word frequency distributions. The first iteration in EO will select the first section, with this section being the one that has the best overall fit (e.g., assessed with a Pearson correlation coefficient) to a set of familiarity data. This section is then added into the optimized frequency distribution and removed from the search set (i.e., sampling without replacement). The second iteration will take all of the remaining sections and add each frequency distribution to the optimized frequency distribution. The resulting section that offers the best fit is added into the optimized frequency distribution and is removed from the search set. This process iterates until no section offers a meaningful improvement in fit. At the end of the search process, the selected corpus presents an optimized record of language experience in relation to the target data.

In the simulations that follow, each section given to EO will be based on the books written by an individual author whose place and time of birth are available. However, as can be seen in the book corpus descriptives shown in Table 1, there is significant variability in the word counts for individual authors. To solve the differences, we (a) excluded authors born before 1800, (b) excluded authors with fewer than 50,000 words in their assembled writings, and (c) represented the writing of each of the remaining authors by a random sample of one million words from all of the words in their books. The sampling procedure in step (c) was used in order to ensure that each section had the same amount of lexical material. In total, 2,043 different sections (i.e., one section per author) were included in the search set. The result is a set of frequency distributions that include the equivalent amount of lexical information across authors, but differ in their distributional properties. In the following analyses, only familiarity rating and lexical decision norms were examined, because those data are available in much larger sample sizes than the category production data.

As an example of how EO operates, Fig. 9 includes the results of EO optimized to the familiarity data of Clark and Paivio (2004) and Gilhooly and Logie (1980). This figure shows that as new sections are selected, there is a corresponding increase in the fit of the optimized model to the data, with the Gilhooly and Logie dataset achieving a higher overall correlation. One reason for this discrepancy is that the language sources, as is visualized in Fig. 1, are skewed toward older sources, so older norms may receive a greater benefit using these language materials; however, item-level factors may be at play as well. Also shown, the most substantial increase in fit occurs at the outset of the search process, with smaller and diminishing improvements observed at later iterations. However, the final result is a fit to the data that is substantially better than fitting to the entire corpus. For example, the best correlation for the Clark and Paivio dataset was obtained for the new corpus, which had a correlation of r = .67, whereas the EO optimized frequency distribution was r = .83. For the Gilhooly and Logie dataset, the best corpus fit was r = .78, given by the UK corpus, whereas the EO optimized fit was an r = .87.

Fig. 9
figure 9

Example of experiential optimization (EO) being applied to the word familiarity norms of Clark and Paivio (2004) and Gilhooly and Logie (1980). At each iteration, EO selects a language section, which causes an optimized frequency distribution to maximally increase its fit to a set of norms

For the purposes of this article, the critical test of EO was whether it picked up on the differences in lexical behavior that are inherent in the empirical norms as a function of time and place. In the runs included in Fig. 9, the average date of birth of the authors sampled to the optimized corpus was greater for the more recent, Clark and Paivio (2004) norms (M = 1948) than for the older, Gilhooly and Logie (1980) norms (M = 1943). That difference in the right direction suggests that the optimization method can differentiate the influence of time on lexical behavior. In terms of author nationality, the method selected 57.5% UK authors when fitting the British Gilhooly and Logie norms, but only 14.84% UK authors when fitting the North American Clark and Paivio norms. This suggests that EO is also sensitive to the place where lexical norms were collected.

Johns, Jones, and Mewhort (2019) demonstrated that EO is sensitive to starting point, with different starting points causing different language sources to be selected. Thus, to ensure that the above findings are not anecdotal, 50 runs of EO were conducted for each set of word familiarity and lexical decision norms, with each run having a different random starting section. Then, the average date of birth of the authors selected and the percentage of authors who were born in the UK were recorded. If EO is picking up on the time and place of lexical behavior and matching that to the language corpus, then the average author date of birth and author place of birth selected by the method should reflect when and where the lexical data were collected.

An important point to take into consideration about the following simulations is that splitting language sources by the time and place of an author’s birth is not the only way to categorize the likelihood of a group of participants’ experiences with a certain author. Instead, time and place are broad and inclusive metrics, which might not map onto any single individual’s language exposure particularly well. For example, Applebee (1992) found that 54% of sampled high school English courses in the USA taught The Lord of the Flies by William Golding, a British author born in 1911 (considerably older than the median date of birth in the author sample used here). Thus, the Golding section might be selected by EO more than other sections because of the ubiquity of his writing and the importance of his work to English literature. That is, his contributions would not be accounted for well by the time and place splits used in this article. However, that does not mean that the time and place splits do not have any meaning; indeed, the above simulations demonstrate that time and place seem to have a determinable impact on lexical behavior. Instead, it needs to be acknowledged that many factors go into any single individual’s language exposure, which should be the focus for future research.

Figure 10 shows the fit of EO along with the best corpus fit (i.e., whether USA, UK, old, or new corpus) for the five sets of word familiarity norms and two sets of lexical decision norms. As is shown, EO yields an increased fit to lexical behavior for all ten datasets, over and above the otherwise best corpus fit. The results show the advantage of using EO to fit a model. However, the critical test will be to determine whether the optimization procedure was choosing sections that corresponded to the experience that participants likely had.

Fig. 10
figure 10

Fits of the best targeted subcorpus (from the old, new, UK, or USA subcorpora; black bars) and the average optimized fits for 50 runs of experiential optimization (EO). Error bars show standard deviations of the means and demonstrate that there was little deviation in terms of fit when using EO from different starting points

The top panel of Fig. 11 shows the average age of the sections selected to fit the older data (Gilhooly & Logie, 1980; Paivio et al., 1968; Stratton et al., 1975) and the newer data (Balota et al., 2007; Clark & Paivio, 2004; Keuleers et al., 2012; Stadthagen-Gonzalez & Davis, 2006). As can be seen, EO selected older authors to account for the older data and newer authors to account for the newer data. Even with this small sample size, the average date of birth selected for the new data was significantly greater than the average date of birth selected for the older data [t(5) = 5.61, p = .002]. Indeed, the correlation between publication date and average date of birth for the selected sections was r(5) = .98, p < .001. This demonstrates the EO is very sensitive to the effects of time on lexical behavior.

Fig. 11
figure 11

Average dates of birth of the author sections selected (top panel) and the percentages of UK authors selected (bottom panel) when experiential optimization (EO) was applied to the assembled word familiarity and lexical decision norms. Error bars show standard deviations of the means. DOB = date of birth

The bottom panel of Fig. 11 shows the percentages of authors selected who were born in the UK, relative to norms collected in the USA (Balota et al., 2007; Clark & Paivio, 2004; Paivio et al., 1968; Stratton et al., 1975) versus the UK (Gilhooly & Logie, 1980; Keuleers et al., 2012; Stadthagen-Gonzalez & Davis, 2006). This figure shows that EO tends to select language that corresponds to the place where the empirical data were collected, with the data collected in the UK including many more sections written by UK-born authors [t(5) = 5.194, p = .003]. This is a particularly impressive result, considering that more than twice as many sections were provided by authors born in the USA (n = 1,191) than by authors born in the UK (n = 565), a result that suggests that EO is quite capable of selectively differentiating the place where lexical behavior was collected.

One unanswered question that the simulation in Fig. 11 raises is how much additional variance the addition of time- or place-matched language sections provide EO. To answer this question, the EO process was split into two runs of 20 iterations each (for a total of 40 iterations). On the first run, EO was constrained to choosing sections from authors born in either the USA or the UK. During the second run, the method was forced to only select sections from the other country. If the use of place-appropriate sections allows for EO to account for an increased amount of variance, then there should be a larger increase in fit when going from incongruent to congruent sections (e.g., fitting with USA and then UK sections when accounting for norms from the British Lexicon Project) then when going from congruent to incongruent sections (e.g., fitting with UK and then USA sections when accounting for norms from the British Lexicon Project). As Fig. 9 shows, the expected increase in fit after 20 iterations is relatively small, but by comparing these two conditions it should allow for at least a better understanding of the amount of unique variance that using place-congruent sections of language provides when using EO. This was done for each set of the word familiarity and lexical decision norms, and the average increase in correlations was assessed across 50 simulations. The results of the simulation are displayed in Fig. 12. As is shown there, when the second run of EO uses congruent place sections, there is a consistent improvement in the method’s ability to account for the experimental data. The result suggests that there is a unique place signature in language that the method is capable of identifying and leveraging.

Fig. 12
figure 12

Increases in correlations when experiential optimization (EO) sampled from a different place (incongruent condition) or the correct place (congruent condition) on the second run of the algorithm. As is shown, providing EO with sections of language consistent with the likely experience that a group of participants had yields a consistent benefit in the algorithm’s ability to fit data

Combined, these simulations demonstrate that EO optimizes a model to lexical data by tapping into the underlying lexical experience that groups of participants likely have, and not just by optimizing to noise in a dataset.

Supplementary material

To aid other researchers who are interested in the effects of experience on lexical behavior, this article has associated supplementary materials containing the various frequency distributions used in the simulations in this article.Footnote 5 The word list for the frequency distributions was composed of the words from various familiarity datasets used here, the word list from the English Lexicon Project (Balota, et al., 2007), and the word list from the recently released word prevalence data of Brysbaert, Mandera, McCormick, and Keuleers (2019). This resulted in a final list of 81,186 words. The frequency distributions from the old, new, USA, and UK corpora were included in these materials, along with the frequency distribution from the overall book set described in Table 1. Additionally, the frequency distribution for each individual author, along with demographic characteristics recorded for each author, was also included, to allow other researchers to construct their own targeted or optimized frequency distributions.

General discussion

Our goal was to determine whether experiential differences in the language environment are reflected in people’s lexical behavior as a function of time and place. We call this the selective-reading hypothesis. To test the hypothesis, we collated a very large sample of books and tagged each book with its author’s place and date of birth. An initial analysis of word use confirmed that books written by authors with the same place of birth (i.e., USA or UK) or the same time (i.e., generation) are more similar than books written by authors from different places of birth or from different generations. We evaluated the hypothesis against empirical norms for word familiarity, lexical decision, and semantic category production. Across the different sets of norms, models trained on time- and place-matched subcorpora offered the best fits to the experimental data. Taken together, the results support the conclusion that language models trained on time- and place-matched subcorpora can track the influence of language experience on people’s lexical behavior and knowledge.

An additional goal was to use the assembled materials to test a new machine-learning framework for optimizing language models, called experiential optimization (Johns, Jones, & Mewhort, 2019). When optimizing a model to fit word familiarity and lexical decision data, EO chose sections of language that corresponded to the time and place in which the data were collected. This finding supports the conclusion that EO is capable of picking up on differences in language use as a function of place and time.

The underlying conceptualization and purpose of EO is to optimize language models by manipulating the experience that those models receive, instead of by varying internal cognitive parameters. That is, EO assumes that people from different places and times have different language experience and exhibit differences in lexical behavior as a consequence of that differential language experience, not of differences in language processing. The results of this article strengthen the argument for using EO when accounting for lexical behavior with cognitive models.

The work presented here shows that differences in language experience predict (and thus presumably influence) lexical organization and category knowledge. On those grounds, the work demonstrates how analyses of written text might be used to investigate the influence of experience on language processing. However, it points to a more fundamental issue for cognitive modeling and understanding the nested relationship between language and knowledge.

Typically, models of lexical organization and semantics are trained on a general (often convenient) corpus, and the capabilities of different learning schemes are assessed by relation to one another. This strategy is perfectly sensible—if you want to compare models, all other factors should be held constant. However, that strategy fails to appreciate the influence of language experience on the models. We have turned the typical strategy around, to ask how a single model behaves, depending on a change in the corpus. Just as changing models while holding the corpus constant produces different patterns of behavior, our analysis shows that changing the corpus while holding the model constant produces different patterns of behavior. We chose time and place as a basis for the subcorpus distinction because this made for a broad and inclusive cut associated with variation in word meaning (e.g., football).

In his 1956 treatise, Herbert Simon pointed out that understanding cognition requires an examination and understanding of the organism, its environment, and the interaction of the two (Simon, 1956, 1969). Since that time, this mantra has been echoed and taken up in the discipline of ecological cognition (Todd & Gigerenzer, 2001, 2007) and elsewhere (e.g., Hills, Jones, & Todd, 2012; Johns, Jones, & Mewhort, 2019; Jones & Mewhort, 2007). Models of natural language processing have taken Simon’s insight to heart and explained semantics as an emergent outcome arising from the operation of defined processing mechanisms (i.e., a model) and the language environment (i.e., a representative subcorpus of language experience). But, in large part, language experience has been treated agnostically, in deference to conducting a careful examination of the processing mechanisms. The work presented here shows that there is value in considering the language subcorpus just as carefully as the models themselves, and it offers a roadmap for engaging in a systematic and programmatic analysis of the influence that a person’s language experience has on their lexical behavior.