Advances in computational ability and the Internet have propelled research into an era of “big data” that has interesting implications for the field of psycholinguistics, as well as other experimental areas that use normed stimuli for their research. Traditionally, stimuli used for experimental psycholinguistics research were first normed through small in- house pilot studies, which were then used in many subsequent projects. While economic, the results from these stu dies could be potentially misleading, as the results may be due to the stimuli, rather than experimental manipulation. Small individual lab norming projects may be tied to a lack of funding, time, computational power, or even interest in studying phenomena at the stimuli level. Now, we have the capability to collect, analyze, and publish large datasets for research into memory models (Cree, McRae, & McNorgan, 1999; Moss, Tyler, Devlin, & Devlin, 2002; Rogers & McClelland, 2004; Vigliocco, Vinson, Lewis, & Garrett, 2004), aphasias (Vinson, Vigliocco, Cappa, & Siri, 2003), featural probability (Cree & McRae, 2003; McRae, Sa, & Seidenberg, 1997; Pexman, Holyk, & Monfils, 2003), valence (Dodds, Harris, Kloumann, Bliss, & Danforth, 2011; Vo, Conrad, Kuchinke, Urton, Hofmann, & Jacobs, 2009; Warriner, Kuperman, & Brysbaert, 2013), and reading speeds and priming (Balota, Yap, Hutchison, Cortese, Kessler, Loftis, & Treiman, 2007; Cohen-Shikora, Balota, Kapuria, & Yap, 2013; Hutchison, Balota, Neely, Cortese, Cohen-Shikora, Tse, & Buchanan, 2013; Keuleers, Lacey, Rastle, & Brysbaert, 2012) to name a small subset of research avenues.

Big data has manifested in psycholinguistics over the last decade in the form of grant-funded megastudies to collect and analyze large text corpora (i.e., the SUBTLEX projects) or to examine numerous word properties (i.e., the Lexicon projects). The SUBTLEX projects were designed to analyze frequency counts for concepts across large corpora sizes using subtitles as a substitute for natural speech. The investigation of these measures was first spurred by the realization that word frequency is an important predictor of naming and lexical decision times (Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004; Rayner & Duffy, 1986). While previous measures of frequency (i.e., Baayen, Piepenbrock, Gulikers, & Linguistic Data Consortium, 1995; Burgess & Livesay, 1998; Kučera & Francis, 1967) were based on large 1 million+word corpora, they were poor predictors of response latencies (Balota et al., 2004; Brysbaert & New, 2009; Zevin & Seidenberg, 2002). Further, Brysbaert and New (2009) indicate the importance of corpus’ characteristics for psycholinguistic studies, as the underlying source of the text data matters (Internet versus subtitles), as well as the contextual diversity of the data (i.e., number of occurrences across sources, Adelman, Brown, & Quesada, 2006). Not only has Brysbaert and New (2009)’s work been included in newer lexical studies (Hutchison et al., 2013; Yap, Tan, Pexman, & Hargreaves, 2011), but SUBTLEX projects have been published in Dutch (Keuleers, Brysbaert, & New, 2010), Greek (Dimitropoulou, Duñabeitia, Avilés, Corral, & Carreiras, 2010), Spanish (Cuetos, Glez-Nosti, Barbon, & Brysbaert, 2011), Chinese (Cai & Brysbaert, 2010), French (New, Brysbaert, Veronis, & Pallier, 2007), British English (Heuven, van Mandera, Keuleers, & Brysbaert, 2014), Polish (Mandera, Keuleers, Wodniecka, & Brysbaert, 2015), and German (Brysbaert, Buchmeier, Conrad, Jacobs, Bölte, & Böhl, 2011).

The Lexicon projects involved creating large databases of mono- and multisyllabic words to assist in the creation of controlled experimental stimuli sets for future experiments. These databases contain lexical decision and naming response latencies, as well as typical word confound variables such as orthographic neighborhood, phonological, and morphological characteristics. While the English Lexicon Project (Balota et al., 2007) is the most cited of the lexicons, other languages include Chinese (Sze, Rickard Liow, & Yap, 2014; Tse, Yap, Chan, Sze, Shaoul, & Lin, 2017), Malay (Yap, Rickard Liow, Jalil, & Faizal, 2010), Dutch (Keuleers et al., 2010), and British English (Keuleers, Lacey, Rastle, & Brysbaert, 2012). Similar lexical database publications can be found in the literature covering French (Lété, Sprenger-Charolles, & Colé, 2004), Italian (Barca, Burani, & Arduino, 2002), Arabic (Boudelaa & Marslen-Wilson, 2010), and Portuguese (Soares, Medeiros, Simões, Machado, Costa, Iriarte, & Scomesaña, 2014).

The availability of big data has augmented the psycholinguistic literature, but these projects are certainly time consuming due to the amount of participant data required to achieve reliable and stable norms. A solution to large data collection lies in several avenues of easily obtainable data. First, Amazon’s Mechanical Turk, an online crowdsourcing avenue that allows researchers to pay users to complete questionnaires, can be a reliable, diverse participant pool made available at very low cost (Buhrmester, Kwang, & Gosling, 2011; Mason & Suri, 2012). Researchers can pre-screen for specific populations, as well as post-screen surveys for incomplete or inappropriate responses (Buchanan & Scofield, 2018), thus saving time and money with the elimination of poor data. Because of the popularity of Mechanical Turk, large amounts of data can be collected in shorter time periods than traditional experiments. Mechanical Turk has been used to collect data for semantic word pair norms (Buchanan, Holmes, Teasley, & Hutchison, 2013), age of acquisition ratings (Kuperman, Stadthagen-Gonzalez, & Brysbaert, 2012), concreteness ratings (Brysbaert, Warriner, & Kuperman, 2014), past tense information (Cohen-Shikora et al., 2013), and valence and arousal ratings (Dodds et al., 2011; Warriner et al., 2013). Additionally, in a similar vein to the SUBTLEX projects, linguistic data have been mined from open-source data, such as the New York Times, music lyrics, and Twitter (Dodds et al., 2011; Kloumann, Danforth, Harris, Bliss, & Dodds, 2012). Finally, De Deyne et al. (2013) have seen success in setting up a special Web site (https://www.smallworldofwords.com) to gamify the collection of word pair association norms.

The evolution of big data provides exciting opportunities for exploration into psycholinguistics, and this article features the trends in publications of normed datasets across the literature, allowing for a large-scale picture of the developments of trends in psychological stimuli. Historically, these norms have been published in journals connected to the Psychonomic Society, such as Behavior Research Methods, Psychonomic Monograph Supplements, and Perception and Psychophysics. The Psychonomic Society once hosted an electronic database that contained the links to these norms, as well as a search tool to find information about previously published works (Vaughan, 2004). The sale of the society journals to Springer publications has improved journal visibility and user-friendly access, but has also left a need for an indexed list of database publications that span multiple keywords and journal Web sites. Other researchers have started a similar task, publishing the Language Goldmine, an online searchable database of linguistic resources (List, Winter, & Wedel, n.d.). Within the Language Goldmine, users can find over 200 citations for linguistic resources, which are mostly corpora. This article extends that resource by: (1) presenting a searchable, cataloged database of normed stimuli and related materials for a wide range of experimental research, and (2) to examine trends in the publications of these articles to assess the big data movement within cognitive psychology.

Website

This manuscript was written with R markdown and papaja (Aust & Barth, 2017) and can be found at https://osf.io/9bcws/. Readers can find the LAB’s Web site by going to http://www.wordnorms.com, and the source files for the Web site can be found at https://github.com/doomlab/wordnorms. From the Web page, the top navigation bar includes a link to direct the reader to the LAB page. On the LAB page, we have included a purpose statement and several summary options. First, the two variable tables include summary descriptions about the stimuli and keyword (tag) variables in this study using an embedded Shiny application. Shiny is an open-source graphical user interface R package that allows researchers to build interactive web applications (Chang, Cheng, Allaire, Xie, & McPherson, 2017). These apps connect to the LAB database and display the current sample size N, minimum, maximum, mean standard deviation, and correlation across years for each variable, when appropriate. The advantage to using Shiny apps is dynamic updating of the database, so as new information is added, the app will display the most current statistics, while this paper represents a static point in the database development. The entire dataset can be viewed and filtered based on keyword, language, and stimuli type. This search app allows for multiple filter options, so a person may drill down into very specific search criteria. Underneath the search functions, yearly trend visualization and descriptive statistics may be found including frequency tables of stimuli and keywords. Finally, the complete database in .csv format can be downloaded. Specific features will be outlined below in relation to the database creation.

The Web site includes more information on versioning of the dataset for users to reference, along with instructions on how and what others can contribute to the LAB. Viewers can suggest articles that should be included in the dataset by using the online Mendeley group (requires login and account) at https://www.mendeley.com/community/the-lab-linguistic-annotated-bibliography/ or using the email link included in the top right corner of the Web site. Mendeley is free reference software that allows for open-source groups to collaborate on curating reference lists. Additionally, we have provided a BibTex reference file linked on the Web site that can be imported into most reference software programs.

Database methods

Materials

Bradshaw (1984) and Proctor and Vu (1999)’s lists of database information were used as starting points for collection of research articles. We searched Academic Search Premier, PsycInfo, and ERIC through the EBSCO host system, as well as Google Scholar and PLoS One to find other relevant articles using the following keywords: corpus, linguistic database, linguistic norms, norms, and database. Additionally, since a large number of the original articles were hosted by the Psychonomic Society, the Springer Web site was searched with these terms that covered the newer editions of Behavior Research Methods and Memory & Cognition. We then filtered for articles that met the following criteria: (1) contained database information as supplemental material, (2) demonstrated programs related to building research stimuli using normed databases, or (3) generated new calculations of lexical variables. Research articles that used normed databases in experimental design or tested those variables validity/reliability were excluded if they did not include new database information. Additional articles were found while coding initial publications by searching citations for stimuli selection. For example, the Snodgrass and Vanderwart (1980) norms were cited in multiple newer articles on line drawings, and therefore this article was subsequently entered into the database. Last, we consulted the Language Goldmine and included all citations from this resource that could still be accessed (List et al., n.d.). At the time of writing, 884 articles, books, Web sites, and technical reports were included in the following analyses.

Coding procedure

The tables with summaries from Bradshaw (1984) and Proctor and Vu (1999) were consulted for a starting point for data coding. Next, the first round of articles found (approximately 100) were analyzed to determine information that would be pertinent to a user who wished to search for normed stimuli. Based on these reviews and lab discussions, we coded the following information from each article: (1) journal information, (2) stimuli types, (3) stimuli language, (4) program or corpus name, (5) keywords, which we refer to as tags, (6) special populations, and (7) other notes that did not fit into those categories. Each piece of information is detailed below. In some instances, codes were not used as frequently as expected based on these initial discussions, but were included to allow more specificity in searching, as well as the flexibility to include those options for articles subsequently added to the database.

Journal information

Each article was coded with the citation information, and a complete list of citations can be found on the Web site portal by going to the search data section. All author last names are listed, along with publication year, article title, journal title, volume, page numbers, and digital object identifier (DOI) when available. This information is listed in citation format in the Shiny app and separated into columns in the downloadable data for easier sorting and searching. A complete list of publication sources and percentages can be found online by using the frequency statistics link.

Stimuli types

While this publication was originally intended for traditional linguistic database norms, other types of experimental stimuli used in concept studies were apparent after background review. Therefore, stimuli were coded based on the dominant description from the article (i.e., although heteronyms are words and word pairs, they were coded specifically as heteronyms). The number of stimuli presented in the appendix or database was coded with the stimuli if it was available. Generally, programs, corpora, and experimental creation tools did not include this information, which are the majority of the “other” stimuli category. Because many articles included two types of stimuli, or references to different articles where stimuli were selected from, two options for stimuli were included.

Therefore, the total values for number of stimuli do not add up to the number of articles in the database because of multiple instances in articles or no stimuli for program descriptions. Table 1 includes a stimuli list, the number of times that each stimuli was used, percentage of the total stimuli codes, minimum, maximum, the mean and standard deviation of the number of those stimuli. Brief variable descriptions are provided online in the stimuli table. Researchers often cited specific previous works where stimuli were selected from, and these references were included, which can be found in the downloaded data. Table 1 is included dynamically online under “under stimuli table” and view the frequency statistics.

Table 1 Stimuli descriptive statistics

Stimuli language

The language of the stimuli set was coded by starting with the most common languages from the first articles surveyed, and others were added as it was apparent that several norms were present for that language (such as Japanese, Dutch, and Greek). A multiple category was created for datasets with more than one set of language norms, with more information about the languages available provided in the notes column. If the stimuli were non-linguistic selections, like pictures and line drawings, the language of the participants used to norm the set was used, which was commonly English. In order to help distinguish these norms, a column was added to the downloadable data that denoted non-linguistic norms (coded as 0 for linguistic, 1 for non-linguistic). For each language, the Glottolog codes were added in a separate column to help identify them (Hammarstrom, n.d.). One potential limitation of the LAB was that English is the first language for the authors; however, translation tools were used to code sources found in other languages. Table 2 indicates language frequencies and percentages, and the online version can be found by clicking the view frequency statistics link.

Table 2 Language descriptive statistics

Program/corpus name

In many instances, megastudies are often named, such as the English Lexicon Project (Balota et al., 2007), for easier reference. This information was included in the dataset, which will also help researchers with the stimuli references as described above. For example, a newer study may reference using the BOSS database (Brodeur, Dionne-Dostie, Montreuil, & Lepage, 2010) and having that information would make searching for the original article easier by using the corpus name column (especially in instances the dataset name is not listed in the article title). The names of programs or tools were also entered, such as NIM (Guasch, Boada, Ferré, & Sánchez-Casas, 2013), a newer stimuli selection tool for psycholinguistic studies.

Keyword tags

Keyword tags are the majority of the database, as they allow for the best understanding of trends and availability of stimuli. Tables 3 and 4 portray a list of tags, frequencies, percentages, and correlations (described below) for tags with sample sizes greater than 10. Tag descriptions are provided online under tag table. Each article was coded with tags based on the description of the accessible data, and a single article may have multiple tags. However, due to the cumulative nature of database research, this tagging system does not mean that each article collected that particular type of data. The most common example of this distinction occurs when data was combined across sources, but presented in a new article. The Maki, McKinley, and Thompson (2004) semantic distance norms also included values from the South Florida Free Association norms (Nelson, McEvoy, & Schreiber, 2004), and Latent Semantic Analysis (Landauer & Dumais, 1997). Therefore, this article was coded with association and semantics, even though the association norms were not collected in that paper. As described above, some small frequency tags were used because of the initial pass through newer articles, but these were left in the database because of their specificity, and they can be used in future additions.

Table 3 Tag descriptive statistics
Table 4 Tag descriptive statistics continued

Special populations

While coding articles, it became apparent that a subset of the normed data was tested on specific special populations. Consequently, demographic data such as gender, age, ethnicity, and grade school year were listed as described in the article (i.e., if ages were used, age was listed, but if grade year was used, it was listed rather than translating to specific ages).

Other/Notes

Lastly, places for more description were included for tags or variables not frequently used, which was especially useful for program descriptions, as well as descriptions of specific types of stimuli (i.e., CVC trigrams). In several instances, notes that appeared frequently were moved to tags (such as similarity) after the database had several hundred articles sampled. All information described above without a specific table (special populations, other, program/corpus names, and journal information) can be found by downloading the complete dataset.

Results and discussion

Journals

Journal results, unsurprisingly, show that the wealth of data was published in Behavior Research Methods (57.6% combined across name changes). The next largest publishers of articles were Psychonomic Monograph Supplements (2.1%), Journal of Verbal Learning and Verbal Behavior (1.7%), Psychonomic Science (1.7%), Journal of Experimental Psychology (combined across subjournals, 2.5%), Perception & Psychophysics (1.5%), Memory & Cognition (1.4%), Bulletin of the Psychonomic Society (0.8%), and Norms of Word Association (0.9%; Postman & Kepel, 1970). The complete list can be found in the frequency statistics online, as there were many different entries for journals, books, and Web sites of publications. While some of these sources were not published with peer review, they were generally found through citations of other peer-reviewed work or through the Language Goldmine. Although Behavior Research Methods has dominated the field for publications, the large array of options for publishing indicates a growth in the available avenues for researchers in this field (for example, open-source journals such as PLoS ONE and Web sites).

Figure 1 portrays the number of publications across years, and there has been a clear expansion of database and program papers, as part of the growth in big data. Interestingly, a first growth of publications tracks with the 1950s cognitive revolution (Miller, 2003), but an odd decline in publications occurred from the 1970s to the 1990s. The last 20 years has shown unbelievable progress in this area, at over 359 publications since 2010 alone. This chart can be found in greater detail online, under Papers per Year by clicking on the view the yearly statistics link, showing ups and downs of publications by year in a larger format with the ability to control year range. For example, 2004, 2008-2018 were big years for linguistic publications, each with 30 or more publications. Even with these fluctuations, a clear growth curve in publications can be found since the 90s.

Fig. 1
figure 1

Overall publication frequency across years

Stimuli

Stimuli are presented in Table 1, and a review of this table indicated that the publication of word stimuli was the largest category (38.2%), followed by corpora (11.9%). Other types of word stimuli also appear commonly in the LAB data such as categories, letters, and word pairs. Because linguistic data were of particular interest, we selected publications based on words and word pairs, and plotted the number of stimuli presented in the paper to examine big data trends. These data were broken down by set size in Fig. 2. The upper left-hand quadrant shows all stimuli across years, and the big data publications stand out in the last 15 years of publications. We excluded two data points that included over 1 million words to show the increase in publication of larger datasets across years. These data were then further broken down into smaller datasets (< 10,000 stimuli; upper right quadrant), and larger datasets (10,000+ stimuli; bottom left quadrant). The smaller dataset graph shows that these publications are common across time, while the bottom quadrant was more telling for the megastudies trend investigation. As with languages and tags (below), we see an increase in the number of larger datasets across the years.

Fig. 2
figure 2

Number of word stimuli plotted across years. The top left quadrant includes all word stimuli, minus two outliers. The the top right quadrant includes word stimuli ranging up to 10,000 words, and the bottom left quadrant portrays stimuli counts exceeding 10,000. The x-axis is consistent across graphs, however, the y-axis is scaled for the range of stimuli targeted in that graph

Languages

The variety and number of languages for stimuli provided a picture of the growth and diversity of psycholinguistic stimuli, as seen in Table 2. A growing number of articles include non-English languages, including Spanish (6.9%), French (5.2%), German (4.2%), and even include multiple languages (9.7%). To examine trends, the English-only articles were filtered out of the dataset since they were the majority of publications (53.2%) and were published across all years present in this data. Of the 389 non-English publications, 86 included multiple languages, and 45 of these were published after 2010. Additionally, the last 10 years (2008 and later) have seen an explosion of publications in non-English languages: 256, with 32 in 2017 alone. The publication of varied languages is still largely from WEIRD cultures (Western Educated Industrialized Rich Democratic; Henrich, Heine, & Norenzayan, 2010) and Indo-European languages, thus, indicating room for cross linguistic improvement.

Tags

Tables 3 and 4 display the number, percentages, correlations of tags across year for tags with sample sizes greater than 10. Undoubtedly, these tags represent changes in terminology over time, and some could be combined or recoined. However, even if low-frequency (N <= 10; 11 tags in our dataset) tags were excluded, 38 different tags were used to describe the types of psycholinguistic data. Many of these tags can be considered individual research areas, and the sizeable number of different options indicates how complex and diverse the field has become since the publication of free association norms in 1910 (Kent & Rosanoff, 1910).

The total number of tags for each publication was then tallied, and this data was plotted in Fig. 3 to visualize if the number of variables included in a study has grown over time (M = 2.45, SD = 2.30). The correlation between total tags and year was r = .17, 95% CI [.10, .23], t(843) = 4.90, p < .001, indicating a small increase in total tags used over time. Even considering the larger number of publications in the 2000s versus 1950s to 1970s, it appeared that the number of keywords for articles was also slowly growing over time. This trend may indicate the evolution in computing possibilities to be able to publish large amounts of data, but also may indicate a desire to combine datasets so that even more stimuli may be considered at once for modeling or experiment creation.

Fig. 3
figure 3

Number of tags included in each publication across years

Next, tags with at least 30 publications were investigated individually for trends across time (correlations presented in Tables 3 and 4). Individual histograms can be created by using the Tags Per Year area online by clicking on the view the yearly statistics link, which shows the total frequency of the selected tag by year. Some small positive trends were found, such as the increase in arousal, age of acquisition, syllables, familiarity, and valence norms. Intriguingly, meaningfulness and association both showed negative correlations, but these correlations can be understood as an artifact of the publication of a book on association norms in the 1970s (Postman & Keppel, 1970), as well as a recent drop off of in the small but steady use of meaningfulness. These small correlations may partially be explained by the sheer number and variation of data available in the LAB portal, as one would expect the number of frequency tags to increase with the recent SUBTLEX publications. Indeed, if the frequency tags were plotted by year, an increase across the last decade (18 in 2010, 15 in 2013, and 22 in 2014) can be found. Readers are encouraged to view the individual graphs for tags to investigate the change of keyword publication over time, including the rise and demise of several research areas. For example, confusion matrices’ heyday appeared to range from the early 1970s to the mid 1980s, while arousal norms do not make a consistent appearance until the late 1990s.

Conclusions

This article had two main purposes: (1) to present the LAB dataset and portal as an annotated bibliography and searchable tool for researchers, and (2) to view trends in psycholinguistic research with an eye toward big data. We believe the LAB Web site will be a useful channel for all levels of researchers, from graduate students looking for experimental stimuli to design their experiments, to the familiar investigator who wishes to dig deeper into the diverse choices offered. The Language Goldmine presents a similar resource, but the advantage to the LAB is the breadth of publications coded, as well as the coding schema that allowed for investigation of individual trends in publication. While the majority of publications occur in one particular journal, the LAB allows someone to find articles they may have missed in other areas with the advantage of being collected into one location. User-friendly search tools are provided to aide in searching for specific languages, stimuli, or keywords, as well as multiple outputs for easy copying into Excel or SPSS. While this article’s statistics will become dated with the updates to the LAB, dynamic tables and graphs are provided online to see the current status of the field. Lastly, we encourage users to actively report errors and suggest updates for the LAB dataset as a way to crowdsource information that is surely missing, especially in non-English languages.

In the introduction, we provided two examples of current megastudies (SUBTLEX and the Lexicon projects), in addition to how researchers might collect big data through Mechanical Turk or Twitter. This article focused on the breadth of the field to use the information provided by publications as a window into the fluctuations of interest in areas. Megastudies have become a prevalent topic, but data could have revealed that this popularity was due to recent publication of a small subset of articles. Instead, analyses showed that not only are the numbers of publications accumulating, but the sizes of datasets are also growing in tandem. Megastudies specifically focus on large datasets, but big data can also be indicated here by the divergence in languages available, number of places to publish such data, and the increasing number of keywords for articles across years. Time will tell if these trends can and will continue or if certain areas will see a confusion matrix type decline after several large datasets are published. With the move of traditional lab experiments to smartphone and tablet technology (Dufau et al., 2011), it seems likely that researchers in psycholinguistics will continue to find new and creative ways to modernize the field.