Introduction

Distributional models of semantic memory provide a powerful computational approach to understanding how people represent knowledge about real-world objects, individuals, and events. These models describe knowledge representations using high dimensional vectors, trained on natural language word-co-occurrence data, and subsequently specify the association between any two words using the distance between their corresponding vectors (Dhillon, Foster, & Ungar, 2011; Griffiths, Steyvers, & Tenenbaum, 2007; Jones & Mewhort, 2007; Landauer & Dumais, 1997; Mikolov et al., 2013; Pennington, Socher, & Manning, 2014).

The idea that knowledge representations are derived from the distribution of words in natural language has a long history in psychology, linguistics, and other areas of cognitive science (Firth, 1957; Harris, 1954). However, with advances in computer technology, as well as the availability of large online datasets of natural language corpora, this insight has been translated into the development of tools and techniques for uncovering the actual knowledge representations possessed by individuals. Such representations have been shown to successfully predict behavior in a wide range of cognitive tasks, including similarity judgment, categorization, cued recall, and free association (for a review see Bullinaria & Levy, 2007 or Jones, Willits, & Dennis, 2015). These representations are also highly successful at modelling language use in humans, and for this reason, are also commonly applied to problems involving the automated understanding of language in computational linguistics (Turney & Pantel, 2010).

Although most of the above work is focused on relatively low-level cognition, recently Bhatia (2017) has shown how this approach can be extended to model high-level judgment. Many such judgments are associative (Kahneman, 2003; Sloman, 1996), and distributional models can provide a quantitative measure of the strength of association between questions and feasible responses. For example, Bhatia (2017) finds that measures of association derived from distributional semantic models accurately predict participant responses to probability judgment and factual judgment questions, with participants being most likely to select responses that are highly associated with the content of the question. This relationship holds both when the associative response is correct and when it is incorrect, showing that distributional semantic models accurately describe both adaptive and fallacious judgment.

All of the results discussed above have been documented in a controlled lab setting. However, the recent computational and societal developments that have made large natural language datasets available for model training have also made similarly large datasets of human behavior available for model testing (for a discussion of such datasets and the need to use these datasets in cognitive research, see Griffiths, 2015 or Jones, 2017). Thus, it is now possible to apply distributional models of semantic memory to predict high-level cognitive phenomena observed in a variety of real-world circumstances.

In this paper, we attempt such a test, using a dataset of questions from the “Jeopardy!” game show. We apply existing distributional models to obtain vector-based knowledge representations for each of the words in the questions in our dataset. Subsequently, we are able to compute a measure of the associative strength between the clue in each question and the correct response to the question. We use this measure to predict whether contestants are able to successfully provide the correct response. If associations are at play in high-level judgment, and if distributional models accurately quantify these associations, we should expect higher contestant accuracy in questions where correct responses are strongly associated with their clues. This would be the case despite the fact that the “Jeopardy!” game show involves highly skilled contestants in real-world environments with complex stimuli and large monetary and social incentives. Thus, our goal is not to build a question-answering system capable of providing correct responses (e.g., Ferrucci, 2012), but rather to study contestant accuracy and error in the wild, using a theoretically grounded model of knowledge and association.

Methods

Overview of data

The “Jeopardy!” game show presents contestants with clue-based questions. Contestants must respond to these questions with the correct response (typically a single word or phrase) to the clue. The questions have varying monetary values, and contestants earn money or lose money based on the accuracy of their responses. There are three contestants in each game show, and these contestants typically compete to respond to the clue as quickly as possible after it has been read. Thus, responses are made under considerable time pressure.

The clues, as well as the correct responses to the clues, have been compiled by “Jeopardy” fans on www.j-archive.com. This website contains transcripts for the game shows from 1984 to the present time, and we scraped this website to obtain 298,820 questions, across 5,082 different games. For each of these questions we had both the clue text as well as the correct response text. We also had various question and game-level data, including the monetary value of the question, whether the question was in the first round (Jeopardy!), second round (Double Jeopardy!), or the third round (Final Jeopardy!), whether the question was a Daily Double question, and when the game was played. Importantly, we also obtained data on whether contestants were able to respond to the question correctly. Note that there are some differences between the structure of regular questions in the first two rounds, Daily Double questions in the first two rounds, and Final Jeopardy! questions, and so we analyzed each of these three sets separately. The Online Supplemental Materials describe the “Jeopardy!” game structure and our dataset in detail.

Overview of analysis

We used a prominent prebuilt set of vector representations in order to examine the relationship between contestant accuracy and the association between the words in the questions’ clues and the words in the corresponding responses. The representations we used were generated by the Global Vectors for Word Representation (GloVe) model (Pennington et al., 2014), which performs a dimensionality reduction on word co-occurrence matrices, emphasizing the use of the ratios of word-word co-occurrence probabilities. We obtained publicly available GloVe vectors from Pennington et al.’s online repository (http://nlp.stanford.edu/projects/glove/). These vectors were trained on a six billion-word corpus combining English language Wikipedia with the English Gigaword corpus, and have a vocabulary of 400,000 words. Bhatia (2017) found that these vectors described participant responses in high-level judgment tasks with considerable accuracy, and so we restrict the main text of this paper to only the analysis of the GloVe vectors. In the Online Supplemental Materials we replicate the results of our analysis using the Word2Vec and Eigenwords vector representations (Dhillon et al., 2011; Mikolov et al., 2013), also considered in Bhatia (2017).

We computed the association between each clue and response in our dataset, as assessed by the vector representations. These representations specify a word i as a 300-dimensional vector wi. For a given question, we first generated an aggregate representation of the question clue by taking the average of its words’ vectors, weighted by the frequency of the words in the clue (excluding highly common “stop words” and words that were not present in GloVe’s vocabulary). The vector for a clue, c, can be written as \( \boldsymbol{c}=\frac{\sum_i{n}_i\bullet {\boldsymbol{w}}_{\boldsymbol{i}}}{\sum_i{n}_i} \), where ni is the number of times word i occurs in the clue. We can use the same method to build a vector representation of the correct response r, and in turn specify the association between the clue and the response based on the distance between c and r. As in prior work, we used cosine similarity to specify distance, so that the association between c and r is A(c,r) = cr/(||c||∙||r||). A(c,r) ranges between -1 and +1, with higher values corresponding to clues and responses that are more closely associated. The Online Supplemental Materials in Bhatia (2017) provide additional details about the computational techniques used in our analysis.

Results

Summary statistics

Table 1 presents the summary statistics for the first round (Jeopardy!) and second round (Double Jeopardy!) questions in our dataset, separated by the question value. For each type of question, it presents the total number of such questions in the dataset, and the mean and standard deviation of contestant accuracy on these questions. This table also presents the total number of such questions in the dataset for which we were able to compute the association between the question clue and the correct response with the GloVe representations, as well as the mean and standard deviation of these association scores. Here contestant accuracy is a binary variable which, for each question, calculates whether or not at least one of the contestants managed to provide the correct response. Association, in contrast, is a continuous variable ranging from -1 to +1, and is calculated by measuring the cosine similarity between the question clue and its corresponding correct response. The reason why we are unable to calculate associations for some questions is because either their clues or their responses are composed entirely of words absent from the GloVe vocabulary.

Table 1 Summary statistics for different types of Jeopardy! questions. Here “Total # Quest.”, “Con. Acc. Mean”, and “Conc. Acc. Std.” describe the total number of each type of question, as well as the mean and standard deviation of contestant accuracy on the questions. “Assoc. # Quest.”, “Assoc. Mean” and “Assoc. Std.” describe the total number of each type of question for which we were able to compute associations, as well as the mean and standard deviation for the associations for the questions

Table 1 illustrates a number of regularities in our data. Firstly, contestant accuracy is fairly high, averaging between 66% and 97% based on the type of question in consideration. Likewise, the association measure is also relatively high. Unsurprisingly the correct response for a question is associated with the content of the question clue. More importantly, however, we see both contestant accuracy and association vary systematically with the value of the question in consideration. Question value depends on question difficulty, and we find that contestants tend to answer low-valued easy questions more accurately than high-valued difficult questions. The low-valued questions are also the ones for which the association of the clue and correct response is particularly high. This suggests that there may be a systematic relationship between association and the ability of contestants to give correct responses.

Contestant accuracy in regular questions

The goal of this section is to rigorously test this relationship. More specifically, we examine the correlation between the association of a question clue and its correct response, and contestant accuracy for the question (whether or not one of the contestants managed to provide the correct response). Overall, we find a very strong positive relationship between association and contestant accuracy. This is illustrated in Fig. 1, which plots the average contestant accuracy as a function of association, as assessed by the GloVe vectors. Here we have divided all our questions into ten equal portions based on the strength of the association measure for the questions, and pooled contestant accuracy for each of these portions. For the reasons discussed above we exclude Daily Double and Final Jeopardy! questions, as well as questions for which we were unable to compute association (those whose component words are not in the GloVe vocabulary). This leaves us with N = 272,412 regular questions for the analysis in this section. The histogram nested within Fig. 1 shows the distribution of associations for all questions. As can be easily seen, these association scores are distributed normally.

Fig. 1
figure 1

Average contestant accuracy for questions with different strengths of association between clues and correct responses. The x-axis indicates the association decile (ranging from weakest association to strongest association) for each group of questions, whereas the y-axis indicates the proportion of the questions that are answered correctly by some contestant. The nested histogram shows the distribution of associative strength across all our questions. Error bars indicate standard error

Figure 1 shows that contestant accuracy gets, on average, progressively higher as the association of the question clue and the correct response increases. Overall, contestant accuracy is at its lowest (around 82%) for the questions whose correct responses are unassociated with the question clues (the first decile), and at its highest (around 87%) for the questions whose correct responses are highly associated with the clues (the ninth decile). It seems that contestant accuracy does drop for the last decile of questions. This could be due to a ceiling on the effect of associative strength on accuracy (as further increases to association after reaching a certain level no longer facilitate increased recall, and there is eventually a regression to the mean). Alternatively, this may capture the effect of questions with multiple highly compelling intuitive answers (out of which only one is correct). In the Online Supplemental Materials we provide exploratory analysis suggesting that the latter explanation may be correct.

We first examined this relationship statistically using a simple logistic regression. In this regression, our dependent variable was the contestant accuracy for a given question (1 if it was answered correctly by at least one of the three contestants; 0 otherwise), and our primary independent variable was the association between the question clue and correct response, as measured by cosine similarity on our GloVe vectors. This regression revealed a strong positive effect of association on contestant accuracy (β = 0.78, z = 23.46, p < 0.001, 95% CI = [0.72–0.85], OR = 2.18). We also ran a more rigorous variant of this analysis. This second regression had controls for the monetary value of the question (a dollar amount ranging from $100 to $2,000), in order to ensure the relationship observed in the regression and in Fig. 1 is not confounded by question difficulty. This second regression also included controls for whether the question was part of the first round or second round (1 if in Double Jeopardy!; 0 otherwise), and the year in which the game was played (between 1984 and 2016). This regression also permitted random intercepts for the game in consideration, in order to accommodate game-level effects on contestant accuracy. Finally, as we suspected that the effect of association on contestant accuracy varies across easy and difficult problems, we also included an interaction effect term between association and question value.

Our second regression again found a strong positive relationship between association and contestant accuracy (β = 0.70, z = 10.66, p < 0.001, 95% CI = [0.57–0.82], OR = 2.01). In addition to this, we also found a strong negative effect of question value, showing that contestant accuracy drops for harder questions (β = -0.14x10-2, z = -61.47, p < 0.001, 95% CI = [-0.15×10-2– -0.14×10-2], OR = 0.9986). Our analysis also revealed positive effects for both Double Jeopardy! (β = 0.38, z = 26.20, p < 0.001, 95% CI = [0.36–0.42], OR = 1.46) and for year (β = 0.31×10-1, z = 26.69, p < 0.001, 95% CI = [0.29×10-1–0.34×10-1], OR = 1.03), indicating that contestants are more accurate in the second round of the game show (once question value has been controlled for) and for more recent game shows. Finally, we noted a negative interaction effect between question value and association (β = -0.26x10-3, z = -4.29, p < 0.001, 95% CI = [-0.37×10-3– -0.14 ×10-3], OR = 1.0003), indicating that the positive effect of association on accuracy drops as the questions get harder.

The effect of association on contestant accuracy for different types of questions is shown in Fig. 2. As in Fig. 1, questions are pooled based on association (this time using quartiles rather than deciles), and the average contestant accuracy for each set of questions is calculated and plotted separately based on the monetary value of the question and whether or not the question was in the first or second round of the game show. We repeat the analysis in this section with Word2Vec and Eigenwords representations in our Online Supplemental Materials. In the Supplementary Materials, we also repeat our analysis after excluding questions in which the correct answer is actually present in the clue text (to ensure that such questions are not driving our results).

Fig. 2
figure 2

Average contestant accuracy for questions with different strength of association between clues and correct responses, for different question types (here “DJ” corresponds to the Double Jeopardy! round). The x-axis indicates the association quartile (ranging from weakest association to strongest association) for each group of questions, whereas the y-axis indicates proportion of the questions that are answered correctly by some contestant. Error bars indicate standard error

Contestant accuracy in daily double questions

We also tested the above effects for the Daily Double questions. Note again that these questions have a different format to the regular questions, in that contestants do not have to compete to provide the response first, and are additionally able to specify the amount of money they wish to wager on the question. For the Daily Double questions (N = 14,584) we again ran a logistic regression with contestant accuracy as the main dependent variable and the association between the clue and the correct response as the main independent variable. We found a significant positive relationship between these two variables, both with a simple logistic regression (β = 0.30, z = 2.88, p < 0.01, 95% CI = [0.10–0.51], OR = 1.35) and with a more extensive regression with the multiple controls and random intercepts used in the prior section (β = 0.35, z = 2.03, p < 0.05, 95% CI = [0.01–0.70], OR = 1.42). Unlike in our previous analysis, however, question value had a positive relationship with contestant accuracy (β = 0.12×10-3, z = 4.58, p < 0.001, 95% CI = [0.07×10-3–0.17×10-3], OR = 1.0001). This likely reflects the contestants’ confidence, which correlates positively with both wagered amounts and accuracy for Daily Double questions. For this reason, we also fail to find an interaction effect between question value and association (p > 0.10).

It is useful to note that the magnitude of the effect of associative strength on contestant accuracy is much smaller for the Daily Double questions compared to the regular questions in the prior section. This may reflect the fact that contestants do not have to compete to provide responses, and thus need not rely as strongly on associative cues (which are likely to be disproportionately used when under time pressure). We tested this formally by combining our Daily Double questions with the regular questions from the previous section, and performing a logistic regression to predict contestant accuracy. This regression included main effects for association and Daily Double, as well as an interaction between these two variables. Like our previous regressions it also had controls for the year and the value of the question, and random effects for the game. As expected, this regression showed a positive effect of association on accuracy (β = 0.61, z = 17.77, p < 0.001, 95% CI = [0.54–0.67], OR = 1.84). More interestingly, however, we obtained a negative interaction effect between association and Daily Double, indicating that contestants are less likely to use association for such questions (β = -0.41, z = -3.39, p < 0.001, 95% CI = [-0.65– -0.17], OR = 0.66).

The Online Supplemental Materials report a similar analysis for Final Jeopardy! questions. This round is different from the others in that all three contestants must provide an answer to the question. Here we found no significant correlation between cue and response association and contestant accuracy. This could reflect the fact that Final Jeopardy! questions are some of the hardest questions in the game show and associations are less useful for these types of questions (as evidenced by the negative interaction effect between question value and association, shown previously). It could also be due to the fact that contestants do not have to compete to provide responses, and thus need not rely as strongly on associative cues. Indeed, this is the case for the Daily Double questions analyzed above. Both these issues are likely compounded by the relatively small sample sizes in our dataset for Final Jeopardy! questions (there is only one such question per game).

Discussion

We used distributional models of semantic memory to specify the strength of association between clues in the Jeopardy! game show and their corresponding correct responses. We found that contestants are more likely to provide the correct response if this response is strongly associated with the clue. This relationship weakens when questions increase in their difficulty (as with high monetary value Jeopardy! questions) and when contestants are not under time pressure to respond first (as with Daily Double questions).

Our results provide strong support for the predictive power of distributional models of semantic memory (Dhillon et al. 2011; Griffiths et al., 2007; Jones & Mewhort, 2007; Landaur & Dumais, 1997; Mikolov et al., 2013; Pennington et al., 2014), showing that such models can be successful even in the context of high-level associative judgment (Kahneman, 2003; Sloman, 1996; also see Bhatia, 2017). In addition, they showcase a novel method for analyzing high-level cognition in the real-world. Such analyses ensure the robustness and generalizability of existing theories in settings with much more data, complexity, and realism than those achievable in the laboratory. They are also valuable for understanding the ways in which cognitive mechanisms (such as those involving associative judgment) manifest in everyday life, thereby facilitating the development of richer theories of human cognition and behavior (Griffiths, 2015; Jones, 2017).

Some readers may note a similarity between the dataset used in this paper and that used to train IBM Watson’s groundbreaking Jeopardy-playing computer (see Ferrucci, 2012). Note, however, that, unlike IBM, our goals are not to answer Jeopardy questions accurately, but rather to study the psychological determinants of human Jeopardy responses (responses that are both correct and those that are incorrect). Of course, future work could adopt some of the computational advancements developed for question-answering systems such as Watson.

Such work could also attempt to integrate the proposed approach with more sophisticated psychological theories of question-answering (e.g., Anderson et al., 2004; Reder, 1987), which are able to process complex relations between the clues and the responses, while also specifying metacognitive processes for controlling memory search and response generation. We look forward to research that exploits these new and exciting data sources and techniques, to further integrate the analysis of large-scale human data into the study of cognition and behavior.