Over three decades ago, Nelson and Narens (1980) published normative information for a set of questions over multiple domains of knowledge. They created 300 questions and assessed several cognitive and metacognitive measures, including the probability of recall and the latency to generate a response. Their norms have been used widely in research investigating the processes involved in long-term memory (e.g., Fazio & Marsh, 2008; Hanley, 2011; Marsh, Meade, & Roediger, 2003; Murayama & Kuhbandner, 2011; Woltz, Gardner, & Gyll, 2000), those involved in metacognition (e.g., Barber, Rajaram, & Marsh, 2008; Metcalfe & Finn, 2011; Nelson, Leonesio, Shimamura, Landwehr, & Narens, 1982; Schwartz, 2001; Singer & Tiede, 2008; Weinstein & Roediger, 2010; Winne & Muis, 2011), and how all of these processes change over the lifespan (e.g., Cosentino, Metcalfe, Butterfield, & Stern, 2007; Dodson, Bawa, & Krueger, 2007; MacDonald, Dixon, Cohen, & Hazlitt, 2004; Mansueti, de Frias, Bub, & Dixon, 2008; Marsh, Balota, & Roediger, 2005; Metcalfe & Finn, 2012). Given the continued reliance on these measures, our primary goals were to (a) update the 1980 norms and assess shifts in knowledge that have occurred over the 32 years since they were collected, and (b) expand the 1980 norms by providing new measures for the questions.

The three-decade gap warrants new data collection in part because knowledge is likely to have shifted in comparison with people’s knowledge in 1980. For example, in 1980 most people knew the answer to the question “What is the name of the Lone Ranger’s Indian sidekick?” (Tonto; 1980 probability of recall = .87). Given that the Lone Ranger was most popular in the early to mid-1900s, it is reasonable that people in 2012 would be less likely to know that Tonto was the Lone Ranger’s sidekick. Conversely, for other information, generational differences may lead to increased knowledge in 2012 relative to 1980. For instance, in 1980 very few people knew the answer to the question “Of which country is Baghdad the capital?” (1980 probability of recall = .07). Iraq, which is the correct response, has been in the news a great deal since the 1990s, most notably in reference to the Iraq war, which began in 2003. Given such generational shifts, researchers should not rely on the 1980 norms.

We also identified several questions from the 1980 norms that were either outdated or incorrect. For instance, consider the question “What is the name of the company that produces Baby Ruth candy bars?” In 1980, the correct response was Curtiss, but now Nestle produces this candy bar. As an example of incorrect information, consider the question “What is the last name of the first American author to win the Nobel Prize for literature?” In the 1980 norms, the correct answer was listed as “Henry,” but the correct answer is “Lewis.”

In addition to updating the 1980 norms and correcting such errors, we also wished to provide three new measures: confidence judgments, peer judgments, and commission errors. We will discuss each in turn. After answering each question, participants made a confidence judgment about the likelihood (0 %–100 %) that the response was correct. Confidence judgments provide an index of how certain people are that their knowledge is correct. To foreshadow, in some cases, people’s confidence was in line with their knowledge: that is, high confidence for correct responses and low confidence for incorrect responses. However, for other questions, people’s confidence was a poor indicator of knowledge. For decades, research has investigated illusions of memory such as confidently held false memories. Such research has commonly investigated false memories that have been created from word lists (e.g., Anastasi, Rhodes, & Burns, 2000; Gallo, 2010; Roediger & McDermott, 1995), produced by misinformation (e.g., Frenda, Nichols, & Loftus, 2011; Loftus & Hoffman, 1989), or implanted into one’s past (e.g., Loftus, 1997). The expanded measures reported here provide researchers with information that will facilitate research on long-term false memories for general knowledge information by identifying questions with low probabilities of recall and high levels of confidence in errors.

We also created a new metacognitive measure, peer judgments, by having people predict how many peers would correctly answer each question. This measure provides normative information about people’s sense of how their own knowledge compares with that of others, which will be useful within a number of research areas. For instance, researchers have increasingly focused on metacognitive judgments made for oneself, in comparison with judgments for another person (e.g., Kelley & Jacoby, 1996; Koriat & Ackerman, 2010; Nickerson, 1999). Such research has focused on the influence of one’s subjective experience on metacognitive processes, and on the relationship between predictions made for oneself as compared with predictions made for someone else. The peer judgments reported here provide an index of the latter relationship. To preview, for some questions the confidence and peer judgments were strongly associated (e.g., Pearson’s r > .80), whereas for other questions the measures were weakly associated (e.g., Pearson’s r < .30). Thus, researchers can identify subsets of questions for which confidence and peer judgments are highly correlated, and other subsets for which the two measures are uncorrelated or weakly correlated.

Perhaps most importantly, we have provided a detailed assessment of the kinds of errors that people made when answering each question. In particular, we focused on commission errors by providing the most commonly reported incorrect responses and confidence in these errors. Information about commissions and normative confidence in these errors may support inquiry in numerous areas, such as in research investigating false memories and error correction. Concerning the latter area, researchers are currently investigating how people correct errors (e.g., Butler, Fazio, & Marsh, 2011; Butterfield & Metcalfe, 2001). To do so, the researchers use items for which people generate responses that are incorrect but are held in high confidence, as compared with errors that are held in low confidence. The commission errors reported here provide normative values for this kind of information by identifying questions for which commission errors are both frequent and associated with differing levels of confidence.

Method

Participants, materials, and procedure

Six hundred seventy one students (age: M = 20.1, SE = 0.14, n = 639 who reported age; 443 females, 207 males, n = 650 who reported gender) from two large state universities participated in exchange for course credit. The participants were recruited from introductory psychology courses at Kent State University (KSU; n = 636, who provided 51,245 responses included in the probability of recall), as well as from introduction to psychology and introduction to research methods courses at Colorado State University (CSU; n = 35, who provided 1,628 responses included in the probability of recall). This sample was selected for comparison purposes with the student sample involved in the original 1980 norms. However, future research may profit by targeting a broader sample of participants (e.g., community samples) to evaluate the degree to which the 2012 norms extend to more diverse samples.

The materials were identical to those used in the Nelson and Narens (1980) norms. These questions tapped general knowledge in several domains, including art, body and health, entertainment, games, geography, history, literature, science and nature, and sports (see Table A1). The procedure was similar to that employed for the 1980 norms, with a few exceptions. The odd-numbered questions from the 1980 norms were assigned to Set A, and the even-numbered questions were assigned to Set B (150 questions per set). Participants were randomly assigned to Set A or B. A small subset of participants (n = 56) completed one set and at least a portion of the second set of questions. For both sets, the entire experiment was computerized, and the questions were randomly ordered anew for each participant. Questions were presented in uppercase letters, one at a time in the center of the screen. Participants were given the following instructions:

This experiment will test your knowledge about general kinds of information. You will be asked a question, and you will type your best guess in the space provided below the question. The answer will always consist of exactly one word and will never be longer than one word. As each question is presented search your memory hard in an attempt to find the answer to the question. When you have determined the answer there is no need to waste extra time before typing in your response. However, if you do not locate the answer immediately, give yourself a chance to find it by searching your memory a bit more. Some questions may take a while before you locate an answer. The questions vary greatly in difficulty such that you will probably be able to answer some easily, while others will be harder, and still others you may not know at all. If, after searching your memory, you are sure you don’t know the answer, then type the word “NEXT” and proceed to the next question. Please never leave an answer space blank; for every answer space, either fill in “NEXT” or fill in your best guess. However, there is no penalty for guessing, and incorrect answers will not affect your score, so be sure to take a guess, if you can, whenever you are unsure. We are only concerned that you get as many questions correct as you possibly can.

For each response, the total amount of time to read the question, search memory, and type a response was measured as an index of response latency. Following each question, participants made a confidence judgment on a scale from 0 to 100, indicating their level of confidence that their answer was correct, such that 0 % indicated not confident and 100 % indicated absolutely confident. After the confidence judgment the participants made a peer judgment, given the following instructions:

Make a judgment about how well you think other people would do with that question. Specifically, if we gave the question to 100 students who were about your age, how many of them do you think would be able to answer the question correctly?

Participants based their judgments on a scale from 0 to 100 people, where 0 indicated that no other students would get the question correct, and 100 indicated that all 100 students would get the question correct.

The questions in most cases were administered after participants had finished an unrelated experimental protocol. Given that participants took different amounts of time to finish the initial protocol, they had different amounts of time remaining (given that they participated for a fixed amount of time overall) to answer questions and make their judgments. Also, with the time remaining, a participant might not have completed all of the questions in his or her assigned set. Thus, the number of observations varied per question. However, participants completed all responses (i.e., recall attempt, confidence judgment, and peer judgment) for each question that they attempted, which prevented reporting partial information per question. Confidence judgments were not measured for omission errors (i.e., typing NEXT), because confidence judgments for omissions should be 0, and hence are uninteresting. However, for omission errors participants did provide a peer judgment, so more observations would typically be available for peer than for confidence judgments. Thus, given that the numbers of observations per question and per measure could vary for principled reasons, Table A1 includes the number of responses that contributed to each measure.

Data scoring

As for the 1980 norms, the data were scored so that answers were considered correct if the first four letters of the response matched the correct answer. Due to our interest in commission errors, the data were then also hand scored by the first author, to adjust for spelling. The vast majority of the hand scoring was objective (e.g., WHODINI for HOUDINI, TERANCHALA for TARANTULA, and OSTERAGE for OSTRICH were all considered correct), but in the rare cases in which scoring was uncertain, the responses were discussed and resolved by the first three authors. Hand scoring also allowed us to adjust for cases in which the first four letters could indicate a correct or incorrect response (e.g., for Question 83, the correct response was METEORS, not METEORITES).

Results and discussion

The results were collapsed across testing locations (i.e., KSU and CSU) because the Spearman correlation (ρ) between the rank-ordered probabilities of correct recall across questions was high (.94, p < .001), indicating that the question orders were largely consistent across the two locations. On average, participants answered 79.0 (SE = 2.3) questions (range from 1 to 300), provided confidence judgments for 34.5 (SE = 1.1) questions (range from 1 to 214), and provided peer judgments for 80.6 (SE = 2.3) questions (range from 1 to 300). The total number of responses per question ranged from 126 to 232. Although this is somewhat lower than the total number of responses per item reported in the original norms (N = 270), the number of responses per item in the present norms is nonetheless sufficient for establishing normative patterns of performance.

The results for each question are presented in Table A1. This table contains normative data for 299 of the original 300 questions (one question—“What is the capital of Czechoslovakia?”—was dropped because it was outdated and had no correct answer). Questions are reported in rank order from the highest to the lowest probability of correct recall. For purposes of comparison, the 1980 rank order is provided as well. Most importantly, for each question Table A1 provides (a) the probability of recall, (b) the mean response latencies (in seconds) for correct and incorrect responses, (c) the proportions of commission and omission errors, (d) the mean confidence judgment across all nonomission responses, as well as the mean confidence judgment computed separately for correct responses and commission errors, (e) the mean peer judgment, and (f) the correlation between the confidence and peer judgments. With reference to point (b), response latencies for incorrect responses are provided only for commission errors, and not for omission errors.

As is detailed in the Method section, the number of responses varied for each measure. For example, when participants omitted an answer, they did not make a confidence judgment, but they did make a peer judgment. The number of responses also varied slightly due to the participants’ typing errors. For example, given that participants were instructed (and reminded) that answers were one-word responses, if participants typed a number (e.g., “5,” “7,” or “12”), this response was considered a typing error (as if the participant had accidentally proceeded from a question without getting a chance to respond) and did not contribute to the probability of recall. Accordingly, in Table A1, we report the number of responses that contributed to each measure.

To assess generational stability, a coarse-grained measure was the rank-order correlation between the new norms and the original ones. The Spearman correlation (ρ) was .83 (p < .001), which suggests stability from 1980 to 2012. Even so, this value may overestimate stability, and it does not indicate that the new norms simply duplicated the original ones. For instance, consider that in 1980 the most challenging quartile of questions (the items ranked 225–300) contained only four questions that were impossible for participants to answer (i.e., the probability of correct recall was zero). By contrast, in 2012, the majority of the items (68 %, or 51 questions) in this last quartile were impossible. Thus, if researchers today wanted to select relatively difficult questions and used the original norms, they might instead be selecting questions with which many—if not all—of their college participants had no experience.

For a fine-grained measure of generational stability, we conducted an independent-samples t test for each question to compare the current recall against the normative values from the 1980 norms. The p value was set at .0002 to adjust for the 299 tests. The results revealed nonsignificant differences indicating generational stability for 157 questions, and significant differences indicating generational instability for 142 questions. For the latter set, participants in 2012 had a higher probability of recall for only three questions, and a lower probability of recall for 139 questions (see Table A1). The shifts in knowledge for some questions were likely due to the content being dated, and people in 2012 having had less exposure to the information. This shift is most likely the case for many of the game questions—for example, “Which game uses a doubling cube?” (backgammon). In 2012, backgammon is apparently not a popular game among college-aged students.

Table A2 provides details about commission errors, including (a) the most common commission errors, (b) how often each commission error occurred out of all of the commission responses for that question, (c) how often the commission error occurred out of all responses for that question, and (d) the mean confidence in the commission error. Some commission errors were made with high confidence. For example, participants were highly confident (M = 86.5 %) that the correct answer to “What is the name of a dried grape?” is “prune.” For other commission errors, participants were not confident. For example, they were not confident (M = 3.1 %) that “Radium” is the correct answer to “What is the last name of the scientist who discovered radium?” In this particular case, participants may have been guessing.

As noted previously, we identified questions that were either outdated or incorrect from the 1980 norms. Table A3 provides details about these questions, including (a) the 1980 correct answer and (b) the 2012 correct answer. For a final subset of questions, the answer from the 1980 norms was correct, but participants provided alternative responses that were correct as well. For example, the 1980 correct answer to the question “What is the name of an airplane without an engine?” was “glider.” This answer is still correct, of course, but participants provided two alternative responses that were also correct (“paper” and “model”).

To conclude, the update to this set of norms is essential, because much has changed since the original norms were collected. As a final example, in 1980 the question “What was the last name of the man who was the voice of Mr. Magoo?” (answer: “Backus”) was considered moderately difficult (1980 probability of recall = .341), but in 2012 this question was impossible, because none of our participants answered it correctly. Even so, the updated norms maintained variability in the question difficulty (range: .933–.000), and hence will remain a valuable research tool.