Over the past several years, there has been increasing interest in identifying techniques to improve learning and long-term recall of novel information (see Dunlosky, Rawson, Marsh, Mitchell, & Willingham, 2013, for a review), with a particular focus on the benefits of distributed practice (Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006) and retrieval practice (Carpenter, Pashler, Wixted, & Vul, 2008; Karpicke & Roediger, 2008). Within this realm of inquiry, researchers have begun to study how these techniques fare in educationally valid contexts (Roediger, Agarwal, McDaniel, & McDermott, 2011; Sobel, Cepeda, & Kapler, 2011), and how they impact learning and recall across the lifespan (Balota, Duchek, & Logan, 2007; Balota, Duchek, Sergent-Marshall, & Roediger, 2006; Logan & Balota, 2008). Though different types of materials, such as novel facts, categorized items, or written passages, have been used as learning stimuli (Barber, Rajaram, & Marsh, 2008; Bäuml, Holterman, & Abel, 2014; Carpenter et al., 2008; Little, Storm, & Bjork, 2011; Rawson & Kintsch, 2005), perhaps the most popular type of stimulus has involved paired associates (Grimaldi, Pyc, & Rawson, 2010). In paired-associate learning, individuals learn a pair of items, such as two words, and then must recall one item when cued with the other.

Although many empirical studies of learning have investigated only one type of stimulus, in real-world situations people are often required to learn multiple forms of related information in the same learning context. For example, a student in a history class might need to associate information like the name of a famous battle with a date, but would also need to understand the importance of that battle and its outcome. Similarly, when visiting a foreign country, individuals would want to learn not only the primary language spoken, but also details about the country’s history, political structure, and places of interest. In both of these scenarios, different types of stimuli that are thematically or conceptually linked must be learned. In the rare case in which learning of multiple types of materials has been empirically assessed, the materials often have not been linked in any meaningful way (Bäuml et al., 2014; Carpenter et al., 2008). Therefore, in this study we aimed to provide a set of recall norms obtained from a demographically diverse sample for stimuli that are thematically linked—namely, foreign-language vocabulary (English–Swahili word pairs) and facts about the history, geography, and civics of a country (Kenya) in which Swahili is one of the official languages spoken. These norms will be useful to researchers who want to assess whether the same learning context or technique leads to memory benefits for different, but related, materials.

Often, paired-associate stimuli involve weakly associated English noun pairs, such as horse–table. An advantage to using these stimuli is that the pairs developed for a particular study are unique and unfamiliar to participants. This allows researchers to assess learning from “scratch” (Nelson & Dunlosky, 1994). Some studies have developed paired associates that involve foreign-language vocabulary learning (Grimaldi et al., 2010; Nelson & Dunlosky, 1994). There is evidence that learning proceeds differently for paired associates that involve arbitrary English noun pairs (semantic focus) than for paired associates involving foreign-language learning (phonological focus) (Papagno, Valentine, & Baddeley, 1991). Specifically, for English speakers, two English words carry different meanings that individuals must try to associate with one another. On the other hand, an English–foreign-language word pair would not benefit from trying to make semantic connections, because the words are equivalent in meaning. Instead, in order to remember the foreign-language word and how it links to the English word, people must engage in phonological encoding (Papagno et al., 1991). Some research has suggested that people may be able to use a combination of these techniques through the use of a mediator word or keyword semantically linked to the English word but phonologically linked to the foreign-language word (Pyc & Rawson, 2010; Raugh & Atkinson, 1975). For example, when trying to connect the Swahili word wingu with its English translation “cloud,” one might think of a wing being something that allows birds to fly in the clouds. In this way, “wing” is semantically linked to “cloud” but shares phonology with the Swahili translation, wingu.

Of foreign-language paired associates, one commonly used stimulus set involves Swahili–English word pairs for which recall norms were published 20 years ago by Nelson and Dunlosky (1994). They collected data from a group of college-aged participants after three study and recall trials. During the recall phase, the participants were cued with the Swahili word and asked to recall the English translation. These norms have been used extensively to investigate the influences of a variety of factors on learning and memory, including retrieval effects (Kang & Pashler, 2014; Pyc & Rawson, 2012b), spacing effects (Karpicke & Bauernschmidt, 2011; Pyc & Rawson, 2009b), feedback (Hays, Kornell, & Bjork, 2010), affect (Finn & Roediger, 2011, 2012), interference processes (Miyake, 2007), cognitive and physical exercise (Kayes, 2013), and mindfulness exercises (Bonamo, Legerski, & Thomas, 2015).

These norms have also been used to examine metacognitive judgments of learning (Jang & Nelson, 2005; Keleman, Winningham, & Weaver, 2007; Keleman, Frost, & Weaver, 2000; Krueger & Sifuentes, 2014; Pyc & Rawson, 2012a; Pyc, Rawson, & Aschenbrenner, 2014; Scheck & Nelson, 2005), including how people make decisions about when to stop studying particular items (Karpicke, 2009; Kornell & Bjork, 2008; Pyc & Rawson, 2007, 2009a) and how people choose to allocate their study time (Ariel, 2012; Dunlosky & Thiede, 1998; Krueger, 2012).

Because of the pervasive use of these stimuli, Grimaldi and colleagues (2010) recently published a set of recall norms for Lithuanian–English word pairs, in order to provide an alternative stimulus set for researchers. Again, the norms were determined using a sample of college undergraduates who were cued with the Lithuanian word and had to recall the paired English translation. In fact, most studies using these types of stimuli have assessed learning when recall was cued by the foreign-language word and the English equivalent must be recalled (see Grimaldi et al., 2010, for a review). Although a few researchers have attempted to evaluate Swahili recall when cued with the English word (Carpenter & Olson, 2012; Kornell & Bjork, 2008), they have done so without any normative information about the recall difficulty of the Swahili words. This is an important point, because it is unclear whether recall performance for Swahili words would mirror the difficulty of English word recall. Therefore, there is a need to establish norms for these associates when the English words serve as the cues and the Swahili words as the targets.

The associative-symmetry hypothesis proposes that when individuals must form associations between two items (e.g., X and Y), as is the case with paired associates, the representation formed in memory is a holistic conjunction of the two items (see Kahana, 2002). Moreover, the hypothesis predicts that when individuals are asked to recall one item of the pair when cued with the other, retrieval performance should be strongly associated across the different cueing directions (e.g., X → Y or Y → X). Although a number of studies have shown evidence of symmetric memory for stimulus pairs (Kahana, 2002; Madan, Glaholt, & Caplan, 2010), others have revealed asymmetric performance, especially for item pairs that have been well-learned (Vaughn & Rawson, 2014). This asymmetry in recall performance has also been found with the Lithuanian–English paired associates learned to a criterion level, where recall was tested using both forward and backward cueing; better recall was found when individuals were cued with the Lithuanian word and asked to recall the English target (Vaughn & Rawson, 2011).

Research with bilingual individuals has supported the idea of asymmetric recall; early in second-language (L2) learning, the L2 is thought to be linked to the first language (L1) through lexical routes before the links with conceptual representations are established. With increased L2 proficiency, conceptual representations are developed, but the lexical links remain. This leads to stronger links from L1 than from L2 to the conceptual representations in memory, and drives asymmetric effects during translation, which is typically faster in the L2-to-L1 than in the L1-to-L2 direction (Kroll & Stewart, 1994). Likewise, Prior, MacWhinney, and Kroll (2007) published a set of translation norms for English and Spanish from highly proficient bilinguals and also found evidence of asymmetric translation effects across languages. Given these patterns of asymmetrical translation in bilingual individuals, paired-associate learning involving cueing from the familiar language (English) to the novel language (Swahili) should be more difficult than that involving cueing in the opposite direction.

We are unaware of any formally published norms in monolinguals assessing recall when individuals are cued with the English word and must recall an associated foreign-language equivalent, despite the fact that this ability is an important part of learning a new language. Therefore, we report a set of recall norms for Swahili words cued by their English equivalents, which can be used to gain a more complete picture of performance akin to early foreign-language learning, especially as it pertains to the development of translation skill from L1 to L2. Knowledge of normative recall difficulty involving cueing in this direction would also be important for studies seeking to better understand the structure of associative memory and the issues of symmetrical versus asymmetrical memory representations.

As we indicated above, learning often involves a host of different types of stimuli, such as facts, strategies, or even skills. These stimuli differ in their levels of recall difficulty and in the degrees of information that may be available at recall to retrieve associated information. This differential ease of recall can be explained by the Search of Associative Memory (SAM) model (Raaijmakers & Shiffrin, 1980, 1981, 2002), which posits that information is encoded in long-term memory in the form of memory images combining contextual information, associations with other items in memory, and information regarding various target features, such as meaning, part of speech, letters, and other relevant features. Retrieval of the target item from memory is cue-dependent—that is, the more strongly a cue or set of cues is associated with the memory image of the target item, the higher the probability of retrieving the target item from memory. When considering ease of recall for facts versus foreign-language paired associates, one would expect fact recall to be easier, because the question used as the cue would likely contain more information to strengthen activation of the target memory image and constrain memory search than the individual word cue used for paired-associate learning would contain. For example, the question “How many years constitute a single term as president?” includes a host of information to activate links to the memory image containing the answer, including its format (number), its meaning, the letters used to represent the verbal form of the answer, the numeric digit used to represent the answer, an association with the term lengths of other political figures, and so forth. On the other hand, the English cue “doctor” includes less information to activate the Swahili translation tabibu, especially given that in early learning, tabibu might not include a strong link to the concept it represents in memory (Kroll & Stewart, 1994).

Norms collected from college undergraduate students have been published for the recall of various general-knowledge facts (Nelson & Narens, 1980; Tauber, Dunlosky, Rawson, Rhodes, & Sitzman, 2013). However, many studies assessing the learning of facts have developed a novel set of items that were either piloted using a sample of college students prior to study administration (Barber et al., 2008) or administered without collecting prior normative data on recall performance (Carpenter et al., 2008). The fact that normed and pilot data have been collected using samples of college-aged individuals makes it difficult to generalize these norms to other age groups. Learning, however, is a lifelong process, and as we indicated earlier, many researchers have begun to focus on how learning changes across the lifespan (see Balota et al., 2007a, for a review; Logan & Balota, 2008).

One example of a real-world learning scenario experienced by individuals of varied backgrounds and ages is the naturalization exam to obtain U.S. citizenship. The exam requires the learning of multiple types of information, such as facts about U.S. history, government, and civics, as well as English-language vocabulary (see www.uscis.gov/citizenship). For researchers interested in studying various learning techniques for materials like these, it would be useful to develop stimulus sets involving different types of materials that are thematically related. Another critical point is that many of the published norms in the learning literature were established using college-aged participants, rather than a more varied sample in terms of age and ethnicity (Grimaldi et al., 2010; Nelson & Dunlosky, 1994; Nelson & Narens, 1980; Tauber et al., 2013).

One increasingly popular method for recruiting participants in research studies has been the use of Amazon’s Mechanical Turk (MTurk). MTurk is an online service that enables individuals to participate in experiments and surveys for reimbursement. The advantages of MTurk and other online research deployment methods include not only the ability to quickly and efficiently gather experimental data, but also the acquisition of more socio-economically and ethnically diverse samples of participants than are typically reported in empirical studies conducted in laboratory settings (see Birnbaum, 2004, for a review; Mason & Suri, 2012). A number of studies have demonstrated that the data collected online are of good quality and are comparable to empirical data collected in a laboratory, as long as expectations for the participants are made transparent and manipulation checks are included to ensure that they have complied with the instructions (Buhrmester, Kwang, & Gosling, 2011; Casler, Bickel, & Hackett, 2013; Crump, McDonnell, & Gureckis, 2013; Mason & Suri, 2012). With regard to issues concerning the collection of response time data online through the use of a platform such as Adobe Flash, administered via MTurk, research has suggested that although there may be some variability across systems, reliable response time data can be obtained. This is especially true for within-subjects designs, and there is evidence that even small condition differences that are detected in laboratory settings can be replicated online in this way (Reimers & Stewart, 2014; Simcox & Fiez, 2014). However, this technique is not recommended for studies that require brief, millisecond-level stimulus presentations.

In the present study, we gathered recall norms for the classic Nelson and Dunlosky (1994) word pairs when people were cued with the English word and had to recall its Swahili equivalent. These norms will be of interest to researchers who study learning, memory, and metacognition, as well as to individuals interested in L2 vocabulary learning. Moreover, we established additional recall norms for a set of facts about a Swahili-speaking country (Kenya) in order to provide related, but different, conceptual materials that could be used in future studies to assess learning across different types of stimuli. Furthermore, we used MTurk to recruit a demographically diverse sample of participants.

Experiment 1

In this experiment, we gathered recall norms for the Swahili–English word pairs published by Nelson and Dunlosky (1994) when the cue-and-target direction was reversed. In their original study, norms were reported for English word recall when cued with the Swahili equivalent. However, when learning a new language, individuals need to learn translation in both directions until they develop strong conceptual links to the words in the L2. Evidence from bilingual research has suggested that differential processes are engaged in translation in the different cueing directions. Thus, in assessing learning of foreign-language paired associates, it is important to gain an understanding of how learning may differ when individuals must report the foreign-language word when cued with their native-language equivalent. Therefore, this experiment will report norms for Swahili word recall when individuals were cued with the associated English word from the pair.

Method

Participants

A total of 250 people from all over the United States were recruited for this study through MTurk. These individuals were able to read a description of the experiment and its eligibility requirements on the MTurk website. Only people who were at least 18 years of age and were currently living in the United States were eligible to participate. Those interested in completing the study then clicked a link that opened up a separate window with an Adobe Flash movie that presented the consent sheet and experimental task. Prior to completing the task, participants read and provided consent (via clicking a button) by means of an electronic informed-consent form that was approved by the Institutional Review Board at the University of Texas at El Paso. When providing consent, participants also verified that they were proficient in English. One participant who experienced computer difficulties and 31 participants who reported writing down information during the study or who were familiar with Swahili were eliminated from the analyses. Participants were asked to complete the testing session in one sitting, so we set a criterion completion time of 2 h to eliminate any participants who failed to attend to this instruction. We eliminated seven additional participants who took 2 h or more to complete the task. Thus, 211 participants (M age = 33.06 ± 11.47 years, range = 18–67 years) contributed data that were included in the norms reported below. Of these, 202 of the participants reported that English was their first language. For the remaining nine participants, who reported a different language as their first, the languages reported included Spanish, Hindi, Vietnamese, Nepali, and Laotian. This sample was composed mainly of females, with 147 women (70 % of the sample) and 64 men. However, the participants were diverse in terms of age and ethnicity. This sample was composed of the following groups: 73 % White, 9.5 % Black, 7 % Asian, 0.5 % American Indian or native Alaskan, 1 % native Hawaiian or Pacific Islander, 7 % more than one race, and 2 % who preferred not to respond. Each participant was paid $2.50 for participation.

Stimuli

The stimuli used in the present experiment were 100 English–Swahili word pairs published by Nelson and Dunlosky (1994). We used their reported recall accuracy norms after the first recall attempt for English targets cued with their Swahili equivalents, and rank-ordered the items from least to most difficult. We then divided them evenly into two stimulus sets of 50 word pairs each; every item in an even-ranked position on the list was assigned to List A, and every odd-ranked item in the list was assigned to List B. This roughly equated difficulty between the two lists, on the basis of the cueing procedure and data obtained by Nelson and Dunlosky (1994). Participants were only assigned 50 words in order to approximate the learning-list lengths used in prior norming studies (Grimaldi et al., 2010; Nelson & Dunlosky, 1994) and to avoid fatigue. We also created six additional English–Swahili word pairs that were not published in the original set of norms, to serve as buffer items in this experiment. These buffers included the following items: soil–udongo, fish–samaki, flag–bendera, apple–tufaha, potato–kiazi, and pants–suruali.

Procedure

The experiment and the associated questionnaires and consent option were programmed using Adobe Flash Professional CS5. After participants had provided online consent, they were randomly assigned to study one of the two stimulus lists. Those who studied List A were designated as Group 1, and those who studied List B were designated as Group 2. With random assignment, 103 participants were assigned to Group 1, and the remaining 108 participants were assigned to Group 2.

The procedure was similar to that reported in Nelson and Dunlosky (1994). Namely, individuals were asked to study the items and then to recall them, with this process repeated three times. Participants were instructed that they would see a series of English words along with their Swahili translations on the computer screen. They were asked to learn the pairs and to pay special attention to learning the Swahili translations and spellings, because they would be asked to recall the words later. They were told they would engage in three rounds of study and recall and that they would need to complete all of the study and recall attempts within a single testing session. They were also told not to write down any information during the study. Upon completion of the experiment, participants filled out a short questionnaire asking them about demographic information, what strategy they had used during the study, and their native language. They entered their MTurk Worker ID and were given a unique alphanumeric code that they entered into a form, on the original experiment page hosted in MTurk, to verify their participation. Once verified, they received their participant payment.

During each study phase, the groups first studied the six buffer items in a fixed order to help orient them to the task. Performance for these buffer items was not included in the analyses. Participants then studied the 50 experimental items from their assigned list. Each English–Swahili word pair appeared at the center of the computer screen, one at a time, and remained on the screen for 12 s, followed by a 2-s interstimulus interval before presentation of the next stimulus pair. In the present study, the participants were given longer to learn the word pairs than in the original Nelson and Dunlosky (1994) study (in which participants were given 10 s to respond), since having to recall the Swahili word was expected to be more difficult than recalling the English word from the pair. The order of the presented pairs within each study block was randomized, but the presentation order was tracked for crafting the order of items presented in the cued-recall portion of the task.

After the last study trial, the first 25 items from the study block were rerandomized and presented as the first 25 items during cued recall. Likewise, the final 25 items from the study block were rerandomized and presented during the last half of the cued-recall phase. This ensured at least a 25-item lag between the study of each pair and its subsequent cued recall. On each cued-recall trial, participants were presented with the English word and prompted to recall its Swahili translation by a question mark (e.g., doctor–?). Participants typed their responses into an answer box provided on the screen. Once participants began typing their response, a button labeled “next” appeared at the bottom of the screen. The button was hidden until the response was initiated. Participants were given 15 s to type their response before advancing to the next cued-recall trial. Once participants had completed typing in their answer, they clicked the “next” button beneath the answer box to move to the next trial. Participants could correct typing errors within the 15-s timeframe. If they failed to click the “next” button within the 15-s timeframe, they were automatically advanced to the next trial. Any information typed in the response box when the 15 s had elapsed was recorded as the answer for that trial, even if participants failed to click the “next” button. These data were tagged as “timeout” trials. After the final cued-recall trial, the 50 word pairs were randomized anew and presented, using the identical procedures, for two additional bouts of study and cued recall. The average time it took participants to complete the entire experiment was 1 h 8 min.

Coding scheme

Recall accuracy and the mean response time from each cued-recall phase were recorded as the dependent variables in this experiment. Swahili words were considered correct only if the entire Swahili word was typed correctly with no spelling errors. The response time was recorded as the time from initial presentation of the cue during a cued-recall trial to the time when the participant pressed the “next” button or the trial advanced on its own. For timeout trials, the response time was automatically recorded as 15 s. Normed response times are only reported for correct trials; any of these trials flagged as timeout trials were also excluded from the response time analyses.

Results

Recall performance

Recall accuracy norms for each item over the three recall attempts are reported in Table 1 in the Appendix. The mean response times for each item are reported in Table 2. We also list the corresponding standard deviations and standard errors of the means in each table. Information about English word length, Swahili word length, and English word frequency norms was retrieved from the English Lexicon Project Database (Balota, et al. 2007b), calculated from the Hyperspace Analogue to Language (HAL) corpus (Lund & Burgess, 1996). The HAL corpus includes roughly 131 million words. We report the log-transformed HAL frequency norms for the English words in the table and used these log-transformed frequency norms in all subsequent analyses involving word frequency. Note that the items in the tables are listed in alphabetical order by English word. One may question whether nonnative English speakers would engage different processes than native speakers to form associations between English and Swahili words. We conducted paired-samples t tests to compare the average recall accuracies by items for Swahili when the nonnative English speakers were included versus excluded from the data set. We found no differences in recall accuracy on any of the recall attempts (Recall 1 p = .911, Recall 2 p = .075, Recall 3 p = .371), suggesting that the norms acquired in the present study did not markedly change as a result of the inclusion of nonnative English speakers.

We conducted repeated measures analyses of variance (ANOVAs) on the mean accuracies and response times by items to determine whether these measures significantly changed for each item over the three recall attempts. In cases in which the sphericity assumption was violated, we also report the Huynh–Feldt correction for degrees of freedom. We found a main effect of recall attempt for recall accuracy, F(2, 167) = 1,180.82, p < .001, η p 2 = .923, with accuracy improving over each new recall attempt. The mean accuracy scores and the corresponding standard deviations were as follows: M Recall1 = .24 (SD = .10), M Recall2 = .42 (SD = .13), and M Recall3 = .53 (SD = .13). Response times also became significantly faster across recall attempts, F(2, 165) = 459.02, p < .001, η p 2 = .823, with M Recall1 = 8,205 ms (SD = 1,062), M Recall2 = 7,020 ms (SD = 918), and M Recall3 = 6,215 ms (SD = 858). These results demonstrate performance improvements for the items in both accuracy and response time across recall attempts. Importantly, recall accuracy was clearly not at ceiling even after the third recall.

Akin to Nelson and Dunlosky (1994) and Grimaldi and colleagues (2010), we conducted a set of correlations by items to determine whether the accuracy performance on subsequent recall attempts for each item was correlated with accuracy on earlier attempts, which would suggest that the distributions of difficulty by items were similar across recall attempts. Additionally, we evaluated whether similar correlations were found for response times. For both accuracy and response time, we found a significant correlation between performance at Recall 1 and Recall 2 (accuracy: r = .89, p < .001, N = 100, 95 % confidence interval [CI] = [.84, .93]; response time: r = .76, p < .001, N = 100, 95 % CI = [.66, .83]). Significant relationships were also found between these measures when comparing Recall 2 to Recall 3 performance (accuracy: r = .94, p < .001, N = 100, 95 % CI = [.91, .96]; response time: r = .85, p < .001, N = 100, 95 % CI = [.79, .90]). The correlations we achieved for accuracy are similar to those reported by Nelson and Dunlosky (1994: r = .91 for Recall 1 vs. Recall 2; r = .95 for Recall 2 vs. Recall 3). These consistently significant relationships indicate the stability of item difficulties across the three recall attempts.

Relationship to prior norms

We have argued that the ease of recall for a Swahili word when cued by its English translation may not be equivalent to the ease of recall of the English word when cuing is in the opposite direction. One coarse method to explore whether the recall difficulty of stimulus items is distributed similarly across the two cueing directions is to evaluate whether there is a correlation between the vocabulary norms reported in the Nelson and Dunlosky (1994) study and those found in the present study. We performed correlations on the mean accuracy scores across the two studies for each recall attempt. Although the recall norms were not perfectly correlated, we did find positive relationships between the by-item accuracy scores at each recall attempt (Recall 1: r = .56, p < .001, N = 100, 95 % CI = [.41, .68], Recall 2: r = .63, p < .001, N = 100, 95 % CI = [.50, .74], Recall 3: r = .58, p < .001, N = 100, 95 % CI = [.44, .70]). This suggests that although the difficulty levels across the different cueing directions in the two studies are related, there are also differences in recall when individuals are cued in different directions. We have included scatterplots of these correlations in Fig. 1 to illustrate these relationships.

Fig. 1
figure 1

Scatterplots showing the relationship between the recall accuracy norms for English members of Swahili–English word pairs reported by Nelson and Dunlosky (1994) and for the Swahili members of the word pairs reported in the present experiment, at (A) Recall Attempt 1, (B) Recall Attempt 2, and (C) Recall Attempt 3. Pearson’s r is shown for each correlation on the associated scatterplot

In order to determine whether recall in the present study was generally better or worse than that reported in Nelson and Dunlosky (1994), we also conducted a set of t tests comparing the accuracy norms averaged across items for each recall attempt in the two studies. For the first recall attempt, performance was better in our study when individuals were asked to recall the Swahili word (M = .24) than in the prior norming study, when they were asked to recall the English word (M = .14), t Recall1(99) = –10.48, p < .001. We found no difference in overall mean accuracies at the second recall attempt across the two studies: Mean recall accuracy was equated at .42 in both, t Recall2(99) = .41, p = .683. The pattern for Recall 3 was opposite that of Recall 1; here the performance was better in the Nelson and Dunlosky study (M = .63) than in the present study (M = .53), t Recall3(99) = 8.41, p < .001. This suggests that although some distribution of difficulty may be similar across the different cueing directions, learning may actually progress at different rates in these two situations for foreign-language paired associates.

Factor structure

We conducted exploratory factor analyses (EFAs) on recall accuracy at the first recall attempt to determine whether any latent constructs were driving the relationships between the different English–Swahili word pairs. Because we gathered recall data for each half of our set of word pairs from two different participant groups, we had to perform a separate EFA for each group (e.g., Group 1 and Group 2). The EFAs did not reveal any meaningful factor structures (as evidenced by the large numbers of factors generated and items loading in nonmeaningful ways). For those who are interested in exploring the correlational structure between the vocabulary items, we have included the correlation matrices for each of these groups in Tables 7 and 8 of the Appendix.

Influence of item characteristics

Though the EFAs did not convey any meaningful information, the words comprising the English–Swahili pairs differed on potentially meaningful characteristics that might have influenced recall accuracy. These characteristics are reported in Table 1, and include (a) English word length, (b) Swahili word length, and (c) English word frequency, based on the log HAL word frequency norms (Lund & Burgess, 1996). We evaluated the relationships between these characteristics and recall accuracy and response time at each recall attempt using Pearson correlations, adopting a criterion of p < .005 to correct for the nine comparisons evaluated for each dependent measure. Swahili word length was significantly negatively correlated with recall accuracy, and positively correlated with response time (Recall 1: r Acc = –.62, p < .001, N = 100, 95 % CI = [–.72, –.48], r RT = .68, p < .001, N = 100, 95 % CI = [.56, .78]; Recall 2: r Acc = –.69, p < .001; N = 100, 95 % CI = [–.78, –.57], r RT = .77, p < .001, N = 100, 95 % CI = [.67, .84]; Recall 3: r Acc = –.71, p < .001, N = 100, 95 % CI = [–.79, –.60], r RT = .785, p < .001, N = 100, 95 % CI = [.70, .85]). This suggests that longer Swahili words were associated with slower, less accurate responses. No correlations with English word length or English word frequency emerged after the correction for multiple comparisons.

To determine the driving elements behind the correlations of our norms and the norms published by Nelson and Dunlosky (1994), we conducted a two-step hierarchical regression analysis for each of the three recall attempts. In each analysis, we regressed our norms on the previously published norms (entered in Step 1) and the item characteristics (English word length, Swahili word length, and log HAL word frequency, entered in Step 2). We found that for all three recall attempts, the addition of the item characteristics in Step 2 of the hierarchy did significantly explain additional variance in our norms, over and above what was captured by the previously published norms [Recall 1: F(3, 95) change = 19.443, p < .001, R 2 change = .261; Recall 2: F(3, 95) change = 38.977, p < .001, R 2 change = .330; Recall 3: F(3, 95) change = 39.349, p < .001, R 2 change = .366]. The total model was also significant for all three recall attempts [Recall 1: F(4, 95) = 32.192, p < .001, total R 2 = .575; Recall 2: F(4, 95) = 64.749, p < .001, total R 2 = .732; Recall 3: F(4, 95) = 56.945, p < .001, R 2 = .706]. For all three recall attempts, Swahili word length was a significant predictor of recall accuracy in the present norms. For each decrease in our study of about 0.5 standardized units in word length (the beta coefficient varied slightly for each recall attempt), there was a 1-standardized-unit increase in recall accuracy—Recall 1: t = –6.908, p < .001, beta = –.504; Recall 2: t = –9.755, p < .001, beta = –.565; Recall 3: t = –9.925, p < .001, beta = –.599. These findings suggest that the longer the Swahili word to be learned, the less accurate participants were when they recalled that Swahili word.

Strategy effects

We asked participants to report any strategies they used to try to remember the associations, since recalling the Swahili words was expected to be challenging. The strategies, as well as the percentages of participants who reported using these strategies, were as follows: (a) mental imagery/pictures = 7.6 %, (b) repetition = 17.1 %, (c) word association/mediator use = 28.4 %, (d) crafting a sentence = 6.2 %, (e) other/nonclassifiable = 16.6 %, (f) use of multiple strategies = 17.5 %, and (g) no strategy used or reported = 6.6 %. Figure 2 demonstrates the mean proportions of accurate recall for participants who used each strategy at each recall attempt. We conducted a mixed-effects ANOVA with Strategy as the between-subjects factor and Recall as the within-subjects factor, to examine the effects of strategy use on recall accuracy. In particular, we were interested in whether we would find a main effect of strategy type or an interaction between strategy type and recall attempt. Sphericity was violated, so the Huynh–Feldt-corrected p values and degrees of freedom are reported. The expected main effect of recall attempt emerged, F(1, 300) = 188.14, p < .001, η p 2 = .480, due to improved accuracy over the course of the three attempts, but no main effect of strategy, F(6, 204) = 0.544, p = .774. However, we did find a significant interaction between recall attempt and strategy type, F(9, 300) = 2.59, p = .007, η p 2 = .074; the impact of each strategy differed depending on the recall attempt. We conducted a set of follow-up ANOVAs within each recall attempt to assess which strategy differences might have driven the interaction. None of these follow-up tests revealed a significant effect of strategy (Recall 1 p = .983, Recall 2 p = .794, Recall 3 p = .129). Figure 2 suggests that strategy differences started to emerge at the final recall attempt, and on the basis of the ordering of mean recall accuracy, crafting a sentence appeared to lead to the best recall during this attempt. Post-hoc pairwise independent t tests (evaluated at a criterion of p < .002, to correct for the 21 comparisons) were conducted between all of the reported strategies at the final recall attempt, to determine whether any performance differences emerged. However, none of the t tests between different strategies reached significance.

Fig. 2
figure 2

Graph showing mean proportions of Swahili words recalled at each recall attempt during the vocabulary task as a function of participants’ reported strategies

Age differences

Our sample of participants included individuals across a wide age range. In an effort to determine whether age led to any differences in recall accuracy for these paired-associate items, we conducted a mixed analysis of covariance (ANCOVA) within each norming group (Group 1 and Group 2) using Recall Attempt and Item as our within-subjects factors and age as a continuous variable. In particular, we were looking for main effects of age or any interactions with age. For Group 1, neither the main effect of age, p = .383, nor any of the interactions with age (all ps > .10) reached significance. For Group 2, we found a significant interaction between age and item, F(43, 4031) = 1.56, p = .011, η p 2 = .016, but the main effect of age and all other interactions with age did not reach significance (all ps > .18). An independent-samples t test revealed no difference in average age for the two norming groups (M Group1 = 34.25, M Group2 = 31.92). To further explore the age differences in item recall, we compared recall accuracies for the oldest third (at and above the 66th age percentile) and the youngest third (at and below the 33rd age percentile) of our sample. The youngest third comprised participants 25 years of age and younger (range = 18–25 years), whereas the oldest third comprised participants 36 years of age and older (range = 36–67 years). We conducted omnibus independent-samples t tests comparing the overall mean accuracies across items for each age group at each recall attempt and found significant differences between these groups. For Recall 1, the overall accuracy mean collapsed across items was better for the youngest (M = .298) than for the oldest (M = .181) group, t Recall1(115) = 2.62, p = .010. For Recall 2, the overall accuracy mean was also better for the youngest (.470) than for the oldest (M = .349) group, t Recall2(124) = 2.49, p = .014. The comparison for Recall 3 was not significant, p = .123. We then conducted a series of independent t tests comparing the accuracy norms for each item pair for these age groups using a corrected p value of .0005, to adjust for the 100 tests within each recall attempt. Five pairs were significantly different across age groups at Recall 1, three pairs at Recall 2, and none at Recall 3. However, in many cases the performance for the youngest age group was clearly better than that for the oldest group, but this difference did not survive the correction for multiple comparisons. We have included supplementary tables in the Appendix reporting the recall accuracy norms for the youngest third (Table 3) and the oldest third (Table 4) of participants. Although we recognize that the age cutoff for our oldest group does not conform to a typical cutoff for older adults, we do think these norms may be of interest to individuals working with samples consisting of adults other than college students. We do not include a table reporting response times separated by age groups, because for some items within a group there were too few correct responses (in some cases, no correct responses) to calculate stable estimates of response time.

Discussion

These norms for Swahili word recall, along with those of Nelson and Dunlosky (1994), offer useful tools for researchers interested in a variety of topics related to learning, metacognition, associative memory, and the associative mechanisms of foreign-language learning in both directions of translation. Although the majority of studies have used the stimuli in this experiment to assess English recall when cued by associated Swahili translations, the norms presented in this experiment will enable researchers to examine learning scenarios in which the less familiar foreign-language item must be recalled. We demonstrated that participants improved their performance for these items across different recall attempts and that recall difficulty was stable across recall attempts within our set of norms. A comparison of our norms to those of Nelson and Dunlosky (1994) suggests some similarity in the distributions of recall difficulty, but not a one-to-one relationship.

Interestingly, we found no meaningful underlying latent constructs organizing the word pairs. However, we were unable to examine the factor structure across all items simultaneously, since the items were split between two separate groups. Thus, one important future direction will be to examine recall for all of these items within a single set of participants, to better assess the factor structures among all 100 items. When assessing the a priori item characteristics of the word pairs, we failed to find a relationship between English word frequency and Swahili recall, despite evidence that English word frequency has been related to English word recall performance in prior paired-associate studies using English words for both the cue and the target (Criss, Aue, & Smith, 2011; Madan et al., 2010) and studies using foreign language paired associates with English words as the targets (Grimaldi at al., 2010; Nelson & Dunlosky, 1994). We did, however, find that Swahili word length was related to Swahili recall, in that the longer Swahili words were more difficult to recall than shorter words and that Swahili word length accounted for a significant amount of variance in our norms beyond that explained by Nelson and Dunlosky’s (1994) prior norms. This suggests that for these types of stimuli, recall in the different cueing directions may be asymmetric and mediated by different influences. For the present set of norms, since individuals knew that they had a limited amount of time to study each pair and that they would be required to recall the Swahili word, they may have focused more attention on this target item during study than on the cue word. This is further supported by the lack of English word length effects on recall accuracy. Given that the present set of norms required participants to spell items correctly, it may be interesting for future studies to classify the Swahili words on the basis of the ease with which they can be spelled by native English speakers. These data would then allow for a more comprehensive assessment of the processes contributing to recall difficulty for Swahili vocabulary words.

When evaluating the influence of the strategies used by participants to recall items, we found that the benefit of each strategy depended on the recall attempt. Although no single strategy led to significantly better performance than the others within the individual attempts, it does appear that at the last recall attempt there was some separation of performance due to different strategies that might emerge more clearly in the face of additional recall attempts. Despite the inability to identify a specific strategy that was more beneficial than the others with the present data set, it is useful to be aware of the types and number of self-selected strategies that participants reported during paired-associate learning. It would be helpful to evaluate strategy selection more systematically, perhaps on an item-by-item basis, in future work, to gain a better understanding of which strategies ultimately lead to the most efficient and stable learning.

We found some age-related differences in recall accuracy and have provided tables of recall norms from the oldest third and youngest third of our sample that will be useful for researchers who are interested in samples consisting of individuals older than typical college students. However, we caution that researchers should be aware of the arbitrary age cutoffs that distinguish these two groups in our study and be aware that our older group does not constitute a typical “older adult” sample. In an effort to develop a set of items that could be used to evaluate learning for materials different from but related to English–Swahili word pairs, we also evaluated recall norms for a set of facts about Kenya.

Experiment 2

In real-world learning situations, individuals often must learn a variety of information types. For example, when an individual moves to a foreign country he or she will likely need to learn not only the language of that country, but also important laws and customs. In some cases, this information is critical for obtaining and achieving job success, navigating interpersonal situations, and—as is the case in the United States—obtaining citizenship. For the present experiment, we crafted a set of facts about a foreign country (Kenya) in which Swahili is one of the official languages. The facts may serve as companion stimuli to the Swahili–English word pairs, but they could also be used independently in studies of learning. The facts include information about the history, civics, and government of Kenya. Notably, recall performance for these items was obtained from a sample with a diverse range of ages and ethnicities, to ensure that these norms would be more generalizable than those reported in prior studies in which the norms had been obtained from a sample of undergraduate students (Barber et al., 2008; Nelson & Narens, 1980; Tauber et al., 2013)

Method

Participants

A total of 230 individuals were recruited from MTurk. As with Experiment 1, individuals could read a description of the experiment hosted on MTurk; they were eligible if they were at least 18 years of age, currently lived in the United States, and were proficient in English. Interested participants clicked a link that opened a separate window showing the Adobe Flash movie of the experiment. They had to read and agree to the electronic consent form before starting the experimental task. By clicking on the button on the consent page, they indicated that they were both proficient in English and willing to participate in the study. The instructions for the study asked participants to complete the experiment in one sitting. The study and consent procedures were approved by the Institutional Review Board at the University of Texas at El Paso.

Of the individuals who completed the experiment, ten were eliminated from the analyses for not following instructions. Either these participants never pressed the button to advance to the next trial after typing their answers, or their recorded start and end times for the experiment indicated that they failed to complete the experiment in a single sitting (i.e., the completion time was longer than 2 h). Eight additional people were eliminated due to computer error and corruption of their data files.

Recall norms were calculated from the data of the remaining 212 participants (M age = 31.59 ± 11.06 years, range = 18–61 years). Of these participants, 202 reported that English was their first language. The native languages reported by the remaining ten participants included Chinese, Tagalog, Spanish, Russian, Dutch, and Azeri. The participant pool comprised mostly females, with 149 women (70 % of the sample) and 63 men. Our sample represented relatively diverse ethnic backgrounds: 77 % White, 8 % Black, 6 % Asian, 0.5 % American Indian or native Alaskan, 0.5 % native Hawaiian or Pacific Islander, 5 % more than one race, and 3 % who preferred not to respond. Participants were paid $2.50 for their involvement in the study.

Stimuli

The stimuli developed for this experiment were question–answer pairs about the history, government, geography, and civics of Kenya. These questions were developed on the basis of Web research about Kenya, evaluation of Kenya’s 2010 constitution, and consultations with individuals who had experience living and working in Kenya. Kenya was selected as the country of origin because Swahili is one of the official languages spoken in that country. Moreover, we expected that these facts would be largely unfamiliar to a typical sample of participants recruited from the United States. We created 100 question–answer pairs and then divided them evenly into two stimulus sets of 50 items each for the present study. Individuals who were assigned the first stimulus set were designated as Group 1, and individuals who were assigned the second set were Group 2. Participants were randomly assigned to either Group 1 or 2 after acknowledging their consent on the electronic consent form. With random assignment, 100 people were assigned to Group 1, and 112 people were assigned to Group 2.

We created an additional six question-and-answer pairs to serve as buffer items in this experiment. These items were as follows:

  1. 1.

    On what continent is Kenya located? Africa

  2. 2.

    Name one of Kenya’s most famous dishes. a) nyama choma, b) ugali

  3. 3.

    Name 2 of the 3 possible classifications of land in Kenya, according to Kenya’s Constitution. a) Public, b) Community, c) Private

  4. 4.

    Which judicial body rules on Constitutional matters? The High Court

  5. 5.

    As of 2011 how many Constitutions has Kenya had? 2

  6. 6.

    What large lake is located to Kenya’s west? Lake Victoria

Procedure

The experiment and associated questionnaire and electronic consent form were programmed using Adobe Flash Professional CS5. This experiment followed a procedure identical to the one described in Experiment 1, but with the question–answer pairs about Kenya presented instead of English–Swahili word pairs. Again, participants were given three cycles of study and cued recall. The six buffer question–answer items were included at the beginning of each study list to orient participants to the task. The processes for randomizing and presenting items during study and recall were the same as those described for Experiment 1. The timings of item presentation during study and recall were also equivalent to those aspects of the first experiment, as were the dependent measures recorded. We adopted the same timings as in Experiment 1 for several reasons. First, we wanted to limit study and recall time to prevent participants from writing down or looking up answers during the task. We also wanted to limit the overall duration of the experiment to combat fatigue. Finally, even though there was more information on the screen for participants to read than in Experiment 1, we expected that the process of associating the answer to the question for facts would be easier than associating the Swahili word with its English translation for later Swahili recall.

Participants were informed that they would see a series of facts about Kenya and would be asked to study questions paired with their answers. They would then see the questions again and would have to type in the appropriate answer into a text box on the screen. They learned that they would repeat this cycle three times and were asked not to write down any information when studying the question–answer stimuli and to complete the entire experiment in a single sitting. At the conclusion of the third cued-recall phase, participants filled out a short demographics questionnaire. They also entered their MTurk Worker ID and were given a unique alphanumeric code that they had to enter into the MTurk site to verify their participation. Once verified, they received their participant payment. The average time taken to complete the task was 1 h 12 min.

Coding scheme

Once again, recall accuracy and response time (i.e., the time from presentation of the question to pressing of the “next” button) were recorded. Answers were considered correct if all words critical to the meaning of the response were included and spelled correctly. For instance, if the original answer shown during study was “the Equator” and the participant wrote “Equator” during the cued-recall phase, we considered the answer correct. We also considered correct any answers with synonym substitutions for words presented as part of the answer during the study phase, or any cases in which the phrase format was altered but the original meaning was preserved. For example, if the original answer was “presidential appointment,” we would accept “appointed by the president” as a correct answer. Likewise, we accepted number substitutions in the case of answers in which numbers had initially been written out as words. Finally, in cases in which a portion of the answer was part of the question, participants did not need to include that detail as part of their answer. For instance, participants were often asked to report a particular chapter of the constitution, but they did not need to include the word “chapter” in their answer; the relevant number was sufficient to be deemed correct. Participants did not have to worry about correct capitalization format in their answers. In cases in which participants failed to click the “next” button after making their response or the time limit of 15 s passed, the portion of the response currently contained in the response box was recorded as the answer, the trial was labeled as a “timeout” trial, and the next cued-recall trial was initiated. The response time was coded as 15 s for these timeout trials. Normed response times are only reported for correct trials, and response times for trials that were flagged as timeouts were excluded from the analyses.

Results

Recall performance

Table 5 in the Appendix shows the question-and-answer pairs, as well as the normed recall accuracy and associated standard deviation and standard error of the mean for each item across participants. The study time for each item was limited to 12 s, and the time allowed for recall during each cued-recall phase was limited to 15 s. This encouraged participants to make speeded responses before the time limit elapsed. We recognize that under normal circumstances some participants may have required more time to finish typing lengthier responses. Therefore, we have included the mean response times from this experiment in a separate table (Table 4) and caution researchers using the response time norms that they reflect values that emerged when participants were given a limiting upper bound of acceptable response time. Note that the items listed in these tables are shown in alphabetical order by question.

We conducted repeated measures ANOVAs on the mean accuracies and response times for each item to determine whether these measures significantly changed over the three recall attempts. In cases in which the sphericity assumption was violated, we report the Huynh–Feldt correction for degrees of freedom. We found that recall accuracy for items improved with each new recall attempt, F(1, 119) = 219.46, p < .001, η p 2 = .689. The mean accuracy scores and corresponding standard deviations were as follows: M Recall1 = .59 (SD = .26), M Recall2 = .76 (SD = .21), and M Recall3 = .81 (SD = .18). The response times for items also became significantly faster across recall attempts, F(1, 148) = 191.42, p < .001, η p 2 = .659, with M Recall1 = 7,771 ms (SD = 2,094), M Recall2 = 7,083 ms (SD = 2,111), and M Recall3 = 6,823 ms (SD = 2,191).

By-item correlations were calculated to determine whether the accuracy and response time performance on subsequent recall attempts were related, indicating the relative stability of item difficulties across recall attempts. For both accuracy and response time, we found a significant correlation between performance at Recall 1 and performance at Recall 2 (accuracy: r = .92, p < .001, N = 100, 95 % CI = [.88, .95]; response time: r = .97, p < .001, N = 100, 95 % CI = [.95, .98]). Significant relationships were also found between these measures when comparing Recall 2 to Recall 3 performance (accuracy: r = .97, p < .001, N = 100, 95 % CI = [.96, .98]; response time: r = .99, p < .001, N = 100, 95 % CI = [.99, .99]). These results suggest that the distributions of item difficulties for facts were very similar across recall attempts.

Factor structure

We conducted EFAs on accuracy for each norming group (Group 1 and Group 2) at the first recall attempt to determine whether any meaningful factor structure would emerge within each set of 50 items. As in Experiment 1, the EFAs did not reveal any meaningful factor structures. However, we have included the matrix of correlations between all of the items within each group in Tables 9 and 10, for any researchers interested in exploring the correlational structure between these facts.

Influence of item characteristics

We developed a priori classifications of the facts based on specific characteristics that could influence recall performance. Each question-and-answer pair was classified according to the following eight characteristics (which were not mutually exclusive): (a) the number of words in the answer, (b) whether the answer was a word or a number, (c) whether the answer was a Swahili word, (d) whether the question was about the Kenyan constitution, (e) whether the question was about Kenya’s history, (f) whether the question was about civics in Kenya, (g) whether the question was about Kenya’s government, or (h) whether the question was about Kenya’s geography. Whereas the classification of the number of words in the answer to the question was a continuous variable, the remaining variables involved binary codes based on whether the pair did or did not have that characteristic or whether the answer should be classified as a word or a number. Accordingly, Pearson and point-biserial correlations were conducted to determine the relationships between these by-item characteristics and recall accuracy and response time at each recall attempt, using an adjusted p value of .002 for each set of 24 comparisons performed for each dependent measure. A meaningful relationship between recall accuracy and the number of words contained in the answer emerged—as the number of words in an answer increased, accuracy decreased. This pattern was consistent across the three recall attempts, all ps < .001 (Recall 1: r = –.43, N = 100, 95 % CI = [–.57, –.25], Recall 2: r = –.41, N = 100, 95 % CI = [–.56, –.24], Recall 3: r = –.39, N = 100, 95 % CI = [–.55, –.21]). Recall accuracy also decreased for answers involving Swahili words, all ps < .001 (Recall 1: r pb = –.51, N = 100, 95 % CI = [–.64, –.35], Recall 2: r pb = –.49, N = 100, 95 % CI = [–.63, –.33], Recall 3: r pb = –.451, N = 100, 95 % CI = [–.59, –.28]). No other significant relationships between the dependent measures and fact characteristics were found (all ps > .003).

Age differences

As with Experiment 1, this study included participants across a broad range of ages. Therefore, we conducted a mixed ANCOVA within each norming group (Group 1 and Group 2) using two within-subjects factors (Recall Attempt and Item) and age as a continuous variable. In particular, we were looking for main effects of age or any interactions with age. For Group 1, we found a significant interaction between age and item, F(31, 3026) = 1.72, p = .008, η p 2 = 017. However, none of the other interactions with age reached significance (all ps > .17). The main effect of age also failed to reach significance, p = .858. For Group 2, neither the main effect of age, p = .540, nor the interactions with age (all ps > .22) reached significance. An independent-samples t test revealed no difference in average age for the two norming groups (M Group1 = 31.29, M Group2 = 31.85). To further explore the age effects found in these data, we examined whether any age differences between accuracy recall norms for the oldest third and the youngest third of our sample would emerge across all participants and items. The youngest third of participants comprised individuals 24 years of age and younger (range = 18–24 years), whereas the oldest third of participants included participants 34 years of age and older (range = 34–61 years). These cutoffs were determined by finding the age values that marked the 33rd and 66th percentiles when considering age frequencies. We conducted omnibus independent-samples t tests comparing the average recall accuracies across all items for each age group at each recall attempt and found no significant differences. We also performed a finer-grained analysis involving independent t tests comparing the accuracy norms for every item at each recall attempt for the oldest and youngest thirds of participants, using an adjusted p value of .0005 due to the 100 multiple comparisons within each recall attempt. No significant age differences in recall emerged for any specific facts at any recall attempt.

Discussion

The Kenya facts from this experiment are a set of different, but related, learning materials that could be used in conjunction with the English–Swahili paired associates from Experiment 1 in studies of learning. Although the recall norms for the present stimuli were not collected from the same participants who completed the first experiment, the samples across the two experiments were demographically similar. The recall accuracy means for the facts were consistently higher across recall attempts than those found for Swahili recall in the first experiment. This supports the notion that more information was available in the fact cues (e.g., the questions) to support target retrieval than was in the English word cues from the paired-associate stimuli used in Experiment 1. This is consistent with the SAM model of associative memory (Raaijmakers & Shiffrin, 1980, 1981, 2002) discussed earlier.

We failed to discover any meaningful underlying structure to the facts included in our data set using EFA. However, as with Experiment 1, we were unable to examine the factor structure of all items simultaneously, since the items were split across two groups. Future studies that test all items within one sample will be better able to examine factor structures among all 100 items. Using a priori classifications of the items, we found that the recall accuracy for question–answer pairs that involved Swahili words in the answer was worse than that for questions that did not involve Swahili words. For these stimuli, the cues contained in the questions may not have activated links to the memory image containing the Swahili word required for the answer.

A higher number of words required in the answer was associated with poorer recall performance. Thus, the time limit imposed for responding may have impacted recall for these longer answers. Overall, performance appeared stable and showed improvements across the three recall attempts. A review of the accuracy norms for these items in Table 5 demonstrates that performance was certainly not at floor for many of the questions that required longer answers. However, researchers interested in using this set of norms may wish to include number of words as a covariate when evaluating performance in their own studies, in order to address the relationship between the number of words in the answer and recall accuracy. Notably, the mean recall accuracy across items for the first recall attempt (M Recall1 = .59) was in the middle of the range of possible values, supporting the idea that this set of norms may have good sensitivity for use in studies of learning.

Although we found an age-by-item interaction for recall accuracy within Group 1, these differences did not remain when accuracy across items for the two most extreme age groups was compared. Therefore, we do not report separate norms for these two age groups. Overall, our sample of participants was diverse in both age and ethnicity, and we were able to achieve a broad range of recall accuracy values, reflecting varying levels of recall difficulty across the set of items. These norms for Kenya facts can be used as companion stimuli to the English–Swahili word pairs for researchers who wish to study learning across a variety of item types, or they may be used independently to evaluate learning for novel facts.

General discussion

Given the burgeoning interest in studies of learning, metacognition, and memory, there is a need to develop a broad array of normed stimulus sets that researchers can draw from to address similar questions across different learning materials or to better assess how associations between different types of items are formed in memory. The present study provides normative information about recall accuracy and response time for English–Swahili word pairs and facts about Kenya obtained from diverse participant samples. We also detail the influence of item characteristics on recall for the two stimulus sets.

We found that our English-to-Swahili norms were moderately, but not perfectly, associated with Nelson and Dunlosky’s (1994) original Swahili-to-English norms, suggesting some similarities in the distributions of recall difficulty across items, but potential differences in the processes engaged for successful recall. Swahili word length accounted for additional variance in the present norms, above and beyond the variance accounted for by the Nelson and Dunlosky norms. Interestingly, we failed to find a relationship between Swahili target recall accuracy and English word frequency. This differs from the results of earlier norming studies of foreign-language paired associates (Grimaldi et al., 2010; Nelson & Dunlosky, 1994), which found evidence that preexperimental familiarity with English may aid in the recall of English target words. However, prior studies of paired-associate learning involving purely English cue–target pairs have provided evidence that low-frequency targets lead to poorer recall than high-frequency targets (Criss et al., 2011; Madan et al., 2010). In the present study, the Swahili words were novel targets for the participants, so they might all have operated as very-low-frequency items in relation to the participants’ experience. It is possible that the focus on learning these novel items and attempting to attend to the association between them and their English cues overwhelmed any potential word frequency effects of the cues (Criss et al., 2011). The fact that English word frequency was not associated with recall performance in the present study may point to asymmetric associations between English and foreign-language words in paired-associate learning, as has been suggested by translation differences in bilingual research (Kroll & Stewart, 1994). However, recall differences as a result of differing cue and target properties can emerge even in the face of holistic or symmetric associations between the cue and the target (Criss et al., 2011; Kahana, 2002; Madan et al., 2010). In future work, it will be important to assess recall of these items in both the forward and backward cueing directions within the same sample of participants and to evaluate whether performance is highly correlated across these cueing directions. Strong correlations would suggest that even foreign-language paired associates are learned and stored as holistic representations (Kahana, 2002). This type of analysis would also be intriguing to apply to the Kenya facts, to determine whether the associations formed between question-and-answer pairs are symmetric.

In the present set of norms, recall accuracy scores were typically higher after the first recall attempt for facts than were accuracy scores for the English–Swahili word pairs. Recall for facts involving Swahili words in the answer was also poorer than recall of facts that only required English words. This supports the notion that the English–Swahili items were more difficult to learn. We reiterate the point that for English–Swahili paired associates, the English cue provides little (e.g., phonological) information to activate links to the representation of the Swahili target in associative memory, whereas the question cues for facts contain many more semantic cues to activate the associated answer in memory. These differences are predicted by associative-memory models, such as SAM, that argue that recall performance is dependent on the quality of the retrieval cue (Raaijmakers & Shiffrin, 1980, 1981, 2002). It is possible that quality in these cases may be related to how strongly the cue activates conceptual links to the target in memory.

When considering the use of the fact norms in the present study, it is important to recognize that there was a relationship between the number of words in the answer and recall accuracy. This may have been due to the 15-s time limit that we applied for recall. We used this limit to combat fatigue and prevent participants from looking up answers or failing to complete the task in one sitting. However, researchers who are interested in using some of the questions that required longer responses may want to consider this issue when deciding which items to select for their own studies and how much time to allocate for recall. An additional option would be for researchers to use the number of words in the answer as a covariate when analyzing the results of any studies making use of these norms.

We found some age differences in recall accuracy for certain English–Swahili word pairs when comparing the oldest and youngest thirds of participants in our sample. Although we did find evidence of an interaction between items and age for our fact stimuli within one of the norming groups, these differences were not retained in statistical comparisons of by-item recall accuracy for the youngest and oldest age groups. This absence of an age difference for fact recall between the youngest and oldest groups in the sample may have been due to the greater ease of retrieval that these materials afforded through their more effective cues. The sample of participants who learned the facts also included fewer individuals over age 60 (one participant) than the sample of participants who learned the English–Swahili pairs (five participants). It is important to note that the age cutoffs used to define the youngest and oldest thirds of participants in each experiment were completely determined by our sample and were atypical when defining age groups. Future work should investigate performance for more traditional older and younger adult samples to determine what items may be most impacted by age.

We adopted a norming technique that mirrored that of Nelson and Dunlosky (1994), in which different participants provided norms for each half of the stimulus set. This leads to constraints on the possible analyses that can be attempted with, and conclusions that can be drawn from, these data. For example, we were unable to assess factor structures across all items simultaneously when conducting our EFAs, and we were unable to assess item and participant characteristics within a single model. This type of analysis would be extremely useful to incorporate into future studies evaluating these norms, to gain a better understanding of the interrelationships of these items and participants’ characteristics. We also did not acquire our English–Swahili and Kenya fact recall norms from the same sample of participants, even though the samples were demographically similar. An interesting direction for future norming studies would be to acquire these normative data from the same sample of participants to have estimates of recall difficulty that could be directly compared across the stimulus sets.

The present set of norms were collected from users of MTurk, which allowed us to acquire samples with a wide range of ages and ethnicities from all over the United States; therefore, the present norms are more applicable to the general public than are prior published norms acquired from purely undergraduate student samples (Grimaldi et al., 2010; Nelson & Dunlosky, 1994; Nelson & Narens, 1980; Tauber et al., 2013). One could argue that by hosting our experiment online we biased our sample toward individuals with high levels of technological and computer skill. Although this is a reasonable point, it is likely that many college students also have high proficiency in these domains, so this does not limit our present normed data when compared to prior published norms.

It is interesting to note that prior norming studies of foreign-language paired associates and of general-knowledge facts did not assess native-language status or, at the very least, did not report information about the native-language status of their participants (Grimaldi et al., 2010; Nelson & Dunlosky, 1994; Nelson & Narens, 1980; Tauber et al., 2013). We report these data for the experiments discussed in this article in order to provide a comprehensive picture of the demographic characteristics of our samples. Although English was not the first language learned for all participants, they all claimed to be proficient in English, and the percentage of nonnative English speakers was relatively small (between 4 % and 5 % in each study). Language experience may be important to address in future norming experiments, and should certainly be considered by researchers when deciding what norms are most appropriate for their needs.

We are unaware of any published norms that report a companion set of different but related materials that may be used in tandem in studies of learning and memory. There is also an absence of any published norms for foreign-language paired associates in which monolingual individuals must recall the foreign-language word after receiving a cue word in their native language. In general, the provided normed stimuli expand the options currently available to learning researchers and will appeal to those who study learning and memory in diverse participant samples. These norms will be of high utility for researchers who have an interest in investigating learning and the formation of long-term and associative memories for different, but related, items. The English–Swahili recall norms will also be useful to researchers interested in how associative-memory representations develop for difficult materials with limited early links to semantic or conceptual representations, and to investigators who want to understand the mechanisms of association in foreign-language learning and translation.