Introduction

Episodic memory, a necessary function in everyday life, is defined as memory for an event or item and associated contextual details including time and place (Tulving, 2002). Although there are multiple tests to measure episodic memory in the laboratory, the predominant measures are recall and recognition tests (Otani & Schwartz, 2019; Renoult & Rugg, 2019). After a series of items is presented (pictures, words, or other stimuli), participants can be asked to list the items they remember (recall) or indicate which items were presented from a set of items that includes new items as well as the presented ones (recognition).

Reliability is defined as an estimate of the true measurement of a construct that accounts for both variability and error across multiple independent measurements (Brennan, 2010; Cronbach, 1947). A reliable test is one that can estimate true scores between individuals over time and which has minimized error due to the measurement itself. For example, the interaction between the experimenter and the participant and different instructions for completing the test can potentially bias measurement, and scores can change between people because of greater measurement error in certain cases rather than actual differences in the construct. Common measures of reliability include test–retest reliability, which is merely a correlation between test scores at different time-points, and inter-rater reliability, a measure of the agreement between evaluators collecting observational data independently. It is important to ensure that a test is reliable and as accurate as possible across people, sessions, and time by identifying potential sources of error, including interactions between different factors, and minimizing them. In this way, we can estimate true differences between people and accurately measure effects of experimental manipulations, especially in memory tests.

Previous attempts to measure the reliability of episodic memory tests have been limited to neuropsychological tests. For example, Cheke and Clayton (2015) found no positive correlations between any of the episodic measures implemented, including free recall, and a non-linear relationship emerged only between two tests but not others. However, they did not include the recognition test among the measures of interest. Other studies have examined the reliability and validity of neuropsychological tests involving recognition, and these tests are typically used with patients. While these neuropsychological tests, including the California Verbal Learning Test, Auditory Verbal Learning, and Continuous Recognition test, have shown satisfactory reliability and validity (Larrabee & Curtiss, 1995), they are not frequently used in the laboratory with experimental studies. The lack of use of neuropsychological tests is partly due to the large overlap with short-term memory (Hasselmo, 2011, p. 13–20) and partly due to inflexibility; stimuli within the tests cannot be easily changed, reduced, or expanded to meet the needs of the experiment without compromising validity or reliability. In contrast, experimental studies aiming to examine or manipulate a specific aspect of episodic memory change the content, length, and characteristics of the stimulus set while keeping the general recognition test structure. Moreover, studies can also implement varying delays between stimulus presentation and test, which is not always possible with neuropsychological tests that have fixed instructions and structure. Thus, a major question that remains to be answered is how the reliability of the “flexible” laboratory recognition test changes across different stimuli, number of test trials, populations, and languages. Will the test remain reliable and will measurements using the test be robust and comparable across studies?

Currently, the reliability of psychological and cognitive tests is predominantly examined using Classical Test Theory (CTT; Livingston, 1972; Miller, 1995), including Cronbach’s alpha that is merely an overall correlation between test items. CTT considers that any test score includes a true score (e.g. true differences between people) and an error, but it cannot differentiate between multiple sources of error such as individuals, items, language, and interactions between them that impact reliability. Generalizability theory (G theory) is an extension of CTT and goes beyond CTT's limited assumption of considering error variance as a single factor (Allen & Yen, 1979) and can identify and evaluate specific sources of measurement error in a test (Cronbach et al., 1963). It is named Generalizability Theory because it estimates the extent to which the influence of any specific source of error variance can be generalized to all possible situations and contexts as opposed to only the specific testing situation for which the data had been obtained (Cronbach et al., 1963). G-Theory assesses numerous sources of variance contributing to the measurement error associated with the main variable of interest (Cardinet et al., 1976), such as a test score. In naturally occurring environments, there are more factors including personal (e.g., personality), methodological (e.g., psychometric characteristics of the measure used), and situational (e.g., time of the day) that might each independently contribute to measurement error. G-Theory provides an advanced method for assessing these factors and their interactions, thus contributing to the improvement of methodology and precision of an assessment (for review see Cardinet et al., 2010). G theory is superior to other methods of reliability assessment from CTT, such as test–retest and inter-rater reliability, because it is possible to estimate the unique impact of multiple external (e.g. language) and internal (e.g. number of items) factors on measurement accuracy (Bloch & Norman, 2012). Thus, G theory can be applied to identify flaws in the recognition test, particularly those related to the flexible structure (different languages and number of items). G theory can examine whether the recognition test is biased due to the use of different words in different languages (e.g. each person receives a different set of words from a large pool in his or her native language – Russian or English).

It is important to establish the reliability of episodic measures to evaluate memory function and to study memory enhancement. Specifically, a growing number of studies in episodic memory have examined the possibility of enhancing memory function with non-invasive brain stimulation (NIBS), and many of these studies measured changes in memory performance using laboratory recognition tests. We pooled trial-by-trial data from the control group of multiple NIBS experiments in our lab (Medvedeva et al., 2019; Petrovskaya et al., 2020, in preparation) to accumulate a large sample size of Russian and English participants who completed the same memory test (identical stimuli) in Russian and English, respectively, for reliability testing.

The aim of the current study was to apply G Theory to establish the reliability of the recognition test and evaluate specific sources of measurement error associated with different languages and words used to assess memory performance. G theory was applied in two parts using a cross-sectional design with the individual as the object of measurement. Generalizability study (G-study) estimated sources of variance across individuals and trials in the current measurement design. Decision study (D-study) examined generalizability of recognition performance scores for different numbers of trials to establish reliability across trials and languages.

Method

Participants

The sample for this study was extracted from three experiments: (Medvedeva et al., 2019, Experiment 1) with an English sample (Medvedeva et al., 2019, Experiments 1-2, English Control 1) that was supplemented by a second experiment with an identical method (Medvedeva, 2019, Chapter 5 Experiment 4, English Control 2) and a third experiment with a Russian sample (Petrovskaya et al., 2020, Russian Control 1 and 2). The demographics for each sample are shown in Table 1; there were 80 individuals in total, 40 for each language (Russian and English) and there were no significant differences in age, p = 0.07.

Table 1 Demographics of samples included in the analysis

Procedure

The method and stimuli for all experiments followed the control group procedure detailed in Medvedeva et al., (2019) (Experiment 1) with the exception that for the Russian groups, stimuli were translated to Russian for the third experiment and participants from an additional Russian group were included (an active stimulation group that showed no difference in performance compared to the control group; see Petrovskaya et al., 2020). Together, the two English experiments and two Russian groups in the third experiment met sample size requirements (n ≥ 76) for the reliability study with (1-β) = 0.85 and f ≥ 0.25 under p = 0.05 (Atilgan, 2013). In each two-part experiment, participants judged the pleasantness (pleasant or not) of individually-presented words (MRC database; Coltheart, 1981) and memorized them on the first day (160 words) and 120 of these words were presented a day later in an old/new recognition task (indicate whether the item is old – previously presented – or new) along with 81 new words. See Appendix 1 for the pool of 240 words in Russian and English. For the current work, we only considered trial-by-trial performance on the 120 items that were presented at study and test. All experiments were approved by the authors’ institutional ethics committees. The results of all experiments confirmed that the performance of the control group was not affected by the NIBS-specific procedure (application of electrodes, short duration of stimulation, etc.).

Data Analyses

Prior to the main analyses, normality of distribution was tested and ensured for both Russian (kurtosis = -0.28, skewness = -0.68) and English (kurtosis = -0.47, skewness = -0.59) groups. The two groups had significantly different average memory performance, t(78) = -3.18, p = 0.002, with higher performance for the English sample (M = 0.75, SD = 0.13) than the Russian sample (M = 0.66, SD = 0.14) but a medium effect size, d = 0.66. G theory analyses followed recommendations of Cardinet et al. (2010) and employed EduG 6.1-e software (Swiss Society for Research in Education Working Group, 2006). Accuracy data for each trial reflected correctly recognizing an “old” word as previously-presented or incorrectly indicating that the word is “new” (1 = correct and 0 = incorrect or no response), and each participant had complete data for 120 trials. G theory was applied by defining the measurement design and estimating variance components in G-study, and conducting a D-study. The D-study involved calculating generalisability coefficients (G-coefficients) and variance estimates for different numbers of trials to examine the reliability of the test.

A person (P) nested within language (L) by trial (T) design expressed as P:L x T was implemented in which an equal number of participants (n = 40) was nested in each language (L) (Russian or English). Person nested in language (P:L) was the object of measurement and was not considered a source of error, while trial (T) was instrumentation facet because object of measurement (P) cannot be nested in the instrumentation facet (L). Therefore, a P:L/T 40 × 2 × 120 measurement design was specified in EduG. Interactions between person and language may reflect an effect of language. A two-way ANOVA with a 2-level random effects measurement design defined as person nested in language by trial was conducted to calculate variance components due to person, language, trial, and interactions between them. EduG software accurately estimates variance components with application of Whimbey’s correction coefficient (Cardinet et al., 2010) expressed as (N(f) – 1)/N(f), with N(f) representing population size of the (f) facet in the G-study design. G-study estimated variance components using standard ANOVA approach. D-study computed G coefficients using equations by Brennan (2010). G–coefficients reflect the overall reliability and generalizability of the test scores across persons and trials (Bloch & Norman, 2012; Medvedev et al., 2017). While the relative G-coefficient (Gr) only considers variance directly related to the object of measurement, the absolute G coefficient (Ga) accounts for other sources of variance such as trial that could affect absolute measurement indirectly, so it is a more conservative measure of reliability. D-study involved removing one of two languages as a level from the model to compare reliability and generalizability of the test in each language using G-coefficients. In addition, different number of trials including 20, 40, 60, 80 and 100 were examined to evaluate the reliability of the test.

Results

G study

Table 2 includes conventional ANOVA estimates computed for G-study including person (P) nested in language (L) by trial (T) and their interactions. The results of the G study are interpreted with a reference to the generalizability of a single trial. The large P:L x T variance component (0.191) reflects a large amount of variance (91.2% of the total variance). There is a modest amount of variability between persons within language groups (0.016 or 7.5%). The G study results for English and Russian presented in Table 2 are very similar, suggesting only negligible differences due to language. Therefore, it was unnecessary to conduct D study separately for English and Russian sub-samples, and the full sample was used in D-study.

Table 2 ANOVA estimates for individual nested within language by trial for the full sample (above) and English and Russian groups (below)

Table 3 includes D-study results and shows that after accounting for all sources of error, the recognition test has good generalizability across all English and Russian speakers with a high absolute G-coefficient (0.92). Variance components in D-study were computed for 120 where all error components from the G study have been divided by 120. Interaction between individual and trial nested in language reflects individual differences in remembering different words and explains the most error variance in memory performance. Language alone explains only 10% of the differentiation variance. Similarly, trial and the interaction between trial and language accounted for a negligible proportion of the absolute error variance (Table 3). Standard error values were negligible, indicating the robustness of both relative and absolute G coefficients and in turn reflecting good reliability of the test.

Table 3 D-study estimates for the full sample

D study variance components were produced by dividing the estimated variance components from the G study that include T (trial) by 20, 40, 60, 80, and 100, respectively (Table 4). We found insufficient reliability at 20 and 40 trials with relative and absolute G coefficient of 0.65 and 0.79, respectively (see Table 4). However, starting from 60 trials, reliability reached an acceptable level (Gr and Ga = 0.85). Furthermore, a non-linear increase in reliability was observed with an increasing number of trials: Gr and Ga of 0.88 for 60 trials and 0.90 for 100 trials, comparable to the overall reliability of the test with all 120 trials (Gr and Ga = 0.92). Moreover, the grand mean was not significantly different between trial designs (20–100 trials), ranging from 0.69 for 20 trials to 0.70 for 40, 60, 80, and 100 trials. For all trial groups except 20 trials, standard error values were relatively low, reflecting the robustness of both relative and absolute G coefficients and acceptable reliability of the test. We have also estimated G coefficients for each language to evaluate robustness if the number of trials is reduced to the critical cut-off point (e.g. 60, 40 and 20 trials). For the English version, both G-coefficients were at 0.61 and 0.76 with 20 and 40 trials respectively, and reached an acceptable level of 0.83 with 60 trials. Similarly, the Russian test version reached good reliability with 60 trials (both G = 0.82), while with 20 and 40 trials the reliability was below acceptable cut-off point of 0.80.

Table 4 D-study estimates for the full sample and different number of trials

Discussion

The aim of the current study was to establish the reliability of a standard laboratory recognition test and identify possible errors due to language and number of trials that could bias recognition measurement. The interaction between person and trial in language produced the largest amount of error, perhaps because different people interacted with different words uniquely, and after accounting for the error, the measurement error due to language was negligible. The results demonstrated that the reliability of the test and generalizability of test scores across samples populations depend on the number of trials used in the test irrespective of the language used. The highest reliability and generalizability of the test was achieved with 120 trials. Moreover, there was little measurement error due to different words (as represented by trial), so reliability in recognition measurement is not contingent upon using the same set of words. Thus, conducting the test in different languages is unlikely to significantly bias measurement of recognition, although the results can only be generalized to English and Russian younger adults.

The D study showed that by reducing the number of trials, reliability gradually decreased but was still at acceptable level with at least 60 trials overall and in each language. This indicates that the test users should consider including higher numbers of trials (> = 60) to ensure reliability of the test because with 80 trials the test was as reliable as one with 100 or 120 trials. Although we examined only old items, i.e. items that are shown at both study and test.  The recognition test should have a minimum of 40 old items presented at study and test in addition to the same number of items presented at study but not test and new items presented at test but not study. Although results can vary between studies based on differences in semantic relatedness and distinctiveness within the word set (e.g. Hunt & Smith, 2014; Roediger et al., 2001), reliability should be equally high for these stimulus sets as for words that are randomly selected and semantically unrelated (as in our stimulus set). Other considerations for memory test design include blocked vs intermixed presentation of conditions and stimulus characteristics such as concreteness and frequency. Based on our results, we suggest that while these factors are important for experimental manipulation and control within the memory test, the reliability would remain consistently high across changes. Based on the structure of a recognition test and to ensure a sufficient level of difficulty, at least the same number of new items and studied and unpresented items at test as studied and presented items at test (> = 60) should be used (e.g. 60 new items, 60 studied and unpresented items, 60 studied and presented items). We could not demonstrate the effect of changing the proportion of new items because new items are not learned and accordingly are not presented at both study and test.

The examination of different trial numbers is novel and valuable because until now, the minimum number of items in a reliable memory test remained unclear, particularly when neuropsychological tests showed differences between patients and healthy controls with as few as ten items (e.g. Selective Reminding Test, Buschke & Fuld, 1974). Generalizability theory is a well-established but underused method for determining how well a test discriminates between individuals in the intended construct, and the method is especially valuable in identifying factors that can increase the reliability (such as trial number). The uniqueness of theory for evaluating reliability of the long-term memory test is apparent if considering that such investigation would not be possible using traditional CTT methods because they would require completion of the test using the same sets of words and in turn make the test invalid due to learning effect.

The results suggest that because recognition measurements are equally reliable in both languages, the test can be confidently used to discriminate between groups with different memory performance and explore individual differences or reveal effects of cognitive manipulations. The finding that individual differences were the greatest source of error suggests that many other factors can influence memory in addition to age, education, and NIBS. The presence of significant differences between groups and the existence of individual differences in memory that cannot be accounted for by age, education, NIBS, or a cognitive manipulation suggest that further examination of neurobiological and cognitive factors is warranted. Future studies can examine individual differences in-depth by applying within-subjects designs and measuring memory performance over time; multiple measurements are more likely to be accurate. Accordingly, future studies should examine the reliability of the memory test across time by measuring performance at three different time points and applying G theory. The current study only examined memory performance at one time point and used samples from three different experiments to increase the sample size of the English and Russian groups. Future studies can also apply G theory when taking recognition measurements in other languages to confirm that any differences in results cannot be attributed to differences in the reliability of the test between languages. The recognition test results with the set of words cannot be generalized to every language because language is a fixed construct (i.e. limited to the languages included in the analysis), meaning that the reliability of the test is known only for Russian and English speakers. Since the reliability results only apply to English- and Russian-speaking younger adults, future studies should measure the test’s reliability in other populations, including older adults and patients with memory disorders. Although these populations are more likely to be tested with neuropsychological tests, there is still a need for tests that are sensitive to memory changes, particularly deficits and decline (Busch et al., 2019; Elkana et al., 2016).

The recognition test has the advantage over recall tests of a larger number of trials (Murdock, 1968), better performance (Allan & Rugg, 1997), and decreased individual variability (Gomulicki, 1956; Mandler, 1968; Ozier, 1980). While recall tests may be more difficult and index recollection rather than familiarity, recognition tests may be more practical. The laboratory recognition test has the potential to be flexibly applied with different languages, items, and list lengths, according to the needs of the experiment and the research question. Specifically, the reliability of the recognition test has implications for measuring tDCS-induced neuroenhancement: studies have been conducted in different languages and countries, including Russian and English, and a reliable memory test can demonstrate when tDCS has led to changes in memory in cross-sectional measurements and over time. In conclusion, the widely-used method of applying the recognition test in the laboratory, with a different set of words in different languages for each experiment, may be as reliable as for recognition tests with a fixed format. Although the reliability of the test remains to be tested with other languages and populations, the reliability and flexibility as shown for Russian and English in younger adults may make it an optimal assessment for cognitive settings.