Using response time modeling to distinguish memory and decision processes in recognition and source tasks
- First Online:
- Cite this article as:
- Starns, J.J. Mem Cogn (2014) 42: 1357. doi:10.3758/s13421-014-0432-z
Receiver operating characteristic (ROC) functions are often used to make inferences about memory processes, such as claiming that memory strength is more variable for studied versus nonstudied items. However, decision processes can produce the ROC patterns that are usually attributed to memory, so independent forms of data are needed to support strong conclusions. The present experiments tested ROC-based claims about the variability of memory evidence by modeling response time (RT) data with the diffusion model. To ensure that the model can correctly discriminate equal- and unequal-variance distributions, Experiment 1 used a numerousity discrimination task that had a direct manipulation of evidence variability. Fits of the model produced correct conclusions about evidence variability in all cases. Experiments 2 and 3 explored the effect of repeated learning trials on evidence variability in recognition and source memory tasks, respectively. Fits of the diffusion model supported the same conclusions about variability as the ROC literature. For recognition, evidence variability was higher for targets than for lures, but it did not differ on the basis of the number of learning trials for target items. For source memory, evidence variability was roughly equal for source 1 and source 2 items, and variability increased for items with additional learning attempts. These results demonstrate that RT modeling can help resolve ambiguities regarding the processes that produce different patterns in ROC data. The results strengthen the evidence that memory strength distributions have unequal variability across item types in recognition and source memory tasks.
KeywordsDiffusion model Recognition memory Source memory Unequal variance assumption
When we look back on an event, what kind of information do we get from memory? Subjectively, we reexperience a somewhat degraded version of the perceptions, actions, thoughts, and emotions that characterized the event, but how is this information translated into explicit decisions about what we have and have not experienced in the past? Theorists from a variety of perspectives propose that all of the different types of remembered information vary on a continuum of strength, and the total evidence that an event was experienced is defined by combining these strength values (Johnson, Hashtroudi, & Lindsay, 1993; Ratcliff, 1978; Wixted, 2007).
Researchers have often attempted to test hypotheses about memory evidence using receiver operating characteristic (ROC) functions, which plot the proportion of times a response is made correctly on the proportion of times it is made incorrectly across different levels of bias (Egan, 1958; Wixted, 2007). The different bias levels are almost always created by having participants respond on a confidence scale. Conclusions about memory rely on the assumption that properties of the ROC function—such as asymmetry in the function or the degree of curvature—are a consequence of underlying memory processes. However, ROC properties can also be influenced by decision processes. For example, ROC asymmetry and curvature can be affected by differences in decision criteria for the different response alternatives in sequential sampling models (Ratcliff & Starns, 2009, 2013; Van Zandt, 2000), changes in the decision criteria on one evidence dimension across different levels of another dimension (Starns, Pazzaglia, Rotello, Hautus, & Macmillan, 2013; Starns, Rotello, & Hautus, 2014), and variability in the position of decision criteria across trials (Benjamin, Diaz, & Wee, 2009; Mueller & Weidemann, 2008).
A critical goal for advancing memory research is distinguishing which ROC results are properly interpreted in terms of memory evidence and which are produced by decision mechanisms. My primary goal was to model response time (RT) distributions from memory tasks to determine whether RT data support the same conclusions about memory as ROC analyses. The RT model was applied to two-choice responding, so decision biases that affect confidence ratings cannot distort the results. Another goal was to model results from a task with known evidence distributions to ensure that the RT model supported correct conclusions.
Evidence variability and ROC functions
The view that memory evidence is a continuous strength value can be formalized using the standard univariate signal detection model (Egan, 1958; Lockhart & Murdock, 1970). Signal detection models have been broadly applied in studies of both recognition memory, in which participants are asked to discriminate items that were studied on a previous list (targets) from items that were not (lures), and source memory, in which participants are asked to specify the context or presentation format that was paired with an item at study. In the recognition memory version of the model, memory evidence is higher for targets than lures on average, but varies from trial to trial. For example, targets can sometimes produce weak evidence (perhaps because they were not well attended at encoding), and lures can sometimes produce strong evidence (perhaps because they are similar to one or more of the studied words). Some theorists contend that strength is more variable for targets than for lures, given that encoding and retrieval processes are more successful for some items than for others (Wixted, 2007). Indeed, a number of computational models of memory produce unequal-variance distributions (Gillund & Shiffrin, 1984; Hintzman, 1986; McClelland & Chappell, 1998; Ratcliff, Sheu, & Gronlund, 1992; Shiffrin & Steyvers, 1997). However, the effects of learning variability are challenged by some theorists (e.g., Koen & Yonelinas, 2010), and some models do produce equal-variance strength distributions (e.g., Murdock, 1982).
ROC data have played a key role in testing assumptions about the variability of memory evidence distributions. Recognition ROCs are almost always poorly fit by an equal-variance model and closely fit by a model in which evidence is more variable for targets than for lures (e.g., Egan, 1958; Wixted, 2007). The unequal-variance model performs better because it can match the asymmetry in the empirical functions. Providing extra learning trials generally has little or no influence on the asymmetry of the recognition ROC, suggesting that increasing the average memory strength does not affect the across-trial variability in strength (Dube, Starns, Rotello, & Ratcliff, 2012; Glanzer, Kim, Hilford, & Adams, 1999; Ratcliff, McKoon, & Tindall, 1994; Ratcliff et al., 1992; but see Heathcote, 2003). For source memory tasks, the ROC function is typically symmetrical when the two alternative sources are equal in memory strength (Hilford, Glanzer, Kim, & DeCarlo, 2002). If one source is learned more effectively, the function becomes asymmetrical and indicates higher evidence variability for the strong source than the weak source (Starns et al., 2013; Yonelinas & Parks, 2007).
Starns et al. (2013) reported evidence that the effect of strength on the asymmetry of source memory ROC functions (or equivalently, the slope of zROC functions) is produced by decision processes. By evaluating source confidence ratings for nonstudied items incorrectly called “old,” these researchers demonstrated that participants were more willing to make high-confidence source ratings when they were more confident that the item was studied. This pattern was observed in 14 data sets, and the decision bias was often quite dramatic. For example, participants in one data set made high-confidence source ratings for fewer than 1 % of nonstudied items that they rated as “maybe old,” as compared with 64 % of nonstudied items that they rated as “definitely old.” Critically, when the confidence ratings were used to construct ROC functions, this decision bias had the same influence on ROC asymmetry as increasing the evidence variability for the stronger source. Therefore, Starns et al. (2013) suggested that learning strength might have little or no effect on evidence variability for source memory, with the apparent effect produced by decision processes.
Experiments 2 and 3 herein explored the effect of additional learning on evidence variability in recognition and source tasks by analyzing accuracy and RT data with the diffusion model (Ratcliff, 1978). These experiments used similar methods, and the primary goal was to determine whether RT modeling—like ROC modeling—shows that additional learning increases evidence variability for source tasks but not for recognition tasks. Starns and Ratcliff (2014) applied the diffusion model to nine recognition memory data sets, and the results consistently showed that memory evidence was more variable for studied items than for nonstudied items. Learning strength did not affect evidence variability for studied words, consistent with the findings of ROC studies. I expect to replicate these results in Experiment 2, and Experiment 3 will reveal whether RT and ROC modeling also support the same concluions for source memory.
RT modeling can inform whether the strength effect on the asymmetry of source memory ROC functions is based on decision processes alone or whether changes in memory evidence also play a role. The diffusion model can estimate evidence variability from a two-choice task without confidence ratings. If the strength effect is produced by confidence-scale decision biases, then the diffusion model estimates should show that additional learning has little or no effect on evidence variability for source memory. If increasing source performance does affect the overall variability of source evidence, then the diffusion model variability estimates should be higher for stronger items, matching the ROC results.
Unequal variance in RT modeling
The critical model parameters are the separation between the decision boundaries (a), the starting point of evidence accumulation (z), the duration of nondecision processes (Ter), and the drift rate toward a response boundary (v). The amount of within-trial variability in the accumulation process (s) is treated as a scaling parameter, and I follow the standard practice of setting it to .1 (e.g., Ratcliff & Rouder, 1998).
The model accommodates between-trial variability in many of the parameters, including a uniform distribution of starting points with range sZ, a uniform distribution of nondecision times with range sT, and a normal distribution of drift rates with standard deviation η (Ratcliff & McKoon, 2008; Ratcliff, Van Zandt, & McKoon, 1999). The between-trial variation in drift is similar to the variation in strength assumed by signal detection models. Panel 2 of Fig. 1 shows between-trial drift distributions for a male/female source task such as the one used in Experiment 3. Female items usually have drift rates that approach the bottom boundary (i.e., the distribution means are below zero), and male items usually have drift rates that approach the top boundary (i.e., the distribution means are above zero). However, some items have drift rates that approach the incorrect boundary.
The displayed distributions produce an accuracy of about .65 for weak items and .8 for strong items, and they represent two possible mechanisms for this increase in accuracy. In the set of distributions on the top, strengthening items shifts the means of the drift distributions farther from zero with no effect on variability (η = .1 for all distributions). In the set on the bottom, strengthening items both shifts the means of the distributions and increases the variability (η = .1 for weak items and .2 for strong items). The drift distribution means were set to match the desired accuracy levels, and the other model parameters were similar to the average values from Experiment 3 below.
Figure 1 also shows model predictions to demonstrate how the equal- and unequal-variance alternatives can be discriminated with RT distributions (Panel 3). Predictions are shown for the proportion of “male” and “female” responses, as well as the .1, .5, and .9 quantiles of the RT distribution associated with each response. These quantiles show the leading edge, the median, and the tail of the RT distributions, respectively. All of the drift distributions are the same for weak items, so the predictions are identical as well. For strong items, the RT quantiles show that the equal- and unequal-variance versions of the model produce slightly different RT distributions for the same accuracy level. Specifically, the unequal-variance version of the model produces faster correct responses, with a very small effect on the leading edge, a slightly larger effect on the median, and a still larger effect on the tail. The error RTs show a hint of the same pattern, but the differences are smaller.
Parameter validation experiments
Figure 1 shows that changing evidence variability in the diffusion model has only a small effect on predictions. The goal of Experiment 1 was to ensure that model fits can discriminate equal- and unequal-variance distributions despite this small effect. Participants saw arrays of varying numbers of asterisks, and they decided whether a low (under 50) or high (over 50) number of asterisks was displayed. The across-trial variability in the number of asterisks was directly manipulated to produce equal- and unequal-variance evidence distributions. Different blocks of trials were designed to produce evidence distributions similar to those assumed in both the recognition task (Experiment 2) and the source task (Experiment 3). Thus, Experiment 1 will show whether or not conclusions for the memory experiments rest on sound methods. Previous investigators have performed similar validation studies for the diffusion model, and the model shows appropriate parameter estimates for response caution, response bias, evidence strength, and nondecision processes (e.g., Ratcliff & McKoon, 2008; Voss, Rothermund, & Voss, 2004). Experiment 1 will be the first experiment to extend this validation approach to the drift-rate variability parameter.
Experiment 1: Numerousity discrimination
Thirty-seven University of Massachusetts undergraduates participated to earn extra credit in their psychology courses.
Each stimulus had 100 locations arranged in 10 rows and 10 columns. Some of the locations displayed an asterisk, and some were blank. The stimuli were created by randomly selecting a value from a uniform distribution between 0 and 1 for each location and displaying an asterisk for any location with a value less than p. Thus, the number of asterisks displayed across trials followed a binomial distribution with a probability parameter equal to p and an N of 100. The asterisk distributions were designed to simulate evidence distributions in either a recognition or source memory task. Throughout the “Method” and “Results” sections, the conditions will be labeled with the corresponding conditions in the memory task that they represent.
For the unequal-variance recognition blocks, the trial structure and the distributions for lures and weak targets were the same as those in the equal-variance blocks. To introduce additional variability for strong targets, the probability parameter of the binomial distribution was sampled from a beta distribution (instead of being fixed across trials). The beta distribution had shape parameters α = 12 and β = 5, which produces a mean probability of .706 that a space will be filled with an asterisk. The combination of the beta and binomial distributions produces a distribution for the number of asterisks with higher variability and a slight negative skew (Fig. 2, Panel 2).
The remaining blocks were based on the source task in Experiment 3, in which participants studied words paired with a picture of either a male or a female face. Here, the number-of-asterisks dimension is analogous to a dimension ranging from evidence that is strongly characteristic of the female face to evidence that is strongly characteristic of the male face. The number of asterisks for the equal-variance source blocks followed binomial distributions with probability parameters of .42, .46, .54, and .58 for strong female, weak female, weak male, and strong male items, respectively (Fig. 2, Panel 3). The unequal variance source blocks matched the equal-variance source blocks, except that the probability parameters for strong male and strong female items were sampled from a beta distribution, with α = 5 and β = 12 (mean = .294) for female items and α = 12 and β = 5 (mean = .706) for male items (Fig. 2, Panel 4).
Design and procedures
Block type and item type were both manipulated within subjects. At the beginning of the experiment, participants were informed that they would complete multiple blocks of a numerousity task. They were told that they would see displays with varying numbers of asterisks on the screen and that they should hit the “z” key to respond “low” (below 50 asterisks) or the “/” key to respond “high” (above 50 asterisks). Participants were asked to balance speed and accuracy in their responding. Any trial with an RT below 200 ms or above 1,400 ms was immediately followed by a “too fast” or “too slow” message, respectively. The RT feedback messages were required for fewer than 1 % of the responses. Participants also saw the proportion of correct responses that they achieved on each block after all the trials were completed.
Participants first completed a practice block with the equal-variance source design, and then they completed 16 blocks that contributed data for the analyses below (four from each block condition). The order of the blocks was randomized uniquely for each participant. Recognition blocks comprised 50 lure trials, 25 weak-target trials, and 25 strong-target trials. Source blocks comprised 17 trials each for the strong-male, strong-female, weak-male, and weak-female stimuli. The order of trials within each block was random. Participants were allowed to take brief breaks between blocks if they wished.
Response proportions and correct and error response time (RT) medians from all experiments
Experiment and Condition
Correct RT Median
Error RT Median
The diffusion model was fit to the data from each participant using the χ2 method described by Ratcliff and Tuerlinckx (2002) and the same parameter ranges reported by Starns and Ratcliff (2014). Separate model fits were performed for the four block types. The data comprised 12 response frequencies for each condition: 6 frequencies each for the number of “high” and “low” responses in the six RT bins segregated by the .1, .3, .5, .7, and .9 quantiles of the RT distribution. Thus, to fit the data, the model had to match both the proportion of each response and the RT distributions within each category. One degree of freedom (df) is lost for each item type because the proportions in the bins must sum to 1.
For all of the reported fits, parameters for decision criteria (a, z, and sZ) and parameters for nondecision components (Ter and sT) were held constant across different stimulus types, such as displays with many or few asterisks in Experiment 1 or studied versus nonstudied items in Experiment 2. This is a nearly universal practice in diffusion research, and fixing these parameters has a strong theoretical justification (e.g., Ratcliff & McKoon, 2008). Fixing the decision parameters is analogous to using the same response criterion across stimulus types in signal detection theory. If the only thing identifying the stimulus type is the evidence available from the stimulus, then participants have no basis for changing decision standards across stimulus types (e.g., they can’t decide to be more liberal for targets, because once they have identified a stimulus as a target, the decision has already been made). Fixing the nondecision parameters is justified by the fact that all of the stimulus types are randomly mixed into the same test with the same basic stimulus processing and response output requirements; thus, the factors affecting nondecision time are controlled.1
For the blocks simulating equal- and unequal-variance recognition distributions, the data for each fit had 33 df (11 each for strong targets, weak targets, and lures). The recognition model had 12 free parameters: the width of the response boundaries (a), the mean starting point of evidence accumulation (z), the range of variation in starting point across trials (sZ), the mean nondecision time (Ter), the range of variation in nondecision time across trials (sT), the proportion of trials affected by RT contaminants (pO; see Ratcliff & Tuerlinckx, 2002), three drift rates (v) for strong targets, weak targets, and lures, and three between-trial evidence variability parameters (η) for strong targets, weak targets, and lures. Thus, the fits for the recognition blocks were associated with 21 df (33 df in the data minus 12 free parameters).
For the blocks designed to simulate a source memory task, the data had 44 df (11 each from the four stimulus types). The source model had 14 free parameters. The parameters were the same as the fits to the recognition blocks, except that there were four drift rates (v) and four evidence variability (η) parameters for strong-male, strong-female, weak-male, and weak-female items. Thus, fits to the source blocks had 30 df (44 df in the data minus 14 free parameters).
Experiment and Condition
Drift Rate (v)
Drift Distribution Standard Deviation (η)
Experiment and Condition
Drift Rate (v)
Drift Distribution Standard Deviation (η)
Average parameter values for parameters fixed across item types in the individual-participant fits
Experiment and Condition
Figure 2 shows the average η estimates from each of the four conditions in Experiment 1. The variability estimates had similar values across the different item types in the equal-variance conditions (Panels 1 and 3) but showed clear differences in the unequal-variance conditions (Panels 2 and 4). Data from the recognition blocks were submitted to a 3 (item type) × 2 (equal- vs. unequal-variance blocks) ANOVA. If the diffusion model successfully discriminated the equal- and unequal-variance distributions, the different blocks should have different variability estimates for strong targets but no differences for lures and weak targets. The ANOVA revealed a significant interaction consistent with this pattern, F(2, 72) = 7.90, p < .001, MSE = .009. Follow-up analyses showed a significant difference between equal- and unequal-variance blocks for strong targets, t(36) = 3.47, p < .001, but not weak targets or lures (lowest p = .352).
The source-block data were submitted to a 2 (gender) × 2 (strength) × 2 (variance block) ANOVA. On the basis of the distributions in Fig. 2, the estimates should show no effect of gender on evidence variability. Consistent with this expectation, η values were very similar for male (.20) and female (.18) items. The ANOVA found no main effect of gender, F(1, 36) = 1.83, n.s., MSE = .012, and no significant interactions involving the gender variable (lowest p = .583). The distributions in Fig. 2 also suggest that η estimates should be higher for strong than for weak items, but only in the unequal-variance blocks. Validating this pattern, the results showed a significant interaction between strength and block type, F(1, 36) = 18.63, p < .001, MSE = .011. Follow-up 2 (gender) × 2 (strength) ANOVAs showed a difference between strong (.26) and weak (.15) items in the unequal-variance blocks, F(1, 36) = 34.17, p < .001, MSE = .012, but very similar values for strong (.18) and weak (.18) items in the equal-variance blocks, F(1, 36) = 0.02, n.s., MSE = .008.
Experiment 1 conclusions
Experiment 1 showed that the diffusion model can accurately discriminate equal- and unequal-variance distributions based on response proportion and RT data. Thus, RT modeling provides an opportunity to double-check conclusions about evidence variability based on ROC analyses. The following experiments address this goal.
Experiment 2: Recognition memory
Experiment 2 examines the effect of learning strength on evidence variability in a recognition memory task. Results are expected to replicate Starns and Ratcliff (2014); that is, evidence should be more variable for targets than for lures, but additional learning should have little or no effect on the evidence variability parameter. This experiment will provide a comparison data set that uses very similar materials and methods as the source memory experiment below.
Thirty-three University of Massachusetts undergraduates participated to earn extra credit in their psychology courses.
The stimuli for each participant were randomly drawn from a pool of 680 low-frequency nouns, verbs, and adjectives (Kučera & Francis, 1967). For the practice cycle, the study list contained 20 words, with half presented once and half presented three times. The test comprised these 20 targets, along with 20 lure words that were not on the study list. For the true experiment cycles, the study list had 25 targets studied once and 25 targets studied three times. The tests contained 50 targets and 50 lures. No words appeared in more than one study/test cycle. The order of all lists was randomized anew for each participant, with the constraint that at least two other items intervened before the same strong target was repeated at study.
Design and procedures
Item type was manipulated within subjects. At the beginning of the experiment, participants were informed that they would study lists of words and take a memory test following each list. They were told that some of the words on each study list would be repeated and that they should try to remember all of the words as best they could. They were also correctly informed that words from previous study/test cycles would never appear in the current cycle, so they only had to remember the most recent study list. For each test, they were asked to press the “/” key if they thought the test word was studied or the “z” key if they thought the word was new. They were asked to balance speed and accuracy in their decisions. After the instructions, participants completed the practice cycle, and then they began the true experiment cycles.
Each studied word remained on the screen for 1,400 ms, followed by 100 ms of blank screen. Each test word remained on the screen until a response was made. Participants saw a “too slow” message following any RT longer than 1,400 ms and a “too fast” message following any RT shorter than 300 ms. Only 1 % of the trials required these messages. Accuracy feedback was not provided on a trial-by-trial basis, but participants saw a message reporting their proportion of correct responses after each test was completed.
Each participant completed three study/test cycles after the practice cycle. Participants initiated each cycle by pressing the space bar, and they were informed that they could take short breaks before starting a cycle.
Results and discussion
Responses made faster than 300 ms or slower than 3,500 ms were excluded from analyses, which eliminated less than 1 % of the data. Table 1 shows the proportion of “old” responses and the median RT for correct and incorrect responses in Experiment 2. As was expected, participants made more “old” responses for strong (.84) than for weak (.61) targets, with relatively few “old” responses for lure items (.15). Participants also made “old” responses more quickly for strong versus weak targets (673 vs. 713 ms, respectively). Error RTs were longer than correct RTs for all three item types.
Table 2 reports the average evidence-variability parameters across participants for lures, weak targets, and strong targets. Evidence was substantially more variable for weak and strong targets (.281 and .315) than for lures (.173) but was relatively similar across the two target strengths. An ANOVA showed a significant effect of item type, F(2, 64) = 26.66, p < .001, MSE = .007. Bonferroni-corrected follow up analyses revealed a significant difference between lures and both types of targets (corrected p < .001 for both comparisons), but no difference between strong and weak targets (corrected p = .48). Even an uncorrected comparison of strong and weak targets failed to reach significance (p = .16). Therefore, the results matched the RT estimates reported by Starns and Ratcliff (2014), as well as past work with ROC functions (e.g., Ratcliff et al., 1994; Ratcliff et al., 1992).
Experiment 3: Source memory
Experiment 3 explored the effect of learning strength on evidence variability in a source memory task. If the strength effect on the asymmetry of source ROCs is based on decision strategies for using the confidence scale (Starns et al., 2013), then the fits to the two-response source task should show no differences in evidence variability based on strength. If additional learning does affect the overall variability of source evidence, then the variability estimates from the diffusion model should be higher for strong than for weak items.
Twenty-six University of Massachusetts undergraduates participated to earn extra credit in their psychology courses.
Stimuli for each participant were sampled from the same word pool that was used in Experiment 2. In addition, four faces were randomly selected from a pool of eight alternatives within each gender. All face pictures were taken from Maner et al. (2003). A different pair of male and female faces was used for each study/test cycle (including the practice cycle) to minimize interference. The practice study list contained 20 items, half studied with a male picture and half studied with a female picture. Half of the items within each gender were studied once, and half were studied three times. Study lists for the true experiment cycles had the same organization, but the total number of items was increased to 48. The order of the study lists was independently randomized for each participant, with the constraint that at least two words intervened before the same word was repeated. No words were repeated across study/test cycles. Each test list contained all of the studied items in a random order.
Design and procedure
The independent variables were source (male versus female) and strength (one versus three learning trials), and both were manipulated within subjects. The initial instructions were similar to those in Experiment 2, except that participants were informed that they would have to remember whether each word was studied with a male or a female face. At study, each word–picture combination remained on the screen for 1,900 ms, followed by 100 ms of blank screen. At test, participants were asked to press the “/” key or the “z” key to indicate that the word was studied with a male or a female face, respectively. They were again asked to balance speed and accuracy. “Too fast” and “too slow” messages were displayed following trials with RTs <300 ms and >1,600 ms, respectively. Only 1 % of the trials had RTs outside of these boundaries. Participants received feedback on their proportion of correct responses following each test list. Each participant completed the practice cycle and three regular cycles.
Trials with RTs shorter than 300 ms or longer than 3,500 ms were excluded from analyses, which eliminated less than 1 % of the data. Table 1 reports the proportion of “male” responses and the RT medians for the four conditions. As was expected, extra learning trials increased the proportion of “male” responses for male items (.68 vs. .82) and decreased the proportion of “male” responses for female items (.33 vs. .19). Correct responses were faster for strong than for weak items, and error responses were consistently slower than correct responses.
Table 3 reports the average evidence variability (η) parameters for the four conditions. A 2 (source) × 2 (strength) ANOVA showed that η estimates were higher for strong (.29) than for weak (.22) items, F(1, 25) = 8.77, p < .01, MSE = .015. There was no main effect of source, F(1, 25) = 2.04, n.s., MSE = .005, and no interaction F(1, 25) = 0.31, n.s., MSE = .006.
The variability estimates from the RT model showed the same pattern as the ROC literature: Source evidence was more variable for items with strong learning than for items with weak learning (Starns et al., 2013; Yonelinas & Parks, 2007). Although decision strategies for using a confidence rating scale do play a role in producing the ROC strength effect (Starns et al., 2013), the present results suggest that confidence-rating artifacts do not provide a complete account.
Comparing recognition and source memory
Considering Experiments 2 and 3 together, RT modeling supports the same conclusion as ROC modeling; that is, additional learning increases the variability of source evidence, but not recognition evidence. However, this conclusion requires accepting a nonsignificant hypothesis test as evidence for the lack of an effect in recognition memory. In this section, I use a Bayesian approach to evaluate the possibility of a selective effect on source memory. Bayesian statistics are particularly well-suited to the question because they can quantify support for the null hypothesis and they provide a principled method for combining the present recognition results with the previous diffusion model fits to recognition data. I applied the Bayesian t-test outlined by Rouder, Speckman, Sun, Morey, and Iverson (2009). The likelihood for a given effect size was the probability density of the observed t value on a t distribution with a noncentrality parameter determined by the effect size (all tests were with paired-samples, so the df was N – 1). When multiple experiments were considered, the likelihoods were multiplied across observed t values. The null model assumed an effect size of zero, and the alternative model had a standard normal distribution as the prior on effect size. Reported Bayes factors give the probability of the observed data under the null divided by the probability under the alternative.
Echoing the standard analyses, the Bayesian test showed a strong effect of additional leaning on η parameters for source memory. The probability of the data was nearly 12 times higher for the alternative than for the null hypothesis (BF = .084). In contrast, the recognition data in Experiment 2 were more consistent with the null than with the alternative hypothesis, although the results were more equivocal (BF = 2.797). A single study will almost never provide strong support for the null hypothesis, so I considered the present recognition results together with 5 similar data sets that Starns and Ratcliff (2014) identified as having sufficient observations to estimate separate η parameters for each item type.4 With all of the available studies combined, the probability of the data was over 15 times higher for the null hypothesis than for the alternative (BF = 15.309). Thus, the currently available diffusion model results strongly support the contention that additional learning increases evidence variability for source memory but not recognition.
My primary goal was to approach the variability question as a parameter estimation issue, but the tests for variability differences could also be framed in terms of model selection. In this section, I evaluate whether differences in model fit support the same conclusions as the analyses on parameter values. For Experiment 2, I tested the difference in η values between targets and lures by comparing the fit of a model with a single η parameter across all item types (one-η) with that of a model with one η for lures and another for both types of target (two-η). To test for a difference between strong and weak targets, I compared the fit of the two-η model with that of the three-η model originally fit to the data. For Experiment 3, I tested for strength effects on η by comparing the fit of a model with a single η value for all item types with that of a model with one η parameter for weak male and female items and another for strong male and female items. Under the null hypothesis of equal η parameters, the difference in fit for each participant should theoretically follow a χ2 distribution with one df, and the summed χ2 across the N participants should follow a χ2 distribution with N df. For each comparison, I performed a single χ2 test on the summed difference in fit across participants.5
For Experiment 2, the fit comparisons indicated a significant difference between a model with a single η value for all items types and a model with different η values for targets and lures, χ2 (33) = 113.72, p < .001. In contrast, results showed no evidence of a difference between the two-η model and a model with separate η parameters for strong targets, weak targets, and lures, χ2 (33) = 45.34, n.s. For Experiment 3, fit statistics showed a significant difference between a single-η model and a model with different η parameters for strong and weak items, χ2 (26) = 56.71, p < .001. Thus, the fit comparisons supported the same conclusions as the analyses on η estimates across participants: Evidence variability differed between targets and lures in recognition memory but did not change on the basis of learning strength, whereas evidence variability increased with learning strength for source memory.
The diffusion model accurately distinguished equal- and unequal-variance evidence distributions in fits to the numerousity discrimination task (Experiment 1). These results extend the parameter validation literature to a new parameter (η) and provide further confirmation that the model appropriately measures the psychological processes involved in decision making. For the recognition memory data (Experiment 2), applying the model revealed that memory evidence was more variable for targets than for lures, but additional learning had little or no effect on evidence variability. For source memory (Experiment 3), variability estimates did not differ between male and female items within a strength class, but they were higher for strong items than for weak items. Overall, the diffusion results support the same conclusions about evidence variability as previous experiments using ROC analyses.
Starns et al. (2013) showed that participants are more willing to make high-confidence source responses when they are more confident that an item was studied, and this decision strategy has the same influence on source memory ROC functions as an increase in evidence variability with stronger learning. As a result, Starns et al. (2013) suggested that additional learning might not actually affect the variability of source evidence. The present results provide evidence against this suggestion. Even when confidence ratings were eliminated and evidence variability was estimated from RT distributions, results showed an increase in variability from weak to strong items. As such, the present study provides a good example of how RT modeling can be used to disambiguate memory and decision processes.
Alternatives to unequal variance
In ROC research, one well-known competitor for the unequal-variance approach is Yonelinas’s (1994) dual-process model. This model assumes that recollection succeeds for a proportion of the target items equal to R, and for these items, decisions are based on retrieving qualitative contextual information from the learning event. Decisions for the remaining targets (and all of the lures) are based on familiarity, a continuous strength signal that tends to be higher for items that were recently encountered. Familiarity values are assumed to be equally variable for targets and lures, so the model predicts symmetrical ROC functions if none of the studied items are recollected. When recollection succeeds for some items, the predicted ROC function becomes asymmetrical. Thus, mixing decisions based on recollection and familiarity produces an ROC shape that is very similar to the shape predicted by the unequal-variance signal detection model (Wixted, 2007; Yonelinas & Parks, 2007).
Although the models show nearly complete mimicry for ROC data, our results highlight one unique advantage of the unequal-variance approach. Namely, this approach has been successfully implemented in models that accommodate RT data in addition to response proportions. A number of studies have simultaneously modeled both RT and ROC data, and the successful models in all of these studies used unequal-variance evidence distributions (Dube et al., 2012; Ratcliff & Starns, 2009, 2013; Starns, Ratcliff, & McKoon, 2012). RT modeling also offers a chance to validate conclusions from ROC analyses with a completely independent form of data. Together with Starns and Ratcliff (2014), the present results show an impressive level of agreement in conclusions about evidence variability from RT and ROC modeling.
At this point, there is no version of the dual-process model that is capable of estimating recollection and familiarity using RT distributions. Therefore, the dual-process account cannot be tested by evaluating the consistency of RT and ROC modeling. Although the present results do not directly refute the dual-process approach, they do offer a form of support for the unequal-variance account that is currently not available for the dual-process account. The same can be said for pure threshold models of ROC data (Batchelder & Riefer, 1999; Bröder & Schütz, 2009), since this approach has not been systematically applied to modeling RT distributions (Dube et al, 2012).
A variety of mixture models have been proposed to accommodate ROC data (e.g., DeCarlo, 2002; Onyper, Zhang, & Howard, 2010). These models all assume that studied items fall into two or more latent classes that differ in strength, such as attended versus unattended items. Mixing these latent classes functionally increases the variability of memory evidence for studied items. Both ROC analyses and diffusion model fits are relatively insensitive to the functional form of the underlying distributions (Ratcliff, 2013), so the currently available data cannot discriminate models in which the evidence distributions are Gaussian distributions versus mixtures of Gaussian distributions. Although I chose to implement the unequal-variance assumption using Gaussian distributions, the present results are fully consistent with the predictions of mixture models.
Hierarchical Bayesian modeling
Recently, theorists have implemented the diffusion model in a hierarchical Bayesian format (e.g., Vandekerckhove, Tuerlinckx, & Lee, 2011). This technique offers many advantages that are especially relevant when estimating parameters that can have subtle influences on the data, such as η. For example, parameter estimates for each participant can be influenced by the data from other participants via a participant-level distribution that is simultaneously estimated by the model. Future research should address whether hierarchical modeling also shows differences in η values for memory tasks. Although the traditional modeling procedures used herein are not ideal, Experiment 1 provides strong evidence that they appropriately discriminate equal- and unequal-variance evidence distributions.
Researchers often make conclusions about the nature of memory evidence on the basis of the properties of ROC functions, but these properties can also be influenced by decision mechanisms (Benjamin et al., 2009; Mueller & Weidemann, 2008; Ratcliff & Starns, 2009, 2013; Starns et al., 2013; Van Zandt, 2000). Thus, conclusions about memory must be corroborated with independent forms of evidence. The present experiments show that ROC and RT modeling support the same conclusions about the effect of additional learning on evidence variability in recognition and source tasks, providing further evidence that the unequal-variance assumption is critical in accounting for memory decisions. Together with past work, the present results demonstrate that RT modeling provides a unique and rigorous test of claims about memory processes (Criss, 2010; Dube et al., 2012; Ratcliff & Starns, 2009, 2013; Ratcliff, Thapar, & McKoon, 2004; Starns, Ratcliff, & McKoon, 2012; Starns, Ratcliff, & White, 2012; Van Zandt, 2000).
Acting alone, no other model parameter can produce the same effects on the data as changes in the across-trial drift variability parameter (η), not even the other across-trial variability parameters (sZ and sT). However, allowing all of the parameters to vary across stimulus levels produces a very flexible model that can closely mimic the effects of changes in the η parameter even if η is constant across stimulus types. For this reason, η values cannot be accurately recovered in a completely unconstrained model. Given that fixing decision criteria and nondecision parameters across stimulus levels is an extremely common practice with strong theoretical justifications, my goal was to explore the effects of variables on η estimates under these constraints.
The .3 and .7 quantiles were also fit but are not displayed, to reduce clutter.
As was mentioned, the model is actually fitting response frequencies in the 6 RT bins separated by the 1, .3, .5, .7, and .9 quantiles. The .9 quantiles are in the low-density tail of the distribution, so large shifts in this quantile have relatively small effects on the number of counts in the 5th and 6th RT bin. As a result, the model can make what appears to be a large miss in the .9 quantile, but this actually translates to a more subtle miss in the bin counts that are actually being used to calculate the fit statistic.
Results are from data sets 2, 6, 7, 8, and 9 in Starns and Ratcliff (2014). Tests were conducted on η values from fits with separate η parameters for lures, weak (nonrepeated) targets, and strong (repeated) targets within each word frequency condition. The present experiments used low-frequency words as stimuli, so I used only the low-frequency conditions from previous studies. Results were very similar if both high- and low-frequency words were considered.
To ensure that the fit differences truly followed a χ2 distribution (or at least conformed closely enough to avoid incorrect conclusions), I performed Monte Carlo simulations in which I simulated data sets from the diffusion model with each participant’s best-fitting equal-η parameters and fit the simulated data with both equal-η and free-η models. I performed 50 replications including all participants for each of the χ2 tests reported in the text, and I summed the fit differences across participants for each replication. The .95 quantile of the distribution of fit differences across replications was always close to the theoretical χ2 critical value with α = .05, so there is no reason to expect that deviations from the χ2 distribution affected the outcome of any of the reported χ2 tests.