The ability to recognize previously-experienced information or events is one of the most fundamental faculties of human memory. Not surprisingly, recognition memory is a central topic in memory research, with several models assuming different processes being proposed in the literature (for a review, see Malmberg 2008). In the present work, we will focus on members of the prominent class of of signal detection models (Green & Swets, 1966) that have been at the center of the major debates.

One of the models proposed is the unequal-variance signal detection model (Green & Swets, 1966; Lockhart & Murdock, 1970), which assumes a continuous memory process, often termed familiarity, to describe individuals’ memory-based judgments. A depiction of the model is provided in Fig. 1. Both old and new items evoke some degree of familiarity, with separate familiarity distributions for old and new items. The difficulty in discriminating between the two types of items is determined by the degree of overlap between the two distributions. Recognition-memory judgments (e.g., using a confidence scale) are produced by comparing the familiarity of the test item with one or several criteria placed along the familiarity axis (see Fig. 1). The familiarity distributions are usually assumed to be Gaussian, with parameters {μ o ,σ o } and {μ n =0,σ n =1} denoting the mean and standard deviations of the old and new-item distributions, respectively. Parameter σ o is commonly found to be larger than σ n , a difference that is interpreted as the result of encoding variability during the study phase (e.g., Wixted 2007).

Fig. 1
figure 1

Depiction of the UVSD, DPSD, and MSD models and an ROC function

The dual-process model (DPSD; Yonelinas & Parks, 2007) assumes the combination of a vague continuous familiarity process (assumed to be equivalent to the UVSD, only with σ o =1) and a threshold-based episodic retrieval component, termed recollection. When judging an old item, an individual can recollect the item with probability R. It is usually assumed that recollected items are always recognized with maximum confidence (Yonelinas & Parks, 2007). When recollection fails with probability 1-R the recognition judgment is based on the item’s familiarity, with discriminability determined by μ o (with σ o =σ n =1). When judging a new item, recollection cannot occur and the item is solely evaluated in terms of its familiarity (see Fig. 1). Recollection and familiarity are assumed to be independent processes. These processes can be selectively influenced and in some cases one of them is expected to be the sole culprit of above-chance performance (e.g., tasks in which recollection alone drives above-chance performance; see Yonelinas & Parks, 2007).

The mixture signal detection model (DeCarlo, 2002) is similar to the UVSD but assumes that the familiarity of studied items is not described by a single distribution but by a mixture of two familiarity distributions (see Fig. 1). One distribution (with mean μ o ) corresponds to the items that were attended to during study and the other to items that were unattended (with mean \(\mu ^{*}_{o}\)). Parameter λ characterizes the probability of a studied item being attended to during study. The most common implementation of the MSD (DeCarlo, 2002) assumes that all familiarity distributions have the same unit standard deviation (with \(\sigma ^{*}_{o} = \sigma _{o} = \sigma _{n} = 1\)). A restricted version of MSD, which we will refer to as MSD0, also assumes that performance for unattended items is at chance level (\(\mu ^{*}_{o} = 0\); DeCarlo 2002). Also, note that the MSD reduces to the DPSD when μ o takes on extremely large values.

These signal-detection models are often compared by means of Receiver Operating Characteristic (ROC) functions, which plot the individuals’ cumulative confidence responses (from “sure old” to “sure new”) for new and old items on the abscissa and ordinate axes respectively (see the bottom panel of Fig. 1). ROCs are widely used in the psychological literature as a means to test different theories (Yonelinas & Parks, 2007). The basic familiarity process assumed by the models accounts for the ROC curvature, while the observed ROC asymmetry is accounted for by encoding variability (σ o >σ n ) in the case of UVSD, by recollection (R>0) in the case of the DPSD, and by attentional shifts (0<λ<1) in the case of the MSD/MSD0. Note that all these models have the equal-variance signal detection model (EVSD) as a special case (UVSD: σ o =1, DPSD: R=0, and MSD/MSD0: λ=1). In fact, all these models can be seen as different ways to extend the EVSD in order to account for ROC asymmetry.

Despite an intensive debate already spanning decades, the discussion of which model provides the best characterization of the data is still ongoing (for recent reviews, see; Yonelinas & Parks, 2007; Wixted 2007). This unsatisfactory state of affairs has led researchers to search for alternative ways to compare models. In the present manuscript we will discuss one particular approach that was recently proposed by Dede et al. (2014), which relies on the residuals produced by the models’ fits to ROC data.

ROC residuals

Instead of relying on ROC-fit statistics or related model-selection indices, Dede et al. (2014) focused on the pattern of residuals produced by the UVSD and DPSD’s best-fitting predictions (the MSD and MSD0 were not considered). The logic underlying Dede et al.’s work is as follows: If one of the models successfully characterizes the underlying processes, then the residuals produced when fitting ROC data should not be systematic. Instead, the average residuals should not differ systematically from zero for each response category as these residuals reflect nothing more than sampling variability. On the other hand, if a model does not provide a suitable characterization of the underlying processes, then one should observe systematic residuals. A reanalysis of previously published data showed the presence of systematic residuals in the DPSD but not in the UVSD.

One limitation of Dede et al.’s 2014 work is that their analyses relied on ROC data from very few independent sources. It seems somewhat desirable that general claims regarding the relative performance of models based on a new method resort to a larger and richer set of ROC data. In order to overcome this limitation we analyzed Old-New ROC mimicry and residuals using an extended set of individual ROC data obtained from several different sources (Benjamin et al. 2013; Dube and Rotello, 2012; Heathcote et al. 2006; Jaeger et al. 2012; Jang et al. 2009; Koen et al. 2013; Koen & Yonelinas, 2010; 2011; Onyper et al. 2010; Pratte et al. 2010; Smith & Duncan, 2004; Van Zandt, 2000), for a total of 883 individual Old-New ROCs (492 six-point ROCs and 391 eight-point ROCs). These individual ROCs are depicted in Fig. 2. Although the composition of this extended set of data is not exhaustive and corresponds to a convenience sample, it seems nevertheless appropriate for testing the suitability of using residual analyses for purposes of model selection.

Fig. 2
figure 2

Individual and mean Old-New ROC data. A more detailed reference to each dataset is provided in Table 1. Individual ROCs are plotted with 80 % transparency in the background so that overlapping ROCs are displayed darker

Another limitation of Dede et al. (2014) concerns the criterion used when evaluating residuals: The pattern of residuals was evaluated using independent t-tests for each response category and dataset separately. Residuals were only considered to be systematically different from zero for a response category when statistically significant differences (in the same direction; p<.05) were found in each analyzed dataset. Such an approach is somewhat questionable given that it assumes that an effect only truly exists when the null hypothesis is rejected in all studies individually, completely ignoring the well-known relationship between statistical power and the frequency of statistically significant effects (Cohen, 1988). In particular, their approach enforces a decrease in the probability of detecting an effect as more datasets are included in the analysis (e.g., with 80 % power the probability of always finding a significant effect across 5, 10, and 20 datasets is 33 %, 11 %, and 1 %, respectively). A perhaps more reasonable approach is employed in the analysis reported below, which consists of a meta-analytic estimation of effects using a linear mixed model (LMM) analysis that treats the source of the data as a random effect (Barr et al. 2013).

Model fits and residual analysis

We first report an analysis of the residuals produced by the four models when fitting the above-described set of 883 individual ROCs. Old-New ROCs were fitted using MPTinR (Singmann and Kellen, 2013) via the maximum-likelihood method. Goodness-of-fit results are provided in Table 1. Details on the specification of the models, the data, and the analysis scripts can be found at https://osf.io/p2eq8/. The goodness-of-fit results reported in Table 1 show that for the models with a smaller number of parameters (i.e., excluding MSD), the best-fitting model was the MSD0, followed by the UVSD and the DPSD. However, in terms of overall fit performance the MSD was significantly better than the DPSD and MSD0 (smallest summed ΔG 2 = 1238.30, largest p<.0001)Footnote 1

Table 1 Summary of Fitted Data Sets

Although these results can be interpreted as a victory for the MSD0 and a clear rejection of the DPSD (when only looking at models with the same number of parameters), such a conclusion is perhaps premature at this point given that the goodness-of-fit performance of these models is not being corrected for their respective flexibilities. According to model-selection statistics coming from the Minimum Description length (MDL) framework, the DPSD is less flexible than the UVSD and MSD0 in the case of ROC data, despite the fact that all models have the same number of parameters (Kellen et al. 2013; Klauer & Kellen, in press). These differences in flexibility, which are due to the functional form of the models, are not captured by common model-selection indices such as the Akaike and Bayesian information criteria and can have a large impact in model comparison results. In fact, a MDL-based meta-analysis conducted by Klauer and Kellen (2015) shows that the DPSD — when flexibility due to functional form is taken into account — tends to outperform models like UVSD or MSD0.

The old- and new-item residuals (predicted minus observed response proportions) from the UVSD, DPSD, MSD0, and MSD (depicted in Figs. 3 and 4) were analyzed with LMMs (Barr et al. 2013) using “Experiment” as a random effect. We chose this analysis in order to be able to estimate the overall residual pattern across studies while taking into account the idiosyncrasies of each of them (e.g., Singmann et al. 2014). To evaluate whether the residuals systematically deviate from 0 we fitted separate LMMs to the residuals from the four models using R package lme4 (Bates et al. 2014). Each LMM had a fixed effect for the response categories of both old and new items (i.e., twelve levels in the case of six-point ROCs and sixteen levels in the case of eight-point ROCs). Additionally, each LMM was established in a way that controls for the sample size of each single study (this ensured that studies were not equally weighted), as traditionally done in meta-analytic studies (e.g., Hedges & Olkin 1985). This was achieved by also adding a fixed effect with the sample size of each study (centered at the weighted mean sample size) and the respective interaction with “Response Category”.Footnote 2 Furthermore, we allowed the effects to vary across experiments by estimating random slopes for factor “Response Category” (adding the corresponding random slopes for participants would lead to an oversaturated model). We did not estimate an overall intercept nor random intercepts for “Experiment” nor for “Participant” given that the mean of the residuals is zero a priori (as observed and predicted proportions sum to one per item type). To avoid local minima each LMM was estimated with all available optimization algorithms using function allFit from package afex (Singmann et al. 2015).

Fig. 3
figure 3

Model residuals for six-point ROCs. Residuals correspond to the difference between predicted and observed response proportions. For values above 0, the model overestimates the reponse proportions. The left panels depict observed mean residuals per experiment. If the residuals of one response category systematically deviate from zero in the LMM analysis then this is indicated by asterisks (. =p<.1, * =p<.05, ** =p<.01, and *** =p<.001). The right panels depict estimated marginal mean residuals and (more conservative) confidence intervals with simultaneous coverage probability of 95 %. We restricted the Type I error probability to .05 for each model. ΣG 2 is the summed G 2 of the model fit for the depicted data. RMSE is the root-mean-squared error and MAD the median absolute deviation of the estimated marginal mean residuals from the LMM (i.e., estimates of the amount of deviation)

Fig. 4
figure 4

Model residuals for eight-point ROCs. See Fig. 3 for more details

For each of the eight LMMs (two per model) the fixed effect for “Response Category” was significant using a Wald test, with the smallest effect occurring in the case of the MSD0 residuals for the six-point ROCs (χ 2(12)=82.10,p<.0001) and the MSD residuals for the eight-point ROCs (χ 2(16)=167.73,p<.0001). These results indicate that every model produced residuals that systematically deviated from zero. If one would have used Dede et al.’s (2014) approach instead the presence of systematic residuals would not have been detected for any response category in any of the models in the case of six-point ROCs, but for at least three different response categories in all models in the case of eight-point ROCs.

We used the LMMs to estimate the marginal effects of each response category. To evaluate whether each of these categories significantly differed from zero we used z-tests. In order to control for the probability of Type I errors we used a generalization of the Bonferroni-Holm method that takes the correlation of the LMM’s parameter estimates into account (Bretz et al. 2010)Footnote 3, restricting the overall Type I error probability to .05 for all tests conducted within each LMM. To quantify the amount of residuals we used the marginal LMM estimates to calculate the root-mean-squared error (RMSE) and the median absolute deviation (MAD). The results are depicted in Figs. 3 and 4.

As can be seen in Figs. 3 and 4, significant differences emerged for all models across several response categories. Furthermore, the residuals for old items were virtually a mirror image of the new-item residuals. This symmetry contrasts with Dede et al.’s (2014) claim that no systematic residuals could be found in the case of new items. Furthermore, the magnitude of the residuals reflected the models’ misfits as quantified by the G 2 statistic, a situation that is expected given that all these statistics are based on the divergence between observed and expected values.

For six-point ROCs the DPSD exhibited the largest misfit (ΔG 2>400). DPSD also showed the most pronounced residuals (ΔRMSE ≈.003), clearly mispredicting response categories 3 and 5 and to a lesser degree 1. The other models showed somewhat smaller residuals albeit also systematically mispredicting at least two or three response categories. In the case of eight-point ROCs, the largest misfit was observed for the UVSD (ΔG 2>200). Here the UVSD showed the most pronounced residuals (ΔRMSE ≈.001), clearly mispredicting response category 2, but systematically mispredicting almost all categories. In contrast, the other models showed less extreme and less systematic residuals, mispredicting only two response category per item type.

Taken together, the LMM results are not consistent with Dede et al.’s (2014) findings. We found evidence for systematic residuals in all models and not only for the DPSD. Additionally, while the DPSD residuals were most pronounced for the six-point ROCs this was not the case for eight-point ROCs, which suggests that relative model performance is somewhat dependent on features of the experimental design such as the length of the confidence-rating scale used. However, note that the residual patterns for “new” judgments in both old and new items (i.e., for the three/four leftmost response categories for both old and new items in Figs. 3 and 4) are quite similar for both six and eight-point ROCs. Also, the UVSD, MSD0, and MSD’s systematic residuals tend to be more prevalent in these “new” judgments.

Residual analysis of model-generated data

We followed Dede et al. (2014) and checked how the residuals of each model related to the predictions of the other models. Consider a scenario in which one of the models (e.g., DPSD), when fitting data generated by another model (e.g., UVSD), produces residuals that are similar to the ones obtained with real data. However, this similarity between residuals is not found when the models exchange roles (e.g., when the UVSD fits DPSD-generated data). Under these circumstances, one could argue that one of the models is closer to the true data-generating processes than the other one (e.g., the UVSD is closer).

In order to investigate this possibility we fitted each model to the predicted frequencies of the other models. These predictions were obtained from the model fits to the real data. We restricted this analysis to the models having the same number of parameters, UVSD, DPSD, and MSD0. The residuals produced by fitting these model’s predictions are shown in Fig. 5 for the six-point ROC data and Fig. 6 for the eight-point ROC data. The LMMs on the residuals revealed significant effects of “Response Category” for all twelve fits to model predictions (i.e., the six sets of residuals for the six-point ROCs in Fig. 5 plus the six sets for the eight-point ROCs in Fig. 6); smallest χ 2(12)=34.31,p=.0006 and χ 2(16)=382.44,p<.0001 for six and eight-point ROCs, respectively. These results indicate that as in the case of the real data, every model produced residuals that systematically deviated from zero when fitting the predicted values of other models. The pairwise comparisons show an almost perfect mirror pattern of residuals. This result simply reflects the generating-model’s residuals to the original data. For example, take the case of the UVSD and the DPSD: The UVSD systematically underestimates response category 2 for both six-point and eight-point ROCs while DPSD does not make any systematic misprediction (see Figs. 5 and 6). This difference leads to DPSD overestimating category 2 when fitting UVSD-generating data. The residuals coming from UVSD fits to MSD0 predictions (and vice versa) were more moderate given the considerable similarity between the two models’ mispredictions of the original data. Again our results are not consistent with Dede et al. (2014) as the residuals obtained when fitting model-generated data did not resemble the residuals obtained with real data. Instead, they merely reflected the differences between the models’ (mis)predictions.

Fig. 5
figure 5

Model residuals of fits to predicted values for six-point ROCs. In each plot, the residuals of the pairwise comparison in both directions are plotted in different colors (e.g., the topmost panels show residuals of DPSD fits to predicted values of the UVSD in black and residuals of UVSD fits to predicted values of the DPSD in gray). Significant deviations from zero are indicated by symbols in the corresponding color on either the top or bottom of the plot. As before, probability of Type I errors is restricted to .05 for each fit. See Fig. 3 for more details

Fig. 6
figure 6

Model residuals of fits to predicted values for eight-point ROCs. See Figs. 3 and 5 for more details

Discussion

The present analysis showed that systematic biases can be found in the models’ residuals to ROC data, contradicting Dede et al.’s (2014) claim that only the DPSD produces systematic residuals. This result shows that the mere presence of systematic residuals does not constitute a suitable criterion for selecting between the present candidate models. At this point the following question should be posed: is the systematicity of ROC residuals important at all? The answer is unequivocally “yes” given that the systematic patterns found across studies clearly indicate that the models are consistently failing to characterize some of the behavioral regularities present in the data. The critical issue here is not that all models fail at some point given sufficient data but the fact that each model fails in a systematic fashion across a diverse set of studies. These results cannot be overstated given that the debate surrounding the merit of these models has been pretty much driven by their ability to account for ROC data.

The main challenge now is to understand whether these systematic residuals reflect a violation of the models’ core principles (e.g., independent recollection and familiarity processes) or auxiliary distributional assumptions (e.g., Gaussian familiarity distributions, response mapping of recollection, and mixtures of distributions). In order to test these possibilities it is necessary to consider modified or extended versions of these models, which can be developed in several ways: Let us first consider the familiarity distributions assumed by all four models. One possible explanation is that the Gaussian assumption adopted in all four models does not constitute a suitable representation and should be replaced by other distributional assumptions. The use of alternative distributional assumptions has been discussed since the introduction of SDT (see DeCarlo 1998; Green & Swets 1966; Killeen & Taylor 2004), and the need to compare them has been pointed out long ago (e.g., Lockhart & Murdock 1970). One important feature of non-Gaussian distributions is that many are able to account for ROC asymmetry without invoking additional processes such as encoding variability (as done by the UVSD), recollection (DPSD), or attention failure (MSD/MSD0; see DeCarlo 1998 for an example using extreme-value distributions). This means that exploration of alternative distributional assumptions might lead to superior accounts of data but also accounts that provide distinct (perhaps even more parsimonious) characterizations of the underlying processes.

Moreover, the exploration of different assumptions should take into account the exact data for which the systematic residuals are found. For instance, most of the UVSD, MSD0, and MSD’s systematic residuals are found in the “new” responses. One possible cause for these mispredictions is that the familiarity of new items comes from a mixture of distributions (Chechile, 2013), which when unaccounted for can lead to distorted predictions, especially at the level of “new” judgments (which mostly occur for new items). Chechile (2013) recently proposed a test for detecting the presence of mixtures for new items and found evidence consistent with it (see also DeCarlo 2007).

In the case of the DPSD one possibility is to relax the assumptions on how recollection is mapped onto the confidence scale. Recollection is expected to produce recognition judgments with high confidence. This assumption is usually implemented by enforcing the prediction that all recollected items are mapped onto the maximum-confidence “old” response (Wixted, 2007; Yonelinas & Parks, 2007). This enforcement is unreasonably restrictive given that it completely excludes the possibility of other confidence levels being used. Different confidence levels can be used for several reasons, ranging from the mere occurrence of random errors to individuals’ differential use of idiosyncratic response styles. A perhaps more reasonable assumption is that recollection is preferentially mapped onto high-confidence responses (for a review, see Klauer & Kellen 2010). One important aspect of this DPSD extension is that it releases the model from confidence-rating ROC predictions that have been taken for granted in the literature at large, namely the prediction of ROC linearity when recollection is assumed to be the only process contributing to an above-chance performance (Wixted, 2007; Yonelinas & Parks, 2007).

However, the evaluation of these different possibilities is far from trivial: First, one needs to take into account their relative flexibility in a sensible manner, something that is not accomplished by model-selection statistic that use the number of free parameters as a proxy for model flexibility (Kellen et al. 2013; Klauer and Kellen, in press). Second, some of the different model extensions or modifications proposed might require focused validation tests. For instance, a relaxed recollection process in the DPSD can be validated by testing the conditional independence of recollection’s response-mapping probabilities (Province & Rouder, 2012). Irrespective of which model will turn out to be the most successful one, sensible model comparisons should rely on a set of diverse criteria that go beyond overall fit and model-selection statistics and incorporate information on how exactly models are failing.