Disentangling Different Aspects of Between-Item Similarity Unveils Evidence Against the Ensemble Model of Lineup Memory

For modeling recognition decisions in a typical eyewitness identification lineup task with multiple simultaneously presented test stimuli (also known as simultaneous detection and identification), essentially two different models based on signal detection theory are currently under consideration. These two models mainly differ with respect to their assumptions regarding the interplay between the memory signals of different stimuli presented in the same lineup. The independent observations model (IOM), on the one hand, assumes that the memory signal of each simultaneously presented test stimulus is separately assessed by the decision-maker, whereas the ensemble model (EM), on the other hand, assumes that each of these memory signals is first compared with and then assessed relative to its respective context (i.e., the memory signals of the other stimuli within the same lineup). Here, we discuss some reasons why comparing confidence ratings between trials with and without a dud (i.e., a lure with no systematic resemblance to the target) in an otherwise fair lineup—results of which have been interpreted as evidence in favor of the EM—is in fact inconclusive for differentiating between the EM and the IOM. However, the lack of diagnostic value hinges on the fact that in these experiments two aspects of between-item similarity (viz. old–new and within-lineup similarity) are perfectly confounded. Indeed, if separately manipulating old–new similarity, we demonstrate that EM and IOM make distinct predictions. Following this, we show that previously published data are inconsistent with the predictions made by the EM.


Introduction
In the past, basic recognition memory research largely relied on single-item recognition paradigms (e.g., Bröder & Schütz, 2009) in which participants are asked to decide for each individually presented test stimulus whether it is a previously studied old item (i.e., a target) or a nonstudied new item (i.e., a lure). Over the last few decades, memory researchers have proposed a plethora of different cognitive models to explore the latent processes underlying decisions in these situations (see, e.g., Malmberg, 2008;Parks & Yonelinas, 2008;Rotello, 2017;Jang et al., 2009).
A prominent assumption made by many such models is that a continuous memory-strength signal (the so-called Constantin G. Meyer-Grant constantin.meyer-grant@psychologie.uni-freiburg.de 1 Department of Psychology, University of Freiburg, 79085, Freiburg, Germany familiarity; Morrell et al., 2002;cf. Glanzer et al., 2009) is elicited by each test stimulus presented in a recognition memory task. These familiarity values are assumed to be realizations of a random variable (RV) and are thus stochastic in nature. The decision-maker can then compare each familiarity value to a response criterion (λ) on the memory-strength dimension (Kellen & Klauer, 2018). Whenever a familiarity value exceeds the response criterion, the corresponding test stimulus will be categorized as being old (i.e., previously encountered during study). Otherwise, the test stimulus will be categorized as being new (i.e., not previously encountered). Importantly, olditem familiarities (X o ) are assumed to be distributed differently from new-item familiarities (X n ), enabling above-chance discrimination between these two types of stimuli (Macmillan & Creelman, 2005). Models adhering to these functional principles are usually referred to as signal detection theory (SDT; Green & Swets, 1966;Macmillan & Creelman, 2005;Wickens, 2002;Kellen & Klauer, 2018) models of recognition memory (Rotello, 2017;Kellen et al., 2021). Often, both X o and X n are assumed to follow a normal distribution (i.e., X o ∼ N (μ X o , σ X o ) and X n ∼ N (μ X n , σ X n )), leading to the so-called Gaussian SDT model sub-class. 1 In recent years, however, more complex experimental paradigms have come into focus as they allow for novel and more refined tests of specific model assumptions (e.g., Kellen et al., 2021;Kellen & Klauer, 2014;Meyer-Grant & Klauer, 2021;Voormann et al., 2021). Consider, for example, the situation in which participants are presented with a set of m > 1 stimuli at the same time in each test trial. Furthermore, assume that each of these sets can either contain an old stimulus (a target trial) or not (a non-target trial). Participants can now be asked to indicate whether they believe a target to be present or not (1-out-of-m detection sub-task), and they can also be asked to identify the stimulus they think is most likely to be old (m-alternative forcedchoice sub-task). Together, these two sub-tasks are referred to as simultaneous detection and identification (SDAI; Macmillan & Creelman, 2005).
Notably, this paradigm is of great importance for applied memory research, especially for the investigation of eyewitness identification. This is due to the fact that SDAI is mostly equivalent to the widely used simultaneous lineup procedure (see, e.g., ). In such a task, witnesses to a crime are presented with the suspect along with some innocent fillers. The suspect can either be guilty (i.e., a target-present lineup) or innocent (i.e., a target-absent lineup). Witnesses are then asked to indicate whether they believe the perpetrator is present in the current lineup, and if so, to identify the perpetrator. 2 However, this research tradition is only just beginning to explore the potential of mathematical modeling, with Gaussian SDT models in particular rapidly gaining popularity (e.g., Cohen et al., 2021;Wixted & Mickes, 2014;1 The normal distribution is chosen to model familiarity more for practical than substantive reasons (but see Footnote 8) and is therefore usually considered to be merely an auxiliary assumption (see, e.g., Kellen & Klauer, 2018;Kellen et al., 2021;Rouder et al., 2014). However, due to its almost ubiquitous use in modeling recognition memory, we will mainly focus on Gaussian SDT models in the following. Moreover, we will later demonstrate that-under quite general conditions-the main argument of the present work is not necessarily jeopardized by using alternative distributional assumptions. 2 Note that for forensic purposes, witnesses should not be asked to always identify the person they believe is most likely to be the perpetrator, as it is not intended to suggest that the perpetrator must always be present in any given lineup. Instead, it is recommended that witnesses only give an identification response (i.e., a response to the m-alternative forced-choice sub-task) if they believe they recognize the perpetrator (Wells et al., 2020); that is, for lineups in which they believe the perpetrator to be present. In other words, whether or not witnesses are asked to provide an identification response depends on their decision in the 1-out-of-m detection sub-task. Lee & Penrod, 2019;Colloff et al., 2017).
Interestingly, SDAI has until recently been largely ignored by more basic memory researchers concerned with the evaluation and validation of such formal recognition memory models (but see Meyer-Grant & Klauer, 2021). However, a critical assessment of the core conceptual assumptions of said models seems advisable because applied researchers might otherwise draw erroneous conclusions based on improper models. This problem is further aggravated by the fact that the classical SDT framework can be extended in various ways to account for the responses in the 1-out-of-m detection sub-task . Two of these competing approaches-namely, the independent observations model (IOM) and the ensemble model (EM)-are of particular interest because both of them are considered to be reasonably well supported by previous empirical studies, despite proposing substantially different cognitive mechanisms Akan et al., 2021). 3 As we will outline below, this is mostly due to the fact that the experimental designs of past empirical investigations are inconclusive with respect to differentiating between these two models. For the purpose of addressing these issues, we will first briefly recapitulate and formalize the characteristics of both modeling accounts.

Independent Observations Model
Arguably the simplest SDT model of SDAI is the IOM, which has been known for quite some time in the literature on SDT (Starr et al., 1975;Green et al., 1977;Green & Birdsall, 1978;Macmillan & Creelman, 2005). This model assumes that each item's familiarity value will be separately compared to the response criterion λ ∈ R. If at least one of the m familiarity values exceeds λ, participants will give a "target present" response. Therefore, the model prediction for the probability of a hit (H) in the 1-out-of-m detection sub-task (i.e., a "target present" response if an old item is indeed present 4 ) is given by the probability that the maximum of all familiarity values exceeds λ, that is, where r is the number of simultaneously presented lures. 3   also discuss a third variant (viz., the integration model). However, this model could not be reconciled with their data and was therefore decisively rejected , which is why we omit it here. 4 Note that the meaning of the term "hit" in the context of SDAI refers to a target trial being correctly detected as such, regardless of whether that decision is followed by the selection of a target or a lure in the m-alternative forced-choice sub-task. In other words, the term "hit" refers exclusively to the responses in the 1-out-of-m detection sub-task.
The IOM further assumes that the identification response (i.e., the response in the m-alternative forced-choice subtask) will be made according to which item elicited the highest familiarity value (i.e., a maximum decision rule; see, e.g., Norman & Wickelgren, 1969). Thus, the model prediction for the probability of correctly identifying the old item (I) and a hit is given by Figure 1 shows examples of probability density functions (PDFs) of an equal-variance Gaussian SDT model (i.e., σ X o = σ X n = 1) for both target and lure familiarities as well as the corresponding PDFs of the IOM's decision variables for a target (i.e., max {X o , X n }) and a non-target (i.e., max {(X n ) 1 , (X n ) 2 }) trial, both with m = 2. Moreover, Fig. 1 also depicts the respective model predictions for the so-called receiver operating characteristic (ROC) as well as the identification operating characteristic (IOC; Macmillan & Creelman, 2005). By plotting the model predictions for P(H) and P(I, H), respectively, against the model predictions for the probability of a so-called false alarm (FA; i.e., a "target present" response in the absence of an old item 5 ), these curves show how changes in the response criterion affect the relationship between the predicted response behavior in target and non-target trials (Macmillan & Creelman, 2005).

Ensemble Model
The EM likewise assumes that each simultaneously presented test stimulus elicits an individual familiarity value. However, other than the IOM, it assumes that for each set of test items, the decision-maker first computes the mean familiarity of all simultaneously elicited item familiarity values given a non-target trial . Then, the difference between M and each individual item familiarity is computed. This difference is itself a RV with a distribution that depends on the true status of the respective stimulus (i.e., whether it is old or new) and on the present context of stimuli (i.e., whether it is a target or a non-target trial). We denote this RV as D o := X o − M for old items and as (D n ) i := (X n ) i − M for new items. If at least one of these decision variables exceeds the response criterion λ ≥ 0, a "target present" response will be given. The model 5 Like the term "hit," the term "false alarm" refers exclusively to the responses in the 1-out-of-m detection sub-task. That is, in the context of SDAI, the term "false alarm" denotes incorrectly reporting the presence of a target in a non-target trial.
predictions for the probability of a hit is consequently given by In other words, the EM assumes that the decision in the 1-out-of-m detection sub-tasks depends on how much an individual item "stands out" from the rest of the lineup (i.e., a "target present" response is given if a single item appears sufficiently more familiar than the rest of the lineup). In contrast, the IOM assumes that each stimulus within a lineup is assessed separately by the decision-maker and a "target present" response is given if at least one item appears sufficiently familiar. From a theoretical perspective, these two mechanisms therefore correspond to what are known as relative and absolute decision strategies (e.g., Dunning & Stern, 1994;Charman & Wells, 2007;Clark et al., 2011), respectively.
The identification response, on the other hand, isanalogously to the IOM-determined by which item elicited the maximal familiarity value. Thus, the model prediction for the probability of a correct identification and a hit is given by

Similar Lures and the Dud-Alternative Effect
One piece of evidence that-at first glance-seems to favor the EM over the IOM is what Windschitl and Chambers (2004; see also Charman et al., 2011) referred to as the dud-alternative effect. This effect describes the phenomenon that reported confidence (in the target being present 6 ) tends to increase when an 6 In the past, most experiments investigating the dud-alternative effect have measured confidence ratings associated with the subsequent identification response (i.e., the response regarding the m-alternative forced-choice sub-task; Horry & Brewer, 2016;Charman et al., 2011;. However, the confidence in the identification response is-strictly speaking-not what is actually modeled, but rather the confidence in the 1-out-of-m detection response. This is the case because the decision variable of the 1-out-of-m detection sub-task determines the confidence level in both models. From a practical perspective, it seems appropriate to assume that both kinds of confidence ratings are based on the same underlying decision variable and hence are identical. We therefore adopt this view in the following. Nevertheless, testing whether this assumption in fact holds might be a worthwhile endeavor in its own right, which is, however, beyond the scope of the present work. implausible alternative (a so-called dud, i.e., a regular lure which exhibits no systematic similarities to the target) is included in an otherwise fair lineup (i.e., a lineup comprised of stimuli systematically resembling each other; Charman et al., 2011;Horry & Brewer, 2016;Fitzgerald et al., 2013). By the same token, adding a lure that resembles the target (i.e., a similar lure) to an unfair lineup consisting of randomly assembled stimuli (i.e., the stimuli of the lineup do not systematically resemble each other) would tend to reduce confidence.
In the context of eyewitness identification, for example, similar lures are fillers whose visual characteristics match the verbal descriptions of the perpetrator and/or the general appearance of the suspect. The use of similar lures in reallife lineups (i.e., constructing fair lineups) is generally recommended since it has been shown to reduce the number of falsely identified innocent suspects (Wells et al., 2020;Fitzgerald et al., 2013;Smith et al., 2017). This is the case because the rate of "target present" responses was found to increase for unfair lineups compared to fair lineups, regardless of whether the suspect is guilty or innocent (Fitzgerald et al., 2013). Indeed, the dud-alternative effect is closely related to this observation as it describes an increase in confidence when the lineup becomes less fair (Horry & Brewer, 2016;Fitzgerald et al., 2013).
Importantly, confidence is intricately tied to the magnitude of the decision variable in SDT models of recognition memory; that is, the larger the decision variable, the more confident decision-makers are that a target was present (Kellen & Klauer, 2018;Dubé & Rotello, 2012;cf. Mickes et al., 2017). Modeling a confidence rating (i.e., a graded response) is then simply accomplished by assuming multiple staggered response criteria that partition the memory-strength dimension into multiple confidence regions. Furthermore, this entails that an increase in confidence is tantamount to an increase in hit (or false alarm) rate over different confidence levels. To explain the dud-alternative and related effects, it has thus been proposed that the more salient the target familiarity is compared to its context (facilitated by the inclusion of a dud in place of a similar lure), the larger the decision variable and-as a consequence-the higher the associated confidence. The EM is in essence an instantiation of this idea , which is why the dud-alternative effect lends some credence to this account.
At this point, however, it is necessary to draw attention to one way in which between-item similarity has previously been modeled: by introducing a positive within-lineup correlation of memory signals while neither changing the mean of the lure (μ X n ) nor the target (μ X o ) familiarity distribution (see, e.g., . A positive within-lineup correlation (ρ {X o , X n } > 0) between target and lure familiarity implies that if the target appears to be relatively familiar (i.e., more familiar than an average target) to the decision-maker, the lure tends to appear relatively familiar (i.e., more familiar than an average lure) as well, and vice versa. This modeling decision does address one key aspect of between-item similarity, but overlooks another crucial aspect that must not be ignored in this context.
In order to see why this is the case, consider a situation in which target and lure are de facto indistinguishable by a human observer (e.g., through only minimally changing the color of a single pixel in a picture that was presented during study, as discussed by Meyer-Grant & Klauer, 2021). According to the modeling approach described above, the correlation between the target and lure familiarity values ρ {X o , X n } would approach one in such a case. Thus, the memory-strength "signals generated by the target and the lure on any given trial fall at precisely the same point on their respective distributions" (Wixted et al., 2018, p. 84; see also Hintzman, 2001). If, for example, the familiarity value elicited by the target falls one standard deviation below μ X o , then the familiarity value elicited by a lure will fall one standard deviation below μ X n . However, as long as the means of both the target (μ X o ) and the lure (μ X n ) familiarity distributions remain unaffected by this manipulation (and also μ X o > μ X n ), this implies that the more similar the stimuli are, the better the identification performance becomes, until eventually the model will predict the identification responses to be always correct . 7 Common sense, on the other hand, strongly suggest that the opposite must be the case. That is, if a decision-maker cannot distinguish between the stimuli, then (assuming no additional biases) the response must be based solely on guessing (see also Luus & Wells, 1991;Wells et al., 2020). Thus, in addition to an increase in the within-lineup correlation between familiarities, an increase in similarity should be accompanied by a decrease in the mean difference between target and lure familiarity distributions (i.e., μ X n − μ X o ). More precisely, as ρ {X o , X n } approaches one, μ X n − μ X o must approach zero. 8 7 Note that this behavior does not depend on whether we adopt the IOM or the EM framework, as both share the same mechanism for modeling the identification response. 8 Another argument that could be made here is that one rationale for using the normal distribution to model latent memory strength is to assume that these values are formed by summation over partial memory information (see, e.g., Hintzman, 1984). The normality assumption then follows from the central limit theorem under quite general regularity conditions (Kellen & Klauer, 2018). Given that a similar lure systematically shares a certain proportion of target characteristics, the corresponding partial memory information will likewise coincide. This not only results in both values being correlated, but also in bringing the expected value of the similar lure familiarities closer to the expected value of the target familiarities. This illustrates that in the dud-alternative paradigm we are dealing with two different but perfectly confounded aspects of between-item similarity: On the one hand, the stimuli of a lineup can resemble each other (i.e., the within-lineup similarity, as modeled by the within-lineup correlation), and, on the other hand, a new stimulus can resemble an old stimulus (i.e., the old-new similarity, as modeled by the mean difference between target and lure familiarity distributions).
Consider, for example, a forensic lineup of size m = 2 which contains the suspect and a filler that closely resembles the description of the perpetrator. Since the suspect naturally also matches this description, both withinlineup similarity and old-new similarity are high in such a scenario. For an illustration of how to disentangle the two kinds of similarities, suppose that the crime was committed by two perpetrators that are not systematically similar to each other. This allows for the construction of a lineup that contains a suspect that resembles one of the perpetrators and a filler that matches the description of the other perpetrator. In this case, within-lineup similarity will be relatively low, whereas old-new similarity will be high.
Fortunately, SDT models of SDAI can easily be modified to account for both aspects of similarity present in the dud-alternative paradigm. Let us therefore only denote the familiarity of a dud (i.e., a regular lure) as X n and-in distinction to this-the familiarity of a similar lure as X s ∼ N (μ X s , σ X s ). Now, we can specify that X o and X s are positively correlated (denoted by the parameter ρ {X o , X s } > 0) while at the same time X o and X n are uncorrelated (i.e., ρ {X o , X n } = 0). Furthermore, we assume that μ X o > μ X s > μ X n and that the closer a similar lure resembles the target, the larger ρ {X o , X s } and the closer μ X s approaches μ X o .
Expanding the SDT model framework in the abovementioned way does not compromise the conceptual appeal of the EM in terms of predicting a dud-alternative effect. But the presence of the dud-alternative effect is at the same time not yet sufficient to rule out the IOM. The reasons for this are twofold: First, the presence or absence of a similar lure in the dud-alternative paradigm is clearly apparent to the decisionmaker and thus not opaque to conscious understanding. 9 Thus, as was noted by Hanczakowski et al. (2014; see also , participants could alter their response criterion between both conditions; that is, they might adopt a more conservative response criterion when a similar lure is added to the lineup. Second, the IOM in fact also predicts the dud-alternative effect to occur for many reasonable parameter specifications, even without assuming a shift in the response criterion. This is due to the fact that the lower the correlation between target and lure familiarities within the same lineup, the higher the chance that a relatively high lure familiarity will be elicited together with a relatively low target familiarity. 10 One can imagine that if a target and a lure do not resemble each other, sometimes the decision-maker will erroneously believe to recognize the lure while not recognizing the target. If, on the other hand, target and lure are highly similar, the decision-maker will most likely not mistake the lure for being old if the target already appears very unfamiliar. In essence, the inclusion of a dud adds another independent attempt to surpass the response criterion, which is why this event tends to occur more often (i.e., a hit becomes more likely; see also Fig. 2, top row of panels). 11 One might object to this that the dud-alternative effect is also present when only considering the confidence ratings associated with a correct identification (i.e., a target identification; e.g., Horry & Brewer, 2016). Initially, the previous argument seems to be invalidated by this observation; however, this is actually not the case: The predicted probability of instances in which the lure familiarity value surpasses the target familiarity value, assuming no correlation between them and given their maximum exceeds the response criterion λ (i.e., the predicted probability of an incorrect identification given a hit), is monotonically linked to λ itself in the IOM. That is, the smaller λ, the more likely a lure will be identified over a target, given a hit (see Meyer-Grant & Klauer, 2021, who provided a proof of this dependency for m = 2 and different parametrizations of the IOM, including for normally distributed familiarity values 12 ). In other words, the smaller the maximum familiarity value, the more likely it is that this value was elicited by a lure rather than a 10 Note that the dud-alternative effect has sometimes been investigated in the past by simply adding a dud to the lineup (Charman et al., 2011). This resulted in an increase in lineup size (m) between both conditions. In such a situation, there is yet another mechanism of the IOM that predicts the dud-alternative effect to occur: The maximum of the lineup with a dud will tend to be greater than that of the lineup without one, simply because there are more items available to generate the maximum. However, other studies found an analogous effect even if m is constant between both conditions (Horry & Brewer, 2016), which is why this mechanism alone is not sufficient to account for past observations. 11 Importantly, this causal mechanism does not depend on the number of stimuli (m) presented in a single lineup. Thus, the IOM's ability to predict the dud-alternative effect does not depend on lineup size. However, analogous to the EM, the magnitude of the effect predicted by the IOM decreases as m increases. 12 To our knowledge, there is no rigorous proof of this dependency when m > 2, but simulations suggest that it holds for larger lineups as well.
target. Hence, these instances are selectively removed when conditioning on a correct identification response, which in turn results in an observable increase in confidence in situations in which target and lure familiarities are uncorrelated compared to situations in which they are correlated (i.e., the dud-alternative effect). In fact, this can lead to a stronger dud-alternative effect than if lure identifications are also taken into account (see also Fig. 2, bottom row of panels).

A Critical Test
Whether or not a dud-alternative effect occurs in an SDAI task is therefore not diagnostic for deciding between the IOM and the EM. However, the task can be slightly modified in order to be much more informative in that regard. While (for target trials) the within-lineup similarity cannot be manipulated without affecting old-new similarity, separately manipulating old-new similarity is in fact possible. of panels: The model predictions for the probability of a hit (top right panel) in target trials with (ordinate) and without (abscissa) a similar lure as well as the model predictions for the probability of a hit given a target identification (bottom right panel) in target trials with (ordinate) and without (abscissa) a similar lure (black: (X s ) 1 , blue: (X s ) 2 ). The dud-alternative effect corresponds to a curve below the diagonal identity line (dotted gray line). The stronger the effect, the lower the curve. Squares indicate the predicted probabilities for the response criterion λ This simply requires that a similar lure (i.e., a new stimulus which resembles an old one) is not paired with its similar old-item sibling during test, but instead is presented together with another old item (i.e., the similar lure resembles an old stimulus, but not the target presented in the same lineup). This raises the old-new similarity in such lineups while leaving within-lineup similarity unaffected.
Such an experiment (with m = 2 13 ) was recently conducted by Meyer-Grant and . They first presented a sequence of several portrait pictures to the participants, who were asked to memorize them. In a subsequent test phase, they then presented participants with lineups with no systematic within-lineup similarity that either contained a target and a regular lure, a target and a similar lure, a regular lure and a similar lure, or two regular lures. 14 Participants were then asked to provide a four-level confidence rating as to whether or not they believed a target to be present (i.e., the 1-out-of-m detection sub-task), and to subsequently identify the stimulus they most likely believed to be the target (i.e., the m-alternative forced-choice sub-task).
Their results revealed that, first, people were more likely to correctly identify the target when they were more confident that a target was present; second, people were more likely to falsely identify a similar lure (a so-called pseudo-identification) over a regular lure when they were more confident that a target was present; and third, people were more likely to think a target was present in a target trial when the target was accompanied by a similar lure than when the target was accompanied by a regular lure instead (Meyer-Grant & Klauer, 2021). They further showed that the IOM is consistent with these observations. Here, we will derive the respective predictions of the EM and show that, unlike the IOM, it cannot account for this pattern of effects.
Let us therefore initially consider only the first two critical predictions identified by Meyer-Grant and . It can be shown that both predictions follow from certain rank order probabilities (e.g., P(D o > D n | max(D o , D n ) > λ) in case of the EM) exhibiting monotonicity under changes 13 Henceforth we will confine ourselves to an SDAI task with m = 2 as we want to focus on the data by Meyer-Grant and Klauer (2021) and choosing m = 2 conforms to their experimental design. Furthermore, it will suffice for the central research objective of the present work, that is, to assess the tenability of the EM. An interesting side note is that for m = 2, the so-called best vs. rest model (Clark et al., 2011)a variant of the EM in which the maximum familiarity is compared to the mean familiarity value of the remaining items -is equivalent to the so-called best vs. next model (Clark et al., 2011), in which the maximum familiarity is compared to the maximum familiarity of the remaining items. Hence, the critical test presented here applies to these models as well. 14 Note that in order to construct these different types of lineups, it was essential to present multiple stimuli during study. Only in this way it was possible to construct lineups with a target and a similar lure, that is, a lure which resembles another studied stimulus other than the target of the current lineup. in the response criterion (Meyer-Grant & Klauer, 2021). As we will demonstrate next, this property indeed also holds for the Gaussian EM.
In order to provide a rigorous proof of this claim, we first note that for a target trial and m = 2 Because the normal distribution is stable under convolution, we further find that Without loss of generality (except for the two degenerate cases with perfect correlation of either ρ {X o , X n } = 1 or in the following and define As previously mentioned, the EM assumes that the decision in the 1-out-of-m detection sub-task is determined by whether or not max{D o , D n } exceeds the response criterion λ. We can therefore express the predicted probability of a so-called miss (M; i.e., a "target absent" response if an old item is present 15 ) as However, in the Gaussian case this expression is equivalent to the cumulative distribution function (CDF) of a folded normal distribution (more specifically, of a folded normal distribution with scale parameter σ D o = 1 and location parameter μ D o ). Thus, it holds that where is the CDF of the standard normal distribution. 15 Like the terms "hit" and "false alarm," the term "miss" refers exclusively to the responses in the 1-out-of-m detection sub-task. That is, in the context of SDAI, the term "miss" denotes incorrectly rejecting the presence of a target in a target trial, regardless of whether that decision is followed by the selection of a target or a lure in the m-alternative forced-choice sub-task.
Furthermore, we can express the predicted probability of a miss and a subsequent correct identification as where ϕ is the PDF of the standard normal distribution. Thus, it holds that The predicted probability of a correct identification given a miss can therefore be expressed as for all λ > 0 according to the Gaussian EM. Furthermore, one finds by means of elementary probability theory that and P EM (I, H) Thus, it follows immediately that P EM (I|H, λ) = P EM (I, H|λ) P EM (H|λ) for all λ > 0.
Consequently, both the increases in correct rate and-as can be seen through exchanging X o with X s in the above derivations and assuming μ X s > μ X n -in pseudo-identification rate with confidence are consistent with the EM.
Interestingly, however, the model predictions regarding the last critical effect discussed by Meyer-Grant and Klauer (2021) differ between the EM and the IOM. As we established above, if similar lures are included into the design, we must assume that μ X s > μ X n in order for the Gaussian EM to predict an increase of the pseudo-identification rate with confidence-a direct consequence of Theorem 1. 16 But if this is indeed the case, one can show that according to an equal-variance Gaussian EM (i.e., σ X o = σ X s = σ X n = 1) in which the correlation between familiarity values of a lure and a target does not depend on whether the lure is a similar or regular lure (i.e., Cov(X o , X s ) = Cov(X o , X n ), as is to be assumed for the experimental data under consideration here, where the similar lure resembled an old item other than the target; see Meyer Grant & Klauer, 2021), predicted hit rates must be lower in target trials with a similar lure compared to target trials with a regular lure (if m = 2). Furthermore, this prediction is independent of the value of the response criterion λ and thus holds regardless of the confidence level.
Proof See the Appendix.
The qualitative prediction of the equal-variance Gaussian EM that is entailed by Proposition 2 is the exact opposite of the corresponding prediction made by any IOM with a monotonic likelihood ratio of similar and regular lure familiarities (see Proposition 9 in Meyer-Grant & . This can also be interpreted as an immediate consequence of the central distinguishing aspect between the EM and the IOM framework. That is, in the EM, it is not the absolute familiarity values (as in the IOM) but the relative familiarity values (i.e., how much a single stimulus' familiarity "stands out" compared to the rest of the set) that are relevant for the decision. In the presence of a similar lure (instead of a regular one) the difference between target and lure familiarity values tends to be reduced if μ X s > μ X n . Thus, the predicted hit rate must decrease in such circumstances according to the EM.
This discrepancy between the EM and the IOM can also be visualized by depicting the two-dimensional decision space of both models (Fig. 3), where each dimension corresponds to one of the m = 2 test-item familiarity values. Suppose that for a target trial, values on the ordinate correspond to the old-item familiarity (X o ), whereas values is larger than the lure familiarity value. Below the lower dashed black line, the lure familiarity values is larger than X o + 2λ. Realizations between these two lines correspond to a "target absent" response in the EM on the abscissa correspond to the new-item familiarity (X n ). Let us first consider the identity line X o = X n that passes through the origin. Points above this line corresponds to a correct identification for both the EM and the IOM, as they imply that the target familiarity values exceeds the lure familiarity value.
The key difference between between the EM and the IOM in this representation is how both models partition this decision space into different response categories for the 1-out-of-m detection sub-task. The IOM assumes that the boundary which separates "target present" from "target absent" decisions is defined by the two line segments X o = λ and X n = λ that terminate at (λ, λ), which is depicted in the left panel of Fig. 3. Any pair {X o , X n } such that X o > λ or X n > λ is detected as a trial containing a target-the larger the value of λ, the greater the confidence level. In contrast, according to the EM, the relevant boundaries are defined by the two lines X o = X n + 2λ and X n = X o + 2λ, which is depicted in the right panel of Fig. 3. Any pair {X o , X n } such that X o > X n + 2λ or X n > X o + 2λ is detected as a trial containing a target. There is thus a band centered on the identity line X o = X n in which item pairs are not detected as containing a target.
To reiterate, the effect of interest concerns changes in the hit rate if a target is presented alongside a similar lure (with familiarity values X s ) instead of a regular lure (with familiarity value X n ). As long as μ s > μ n , the bivariate distribution corresponding to target trials with a similar lure is simply the distribution for a target trial with a regular lure shifted by μ s − μ n to the right along the abscissa, which can be easily seen in Fig. 3. This has the effect of increasing the proportion of realizations that are detected in the IOM, given λ remains unchanged (i.e., an increase in the predicted hit rate; see left panel of Fig. 3). For the EM model, on the other hand, the same rightward shift entails that more probability mass crosses into the non-detection band surrounding the identity line, thereby decreasing the proportion of realizations that lead to detection (i.e., a decrease in the predicted hit rate; see right panel of Fig. 3).
In their empirical investigation of this situation, Meyer-Grant and  found clear evidence for a higher hit rate when including a similar lure instead of a regular one, regardless of the confidence level. 17 This raises some serious doubts regarding the validity of the central cognitive mechanism proposed by the equal-variance Gaussian EM in particular.

Reanalysis of Previously Published Data
However, the EM generally not being able to predict this qualitative effect hinges on certain model assumptions. More specifically, if the assumptions regarding variances and/or covariances (as specified in Proposition 2) are relaxed, the model can no longer be decisively ruled out based on the occurrence of the critical effect alone. Moreover, an unequal-variance assumption, in particular, is rather prominent in Gaussian SDT models of recognition memory (Jang et al., 2009;Mickes et al., 2007;Wixted, 2007;Starns et al., 2012;Spanton & Berry, 2021; but see ;  and should therefore be considered in the present work. Hence, it seems worthwhile to complement our argument with an additional quantitative model comparison between the unequal-variance Gaussian IOM and the unequal-variance Gaussian EM in order to ensure that our criticism of the EM framework is indeed warranted. We therefore conducted a reanalysis of the data by Meyer-Grant and  to address this particular issue, mirroring their model comparison strategy. Thus, we first fitted the parameters of both the unequalvariance Gaussian IOM and the unequal-variance Gaussian EM to the data aggregated across participants using a maximum likelihood approach (with σ X o , σ X s , and σ X n allowed to differ from each other). Calculating the Pearson's χ 2 test statistics for both models reveals that the EM (χ 2 (17) = 392.71) provides a clearly inferior goodness of fit compared to the IOM (χ 2 (17) = 149.17). The top and middle row of panels in Fig. 4 depict the respective model predictions (for the ROC and the IOC as well as the hit probabilities for target trials with and without a similar lure, respectively) together with the corresponding experimental data (Meyer-Grant & Klauer, 2021).
However, analyses of data aggregated across participants neglect possible variability in parameters between participants and are thus susceptible to aggregation biases (Morey et al., 2008;Juola et al., 2019;Smith et al., 2017). We therefore additionally conducted a Bayesian hierarchical model analysis (Rouder & Lu, 2005;Rouder et al., 2017), in which the parameter vector for each participant is separately drawn from a population distribution. Again following the procedure outlined by Meyer-Grant and Klauer (2021), we used a latent-trait approach (Klauer, 2010) and assessed model fit by means of a cross-validation index (CVI; Browne, 2000;Gelman et al., 2013;Vehtari et al., 2017). The selection of (hyper-)prior parameters, the cross validation procedure, and the calculation of the CVI adheres exactly to the specifications in Meyer-Grant and . We also provide model weights (W) calculated through stacking of predictive distributions (Yao et al., 2018). All calculations are repeated two times with different seeds of the random number generator (Meyer-Grant & Klauer, 2021). Results clearly suggest that the IOM (M CVI = −43 749.34, SD CVI = 1.38, M W = 0.992, SD W = 0.011) provides a better account of the data than the EM (M CVI = −43 985.90, SD CVI = 1.64, M W = 0.008, SD W = 0.011). This corroborates the preliminary conclusion drawn from the analysis of the aggregated data.
Additionally, the bottom row of panels in Fig. 4 depicts the model predictions for the hit probabilities given a correct identification for different confidence levels together with the corresponding experimental data (Meyer-Grant & Klauer, 2021). This reveals yet another critical effect pattern that distinguishes the two model frameworks: Including a similar lure that does not resemble the target of the current lineup appears to increase the hit rate even when conditioning on a correct identification response (i.e., the confidence in a correct identification increases), as opposed to just increasing the overall hit rate regardless of the identification response. 18 The IOM predicts such an effect (see bottom left panel of Fig. 4) since targets selected over similar lures should tend to be associated with a stronger absolute memory-strength signal than targets selected over regular lures, whereas the EM is once more unable to adequately account for this pattern (see bottom right panel of Fig. 4).
In order to further test this effect, we conducted three generalized linear mixed model analyses (Singmann & Kellen, 2019) with a logistic link function comparing the relative frequencies for hits given a correct identification response between target trials with and without a similar lure. We therefore included the fixed-effect within-subject factor "similar lure" (present vs. absent) as well as crossed random-effect factors for participants as well as target materials (Judd et al., 2012). The random effects structure was determined by a backwards selection (for details, see Meyer-Grant & Klauer, 2021). 19 Each of the three analyses addressed one of three different confidence levels. The first analysis only treated responses with the highest confidence as hits, the second one additionally treated responses with the second highest confidence as hits, and the third one treated all but the lowest confidence rating as hits. In all three analyses, the hit rate was significantly larger in trials with a similar lure (results are reported in Table 1; for a pictorial representation, see Fig. 5).

Discussion
In the present work, we have demonstrated that two aspects of between-item similarity, namely old-new similarity and within-lineup similarity, must be treated as distinct concepts and should therefore be separately modeled. By deriving conflicting predictions between the equal-variance Gaussian EM and many IOMs for situations in which old-new similarity is selectively manipulated and m = 2, we have, furthermore, shown that past empirical results (Meyer-Grant   and the respective model predictions (solid lines and squares) of the best fitting unequal-variance Gaussian IOM (UVG-IOM; left column of panels) and the best fitting unequalvariance Gaussian EM (UVG-EM; right column of panels). Upper row of panels depicts ROCs and IOCs for trials with (blue) and without (black) a similar lure. Dotted gray lines represent guessing level. Middle row of panels depicts the probability of a hit in target trials with a similar lure on the ordinate and in target trials with a regular lure on the abscissa. Bottom row of panels depicts the probability of a hit conditional on a correct identification in target trials with a similar lure on the ordinate and in target trials with a regular lure on the abscissa. Dotted gray line is the identity line   are irreconcilable with the equal-variance Gaussian EM.
Although we have not elaborated on it thus far, this critical test applies not only for a model parametrization in terms of equal-variance normal distributions, but also to other EMs in cases where the likelihood ratio of similar and regular lure familiarities is a monotonically increasing function (see the alternative proof of Proposition 2 in the Appendix). This is arguably a quite reasonable and defensible assumption to make as it is equivalent to a low familiarity value being "more likely under the familiarity distribution of regular than under the familiarity distribution of similar lures, whereas for high familiarity values it is the other way around" (Meyer-Grant & Klauer, 2021, p.9; see also Kellen et al., 2021;Rouder et al., 2021;Green & Swets, 1966).
A reanalysis of the data from Meyer-Grant and Klauer (2021) additionally revealed that even when relaxing the equal-variance assumption, 20 the EM is still clearly outperformed by the IOM in terms of quantitative model fit.
It is also noteworthy that the key message of the present work does extend beyond the class of SDT models of recognition memory. One could, for example, imagine a similar mechanism within so-called threshold models (see, e.g., Kellen & Klauer, 2018;Coombs et al., 1970;Bernbach, 1967). In contrast to SDT models, threshold models assume that the latent memory signal is not directly accessible by the decision-maker. Instead, these models propose mediating mental states of discrete nature (such as "correctly detecting a target to be old" or "correctly detecting a lure to be new"), which can be reached with a certain probability that depends on whether the stimulus is old or new.
Recently, it has been shown that threshold models-in particular the so-called two-high-threshold model (Snodgrass & Corwin, 1988)-cannot explain certain response patterns when manipulating memory strength and presenting multiple stimuli simultaneously during test (Kellen et al., 2021;Kellen & Klauer, 2014). This is the case because-according to the standard interpretation of the two-high-threshold model-a memory-strength manipulation only affects the probability of correctly detecting a target. However, the model predictions for the response frequencies of interest do not depend on this very probability, but only on the probability of correctly detecting a lure. To avoid these problems, one could easily extend the two-highthreshold model to operate in a similar way as the EM, that is, by allowing detection probabilities to depend on the specific context of presented stimuli. This results in altering the probability of a correct lure detection when the memory strength of the target is manipulated (e.g., the stronger the target's memory signal, the easier it is to detect a lure).
However, this idea falls victim to the same observation as the SDT version of the EM, since increasing the old-new similarity would then decrease the probability of correctly detecting a target, which in turn would tend to lower the hit rate in such a situation. But this again directly contradicts the observations reported by Meyer-Grant and . Therefore, our results also indirectly strengthen the arguments by Kellen and Klauer (2014) and Kellen et al. (2021).
There is, however, a caveat that concerns the applicability of our findings to research primarily focused on eyewitness identification: The experimental paradigm used to obtain the data on which our argument is based (Meyer-Grant & Klauer, 2021) differs from the paradigm commonly used in that field. More precisely, the data under consideration here are obtained via a multiple-trial experiment (i.e., testing each participants on multiple lineups), whereas most research on eyewitnesses identification relies on single-trial experiments (i.e., each participant is only tested on one lineup; Mansour et al., 2017).
On the one hand, a multiple-trial design permits analyzing the data by means of Bayesian hierarchical modeling, which helps to avoid aggregation biases that may be present but remain unnoticed in single-trial data. On the other hand, in the task presented here, certain pairs of items that participants encounter will tend to elicit two relatively high familiarity values (i.e., target trials with a similar lure). One could argue that in such a situation a comparative strategy (as instantiated by the EM) is not advisable, as it would lead to many misses in this scenario. Thus, the test Mean relative frequencies (black dots) of a hit given a correct identification in the absence and presence of a similar lure. In the left panel only the highest confidence level is treated as a hit (hit conservative), in the middle panel the highest and second highest confidence levels are treated as a hit (hit medium), and in the right panel all but the lowest confidence level are treated as a hit (hit liberal). Error bars depict ±1SE (model based) and the violin plots depict kernel densities of individual relative frequencies environment may encourage an absolute strategy (as instantiated by the IOM). This would imply that the results may not generalize to single-trial situations, which might encourage other strategies. Future research may address this question by examining these effects in single-trial experiments.
Another aspect of the experiment conducted by Meyer-Grant and Klauer (2021) that differs from most lineups used in forensic practice, is the number of simultaneously presented items (typically, m > 2 for real-life lineups). Although it seems at least reasonable to assume that the basic underlying memory mechanisms do not depend on lineup size, this conjecture may be verified in the future by quantitative model comparisons for larger lineups.
Note also that the focus of the present work was on SDAI, that is, on simultaneous lineup procedures. However, the general theoretical ideas underlying the modeling framework of the EM and the IOM can be extended to sequential lineups as well (see, e.g., Dunn et al., 2022;Kaesler et al., 2020).
Despite these limitations, we are inclined to conclude that the central mechanism of the EM can be questioned on the basis of our results, especially in light of the fact that other pieces of evidence seemingly favoring the EM over the IOM are inconclusive upon closer inspection Akan et al., 2021). In fact, the IOM appears to be the only SDAI extension of the SDT model framework which can account for all observations discussed in the literature on lineup memory to date.
However, this does not necessarily imply that this model is uncontested. It has, for example, been argued that a lowthreshold model (Kellen et al., 2016;Luce, 1963;Starns, 2020) of SDAI could be a viable competitor to IOMs based on SDT (Meyer-Grant & Klauer, 2021). Moreover, unequalvariance Gaussian SDT models are themselves known to have certain properties that are at least controversial. For instance, they do not possess monotone likelihood ratios, which is-as already mentioned above-often considered to be implausible (e.g., Green & Swets, 1966;Kellen & Klauer, 2018;Kellen et al., 2021). Thus, while the present work contributes to the debate on what qualifies as a reasonable modeling approach for SDAI by questioning the validity of one prominent candidate model (viz., the EM), further research is clearly necessary to validate and/or refine the models of recognition memory used to study lineup memory. (5) for all λ ∈ (0, ∞). Since it holds that exp − 1 2 λ 2 + z 2 + 2(μ D o ) 2 ≥ 0 and λ − z < z − λ for all z > λ, the inequality in Eq. 5 must hold if μ D o > 0. Thus, τ H (λ) is monotonically increasing in R + . To show that . is a monotonically increasing function, it suffices to show that is monotonically decreasing. Taking the derivative of T 2 (λ) with respect to λ yields The denominator in Eq. 6 is always positive and will therefore be ignored in the following. Thus, T 2 (λ) is strictly monotonically decreasing if and only if (7) for all λ ∈ (0, ∞). Since again exp − 1 2 λ 2 + z 2 + 2(μ D o ) 2 ≥ 0 and z − λ < λ − z for all z < λ, the inequality in Eq. 7 must hold if μ D o > 0. Therefore, τ M (λ) is also monotonically increasing in R + .
Proof of Proposition 2 First, we note that and that we can again set σ D o = 1 without loss of generality. It then follows from Eq. 1 that Eq. 3 is equivalent to Because μ X o > μ X s > μ X n = 0, it must hold that Thus, it suffices to show that which is clearly the case since λμ > −λμ and if λ > 0 as well as μ > 0.
Alternative proof of Proposition 2 First, let f X o , f X s , and f X n denote the PDFs of target, similar lure and regular lure familiarities (i.e., the RVs X o , X s , and X n with the same support = R), respectively. Furthermore, let F X s and F X n denote the CDFs of X s and X n , respectively. If we assume independence, then the complementary CDF (i.e., the tail distribution function) of the difference of two independent RVs, for instance, X o and X n (i.e., the RV X o − X n ), can be written as (8) where denotes the indicator function. If ⊂ R, then we apply a strictly monotonic transformation G : → R to X o , X s , and X n (i.e., a change of variable for all familiarity distributions, such that the support of the transformed RVs becomes R). One immediately sees that Eq. 8 is the probability of a hit as predicted by a nonparametric EM in cases where m = 2, both item familiarities are independent, and a regular lure is presented alongside the target.
By exchanging f X n and F X n with f X s and F X s , respectively, in Eq. 8, we also find that is the predicted probability analogous to Eq. 8, when instead of a regular lure a similar lure is presented (still assuming that its familiarity is independent from the target familiarity).
If we further assume first order stochastic dominance of similar and regular lure familiarities (i.e., F X s (x) ≤ F X n (x) for all x ∈ R), we easily see that ⇔ P EM (H|λ, Similar Lure) ≤ P EM (H|λ, Regular Lure) (9) indeed holds for all λ ∈ R + .
It is well known that a monotone likelihood ratio entails first order stochastic dominance (Bapat & Kochar, 1994). Consequently, all distributions exhibiting a monotone likelihood ratio of similar and regular lure familiarity distributions (i.e., f X s (x)/f X n (x) is a monotonically increasing function for all x ∈ R) also exhibit first order stochastic dominance of the similar and regular lure familiarity distributions (i.e., F X s (x) ≤ F X n (x) for all x ∈ R).
Equal-variance Gaussian SDT models indeed exhibit this monotone likelihood ratio and hence also first order stochastic dominance of similar and regular lure familiarity distributions if μ X s > μ X n . Furthermore, for a bivariate Gaussian distribution, the absence of a within-lineup correlation is equivalent to both RVs being independent. Therefore, the inequality in Eq. 9 holds for the equal-variance Gaussian EM, if there is no within-lineup correlation.