Eyewitness identification research has contributed significantly to judicial and policing practices over the past few decades and this momentum continues (Innocence Project, n.d.; State v. Henderson, 2011; Technical Working Group for Eyewitness Evidence, 1999; National Research Council, 2014). A major challenge to eyewitness researchers is the balance between ecological validity and methodological rigor: Researchers must decide which aspects of their experimental design are critical for internal validity and which ought to closely resemble those encountered by real-world eyewitnesses to ensure external validity. In the real world, eyewitnesses typically see one crime and (may) participate in an identification procedure for a suspect. Translated into an experimental design, a participant is assigned to one experimental condition and views one mock crime and an accompanying lineup. This between-subjects design has the advantage of closely mirroring the real eyewitness experience but the disadvantage of producing only one recognition and one confidence data point per participant (cf. Brewer, Weber, Clark, & Wells, 2008).

Researchers obtaining a single data point per participant (per measure) require a large number of participants to obtain sufficient power to detect reliable differences. Concerns about power are heightened when the measure of interest is dichotomous (Tabachnick & Fidell, 2007), as is often the case in eyewitness research. Obviously, collecting data from larger samples requires more time and resources. Using a repeated-measures design to obtain multiple data points from each participant allows researchers to obtain greater power with smaller sample sizes. In addition, within-subjects designs in general have higher internal validity because each participant acts as their own control group (Glass & Hopkins, 1996). Thus, eyewitness researchers must decide between maximizing external validity by collecting one data point per participant from a large number of participants and maximizing power, internal validity, and resources by collecting multiple data points per participant. We considered how such repeated-measures designs affect eyewitness accuracy, choosing, and confidence.

Potential effects of multiple trials

Researchers may be suspicious of using multiple lineup trials because such a procedure may change how participant-eyewitnesses approach the lineup task and/or may produce practice effects (VanLehn, 1996). Participants in multiple-trial eyewitness identification experiments may become more or less accurate as they complete more trials because they become more aware of the task demands, develop beliefs about the researchers’ hypotheses (Rosenthal, 1966), or adopt more or less stringent selection (or rejection) criteria.

Researchers interested in increasing children’s identification accuracy have explored presenting practice trials before the “real” lineup. When presented with traditional identification procedures, children are less likely to reject lineups than adults—leading to a higher probability of false identifications when the perpetrator is absent (Fitzgerald & Price, 2015; Pozzulo & Lindsay, 1998). Pozzulo and Lindsay (1997) hypothesized that practice would decrease choosing by demonstrating to children that saying “no” to a lineup can be an appropriate response. Contrary to this intuitive hypothesis, explicit practice does not tend to improve children’s accuracy (e.g., Goodman, Bottoms, Schwartz-Kenney, & Rudy, 1991; Parker & Myers, 2001; Parker & Ryan, 1993; Pozzulo & Lindsay, 1997). On the basis of their meta-analysis, Pozzulo and Lindsay (1998) concluded that practice likely has no effect on children’s correct rejections although it may increase their correct identifications. If multiple-trial experiments produce higher correct identification rates, consumers of this research (police, lawyers, judges, researchers, etc.) may overestimate the accuracy of real-world child witnesses because they are unaware that this level of performance is an artifact of the research paradigm.

Research examining practice effects with adults is inconsistent. Shapiro and Penrod’s (1986) meta-analysis of facial recognition and eyewitness identification studies found no consistent positive or negative effects of practice—which they referred to as training—on facial recognition and lineup decisions. Practice reduced accuracy in some studies (e.g., Platz & Hosch, 1988), but had no effect in others (e.g., Malpass, Laviqueur, & Weldon, 1973). Although only a few studies have explicitly examined the issue of practice (eight studies with target-present lineups; five with target-absent), Shapiro and Penrod concluded that extensive practice (e.g., describing, recognizing, or comparing 90+ faces) is ineffective, but that short (20-minute) training programs may increase facial recognition accuracy. This pattern suggests that the “practice” inherent in a multiple-trial design may influence performance on early trials. However, one eyewitness study using multiple trials reported no learning effects across eight trials (Mansour, Lindsay, Brewer, & Munhall, 2009). Given the literature, we expected correct identifications to remain stable across multiple trials or, at most, to slightly increase in early trials and level off in later trials (i.e., a quadratic effect of trial number).

Possible interactions with multiple trials

Of particular concern with the use of multiple-trial experiments is whether the manipulated variables would interact with multiple trials to produce systematic changes in accuracy, choosing, or confidence. One such variable is lineup type. Sequential lineups involve presenting one lineup member at a time, combined with other procedural components designed to reduce false identifications relative to presenting lineup members simultaneously (simultaneous lineups; Lindsay, Mansour, Beaudry, Leach, & Bertrand, 2009). Normally, sequential lineups are backloaded, which implies to the eyewitness that the lineup contains more people than it actually does, in order to reduce pressure to choose someone (Horry, Palmer, & Brewer, 2012; Lindsay & Wells, 1985). In contrast, the number of lineup members in a simultaneous lineup is immediately obvious. Thus, eyewitnesses viewing simultaneous lineups may maintain a constant decision criterion (Meissner, Tredoux, Parker, & MacLin, 2005), whereas eyewitnesses viewing sequential lineups may adopt a more lenient criterion as they become familiar with the size of the lineups. If this is the case, a multiple-trial approach would be inadvisable with sequential lineups.

Prior research examining how knowledge of the nominal size of the sequential lineup affects identification decisions speaks to this possibility. Lindsay, Lea, and Fulford (1991) and Horry et al. (2012) found correct rejections (but not correct identifications) were lower when participants were aware of how many members comprised their sequential lineup. A participant who rejected a sequential lineup in a multiple-trial experiment would discover how many lineup members comprised the lineups, which could lead them to adopt a more lenient criterion for identification as they approached the end of subsequent lineups. If selecting a lineup member terminated the trial (as in the present research), selections would not provide information about lineup size. Thus, participants who view sequential lineups may make more correct rejections in early trials than in later trials, when size may become apparent and, thus, perceived pressure to choose increases. We expected no significant difference in correct rejections across trials with simultaneous lineups given the stability of apparent lineup size with these lineups.

A second factor that could interact with trial number is the strength of the eyewitness’ memory for the perpetrator. The difficulty of a lineup decision is at least partially a function of memory trace strength. People with a weak memory trace—due to poor encoding conditions or delay—tend to perform worse when they view multiple lineups for the same suspect (Godfrey & Clark, 2010; Lindsay, Mansour, Kalmet, Bertrand, & Melsom, 2011). Palmer, Brewer, and Weber (2010) suggested that viewing multiple, non-independent lineups for the same target negatively affects metacognitions about memory strength. When they led participants to believe their memory strength was poor (on the basis of feedback or presentation of a second lineup), performance on the second lineup suffered (fewer correct identifications and correct rejections). Palmer et al.’s conclusion is consistent with research showing that people use perceived prior success or failure to inform judgments of future success (Feather, 1966). Thus, a participant’s experience with prior lineups and/or their perceived memory strength may inform subsequent lineup decisions.

As a consequence for multiple-trial designs, an important consideration is whether confidence interacts with willingness to choose from lineups. When one’s memory for a face is weak, confidence should be low. In contrast, when one’s memory is strong, selections are likely to be made with confidence. A feeling of confidence in a particular decision (arising from the decision being “easy”) may lead participants to have more confidence in their ability to remember, potentially leading them to become more willing to identify someone from the lineup. That is, previous (apparent) success may increase perceptions of one’s general ability to make correct memory-based decisions. Generally, confidence in identification decisions has been associated with conditions that influence memory strength: the relation between confidence and accuracy deteriorates as viewing conditions deteriorate (Lindsay, Read, & Sharma, 1998). Using a multiple-trial design in which some conditions make it difficult to form a good memory trace (e.g., disguise) may eliminate systematic changes in confidence in identifications.

Given that the relationship between confidence and accuracy is weak or nonexistent for lineup rejections (Leippe & Eisenstadt, 2007) we would logically expect no systematic effects of these variables on confidence in rejections. Accordingly, a repeated-measures design that provides a randomized mix of target-present and target-absent lineup trials may not lead participants to become more (or less) confident in their ability; thus, choosing, accuracy, and confidence may be unaffected by trial.

Present study

Whether multiple eyewitness identification trials for independent targets influence the validity of conclusions is an empirical question that has not been addressed directly. We examined whether correct identifications, correct rejections, choosing, and decision confidence changed over 24 trials, and examined possible interactions with lineup type and memory strength variations. The supplemental materials also present analyses of overall accuracy—defined as the proportion of all lineup decisions that were correct (i.e., correct identifications and correct rejections)—and of mean overall confidence. Critically, an absence of effects of trial number and interactions of trial number with other variables (e.g., memory strength) can be taken as evidence that multiple-trial experiments do not compromise the validity of identification data.

Method

We reanalyzed data reported in Mansour et al. (2012) and included newly collected data using an almost identical methodology (total N = 8,376 lineup decisions). We summarize only the key factors of the earlier work; readers are encouraged to refer to Mansour et al. for a more detailed methodology. Participants completed 24 trials in which they watched a mock-crime video, made a lineup decision after no delay or a short delay (see below), and reported their confidence in that decision on a scale from 0% (not at all confident) to 100% (extremely confident). All participants received fair lineup instructions (Malpass & Devine, 1981) in conjunction with each lineup.

With the exception of memory strength (detailed later), all data sets included the same manipulations so we report them together in this section. Participants were randomly assigned to all between-subjects manipulations and counterbalancing was employed for within-subjects manipulations. First, we manipulated between subjects lineup type. Lineup type refers to whether the lineups viewed by participants were simultaneous (i.e., all lineup members presented at once) or sequential (i.e., lineup members presented one at a time, a response for each lineup member required before viewing the next lineup member, no indication given as to the total number of lineup members; Lindsay et al., 2009). Second, we manipulated between subjects type of disguise (toque and sunglasses versus stocking mask). Disguised targets wore a toque (i.e., knitted hat or beanie) and/or sunglasses or they wore a stocking mask covering variable portions of their face. Third, we manipulated within subjects degree of disguise. Toque and sunglasses participants viewed targets that were undisguised, wore a toque, wore sunglasses, or wore both a toque and sunglasses. Stocking participants viewed targets who were undisguised, wore a stocking covering their hair and forehead (1/3 covered), wore a stocking covering to just below their nose (2/3 covered), or wore a stocking completely covering their head (fully covered). Fourth, we manipulated within subjects target presence—that is, whether a particular lineup contained the target (target-present) or not (target-absent).

The mock-crime videos were designed to elicit different levels of memory trace strength for the target. The mock-crime videos intended to elicit good memory strength were approximately 30 s in length and filled a 19-in. monitor. To produce mock-crime videos that would elicit a moderate strength memory, we shortened the 30-s videos to 3 s and resized them to fill one-third of a 19-in. monitor. To produce a poor strength memory, we modified the moderate memory strength condition by including a 30-s delay in which participants completed a visual search task between viewing each 3-s mock-crime video and its associated lineup. We added this delay to further weaken participants’ memory for each target relative to the other conditions by allowing an opportunity for forgetting to occur. With the exception of these changes, the mock-crime videos across memory strength conditions were identical. Our correct identification rates support our good, moderate, and poor memory strength categorization (see the Results section).

Each mock-crime video depicted one of four mock crimes: discussion of a bank robbery, a plot to murder someone, planning of a burglary with an off-screen accomplice, or questioning by an off-screen police officer after a robbery. All videos displayed one target from the shoulders up and the targets followed the same script for the respective mock crimes. Thus, the mock-crime videos depicting the same mock crime were identical except for the target. The videos chosen were selected from a larger pool of videos (approximately 35) based on ease of producing a lineup for the target (e.g., we opted not to include targets for which the pool of filler photographs was small) and to ensure equal numbers of male and female targets.

All lineups included six facial pictures (neck up and thus no clothing cues) of people looking straight into the camera. Fillers were selected by using an iterative match to description procedure (Turtle, Lindsay, & Wells, 2003) within the limits of the pictures available within the lab. Five members of the target-absent lineups were used as fillers in the target-present lineups. No photo appeared in any lineup for more than one target. All lineup members were undisguised.

Mansour et al. (2012) data set

The participants in the two experiments from Mansour et al. (2012) were students at an Eastern Canadian University. The participants in Experiment 1 (N = 98) were randomly assigned to the toque and/or sunglasses condition of our disguise type manipulation, whereas the participants in Experiment 2 (N = 102) were randomly assigned to the stocking condition of our disguise type manipulation.Footnote 1 In addition to the manipulations described above, participants were expected to have different levels of memory strength, as described above.

Good memory strength

Approximately two-thirds of the participants in Experiment 1 (n = 56) and Experiment 2 (n = 58) participated in the good memory strength condition. The quality of their exposure to the target presumably resulted in a good opportunity to encode his or her face. Lineups were presented immediately after the videos, providing little to no opportunity for forgetting.

Moderate memory strength

The other participants in Experiment 1 (n = 38) and Experiment 2 (n = 39) participated in the moderate memory strength condition. This exposure presumably resulted in a moderate opportunity to encode the perpetrator, with little or no opportunity for forgetting.

Additional data set (poor memory strength)

We later collected additional data from 158 participants (randomly assigned to the toque and/or sunglasses disguise [n = 78] or stocking disguise [n = 80] conditions) at a Western Canadian university using nearly the same procedures as Mansour et al. (2012).

Participants

In this additional dataset, the participants were predominantly female (.68) and categorized themselves as Asian (.52), White (.25), or other (.23). Most participants were of college age (M = 20.36, SD = 2.75, range = 17–38). All participants received introductory psychology course credit in exchange for participating.

Design

Approximately equal proportions of participants were randomly assigned to the between-subject manipulations of lineup type (simultaneous, sequential) and disguise type (toque/sunglasses, stocking). As in Mansour et al. (2012), we manipulated degree of disguise and target presence within subjects and employed counterbalancing for these variables.

Materials and procedure

The materials (including videos and lineups) and procedure were the same as in Mansour et al. (2012), with the exception of the 30-s delay between videos and lineups, during which participants viewed a Where’s WaldoFootnote 2 image and answered related questions (e.g., “How many people are sunburned in this picture?”).

Measures

We focused our analyses on six measures (three identification decisions, three confidence decisions); additional analyses are available in the supplemental materials. For target-present lineups, we coded target identifications as accurate (correct identifications) and coded responses of “not there” and selections of lineup fillers as inaccurate (inaccurate target-present decisions). When reporting descriptive statistics alongside our model results, we provide the proportion of correct identifications (number of correct identifications divided by the total number of target-present trials). For target-absent lineups, we avoided the issue of designating an innocent suspect because the discipline lacks a consistent method. We coded “not there” responses as accurate (correct rejections) and all selections as inaccurate (target-absent selections). The relevant descriptive statistic for this measure is the proportion of correct rejections calculated as the number of correct rejections divided by the total number of target-absent trials. Overall accuracy was also calculated (see the supplemental materials) as the number of correct responses (i.e., correct identifications plus correct rejections) divided by the total number of trials. Identification decisions were also coded as selections (correct identifications, filler selections, and target-absent selections) or rejections (incorrect rejections and correct rejections) so we could analyze choosing. We report the proportion of choices with our inferential results (i.e., the number of selections divided by the total number of trials). Finally, we examined confidence in correct identifications, confidence in correct rejections, and confidence in target-absent selections. The supplemental materials report analyses of overall confidence, defined as mean confidence across all lineup decisions. Intervals reported after proportions and means are 95% confidence intervals.

Analytic approach

Each participant made multiple lineup decisions and we did not randomly assign participants to the good, moderate, or poor memory strength conditions. As such, lineup decisions were nested within participants, which were nested within memory strength. Thus, our data were nested in three levels with trials at Level 1, participants at Level 2, and memory strength at Level 3. Participants were randomly assigned to the between-subject conditions of lineup type and disguise type, while degree of disguise was manipulated within subjects; therefore, these manipulations were incorporated at Level 1 (Field, 2009).

We used multilevel mixed-effects modeling to evaluate models for the six measures described above. First we modeled how participants responded on each trial: one set of models aimed to predict correct identifications, one set aimed to predict correct rejections, and one set aimed to predict choosing. The remaining models aimed to predict participants’ confidence on a particular trial given their specific decision. That is, we modeled confidence in correct identifications, confidence in correct rejections, and confidence in target-absent selections.

Appendix A provides the mathematical formulas for each model and Appendix B provides a flowchart of the process we engaged in for modeling, but we outline the general logic here. We first determined whether a multilevel model was necessary by comparing a one-level model with no predictors to two- and then three-level models with no predictors (i.e., different null models). The three-level model was normally superior to a one- or two-level model except in two cases (noted in the Results); therefore, we do not discuss the comparisons of different null models in the results.

Next, and in keeping with standard modeling procedures, we examined which fixed effects should be included in the multilevel model (Field, 2009). Each fixed effect was added to the null model (Model 1) individually, and the fit was compared to the null model (Models 2–5). Thus, we compared the null model to four separate fixed-effect models; each model included one of lineup type, disguise type, degree of disguise, or trial number. Fixed effects resulting in superior model fit were next included together in a model to test whether a model with multiple predictors (Model 6) improved fit relative to the null or to models with each individual predictor. If one or none of the fixed effects improved the fit, then we could not construct a Model 6, and so proceeded to the next step. Once the best-fitting fixed-effects model was determined, it was compared to models including interactions of trial number with the fixed effects (Models 7–9) and with the nesting variables (memory strength and participant; Models 10 and 11, respectively). For models with interactions, relevant fixed effects were always included.

We interpreted our results using the likelihood ratio test because this is the most liberal test and the most widely reported one in the eyewitness field (e.g., Horry, Halford, Brewer, Milne, & Bull, 2014; Wright & London, 2009). This test compares the log-likelihood of nested models on a chi-square distribution (Hox, 2010). In addition, we report the Akaike information criterion (AIC), Akaike weights (w i ), and Bayes factor (BF) for interested readers. Wagenmakers and Farrell (2004) recommend converting raw AIC values to Akaike weights to obtain an approximate measure of the probability that the model at hand is the best of the various models considered. Akaike weights vary between 0 and 1; therefore, an Akaike weight of .56 indicates that the specific model has a 56% chance of being the best model out of the set of models considered for predicting the variance in a given dataset. Within the text of the results, we calculated a ratio of the Akaike weights for the models being compared. The more complex model was always in the numerator and the simpler model (null or the best-fitting to that point) in the denominator; therefore, ratios greater than 1 indicate evidence for the more complex model. Finally, the BF, the most conservative test, was estimated from the Bayesian information criterion (BIC; Jarosz & Wiley, 2014). Critically for our interest in potential null effects, the BF allows one to test for evidence of null results (Jarosz & Wiley, 2014). For all BF calculations, as we did with the Akaike weight ratios, we placed the more complex model in the numerator and the simpler model in the denominator. Thus, a BF greater than 1 provides evidence in favor of the more complex model. When the BIC is used to calculate the BF, the model is more likely to favor the null hypothesis (i.e., the simpler model, in this case) over the alternative hypothesis (i.e., the more complex model, in this case); thus, models should be evaluated on the basis of both AIC and BF (Weakliem, 1999).

Results

Correct identifications

We created a model to examine which of our experimental manipulations best predicted the accuracy of decisions from target-present lineups (correct identifications versus inaccurate target-present decisions). This analysis, however, revealed that the data were not best modeled using the nested analysis. In fact, when examining models without predictors, the model with an intercept at Level 1 only was better than the models with random intercepts at Levels 1 and 2, or at Levels 1, 2, and 3. Thus, we used hierarchical logistic regression to analyze these data.

In the first step, we entered all of the fixed effects: memory strength, lineup type, disguise type, degree of disguise, and trial number. In Step 2, we entered the interactions of interest: the interactions of trial number with each of memory strength, lineup type, disguise type, and degree of disguise. The overall model was significant, χ 2(9) = 136.57, p < .001, R Nagelkerke 2 = .04; with Step 1 significant, χ 2(5) = 130.10, p < .001, R Nagelkerke 2 = .04; but not Step 2, χ 2(4) = 6.47, p = .17, R Nagelkerke 2 = .002 (none of the predictors in Step 2 were significant, with all ps > .08).

Memory strength, lineup type, and disguise type were significant predictors in Step 1. Participants made more correct identifications in the good memory strength condition (M = .72 [.70, .74]) than in the moderate (M = .65 [.62, .68]) and poor (M = .58, [.55, .60]) memory strength conditions, B = .34, χ 2(1) = 79.32, p < .001, OR = 1.40. All three levels differed significantly from each other using a Bonferroni correction (ps ≤ .001). Simultaneous lineups (M = .68 [.66, .70]) led to more correct identifications than did sequential lineups (M = .59 [.57, .62]), B = .42, χ 2(1) = 41.56, p < .001, OR = 1.53. Finally, participants made more correct identifications in the toque and/or sunglasses conditions (M = .66 [.64, .68]) than in the stocking conditions (M = .62 [.60, .64]), B = .19, χ 2(1) = 8.63, p < .001, OR = 1.21. Neither degree of disguise, B = .05, χ 2(1) = 3.02, p = .08, OR = 1.05, nor trial number, B = –.003, χ 2(1) = 0.45, p = .50, OR = 1.00 (see Fig. 1A), was a significant predictor. Thus, the hierarchical logistic regression revealed that memory strength, lineup type, and disguise type influenced correct identifications, but that degree of disguise and trial number did not.

Fig. 1
figure 1

(A) Proportions of actual overall accuracy (i.e., proportion of lineup decisions resulting in either correct identifications or correct rejections), correct identifications, correct rejections, and choosing, by trial number, and (B) proportion of correct identifications by trial number squared (i.e., a quadratic effect)

Because we anticipated there might be an increase in correct identifications early on that would stabilize in later trials, we mean-centered and then squared trial number to create the quadratic effect of trial number. The overall model was significant, χ 2(10) = 201.02, p < .001, R Nagelkerke 2 = .06, with Step 1 being significant, χ 2(6) = 193.74, p < .001, R Nagelkerke 2 = .06, but not Step 2, χ 2(4) = 7.28, p = .12, R Nagelkerke 2 = .002. The conclusions were similar for the model with the linear effect of trial number with two exceptions. In Step 1, the quadratic effect of trial number, B = –.006, χ 2(1) = 63.41, p < .001, OR = 0.99, was significant (see Fig. 1B). Second, although Step 2 did not account for significant variance, the interaction of the quadratic effect of trial number with disguise type was a significant predictor in Step 2, B = –.004, χ 2(1) = 6.67, p = .01, OR = 1.00 (see Fig. 1B).

Finally, we ran a multilevel model to examine the effects of trial number (linear and quadratic) on correct identifications. Correct identifications in a three-level nested model were best modeled by including lineup type, disguise type, the quadratic effect of trial number, and the interaction of the quadratic effect of trial number and disguise type (see Table 1), consistent with the results of the logistic regression. Thus, the results indicate a quadratic relationship between trial number and correct identifications, although the effect size is small. We consider the implications of these trial effects in the Discussion section.

Table 1 Parameter estimates for predictors in models of correct identifications (4,188 observations)

Correct rejections

In this analysis, we predicted the accuracy of lineup decisions from target-absent lineups (correct rejections versus target-absent selections). Accuracy was highest when memory strength was good (M = .74 [.72, .76]), followed by moderate (M = .66 [.63, .69]), and poor (M = .48 [.46, .50]).

Including lineup type improved the fit relative to the null model (see Table 2), χ 2(1) = 20.20, p < .001, w i ratio = 8,955.29, BF = 365.04, such that more correct rejections were made when participants viewed sequential (M = .66 [.64, .68]) than when they viewed simultaneous lineups (M = .55 [.53, .57]). The fit relative to the null model was not improved by including disguise type, χ 2(1) = 1.40, p = .24, w i ratio = 0.74, BF = 0.03; degree of disguise, χ 2(1) = 0.20, p = .65, w i ratio = 0.39, BF = 0.02; or trial number, χ 2(1) = 0.20, p = .65, w i ratio = 0.41, BF = 3.11 × 10–14 (see Fig. 1A) on their own.

Table 2 Parameter estimates for predictors in models of correct rejections (4,188 observations)

Next we considered whether including interactions with trial number (separately) improved fit relative to the model with lineup type only. Neither the interaction of trial number and lineup type, χ 2(2) = 0.40, p = .82, w i ratio = 0.17, BF = 2.89 × 10–4; nor the interaction of trial number and disguise type, χ 2(3) = 1.80, p = .61, w i ratio = 0.12, BF = 8.72 × 10–6; nor the interaction of trial number and degree of disguise, χ 2(3) = 0.60, p = .90, w i ratio = 0.07, BF = 5.29 × 10–6, improved model fit.

Allowing the slopes of correct rejections across trial numbers to vary across memory strength conditions did not improve model fit, χ 2(4) = 0.20, p = 1.00, w i ratio = 0.02, BF = 6.49 × 10–8. Allowing each participant to have a different slope of correct rejections across trial numbers also did not improve the model fit, χ 2(5) = 1.20, p = .94, w i ratio = 0.01, BF = 1.61 × 10–9, and this model did not converge. Thus, the best-fitting model for correct rejections was a three-level model with the fixed effect of lineup type only. Neither trial number nor any interactions with trial number predicted participants’ correct rejections.

Choosing

In this analysis, we considered whether the manipulated or nested variables predicted participants’ selections from the lineup, correctly or incorrectly, regardless of whether or not the lineup contained the target. Table 3 depicts the model parameter estimates and fit indices. A three-level model was preferable to a two- or one-level model. Participants were less likely to choose someone from the lineup when their memory strength was good (M = .54 [.52, .56]), followed by moderate (M = .56 [.54, .59]), and poor (M = .67 [.66, .69]).

Table 3 Parameter estimates for predictors in models of choosing (8,376 observations)

In comparison to the null model, separately adding lineup type, χ 2(1) = 23.80, p < .001, w i ratio = 5.42 × 104, BF = 1.64 × 103, or disguise type, χ 2(1) = 5.80, p = .02, w i ratio = 6.36, BF = 0.19, significantly improved the fit, whereas separately adding degree of disguise, χ 2(1) = 1.20, p = .27, w i ratio = 0.67, BF = 0.02, or trial number, χ 2(1) = 2.00, p = .16, w i ratio = 0.95, BF = 0.03 (see Fig. 1A), had no effect. The model with both disguise type and lineup type significantly improved fit relative to the models with only lineup type, χ 2(1) = 5.60, p = .02, w i ratio = 5.75, BF = 0.10, and only disguise type, χ 2(1) = 23.60, p < .001, w i ratio = 4.90 × 104, BF = 544.57. Thus, the best-fitting fixed-effects model included disguise type and lineup type. Choosing was higher when participants viewed simultaneous (M = .65 [.63, .66]) rather than sequential (M = .56 [.54, .57]) lineups. Likewise, choosing was higher in the toque and/or sunglasses disguise condition (M = .62 [.61, .64]) than in the stocking disguise condition (M = .58 [.57, .60]).

We next examined whether separately including the three two-way interactions of trial number with lineup type, disguise type, and degree of disguise improved the fit. In sequence, we compared each of these predictors to the best-fitting fixed-effects model, containing lineup type and disguise type. None of the interactions improved the model fit. Specifically, relative to the best-fitting fixed-effects model, the interactions of trial number with lineup type, χ 2(2) = 2.00, p = .37, w i ratio = 0.37, BF = 3.35 × 10–4; disguise type, χ 2(2) = 2.20, p = .33, w i ratio = 0.39, BF = 3.53 × 10–4; and degree of disguise, χ 2(3) = 4.20, p = .24, w i ratio = 0.41, BF = 1.06 × 10–5, did not improve the fit.

Our final step was to examine whether allowing the slope of trial number to vary over levels of memory strength or participants improved the fit. Neither modification to the best-fitting fixed-effects model improved the fit: χ 2(4) = 2.00, p = .74, w i ratio = 0.05, BF = 3.94 × 10–8 (memory strength), and χ 2(5) = 5.00, p = .42, w i ratio = 0.08, BF = 5.56 × 10–6 (participant). Thus, choosing was best predicted by a three-level model including the fixed effects of disguise type and lineup type. Neither trial number nor any interactions with trial number predicted participants’ choosing.

Confidence in correct identifications

The three-level null model predicted confidence in correct identifications better than a one-level null model, but it did not differ from the two-level model. Since a three-level model most accurately reflects our data’s structure, we continued with the three-level model as our null model. Confidence was highest for participants in the good memory strength condition (M = 78.94% [77.76, 80.12]), followed by the moderate (M = 76.48% [74.87, 78.09]), and poor conditions (M = 75.45% [74.03, 76.87]).

Lineup type, χ 2(1) = 12.20, p < .001, w i ratio = 164.02, BF = 8.58, but not disguise type, χ 2(1) = 1.20, p = .27, w i ratio = 0.61, BF = 0.03; degree of disguise, χ 2(1) = 1.20, p = .27, w i ratio = 0.67, BF = 0.04; or trial number, χ 2(1) = 1.60, p = .21, w i ratio = 0.74, BF = 0.04 (see Fig. 2), improved the model fit when entered separately and compared to the null model. Confidence in correct identifications was higher when participants viewed sequential (M = 79.41% [78.25, 80.57]) rather than simultaneous (M = 75.02% [73.91, 76.13]) lineups.

Fig. 2
figure 2

Actual mean confidence (%) in correct identifications, correct rejections, and target-absent selections (i.e., any selection from a target-absent lineup) by trial number

Including the interaction of lineup type and trial number did not further improve fit, relative to the model with just lineup type (see Table 4), χ 2(2) = 1.80, p = .41, w i ratio = 0.29, BF = 0.001; neither did including the interaction of disguise type and trial number, χ 2(3) = 3.40, p = .33, w i ratio = 0.22, BF = 3.20 × 10–5; nor did including the interaction of degree of disguise and trial number, χ 2(3) = 3.80, p = .28, w i ratio = 0.29, BF = 4.32 × 10–5. Model fit was also not improved by allowing the slope of trial number to vary with memory strength, χ 2(4) = 1.80, p = .77, w i ratio = 0.04, BF = 2.91 × 10–7, and note that this model failed to converge. Fit also did not improve by allowing the slope of trial number to vary by participants, χ 2(5) = 4.20, p = .52, w i ratio = 0.05, BF = 1.96 × 10–8. Thus, confidence in correct identifications was best predicted by a three-level model that included the fixed effect of lineup type. Neither trial number nor any interactions with trial number predicted participants’ confidence in correct identifications.

Table 4 Parameter estimates for predictors in models of confidence in correct identifications (2,683 observations).

Confidence in correct rejections

The three-level null model was preferred over a one- or two-level null model. Confidence in correct rejections was highest when participants had a good memory strength (M = 72.41% [71.07, 73.75]), followed by a moderate memory strength (M = 67.27% [65.18, 69.36]), and a poor memory strength (M = 63.36% [61.50, 65.21]).

The fit was better than the null model when we included lineup type only, χ 2(1) = 10.60, p = .001, w i ratio = 77.48, BF = 4.26 (see Table 5). Confidence in correct rejections was higher when participants viewed simultaneous lineups (M = 70.42% [69.14, 71.70]), as compared to sequential lineups (M = 65.68% [64.17, 67.20]). Neither disguise type, χ 2(1) = 0.00, p = 1.00, w i ratio = 0.37, BF = 0.02; nor degree of disguise, χ 2(1) = 0.00, p = 1.00, w i ratio = 0.37, BF = 0.02; nor trial number (see Fig. 2), χ 2(1) = 0.00, p = 1.00, w i ratio = 0.37, BF = 0.02, improved the model fit.

Table 5 Parameter estimates for predictors in models of confidence in correct rejections (2,531 observations)

Fit was not improved relative to the model with lineup type when we separately included the interactions of trial number with lineup type, χ 2(2) = 1.40, p = .50, w i ratio = 0.26, BF = 0.001; with disguise type, χ 2(3) = 0.60, p = .90, w i ratio = 0.07, BF = 1.01 × 10–5; and with degree of disguise, χ 2(3) = 0.00, p = 1.00, w i ratio = 0.05, BF = 7.89 × 10–6. Fit was also not improved by allowing different slopes for each memory strength condition, χ 2(4) = 0.20, p = 1.00, w i ratio = 0.02, BF = 1.68 × 10–7, or for each participant, χ 2(5) = 0.40, p = 1.00, w i ratio = 0.01, BF = 3.40 × 10–9. In summary, the best-fitting model for predicting confidence in correct rejections was a three-level model with the fixed effect of lineup type. Neither trial number nor any interactions with trial number predicted participants’ confidence in correct rejections.

Confidence in target-absent selections

A three-level model was appropriate for these data. We found that confidence in target-absent selections was highest when participants had a good memory strength (M = 64.67% [62.62, 66.72]), followed by a moderate memory strength (M = 55.19% [52.83, 57.55]), and a poor memory strength (M = 53.26% [51.80, 54.71]).

Lineup type alone, χ 2(1) = 16.00, p < .001, w i ratio = 1152.86, BF = 77.48, led to a better fit than the null model (see Table 6). The fit was not improved by adding disguise type, χ 2(1) = 1.40, p = .24, w i ratio = 0.78, BF = 0.05; degree of disguise, χ 2(1) = 2.20, p = .14, w i ratio = 1.22, BF = 0.08; or trial number (see Fig. 2), χ 2(1) = 2.60, p = .11, w i ratio = 1.49, BF = 0.10. Confidence in target-absent selections was higher when participants viewed sequential (M = 59.91% [58.27, 61.55]) rather than simultaneous (M = 53.45% [52.02, 54.89]) lineups.

Table 6 Parameter estimates for predictors in models of confidence in target-absent selections (1,657 observations)

The model with the interaction of lineup type and trial number was not significantly better than the model with lineup type alone, χ 2(2) = 2.80, p = .25, w i ratio = 0.58, BF = 0.002. The fit also did not improve when we included the interaction of trial number with disguise type, χ 2(3) = 4.20, p = .24, w i ratio = 0.39, BF = 0.0001, or the interaction of trial number with degree of disguise, χ 2(3) = 7.80, p = .0503, w i ratio = 2.46, BF = 0.001. Nor did the model fit improve through allowing the slope of trial number to vary with memory strength, χ 2(4) = 2.80, p = .59, w i ratio = 0.07, BF = 1.37 × 10–6, or by participant, χ 2(5) = 2.80, p = .73, w i ratio = 0.03, BF = 3.39 × 10–8. Thus, the best-fitting model for confidence in target-absent selections had three levels and included the fixed effect of lineup type. Neither trial number nor any interactions with trial number predicted participants’ confidence in target-absent selections.

Discussion

The purpose of this article was to determine whether it is appropriate to study lineup decisions and confidence in lineup decisions with a multiple-trial method. The typical eyewitness paradigm involving a single lineup decision and confidence rating per participant is resource-intensive and costly. However, using a multiple-trial approach would be ill advised if the effects of key variables of interest on eyewitness decisions are obscured in the data collected using this approach. Our results are good news for eyewitness researchers. That is, our results indicate that there is essentially no downside to using multiple lineup trials for different targets to examine accuracy, choosing, and confidence across manipulations of memory strength, disguise type, degree of disguise, and lineup type. Our most important results in this regard are that, with one exception, trial number did not interact with other variables of interest such as lineup type and memory strength. In the case of the exception, the effect size was negligible and therefore unlikely to influence the conclusions researchers draw using a multiple-trial approach. Main effect variations across trials should not be critical if other manipulated variables randomly vary or are counterbalanced across trials—but note that we found only one significant main effect and its effect size was trivial.

On the basis of the literature examining practice lineups (e.g., Shapiro & Penrod, 1986), we considered whether correct identifications would increase during the early trials but disappear in later trials, or whether there would be no effect overall. We did find a statistically significant quadratic effect of trial number on correct identifications but the size of the effect was negligible. That is, our results indicate that, although trial number was a significant predictor, a correct identification was essentially as likely on one trial as another (OR = 0.99). Likewise, the interaction of the quadratic effect of trial number and disguise type was a significant predictor of correct identifications, but again the effect size was negligible (OR = 1.00). Visual inspection of the main effect and interaction (See Fig. 1B) illustrates that the nature of these effects are almost impossible to discern. Given our highly powerful data set—our analysis of correct identifications included 4,188 data points—and the small effect sizes, we feel confident in concluding that multiple trials will not obscure the effects of other variables on correct identification rates. Thus, although we found practice effects on correct identifications, the effects are small and are highly unlikely to influence the conclusions researchers draw from multiple-trial experiments.

Contrary to our expectations, lineup type did not interact with trial number to influence correct rejections. Thus, researchers wanting to use a multiple-trial design with sequential lineups may do so despite the fact that participants may discern the size of the lineup. One potential explanation for the lack of the predicted effect may be that we terminated the sequential lineups whenever a selection was made. As a result, participants may not have deduced the number of lineup members in the sequential conditions because the apparent lineup size varied with their choosing behavior. We did not draw participants’ attention to the number of lineup members in our sequential lineups and participants completed a randomized assortment of target-present and target-absent trials. An effect of lineup size may emerge if lineup size is more salient (e.g., Horry et al., 2012; Lindsay et al., 1991), but this could be countered by varying lineup size across trials.

Although it was beyond the scope of our research question, we found an interesting result with correct rejections from sequential and simultaneous lineups across our memory strength conditions. Correct rejections declined somewhat for sequential lineups from the good to poor memory strength conditions (.78, .69, .56, respectively); however, the decline for simultaneous lineups was much more dramatic (.70, .63, .41, respectively). We did not test for an interaction of memory strength and lineup type—and chose not to because participants were not randomly assigned to memory strength conditions—thus, we can only raise this as a possible avenue for future research. Evidence regarding how memory strength influences sequential and simultaneous lineup performance is likely to be highly relevant to the current debate over which procedure is superior (Wells, Smalarz, & Smith, 2015; Wells, Smith, & Smalarz, 2015; Wixted & Mickes, 2015a, 2015b).

We found no indication that participants with a good (versus poor) memory trace were more willing to identify someone as trials progressed. Trial number was not included in the best-fitting model of choosing either as a fixed effect or in interaction with any of our manipulations related to memory trace (i.e., memory strength, disguise type, or degree of disguise). These data are consistent with our expectation that exposing participants to a random ordering of target-present and -absent trials limits the opportunity to inflate their perceived ability to identify targets. This finding further supports the use of multiple-trial experiments for studying eyewitness identification.

We found similarly encouraging results with confidence as with accuracy and choosing. Neither trial number, nor any interactions with trial number significantly predicted confidence ratings—regardless of decision type (i.e., correct identification, correct rejection, or selection from a target-absent lineup). The alternative, more conservative, fit indices (Akaike weights, BF) are consistent with the likelihood ratio test, on which we based our model selections, with one exception. The exception was confidence in target-absent selections in which the highest Akaike weight (and therefore the most likely best-fitting model, according to this approach; Wagenmakers & Farrell, 2004), was the three-level model with lineup type, degree of disguise, trial number, and the interaction of degree of disguise and trial number as predictors. Supplemental Fig. 1 provides a visual representation of the interaction of degree of disguise and trial number for confidence in target-absent selections. No systematic effect of trial number is readily apparent across or within memory strength conditions.

Despite the high level of power we had in this experiment—with 349 participants and within-subjects manipulations of degree of disguise and trial number resulting in 1,657 to 8,376 data points, depending on the dependent variable—we detected three possible effects of trial number. The first two effects were on correct identifications: a significant direct quadratic effect of trial number, with a negligible effect size, and an interaction of the quadratic effect of trial number with disguise type, also with a negligible effect size. We encourage readers to carefully consider the practical relevance of small but significant effects detected by using a liberal test given to a large sample. The second effect, on confidence in target-absent selections, emerged from only on one of the three model-fitting criteria we reported—Akaike weights—with no easily discernible systematic pattern. Thus, the preponderance of evidence suggests that multiple-trial experiments are appropriate for eyewitness identification experiments.

This study has two potential limitations. First, our manipulation of memory strength was confounded with data collection location and dates. That is, we collected the good and moderate memory strength data in Eastern Canada between 2005 and 2007, and the poor memory strength data in Western Canada between 2009 and 2010. Despite this, there is no logical reason to expect that location or date would systematically affect either lineup decisions or confidence. Indeed, the results with regard to memory strength are in line with traditional expectations (e.g., correct identifications, correct rejections, and confidence in correct identifications were higher when memory strength was better); therefore, we think that the variability in our sample probably enhances our generalizability more than this confound negates it. Importantly, the other independent variables produced similar patterns of results in the three data sets, suggesting that participants behaved alike regardless of data collection date or location.

Second, this research examined only one type of multiple-trial identification design—presenting alternating mock-crime videos followed by the yoked lineups (i.e., Crime-Lineup-Crime-Lineup). Meissner and colleagues (Evans, Marcon, & Meissner, 2009; Lane & Meissner, 2008; Meissner et al., 2005) have used a different multiple-trial method in which participants view all targets before completing the accompanying lineups (i.e., Crime-Crime-Lineup-Lineup). It remains to be seen whether Meissner’s paradigm is comparable to our multiple-trial experiments and the standard, single-trial method. Indeed, basic memory research indicates that interference may build up across to-be-remembered lists, which is alleviated by testing (i.e., list-before-last paradigm; Jang & Huber, 2008; Klein, Shiffrin, & Criss, 2007; Shiffrin, 1970). Presenting multiple to-be-remembered target faces or mock-crimes before presenting any lineups may lead to a buildup of interference. In our paradigm, participants had to maintain a memory for only one target at a time, which may have prevented interference. Further research will be necessary before results from our paradigm can be generalized to Meissner and colleagues’ multiple-trial method.

In conclusion, researchers should consider using our multiple-trial paradigm (Crime–Lineup–Crime–Lineup) to obtain more data from fewer participants. Future research into multiple-trial designs should examine whether any systematic effects emerge beyond 24 trials or with any other paradigms (e.g., that of Meissner et al., 2005). Overall, a multiple-trial design for independent lineup trials can be an effective way of obtaining powerful datasets in lineup experiments, allowing researchers to examine more complex interactions than are typically tested and that could significantly contribute to our understanding of eyewitness decision making.