Memory plays a pivotal role in the justice system. But eyewitness memories are easily distorted (Loftus, 2005). Moreover, eyewitnesses who confidently report their distorted memories are persuasive to jurors (Cutler, Penrod, & Dexter, 1990; Douglass, Neuschatz, Imrich, & Wilkinson, 2010). In many cases, these distortions are the product of suggestive questioning (Loftus & Palmer, 1974; Loftus & Zanni, 1975). But what if questions cause trouble even without being suggestive? Could simply changing the order in which eyewitnesses answer questions affect how they appraise their memory, and in turn how jurors appraise those eyewitnesses? You might think changing the order in which eyewitnesses answer questions would have little effect—after all, the questions are still the same overall. But across the experiments presented here, we show that the order of questions matters.

In fact, we already know some seemingly trivial features of questions create problems for eyewitnesses. Take the striking results from changing just one word: In one study, witnesses reported cars in an accident traveled faster when a question suggested the cars smashed into rather than hit each other (Loftus & Palmer, 1974). In another study, witnesses were more likely to report seeing a nonexistent broken headlight when a question suggested its presence using the word the, rather than the more ambiguous a (Loftus & Zanni, 1975). More than three decades of research now shows that questions can transmit misleading suggestions that distort memory (see Loftus, 2005, for a review).

But questions can distort more than the details of memories; they can exert equally interesting influences on metacognition. For instance, eyewitnesses incorrectly answer misleading questions quickly and confidently (Loftus, Donders, Hoffman, & Schooler, 1989), and people generally provide more information—but monitor less for accuracy—when forced to answer questions compared with when they decide themselves what to report (Koriat & Goldsmith, 1996). These studies show that questions can change not only the content of eyewitnesses’ memories—but also what eyewitnesses think about their memory.

Research in educational psychology has revealed another important property of questions: The order in which they are asked. Across a series of experiments, people who answered trivia questions from the easiest to most difficult believed they answered more questions correctly than people who answered questions the other way around, even though everyone actually got about the same number correct (Jackson & Greene, 2014; Weinstein & Roediger, 2010, 2012).

We were intrigued by these findings and wondered to what extent the order of questions would influence eyewitnesses’ beliefs about the accuracy and quality of their memory. But questions asked of eyewitnesses are different from trivia questions. Eyewitness interviewers use questions to gather information, often not knowing whether answers are correct. In that context—where accuracy is rarely known—the eyewitness’s subjective experience becomes especially interesting. We were therefore particularly interested to examine what happens when questions are arranged to produce subjective experiences of increasing or decreasing eyewitness confidence.

Of course, it is not obvious that the order of questions should influence eyewitnesses at all. Whereas trivia questions can be drawn from a virtually infinite pool, questions put to eyewitnesses are typically from a more limited set, addressing a specific and recent event. This relative constraint should provide fewer opportunities for uncertainty, reducing eyewitnesses’ reliance on heuristic processing—cognitive shortcuts that can result in biased judgments from seemingly innocuous manipulations (Tversky & Kahneman, 1974). It would be surprising and worrying if a simple change to the order of questions put to eyewitnesses could change how they appraise their memories.

But across six experiments, we show that the order of questions matters. In Experiments 1a, 1b, and 1c, we asked people to watch a video of a crime, then take an eyewitness memory test. We arranged test questions to produce one of two experiences, based on previously normed confidence ratings. In one version, the test began with the question that elicited the greatest confidence and ended with the question that elicited the least, so that people became progressively less confident. In the other version, we reversed this ordering. These different experiences—produced by an identical set of questions—affected how many questions eyewitnesses thought they answered correctly, and how confident they were about the accuracy of their memory. In Experiments 2a, 2b, and 2c, we show that these biases have consequences beyond the eyewitnesses: Jurors’ impressions of eyewitnesses matched the eyewitnesses’ own impressions, a finding in line with research showing that jurors find confident eyewitnesses to be credible eyewitnesses (Penrod & Cutler, 1995).

Experiment 1a

Method

Subjects

Through pilot work, we determined a sample size of 100 (50 per between subjects cell). We ultimately recruited a total of 102 Amazon Mechanical Turk workers (www.mturk.com), because Mechanical Turk and Qualtrics—our experimental software—interact such that it is possible to unintentionally collect more data points than requested.

Design

We used a simple two groups design with Question Order (low-to high confidence, high-to-low confidence) manipulated between subjects.

Procedure

The experiment had four phases. First, we told subjects the study was examining learning styles. They watched one of two similar videos of a tradesman who stole items from the unoccupied house he was working in (Takarangi, Parker, & Garry, 2006). We counterbalanced versions across subjects and conditions.

The second phase began when the video ended. To mirror real-life memory decay, subjects solved Sudoku puzzles for 10 minutes.

In the third phase, subjects took a surprise memory test consisting of 30 two-alternative forced choice (2AFC) questions about the video. As we noted earlier, we were particularly interested in examining what happens when questions are arranged to produce experiences of increasing or decreasing eyewitness confidence. Accordingly, we ordered the test questions by confidence rather than accuracy, because accuracy is typically unknown in an eyewitness context.Footnote 1

We constructed the order of test items using data from an earlier, separate group of 107 subjects who followed the same procedure, except the 30 questions were ordered randomly. Then, on the basis of mean confidence, we ordered the 30 questions from the lowest confidence (M = 1.73, SD = 1.06) to highest (M = 4.79, SD = 0.63) to produce the low-to-high confidence test. We reversed this order to create the high-to-low confidence version. Subjects in the current experiment were randomly assigned one of these versions.

For each question, subjects used a scale from 1 (“Not at all confident”) to 5 (“Very confident”) to report their confidence they had selected the correct answer.

The fourth phase followed the test. Subjects answered two randomly ordered questions: [1] “The memory test about Eric the Electrician consisted of 30 questions. How many of those questions do you think you answered correctly?” Subjects responded with a number between 0 and 30; [2] “Suppose that you were asked to testify as an eyewitness. How confident would you be in your memory of the events you saw in the video of Eric the Electrician?” Subjects responded on a scale from 1 (“Not at all confident”) to 5 (“Very confident”).

Results and discussion

We first performed a manipulation check by examining mean confidence ratings for individual test questions. These data appear in the top panel of Fig. 1 and show that our manipulation worked: “low-to-high” subjects were increasingly confident, and “high-to-low” subjects were the opposite. The middle panel of Fig. 1 displays accuracy for individual test questions and shows a similar pattern—although less cleanly, as a consequence of 2AFC scoring. The bottom panel of Fig. 1 displays confidence-accuracy relationships for individual test questions and suggests that the order of questions did not affect subjects’ insight into their own accuracy. We also found that the order of questions had little effect on overall test performance, M diff = 0.65 (2.17 %), 95 % confidence interval (CI) [−0.46, 1.76]; t(100) = 1.15, p = .252.

Fig. 1
figure 1

Top panel: Mean confidence of a correct answer for each test question, ordered by position on test. Middle panel: Proportion of subjects who answered each test question correctly, ordered by position on test. Bottom panel: Pearson correlations between confidence and accuracy ratings for each test question, ordered by position on test. Note that the test versions are symmetric, i.e., question 1 in one condition is the same as question 30 in the other condition. Data are from Experiment 1a

We now address our primary questions: To what extent did the order of questions [1] bias subjects’ retrospective estimates of their test performance, and [2] affect their confidence in their memory? To answer [1], we subtracted subjects’ test scores from their retrospective estimates to produce bias scores. Positive bias scores represent subjects who thought they performed better on the test than they truly did, and negative bias scores represents the opposite. We present actual and retrospective estimates of test scores in the top panel of Fig. 2 and bias scores in the middle panel. These data show that low-to-high confidence subjects were more pessimistic than high-to-low confidence subjects, M diff = 2.32 (7.73 %), 95 % CI [0.33, 4.32]; t(100) = 2.31, p = .023. To answer [2], we examined subjects’ post-test reports of memory confidence. These data appear in the bottom panel of Fig. 2 and show that low-to-high confidence subjects were less confident about the accuracy of their memory: M diff = 0.46 (11.50 %), 95 % CI [0.08, 0.84]; t(100) = 2.41, p = .018 (for all experiments, we report cell means and SDs in Tables 1 and 2).

Fig. 2
figure 2

Top panel: Mean actual and estimated test scores by condition. Middle panel: Mean bias (estimated test score - actual test score) by condition. Positive bias scores represent subjects who thought they performed better than they truly did; negative bias scores represent the opposite. Bottom panel: Mean post-test memory confidence by condition. Error bars represent 95 % confidence intervals of cell means. Data are from Experiment 1a

Table 1 Experiment 1 mean scores for Bias and Confidence by condition
Table 2 Experiment 2 mean scores for Estimate and Confidence by condition

To determine the extent to which these effects would generalize to the more real-world situation of open-ended questions, we conducted Experiment 1b.

Experiment 1b

Method

Subjects

To boost precision, we recruited a larger sample of 220 Mechanical Turk workers.

Design and procedure

Experiment 1b followed the design and procedure of Experiment 1a, except we converted each 2AFC question into a cued-recall question.

Results and discussion

We scored responses by a keyword search. A blind rater also hand-scored a random 20 % of responses; electronic and hand scores were highly correlated, r = 0.96, p < .001.

This new format replicated the earlier results: low-to-high confidence subjects were more pessimistic, M diff = 3.65 (12.17 %), 95 % CI [2.33, 4.98]; t(218) = 5.43, p < .001; and were less confident about the accuracy of their memory, M diff = 0.37 (9.25 %), 95 % CI [0.10, 0.64]; t(218) = 2.73, p = .007. We next ran Experiment 1c to ensure these effects were not tied to specific materials.

Experiment 1c

Method

Subjects

We recruited a new sample of 205 Mechanical Turk workers.

Design and procedure

The design and procedure was the same as Experiment 1a, except subjects viewed a different video and answered a different set of twenty 2AFC questions (French, Garry, & Mori, 2011). We again arranged questions in the two orders, based on data from an earlier 106 subjects who rated each randomly ordered question for its difficulty.

Results and discussion

As before, we found that low-to-high confidence subjects were more pessimistic, M diff = 1.88 (9.40 %), 95 % CI [0.86, 2.90]; t(203) = 3.62, p < .001; they also were less confident about the accuracy of their memory, M diff = 0.28 (7.00 %), 95 % CI [0.01, 0.55]; t(203) = 2.05, p = .042. These data show that the influence of the order of questions generalizes to novel materials.

In line with Cumming’s (2012) recommendations, we obtained more precise estimates of these effect sizes by meta-analysing the results of Experiments 1a, 1b, and 1c, using ESCI software to run two random effects model meta-analyses. These analyses estimate that “low-to-high” eyewitnesses would be 10.33 % more pessimistic about their performance than “high-to-low” eyewitnesses, M diff = 10.33 %, 95 % CI [7.36, 13.30], z = 6.82, p < .001. These “low-to-high” eyewitnesses also would be 0.36 points, or 9.00 %, less confident about what they remember, M diff = 0.36, 95 % CI [0.19, 0.52], z = 4.10, p < .001.

The results of Experiments 1a, 1b, and 1c show that the order of questions shapes what eyewitnesses believe. Specifically, when people answered questions that initially seemed difficult and then became easy, they were more pessimistic and less confident about their memory compared with others who answered questions that initially seemed easy and then became difficult.

In changing how eyewitnesses appraise their memories, one possible consequence is that jurors will appraise the eyewitness's credibility in the same direction (Douglass et al., 2010). Such a result would have disturbing implications for the justice system. Because jurors tend to rely on eyewitness confidence as a signal of accuracy (Penrod & Cutler, 1995), we asked subjects in Experiments 2a, 2b, and 2c to take on the role of a juror, evaluating an eyewitness whose confidence systematically changed over the course of questioning.

Experiment 2a

Method

Subjects

We aimed to collect data from 200 people but ultimately recruited 261 Mechanical Turk workers.

Design

We used a two groups design with Question Order (low-to-high confidence, high-to-low confidence) manipulated between subjects.

Procedure

We asked people to take on the role of a juror and answer questions about an eyewitness who had been in a previous study. We told these “jurors” that in the previous study, the eyewitness had taken a memory test after watching the video of Eric the Electrician. The juror's task was not to watch the video but to carefully read the eyewitness's memory test and then answer some questions.

To mirror the real-world scenario where a group of jurors evaluate one eyewitness, all jurors within a group actually read a single eyewitness’s test that we secretly created. In the high-to-low confidence version, the eyewitness's answers were initially confident but became less confident over the test. In the low-to-high confidence version, this pattern reversed. We created these two versions using data from Experiment 1a. We calculated mean confidence ratings for each of the 30 questions, rounding each mean to an integer so it could be represented on the Likert scale of confidence the eyewitness had ostensibly used. We also randomly selected, for each test question, which answer the eyewitness had ostensibly chosen.

Subjects randomly received either the low-to-high confidence or high-to-low confidence eyewitness test, formatted exactly like the test in Experiment 1a. Immediately after reading, subjects answered two randomly ordered questions: [1] “The memory test about Eric the Electrician consisted of 30 questions. How many of those questions do you think the eyewitness answered correctly?” Subjects responded with a number between 0 and 30; [2] “How confident are you about the accuracy of the eyewitness's memory?” Subjects responded on a scale from 1 (“Not at all confident”) to 5 (“Very confident”).

Results and discussion

Jurors believed that an initially confident eyewitness was more accurate, estimating that these eyewitnesses answered more questions correctly, M diff = 3.23 (10.77 %), 95 % CI [2.12, 4.34]; t(259) = 5.73, p < .001. Jurors also reported more confidence in these eyewitnesses’ memories, M diff = 0.47 (11.75 %), 95 % CI [0.27, 0.68]; t(259) = 4.54, p < .001.

Note, however, that each of the 30 test questions always appeared with the same confidence rating. This confound leaves open the possibility that jurors were influenced not by the eyewitness’s confidence, but by the content of the questions. We ran Experiment 2b to address this counter explanation.

Experiment 2b

Method

Subjects

We aimed to boost precision by increasing observations to 150 per between subjects cell, ultimately recruiting 305 Mechanical Turk workers.

Design and procedure

The design and procedure was the same as in Experiment 2a, except that we decoupled questions from their associated confidence ratings while maintaining the ascending or descending pattern of confidence, by randomly assigning questions to each confidence rating.

Results and discussion

We found again that subjects believed high-to-low confidence eyewitnesses answered more questions correctly, M diff = 4.18 (13.93 %), 95 % CI [3.09, 5.27]; t(303) = 7.54, p < .001, and were more confident about the accuracy of these eywitnesses’ memories, M diff = 0.72 (18.00 %), 95 % CI [0.52, 0.92]; t(303) = 7.10, p < .001.

Finally, we ran Experiment 2c to demonstrate that these effects were not tied to specific materials.

Experiment 2c

Method

Subjects

We aimed to collect 150 observations per between subjects cell, and ultimately recruited 316 Mechanical Turk workers.

Design and procedure

The design and procedure was the same as in Experiment 2b but used the materials from Experiment 1c.

Results and discussion

We found again that jurors believed high-to-low confidence eyewitnesses answered more questions correctly, M diff = 1.88 (9.40 %), 95 % CI [1.14, 2.62]; t(314) = 4.99, p < .001, and jurors were also more confident about the accuracy of these eyewitnesses’ memories, M diff = 0.42 (10.50 %), 95 % CI [0.22, 0.62]; t(314) = 4.14, p < .001.

The findings from Experiments 2a, 2b, and 2c fit with those of Experiments 1a, 1b, and 1c, in which eyewitnesses thought they answered more questions correctly and reported higher confidence in their memory if their initial experience was one of high confidence. We meta-analysed the results of Experiments 2a, 2b, and 2c (Cumming, 2012) and estimated that jurors believe “high-to-low” eyewitnesses answer 11.38 % more questions correctly, M diff = 11.38 %, 95 % CI [8.77, 14.00], z = 8.53, p < .001. Moreover, jurors are 0.54 points—or 13.50 %—more confident about the accuracy of a “high-to-low” eyewitness’s memory, M diff = 0.54, 95 % CI [0.36, 0.72], z = 5.75, p < .001.

General discussion

Across six experiments, we found that the order in which eyewitnesses answered questions mattered in two key ways. First, the order changed how eyewitnesses appraised themselves. When questions produced an initial experience of high confidence rather than low confidence, eyewitnesses believed that they were more accurate and were more confident about their memory. Second, the order changed how jurors appraised eyewitnesses. Jurors believed eyewitnesses who initially displayed high confidence were more accurate, and jurors were more confident about those eyewitnesses’ memories. This collection of results paints a worrying picture of the malleability of beliefs about memory accuracy.

It is surprising that questions produce different beliefs in witnesses when all that changes is the order those questions are asked. Ultimately, everyone answers the same questions, so it seems reasonable to expect no differences in beliefs. But the influence of order shows that beliefs about memory are shaped not only by the content or phrasing of questions, but also by factors that—on the face of it—are trivial.

In fact, our seemingly trivial manipulation produced effects similar in size to more blatant manipulations affecting eyewitness credibility. An eyewitness who claims to be absolutely certain, for example, is rated more credible than an eyewitness who does not (Tenney, MacCoun, Spellman, & Hastie, 2007), and prosecution eyewitnesses who elaborate their testimony with extra details are more credible, and get more guilty verdicts, than eyewitnesses who do not (Bell & Loftus, 1988, 1989). It is worrying that our subtle manipulation produces effects similar in magnitude to these relatively heavy-handed approaches.

How can we explain our effects? One possibility is that people’s attention wanes over the test, resulting in impressions influenced most by early experience (Crano, 1977). If this “attention decrement” hypothesis is true, then the same question should be answered with higher accuracy when it appears early rather than late. To address this possibility, we ran a random effects model meta-analysis comprising all three datasets from Experiment 1. This meta-analysis compared accuracy between groups for the subjectively easiest and most difficult test questions, because each appears first for one group and last for the other. We found no support for this attention-based explanation: Accuracy is not notably different when a question appears first rather than last, M diff = −0.01, 95 % CI [−0.04, 0.02], z = −0.40, p = .686.

An alternative explanation is that the effects are driven by early experience and insufficient adjustments: The subjective ease or difficulty of early questions sets an anchor, and to save effort, people adjust from this anchor only until reaching a plausible impression (Epley & Gilovich, 2006). This explanation is consistent with recent research in which subjects held biased impressions of performance throughout a trivia test, and not merely at the end (Weinstein & Roediger, 2010, 2012). Relatedly, Experiments 2a-2c suggest that jurors used early information to create a story about the eyewitness’s credibility and were slow to revise that story in the face of new information. This explanation fits with the Story Model of juror decision-making, a model in which juror’s verdicts are influenced by the stories they construct to make sense of events (Pennington & Hastie, 1992).

Our findings have implications for eyewitnesses’ metacognition, because they suggest that the order of questions influences eyewitnesses’ ability to evaluate what they know about an event. Similarly, our findings are reminiscent of other suggestive techniques that manipulate eyewitness beliefs, such as subtle changes to the wording of questions, or direct feedback about lineup identifications (Douglass & Steblay, 2006; Loftus & Palmer, 1974; Loftus & Zanni, 1975). But in contrast, we have manipulated what eyewitnesses and jurors believe about memory without using suggestive techniques.

Our findings also raise interesting questions. For instance, does the order of questions influence other related judgments, such as eyewitnesses’ estimates of how well they saw the perpetrator? We know that positive post-identification feedback enhances eyewitnesses’ beliefs about their memory for a crime, including how well they could see a suspect’s face and how much attention they paid (Wells & Bradfield, 1998). Perhaps an initial experience of subjectively easy questions causes similar enhancements. It would also be useful to know if the order of questions produces lasting changes in beliefs or if the influence is fleeting. Finally, it is worth considering that we ordered questions in our experiments either by subjective confidence or subjective difficulty. Earlier work has ordered questions by objective difficulty, calculated as the mean proportion of people who answer a question correctly (Jackson & Greene, 2014; Weinstein & Roediger, 2010, 2012). Our results suggest that the subjective experience of difficulty may underpin the influence of question order—but a future experiment teasing apart subjective and objective difficulty could provide information about their relative contributions.

Eyewitnesses play an undeniably important role in the justice system. But justice requires that we protect the integrity of eyewitness memory as much as possible. That integrity is called into question when eyewitnesses and jurors are swayed by something as trivial as the order in which they answer questions.