Cue quality and criterion setting in recognition memory

Kent, Christopher; Lamberts, Koen; Patton, Richard

doi:10.3758/s13421-018-0796-6

Cue quality and criterion setting in recognition memory

Published: 02 February 2018

Volume 46, pages 757–769, (2018)
Cite this article

Download PDF

Memory & Cognition Aims and scope Submit manuscript

Cue quality and criterion setting in recognition memory

Download PDF

Christopher Kent¹,
Koen Lamberts² &
Richard Patton¹

1471 Accesses
7 Citations
Explore all metrics

Abstract

Previous studies on how people set and modify decision criteria in old-new recognition tasks (in which they have to decide whether or not a stimulus was seen in a study phase) have almost exclusively focused on properties of the study items, such as presentation frequency or study list length. In contrast, in the three studies reported here, we manipulated the quality of the test cues in a scene-recognition task, either by degrading through Gaussian blurring (Experiment 1) or by limiting presentation duration (Experiment 2 and 3). In Experiments 1 and 2, degradation of the test cue led to worse old-new discrimination. Most importantly, however, participants were more liberal in their responses to degraded cues (i.e., more likely to call the cue “old”), demonstrating strong within-list, item-by-item, criterion shifts. This liberal response bias toward degraded stimuli came at the cost of increasing the false alarm rate while maintaining a constant hit rate. Experiment 3 replicated Experiment 2 with additional stimulus types (words and faces) but did not provide accuracy feedback to participants. The criterion shifts in Experiment 3 were smaller in magnitude than Experiments 1 and 2 and varied in consistency across stimulus type, suggesting, in line with previous studies, that feedback is important for participants to shift their criteria.

Twenty years of load theory—Where are we now, and where should we go next?

Article 04 January 2016

Effects of divided attention at encoding and retrieval: Further data

Article 22 June 2018

Guided Search 6.0: An updated model of visual search

Article 05 February 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

People often make old-new recognition decisions about stimuli that differ perceptually from the original visual experience. In addition to changes in viewpoint, occlusion or illumination, the stimulus material itself can be degraded in a number of ways. In this article, we report three experiments in which we studied the effects of stimulus degradation on old-new recognition judgments for visual scenes, words, and faces. In particular, we were interested in studying whether and how people adjust their decision criterion in response to a degraded test presentation. Criterion setting is an important area for understanding how people make recognition judgments and continues to provide a rich test bed for models of recognition memory (e.g., Cox & Shiffrin, 2012; Hicks & Starns, 2014; Starns, Ratcliff, & White, 2012; Starns, White, & Ratcliff, 2010, 2012). However, very few studies have looked at changes in criterion setting due to item specific properties at test, instead manipulating properties at study (e.g., item or list strength).

One study to look at the impact of test stimulus degradation on recognition performance is Wolfe and Kuzmova (2011), who demonstrated that a significant reduction in stimulus resolution (from 256 × 256 pixels at study to 32 × 32 pixels at test) still allowed for efficient old-new recognition, confirming results previously obtained by Uttl, Graf, and Siegenthaler (2007). Still, despite the robust nature of recognition, it is clear that there must be a level of test item degradation so severe that it leads to a significant decline in recognition performance, and we set out to explore what form that decline takes.

Many current accounts of recognition memory are based on some version of signal detection theory (SDT; see Malmberg, 2008). In such accounts, it is usually assumed that the stimulus generates a familiarity signal (corresponding to the value of a random variable with a particular distribution), and if this signal exceeds a criterion value, an “old” decision is made. Recognition models differ in their characterization of the familiarity variable (some models assume that there are other variables at play as well, e.g., Mandler, 1991), but the nature of the task lends itself exceptionally well to a characterization in terms of a signal-criterion comparison.

Within this framework, degradation of a test stimulus can have several possible effects. Degradation can lead to weaker familiarity signals for old test items, which would result in poorer discriminability of old and new test items. Test item degradation could also affect the variance of the signal distribution. Finally, degradation of the test stimulus could induce change in the criterion that underlies old-new decisions. It is well known that criterion setting can depend on various characteristics of stimulus items and on procedural variables (see Hockley, 2011, for a review). For example, it has been demonstrated that more memorable items are judged against a more conservative criterion than less memorable items (e.g., Hirshman, 1995), and that more liberal criteria are applied to delayed test items compared to immediate test items (Singer & Wixted, 2006). In some circumstances, criterion shifts can occur trial by trial (e.g., Heit, Brockdorff, & Lamberts, 2003; Hockley & Niewiadomski, 2007), although these shifts may lag considerably behind changes in the decision environment (Brown & Steyvers, 2005). Participants' subjective perception of task difficulty (related to perceived memorability of study lists) has also been shown to impact on criterion placement (Bruno, Higham, & Perfect, 2009) and there appear to be reliable individual differences (e.g., Aminoff et al., 2012; Kanter & Lindsay, 2012, 2014). Together, these results led us to expect that, if a criterion shift occurs in response to variation in test item quality, participants will use a more conservative criterion for high-quality test items (i.e., they would need a stronger familiarity signal before declaring a test item “old”) than for low-quality test items, where a more liberal criterion would apply. The shift would reflect participants’ anticipation of stronger familiarity signals from high-quality old test items than from low-quality old test items (e.g., Brown, Lewis, & Monk, 1977). Such a criterion shift would be compatible with the results of a relevant study by Hockley, Hemsworth, and Consoli (1999). When participants studied normal face stimuli and then carried out a recognition task with normal faces and with degraded faces (wearing sunglasses), a mirror effect occurred (see Glanzer & Adams, 1985), with degraded test stimuli producing lower hit rates and higher false-alarm rates (Hockley et al., 1999).

We carried out three old-new recognition experiments. In the study phase of the experiments, the participants observed a number of images (scenes in all three experiments, and also faces and words in Experiment 3). In the subsequent test phase, images that had been presented at study (Old items) were intermixed with unseen images (New items). The participants were asked to decide for each test item whether it was old or new. In all experiments, some of the test images were degraded. Unlike the experiments in Hockley et al. (1999), the whole stimulus was degraded, similar to the study by Wolfe and Kuzmova (2011). In Experiment 1, the degradation was done through low-pass Gaussian filtering, blurring the images. In Experiments 2 and 3, short exposure durations were used to reduce perceptual quality. At short exposure durations, coarse stimulus information is likely to be more available for further processing than fine-grained information (e.g., Fabre-Thorpe, 2011), and so we expected to find similar degradation effects across experiments. In addition, Experiments 1 and 2 gave trial-by-trial feedback about performance (correct/incorrect) at test, whereas Experiment 3 did not provide participants with feedback.

Experiment 1

Method

Participants

Thirty-nine (29 female) students and research staff from the University of Bristol and the University of Warwick participated either in return for course credit or as a volunteer. Mean age was 21:3 years and all reported normal or corrected-to-normal vision.

Materials

Stimuli were presented on a Cathode Ray Tube monitor set to 1,152 x 864 controlled via a Pentium class PC running custom written software. Responses were made via a mouse connected to the Universal Serial Bus controller of the PC. Stimuli consisted of 128 digital photographs of real world scenes taken of four subjects (32 images of each; two from each were reserved for presentation at the beginning and end of the study list to control primacy and recency effects): traffic scenes, woodland scenes, buildings, and rivers. For the blurred images we applied a low-pass Gaussian filter with a standard deviation of 25 pixels. The complete set of stimuli is available on request from the first author.

Design and procedure

Test cue quality was manipulated within participants. Sixty old and 60 new stimuli were randomly selected from the 120 images for each participant. Of each of the 60 old and 60 new stimuli, 30 were randomly selected to be blurred.

Participants sat alone in a quiet room at a distance of 100 cm from the monitor. The study phase consisted of 68 stimulus presentations. Four stimuli, which were not later tested, were presented at the start of the list and at the end of the list; these were used to reduce the impact of primacy and recency effects. Each study stimulus was presented for 2,000 ms, with an inter-stimulus interval of 500 ms consisting of a neutral gray screen. After the study phase, participants were asked to select a mouse button for their “old” responses (the other button being used for “new” responses). The test phase then started. Each trial started with the presentation of a black central fixation cross on a gray background for 500 ms, followed by a blank gray screen for 100 ms. The test stimulus then appeared, and was displayed until the response was given. Participants were informed they should respond as quickly and as accurately as possible. Once participants had made an old/new response they were presented with a confidence rating screen, in which they clicked on one of four text boxes to indicate how confident they were in the correctness of their response: “Guess”, “Maybe”, “Probably”, and “Definitely”. “Correct”/”Wrong” feedback was then provided centrally for 750 ms. Blurred and Clear stimuli were randomly intermixed. The experiment lasted approximately 15 minutes per participant.

Results and discussion

We first analyzed the old/new decision data, without taking into account the confidence ratings. Table 1 summarizes the response proportions in the Clear and Blurred conditions, respectively, across all participants. Unlike Hockley et al. (1999), we did not observe a strong mirror effect across non-degraded and degraded test stimuli. The hit rates for clear and blurred old items were very similar, with only slightly more errors in the blurred condition, t(38) = 1.00, p = .33. However, for new items, the false alarm rate was higher in the blurred condition (.442) than in the clear condition (.227), t(38) = 8.39, p < .001, SEM = 0.03, d = 1.35. Sensitivity (d_a) and criterion (c_a) values under a standard Gaussian SDT model were calculated (we used RscorePlus, Harvey, 2010, for all signal detection analyses). As expected, d_a was significantly higher in the clear condition (0.92 95 % CI ± 0.09)^{Footnote 1} than in the blurred condition (0.31 ± .08; difference = 0.61 ±0.09), showing that blurring was effective in reducing cue quality. In addition, there was a significant difference in bias between the conditions (difference = 0.27 ±0.07), with participants using a more conservative criterion in the clear condition (c = 0.23) than in the blurred condition (c = -0.03).

Table 1 Hit and false alarm rates to clear and blurred stimuli in Experiment 1 and Long and Short stimuli in Experiment 2

Full size table

To better understand the nature of the criterion shift, we extended the analysis to include the confidence rating data. The confidence ratings for old and new responses were combined to construct a single 8-point scale, with 1 meaning “definitely new” and 8 meaning “definitely old.” On this scale, a value of 4 corresponded to a "guess" rating following a new response, and a value of 5 represented a "guess" rating following an old response. Figure 1 shows the proportions of responses in the eight confidence categories, as a function of stimulus type (clear vs. blurred, and old vs. new). Not surprisingly, the observers were more reluctant to express high confidence in correct responses in the blurred condition than in the clear condition. The data on the 8-point scale were then used to construct z-ROC curves, on the basis of transformed hit and false alarm rates across different levels of confidence (see Macmillan & Creelman, 2005). Figure 2 shows the z-ROCs for the clear and blurred test items. As expected, the z-ROC for the Clear condition shows greater overall discriminability than that for the Blurred condition.

To obtain criterion estimates, a conventional approach is to estimate a separate decision criterion for each pair of adjacent scores on the scale (Macmillan & Creelman, 2005). For an 8-point scale, this implies that seven criteria have to be estimated. The criterion estimates were obtained using a variation of the Marquardt method to find maximum-likelihood parameter estimates (Harvey, 2010). The psychophysical model assumed Gaussian distributions and allowed for unequal variances of the “old” and “new” signal distributions. In each condition (Clear or Blurred), nine parameters were estimated (seven criteria, and the mean and variance of the “old” signal distribution, assuming without loss of generality that the “new” distribution is standard normal). Figure 3 shows an overview of the estimated distributions and criteria for Clear and Blurred test items. The estimated criterion values differ between Clear and Blurred test items (the horizontal bars at the top of each criterion line show the 95 % CI around the estimated value), with generally more conservative settings in the Clear condition than in the Blurred condition.

A crucial question is why the observed criterion values were chosen. Confidence criteria can be set according to different principles (see Stretch & Wixted, 1998). A pattern in which the criteria are spread further apart in the condition with the smaller discriminability is qualitatively compatible with a likelihood-ratio principle (Stretch & Wixted, 1998), according to which participants maintain a constant ratio between the likelihood of a test item being "old" versus "new" for each confidence criterion, across all test conditions. We computed log-likelihood ratios for each of the criteria in the two conditions in Experiment 1, using the equivalent of Equation A4 in Stretch and Wixted (1998). As shown in Fig. 4, the likelihood ratios differ between the Clear and Blurred conditions, and the results therefore do not support the idea that criteria are set to maintain constant likelihood ratios (note that this conclusion rests on the assumption that the standard deviation of the “new” distribution is the same between conditions). Instead, the criteria seem to reflect the observers' desire to maintain a constant hit rate (as shown in Table 1), combined with a general reduction in confidence for the blurred stimuli. It is remarkable that the observers were willing to tolerate a high false-alarm rate in the blurred condition to maintain a steady hit rate. This suggests that, in the blurred condition, false alarms (saying "old" to new stimuli) were seen as less problematic than misses (i.e., saying "new" to old stimuli). We will consider the reasons for this in the General Discussion.

Experiment 2

Experiment 2 was a designed as a replication of Experiment 1. However, instead of blurring, short exposure duration was used to degrade the stimulus percept.