Do target detection and target localization always go together? Extracting information from briefly presented displays

Carrigan, Ann J.; Wardle, Susan G.; Rich, Anina N.

doi:10.3758/s13414-019-01782-9

Do target detection and target localization always go together? Extracting information from briefly presented displays

Published: 19 June 2019

Volume 81, pages 2685–2699, (2019)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Do target detection and target localization always go together? Extracting information from briefly presented displays

Download PDF

Ann J. Carrigan^1,2,3,
Susan G. Wardle^1,2 &
Anina N. Rich^1,2,3

1406 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

The human visual system is capable of processing an enormous amount of information in a short time. Although rapid target detection has been explored extensively, less is known about target localization. Here we used natural scenes and explored the relationship between being able to detect a target (present vs. absent) and being able to localize it. Across four presentation durations (~ 33–199 ms), participants viewed scenes taken from two superordinate categories (natural and manmade), each containing exemplars from four basic scene categories. In a two-interval forced choice task, observers were asked to detect a Gabor target inserted in one of the two scenes. This was followed by one of two different localization tasks. Participants were asked either to discriminate whether the target was on the left or the right side of the display or to click on the exact location where they had seen the target. Targets could be detected and localized at our shortest exposure duration (~ 33 ms), with a predictable improvement in performance with increasing exposure duration. We saw some evidence at this shortest duration of detection without localization, but further analyses demonstrated that these trials typically reflected coarse or imprecise localization information, rather than its complete absence. Experiment 2 replicated our main findings while exploring the effect of the level of “openness” in the scene. Our results are consistent with the notion that when we are able to extract what objects are present in a scene, we also have information about where each object is, which provides crucial guidance for our goal-directed actions.

The cost of divided attention for detection of simple visual features primarily reflects limits in post-perceptual processing

Article 08 August 2022

Automatic guidance of attention during real-world visual search

Article 22 April 2015

Gravitational effects of scene information in object localization

Article Open access 01 June 2021

Vision is fast: As soon as our eyes open, we get an impression that we can see everything around us. Early findings suggested that the basic meaning of natural scenes (e.g., classification as outdoor vs. indoor scenes) can be extracted after an exposure of only 100 ms (Potter, 1976; Potter & Faulconer, 1975). Further studies using backward masking to precisely control display duration showed that observers are above chance at categorizing scenes at the superordinate (e.g., natural vs. manmade) and basic (e.g., coast vs. city) levels after exposure durations as short as 20 ms (Greene & Oliva, 2009; Joubert, Rousselet, Fize, & Fabre-Thorpe, 2007). In addition, when primed with an object category (e.g., animal or truck), these objects can be detected when observers are shown scenes for only 20–25 ms, albeit with no backward mask (Thorpe, Fize, & Marlot, 1996; VanRullen & Thorpe, 2001). It has been proposed that this remarkable ability is due to extensive experience with domain-specific types of scenes (Drew, Evans, Võ, Jacobson, & Wolfe, 2013).

Although there is evidence that considerable information, such as scene category and target/object detection, is extracted in the initial glance at an environment, goal-directed actions also require information about where target objects are within the environment. Despite the importance of location information for our successful interactions with the world, we know much less about the early stages of localization than of detection in brief displays. In particular, there is debate about the extent to which detection and localization are separable. This is important because localization information is thought to require selective processing, which has severe capacity limits, whereas at least in studies involving medical images as stimuli detection may be possible on the basis of “gist” or nonselective processing (e.g., Evans, Georgian-Smith, Tambouret, Birdwell, & Wolfe, 2013; Evans, Haygood, Cooper, Culpan & Wolfe, 2016; Wolfe, Võ, Evans, & Greene, 2011). Thus, determining the relative time courses of information processing that lead to detection versus localization is an important empirical question with implications for major theories of visual search.

The term “gist” is not entirely clear. The general sense in the natural-scenes and medical-imaging literatures, however, is that gist is something that is extracted rapidly, in a global fashion, without requiring selection (Brennan, Gandomkar, Ekpo, Tapia, Trieu, et al., 2018; Oliva, 2005). If an object can be detected but not localized, this implies that initial early detection might be based on global, nonselective processing (Evans et al., 2016). In contrast, it seems unlikely that specific location information is available in this global signal. Indeed, decades of visual search research has demonstrated that localizing (and detecting) targets that do not have unique features (i.e., are not “pop-out” targets) requires attentive processing (Wolfe, 1994; Wolfe & Horowitz, 2017). Instead, it has been suggested that “gist” may guide the way observers view the scene (Loftus & Mackworth, 1978; Oliva, Torralba, Catelhano, & Henderson, 2003) and therefore facilitate efficient object recognition by guiding selective attention to target locations (Davenport & Potter, 2004).

Although there has been little study of localization specifically in natural scenes, claims of a dissociation between detection and localization information have been made in the medical-imaging literature. Some studies have reported that radiologists can detect, but not locate, a cancer in a mammogram in 250 ms (Brennan et al., 2018; Evans et al., 2013; Evans et al., 2016). The basis of these findings, however, is a lack of significant performance on a localization task with small but significant “above-chance” performance on a detection task. This might be problematic for at least two reasons: First, frequentist statistics do not allow interpretation of a null finding (p > .05) as support for the null hypothesis; second, there is no image-level analysis to determine the degree to which a few images might contribute to the apparent dissociation.

We previously have addressed these issues in a study in which we tested radiologists with mammograms presented for 250 ms (Carrigan, Wardle, & Rich, 2018). We found that radiologists could both detect and localize a mass “above chance,” but, more importantly, we demonstrated that on trials in which it appeared detection was successful and localization was unsuccessful, there were alternative explanations for the apparent dissociation: Either there was coarse information about location (responses clustered around the border of the mass) or a distractor area of the image had been mistakenly identified as a target (responses clustered within another, nontarget area of the image). Although these results might provide an alternative explanation for some of the medical-imaging results, in other studies there was no specific mass to be located, yet radiologists were above chance in classifying the associated images as “abnormal” (e.g., Evans et al., 2016). In these cases, only global, non-location-specific signals could be driving the effect, as there was nothing to localize. Overall, within the medical imaging literature, considerable debate has surrounded the question of whether, when there is sufficient information to detect a target, there is also information about its location, or whether these tasks can be dissociated.

A study on change detection in faces has provided evidence of a dissociation between detection and localization (Howe & Webb, 2014). Howe and Webb showed observers a photograph of a face for 1.5 s, followed by a 1s blank, and then another version of the same photograph with a single changed feature (e.g., removal of glasses). Observers were asked to indicate whether a change had occurred and, if so, to select the change from a list of nine possible options. The results showed that observers could sometimes detect that a change had occurred without identifying the specific change, even when taking into account potential correct guesses (an important innovation of this study). The authors suggested that the apparent lack of information about the identity of the change might reflect low precision in the location. In contrast, other change detection studies have shown that the detection of a change is accompanied by knowledge of the change location, and that this performance is driven by feature salience (Mitroff & Simons, 2002; Rensink, O’Regan, & Clark, 1997).

One challenge with using either medical images or faces to study detection and localization is that the images themselves carry information that could guide attention to particular locations. For example, in a mammogram, an expert might be more likely to attend to specific regions that are more likely to contain a mass. In the natural-scenes literature, this type of guidance has been called “scene-based guidance” and suggests that our search for objects is influenced by their expected locations (e.g., a toaster on the kitchen counter; Davenport & Potter, 2004; Wolfe et al., 2011). Thus, in these paradigms, it is hard to study the initial brief processing of a stimulus separate from rapid guidance mechanisms that we automatically use to increase our efficiency at perceiving visual displays. Here we addressed this question by using an artificial target that was not related to the image.

The overarching goal of the present study was to investigate the time course of target detection and localization in brief presentations. We used natural scenes in order to retain the expectation that the information could be extracted rapidly from a brief display, and so that we could independently verify that the “gist” had been processed, but we used a Gabor as a target, to avoid any scene-based guidance. Our first aim was to evaluate whether the localization information about a target is accessible at short durations. We then used another feature of natural scenes, usually defined as “openness,” to test whether this localization information was only extracted when the Gabor was not embedded within complex visual information.

To validate our range of durations, we first confirmed that participants could do scene categorization (natural vs. manmade) for the background scenes at the shortest experimental duration (Exp. 1A). We then compared detection and localization performance for a Gabor target embedded in a range of natural scenes at brief exposure durations between 33 and 199 ms (Exp. 1B). To test the idea that location information might be present but less precise at brief durations (Howe & Webb, 2014), we included a left (L)-versus-right (R) localization task as well as the instruction to “click on the location with the mouse.” Finally, in Experiment 2, we explored the effect of openness by testing target detection and localization in “open” versus “closed” scenes.

Experiment 1

Experiment 1A was a scene categorization task (natural vs. manmade), designed to verify that the overall “gist” of the background scenes could be extracted at the shortest experimental duration (~ 33 ms) used in our paradigm. The aim of Experiment 1B was to test whether durations between 33 and 199 ms^{Footnote 1} resulted in sufficient processing to support detection along with localization of a Gabor target embedded in the natural scenes. Using a two-interval forced choice (2IFC) paradigm, natural scenes were presented at one of four durations (33–199 ms) on each trial, with a Gabor target randomly located within one of the two scenes. The participants were asked to report which scene contained the target and then where the target was located within the scene.

Method

Participants

Thirty participants (22 females, 8 males; age range 19–55 years, M = 31.47 years, SD = 8.81) were recruited from Macquarie University. All participants gave informed consent, reported normal or corrected-to-normal vision, and were financially reimbursed for their time. The study was approved by the Macquarie University Human Research Ethics Committee (Medical Sciences). The data for two observers were excluded due to technical issues, leaving 28 datasets for analysis.

Stimuli and apparatus

Natural-scene stimuli were classified into the scene categories defined by Oliva and Torralba (2001), which are available at http://cvcl.mit.edu/database.htm (see Fig. 1). A total of 160 photographic images of natural scenes comprising two superordinate categories (natural and manmade) were selected from an internet search using Google Images. The natural and manmade categories comprised four basic-level categories (20 images in each): coast, mountain, open country, and forest for the natural superordinate category, and tall building, highway, city center, and street for the manmade superordinate category. The images were converted to grayscale and downsized to subtend 23° × 15° of visual angle.

The target was a Gabor patch with the following parameters: orientation 45°, spatial frequency 0.5 cycles/deg, diameter 3.8°, Michelson contrast 0.2. The target image appeared in a different random location within a scene (constrained to fully appear within the borders of the display) and was present on all trials in either Scene 1 or Scene 2 (see Fig. 2).

The participants sat at a viewing distance of approximately 70 cm in a dimly lit, windowless laboratory at Macquarie University, Sydney. The stimuli were presented with MATLAB 8.2 using Psychtoolbox 3 (Kleiner, Brainard, & Pelli, 2007) and were displayed on a 27-in. Samsung SyncMaster SA950 LCD monitor (1,920 × 1,080, 120 Hz).

Procedure

Experiment 1A: Scene categorization task

This task was to verify that scene categorization (natural vs. manmade) was possible for the background scenes at the shortest experimental duration. We used a single-factor (superordinate category: natural, manmade) within-subjects design. Each trial began with a fixation point for 498 ms, followed by a scene from one of the superordinate categories, displayed in the center of the screen for 33 ms. This was followed by a backward 1/f noise mask for 249 ms. Participants categorized the scene by its superordinate category (manmade vs. natural) with a key press as accurately and quickly as possible (see Fig. 3). Participants were given ten practice trials at a longer scene presentation duration of 398 ms, to familiarize them with the task, before completing the experimental trials.

Experiment 1B: Target detection and localization task

For the main experimental task, we used a 2 (location task: exact click, L vs. R) × 4 (duration: 33, 58, 116, 199 ms) within-subjects design. Initially the participants were shown a picture of the target, to familiarize them with a Gabor, and then were given eight practice trials (two per duration) with feedback. Each trial began with a fixation point for 498 ms, followed by Scene 1 (33–199 ms, constant within a block) and a backward 1/f noise mask (249 ms), and then by Scene 2 (same duration as Scene 1) and a 1/f noise mask (249 ms). Observers made a 2IFC decision with a key press regarding whether the target had been present in Scene 1 or Scene 2. Following this detection response, they were presented with a blank screen and asked one of two localization questions (in separate blocks, order counterbalanced across participants). The same images were presented for both localization tasks, and target location was independent of image identity (randomly shuffled). In one block of trials, participants were asked to click on the exact location of the target on the blank screen using the mouse. The location of the Gabor was random and was only constrained to appear within the borders of the display. In the other localization task, they were instead asked whether the target had appeared on the left or the right side of the screen, and they responded using a key press. Here the Gabor was not only constrained to appear within the borders of the display, it was also constrained to fully appear within two invisible bounding boxes on either the L or R of the screen. This localization task required a coarser judgment of the target’s location in order to answer correctly, as compared to the more difficult exact-click task. The response keys for the L/R localization task were the same as the keys used for the detection task (the “z” and “m” keys; see Fig. 4).

On each trial, both scenes were selected from the same superordinate category (e.g., natural or manmade), but the basic category was random (e.g., both could be from the same category or from different categories within the superordinate category). Fifty percent of trials had the Gabor in Scene 1, and 50% in Scene 2, randomly interleaved within a block. The target location was randomized, with the restriction that it was not clipped by the screen edge and that it appeared in the left half for 50% of trials, and in the right for the other 50% of trials. Duration order was blocked and counterbalanced across participants. The participants performed 160 experimental trials for each localization task. The experiment was self-paced, and the participants initiated each trial with a key press. The observers saw the same images in each task, but in a different randomized order (80 natural and 80 manmade scenes in each version of the task), giving a total of 320 trials across the experiment. They were instructed to respond as accurately as possible, and there was a minimum 15-s rest period every 40 trials. Participants were not provided with any feedback during the experimental tasks (see Fig. 4).

Results and discussion

Analysis

All analyses were performed using the Statistical Software for the Social Sciences (IBM SPSS version 25). The 95% confidence intervals (CIs) were calculated as Ẋ ± 1.960(σ/√n).

Experiment 1A: Scene categorization

The purpose of this experiment was to independently verify that sufficient information (gist or scene statistics) about scene categories (natural vs. manmade) could be extracted using our parameters at our shortest duration (33 ms), as had been demonstrated by others (Greene & Oliva, 2009; Joubert et al., 2007). We used a measure of sensitivity, d ′, as our dependent measure. The mean d ′ for the categorization tasks was 2.29 (SD = 0.56, range = 1.36–3.65). A single-sample t test on d ′ relative to chance (d ′ = 0) demonstrated that performance was better than chance in categorizing the scenes as manmade vs. natural at the shortest experimental duration, t(27) = 21.8, p < .0001. This replicates previous findings that sufficient visual information to categorize scenes is available from < 50 ms presentations.

Experiment 1B: Detection performance

Figure 5 shows detection performance for the Gabor target at exposure durations from 33 to 199 ms. We calculated d ′ as a function of target presence in Scene 1 or Scene 2. A two-way repeated measures analysis of variance (ANOVA) on d ′ with the factors localization task (exact click, L vs. R) and duration (33, 58, 116, 199 ms) revealed no main effect of localization task, F(1, 27) = 0.38, p = .541; a significant main effect of duration, F(3,81) = 16.25, p < .0001, η_p² = .38 (Greenhouse–Geisser-corrected); and no significant localization task × duration interaction, F(3, 81) = 1.29, p = .27, η_p² = .046 (Greenhouse–Geisser-corrected). The detection task was identical for the two location tasks, and detection was performed prior to the localization task. It is therefore not surprising that we see only an effect of improved performance as duration increased.

Our primary question for target detection was whether at each duration there would be sufficient information to support detection. We therefore collapsed the detection data across localization tasks and evaluated detection performance using single-sample t tests at each duration relative to a chance level of d ′ = 0. Figure 5 shows the data for detection performance (d ′) collapsed across localization tasks. To maintain an overall Type I error rate of .05, a Bonferroni correction was used (testwise alpha was set at p = .0125). Detection performance was significantly above chance at each exposure duration [33 ms, t(27) = 3.98, p < .0001; 58 ms, t(27) = 16.04, p < .0001; 116 ms, t(27) = 27.51, p < .0001; 199 ms, t(27) = 34.23, p < .0001].

Localization performance

The results above show that the target Gabor could be detected in complex natural scenes with presentations even as brief as 33 ms. These results are consistent with the previous literature, which has shown accurate object detection within scenes at exposure durations between 20 and 25 ms (Thorpe et al., 1996; VanRullen & Thorpe, 2001). Next, we investigated whether the target could also be located at these very brief presentation durations. Our dependent variable was percentage of localizations correct. We analyzed the total percentage of localizations correct across all trials (regardless of whether detection was correct or incorrect), splitting the analysis by the two localization tasks, L vs. R (coarse localization) and exact click (fine localization), which were presented in separate blocks.

Localization in the coarse L vs. R task

This localization task was a 2AFC: left or right. Thus, chance is 50%. Figure 6 shows a clear pattern of increasing localization performance with increasing duration for both localization tasks. Our key question related to whether there was sufficient information at each duration to support localization. We therefore used single-sample t tests on percentages of correct localization responses (Bonferroni-corrected, testwise alpha set at .0125). For the L vs. R task (chance = 50%), this showed that performance was above chance for all durations [Fig. 6, black line; 33 ms, t(27) = 5.07, p < .0001; 58 ms, t(27) = 11.10, p < .0001; 116 ms, t(27) = 27.49, p < .0001; 199 ms, t(27) = 15.87, p < .0001].

Localization in the exact-click localization task

This localization task required a precise mouse click on the target. We calculated chance on the basis of the number of possible nonoverlapping locations of the target Gabor within the image (chance = 16.67%). To allow for some imprecision in reporting the remembered target location, we defined a region of acceptance (ROA) for scoring a mouse click as a correct localization of twice the Gabor diameter, or 7.6° centered on the Gabor location, defined as a square boundary around the Gabor (i.e., the original matrix size of the 2-D sine wave prior to applying the cosine window in Matlab). This method has been utilized in other applied perception studies to account for slight imprecision errors (e.g., Carrigan et al., 2018; Evans et al., 2013), although the exact ROA chosen is essentially arbitrary. Importantly, we prespecified what this ROA would be and used the same one for all analyses. Our dependent variable was the percentage of localizations correct. Again, our key question related to whether there was sufficient information at each duration to support localization, using single-sample t tests on percentages of correct localization responses (Bonferroni-corrected, testwise alpha was set at p = .0125). For the exact click task (chance = 16.67%), this showed that performance was above chance for all durations [Fig. 6; 33 ms, t(27) = 6.3, p < .0001; 58 ms, t(27) = 14.78, p < .0001; 116 ms, t(27) = 26.75, p < .0001; 199 ms, t(27) = 51.21, p < .0001].

The localization results show that the participants could accurately localize a Gabor target on some trials for presentation durations as brief as 33 ms. Specifically, participants performed significantly better than chance for all durations, 33–199 ms, even for the precise-localization task (exact click). In Fig. 6 and the corresponding analysis, all localization trials were included, regardless of whether detection was correct. Figure 7 represents the relative proportions of trials across all four durations when detection and localization were both correct, when detection was correct and localization was incorrect, when localization was correct and detection was incorrect, and when detection and localization were both incorrect. Each experimental trial is represented once in the graph. The proportion of trials on which both detection and localization are correct clearly increases as a function of duration, as one would expect. There also seems to be an effect of duration on the proportion of trials on which detection appears to be correct but localization is incorrect (Fig. 7). This pattern is statistically reliable: One-way repeated measures ANOVAs for each localization task showed significant effects of duration (33, 58, 116, 199 ms) on the proportion of detection-only trials [exact click: main effect of duration, F(3, 81) = 30.39, p < .0001, η_p² = .53; L vs. R: main effect of duration, F(3, 81) = 20.05, p < .0001, η_p² = .43]. As duration increased from 33 to 199 ms, there was a decrease in the proportion of trials on which observers were correct on detection but not location.

At the shortest durations, accuracy on both detection and localization appears lower in absolute performance for the more precise localization task (exact click) than for the coarse localization task (L vs. R); however, note that differences in chance baseline between the two localization tasks (50% for L/R, 16.67% for exact click) limit a direct comparison of between-task performance. In Fig. 7a, for the L-versus-R task, this shows that observers were ~ 20% correct for “localization only” at the 33 ms duration. This could be due to a keyboard assignment issue, since the keys for the detection response matched those for the localization response (e.g., “z” = Scene 1 and left; “m” = Scene 2 and right). This seems likely to have caused occasional response conflict in the L vs. R task, in which participants might have accidentally pressed the button corresponding to the target location first, instead of the detection as Scene 1 versus Scene 2. We therefore do not interpret these localization-without-detection trials for the L/R task further.

Returning to the summary statistics, the overall finding that observers were greater than chance on both detection and localization shows that even at brief durations, a target embedded in a natural scene can often be spatially localized as well as detected. Therefore, the localization of targets at brief presentation durations may depend on how salient the target is within a particular scene or on features of particular types of scenes, such as the level of a scene’s openness. Previous research has demonstrated that target detection is difficult for target letters (e.g., Henderson, Chanceaux, & Smith, 2009) and for target Gabors located in geographical maps (Rosenholtz, Li, & Nakano, 2007). Here, we explored how a scene’s openness affects target localization at brief durations, by performing an exploratory post-hoc scene analysis on the effect of openness on localization performance within our diverse natural-scene image set.

Models within the computational literature emphasize the importance of global properties, or the distribution of basic features of a scene along with the scene’s spatial layout, for scene recognition. For example, the spatial envelope model (SEM; Oliva & Torralba, 2001) describes the “degree of openness” of a scene. Scenes vary with regard to their degree of openness, ranging from low, where a scene comprises many visual characteristics, to high, where scenes often contain a horizon and are vast, containing minimal visual items. For the predetermined categories from Oliva and Torralba, we grouped three categories with a low degree of openness as “closed” categories (mountain, forest, and city), and three categories with a high degree of openness as “open” categories (coast, open country, and highway). Since this was a post-hoc analysis, there were unequal numbers of open versus closed scenes.

If localization is affected by a scene’s degree of openness, there should be a difference between localization performance on correct-detection trials in open versus closed scenes. We divided the scenes into categories according to their level of openness (open, closed) and conducted a two-way repeated measures ANOVA on the two localization tasks separately, with the factors scene (open, closed) and duration (33, 58, 116, 199 ms). See Fig. 8.

For the L vs. R task, we found a significant main effect of scene (open vs. closed), F(1, 27) = 70.16, p < .0001, η_p² = .72; a significant main effect of duration, F(3,81) = 176.13, p < .0001, η_p² = .87 (Huynh–Feldt-corrected); and a significant scene × duration interaction, F(3, 81) = 5.99, p = .001, η_p² = .18 (Greenhouse–Geisser-corrected). Similarly, for the exact-click task, there was a significant main effect of scene (open vs. closed), F(1, 27) = 87.49, p < .0001, η_p² = .76; a significant main effect of duration, F(3, 81) = 259.69, p < .0001, η_p² = .91; and a significant scene × duration interaction, F(3, 81) = 7.96, p < .0001, η_p² = .23. The interactions suggest that openness does influence the degree to which location information is available. Experiment 2 was designed to follow up this initial analysis by experimentally manipulating the degree of openness in natural scenes, to systematically examine its effects on target detection and localization at brief durations.

Experiment 2

Experiment 2 was designed to systematically investigate the influences of openness in natural scenes on target detection and localization. We manipulated openness using the computational definition of the degree of openness for natural scenes, with the following superordinate categories: open (coast, open country, highway) and closed (forest, mountain, city) (Oliva & Torralba, 2001), as we outlined in Experiment 1. The post-hoc analysis of openness for Experiment 1 had been limited, due to the small and unbalanced set of open and closed natural scenes across the durations, since the experiment had not been designed for this purpose. Thus, in Experiment 2 we increased the number of scenes in each category to be equal across open/closed scene types in each duration, and we examined detection and localization performance as a function of duration and scene type.