To sustain visually-directed attention in real-world contexts (e.g., driving at night), a complicated stream of incoming visual information must be continuously monitored for task- or goal-relevant features, locations, or objects (e.g., deer). An important component of this process is constructing and maintaining a task set whereby select features stored in an attentional template are mapped to the appropriate behavioral responses. For example, when driving on a highway through the forest at dusk, the driver should maintain a task set that potentiates deer or deer-like features and link these features with an avoidance response (e.g., break, accelerate, swerve). If a driver fails to develop an efficient attentional template—potentiating features that are either too specific (e.g., Bambi) or too broad (e.g., brown)—it will hinder the driver’s ability to rapidly identify and avoid deer. Even if an optimal attentional template is constructed, it must be maintained, and failure to do so may result in a sub-optimal attentional state so that the presence of deer-like features would fail to elicit a rapid avoidance response. In the sustained attention literature, such failures to respond to target features are labeled attentional lapses and, as described above, the underlying causes of such lapses can be variable and multifaceted. Individual differences in the ability to sustain attention are stable (trait-like) and the factors driving these individual differences are likely multifaceted for similar reasons. In the present paper, we seek to better characterize the relationship between individual differences in sustained attention ability and the content and maintenance of an attentional template during a sustained attention task.
Probing attentional templates using T-N similarity
In order to assess the strength and content of individual attentional templates we rely on the notion that the similarity of a target stimulus to its nontarget distractors (called T-N similarity by Duncan & Humphreys, 1989) and vice versa can influence performance. This influence has been thoroughly explored and validated in static visual search paradigms (for a recent review see Hout et al., 2015), where participants must give a response indicating whether a target stimulus exists amongst a complex scene of simultaneously presented nontarget stimuli. Specifically, as visual T-N similarity increases (the targets and nontargets are more similar to one another), participants become slower and more error prone (Duncan & Humphreys, 1989). Additionally, RSVP and masked priming experiments have demonstrated that T-N similarity can reduce the likelihood of target detection when nontargets are temporally (within 400 ms) adjacent (Raymond, Shapiro, & Arnell, 1995).
Because T-N similarity has a consistent effect on visual search, T-N similarity effects have been used to probe the content and dynamics of attentional templates during visual search tasks (Geng, DiQuattro, & Helm, 2017; Lee & Geng, 2017). It is assumed that target detection is accomplished by matching an incoming stimulus representation with the features stored in the attentional template (Carlisle, Arita, Pardo, & Woodman, 2011; Desimone & Duncan, 1995; Hout & Goldinger, 2015; Lee & Geng, 2017). Therefore, differences in the influence of T-N similarity on performance could reflect differences in the attentional template. In one study, Lee and Geng (2017) demonstrated that individual differences in the representational similarity for face-morph stimuli predicted behavioral differences in an identity search task. Specifically, individuals who found a given face-pair to be more similar were more likely to be distracted when one item in the pair was a target and the other was a distractor. The differences in the perceptual similarity space of faces were shown to influence individual attentional templates and thus modulate the degree of behavioral interference due to T-N similarity on the visual search task. Aside from this study (see also Charest, Kievit, Schmitz, Deca, & Kriegeskorte, 2014), little work has been done exploring individual differences in the representation of stimulus similarity and how this relates to attentional templates and performance on a given task. Accordingly, we sought to determine the relationship between individual differences in sensitivity to T-N similarity and more standard measures of performance during a sustained attention task. On the one hand, heightened sensitivity to T-N similarity could be a sign of poor filtering/greater interference and thus would lead to conflicting or incorrect signals sent from perceptual and decision processes to motor responses. On the other hand, it could be that those with more robust and efficient attentional templates are more accurate overall but—because both target and non-target features are strongly potentiated—more susceptible to stimulus-specific differences in T-N similarity.
Quantifying T-N similarity in naturalistic scenes
Experimentally measuring the influence of T-N similarity on behavior depends on correctly identifying (or at least approximating) the behaviorally-relevant features involved in the matching process. In a visual search task where a single target is precisely cued, exemplar-specific features are used to identify the target from the distractors. However, as the precision of the cue deviates from the target (e.g., the cue deviates by a size or orientation transformation, or the cue is a category label), the features used to discriminate the target from nontargets become increasingly abstract (Hout & Goldinger, 2015; Maxfield & Zelinsky, 2012; Peelen & Kastner, 2011; Schmidt & Zelinsky, 2009a; Vickery, King, & Jiang, 2005; Yang & Zelinsky, 2009). In the experiments presented below, participants needed to discriminate photographs of city scenes from photographs of mountain scenes. While the discrimination falls across a basic category boundary, there were only ten exemplars in each category and each exemplar was repeated many times. This experimental design allowed for both category-level and exemplar-level features to mediate the target-matching process. Therefore, we quantified the strength and content of attentional templates as the degree of observed T-N similarity and either exemplar or category-level feature spaces were considered when computing T-N similarity. As a consequence of making explicit the feature spaces underlying T-N similarity, we also explored how task demands shape the representational content of attentional templates. It could be that category-level features were optimal as they mediate our ability to rapidly identify novel scenes more generally; however, because this task enabled the use of exemplar-level features, leveraging such features may be optimal to the matching process.
Relating sustained attention to attentional templates
By using T-N similarity to measure the contents and strength of an individual’s attentional template during a sustained attention task, we sought to determine if individual differences in the influence of T-N similarity on performance were associated with more standard measures of sustained attention ability. This required a sustained attention task that uses a complex enough stimulus set and target/non-target category boundary to allow for measurable individual differences in attentional templates while providing a reliable measure of sustained attention ability. Conventionally, studies of sustained attention use variants of go/no-go tasks called continuous performance tasks (CPTs) where stimuli are shown in succession and participants are instructed to respond to non-target stimuli while withholding their response to target stimuli (Conners, 2000; Esterman, Noonan, Rosenberg, & Degutis, 2013; Robertson, Manly, Andrade, Baddeley, & Yiend, 1997; Rosenberg, Noonan, Degutis, & Esterman, 2013). Performance is generally evaluated using response errors (commission errors for target stimuli and omission errors for nontargets) and variability in response times (RTs) with greater variability indicative of worse attentional ability. When these tasks are used to measure dynamic changes in attentional states (e.g., Cheyne, Carriere, & Smilek, 2009; Christoff, Gordon, Smallwood, Smith, & Schooler, 2009; Esterman et al., 2013; Smallwood et al., 2004; Weissman et al., 2006), it is generally assumed that the fluctuations in performance are driven by internal cognitive processes falling under the broad category of cognitive control. On the other hand, stimulus-driven properties such as stimulus quality, complexity or T-N similarity are either rarely considered or controlled for with the use of very simple stimuli (but see Head & Helton, 2012; Helton & Russell, 2011; Parasuraman et al., 2009). However, given a sufficiently complex stimulus set, the stimulus-driven influences on performance are those that would be most likely to be modulated by the strength and efficiency of a given attentional template. Accordingly, we tested participants on the gradual onset continuous performance task (gradCPT), a CPT variant that removes phasic onsets/offsets between stimuli (Esterman et al., 2013; Fortenbaugh et al., 2015; Rosenberg et al., 2013). Participants were shown a series of images of natural scenes selected from ten photographs of city scenes (nontargets, 90 % of trials) and ten photographs of mountain scenes (targets, 10 % of trials) and made a category decision on each stimulus image. Critically, because the stimulus set involved complex natural images with immense numbers of visual features in each stimulus, the diagnostic features selected in the attentional template, and the degree to which these features were potentiated and maintained, could vary considerably across participants.
In the current study, we report two experiments that examined the relationship between T-N similarity and sustained attention performance using the gradCPT. First, using data collected from a massive online dataset (TestMyBrain.org, N > 20,000), we established that T-N similarity influenced performance (Experiment 1)—evident in both RTs for go trials (nontarget city scenes) and the probability of an erroneous button press (commission errors) on no-go trials (target mountain scenes). Critically, we examined the degree to which individual differences in sensitivity to visual T-N similarity predicted overall performance (as measured by accuracy and RT variability) on the task, showing greater sensitivity was associated with better performance. In Experiment 2, we replicated the primary findings from Experiment 1in a lab-based sample and found that individual differences in sensitivity to similarity derived from performance on the gradCPT predicted performance on a clinically used CPT that used simple stimuli instead of complex scenes.
Experiment 1: Examining the effect of stimulus similarity on gradCPT performance using a large online dataset
In Experiment 1, we set out to determine if and how visual T-N similarity influenced performance on the gradCPT. Data in this dataset have been published in previous studies (Fortenbaugh et al., 2015; Riley et al., 2016; Riley, Esterman, Fortenbaugh, & DeGutis, 2017). Thus, we include a very similar description of the methods below with the addition of new analytic procedures.
Methods
Participants
We included 21,406 unpaid volunteers between the ages of 10 and 70 years in the analyses. A total of 25,519 participants performed the gradCPT by visiting TestMyBrain.org, a website containing various cognitive tests, but 3,739 participants were removed due to participant exclusion criterion reported in Fortenbaugh (2015). An additional 374 participants were removed because either their accuracy (d’ or criterion), RT variance, or mean RT fell outside 3 standard deviations of the sample mean. There was a nearly equal ratio of males and females (46 % female).
Procedure and stimuli
Participants visited TestMyBrain.org—a website that allows users to participate in experimental tasks and provides feedback on performance relative to other users. The gradCPT was presented on its home page as a “Continuous Concentration Task.” TestMyBrain.org receives traffic mostly from social-networking sites and search engines (Germine et al., 2012). The gradCPT was one test out of several on TestMyBrain.org (others included face recognition and working memory). Participants were free to complete one or more of these tasks; for some participants, the gradCPT was the first or only test they completed, while others may have completed other tasks first. Single experiment studies on TestMyBrain.org are kept brief (< 10 min) in order to maintain a balance between task completion, participation, and test length. Given the demanding nature of continuous performance tasks, the web-based gradCPT was a shortened version (4 min) of the continuous go/no-go task originally reported in Esterman et al. (2013). This test length was chosen because it was sufficiently short that participant attrition rates would be comparable with those of other experiments on the site. The concern with participants dropping out during testing regards a selection bias in which individuals with poorer sustained-attention ability would be less likely to complete the experiment. This led to a total experiment time of approximately 7 min from consent to debriefing.
Before starting the gradCPT, participants gave informed consent according to the guidelines set by the Committee on the Use of Human Participants at Harvard University and the Wellesley College Institutional Review Board. Participants were then asked to complete a voluntary demographic survey asking about their age, gender, native language, and ethnicity. Comprehensive instructions as well as three practice sessions (30 s each) were then provided to familiarize participants with the scenes and the task so that it could be completed without assistance from an experimenter. Across these practice sessions, each stimulus exemplar was presented at least twice. After completing 4 min of the gradCPT, participants were asked if they had cheated or if any problems occurred during the task and then they were provided with performance feedback. This feedback consisted of a personal score, which was the percentage of correct withholdings to mountains, as well as how this score compared with the average participant.
The current version of the gradCPT contained 20 round, gray-scale photographs; half depicting mountain scenes and half depicting city scenes (Esterman et al., 2013). The stimulus images were resized to a 256 × 256 pixel image and then cropped to appear within a circular frame (radius ~ 126 pixels). In the gradCPT, the scene images were presented in a pseudo random order with 10 % of trials displaying mountain scenes and 90 % city scenes without allowing an identical scene to repeat on consecutive trials. Scene images gradually transitioned from one to the next, in a linear pixel-by-pixel interpolation, with each transition occurring over 800 ms. The new image increased in clarity at the same rate as the old image decreased in clarity. Participants were instructed to press the space bar for city scenes (267 trials) and withhold a response for mountain scenes (32 trials). Trial order was fixed across participants to ensure that stimulus-driven difficulty was matched across participants.
Computing stimulus similarity
In order to quantify T-N similarity, we considered the degree to which the task-relevant features were exemplar-based or category-based. According to the exemplar-basedFootnote 1 hypothesis, participants memorized information about each of the ten city and ten mountain image stimuli and these exemplar-specific features were used to match an incoming stimulus to one of the exemplars stored in memory. Responses could be based entirely on exemplar image identification instead of a broader category label assignment or the exemplar-specific features could facilitate the proper categorization. Alternatively, the category-based hypothesis predicted that participants used the same general features that enable them to categorize any photograph upon its first viewing. In assessing T-N similarity, the exemplar-based hypothesis predicted that the relevant similarity would be between a given city image and the most similar (nearest neighbor) mountain image in the sample. Specifically, if a given city exemplar was visually similar to a mountain exemplar, even if the city exemplar was not particularly “mountain-like,” the stored visual representation of that mountain exemplar would interfere with the response to the city, resulting in a slower response. Alternatively, if T-N similarity was category-based, we predicted the relevant similarity would be between a given city image and the set of stored mountain features that were abstract enough to classify scene images despite the large degree of possible physical image variability within its basic category. As such, cities that shared more “mountain-like” attributes would elicit slower responses.
Along with considering how abstract the task-relevant features were, we also examined how to sensibly characterize the features that composed the scene representations at an exemplar and basic-category level. The exemplar-based hypothesis would rely on image features that could reliably distinguish the exemplars within a category. Here we considered two types of features—pixel intensity and the GIST descriptor. Pixel-based feature matching assumed the stimulus images were represented as a set of position-specific pixel intensity values. Indeed, pixel intensity correlations have been shown to be a neurally-relevant (but not necessarily behaviorally relevant) measure of image similarity for objects (e.g., Peelen & Caramazza, 2012) and scenes (Walther, Caddigan, Fei-Fei, & Beck, 2009). Additionally, pixel-based similarity measures can be used to approximate visual similarity judgments better than chance (Sinha & Russell, 2011). It is worth noting that we do not suggest that pixel intensity correlations were the best approximation of exemplar-level perceptual or representational similarity. However, pixel intensity correlations were easy to compute and sensitive to position-specific, high-frequency commonalities and variations across image pairs. Further, pixel intensity features provided a computational maximum for the fidelity of exemplar representations as the stimulus images were composed of gray-scaled pixel intensity values so identical exemplars were necessarily identical in this feature space and different exemplars were necessarily different. Like any image feature descriptor, the success of pixel intensity values for this purpose rested on the assumption that these features bear a second-order isomorphic relationship to the representations of the actual task-relevant features. The second type of image features consisted of the GIST descriptor values for each image (Oliva & Torralba, 2001). The GIST descriptor is a feature vector that is thought to provide an important summary of the spatial layout of a scene that helps mediate the rapid identification of scene images (Oliva & Torralba, 2001, 2006). This measure is less sensitive to high frequency local variations and more sensitive to spectral features distributed throughout the image.
Accordingly, we computed two exemplar-based T-N similarity values for each stimulus, assuming either pixel image features or GIST descriptor feature vectors. The procedures are depicted in Fig. 1. To construct the exemplar-pixel similarity values, each of the 20 stimulus images were converted into vectors of pixel intensity values. To quantify the influence of exemplar-specific similarity, we reasoned that the most similar cross-category stimulus would lead to the greatest interference and consequently would best reflect the influence of exemplar-specific feature similarity. Therefore, for each stimulus, we computed ten pairwise Pearson correlation values by correlating the pixel image vector of a given stimulus with the pixel vectors of the ten cross-category stimuli. We then selected the largest r (the nearest cross-category neighbor) to be the exemplar-pixel T-N similarity value for that stimulus. For example, the pixel vector for a given city stimulus was correlated with each of the ten mountain pixel vectors and the largest correlation coefficient value was selected. To construct the exemplar-GIST similarity values, the same procedure was employed but instead of using pixel intensity values, each stimulus image was represented by a GIST descriptor feature vector with a length of 512. To construct the GIST feature vector, a 4 × 4 grid was used to divide each stimulus into 16 equal subsets. Within each image subset, Gabor-like filters tuned to 8 different orientations and four different sizes were applied resulting in 32 values for each of the 16 subsets that combined to 512 values in total (Oliva & Torralba, 2001).
In order to quantify each stimulus image’s T-N similarity assuming more abstract, category-level features, we used a pre-trained support vector regression model to estimate parameter weights for 102 scene attributes (Patterson & Hays, 2012; Patterson, Xu, Su, & Hays, 2014; Xiao et al., 2013). Briefly, Patterson and Hays (2012) used Mechanical Turk to obtain binary scene attribute judgments for 102 attributes and 14,340 scene images taken from the Sun database (Xiao, Hays, Ehinger, Oliva, & Torralba, 2010). A kernel-based support vector regression model was trained to predict the attribute judgments using as input a set of four image descriptors: the GIST descriptor, Histogram of Gradients (HOG) 2×2, self-similarity, and color histograms (for more details about each descriptor, see Xiao et al., 2010). This pre-trained model allowed us to take any scene image and compute a vector of 102 scene attribute values for that image. Using images from the Places database (Zhou, Lapedriza, Xiao, Torralba, & Oliva, 2014), these scene-attribute vectors were computed for a sample of 1,200 city images taken at random from the street and the downtown categories as well as 1,200 mountain images taken from the mountains and the snowy mountains categories. Images were resized and gray-scaled in order to match the gradCPT stimuli. For each category, the 1,200 attribute vectors were averaged to compute a template attribute vector. The resulting mountain and city template attribute vectors were tested for accuracy using a novel sample of 1,200 images from each category. Finally, attribute vectors were computed for each of the gradCPT stimuli and attribute-based T-N similarity was quantified as the Pearson correlation between a given stimulus’s attribute vector and the template attribute vector from the opposing category.
When evaluating the feasibility of our category-attribute feature templates on the selected 2,400 novel images (1,200 from each category) we found that a simple correlation classifier achieved 93 % accuracy for both the city and mountain images (chance = 50 %) suggesting that the templates contained information about city and mountain scenes that generalized to novel images. The templates also correctly classified all ten of the gradCPT city stimuli and nine out of the ten mountain stimuli (Mountain 2, shown on Table 2, was incorrectly classified as a city). To further help motivate the use of these attribute vectors, we determined what the most and least informative attributes were. By taking the absolute difference between each attribute from the two category template vectors, we found the ten most informative attributes (greatest absolute difference) were natural, man-made, rugged scene, mostly vertical components, far-away horizon, hiking, climbing, snow, camping, and enclosed area. Further, the ten least informative attributes were cleaning, rubber/plastic, carpet, stressful, shingles, smoke, exercise, playing, spectating/being in an audience, and reading. That the most and least informative attributes made intuitive sense as being features that should (or should not) distinguish images of city and mountain scenes suggested the templates were successfully approximating components of this abstract feature space.
Reaction time
Throughout the task, response times (RTs) were recorded. RTs were calculated relative to the beginning of each image transition such that an RT of 800 ms indicated a button press at the moment the image was 100 % coherent and not mixed with other images. A shorter RT indicated that the current scene was still in the process of transitioning from the previous scene, and a longer RT indicated that the current scene was in the process of transitioning to the subsequent scene. For example, an RT of 720 ms would indicate a button press at the moment of 90 % image n and 10 % image n-1, and so forth. On rare trials with highly deviant RTs (before 70 % coherence of image n and after 40 % coherence of image n-1) or multiple button presses, an iterative algorithm maximized correct responses. The algorithm first assigned unambiguous correct responses, leaving few ambiguous button presses. Ambiguous presses were then assigned to an adjacent trial if one of the two trials had no response. If both adjacent trials had no response, the press was assigned to the closest trial unless one trial was a no-go target, in which case participants were given the benefit of the doubt that they correctly omitted. If there were multiple presses that could be assigned to any one trial, the fastest response was selected. Slight variations to this algorithm yielded highly similar results, as most button presses showed a 1:1 correspondence with presented images.
Measures of gradCPT performance
Taking a signal detection approach, we defined hits as correct omissions to mountains and false alarms as incorrect omissions to cities. Following these definitions, we computed discrimination ability (d’) and criterion (a measure of strategy or bias to respond in the case of uncertainty). We used standard procedures to correct for cases in which hit rates were 100 % or false alarm rates were 0 %, with one-half error deducted or added on the basis of the number of target or nontarget trials presented, respectively. In addition to d prime and criterion, we computed, for each participant, RT variance or more precisely the coefficient of variation of their RTs (CV), which is a measure of ability with lower CV corresponding to greater ability. Finally, we computed the mean of their RTs, which is a measure of strategy with slower RTs corresponding to a more conservative strategy. This ability/strategy distinction stems from Fortenbaugh et al. (2015), who found, through a factor analysis, that the ability (d prime and CV) and strategy (criterion and mean RT) factors dissociate across the lifespan.
Quantifying the influence of T-N similarity on RTs
Following Duncan and Humphreys (1989), we predicted that city stimuli that had a greater degree of T-N similarity would result in slower RTs. To test this (see Fig. 2), an estimate of the population RT time course for the 267 city trials was constructed by averaging together, at each trial, the RTs across the 21,406 participants. The trial order was identical across participants so averaging together the RTs maintained trial-to-trial variance in RT due to stimulus-driven factors like T-N similarity. Furthermore, the N was large enough for us to treat the mean RT as an approximate to the population values.Footnote 2 Because we wanted to see which type of T-N similarity best explained trial-to-trial variance in RTs, the RT time course was tested for the influence of T-N similarity by running a simultaneous regression analysis including four predictors—the pixel-based exemplar T-N similarity measure, the GIST-based exemplar T-N similarity measure, the attribute-based T-N similarity measure, and a linear trend. The linear trend predicted RT slowing over the course of the experiment. Each of the 3 T-N similarity predictors assigned the T-N similarity value corresponding to the city stimulus shown on that trial for each of the 267 trials. The R2 computed from this model reflected the amount of trial-to-trial RT variance the similarity measures and the linear trend explained and the standardized beta coefficients reflected the unique relative contributions of each of the T-N similarity measures. In order to compute standardized beta coefficients, a constant was added to each model and all variables were normalized to have a mean of 0 and a standard deviation of 1.
Quantifying the influence of T-N similarity on commission errors
We additionally predicted that greater T-N similarity for mountain stimuli (i.e., mountain images that looked more “city-like”) should result in more errors. To test this, we ran the same type of group regression analysis as described in the previous section but adapted them for mountain trials. First, for each of the 32 mountain trials, the rate of erroneous button presses or commission error rate (CE rate) was computed by dividing the number of participants that made an error at a given trial by the total number of participants. This CE rate time-course was the dependent variable for the regression analysis that included the three T-N similarity predictors for the mountain stimuli and a linear trend that predicted an increase in error likelihood over the experimental time course.
Individual differences
To examine whether individual differences in sensitivity to similarity were related to performance on the gradCPT, we computed individual sensitivity to T-N similarity by running the regressions described above on each participant’s data separately. Therefore, for each participant, we ran two regression analyses, one predicting mountain-trial accuracy and one predicting the city-trial RTs. Each model produced four beta coefficients and sensitivity to T-N similarity was quantified as the t-values from the beta coefficients. The set of beta t-values from each participant resulted in four individual-difference indices—three quantifying individual differences in sensitivity to T-N similarity (pixel-based, GIST-based and attribute-based) and one for linear trend. First, each of the t-values were correlated (Pearson) with the four different measures of gradCPT performance (d prime, criterion, RT variance, or mean RT). Next, to determine the unique contribution of the sensitivity to similarity measures, they were entered into second-level regression models—one for each of the measures of gradCPT performance.
Results
We set out to determine (1) if T-N similarity influenced performance on the gradCPT and (2) if individual differences in sensitivity to T-N similarity were related to more conventional measures of task performance. To do this, we computed three measures of T-N similarity—one assuming a pixel-based exemplar matching procedure, another assuming a GIST-based exemplar matching procedure, and a third assuming an attribute-based category template matching procedure. Tables 1 and 2 show each of the stimuli along with their three T-N similarity values as well as mean RT for city stimuli (Table 1) and CE rate for mountain stimuli (Table 2). Tables 3 and 4 show the degree to which these measures covary.
Table 1 The three predicted T-N similarity measures and the mean RT in ms (from Exp. 1) for the ten non-target (city) stimuli. Intervals reported below the mean RTs are their 95% confidence intervals Table 2 The three predicted T-N similarity measures and the commission error rate (from Exp. 1) for the ten target (mountain) stimuli Table 3 Correlation (Pearson) matrix of item-level (ten cities) variance in predicted T-N similarity measures and mean RT for the non-target (city) stimuli Table 4 Correlation (Pearson) matrix of item-level (ten mountains) variance in predicted T-N similarity measures and the item level commission error rate for the target (mountain) stimuli
The influence of T-N similarity on city RTs
To determine if (and if so, which measure of) T-N similarity influenced performance on the gradCPT, a regression analysis was performed to simultaneously test for the influence of exemplar-pixel, exemplar-GIST, and category-attribute T-N similarity (as well as a linear trend) on the trial-to-trial group average RTs in response to city stimuli. As reported in Fig. 3, the overall model was significant (R2 = 0.26, F(4, 262) = 22.78, p < .001) with the coefficients for the exemplar-pixel (β = 0.34, t(262) = 6.13, p < .001) and exemplar-GIST (β = 0.21, t(262) = 3.45, p < .001) T-N similarity measures explaining unique variance while the attribute-based measure did not (β = 0.02, t(262) = 0.38, p = .70). The significant linear trend (β = -0.31, t(262) = -5.86, p < .001) indicated that participants sped up over time.
The influence of T-N similarity on mountain CEs
We additionally ran an analogous regression analysis predicting the trial-to-trial commission error rates for mountain trials (see Figs. 3 and 4). This model was quite successful explaining nearly 75 % of the variance in trial to trial error rates (R2 = 0.72, F(4, 27) = 17.48, p < .001). Like the model predicting RTs to city stimuli, the exemplar-pixel measure of T-N similarity and the linear trend predictors had significant coefficients while the category-attribute predictor did not (exemplar-pixel: β = 0.69, t(27) = 6.12, p < .001; linear trend: β = 0.57, t(27) = 5.22, p < .001; category-attribute β = -0.05, t(27) = -0.36, p = .72). Unlike the city RT model, the exemplar-GIST predictor was not significant (β = 0.02, t(27) = 0.14, p = .89). Overall, the exemplar-pixel and, to a lesser degree, the exemplar-GIST measures of T-N similarity could explain some of the trial-to-trial variance in performance such that more “mountain-like” city stimuli and more “city-like” mountain stimuli according to these measures led to slower and more error-prone responses, respectively, while the category-attribute was not significant for either. The significant linear trends revealed that, while accounting for T-N similarity, participants responded more quickly and made more commission errors as the experiment progressed.
Relating sensitivity to T-N similarity to individual differences in gradCPT performance
We ran a set of eight second-level regression analyses in order to evaluate how individual differences in the sensitivity to each of the three measures T-N similarity as well linear trend predicted gradCPT ability (d prime and CV) and measures of strategy (criterion and mean RT). Because the N was so large, even a model with a small R2 was significant. We therefore followed Cohen’s (1988) convention that states that R2 cutoffs for small, medium, and large effects are R2 = 0.02, 0.13, and 0.26, respectively. We found that sensitivity to similarity as measured by the influence of T-N similarity on RTs to city stimuli could predict both measures of gradCPT ability—specifically, discrimination ability/accuracy (d prime) with a medium effect (R2 = 0.16) and RT variability (CV) with a large effect (R2 = 0.27). Examination of the coefficients revealed that the success of the models was largely driven by sensitivity to the exemplar-pixel (d prime: β = 0.25; CV: β = -0.36) and exemplar-GIST (d prime: β = 0.28; CV: β = -0.30) measures of T-N similarity (see Figure 2). Importantly, the signs of these coefficients revealed that individuals who were more sensitive to the exemplar-based measures of T-N similarity were more accurate (greater d prime) and had more consistent RTs (lower RT variability). Neither sensitivity to the category-attribute measure T-N similarity (d prime: β = -0.03; CV: β = -0.03) nor linear trend (d prime: β = 0.01; CV: β = -0.09) substantially contributed to either model. These measures did not predict either measure of strategy beyond a small effect (criterion: R2 = 0.04; RT mean: R2 = 0.03).Footnote 3 Furthermore, none of the models using sensitivity to T-N similarity derived from accuracy to mountain stimuli predicted any of the four measures of gradCPT performance beyond a small effect.
Experiment 1 discussion
Experiment 1 provided, to our knowledge, the first analysis of how visual image similarity impacts sustained attention and how individual differences in sensitivity to visual similarity are related to sustained attention ability. In particular, using a sample size of more than 20,000 participants, we found compelling evidence that increased T-N similarity was related to performance on the gradCPT—both in terms of slowing RTs for city trials and increasing the likelihood of commission errors for mountain trials—and that individual differences in sensitivity to certain measures of T-N similarity was associated with more optimal performance on the task.
First, we showed that, at the group level, the exemplar-pixel measure of T-N similarity explained variance in both the average RT time course for non-target city stimuli as well as the error-rate time course for the target mountain stimuli while the exemplar-GIST measure only explained variance in the RT time course. The category-attribute measure, however, did not significantly explain variance in either. As expected, stimuli having greater T-N similarity showed slower RTs and greater error rates. We posit that the cognitive processes underlying the success of the exemplar-based T-N similarity measures are computationally analogous to how the T-N similarity measures were constructed—namely, that the slower RTs were a consequence of interference from image features from the most confusable mountain exemplar (nearest cross-category neighbor) stored or potentiated in the attentional template, sending a conflicting response signal (withhold in the case of city stimuli and press in the case of mountain stimuli). The success of the exemplar-based measures of T-N similarity further suggests that participants executed the task, at least in part, by adopting an exemplar-based matching procedure whereby the perceptual decision (city vs. mountain) was accomplished by storing stimulus-specific image features from each stimulus and leveraging all of these features when making the perceptual decision. The failure of the attribute-based category T-N similarity measure could mean that exemplar features were used at the exclusion of the attribute-based T-N similarity. However, this interpretation presupposes that the attribute-based T-N similarity sufficiently approximates the appropriate category-level features used to classify novel scenes. Since this was not independently verified, any interpretation of the null effect here is ambiguous.
Inclusion of the linear trend in the regression models revealed a vigilance decrement by showing an increase in error rate throughout the time course of the experiment as well as RT speeding, which has been linked with more automated (less vigilant) attentional states (deBettencourt, Norman, & Turk-Browne, 2017). While not the focus of this study, Fig. 4 (plot on bottom right) illustrates that using T-N similarity can help reveal performance variation due to intrinsic factors like executive or cognitive control by controlling for trial-to-trial variance due to stimulus driven factors.
We further found that both the pixel-based and the GIST-based image features contributed unique variance to explaining reaction times. This suggests that the exemplar representations stored in memory were representing multiple types of image properties and that similarity, as derived through pixel intensity and the GIST descriptor features, approximated the similarity structure at separate levels of representation involved in the categorization of scene images. Alternatively, the pixel and GIST measures could capture different facets of the similarity structure at a single level of representation.
Finally, individuals who were more sensitive to T-N similarity as indexed by both the pixel and GIST exemplar T-N similarity measures to city RTs were better at the task overall, with greater discrimination ability (higher d prime) and less variable RTs (CV). This result is novel, even in the broader context of visual similarity across the visual search and RSVP literature. While other studies have found that individual differences in the perceived similarity of target and non-target stimuli could predict individual differences in sensitivity to T-N similarity (Lee & Geng, 2017), sensitivity to T-N similarity has not (to our knowledge) been used as an index to predict overall performance on a given task.
One potential concern is that since sensitivity to T-N similarity was based on city RTs, perhaps it was really an indirect measure of RT variability and therefore the individual difference correlations between sensitivity to T-N similarity and CV were statistically circular. However, the results were not consistent with this interpretation. Increased sensitivity to T-N similarity was measured from an increase of a particular form of RT variability, namely slower responses to cities that have a similar mountain exemplar. Therefore, if everything else was equal, increased sensitivity to T-N similarity would lead to an increase in RT variance. In fact, the opposite was true—greater sensitivity to T-N similarity was associated with lower overall RT variability. This suggests that greater sensitivity to T-N similarity was associated with a large enough reduction of other causes of RT variability (e.g., ability to sustain attention) to counteract the increased RT variability that was directly a result of increased sensitivity to T-N similarity.
While sensitivity to T-N similarity was associated with overall performance on the task, the nature of this association was unclear. Specifically, because complex stimuli were used, it was unclear whether the individual differences in both sensitivity to T-N similarity and gradCPT performance were due underlying differences in perceptual or attentional abilities. In Experiment 2, in addition to replicating the findings of Experiment 1 in a diverse, lab-based sample of Veterans, we explored the underlying causes of the individual differences in sensitivity to exemplar T-N similarity. In particular, we examined if the individual differences in T-N similarity were driven by differences in perceptual processing ability such that the perceptual decision component of the gradCPT was easier for some participants. Alternatively, such differences could have been driven by differences in the ability to sustain attention more generally. By testing participants on both the gradCPT and the TOVA—a CPT that uses simple visual stimuli—we assessed attentional ability on a perceptually simple task where individual differences in perceptual processing ability had a minimal effect and saw if it related to sensitivity to T-N similarity as measured on the gradCPT. Any relationship would probably not have been due to differences in perceptual processing ability, leaving sustained attention ability as the likely cause.
Experiment 2: Examining how gradCPT sensitivity to similarity relates to performance on a neuropsychological test of sustained attention
In Experiment 1 we demonstrated that the exemplar-based T-N similarity influenced RTs for the city (non-target) trials and the likelihood of a CE error for mountain (target) trials. Critically, increased individual sensitivity to similarity, when assessed by the influence of exemplar-based T-N similarity on RTs, was associated with increased overall ability on the task. The meaning of this association, however, is open to interpretation. One possibility is that sensitivity to image similarity solely reflects enhanced bottom-up perceptual processing or visual memory independent of an individual’s ability to sustain attention. Alternatively, sensitivity to similarity could reflect individual differences in the ability to maintain the task set. In other words, those who are more capable at sustaining attention more generally are consequently more capable at selecting and maintaining task-relevant features and responding appropriately. To choose between these possible interpretations, we analyzed data from participants who performed the gradCPT along with a battery of surveys and other tests. One such test was the Test of Variables of Attention (TOVA), a CPT that is widely used to help diagnose attention-deficit hyperactivity disorder (Leark, Greenberg, Kindschi, Dupuy, & Hughes, 2007). The TOVA was the only other CPT presented in the battery and was selected for this experiment because it assessed sustained attention in a similar manner to the gradCPT but without the same perceptual demands. No other task in the battery was examined for the purposes of this study. For the target/non-target category distinction, it employs a highly discriminable position location (top vs. bottom) of a simple square. If increased sensitivity to similarity arises from enhanced bottom-up perceptual processing or visual memory capacity, the TOVA—using simple stimuli with low perceptual discrimination demands—would largely eliminate any advantage individuals with enhanced perceptual processing would have. Therefore, we would predict no association between sensitivity to T-N similarity on the gradCPT and attentional ability on the TOVA. If the latter hypothesis is true, we would predict that individual differences in sensitivity to T-N similarity on the gradCPT should predict performance on the TOVA as well because the attentional demands across the two tasks are similar and the sustained attention ability that determines the degree of sensitivity to T-N similarity will be measured during the TOVA.
Methods
Participants
The participant sample included 152 Veterans (male: 141; age: 21–56 years) from Operation Enduring Freedom/Operation Iraqi Freedom/Operation New Dawn (OEF/OIF/OND) who were recruited from the Translational Research Center for Traumatic Brain Injury and Stress Disorders (TRACTS) at the Veterans Affairs Boston Healthcare System. Details regarding the larger TRACTS cohort has been described in detail elsewhere (Lippa et al., 2015). Participants in the present sample completed the current tasks as part of a larger battery of tests that included clinical interviews and neuropsychological testing that took place over one or more sessions. This Veteran sample was used because it had available data from participants who did both the gradCPT and the TOVA. The Institutional Review Board of Human Studies Research at the VA Boston Healthcare System approved all research procedures. All participants provided informed consent and were reimbursed for their time and travel expenses.
Initially 169 participants completed the gradCPT but some participants (N = 17) were removed due to exclusion criteria described below. Overall exclusion criteria for TRACTS includes: (a) history of neurological illness (other than traumatic brain injury (TBI)); (b) history of seizures; (c) current diagnosis of schizophrenia spectrum or other psychotic disorders (not related to PTSD); (d) current active suicidal and/or homicidal ideation, intent, or plan requiring crisis intervention; or (e) cognitive disorder due to general medical condition other than TBI (exclusions based on these criteria: N = 2). For the present study, veterans with a history of moderate or severe TBI or participants who failed the Green’s Medical Symptom Validity Test were excluded (N = 13). For gradCPT-specific exclusion criteria, as in the previous experiments, we excluded participants who showed significant periods (30 s or greater) with no response (N = 2).
Participants completed both the gradCPT and the TOVA along with a battery of additional tasks. A majority of participants returned for an additional testing session (N = 94) after an average of 762 days (range 363–1,743 days). For the participants who did the gradCPT in both sessions (N = 10) only the gradCPT results from the first session were analyzed. Additionally, of the 152 participants who did the gradCPT, 136 performed the TOVA—133 participants have data from the gradCPT and TOVA collected during the same session and 90 participants have data collected from different sessions.
Stimuli and procedure
GradCPT
The stimuli and procedure were identical to Experiment 1 with a few exceptions. First, the participants performed that gradCPT during an fMRI scan; however, the fMRI data are out of the scope of this paper and not discussed further. The experimental task was projected onto a wall to the rear of the scanner and participants viewed the task through a mirror. Next, this version of the gradCPT was longer, consisting of 600 trials (540 city non-targets and 60 mountain targets) and lasting 8 min. Finally, unlike Experiment 1, in which the trial order was fixed across participants, each participant had a unique randomized trial order.
Tests of Variables of Attention (TOVA)
The TOVA consisted of a simple continuous performance task—widely used to diagnose ADHD—where participants were instructed to respond to the target stimulus and ignore the non-target stimulus (Leark et al., 2007). Both the target and non-target stimuli were composed of a white square with a smaller black square horizontally centered within the white square. For the target stimulus, the black square was positioned near the top of the white square while for the non-target stimulus the black square was positioned near the bottom. Responses were made by pressing a button on a switch box. A trial had a duration of 2,000 ms and consisted of a stimulus presented for 100 ms and a 1,900-ms response period. In total, the TOVA consisted of 648 trials, with 50 % of trials being target trials. However, in the first half of the TOVA target trials were infrequent (22.5 % of first-half trials) while in the second half target trials were more frequent (77.5% of second-half trials). Similar to the gradCPT, we obtained each participant’s RT variability (in this case RT standard deviation or RT SD), mean RT, d prime, and criterion.
Data analysis and results
Replicating Experiment 1
First we set out to replicate the association between individual differences in sensitivity to T-N similarity and variables that measure strategy (criterion and mean RT) and ability (d prime and CV) on the gradCPT reported in Experiment 1. Using the same analytical methods described in Experiment 1, we ran a regression model for each participant fitting their RT time course to non-target city stimuli with exemplar-pixel, exemplar-GIST, and category-attribute measures of T-N similarity as well as a linear trend as predictors. The resulting t-values derived from the four predictor coefficients (a set of four t-values for each participant) quantified each participant’s sensitivity to the different measures of T-N similarity and linear trend and entered as predictors into four second-level regression analyses predicting each participant’s d prime, RT variance (CV), criterion, and RT mean on the gradCPT. Consistent with Experiment 1, we found that sensitivity to T-N similarity could predict discrimination ability/accuracy (d prime) with a medium effect (R2 = 0.20) and RT variance (CV) with a medium effect (R2 = 0.24). Moreover, examination of the coefficients revealed that the success of the models was largely driven by sensitivity to the exemplar-pixel (d prime: β = 0.18, t(147) = 2.53, p = .013; CV: β = -0.22, t(147) = -3.22, p = .002) and exemplar-GIST (d prime: β = 0.25, t(147) = 3.22, p = .002; CV: β = -0.28, t(147) = -3.76, p < .001) measures of T-N similarity. The coefficients for sensitivity to category-attribute T-N similarity were not significant (d prime: β = -0.09, t(147) = -1.04, p = .30; CV: β = 0.12, t(147) = 1.38, p = .17). However, unlike Experiment 1, sensitivity to similarity also predicted measures of strategy with medium effect sizes (criterion: R2 = 0.11; RT mean: R2 = 0.18) such that higher T-N similarity was associated with a more liberal response strategy (greater bias to respond) and faster RTs.
Sensitivity to similarity index and the TOVA
Given that individual differences in sensitivity to exemplar-based T-N similarity predicted with measures of ability on the gradCPT, the primary goal of this experiment was to see if sensitivity to exemplar-based T-N similarity index could predict performance on an attention task with minimal perceptual processing demands as well. Much like the gradCPT, measures of task ability on the TOVA (d prime and RT SD) were computed for each participant. First, for participants who completed the gradCPT and the TOVA during the same testing session (N = 138), two second-level regression analyses were run with gradCPT-derived sensitivity to exemplar-pixel and exemplar-GIST T-N similarity as regressors predicting either TOVA derived d prime or RT variance (RT SD). Both of the models were significant, albeit with small effect sizes (d prime: R2 = 0.07, F(2, 135) = 5.39, p = .006; RT variance: R2 = 0.07, F(2, 135) = 4.75, p = .01). In other words, individuals who were more sensitive to exemplar-based similarity were less variable in response times and more accurate overall on the TOVA. To determine if this association was attributable to a stable cognitive characteristic or a more temporary cognitive state, we identified participants who had been given the TOVA on a different day from the gradCPT. If the association was the result of a more consistent trait, we would predict the association should remain when the data were collected on different days. For these participants (n = 90) the association was nearly identical with medium effect sizes for the model predicting d prime (R2 = 0.05, F(2, 88) = 2.49, p = .088) as well as the model predicting RT variance (R2 = 0.09, F(2, 88) = 4.26, p = .017). Finally, to determine if the measures of gradCPT ability related to the analogous measures in the TOVA, each participant’s d prime and RT SD from the gradCPT were correlated with the corresponding d’ and RT variance measured from the TOVA. This revealed a strong correlation between these analogous task measures (same session d prime: r = 0.42, p < .001; cross-session d prime: r = 0.29, p < .005; same-session RT variance: r = 0.53, p < .001; cross-session RT variance: r = 0.42, p < .001).
Experiment 2 discussion
In Experiment 4, the gradCPT was given to participants as part of a large battery of other tasks, including the TOVA—a clinically-used CPT that uses simple visual stimuli. This experiment furthered the exploration of the relationship between sensitivity to T-N similarity and sustained attention by examining whether high sensitivity to exemplar-based similarity reflected (1) enhanced, attention-independent, perceptual processing ability that consequently reduced the attentional demands of the gradCPT or (2) enhanced perceptual processing and encoding that resulted from maintaining an optimal attentional template for tasks with and without complex perceptual demands. To this end, we fit sensitivity to T-N similarity derived from the gradCPT with measures of sustained attention performance observed from the same participants tested on a perceptually simple continuous performance task (the TOVA) and found that individuals who were more sensitive to exemplar-based T-N similarity were better at both the gradCPT and the TOVA with larger d primes and lower RT variability (CV for the gradCPT and RT SD for the TOVA). Because the TOVA, unlike the gradCPT, used two simple stimuli, the demands on perceptual processing and visual working memory were minimal. Therefore, any attention-independent perceptual processing advantage that a given participant may have had on the gradCPT would be minimized when the perceptual demands were trivial. Because bottom-up perceptual processing ability was ruled out as the sole source of the relationship between sensitivity to exemplar-based similarity on the gradCPT and performance on the TOVA, the underlying cognitive processes driving this relationship are likely to be, at least in part, attentional in nature.
Additionally, we set out to replicate the previously observed relationship between individual differences in sensitivity to T-N similarity and more standard measures of gradCPT performance in this lab-based studyFootnote 4 with a substantial sample size and cohort. Like Experiment 1, we found that participants who were more sensitive to exemplar-based T-N similarity were more accurate (higher d prime) and had less variable RTs (lower CV). Replicating this association expands upon Experiment 1 in two important ways. First, these participants performed the gradCPT in the lab providing a replication of the results in a more controlled setting. Next, because the participants were selected from the veteran population, we demonstrated that this effect generalizes to different populations. The association between sensitivity to similarity and measures of gradCPT strategy, however, is novel and may be related to sample-specific attributes (e.g., greater prevalence of PTSD or mild TBI).
Finally, the TOVA, while simple, has been assessed for reliability and validity and has been shown to be strongly predictive of attentional dysfunction (e.g., ADHD). Therefore, the fact that variables on the gradCPT are predictive of results on the TOVA further validates the link between measures of gradCPT performance and the ability to sustain attention more generally.