All experimental procedures were approved by the Johns Hopkins Medical Institutional Review Board. All participants passed an fMRI safety screening prior to the scan and provided written informed consent.
Twelve neurologically intact, right-handed, healthy adults (seven females, five males; mean age = 26, SE = 1.9 years) were recruited from the Johns Hopkins University community. All participants had normal or corrected-to-normal vision. Each participant completed one behavioral training session and one 2-h scanning session, on separate days. They were paid $10/h for the behavioral training session and $25/h for the scanning session.
In the training session, visual stimuli were presented on an 18-in. CRT monitor located 79 cm in front of a chinrest, used to equate visual angles across participants, and buttonpress responses were made on a computer keyboard. In the fMRI session, the visual stimuli were projected onto a screen placed at the end of the magnet bore and viewed with a mirror mounted above the head coil. Each participant was fitted with a custom-molded dental impression block clamped to the head coil cage, to minimize head motion; buttonpress responses were made on a custom-built MR-compatible response box. Stimulus presentation and behavioral data collection were controlled by custom MATLAB (The MathWorks, Inc.) code using the Psychophysics Toolbox (Brainard, 1997). Eye position was monitored with a closed-circuit video system during the practice session, and with a custom MR-compatible infrared camera (MRA, Inc.) and ViewPoint 2.8.3 eyetracking software (Arrington Research, Inc.) during the fMRI session.
Stimuli and procedures
Participants were instructed to fix their gaze on a white central fixation dot 0.2° in diameter while performing a multistream RSVP task (Fig. 1). Task-relevant alphanumeric RSVP streams were located 3.5° to the left and right of the fixation dot along the horizontal meridian. Each of these two relevant streams was flanked 3.3° (center to center) above, below, and laterally by three irrelevant distractor streams in order to maximize the demand for selective attention. The alphanumeric characters subtended 1.4° in height and 1.0° in width and were presented in fixed-width Monaco font (letters in uppercase). Participants were instructed to make four-alternative buttonpress responses to infrequent target digits embedded within the task-relevant RSVP streams. Simultaneously, a digit from 2 to 5 was presented in one stream and a different digit from 2 to 5 was presented in the other stream. Participants pressed the right index-, middle-, ring-, or little-finger button to indicate the identity of the digit (2, 3, 4, or 5, respectively) presented in the currently attended RSVP stream. The filler (nontarget) items consisted of the letters A through Z, except for L and R (see below). All visual stimuli were presented on a gray background, and each target and filler item was rendered in one of eight randomly chosen colors (excluding red), with the constraint that every item within the same RSVP frame was rendered in a different color. The stimulus duration (i.e., RSVP frame duration) was 133.3 ms, and targets were infrequent, appearing on average two or three times per minute. The stimuli requiring motor responses were rare and included only to ensure that the participants remained vigilant.
Each fMRI scanning run consisted of one trial block of the cued condition and one trial block of the uncued condition (see below). Block order was counterbalanced across runs and across participants, and printed instructions (“red cues” or “self-paced”) indicating the relevant condition were presented immediately prior to each block. Each run lasted 410.2 s (189.1 s per block), including an initial 12-s fixation period, as well as a 20-s fixation period inserted between the two blocks. Each participant completed ten runs in the fMRI session, conducted at the F. M. Kirby Research Center for Functional Brain Imaging in Baltimore, Maryland.
In the cued condition, participants occasionally were instructed either to shift attention to the other task-relevant RSVP stream or to maintain attention (“hold”) on the current stream. Shift and hold instructions were conveyed by the letters L and R rendered in bright red, which appeared unpredictably within the currently relevant stream. When an R appeared in the left stream, participants were to shift attention to the right stream; in contrast, when an L appeared in the left stream, participants were to maintain attention to the left stream—L and R signaled the reverse instructions when presented in the right stream. Only the shift and hold cues were rendered in bright red, so that participants could easily discriminate cues from the other stimuli; this discriminability was verified during the practice session. The onset asynchrony between critical events (i.e., the presentation of a cue or a target) in the cued condition varied randomly among 5.067, 6.000, 7.067, 8.000, 9.067, or 10.000 s. Approximately half of the cues were shift cues, and approximately half were hold cues. A variable number of hold cues were presented between successive shift cues (resulting in unpredictable cue sequences), and, of particular importance, the mean onset asynchrony between shift cues was 19.8 s (observed asynchrony range = 5.067–44.267 s, between-participants SD = 1.5 s). Targets were always separated from cues by a minimum onset asynchrony of 5.067 s.
In the uncued condition, the shift and hold cues were omitted (replaced by filler letters). Instead, participants were instructed to shift attention voluntarily from one task-relevant stream to the other a few (roughly three or four) times each minute, and to respond to targets appearing within the currently attended (relevant) stream just as they had responded in the cued condition. Participants were not required to shift attention at any particular times in the uncued condition. A previous study (Gmeindl, Gao, Yantis, & Courtney, 2008) employing a very similar design, but one in which buttonpress responses were used to verify the accurate timing of the indexed attention shifts, indicated that with these instructions participants shifted attention between the left and right RSVP streams every 21.1 s on average (SE = 2.3 s).
Prior to the fMRI session, each participant completed a training session in our laboratory at Johns Hopkins University. Throughout the training session, the RSVP stimulus frame duration was incrementally decreased from 400 to 133.3 ms, a rate at which participants were able to maintain an accuracy in the cued condition of at least 80 % correct across two successive blocks. In the practice session, accuracy feedback was provided at the end of each block. In the fMRI session, the RSVP stimulus frame duration was fixed at 133.3 ms and accuracy feedback was omitted.
Functional MRI data were acquired with a Philips Intera 3-T scanner and an eight-channel SENSE head coil (MRI Devices). High-resolution, whole-brain anatomical volumes were acquired with an MPRAGE T1-weighted sequence yielding 200 1-mm coronal slices (1 × 1 mm in-plane resolution, matrix = 256 × 256, TE = 3.7 ms, TR = 8.1 ms, flip angle = 8°). Whole-brain functional volumes were acquired with a T2*-weighted echoplanar imaging sequence yielding 30 2.5-mm axial slices (1-mm gap, 2.5 × 2.5 mm in-plane resolution, matrix = 76 × 76, TE = 30 ms, TR = 1.5 s, flip angle = 70°). Eight subsequently discarded volumes were collected at the beginning of each run to allow magnetization to reach a steady state prior to task presentation.
Imaging data preprocessing
The functional MRI data were preprocessed using the BrainVoyager QX software, version 1.10 (Brain Innovation). The data from each run were corrected for slice-time acquisition and motion, and then temporally high-pass filtered (three cycles per run). To correct for between-run motion, each participant’s functional volumes were all coregistered to his or her high-resolution anatomical volume. Voxels were resampled to 3 × 3 × 3 mm. No other spatial smoothing or normalization was performed. After preprocessing, the blood oxygenation level dependent (BOLD) time course was extracted from each voxel for each run. The BOLD amplitude at each time point (i.e., the functional volume, or TR) was transformed into a z score with respect to the mean and standard deviation of the voxel’s time course for that run, and then entered into MATLAB for the MVPA (see below). To conduct the Cued-Shift > Cued-Hold contrast reported below, each participant’s anatomical and functional volumes were transformed into Talairach space using a rigid-body transformation. Then, for each participant, a general linear model of the cued-attention data was formed. This model included regressors for shift cues, hold cues, and targets that were each created by convolving a single-gamma hemodynamic response function with Kronecker delta (stick) functions that marked the onsets of the corresponding events; head movements in the x, y, and z dimensions were included as regressors of no interest. A standard group-level analysis (one-sample t test) was then performed on the results of a Cued-Shift > Cued-Hold contrast, with statistical maps being corrected for multiple comparisons by applying a cluster-size threshold [voxel-wise p < .001, t(11) = 3.5, corrected α = .05].
Multivoxel pattern analysis
For each participant, we first used data from only the cued condition to train an MVPA classifier (linear support vector machine LIBSVM; Chang & Lin, 2011; www.csie.ntu.edu.tw/~cjlin/libsvm) to distinguish between the patterns of activity associated with sustained attention to the left versus the right RSVP stream. These were relatively long epochs (up to 44.3 s) during which participants had been cued to attend to the RSVP stream on the left (Attend Left) or the right (Attend Right) of fixation. We trained the classifier on these Attend Left and Attend Right epochs using multivoxel patterns recorded from 7.5 s after the onset of the corresponding epoch (to account for the hemodynamic response lag) to 1.5 s after the offset of the corresponding epoch, with each constituent time point (i.e., TR) treated as a training sample. The MVPA was conducted first using all voxels within a whole-brain mask (ventricles excluded) created separately for each participant in the native anatomical space. The number of voxels in the mask ranged from 41,608 to 50,314 across participants (M = 47,525). A standard leave-one-run-out cross-validation procedure was used to evaluate classification accuracy. This initial whole-brain MVPA provided a set of weights (one per voxel), for each participant, that indicated how much information each voxel contributed to the correct classification. To select those voxels that were most informative, we ranked all of the voxels according to the absolute values of their weights and then repeated the cross-validation procedure with increasingly larger subsets (1, 50, 100, 500, 1,000, etc.) of the most informative voxels. Classification accuracy, averaged across participants, varied approximately as an inverted-U function of the number of voxels included in the MVPA, with a peak mean accuracy at 3,000 voxels. Therefore, we selected for each participant the 3,000 most informative voxels and trained a new, optimized classifier on the cued-condition data from these 3,000 voxels (no run left out).
This optimized classifier was then applied to the data from the uncued condition, resulting in a multivoxel pattern time course (MVPTC; Chiu, Esterman, Gmeindl, & Yantis, 2012; see also Greenberg et al., 2010) that indicated, for each time point, the degree to which the pattern of activity across the 3,000 voxels corresponded to the patterns associated with Attend Left versus Attend Right. The MVPTC was temporally smoothed by averaging the MVPTC value at each time point with the MVPTC values for the subsequent two time points, and then the MVPTC was binarized. The time points at which the classification reversed (i.e., shifting from left to right or vice versa) were demarcated as attention-shift points.
To verify that the MVPTC could be used to reliably index attention shifts, we used a leave-one-run-out cross-validation procedure in which, for each participant, we iteratively left out the data from one run of the cued condition (e.g., Run 1) and trained the classifier using the rest of the data from the cued condition (e.g., Runs 2–10). For each run left out, we then compared the onsets of attention shifts indexed by the MVPTC to the actual onsets of the shift cues. If the onset of an indexed attention shift fell within three TRs (i.e., 4.5 s, to account for hemodynamic response lag) of the onset of a shift cue, we considered this a hit. Across participants, the mean hit rate was 87.6 % (SE = 2.3 %). The false-alarm rate, as defined by the three-TR threshold, was comparatively low (M = 34.3 %, SE = 5.1 %) and may reflect that attention likely did fluctuate occasionally during performance (resulting, e.g., in missed targets). The hit rate was significantly higher than the false-alarm rate [t(11) = 7.29, p < .001], indicating that the MVPTC reliably indexed attention shifts across participants and across runs within each scanning session.
A priori regions of interest
A recent study (Chiu et al., 2012) using a novel MVPTC analysis revealed that two cortical regions—right middle frontal gyrus (rMFG) and dorsal anterior cingulate cortex (dACC)—and one subcortical cluster in the basal ganglia (BG) demonstrated functional connectivity with the mSPL during cued shifts of attention. Furthermore, rMFG and dACC had also been implicated in our preliminary study (Gmeindl et al., 2008), in which participants engaged in uncued attention shifts during a similar task, but one in which buttonpress responses were used to verify the timing of the demarcated attention shifts. Of particular interest, that study indicated that rMFG and dACC were activated reliably more for self-generated than for cue-driven shifts. On the basis of these findings, we therefore included in the present study rMFG, dACC, and BG a priori regions of interest (ROIs; Table 1) that were functionally defined on the basis of the data from Chiu et al. (2012), and we tested for increased preparatory processing in these regions prior to self-generated shifts of attention.
Event-related average time-course analysis
To test the directional hypothesis that self-generated attention shifts are associated with earlier rises in activity within the a priori ROIs (Table 1) than are cued attention shifts, we performed an event-related time-course analysis (Serences, 2004).
We first extracted time courses from the mSPL, rMFG, dACC, and BG regions (see Tables 1 and 2 and Fig. 2) by calculating the mean BOLD amplitude across all voxels within the region for each time point, covering 12 s centered on the uncued-shift time point identified by the MVPA classifier (i.e., from 6.0 s before to 6.0 s after the uncued-shift time point). BOLD amplitudes were then transformed to percent signal changes, relative to the mean BOLD amplitude calculated within the region across the run.
Note that because the MVPA classifier, although optimized, was associated with some degree of error in classification, resulting in a smoothing of the distribution of the demarcated shift points, we also computed event-related average BOLD time courses for cued shifts of attention using the same algorithm. This was achieved by time-locking the BOLD signal to demarcated cued-shift points based on the output from the classifier when it was applied to the data from the cued-attention condition (using a leave-one-run-out procedure), rather than by time-locking to the actual onsets of the shift cues. Importantly, this method incorporates classifier error for the demarcation of both types of shifts (and avoids the need to correct for hemodynamic response lag at this stage), therefore allowing for a more appropriate and direct comparison between the uncued-shift and cued-shift event-related averages.
Finally, we performed a single planned contrast (i.e., the interaction between shift condition and time, one-tailed, α = .05, N = 12) on the event-related averages for each of these ROIs. Post-hoc tests of simple main effects were conducted following evidence for a reliable shift condition × time interaction; the statistical threshold for these post-hoc tests was Bonferroni-corrected (α = .025).