Introduction

Gaze behavior can speak volumes about an observer’s goals in the present moment (Henderson, 2017; Henderson, Shinkareva, Wang, Luke, & Olejarczyk, 2013) and how one may act on their environment in the immediate future (David-John et al.,, 2021; Hayhoe & Ballard, 2005; Hayhoe & Matthis, 2018; Hayhoe, Shrivastava, Mruczek, & Pelz, 2003; Pelz & Canosa, 2001; Sullivan, Ludwig, Damen, Mayol-Cuevas, & Gilchrist, 2021). When planning physical actions with visual guidance, observers look at objects they intend to interact with (Hayhoe & Ballard, 2005; Hayhoe & Matthis, 2018; Hayhoe et al., 2003), and look ahead to objects involved in later segments of the action sequence (Pelz & Canosa, 2001; Sullivan et al., 2021). Beyond what fixations on objects reveal, gaze dynamics can be used to predict when an observer is about to interact with an object (David-John et al., 2021). This evidence suggests that visual attention is systematically deployed to objects in the environment in the moments leading up to an agent interacting with an object.

The interactions one could perform with an object influence visual attention even when observers are not actively planning to interact with the object. Gomez and Snow (2017) found that object affordances guide overt attention during a visual search task; furthermore, the influence of affordances on attention is stronger for physically present objects that are within reach as opposed to 2- or 3-D object representations displayed on a screen (Gomez, Skiba, & Snow, 2018). As observers learned the function or features of novel objects (e.g., as they learned new pulling affordances of a soap container on the ceiling), successfully learning the affordances of novel objects facilitated subsequent search behavior (Castelhano & Witherspoon, 2016). Taken together, the findings suggest a strong influence of object affordances on visual attention in scenes. In the current study, we investigated whether the aforementioned influence of affordances is driven by specific object affordances (the ability to be grasped or manipulated) or to affordances broadly defined (an object’s ability to be interacted with in any way).

The finding that visual attention orients to objects that afford interaction is consistent with cognitive guidance theory (Henderson, Brockmole, Castelhano, & Mack, 2007). According to cognitive guidance theory, visual attention is not passively pulled to regions of the scene that stand out against their surroundings (as asserted by Itti & Koch, 2000; Parkhurst, Law, & Niebur, 2002), but instead cognitive systems push visual attention to information-rich regions of the scene. Visual attention is allocated to informative objects in scenes moreso than to regions that contrast with surrounding areas in luminance, orientation, and other physical properties of the scene, as captured by image-computable saliency maps (Einhäuser, Spain, & Perona, 2008; Nuthmann & Henderson, 2010), even when the information is not task-relevant (Hayes & Henderson, 2019b; Shomstein, Malcolm, & Nah, 2019) and when contrasts in physical salience are task-relevant (Peacock, Hayes, & Henderson, 2019b). In recent work, Henderson and Hayes (2017) developed a method to capture the spatial distribution of local semantic information in a scene using meaning maps, which were designed to be comparable to saliency maps. To construct the maps, raters were prompted to rate small patches taken from a real-world scene on the degree to which each patch was informative or recognizable. General informativeness as captured by meaning maps has been shown to account for variance in attention better than Graph-Based Visual Saliency maps (GBVS; Harel, Koch, & Perona, 2006) while observers engaged in aesthetic judgment and memorization tasks (Henderson & Hayes 2017, 2018; Rehrig, Hayes, Henderson, & Ferreira, 2020a), action and scene description tasks (Henderson, Hayes, Rehrig, & Ferreira, 2018; Rehrig, Peacock, Hayes, Henderson, & Ferreira, 2020b), and free-viewing tasks (Peacock, Hayes, & Henderson, 2019a). Furthermore, in a recognition memory task, observers were more likely to resample previously fixated regions in a scene when those regions were informative, as captured by meaning maps (Ramey, Yonelinas, & Henderson, 2020). These findings indicate that semantic information in scenes guides visual attention, consistent with the cognitive guidance theory of overt visual attention.

Meaning maps have shown that the distribution of local semantic information—broadly defined—guides visual attention in a scene (Henderson & Hayes, 2017, 2018; Henderson et al.,, 2018; Peacock et al.,, 2019a, 2019b; Rehrig et al.,, 2020a). While meaning maps have proven useful in demonstrating the relationship between scene semantics and visual attention, they were not intended to be a complete representation of semantic information in scenes, but instead were meant to serve as a starting point to quantify semantic information in scenes in a new way. As such, the rating instruction used to construct the original meaning maps was intentionally quite broad—to indicate how informative or recognizable the patch appeared to be, following Antes (1974) and Mackworth and Morandi (1967)—with the understanding that the features queried do not capture all of scene semantics . However, the mapping procedure is flexible in that the instructions can be modified to tap into raters’ conceptions of different types of information in the scene. In our previous work (Rehrig et al., 2020b), we altered the rating instruction to investigate whether grasping affordances—the possible grasping interactions that could be performed with objects in the scene—predict visual attention when speakers describe actions that could be carried out in a scene. Our goal was not to develop a computational model that predicts viewer fixations perfectly, but rather to determine what kind of information in scenes is most relevant for the cognitive processes that give rise to overt attention during the action description task. To isolate what information is cognitively relevant, we measured different kinds of information to see what type of information predicted visual behavior best. We constructed physical saliency, meaning, and grasp maps, and correlated each map with an attention map derived from viewer fixations in three action description experiments. For typical real-world scenes, we found that meaning maps explained variance in attention maps best (consistent with Hayes & Henderson, 2019b; Henderson & Hayes, 2017; Henderson et al.,, 2018; Peacock et al.,, 2019b), but meaning and grasp maps explained comparable variance in attention when scenes instead depicted reachable spaces. The results suggest that general informativeness guides attention best overall, but grasping object affordances can guide attention as well as general semantic information does when graspable objects are shown within reach of the camera’s viewpoint—in other words, when the scene itself is conducive to grasping objects.

In Rehrig et al., (2020b), we found that both grasping object affordances and general informativeness guide visual attention in scenes that depict reachable spaces, contributing to the evidence that object affordances influence attention. However, it remains puzzling that the influence of object affordances on attention was weak for scenes that were not optimized for grasping, given that object affordances predicted attention well in other studies (Castelhano and Witherspoon, 2016; Gomez et al., 2018; Gomez & Snow, 2017). We suspect object affordances underperformed in Rehrig et al., (2020b) due to the narrow way in which we defined them. Because we operationalized object affordances as grasping affordances specifically—a very narrow type of object affordance— grasp maps were likely unable to capture the influence of object affordances on attention broadly, and therefore our prior work may have underestimated the degree to which object affordances guide visual attention. Rather than mapping scenes to capture another specific variety of object affordance, in the current study we constructed interact maps to capture the degree to which any type of object interaction (e.g., grasping, sitting, watching, etc.) was possible in a scene. Once we constructed a broader measure of object affordances, we re-analyzed the fixation data to determine which of the three types of semantic information we quantified was best able to predict attended scene locations using a hierarchical logistic regression model that compared meaning, grasp, and interact map values to determine which features predicted the locations that were attended in the scene.

The current study expands on Rehrig et al., (2020b) in several ways. First, we constructed interact maps for scenes in the Rehrig et al., (2020b) data set to capture broadly-defined object affordances in a scene. Second, we constructed meaning, grasp, and interact maps for the 15 scenes that were not originally included in the Rehrig et al., (2020b) analysis, doubling the number of scenes included in the Experiment 1 data set (N = 30 scenes). Third, to explore whether task goals mediate the influence of semantic information in the scene on attention, we analyzed eye movements in two additional data sets for which the task was not to describe the actions possible in a scene: an open-ended scene description task (Henderson et al., 2018; Experiment 4) and a scene memorization task (Rehrig et al., 2020a; Experiment 5). The two additional data sets included the same scenes and the same number of subjects as Experiment 1 in Rehrig et al., (2020b). Finally, we analyzed the data using a new approach inspired by Nuthmann, Einhäuser, and Schütz (2017) and developed by Hayes and Henderson (2021), which enabled us to examine fine-grained differences between regions that were selected for attention over other parts of the scene that were not fixated.

Nuthmann et al., (2017) developed a novel analysis approach that exploits two key assumptions about overt visual attention: 1) the regions of a scene that are prioritized for attention differ from regions that were not attended in ways that are quantifiable (e.g., using saliency maps) and 2) measurable differences between attended regions and unattended regions may explain why those regions were prioritized for attention over others (e.g., the presence of interesting objects). To that aim, Nuthmann et al., (2017) divided scenes into a pre-defined grid and assigned each square in the grid a value of 1 if any fixations fell within the square, or 0 if the square was not fixated, and conducted a logistic mixed-effects regression analysis with the average values for various saliency models and Euclidean distance from the center of the screen for each square in the grid as predictors in the model. Hayes and Henderson (2021) expanded on Nuthmann et al., (2017)’s approach to obviate the need for a grid, instead measuring center proximity (inverted from Euclidean center distance) and feature values (in this case, semantic values for objects computed from ConceptNet) in a 3 diameter window approximating the size of the fovea around each fixation coordinate, as well as from randomly sampled locations that were not fixated.

In the current study, we implemented Hayes and Henderson’s (2021) analysis approach on fixation data from 5 eyetracking experiments previously reported in Henderson et al., (2018) and Rehrig et al., (2020a, 2020b). Our goal was to determine whether overt visual attention is guided by semantic information broadly construed, or by object affordances. To that aim, we assessed whether general informativeness, graspability, or interactability predicted visual attention across 3 different types of tasks: 1) action description, in which speakers describe the actions that could be carried out in a scene, 2) scene description, in which speakers describe a scene however they like, and 3) scene memorization, in which observers study a scene in preparation for a later recognition memory task. Because image salience did not predict attention well in our previous work (Henderson et al.,, 2018; Rehrig et al.,, 2020b; Rehrig et al.,, 2020a), we instead focused only on three different operationalizations of semantic information in the new analysis: general informativeness, graspability, and interactability, as captured by meaning, grasp, and interact maps, respectively. Based on our previous work (Henderson et al.,, 2018; Rehrig et al.,, 2020a, 2020b), we expected meaning to predict fixated locations well overall across tasks. With respect to the action description Experiments (13) specifically, we expected meaning map values to perform better than grasp map values in Experiments 1 and 2, and we expected both meaning and grasp map values to predict fixated locations well in Experiment 3 because the scenes depicted reachable spaces. If general object affordances as captured by interact maps predict attention better than the narrowly-defined grasping affordances, we expected interact map values to predict fixated locations better than grasp map values in all three action description experiments, and to perhaps rival general informativeness. If object affordances guide attention even when they are less task-relevant, but might still be mentioned in a description, we expect interact map values—and possibly grasp map values—to predict fixated locations when observers described scenes however they liked (Experiment 4). Likewise, if object affordances guide attention generally (as suggested by Gomez et al.,, 2018; Gomez & Snow, 2017), not just when the task is not explicitly linguistic in nature, then we similarly expect interact map and grasp map values to predict fixated locations well in a scene memorization task (Experiment 5).

Methods

Eyetracking data collection

Subjects

All subjects were undergraduate students enrolled at the University of California, Davis who participated in exchange for course credit. They spoke English as a first language, were at least 18 years old, and had normal or corrected-to-normal vision. They were naive to the purpose of the experiment and provided informed consent as approved by the University of California, Davis Institutional Review Board. Thirty-two subjects in Experiment 1 participated (2 excluded from analysis); 48 participated in Experiment 2 (8 excluded), 49 participated in Experiment 3 (9 excluded from analysis), 38 participated in Experiment 4 (8 excluded), and 68 participated in Experiment 5 (8 excluded). Across experiments, subjects were excluded from analysis either because their eyes could not be tracked accurately, or due to errors caused by software, hardware, or because of experimenter error. In Experiment 5 only, 30 of the subjects completed a secondary task in addition to memorizing scenes. The secondary task was an articulatory suppression task in which subjects repeated a sequence of digits aloud while viewing the scene, which was intended to prevent subjects from using internal language to facilitate memorization of the scene. The original study showed no effect of articulatory suppression on the relationship between scene informativeness and attention (see Rehrig et al., 2020a for details). For the purpose of the current analysis, we chose to examine the control condition only (the scene memorization task with no secondary task) in order to draw a clean comparison with the description experiments that involved fewer changes in experimental parameters; data from 30 participants in the control condition were analyzed.

Stimuli

In all experiments, digitized and luminance-matched photographs of real-world scenes depicting indoor and outdoor environments were presented at 1024 × 768 resolution. There were 30 scenes presented in Experiments 14, and 5, and 20 in Experiment 2 (15 of which were also presented in Experiment 1). In Experiment 3, 20 scenes were presented, 15 of which were photographed by the first and third authors to depict reachable spaces. For those 15 scenes, the authors confirmed that objects in the foreground of the scene were within reach of the scene’s viewpoint. The remaining scenes were drawn from other studies: four from Xu, Jiang, Wang, Kankanhalli, and Zhao (2014) and one from Rehrig, Cullimore, Henderson, and Ferreira (2021). Text was removed from each scene presented in Experiment 3 using the clone stamp and patch tools in Adobe Photoshop CS4. One scene in Experiment 2 showed people in the background of the image; faces were not present in the other 54 scenes. See A for all 55 scenes and feature maps.

Apparatus

In all experiments, eye movements were recorded with an SR Research EyeLink 1000 + tower mount eyetracker (spatial resolution 0.01) at a sampling rate of 1000 Hz. Head movements were minimized using a chin and forehead rest integrated with the eyetracker’s tower mount. Although viewing was binocular, eye movements were recorded from the right eye only. The experiment was controlled using SR Research Experiment Builder software. Audio was recorded digitally at a rate of 48 kHz using a Shure SM86 cardioid condenser microphone.

In Experiments 14, and 5, subjects sat 85 cm away from a 21” monitor such that scenes subtended approximately 27 × 20.5 visual angle, and audio was recorded digitally at a rate of 48 kHz using a Roland Rubix 22 USB audio interface and a Shure SM86 cardioid condenser microphone. In Experiments 2 and 3, subjects sat 83 cm away from a 24.5” monitor such that scenes subtended approximately 27 × 20.5 visual angle at a resolution of 1024 × 768 pixels, presented in 4:3 aspect ratio. For both Experiments 2 and 3, data were collected on two separate systems that were identical except that the operating system for the subject computer in one system was Windows 10, and Windows 7 on the other.

Procedure

A calibration procedure was conducted at the beginning of each session to map eye position to screen coordinates. Successful calibration required an average error of less than 0.49 and a maximum error below 0.99. Fixations and saccades were parsed with EyeLink’s standard algorithm using velocity and acceleration thresholds (30/s and 9500/s2; SR Research, 2017).

After successful calibration, subjects received task instructions. In Experiments 1 and 2, the instructions were as follows: “In this experiment, you will see a series of scenes. In each scene, think of the average person. Describe what the average person would be inclined to do in the scene. You will have 30 s to respond.” In Experiment 4, subjects were instead instructed as follows: “In this experiment, you will see a series of scenes. For each scene, describe what you would do in the scene. You will have 30 s to respond.” In Experiment 3, subjects were instructed to describe scenes as follows: “In this experiment, you will see a series of scenes. You will have 30 s to describe the scene out loud.”. In Experiment 5, subjects were instructed to study a series of scenes for a later memory test. In each experiment, the instruction was followed by three practice trials that allowed subjects to familiarize themselves with the task and the duration of the response window. Subjects pressed any button on a button box to advance throughout the task.

The task instruction was repeated before subjects began the experimental block (Fig. 1a). Within the block, each subject received a unique pseudo-random trial order that prevented two scenes of the same type (e.g., living room) from occurring consecutively. A trial proceeded as follows. First, a five-point fixation array was displayed to check calibration (Fig. 1b). The subject fixated the center cross and the experimenter pressed a key to begin the trial if the fixation was stable, otherwise the experimenter reran the calibration procedure. The scene was then shown for a period of 30 s (Experiments 14) or 12 s (Experiment 5), during which time eye-movements were recorded (Fig. 1c). In Experiments 14, audio was also recorded during scene viewing. After the scene viewing period ended, subjects were instructed to press a button to proceed to the next trial (Fig. 1d). The trial procedure repeated until all trials were complete (Experiments 14, & 5= 30 trials, Experiments 2 and 3 = 20 trials). In Experiment 5 only, subjects completed a recognition memory test comprised of the 30 scenes presented in the experiment and 30 image foils depicting similar scenes.

Fig. 1
figure 1

Visualization of the trial procedure for each of the 3 eyetracking experiments. First, (a) task instructions were reiterated to subjects following the practice trials. (b) A five-point fixation array was used to gauge calibration quality. (c) A real-world scene was shown for 30 s. Eye-movements were recorded for the duration of the viewing period in all experiments; audio was additionally recorded in Experiments 33. (d) Subjects pressed a button to initiate the next trial. After pressing the button, the trial procedure repeated (from b)

Eye movement data were imported offline into MATLAB using the Visual EDF2ASC tool packaged with SR Research DataViewer software. The first fixation was excluded from analysis, as were saccade outliers (amplitude > 20).

Meaning, grasp, and interact map generation

We used the same meaning and grasp maps generated in all three experiments as described in Rehrig et al., (2020b). We additionally mapped 15 scenes for informativeness and graspability, and mapped all 55 scenes for interactability. The mapping procedure was identical between the current study and Rehrig et al., (2020b). In the interest of brevity, we describe details of the mapping procedure only for maps introduced in the current study.

Meaning maps

Meaning maps were generated using a contextualized rating procedure in which subjects viewed small circular patches drawn from the scene alongside a thumbnail image showing the full scene that included a green circle showing what region the patch came from (Peacock et al., 2019a). Each of the 15 scenes (1024 × 768 pixel) was decomposed into a series of partially overlapping circular patches at fine and coarse spatial scales (Fig. 2b&c), resulting in 4,500 unique fine-scale patches (93 pixel diameter) and 1,620 unique coarse-scale patches (217 pixel diameter), 6,120 patches in total.

Fig. 2
figure 2

(a–d) Feature map generation schematic. (a) Real-world scene. Raters saw the real-world scene and either a fine (inner) or coarse (outer) green circle indicating the origin of the scene patch under consideration. (b–c) Fine-scale (b) and coarse-scale (c) spatial grids used to create scene patches. (d) Examples of scene patches that were rated as low or high with respect to meaning, grasp, and interact. (e–g) Examples of meaning (e), grasp (f), and interact (g) maps for the scene shown in (a)

Raters were 97 undergraduates enrolled at UC Davis who participated through Sona. Students received credit toward a course requirement for participating. Subjects were at least 18 years old, had normal or corrected-to-normal vision, and had normal color vision.

Each subject rated 300 random patches extracted from the 15 scenes, presented alongside a small (256 × 192 pixel) image of the scene for context. Subjects were instructed to rate how informative or recognizable each patch was using a 6-point Likert scale (‘very low’, ‘low’, ‘somewhat low’, ‘somewhat high’, ‘high’, ‘very high’). Prior to rating patches, subjects were given two examples of low-meaning and two examples of high-meaning scene patches in the instructions to ensure that they understood the task. Scene-patch pairs were presented in random order.

Ten catch trials, which were easy for a human completing the task in good faith to answer correctly, were included in each survey to serve as an attention check. Each catch trial presented a unique catch patch to the subject to rate, which showed a blank surface drawn from the scene (usually a wall or ceiling; see Fig. 3). As in the test trials, catch patches were presented alongside an image showing where in the scene the patch was drawn from so that subjects were not aware the trial was an attention check. If subjects complete the task in accordance with the examples provided in the task instructions, catch patches should be rated as low in meaning (a value of 1 or 2 on the Likert scale). To score catch trial performance, ratings of 2 or lower were considered correct responses, and ratings of 3 or higher were scored as incorrect. Ratings from 34 subjects who scored below 80% on the catch patches were excluded. Each unique patch was rated at least 3 times by 3 independent raters for a total of 18,360 ratings.

Fig. 3
figure 3

Examples of fine- (a) and coarse-scale (b) catch patches that were included as attention checks

Meaning maps were generated from the ratings by averaging, smoothing, and combining the fine and coarse scale maps from the corresponding patch ratings. The ratings for each pixel at each scale in each scene were averaged, producing an average fine and coarse rating map for each scene. The fine and coarse maps were then averaged [(fine map + coarse map)/2]. This procedure was used for each scene. The final map was blurred using a Gaussian filter via the MATLAB function ‘imgaussfilt’ with a sigma of 10 (see Fig. 2e for an example meaning map).

Grasp maps

Grasp maps were constructed from ratings in the same manner as meaning maps, with the critical exception that subjects rated each patch on how ‘graspable’ the region of the scene shown in the patch was. In the instructions, we defined ‘graspability’ as how easily an object depicted in the patch could be picked up or manipulated by hand. If a patch contained more than one object or only part of an object, raters were instructed to use the object or entity that occupied the most space in the patch as the basis for their rating. The remainder of the procedure was identical to the one used to generate meaning maps.

Raters were 83 undergraduates enrolled at UC Davis who participated through Sona. Students received credit toward a course requirement for participating. Subjects were at least 18 years old, had normal or corrected-to-normal vision, and had normal color vision.

Each subject rated 300 random patches extracted from the 15 scenes, presented alongside a scene thumbnail for context. Subjects were instructed to rate how graspable each patch was using a 6-point Likert scale (‘very low’, ‘low’, ‘somewhat low’, ‘somewhat high’, ‘high’, ‘very high’). Prior to rating patches, subjects were given two examples each of low-graspability and high-graspability scene patches in the instructions to ensure that they understood the task. Scene-patch pair presentation order was random. Ratings from 20 subjects that scored below 80% on the catch patches were excluded. Each unique patch was rated at least 3 times by 3 independent raters for a total of 18,360 ratings.

Grasp maps were generated in the same manner as the meaning maps. Ratings were averaged, smoothed, and combined across scales. The ratings for each pixel at each scale in each scene were averaged, producing an average fine and coarse rating map for each scene. This procedure was used for each scene. An example grasp map can be seen in Fig. 2f.

Interact maps

Interact maps were constructed in the same manner as meaning and grasp maps, except subjects were asked to rate the region of the scene that was visible in each patch based on how ‘interactable’ it was. We defined ‘interactability’ as the extent to which the subject viewed what was shown as an object with which a human might interact. As in the grasp map generation procedure, subjects were again instructed to rate the object that occupied the majority of the patch.

Each of the 55 scenes (1024 × 768 pixel) was decomposed into a series of partially overlapping circular patches at fine and coarse spatial scales (Fig. 2b&c), resulting in 16,500 unique fine-scale patches (93 pixel diameter) and 5,940 unique coarse-scale patches (217 pixel diameter), 22,440 patches in total.

Raters were 328 undergraduates enrolled at UC Davis who participated through Sona.Footnote 1 Students received credit toward a course requirement for participating. Subjects were at least 18 years old, had normal or corrected-to-normal vision, and had normal color vision.

Each subject rated 300 random patches extracted from the 55 scenes, presented alongside a scene thumbnail for context. Subjects were instructed to rate how interactable each patch was using a 6-point Likert scale (‘very low’, ‘low’, ‘somewhat low’, ‘somewhat high’, ‘high’, ‘very high’). Prior to rating patches, subjects were given two examples each of low-interactability and high-interactability scene patches in the instructions to ensure that they understood the task. Scene-patch pair presentation order was random. Ratings from 103 subjects that scored below 80% on the catch patches were excluded. Each unique patch was rated at least 3 times by 3 independent raters (at least 67,320 ratings in total).

Interact maps were generated in the same manner as the meaning and grasp maps. Ratings were averaged, smoothed, and combined across scales. This procedure was used for each scene. An example interact map is shown in Fig. 2g.

Overall, the resulting meaning, grasp, and interact maps were correlated with one another; the correlation was particularly high for grasp and interact maps in Experiment 3 (\(M_{R^{2}} =\) 0.72, \(SD_{R^{2}} =\) 0.10)(Table 1).

Table 1 Correlations (R2) between feature maps

Analysis

Following Nuthmann et al., (2017), we examined which features influenced visual attention by comparing the feature map values at locations in the scene that were fixated to those for locations that were not, operating on the assumption that differences between regions of the scene that were and were not fixated speak to what information is prioritized for attention. Rather than dividing the scene into a grid (as Nuthmann et al.,, 2017 did), we elected to use the procedure developed by Hayes and Henderson (2021) to measure meaning, grasp, and interact map values in a window around each location, and compared the values for fixated locations to those of sampled locations in the scene that were not fixated.

Specifically, we conducted a logistic mixed-effects regression analysis in which the dependent variable was whether subjects fixated a location (1) or not (0). The dependent variable was defined as follows. For each subject and each trial, the x,y coordinates corresponding to the subject’s fixations were assigned a value of 1 (fixated). A number of locations that were not fixated equal to the number of fixated locations were then randomly sampled from all possible coordinates in the 1024 × 768 image using the ‘sample’ function from the ‘random’ module in Python 3. Locations that the subject fixated during that trial, or locations that fell within a 1.5 visual angle (56 pixel) radius around the fixated location, were excluded from the sample space. The randomly sampled coordinates were assigned a value of 0 (not fixated).

We accounted for center bias in our model (Tatler, 2007; Hayes & Henderson, 2019a) using the center proximity measure developed by Hayes and Henderson (2021). We calculated the inverted Euclidean distance between the center of the scene and each other pixel in the image and stored the value for each pixel in a 1024 × 768 matrix. The Euclidean distance was z-scored and inverted for ease of interpretation such that higher values indicate closer proximity to the center of the scene.

For each x,y coordinate pair, we then computed the mean feature and center proximity map values corresponding to a 3 visual angle (113 pixel) diameter window around the coordinate. We defined a mask for the region around the fixation using a 56 pixel radius. The mask was then used to extract an array of map values for the meaning, grasp, interact, and center proximity maps, and the mean of each array was stored as the average feature map values corresponding to the x,y coordinate under consideration (Fig. 4).

Fig. 4
figure 4

Visualization of analysis approach. (a) Real-world scene. (b) Scene overlaid with fixated (yellow) and randomly sampled (cyan) location coordinates. Circles illustrate the mask radius used to compute average feature map values around each fixated (cyan) or sampled (yellow) coordinate. (c) Center proximity map. (d–f) Meaning (d), grasp (e), and interact (f) maps for the scene shown in (a)

A logistic mixed-effects model was constructed for each experiment’s data using the ‘glmer’ function of the ‘lme4’ package in R (Bates, Mächler, Bolker, & Walker, 2015; R Core Team, 2021). Each model was maximally specified to include fixed effects of center proximity, meaning, grasp, and interact, as well as interactions between each. Random intercepts and random slopes corresponding to fixed effects and their interactions were included in both random effect structures. To facilitate model convergence, all predictors were centered and scaled using the ‘scale’ function in base R prior to analysis, and random slopes and intercepts were uncorrelated. All models used the default optimizer (bobyqa). Random effects were included for subjects and items (scenes). Because the data sets in use are large, the maximum number of model iterations was increased to 100,000.

Results

Experiment 1

In Experiment 1, 30 subjects were asked to describe actions the average person could carry out in each of 30 real-world scenes. We predicted that object affordances as captured by interact and grasp maps would predict regions in the scene that were selected for attention as subjects described possible actions, because objects that can be interacted with are task-relevant.

Locations in the scene that were fixated were more informative on average (M = 3.05, SD = 0.76) than randomly sampled locations (M = 2.62, SD = 0.77) (Fig. 5a). Locations that were fixated also had higher grasp map values on average (M = 3.26, SD = 0.86) than randomly sampled locations that were not (M = 2.91, SD = 0.88). Interact map values were also higher on average for locations that were fixated (M = 3.16, SD = 0.89) than those that were not fixated (M = 2.74, SD = 0.90). Consistent with center bias, fixated locations had higher center proximity on average (M = 0.60, SD = 0.95) than randomly sampled locations that were not fixated (M = − 0.28, SD = 0.89).

Fig. 5
figure 5

Hybrid violin and box plots. Data for each of the three experiments is shown on separate rows. In each row, the left panel shows center proximity (green) both for sampled coordinates that were not fixated and for fixated locations (x-axis). The right panel shows the average grasp map values around the image coordinate (yellow-orange), average interact map values (violet), and average meaning map values (red), shown separately for locations that were randomly sampled and fixated locations (x-axis). Center proximity values and map values reflect z-values and Likert ratings (1–6), respectively. White points superimposed over the violins indicate the grand mean. On the box plots to the left of each violin, black horizontal lines correspond to the median, colored boxes indicate the 25% and 75% quartile boundaries, and black vertical lines show ± 1.5 IQR (the interquartile range)

Consistent with the hypothesis that object affordances broadly influence visual attention, there was a simple main effect of interact such that subjects were more likely to fixate locations that had higher interact map values (β = 0.48, z = 4.40, p < .0001) (Table 2). Counter to our predictions, there was no simple main effect of meaning (β = 0.10, z = 1.14, p = 0.26). There was a reliable interaction between grasp and interact such that locations in the scene that had low interact map values were more likely to be fixated if they had high grasp map values (β = − 0.17, z = − 2.29, p = .02). The model revealed a simple main effect of center proximity such that subjects were more likely to fixate locations near the center of the image (β = 0.82, z = 13.67, p < .0001), consistent with center bias (Tatler, 2007). There was a reliable interaction between center proximity and meaning such that locations further from the screen center were more likely to be fixated if they had higher meaning map values (β = − 0.15, z = − 2.88, p = .004), and an opposite reliable interaction between center proximity and grasp such that locations further from the center of the scene were less likely to be fixated if they had high grasp map values (β = 0.17, z = 2.69, p = .007) (Fig. 6). Finally, there was a marginal interaction between interact and center proximity such that regions of the scene in the periphery were marginally more likely to be fixated if they had high interact map values (β = − 0.13, z = − 1.91, p = .06). No other predictors were significant.

Table 2 Experiment 2 logistic mixed-effects model output
Fig. 6
figure 6

Estimated fixation probability (y-axis) for marginal and significant interactions for z-scored predictors in Experiment 1. Shaded regions indicate 95% confidence intervals. The top row shows the interactions of grasp (x-axis) with interact (lines) and meaning (x-axis) with center proximity (lines). The bottom row shows interactions of grasp (x-axis) and center proximity (lines) and interact (x-axis) and center proximity (lines)

In sum, interact map values predicted fixated locations in the scene better than meaning or grasp, which were only influential in interactions which revealed that observers deviated from the center of the image to pursue locations that were more informative or interactable, but not highly graspable. As expected, fixated locations were closer to the center of the image, reflecting center bias.

Experiment 2

In Experiment 2, we again asked 40 subjects to describe actions the average person could carry out in each of 20 real-world scenes, and once again we anticipated object affordances (as captured by grasp and interact maps) would predict the regions that were fixated in the scene.

As in Experiment 1, fixated locations had higher average meaning map values (M = 2.87, SD = 0.70) than randomly sampled locations that were not fixated (M = 2.51, SD = 0.73), and higher grasp map values (M = 3.52, SD = 0.79) than randomly sampled locations (M = 3.17, SD = 0.84). Consistent with our hypothesis and the results of Experiment 1, fixated locations in the scene had higher interact map values (M = 3.23, SD = 0.81), than those that were not fixated (M = 2.86, SD = 0.83). Finally, fixated locations were closer to the center of the image on average (M = 0.54, SD = 0.94) than locations that were sampled from parts of the scene that were not fixated (M = − 0.27, SD = 0.90).

Consistent with Experiment 1, there was a simple main effect of interact such that subjects were more likely to fixate locations that had higher interact map values (β = 0.38, z = 2.67, p = 0.008) (Table 3). There was a reliable interaction between meaning and grasp such that locations with high meaning values were more likely to be fixated if they also had high grasp map values (β = 0.23, z = 2.63, p = 0.009) (Fig. 7). The model revealed a simple main effect of center proximity reflecting center bias (β = 1.15, z = 8.47, p < .0001). There was a marginal interaction between meaning and center proximity such that locations further from the center of the image were marginally more likely to be fixated if they were informative (β = − 0.19, z = − 1.80, p = 0.07), and a marginal three-way interaction between grasp, interact, and center proximity such that regions close to the center of the image were marginally more likely to be fixated when they had low interact map values if they had high grasp map values (β = − 0.21, z = − 1.88, p = 0.06). No other predictors were significant.

Table 3 Experiment 3 logistic mixed-effects model output
Fig. 7
figure 7

Estimated fixation probability (y-axis) for marginal and significant interactions for z-scored predictors in Experiment 3. Shaded regions indicate 95% confidence intervals. The top row shows 2-way interactions: meaning (x-axis) with center proximity (lines) and grasp (x-axis) with meaning (lines). The bottom row shows the 3-way interaction between grasp (x-axis), interact (lines), and center proximity (facets)

Consistent with Experiment 1, interact map values predicted fixated locations in the scene, and meaning and grasp did not predict fixated locations well independently, though each had some influence in interactions. There was a significant effect of center bias such that fixated locations were closer to the center of the image, and there was a marginal interaction between center proximity, interact, and grasp such that regions in the center of the image that were low in interactability were more likely to be fixated if were high in graspability.

Experiment 3

In Experiment 3, we asked 40 subjects to describe actions that they personally would carry out in each of 20 real-world scenes, which depicted reachable spaces (Josephs & Konkle, 2020). We anticipated object affordances (as captured by grasp and interact maps) might predict the regions that were fixated in the scene more strongly than in the first two experiments because the task instruction was personalized and the scenes depicted spaces that afford object interactions particularly well.

Fixated locations again had higher average meaning map values (M = 2.90, SD = 0.71) than sampled locations did (M = 2.51, SD = 0.75). Grasp map values were also higher on average for fixated locations (M = 3.40, SD = 0.77) than sampled locations (M = 2.93, SD = 0.88). Finally, fixated locations in the scene again had higher interact map values (M = 3.53, SD = 0.74), than randomly sampled locations that were not fixated did (M = 3.07, SD = 0.85). Once again, fixated locations were closer to the center of the image (M = 0.57, SD = 0.91) on average than randomly sampled locations were (M = − 0.31, SD = 0.90).

Consistent with the previous two experiments, there was a simple main effect of interact such that subjects were more likely to fixate locations that had higher interact map values (β = 0.22, z = 2.53, p = 0.01) (Table 4). There was a marginal effect of grasp such that regions were marginally more likely to be fixated when they had higher grasp map values (β = 0.15, z = 1.67, p = 0.095). The model revealed a simple main effect of center proximity reflecting center bias (β = 1.02, z = 14.52, p < .0001). There was a reliable interaction between center proximity, meaning, and interact such that locations further from the center of the image were more likely to be fixated when both interact and meaning map values were high (β = − 0.15, z = − 2.36, p = 0.02), and a reliable interaction between center proximity, grasp, and interact such that locations in the periphery with low interact map values were more likely to be fixated if they had high grasp map values (Fig. 8). No other predictors were significant.

Table 4 Experiment 3 logistic mixed-effects model output
Fig. 8
figure 8

Estimated fixation probability (y-axis) for significant 3-way interaction between z-scored predictors in Experiment 3: in the top row, grasp (x-axis), interact (lines), and center proximity (facets); in the bottom row, meaning (x-axis), interact (lines), and center proximity (facets). Shaded regions indicate 95% confidence intervals

In Experiment 3, interact map values again predicted fixated locations in the scene better than meaning or grasp, though grasp was a marginal independent predictor. There was again a reliable center bias on fixated locations, and there were reliable interactions between center proximity, meaning, and interact map values and center proximity, grasp, and interact map values.

Experiment 4

To determine whether the finding that interactability predicts fixated locations generalizes to a description task for which object interactions are less task-relevant, we applied the analysis performed on the action description tasks (Experiments 13) to fixation data from an open-ended description task (Henderson et al., 2018) that used the same 30 scenes presented in Experiment 1.

If object interactions guide attention in scenes even when actions are less task-relevant, we anticipate that the analysis will show a strong predictive relationship between interact map values and fixated locations; however, if interact map values predicted well in Experiments 13 because object interactions were highly task relevant—but not generally more important than general informativeness for visual attention—we expect meaning map values to predict fixated locations better than interact map values.

The analysis was identical to that of Experiments 13, with the following exception: the maximal model produced singular fit, therefore the random slope that accounted for negligible variance (an interaction between center proximity, meaning, and grasp in the subject random effect) was pruned from the model (following Barr, Levy, Scheepers, & Tily 2013). The resulting model converged without error.

When subjects described scenes however they liked, the average meaning map values were higher for fixated locations (M = 3.30, SD = 0.68) than sampled locations (M = 2.51, SD = 0.73) (Fig. 9). Grasp map values were also higher on average for fixated (M = 3.48, SD = 0.83) as opposed to sampled locations (M = 2.81, SD = 0.85). Consistent with the action description experiments, fixated locations in the scene also had higher interact map values (M = 3.35, SD = 0.88) than randomly sampled locations that were not fixated (M = 2.66, SD = 0.86). Finally, fixated locations were, on average, closer to the center of the image (M = 0.53, SD = 0.93) than randomly sampled locations were (M = − 0.29, SD = 0.90).

Fig. 9
figure 9

Hybrid violin and box plots for predictors Experiment 4. The left panel shows center proximity (green) both for sampled coordinates that were not fixated and for fixated locations (x-axis). The right panel shows the average grasp map values around the image coordinate (yellow-orange), average interact map values (violet), and average meaning map values (red), shown separately for locations that were randomly sampled and fixated locations (x-axis). Center proximity values and map values reflect z-values and Likert ratings (1–6), respectively. White points superimposed over the violins indicate the grand mean. On the box plots to the left of each violin, black horizontal lines correspond to the median, colored boxes indicate the 25% and 75% quartile boundaries, and black vertical lines show ± 1.5 IQR (the interquartile range)

As in the action description experiments, in the open-ended scene description task there was a simple main effect of interact such that subjects were more likely to fixate locations that had higher interact map values (β = 0.41, z = 2.66, p = 0.008) (Table 5). Counter to the action description tasks, there was a simple main effect of meaning such that subjects were more likely to fixate locations with higher meaning map values (β = 0.90, z = 7.70, p < .0001). As expected, the model revealed a simple main effect of center proximity reflecting center bias (β = 0.44, z = 4.08, p < .0001). There was a marginal interaction between center proximity, meaning, and grasp such that locations near the center of the scene were marginally more likely to be fixated if they had high grasp and meaning map values (β = .21, z = 1.93, p = 0.05), and there was a reliable 4-way interaction between center proximity, meaning, grasp, and interact such that locations in the periphery of the scene that had high meaning map values, but lower grasp and interact map values, were more likely to be fixated (β = − 0.20, z = − 2.63, p = 0.009) (Fig. 10). No other predictors were significant.

Table 5 Experiment 3 logistic mixed-effects model output
Fig. 10
figure 10

Estimated fixation probability (y-axis) for interactions between z-scored predictors in Experiment 4. The top figure illustrates a marginal three-way interaction between meaning (x-axis), grasp (lines) and center proximity (columns). The bottom figure visualizes a reliable four-way interaction between meaning (x-axis), grasp (lines), interact (rows) and center proximity (columns). Shaded regions indicate 95% confidence intervals

In Experiment 4, both interact and meaning map values predicted fixated locations in the scene, whereas grasp map values did not, and there was again a reliable center bias on fixated locations.

Experiment 5

To determine whether interactability predicts fixated locations well in a task that does not encourage the viewer to think about objects in the scenes and how they would interact with those objects, we applied the analysis performed in Experiments 14 to fixation data from a scene memorization task (Rehrig et al., 2020a) that used the same 30 scenes presented in Experiments 1 and 4. In Experiment 5, 30 subjects memorized 30 real-world scenes for a period of 12 s each in preparation for a later recognition memory task. Following Rehrig et al., (2020a), we expect general informativeness to predict fixated locations well. If the strong predictive relationship between object interactability and attention observed in Experiments 14 generalizes beyond language tasks, we additionally expect interact map values to predict fixated locations.

When subjects studied scenes for a later memorization task, the average meaning map values were higher for fixated locations (M = 3.34, SD = 0.69) than for randomly sampled locations that had not been fixated (M = 2.60, SD = 0.75) (Fig. 11). Grasp map values were also higher on average for fixated (M = 3.50, SD = 0.83) as opposed to sampled locations (M = 2.92, SD = 0.87). Consistent with the scene description experiments, fixated locations in the scene also had higher interact map values (M = 3.31, SD = 0.91) than randomly sampled locations that were not fixated (M = 2.78, SD = 0.89). Finally, fixated locations were, on average, closer to the center of the image (M = 0.49, SD = 1.02) than randomly sampled locations were (M = − 0.14, SD = 0.94).

Fig. 11
figure 11

Hybrid violin and box plots for predictors Experiment 5. The left panel shows center proximity (green) both for sampled coordinates that were not fixated and for fixated locations (x-axis). The right panel shows the average grasp map values around the image coordinate (yellow-orange), average interact map values (violet), and average meaning map values (red), shown separately for locations that were randomly sampled and fixated locations (x-axis). Center proximity values and map values reflect z-values and Likert ratings (1–6), respectively. White points superimposed over the violins indicate the grand mean. On the box plots to the left of each violin, black horizontal lines correspond to the median, colored boxes indicate the 25% and 75% quartile boundaries, and black vertical lines show ± 1.5 IQR (the interquartile range)

Unlike Experiments 13, but consistent with Experiment 4, in the scene memorization task there was a simple main effect of meaning: Subjects were more likely to fixate locations that had higher meaning map values (β = 1.49, z = 11.70, p < .0001) (Table 6). There was a reliable interaction between meaning and grasp such that locations that had low meaning map values were more likely to be fixated when the corresponding grasp map values were high (β = − 0.23, z = − 2.05, p = 0.04) (Fig. 12). Consistent with all other experiments, the model revealed a simple main effect of center proximity reflecting center bias (β = 0.25, z = 2.23, p < 0.03). No other predictors were significant.

Table 6 Experiment 5 logistic mixed-effects model output
Fig. 12
figure 12

Estimated fixation probability (y-axis) for significant interaction between z-scored predictors in Experiment 5: grasp (x-axis) and meaning (lines). Shaded regions indicate 95% confidence intervals

In stark contrast to the previous experiments, of the three feature maps used, meaning map values were the only reliable independent predictor of fixated locations in Experiment 5, though there was a reliable interaction between meaning and grasp map values. Consistent with all of the previous experiments, there was a reliable effect of center bias on fixated locations.

General discussion

In four data sets used in the current analysis, fixated locations were predicted by interact map values such that locations that were highly interactable were more likely to be fixated, consistent with the prediction that interactability could rival general informativeness in predicting overt visual attention, which follows from the hypothesis that object affordances influence attention in scenes. However, interact map values predicted fixated locations only for description tasks (Experiments 14), but failed to predict fixated locations when the task did not have an explicit language component (Experiment 5). When the task was not to describe the scene (scene memorization), only meaning map values predicted what locations in the scene were fixated. Partially consistent with Rehrig et al., (2020b) and with our predictions, higher grasp map values marginally predicted fixated locations only when scenes depicted reachable spaces (Experiment 3), and otherwise grasp contributed to reliable interactions in all experiments. Counter to our predictions, meaning map values were not a significant predictor as a simple main effect in any of the action description experiments; however, general informativeness was influential in tasks for which object interactions were less task-relevant (Experiments 45).

Our findings for Experiments 14 suggest that object affordances broadly defined (as captured by interact maps) predict locations prioritized for visual attention in scenes during description tasks; however, in a task that did not encourage the viewer to think about objects in the scene or their interactions (Experiment 5), affordances as operationalized in the current study did not predict fixated locations. The aforementioned findings are difficult to reconcile with those in the literature that show an influence of object affordances on attention in visual search tasks (Castelhano and Witherspoon, 2016; Gomez et al., 2018; Gomez & Snow, 2017). One possible explanation put forth by Rehrig et al., (2020b) is that prior work demonstrating a role of object affordances on attention in more traditional visual attention experiments (such as visual search; Castelhano and Witherspoon, 2016; Gomez et al.,, 2018; Gomez & Snow, 2017) may have been driven by other object-related information (such as recognizabilty or informativeness) that was better captured by informativeness than general affordances in the current study. However, it might also be the case that the 2-dimensional nature of the task used in the current study was unable to speak to the role of object affordances in guiding attention to physically present objects as demonstrated by Gomez et al., (2018). We leave the challenging task of investigating whether attentional guidance is better explained by general informativeness or object affordances for 3-dimensional or physically present objects to future work.

The influence of affordances on attention in the description tasks (Experiments 14) is consistent with literature implicating object affordances in language processing broadly (Borghi, 2012; Borghi & Riggio, 2009; Feven-Parsons & Goslin, 2018; Glenberg et al.,, 2009; Glenberg & Kaschak, 2002; Grafton, Fadiga, Arbib, & Rizzolatti, 1997; Harpaintner, Sim, Trumpp, Ulrich, & Kiefer, 2020; Kaschak & Glenberg, 2000; Martin, 2007). Neuroimaging studies have revealed motor activation associated with object-related cognitive processes (Martin, 2007), and specifically with language processes such as silently naming an object (Grafton et al., 1997), or making lexical decisions about action words (Harpaintner et al., 2020). A priming study showed an object’s semantics are not prioritized over its affordances when processing object names (Feven-Parsons and Goslin, 2018). Evidence from language comprehension suggests that we interpret sentences through human action (Glenberg et al., 2009; Glenberg & Kaschak, 2002; Kaschak & Glenberg, 2000): for example, an object’s affordances can facilitate detection of the object’s name in sentences (Borghi, 2012; Borghi & Riggio, 2009). Studies of language-mediated visual attention suggest that, while listening to speech presented concurrent with scene viewing, observers attend to objects in a scene with affordances that are compatible with those of the events or objects mentioned (Altmann & Kamide, 1999; Chambers, Tanenhaus, Eberhard, Filip, & Carlson, 2002; Chambers, Tanenhaus, & Magnuson, 2004; Kako & Trueswell, 2000; Kamide, Altmann, & Haywood, 2003), particularly when object affordances are task-relevant (Salverda, Brown, & Tanenhaus, 2011). Altmann and Kamide (2007) argued that processes of language comprehension activate conceptual representations associated with a referent (such as an object or event), and, in turn, visual attention seeks out objects in the scene that have compatible object affordances. The results of the current study are compatible with the idea that the mediating effects of language on visual attention described by Altmann and Kamide (2007) may extend to language production tasks, and to the allocation of attention in real-world scenes. However, the current study cannot differentiate between the possibility that object affordances influenced attention in Experiments 14 because description tasks engage the language system, or because description tasks encourage the observer to think about objects in the scene and the interactions they afford more carefully than other tasks would. Future work will be needed to determine which of the two possibilities best explains the observed relationship between object affordances and visual attention.

It is worth noting that Experiments 33, & 3 used the same stimuli and maps, and tested the same number of subjects, yet the influence of the types of semantic information we quantified in the current study (informativeness, graspability, and interactability) on attention differed in each. We attribute the difference to the observer’s task, which changed across the three experiments. When object affordances were most task relevant—as observers described the potential actions available to them in a scene—interactability predicted attended locations better than meaning or graspability (Experiment 1), but when observers simply described what they saw, informativeness and interactability both guided attention (Experiment 4). Finally, when object affordances were least task-relevant, informativeness influenced visual attention and interactability did not (Experiment 5). The difference in findings dependent on the task instruction supports the idea that object affordances exert a greater influence on cognition when they are task-relevant (Ostarek & Huettig, 2019).

In the previous analysis using much of the same data, grasping affordances only explained variation in fixation density effectively when the scenes depicted reachable spaces, which led us to conclude that affordances guide attention for stimuli that are especially conducive to acting on the environment (Rehrig et al., 2020b). In contrast, the present analysis revealed that object affordances broadly, as captured by interact maps, predicted fixated locations during the same action description experiments even in scenes that were less clearly conducive to interaction, including the scenes for which graspability did not explain attention well in the prior study. Consistent with Rehrig et al., (2020b), grasping affordances—as captured by grasp maps—marginally predicted fixated locations when scenes depicted reachable spaces. Through comparing the results of the current analysis with those reported in Rehrig et al., (2020b), we conclude that grasping affordances influence attention only when objects would be within reach (conducive to grasping), despite the fact that possible grasping interactions are task-relevant in all scenes, but object affordances more broadly exert a strong influence on attention when the possible actions in the scene are relevant to the speaker’s goals. These findings suggest that the possible grasping actions in an environment are only relevant to observers when the object is within reach and thus would be readily acted upon; however, an alternative explanation is simply that any highly constrained, specific affordance—be it grasping, lifting, sitting, etc.—would underperform in our model relative to a representation that captures a wide range of possible interactions with the environment. It is further possible that graspability would perform as well as, or perhaps better, than interactability in a task for which grasping interactions specifically were highly task-relevant—for example, if observers were asked to study a scene for the purpose of planning to sanitize objects in the scene, or to pack the items in the room for a move. We leave investigation of the latter possibility to future work.

Our findings further illustrate the flexibility and utility of the mapping procedure, originally developed to construct meaning maps, in capturing different types of semantic information in scenes (Henderson, Hayes, Peacock, & Rehrig, 2021). The meaning, grasp, and interact maps used in the current study are all primarily derived from stored semantic representations of objects and scene categories that comprise semantic knowledge for scenes. Although each map taps semantic representations in a similar way, and the maps are correlated with one another, the fact that each differed in their ability to predict fixated locations across tasks indicates that the different maps tapped dissociable forms of semantic information, which suggests raters in the crowd-sourcing task were sensitive to variation in the instructions and followed them diligently.

Conclusion

The current analysis investigated what type of semantic information guides attention in a scene. We conducted a novel analysis on existing data sets (Henderson et al.,, 2018; Rehrig et al.,, 2020a, 2020b) and determined which of three forms of semantic information best accounted for overt visual attention: (1) general informativeness, the informativeness or recognizability of scene regions, (2) graspability, the degree to which what is shown in a region can be grasped, and (3) interactability, the degree to which a scene region depicts objects that can be interacted with in any way. Of the forms of semantic information tested, interactability was the strongest predictor of locations speakers fixated across three action description experiments, suggesting that the actions objects in a scene afford exert a strong influence on attention during action descriptions, moreso than what the results originally reported in Rehrig et al., (2020b) suggested. When speakers described scenes however they liked (Henderson et al., 2018), both interactability and informativeness predicted fixated locations; however, only informativeness predicted fixated locations when the task had no explicit language component (scene memorization; Rehrig et al.,, 2020a). Consistent with Altmann and Kamide (2007), the results suggest that object affordances guide attention when the language system is engaged—to a greater degree than informativeness does, at least when affordances are especially task-relevant (Experiments 13; consistent with Salverda et al.,, 2011)—while informativeness guides attention when the task does not encourage observers to carefully consider the objects in the scene, and, by extension, the interactions those objects afford. More generally, the finding that different semantic aspects of a scene influence the allocation of visual attention differently depending on the viewer’s task offers additional, compelling evidence for the cognitive guidance theory of eye movement control (Henderson et al., 2007).