In this article, our general concern is with the time-honored distinction between the perception of form and material that was so elegantly revealed phenomenologically in Goldmeier’s classic studies (1936/1972; see also Kimchi & Palmer, 1982; Klein & Barresi, 1985). Our general objective is to explore the relative roles of central and peripheral vision in the processing of form and material. Conventional wisdom tells us that global information about forms and their spatial arrangement is preferentially processed via peripheral vision, while local information about the material making up forms is preferentially processed via central vision. As is illustrated in this quotation from Livingstone, much of the thinking about this division of labor is rooted in our knowledge of the density and nature of visual receptors in the retina and their projection pathways in the central nervous system:

The fact that our vision has the highest acuity in the center of gaze does not mean that vision in the rest of the visual field is inferior—it’s just used for different things. Foveal vision is used for scrutinizing highly detailed objects or surfaces, whereas peripheral vision is used for organizing the spatial scene, for seeing large objects, and for detecting areas to which we should direct our foveal vision. Our foveal vision is optimized for fine details, and our peripheral vision is optimized for coarser information. (Livingstone, 2002, pp. 68–69)

Our specific plan for achieving this objective was suggested by Rayner and Bertera’s (1979) article entitled “Reading Without a Fovea.” The method entails using eye-monitoring equipment that is both sufficiently fast and sufficiently accurate to be programmed to allow us to block out central vision or to block out everything but central vision while an observer visually explores a scene presented on a display controlled by a computer with rapid access to information about eye position. For such an experiment, we believe that conventional wisdom would predict that eliminating peripheral vision should have a more deleterious effect on the perception of form than it would have on the perception of material, whereas eliminating central vision should have a more deleterious effect on the perception of material than upon form.

Before we describe our project, it is useful to briefly compare it to projects that have posed a similar question or used a similar method. The two most common techniques have been to control whether a briefly presented stimulus is presented to central or peripheral vision or to use some form of gaze-contingent control (as in the Rayner & Bertera, 1979, study), which overcomes the need to limit exposure duration in order to prevent eye movements.

Following the lead of Rayner and Bertera (1979), several studies have explored the effect on perceptionFootnote 1 of completely masking (or eliminating) central or peripheral vision, using fast and accurate eye-monitoring equipment to implement gaze-continent control of the display. In these studies, a variety of terms have been used to label the conditions in which the display was limited to central or peripheral vision. We present these in Table 1, in order to alert readers to the equivalent meanings of different terms. Here we will simply use the term “mask,” with the adjectives “central” and “peripheral” referring to the region that was not displayed.

Table 1 Studies cited in this article that have used central or peripheral masks, or both, in visual perception experiments, with the terms they used to label these conditions

Most of these studies have been about visual search for prespecified targets in multielement displays (Bertera & Rayner, 2000; Cornelissen, Bruin, & Kooijman, 2005) or in real-world scenes (Miellet, Zhou, He, Rodger, & Caldara, 2010; Nuthmann, 2014). Also exploring search, Loschky and McConkie (2002) allowed for clear central vision of real-world scenes while using different degrees of filtering to reduce the resolution of information in peripheral vision. Most of these search studies have manipulated the size of the masked region, and their results are relatively clear, consistent, and not particularly surprising: Performance with very small masked regions was very similar to that with full vision, and the larger the masked region, the worse the performance.

Larson and Loschky (2009) and Thorpe, Gegenfurtner, Fabre-Thorpe, and Bülthoff (2001) used the limited-exposure-duration method to control whether the image contents were presented exclusively to central or peripheral vision. Thorpe et al. explored the ability of their observers to indicate whether photographs of briefly presented natural scenes did or did not contain an animal. Not surprisingly, accuracy on this task decreased monotonically with increases in the eccentricity of the animal’s position, but accuracy remained above chance even at the largest eccentricity (~ 60%, with 50% being chance). Larson and Loschky presented real-world scenes for 106 ms and then asked their observers to indicate whether or not a one-word gist description of the picture was correct. They manipulated the radius of a region centered on fixation that was used to define whether central or peripheral vision was masked, by being presented as a uniform gray. When the central region was masked, the accuracy of gist decisions was as good as with full vision, so long as the radius of the masked region was 5 deg or less. When the periphery was masked, performance was worse than with full vision for all radii less than 13 deg. The findings from these two studies suggest that with exposures too brief to allow an eye movement, there is some, but not very good, processing of small objects in peripheral vision, and that the gist of a scene may be appreciated better with peripheral than with central vision.

None of these studies have generated the kind of data that we believe are needed to confirm or disconfirm conventional ideas about the division of labor between central and peripheral vision. Our experiment was to apply the gaze-contingent masking of either central or peripheral vision, which has been used most frequently in studies of visual search, to “scenes” consisting of hierarchically constructed stimuli in which a global form is composed of a local material. Although the levels of such hierarchically composed stimuli—perhaps first used by Asch, Ceraso, and Heimer (1960), and later by Navon (1977) and by Pomerantz and colleagues (Pomerantz & Sager, 1975; Pomerantz, Sager, & Stoever, 1977)—are often referred to as global and local, we prefer the terms form and material, described by the Gestalt psychologist Goldmeier (1936/1972). Our combination of method (gaze-contingent displays) with stimulus material would allow us to explore, for the first time, how rapidly and how accurately an observer can discern either the global form or the local material composing it with only central or only peripheral vision.

The stimuli we used (see Fig. 1) were modeled on those developed by and exploited in Christie et al. (2012). The participant’s task was to make a speeded button press to indicate whether the stimulus contained a circle or square. The stimulus never contained both a circle and a square. Because all four stimuli were randomly intermixed, the observer could not determine until the stimulus was presented whether its form or material contained the target (circle/square). Christie et al. selected D as the “neutral” stimulus because (as rendered) the D is composed of half a square combined with half a circle. There were three viewing conditions that varied randomly from trial to trial: full vision, central vision masked, and peripheral vision masked (see Fig. 2). In addition, three different mask sizes were used. Finally, to minimize the possibility of participants planning in advance a sequence of saccades in order to perform the task more efficiently, we randomly manipulated the location of the starting fixation.

Fig. 1
figure 1

Matrix of images depicting the four types of form–material combinations that were randomly intermixed in a block of trials. (top left) a square composed of Ds; (top right) a circle composed of Ds; (lower left) a D composed of circles; (lower right) a D composed of squares (the example in Fig. 2). Not illustrated here are the four orientations of the D (normal, ± 90, 180 deg), randomly selected for each trial. When D was the material all of the Ds were the same orientation

Fig. 2
figure 2

Six of the possible displays with the same scene (a D composed of squares with a 12-deg mask). The left column shows central masks, and right column shows peripheral masks. The rows display start positions. Note that in the actual display, the background was gray and the stimuli were white

Method

Participants

A total of 21 adults (nine female, 12 male; average 29 years old, ranging from 18 to 61, with two over 40) with normal or corrected-to-normal vision and varying degrees of drawing expertise were recruited through advertising materials posted on the campus of the Nova Scotia College of Art and Design University. Participants received $6/half hour for their participation.

Apparatus, stimuli, and procedure

Stimuli were presented on a 24-in. iMac. Participants sat approximately 70 cm from the screen, with head stabilization via a chin rest. Eye position was monitored using SR Research’s EyeLink 1000, a desk-mounted system with high spatial accuracy operating at 1000 Hz. At the beginning of each trial, a “drift correction” was performed. A small dot served as the fixation stimulus in one of three positions: bottom (6.5° below center), center, or top (6.5° above center). When participants were comfortably fixating the dot and ready for the trial to proceed, they pressed the [space] key. If a steady fixation was detected by the eye-monitoring software, the trial proceeded. Otherwise, the participant was alerted with a beep and repeated the key press response when ready. Immediately after a successful drift correction, the fixation dot was removed, and one of the four visual stimuli illustrated in Fig. 1 was presented until the participant had made a square/circle decision, using the “z” key for circles and the “/” key for squares. Circle and square stickers were placed on the corresponding keys. The square form subtended 10.2°. Depending on the display condition (see Fig. 2), a portion of the display centered on the participant’s point of fixation was masked. Either central vision (left column of Fig. 2) or peripheral vision (right column of Fig. 2) was masked. The masked region was updated on the basis of the x, y coordinates of the region in the scene that, according to the EyeLink 1000, was fixated. To avoid sharp contours, the border separating the visible from the invisible region of the scene was Gaussian blurred. On masked trials, three different sizes of the masked (central mask) or the visible (peripheral mask) region were used: 8, 12, and 16 deg in diameter, respectively, for the small, medium, and large circles (or 50, 113, and 201 deg2 in area). The sequence of trials, sequence of events on a trial, real-time masking (when employed), and recording of the participant’s responses were controlled by a program written in Python.

Design

Each participant began a block of 288 trials in which the following variables were factorially combined: (1) display condition: no mask, central mask, or peripheral mask; (2) size of masked/visible region: small, medium, or large (this was an unanalyzed dummy variable for the no-mask condition); (3) target shape: circle or square; and (4) target level: global form or local material. Note that Variables 3 and 4 combined are exemplified in Fig. 1. This yielded (3×3×2×2) = 36 “cells.” There were eight repetitions of each of these possibilities, creating a block of 288 trials. For each trial in this block, two features were selected randomly: starting fixation (central, lower, upper) and orientation of the neutral D (whether in form or material). Most participants completed the entire block; a few did not, but none of them completed fewer than 198 trials, which we considered a sufficient minimum to be included in the analyses.

Methods of analysis

Trials that timed out without a response (0.9%) were excluded. On the basis of accuracy-versus-reaction-time (RT) plots, trials with RTs less than 400 ms (two trials) and trials with RTs greater than 2.2 s (1.7%) were excluded from the analysis, because they reflected failures to follow instructions, through either anticipation or inattention to the task. RTs and accuracies were subjected to likelihood ratio (LR) tests of multilevel models in a Type II analysis of variance (ANOVA) fashion. These tests for a specific factor in our factorial design can be interpreted similarly to how a common Type II ANOVA would be interpreted. More importantly, using multilevel models allowed us to use logistic regression for the error responses. This type of modeling substantially ameliorates the issues with scaling as error rates approach 0 and, commensurately, variances fall at an accelerating rate. Multilevel modeling consequently allows for more trustworthy interactions than do analyses of proportions correct (Dixon, 2008; Jaeger, 2008). The factors in our analyses were level of the target (form or material), mask size, and mask type. The LR value represents how much more likely a model containing the factor is than one that does not. For example, if a model comparison that differs by one factor yielded an LR of 25, then that would mean the model containing the factor was 25 times more likely to be the best model in the comparison. For those who insist on interpreting only by means of significance tests, the p values for all LRs in a statistical test will be provided and can be interpreted like the p values in an ANOVA. Note that the mask size was treated categorically in the analyses and reflected the categories in Figs. 3 and 4.

Fig. 3
figure 3

Reaction times as a function of mask area and target level (solid lines for form and dashed lines for material) for central (left) and peripheral (right) masks. Flat lines represent performance from the full-vision (no-mask) trials. Note that in the right panel, a decreasing area of the circle at fixation created a larger peripheral mask

Fig. 4
figure 4

Proportions of errors as a function of circle area and target level for central (left) and peripheral (right) masks. All other conventions are as in Fig. 3

Results

All means and SDs, as well as the covariance matrices of all conditions, are in the Appendix. Using these values, one can reconstruct a simulation of all our results that were analyzed with LRs. The mean RTs are presented in Fig. 3, and the statistical analysis of RT as a function of mask size, type of mask, and target level is presented in Table 2. A model containing the obvious three-way interaction is 39 times more likely to be the best model than one without it, p < .001. All two-way interactions, save one, were also statistically decisive, as can be seen in Table 2, and there was a main effect of target level. As expected, whether the mask was central or peripheral, RTs steadily increased as the size of the masked region increased. The noteworthy three-way interaction can be attributed to a small difference in how form and material are affected by changes in mask size with a central mask, whereas with the peripheral mask, discrimination RTs for form targets increased dramatically as mask size increased, relative to material targets.Footnote 2

Table 2 Statistical analysis of RT

The error rates are presented in Fig. 4, and the statistical analysis of error rates is presented in Table 3. We found a main effect of mask type, LR(1) = 6.7, p = .01, with central-mask trials being more accurate overall than peripheral-mask trials. There was again a noteworthy three-way interaction among all predictors, with the model containing the interaction being 14.2 times more likely than one without it, p < .001. Contributing to this three-way interaction, there was a deleterious effect on material targets relative to form in the condition with the largest central mask, whereas with the largest peripheral invisible region, form targets were selectively disrupted.

Table 3 Statistical analysis of accuracy (error rate)

In recognition of how it is sometimes difficult to interpret interactions from plots of means, and to show estimates of the magnitudes of effects, Fig. 5 shows plots of the form – material effects for all conditions, in a format that mirrors Figs. 3 and 4. Here the two critical three-way interactions really stand out, since the pattern of the effects in the central condition is very different from that in the peripheral condition for both RTs and errors.

Fig. 5
figure 5

Data from Figs. 3 and 4, plotted as form – material effects with 95% confidence intervals (CIs). The first row shows reaction times, and the second row shows proportions correct. The no-mask effect CIs are the dashed lines

Discussion

The pattern of results from the peripheral-mask condition, in which vision was confined to the region near fixation, behaved as we expected. When the visible region was large, the processing of both form and material was similar and only slightly less efficient than in the no-mask condition. As the mask size increased (and, therefore, more of the periphery was occluded), the efficiency of processing form and material was disrupted, but the disruptions (in both RTs and errors) were much greater for form than for material. If the form in our scenes is viewed as akin to the gist of the scenes used by Larson and Loschky (2009), then this pattern of results converges with theirs.

The pattern of results from the central-mask condition, in which vision was confined to the periphery, both matched and mismatched our expectations. Performance on material targets was as expected: The larger the central mask, the slower and less accurate were decisions about the target’s identity. Performance on form targets was mixed. Accuracy of responding showed the expected pattern of no effect of central mask size on the accuracy of target identification. RTs for form targets, however, were unexpectedly delayed as the size of the central mask was increased.

We suggest two related explanations for this unexpected pattern of RTs. It is a ubiquitous belief among perception researchers that in everyday perception, we move our eyes so as to place new information to be processed in the sensitive foveal region of vision, and that the information so acquired has preferential access to awareness. Regardless of the level that contains the target, central masking makes this normal behavior impossible (it was, of course, our goal to do so). Perhaps this disruption causes a delay in RTs, regardless of the level at which the target shape is presented. Relatedly, the participant can use central vision to determine at what level the target is being presented: If Ds are picked up in central vision, the target is at the level of form; if circles or squares are picked up, the target is that detected shape. When central vision is masked, this heuristic is disabled, and this might delay decisions about both form and material. A converse of this heuristic (detecting a D in the form, processed in peripheral vision) can be used to focus on material when central vision is available. This heuristic might explain the small deleterious effect of increasing peripheral mask size upon the extraction of material.

Conclusions

Conventional wisdom regarding the roles of central and peripheral vision with respect to material and form information is difficult to test directly. The development of modern real-time eyetracking, however, has enabled progress on this front. Indeed, our use of gaze-contingent masking of central and peripheral vision has generated direct evidence in support of the conventional views that global form is preferentially processed in peripheral vision, whereas local material is preferentially processed in central vision. Further research will be required in order to fully understand why the results with central masks of increasing size generated different patterns of performance for RTs and accuracy.