Incremental grouping of image elements in vision
One important task for the visual system is to group image elements that belong to an object and to segregate them from other objects and the background. We here present an incremental grouping theory (IGT) that addresses the role of object-based attention in perceptual grouping at a psychological level and, at the same time, outlines the mechanisms for grouping at the neurophysiological level. The IGT proposes that there are two processes for perceptual grouping. The first process is base grouping and relies on neurons that are tuned to feature conjunctions. Base grouping is fast and occurs in parallel across the visual scene, but not all possible feature conjunctions can be coded as base groupings. If there are no neurons tuned to the relevant feature conjunctions, a second process called incremental grouping comes into play. Incremental grouping is a time-consuming and capacity-limited process that requires the gradual spread of enhanced neuronal activity across the representation of an object in the visual cortex. The spread of enhanced neuronal activity corresponds to the labeling of image elements with object-based attention.
KeywordsGestalt groupingImage parsingAttentionObject-based attentionObject recognitionPerceptual organizationSynchronizationVisual cortexBinding problem
Vision starts with a fragmentation of the visual scene. Neurons in low-level areas of the visual cortex extract the low-level features present in their small receptive fields, in parallel across the visual field. The representation of the visual scene in early visual areas therefore consists of a set of image fragments, like short contour elements and small texture patches. This is not how we perceive the visual world. The world with which we interact consists of coherent and unitary objects that comprise many features, rather than an unstructured collection of localized image fragments (Wertheimer, 1923). Thus, our visual system must have powerful mechanisms for grouping all the elements of an object together and for segregating them from other objects and the background. This process of perceptual grouping is important for object recognition and for the interaction with the objects that surround us. If we want to grasp an object, we have to know which parts belong to it and which parts do not. Somehow, our visual system synthesizes chairs, tables, trees, and animals from all the image fragments that are represented in early visual areas and assigns colors, motions, and depth structure to these objects. Previous theoreticians convincingly pointed out that this is no small achievement, especially because we can even perceive new objects that consist of feature constellations that we never saw before (see also Treisman, 1996; von der Malsburg, 1999; cf. Singer & Gray, 1995).
However, there are other conditions where the grouping of image elements on the basis of connectedness and good continuation is associated with substantial delays (Crundall, Dewhurst, & Underwood, 2008; Jolicoeur, Ullman, & MacKay, 1986, 1991; Pringle & Egeth, 1988; Roelfsema, Scholte, & Spekreijse, 1999). This occurs, for example, if stimuli consist of two curves and subjects have to judge whether contour elements belong to the same curve. The fishermen in Fig. 2b have to perceptually group the contour elements of their line to see who will catch the big fish. All the elements of the lines are related to each other by Gestalt grouping cues since they are locally collinear and connected—that is, they are in each other’s good continuation. By studying a laboratory version of this task, Jolicoeur et al. (1986, 1991) demonstrated that the processing time increases linearly with the length of the lines. Grouping short lines requires tens of milliseconds (Crundall, Cole, & Underwood, 2008; Pringle & Egeth, 1988), but the delays add up to hundreds of milliseconds for longer curves (Jolicoeur et al., 1986), implying that Gestalt criteria are not invariably evaluated by an unlimited capacity mechanism.
In what follows, we will use the term grouping for the processes that delineate the features that belong to the same object, such as its color and shape, and also for the processes that identify the various image elements that belong to an object. Previous authors have used the term binding for the same processes. Because the IGT links levels of description, we will have to switch back and forth between neurophysiology and perceptual psychology. The IGT also proposes a number of computational principles that underlie the specificity of perceptual grouping, which could inspire more specific implementations as neural network models or in machine vision. However, in its present form, the theory is not sufficiently detailed to address the complex computational problem of finding the best way to parse a visual image given a set of local consistency constraints (for computational approaches, see, e.g., Barbu & Zhu, 2003; Borenstein & Ullman, 2008; Sharon, Galun, Sharon, Basri, & Brandt, 2006).
The second section of the article will describe the core assumptions of the IGT that are inspired by neurophysiological findings. In a previous article (Roelfsema, 2006), we outlined some of these neurophysiological mechanisms, but the present article is the first to present the IGT comprehensively, as a number of conjectures. In the third section, we then use the theory, for the first time, to reconcile apparently conflicting findings on the efficiency of grouping, thereby establishing new connections between the neurophysiology of perceptual grouping and the large psychological literature on this topic. We will discuss tasks that probe binding and object perception, but we will initially stay away from visual search and texture segregation, tasks that have been used to measure the efficiency of feature binding in previous work (e.g., Treisman & Gelade, 1980; Wolfe, 1994). We will postpone a discussion of these tasks to the fourth section, which specifies similarities and differences with previous theories of perceptual grouping. We will suggest that the neuronal mechanisms for search and texture segregation are more complex than may have been anticipated in these earlier studies. Finally, the fifth section makes a number of new predictions that could be exploited to test the theory.
Neuronal mechanisms for perceptual grouping
The IGT proposes that there are two distinct mechanisms for perceptual grouping. The first is base grouping. In the visual cortex, base groupings are coded by specialized neurons. The computation of these groupings relies on the selectivity of feedforward connections that propagate activity from lower to higher areas of the visual cortex (Fig. 3a). The second type of grouping is called incremental grouping.1 It is required for groupings that are not coded as base groupings. Incremental grouping relies on feedback connections that run from higher to lower visual areas, as well as on lateral connections between neurons in the same area. These recurrent connections propagate a neuronal response enhancement in order to label all the neurons that code image elements that belong to a single object (Fig. 3b). This labeling operation is associated with processing delays, and incremental grouping is, therefore, a serial process.
Conjecture 1: Two mechanisms for grouping There are two forms of grouping: (1) base grouping, the rapid activation of neurons tuned to feature conjunctions, and (2) incremental grouping, the time-consuming spread of an enhanced neuronal response.
Base grouping: activation of tuned neurons
Base grouping depends on the tuning of individual neurons to feature conjunctions. Neurons in early visual areas respond selectively to relatively simple features, such as the orientation of a line element (Fig. 3a). Orientation is usually considered to be a basic feature that is extracted before grouping operations come into play. However, a line element is also a grouping of simpler elements (e.g., pixels) that are aligned in a specific configuration. It is not easy to draw a line between what is a shape feature and what is a conjunction of shapes. If neurons are tuned to shapes, we call these shapes base groupings. In addition to their tuning to shapes, many neurons in early visual areas are also tuned to other features, such as colors and movement directions (Leventhal, Thompson, Liu, Zhou, & Ault, 1995). A neuron tuned to a red horizontal line represents a conjunction between these two features, and this is another example of base grouping. An important implication is that the representation of some feature conjunctions is not fundamentally different from the representation of single features.
Neurons in higher area neurons are tuned to more complex feature conjunctions (Fukushima, 1980; Riesenhuber & Poggio, 1999b). Neurons in the inferotemporal cortex, for example, code specific configurations of contour elements that form a shape (Brincat & Connor, 2004, 2006; Kayaert, Biederman, Op de Beeck, & Vogels, 2005; Tanaka, 1993). Consider the neurons in this brain region that are tuned to the shape of a face: Some of these cells are activated only if a number of face components, such as the mouth, eyes, and nose, are seen in their correct relative positions (Kobatake & Tanaka, 1994; Tsao, Freiwald, Tootell, & Livingstone, 2006). The neuron’s activation implies detection of the face components as a perceptual group (analogous to a group of aligned pixels that are detected as a line element). Barlow (1972) called these highly selective neurons “cardinal cells.” They are also known as grandmother cells, and these studies, together with recent findings of neurons tuned to specific individuals like Jennifer Aniston and Bill Clinton (Kreiman, Fried, & Koch, 2002; Quiroga, Reddy, Kreiman, Koch, & Fried, 2005) provide compelling evidence that cardinal cells exist.
Base groupings are extracted rapidly after the presentation of a visual image, in early visual areas (Celebrini, Thorpe, Trotter, & Imbert, 1993), as well as in higher visual areas (Hung, Kreiman, Poggio, & DiCarlo, 2005; Oram & Perrett, 1992; Sugase, Yamane, Ueno, & Kawano, 1999). The tuning to these feature conjunctions emerges as soon as the neurons are activated by a newly presented stimulus. The implication is that base grouping mainly reflects the selectivity of feedforward connections, because these connections provide the shortest route from the retina to any particular area of the visual cortex (Lamme & Roelfsema, 2000; Oram & Perrett, 1992; Thorpe, Fize, & Marlot, 1996). The fast emergence of tuning is incompatible with a major role for recurrent pathways that involve lateral connections and feedback connections, since these are associated with additional synaptic and axonal conduction delays. We will use the term base representation when we refer to the pattern of neuronal activity that is evoked by the selectivity of feedforward connections (Roelfsema, Lamme, & Spekreijse, 2000; Ullman, 1984). The set of features that are coded as base groupings is large because it corresponds to the features for which tuned neurons are found. It includes neurons tuned to properties of contours such as orientation (Livingstone & Hubel, 1987) and curvature (Brincat & Connor, 2004), surface properties such as color (Zeki, 1983) and texture (Komatsu & Ideura, 1993), and other features such as motion (Albright & Stoner, 2002) and shape (Tanaka, 1993).
And yet, there are limits to the number of groupings that can be coded in the base representation. In higher areas, receptive fields are larger, so that multiple objects fall into one receptive field. The representations of these objects compete with each other through mutual inhibition (Desimone & Duncan, 1995), and hence the depth of processing—that is, the number of computed base groupings—depends on the distance between an object and other objects in the surround. These inhibitory interactions take place on such a fast time scale that they curtail the initial wave of feedforward processing (Knierim & Van Essen, 1992; Miller, Gochin, & Gross, 1993; Sheinberg & Logothetis, 2001). Moreover, if multiple objects fall into these larger receptive fields, their features can become mingled, which is presumably the cause of crowding (Pelli, Palomares, & Majaj, 2004).
There also is a more fundamental limit on the number of base groupings, because there are more objects possible than there are neurons available in the visual cortex (Engel, König, Kreiter, Schillen, & Singer, 1992; von der Malsburg, 1999; von der Malsburg & Schneider, 1986). This problem can be solved by coding objects as patterns of activity across a number of neurons. An elongated curve, for example, can be coded as a collection of contour elements, even if there is no neuron that is tuned to the overall shape of the curve. Such a distributed representation has a number of virtues. First, it is efficient, because neurons can participate in the representation of many objects that share a particular feature. Second, objects that were not encountered previously can be coded as a new pattern of activity across the existing neurons (see also Singer & Gray, 1995). However, there is a disadvantage associated with distributed representations, called the binding problem: In the presence of multiple objects, information is lost about whether features belong to the same object or to different objects (Engel et al., 1992; Treisman & Gelade, 1980; von der Malsburg, 1999; von der Malsburg & Schneider, 1986). If there are multiple elongated curves, for example, neurons that code the contour elements will all be active, but this pattern of activity does not reveal which contour elements belong to the same curve (as in Fig. 2b). These situations therefore require incremental grouping.
Incremental grouping: grouping by labeling
Incremental grouping is necessary if feature conjunctions have to be established that are not coded as base groupings. The central idea is that neurons that respond to features that are grouped incrementally are labeled in the visual cortex with an enhanced neuronal response. This idea is illustrated in Fig. 3b, where a collection of contour elements that belong to one of the lines has been highlighted. The central hypothesis is that these contour elements are grouped together in perception when the neurons in the visual cortex that represent them enhance their firing rate.
Conjecture 2: Identity of the label Neurons coding features that are grouped incrementally enhance their response.
Conjecture 3: Enabling of connections The base representation enables a set of connections: the connections between neurons that are activated by feedforward connections. Enabled connections can propagate the enhanced neuronal response.
An extra processing step is necessary to make incremental groupings explicit and determine which image elements belong to the same overall shape, because neurons do not have access to the overall shape of the interaction skeleton. Neurons receive information only about the activity of other cells, but not about a (transitive) pattern of enabled connections. This distinction is important. The neurons that respond to connected pixels are initially only locally linked, but these pixels are not yet transitively grouped—that is, available for report. To make these latent groupings accessible to other neurons, an enhanced firing rate has to spread through the network of enabled connections. Figure 4b illustrates how, during incremental grouping, the enhanced firing rate starts to spread from one of the activated neurons so that it eventually highlights the representation of an entire connected image component. It follows that incremental grouping is a serial process. The buildup of processing delays during the spread of the rate enhancement produces a linear increase in reaction time with increases in the length of the curve and corresponds to the serial spread of attention at the psychological level (Fig. 4c), as is indeed observed experimentally (Houtkamp, Spekreijse, & Roelfsema, 2003; Jolicoeur et al., 1986).
The elements that are labeled by the enhanced neuronal response are segregated from the elements that are not labeled, so that grouping and segregation are two sides of the same coin. In some cases, incremental grouping can even override local cues for segregation. This can be seen in Fig. 4a, where squares 3 and 4 are separated by white squares that provide (local) evidence for segregation. However, in the lower panel of Fig. 4a, these squares are linked through a detour, so that incremental grouping could make their linkage explicit.
Conjecture 4: Linking and grouping The enabling of connections is called linking and occurs in parallel across the visual scene. An enhanced response has to be propagated across the enabled links to make incremental grouping explicit so that it is available for report. This propagation is a serial, time-consuming process.
The neurophysiology of incremental grouping
As predicted by the IGT, the neuronal activity in the primary visual cortex (area V1) during the contour grouping task is characterized by two distinct processing phases. The initial responses caused by the feedforward connections code the contour elements and are selective for their orientation (Celebrini et al., 1993; Lamme, Rodriguez-Rodriguez, & Spekreijse, 1999) (blue bar in Fig. 5c). This is followed by a phase where lateral and feedback connections propagate an enhanced response along the target curve in order to compute the incremental groupings (yellow bar in Fig. 5c). The contour grouping task is solved when this enhanced response reaches the circle at the end of the target curve so that it can be selected for an eye movement.
A veridical and a labeling network
The division of labor between A- and N-sites would also permit the incremental grouping of curves with varying contrast, as is illustrated in Fig. 6b. In this figure, the upper curve is attended so that the responses of A-neurons are stronger than the responses of N-neurons. Thus, in such a scheme, labeling with an enhanced response can even occur for curves with varying contrast. It is also possible to propagate a difference in activity between A- and N-neurons, if A-neurons excite neighboring A-neurons with N-neurons providing inhibition. This idea is illustrated in Fig. 6c, where A-neurons in the lower row receive an equal amount of excitation and inhibition. In the upper row, however, the A-neurons propagate enhanced activity. When an A-neuron enhances its response, the neighboring A-neuron receives more excitation than inhibition, so that it will also enhance its response. Thus, the network can propagate the response enhancement in spite of the differences in activity caused by variation in stimulus contrast. The network uses two codes: The difference in response between A- and N-neurons labels contour elements for incremental grouping, while the response of N-neurons codes the contrast of the contour elements. In a recent study, we found that it is indeed possible to decode the contrast of a stimulus as well as the contour elements that have been labeled for grouping from the activity of a population of neurons in area V1 (Pooresmaeili et al., 2010).
Conjecture 5: A veridical and a labeling network There are N-neurons and A-neurons in the visual cortex. N-neurons form a veridical network that provides a reliable representation of the stimulus features, while A-neurons form a labeling network where features can be grouped incrementally.
Interactions between lower and higher visual areas
In the first place, shape representations in higher areas could facilitate the grouping of low-level features. Experimental support for such a facilitation has come from a study by Vecera and Farah (1997), who asked subjects to decide whether two markers were on the same or on different letters, a task that requires the grouping of contour elements of the same letter. This study compared the performance for letters in their upright position, which are presumably strongly represented in higher visual areas, with the performance for upside-down letters with weaker representations (Fig. 7b). Response times were shorter for the upright letter, implying that shape representations indeed facilitate the grouping of low-level features. The benefit for upright letters is explained if feedforward processing activates shape representations in higher areas (base grouping). Neurons in the higher areas that represent the cued shape (the letter F) would enhance their activity over the noncued shape and would feed back to lower areas to enhance the representation of the individual contour elements of the F in lower areas. Inverted letters are presumably not represented or are more weakly represented in higher areas, so that the grouping of their contour elements has to rely on low-level grouping cues in lower areas like good continuation only.
Conversely, if a set of low-level features that form a perceptual group is labeled with enhanced activity in early visual areas, this enhanced activity can spread to higher areas to activate neurons that code the overall shape (see, e.g., Sáry, Vogels, & Orban, 1993). For example, subjects can group image elements that move coherently in one direction and can segregate them from background elements moving in different directions to identify the shape of this perceptual group (Large, Aldcroft, & Vilis, 2005).
Figure 7c gives another demonstration. In the left panel, some of the image elements are linked by good continuation and can be seen to form the letter F. Without the low-level grouping, it is more difficult to perceive the F (right panel in Fig. 7c). Thus, the interactions between lower and higher areas permit grouping of elementary features into a shape. These interactions between areas are important for many tasks; for example, it is crucial to know which parts do and which parts do not belong to an object if you plan to grasp it.
It is likely that the spread of enhanced activity in lower areas influences the competitive interactions between representations in higher areas. If objects are nearby or overlapping (as they are in Fig. 7b), they typically fall into the same receptive field of neurons in higher areas. In these situations, the neuronal activity is the average of the activity evoked by the individual objects (Armstrong, Fitzgerald, & Moore, 2006; Reynolds, Chelazzi, & Desimone, 1999). Thus, a neuron that responds well to the F in Fig. 7b but poorly to the A will have an intermediate response if both letters fall in its receptive field. When attention is directed to the strokes of the F, the neuron’s response increases to the level evoked by the F (if presented alone). Vice versa, if attention is directed to the A, the activity of the neuron decreases to the level evoked by an individual A. Modeling studies have shown that the spread of enhanced neuronal activity in lower visual areas can account for these influences on neuronal activity in higher areas (Grossberg & Raizada, 2000).
Scale invariance of perceptual grouping
Influence of the enhanced activity on areas involved in response selection
Incremental grouping is apparently associated with the enhancement of neuronal activity in the visual and frontal cortices. It is likely that the enhanced activity of V1 neurons also influences neuronal responses in higher visual areas (see the Interactions between lower and higher visual areas section). However, we do not know whether the enhanced activity in FEF originates from area V1. It is also possible that the response enhancement in V1 is caused by feedback from higher visual areas, including area FEF. Studies that have used microstimulation to artificially raise the activity of small populations of neurons have demonstrated that feedback from FEF enhances neuronal activity in the visual cortex (Moore & Armstrong, 2003), even in area V1 (Ekstrom, Roelfsema, Arsenault, Bonmassar, & Vanduffel, 2008). Our working hypothesis is that neurons in the two areas engage in reciprocal interactions, causing the same curve to be selected in the visual and the frontal cortices (Duncan, Humphreys, & Ward, 1997). Interactions between area V1 and FEF involve intermediate processing stages as well (Ekstrom et al., 2008), since there are only a few direct connections between area V1 and FEF (Felleman & Van Essen, 1991). A recent study showed that the widespread neuronal correlates of the attentional selection of the target curve can also be measured as a sustained negativity in the human EEG (Lefebvre, Jolicoeur, & Dell'Acqua, 2010).
Conjecture 6: Consequences of incremental grouping Neurons that enhance their activity during incremental grouping have increased impact on other cortical areas and can thus provide the input to object recognition and response selection.
The role of attention
Incremental grouping requires the time-consuming spread of an enhanced firing rate. Many neurophysiological studies have indicated that these firing-rate modulations in the visual cortex are responsible for shifts of visual attention at a psychological level of description (reviewed by Desimone & Duncan, 1995; Roelfsema, 2006). Thus, while base grouping maps onto preattentive processing, incremental grouping maps onto attentive processing, and the spread of an enhanced neuronal activity in the visual cortex corresponds to the spread of object-based attention in psychology (Fig. 4c).
We obtained support for this idea by investigating the distribution of visual attention during contour grouping (Houtkamp et al., 2003). Subjects saw a target curve that started at a fixation point and a distractor curve. Their primary task was to indicate the location of a marker at the other end of the target curve. To probe the distribution of attention, colors were presented on different segments of the curves at various intervals during a trial, and the secondary task was to report one of these colors. The performance in the secondary task showed that, at the start of the trial, attention was directed to the initial contour elements of the target curve and that it subsequently spread across the entire curve until all contour elements were labeled by attention (schematically indicated in Fig. 4c). Thus, during incremental grouping, attention gradually adds elements to the evolving perceptual group by spreading from attended image elements to other elements that are related to them by Gestalt criteria, until the entire object has been labeled by attention.
Conjecture 7: Spread of attention The propagation of an enhanced neuronal response through the network of enabled connections corresponds to the spread of attention on the basis of Gestalt cues at a psychological level of description.
Incremental grouping on the basis of good continuation, similarity, and proximity
The connectivity scheme shown in Fig. 4 works well for the detection of connectedness but must be generalized to accommodate other Gestalt grouping laws, such as good continuation. For good continuation, we assume that there are connections that spread the enhanced neuronal activity between neurons tuned to well-aligned contour elements (Field et al., 1993; Grossberg & Raizada, 2000; Li, 1999). This assumption is in accordance with the anatomy of horizontal connections in the visual cortex, which interconnect neurons that code contour elements that are well aligned—that is, in each other’s good continuation (Bosking, Zhang, Schofield, & Fitzpatrick, 1997; Schmidt, Goebel, Löwel, & Singer, 1997). Likewise, grouping by similarity can be implemented by connections between neurons tuned to similar features (e.g., Grossberg & Mingolla, 1985; Roelfsema, Lamme, Spekreijse, & Bosch, 2002), and proximity grouping by connections between neurons with nearby receptive fields.
Conjecture 8: Implementation of Gestalt grouping Gestalt grouping cues are implemented by connecting neurons tuned to image features that are likely to belong to the same perceptual object so that they spread enhanced activity—for example, similar features (similarity grouping), well-aligned features (good continuation), or nearby features (proximity grouping).
The IGT predicts that perceptual grouping becomes time consuming whenever groupings have to be formed transitively because they are not extracted as base groupings. In a recent study, we investigated whether it is possible to observe delays during perceptual grouping on the basis of good continuation, using displays similar to those shown in Fig. 10a (Houtkamp & Roelfsema, 2010). We found that the reaction time of subjects increased linearly with the number of elements that had to be grouped together (i.e., with the distance between the two arrows in Fig. 10a), just as had been observed for continuous contours by Jolicoeur et al. (1986, 1991).
Figure 10b shows that it is also straightforward to create stimuli where transitive grouping occurs on the basis of proximity. The circles in the left panel can be seen to form two strings. Circle 1 is close to a nearby circle, which is close to another one, and we eventually reach circle 2 through a chain of local groupings. Transitivity dictates that the entire chain is seen as a perceptual group, and circle 1 therefore groups with circle 2, although it is actually closer to circle 3. Thus, also in this example, the transitivity implies that grouping is sensitive to the context set by other elements in the display. In the middle panel of Fig. 10b, the distances between circles 1, 2, and 3 are the same, but other circles are displaced so that circle 1 groups with circle 3. According to the IGT, attention spreads from one circle to the next until the whole string is attended, and we observed that the reaction times of subjects indeed increased linearly with the number of items that had to be grouped using displays comparable to those shown in Fig. 10b (Houtkamp & Roelfsema, 2010). The IGT would model the spread of the enhanced response by assuming that neurons coding nearby items are linked by recurrent connections. It is well known that attention tends to spread from target elements to other items in their proximity (e.g., Eriksen & Eriksen, 1974). The IGT assigns a functional role to this effect: It promotes grouping of nearby items.
In the right panel of Fig. 10b, we added five elements to the left picture, and circle 1 now also groups with circle 3. This is remarkable because all circles that promoted grouping between circles 1 and 2 in the left panel have kept their position. How can a scheme that uses enabled connections to support grouping in the left panel fail to do so in the right panel? We propose that proximity grouping is implemented at multiple spatial scales. Horizontal connections in early visual areas link nearby locations in the visual field, while horizontal connections in higher areas link locations that are farther apart. Higher areas could propagate an enhanced response between neurons that code image elements that are far apart, but propagation in the higher areas should be blocked if there are also image elements that are nearer, just as was proposed for scale-invariant contour grouping above. A psychophysical study demonstrated that proximity grouping is indeed largely scale invariant (Kubovy et al., 1998), so that the perceptual organization of the displays of Fig. 10b does not depend on viewing distance. The implementation of proximity grouping at multiple hierarchical levels of the visual cortex might account for this scale invariance, a hypothesis that could be tested by neural network studies.
Figure 10c illustrates transitive grouping by similarity. Nearby image regions with a similar color are grouped together. Boundaries form at locations where neighboring circles have a categorically different color so that elements on one side of a boundary do not group with elements on the other side (e.g., blue and gray circles). The left panel of Fig. 10c shows that a gradual change in color within a region permits transitive grouping between elements with a dissimilar color (e.g., green and yellow circles). Mumford, Kosslyn, Hillger, and Herrnstein (1987) and Wolfson and Landy (1998) demonstrated that subjects exploit gradual changes in feature values for the grouping of image regions, but they also evaluate abrupt feature changes for their segregation, in support of the idea that grouping and segregation are complementary processes. Note, however, that while an abrupt change in color may provide local evidence for the segregation of image regions, these regions may nevertheless be linked through a detour, as is illustrated for the yellow and green circles in the left panel of Fig. 10c. If subjects are tested with displays where image elements have to be grouped on the basis of their similarity, the reaction time increases linearly with the number of items that need to be grouped, indicating that incremental grouping also occurs when similarity defines the perceptual groups (Houtkamp & Roelfsema, 2010).
The middle and right panels of Fig. 10c illustrate that grouping on the basis of similarity can, in some situations, rely on base grouping, while it requires incremental grouping in others. Image elements with a color that differs from other elements can be detected by neurons in higher areas that are color selective (middle panel), rapidly and in parallel. However, incremental grouping has to come into play when there are multiple strings with the same color (right panel). In this situation, color selectivity does not suffice for grouping. To group the elements of one of the red strings and to segregate them from the other string, an enhanced response has to be propagated along a chain of more local groupings, based on both color and proximity relationships. We showed recently that grouping is serial in this situation, since subjects’ response times increase linearly with the length of the string (Houtkamp & Roelfsema, 2010).
Previous experiments on perceptual grouping
The key assumptions of the IGT have been summarized in conjectures 1–8. In summary, the presentation of an image triggers a parallel and resource-unlimited base-grouping process that depends on cascades of neurons tuned to features and feature conjunctions. This pattern of activity enables some of the links: activity-spreading connections between activated neurons. For conjunctions not coded as base groupings, there is a later serial and resource-limited incremental grouping process that relies on attention that is propagated along the enabled links to enhance the representation of a coherent, unified group. This pattern of enhanced activity can then be read out by other processes for object recognition (Fig. 7c) or for the programming of actions.
We will now explore whether and how the conjectures above, which are largely based on neurophysiology, can account for previous results in perceptual psychology. Let us, for the sake of the argument, step in the shoes of an outsider who tries to familiarize himself or herself with the literature on grouping in perception. It is likely that he or she will first be confused. Some workers have argued that perceptual grouping takes place in parallel across the visual scene, while others state that it requires serial processing. Some maintain that grouping depends on visual attention, while others claim that it largely happens at a preattentive stage. Many of these viewpoints have been supported by substantial experimental evidence. The IGT aims to provide a framework that resolves some of these discrepancies.
When grouping takes time and when it does not
These considerations do not exclude that incremental grouping may, in some cases, also be required to establish conjunctions between features at the same location. To take an arbitrary example, imagine a rotating and shrinking elephant. Feature conjunctions between complex motion patterns and shapes are presumably not coded as base groupings and would require labeling the shape in one visual area and the motion pattern in another area with enhanced activity. Mechanisms to ensure that the enhanced activity spreads from one visual area to another one with sufficient specificity have been reviewed by Roelfsema (2006).
Grouping with and without attention
There are many discrepancies in the literature about the involvement of attention in perceptual grouping. Some studies have suggested that forms of grouping do not occur without attention, while other studies have suggested that Gestalt grouping takes place at a preattentive stage. Here, we will suggest how some of these discrepancies can be resolved by distinguishing between base and incremental grouping.
One example of attention-demanding grouping is the contour-grouping task, where object-based attention spreads over contour elements that have to be grouped into elongated curves (see Fig. 4c; Houtkamp et al., 2003; Roelfsema, Houtkamp, & Korjoukov, 2010; Scholte, Spekreijse, & Roelfsema, 2001). Other studies have found that some forms of grouping do not occur when attention is directed elsewhere. Ben-Av, Sagi, and Braun (1992), for example, investigated grouping of image elements surrounding a centrally displayed letter. When the subjects directed their attention to the letter, they were unable to report the perceptual organization of the other image elements, which were arranged in rows or columns on the basis of proximity or similarity, as if proximity and similarity groups do not form without attention.
Another line of evidence that seems to imply a role for attention in perceptual grouping comes from the inattentional blindness paradigm introduced by Mack, Tang, Tuma, Kahn, and Rock (1992; see also Mack & Rock, 1998). Their subjects had to report about the relative length of two arms of a central cross that was surrounded by a pattern of small elements. After a few trials with this task, the background elements were organized in columns or rows on the basis of proximity or similarity cues. The observers received a surprise question about the perceptual organization of the background elements and were usually unable to report about the grouping into rows or columns. Mack et al. concluded that Gestalt grouping does not take place without attention. However, their methodology was later criticized. First, it is conceivable that grouping took place outside awareness and that subjects therefore were not able to report about it (Driver, Davis, Russell, Turatto, & Freeman, 2001). Second, the observers may have forgotten their percept by the time of questioning (Wolfe, 1999).
Subsequent studies with similar arrays, but using more sensitive and implicit measures of grouping, have substantiated these criticisms and, instead, have obtained evidence for grouping without attention. C. M. Moore and Egeth (1997), for example, asked subjects to carry out a line length discrimination task on a background of black and white dots. On some of the trials, the black dots were configured to induce a line length illusion in the case of grouping. Remarkably, the dots influenced the line length judgments, even though subjects could not report about the groupings when asked (see also Chan & Chua, 2003). Studies by Kimchi and Razpurker-Apfeld (2004) and Russell and Driver (2005) extended these findings. They investigated the influence of perceptual grouping of background elements while subjects carried out a change detection task. Subjects had to compare stimuli in central vision to detect a change. Unbeknownst to the subjects, the image elements in the surround formed columns or rows on the basis of color similarity in some of the images. If both the central pattern and the grouping of background elements changed across displays, the subjects were more likely to report the change than when only the central stimulus changed. Again, the observers were unaware of the grouping of the background elements (see also Kimchi & Peterson, 2008).
These studies demonstrate that perceptual grouping can occur without attention and outside awareness. According to the IGT, this is possible only for feature constellations that are coded as base groupings. The arrays of black dots of C. M. Moore and Egeth (1997) looked like lines if observed through a low spatial frequency filter. It is therefore plausible that neurons in the visual cortex could detect these dot arrays as base groupings during feedforward processing, and this may have influenced perceived line length just as normal line inducers that are commonly used to produce the line length illusions. A similar explanation can be given for the isoluminant dot arrays used in the change detection task of Kimchi and Razpurker-Apfeld (2004) and Russell and Driver (2005). These patterns are likely to activate orientation-selective cells (Gegenfurtner, Kiper, & Levitt, 1997) and can be registered as base groupings without attention. Thus, although these results are indicative of grouping without attention, they are consistent with what is known about the tuning of visual cortical neurons to feature conjunctions.
Another line of evidence that, at least at first sight, seems to imply that Gestalt grouping occurs without attention has come from studies in patients with hemineglect. These patients often fail to perceive objects in the hemifield that is contralateral to a brain lesion that is often located in the parietal cortex (Halligan & Marshall, 1993; reviewed by Driver, 1995). Many of these patients suffer from extinction: If presented with two visual objects, one in each hemifield, they see only the object in the good hemifield and fail to see the one in the bad hemifield. This deficit occurs even though patients are able to see the same stimulus in the impaired hemifield if presented alone. The remarkable finding is that an item in the bad hemifield can be rescued from extinction if it forms a perceptual group with an item in the good hemifield; that is, if the patients see two items that form a perceptual group, they perceive both. This relief from extinction has been observed for objects that are grouped on the basis of luminance similarity (Gilchrist, Humphreys, & Riddoch, 1996), connectedness (Driver, 1995; Humphreys & Riddoch, 1993), and good continuation (Gilchrist et al., 1996; Mattingly, Davis, & Driver, 1997; Pavlovskaya, Sagi, Soroker, & Ring, 1997).
Because the main impairment of neglect patients is to shift their attention to the bad hemifield, the results have been interpreted as evidence for grouping without attention (e.g., Driver, 1995). The IGT provides an alternative explanation that is based on linking—that is, the enabling of connections. The items in the good and bad hemifields of the patients are related to each other by Gestalt grouping cues, and neurons that represent them should, therefore, be linked by recurrent, attention-spreading connections in early visual areas that are usually spared by the lesion. The enabling of these connections occurs in parallel across the visual field but is without effect during the preattentive processing stage. However, when the patient attends to the item in the good hemifield, the enabled connections cause attention to spread to the item in the impaired hemifield and facilitate detection.
Linked but not grouped
The distinction between linking and grouping also sheds light on intriguing results on the influence of grouping in the motion-induced blindness (MIB) paradigm (Bonneh, Cooperman, & Sagi, 2001). In this paradigm, a number of high-contrast stationary (or slowly moving) items are superimposed on a background of rapidly moving dots. Under these conditions, the stationary items disappear spontaneously from perception for a period of several seconds and then reappear. Bonneh et al. suggested that the moving dots increase the level of competition between the representations of visual objects, so that the stationary ones can completely disappear from perception. Importantly, Gestalt cues influence MIB. Visual objects disappear and reappear together if they form a perceptual group on the basis of collinearity and proximity cues. Ungrouped objects, on the other hand, disappear and reappear independently. Mitroff and Scholl (2005) extended these findings to other grouping cues, including connectedness. In their study, Gestalt grouping cues were added or removed between objects while they were invisible. Remarkably, the changes in the grouping cues occurring outside awareness nevertheless influenced the simultaneity of reappearance.
If grouping cues were removed outside awareness, the items tended to reappear independently, and vice versa, if grouping cues were added, items tended to reappear together.
The enabling of attention-spreading connections between items on the basis of Gestalt cues accounts for their simultaneous reappearance in MIB. Recall that the enabling process is the direct consequence of the pattern of activity in the base representation, and modifications in the base representation are therefore associated with changes in the set of enabled connections (as was discussed in relation to Fig. 4a). Thus, enabling and disabling of connections (i.e., linking) is independent of attention and can occur outside awareness. However, once attention is directed to one item, it will spread to other items that are linked so that they become visible at the same time.
An additional explanation is required to account for the simultaneous disappearance of grouped items. One possibility is that the mutual facilitation between grouped items causes them to remain visible for a longer time than they would if presented alone. Bonneh et al. (2001) indeed demonstrated that grouped items were less prone to disappear from awareness. However, as soon as one of the grouped items disappears from awareness, this decreases the facilitation of linked items, increasing the probability that they also become invisible.
The influence of grouping cues on the spread of attention
The IGT requires that attention flows within perceptual groups, a requirement that has been supported by many studies. Kahneman and Henik (1981) may have been the first to study the effect of perceptual grouping cues on the spread of attention. They investigated the effect of proximity and similarity cues in partial report tasks and found that perceptual groups act as units, because grouped items tend to be jointly reported or jointly missed, suggesting that they are coselected by attention. Later studies extended these findings with a great variety of techniques.
The flanker task is another powerful method to probe the influence of Gestalt grouping cues on the spread of attention. In this task, subjects map target objects onto arbitrary responses. The target is flanked by distractors that are response incompatible (they map onto the opposite response), neutral (not associated with a response), or compatible (associated with the same response as the target). The general finding is that response-incompatible flankers increase response time, while compatible flankers tend to reduce response time. Importantly, flankers that are linked to the target by Gestalt grouping cues cause more interference than do unlinked flankers. These effects can be explained if attention spreads from the target to the flankers if they are linked by Gestalt grouping cues. Eriksen and Eriksen (1974), for example, showed that nearby flankers generate more interference than do flankers that are farther away, supporting the hypothesis that attention flows among items linked by proximity (see also Baylis & Driver, 1992). Similar results have been obtained for similarity and motion. Flankers with a similar color or motion cause more interference than do flankers with a different color or motion (Baylis & Driver, 1992; Driver & Baylis, 1989; Harms & Bundesen, 1983; Kramer & Jacobson, 1991). Moreover, connectedness and good continuation modulate the flanker effect similarly (Baylis & Driver, 1992; Kramer & Jacobson, 1991; Richard, Lee, & Vecera, 2008). These results, taken together, provide strong support for the hypothesis that attention spreads among items linked by Gestalt grouping cues, enhancing the impact of flankers linked to a target.
Cuing tasks provide a second line of evidence in support of the hypothesis that attention flows among linked items. In an elegant study, Egly, Driver, and Rafal (1994) presented two elongated objects and asked subjects to detect a probe item that was presented on one of these objects. The probe was preceded by a cue that could appear at another location on the same object or on a different object. Remarkably, a cue on the same object gave rise to shorter response times than did a cue on the other object, even if the distances between cues and probes were the same. This result is explained if attention spreads across the entire representation of the object when it is cued on one end; that is, it spreads within a linked array of locations (Vecera, 1994).2 This result was extended by Haimson and Behrmann (2001), who showed that attention spreads across the entire cued object even if parts of it are occluded, and by He and Nakayama (1995), who demonstrated that attention spreads among neighboring items that are linked by a disparity gradient defining a plane in the image. We recently discovered a neuronal correlate of this object-based cuing effect by showing that the enhanced neuronal activity evoked by a cue spreads to the representation of other image elements in the visual cortex that are linked to the cued element (see Fig. 11; Wannig et al., 2011).
Visual search experiments provide yet another line of evidence in support of the conjecture that perceptual groups act as units that can be selected by attention. Duncan and Humphreys (1989) demonstrated an important role for similarity grouping in visual search. They showed that visual search for a particular target item is most efficient if distractors are similar, but dissimilar from the shape of the target, and suggested that a set of similar distractors can be rejected efficiently as a perceptual group (see also Bundesen & Pedersen, 1983). Less interference during search has also been observed for distractors grouped by proximity (Banks & Prinzmetal, 1976), good continuation (Donnelly, Humphreys, & Riddoch, 1991), and connectedness (Wolfe & Bennett, 1997) and for those located on an image plane tilted in depth (He & Nakayama, 1995).
Finally, there are indirect effects of perceptual grouping on performance in other tasks, such as the repetition detection task. In this task, subjects see a string of image elements, and they have to detect the repetition of one element; that is, they have to detect whether there are two adjacent elements that are identical. Repetitions are easier to detect if they are part of the same perceptual group than if they are part of different groups. It is likely that attention has to be directed to the repeating elements for accurate detection. The spread of attention according to the grouping cues can therefore account for the better performance if the repeating elements are part of the same group.
Thus, there is converging evidence for the role of Gestalt grouping cues in attentional processing. Attention spreads among related items to form perceptual groups, and linked items are thereby either jointly selected or blocked from further processing. Previous studies have remarked that image segmentation influences attention (e.g., Driver et al., 2001; Kahneman & Henik, 1981), but by distinguishing between linking and grouping, the IGT introduces a notation that is more precise: The representation of the image in the visual cortex enables a set of recurrent connections (linking), but these connections do not come into effect before attention starts to flow among the linked items (grouping).
If a particular grouping of features is critical for performance and is required often, the visual brain may reserve a feedforward cascade of feature detectors to detect the relevant conjunction. It can use a base-grouping strategy, although this requires more dedicated and specialized neurons than does the labeling of a distributed representation with an enhanced response. There is substantial evidence that perceptual experience induces new base groupings. The experiment of Vecera and Farah (1997) described above (Fig. 7b) indicates that our lifelong experience with upright letters makes them easier to segregate from each other, and base groupings for letters presumably form during childhood. Figure 13b illustrates how new base groupings would increase grouping speed in the example with the vacuum cleaner. Grouping of the floor brush and plug can occur on the basis of low-level features (circles with solid lines), but this would require the spread of attention across a large number of local links. The addition of more complex base groupings (dashed lines) would reduce the number of links that need to be traversed, and it can thereby speed up incremental grouping. Effects of training have indeed been observed in contour grouping (DeSchepper & Treisman, 1996; Kourtzi, Betts, Sarkheil, & Welchman, 2005), where a few hours of training have long-lasting effects.
Neurophysiological experiments have also witnessed the emergence of new base groupings. Baker, Behrmann, and Olson (2002) trained monkeys to discriminate between “batons,” elongated objects consisting of two distinct shapes joined by a straight line. Initially, many neurons in the monkey’s inferotemporal cortex were tuned to the local shapes, but not to the overall configuration. The situation changed after training, because neurons became selective for conjunctions between the shapes. Thus, new base groupings were formed after experience with behaviorally relevant feature conjunctions.
These results, taken together, suggest that new base groupings can be formed as the result of perceptual experience. Tasks that initially require incremental grouping may later be solved by base grouping if sufficient training has taken place. Training thereby makes perceptual grouping more efficient, since the groupings can now be extracted rapidly and in parallel, replacing the slower, serial incremental-grouping process.
Introspective phenomenology and availability for report
Many studies on Gestalt grouping cues have investigated the phenomenology of visual perception. Observers looked at stimuli and reported whether some of the image elements appeared to be grouped or not (Koffka, 1935; Wertheimer, 1923), an approach that also yielded many valuable insights in more recent years (e.g., Kellman & Shipley, 1991), although there are new methods that permit probing the effects of Gestalt grouping cues through other tasks, such as the detection of a repeating element in a row of elements (Palmer & Beck, 2007), the estimation of numerosity (Franconeri, Bemis, & Alvarez, 2009) or distance (Coren & Girgus, 1980; Vickery & Chun, 2010). Here, we have approached the problem of perceptual grouping from a neurobiological perspective, and we have also not put emphasis on introspective phenomenology. However, we believe that the labeling process sets the stage for the ability to report about the groupings (see also Avrahami, 1999). In the curve-tracing task in Fig. 5a, for example, the monkeys reported their percept by making an eye movement to the larger red circle at the end of the relevant curve. In this task, the early visual areas propagate the enhanced response along the target curve, and when the enhanced neuronal activity reaches the circle at the end of this curve, other areas involved in the planning of eye movements can read out the enhanced activity to initiate a saccade to the appropriate location in the visual field. Global workspace theories of awareness propose that the main difference between sensory features that do and do not enter awareness is determined by their influence on processing in other brain areas (Baars, 2002; Dehaene, Sergent, & Changeux, 2003). In the IGT, features that are labeled with enhanced neuronal activity have more influence on higher areas (Fig. 7), which is consistent with global workspace theories. This correspondence is further supported by the results of Supèr, Spekreijse, and Lamme (2001), who showed that the enhancement of V1 responses is strong on trials where a monkey perceives a stimulus and is weaker on trials where it does not.
As was discussed above, the image elements coded by a set of active neurons (coding base groupings) linked by enabled connections is not yet available for report because neurons in other areas are insensitive to the enabling of connections (Fig. 13a, lower panel). The incremental-grouping process first has to make these linkages explicit by enhancing the responses of a subset of the neurons so that the groupings can be read out. In other words, unconsciously established links can only set the stage for incremental grouping, while the propagation of the enhanced neuronal response causes the formation of perceptual groups that are reportable.
Base groupings, in contrast, can exist outside awareness. Dehaene et al. (1998), for example, demonstrated that digits that are followed by a mask can exert unconscious priming effects, implying that the activation of neurons that code digits is insufficient for awareness. It has been proposed that reciprocal interactions between higher and lower visual areas is necessary for awareness in these situations (Dehaene et al., 2003; Lamme, 2003; Lamme & Roelfsema, 2000).
Comparison with previous theories
The IGT aims to link neurophysiology to psychology. This section will compare the IGT with previous theories that were specified at the psychological level (the FIT) and the neurophysiological level (binding-by-synchrony). We will start by mentioning theoretical developments and experimental findings that inspired the IGT. An important inspiration was the work of Ullman (1984), who suggested that vision starts with an early base representation that is driven by the visual stimulus and a later incremental representation that is modified by elemental operators like, for example, visual search and contour grouping that, when applied sequentially, can form visual routines. The implementation of these elemental operators in the visual cortex has been discussed elsewhere (Roelfsema, 2005; Roelfsema et al., 2000).
A number of previous studies have demonstrated that image segmentation and perceptual grouping do not always precede object recognition and attentive object selection. Driver et al. (2001), for example, described how grouping cues determine the spread of attention but that the opposite relation also holds: Attentional processes can influence segmentation. This article clearly indicated that theories with a strict succession of preattentive processes responsible for segmentation followed by an attentive processing stage must be incomplete. Studies by Peterson, Harvey, and Weidenbacher (1991) and Vecera and Farah (1997) also provided evidence for “late” grouping processes by demonstrating, for the first time, that image segmentation sometimes depends on the results of object recognition, contrasting with the more popular view that image segmentation precedes object recognition. Moreover, Vecera and O'Reilly (1998) demonstrated the plausibility of reciprocal interactions between object recognition and perceptual grouping processes in artificial neural networks (see also Behrmann, Zemel, & Mozer, 1998; van der Velde & de Kamps, 2001), and these results inspired the proposed interactions between higher and lower visual areas illustrated in Fig. 7. An interesting precursor of the distinction between base and incremental grouping can be found in the work of Zucker and Davis (1988), who noted that the grouping of nearby dots can be mediated by visual cortical neurons tuned to orientation, while grouping of dots with a larger spacing requires another mechanism.
The feature integration theory and related theories
The FIT (Treisman & Gelade, 1980) has been a very influential theory of the role of attention in perceptual grouping. This theory holds that features such as colors, motions, and shapes are initially registered in separate feature maps. A spotlight of attention has to be directed to the location of an image element to highlight all its features in the various feature maps so that they are bound in perception. The FIT was the first to propose that attention can be used to represent feature conjunctions that are not coded by dedicated neurons (Treisman & Gelade, 1980; Treisman & Schmidt, 1982), and this insight plays an important role in the IGT.
The distinction between a preattentive and an attentive processing stage also features in related theories of visual perception (Neisser, 1967). Many theories of visual search have adopted a preattentive processing stage that accounts for parallel search, followed by a serial stage that compares individual display items with the representation of the search target in memory (Egeth, Virzi, & Garbart, 1984; Hoffman, 1979; Wolfe, 1994). In the IGT, base and incremental grouping map onto preattentive and attentive processing, respectively, and yet there are important differences between the IGT and the FIT and these other, previous theories.
First, the FIT considers only spatial attention: A spotlight or zoom lens is directed to the spatial location of a target item to bind its features into a coherent representation. The grouping of information at a single location is a relatively easy problem if compared with the grouping of features at different locations of a spatially extended object (see Shadlen & Movshon, 1999). The incremental-grouping theory therefore holds that conjunctions between features at a single spatial location are often coded as base groupings that are extracted in parallel. Many examples of feature conjunctions that are coded by single neurons have been found in neurophysiology (Kobatake & Tanaka, 1994; Leventhal et al., 1995), and human observers extract these conjunctions rapidly (see Fig. 12; Holcombe & Cavanagh, 2001) and in parallel (McLeod, Driver, & Crisp, 1988), which is inconsistent with the FIT.
The hard problem that requires attention is the grouping of features of spatially extended objects. In this situation, attention does not act as a spotlight, however, but, rather, adopts the shape of the relevant object (Fig. 4c). There is compelling evidence that attention can be object based, which means that it can be directed selectively to an object that overlaps with another object (Behrmann et al., 1998; Blaser, Pylyshyn, & Holcombe, 2000; Duncan, 1984; O'Craven, Downing, & Kanwisher, 1999; Scholl, 2001; Watson & Kramer, 1999).
Another important difference between the IGT, on the one hand, and the FIT and other object-based theories of attention, on the other hand, is that most of the previous theories suggest that Gestalt grouping takes place preattentively (but see Driver et al., 2001), whereas the IGT claims that this is true only for the detection of local base groupings and linking (i.e., enabling of connections), while the transitive combination of local groupings on the basis of Gestalt criteria requires incremental grouping (i.e., the spread of object-based attention). Thus, while the FIT suggests that processing delays are caused by shifts of the attentional spotlight from one object to the next, the IGT holds that delays are also caused by the time-consuming spread of attention within the object representations (Houtkamp et al., 2003). Finally, the IGT proposes that the spread of attention at the psychological level corresponds to the spread of enhanced activity at the neurophysiological level and explains mechanistically why attention coselects image elements linked by Gestalt grouping cues.
The FIT and related theories have proposed that texture segregation can be used to distinguish between basic features and feature conjunctions (Bergen & Julesz, 1983; Treisman & Gelade, 1980). These theories hold that if a texture can be subdivided into regions with different features, these regions segregate effortlessly, while feature conjunctions do not support effortless texture segregation. If the base groupings of the IGT had the same status as the features of the FIT, image regions with different base groupings should segregate effortlessly (Bergen & Julesz, 1983; Julesz, 1981; Malik & Perona, 1990; see also Beck, Graham, & Sutter, 1991; Sutter, Beck, & Graham, 1989). However, the IGT does not propose that a difference between base groupings is sufficient for segregation, because theories that use texture segregation to distinguish between basic features and feature conjunctions may have oversimplified the texture segregation process. Texture segregation is actually a very complex process that relies on two complementary processes, boundary detection and region growing, which require different types of interactions between neurons and, hence, different types of connections (Grossberg & Mingolla, 1985; Mumford et al., 1987; Roelfsema et al., 2002).
The first process, boundary detection, is sensitive to abrupt changes in feature value. The processes for boundary detection are related to pop-out, the effortless detection of image elements that differ from their neighbors. Neurophysiologically plausible algorithms for boundary detection and pop-out hold that neurons tuned to similar features with adjacent receptive fields inhibit each other (Grossberg & Mingolla, 1985; Itti & Koch, 2001; Li, 1999), unlike in incremental grouping, where these cells excite each other (conjecture 8). Neurons coding elements of a homogeneous region receive maximal inhibition from their neighbors, while neurons coding elements at a boundary receive less inhibition. Boundaries therefore evoke more activity than do homogeneous image regions, and this is why boundaries are more salient.
The complexity of the mechanisms for texture segregation implies that there are several reasons why not all base groupings may permit effortless texture segregation so that the IGT cannot separate conditions with effortless texture segregation from conditions where segregation is effortful. For example, the inhibition generated by dense displays during feedforward processing may curtail the computation of complex base groupings in higher areas (as was discussed in the Base grouping: activation of tuned neurons section). Moreover, neurons in higher visual areas that code these more complex base groupings have large receptive fields that may preclude the precise localization of boundaries. It is also conceivable that the boundary signal is too weak if neurons coding a particular base grouping are not strongly linked with inhibitory connections. Similar arguments can be made for the process of region growing that depends on feedback. Neurons in higher areas that code the more complex base groupings may not uniquely identify the image locations that belong to a homogeneous region because of their large receptive fields, and these cells may not provide feedback to the appropriate neurons in lower areas.
The FIT and related theories (e.g., Wolfe, 1994) have proposed that visual search can be used to distinguish between elementary features and feature conjunctions, but again, the complexity of search may have been underestimated. Search for features should be parallel, while search for feature conjunctions should be serial (Treisman & Gelade, 1980). However, the neuronal implementation of visual search is complex, and the absence of base groupings is only one of a number of possible reasons for search to become serial. First, dense displays may curtail feedforward processing and, hence, the computation of complex base groupings in higher areas. Moreover, models of search use inhibitory connections between neurons with similar tuning for pop-out (Bravo & Nakayama, 1992; Wolfe, 1994) (just as for texture segregation; see Fig. 14a), and not all base groupings may support pop-out. Models of search also use feedback connections from higher to lower areas to “guide” search (Bravo & Nakayama, 1992; Egeth et al., 1984). These connections propagate activity from higher areas coding the target of search to enhance the activity of neurons in lower areas that respond to target features (van der Velde & de Kamps, 2001; Wolfe, 1994). These feedback connections may not have sufficient selectivity if the base groupings are complex and not represented in early visual areas. Finally, most models of search require an additional matching process that compares candidate display items with a representation of the search target stored in memory ("examination" in Wolfe, 1994). This putative matching process is only partially understood. We therefore conclude that there are many reasons why a feature conjunction coded as a base grouping cannot be searched for in parallel and that the IGT does not predict when search becomes serial. One illustrative example is the search for items with a particular orientation. There are combinations of target and distractor orientations that do not permit parallel search (Wolfe, 1994). This finding is remarkable because there are many neurons in early visual areas that are highly selective for stimulus orientation. Differences between the processes required for visual search and texture segregation may also explain why combinations of image elements that permit effortless texture segregation do not always give rise to parallel search and, vice versa, why image elements that can be searched in parallel do not always permit texture segregation (Wolfe, 1992). Another illustrative example is the demonstration that learning can change a condition where search is serial into one where it becomes parallel (Sireteanu & Rettenbach, 1995). One exciting possibility is that this learning relies on new feature conjunctions coded as base groupings, but it is also possible that other changes in the interactions between neurons are responsible, because the presence of base groupings is not sufficient for parallel search.
In an influential study, Treisman and Schmidt (1982) demonstrated that subjects make conjunction errors: They sometimes perceive features in erroneous combinations. In their study, the subjects had to direct their attention to a large region in the display because their primary task was to report two briefly presented digits appearing at peripheral locations to the left and right of fixation. A number of colored letters were presented between the digits, and it was the subject’s secondary task to report the identity and color of the letters. An illusory conjunction occurred if the subject reported a letter of the secondary task with a color that actually belonged to a different letter. Thus, the subject might, for example, report a brown T, although a blue T and a brown R had been presented. Treisman and Schmidt interpreted this result in the context of the FIT. They suggested that features within a large attentional focus are floating freely, because attention has to constrict around one of the letters before the letter identity is correctly bound with the corresponding color.
Ashby, Prinzmetal, Ivry, and Maddox (1996) noted, however, that the subjects in this study (and other studies on illusory conjunctions) were much better at detecting genuine conjunctions than would be expected if features were completely free-floating and recombined randomly. This is consistent with the incremental grouping theory that proposes that base groupings provide hardwired conjunctions between shapes and colors at the same location. But why would subjects sometimes erroneously recombine features that are coded as base groupings? The answer may be that Treisman and Schmidt (1982) had to present the letters so briefly to observe conjunction errors that the subjects also made feature errors: That is, they reported a letter or a color not present in the display. The occurrence of conjunction errors is not incompatible with their coding as base groupings if the display duration is so short that even simple features are sometimes misperceived. It is conceivable that these short display durations occasionally cause the preserved representation of the identity of one letter and the color of another one while the feature conjunctions are lost, so that the subjects report the features that they did perceive and remember in erroneous combinations.
Conjunction errors occur more frequently between items that belong to the same perceptual group than between items of different groups (Prinzmetal, 1981). Illusory conjunctions are, for example, more abundant between nearby items than between items at a larger separation (Cohen & Ivry, 1989). They also occur more frequently between items with a similar color, shape, or motion (Baylis, Driver, & McLeod, 1992; Ivry & Prinzmetal, 1991; Prinzmetal, 1981) and between items that are connected (Scholte et al., 2001). To explain the effects of grouping cues on illusory conjunctions, we note that the subjects may sometimes direct their attention to these items, even if they are part of a secondary task (see also Ashby et al., 1996; Treisman & Schmidt, 1982). The influence of grouping cues on the spread of attention section above reviewed the results of Kahneman and Henik (1981), who demonstrated that Gestalt grouping cues determine the features that are jointly reported in partial report tasks, presumably because these cues determine the spread of attention (for a similar view, see Prinzmetal, 1981). Thus, linked image elements are more likely to be coselected by attention, and their features are therefore more likely to be extracted together. It follows naturally that subjects are also more likely to report the features of different elements of a group in an erroneous combination than the features of items that belong to different groups.
A second theory of binding that has been specified at the neurophysiological level is binding-by-synchrony, first proposed by von der Malsburg (1999; von der Malsburg & Schneider, 1986). This theory holds that neurons that code features of the same object are active synchronously; that is, they fire their action potentials at approximately the same time with a temporal resolution of a few milliseconds, while neurons coding features of different objects fire independently (Engel et al., 1992; Phillips & Singer, 1997; Singer & Gray, 1995; von der Malsburg, 1999; von der Malsburg & Schneider, 1986; Watt & Phillips, 2000). Thus, here a temporal tag, instead of an enhanced response, labels image elements to be grouped in perception. One of the hypothesized advantages of binding-by-synchrony is that multiple incremental groups may coexist in perception, because neurons coding the image elements of different groups would fire at different phases of an oscillation (Behrmann et al., 1998; Cowan, 2001; Singer & Gray, 1995). Neurons that participate in the representation of the same perceptual group can be synchronized, while there is no synchronization between groups. The IGT, on the other hand, has only a single label. If two groups of neurons are simultaneously labeled with the enhanced response, these two groups run the risk of merging into one large group. To test this difference between the theories, we recently devised a task where subjects had to group contour elements into two curves and found that only one incremental group could form at a time (Houtkamp & Roelfsema, 2010; in accordance with preliminary data by Jolicoeur, 1988). These results provide evidence compatible with the IGT and against binding-by-synchrony.
Clearly, the largest differences between binding-by-synchrony and the IGT are at the neurophysiological level. Experimental support for binding-by-synchrony was obtained in a series of neurophysiological studies demonstrating that the responses of visual cortical neurons evoked by features of the same object were better synchronized than the responses of neurons responding to features of different objects (Singer & Gray, 1995; reviewed by Eckhorn et al., 2001). Later studies did not support these findings, however, and cast doubt on the generality of binding-by-synchrony (Shadlen & Movshon, 1999; Thiele & Stoner, 2003). Some of the discrepancies between studies are presumably related to the use of anesthetized animals in some of them and the use of awake animals in others. Only a handful of studies have measured neuronal synchrony while the animals reported about their percepts, and these studies generally have not supported the binding-by-synchrony theory (Lamme & Spekreijse, 1998; Palanca & DeAngelis, 2005; Roelfsema et al., 2004; Thiele & Stoner, 2003). Roelfsema et al. (2004) even succeeded in dissociating grouping from synchrony by using the contour grouping task described above and created stimuli where the strength of synchrony between neuronal responses evoked by the same curve was weaker than the strength of synchrony evoked by different curves. These results imply that synchrony is not a universal code for binding. Instead, the contour elements that had to be grouped in perception were invariably labeled with enhanced neuronal activity, in accordance with the IGT.
Testing the incremental-grouping theory
We will close with a number of predictions of the incremental-grouping theory that can be tested experimentally. The first and major prediction is that grouping becomes serial whenever the following two conditions are met: (1) The task requires the transitive combination of local groupings, and (2) the overall configuration is unfamiliar, so that base groupings cannot have formed (Fig. 2b). We note that this prediction distinguishes the IGT from previous theories that did not envision serial grouping. We conjecture that serial grouping is required in many everyday scenes, which usually contain objects where multiple grouping cues have to be combined in a transitive manner. Figure 13b gives one example. The components of the vacuum cleaner are all indirectly connected, and the IGT predicts that grouping of all its components depends on a time-consuming incremental grouping process.
A third prediction is the coexistence of two networks, a veridical network consisting of N-neurons and a labeling network composed of A-neurons. These networks should have ramifications in many visual areas that code different visual features such as motions, colors, shapes, and so forth. Previous neurophysiological studies on the effects of attention in these areas invariably observed neurons that were modulated by attention shifts and other neurons that were not, although the properties of the modulating neurons usually received the most emphasis (Luck, Chelazzi, Hillyard, & Desimone, 1997; Roelfsema et al., 1998; Treue & Maunsell, 1996). Future studies could examine whether the propagation of an enhanced response indeed depends on the specific interaction between A- and N-neurons proposed in Fig. 6c. A related aspect open to experimental testing is the proposed multiplicative interaction between the feedforward connections and recurrent connections. This specific interaction is responsible for the linking process—that is, the enabling of connections between neurons activated by feedforward connections and the disabling of connections between neurons that are not activated by the stimulus. Previous studies have shown that attention mainly amplifies the response of well-driven neurons in the visual cortex and has comparatively little effect on neurons that are not well driven (McAdams & Maunsell, 1999; Treue & Martínez Trujillo, 1999), in support of such a multiplicative interaction. Future studies could test whether comparable interactions are responsible for the spread of the enhanced response among image elements related to each other by Gestalt grouping rules.
A fourth prediction derives from the availability of one label for incremental grouping. A strong prediction of the IGT is that it is not possible to simultaneously group two sets of image elements that are linked only transitively by a chain of local grouping cues. Our recent study (Houtkamp & Roelfsema, 2010) supported this prediction during a contour grouping task. We note for clarity that the theory does permit the coexistence of an incremental group with a number of base groupings that are extracted during feedforward processing.
The topic of perceptual grouping or binding has been a controversial issue for many years, with viewpoints ranging from the idea that binding is crucial for perception (Engel et al., 1992; Singer & Gray, 1995) to the view that most binding problems are solved during feedforward processing (Ghose & Maunsell, 1999; Riesenhuber & Poggio, 1999a). We believe that our consideration of a wide range of experimental findings in neurophysiology and psychology has provided a new conceptual framework for perceptual grouping that is able to reconcile many of the discrepancies. The predictions above are only a few of those made by the IGT, and it is exciting to anticipate the experiments that will put these predictions to the test.
The terms base grouping and incremental grouping are based on a distinction initially made by Ullman (1984) between an early visual representation and a representation that can be modified by visual routines.
Vecera (1994) actually referred to a set of linked locations as a grouped array. Here, we avoided this terminology to prevent confusion. We use the word linked for a set of image elements that are related to each other by Gestalt grouping cues. At a neurophysiological level of description, these elements are linked by a chain of enabled connections. We use incrementally grouped for the set of elements that are labeled by an enhanced response—that is, attention. These groupings have been made explicit by the labeling process and, thereby, become available for the subject’s report.
We thank Jochen Braun for helpful comments on the manuscript. The work was supported by a grant of the HFSP, a grant from the European Union (EU IST Cognitive Systems, project 027198 "Decisions in Motion"), a NWO-MaGW grant, and an NWO VICI grant awarded to P.R.R.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.