How do object reference frames and motion vector decomposition emerge in laminar cortical circuits?
How do spatially disjoint and ambiguous local motion signals in multiple directions generate coherent and unambiguous representations of object motion? Various motion percepts, starting with those of Duncker (Induced motion, 1929/1938) and Johansson (Configurations in event perception, 1950), obey a rule of vector decomposition, in which global motion appears to be subtracted from the true motion path of localized stimulus components, so that objects and their parts are seen as moving relative to a common reference frame. A neural model predicts how vector decomposition results from multiple-scale and multiple-depth interactions within and between the form- and motion-processing streams in V1–V2 and V1–MST, which include form grouping, form-to-motion capture, figure–ground separation, and object motion capture mechanisms. Particular advantages of the model are that these mechanisms solve the aperture problem, group spatially disjoint moving objects via illusory contours, capture object motion direction signals on real and illusory contours, and use interdepth directional inhibition to cause a vector decomposition, whereby the motion directions of a moving frame at a nearer depth suppress those directions at a farther depth, and thereby cause a peak shift in the perceived directions of object parts moving with respect to the frame.
KeywordsMotion perception Vector decomposition Frames of reference Peak shift Complementary computing V2 MT MST
How do we make sense of the complex motions of multiple interacting objects and their parts? One required computational step is to represent the various motion paths in an appropriate reference frame. Various ways of defining a reference frame have been proposed, ranging from retinocentric, in which an object is coded relative to the location of the activity it induces on the retina, to geocentric, in which objects are represented independent of the observer’s viewpoint (Wade & Swanston, 1987). According to an object-centered reference frame (Bremner, Bryant, & Mareschal, 2005; Wade & Swanston, 1996), objects are perceived relative to other objects. For example, on a cloudy night, the moon may appear to be moving in a direction opposite to that of the clouds. In a laboratory setting, this concept is well-illustrated by induced-motion experiments, wherein the motion of one object appears to cause opponent motion in another, otherwise static, object (Duncker, 1929/1938).
Frames of reference
From a functional perspective, the creation of perceptual relative frames of reference may be one mechanism evolved by the brain to represent the motion of individual objects in a scene. This ability appears especially important when considering that the meaningfulness of the motion of a particular object can often be compromised by the motion of another object. For example, when looking at a person waving a hand from a moving train, the motion components of the hand and the train become mixed together. By representing the motion of the hand relative to that of the train, the motion component of the train can be removed and the motion of the hand itself recovered (Rock, 1990). Relative reference frames may also be more sensitive to subtle variations in the visual scene, as suggested by the lower thresholds for motion detection in the presence of a neighboring stationary reference than in completely dark environments (Sokolov & Pavlova, 2006).
Another evolutionary advantage may be that information represented in an object-centered reference frame is partly invariant to changes in viewpoint (Wade & Swanston, 2001). Furthermore, as exemplified by the model presented here, computing an object-centered reference frame does not necessitate a viewer-centered representation (Sedgwick, 1983; Wade & Swanston, 1987), making it an efficient substitute for the latter.
How does the laminar organization of visual cortex create such a reference frame? The neural model proposed in this article predicts how the form and motion pathways in cortical areas V1, V2, MT, and MST accomplish this task using multiple-scale and multiple-depth interactions within and between form- and motion-processing streams in V1–V2 and V1–MT. These mechanisms have been developed elsewhere to explain data about motion perception by proposing how the brain solves the aperture problem. Wallach (1935/1996) first showed that the motion of a featureless line seen behind a circular aperture is perceptually ambiguous: No matter what may be the real direction of motion, the perceived direction is perpendicular to the orientation of the line—that is, the normal component of motion. The aperture problem is faced by any localized neural motion sensor, such as a neuron in the early visual pathway, which responds to a local contour moving through an aperture-like receptive field. In contrast, a moving dot, line end, or corner provides unambiguous information about an object’s true motion direction (Shimojo, Silverman, & Nakayama, 1989). The barber pole illusion demonstrates how the motion of a line is determined by unambiguous signals formed at its terminators and how these unambiguous signals capture the motion of nearby ambiguous motion regions (Ramachandran & Inada, 1985; Wallach, 1935/1996). The model proposes how such moving visual features activate cells in the brain that compute feature-tracking signals that can disambiguate an object’s true direction of motion. Our model does not rely on local pooling across motion directions, which has been shown not to be able to account for various data on motion perception (Amano, Edwards, Badcock, & Nishida, 2009). Instead, a dominant motion direction is determined over successive competitive stages with increasing receptive-field sizes, while preserving various candidate motion directions at each spatial position up to the highest model stages, where motion-grouping processes determine the perceived directions of object motion.
The model is here further developed to simulate key psychophysical percepts, such as classical motion perception experiments (Johansson, 1950), the Duncker wheel (Duncker, 1929/1938), and variants thereof, and casts new light on various related experimental findings. In particular, the model makes sense of psychophysical evidence that suggests that properties shared by groups of objects determine a common coordinate frame relative to which the features particular to individual objects are perceived. This process is well-summarized in the classical concept of vector decomposition (Johansson, 1950).
Figure 2a shows vector components into which downward and leftward motions of the individual dots can be decomposed. If the moving frame captures the diagonal direction down-and-left, as in Fig. 2b, the individual dots are left with components that oscillate toward and away from each other, as in Fig. 2c. A complete account of vector decomposition requires simultaneously representing common- and part-motion components. In our model, simultaneous representation of both types of motion is made possible by having cells from different depth planes represent the different motion components. Subtraction of the common-motion component is due to inhibition from cells coding for the nearer depth to cells coding for the farther depth. We show below how interdepth directional inhibition causes a peak shift (Grossberg & Levine, 1976) in directional selectivity that behaves like a vector decomposition.
Following Johansson (1950), vector decomposition has been invoked to explain motion perception in multiple experiments employing a variety of stimulus configurations (e.g., Börjesson & von Hofsten, 1972, 1973, 1975, 1977; Cutting & Proffitt, 1982; Di Vita & Rock, 1997; Gogel & MacCracken, 1979; Gogel & Tietz, 1976; Johansson, 1974; Post, Chi, Heckmann, & Chaderjian, 1989). The bulk of this work supports the view that vector decomposition is a useful concept in characterizing object-centric frames of reference in motion perception. However, no model has so far attempted to explain how vector decomposition results from the perceptual mechanisms embedded in the neural circuits of the visual system.
The present article fills this gap by further developing the 3D FORMOTION model (Baloch & Grossberg, 1997; Berzhanskaya, Grossberg, & Mingolla 2007; Chey, Grossberg, & Mingolla, 1997, 1998; Francis & Grossberg, 1996a, 1996b; Grossberg, Mingolla, & Viswanathan, 2001; Grossberg & Pilly, 2008). As the model’s name suggests, it proposes how form and motion processes interact to form coherent percepts of object motion in depth and already proposes a unified mechanistic explanation of many perceptual facts, including the barber pole illusion, plaid motion, and transparent motion. Form and motion processes, such as those in V2/V4 and MT/MST, occur in the “what” and “where” dorsal cortical processing streams, respectively. Key mechanisms within the “what” ventral and “where” streams seem to obey computationally complementary laws (Grossberg, 1991, 2000): The ability of each process to compute some properties prevents it from computing other, complementary, properties. Examples of such complementary properties include boundary completion versus surface filling-in—within the (V1 interblob)–(V2 interstripe) and (V1 blob)–(V2 thin stripe) streams, respectively—and, more relevant to the results herein, boundary orientation and precise depth versus motion direction and coarse depth—within the V1–V2 and V1–MT streams, respectively. The present article clarifies some of the interactions between form and motion processes that enable them to overcome their complementary deficiencies and to thereby compute more informative representations of unambiguous object motion.
3D FORMOTION model
Figure–ground separation mechanisms play a key role in explaining vector decomposition data. Many data about figure–ground perception have been modeled as part of the form-and-color-and-depth (FACADE) theory of 3-D vision (e.g., Cao & Grossberg, 2005, 2011; Fang & Grossberg, 2009; Grossberg, 1994, 1997; Grossberg & Kelly, 1999; Grossberg & McLoughlin, 1997; Grossberg & Pessoa, 1998; Grossberg & Yazdanbakhsh, 2005; Kelly & Grossberg, 2001). FACADE theory describes how 3-D boundary and surface representations are generated within the blob and interblob cortical processing streams from cortical areas V1 to V4. Figure–ground separation processes that are needed for the present analysis are predicted to be completed within the pale stripes of cortical area V2. These figure–ground processes help to segregate occluding and occluded objects, along with their terminators, onto different depth planes.
In response to the dot displays of Fig. 1, the model clarifies how an illusory contour forms between the pair of moving dots within cortical area V2 and captures motion direction signals in cortical area MT via a form-to-motion, or formotion, interaction from V2 to MT. The captured motion direction of this illusory contour causes vector decomposition of the motion directions of the individual dots. Indeed, at the intersection of an illusory contour and a dot, contour curvature is greater in the dot’s real boundary than in the illusory contour-completed boundary, since the illusory contour is tangent to the dot boundary. This greater curvature initially results in a weaker representation of the dots’ boundaries in area V2. These boundaries are then pushed farther in depth than the grouped illusory contour-completed shape due to interacting processes of orientational competition, boundary pruning, and boundary enrichment, which are described and simulated in the FACADE theory.
Level 1: Input from LGN
In the 3D FORMOTION model of Berzhanskaya et al. (2007), as in the present model, the boundary input is not depth-specific. Rather, the boundary input models signals that come from the retina and LGN into V1 (Xu, Bonds, & Casagrande, 2002). This boundary is represented in both ON and OFF channels. After V1 motion processing, described below, the motion signal then goes on to MT and MST. The 3-D figure–ground-separated boundary inputs in the present model come from V2 to MT and select bottom-up motion inputs from V1 in a depth-selective way. This process clarifies how the visual system uses occlusion clues to segregate moving boundaries into different depth planes, even though the inputs themselves occur within the same depth plane.
Berzhanskaya et al. (2007) showed how a combination of habituative (Appendix Eqs. 7, 8 and 9) and depth selection (Appendix Eq. 20) mechanisms accomplish the required depth segregation of motion signals in stimuli containing both static and moving components, such as chopstick displays (Lorenceau & Alais, 2001). In particular, habituative preprocessing enables motion cues to trigger the activation of transient cells (model Level 2 in Fig. 3), whereas signals due to static elements in the display habituate and become weak over time. As simulated by Berzhanskaya et al. (2007), this mechanism can explain why visible occluders in a chopstick display generate weaker motion signals at all depth planes. Although not necessary in the present simulations due to the absence of static elements in the displays, habituative mechanisms in the early stages of the model are included to enable a unified explanation of the data.
The motion selection mechanism separates motion signals in depth by using depth-separated boundary signals from V2 to MT. The model of Berzhanskaya et al. (2007) simulated in greater detail the formation of these depth-separated boundaries. The present model uses algorithmically defined boundaries to simplify the simulations. The model shows how these boundaries can capture only the appropriate motion signals onto their respective depth planes in MT. Although the question of how the time course of boundary formation impacts vector decomposition is not analyzed in detail in the present article, in part because there do not seem to be empirical data on this matter, some of our results nevertheless begin to address this issue, such as the persistence of the perceived motion until a large fraction of the boundary is pruned (see Fig. 15).
Level 2: Transient cells
The second stage of the motion processing system (Fig. 3) consists of nondirectional transient cells, inhibitory directional interneurons, and directional transient cells. The nondirectional transient cells respond briefly to a change in the image luminance, irrespective of the direction of movement (Appendix Eqs. 7, 8 and 9). Such cells respond well to moving boundaries and poorly to static objects because of the habituation that creates the transient response. The type of adaptation that leads to these transient cell responses is known to occur at several stages in the visual system, ranging from retinal Y cells (Enroth-Cugell & Robson, 1966; Hochstein & Shapley, 1976a, 1976b) to cells in V1 and V2 (Abbott, Sen, Varela, & Nelson, 1997; Carandini & Ferster, 1997; Chance, Nelson, & Abbott, 1998; Francis & Grossberg, 1996a, 1996b; Francis, Grossberg, & Mingolla, 1994; Varela, Sen, Gibson, Fost, Abbott, & Nelson, 1997) and beyond. The nondirectional transient cells send signals to inhibitory directional interneurons and directional transient cells, and the inhibitory interneurons interact with each other and with the directional transient cells (Eqs. 10–12). A directional transient cell fires vigorously when a stimulus is moved through its receptive field in one direction (called the preferred direction), while motion in the reverse direction (called the null direction) evokes little response (Barlow & Levick, 1965).
The directional inhibitory interneuronal interaction enables the directional transient cells to realize directional selectivity at a wide range of speeds (Chey et al., 1997; Grossberg et al., 2001). Although in the present model directional interneurons and transient cells correspond to cells in V1, this predicted interaction is consistent with retinal data concerning how bipolar cells interact with inhibitory starburst amacrine cells and direction-selective ganglion cells and how starburst cells interact with each other and with ganglion cells (Fried, Münch, & Werblin, 2002). The possible role of starburst cell inhibitory interneurons in ensuring directional selectivity at a wide range of speeds has not yet been tested. The model is also consistent with physiological data from cat and macaque species showing that directional selectivity first occurs in V1 and that it is due, at least in part, to inhibition that reduces the response to the null direction of motion (Livingstone, 1998; Livingstone & Conway, 2003; Murthy & Humphrey, 1999).
Level 3: Short-range filter
A key step in solving the aperture problem is to strengthen unambiguous feature-tracking signals relative to ambiguous motion signals. Feature-tracking signals are often generated by a relatively small number of moving features in a scene, yet can have a very large effect on motion perception. One process that strengthens feature-tracking signals relative to ambiguous aperture signals is the short-range directional filter (Fig. 3). Cells in this filter accumulate evidence from directional transient cells of similar directional preference within a spatially anisotropic region that is oriented along the preferred direction of the cell. This computation selectively strengthens the responses of short-range filter cells to feature-tracking signals at unoccluded line endings, object corners, and other scenic features (Appendix Eq. 13). The use of a short-range filter followed by competition at Level 4 eliminates the need for an explicit solution of the feature correspondence problem that various other models posit and attempt to solve (Reichardt, 1961; Ullman, 1979; van Santen & Sperling, 1985).
The short-range filter uses multiple spatial scales (Appendix Eq. 15). Each scale responds preferentially to a specific speed range. Larger scales respond better to faster speeds due to thresholding of short-range filter outputs with a self-similar threshold; that is, a threshold that increases with filter size (Appendix Eq. 16). Larger scales thus require “more evidence” to fire (Chey et al., 1998).
Level 4: Spatial competition and opponent direction competition
Two kinds of competition further enhance the relative advantage of feature-tracking signals (Fig. 3 and Appendix Eqs. 17, 18 and 19). These competing cells are proposed to occur in Layer 4B of V1 (Fig. 3). Spatial competition among cells of the same spatial scale that prefer the same motion direction boosts the amplitude of feature-tracking signals relative to those of ambiguous signals. Feature-tracking signals are contrast-enhanced by such competition because they are often found at motion discontinuities, and thus get less inhibition than ambiguous motion signals that lie within an object’s interior. Opponent-direction competition also occurs at this processing stage (Albright, 1984; Albright, Desimone, & Gross, 1984) and ensures that cells tuned to opposite motion directions are not simultaneously active.
The activity pattern at this model stage is consistent with the data of Pack, Gartland and Born (2004). In their experiments, V1 cells demonstrated an apparent suppression of responses to motion along visible occluders. A similar suppression occurs in the model, due to the adaptation of transient inputs to static boundaries. Also, cells in the middle of a grating respond more weakly than cells at the edge of the grating. Spatial competition in the model between motion signals performs divisive normalization and end-stopping, which together amplify the strength of directionally unambiguous feature-tracking signals at line ends relative to the strength of aperture-ambiguous signals along line interiors.
Level 5: Long-range filter and formotion selection
Motion signals from model Layer 4B of V1 input to model area MT. Area MT also receives a projection from V2 (Anderson & Martin, 2002; Rockland, 1995) that carries depth-specific figure–ground-separated boundary signals whose predicted properties were supported by Ponce, Lomber, and Born (2008). These V2 form boundaries select the motion signals (formotion selection) by selectively capturing at different depths the motion signals coming into MT from Layer 4B of V1 (Appendix Eq. 20).
Formotion selection from V2 to MT is depth-specific. At the nearer depth D1, V2 boundary signals that correspond to the illusory contour grouping select the larger-scale motion signals (Fig. 5a) and suppress motion signals at other locations in that same depth. At the farther depth D2, V2 boundary signals that correspond to the individual dots (Fig. 5b) select motion signals that represent the motion of individual parts of the stimulus.
Boundary-gated signals from Layer 4 of MT are proposed to input to the upper layers of MT (Fig. 3; Appendix Eq. 22), where they activate directionally selective, spatially anisotropic filters via long-range horizontal connections (Appendix Eq. 25). In this long-range directional filter, motion signals coding the same directional preference are pooled from object contours with multiple orientations and opposite contrast polarities. This pooling process creates a true directional cell response (Chey et al., 1997; Grossberg et al., 2001; Grossberg & Rudd, 1989, 1992).
The long-range filter accumulates evidence of a given motion direction using a kernel that is elongated in the direction of that motion, much as in the case of the short-range filter. This hypothesis is consistent with data showing that approximately 30% of the cells in MT show a preferred direction of motion that is aligned with the main axis of their receptive fields (Xiao, Raiguel, Marcar, & Orban, 1997). Long-range filtering is performed at multiple scales according to the size–distance invariance hypothesis (Chey et al., 1997; Hershenson, 1999): Signals in the nearer depth are filtered at a larger scale, and signals in the farther depth are filtered at a smaller scale.
The model hereby predicts that common and part motions are simultaneously represented by different cell populations in MT due to form selection. This type of effect may be compared with the report that some MT neurons are responsive to the global motion of a plaid stimulus, whereas others respond to the motion of its individual sinusoidal grating components (Rust, Mante, Simoncelli, & Movshon, 2006; Smith, Majaj & Movshon, 2005).
The long-range filter cells in Layer 2/3 of model MT are proposed to play a role in binding together directional information that is homologous to the coaxial and collinear accumulation of orientational evidence within Layer 2/3 of the pale stripes of cortical area V2 for perceptual grouping of form (Grossberg, 1999; Grossberg & Raizada, 2000; Hirsch & Gilbert, 1991). This anisotropic long-range motion filter allows directional motion signals to be integrated across the illusory contours in Fig. 5a that link the pair of dots.
Level 6: Directional grouping, near-to-far inhibition, and directional peak shift
The model processing stages up to now have not fully solved the aperture problem. Although they can amplify feature-tracking signals and assign motion signals to the correct depths, they cannot yet explain how feature-tracking signals can propagate across space to select consistent motion directions from ambiguous motion directions, without distorting their speed estimates, and at the same time suppress inconsistent motion directions. They also cannot explain how motion integration can compute a vector average of ambiguous motion signals across space to determine the perceived motion direction when feature-tracking signals are not present at that depth. The final stage of the model accomplishes this goal by using a motion grouping network (Appendix Eq. 28), interpreted to occur in ventral MST (MSTv), both because MSTv has been shown to encode object motion (Tanaka, Sugita, Moriya & Saito, 1993) and because it is a natural anatomical marker, given the model processes that precede and succeed it. We predict that feedback between MT and MST determines the coherent motion direction of discrete moving objects.
The motion grouping network works as follows: Cells that code similar directions in MT send convergent inputs to cells in MSTv via the motion grouping network. Unlike the previous 3D FORMOTION model, in which MST cells received input only from MT cells of the same direction, a weighted sum of directions inputs to the motion grouping cells (Appendix Eq. 29). Thus, for example, cells tuned to the southwest direction receive excitatory input not only from cells coding for that direction but also, to a lesser extent, from cells tuned to either the south or west direction, enabling a stronger representation of the common motion of the two dots.
Directional competition at each position then determines a winning motion direction. This winning directional cell then feeds back to its source cells in MT. This feedback supports the activity of MT cells that code the winning direction, while suppressing the activities of cells that code all other directions. This motion grouping network enables feature-tracking signals to select similar directions at nearby ambiguous motion positions, while suppressing other directions there. These competitive processes take place in each depth plane, consistent with the fact that direction-tuned cells in MST are also disparity-selective (Eifuku & Wurtz, 1999). On the next cycle of the feedback process, these newly unambiguous motion directions in MT select consistent MSTv grouping cells at positions near them. The grouping process hereby propagates across space as the feedback signals cycle through time between MT and MSTv.
Berzhanskaya et al. (2007), Chey et al. (1997), and Grossberg et al. (2001) have used this motion-grouping process to simulate data showing how the present model solves the aperture problem. Pack and Born (2001) provided supportive neurophysiological data, wherein the responses of MT cells over time to the motion of the interior of an extended line dynamically modulate away from the local direction that is perpendicular to the line and toward the direction of line terminator motion.
Both the V2-to-MT and the MSTv-to-MT signals carry out selection processes using modulatory on–center, off–surround interactions. The V2-to-MT signals select motion signals at the locations and depth of a moving boundary. The MST-to-MT signals select motion signals in the direction and depth of a motion grouping. Such a modulatory on–center, off–surround network was predicted by Adaptive Resonance Theory to carry out attentive selection processes in a manner that enables fast and stable learning of appropriate features to occur. See Raizada and Grossberg (2003) for a review of behavioral and neurobiological data that support this prediction in several brain systems. Direct experiments to test it in the above cases still remain to be done.
Near-to-far inhibition and peak shift are the processes whereby MST cells that code nearer depth inhibit MST cells that code similar directions and positions at farther depths. In previous 3D FORMOTION models, this near-to-far inhibition only involved MST cells of the same direction. Depth suppression in the present model is done via a Gaussian kernel in direction space (Appendix Eq. 31). When this near-to-far inhibition acts, it causes a peak shift in the maximally activated direction at the farther depth. This peak shift causes vector decomposition.
LGN → V1 Layer 4
Strong LGN input; ON and OFF center–surround
V1 Layer 4 nondirectional transient cells → directional transient cells
De Valois, Cottaris, Mahon, Elfar, and Wilson (2000)
V1 Layer 4B → MT Layers 4 and 6
Feedforward local motion input to MT
Anderson, Binzegger, Martin, and Rockland (1998)
V2 → MT Layers 4, 5/6
Boundary selection of motion in depth
MT Layer 2/3 large receptive fields
Long-range spatial summation of motion
Born and Tootell (1992)
MT Layer 2/3 → MST
Directional motion grouping
Maunsell and van Essen (1983)
MST → MT Layer 2/3
Selection of consistent motion direction
Maunsell and van Essen (1983)
Simulation of psychophysical experiments
Symmetrically moving inducers
For ease of viewing, network activity is overlaid on top of the V2 boundary input, which is depicted in gray. Motion signals selected by V2 boundaries in MT Layers 4 and 5/6 are displayed in the top row. The larger scale (left) selects motion signals corresponding to the grouped boundary, whereas the smaller scale (right) selects motion signals corresponding to individual dots. Long-range filtering in MT Layer 2/3 (middle row) groups motion signals at each scale. Thus, in the larger scale, the coherent southwest direction is enhanced with respect to its activity level at the previous layer. In comparison, the smaller scale maintains the physical motion directions corresponding to each dot. Directional competition in MSTv (bottom row) results in an enhanced diagonal direction of motion in the large scale, which is then subtracted from the corresponding activity in the small scale, resulting in an inward peak shift. Note that the magnitude of the shift reported in Fig. 7 is less than the 45° initially reported in Johansson (1950), which is compatible with results from a more recent instantiation of this paradigm, where angles of 30–40º were reported (Wallach, Becklen, & Nitzberg, 1985).
Rolling wheel experiment
Johansson (1974) provided a mathematical explanation of the wheel experiment in terms of vector analysis. As before, if the motion common to both dots is subtracted from the cycloid dot’s physical motion, the cycloid dot is seen to move in a circle around the center dot.
Note the early appearance of the rightward motion direction over the hub as compared to the cycloid. This is made explicit in Fig. 13 by a small vertical bar on the horizontal axis of each graph, which marks the time at which corresponding levels of activity are reached for both dots. The rightward motion signal propagates to the cycloid dot over the illusory contours that join them through time. The rightward direction of motion is retained at the position of the cycloid dot, even though its position on the y-axis changes throughout the simulation.
The 3D FORMOTION model predicts that elements of a visual display with constant velocity are more likely to govern the emergence of a frame of reference, due to the accumulation of motion signals in the direction of motion. A related prediction is that stimuli designed to prevent such accumulation of evidence will not develop a strong object-centered frame of reference. Partial support for this prediction can be found in an experiment by Kolers (1972, cf. Array 17 on p. 69) using stroboscopic motion on a display otherwise qualitatively the same as that in Johansson’s (1950) Experiment 19. Subjects’ percepts here seemed to reflect the independent motion of the dots rather than motion of a common frame of reference. A related case is that of Ternus–Pikler displays, in which one of the moving disks contains a rotating dot. Here, vector decomposition occurs only at the high ISIs that are also necessary to perceive grouped disk motion (Boi, Öğmen, Krummenacher, Otto, & Herzog, 2009).
The 3D FORMOTION model predicts that the creation of an object-centric frame of reference is driven by interacting stages of the form and motion streams of visual cortex: Form selection of motion-in-depth signals in area MT and interdepth directional inhibition in MST cause a vector decomposition whereby the motion directions of a moving frame at a nearer depth suppress these directions at a farther depth, and thereby cause a peak shift in the perceived directions of object parts moving with respect to the frame. In particular, motion signals predominant in the larger scale, or nearer depth, induce a peak shift of activity in smaller scales, or farther depths. The model qualitatively clarifies relative motion properties as manifestations of how the brain groups motion signals into percepts of moving objects, and quantitatively explains and simulates data about vector decomposition and relative frames of reference.
The model also qualitatively explains other data about frame-dependent motion coherence. Tadin, Lappin, Blake, and Grossman (2002) presented observers with a display consisting of an illusory pentagon circularly translating behind fixed apertures, with each side of the pentagon defined by an oscillating Gabor patch. The locations of the apertures and of the corners of the pentagon never overlapped, such that the latter were kept hidden during the entire stimulus presentation. Subjects had to judge the coherence of motion of the Gabor patches belonging to the different sides of the pentagon. Crucially, when the apertures were present, subjects reported seeing the patches as forming the shape of a pentagon, whereas when the apertures were absent, the patches did not seem to belong to the same shape. Results showed that motion coherence estimates were much better when apertures were present than when they were not. According to the FACADE mechanisms in the form stream, the presence of apertures triggers the formation of illusory contours linking the contours of the Gabor patches into a single pentagon behind the apertures (see Berzhanskaya et al., 2007). Subsequent form selection and long-range filtering in MT lead to a representation of the pentagon’s motion at a particular scale. This global motion direction is then subtracted from local motion signals of individual patches, thereby leading to better coherence judgments. In the absence of the apertures, form selection followed by long-range filtering of motion signals did not occur, such that the motion of individual patches mixed the common- and part-motion vectors, making coherence judgments difficult.
In displays where the speeds of the moving reference frame and of a smaller moving target can be decoupled, the perceived amount of vector decomposition has been shown to be proportional to the speed of the frame (Gogel, 1979; Post et al., 1989). This can be interpreted by noting that the firing rate of an MT cell in response to motion stimuli is proportional to the speed tuning of the cell (Raiguel, Xiao, Marcar, & Orban, 1999). A frame of reference moving at a higher speed should therefore lead to higher cortical activation in the larger scales of MT and MST, and thus to a more pronounced motion direction peak shift, reflecting the stronger percept of vector decomposition (Gogel, 1979; Post et al., 1989). For the same reason, the model also predicts that the amount of shift in the perceived direction of the moving target is inversely proportional to target speed: A stronger peak in the motion direction distribution in the smaller scale (before subtraction) will be shifted less by subtraction from the large scale. Another prediction is that vector decomposition mechanisms occur mainly through MT–MST interactions.
The simulations shown here were conducted using a minimum number of scales in order to explain the experimental results. However, the model can be generalized to include a finer sampling of scale space, perhaps with depth suppression occurring as a transitive relation across scale. Such an arrangement of scales would then be able to account for experimental cases in which vector decomposition must be applied in a hierarchical manner, such as in biological motion displays (Johansson, 1973). Accordingly, residual motion of the knee is obtained after subtraction of the common motion component of the hip and knee, whereas residual motion of the ankle is obtained after subtraction of the common motion component of the knee and ankle. Similar decompositions occur for upper limb parts. Such vector decompositions would require the use of spatial scales roughly matched to the lengths of the limbs, with depth suppression occurring from larger scales coding for limb motion to smaller scales coding for joint motion.
The present model explains cases of vector analysis in which retinal motion is imparted to all display elements, as opposed to some being static. The model would need to be refined to account for induced motion displays using an oscillating rectangle to induce an opposite perceived motion direction in a static dot (Duncker, 1929/1938). The suggestion that additional mechanisms are needed to explain induced motion is supported by experimental evidence highlighting differences between induced motion and vector decomposition, as summarized by Di Vita and Rock (1997). For example, induced motion is typically not observed when the reference frame’s physical speed is above the threshold for motion detection, whereas the vector decomposition stimuli analyzed here are robust to variations in speed. Also, in induced motion, the motion of the frame is underestimated or not perceived at all, whereas the common motion component in vector decomposition stimuli is perceived simultaneously to that of the parts.
This work was partially supported by CELEST, an NSF Science of Learning Center (Grant SBE-0354378), and by the DARPA SyNAPSE program (Grant HR0011-09-C-0001).
- Amano, K., Edwards, M., Badcock, D. R., & Nishida, S. (2009). Adaptive pooling of visual motion signals by the human visual system revealed with a novel multi-element stimulus. Journal of Vision, 9(3), 4:1–25Google Scholar
- Boi, M., Öğmen, H., Krummenacher, J., Otto, T. U., & Herzog, M. H. (2009). A (fascinating) litmus test for human retino- vs. non-retinotopic processing. Journal of Vision, 9(13), 5:1–11Google Scholar
- Börjesson, E., & von Hofsten, C. (1972). Spatial determinants of depth perception in two-dot motion patterns. Perception & Psychophysics, 11, 263–268.Google Scholar
- Börjesson, E., & von Hofsten, C. (1973). Visual perception of motion in depth: Application of a vector model to three-dot motion patterns. Perception & Psychophysics, 2, 169–179.Google Scholar
- Bremner, A. J., Bryant, P. E., & Mareschal, D. (2005). Object-centred spatial reference in 4-month-old infants. Infant Behaviour and Development, 29, 1–10.Google Scholar
- Browning, N. A., Grossberg, S., & Mingolla, E. (2009). Cortical dynamics of navigation and steering in natural scenes: Motion-based object segmentation, heading, and obstacle avoidance. Neural Networks, 22, 1383–1398.Google Scholar
- Cao, Y., & Grossberg, S. (2011). Stereopsis and 3D surface perception by spiking neurons in laminar cortical circuits: A method for converting neural rate models into spiking models. Manuscript submitted for publicationGoogle Scholar
- Chapman, B., Jost, G., & van der Pas, R. (2007). Using OpenMP: Portable shared memory parallel programming. Cambridge: MIT Press.Google Scholar
- Chey, J., Grossberg, S., & Mingolla, E. (1997). Neural dynamics of motion grouping: From aperture ambiguity to object speed and direction. Journal of the Optical Society of America A, 14, 2570–2594.Google Scholar
- Di Vita, J. C., & Rock, I. (1997). A belongingness principle of motion perception. Journal of Experimental Psychology: Human Perception and Performance, 23, 1343–1352.Google Scholar
- Duncker, K. (1938). Induced motion. In W. D. Ellis (Ed.), A sourcebook of Gestalt psychology. London: Routledge & Kegan Paul. Original work published in German, 1929.Google Scholar
- Francis, G., & Grossberg, S. (1996a). Cortical dynamics of boundary segmentation and reset: Persistence, afterimages, and residual traces. Perception, 35, 543–567.Google Scholar
- Frigo, M., & Johnson, S. G. (2005). The design and implementation of FFTW3. Proceedings of the IEEE, 93, 216–231.Google Scholar
- Gattass, R., Sousa, A. P. B., Mishkin, M., & Ungerleider, L. G. (1997). Cortical projections of area V2 in the macaque. Cerebral Cortex, 7, 110–129.Google Scholar
- Grossberg, S. (1968). Some physiological and biochemical consequences of psychological postulates. Proceedings of the National Academy of Sciences, 60, 758–765.Google Scholar
- Grossberg, S. (1972). A neural theory of punishment and avoidance. II: Quantitative theory. Mathematical Biosciences, 15, 253–285.Google Scholar
- Grossberg, S. (1973). Contour enhancement, short-term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, 52, 213–257.Google Scholar
- Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, 1, 12–61.Google Scholar
- Grossberg, S. (1991). Why do parallel cortical systems exist for the perception of static form and moving form? Perception & Psychophysics, 49, 117–141.Google Scholar
- Grossberg, S. (1994). 3-D vision and figure–ground separation by visual cortex. Perception & Psychophysics, 55, 48–121.Google Scholar
- Grossberg, S. (2000). The complementary brain: Unifying brain dynamics and modularity. Trends in Cognitive Science, 4, 233–246.Google Scholar
- Grossberg, S., & McLoughlin, N. P. (1997). Cortical dynamics of three-dimensional surface perception: Binocular and half-occluded scenic images. Neural Networks, 10, 1583–1605.Google Scholar
- Grossberg, S., & Mingolla, E. (1985). Neural dynamics of perceptual grouping: Textures, boundaries, and emergent segmentations. Perception & Psychophysics, 38, 141–171.Google Scholar
- Grossberg, S., Mingolla, E., & Viswanathan, L. (2001). Neural dynamics of motion integration and segmentation within and across apertures. Vision Research, 41, 2351–2553.Google Scholar
- Grossberg, S., & Rudd, M. (1989). A neural architecture for visual motion perception: Group and element apparent motion. Neural Networks, 2, 421–450.Google Scholar
- Haralick, R. M., & Shapiro, L. G. (1992). Computer and robot vision (Vol. 1). Boston: Addison-Wesley.Google Scholar
- Hershenson, M. (1999). Visual space perception. Cambridge: MIT Press.Google Scholar
- Johansson, G. (1950). Configurations in event perception. Uppsala: Almqvist & Wiksell.Google Scholar
- Johansson, G. (1973). Visual perception of biological motion and a model for its analysis. Perception & Psychophysics, 14, 201–211.Google Scholar
- Joly, T. J., & Bender, D. B. (1997). Loss of relative-motion sensitivity in the monkey superior colliculus after lesions of cortical area MT. Experimental Brain Research, 117, 43–58.Google Scholar
- Kelly, F., & Grossberg, S. (2001). Neural dynamics of 3-D surface perception: Figure–ground separation and lightness perception. Perception & Psychophysics, 62, 1596–1618.Google Scholar
- Kolers, P. A. (1972). Aspects of motion perception. Oxford: Pergamon Press.Google Scholar
- Muller, J. R., Metha, A. B., Krauskopf, J., & Lennie, P. (2001). Information conveyed by onset transients in responses of striate cortical neurons. The Journal of Neuroscience, 21, 6987–6990.Google Scholar
- Post, R. B., Chi, D., Heckmann, T., & Chaderjian, M. (1989). A reevaluation of the effect of velocity on induced motion. Perception & Psychophysics, 45, 411–416.Google Scholar
- Reichardt, W. (1961). Autocorrelation, a principle for the evaluation of sensory information by the central nervous system. In W. A. Rosenblith (Ed.), Sensory communication (pp. 303–317). New York: Wiley.Google Scholar
- Rock, I. (1990). The frame of reference. In I. Rock (Ed.), The legacy of Solomon Asch (pp. 243–268). Hillsdale: Erlbaum.Google Scholar
- Rubin, J., & Richards, W. A. (1988). Visual perception of moving parts. Journal of the Optical Society of America A, 5, 2045–2049.Google Scholar
- Sedgwick, H. A. (1983). Environment-centered representation of spatial layout: Available visual information from texture and perspective. In J. Beck, B. Hope, & A. Rosenfeld (Eds.), Human and machine vision. Amsterdam: Elsevier.Google Scholar
- Sokolov, A., & Pavlova, M. (2006). Visual motion detection in hierarchical spatial frames of reference. Experimental Brain Research, 174, 477–486.Google Scholar
- Thiele, A., Distler, C., Korbmacher, H., & Hoffmann, K.-P. (2004). Contribution of inhibitory mechanisms to direction selectivity and response normalization in macaque middle temporal area. Proceedings of the National Academy of Sciences, 101, 9810–9815.Google Scholar
- Ullman, S. (1979). The interpretation of visual motion. Cambridge: MIT Press.Google Scholar
- van Santen, J. P., & Sperling, G. (1985). Elaborated Reichardt detectors. Journal of the Optical Society of America A, 2, 300–321.Google Scholar
- Wade, N. J., & Swanston, M. T. (2001). Visual perception: An introduction (2nd ed.). Hove: Psychology Press.Google Scholar
- Wallach, H. (1996). On the visually perceived direction of motion. Psychologische Forschung, 20, 325–380. Original work published 1935.Google Scholar