How do we make sense of the complex motions of multiple interacting objects and their parts? One required computational step is to represent the various motion paths in an appropriate reference frame. Various ways of defining a reference frame have been proposed, ranging from retinocentric, in which an object is coded relative to the location of the activity it induces on the retina, to geocentric, in which objects are represented independent of the observer’s viewpoint (Wade & Swanston, 1987). According to an object-centered reference frame (Bremner, Bryant, & Mareschal, 2005; Wade & Swanston, 1996), objects are perceived relative to other objects. For example, on a cloudy night, the moon may appear to be moving in a direction opposite to that of the clouds. In a laboratory setting, this concept is well-illustrated by induced-motion experiments, wherein the motion of one object appears to cause opponent motion in another, otherwise static, object (Duncker, 1929/1938).

Frames of reference

From a functional perspective, the creation of perceptual relative frames of reference may be one mechanism evolved by the brain to represent the motion of individual objects in a scene. This ability appears especially important when considering that the meaningfulness of the motion of a particular object can often be compromised by the motion of another object. For example, when looking at a person waving a hand from a moving train, the motion components of the hand and the train become mixed together. By representing the motion of the hand relative to that of the train, the motion component of the train can be removed and the motion of the hand itself recovered (Rock, 1990). Relative reference frames may also be more sensitive to subtle variations in the visual scene, as suggested by the lower thresholds for motion detection in the presence of a neighboring stationary reference than in completely dark environments (Sokolov & Pavlova, 2006).

Another evolutionary advantage may be that information represented in an object-centered reference frame is partly invariant to changes in viewpoint (Wade & Swanston, 2001). Furthermore, as exemplified by the model presented here, computing an object-centered reference frame does not necessitate a viewer-centered representation (Sedgwick, 1983; Wade & Swanston, 1987), making it an efficient substitute for the latter.

Aperture problem

How does the laminar organization of visual cortex create such a reference frame? The neural model proposed in this article predicts how the form and motion pathways in cortical areas V1, V2, MT, and MST accomplish this task using multiple-scale and multiple-depth interactions within and between form- and motion-processing streams in V1–V2 and V1–MT. These mechanisms have been developed elsewhere to explain data about motion perception by proposing how the brain solves the aperture problem. Wallach (1935/1996) first showed that the motion of a featureless line seen behind a circular aperture is perceptually ambiguous: No matter what may be the real direction of motion, the perceived direction is perpendicular to the orientation of the line—that is, the normal component of motion. The aperture problem is faced by any localized neural motion sensor, such as a neuron in the early visual pathway, which responds to a local contour moving through an aperture-like receptive field. In contrast, a moving dot, line end, or corner provides unambiguous information about an object’s true motion direction (Shimojo, Silverman, & Nakayama, 1989). The barber pole illusion demonstrates how the motion of a line is determined by unambiguous signals formed at its terminators and how these unambiguous signals capture the motion of nearby ambiguous motion regions (Ramachandran & Inada, 1985; Wallach, 1935/1996). The model proposes how such moving visual features activate cells in the brain that compute feature-tracking signals that can disambiguate an object’s true direction of motion. Our model does not rely on local pooling across motion directions, which has been shown not to be able to account for various data on motion perception (Amano, Edwards, Badcock, & Nishida, 2009). Instead, a dominant motion direction is determined over successive competitive stages with increasing receptive-field sizes, while preserving various candidate motion directions at each spatial position up to the highest model stages, where motion-grouping processes determine the perceived directions of object motion.

The model is here further developed to simulate key psychophysical percepts, such as classical motion perception experiments (Johansson, 1950), the Duncker wheel (Duncker, 1929/1938), and variants thereof, and casts new light on various related experimental findings. In particular, the model makes sense of psychophysical evidence that suggests that properties shared by groups of objects determine a common coordinate frame relative to which the features particular to individual objects are perceived. This process is well-summarized in the classical concept of vector decomposition (Johansson, 1950).

Vector decomposition

Johansson (1950) showed that the perceived motion of a stimulus can be characterized as a linear combination of motion vectors corresponding to different stimulus parts. Accordingly, the true motion vectors (i.e., the vectors generated by the true motion path of the stimulus) are dissociated into orthogonal components. One component represents the motion of the grouped stimulus, or, in some cases, of a large stimulus element that appears to encompass smaller ones (e.g., the rectangular frame in induced motion experiments). The other component corresponds to the motion of individual objects from which the first component has been subtracted. An example of this vector decomposition process is shown in Fig. 1.

Fig. 1
figure 1

Johansson’s Experiment 19. (a) The stimulus consists of two dots oscillating in orthogonal directions and meeting periodically at point ab. (b) The emergent percept is that of two dots oscillating on a common diagonal axis (represented as an ellipse), which itself oscillates in the orthogonal direction

Figure 1a depicts the visual stimulus presented to the subject. Here, two dots oscillate in orthogonal directions and meet at one endpoint (point ab) of their trajectories. Observers report viewing either the nonrigid motion shown in Fig. 1b or the rigid motion of a bar rotating in depth. The former percept is that of two dots oscillating along a common diagonal axis, denoted by the ellipse, which itself oscillates along the orthogonal direction. In other words, the dots are seen as moving relative to a common reference frame, the diagonal axis. The pertinence of vector decomposition to the stimulus of Fig. 1 is shown in greater detail in Fig. 2.

Fig. 2
figure 2

Vector decomposition analysis. (a) The true motion vectors (solid vectors) are cast into an orthogonal basis (dashed vectors). (b) In this basis, the component common to both dots is directed toward the southwest corner. (c) The remaining component, specific to each dot, moves along a common axis

Figure 2a shows vector components into which downward and leftward motions of the individual dots can be decomposed. If the moving frame captures the diagonal direction down-and-left, as in Fig. 2b, the individual dots are left with components that oscillate toward and away from each other, as in Fig. 2c. A complete account of vector decomposition requires simultaneously representing common- and part-motion components. In our model, simultaneous representation of both types of motion is made possible by having cells from different depth planes represent the different motion components. Subtraction of the common-motion component is due to inhibition from cells coding for the nearer depth to cells coding for the farther depth. We show below how interdepth directional inhibition causes a peak shift (Grossberg & Levine, 1976) in directional selectivity that behaves like a vector decomposition.

Following Johansson (1950), vector decomposition has been invoked to explain motion perception in multiple experiments employing a variety of stimulus configurations (e.g., Börjesson & von Hofsten, 1972, 1973, 1975, 1977; Cutting & Proffitt, 1982; Di Vita & Rock, 1997; Gogel & MacCracken, 1979; Gogel & Tietz, 1976; Johansson, 1974; Post, Chi, Heckmann, & Chaderjian, 1989). The bulk of this work supports the view that vector decomposition is a useful concept in characterizing object-centric frames of reference in motion perception. However, no model has so far attempted to explain how vector decomposition results from the perceptual mechanisms embedded in the neural circuits of the visual system.

The present article fills this gap by further developing the 3D FORMOTION model (Baloch & Grossberg, 1997; Berzhanskaya, Grossberg, & Mingolla 2007; Chey, Grossberg, & Mingolla, 1997, 1998; Francis & Grossberg, 1996a, 1996b; Grossberg, Mingolla, & Viswanathan, 2001; Grossberg & Pilly, 2008). As the model’s name suggests, it proposes how form and motion processes interact to form coherent percepts of object motion in depth and already proposes a unified mechanistic explanation of many perceptual facts, including the barber pole illusion, plaid motion, and transparent motion. Form and motion processes, such as those in V2/V4 and MT/MST, occur in the “what” and “where” dorsal cortical processing streams, respectively. Key mechanisms within the “what” ventral and “where” streams seem to obey computationally complementary laws (Grossberg, 1991, 2000): The ability of each process to compute some properties prevents it from computing other, complementary, properties. Examples of such complementary properties include boundary completion versus surface filling-in—within the (V1 interblob)–(V2 interstripe) and (V1 blob)–(V2 thin stripe) streams, respectively—and, more relevant to the results herein, boundary orientation and precise depth versus motion direction and coarse depth—within the V1–V2 and V1–MT streams, respectively. The present article clarifies some of the interactions between form and motion processes that enable them to overcome their complementary deficiencies and to thereby compute more informative representations of unambiguous object motion.

3D FORMOTION model

Figure–ground separation mechanisms play a key role in explaining vector decomposition data. Many data about figure–ground perception have been modeled as part of the form-and-color-and-depth (FACADE) theory of 3-D vision (e.g., Cao & Grossberg, 2005, 2011; Fang & Grossberg, 2009; Grossberg, 1994, 1997; Grossberg & Kelly, 1999; Grossberg & McLoughlin, 1997; Grossberg & Pessoa, 1998; Grossberg & Yazdanbakhsh, 2005; Kelly & Grossberg, 2001). FACADE theory describes how 3-D boundary and surface representations are generated within the blob and interblob cortical processing streams from cortical areas V1 to V4. Figure–ground separation processes that are needed for the present analysis are predicted to be completed within the pale stripes of cortical area V2. These figure–ground processes help to segregate occluding and occluded objects, along with their terminators, onto different depth planes.

In response to the dot displays of Fig. 1, the model clarifies how an illusory contour forms between the pair of moving dots within cortical area V2 and captures motion direction signals in cortical area MT via a form-to-motion, or formotion, interaction from V2 to MT. The captured motion direction of this illusory contour causes vector decomposition of the motion directions of the individual dots. Indeed, at the intersection of an illusory contour and a dot, contour curvature is greater in the dot’s real boundary than in the illusory contour-completed boundary, since the illusory contour is tangent to the dot boundary. This greater curvature initially results in a weaker representation of the dots’ boundaries in area V2. These boundaries are then pushed farther in depth than the grouped illusory contour-completed shape due to interacting processes of orientational competition, boundary pruning, and boundary enrichment, which are described and simulated in the FACADE theory.

Motion processing is performed in the “where” stream, whose six levels model dynamics homologous to LGN, V1, MT, and MST (Fig. 3). These stages are mathematically defined in the Appendix.

Fig. 3
figure 3

Motion-processing stream of the 3D FORMOTION model. Level 1: ON and OFF input cells. Level 2: Transient nondirectional and directional cells. Level 3: Short-range filter. Level 4: Spatial competition and opponent direction inhibition. Level 5A: Boundary selection of motion signals at multiple depth planes. Level 5B: Long-range spatial filter. Level 6: Directional grouping and depth suppression

Level 1: Input from LGN

In the 3D FORMOTION model of Berzhanskaya et al. (2007), as in the present model, the boundary input is not depth-specific. Rather, the boundary input models signals that come from the retina and LGN into V1 (Xu, Bonds, & Casagrande, 2002). This boundary is represented in both ON and OFF channels. After V1 motion processing, described below, the motion signal then goes on to MT and MST. The 3-D figure–ground-separated boundary inputs in the present model come from V2 to MT and select bottom-up motion inputs from V1 in a depth-selective way. This process clarifies how the visual system uses occlusion clues to segregate moving boundaries into different depth planes, even though the inputs themselves occur within the same depth plane.

Berzhanskaya et al. (2007) showed how a combination of habituative (Appendix Eqs. 7, 8 and 9) and depth selection (Appendix Eq. 20) mechanisms accomplish the required depth segregation of motion signals in stimuli containing both static and moving components, such as chopstick displays (Lorenceau & Alais, 2001). In particular, habituative preprocessing enables motion cues to trigger the activation of transient cells (model Level 2 in Fig. 3), whereas signals due to static elements in the display habituate and become weak over time. As simulated by Berzhanskaya et al. (2007), this mechanism can explain why visible occluders in a chopstick display generate weaker motion signals at all depth planes. Although not necessary in the present simulations due to the absence of static elements in the displays, habituative mechanisms in the early stages of the model are included to enable a unified explanation of the data.

The motion selection mechanism separates motion signals in depth by using depth-separated boundary signals from V2 to MT. The model of Berzhanskaya et al. (2007) simulated in greater detail the formation of these depth-separated boundaries. The present model uses algorithmically defined boundaries to simplify the simulations. The model shows how these boundaries can capture only the appropriate motion signals onto their respective depth planes in MT. Although the question of how the time course of boundary formation impacts vector decomposition is not analyzed in detail in the present article, in part because there do not seem to be empirical data on this matter, some of our results nevertheless begin to address this issue, such as the persistence of the perceived motion until a large fraction of the boundary is pruned (see Fig. 15).

Both ON and OFF input cells are needed. For example, when a bright dot moves downward on a dark background (Fig. 4a), ON cells respond to its lower edge (Fig. 4b), but OFF cells respond to its upper edge (Fig. 4c). Likewise, when the dot reverses direction and starts to move upward, the upper edge now activates ON cells and the lower edge activates OFF cells. By differentially activating ON and OFF cells in different parts of this motion cycle, these cells have more time to recover from habituation, so that the system remains more sensitive to repetitive motion signals. Model ON and OFF responses are thus relevant to the role played by habituative mechanisms in generating transient-cell responses.

Fig. 4
figure 4

Input to the motion pathway. (a) Motion path of the dots directed toward the lower left corner. The input to the ON input cells corresponds to the leading edge of the dot (b), whereas the input to the OFF input cells corresponds to the trailing edge (c)

Level 2: Transient cells

The second stage of the motion processing system (Fig. 3) consists of nondirectional transient cells, inhibitory directional interneurons, and directional transient cells. The nondirectional transient cells respond briefly to a change in the image luminance, irrespective of the direction of movement (Appendix Eqs. 7, 8 and 9). Such cells respond well to moving boundaries and poorly to static objects because of the habituation that creates the transient response. The type of adaptation that leads to these transient cell responses is known to occur at several stages in the visual system, ranging from retinal Y cells (Enroth-Cugell & Robson, 1966; Hochstein & Shapley, 1976a, 1976b) to cells in V1 and V2 (Abbott, Sen, Varela, & Nelson, 1997; Carandini & Ferster, 1997; Chance, Nelson, & Abbott, 1998; Francis & Grossberg, 1996a, 1996b; Francis, Grossberg, & Mingolla, 1994; Varela, Sen, Gibson, Fost, Abbott, & Nelson, 1997) and beyond. The nondirectional transient cells send signals to inhibitory directional interneurons and directional transient cells, and the inhibitory interneurons interact with each other and with the directional transient cells (Eqs. 1012). A directional transient cell fires vigorously when a stimulus is moved through its receptive field in one direction (called the preferred direction), while motion in the reverse direction (called the null direction) evokes little response (Barlow & Levick, 1965).

The directional inhibitory interneuronal interaction enables the directional transient cells to realize directional selectivity at a wide range of speeds (Chey et al., 1997; Grossberg et al., 2001). Although in the present model directional interneurons and transient cells correspond to cells in V1, this predicted interaction is consistent with retinal data concerning how bipolar cells interact with inhibitory starburst amacrine cells and direction-selective ganglion cells and how starburst cells interact with each other and with ganglion cells (Fried, Münch, & Werblin, 2002). The possible role of starburst cell inhibitory interneurons in ensuring directional selectivity at a wide range of speeds has not yet been tested. The model is also consistent with physiological data from cat and macaque species showing that directional selectivity first occurs in V1 and that it is due, at least in part, to inhibition that reduces the response to the null direction of motion (Livingstone, 1998; Livingstone & Conway, 2003; Murthy & Humphrey, 1999).

Level 3: Short-range filter

A key step in solving the aperture problem is to strengthen unambiguous feature-tracking signals relative to ambiguous motion signals. Feature-tracking signals are often generated by a relatively small number of moving features in a scene, yet can have a very large effect on motion perception. One process that strengthens feature-tracking signals relative to ambiguous aperture signals is the short-range directional filter (Fig. 3). Cells in this filter accumulate evidence from directional transient cells of similar directional preference within a spatially anisotropic region that is oriented along the preferred direction of the cell. This computation selectively strengthens the responses of short-range filter cells to feature-tracking signals at unoccluded line endings, object corners, and other scenic features (Appendix Eq. 13). The use of a short-range filter followed by competition at Level 4 eliminates the need for an explicit solution of the feature correspondence problem that various other models posit and attempt to solve (Reichardt, 1961; Ullman, 1979; van Santen & Sperling, 1985).

The short-range filter uses multiple spatial scales (Appendix Eq. 15). Each scale responds preferentially to a specific speed range. Larger scales respond better to faster speeds due to thresholding of short-range filter outputs with a self-similar threshold; that is, a threshold that increases with filter size (Appendix Eq. 16). Larger scales thus require “more evidence” to fire (Chey et al., 1998).

Level 4: Spatial competition and opponent direction competition

Two kinds of competition further enhance the relative advantage of feature-tracking signals (Fig. 3 and Appendix Eqs. 17, 18 and 19). These competing cells are proposed to occur in Layer 4B of V1 (Fig. 3). Spatial competition among cells of the same spatial scale that prefer the same motion direction boosts the amplitude of feature-tracking signals relative to those of ambiguous signals. Feature-tracking signals are contrast-enhanced by such competition because they are often found at motion discontinuities, and thus get less inhibition than ambiguous motion signals that lie within an object’s interior. Opponent-direction competition also occurs at this processing stage (Albright, 1984; Albright, Desimone, & Gross, 1984) and ensures that cells tuned to opposite motion directions are not simultaneously active.

The activity pattern at this model stage is consistent with the data of Pack, Gartland and Born (2004). In their experiments, V1 cells demonstrated an apparent suppression of responses to motion along visible occluders. A similar suppression occurs in the model, due to the adaptation of transient inputs to static boundaries. Also, cells in the middle of a grating respond more weakly than cells at the edge of the grating. Spatial competition in the model between motion signals performs divisive normalization and end-stopping, which together amplify the strength of directionally unambiguous feature-tracking signals at line ends relative to the strength of aperture-ambiguous signals along line interiors.

Level 5: Long-range filter and formotion selection

Motion signals from model Layer 4B of V1 input to model area MT. Area MT also receives a projection from V2 (Anderson & Martin, 2002; Rockland, 1995) that carries depth-specific figure–ground-separated boundary signals whose predicted properties were supported by Ponce, Lomber, and Born (2008). These V2 form boundaries select the motion signals (formotion selection) by selectively capturing at different depths the motion signals coming into MT from Layer 4B of V1 (Appendix Eq. 20).

Formotion selection, or selection of motion signals in depth by corresponding boundaries, is proposed to occur via a narrow excitatory center and a broad inhibitory surround projection from V2 to Layer 4 of MT. First, in response to the oscillating dot pair, the larger spatial scale at the nearer depth (D1) in V2 allows illusory contours to bridge the two dots. At the same time, ON–center OFF–surround spatial competition inhibits boundaries within the enclosing shape at that depth (Fig. 5a). In the smaller spatial scale of farther depth (D2) of V2, no illusory contours bridge the dots. In addition, boundaries at the farther depth are inhibited by corresponding ones at the nearer depth at the corresponding positions. The resulting pruned boundaries are shown in gray in Fig. 5b.

Fig. 5
figure 5

V2 input to MT for the dot configuration of Fig. 1. Strong boundaries are represented in black, whereas weaker boundaries are represented in gray. (a) Nearer-depth (larger-scale) input contains FACADE boundaries corresponding to the dots and illusory contours linking the pair. The parts of the dot boundaries that would be located inside the enclosing shape are inhibited due to spatial competition. (b) Farther-depth (smaller-scale) input contains the boundaries of individual dots. Dot boundaries in that depth and at the same spatial locations as the boundaries in the nearer depth are inhibited by the latter, due to near-to-far suppression (see Eq. 28), and are thus shown as being weaker

Formotion selection from V2 to MT is depth-specific. At the nearer depth D1, V2 boundary signals that correspond to the illusory contour grouping select the larger-scale motion signals (Fig. 5a) and suppress motion signals at other locations in that same depth. At the farther depth D2, V2 boundary signals that correspond to the individual dots (Fig. 5b) select motion signals that represent the motion of individual parts of the stimulus.

Boundary-gated signals from Layer 4 of MT are proposed to input to the upper layers of MT (Fig. 3; Appendix Eq. 22), where they activate directionally selective, spatially anisotropic filters via long-range horizontal connections (Appendix Eq. 25). In this long-range directional filter, motion signals coding the same directional preference are pooled from object contours with multiple orientations and opposite contrast polarities. This pooling process creates a true directional cell response (Chey et al., 1997; Grossberg et al., 2001; Grossberg & Rudd, 1989, 1992).

The long-range filter accumulates evidence of a given motion direction using a kernel that is elongated in the direction of that motion, much as in the case of the short-range filter. This hypothesis is consistent with data showing that approximately 30% of the cells in MT show a preferred direction of motion that is aligned with the main axis of their receptive fields (Xiao, Raiguel, Marcar, & Orban, 1997). Long-range filtering is performed at multiple scales according to the size–distance invariance hypothesis (Chey et al., 1997; Hershenson, 1999): Signals in the nearer depth are filtered at a larger scale, and signals in the farther depth are filtered at a smaller scale.

The model hereby predicts that common and part motions are simultaneously represented by different cell populations in MT due to form selection. This type of effect may be compared with the report that some MT neurons are responsive to the global motion of a plaid stimulus, whereas others respond to the motion of its individual sinusoidal grating components (Rust, Mante, Simoncelli, & Movshon, 2006; Smith, Majaj & Movshon, 2005).

The long-range filter cells in Layer 2/3 of model MT are proposed to play a role in binding together directional information that is homologous to the coaxial and collinear accumulation of orientational evidence within Layer 2/3 of the pale stripes of cortical area V2 for perceptual grouping of form (Grossberg, 1999; Grossberg & Raizada, 2000; Hirsch & Gilbert, 1991). This anisotropic long-range motion filter allows directional motion signals to be integrated across the illusory contours in Fig. 5a that link the pair of dots.

Level 6: Directional grouping, near-to-far inhibition, and directional peak shift

The model processing stages up to now have not fully solved the aperture problem. Although they can amplify feature-tracking signals and assign motion signals to the correct depths, they cannot yet explain how feature-tracking signals can propagate across space to select consistent motion directions from ambiguous motion directions, without distorting their speed estimates, and at the same time suppress inconsistent motion directions. They also cannot explain how motion integration can compute a vector average of ambiguous motion signals across space to determine the perceived motion direction when feature-tracking signals are not present at that depth. The final stage of the model accomplishes this goal by using a motion grouping network (Appendix Eq. 28), interpreted to occur in ventral MST (MSTv), both because MSTv has been shown to encode object motion (Tanaka, Sugita, Moriya & Saito, 1993) and because it is a natural anatomical marker, given the model processes that precede and succeed it. We predict that feedback between MT and MST determines the coherent motion direction of discrete moving objects.

The motion grouping network works as follows: Cells that code similar directions in MT send convergent inputs to cells in MSTv via the motion grouping network. Unlike the previous 3D FORMOTION model, in which MST cells received input only from MT cells of the same direction, a weighted sum of directions inputs to the motion grouping cells (Appendix Eq. 29). Thus, for example, cells tuned to the southwest direction receive excitatory input not only from cells coding for that direction but also, to a lesser extent, from cells tuned to either the south or west direction, enabling a stronger representation of the common motion of the two dots.

Directional competition at each position then determines a winning motion direction. This winning directional cell then feeds back to its source cells in MT. This feedback supports the activity of MT cells that code the winning direction, while suppressing the activities of cells that code all other directions. This motion grouping network enables feature-tracking signals to select similar directions at nearby ambiguous motion positions, while suppressing other directions there. These competitive processes take place in each depth plane, consistent with the fact that direction-tuned cells in MST are also disparity-selective (Eifuku & Wurtz, 1999). On the next cycle of the feedback process, these newly unambiguous motion directions in MT select consistent MSTv grouping cells at positions near them. The grouping process hereby propagates across space as the feedback signals cycle through time between MT and MSTv.

Berzhanskaya et al. (2007), Chey et al. (1997), and Grossberg et al. (2001) have used this motion-grouping process to simulate data showing how the present model solves the aperture problem. Pack and Born (2001) provided supportive neurophysiological data, wherein the responses of MT cells over time to the motion of the interior of an extended line dynamically modulate away from the local direction that is perpendicular to the line and toward the direction of line terminator motion.

Both the V2-to-MT and the MSTv-to-MT signals carry out selection processes using modulatory on–center, off–surround interactions. The V2-to-MT signals select motion signals at the locations and depth of a moving boundary. The MST-to-MT signals select motion signals in the direction and depth of a motion grouping. Such a modulatory on–center, off–surround network was predicted by Adaptive Resonance Theory to carry out attentive selection processes in a manner that enables fast and stable learning of appropriate features to occur. See Raizada and Grossberg (2003) for a review of behavioral and neurobiological data that support this prediction in several brain systems. Direct experiments to test it in the above cases still remain to be done.

Near-to-far inhibition and peak shift are the processes whereby MST cells that code nearer depth inhibit MST cells that code similar directions and positions at farther depths. In previous 3D FORMOTION models, this near-to-far inhibition only involved MST cells of the same direction. Depth suppression in the present model is done via a Gaussian kernel in direction space (Appendix Eq. 31). When this near-to-far inhibition acts, it causes a peak shift in the maximally activated direction at the farther depth. This peak shift causes vector decomposition.

In particular, consider the stimulus in Fig. 1. First, note that large-scale MST cells in the near plane inherit the dominant southwest motion direction of the grouped stimulus from MT Layer 2/3 cells in the same plane (Fig. 6a). For the same reason, MST cells in the far plane inherit the motion direction of single dots from MT Layer 2/3 cells in the corresponding depth plane (Fig. 6b). Figure 6c illustrates the effect of depth suppression from the direction in Fig. 6a on the distribution of directionally specific activity of an MST cell that responds to the dot moving to the left.

Fig. 6
figure 6

Peak shift in motion perception via depth suppression over the leftward-moving dot in Johansson’s Experiment 19. (a) MST cell activity in nearer depth (larger scale). (b) MST cell activity at the same spatial location but in the farther depth (smaller scale), without depth suppression. (c) MST cell activity in the farther depth (smaller scale), with depth suppression. Each bin represents the activity in one of eight directions

If near-to-far depth suppression were disabled, the peak of motion activity would be in the left direction of motion (Fig. 6b). With depth suppression, however, motion directions close to the southwest direction are strongly inhibited, resulting in a peak shift to the northwest direction of motion (Fig. 6c). The same scenario occurs, but in the opposite direction, for the vertical oscillating dot. Thus, vector decomposition occurs because of a peak shift in motion direction, which is in turn due to depth suppression and the representation of stimulus motion at various scales and corresponding depths. Empirical evidence supporting the predicted model connections is summarized in Table 1.

Table 1 Anatomical connections

Simulation of psychophysical experiments

Symmetrically moving inducers

Johansson (1950, Exp. 19) used a stimulus display (Fig. 2) wherein each stimulus contributed equally to the common reference frame, because of the symmetry in the display. Each frame in the simulation summarized by Fig. 7 represents the activity of a different model level at two scales at a single time as the dots move toward the lower left corner.

Fig. 7
figure 7

Simulation of Johansson’s Experiment 19. Each frame represents the activity of one level at the two scales considered, for a single time frame as the dots are moving toward the lower left corner (t = 20). V2 form boundaries select signals in MT Layers 4 and 5/6 (Fig. 3, Level 5A), which enhances the diagonal motion direction in the large scale and the horizontal/vertical motion directions in the small scale. Long-range filtering in MT Layer 2/3 (Fig. 3, Level 5B) groups motion signals over the area subtended by the stimulus. Directional competition in MST (Fig. 3, Level 6) results in an enhanced diagonal direction of motion in the large scale, which is then subtracted from the corresponding activity in the small scale, resulting in an inward peak shift

For ease of viewing, network activity is overlaid on top of the V2 boundary input, which is depicted in gray. Motion signals selected by V2 boundaries in MT Layers 4 and 5/6 are displayed in the top row. The larger scale (left) selects motion signals corresponding to the grouped boundary, whereas the smaller scale (right) selects motion signals corresponding to individual dots. Long-range filtering in MT Layer 2/3 (middle row) groups motion signals at each scale. Thus, in the larger scale, the coherent southwest direction is enhanced with respect to its activity level at the previous layer. In comparison, the smaller scale maintains the physical motion directions corresponding to each dot. Directional competition in MSTv (bottom row) results in an enhanced diagonal direction of motion in the large scale, which is then subtracted from the corresponding activity in the small scale, resulting in an inward peak shift. Note that the magnitude of the shift reported in Fig. 7 is less than the 45° initially reported in Johansson (1950), which is compatible with results from a more recent instantiation of this paradigm, where angles of 30–40º were reported (Wallach, Becklen, & Nitzberg, 1985).

Wallach et al. (1985) explained this result by noting that it corresponds to the average direction that combines the true motion paths and the paths formed by the dots moving relative to each other, a mechanism they called process combination. In the model, the magnitude of the shift can be controlled by varying the strength of suppression in depth, which balances the contributions of the real and relative motion paths. Process combination can therefore be interpreted as resulting from the interaction of feedforward mechanisms representing true motion paths and feedback mechanisms responsible for the peak shift in motion direction. Figure 8 shows the MST cell activity in the two scales at the two critical moments of the stimulus cycle: when the dots move toward the left corner (top) and when they move in the reverse direction toward their respective origins (bottom). Note the reversal of motion directions in the small scale, which is again consistent with the percept and obeys the principle of vector decomposition.

Fig. 8
figure 8

Two different phases of Johansson’s Experiment 19. (Top) Motion toward the lower left corner causes the dots to be perceived as moving inwardly. (Bottom) Motion toward the outer corners results in an outward motion percept. Insets indicate which phase of the stimulus corresponds to the activity shown

In his description of this experiment, Johansson (1950, p. 89) reported that this motion configuration was not the only one that subjects experienced. The physical motion path of one of the two dots could be recovered with overt attention directed to that dot, in which case the unattended dot was seen as on a sloping path—or even 3-D rigid motion of a rotating rod could be perceived. The simulation of Fig. 9 was obtained by attending in the nearer depth to the motion direction of the horizontal oscillating dot. As observed by Johansson (1950), attending to the horizontal oscillating dot in the westward direction results in the perception of its real direction of motion in the nearer depth, while the motion of the unattended dot is on a sloped path in the farther depth. Previous explanations of how top-down attention can bias form and motion percepts can also be applied here (Berzhanskaya et al., 2007; Grossberg & Swaminathan, 2004; Grossberg & Yazdanbakhsh, 2005). In Fig. 9, the slanted motion direction of the vertical dot results from a peak shift induced by the strong westward motion direction induced in the larger scale by the attended horizontal dot. In the model, top-down attention operates in the motion stream at the level of MST cells (Appendix Eqs. 28 and 30).

Fig. 9
figure 9

Simulation of Johansson’s Experiment 19 with top-down attention to the horizontal moving dot. The true motion direction of the horizontal dot is perceived in the nearer plane, while the path of the nonattended dot is seen as moving on a sloping path, as described in Johansson’s (1950) original experiment. (Top) Motion perceived as the dots move toward the lower left corner. (Bottom) Motion perceived as the dots move toward the outer corners

The robustness of the results in Figs. 7, 8 and 9 can be assessed by considering that the network with the same parameters simulates a related experiment, where the dot paths intersect at their midpoint rather than at one end (Johansson, 1950, Exp. 20), such that observers report a percept similar to the one in the previous experiment, with the difference that four phases can be distinguished: when the dots move to the lower left toward the center; away from the center; to the upper right toward the center; and away from the center once more. Figure 10 shows the peak-shifted activity obtained in the small scale at the four crucial phases.

Fig. 10
figure 10

Simulation of Johansson’s Experiment 20 with top-down attention to the grouped diagonal motion in the nearer plane at the level of MST. Insets indicate which phase of the stimulus corresponds to the activity shown

Rolling wheel experiment

The rolling wheel experiment of Duncker (1929/1938) demonstrates that not all elements in a display need contribute equally to the emergence of a relative reference frame. The experiment can be described as follows (Fig. 11; see the Appendix Eq. 4): A single dot moving on the rim of a rolling illusory wheel is perceived to move according to its physical trajectory, in this case a cycloid curve (Fig. 11a). If a second dot is added that moves as if on the hub of the same illusory wheel (Fig. 11b), the cycloid is then seen as orbiting on a circular path with the hub at its center and translating to the right (Fig. 11c).

Fig. 11
figure 11

Rolling wheel experiment (Duncker, 1929/1938). (a) When a single dot is seen moving on a cycloid path, which describes the motion of a dot on the rim of a wheel, cycloid motion is seen. (b) When an additional dot moves on the hub of an illusory wheel of radius a, the cycloid path is then perceived as rotating on a circular path around the hub (c), and the total stimulus is seen as moving globally to the right

A proper account of the Duncker wheel experiment must explain the percept of true motion in the case of the single cycloid dot and the rotational motion perceived over the cycloid dot in the two-dot configuration, as well as the global rightward motion in the later configuration. Figure 12 shows that the network is able to represent the true cycloid motion path at the level of MST cells in the single-cycloid-dot case. Here, each polar histogram shows the distribution of motion directions in MST cells over the area subtended by the cycloid dot at a particular phase of the revolution.

Fig. 12
figure 12

Simulations of a single cycloid dot. Each polar histogram shows the motion activity over the cycloid dot at different phases of one rotation cycle. The presence of multiple bins in a given histogram denotes activation in multiple directions

Johansson (1974) provided a mathematical explanation of the wheel experiment in terms of vector analysis. As before, if the motion common to both dots is subtracted from the cycloid dot’s physical motion, the cycloid dot is seen to move in a circle around the center dot.

It has been suggested that the visual system treats the dot moving with constant velocity as the center of a configuration relative to which the motion of the other dots is perceived (Cutting & Proffitt, 1982; Rubin & Richards, 1988). The successive short-range and long-range directional filtering stages in the 3D FORMOTION model implement this constraint by accumulating directional evidence in the constant rightward motion direction of the hub dot. A strong rightward motion direction in the large scale hereby emerges at the hub and captures the motion of the cycloid dot. Figure 13 shows the activity observed at the level of MST (large scale) over time for the dot located at the center of the hub (Fig. 13a) and the cycloid dot (Fig. 13b).

Fig. 13
figure 13

Horizontal motion capture in the large (near) scale over both stimulus dots in the Duncker wheel experiment. Activity in the rightward direction is shown by the curve denoted E. In each graph, a small vertical line on the x-axis indicates the time at which rightward motion activity reaches 0.18. (a) Activity observed over the hub dot’s location. (b) Activity observed over the cycloid dot’s location. Notice how the rightward direction quickly becomes dominant and stable over time

Note the early appearance of the rightward motion direction over the hub as compared to the cycloid. This is made explicit in Fig. 13 by a small vertical bar on the horizontal axis of each graph, which marks the time at which corresponding levels of activity are reached for both dots. The rightward motion signal propagates to the cycloid dot over the illusory contours that join them through time. The rightward direction of motion is retained at the position of the cycloid dot, even though its position on the y-axis changes throughout the simulation.

The 3D FORMOTION model predicts that elements of a visual display with constant velocity are more likely to govern the emergence of a frame of reference, due to the accumulation of motion signals in the direction of motion. A related prediction is that stimuli designed to prevent such accumulation of evidence will not develop a strong object-centered frame of reference. Partial support for this prediction can be found in an experiment by Kolers (1972, cf. Array 17 on p. 69) using stroboscopic motion on a display otherwise qualitatively the same as that in Johansson’s (1950) Experiment 19. Subjects’ percepts here seemed to reflect the independent motion of the dots rather than motion of a common frame of reference. A related case is that of Ternus–Pikler displays, in which one of the moving disks contains a rotating dot. Here, vector decomposition occurs only at the high ISIs that are also necessary to perceive grouped disk motion (Boi, Öğmen, Krummenacher, Otto, & Herzog, 2009).

As noted previously, the common motion direction is subtracted from part motion via near-to-far suppression in depth, which gives rise to a wheel-like percept over the cycloid dot, as the simulations in Fig. 14 show, using the same polar histogram representation as in Fig. 12, for various levels of pruning (indicated as percentages) of the farther V2 boundaries. Although these results could be improved with a finer sampling of the direction space, they are sufficient to demonstrate a predicted role of MSTv interactions in generating a peak shift in motion direction that leads to the observed vector decomposition.

Fig. 14
figure 14

Simulations of the Duncker wheel stimulus, at various levels of boundary pruning. Each polar histogram shows the motion activity in the small (farther) scale over the cycloid dot at different phases of one rotation cycle. The presence of multiple bins in a given histogram denotes activation in multiple directions. For each wheel plotted, the amount of pruning completed is shown as a percentage

In order to quantify the degradation of the percept as a function of the amount of boundary pruning, the motion directions obtained at each time step over the cycloid dot were correlated with that of an ideal rotating wheel according to the measure (R) defined in the Appendix Eqs. 33, 34 and 35. The measure is defined so as to be bounded in [–1, 1], where R = –1 corresponds to a wheel rotating in the opposite direction, and R = 1 corresponds to a perfectly represented wheel. Figure 15a shows the results obtained using this similarity measure for Duncker wheel simulations with increasing amounts of pruning completed. Figure 15b shows the result obtained for the simulation of the cycloid dot only, in which there is no boundary pruning. Comparing Fig. 15a and b is sufficient to see that Duncker wheel simulations yielded more wheel-like activation in MST than did the cycloid simulation, at all levels of boundary pruning.

Fig. 15
figure 15

Effect of boundary pruning on MST activity evaluated using similarity measure R. Directional activity in MST perfectly consistent with wheel-like rotation in the rightward direction should yield R = 1, whereas nonrotational motion should lead to a smaller value of R. (a) As the percentage of pruning completed increases, MST activity observed in the Duncker wheel simulations becomes less wheel-like. Nevertheless, motion is always more circular than that observed in cycloid dot simulations (b)

Discussion

The 3D FORMOTION model predicts that the creation of an object-centric frame of reference is driven by interacting stages of the form and motion streams of visual cortex: Form selection of motion-in-depth signals in area MT and interdepth directional inhibition in MST cause a vector decomposition whereby the motion directions of a moving frame at a nearer depth suppress these directions at a farther depth, and thereby cause a peak shift in the perceived directions of object parts moving with respect to the frame. In particular, motion signals predominant in the larger scale, or nearer depth, induce a peak shift of activity in smaller scales, or farther depths. The model qualitatively clarifies relative motion properties as manifestations of how the brain groups motion signals into percepts of moving objects, and quantitatively explains and simulates data about vector decomposition and relative frames of reference.

The model also qualitatively explains other data about frame-dependent motion coherence. Tadin, Lappin, Blake, and Grossman (2002) presented observers with a display consisting of an illusory pentagon circularly translating behind fixed apertures, with each side of the pentagon defined by an oscillating Gabor patch. The locations of the apertures and of the corners of the pentagon never overlapped, such that the latter were kept hidden during the entire stimulus presentation. Subjects had to judge the coherence of motion of the Gabor patches belonging to the different sides of the pentagon. Crucially, when the apertures were present, subjects reported seeing the patches as forming the shape of a pentagon, whereas when the apertures were absent, the patches did not seem to belong to the same shape. Results showed that motion coherence estimates were much better when apertures were present than when they were not. According to the FACADE mechanisms in the form stream, the presence of apertures triggers the formation of illusory contours linking the contours of the Gabor patches into a single pentagon behind the apertures (see Berzhanskaya et al., 2007). Subsequent form selection and long-range filtering in MT lead to a representation of the pentagon’s motion at a particular scale. This global motion direction is then subtracted from local motion signals of individual patches, thereby leading to better coherence judgments. In the absence of the apertures, form selection followed by long-range filtering of motion signals did not occur, such that the motion of individual patches mixed the common- and part-motion vectors, making coherence judgments difficult.

In displays where the speeds of the moving reference frame and of a smaller moving target can be decoupled, the perceived amount of vector decomposition has been shown to be proportional to the speed of the frame (Gogel, 1979; Post et al., 1989). This can be interpreted by noting that the firing rate of an MT cell in response to motion stimuli is proportional to the speed tuning of the cell (Raiguel, Xiao, Marcar, & Orban, 1999). A frame of reference moving at a higher speed should therefore lead to higher cortical activation in the larger scales of MT and MST, and thus to a more pronounced motion direction peak shift, reflecting the stronger percept of vector decomposition (Gogel, 1979; Post et al., 1989). For the same reason, the model also predicts that the amount of shift in the perceived direction of the moving target is inversely proportional to target speed: A stronger peak in the motion direction distribution in the smaller scale (before subtraction) will be shifted less by subtraction from the large scale. Another prediction is that vector decomposition mechanisms occur mainly through MT–MST interactions.

The simulations shown here were conducted using a minimum number of scales in order to explain the experimental results. However, the model can be generalized to include a finer sampling of scale space, perhaps with depth suppression occurring as a transitive relation across scale. Such an arrangement of scales would then be able to account for experimental cases in which vector decomposition must be applied in a hierarchical manner, such as in biological motion displays (Johansson, 1973). Accordingly, residual motion of the knee is obtained after subtraction of the common motion component of the hip and knee, whereas residual motion of the ankle is obtained after subtraction of the common motion component of the knee and ankle. Similar decompositions occur for upper limb parts. Such vector decompositions would require the use of spatial scales roughly matched to the lengths of the limbs, with depth suppression occurring from larger scales coding for limb motion to smaller scales coding for joint motion.

The present model explains cases of vector analysis in which retinal motion is imparted to all display elements, as opposed to some being static. The model would need to be refined to account for induced motion displays using an oscillating rectangle to induce an opposite perceived motion direction in a static dot (Duncker, 1929/1938). The suggestion that additional mechanisms are needed to explain induced motion is supported by experimental evidence highlighting differences between induced motion and vector decomposition, as summarized by Di Vita and Rock (1997). For example, induced motion is typically not observed when the reference frame’s physical speed is above the threshold for motion detection, whereas the vector decomposition stimuli analyzed here are robust to variations in speed. Also, in induced motion, the motion of the frame is underestimated or not perceived at all, whereas the common motion component in vector decomposition stimuli is perceived simultaneously to that of the parts.