Field-like interactions between motion-based reference frames
A reference frame is required to specify how motion is perceived. For example, the motion of part of an object is usually perceived relative to the motion of the object itself. Johansson (Psychological Research, 38, 379–393, 1976) proposed that the perceptual system carries out a vector decomposition, which rewsults in common and relative motion percepts. Because vector decomposition is an ill-posed problem, several studies have introduced constraints by means of which the number of solutions can be substantially reduced. Here, we have adopted an alternative approach and studied how, rather than why, a subset of solutions is selected by the visual system. We propose that each retinotopic motion vector creates a reference-frame field in the retinotopic space, and that the fields created by different motion vectors interact in order to determine a motion vector that will serve as the reference frame at a given point and time in space. To test this theory, we performed a set of psychophysical experiments. The field-like influence of motion-based reference frames was manifested by increased nonspatiotopic percepts of the backward motion of a target square with decreasing distance from a drifting grating. We then sought to determine whether these field-like effects of motion-based reference frames can also be extended to stationary landmarks. The results suggest that reference-field interactions occur only between motion-generated fields. Finally, we investigated whether and how different reference fields interact with each other, and found that different reference-field interactions are nonlinear and depend on how the motion vectors are grouped. These findings are discussed from the perspective of the reference-frame metric field (RFMF) theory, according to which perceptual grouping operations play a central and essential role in determining the prevailing reference frames.
Keywords2-D motion Motion integration Temporal processing
The relativity of perceived motion
By definition, motion is a change of position over time. Hence, the definition implies that to determine motion, one needs to have references (or coordinate systems) for position and time, and motion becomes relative to these coordinate systems. The choice of the coordinate systems and their scales depend on the phenomena of interest. For example, astronomical scales are used to characterize planetary motions, and the orbital speed of earth’s motion according to a solar reference frame is about 30 km/s. However, in our ecological environment, for all practical purposes the earth appears stationary, and an earth-based (geocentric) reference frame prevails. As a result, explicitly or implicitly, geocentric (also called spatiotopic) motion is generally regarded as the “real,” “physical,” or “absolute” motion, whereas motion relative to other reference frames is considered “relative” or “illusory” motion (Swanston, Wade, & Day, 1987; Wade & Swanston, 1987). However, in analyzing its inputs, our visual system is faced by the complexity that arises at the individual stimuli level, and perceptual organizational principles, such as Gestalt grouping and figure–ground segregation are proposed to be fundamental requirements in processing visual inputs (Koffka, 1935; Wagemans, Elder, et al., 2012; Wagemans, Feldman, et al., 2012). The perception of a stimulus does not only depend on its own individual properties but also on the properties of other spatiotemporally neighboring stimuli (Koffka, 1935). It is not surprising, then, that relative motion was a central topic in Gestalt psychology (Duncker, 1929; Ellis, 1938). In one of his experiments, Karl Duncker used displays generated by point-lights attached to an otherwise invisible rotating and translating circular piece of cardboard (Duncker, 1929, p. 240). He found that when a single point-light was attached to the rim of the cardboard, observers reported seeing the point-light moving along a cycloidal trajectory. On the other hand, when he added a second point-light to the hub of the cardboard, observers often reported perceiving the point-light attached to the rim undergoing circular motion around the point-light on the hub, which itself was perceived to move horizontally. These results can be understood in terms of the reference frame against which the motion of the point-light is perceived. Cycloidal motion corresponds to a trajectory relative to a geocentric reference, whereas the rotation corresponds to a trajectory relative to a moving reference frame positioned on the point-light at the hub.
The theory of perceptual vector decomposition
Vector decomposition: An ill-posed problem
Although the processes illustrated in Fig. 1 seem straightforward, in mathematical terms, vector decomposition is an ill-posed problem: Infinitely many pairs of common and relative motions can produce exactly the same absolute motion. Figure 1D shows an alternative set of common and relative motion components corresponding to the same physical motion as in Fig. 1A. Johansson recognized this ambiguity in some of his studies (Johansson, 1950, 1958; Johansson & Jansson, 1968). For example, in his Exp. 19 (Johansson, 1950, p. 89), in which he presented two-dot displays in which one of the dots oscillated horizontally while the other oscillated vertically (see Fig. 1E), he reported that subjects did not always experience the same motion configuration. If they attended to one of the dots, they perceived the geocentric motion of that dot, while perceiving the other dot as moving along a slanted trajectory. There were even reports of 3-D rigid motion of a rotating rod. Similarly, with Duncker’s (1929) wheel stimulus, some observers reported a rotating wheel, whereas others reported that the motion of two point-lights more resembled a tumbling stick (Cutting & Proffitt, 1982; Proffitt, Cutting, & Stier, 1979).
Since observers do not perceive all possible solutions, but instead a rather small subset (see, e.g., Johansson, 1950; Proffitt et al., 1979; Wallach, Becklen, & Nitzberg, 1985), the fundamental questions are to determine which subset of solutions is selected by the visual system, and why. Mathematically, the number of solutions of an ill-posed problem can be reduced by introducing additional information or constraints, an approach known as regularization (Marr & Ullman, 1981).
Which subset and why: Regularization approach in vector decomposition
A variety of constraints have been proposed to explain how the visual system regularizes vector decomposition. Hochberg and McAlister (1953) argued that the perceptual system chooses the simplest solution in terms of the information required to define the pattern when it encounters an ambiguous stimulus with multiple potential interpretations. In other words, the alternative in Fig. 1D will be rejected because it necessitates additional motion vectors rather than just one, as in the case of Fig. 1C. Börjesson and von Hofsten (1972) proposed as a constraint that residual motion vectors should sum to zero. Gogel proposed the “adjacency principle,” according to which the relative motion determination is restricted only to nearby objects (Gogel, 1974; Gogel & Koslow, 1972). Proffitt and colleagues proposed that the common motion is determined by the motion of the center of gravity of the dots (Proffitt et al., 1979). Restle (1979) proposed “information load” as the constraint to be minimized in determining the prevailing solution. A hybrid system that minimizes either the common or relative component, depending on which process (assuming that the common and relative motion calculations are done via independent processes) is completed first, has been shown to account for some of the classical findings in dot motion experiments (Cutting & Proffitt, 1982). More recently, building upon Johansson’s (1950) original study of vector analysis, a Bayesian framework with a set of probabilistic constraints has also been introduced (Gershman, Jäkel, & Tenenbaum, 2013). In sum, the constraints introduced in regularization approaches to vector decomposition provide heuristics to explain, at least partially, why the human visual system selects a particular vector decomposition in motion perception.
To put this approach in perspective, consider its use in physics. In physics, the principle of minimum total potential energy (which states that particles move so as to minimize the total potential energy) is formulated to explain why things move the way they do. Hence, on the basis of this constraint, a global energy function can be minimized to determine the motion of particles in a medium. An alternative perspective is to express how this particular solution emerges in real time through interactions between particles. In this case, one uses forces and fields applied to particles. In mathematical terms, these two approaches are related, and can be expressed as the energy (or Ljapunov) function of a system and the differential equations governing the system.
Reference-frame metric field (RFMF) theory
The goal of the present study was to investigate whether and how different reference-frame fields interact. We sought to determine (i) whether reference-frame fields are actually generated by local motion vectors (i.e., to replicate previous findings with a variant of the stimuli previously used), (ii) whether stationary landmarks generate reference fields as well, (iii) whether the fields created by different motion-based reference frames actually interact, and if so, (iv) how they interact. To this end, we probed the perceived direction of a moving dot with and without drifting gratings (which produce motion-based reference frames) at various distances from the dot, and we quantified the field effect and relativity of the perceived motion.
Three naive observers and one of the authors (M.N.A.) participated in the study. The age of the participants ranged from 26 to 29 years, and all participants had normal or corrected-to-normal vision. All experiments followed a protocol approved by the University of Houston Committee for the Protection of Human Subjects, and each observer gave written informed consent before the experiments.
All visual stimuli were created via the VSG2/5 visual stimulus generator card (Cambridge Research Systems) and displayed at a resolution of 800 × 600 pixels, with a refresh rate of 100 Hz. Gaze position monitoring for both eyes was performed by means of an EyeLink II eyetracker at a 250-Hz sampling rate. The distance between observers’ eyes and the display was 1 m, and the dimensions of the display at this distance were 22.7 × 17.0 deg. A head/chin rest was used to help stabilize fixation, and observers reported their responses via a joystick. All experiments were done in a normally illuminated room.
Experiment 1: The reference-field effect
The purpose of the experiment was to demonstrate the distance-dependent effects of a motion-based reference frame—that is, the reference-field effect. In Experiment 1a, we established a baseline for the judgments of motion direction in the absence of any dynamic reference frames. In Experiment 1b, we added a motion-based reference frame to the display and varied the distance of the target from this reference.
Stimuli and procedures
In Experiment 1a (Fig. 4A), the target square was presented alone at various vertical eccentricities across blocks (0, ±2.75, and ±5.5 deg; positive and negative values indicate the upper and lower visual fields, respectively). In Experiment 1b, a square wave grating (dimensions: 23 × 1 deg, spatial frequency: 0.25 cpd, duty cycle: 50 %, Michelson contrast: 0.98) was also presented, always at 6.5-deg vertical eccentricity (in either the upper or the lower part of the display). The distances between the target square and the grating were, therefore, 1, 3.75, 6.5, 9.25, and 12 deg. The vertical eccentricities of the target (i.e., target–grating distances) were blocked, and the order of blocks was randomized. The drift speed of the grating was equal to the average speed of the target square (9°/s), and the drift was always in the same direction as the target square’s motion (see the thick arrow in Fig. 4B and the thick top line in Fig. 4F).
Observers were asked to spread their attention to the entire display and, as soon as the target (and the grating) completed their motion and disappeared, to report via a joystick whether the target had ever moved backward during the trial (yes/no). The amplitude of the sine modulation in the target’s velocity profile was varied across trials by an adaptive staircase algorithm (see the various thin curves showing different modulation amplitudes in Fig. 4F). For each reversal in observers’ responses, the step size in the staircase was halved. Four independent staircases with randomly chosen initial amplitudes (within the range 0°–19°/s) were interleaved in a block of trials. A single staircase was completed in 15–25 trials. A staircase was considered “converged” when it had encountered ten reversals in an observer’s responses and the last eight reversals were used to calculate the threshold for perceiving backward motion. The minimum velocity of the target corresponding to this threshold amplitude was taken as the reference-field effect. For instance, if the staircase converged to 9°/s amplitude for sine modulation, it corresponded to the minimum target velocity of (9 – 9 =) 0°/s (reference-field effect = 0°/s). This would mean that backward motion is perceived only when the target velocity goes below 0°/s (veridical percept). On the other hand, if, for instance, the staircase converges to 6°/s, corresponding to the minimum target velocity of 3°/s, it would mean that as soon as the target velocity fell below 3°/s (reference-field effect = 3°/s), backward motion would be perceived (illusory percept), although it might never spatiotopically move backward. For each vertical position of the target, each observer ran one block of trials (four staircases).
Experiment 1a was designed to determine quantitatively for each observer the ability to detect a reversal in the direction of motion of a simple stimulus as a function of eccentricity. Ideally, the thresholds for detecting backward motion should be zero or close to zero. However, since we used a yes/no task, bias might occur. In Experiment 1b, the reference field generated around the drifting grating should cause illusory percepts. To be more specific, the RFMF theory predicts that backward motion should be perceived even at positive minimum target velocities (field effects > 0°/s). In order to show that percepts were not simply due to subtraction of a common motion component and that a motion-based reference frame is effective only within a limited spatial region, we analyzed the field effects as a function of the grating-to-target distance.
Results and discussion
The presence of the drifting grating significantly changed the pattern of results. Figure 5B shows the reference-field effects (defined as the minimum speed at which the target was perceived to be moving backward; see the Stimuli and Procedures section above) as a function of the target–grating distance. Although we found a small but significant perceptual bias in Experiment 1a in the absence of the drifting grating, we did not take this bias into account for the results of Experiment 1b, since the bias was not eccentricity dependent. Factoring out these biases thus would only cause an overall upward shift, not affect the statistical analysis. The drop in the effect size with increasing target–grating distance, shown in Fig. 5B, is significant [F(4, 12) = 13.550, p < .001, ηp2 = .819]. The results replicate the distance-dependent influence of motion-based reference frames on the perceived motion of nearby objects that has already been shown by several studies (DiVita & Rock, 1997; Hochberg & Fallon, 1976; Mori, 1979; Shum & Wolford, 1983). All of these studies lend support to the claim that common motion might serve as a reference frame, but its effectiveness is limited to a spatial region.
Experiment 2: No motion, no interaction
The RFMF theory predicts that reference-field interactions will occur only when there is motion. In this experiment, we tested this prediction by presenting a static grating in addition to the dynamic grating. The static grating was in the other half of the visual field than the dynamic grating. It has been shown that the presence of a stationary landmark (such as a fixation point, surrounding frame, etc.) substantially influences the perceived motion (e.g., Wallach et al., 1985). The static grating used in Experiment 2 provided an additional reference frame (along with the display borders) for motion computations; however, whether it could generate interactions with a reference field needed to be investigated. If the presence of the static grating modulated the strength of a motion-based reference field, we should see distance-dependent drops in the reference-field effect, as compared to the case in which only the dynamic grating was presented.
Stimuli and procedures
The stimuli and procedures were identical to those used in Experiment 1b, with the following exceptions. In addition to the dynamic grating, which always moved in the same direction as the target, a second grating was presented at the same vertical eccentricity as the dynamic one, but in the opposite half of the screen (Fig. 4C). The second grating was stationary (drift velocity = 0°/s). Which one of the gratings was presented in the upper visual field was randomized across trials.
Results and discussion
One might also argue that the failure to find a significant interaction between distance and experiment here might have been due to the floor effect. At around 9–12 deg of distance from the dynamic grating, the effect size is already very close to zero, and any potential drop in the effect size due to the presence of the static grating might be clouded. In order to address this issue, we took the three closest distances to the dynamic grating and repeated the statistical analysis. The main effect of distance was again significant [F(2, 6) = 10.240, p = .012, ηp2 = .773], whereas the main effect of experiment was not [F(1, 3) = 4.775, p = .117, ηp2 = .614]. The interaction of the main factors was, once again, not significant [F(2, 6) = 0.358, p = .713, ηp2 = .106]. Moreover, within-observer differences between the two experiments did not show any trends whatsoever (see Fig. 6B). If a field was indeed associated with static references, we would see positive slopes—that is, increased effect size difference with distance.
Experiment 3: Interacting dynamic reference fields
In the first experiment, we replicated the basic finding that the perceived motion of a stimulus can be influenced by the motions of nearby objects, and that this effect spreads over space in a field-like manner. In the second experiment, we showed that the reference fields are generated only when motion is present. In ecological viewing conditions, a multitude of objects might move in various directions. According to the RFMF theory, in order to perceive sharp and clear forms of these objects, each object needs to be processed in a proper reference frame (determined by the local motion vectors). Furthermore, the selection of a certain reference frame is not done in an all-or-none manner. Instead, each reference frame has a reference field associated with it and exerts its effect within this spatiotemporally limited field. A question, then, arises: What if several reference fields come into close proximity with each other? The RFMF theory suggests that the reference fields would interact to reach an equilibrium in the retinotopic space. In this experiment, we tested this prediction.
Stimuli and procedures
The stimuli and procedures were identical to those used in Experiment 2, with the following exceptions. The static grating was replaced by a drifting grating having the same spatial characteristics as the dynamic grating in Experiments 1b and 2. In Experiment 3a, the drift velocities of the two gratings were identical and equal to the average velocity of the target (Fig. 4D). Since these gratings were identical in all respects, the target–grating distances of 1 and 12 deg and 3.75 and 9.25 deg were essentially the same. Therefore, we effectively had only three target–grating distances in Experiment 3a (1, 3.75, and 6.5 deg). In Experiment 3b, one of the gratings (primary) always drifted in the target’s motion direction, while the other (secondary) drifted in the opposite direction, but with the same speed (Fig. 4E). As in Experiment 2, the target was presented at five different vertical eccentricities (the corresponding target primary distances were 1, 3.75, 6.5, 9.25, and 12 deg). Which one of the gratings was presented in the upper visual field was randomized across trials.
Results and discussion
When we compute the motion of an object in everyday life, we generally use the static environment as a reference frame. However, the perceived motion of an object corresponds to its motion with respect to a static reference frame only in special, simple cases. When we see a friend waving his or her hand on a moving bicycle, the hand undergoes a complex motion trajectory with respect to the static background. But, in fact, we perceive the hand moving vertically up and down, discounting the motion of the bicycle. Likewise, we see the wheels of the bicycle rotating around their axles, with the horizontal motion of the axles discounted. Hence, the circular motion of the wheel is perceived with respect to the translatory motion of the bicycle. The inadequacy of using the static environment as the single reference frame and the roles of moving reference frames have been systematically investigated by Johansson (1950, 1958, 1973, 1976, 1986) and many others (Duncker, 1929; Gogel, 1974; Hochberg & Fallon, 1976; Mori, 1979; Wallach et al., 1985; Wallach, O’Leary, & McMahon, 1982).
Johansson claimed that the rotary motion of any point on a wheel can be deduced by perceptual subtraction of the translatory motion vector, common to both the hub and the wheel, from its “real” cycloidal motion (Johansson, 1950, 1976). The theory of perceptual vector analysis can explain rapid perception of highly complex motion patterns, such as biological motion displays, by a hierarchy of moving reference frames, thus simplifying the motions of the knees and feet as the simple harmonic motion of a pendulum (Johansson, 1973, 1976). The gist of Johansson’s theory lies in the extraction of common and relative motion components. However, many studies have demonstrated that the extraction of common motion is not always perfect (DiVita & Rock, 1997; Gogel, 1974; Hochberg & McAlister, 1953; Johansson, 1974; Mori, 1979, 1984; Wallach, 1959). More importantly, in mathematical terms, vector decomposition is an ill-posed problem and needs additional information to be solved. Several constraints to limit the number of solutions have been introduced (Börjesson & von Hofsten, 1972; Cutting & Proffitt, 1982; Gershman et al., 2013; Gogel, 1974; Gogel & Koslow, 1972; Hochberg & McAlister, 1953; Proffitt et al., 1979; Restle, 1979). In short, these constraints provide heuristics to explain why the human visual system selects a particular solution. In the present study, we have taken the alternative perspective, and looked at how a particular solution emerges through interactions between motion vectors.
Previously, we have shown that the perceived motion of a target stimulus can be influenced by nearby motion of another object, and that this object need not be surrounding the target stimulus, as in the induced-motion paradigm (Agaoglu et al., 2015; Noory et al., 2015). In this study, we started off by replicating our previous findings that each local motion vector has a reference field associated with it, and this is manifested by increased illusory percepts of backward motion with decreasing distance to this moving reference frame. We then sought to determine whether these field-like effects of motion-based reference frames can also be extended to stationary landmarks. We presented a highly salient stationary grating along with a drifting one to examine whether the effect of the latter would be in any way modulated by the presence of the former. Although there was a consistent trend of reduced effect sizes in the presence of a stationary grating, this reduction did not reach significance. More importantly, we did not find any significant interaction between distance and the presence/absence of the stationary grating, suggesting that reference fields interact only when there is motion.
In order to investigate whether and how different reference fields interact with each other, we presented two drifting gratings at various distances from the target square. In different experiments, we manipulated the drift direction of the secondary grating while the primary grating always drifted in the same direction as the average target velocity. We found that when both gratings drifted in the same direction, their effects combined and strengthened the illusory backward motion percepts. When the secondary grating drifted in a direction opposite to the direction of both the target and the primary grating, we found a significant drop in the illusory percept—that is, in the reference-field effect. These drops, however, cannot be explained by linear summation of the reference fields. Taken together, these findings suggest that reference fields do interact, and the way that their effects combine is nonlinear and depends on how the motion vectors are grouped.
These results clarify the details of the interactions posited in the RFMF theory. The RFMF theory was developed to explain how the visual system computes the attributes of stimuli under ecological conditions—that is, when the observer and the objects in the environment are in motion. Due to visible persistence, moving objects should appear extensively smeared, but under normal viewing conditions they do not (Hammett, 1997; Ramachandran et al., 1974). In addition, a moving object activates a retinotopically anchored receptive field only briefly, which does not allow sufficient time for computation of the stimulus attributes. As a result, one would expect moving objects to have a featureless “ghost-like” appearance (Öğmen, 2007; Öğmen & Herzog, 2010); however, under normal viewing conditions, moving objects appear sharp and clear. RFMF suggests that the visual system avoids these problems by computing the attributes of moving objects, not based on retinotopic coordinates, but instead according to a reference frame that moves with the object. To achieve this, as is depicted in Fig. 3, a first stage of processing groups motion information and extracts reference frames that are used to compute the attributes of the moving objects. The use of nonretinotopic, motion-based reference frames has been supported by several studies (Agaoglu et al., 2012; Boi et al., 2009; Hisakata, Terao, & Murakami, 2013; Kawabe, 2008; Nishida, Watanabe, Kuriki, & Tokimoto, 2007; Öğmen, Otto, & Herzog, 2006; Yamada & Kawabe, 2013). On the basis of our results, we can summarize the reference-frame rules as follows: First, individual motion vectors are grouped according to their similarities (law of common fate). Each individual motion vector creates a field whose effect decays with distance. The fields of vectors that are grouped together reinforce each other. The field of a vector can weaken the effects of the fields of vectors that form a different group; however, without being grouped with these motion vectors, it cannot revert their effects.
Tynan and Sekuler (1975) argued that theories of motion perception based on reference frames have two limitations: (i) They do not state unambiguously which objects will form the reference frame in a complex scene, and (ii) they do not account for the dependence of suppressive interactions on stimulus velocities. Regarding the first point, we have recently implemented RFMF theory as a computational model that specifies mechanistically how reference frames are determined (Clarke, Öğmen, Herzog, 2015). Regarding the second point, additional experiments will be needed to determine whether the dependence of suppression on stimulus velocities arises from the center–surround organization of basic motion detectors, motion grouping, or other factors. RFMF theory takes into account basic motion detectors’ receptive-field properties at the retinotopic level, and the grouping operations at the level of retinotopic representations are mapped into nonretinotopic representations (Fig. 3).
Other studies have also focused more on how, rather than why, a particular solution emerges as a result of vector decomposition. Wallach and colleagues (1985) rejected the idea of perceptual vector analysis and interpreted the percepts that supposedly result from imperfect vector decomposition as a consequence of “process combination” (see also Johansson, 1985, and Wallach & Becklen, 1985). They claimed that there is no need for the extraction of common motion, and that the perceived motion patterns are nothing but incidental results of the sensory apparatus. In other words, component motions activate different kinds of motion processes, and sometimes these processes can combine, resulting in motion percepts that deviate from what vector decomposition theory would predict. Take, for example, the two-dot display shown in Fig. 1E. According to process combination theory, the individual motions of the dots, the displacements of the group as a whole, and motion within the group activate different motion sensors. When a stationary landmark (a fixation point or a rectangular frame) is presented along with the two-dot display shown in Fig. 1E, observers mostly perceive the absolute motion paths—that is, one dot oscillating vertically and the other horizontally—because motions with respect to the stationary reference are enhanced and grouping of the dots is weakened (Wallach et al., 1985), which is also in line with RFMF theory.
Although it is indirect, some partial evidence supports this position. The existence of motion sensors that are tuned to various types of moving patterns has been shown: The dorsal portion of the medial superior temporal area is known to be sensitive to global motion patterns, whereas the ventrolateral portion is more sensitive to within-configuration or object motions in the scene (Duffy & Wurtz, 1995; Eifuku & Wurtz, 1998), and some medial temporal (area V5) neurons are responsive to the global motion of a plaid, whereas others respond to the motion of its individual sinusoidal components (Rust, Mante, Simoncelli, & Movshon, 2006). If a given type of motion sensor is activated more than others, perceived motion can be mostly determined by the outcome of this process as a result of process combination. For instance, during steady fixation, relative motion determines the percepts at slow speeds, whereas absolute motion takes over at high speeds (Baker & Braddick, 1982; Snowden, 1992); this might be due to different levels of activation of the corresponding motion sensors at different speeds. However, in contrast to this perspective built on hardwired motion mechanisms, we suggest that the formation of reference frames is a dynamic process that is adaptable in real time. The rationale for this is that under ecological viewing conditions, trajectories can be arbitrarily complex, and it is not possible for the visual system to build hardwired motion sensors for all possible trajectories. Hence, real-time grouping operations and field interactions between the activities generated by a small set of canonical motion mechanisms can represent a neural-network state solution that prevails under the given specific stimulus conditions.
A constraint that can play an important role in disambiguating vector decomposition is prior knowledge among the observers. For example, observers can readily recognize biological motion when the stimulus is presented in its correct orientation, but they fail to do so when it is inverted (Pavlova & Sokolov, 2000). This suggests that templates from memory can also help resolve ambiguity and prime the grouping process. Part of the reason why stimuli like the simple dots shown in Fig. 1 generate multiple percepts may be due to the fact that they are not rich enough to engage specific memory patterns, and thus form unambiguous groups. Hence, in general, learned figural configurations need also to be taken into account. Recently, Grossberg, Léveillé, and Versace (2011) proposed a neural-network model to explain how vector decomposition might occur by taking into account figural factors. According to their model, figure–ground separation and inhibition between neural populations, which represent motion at different depths, play the critical role; near-to-far inhibition and the resultant peak-shift in the population activity leads to vector decomposition. Consistent with this model, it has been shown that surface decomposition leads to velocity decomposition (Watanabe, 1997). An important, but rather implicit, assumption of the model, which has not been tested formally, is that common (or coherent) motion is perceived to be at a different depth or scale than relative (or incoherent) motion: The former resides at a nearer depth or a larger scale. Moreover, as was acknowledged by the authors, their model in its current form cannot account for induced motion, in which a stationary object is seen to be moving in the opposite direction from a surrounding (or neighboring) moving object. This is partly due to a claim that vector decomposition and induced motion arise from different neural mechanisms (DiVita & Rock, 1997). For instance, induced motion is not perceived after the motion threshold (up to 3°/s) for the frame is reached. Also in induced motion, the inducer is perceived to be either moving at a lower speed or not moving at all, whereas in vector decomposition stimuli, the common and relative parts are perceived simultaneously. However, we obtained strong illusory perception of backward motion in the presence of a grating with a drift velocity of 9°/s. Therefore, one can argue that the vector analysis effect and induced motion stem from the same neural mechanism, and that the reference-field effect reported here constitutes a special case of induced motion. Last but not the least, figure–ground segregation via perceptual grouping operations requires at least some form of computation. Grossberg and colleagues’ model extracts the form of the group of two dots via illusory contours; only then can common motion be calculated. It would be interesting to see how their model would respond to a “formless” motion stimulus. The RFMF theory predicts that retinotopic motion (without any form) is sufficient to generate a reference field.
As we have mentioned before, theories of motion perception based on center–surround antagonisms within receptive fields have proposed inhibitory interactions between motions across space (e.g., Kim & Wilson, 1997; Murakami & Shimojo, 1996; Nawrot & Sekuler, 1990; Tadin et al., 2003; Tynan & Sekuler, 1975). Field-like effects of motion-based reference frames and inhibitory interactions can be predicted from these theories to some extent. The RFMF theory differs from these approaches by putting perceptual-grouping operations as an essential component of reference-frame extraction; in this context, the nonlinearity in the antagonistic interactions in Experiment 3b can be explained as an effect of perceptual grouping. Moreover, most theories of vision, including the aforementioned approaches to motion perception, are based on retinotopic representations, although mounting evidence is showing that perception is highly nonretinotopic (Agaoglu et al., 2012; Boi et al., 2009; Hisakata et al., 2013; Kawabe, 2008; Nishida et al., 2007; Öğmen et al., 2006; Shimozaki, Eckstein, & Thomas, 1999; Yamada & Kawabe, 2013). Hence, a fundamental gap exists between retinotopic theories and the nonretinotopic percepts that these theories attempt to explain. The RFMF theory offers a unified solution by constructing nonretinotopic representations according to motion-based reference frames. For instance, anorthoscopic perception (i.e., perceiving an object as a whole when it moves behind a narrow slit) had no viable explanation based on retinotopic theories, whereas it can readily be explained by the nonretinotopic representations of RFMF (see Öğmen, 2007), We recently tested this theory and showed how a motion-based reference frame can construct space and enable form perception (Agaoglu et al., 2012; Aydın et al., 2008, 2009; see also Nishida, 2004). Therefore, all of the theories employing center–surround antagonism that are listed above are tailored specifically for motion, whereas the RFMF theory accounts for all attributes of the stimuli. Hence, the selection and construction of motion-based reference frames are needed not only to perceive motion, but also to perceive the complete stimulus. Therefore, RFMF theory can make predictions on perceived form. For instance, in slit viewing, objects appear to be compressed along the axis of motion. The RFMF theory predicts that this apparent compression results from perceived speed differences between the different parts of the moving object. These predictions have also been tested formally, and the results lend further support to the RFMF theory (Aydın et al., 2008). Moreover, why and how attention is allocated to moving stimuli (Boi et al., 2009; Boi, Vergeer, Öğmen, & Herzog, 2011) and why masking is retinotopic but form perception can escape masking when a motion is predictable are other predictions of the RFMF theory that we have recently tested experimentally (Noory, Herzog, & Öğmen, in press), and we have demonstrated that RFMF can be implemented computationally to explain data at the quantitative level (Clarke et al., 2015).
M.H.H. is supported by the Swiss National Science Foundation (SNF) project “Basics of Visual Processing: From Retinotopic Encoding to Non-Retinotopic Representations.” We thank the reviewers for helpful comments and suggestions.