# Modelling human visual navigation using multi-view scene reconstruction

- First Online:

- Received:
- Accepted:

DOI: 10.1007/s00422-013-0558-2

- Cite this article as:
- Pickup, L.C., Fitzgibbon, A.W. & Glennerster, A. Biol Cybern (2013) 107: 449. doi:10.1007/s00422-013-0558-2

- 1 Citations
- 1k Downloads

## Abstract

It is often assumed that humans generate a 3D reconstruction of the environment, either in egocentric or world-based coordinates, but the steps involved are unknown. Here, we propose two reconstruction-based models, evaluated using data from two tasks in immersive virtual reality. We model the observer’s prediction of landmark location based on standard photogrammetric methods and then combine location predictions to compute *likelihood maps* of navigation behaviour. In one model, each scene point is treated independently in the reconstruction; in the other, the pertinent variable is the spatial relationship between pairs of points. Participants viewed a simple environment from one location, were transported (virtually) to another part of the scene and were asked to navigate back. Error distributions varied substantially with changes in scene layout; we compared these directly with the likelihood maps to quantify the success of the models. We also measured error distributions when participants manipulated the location of a landmark to match the preceding interval, providing a direct test of the landmark-location stage of the navigation models. Models such as this, which start with scenes and end with a probabilistic prediction of behaviour, are likely to be increasingly useful for understanding 3D vision.

### Keywords

Navigation 3D perception Virtual reality Stereopsis Motion parallax Computational modelling## 1 Introduction

Many studies on 3D representation assume that the parietal cortex generates representations of the scene in an egocentric frame, the hippocampus does so in a world-centred frame, and coordinate transformations account for the passage of information from one frame to another (Andersen et al. 1997; Burgess et al. 1999; Snyder et al. 1998; Mou et al. 2006; Burgess 2006; O’Keefe and Nadel 1978; McNaughton et al. 2006). However, there is little evidence for a well-ordered 3D representation in cortex underlying each of these putative representations. In striate cortex, retinotopic location and disparity tuning provide an anatomical basis for encoding the visual direction and depth of objects relative to the fixation point, but this anatomical regularity is not found in other parts of the cortex representing egocentric and world-centred relationships (DeAngelis and Newsome 1999; Cumming and DeAngelis 2001). And in relation to psychophysical data, there have been few attempts to model and test the processes assumed to underlie the generation of a 3D reconstruction from images, including the distortions that would be predicted to arise from such processing, as we do here.

Of course, 3D reconstruction is not the only way that a scene could be represented (Gillner and Mallot 1998; Glennerster et al. 2001; Warren 2012) and more generally there are many ways to guide actions and navigate within a 3D environment that do not involve scene reconstruction (Gibson 1979; Franz et al. 1998; Möller and Vardy 2006; Stürzl et al. 2008). Together, these come under the category of “view-based” methods of carrying out tasks. By contrast, in the current paper, we focus on reconstruction-based hypotheses for a scene-matching task and the extent to which these are able to account for the pattern of errors displayed by humans faced with the same task. We have examined view-based predictions for the same task in a previous paper (Pickup et al. 2011), and we will present a detailed comparison of the two approaches in a subsequent paper. Here, we focus on the hypothesis that the visual system generates a reconstruction of the scene. If this is what the visual system does when the observer is asked to remember their location in a scene, then we can model the pattern of errors that we would expect observers to make when they try to return to that location.

Using a similar “homing” task, it has often been shown that changing or removing landmarks can bias or disrupt accurate navigation of bees (Cartwright and Collett 1983), ants (Graham and Collett 2002) and humans (Mallot and Gillner 2000; Waller et al. 2001; Foo et al. 2005). By contrast, in our study the structure of the scene remains constant between the reference and the homing interval, but we nevertheless find that the pattern of errors varies systematically with the structure of the scene. It is these systematic variations that are informative about the nature of the representation the visual system uses. In this paper, we attempt to reproduce a similar pattern of errors using two variants of a reconstruction-based algorithm.

### 1.1 Paper overview

In Sect. 2, we describe the psychophysical experiment measuring navigation errors in a simple homing task in a virtual environment. Sections 3 and 4 describe how a reconstruction algorithm can be used to recover an estimate of the positions of scene landmarks in an egocentric coordinate system and how these estimates, measured in two intervals (“reference” and “homing”), can be combined to form a probabilistic map of navigation end-point locations. We call this the “basic” reconstruction model. Section 5 describes an alternative way of combining the distributions of position estimates that emphasizes the *relative* location of landmarks, so we refer to this as the “shape-based” model.

Section 6 introduces a different type of experiment that allows us to obtain an estimate of the distribution of errors on participants’ representation of *landmark* location (rather than their own location). We compare this to the equivalent distribution that is inferred as part of the modelling of the first experiment. Section 7 compares the ability of the “basic” and “shape-based” models to account for the data, and Sect. 8 discusses our results in the context of models of spatial representation.

## 2 Experiment 1: navigation to a previously viewed location

Participants viewed a simple scene in immersive virtual reality and were then teleported to a different location in the scene from where they had to return to the original location. The paradigm is similar to a previous experiment by Waller et al. (2001). Waller et al. (2001) tried to distinguish different components of the information that participants might be using in a homing task. In their experiment, they identified two candidate locations predicted by two simple heuristics: first, to keep all the landmarks at the same distance from the observer in the two intervals or, second, to keep all the angles between landmarks constant. They found evidence in favour of distance information being important although they admit that the type of virtual environment they used may have contributed to this outcome. Most of the time, only one landmark was visible at a time in their experiment, so angles between landmarks were rarely available visually, forcing participants to rely more heavily on distance information. Unlike Waller et al., we kept the environment the same between the learning and test phases so there was always a correct location to which participants could return. Naturally, this location is the most likely one, as is confirmed by our modelling, but the distribution of navigation errors that participants make around this point and, in particular, the variation in this distribution with the location of the landmarks in the scene, is something we attempt to predict using a reconstruction model. The experiment and data have been presented by Pickup et al. (2011), but are reproduced here for clarity before introducing the modelling.

### 2.1 Methods

Five participants took part in the experiment, all with normal or corrected-to-normal visual acuity. Participants viewed the virtual scene using an NVIS SX111 head-mounted display with horizontal field of view of 102\(^{\circ }\), vertical FOV 64\(^{\circ }\) and binocular overlap of 50\(^{\circ }\). The location and orientation of the head-mounted display were tracked at 240 Hz using a Vicon MX3/T20S nine camera tracking system that was used to update the binocular visual display (1,280 by 1,024 in each eye) at 60 Hz with a latency of two frames. The calibration procedure that allows the left and right eye’s viewing frustums to be calculated from the 6 degrees of freedom tracking data is described by Gilson and Glennerster (2012). The size of the physical room in which the participants could walk was 3.5 by 3.5 m. The stimuli consisted of three very long poles coloured red, green and blue so that they could be easily distinguished. Other than the poles, the image was black. The poles were designed so that the only information about their 3D layout was the angles subtended at the eye between pairs of poles and the change in these angles with changes in viewpoint (either by the participant walking or from binocular viewing). The poles were always one pixel wide (anti-aliased) for all viewing distances. The poles extended far above and far below the participant, and when the participant looked up or down by 35\(^{\circ }\), the image went black. This prevented participants from ever seeing anything close to a “plan view”. The purpose of this minimalist display was to restrict the number of parameters necessary to model the participant’s navigation errors and to allow different types of model to be distinguished.

A trial would start when the participant was within a 20 cm \(\times \) 80 cm viewing zone, which was always in the same physical location within the room. It allowed the participant to move laterally to view the stimulus with motion parallax but without the freedom to explore further. The long axis of the viewing zone was always at right angles to a line joining the centre of the viewing zone and the midpoint between the red and blue poles. The participant was instructed to remember their location with respect to the poles. This first “reference” interval ended when the participant pressed a button, and after a 500 ms blank interval, the poles reappeared, but the participant had been transported virtually (i.e. without physically moving) to a new location in the scene, shown by the magenta cross in Fig. 1. The task was to navigate back to the location in the scene at which they pressed the button ending interval one, i.e. the “goal point”. When participants were satisfied that they had reached the goal point, they pressed a button on a hand-held device recording the location of their cyclopean point at that moment and the trial ended. An image then appeared showing a plan view of a schematic head showing their location in the physical room and an outline of the viewing zone to which they had to return to start the next trial.

### 2.2 Results

In order to gather data that could be plotted in the clear way shown here, i.e. with many trials repeated using exactly the same goal point, we adapted the protocol slightly. Instead of defining the goal point based on the participant’s location in the viewing zone of interval one when they pressed the button, we inserted an “interval 1a” during which the participant saw a static, stereo image of the scene from a fixed viewpoint, and this defined the goal point to which they should try to return in the “homing” interval.

The real data (i.e. the points we analysed and which are shown in Figs. 11) were gathered using only the original two-interval paradigm where the goal position was never exactly the same for different repetitions of a given condition. This does not present any difficulty for modelling, since the goal point was always known, but distributions of errors are more difficult to make out “by eye” when plotted on a unified coordinate frame (see Fig. 12), motivating the use of illustrative “visualization” data as shown in Fig. 2.

## 3 3D pole position model

If participants are to use a 3D reconstruction in order to recognize a location, there are two steps that must be involved. The first of these, which we consider in this section, is to describe how a reconstruction from *one* location can be generated, including the associated errors. The second, which applies to interval two, is to find a location in the room for which the 3D model of the poles generated in that location best matches the 3D model obtained in the first interval. Errors might then arise if the reconstruction from one location is similar to the reconstruction generated from a different location. In this section, we describe our reconstruction model, and in Sect. 4, we will combine multiple reconstructions from different viewpoints in order to build probabilistic maps representing the likely end-points in homing tasks.

Our starting assumption for a reconstruction-based model is that at each point in the virtual reality space, a participant has access to a reconstruction of the scene which they have built up using stereopsis and motion parallax. Both provide information about the 3D layout of the scene from multiple viewpoints. In the derivation below, we assume that the observer is able to move from side to side, i.e. in a direction perpendicular to the line of sight. This is a good approximation to their movements in the first interval, since the viewing zone was narrow and oriented in this direction, but, of course, we had no control over the participant’s movement in the second interval. We discuss reasons why the model is likely to be robust to a fairly wide range of paths taken by the observer. The following section derives an expression for the expected mean and covariance of the distribution of errors for the three poles for each interval.

### 3.1 Deriving the distribution over pole positions

The reconstruction we carry out is in an “egocentric” coordinate frame centred on the middle of the start zone, with the line drawn from there to the central pole taken to define the depth axis. This defines the coordinate frame within which the position of each hypothetical camera is specified, as described below. The pole position distributions we obtain as \(\left( {{\boldsymbol{M}}}_j, {{\boldsymbol{C}}}_j\right) \) are therefore defined within this egocentric coordinate system.

Assume the poles are \({{\boldsymbol{X}}}_j\) (3-vectors representing 2D points \({{\boldsymbol{x}}}_j\)), and \(m_{ij}\) is the image of the \(j\)th point in the \(i\)th camera. Let the projection matrix for the \(i\)th camera be \({{\boldsymbol{P}}}_i\); this is a \(2\times 3\) matrix with focal length one unit, aligned on the viewing strip facing the central pole. This operates on the 2D homogeneous coordinates of the egocentric coordinate system (i.e. 3-vectors representing a 2D pole location) and transforms them into 1D homogeneous coordinates (2-vectors) representing image coordinates. An excellent introduction to multi-view geometry and working with homogeneous coordinates can be found in Hartley and Zisserman (2004).

To obtain a basic estimate of the distribution of the poles given a set of images, we assume there are no interesting priors on \({{\boldsymbol{Y}}}\) and obtain a maximum likelihood estimate, i.e. find values of \({{\boldsymbol{X}}}_j\) for each \(j\) so as to minimize the negative log likelihood.

An example of the types of models this gives for the pole position uncertainty is given in Fig. 3. In Sect. 6, we will compare this to pole position uncertainty data from human subjects, but first we will consider how to combine these pole-position-reconstruction models to form likelihood maps predicting navigation errors.

## 4 Combining models to form maps

We now have the foundation of a reconstruction-based model, but still need additional steps in order to explain the homing behaviour of participants. The problem of a human recognizing an exact location in interval two can be viewed as the task of finding a location in the room for which the 3D model of the poles best matches that obtained in the first interval. Navigation errors then arise when the current pole position model is sufficiently similar to the “template” or “goal-point” model generated in interval one.

Using the Gaussian model described in Sect. 3.1, we compute an egocentric pole-position model for every location (putative end-point) in a wide region around the poles. We then compare each model to the one computed at the goal point. End-points for which the model agrees well with the goal-point model should be assigned a higher likelihood in our map than end-points at which the appearance of the poles is less similar. Overall, high likelihoods in this map mean that we expect participants to press the button more often at this location. A map like this is desirable because the probabilities can be compared directly with observed data for any number of configurations and with other similar models (e.g. Pickup et al. 2011).

A likelihood map over the 2D plane is built up one point at a time by considering the distances between two probability distributions: the model is built with a coordinate frame based around the centre of the viewing strip in interval one, and a second model is built using the current point under consideration at the centre of the viewing strip. The \(y\)-axis is aligned with the green pole and the \(x\)-axis is perpendicular to this (see Fig. 3). This means that at each point, two egocentric maps of the world are compared. The comparison is made using the probability distributions on the pole positions, as follows.

Taking the set of poles as a single six-dimensional Gaussian distribution with a block-diagonal covariance matrix (i.e. by stacking the three mean vectors and arranging the three \(2\times 2\) covariances along the diagonal of a larger \(6\times 6\) covariance matrix), we get a single Gaussian representing the three pole locations as seen from a single point. The likelihood \(\mathcal L \) for any point \({{\boldsymbol{X}}}\) on the ground plane can then be found using the Bhattacharyya distance between the 6D Gaussian for the view centred on the goal point, and the Gaussian centred on the point \({{\boldsymbol{X}}}\).

### 4.1 Example maps

### 4.2 Normalization

The value of \(Z\) is calculated numerically out to a distance of several metres (e.g. 10 m) from the poles in the \(x\) and \(y\) directions, beyond which point it is assumed to be virtually zero. The integral is performed using four calls the dblquad function from Matlab: since the best match is expected to be at the location where the point \({{\boldsymbol{X}}}\) coincides exactly with the goal point, this point is included explicitly in the integral by splitting the region into four rectangles such that the central corner shared by all four regions is the goal location. This prevents the numerical integral routine missing particularly narrow peaky distributions.

### 4.3 Assumptions used in the interval-two model

In the computations described above, for every point on the end-point likelihood map (such as Fig. 4) the pole location probabilities are calculated in exactly the same way as they are at the goal location, i.e. using a viewing strip. In the experiments, the participants were free to walk around the virtual reality area in interval two, so no such restriction was made on the space of views of the poles available to them in this interval. In particular, all the views leading up to a candidate end-point could have been integrated together, potentially, into a single pole-position likelihood distribution in the current egocentric coordinates. Assuming, instead, that participants restricted their movement to a narrow viewing strip similar to interval one is clearly an approximation.

## 5 Shape-based map

The model described above forms an account of a “basic” photogrammetric reconstruction followed by comparison of two reconstructions from separate intervals. In this section, we explore a variation in the model that incorporates an element of sensitivity to relative positions, since this is known to be important in human vision (e.g.Westheimer 1979).

*relative*positions of the poles are the pertinent piece of information remembered from interval one might fare better. We explored a model that computed a distribution over e.g. the red-to-green vector recorded in egocentric coordinates (and the same for the other two possible pole pairs). In this formulation, for a given pole configuration, the red-to-green vector will then be identical for any position of the viewing point along a line from the green pole, since this line defines the orientation of the coordinate frame. Figure 6 illustrates this, showing how two viewing points along one such line give rise to similar means but different covariances in the estimate of relative pole positions, while unrelated viewing positions give rise to quite different estimates of relative pole position.

- 1.
Find a description of landmarks from the goal point in egocentric coordinates: \(\left( {{\boldsymbol{M}}}_R,{{\boldsymbol{C}}}_R\right) , \left( {{\boldsymbol{M}}}_G,{{\boldsymbol{C}}}_G\right) \) and \(\left( {{\boldsymbol{M}}}_B,{{\boldsymbol{C}}}_B\right) \) (see Sect. 3).

- 2.Transform these means into relative-location means by taking pairwise differences (red-to-green, blue-to-green, red-to-blue), i.e.$$\begin{aligned} {{\boldsymbol{M}}}_\alpha&= {{\boldsymbol{M}}}_G - {{\boldsymbol{M}}}_R,\end{aligned}$$(17)$$\begin{aligned} {{\boldsymbol{M}}}_\beta&= {{\boldsymbol{M}}}_G - {{\boldsymbol{M}}}_B,\end{aligned}$$(18)$$\begin{aligned} {{\boldsymbol{M}}}_\gamma&= {{\boldsymbol{M}}}_B - {{\boldsymbol{M}}}_R. \end{aligned}$$(19)
- 3.Transform the associated covariance for each mean, remembering that the uncertainty
*adds*, e.g.:$$\begin{aligned} {{\boldsymbol{C}}}_\alpha&= {{\boldsymbol{C}}}_G + {{\boldsymbol{C}}}_R. \end{aligned}$$(20) - 4.
Stack the three 2D Gaussians to give a single 6D shape-based description of the view of the landmarks from this goal location.

In some ways, the “relative” or “shape-based” model described here is a minor extra step added to the basic model and we are deliberately treating it as such in this paper. This means that the uncertainty that arises, for example, in estimating the location of a pole in the basic model will propagate through and affect the predictions of the shape-based model. That is what makes this a type of reconstruction-based model. However, in another sense, because it is starting to use relative rather than absolute position information, this model is a step down a quite different road, ultimately leading to the abandonment of any type of reconstruction. For example, one could use the relative image locations of pairs of poles as input features to the model and consider independent noise on each of these input measurements. That would be an entirely different, view-based approach, as raised in the Introduction and discussed in a previous paper (Pickup et al. 2011).

The reconstruction model suggests that observers are substantially less sensitive to variations in the depth of a pole than they are to variations in lateral position (Fig. 3). This would, at first sight, seem to run counter to evidence from stereoscopic experiments (e.g. Westheimer and McKee 1979) which suggest the reverse ratio. However, the more relevant data for this experiment are probably those using stimuli with a large disparity pedestal between the reference and the target (McKee et al. 1990) where stereo thresholds can be substantially *poorer* than those for lateral deviations. Here, we designed a method to measure the sensitivity of observers to variations in the position of a given pole in our experiment and so provide a direct empirical test of the distribution of uncertainties over pole position calculated using the reconstruction algorithm, as shown in Fig. 3. The results allowed us to modify the reconstruction stage of the model, as described in the next section.

## 6 Experiment 2: verifying one component of the reconstruction model

In the model presented so far, we set all the parameters in one go; that is, we chose the “decay” parameter, \(\lambda \) (Sect. 4) for the model-comparison step at the same time as “internal” parameters, \({\boldsymbol{\phi }}\), for the reconstruction part of the model. In this section, we describe a new experiment that allowed us to separate out the reconstruction parameters and fit them separately, leaving \(\lambda \) as a free parameter to be learnt in a subsequent step.

There are two arguments for doing this. First, the reconstruction step itself can be validated in isolation. Second, learning fewer parameters at once reduces the danger of over-fitting and leads to better generalization for the model as a whole. Briefly, the experiment allowed us to probe the underlying shape of the human uncertainty function over pole location. We used the data to fit the standard deviation, \(\sigma \), of the noise assumed on the images of the poles, the focal length of the cameras, the number of cameras used, and the width of the strip of cameras (see Sect. 6.3).

### 6.1 Methods

Participants were shown the three poles from a viewing zone, exactly as in interval one of the trials in Experiment 1, and were asked to remember the layout of the poles. Once participants had memorized the layout, they pressed a button, which led to a 0.5 s blank inter-stimulus interval.

In the second interval, the participant remained in the same location in the virtual scene (unlike Experiment 1) and two of the poles remained in the same place while the third pole was displaced. It was always the same pole that was displaced throughout a whole run although the displacement varied from trial to trial. Participants were told in advance which pole would be displaced. The participant’s task was to move the shifted pole back to the location it had occupied during the first interval. They did this using a hand-held pointing device with which they could “push” or “drag” the pole in two dimensions while pressing a button on the device. Participants indicated that they were satisfied that the location of the pole matched that in the first interval by pressing a different button on the device, advancing them to the next trial. The moving pole always remained vertical, so participants could only manipulate its \((x,y)\) coordinate and, like the other poles, its width in the image was always one pixel (anti-aliased).

### 6.2 Results

In general, this pattern of position uncertainty fits the predictions of the reconstruction model described in Sect. 3 and illustrated in Fig. 3, i.e. elongated in the depth direction. More than this, however, the data allow us to revise the basic and shape models using parameters derived from this uncertainty distribution, as described in the next section.

### 6.3 Fixing the free parameters, \({\boldsymbol{\phi }}\), using data from Experiment 2

The free parameters, \({\boldsymbol{\phi }}\), in the reconstruction model are: the number of assumed cameras, \(N\), image noise standard deviation, \(\sigma \), and camera strip half-width, \(w\) (see Sect. 3). The model predicts a Gaussian distribution of position errors for each pole, \(\left\{ {{\boldsymbol{M}}},{{\boldsymbol{C}}}\right\} \), for which we now have a direct estimate. Hence, we were able learn values for each of these parameters.

In optimizing the data likelihood with respect to these parameters, we found slightly higher likelihoods for the observed data when \(w\) was allowed to be a little larger than its veridical value of 0.4 m, which was the half-width of the starting box in the actual navigation experiment. This may be the result of people paying more attention to views at the edges of the viewing space than intermediate views. In our modelling, we limited the width to \(\pm \)0.4 m in order to reflect the ground truth width of the viewing box.

With the viewing-strip width fixed, \(\sigma \) and \(N\) were optimized. The latter is a discrete value greater than one, so optimal likelihoods were found for each \(N\) as \(\sigma \) varied, then the results were compared against each other to find the \((N,\sigma )\) pair maximizing the overall likelihood across the data from Experiment 2. This led to a model using just two cameras, and a noise standard deviation of 0.0128 m when a focal length of 1 m is assumed for the purposes of building up the imaging parameters of the hypothetical cameras. Together with the strip half-width (\(w=0.4\) m), these create the reconstruction model which best described human uncertainty in the pole locations in our experiment. Using these same parameters, \(\phi \), for this stage of the model and for all participants, we can now return to the second layer of the 3D-based navigation models.

### 6.4 Revised navigation predictions

Weight values, given as \(\log _{10}(\lambda )\), for the five participants (P1–P5) on each of the two types of reconstruction-based model

Type | P1 | P2 | P3 | P4 | P5 |
---|---|---|---|---|---|

Basic | \(-\)0.10 | \(-\)0.22 | \(-\)0.31 | \(-\)0.17 | \(-\)0.03 |

Shape | 0.94 | 0.46 | 0.39 | 0.49 | 0.80 |

## 7 “Basic” and “shape” reconstruction models compared

What is clear from Fig. 11 is that while *both* reconstruction-based models provide an explanation for a good deal of the variation observed in the human navigation error dataset, neither model is able to outperform the other consistently, and overall they have a tendency to complement one another.

## 8 Discussion

We have demonstrated the extent to which a reconstruction algorithm can account for participants’ performance in a simple navigation task. Any algorithm that is to predict human behaviour successfully in this case must vary its output according to changes in the visual scene and make explicit the way that noise at various stages in the reconstruction process will affect the predicted spatial distribution of errors in the task. We are not aware of algorithms that fulfil these criteria other than those based on the principles of photogrammetry, as we have used here.

Many papers have assumed that the brain generates a 3D reconstruction of the scene without providing a model of the process underlying its construction in the way that we have done here (Luneburg 1950; Blank 1958; Indow 1995; Tolman 1948; Mou et al. 2006; Burgess 2006; Maguire et al. 1999; Gallistel 1989). While often being quite mathematical in their description, these models are nonetheless descriptions of empirical results fitted *post-hoc* rather than describing a reconstruction process and the noise associated with its different stages. For example, Foley (1991) presents a description of distortions in perceived distance and direction based on psychophysical experiments. However, he provides only a minimal hypothesis about the processes that might underlie these distortions. In particular, he suggests that the compression of visual space may be explained by vergence adaptation occurring over many seconds or minutes in his experiments. By compression of visual space he means that “effective binocular parallax”, a value derived from psychophysical judgements, changes over a small range relative to actual binocular parallax (vergence angle). This hypothesis turns out to be contradicted by more recent data: visual space “compression” measured using a related paradigm has been shown to be very similar for long and short periods of fixation, e.g. 2 s periods of fixation interspersed with large changes in vergence so that vergence adaptation could not occur (Glennerster et al. 1996). A more important criticism, however, is that Foley’s hypothesis about the cause of a compression in visual space relies on changes in vergence to different targets. It is mainly an account that explains the distance estimate of fixated targets rather than being designed to explain distortions across a whole scene at once (without vergence changes). If it is true that information is passed from V1 to parietal cortex to hippocampus and that these representations underlie our perception of space, then the modelling of such transformations should refer to more than a single point at the fovea. In that sense, there is a large gap between descriptions of visual space such as Foley’s and current physiological hypotheses about spatial presentation.

The distortions of space that these models describe (Luneburg 1950; Blank 1958; Indow 1995; Foley 1991) do not predict any shift in the peak of the distribution of errors in our task: it remains the case that the most likely location for participants to choose in interval two would be the correct one because the same distortion would apply in both intervals. Others have discussed whether the notion of a distorted visual space remains tenable in the face of increasing psychophysical evidence against the hypothesis (Glennerster et al. 1996; Koenderink et al. 2002; Smeets et al. 2002; Cuijpers et al. 2003; Svarverud et al. 2010). Independent of that debate, the important point here is that for our task, no distortions of the type described by Luneburg and others would be expected.

Navigation often involves proprioception and vestibular cues in addition to vision (Foo et al. 2005; Campos et al. 2010; Tcheang 2010), but in our experiments, these cues on their own were of no value in carrying out the task. The reconstruction model we have applied does assume some nonvisual information is available, but this is for the purpose of fixing the scale of the visual reconstruction, for example from vergence or proprioception. These provide information about the length of the baseline (distance between the optic centres of a pair of cameras), but otherwise proprioception does not contribute to the process of comparing the stimuli in interval one and two. Any model that tried to integrate proprioceptive information in this matching process would need to be quite complicated, involving a subtraction of two coordinates from visual reconstructions generated at the start of the first and second intervals to get a “homing vector” across the two intervals, and a conversion of this visual vector into proprioceptive coordinates. It is not easy to see how a component derived in this way would add to the explanatory power of the model.

Instead, our model relies on matching of representations generated from visual data. In the end, a full description of human navigation will have to account for multiple sources of sensory information and show how these are integrated. This process will almost certainly incorporate a mechanism for weighting different cues according to their reliability (Landy et al. 1995; Ernst and Banks 2002; Svarverud et al. 2010; Butler et al. 2010) but this does not necessarily mean that the optimal coordinate frame in which to carry out such integration is necessarily a 3D one, as we have discussed elsewhere (Svarverud et al. 2010). Indeed, in relation to the data we have presented here, some of the conditions were best explained by a “shape” model which concentrates on the 3D relationship between pairs of features. This approach no longer uses a full reconstruction of the scene using a single coordinate frame and could be regarded as one step towards abandoning 3D frames altogether.

As we raised in the Introduction, reconstruction models are not the only approach to explaining 3D representation and performance in our scene-matching task. In a subsequent paper, we will compare directly the performance of a reconstruction algorithm with that of a quite different, view-based approach.

## Acknowledgments

This work was supported by the Wellcome Trust [086526], [078655]. We thank Stuart Gilson for his invaluable assistance in setting up the virtual reality experiments, and Tom Collett for insightful conversations that led to Experiment 2.

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.