Integration of visual landmark cues in spatial memory

Newman, Phillip M.; McNamara, Timothy P.

doi:10.1007/s00426-021-01581-8

Integration of visual landmark cues in spatial memory

Original Article
Published: 21 August 2021

Volume 86, pages 1636–1654, (2022)
Cite this article

Download PDF

Psychological Research Aims and scope Submit manuscript

Integration of visual landmark cues in spatial memory

Download PDF

2003 Accesses
8 Citations
4 Altmetric
Explore all metrics

Abstract

Over the past two decades, much research has been conducted to investigate whether humans are optimal when integrating sensory cues during spatial memory and navigational tasks. Although this work has consistently demonstrated optimal integration of visual cues (e.g., landmarks) with body-based cues (e.g., path integration) during human navigation, little work has investigated how cues of the same sensory type are integrated in spatial memory. A few recent studies have reported mixed results, with some showing very little benefit to having access to more than one landmark, and others showing that multiple landmarks can be optimally integrated in spatial memory. In the current study, we employed a combination of immersive and non-immersive virtual reality spatial memory tasks to test adult humans’ ability to integrate multiple landmark cues across six experiments. Our results showed that optimal integration of multiple landmark cues depends on the difficulty of the task, and that the presence of multiple landmarks can elicit an additional latent cue when estimating locations from a ground-level perspective, but not an aerial perspective.

Optimal combination of environmental cues and path integration during navigation

Article 21 August 2017

Effects of older age on visual and self-motion sensory cue integration in navigation

Article 28 March 2024

Environmental and Idiothetic Cues to Reference Frame Selection in Path Integration

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Successful navigation is a critical function of any mobile organism as faulty navigation can lead to injury or even death. Thus, organisms require mechanisms by which they can remain oriented in an environment. One such mechanism is to utilize spatial cues (e.g., landmarks) that inform the organism of its location with respect to an internal or external reference frame. For example, a shopper might attempt to locate his or her car by recalling that it was parked near an oak tree. The oak tree serves as an environmental cue that provides relative information about the location of the car. The shopper may also recall that he or she walked diagonally to the left from the car to the entrance of the store. The person’s internal sense of direction serves as a body-based cue. One strategy for successful navigation is to combine the information from different cues to obtain a more precise estimate of the car’s true location. However, if the cues provide conflicting estimates, it may be better to choose one cue over the other.

Many studies have examined the ways by which humans privilege and integrate spatial cues during navigation (Bates & Wolbers, 2014; Butler et al., 2010; Chen et al., 2017; Cheng et al., 2007; Frissen et al., 2011; Kalia et al., 2013; McNamara & Chen, 2020; Nardini et al., 2008; Newman & McNamara, 2021; Petrini et al., 2016; Philbeck & O’Leary, 2005; Ratliff & Newcombe, 2008; Sjolund et al., 2018; Tcheang, Bulthoff, & Burgess, 2011; Twyman, Holden, & Newcombe, 2018; Wang & Mou, 2020; Wang, Mou, & Dixon, 2018; Xu, Regier, & Newcombe, 2017; Zhao & Warren, 2015a, b). Cheng et al. (2007) proposed that navigators weight and integrate spatial cues according to models of Maximum Likelihood Estimation (MLE). According to the MLE model, each cue provides a probability distribution for a target location, with less variable distributions representing more reliable cues. Weights are assigned to cues based on their relative reliabilities (i.e., more reliable cues receive more weight) and are inversely proportional to the response variance associated with a given cue. Single-cue estimates are linearly combined to obtain a statistically optimal (in the sense of minimizing variance) estimate of the target’s location. The distribution of such optimal estimates is known as the optimal or combined distribution (in a Bayesian analysis, this distribution is referred to as the posterior distribution). Thus, MLE predicts that navigators optimally weight and integrate spatial cues during navigation according to cue reliability.

In a typical cue integration experiment, participants attempt a spatio-perceptual task, with the number of available cues being manipulated (usually within subjects; Alais & Burr, 2004; Battaglia, Jacobs, & Aslin, 2003; Ernst & Banks, 2002; Friedmann, Ludvig, & Legge, 2013; Girshick & Banks, 2009; Hillis et al., 2004; Jacobs, 1999; Oruç et al., 2003, Rohde et al., 2016). On some trials, both cues are available and are consistent (both-cue condition). On other trials, both cues are available but in conflict with one another, each indicating different estimates of a target (conflict condition). Critically, there are also trials for each of the single cues (single-cue conditions). Single-cue trials provide response distributions for each of the cues, which are used to compute cue reliabilities and predicted weights. The reliability of a given cue is equal to the inverse of its variance:

$$r = \frac{1}{{\sigma^{2} }}.$$

(1)

The optimal weights ($W$) for cues ($A$ and $B$) are,

$$W_{A} = \frac{{r_{A} }}{{\left( {r_{A} + r_{B} } \right)}},$$

(2)

$$W_{B} = \frac{{r_{B} }}{{\left( {r_{A} + r_{B} } \right)}}.$$

(3)

Note that ${W}_{A}$ and ${W}_{B}$ sum to 1. The optimal combination of the two cues is,

$$\mu_{O} = W_{A} \mu_{A} + W_{B} \mu_{B} .$$

(4)

The variance of the combined distribution is,

$$\sigma_{O}^{2} = \frac{{\sigma_{A}^{2} \sigma_{B}^{2} }}{{\sigma_{A}^{2} + \sigma_{B}^{2} }}.$$

(5)

Note that the predicted, optimal variance is always less than or equal to the variances of the two single cues (i.e., more cues available allows greater precision). If navigators are optimally combining the cues, response variance for both-cue trials will equal the optimal estimate.

In a recent study, Sjolund et al. (2018; Experiment 1) showed that human navigators optimally integrated environmental (room geometry) and body-based cues during a homing task. The homing task required participants to follow a two-legged path marked by waypoints before attempting to return directly to the path origin using memory. The trials varied in the number of cues available to the participant (i.e., environmental or body based, or both), and whether cues were in conflict when both were presented. When both cues were presented and non-conflicting, response variability was reduced relative to the two single-cue conditions and was consistent with optimal integration. Furthermore, observed cue weights from the conflict condition were consistent with predicted weights based on cue-relative reliability. Other studies using similar methods have shown that navigators can optimally combine body-based information with other visual cues, such as landmarks (e.g., Bates & Wolbers, 2014; Butler et al., 2010; Chen et al., 2017; Kalia et al., 2013; Nardini et al., 2008; Petrini et al., 2016; Tcheang et al., 2011; Zhao & Warren, 2015b) and optic flow (e.g., Fetsch, DeAngelis, & Angelaki, 2010; Fetsch et al., 2009, 2012).

However, cue combination studies in navigation have primarily focused on integration of cues between sensory modalities (i.e., visual and body-based cues). Some studies of cue competition suggest that visual and body-based cues are independent and do not compete for computational resources (Mou & Spetch, 2013; Shettleworth & Sutton, 2005). On the other hand, many cue competition studies examining the interplay of visual cues alone have demonstrated interference (e.g., blocking and overshadowing) between cues (Biegler & Morris, 1999; Chamizo, 2003; Chamizo et al., 1985; Hamilton & Sutherland, 1999; Hardt, Huprach, & Nadel, 2009; Jacobs et al., 1997, 1998; Prados, 2011; Rodrigo et al., 2005; Sánchez-Moreno et al., 1999).

Previous investigations into the use of separate spatial representations from two landmarks when recalling the location of a target have demonstrated a lack of cue integration (Baguley et al., 2006; Clark et al., 2013). For instance, Baguley et al. (2006) had participants learn the location of a target along a horizontal line with two individually presented landmarks. Participants in their study did not show improved performance when recalling the location of the target when both landmarks were presented at test compared to when only one was presented, suggesting that they were unable to integrate the information provided by both landmarks (Experiments 1 and 2). This pattern maintained even when participants learned the location of the target in the presence of both landmarks (Experiment 3). However, recent findings by Du et al. (2017) using a similar task found that participants optimally combined two landmark cues when estimating the location of a target on both horizontal and vertical axes.

The discrepancy in findings between Baguley et al. (2006) and Du et al. (2017) might be attributable to some key methodological differences. For example, Baguley et al. (2006) did not vary the absolute location of the landmarks and horizontal line on the computer screen, which may have allowed for participants to encode the target relative to the edges of the screen, while Du et al. (2017) varied the absolute location of the landmarks and horizontal (or vertical) line while keeping their relative distances constant. Baguley et al. (2006) also had participants learn many stimulus–target pairs during learning, requiring participants to encode more information than might have been possible. Du et al. (2017) instead trained participants to learn a single stimulus-target pair. Although Du et al. (2017) demonstrated optimal combination of two landmarks, this result was only observed when participants learned the location of the target with both cues presented simultaneously.

Other evidence suggests that the use of multiple visual cues can lead to supra-optimal performance with those cues that is better than the sum of performance with each cue alone. Mou and Spetch (2013; Experiment 5) examined how humans combined visual cues during a spatial memory task. During a learning phase, participants studied an array of five objects arranged as a pentagon from an aerial perspective. The test involved a two-alternative forced-choice task in which participants responded to whether a target object had moved relative to the initial learning array. Humans can encode object locations as distance vectors between the target object and other objects (inter-object vectors), as well as between the target object and the viewer’s body (body-object vectors; Klatzky, 1998; McNamara, 1986; Mou & McNamara, 2002; Mou & Spetch, 2013; Stevens & Coupe, 1978; Xiao et al., 2009). On some trials, participants had access to the entire array of objects during the test (both-cue trials). On other trials, participants either had access to the two closest objects or the two farthest objects to the target (close- and far-cue trials). These trials are analogous to single-cue trials such that their response distributions are combined to predict optimal cue integration. Mou and Spetch found that performance for the both-cue trials was better than optimal integration based on the close- and far-cue trials. They argued that the observed supra-optimal performance stemmed from an additional configural cue when all objects were present during the test. That is, when all objects were present during the test, participants had access to the inter-object vectors between the close and far objects and the target object, but they could also judge the location of the target object relative to the overall configuration. This supra-optimal effect was not observed for any of the other experiments investigating the integration of inter-object and body-object vectors, suggesting that these representations are likely governed by separate systems, and are akin to environmental and body-based cues (Burgess, 2008).

Previous work has shown that human navigators can use configural information during spatial search tasks (Jacobs et al., 1998; Spetch, Cheng, & MacDonald, 1996; Spetch et al. 1997). Spetch et al. (1997) had participants search for a target object in a grassy field. A 6 × 6 m square area was defined by four identical posts which served as landmarks. During training, the goal was always present and located in the center of the array of landmarks. During testing, participants were told that the goal would be present on some trials but not others, and that if they could not find the goal in a reasonable amount of time, they were to place a marker where the goal should be. Participants completed three test trials. One trial served as a control in which the landmarks were still arranged as vertices of a 6 × 6 m square area. Another trial was a left–right expansion test in which landmarks were placed 12 m apart in the left–right dimension only, maintaining a distance of 6 m apart in the up-down dimension. Lastly, one trial was a diagonal expansion test in which landmarks were placed 12 m apart along both dimensions. On all three tests, participants searched in the center of the landmark arrays as opposed to using distance vectors from any of the individual landmarks. Thus, humans appeared to use configural information of landmarks as a spatial cue during navigation.

However, it remains unclear if humans combine configural information with individual landmark vectors according to the MLE framework during navigation. That is, do navigators show supra-optimal performance when the entire landmark array is present during navigation relative to the optimal combination of subsets of the array? Or, will navigators choose to only use the most reliable subset of cues, otherwise unable or refusing to integrate subsets? Experiments 1 and 2 were designed to address three hypotheses regarding this question. The optimal integration hypothesis predicts that navigators represent target locations relative to individual landmark vectors and combine these representations during retrieval. Previous work (e.g., Spetch et al., 1996, 1997) investigating the use of configural information has used arrays of identical landmarks, making individual landmark vectors unreliable. If the array is made up of unique landmarks, navigators may disregard configural information. Thus, the optimal combination hypothesis predicts optimal combination of subsets of the array. The supra-optimal hypothesis predicts that navigators combine individual landmark vectors with configural information. This hypothesis is consistent with work by Mou and Spetch (2013) showing that humans combine inter-object and configural cues during a two-alternative forced-choice spatial perception task. Importantly, the supra-optimal hypothesis posits that the configural information (the latent cue) is integrated with the landmarks in the manner specified by the MLE model. However, it is also possible that this latent cue might dominate, leading to supra-optimal performance by way of greater reliability (see General Discussion). The supra-optimal hypothesis predicts that response variability is reduced beyond the optimal combination of the subsets of the array. The hierarchical hypothesis predicts that navigators will choose to use the most reliable subset of cues during retrieval and predicts that response variability during both-cue trials will be equal to the response variability of the most reliable cue (Du et al., 2017).

Experiment 1

Participants completed a spatial memory task in immersive virtual reality. Participants first learned a target location by walking to a post in the presence of four unique landmarks arranged as vertices of a square. Participants then attempted to walk back to the location of the post from a different starting position. On some trials, the entire array of landmarks was present during the test. On other trials, only a subset of the landmarks was present. Response accuracy and response variability were assessed for each trial type, and optimal precision was predicted from response variability from the subset trials. If participants integrate configural information, response variability should be lower than predicted by optimal integration, consistent with the supra-optimal hypothesis. If participants represent the target location with respect to individual landmarks, response variability should be consistent with optimal integration, as predicted by the optimal integration hypothesis.

Methods

Participants

Undergraduate students (N = 25; age M = 19.36, SD = 1.04; 13 females) from Vanderbilt University participated in exchange for credit in a psychology course. Previous cue combination studies in navigation (e.g., Bates & Wolbers, 2014; Chen et al., 2017; Sjolund et al., 2018) have used similar sample sizes, finding medium effect sizes (η_G²s = 0.11–0.18) of cue condition on response variability. A G*Power analysis for repeated-measures ANOVA (α = 0.05, power = 0.95, groups = 1, measurements = 4; Faul et al., 2009) showed that a sample size of 26 is sufficient to achieve f = 0.30 (medium effect = 0.25, large = 0.40). Data for eight additional participants were excluded due to simulator sickness (n = 1), failure to correctly follow experimental procedures (n = 3), recognizing which landmarks belonged to a subset (n = 1), response variability in at least one condition above the third quartile by three times the interquartile range (n = 1), or equipment malfunction (n = 2). A trial was considered an outlier if the response error fell above three times the interquartile range above the third quartile for a given cue condition. Less than 0.01% of trials were cut using this criterion.

Materials and procedure

The immersive virtual environment was rendered in Unity, a multiplatform game engine (https://unity.com/). The environment was displayed in the HTC Vive head-mounted display (HMD) with a resolution of 1080 × 1200 per eye, refreshed at 90 Hz. The field-of-view of the HMD is approximately 110 degrees diagonally. Participants used HTC Vive’s wireless controller to progress throughout the experiment. Position and orientation tracking were supported by HTC Vive’s Lighthouse tracking system, with a 4 × 4 m tracking space. The size of the room was 7.3 × 8.5 m. The TPCast (https://www.tpcastvr.com/) supported wireless tracking of the HMD. With this approach, participants were able to physically rotate and walk throughout the virtual environment. The experiment was implemented on a computer with an Intel Core i7-6700K processor, 32 GB of RAM, and a NVIDIA GTX 1080 graphics card.

Numerous studies have demonstrated that experience and training with video games can enhance spatial abilities (see Uttal et al., 2013). Attempting to control for prior experience with video games, we administered a video game history and habits questionnaire to participants (originally developed by Boot et al., 2008). The survey asked participants about demographics, weekly time spent playing video games, when they first started playing video games, and what video game consoles they own. Only six participants reported playing video games at least 5 h a week, and only five participants reported being an active gamer. Therefore, we do not consider this metric any further. The survey also asked participants to describe any strategies used to complete the experimental task, and whether they noticed any patterns in the landmarks that were present during the test phase.

The virtual environment consisted of an infinite ground plane and four landmarks: A tree, rock, tower, and house. Landmarks were arranged as vertices of a square (Fig. 1), with adjacent landmarks 12 m apart. Yellow target posts (Fig. 2) 0.05 m in diameter appeared randomly within a 3.6 × 3.6 m area centrally superimposed between the landmarks (see Fig. 1).

Every trial comprised a learning phase, test phase, and resetting procedure before each phase, which kept participants within the VR tracking space. During the resetting procedure, a blue post and a red post were the only visible objects in the environment. Participants were instructed to walk to the blue post and turn to face the red post, and then press a button on the controller to begin the next phase. Participants began the learning phase at a randomly chosen starting location, each of which was half-way between and aligned with the two closest landmarks (see Fig. 1). During the learning phase, all landmarks were visible, and a yellow post marked the target location.^{Footnote 1} Participants were instructed to walk to the yellow post and take time to learn its location by looking around at the surrounding landmarks. Participants were told that some landmarks might or might not disappear during the test phase, so it was important to learn the location of the post relative to all the landmarks. Also, because of particular interest in the use of landmark cues, but not body-based cues, participants were told that they would never start at the same location during the test phase as they did during the learning phase. When participants thought they had memorized the location of the yellow post, they pressed a button on the controller to complete another resetting procedure before starting the test phase.

Participants started the test phase at one of the remaining three starting locations (i.e., if participants started at the southern starting location during the learning phase, they could only start at the east, west, or north locations during the test phase), which was chosen randomly. During the test phase, the yellow post was no longer visible, and participants were instructed to walk to the remembered location of the yellow post. In the both-cue condition, all landmarks remained visible. In the subset-A condition, the tree and tower were no longer visible, leaving only the house and rock available. In the subset-B condition, the house and rock were no longer visible, leaving only the tree and tower available. Once participants were confident that they were standing at the location of the yellow post, they pressed a button on the controller to confirm their response and move on to the next trial. Participants completed a practice block with one of each trial type presented in a random order, followed by ten test blocks of three trials each, with one trial for each cue condition.^{Footnote 2}

Analyses

Because the target could take on random locations, the target location for each trial was treated as the origin and responses were aligned accordingly. We first analyzed response accuracy, defined as the mean Euclidean distance between each response location and the target location (origin). Following previous work (e.g., Chen et al., 2017; Nardini et al., 2008; Sjolund et al., 2018), the standard deviation was calculated for each condition, using the absolute distance of each response relative to the mean response location (see Appendix).^{Footnote 3} Using Eq. (5), optimal integration was calculated by combining the variances from the two subset conditions. We did not correct for multiple comparisons when conducting tests comparing model predictions to combined-cue performance as higher cost is assigned to falsely accepting the model (cf. Chen et al., 2017). Mauchly’s test revealed that the assumption of sphericity was met for all repeated-measures ANOVAs reported. However, the Greenhouse–Geisser correction for departure from sphericity was still used as even non-significant departures from sphericity can influence within-subject effects. GG epsilon is reported for all repeated-measures ANOVAs and Cohen’s d is reported for each comparison:

$$d = \frac{{M_{2} - M_{1} }}{{SD_{Pooled} }}.$$

(6)

In addition to traditional inferential tests, the Bayes factor (BF) was computed for comparisons of response variability for the both-cue condition and optimal integration (Jarosz & Wiley, 2014). We considered a Bayes factor (null/alternative) greater than 3 as adequate evidence that performance in the both-cue condition did not differ from the optimal MLE prediction and conversely, a Bayes factor less than 1/3 as adequate evidence that observed and predicted performance differed. If the p-value did not reach significance and the Bayes factor was between 1 and 3, cues were considered to be combined near-optimally (cf. Chen et al., 2017). Following suggestions by Rouder et al. (2009), we used a central Cauchy distribution as the prior with scale r on effect size set to 0.707. This prior is the default setting in many current statistical packages for calculating the Bayes factor (e.g., BayesFactor package for the R Environment; Morey & Rouder, 2015; R Core Team, 2019). As shown by Rouder et al. (2009), changes in scale r seldom result in changes in interpreting the Bayes factor.

Results and discussion

Response accuracy

Response accuracy was examined using a repeated-measures ANOVA, with cue condition as a within-subjects factor. The main-effect of cue condition was significant, F(2,48) = 8.60, GG epsilon = 0.93, p < 0.001, η_p² = 0.39 (BF = 0.85). Planned comparisons showed that participants were more accurate in the both-cue condition (M = 0.89, SD = 0.27) than the subset-B condition (M = 1.09, SD = 0.28), t(24) = 3.74, p = 0.001, d = 0.72 (BF = 0.04). Participants were equally accurate in the both-cue and subset-A (M = 0.95, SD = 0.21) conditions, t(24) = 1.31, p = 0.203, d = 0.26 (BF = 2.21).

Response variability

Response variability was examined using a repeated-measures ANOVA, with cue condition as a within-subjects factor. The main-effect of cue condition was significant, F(2,48) = 7.03, GG epsilon = 0.99, p = 0.002, η_p² = 0.23 (BF = 0.55) (Fig. 3). Planned comparisons revealed reduced response variability in the both-cue condition (M = 1.03, SD = 0.31) than the subset-B condition (M = 1.20, SD = 0.28), t(24) = 3.12, p = 0.005, d = 0.62 (BF = 0.11). Response variability in the both-cue condition was not significantly different than the subset-A condition (M = 1.03, SD = 0.25), t(24) = 0.05, p = 0.959, d = 0.01 (BF = 4.74). Response variability in the both-cue condition was significantly greater than optimal integration (M = 0.77, SD = 0.17), t(24) = 5.59, p < 0.001, d = 1.12 (BF < 0.01).

The results of Experiment 1 support the hierarchical hypothesis; that is, participants chose to use the most reliable of the two subsets (i.e., subset-A) when both subsets were presented during testing. Although we did not predict that either subset would be more reliable than the other, it was observed that participants tended to spend more time viewing the location of the target relative to the house than any other landmark. The house might have been a more reliable cue given its size and shape; the sharp edges of the house provide a salient reference point to which the location of the target can be encoded. Cue salience has been shown to be a critical component of cue reliability and weighting and can be determined by a multitude of factors such as a landmark’s physical properties (Chen et al., 2017). Thus, it is not surprising that the landmarks composing each subset were not equal in this regard.

Although participants’ response variabilities across cue conditions suggest a lack of cue integration, over a third of our participants (n = 9) reported utilizing a configural cue during encoding after the experimental session was over (Fig. 4). Thus, the results of Experiment 1 do not rule out the possibility that participants can integrate configural cues with individual landmark vectors to remember a target location. In Experiment 2, we drew inspiration from Spetch et al. (1996, 1997) and encouraged participants to utilize a configural cue strategy by equalizing cue salience across subsets and providing verbal instruction about the configural nature of the landmarks.

Experiment 2

Because only a minority of participants reported using configural strategies in Experiment 1, we attempted to prime participants to use a configural approach in Experiment 2 by eliminating the number of discriminant features across landmarks and by instructing participants to consider the configural structure of the landmark array. We also randomized the landmarks composing a subset across trials, with the only constraint being that two landmarks in a subset must be adjacent to one another.