1 Introduction

One benefit of having two eyes and an overlapping field of view is stereopsis or depth perception. Stereopsis is essential for one to locate the position of an object relative to its surroundings, especially at near. When an object is seen from two different points of view, it has a slight relative horizontal position shift. The visual system encodes the relative distance difference (called binocular disparity) and perceives the object at a certain z-coordinate in the visual space. The 3D location information needs to be obtained, computed, and conveyed for visual-action functions such as avoiding obstacles, grasping a mug, or descending staircases.

Julesz proposed that one of the most important aspects of stereopsis is to break camouflage, i.e. to support the ability to distinguish a target from background even when the target and background have a matching pattern (Julesz 1971). With stereopsis, one can extract global depth information from autostereograms (single 2-dimensional image with repeating patterns that can reveal 3-dimensional content when viewed with a specific eye convergence) or random dot stereogram. This information is purely derived from binocular disparity, based on the corresponding regions (or points) matched between the two eyes. While stereopsis is purely based on relative position information between the two images, the perception can be richer or poorer with other monocular cues. Ancient artists have been able to paint and render a flat canvas into an impression of 3-dimensional depth by tricking our brains with pictorial cues. These pictorial cues in the visual scene such as texture (Johnston et al. 1993; Hornsey and Hibbard 2021), texture gradient (Hillis et al. 2004; Stevens 1981; Tozawa 2012; Tsutsui et al. 2002), blur (Held et al. 2012; Mather 1997), relative size, lighting direction (Langer and Bülthoff 2000), occlusion (Marshall et al. 1996), tilt (Fiorentini and Maffei 1971; Oluk et al. 2022), distance to the horizon (Gardner et al. 2010), and shading (Chen and Tyler 2015) also provide additional information to depth perception or stereopsis. In 3D shapes, different material properties, local luminance contrast changes at the borders, shadow, reflectance, and motion could help us to identify the figure from its background even with matching texture (Troscianko et al. 2009). In addition to those cues, extraction of 3D visual information of an object from its surroundings also depends on the complexity of the background scene, the contrast and color differences between a target from its background (O’Shea et al. 1994; Scaccia and Langer 2018).

Extraction of 3D visual information involves integrating information from both the object and its surroundings (Salmela and Laurinen 2007). Background texture of different spatial patterns (Sutter et al. 1989), contrasts (Frisby and Mayhew 1978), reference planes (Andrews et al. 2001), presence of nearby features (Mitchison and Westheimer 1984) or colors may improve or worsen visual texture in stereopsis. For example, neighboring surfaces with similar texture profiles may have a camouflage effect making it harder to discern one from the other, potentially degrading stereopsis. As a simple demonstration of how background texture affects depth perception, try cross-fusing the left and right images in Fig. 1A. Specific instructions on how to do this are given in the figure caption. One will see a square target appearing in depth, in front of the no-texture gray background. With matching target-background texture (see Fig. 1B), one will have difficulty separating the 3D target from the background. On the contrary, with a lower SF texture in the background (see Fig. 1C), the background appears blurred, making the perception of the target stronger.

Fig. 1
figure 1

A simple demonstration to illustrate the effect of background texture in depth perception. A helpful trick is to place a finger slightly in front of the screen and fixate on it. Move the finger closer to your face, which forces your eyes to converge. While fixating on the finger, the two images at the back will create a third image in the middle. Adjust the finger closer or further back until the black dot on top of the image overlaps each other. A A square with texture will appear in depth, in front of the gray background. B With matching target-background texture, the target blends in with the background, making it harder to detect the center square monocularly. A center square appears in front of the background despite having a similar texture pattern. C A lower SF background gives an illusion of blur when fixating on the target and enhances depth perception

The human visual system consists of multiple spatially tuned filters to extract luminance information and give rise to the bandpass shaped contrast sensitivity function. Different SFs provide different information: higher SF provides details and edges, while low SF texture provides coarse global information such as shape (Bex et al. 2009; Westheimer 2001). Similarly, stereopsis exhibits the same behavior. Many psychophysical studies have investigated the spatial characteristics of the target in stereopsis by using Gabor (Schor and Wood 1983), sinusoidal gratings, or depth corrugations (Reynaud and Hess 2017). While these studies provided a fundamental understanding of spatial frequency effects in stereopsis, in real-world scenes, spatial frequency also presents in terms of textures. Some luminance variation on the texture of physical objects is needed to carry disparity information.

It remains uncertain how surroundings, especially the spatial content of background texture, would affect the integration of figure-background depth information (Sawayama and Kimura 2015). The spatial content refers to spatial frequency (SF) or the periodic distribution of dark and light imaged within a given area on the retina. In visual perception studies, sinusoidal, square, and noise patterns defined in spatial frequency (cycles per degree, cpd) are typically used. Only a few studies have attempted to understand the impact of texture spatial property in stereopsis. Brown and Weisstein used sinusoidal gratings and found a bias that higher SF gratings are perceived further in front when surrounded by lower spatial frequency gratings, at a difference of 2-octaves (Brown and Weisstein 1988). On the contrary, another study did not find any bias as a function of spatial frequencies (Siderov and Harwerth 1993). Another example of how background texture affects depth perception is disparity averaging, the combination of disparity information across space (Cammack and Harris 2016; Li and Motter 2007). For example, the extraction of disparity information of a single target in isolation may differ from the same target in the presence of surrounding objects or cluttered background. The SF and contrast of the surrounding affect disparity averaging. Generally, when two targets of different SF patterns within 3.5 octaves are placed next to each other, the perceived depth is biased toward the higher SF (Rohaly and Wilson 1994).

In virtual reality (VR), textures are often applied to a surface to give visual details. The SF of a texture can be easily manipulated in the virtual environment. For example, a texture in the asset library typically comes in a fixed size. It can be used to wrap an entire area or be tiled repeatedly. As a result, different tiling and wrap modes would result in contrasting SF content, as shown in Fig. 2. Improving the virtual content can improve the visual comfort and immersive experience, especially in distance perception (Creem-Regehr et al. 2022; Diaz et al. 2017; El Jamiy and Marsh 2019; Ellis et al. n.d.; Kline and Witmer 1996). The underestimation and bias in distance perception in the VR environment have been studied extensively (Bruder et al. 2015; Creem-Regehr et al. 2015; Kunz et al. 2015; Renner et al. 2013; Scarfe and Glennerster 2021). Hornsey and Hibbard found that texture reduced the bias (Hornsey and Hibbard 2021). In addition, realistic texture has also been reported to have a positive influence on spatial perception in virtual reality as compared to “paper model/non-textured” scenes (Lucaci et al. 2022). Therefore, in addition to its aesthetic component, texture can also be used to improve depth judgment in virtual reality. This study specifically explores the effect of texture (on both the target and background) as pictorial cues in depth perception in a virtual environment.

Fig. 2
figure 2

Example of how different tiling and wrap modes could affect the SF content of a texture. A Non-repeating wrap mode. B Low repeat, C Excessive tiling and repeat

In our experiment, bandpass noise patterns at different SFs were used to systematically investigate the role of target and background texture for relative distance judgment in virtual reality. Complete randomization of pixel values results in a white noise pattern (Fig. 3A) containing energy at all spatial frequencies. On the other hand, bandpass noise patterns (Fig. 3B–F) are filtered to contain a narrow band of spatial frequency, allowing the study of fundamental mechanisms of visual perception (Allard et al. 2015). Unlike gratings, they have no orientation bias and more closely resemble natural textures. In addition, it allows us to assess disparity as a function of texture SFs. We found that the texture of a target is critical for depth perception. The participants did better with the middle spatial frequencies bandpass noise pattern with texture on the target. With no texture on the target, the performance was significantly poorer. Different background textures also affected the performance, best with mid-SF target against mid-SF background texture. We propose several models to investigate the influence of SF tuning and texture integration in depth perception.

Fig. 3
figure 3

Example of white noise pattern and bandpass noise pattern. The spatial frequency is defined by the number of black-and-white cycles in each square, assuming each square subtends a visual angle of 3°. A White noise pattern, B Bandpass noise pattern 0.33 cycles per degree (cpd), C 0.66 cpd, D 1 cpd, E 3 cpd, F 4.5 cpd

2 Methodology

2.1 Subjects and apparatus

Eleven subjects (mean age = 29.3, three female) volunteered for this pilot study, and one of them was the author of the manuscript. They all had no binocular vision abnormalities, stereopsis of at least 32 secs of arc tested with the Random Dot 2-S Stereotest (Vision Assessment Corporation), and monocular visual acuity of at least 20/25. They wore their prescribed correction during the experiment. Written informed consent was obtained from all participants before the experiment, and the experimental procedures were approved by the Institutional Review Board of the University of Houston.

Custom code was written in JavaScript using the Three.js 3D rendering library to assess distance discrimination in a virtual reality environment. The program ran in a web browser and WebXR emulator, with the rendered scene shown on each display of an HTC Vive Pro headset. This headset has a resolution of 1080 × 1200 pixels per eye, a 90 Hz refresh rate, and an approximately 110° field of view. The scene was stable on the display, with all head tracking features disabled. The luminance was gamma corrected, with a value of 2.2 in software using a custom WebGL shader, with the noisy-bit method used to provide greater resolution of luminance steps (Allard and Faubert 2008). The mean luminance of the stimuli was ~ 50 cd/m2, equivalent to a typical indoor brightness level, comfortable for long viewing. The HTC Vive can achieve luminances as high as ~ 110 cd/m2 with full white. Subjects responded using a standard computer keyboard.

2.2 Stimulus

The virtual scene consisted of a room with four walls (width and length of 12 m virtual distance) surrounding the perspective camera, positioned in the center. Wearing the headset, the observers saw a background wall located 6 m ahead. With a keypress, three squares appear halfway, 3 m distance for 1 s. Each square tile was 3 × 3° in size and separated by a 4.7° center-to-center distance. The middle square was positioned straight ahead of the line of sight, with one placed on the left and right, respectively. The targets and background were rendered with texture, made of bandpass filtered noise patterns (0.33, 0.66, 1, 3, and 4.5 cycles per degree (cpd)). Higher spatial frequency (cpd) texture has more details and a grainy appearance. The bandpass filtered noise patterns were pre-generated offline, loaded as png files, and wrapped around the objects. The materials used are non-shiny and non-specular meshlambert material defined in Three.js. The spatial filtering has a bandwidth (full width at half-height) of 1 octave and is radially symmetrical in the log2 frequency domain. The equation of the filter, modified from (Peli 1990) is given by:

$${\text{Amplitude}} = 0.5 + \frac{{\cos \left( { \pi \left( {\log_{2} \left( {vals} \right) - \log_{2} \left( {ctr} \right)} \right)} \right.}}{2}$$

where vals is the position in pixels from the center while ctr corresponds to the spatial frequency. First, random values between 0 and 1 were generated for every pixel within the dimension of the image. The image was then Fourier transformed and multiplied by the log2 filter of that SF in cycles per image. The entire image was scaled to the visual angle of each object, resulting in textures defined as cycles per degree (cpd). A large texture was generated to match the size needed for the background. A snippet of the texture was then cropped to be used as the texture of the square tiles and randomly rotated for each target on each trial. An ambient light source (illumination = 1) that equally illuminated all the objects in the scene was placed at the origin coordinate with no cast shadow.

During each trial, one of the targets appeared closer than the other two, with a smaller z-coordinate from the fixation plane at 3 m. E.g., as illustrated in Fig. 4A, the target on the left is positioned at 2.7 m away from the observer while the middle and right non-targets were at 3 m distance. The task was to identify the target position and respond with either the left, down or right arrow. To eliminate cues from size difference when the target is displaced forward, a scaling factor was used to compensate for the larger size so that the three squares subtended a fixed angular size regardless of the z-position in space. A black-colored grid was drawn on the background surface for binocular fusion throughout the experiment. None of the subjects reported double vision during the experiment.

Fig. 4
figure 4

A An illustration of the experimental setup. With an HTC Vive headset, a background wall with texture is located 6 m ahead. The targets, three square tiles, were presented for 1 s at 3 m each time, and one of them was closer to the observer, indicated by the red arrow. The task was to identify the closer one with a keypress. The target and background were tested with different spatially filtered noise textures (SF = 0.33, 0.66, 1, 3, 4.5 cpd or no texture isoluminant gray). B The distance thresholds between the target and two non-targets were obtained through a staircase method with seven reversal points, as indicated by the black ‘O’ symbol. The distance thresholds were calculated based on the mean of the last 5 reversal points. Green dots indicate correct responses, while red ‘X’ symbols indicate an erroneous response

A 3-down 1-up adaptive staircase method with seven reversals was implemented to measure the distance threshold (in meter), or the smallest distance separation needed to distinguish the target from the two non-targets. The distance of the target from the two non-targets decreased after three correct responses in a row and increased with a mistake, as shown in Fig. 4B. The mean of the last five reversals was used to estimate the distance threshold. The maximum value of the staircase was set at a limit of 1.5 m away from the fixation plane because the left and right squares presented beyond that limit had a significant horizontal displacement that could be used as a cue. Each staircase was repeated for each condition. The dependent variable of the study is the distance thresholds, while the independent variables are the different combinations of texture. Three sets of conditions were tested for each subject: textured target against gray background (to study the effect of texture on target only), non-texture-gray target against textured background (to study the effect of texture on the background only), and different combinations of target-background texture (to study the combined effect of target and background texture with matching and non-matching textures). To test the effect of spatial frequency, in each texture condition we tested with 5 different spatial frequencies: 0.33, 0.66, 1, 3, 4.5 cpd.

The subjects sat for a total of 70 blocks (35 blocks × 2), and each block took about 2-5 min. All subjects had a few blocks as practiced before the experiment to get familiarized with the experiment. The total duration of the study was approximately 3–4 h, divided into 3 or 4 visits, with short breaks every ~ 15–20 min. We encourage breaks and stop the experiment after an hour to reduce fatigue and accommodation vergence factors. Below are the details for each condition:

Condition A (5 blocks)

Textured targets (0.33, 0.66, 1, 3, 4.5 cpd) against isoluminant gray background (no texture)

Condition B (5 blocks)

Gray target (no texture) against textured background (0.33, 0.66, 1, 3, 4.5 cpd)

Condition C (25 blocks)

Combinations of textured target and background (0.33, 0.66, 1, 3, 4.5 cpd). In “Matched Condition,” both targets and background had the same SF texture. In “Unmatched Condition,” the targets had different SF textures from the background

The distance thresholds in meter were converted to disparity thresholds in arc min with an assumption of 65 mm inter-pupillary distance fixating at 3 m using the formula below (Howard 1919), where d is the threshold distance and D is set at 3 m, which is the fixation plane. Disparity sensitivity is calculated by taking the reciprocal of disparity thresholds.

$$Disparity,\;arc\,min = \frac{0.065\times57.29\times60}{{D^{2} - dD}}$$
$${\text{Log}}\;{\text{disparity}}\;{\text{sensitivity}}\left( {\min^{ - 1} } \right) = 1/ {\text{disparity}}\;{\text{threshold}}$$

2.3 Data analysis

We performed data analyses using Python and the Numpy/Scipy scientific libraries (Oliphant 2007). Shapiro–Wilk test was conducted to test for normality for the distance thresholds and distance sensitivity and concluded that the sample data does not come from a normal distribution. Other transformations also failed to make the data normally distributed. We used the non-parametric Wilcoxon Sign Ranked test to compare the result between the different conditions. A p value of 0.05 is the threshold for statistical significance.

3 Results

For each subject, there were 70 outcome measures of the distance thresholds, the smallest distance to distinguish a target to be in front of the other two non-targets across all the conditions. The mean distance thresholds across all subjects were computed for each condition. The distance sensitivity is reciprocal of distance thresholds, m−1. The goal of the study is to understand the trend of depth perception with different combinations of texture defined by spatial frequency; therefore, descriptive analysis was performed.

3.1 Condition A (textured target against a gray background)

Figure 5A shows the mean distance threshold for textured targets against a gray background. If the spatial frequency of the target texture does not contribute to depth perception, the expected results will be a straight horizontal line. However, the results showed a trend. The mean distance thresholds were 34.55, 29.98, 25.42, 21.18, and 33.4 cm for 0.33 cpd, 0.66 cpd, 1 cpd, 3 cpd, and 4.5 cpd respectively (refer to right y-abscissa). The distance sensitivity, m−1 was 3.13, 3.66, 4.24, 6.12, and 3.27 for 0.33 cpd, 0.66 cpd, 1 cpd, 3 cpd, and 4.5 cpd respectively (refer to left y-abscissa). Higher distance sensitivity or lower distance threshold indicates that the subjects can distinguish a smaller distance separation between the target and non-targets. On average, all the subjects had lower distance thresholds or peak sensitivity when the targets were presented with 3 cpd. Each color and symbol represent individual data. Most subjects could not give a reliable response when the targets were presented with 4.5 cpd. Participants subjectively reported that 4.5 cpd target resembled a “fuzzy” gray texture, making it harder to distinguish from a gray background. Based on the resolution and field of view of the headset, 1° visual angle subtends about 10 pixels. A texture of 4.5 cpd will require a minimum of 9 pixels per degree -with only black and white pixels. Therefore 4.5 cpd is at the verge of display resolution limit. The staircase terminated when the distance threshold was above 1.5 m, and the data points were excluded from the analysis.

Fig. 5
figure 5

Distance thresholds for different spatial frequency textures. A Condition A (textured target against gray background). The distance thresholds obtained from the response are converted to distance sensitivity (1/thresholds). From the mean across subjects, the sensitivity curve had a peak performance at the mid-SF range, 3 cpd. B Condition B (gray target against textured background). The overall sensitivity is lower with non-textured targets. The peak sensitivity occurs in the background with low spatial frequency texture. Different points in the plots represent the results of each subject. The ribbon diagram on the top of each plot is an illustration of the target and background in each condition. The vertical scale on the right side of each plot is the distance threshold in virtual distance units (cm)

3.2 Condition B (gray target against a textured background)

Figure 5B shows the results of non-textured gray targets against a textured background. The mean distance thresholds were 42.70, 39.97, 48.67, 50.41, and 71.83 cm for 0.33 cpd, 0.66 cpd, 1 cpd, 3 cpd and 4.5 cpd respectively (refer right y-abscissa). The mean distance sensitivity was 2.60, 2.74, 2.16, 2.07 and 1.41 for 0.33 cpd, 0.66 cpd, 1 cpd, 3 cpd and 4.5 cpd respectively (refer left y-abscissa). If the spatial frequency of the background texture does not contribute to depth perception, the expected results will be a straight horizontal line. However, the results showed a trend where subjects performed best with lower SF texture in the background. Subjects also had difficulty in the task when the background was presented at 4.5 cpd. Despite being the same combination of SFs as Condition A, except with target and background reversed, the distribution of the distance threshold between the two conditions pooled across all SFs was significantly different (Wilcoxon Signed Rank test, p < 0.001). The orange curve (Fig. 5B) is lower than the red curve (Fig. 5A), revealing a poorer performance with no-texture target.

3.3 Condition C (textured target + textured background)

Figure 6A shows the mean distance sensitivity and distance thresholds across all the observers for different background texture (increasing background SF from the left to right columns). The peak sensitivity remained robust at 3 cpd across different background texture. Note that the “inverted U-shape” of each curve also changed with different background texture. A heat map is generated to illustrate the pattern across the twenty-five texture combinations for the target and background (see the red square in Fig. 6B). We considered gray texture as the maximum SF and extended the axes. An extremely high SF or dense texture would go beyond the resolution of a display and be displayed as the mean luminance, which is gray. The result from Condition A is added to the rightmost column while result from Condition B is added to topmost row. Overall, the subjects performed best within 1-3 cpd of the target and background, indicated by the brighter region in the middle. Subjects had poorer performance when the target was presented at 4.5 cpd and gray. In each column, the color changes from the bottom to the top, indicating how the SFs of the target affected depth perception at a given background. To determine if background or target texture plays a more significant role, the standard deviations of the distributions for each row and column within the red box were computed. The standard deviations were 3.31, 2.31, 3.57, 4.19, and 6.03 for target SF of 0.33–4.5 cpd (bottom to top row). Meanwhile, the standard deviations for each column (left to right) were 4.19, 7.40, 8.84, 9.74, and 9.61for background SF from 0.33 to 4.5 cpd. The larger standard deviation in each column suggests that changing the SF of the target texture had a larger impact than its background.

Fig. 6
figure 6

Results for Condition C. A Mean distance sensitivity and distance thresholds of different target textures against different background SF (increasing background SF from left to right). The pattern for peak sensitivity at mid SF range remains robust across different background. The curvature of the “inverted U-shape” plot changes with different background. B Replotting the left panel in a heatmap plot for better representation. The red squares outline the result for Condition C with different target and background pairing. The results from Condition A (gray background) and Condition B (gray target) were added at the rightmost column and topmost row, respectively. It was impossible to discriminate the target from the background in grey target-grey background pairing; therefore, we did not test it and inserted a black square. The ‘*’ symbol indicates incomplete data for all subjects due to difficulty performing the task under those conditions. Overall, the colors were brighter in the middle region. A localized brighter region indicates a better depth judgment performance or higher sensitivity with mid SF textures

3.4 Modelling the effect of target and background texture based on spatial frequency processing in human vision

All modeling is based on transformation of distance sensitivity to log units in line with other studies. To fit these data, we assumed that performance across a given background condition (a column of Fig. 6), would resemble results from prior experiments that measured disparity sensitivity. Specifically, with a gray background (Fig. 5A), the paradigm is comparable to Reynaud et al. (2015), who measured sensitivity to depth corrugations and modeled their data with a truncated log parabola fit. This function captures a band-pass shape that drops off at high spatial frequencies and plateaus at low spatial frequencies, like the spatial contrast sensitivity function (CSF) (Robson 1966; Watson and Ahumada 2005). The limit of the DSF at high spatial frequencies has been reported in the depth corrugation literature (Bradshaw and Rogers 1999; Ding and Levi 2011; Peterzell et al. 2017; Serrano-Pedraza and Read 2010; Tyler 1973). This limit, which is between 3-5 cpd, is proposed to be a byproduct of cross-correlation in computing disparity information (Filippini and Banks 2009).

This model, adapted from Reynaud et al. (2015) has four parameters: peak spatial frequency fmax, peak disparity sensitivity γmax, bandwidth β and γlow to flatten sensitivity at low spatial frequencies (Reynaud et al. 2015). The truncated log parabola fits our results from the gray background condition well, as shown in Fig. 7F. However, with the addition of a textured background, the function can no longer adequately describe the data. With low or middle SF backgrounds, the performance at high target SFs plateaus, suggesting a dependence on the background. For example, lowering the SF texture of the background helps to increase the saliency of the target, as shown in Fig. 5B. To capture this effect, we introduce a double-truncated log parabola fit, which adds an additional parameter, γhigh, to flatten sensitivity at higher target SFs. The importance of γhigh will be further described in the Discussion section.

$$S^{{\prime }} \left( f \right) = log_{10} \left( {\gamma_{max} } \right) - k\left( {\frac{{log_{10} \left( f \right) - log_{10} \left( {f_{max} } \right)}}{{\frac{{\beta^{{\prime }} }}{2}}}} \right)^{2} ,\quad where\;k = log_{10} \left( 2 \right)\;{\text{and}}\;\beta^{{\prime }} = log_{10} \left( {2\beta } \right)$$
$$S\left( f \right) = \gamma_{low} \;{\text{if}}\;{f} < f_{max} \;{\text{and}}\;S^{{\prime }} \left( f \right) < \log_{10} \left( {\gamma_{low} } \right)$$
$$S\left( f \right) = \gamma_{{{\text{high}}}} \;{\text{if}}\;{f} > f_{\max } \;{\text{and}}\;S^{{\prime }} \left( f \right) < \log_{10} \left( {\gamma_{high} } \right)$$
$$S\left( f \right) = S^{\prime } \left( f \right)\;otherwise$$
Fig. 7
figure 7

Double-truncated log parabola fit for mean result. Each column shows a different background texture (from 0.33 to 4.5 cpd and a gray background). The horizontal red dashed line indicates conditions where the subjects were not able to do the task. The change in the peak of the fit (x-coordinate, y-coordinate) is illustrated in Fig. 8

The optimal parameters for each subject and different background conditions were independently determined using nonlinear least squares fitting, as shown in Fig. 7. Figure 7 shows the fit to the mean sensitivity across subjects. Each column represents a different background. The red horizontal lines indicate that the subjects were not able to do the task at those texture combinations. Like in Reynaud et al. (2015), the γlow and β parameters did not display any meaningful trend based on spatial frequency and are excluded from further data analysis. The remaining three parameters (fmax, γmax, γhigh) showed several characteristic effects describing how the background impacts the DSF, as shown in Fig. 8. While Fig. 7 plots a fit for each background condition (one fit for each column in Fig. 6A), Fig. 8 summarizes the trend for each parameter across conditions. The fmax parameter decreases from approximately 2–1.5 cpd as background spatial frequency increases. This trend results in a slight leftward shift of the peak of each function in Fig. 7 when going from the leftmost to the rightmost column. γmax reflects the relative height of each function in Fig. 7; specifically, most subjects had the highest sensitivity with a background of 1 cpd. Lastly, γhigh decreases as background SF increases, with a slight peak at 0.66 cpd. The effect of this parameter can be observed by the reduction of the rightmost flat portion of each curve in Fig. 7 when going from the leftmost to the rightmost column.

Fig. 8
figure 8

Summary result of best fits for the three relevant parameters (fmax, γmax, γhigh). fmax is the spatial frequency of peak sensitivity for each plot in Fig. 7. γmax is the peak at each plot in Fig. 7, and γhigh is the flatten sensitivity at high SFs. A The fmax parameter changes with background SF. It peaks at 0.66 cpd before declining at high SF backgrounds. B The γmax parameter has a bandpass shape and peaks at 1.00 cpd background. C The γhigh parameter decreases as background SF increases, with a slight peak at 0.66 cpd. The individual fits for each subject are represented by different colored points. The solid lines indicate the average parameter values across subjects at each background SF, with the shaded regions indicating ± one standard deviation. The horizontal red dashed line indicates where the subjects were not able to do the task

4 Discussion

The results show that the texture of different spatial frequencies affects depth perception in a virtual environment, whether on the surface of the target or its surroundings. Firstly, against a gray background, the ability to correctly judge distances is better with targets presented with mid-spatial frequency texture. With a non-textured target, performance was significantly worse. With different combinations of target and background textures, the SF characteristics of the target modulated the thresholds more strongly than the SF characteristics of the background. We fitted the results with a double-truncated log parabola fit based on spatial frequency characteristics in human perception. The additional parameter, γhigh was introduced to account for the contribution of the background texture, which could aid discrimination when information from the target texture is limited.

Can simple models based on known SF properties be sufficient to describe our results? Several simple models based on the interaction between the target and background SF properties are described below. The heatmap in Fig. 6B is replotted in log–log scale with interpolation as shown in Fig. 9G.

  1. (A)

    Model 1: Background with lower SF gives an illusion of far away. Naturally, far-away objects are perceived as blurred and less detailed. Therefore, a target with a high SF texture against a low SF background texture may give an illusion of blur-mediated bias, as reported by another study (Brown and Weisstein 1988). The hypothetical result would be a heatmap with a gradient bright region on the left side (Fig. 9A) where performance is better with lower background texture. The brighter region represents better depth judgment or lower distance thresholds. Refer to Supplementary Material for the computation.

  2. (B)

    Model 2: Blur-mediated bias is only present when the background has lower SF than the target. The hypothetical result would result in a heatmap as Fig. 9B in which performance is better in conditions above the diagonal line (background has lower SF than the target)

  3. (C)

    Model 3: Camouflage effect (background and target have matching textures) causes poorer performance: Studies have shown that segregation of a figure from its background is easier when the textures differ in luminance (Julesz 1962), color (Saarela and Landy 2012), orientation (Peterzell et al. 2017), and size. When the target and background are similar, it leads to more visual search errors and a longer reaction time (Neider and Zelinsky 2006; Nothdurft 1992). Figure-background textures with a similar pattern may appear as a continuous surface (Caputo 1996) or camouflage texture (Song and Geng 2010). Camouflage texture blends the target with the background, making it harder to segment the figure from the background. Since the texture patterns comprised band-limited noise, the dissimilarity index can be defined as the arithmetic difference between target and background spatial frequencies. The hypothetical result would be a heatmap with the worst performance along the identity diagonal line and better performance when the two textures are on the opposite spectrum (low SF target texture against high SF background texture and vice versa), as shown by the brighter corners in Fig. 9C.

  4. (D)

    Model 4: Depth judgment is tuned to the spatial frequency of target and background texture. The SF tuning characteristics of the target can be defined as a Gaussian distribution, with peak sensitivity at the middle SF of the target, as shown by the bright horizontal band in Fig. 9D. On the other hand, the SF tuning to background is illustrated by the bright vertical band in Fig. 9E. If the subjects were tuned to both SF (target and background) during the task, the resultant peak sensitivity would be a localized bright spot as shown in Fig. 9F.

Fig. 9
figure 9

Comparison between a few simple models based on SF properties and our results. A Model 1: Background with lower SF gives a blur-mediated cue, resulting in better performance (brighter region) with low SF background. B Model 2: Blur-mediated bias is only present when the background has a lower SF than the target. C Model 3: Camouflage effect (background and target have matching textures) causes poorer performance. Performance or sensitivity worsens when the target and background have similar SF texture along the diagonal dashed line. Performance will be highest if the target and background have opposite SF in the two corners. D Depth judgment is tuned to the spatial frequency of the target texture only, resulting in a bright horizontal band. E Depth judgment is tuned to the spatial frequency of the background texture only, resulting in a bright vertical band. F Model 4: Depth judgment is tuned to the spatial frequency of the target and background texture. G Our results from Fig. 5B are replotted as an interpolated surface plot in log–log scale. Our result bear partial resemblance to the tuning models (F)

Based on the behavioral results (Fig. 9G), the contributions of target and background texture are not as straightforward as any of the simple models. When there is no background texture, the SF property of the target is the sole cue and has a tuning characteristic, as shown in Condition A and Fig. 9D. As shown in Fig. 8A, the spatial frequency of the peak, or fmax, changes relative to its background. In other words, the tuning characteristics are not as simple as the naïve tuning model (Fig. 9D). At high SF, the texture on the target becomes less effective as a cue, so the background becomes informative, as shown by the trend in γhigh (see Fig. 8C). In fact, the plot in Fig. 8C shares the same trend as Fig. 5B, reaffirming that γhigh captures the contribution of the background texture.

Textured patterns trigger vergence movements to facilitate depth perception (Kidd et al. 1979). We speculate that without spatial modulation on the target (no texture), it cannot elicit responses from disparity sensitivity neurons. The visual system must rely on information from the local contrast gradient at the edge of the background texture. This could involve a different mechanism of depth perception, which may be independent of SF channels. It has been suggested that depth perception is linearly dependent on the cues, and the weights of each cue change based on the reliability (Young et al. 1993). In this case, when the target texture cue is not available, the weight of the background texture and edges cue would be upweighted. In addition, when the target and background have similar texture patterns, the abrupt change in luminance contrast at the edge may nullify the camouflage effect (Troscianko et al. 2009), especially in low SF textures.

While we have shown that textures improve depth judgment in VR, an extremely high SF texture may not be beneficial unless paired with a lower SF background. Texture compression is also limited by the resolution or pixel density of the display, and current VR headsets are not dense enough to provide very fine detail at high SFs. It may be best to use mid-SF texture for a better stereo experience.

5 Conclusion

Manipulation of the details of texture in virtual reality affects depth judgments. Disparity information derived from texture, especially in middle spatial frequencies, optimally allows users to gauge depth and distance in a virtual environment. Since stereo mechanisms integrate spatial frequency information across the visual scene, the background texture may improve or worsen the ability to discriminate depth. In addition to their aesthetic value, VR content creators or graphic designers can utilize different combinations of texture to support more accurate depth judgments.