Estimating lighting direction in scenes with multiple objects

Peterson, Lindsay M.; Kersten, Daniel J.; Mannion, Damien J.

doi:10.3758/s13414-023-02718-0

Estimating lighting direction in scenes with multiple objects

Open access
Published: 01 August 2023

Volume 86, pages 186–212, (2024)
Cite this article

Download PDF

You have full access to this open access article

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Estimating lighting direction in scenes with multiple objects

Download PDF

Lindsay M. Peterson ORCID: orcid.org/0000-0003-1486-9834¹,
Daniel J. Kersten² &
Damien J. Mannion¹

833 Accesses
Explore all metrics

Abstract

To recover the reflectance and shape of an object in a scene, the human visual system must account for the properties of the light illuminating the object. Here, we examine the extent to which multiple objects within a scene are utilised to estimate the direction of lighting in a scene. In Experiment 1, we presented participants with rendered scenes that contained 1, 9, or 25 unfamiliar blob-like objects and measured their capacity to discriminate whether a directional light source was left or right of the participants’ vantage point. Trends reported for ensemble perception suggest that the number of utilised objects—and, consequently, discrimination sensitivity—would increase with set size. However, we find little indication that increasing the number of objects in a scene increased discrimination sensitivity. In Experiment 2, an equivalent noise analysis was used to measure participants’ internal noise and the number of objects used to judge the average light source direction in a scene, finding that participants relied on 1 or 2 objects to make their judgement regardless of whether 9 or 25 objects were present. In Experiment 3, participants completed a shape identification task that required an implicit judgement of light source direction, rather than an explicit judgement as in Experiments 1 and 2. We find that sensitivity for identifying surface shape was comparable for scenes containing 1, 9, and 25 objects. Our results suggest that the visual system relied on a small number of objects to estimate the direction of lighting in our rendered scenes.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Recovering the intrinsic properties of an object in a scene, such as surface reflectance and shape, requires accounting for the prevailing lighting conditions. Although our perception of illumination has received insufficient psychophysical examination (Gilchrist, 2006; Schirillo, 2013), there is evidence that the visual system infers the lighting conditions in a scene. Observers can estimate the properties of a light source based on an object’s appearance (Kartashova, Sekulovski, de Ridder, te Pas, & Pont, 2016; Kartashova, de Ridder, te Pas, & Pont, 2018; Koenderink, Pont, van Doorn, Kappers, & Todd, 2007) and these estimates are evident in observers’ perception of surface reflectance (Boyaci, Maloney, & Hersh, 2003; Boyaci, Doerschner, & Maloney, 2004). Furthermore, the lighting conditions at different spatial locations in a scene can be accounted for when judging surface reflectance (Gilchrist, 1977, 1980; Mizokami, Ikeda, & Shinoda, 1998), with the observers relying on information given by multiple lighting cues (e.g., specular and non-specular objects) within a scene to make their judgements (Boyaci, Doerschner, & Maloney, 2006; Snyder, Doerschner, & Maloney, 2005).

This apparent reliance on multiple objects indicates that some form of spatial integration may be occurring when observers are estimating the lighting conditions in a scene, similar to an ensemble or summary statistic (Haberman and Whitney, 2012; Pont, 2019; Sanders, Haberman, & Whitney, 2008; Whitney and Yamanashi Leib, 2018). Potential evidence for this suggestion comes from a conference abstract by Sanders et al. (2008), and described further by Haberman and Whitney (2012), who investigated whether the visual system has an ensemble representation of cast shadows (which is analogous to light source direction). Observers were presented with images of rendered geometric objects that were illuminated from a particular direction and were asked to judge the average orientation of the cast shadows for a set of objects. The accuracy with which observers completed the task implied that the estimated shadow orientation for individual objects within an image were integrated to estimate the mean shadow orientation. However, similar levels of accuracy were reported for the control images in which the cast shadows were perceived to be surface paint. As such, observers may have made their judgements without an explicit representation of light source direction; the average “shadow” orientation could have been computed without actually integrating multiple local estimates of shadow orientation. Therefore, the results from Sanders, Haberman, & Whitney (2008) are inconclusive in regards to the visual system utilising multiple objects within a scene to estimate lighting direction.

Here, we focus on examining the extent to which estimates of the direction of lighting in a scene are informed by multiple objects within that scene. If estimates of lighting direction are informed by multiple objects, we would expect to see a positive relationship between the number of objects in a scene and the number of objects that are relied upon to estimate lighting direction. This assumption is based on a trend reported for ensemble perception in which the number of integrated samples (i.e., the number of stimulus elements used to make a judgement about the stimulus) tends to increase as the number of stimulus elements increases (Whitney and Yamanashi Leib, 2018). If the visual system does combine multiple local samples to estimate lighting conditions in a scene, an observer should use more objects to judge light source direction in Fig. 1E compared to Fig. 1C and A, for example. As each local estimate will have some degree of noise associated with it, relying on multiple local estimates will result in greater precision compared to relying on a single local estimate. As a consequence, precision at estimating the direction of a light source should improve as more objects are added to a scene. Such an effect of set size has been found for ensemble perception; for example, Robitaille and Harris (2011) reported that larger set sizes were associated with improved accuracy when observers judged the mean size and orientation of a set of circles and tilted bars, respectively.

In three experiments, we measured participants’ ability to estimate the direction of lighting in scenes that contained a varying number of objects. In Experiment 1, participants were presented with scenes containing 1, 9, or 25 objects and indicated whether they perceived the scene as illuminated from the left or right, relative to their viewpoint. To further understand our results in Experiment 1, we used an equivalent noise paradigm in Experiment 2 to measure the number of objects used to estimate the average light source direction in a scene. To probe any potential differences between implicit and explicit judgements of lighting direction, participants completed a shape identification task in Experiment 3 that required an implicit judgement of lighting direction (rather than explicit, as in Experiments 1 and 2).

Experiment 1

The aim of Experiment 1 was to examine the extent to which light source direction discrimination depends on the number of visible objects within a scene. Participants were presented with images of scenes that contained 1, 9, or 25 objects and asked to indicate whether the scenes were illuminated from the left or right, relative to their viewpoint. We also manipulated whether the scenes were rendered with or without cast shadows to examine the contribution of cast shadow information to any changes in discrimination sensitivity associated with variations in set size. If the visual system does use multiple samples to estimate the direction of lighting in a scene, performance on the light source direction discrimination task should benefit from an increasing number of objects in a scene.

Methods

Participants

Thirty-two participants (18 female and 14 male; median age of 19 years) with self-reported normal or corrected-to-normal vision completed the experiment. The majority of participants (27/32) were 21-years-old or younger. Participants were recruited from a database of undergraduate students enrolled in a first-year psychology course at UNSW Sydney. Participants gave informed and written consent prior to beginning the experiment and experimental procedures were approved by the Human Research Ethics Advisory Panel at the School of Psychology, UNSW Sydney. All participants were naïve to the purpose of the experiment.

Apparatus

The experiment was run in three similar testing booths. In each booth, the experiment was presented on a Display++ LCD monitor (Cambridge Research Systems, Kent, UK). Each monitor had a spatial resolution of $1920 \times 1080$ pixels, temporal resolution of 120Hz, mean luminance of 60 cd/m$^2$, a linear relationship between graphics card signal and luminance, and a 10-bits per pixel luminance output resolution. Participants viewed the monitor, in an otherwise darkened booth, from a distance of approximately 60cm for a total visual angular subtense of approximately $66^\circ \times 37^\circ $. The stimuli for the experiment were created with POV-Ray (Version 3.7; https://www.povray.org/). The experiment was implemented using PsychoPy (Peirce, 2007, 2008), and analyses was performed using NumPy (Harris et al., 2020), SciPy (Virtanen et al., 2020), and PyMC (Salvatier, Wiecki, & Fonnesbeck, 2016).

Stimuli

The stimuli were rendered images of scenes containing 1, 9, or 25 objects (see Fig. 1 for examples). The geometry of the objects was created by applying a “bumpy” texture (f_bumps from POV-Ray’s library) to the surface of a sphere. This created blob-like objects with random surface curvature. The size of the objects was selected so that the objects did not occlude one another. The objects were situated on a flat checkerboard surface (with checks of 15% and 25% reflectance), illuminated by a single directional light source.

We rendered 100 instances of each combination of set size, cast shadows, and light source azimuth. The scene was captured with an orthographic camera that had an elevation of $23.2^\circ $ and was pointed towards centre of the scene. The scale and amplitude of the bump surface texture were randomised for each object in a scene. All of the objects had a diffuse (Lambertian) reflectance that was randomly chosen from a Beta distribution ($\alpha = 1.5$, $\beta = 6.5$), limited to the range 5% reflectance and 80% reflectance, to mimic the distribution of reflectances that is found in natural scenes (Attewell and Baddeley, 2007). The elevation of the light source was fixed at $40^\circ $ and the azimuth varied across renderings. For the no-cast-shadows condition, the light source was artificially prevented from casting shadows.

We chose to present scenes that contained 1, 9, and 25 objects as these set sizes allowed us to adjust the number of objects on the checkerboard surface while maintaining a regular spatial arrangement. The location of the objects in the scene was specified by a $5 \times 5$ grid. An object was placed in the centre of the grid for all set size conditions. Additional objects were added to the grid in the inner and outer ring surrounding the central object for the 9-object and 25-object condition, respectively. A small amount of jitter was added to the position of each object on the checkerboard surface, which meant that each object did not appear in the same exact location throughout the experiment.

Design and procedure

The experiment had a within-subjects design with factors of cast shadows (scenes rendered with or without cast shadows) and set size (1, 9, or 25 items). Participants completed the experiment in a single 45-minute session. Prior to beginning the experiment, participants were introduced to the task via a set of instructions that included a written explanation of the experiment and a short practice task. The practice task consisted of twelve trials (scenes with the light source azimuth as $-35.25^\circ $ and $+35.25^\circ $ for each condition) in which we expected participants to respond correctly unless they misunderstood the task. Participants were required to respond correctly on all of these trials before beginning the experiment.

The experiment consisted of 10 runs with 66 trials each, with a short rest break between each run. On each trial, the stimulus was presented for 600ms at full visibility, with 100ms ramp in and out from the black background, followed by the response prompt: “Was the scene lit from the left or right? Press the ‘left’ arrow key for left or the ‘right’ arrow key for right”. Participants received feedback on the correctness of their response in the form of a tick or cross appearing briefly on the screen before the subsequent trial began.

The experiment had 660 trials in total. Each condition had 100 trials that were randomly interleaved throughout the experiment. For each condition, there were additional 10 ‘catch’ trials in which the illumination angle was randomly chosen to be either $-70.25^\circ $ or $+70.25^\circ $. The angle of illumination on each trial was selected using a Psi-marginal adaptive staircase procedure (Kontsevich and Tyler, 1999; Prins, 2013), with separate staircases for each condition. Participants’ responses were modelled using a cumulative Normal function, which describes the probability of a participant judging a scene as lit from the right for a given illumination angle. Possible stimulus levels ranged from $-70.25^\circ $ to $+70.25^\circ $ in steps of $0.5^\circ $, with $0^\circ $ being the observer’s vantage point. The Psi procedure targeted the spread (inverse slope) of the psychometric function and marginalised over the function’s midpoint to optimise the estimation of the spread, which was central to assessing the effect of our manipulations on discrimination sensitivity.

Data analysis

Participants were excluded from the analysis based on their catch trial performance. Four participants were excluded from the analysis because their accuracy on the catch trials was less than $90\%$. The analysis described below was carried out with the remaining 28 participants.

We used Bayesian statistical modelling to analyse the experimental data, with a focus on parameter estimation rather than hypothesis testing or model comparison (see Calin-Jageman and Cumming , 2019, for an overview of this approach) and our general approach to building the statistical models was motivated by Lee (2018); Wagenmakers et al. (2018), and Betancourt (2020). Participant performance in each experimental condition was modelled using a psychometric function, the parameters of which were estimated with a Bayesian generalised linear mixed model approach. The key parameter of interest was the spread of the psychometric function. The model for the psychometric function spread included fixed effects for the intercept, the main effect of cast shadows, the linear and quadratic main effects of set size rank, and the interaction between cast shadows and the set size rank effects. The quadratic main effect of set size rank was included in the model to allow for a non-monotonic effect of set size rank on the spread. We also included participant random effects in the model such that the fixed effects could vary across participants. The model for the midpoint of the psychometric function followed the same structure. A complete description of the statistical model can be found in Appendix A. Consistent with the Bayesian statistical modelling approach, we incorporated prior probabilities into the models which reflected what we expected to be the reasonable magnitude of the effect of our experimental manipulations (see Appendix A for an explanation of the specific priors placed on the model parameters). Figure 2 shows a summary of the observed data and the fitted psychometric function for a single experimental condition from a representative participant.

Results

In this experiment, participants were shown an image of a scene and indicated whether they perceived the scene as lit from the left or right. We used a Bayesian linear mixed model approach to estimate the parameters of each observer’s psychometric function for each experimental condition. We compared the aggregated observed data and the posterior retrodictive samples from the fitted model and found that the model reproduced the patterns in the observed data with no major discrepancies (see Fig. 15 in Appendix A).

Our primary parameter of interest is the spread (inverse slope) of the psychometric function, where lower values are associated with steeper slopes and greater sensitivity and higher values are associated with shallower slopes and less sensitivity. Figure 3 summarises the estimated posterior distributions for the spread parameter for each of the six experimental conditions (whether cast shadows are present or absent for scenes with 1, 9, or 25 visible objects). If the presence of multiple objects increased the sensitivity with which an observer could discriminate the direction from which the scene was illuminated, we would expect the spread parameter to decrease with increasing numbers of visible objects. However, it is evident in Fig. 3 that the estimated spread actually increased, if anything, with increasing numbers of visible objects. The parameter capturing the linear component of the trend indicated that the spread increased by a factor of about 1.07 (posterior median; 95% credible interval: $\left[ 1.00, 1.14\right] $) with each increment in set size condition (i.e., from 1 to 9 and from 9 to 25 visible objects) when cast shadows were present, with a greater increase of a factor of about 1.17 (posterior median; 95% credible interval: $\left[ 1.09,1.25\right] $) when cast shadows were absent. The quadratic trend components were less influential, although Fig. 3 suggests that there may have been a saturation in the increase in spread with the set size condition when cast shadows were absent.

Presenting participants with scenes rendered without cast shadows worsened the sensitivity for illumination direction discrimination. The average spread across the set size conditions increased by a factor of about 3.36 (posterior median; 95% credible interval: $\left[ 2.86, 3.94\right] $) when cast shadows were absent compared to when they were present.

Discussion

In this experiment, we aimed to measure light source direction discrimination for scenes that contained 1, 9, and 25 objects. We found that there was a small decrease in sensitivity as more objects were added to a scene and this decrease in sensitivity across set sizes was greater for scenes rendered without cast shadows. We also found that discrimination sensitivity decreased for scenes rendered without cast shadows, which is perhaps unsurprising given that previous research has identified cast shadows as an important cue for estimating light source properties (Boyaci et al., 2006; te Pas, Pont, Dalmaijer, & Hooge, 2017).

The potential decrease in discrimination sensitivity with set size could be due to a dependency of internal noise on set size—an effect that has been reported previously for ensemble coding. Dakin (2001) examined how varying the orientation of a group of textures affected judgements of the mean orientation of the textures, finding that increasing the number of texture elements in the stimulus led to an increase in internal noise. Dakin, Mareschal, and Bex (2005a) also reported an increase in internal noise with more stimulus elements when participants were asked to estimate the average motion direction of a group of dots.

It is possible that a similar relationship between set size and internal noise existed in the current experiment, confounding our interpretation of discrimination sensitivity as an indicator of participants relying on multiple objects to judge the direction of the light source. We have assumed that internal noise remains constant across set sizes and, therefore, precision should increase with more objects in a scene (as depicted by the diamond markers in Fig. 4). However, increases in internal noise with set size could outweigh any benefit associated with integrating multiple local estimates (as depicted by the circle and square markers in Fig. 4). In the current experiment, it is possible that participants were using multiple objects to judge light source direction but this was masked by increases in internal noise. That is, the decrease in discrimination sensitivity associated with increases in set size suggests that there may be a trade-off between precision and the reliance on multiple estimates, where any benefit associated with using multiple local estimates is outweighed by decreases in precision. The basis of the subsequent experiment is an equivalent noise paradigm, in which internal noise and the number of integrated objects can be estimated by presenting stimuli with varying levels of external noise (Barlow, 1956; Dakin, 2001; Dakin et al., 2005a; Pelli, 1990) .

Experiment 2

In this experiment, we use an equivalent noise paradigm to measure the number of objects used to judge the average light source direction in a scene as well as participants’ internal noise. Within this paradigm, the spread of each participant’s psychometric function is defined as:

$$\begin{aligned} \sigma = \sqrt{\dfrac{\sigma _{\text {int}}^2 + \sigma _{\text {ext}}^2}{N}}. \end{aligned}$$

(1)

In Eq. 1, $\sigma _{\text {int}}$ is the internal noise, $\sigma _{\text {ext}}$ is the external noise, and N is the number of integrated objects. In an equivalent noise task, the presentation of stimuli with varying levels of external noise allows for estimation of internal noise and the number of integrated samples (Barlow, 1956; Dakin, 2001; Dakin et al., 2005a; Pelli, 1990) . Performance at low levels of external noise is determined by both internal noise and integration, and performance becomes increasingly determined by the number of integrated samples as external noise increases (Dakin, 2001), as depicted in Fig. 5.

Participants were presented with images of scenes that contained 9 and 25 objects and asked to the judge the average light source direction in a scene as leftwards or rightwards. We chose these set sizes to allow for comparisons to the results from Experiment 1, though the 1-object condition that was present in Experiment 1 was excluded from the current experiment due to constraints on the number of trials as well as its relative lack of informativeness within the equivalent noise paradigm. The light source azimuth for each object in a scene was drawn from a wrapped Normal distribution (Dakin, Mareschal, and Bex, 2005a) with a particular standard deviation—this allowed us to add external noise to the stimuli. A given object in a scene could be illuminated by a light source with an azimuth from the $-180^\circ $ to $+180^\circ $ range (examples of an object illuminated from different azimuths are shown in Fig. 6B). For example, in Fig. 6A, the light source azimuth for each object in the scene was drawn from a Normal distribution with a mean of $+50^\circ $ and a standard deviation of $64^\circ $.