3D computational modeling and perceptual analysis of kinetic depth effects

Humans have the ability to perceive kinetic depth effects, i.e., to perceived 3D shapes from 2D projections of rotating 3D objects. This process is based on a variety of visual cues such as lighting and shading effects. However, when such cues are weak or missing, perception can become faulty, as demonstrated by the famous silhouette illusion example of the spinning dancer. Inspired by this, we establish objective and subjective evaluation models of rotated 3D objects by taking their projected 2D images as input. We investigate five different cues: ambient luminance, shading, rotation speed, perspective, and color difference between the objects and background. In the objective evaluation model, we first apply 3D reconstruction algorithms to obtain an objective reconstruction quality metric, and then use quadratic stepwise regression analysis to determine weights of depth cues to represent the reconstruction quality. In the subjective evaluation model, we use a comprehensive user study to reveal correlations with reaction time and accuracy, rotation speed, and perspective. The two evaluation models are generally consistent, and potentially of benefit to inter-disciplinary research into visual perception and 3D reconstruction.


Introduction
Mechanisms of human perception of the 3D world have long been studied. In the early 17th century, artists developed a whole system of stimuli of monocular depth perception based especially on shading and transparency [1]. Loss of stimuli related to depth perception leads to a variety of visual illusions, such as the Pulfrich effect [2]. Here, with a dark filter on the right eye, dots moving to the right seem to be closer to participants than dots moving to the left, even though all the dots are actually at the same distance. This is caused by slower human perception of darker objects.
When a 3D object is rotating around a fixed axis, humans are capable of perceiving the shape of the object from its 2D projections. This is called the kinetic depth effect [4]. However, when there is no light above the object, humans can only perceive partial 3D information from the varying silhouette of the kinetic object over time, which easily leads to ambiguous understanding of the 3D object. One typical example of this phenomenon is the spinning dancer [3,5] (see Fig. 1 for sample frames). The  [3], courtesy of Nobuyuki Kayahara. Due to a lack of visual cues, humans are confused as to whether the dancer is rotating clockwise or counterclockwise.
dancer is observed to be spinning clockwise by some viewers and counterclockwise by others. Such ambiguity implies that more cues are needed for humans to make accurate depth judgements for 3D objects. Visual cues such as occlusion [6], frame timing [7], and speed and axis of rotation [8] have been widely studied by researchers. In addition, perspective effects also affect the accuracy of direction judgements [9].
In this paper, we make in-depth investigations into how visual cues influence the perception of kinetic depth effects, using both objective computational modeling and subjective perceptual analysis. We formulate and quantify visual cues from both 3D objects and their surrounding environment. On one hand, we make a comprehensive subjective evaluation to correlate subjective depth judgement for a 3D object with its visual conditions. On the other hand, as depth perception largely depends on the quality of mental shape reconstruction, we also propose an objective evaluation method based on 3D computational modeling [10]. This allows us to quantify the impacts of the involved visual cues. The impact factors are determined by solving a multivariate quadratic regression problem. Finally, we analyze the interrelations between the proposed subjective and objective evaluation models, and consider the consistency of impacts of visual cues on these models.
In summary, our work makes the following major contributions: • a novel objective evaluation of kinetic depth effects based on multi-view stereo reconstruction, • a novel subjective evaluation of kinetic depth effects based on a carefully designed user study, and • a detailed analysis of how visual cues affect depth perception based on these subjective and objective evaluations.

Related work
Our work focuses on objective computational modeling and subjective analysis of 3D perception of kinetic depth effects under different visual conditions. We first discuss related work on visual perception using psychological and computational approaches, and then briefly describe the relevant reconstruction techniques employed in this work.

Psychological research on shape perception
For monocular vision, shading effects contain rich information [1]. Compared with diffuse shading, specular shading helps more to reduce underestimates of cylinder depth by subjects [11]. However, the shading effect can be ambiguous in some cases. For example, when the illumination direction is unknown, it is hard to disambiguate shape convexities and concavities; humans tend to assume that illumination comes from above [12]. Besides, when the level of overall illumination is low, effect of shadows is generally assumed to be determined by the overall illumination [13]. Motion information also benefits shape perception. The inherent ambiguity of depth order in projected images of 3D objects can be resolved by dynamic occlusion [14]. Perspective also gives rich information about 3D objects during this process [15]. The human visual system can infer 3D shapes from 2D projections of rotated objects [4], interpolating the intervening smooth motion from pairs of images of rotated objects [16].
Color information is very important not only in immersive scene representation [17][18][19][20] but also in psychological depth perception. Isono and Yasuda [21] found that chromatic channels can contribute to depth perception using a prototype flicker-free fieldsequential stereoscopic television system. Guibal and Dresp [22] realized that color effects are largely influenced by luminance contrast and stimulus geometry. When shape stimuli are not strong, color can give an illusion of closeness [23].

Computational visual perception
Computational visual perception has been extensively studied in the computer graphics community. Here we briefly describe the most relevant work on perceptionbased 2D image processing and 3D modeling.
In terms of 2D images, Chu et al. [24] presented a computational framework to synthesize camouflage images that can hide one or more temporally unnoticed figures in the primary image. Tong et al. [25] proposed a hidden image framework that can embed secondary objects within a primary image as a form of artistic expression. The edges of the object to be hidden are firstly detected, and then an image blending based optimization is applied to perform image transformation as well as object embedding. The study of kinetic depth effects often uses subjective response [26], and some researchers also use the judgement of the rotation direction as the response [9].
Like for image-based content embedding and hiding, 3D objects can be embedded into 2D images [27]; the objects can be easily detected by humans, but not by an automatic method. Researchers have also generated various mosaic effects on both images [28] and 3D surfaces [29]. A computational model for the psychological phenomenon of change blindness is investigated in Ref. [30]. As change blindness is caused by failing to store visual information in shortterm memory, the authors model the influence of long-range contextual complexity, allowing them to synthesize images with a given degree of blindness. Illusory motion is also studied as self-animating images in Ref. [31]. In order to computationally model the human motion perception of a static image, repeated asymmetric patterns are optimally generated on streamlines of a specified vector field. Tong et al. [32] created self-moving 3D objects using the hollow-face illusion from input character animation, where the surface's gradient is manipulated to fit the motion illusion. There is also research into rendering, designing, and navigating impossible 3D models [33][34][35]. In contrast to investigating those seemingly impossible models, our work focuses on evaluating the 3D perception of rotated objects.

Multi-view stereo reconstruction
Multi-view 3D reconstruction and 3D point cloud registration are fundamental in computer graphics and computer vision. Comprehensive surveys on these topics can be found in Refs. [36,37]. The well-known structure-from-motion [10] can effectively recover camera poses and further generate a sparse 3D point cloud by making use of multiple images of a scene or objects. Moreover, multi-view stereo algorithms [38] can reconstruct a fully textured surface of the scene. We employ such computational techniques to evaluate 3D reconstruction quality under various environmental conditions.

Approach
Our goal is to evaluate the influence of various visual conditions on kinetic depth effects, including the ambient luminance, shading, perspective, rotation speed, and color difference between the object and background.
For both the human visual system and imagebased 3D reconstruction techniques, the input visual information usually takes the form of projected 2D images. Thus, by using a set of projected 2D images of the 3D objects under the chosen conditions, we investigate the shape perceived by human participants and the shape produced by multi-view stereo 3D reconstruction. As well as measuring the perception of kinetic depth effects using our objective and subjective evaluation models, we also investigate the correlations between these two different methods. Our approach is outlined in Fig. 2. Fig. 2 Our approach. We project the input 3D objects onto 2D image planes under specified conditions (lighting, projection mode, rotation speed, etc.), which are fed into constructed objective and subjective evaluation models. Analysis reveals interesting correlations between depth perception of rotated 3D objects and the visual conditions.

Dataset
Each 3D object is rotated around a fixed vertical axis passing through the geometric center of the object, and we sample projected 2D images at an angular interval of θ. As the frame rate when displaying projected images is fixed, changing the sampling angle interval causes a changing rotation speed of the object. We can obtain such datasets under different visual conditions as the images are explicitly rendered. Specifically, we manipulate the ambient luminance by adjusting ambient lights, and control shading by changing diffuse lights. We control perspective by selecting either orthogonal or perspective projection mode, which affects perception. We also control the color difference between the object and the background; predefined color pairs are used to generate the colors of the background and the 3D object (see Section 4.1). Table 1 summarizes the parameters used. Using such a dataset generated under controlled conditions, we can then assess the 3D perception of the rotated object using the following two evaluation models.

Objective evaluation model
This model utilizes the reconstruction quality of the input 3D object as the basis for evaluation. First, using the projected 2D images of the 3D object under specified visual conditions, we reconstruct a point cloud using multi-view stereo reconstruction algorithms. Then, we develop a method to measure the quality of reconstruction of the original 3D object by the point cloud (see Section 4.2). Finally, we analyze the effects of different visual conditions in detail (see Section 4.3).

Subjective evaluation model
Directly measuring 3D reconstruction in the brains of human subjects is difficult. Based on the observation that if humans successfully mentally reconstruct a rotated object, it is easy for them to tell the direction of rotation; in our study, the time and accuracy of direction judgements are used as proxies to measure the quality of depth perception. We first display rotating objects with the same set of projected images as used for 3D reconstruction in the objective evaluation, and ask participants to judge the rotation direction of the object. Then we consider extreme situations in which image sequences could not be reconstructed well, including overexposure, low lighting levels, and overly fast rotation. We analyze the results in terms of the accuracy and the reaction time of direction judgements (see Section 5.5).

Objective evaluation
Our objective evaluation process includes four steps: generating 2D images of 3D objects under various conditions, reconstructing 3D shapes of objects based on the generated images, quantifying the reconstruction quality of the 3D objects, and fitting weighting factors for depth cues by multivariate quadratic regression optimization.

Parameter selection and image set generation
In order to generate images of 3D objects under various controlled conditions, we need to select the parameter values to determine the depth cues. Firstly, we normalize the size of each 3D object to have a unit bounding box centered at the origin. Then, we import the object into a virtual scene, display it under orthogonal projection, and set a fixed-point light. The line between the light and the geometric center of the object is perpendicular to the rotation axis, and the distance between the light and the geometric center of the object is ten units. Since in openMVG the focal length is a given parameter, considering perspective projection mode in our objective evaluation is not that meaningful, and in this study we restrict ourselves to use of orthogonal projection. We control the brightness of diffuse and ambient lights as follows. We set the HSL value of the diffuse light to be (0, 0, α), with seven set values for α, corresponding to different luminance levels. We set the HSL value of the ambient light to (0, 0, β) with six set values for β (see Table 2). We sample projected 2D images as the 3D object is rotated. We use four different sampling intervals θ = 0.209, 0.157, 0.126, and 0.105. For simplicity, we use five fixed pairs of RGB values for the 3D object and the background (see Table 3).
We calculate the difference of the chosen color pairs using the following equation: Here, is the color of the background, and w r , w g , w b are weighting factors, empirically set to (3,4,2). We generated image sets for 15 different 3D objects under various conditions,  and selected three objects with a high reconstruction success rate (30%). Finally, for each test object, we generated a separate image set for each of 7 (shading) × 6 (ambient luminance) × 4 (rotation speed) × 5 (color difference) conditions, at an image resolution of 800 × 600 pixels. Some examples are shown in Fig. 3.

3D reconstruction and quality assessment
We employ openMVG [39] and openMVS [38] to process image sequences, and take the reconstructed point clouds as input. We normalize the size of each point cloud to the same bounding box as for normalizing 3D objects. Then we match each reconstructed point cloud to the original object. Specifically, we use the sample consensus initial alignment (SAC-IA) method [40] to provide an initial alignment, and the iterative closest point (ICP) to refine the alignment [41]. Finally, we compute the Euclidean fitness score μ between the reconstructed point cloud and the original object.

Objective evaluation results
We generated 2520 image sets for 3D reconstruction, 929 of which provided point clouds, while for 1591, reconstruction failed. We use the following measure  Given a 3D object (top-left) and specified visual conditions, we generate corresponding projected 2D images, and reconstruct 3D shapes (other views) using multi-view stereo algorithms. We quantitatively measure the reconstruction quality for shape perception analysis. Each view above gives the reconstruction quality and the corresponding rendering parameters.
s of reconstruction quality, based on the point cloud distance μ between a reconstructed point cloud and the corresponding original point cloud: Logarithmic processing is used to make the residuals of our model normally distributed. Reconstruction quality values are linearly normalized to the range [0, 1]. Given a set of reconstruction quality samples S = {s 1 , . . . , s n }, we formulate a factor analysis model with the following quadratic stepwise regression: where λ = {λ 1 , λ 2 , λ 3 , λ 4 } are weighting coefficients to balance the impact of the various control parameters, and b is a constant. We determined the coefficients in the model using the standard least squares method. Results are given in Table 4.
It can be seen that the model accounts for 10.3% of the variation in reconstruction quality. Since the reconstruction algorithm used here is not always stable, the explanatory power of the model is limited. The impact of each individual visual cues is now analyzed in turn.

Ambient luminance
The (ambient luminance × shading) interaction is significant with λ 4 = −0.0278 (p<0.01). High (ambient luminance × shading) levels contribute to poor quality reconstruction. As shown in Fig. 6, the fitted lines at different ambient light levels are not parallel.

Rotation speed
High rotation speed significantly contributes to poor reconstruction quality (λ 2 = −0.6361, p < 0.01). When θ = 0.105, the mean value of S is 0.489; when θ = 0.209, the mean value of S is 0.400.

Color difference
Color difference does not significantly affect the reconstruction quality.

Subjective evaluation
As noted, humans can recover rotated 3D objects from their 2D projections. The rotation direction of 3D objects can be an important clue to judging the quality of their mental shape reconstruction. Based on this, the following multi-factor experiment was designed.

Participants
We recruited 35 participants; 34 participants (19 males and 15 females) successfully finished the test.

Procedure and materials
In the experiment, a set of images was continuously displayed in full screen mode. The experiment was conducted on a laptop with an Intel i5 8250U CPU and 8 GB memory. We designed two types of study as follows.

Study A
Here we explored the depth cue effect in general situations. The range of cues was the same as for the objective evaluation model, but we chose fewer values for each cue (see Table 5) to ensure participants could concentrate during the study. The projection could be either orthogonal or perspective. Overall we considered 144 conditions consisting of 3 (shading) × 3 (ambient luminance) × 2 (rotation speed) × 4 (color difference) × 2 (projection mode). For each set of conditions, we displayed three different objects.

Study B
Here we considered more extreme situations, including low lighting levels, overexposure, and high speed rotation, where we varied each condition while keeping other cues fixed (see Table 6). The variables used to represent each situation are shown in Table 7.
Test image sets generated are illustrated in Figs. 7 and 8. To simplify the problem, we only considered orthogonal projection situations. Every participant was asked to judge rotation direction for each image set generated in Studies A and B; each image set was judged only once. The display order of each image set was random, as was the rotation direction of each 3D object.    To exclude viewing-from-above bias [1], we defined rotation direction as Left and Right. From the participants point of view, the rotation direction is right if the closer part of a 3D object is moving to the right, and vice versa. The images were displayed at 24 FPS. The maximum display time for each image set did not exceed 5 s. The participants were given time to practise before the formal experiment. The entire experiment took about 15-20 min.

Subjective evaluation results
We recorded the judgements and reaction time of all participants. We rank all reaction time in ascending order and calculated the standard scores (denoted τ ), which correspond to the estimated cumulative proportion of the reaction time. We use the repeated measure analysis of variance (ANOVA) method to determine the effect of cues on τ under different conditions. We calculated each participant's accuracy of judgement under each condition. Since three objects were tested in each condition, the participant's judgement accuracy has four values, and does not follow a normal distribution. Therefore, we used ordinal logistic regression models to test the effect of cues on the participant's judgement accuracy.
In particular, we chose the complementary log-log link function, since the participant's judgement accuracy mostly lay in 0.67-1.00 [42]: We establish ordinal logistic regression models for all situations, but only show those models with significant results and skip the remainder.

Analysis of Study A
A five-way ANOVA method revealed the main effect of rotation speed (F (1, 5028) = 38.11, p<0.01) on τ .
Participants react faster under high rotation speed conditions (M = −0.08, SD = 0.97) than at low rotation speed (M = 0.05, SD = 1.00). We also found a significant (perspective × rotation speed) interaction (F (1, 5028) = 6.19, p<0.05) on τ . At high rotation speed, τ is significantly lower with perspective projection (M = −0.109, SD = 0.95) than with orthogonal projection (M = −0.06, SD = 1.00). Because rotation speed and perspective only have two levels, there is no need for Mauchly's test of sphericity. Other than the above phenomena, we found no significant effect on the participant's judgement accuracy for the five cues considered in our experiments.

Analysis of Study B
Low lighting levels. A one-way ANOVA reveals the main effect of shading on τ (F (9, 340) = 2.668, p<0.01), and Mauchly's test of sphericity is not significant (p = 0.579). Here higher α leads to lower reaction time. We established an ordinal logistic regression model as follows: where p = {p 1 , p 2 , p 3 , p 4 } are the probabilities of each value of the participant's judgement accuracy (from low to high), η is the weighting coefficient, and = { 1 , 2 , 3 } are constant values. For each case, the value with highest probability is the predicted value of the participant's judgement accuracy. We fitted the coefficients in the model with results shown in Table 8. High α significantly contributes to high judgement accuracy (η = 4.229, p<0.01). Hence using strong shading in low lighting levels conditions improves accuracy and accelerates reactions (see Fig. 9).  Overexposure. A one-way ANOVA reveals the main effect of ambient luminance) (F (9, 340) = 2.661, p<0.01) on τ , and Mauchly's test of sphericity is not significant (p = 0.350). This means that the covariance matrix assumption is met, and the result of repeated measures ANOVA is robust. Participants react faster when β = 2.7 (M = −0.17, SD = 1.07) than when β = 4.5 (M = 0.24, SD = 0.94), which implies that the higher ambient luminance in overexposed conditions delays reactions (see Fig. 9). We found no significant effect of ambient luminance on judgement accuracy.
High speed rotation. Rotation speed has a significant effect on τ (F (9, 340) = 7.627, p<0.01), and Mauchly's test of sphericity is not significant (p = 0.162). In high rotation speed conditions, lower rotation speed leads to faster reactions. We established an ordinal logistic regression model again as follows: ⎧ where p = {p 1 , p 2 , p 3 , p 4 } are the probabilities of each value of the participant's judgement accuracy (from low to high), κ is a weighting coefficient, and ε = {ε 1 , ε 2 , ε 3 } are constant values. For each case, the value with highest probability is the predicted value of judgement accuracy. We fitted the coefficients in the model with results shown in Table 9. High θ significantly contributes to low judgement accuracy (κ = −1.353, p<0.01). In high speed rotation conditions, increasing the rotation speed reduces judgement accuracy and delays reactions (see Fig. 9).

Joint objective and subjective analysis
Based on objective computational modeling and subjective perceptual evaluations, we next performed a joint analysis on the 3D perception of rotated objects.

Shading
For both objective and subjective evaluations, shading has a significant effect on depth perception. In the objective evaluation, shading and reconstruction quality are correlated by a quadratic function. As shading increases, the reconstruction quality first improves and then declines. This coincides with subjective evaluation as under low lighting levels, greater shading improves judgement accuracy and accelerates the observer's reactions.

Ambient luminance
The depth cue from ambient luminance is also effective in both objective and subjective evaluations. In objective evaluation, the interaction of shading and ambient luminance is significant. High (shading × ambient luminance) levels contribute to poor reconstruction. In the subjective evaluation, high ambient luminance in overexposed cases can increase observers' reaction time.

Rotation speed
Rotation speed plays an important role in both objective and subjective evaluations. In the objective evaluation, increasing rotation speed decreases reconstruction quality, which coincides with the result of subjective evaluation that, in high speed conditions, higher rotation speed decreases judgement accuracy. However, in the subjective evaluation, increasing the rotation speed accelerates users' reaction time. A possible reason is that, with higher rotation speeds, participants receive more information within the same time interval, stimulating the participants to make a decision faster. In our experiments under general situations, this acceleration is stronger than the delay caused by uncertainty.

Perspective
In the subjective evaluation, (perspective × rotation speed) interaction is significant. Participants react faster under perspective projection conditions than orthogonal.

Color difference
We found no significant effects caused by color difference between objects and background in either objective evaluation or subjective evaluation models. As future work, we will test more color combinations to further explore possible effects of color differences.

Discussion
We have analyzed the effects of different depth cues on 3D perception of rotated 3D objects, broadening the scope of previous studies. We also designed an objective evaluation and a subjective evaluation to make a thorough analysis.
However, there are also some shortcomings in our design. In our objective evaluation, when the depth cues in images were extremely weak, 3D reconstruction based on structure-from-motion would be unstable caused by unexpected feature matching. This common challenge limits the space of our analysis model (R 2 = 10.3%). Moreover, the subjective evaluation only uses judgement of the direction of rotated objects as the response. In future, we could use more 3D information. In our experiments, reconstruction quality is closely related to the kinds of 3D objects. This specific type of influence on shape perception could also be further analysed.
The analysis of the effect of depth cues guides us how good reconstruction results can be achieved both for humans and computers, such as rendering under certain lighting conditions. The objective evaluation also reveals the limitations of existing algorithms.
Our approach could benefit from more accurate depth prediction and 3D reconstruction in various challenging environments, which could potentially be provided by recent deep learning-based techniques, such as CNN-SLAM [43] and deep stereo matching [44].

Conclusions and future work
We have proposed two approaches to measuring the quality of depth perception of kinetic depth effects, with a detailed analysis of how visual cues affect depth perception. Firstly, we generated a dataset of images of rotating objects considering five depth cues: ambient luminance, shading, rotation speed, and color difference between objects and background. In the objective evaluation, we applied 3D reconstruction and measured reconstruction quality via distances between reconstructed and original objects. In the subjective evaluation, we invited participants to judge the rotation direction of 3D objects by showing them projected 2D images, and inferred perception quality by their reaction time and accuracy. In our experiments, we found both strong and dim lighting significantly undermined the perception of depth. High ambient illumination × shading level, rotation speed, and orthogonal projection can also reduce depth perception quality. Yet it is also interesting that color difference does not have a significant effect on depth perception in our experiments. In future, we will take more depth cues into consideration and develop a more precise quantitative model for more complex situations. Using our new observations to guide other 3D computational modeling would also be an interesting avenue of future work. We hope our study will inspire more inter-disciplinary research into robust 3D reconstruction and human visual perception. low level image processing, machine vision approaches to remote sensing, methods for evaluation of approximation algorithms, medical and biological image analysis, mesh processing, non-photorealistic rendering, and the analysis of shape in art and architecture.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.