DeepLight: light source estimation for augmented reality using deep learning
- 707 Downloads
This paper presents a novel method for illumination estimation from RGB-D images. The main focus of the proposed method is to enhance visual coherence in augmented reality applications by providing accurate and temporally coherent estimates of real illumination. For this purpose, we designed and trained a deep neural network which calculates a dominant light direction from a single RGB-D image. Additionally, we propose a novel method for real-time outlier detection to achieve temporally coherent estimates. Our method for light source estimation in augmented reality was evaluated on the set of real scenes. Our results demonstrate that the neural network can successfully estimate light sources even in scenes which were not seen by the network during training. Moreover, we compared our results with illumination estimates calculated by the state-of-the-art method for illumination estimation. Finally, we demonstrate the applicability of our method on numerous augmented reality scenes.
KeywordsLight source estimation Augmented reality Photometric registration Deep learning
Visual coherence plays an important role in augmented reality (AR) applications. One of the key factors for achieving visual coherence between virtual and real objects is consistent illumination. In order to render virtual objects with consistent lighting, we need to have information about real-world light sources. Therefore, the estimation of real-world illumination is of high importance for AR.
Typically, real-world illumination can be estimated using a passive or active light probe positioned in a scene [19, 22]. Ideally, light sources would be estimated without the light probe to avoid the necessity of undesirable objects in the scene [9, 11]. However, the estimation of real illumination from one image of the scene is a challenging problem, especially if the light sources are not directly visible in the image. Past research showed that if no priors are used in light source estimation from a single image, it is an ill-conditioned problem [6, 25].
Our method is based on an assumption that prior information about lighting can be learned from a large dataset of images with known light sources. We show that this learned information can be encoded in a neural network. Such a trained network can be then used to estimate light sources during runtime in an AR scene which was not previously seen by training. We demonstrate that the neural network can achieve sufficient generality to estimate light in various scenes. This generality can be achieved by increasing the complexity of the network. In order to maintain the convergence of training with increasing network depth and to avoid a vanishing/exploding gradients problem, we design our network using residual blocks of convolutional layers . Previous research showed that it is possible to calculate diffuse lighting in form of an omnidirectional image by a neural network [9, 22]. In the previous work, a neural network was used to calculate the image-to-image relationship between an input image and the estimated illumination. In contrast to that, we demonstrate that a neural network can be trained to directly regress a dominant light direction from an input RGB-D image.
The varying camera poses in an AR scenario cause problems for light estimation by a neural network. These problems are caused by high dimensionality and complexity of input if a network should handle distinct camera poses in world space. We address this problem by regressing a dominant light direction in form of relative Euler angles \(\phi \) and \(\theta \). These angles are always relative to a camera pose and therefore are independent of camera view angle. A dominant light direction in world space can be then calculated by adding relative Euler angles of a light source to Euler angles of a camera.
Once light sources can be estimated from each image of an AR video stream, discontinuities in the temporal domain may appear. In order to address this problem, a filtering or outlier removal needs to be applied in the temporal domain. In this paper, we propose an efficient method for outlier removal from a low amount of subsequent samples in the temporal domain. This method is based on previous research of outlier removal in the spatial domain , and it is adapted to the problem of outlier removal from light source estimation data.
We demonstrate the capabilities of deep learning for light source estimation in AR by integrating the presented method into a real-time AR rendering system based on ray tracing. We also evaluated the results of our method and compare them to the results of a state-of-the-art method for illumination estimation . Our results indicate that a deep neural network can be used to estimate light sources on scenes which have not been previously seen in the training process.
A novel method for probe-less light source estimation in AR scenes,
A novel method for outlier removal in the temporal domain,
Evaluation of the proposed methods on multiple real-world scenes,
Integration of the proposed methods in an AR rendering system based on ray tracing.
2 Related work
Light source estimation has been a challenging problem for researchers in computer graphics for decades. Knowing light position in the 3D world is required for many fields of research including computer vision, image processing and augmented reality. In augmented reality, we can see two main approaches for obtaining information about the real illumination in order to achieve consistent light: (1) inserting active or passive light probes into a scene and (2) estimating the illumination from the image of the main AR camera.
Methods based on light probes use either an active camera with a fish-eye lens or a passive object with known reflectance properties to capture environmental illumination in real time. The hemispherical image from the camera with the fish-eye lens can be used to reconstruct HDR panorama. This image can be utilized directly for image-based lighting in AR [16, 19, 27, 29]. The image can be also processed by image processing methods to identify dominant light sources [8, 34]. In case of passive light probes, illumination is captured by the main camera from the object of known geometry and reflectance which is inserted into a scene. The most common passive light probe is a mirror sphere [1, 5]. We can also use a human face as a light probe to capture illumination from the front-facing camera of a mobile phone . Recently, Mandl et al. showed that it is possible to utilize an arbitrary object as a light probe . In their method, a series of neural networks are trained for a given light probe object. These networks are then employed to estimate light from a scene which contains a given light probe object.
The second category of methods (probe-less methods) can estimate illumination from a main AR camera image without the need of having an arbitrary known object in the scene. These methods typically use image features which are known to be directly affected by illumination. Examples of such features are shadows , gradient of image brightness [2, 3, 18] and shading [10, 11, 14, 21, 26, 31, 32]. Real-world illumination can be also reconstructed from RGB-D images by utilizing the estimation of surface normals and albedo . Recent research showed data-driven approaches to address the problem of light source estimation. These methods typically use a large datasets of panoramas to train an illumination predictor. The predictor estimates surrounding lighting (also represented as a panorama) from a single input image. The predictor is typically based on finding similarity between an input image and one of the projections of individual panoramas . The predictor can be also automatically learned from a large dataset and encoded into a neural network [9, 13]. In our method, we also use a deep neural network to encode a relation between the input image and a dominant light direction. In contrast to prior work, we focus on delta directional light sources which cause hard shadows and strong directionality of the light in the scene. Additionally, we demonstrate direct applicability of our method into an AR scenario and we also focus on temporal coherence of the estimated light.
The light source estimation by neural networks can be also posed as a classification problem. In this case, the space of light directions is discretized into the set of N classes and the network classifies an image as one of these classes . Previous research also showed evidence that dominant light direction can be directly regressed from an input image by a neural network . Our research is based on a similar methodology while we aim at higher complexity of a scene, temporal coherence and direct application of the network to an augmented reality scenario.
3 Light source estimation using deep learning
Our method for light source estimation uses a deep neural network to learn a functional relationship between the input RGB-D image of the scene and a dominant light direction. This network needs to be trained only once on a variety of scenes, and then, it can be applied in a new scene with an arbitrary geometry. We trained our network with an assumption of one dominant light direction in a scene. Therefore, it works the best on scenes with delta light sources. The presented method for light source estimation was integrated into an AR rendering framework and evaluated on several real scenes (which were not used during training).
Training data Deep neural networks require large amount of data to be able to accurately regress a target function. During our research, we trained our network on a synthetic dataset which was rendered using Monte Carlo path tracing. Synthetic data contain five simple scenes which were rendered with a random light source position and a random camera position. A camera viewing direction was rotated toward the center of a scene. The synthetic dataset consists of 23,111 images. 3D objects used for the creation of the synthetic dataset are shown in Fig. 3.
In addition to synthetic data, we experimented with a real-world dataset which was captured in multiple indoor spaces using multiple measured light source positions and a tracked RGB-D camera. This real dataset contains 5650 images from six real scenes. During our experiments, we found out that the network converged much better on the synthetic dataset than on the real one. Moreover, a very interesting finding was that the network trained on the synthetic dataset performs also better in a real-world AR scenario than the network trained on the real dataset. The network trained on the real dataset did not converge properly and performed poorly in AR. We hypothesize that the amount of noise present in depth images was too high for the training process. Due to bad performance of real data in the training and test scenarios, we decided to use only the synthetic dataset for training of our neural network. Moreover, we performed an experiment of training the network with RGB data only while omitting depth data. In this experiment, the network performed poorly in an AR test scenario. Therefore, we decided to use RGB-D data for all subsequent experiments.
4 Temporal coherence
After each light direction from the last N frames is classified as inlier or outlier, we calculate a resulting light source direction as an average of all inliers. This averaging enables temporal smoothing of the estimated light direction. We empirically set N to value 6 in our experiments.
When the light source of a real scene can be estimated for each frame, we need to integrate this algorithm into an AR scenario. In our experiments, we used an RGB-D camera Microsoft Kinect and we integrated it into an AR rendering system based on real-time ray tracing . ARToolKitPlus  marker-based tracking is used to track the RGB-D camera in a real scene. The light source estimation runs asynchronously in a separate thread, and it always uses the last frame from the RGB-D camera to estimate a dominant light direction. This estimated light direction is then used in the ray-tracing system to illuminate virtual objects. As both rendering and light source estimation run in interactive time, the light reflections, shadows and caustics are always adapted to the light direction in the real world. Therefore, consistent illumination between real and virtual objects is achieved (Fig. 4).
6 Evaluation and results
The comparison of our method to algorithm from Gardner et al. 
Gardner et al.
In our evaluation, we also measured the performance of our method on a synthetic dataset. For this purpose, we rendered a new dataset which consists of 7097 images. The scene with a cone model was used in this case. A similar scene was also used in training, but the test data were rendered with new viewpoints and light source positions which were not used during training. We calculated the average angular error of estimated light direction. All images from the new rendered dataset were used for this evaluation. The resulting error is \(20.4^\circ \). This result indicates that the trained neural network performs well also on synthetic data under new viewpoints.
The above-discussed evaluations were performed in an environment with controlled light. We were also interested to investigate the performance of our neural network in a scene with natural (uncontrolled) illumination. For this purpose, we ran an experiment in an office scene lit by sunlight through a window. In this scenario, we measured the position of the window with respect to an AR marker to represent a reference light direction. We compared our result with the method of Gardner et al. . The results of this experiment are shown in Fig. 7. The rendering with the estimated light direction indicates correct estimation of light by our method. (Virtual shadows are consistent with the real ones.) This positive result is also supported by the projection of the estimated light direction into the captured light probe (Fig. 7 red dot in the ground-truth environment). The light probe image was captured by a camera with fish-eye lens (\(185^\circ \) field of view), and it represents the upper hemisphere of incoming light. Our method calculated the light direction with an angular error of \(21.7^\circ \). The compared method performed better in this scene (Angular error of \(12^\circ \)). Nevertheless, the result of this experiment suggests that our method performs well also in scenes with arbitrary uncontrolled lighting.
In addition to single-image light source estimation, we evaluated our method for temporal filtering and outlier removal on a live AR stream. We captured the video of a scene with a moving light source, and we evaluated our light source estimation with and without temporal filtering in comparison with ground-truth data. In this evaluation, an average angular error was \(41^\circ \) without temporal filtering and \(38.3^\circ \) with temporal filtering. The results suggest that our method achieves higher temporal coherence (and lower average error) when temporal filtering is used. Additionally, the example of our method for light source estimation with temporal filtering is given in supplementary video.
Finally, we measured the calculation time of our method. The neural network processing and whole light source estimation with AR integration were measured separately to provide detailed analysis. Calculation times were calculated as averages of multiple measurements. A computer with a hexa-core 3.2 GHz processor and NVIDIA Titan Xp graphics card was used for time measurements. Light source estimation by the neural network was executed on the CPU because the GPU was fully utilized by ray tracing. The average time of light source estimation by the neural network was 380 ms. In our implementation, light source estimation and AR rendering were represented by two services. Therefore, a communication overhead between them also influences the update rate of estimated light. The average time for communication between the rendering and illumination estimation was 50 ms. In the future, this overhead can be reduced by integration of both algorithms into one stand-alone system. The AR rendering is running asynchronously and therefore is independent of the light estimation speed. With ray-tracing-based rendering, we achieved an average rendering time of 58ms. The results indicate that processing by the neural network achieves interactive speed suitable for AR applications.
The results of our evaluation show that the trained neural network is capable of estimating illumination from real scenes which were not used during training. Moreover, an interesting finding was that training on a synthetic dataset leads to better convergence than on real-world data and to better performance in AR. We hypothesize that bad performance of training with real-world data was caused by insufficient variability of light positions and by a high level of noise in depth data.
Limitations and future work Our method works well in many tested AR scenes. However, in some special cases the network does not estimate a light direction correctly. This is often the case if a light source is positioned opposite to a camera. In this case, the uncertainty of the network can be observed (Fig. 8). We hypothesize that this uncertainty is caused by discontinuity in yaw angle on the direction opposite to the camera. This angle can be represented as both \(\pi \) and \(-\pi \). Therefore, the network cannot find a continuous transition from one side to another. In the future, this problem can be addressed by using a different representation of the light direction. For example, the relative direction (x, y, z) in the camera coordinate space can be used.
Another limitation of the network, trained on synthetic data, can be caused by a domain gap between real-world and simplistic synthetic data. As a consequence, the network might not operate properly in complex real scenes. This problem can be addressed in the future by two solutions: The first one is to create complex synthetic scenes which mimic the real world as close as possible. The second direction is to improve quality of the capturing process and use high-quality real-world data to do additional training of the network.
An interesting direction for future work will be training of a network which will operate in both the spatial (2D image) and temporal domain. Such a network might calculate light estimates which are already temporally coherent and therefore additional filtering in the temporal domain would not be needed. Additionally, as real scenes often contain more than one light source, we also aim in future work at training a deep neural network which can estimate multiple light sources.
Finally, the exploration of a wide space of various network designs would be vital for finding the best design for a given problem. We explored many possible network designs during this research, and we found the network in Fig. 2 to work the best in our experiments. Nevertheless, the automatic exploration of design space of neural networks would be beneficial for finding the most appropriate model for various problems.
This paper presents a novel method for delta light source estimation in AR. An end-to-end AR system is presented which estimates a directional light source from a single RGB-D camera and integrates this light estimate into AR rendering. The rendering system superimposes virtual objects into a real image with consistent illumination using the estimated light direction. Moreover, temporal coherence of light source estimation is achieved by applying outlier removal and temporal filtering. We evaluated the proposed methods on various AR scenes. The results indicate that the proposed neural network can estimate a dominant light direction even on scenes which were not seen by the network during training. Finally, our evaluation shows that our method can be a beneficial complement to the methods estimating diffuse lighting to faithfully estimate all frequencies of illumination in AR.
Open access funding provided by TU Wien (TUW). This research was funded by the Austrian research project WWTF ICT15-015. We thank Marc-Andé Gardner for providing us with the results of his algorithm through a Web service and for the kind explanations of details of the algorithm. We are also thankful to Alexander Pacha for advices about deep learning. We would like to thank NVIDIA Corporation for the donation of a Titan Xp graphics card and the Center for Geometry and Computational Design for access to a multi-GPU PC for training our neural networks. We also thank Min Kyung Lee, Iana Podkosova and Khrystyna Vasylevska for their support with controlled light experiments.
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
This research did not involve human participants.
- 1.Agusanto, K., Li, L., Chuangui, Z., Sing, N.W.: Photorealistic rendering for augmented reality using environment illumination. In: IEEE ISMAR, pp. 208–218 (2003)Google Scholar
- 2.Boom, B., Orts-Escolano, S., Ning, X., McDonagh, S., Sandilands, P., Fisher, R.B.: Point light source estimation based on scenes recorded by a RGB-D camera. In: British Machine Vision Conference (2013)Google Scholar
- 3.Boom, B.J., Orts-Escolano, S., Ning, X.X., McDonagh, S., Sandilands, P., Fisher, R.B.: Interactive light source position estimation for augmented reality with an RGB-D camera. Comput. Animat. Virtual Worlds 28(1), 5149–5159 (2015)Google Scholar
- 4.Dante, A., Brookes, M.: Precise real-time outlier removal from motion vector fields for 3D reconstruction. In: International Conference on Image Processing, vol. 1 (2003)Google Scholar
- 5.Debevec, P.: Rendering synthetic objects into real scenes: bridging traditional and image-based graphics with global illumination and high dynamic range photography. In: SIGGRAPH, pp. 189–198. ACM, New York (1998)Google Scholar
- 8.Frahm, J.M., Koeser, K., Grest, D., Koch, R.: Markerless augmented reality with light source estimation for direct illumination. In: CVMP, pp. 211–220 (2005)Google Scholar
- 10.Gruber, L., Richter-Trummer, T., Schmalstieg, D.: Real-time photometric registration from arbitrary geometry. In: ISMAR, pp. 119–128 (2012)Google Scholar
- 11.Gruber, L., Ventura, J., Schmalstieg, D.: Image-space illumination for augmented reality in dynamic environments. In: 2015 IEEE VR, pp. 127–134 (2015)Google Scholar
- 12.He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR arXiv:1512.03385 (2015)
- 13.Hold-Geoffroy, Y., Sunkavalli, K., Hadap, S., Gambaretto, E., Lalonde, J.: Deep outdoor illumination estimation. CoRR arXiv:1611.06403 (2016)
- 14.Jiddi, S., Robert, P., Marchand, E.: Reflectance and illumination estimation for realistic augmentations of real scenes. In: IEEE ISMAR, pp. 244–249 (2016)Google Scholar
- 15.Kán, P.: High-quality real-time global illumination in augmented reality. Ph.D. thesis, TU Wien (2014)Google Scholar
- 16.Kán, P., Unterguggenberger, J., Kaufmann, H.: High-quality consistent illumination in mobile augmented reality by radiance convolution on the GPU. In: ISVC 2015, Part I, LNCS 9474, pp. 574–585. Springer (2015)Google Scholar
- 18.Kasper, M., Keivan, N., Sibley, G., Heckman, C.R.: Light source estimation with analytical path-tracing. CoRR arXiv:1701.04101 (2017)
- 19.Knecht, M., Traxler, C., Mattausch, O., Purgathofer, W., Wimmer, M.: Differential instant radiosity for mixed reality. In: IEEE ISMAR, pp. 99–107 (2010)Google Scholar
- 20.Knorr, S.B., Kurz, D.: Real-time illumination estimation from faces for coherent rendering. In: IEEE ISMAR, pp. 113–122 (2014)Google Scholar
- 22.Mandl, D., Yi, K.M., Mohr, P., Roth, P.M., Fua, P., Lepetit, V., Schmalstieg, D., Kalkofen, D.: Learning lightprobes for mixed reality illumination. In: IEEE ISMAR, pp. 82–89 (2017)Google Scholar
- 23.Marques, B.A.D., Drumond, R.R., Vasconcelos, C.N., Clua, E.: Deep light source estimation for mixed reality. In: VISIGRAPP, pp. 303–311 (2018)Google Scholar
- 24.Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML, pp. 807–814. Omnipress, USA (2010)Google Scholar
- 25.Ramamoorthi, R., Hanrahan, P.: A signal-processing framework for inverse rendering. In: SIGGRAPH, pp. 117–128. ACM, New York, NY, USA (2001)Google Scholar
- 26.Richter-Trummer, T., Kalkofen, D., Park, J., Schmalstieg, D.: Instant mixed reality lighting from casual scanning. In: IEEE ISMAR, pp. 27–36 (2016)Google Scholar
- 27.Rohmer, K., Bschel, W., Dachselt, R., Grosch, T.: Interactive near-field illumination for photorealistic augmented reality on mobile devices. In: IEEE ISMAR, pp. 29–38 (2014)Google Scholar
- 29.Supan, P., Stuppacher, I., Haller, M.J.: Image based shadowing in real-time augmented reality. IJVR 5, 1–7 (2006)Google Scholar
- 30.Wagner, D., Schmalstieg, D.: ARToolKitPlus for pose tracking on mobile devices. Technical Report, TU Graz (2007)Google Scholar
- 31.Weber, M., Cipolla, R.: A practical method for estimation of point light-sources. BMVC 2001(2), 471–480 (2001)Google Scholar
- 33.Yu, L., Yeung, S., Tai, Y., Lin, S.: Shading-based shape refinement of RGB-D images. In: IEEE CVPR, pp. 1415–1422 (2013)Google Scholar
- 34.Zhou, W., Kambhamettu, C.: Estimation of illuminant direction and intensity of multiple light sources. In: ECCV, pp. 206–220. Springer (2002)Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.