Abstract
Monocular depth estimation (MDE), which is the task of using a single image to predict scene depths, has gained considerable interest, in large part owing to the popularity of applying deep learning methods to solve “computer vision problems”. Monocular cues provide sufficient data for humans to instantaneously extract an understanding of scene geometries and relative depths, which is evidence of both the processing power of the human visual system and the predictive power of the monocular data. However, developing computational models to predict depth from monocular images remains challenging. Hand-designed MDE features do not perform particularly well, and even current “deep” models are still evolving. Here we propose a novel approach that uses perceptually-relevant natural scene statistics (NSS) features to predict depths from monocular images in a simple, scale-agnostic way that is competitive with state-of-the-art systems. While the statistics of natural photographic images have been successfully used in a variety of image and video processing, analysis, and quality assessment tasks, they have never been applied in a predictive end-to-end deep-learning model for monocular depth. Correspondingly, no previous work has explicitly incorporated perceptual features in a monocular depth-prediction approach. Here we accomplish this by developing a new closed-form bivariate model of image luminances and use features extracted from this model and from other NSS models to drive a novel deep learning framework for predicting depth given a single image.
Similar content being viewed by others
References
Attneave F (1954) Some informational aspects of visual perception. Psychol Rev 61(3):183
Barlow HB (1972) Single units and sensation: A neuron doctrine for perceptual psychology? Perception 1(4):371–394
Barlow HB et al (1961) Possible principles underlying the transformation of sensory messages. Sensory Commun 1:217–234
Bell AJ, Sejnowski TJ (1997) The “independent components” of natural scenes are edge filters. Vis Res 37(23):3327–3338
Bovik AC (2013) Automatic prediction of perceptual image and video quality. Proc IEEE 101(9):2008–2024
Cao Y, Wu Z, Shen C (2018) Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Trans Circuits Syst Video Technol 28(11):3174–3182
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann Stat 32(2):407–499
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Adv Neural Inf Proc Syst 27:2366–2374
Field DJ (1987) Relations between the statistics of natural images and the response properties of cortical cells. J Opt Soc Am 4(12):2379–2394
Field DJ (1999) Wavelets, vision and the statistics of natural scenes. Philos Trans Royal Soc Lond Ser A Math Phys Eng Sci 357(1760):2527–2542
Fu H, Gong M, Wang C, Batmanghelich K, Tao D (2018) Deep ordinal regression network for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition
Garg R, BG VK, Carneiro G, Reid I (2016) Unsupervised cnn for single view depth estimation: Geometry to the rescue. In: Proc. Eur. Conf. Comput. Vis., pp 740–756
Godard C, Mac Aodha O, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern Recognition
Godard C, Mac Aodha O, Firman M, Brostow G (2018) Digging into self-supervised monocular depth estimation. arXiv preprint arXiv:1806.01260
Godard C, Mac Aodha O, Firman M, Brostow GJ (2019) Digging into self-supervised monocular depth estimation. In: Proc. IEEE Int’l Conf. Comput. Vis., pp 3828–3838
Gómez E, Gomez-Viilegas M, Marin J (1998) A multivariate generalization of the power exponential family of distributions. Commun Stat Theory Methods 27(3):589–600
Ha H, Im S, Park J, Jeon HG, So Kweon I (2016) High-quality depth from uncalibrated small motion clip. In: Proceedings of the IEEE conference on computer vision and pattern Recognition
van Hateren JH, Ruderman DL (1998) Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proc Royal Soc Lond Ser B Biol Sci 265(1412):2315–2320
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp 770–778
Hyvärinen A, Hoyer P (2000) Emergence of phase-and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Comput 12(7):1705–1720
Jou JY, Bovik AC (1989) Improved initial approximation and intensity-guided discontinuity detection in visible-surface reconstruction. Comput Vis Graphics Image Process 47(3):292–326
Karsch K, Liu C, Kang SB (2014) Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Trans Pattern Anal Mach Intell 36(11):2144–2158
Keras-contrib (2018) DSSIM Loss Function. https://github.com/keras-team/keras-contrib/blob/master/keras_contrib/losses/dssim.py
Kim S, Park K, Sohn K, Lin S (2016) Unified depth prediction and intrinsic image decomposition from a single image via joint convolutional neural fields. In: Proc. Eur. Conf. Comput. Vis., pp 143–159
Kong N, Black MJ (2015) Intrinsic depth: Improving depth transfer with intrinsic images. In: IEEE Int’l Conf. Comput. Vis
Kuznietsov Y, Stuckler J, Leibe B (2017) Semi-supervised deep learning for monocular depth map prediction. In: Proceedings of the IEEE conference on computer vision and pattern Recognition
Lagarias JC, Reeds JA, Wright MH, Wright PE (1998) Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM J Optim 9(1):112–147
Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. In: Int’l Conf. 3D Vision, pp 239–248
Li B, Shen C, Dai Y, Van Den Hengel A, He M (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp 1119–1127
Liu F, Shen C, Lin G, Reid I (2016) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell 38(10):2024–2039
Liu Y, Cormack LK, Bovik AC (2009) Luminance, disparity, and range statistics in 3D natural scenes. In: Proc. SPIE, Human Vis. and Electronic Imaging, vol 7240, p 72401G
Liu Y, Cormack LK, Bovik AC (2011) Statistical modeling of 3D natural scenes with application to bayesian stereopsis. IEEE Trans Image Process 20(9):2515–2530
Mittal A, Moorthy AK, Bovik AC (2012) No-reference image quality assessment in the spatial domain. IEEE Trans Image Process 21(12):4695–4708
Mittal A, Soundararajan R, Bovik AC (2013) Making a “completely blind” image quality analyzer. IEEE Signal Process Lett 20(3):209–212
Olshausen BA, Field DJ (1997) Sparse coding with an overcomplete basis set: A strategy employed by v1? Vis Res 37(23):3311–3325
Olshausen BA, Field DJ (2004) Sparse coding of sensory inputs. Curr Opin Neurobiol 14(4):481–487
Pan J, Mueller M, Lahlou T, Bovik AC (2018) Orthogonally-divergent fisheye stereo. In: Int’l Conf. Advanced Concepts for Intell. Vis. Systems, pp 112–124
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Portilla J, Strela V, Wainwright MJ, Simoncelli EP (2003) Image denoising using scale mixtures of gaussians in the wavelet domain. IEEE Trans Image Process 12(11):1338–1351
Potetz B, Lee TS (2003) Statistical correlations between two-dimensional images and three-dimensional structures in natural scenes. J Opt Soc Am 20(7):1292–1303
Rajagopalan A, Chaudhuri S, Mudenagudi U (2004) Depth estimation and image restoration using defocused stereo pairs. IEEE Trans Pattern Anal Mach Intell 26(11):1521–1525
Rao RP, Olshausen BA, Lewicki MS (2002) Probabilistic models of the brain: perception and neural function. MIT press, Cambridge
Roy A, Todorovic S (2016) Monocular depth estimation using neural regression forest. In: Proceedings of the IEEE conference on computer vision and pattern Recognition
Ruderman DL (1994) The statistics of natural images. Netw Comput Neural Syst 5(4):517–548
Sharifi K, Leon-Garcia A (1995) Estimation of shape parameter for generalized gaussian distributions in subband decompositions of video. IEEE Trans Circuits Syst Video Technol 5(1):52–56
Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from RGBD images. In: Eur. Conf. Comput. Vis., pp 746–760
Simoncelli EP, Olshausen BA (2001) Natural image statistics and neural representation. Ann Rev Neurosci 24(1):1193–1216
Sinno Z, Caramanis C, Bovik AC (2018) Towards a closed form second-order natural scene statistics model. IEEE Trans Image Process 27(7):3194–3209
Su CC, Bovik AC, Cormack LK (2011) Natural scene statistics of color and range. In: IEEE Int’l Conf. Image Process., pp 257–260
Su CC, Bovik AC, Cormack LK (2012) Statistical model of color and disparity with application to bayesian stereopsis. In: IEEE Southwest Symp. Image Anal. and Interpretation, pp 169–172
Su CC, Cormack LK, Bovik AC (2013) Color and depth priors in natural images. IEEE Trans Image Process 22(6):2259–2274
Su CC, Cormack LK, Bovik AC (2014) Bivariate statistical modeling of color and range in natural scenes. In: Proc. SPIE, Human Vis. and Electronic Imaging, vol 9014. https://doi.org/10.1117/12.2036505
Su CC, Cormack LK, Bovik AC (2014) New bivariate statistical model of natural image correlations. In: IEEE Int’l Conf. Acoustics, Speech, Signal Process., pp 5362–5366
Su CC, Cormack LK, Bovik AC (2015a) Closed-form correlation model of oriented bandpass natural images. IEEE Signal Process Lett 22(1):21–25
Su CC, Cormack LK, Bovik AC (2015b) Oriented correlation models of distorted natural images with application to natural stereopair quality evaluation. IEEE Trans Image Process 24(5):1685–1699
Su CC, Cormack LK, Bovik AC (2017) Bayesian depth estimation from monocular natural images. J Vis 17:22–22
Tang H, Joshi N, Kapoor A (2011) Learning a blind measure of perceptual image quality. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp 305–312
Van Hateren JH, van der Schaaf A (1998) Independent component filters of natural images compared with simple cells in primary visual cortex. Proc Royal Soc Lond Ser B Biol Sci 265(1394):359–366
Wainwright MJ, Schwartz O (2002) Natural image statistics and divisive. Probabilistic Models of the Brain: Perception and Neural Function, Chap 10, p 203
Wainwright MJ, Schwartz O, Simoncelli EP (2001) Natural image statistics and divisive normalization: Modeling nonlinearities and adaptation in cortical neurons. Perception and Neural Function, Probabilistic Models of the Brain. MIT Press, pp 203–222
Wang Z, Bovik AC (2011) Reduced- and no-reference image quality assessment. IEEE Signal Process Mag 28(6):29–40
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP et al (2004) Image quality assessment: From error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612
Xu D, Ricci E, Ouyang W, Wang X, Sebe N (2017) Multi-scale continuous crfs as sequential deep networks for monocular depth estimation. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp 5354–5362
Zagoruyko S, Komodakis N (2015) Learning to compare image patches via convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern Recognition
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pan, J., Bovik, A.C. Perceptual Monocular Depth Estimation. Neural Process Lett 53, 1205–1228 (2021). https://doi.org/10.1007/s11063-021-10437-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-021-10437-6