Single Image Based Three-Dimensional Scene Reconstruction Using Semantic and Geometric Priors

Yoon, Gang-Joon; Song, Jinjoo; Hong, Yu-Jin; Yoon, Sang Min

doi:10.1007/s11063-022-10780-2

Single Image Based Three-Dimensional Scene Reconstruction Using Semantic and Geometric Priors

Published: 03 March 2022

Volume 54, pages 3679–3694, (2022)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Gang-Joon Yoon¹,
Jinjoo Song³,
Yu-Jin Hong² &
…
Sang Min Yoon ORCID: orcid.org/0000-0003-0001-1845³

364 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Single image based three-dimensional (3D) scene reconstruction has become an important research topic for computer vision and computer graphics fields to provide machine vision systems with near human visual perception. Previous approaches for 3D scene reconstruction and depth estimation from single images required many factors including motion parallax, stereoscopic parallels, and various monocular depth cues adopted from known geometric priors. Deep learning based depth estimation techniques have advanced single image based depth estimation by aggregating various complexity information from RGB depth image datasets for training images to drive the process. This paper proposes an effective 3D scene estimation methodology by automatically extracting vanishing point and semantic information including 3D geometric characteristics without prior assumptions. The vanishing point is extracted from line segments and minimum spanning tree clustering to remove spurious noisy edges. Retracting geometric and semantic information from a given image is achieved by a generative adversarial network trained on the created training set. We verified the proposed approach’s efficiency and effectiveness experimentally with a large database created by directly recovering 3D scenes from from an input image.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning vs. Traditional Computer Vision

Deep learning-based 3D reconstruction: a survey

Article 28 January 2023

A novel image denoising algorithm based on least square generative adversarial network

Article 24 April 2024

Notes

http://www.cvlibs.net/datasets/kitti/.

References

Alagoz BB (2008) Obtaining depth maps from color images by region based stereo matching algorithms. Comment, New figures were added
Alhashim I, Wonka P (2018) High quality monocular depth estimation via transfer learning. CoRR arXiv:1812.11941
Benzougar A, Bernard J, Simon T (1998) Depth from defocus: a spatial moments based method. Mach Vis Appl
Cheng CM, Hsu XA, Lai SH (2010) A novel structure-from-motion strategy for refining depth map estimation and multi-view synthesis in 3dtv. In: 2010 IEEE international conference on multimedia and Expo, pp 944–949. IEEE
Cheng FH, Liang YH (2009) Depth map generation based on scene categories. J Electron Imaging 18(4):043006
Article Google Scholar
Ding L, Sharma G (2017) Fusing structure from motion and lidar for dense accurate depth map estimation. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1283–1287. IEEE
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. CoRR arXiv:1406.2283
Furukawa R, Sagawa R, Kawasaki H (2017) Depth estimation using structured light flow–analysis of projected pattern flow on an object’s surface. In: Proceedings of the IEEE international conference on computer vision, pp 4640–4648
Godard C, Aodha OM, Brostow GJ (2016) Unsupervised monocular depth estimation with left-right consistency. CoRR arXiv:1609.03677
Godard C, Mac Aodha O, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 270–279
Haigron P, Bellemare ME, Acosta O, Goksu C, Kulik C, Rioual K, Lucas A (2004) Depth-map-based scene analysis for active navigation in virtual angioscopy. IEEE Trans Med Imaging 23(11):1380–1390
Article Google Scholar
Hwang HJ, Yoon GJ, Yoon SM (2020) Optimized clustering scheme-based robust vanishing point detection. IEEE Trans Intell Transp Syst 21(1):199–208
Article Google Scholar
Kao CC (2017) Stereoscopic image generation with depth image based rendering. Multimedia Tools Appl 76(11):12981–12999
Article Google Scholar
Kellnhofer P, Didyk P, Ritschel T, Masiá B, Myszkowski K, Seidel H (2016) Motion parallax in stereo 3d: model and applications. ACM Trans Graph 35(6):176:1-176:12
Article Google Scholar
Kuznietsov Y, Stuckler J, Leibe B (2017) Semi-supervised deep learning for monocular depth map prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6647–6655
Kuznietsov Y, Stückler J, Leibe B (2017) Semi-supervised deep learning for monocular depth map prediction. In: CVPR, pp 2215–2223. IEEE Computer Society
Ladicky L, Shi J, Pollefeys M (2014) Pulling things out of perspective. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 89–96
Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. In: 2016 Fourth international conference on 3D vision (3DV), pp 239–248. IEEE
Li J, Yuce C, Klein R, Yao A (2019) A two-streamed network for estimating fine-scaled depth maps from single RGB images. Comput Vis Image Underst 186:25–36
Article Google Scholar
Li Z, Snavely N (2018) Megadepth: Learning single-view depth prediction from internet photos. CoRR arXiv:1804.00607
Liu F, Shen C, Lin G (2015) Deep convolutional neural fields for depth estimation from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5162–5170
Liu F, Shen C, Lin G, Reid I (2015) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell 38(10):2024–2039
Article Google Scholar
Liu M, Zhang W, Orabona F, Yang T (2020) Adam${}^{\text{+}}$: A stochastic method with adaptive variance reduction. CoRR arXiv:2011.11985
Liu S, Zhou F, Liao Q (2016) Defocus map estimation from a single image based on two-parameter defocus model. IEEE Trans Image Process 25(12):5943–5956
Article MathSciNet Google Scholar
Mahmoudpour S, Kim M (2016) Superpixel-based depth map estimation using defocus blur. In: 2016 IEEE international conference on image processing (ICIP), pp 2613–2617. IEEE
Martínez-Martín E (2012) Computer vision methods for robot tasks: motion detection, depth estimation and tracking. AI Commun 25(4):373–375
Article Google Scholar
Mirza M, Osindero S (2014) Conditional generative adversarial nets. CoRR arXiv:1411.1784
Moon H, Ju G, Park S, Shin H (2016) 3d freehand ultrasound reconstruction using a piecewise smooth markov random field. Comput Vis Image Underst 151:101–113
Article Google Scholar
Nicolas H (2012) Depth analysis for surveillance videos in the h264 compressed domain
Ranftl R, Lasinger K, Hafner D, Koltun V (2020) Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2020.3019967
Article Google Scholar
Ranftl R, Vineet V, Chen Q, Koltun V (2016) Dense monocular depth estimation in complex dynamic scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4058–4066
Saxena A, Sun M, Ng A.Y (2009) Make3D: Learning 3D scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell 31(5):824–840
Article Google Scholar
Schennings J (2017) Deep convolutional neural networks for real-time single frame monocular depth estimation. Uppsala universitet, Avdelningen för systemteknik
Google Scholar
Shin YS, Kim A (2019) Sparse depth enhanced direct thermal-infrared slam beyond the visible spectrum. IEEE Robotics Autom Lett 4(3):2918–2925
Article Google Scholar
Tannoury A, Darazi R, Guyeux C, Makhoul A (2017) Efficient and accurate monitoring of the depth information in a wireless multimedia sensor network based surveillance. CoRR
Tao Y, Jian-Hua Z, Qin-Bao S (2017) 3d reconstruction from a single still image based on monocular vision of an uncalibrated camera. Web Conf 12:01018
Article Google Scholar
Teed Z, Deng J (2018) Deepv2d: Video to depth with differentiable structure from motion. CoRR arXiv:1812.04605
Villamizar M, Martínez-González A, Canévet O, Odobez JM (2018) Watchnet: Efficient and depth-based network for people detection in video surveillance systems. In: 2018 15th IEEE International conference on advanced video and signal based surveillance (AVSS), pp 1–6. IEEE
Wang C, Lucey S, Perazzi F, Wang O (2019) Web stereo video supervision for depth prediction from dynamic scenes. pp 348–357. IEEE
Yokozuka M, Tomita K, Matsumoto O, Banno A (2016) Accurate depth-map refinement by per-pixel plane fitting for stereo vision. In: 2016 23rd international conference on pattern recognition (ICPR), pp 2807–2812. IEEE
Zhang X, Huang B (2018) Bayes-metis.3d. (3d geometric reconstruction based on bayes-metis mesh partition) 45(6):265–269
Zhao S, Fang Z (2018) Direct depth slam: sparse geometric feature enhanced direct depth slam system for low-texture environments. Sensors 18(10):3339
Article Google Scholar
Zhou Z, Farhat F, Wang JZ (2017) Detecting dominant vanishing points in natural scenes with application to composition-sensitive image retrieval. IEEE Trans Multim 19(12):2651–2665
Article Google Scholar

Download references

Acknowledgements

G.-J. Yoon is supported by the National Institute for Mathematical Sciences grant funded by the Korean government (No. NIMS-B21810000). J. Song is supported by the National Research Foundation of Korea (No. 2021R1F1A1059202). S.M. Yoon is supported by Institute of Information communications Technology Planning and Evaluation (IITP) (2020-0-00457) and by the National Research Foundation of Korea (No. NRF-2021R1A2C1008555).

Author information

Authors and Affiliations

National Institute of Mathematical Sciences, 70, Yuseong-daero 1689 beon-gil, Yuseong-gu, Daejeon, 34047, Korea
Gang-Joon Yoon
Division of Electronics and Display Engineering, Hoseo University, 20, Hoseo-ro 79, Babang-eub, Asan, Chungcheongnam-do, 31499, Korea
Yu-Jin Hong
College of Computer Science, Kookmin University, 77 Jeongneung-ro, Seoul, 02707, Korea
Jinjoo Song & Sang Min Yoon

Authors

Gang-Joon Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Jinjoo Song
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Jin Hong
View author publications
You can also search for this author in PubMed Google Scholar
Sang Min Yoon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sang Min Yoon.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Geometric Quantities and Image Transformations

In this appendix, we obtain various geometric quantities including a vanishing point representation and several transformations related to the camera settings. As seen in Fig. 4, we take a 2D image from an optical center $C_h(0,c_h, 0)$ with angle $\theta $ focusing at $F_c(0,b,c)$ in a 3D space. Then we correspond the focusing point $F_c$ to the origin $O_I$ of the image plane $I_{2D}$ with orthogonal coordinate system. We can see that $\overrightarrow{C_hF_c}=(0,b-c_h,c)$ so that the normal vector $\mathbf {n}$ to the image plane $I_{2D}$ lying in the 3D space is given as

$$\begin{aligned} \mathbf {n}=(n_1,n_2,n_3)=\frac{\overrightarrow{C_hF_c}}{\Vert \overrightarrow{C_hF_c}\Vert } =\frac{(0,b-c_h,c)}{\sqrt{(b-c_h)^2+c^2}}. \end{aligned}$$

(4)

1.1 Gemmetric Quantities and Vanishing Point Representation

Now, we calculate 3D coordinates of several points related to the 2D image obtained from the projection. Since the vanishing point $(0, v_h)$, we can calculate the camera focusing angle $\theta $ using the focal and focusing points as

$$\begin{aligned} \theta =\arctan \frac{v_h}{\sqrt{(b-c_h)^2+c^2}}=\arctan \frac{c_h-b}{c}. \end{aligned}$$

(5)

From the projection given in Fig. 8, we obtain the angle relation

$$\begin{aligned} \sin \theta =\frac{c_h-b}{\sqrt{(c_h-b)^2+c^2}} \end{aligned}$$

(6)

and

$$\begin{aligned} \cos \theta =\frac{c}{\sqrt{(c_h-b)^2+c^2}}=\frac{\sqrt{(c_h-b)^2+c^2}}{v_3}. \end{aligned}$$

(7)

Applying the relations to the trigonometric values, we can obtain the z coordinate of the vanishing point $v=(0,c_h, v_3)$ in 3D space as

$$\begin{aligned} v_3=\frac{(c_h-b)^2+c^2}{c} =c+\frac{v_h(c_h-b)}{\sqrt{(c_h-b)^2+c^2}}. \end{aligned}$$

(8)

1.2 3D Representation of Points in the Two-Dimensional Image Plane $I_{2D}$

In Sect. 2.4, we proposed a depth estimation from the 2D image $I_{2D}$ using the vanishing point. For the depth estimation, it is convenient to have a transform of the 2D image onto the plane containing the image plane in 3D. The plane $I_{3D}$ containing the image is given as $I_{3D}=\{\mathbf{x}\in \mathbb {R}^3: \mathbf {n}\cdot (\mathbf{x}-F_c)=n_1 x +n_2(y-b)+n_3(z-c)=0\}$ with normal vector $\mathbf {n}$ in (1). Even though we can find a clue for the transform in the proof given for the projection into the 2D image plane, we are to give a concrete form to the transform.

Let Q(u, v) be a point in $I_{2D}.$ From the orthogonal coordinate system given to $I_{2D},$ this means $\overrightarrow{O_IQ}=u\mathbf {e}_1+ v\mathbf {e}_2.$

In 3D, the point Q is represented in vector sum from Fig. 9 as

$$\begin{aligned} \overrightarrow{OQ}&=\overrightarrow{OF_c}+\overrightarrow{F_cQ}\nonumber \\&=b\mathbf {e}_y+c\mathbf {e}_z+u\mathbf {e}_1+ v\mathbf {e}_2 \end{aligned}$$

(9)

On the other hand, we have found the relations between $\{\mathbf {e}_1, \mathbf {e}_2\}$ and $\{\mathbf {e}_x, \mathbf {e}_y, \mathbf {e}_z\}$ in (11) as $ \mathbf {e}_1=\mathbf {e}_x\quad \text{ and }\quad \mathbf {e}_2=\cos \theta \mathbf {e}_y+\sin \theta \mathbf {e}_z. $ Applying these relations to (9) and using the fact $ x=\overrightarrow{OQ}\cdot \mathbf {e}_x,\quad y=\overrightarrow{OQ}\cdot \mathbf {e}_y, \quad z=\overrightarrow{OQ}\cdot \mathbf {e}_z, $ we find the 3D coordinate (x, y, z) for the point Q as

$$\begin{aligned} (x,y,x)&=(0,b,c)+(u,0,0)+(0,v\cos \theta ,v\sin \theta )\\ {}&=(u, b+v\cos \theta ,c+v\sin \theta ) \end{aligned}$$

Here $\cos \theta $ and $\sin \theta $ are given in (7) and (6), respectively. Thus, the transform $T_{I_{2D}\rightarrow I_{3D}}: I_{2D}\rightarrow I_{3D}$ is found to be

$$\begin{aligned} \left( \begin{array}{c} x\\ y\\ z\end{array}\right)&=T_{I_{2D}\rightarrow I_{3D}}\left( \begin{array}{c} u\\ v\end{array}\right) =\left( \begin{array}{c} u\\ b+v\cos \theta \\ c+v\sin \theta \end{array}\right) \\ {}&=\left( \begin{array}{cc} 1 &{} 0\\ 0 &{}\cos \theta \\ 0&{} \sin \theta \end{array}\right) \left( \begin{array}{c} u\\ v\end{array}\right) +\left( \begin{array}{c} 0\\ b\\ c\end{array}\right) . \end{aligned}$$

Conversely, we assume to be given the 3D coordinate (x, y, z) for the point Q. From Fig. 9, the vector sum in (9) gives the relation

$$\begin{aligned} \overrightarrow{O_IQ}&=u\mathbf {e}_1+v\mathbf {e}_2\\ {}&=\overrightarrow{OQ}-\overrightarrow{OF_c}\\&=(x,y,z)-(0,b,c)=(x, y-b,z-c)\\&=x\mathbf {e}_x+(y-b)\mathbf {e}_y+(z-c)\mathbf {e}_z. \end{aligned}$$

Applying the relations (10) and the orthogonality gives the inverse transform $T_{I_{3D}\rightarrow I_{2D}}: I_{3D}\rightarrow I_{2D}$ as

$$\begin{aligned} \left( \begin{array}{c} u\\ v\end{array}\right)&=T_{I_{3D}\rightarrow I_{2D}}\left( \begin{array}{c} x\\ y\\ z\end{array}\right) \\&=\left( \begin{array}{c} x\\ (y-b)\cos \theta +(z-c)\sin \theta \end{array}\right) \\&=\left( \begin{array}{ccc} 1 &{} 0 &{} 0\\ 0 &{}\cos \theta &{}\sin \theta \end{array}\right) \left( \begin{array}{c} x\\ y\\ z\end{array}\right) +\left( \begin{array}{c} 0\\ -b\cos \theta -c\sin \theta \end{array}\right) . \end{aligned}$$

We note that it is not difficult to show that

$$\begin{aligned} T_{I_{3D}\rightarrow I_{2D}}\circ T_{I_{2D}\rightarrow I_{3D}}=id_{I_{2D}} \end{aligned}$$

and

$$\begin{aligned} T_{I_{2D}\rightarrow I_{3D}}\circ T_{I_{3D}\rightarrow I_{2D}}=id_{I_{3D}} \end{aligned}$$

by using the fact that $(x, y-b, z-c)\cdot \mathbf {n}=0$ for (x, y, z) in $I_{3D}$ with the normal vector $\mathbf {n}$ given in (4).

1.3 Projection Onto a Two-Dimensional Image $I_{2D}$

The 3D plane $I_{3D}$ corresponding to the image plane $I_{2D}$ is the set of all points $\mathbf{x}=(x,y,z)$ such that $ \mathbf {n}\cdot (\mathbf{x}-F_c)=n_1 x +n_2(y-b)+n_3(z-c)=0.$ In this last section, we find the projection of 3D points in $\mathbb {R}^3$ into the image plane $I_{2D}.$ Let P(x, y, z) be a 3D point in $\{\mathbf{x}=(x,y,z): \mathbf {n}\cdot (\mathbf{x}-C_h)\ne 0\}$, then we are to find the coordinate (u, v) of the corresponding point Q in image plane $I_{2D}$. We plan to find u and v using the relation

$$\begin{aligned} u=\overrightarrow{O_IQ}\cdot \mathbf {e}_1=\overrightarrow{F_cQ}\cdot \mathbf {e}_1, ~v=\overrightarrow{O_IQ}\cdot \mathbf {e}_2=\overrightarrow{F_cQ}\cdot \mathbf {e}_2. \end{aligned}$$

(10)

To do this, we need to represent the two unit vectors $\mathbf {e}_1$ and $\mathbf {e}_2$ in terms of $\mathbf {e}_x, \mathbf {e}_y, \mathbf {e}_y.$ Using the angle $\theta ,$ we see that

$$\begin{aligned} \mathbf {e}_1=\mathbf {e}_x \quad \text{ and } \quad \mathbf {e}_2=\cos \theta \mathbf {e}_y+\sin \theta \mathbf {e}_z. \end{aligned}$$

(11)

On the other hand, the two vectors $\overrightarrow{C_hQ}$ and $\overrightarrow{C_hP}$ are parallel so that there exists a constant $\alpha $ such that $\overrightarrow{C_hQ}=\alpha \overrightarrow{C_hP}.$ And $\overrightarrow{OQ}$ lies in the plane so that it satisfies $\mathbf {n}\cdot (\overrightarrow{OQ}-\overrightarrow{OF_c})=0.$ And we have $\overrightarrow{OQ}=\overrightarrow{OC_h}+\alpha \overrightarrow{C_hP}$ and $\overrightarrow{C_hP}=\overrightarrow{OP}-\overrightarrow{OC_h}=(x,y-c_h,z).$ Combining these relations, we get

$$\begin{aligned} \alpha&=\frac{(\overrightarrow{OF_c}-\overrightarrow{OC_h})\cdot \mathbf {n}}{(\overrightarrow{OP}-\overrightarrow{OC_h})\cdot \mathbf {n}}=\frac{(0,b-c_h,c)\cdot (n_1,n_2,n_3)}{(x,y-c_h,z)\cdot (n_1,n_2,n_3)}\nonumber \\&=\frac{(b-c_h)^2+c^2}{(b-c_h)(y-c_h)+cz}. \end{aligned}$$

(12)

Also, $\overrightarrow{O_IQ}$ is represented in vector sum as $ \overrightarrow{F_cQ}=\overrightarrow{C_hQ}+\overrightarrow{F_cC_h}=\alpha (x,y-c_h,z)+(0, c_h-b,-c) =(\alpha x, \alpha (y-c_h)+c_h-b, \alpha z-c), $ Applying this vector sum and (11) to (10), we obtain $ u=\alpha x$ and $v=(\alpha (y-c_h)+c_h-b)\cos \theta +(\alpha z-c)\sin \theta .$ Finally, we calculate the projection mapping $T_{{3D}\rightarrow I_{2D}}: \mathbb {R}^3\rightarrow I_{2D}$ as

$$\begin{aligned} \left( \begin{array}{c} u\\ v\end{array}\right)&=T_{3D \rightarrow I_{2D}}\left( \begin{array}{c} x\\ y\\ z\end{array}\right) \\&=\left( \begin{array}{c} \alpha x\\ (\alpha (y-c_h)+c_h-b)\cos \theta +(\alpha z-c)\sin \theta \end{array}\right) \end{aligned}$$

where the parameters $\alpha $, $\sin \theta ,$ and $\cos \theta $ are given in (12), (6) and (7), respectively.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yoon, GJ., Song, J., Hong, YJ. et al. Single Image Based Three-Dimensional Scene Reconstruction Using Semantic and Geometric Priors. Neural Process Lett 54, 3679–3694 (2022). https://doi.org/10.1007/s11063-022-10780-2

Download citation

Accepted: 14 February 2022
Published: 03 March 2022
Issue Date: October 2022
DOI: https://doi.org/10.1007/s11063-022-10780-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Single Image Based Three-Dimensional Scene Reconstruction Using Semantic and Geometric Priors

Abstract

Access this article

Similar content being viewed by others

Deep Learning vs. Traditional Computer Vision

Deep learning-based 3D reconstruction: a survey

A novel image denoising algorithm based on least square generative adversarial network

Notes

References

Acknowledgements