Advertisement

Structural Deep Metric Learning for Room Layout Estimation

Conference paper
  • 504 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12363)

Abstract

In this paper, we propose a structural deep metric learning (SDML) method for room layout estimation, which aims to recover the 3D spatial layout of a cluttered indoor scene from a monocular RGB image. Different from existing room layout estimation methods that solve a regression or per-pixel classification problem, we formulate the room layout estimation problem from a metric learning perspective where we explicitly model the structural relations across different images. We propose to learn a latent embedding space where the Euclidean distance can characterize the actual structural difference between the layouts of two rooms. We then minimize the discrepancy between an image and its ground-truth layout in the learned embedding space. We employ a metric model and a layout encoder to map the RGB images and the ground-truth layouts to the embedding space, respectively, and a layout decoder to map the embeddings to the corresponding layouts, where the whole framework is trained in an end-to-end manner. We perform experiments on the widely used Hedau and LSUN datasets and achieve state-of-the-art performance.

Keywords

Deep metric learning Room layout estimation Structured prediction 

Notes

Acknowledgements

The authors would like to thank Yangyang Song for his kind support and helpful discussions. This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 61822603, Grant U1813218, Grant U1713214, and Grant 61672306, in part by Beijing Natural Science Foundation under Grant No. L172051, in part by Beijing Academy of Artificial Intelligence (BAAI), in part by a grant from the Institute for Guo Qiang, Tsinghua University, in part by the Shenzhen Fundamental Research Fund (Subject Arrangement) under Grant JCYJ20170412170602564, and in part by Tsinghua University Initiative Scientific Research Program.

References

  1. 1.
    Boniardi, F., Valada, A., Mohan, R., Caselitz, T., Burgard, W.: Robot localization in floor plans using a room layout edge extraction network. In: Proceedings of the IROS, pp. 5291–5297 (2019)Google Scholar
  2. 2.
    Coughlan, J.M., Yuille, A.L.: The manhattan world assumption: regularities in scene statistics which enable Bayesian inference. In: Proceedings of the NIPS, pp. 845–851 (2001)Google Scholar
  3. 3.
    Dasgupta, S., Fang, K., Chen, K., Savarese, S.: DeLay: robust spatial layout estimation for cluttered indoor scenes. In: Proceedings of the CVPR, pp. 616–624 (2016)Google Scholar
  4. 4.
    Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric learning. In: Proceedings of the ICML, pp. 209–216 (2007)Google Scholar
  5. 5.
    Del Pero, L., Bowdish, J., Fried, D., Kermgard, B., Hartley, E., Barnard, K.: Bayesian geometric modeling of indoor scenes. In: Proceedings of the CVPR, pp. 2719–2726 (2012)Google Scholar
  6. 6.
    Del Pero, L., Bowdish, J., Kermgard, B., Hartley, E., Barnard, K.: Understanding Bayesian rooms using composite 3D object models. In: Proceedings of the CVPR, pp. 153–160 (2013)Google Scholar
  7. 7.
    Duan, Y., Zheng, W., Lin, X., Lu, J., Zhou, J.: Deep adversarial metric learning. In: Proceedings of the CVPR, pp. 2780–2789 (2018)Google Scholar
  8. 8.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the ICML, pp. 2650–2658 (2015)Google Scholar
  9. 9.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Proceedings of the NIPS, pp. 2366–2374 (2014)Google Scholar
  10. 10.
    Fix, E., Hodges Jr., J.L.: Discriminatory analysis-nonparametric discrimination: consistency properties. Technical report, California Univ Berkeley (1951)Google Scholar
  11. 11.
    Globerson, A., Roweis, S.T.: Metric learning by collapsing classes. In: Proceedings of the NIPS, pp. 451–458 (2006)Google Scholar
  12. 12.
    Gupta, A., Hebert, M., Kanade, T., Blei, D.M.: Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In: Proceedings of the NIPS, pp. 1288–1296 (2010)Google Scholar
  13. 13.
    Gupta, S., Arbeláez, P., Girshick, R., Malik, J.: Aligning 3D models to RGB-D images of cluttered scenes. In: Proceedings of the CVPR, pp. 4731–4740 (2015)Google Scholar
  14. 14.
    Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proceedings of the CVPR, pp. 1735–1742 (2006)Google Scholar
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the CVPR, pp. 770–778 (2016)Google Scholar
  16. 16.
    Hedau, V., Hoiem, D., Forsyth, D.: Recovering the spatial layout of cluttered rooms. In: Proceedings of the ICCV, pp. 1849–1856 (2009)Google Scholar
  17. 17.
    Hirzer, M., Roth, P.M., Lepetit, V.: Smart hypothesis generation for efficient and robust room layout estimation. In: Proceedings of the WACV, pp. 2912–2920 (2020)Google Scholar
  18. 18.
    Huang, J., Gretton, A., Borgwardt, K., Schölkopf, B., Smola, A.J.: Correcting sample selection bias by unlabeled data. In: Proceedings of the NIPS, pp. 601–608 (2007)Google Scholar
  19. 19.
    Kim, S., Seo, M., Laptev, I., Cho, M., Kwak, S.: Deep metric learning beyond binary supervision. In: Proceedings of the CVPR., pp. 2288–2297 (2019)Google Scholar
  20. 20.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the NIPS, pp. 1097–1105 (2012)Google Scholar
  21. 21.
    Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logistics Q. 2(1–2), 83–97 (1955)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Kwak, S., Cho, M., Laptev, I.: Thin-slicing for pose: learning to understand pose without explicit pose estimation. In: Proceedings of the CVPR, pp. 4938–4947 (2016)Google Scholar
  23. 23.
    Lee, C.Y., Badrinarayanan, V., Malisiewicz, T., Rabinovich, A.: RoomNet: end-to-end room layout estimation. In: Proceedings of the ICCV, pp. 4865–4874 (2017)Google Scholar
  24. 24.
    Lee, D.C., Gupta, A., Hebert, M., Kanade, T., Blei, D.M.: Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In: Proceedings of the NIPS, pp. 1288–1296 (2010)Google Scholar
  25. 25.
    Lee, D.C., Hebert, M., Kanade, T.: Geometric reasoning for single image structure recovery. In: Proceedings of the CVPR, pp. 2136–2143 (2009)Google Scholar
  26. 26.
    Lin, C., Li, C., Furukawa, Y., Wang, W.: Floorplan priors for joint camera pose and room layout estimation. arXiv abs/1812.06677 (2018)
  27. 27.
    Liu, C., Schwing, A.G., Kundu, K., Urtasun, R., Fidler, S.: Rent3D: floor-plan priors for monocular layout estimation. In: Proceedings of the CVPR, pp. 3413–3421 (2015)Google Scholar
  28. 28.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the CVPR, pp. 3431–3440 (2015)Google Scholar
  29. 29.
    Mallya, A., Lazebnik, S.: Learning informative edge maps for indoor scene layout prediction. In: Proceedings of the ICCV, pp. 936–944 (2015)Google Scholar
  30. 30.
    Mirowski, P., et al.: Learning to navigate in complex environments. In: Proceedings of the ICLR (2017)Google Scholar
  31. 31.
    Ramalingam, S., Pillai, J.K., Jain, A., Taguchi, Y.: Manhattan junction catalogue for spatial reasoning of indoor scenes. In: Proceedings of the CVPR, pp. 3065–3072 (2013)Google Scholar
  32. 32.
    Ren, Y., Li, S., Chen, C., Kuo, C.-C.J.: A coarse-to-fine indoor layout estimation (CFILE) method. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 36–51. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-54193-8_3CrossRefGoogle Scholar
  33. 33.
    Russell, B.C., Torralba, A., Murphy, K.P., Freeman, W.T.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vision 77(1–3), 157–173 (2008)CrossRefGoogle Scholar
  34. 34.
    Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the CVPR, pp. 815–823 (2015)Google Scholar
  35. 35.
    Schwing, A.G., Hazan, T., Pollefeys, M., Urtasun, R.: Efficient structured prediction for 3D indoor scene understanding. In: Proceedings of the CVPR, pp. 2815–2822 (2012)Google Scholar
  36. 36.
    Schwing, A.G., Urtasun, R.: Efficient exact inference for 3D indoor scene understanding. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 299–313. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33783-3_22CrossRefGoogle Scholar
  37. 37.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of the ICLR (2015)Google Scholar
  38. 38.
    Sohn, K.: Improved deep metric learning with multi-class N-pair loss objective. In: Proceedings of the NIPS, pp. 1857–1865 (2016)Google Scholar
  39. 39.
    Song, H.O., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: Proceedings of the CVPR, pp. 4004–4012 (2016)Google Scholar
  40. 40.
    Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-D: a RGB-D scene understanding benchmark suite. In: Proceedings of the CVPR, pp. 567–576 (2015)Google Scholar
  41. 41.
    Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the CVPR, pp. 1–9 (2015)Google Scholar
  42. 42.
    Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6(Sep), 1453–1484 (2005)MathSciNetzbMATHGoogle Scholar
  43. 43.
    Wang, H., Gould, S., Koller, D.: Discriminative learning with latent variables for cluttered indoor scene understanding. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 497–510. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15561-1_36CrossRefGoogle Scholar
  44. 44.
    Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 10(2), 207–244 (2009)zbMATHGoogle Scholar
  45. 45.
    Weisstein, E.W.: CRC Concise Encyclopedia of Mathematics. Chapman and Hall/CRC, New York (2002)CrossRefGoogle Scholar
  46. 46.
    Xiao, J., Furukawa, Y.: Reconstructing the world’s museums. Int. J. Comput. Vision 110(3), 243–258 (2014)CrossRefGoogle Scholar
  47. 47.
    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from Abbey to Zoo. In: Proceedings of the CVPR, pp. 3485–3492 (2010)Google Scholar
  48. 48.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: Proceedings of the ICLR (2016)Google Scholar
  49. 49.
    Zhang, W., Zhang, W., Gu, J.: Edge-semantic learning strategy for layout estimation in indoor environment. TCYB (2019)Google Scholar
  50. 50.
    Zhang, Y., Yu, F., Song, S., Xu, P., Seff, A., Xiao, J.: Large-scale scene understanding challenge: room layout estimation. In: CVPR Workshop (2015)Google Scholar
  51. 51.
    Zhao, H., Lu, M., Yao, A., Guo, Y., Chen, Y., Zhang, L.: Physics inspired optimization on semantic transfer features: an alternative method for room layout estimation. In: Proceedings of the CVPR, pp. 10–18 (2017)Google Scholar
  52. 52.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of the CVPR, pp. 2881–2890 (2017)Google Scholar
  53. 53.
    Zhao, Y., Zhu, S.C.: Scene parsing by integrating function, geometry and appearance models. In: Proceedings of the CVPR, pp. 3119–3126 (2013)Google Scholar
  54. 54.
    Zheng, W., Chen, Z., Lu, J., Zhou, J.: Hardness-aware deep metric learning. In: Proceedings of the CVPR, pp. 72–81 (2019)Google Scholar
  55. 55.
    Zhu, F., Zhu, L., Yang, Y.: Sim-real joint reinforcement transfer for 3D indoor navigation. In: Proceedings of the CVPR, pp. 11388–11397 (2019)Google Scholar
  56. 56.
    Zou, C., Colburn, A., Shan, Q., Hoiem, D.: LayoutNet: reconstructing the 3D room layout from a single RGB image. In: Proceedings of the CVPR, pp. 2051–2059 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of AutomationTsinghua UniversityBeijingChina
  2. 2.State Key Lab of Intelligent Technologies and SystemsBeijingChina
  3. 3.Beijing National Research Center for Information Science and TechnologyBeijingChina
  4. 4.Tsinghua Shenzhen International Graduate SchoolTsinghua UniversityBeijingChina

Personalised recommendations