Alignment of Deep Features in 3D Models for Camera Pose Estimation

  • Jui-Yuan SuEmail author
  • Shyi-Chyi Cheng
  • Chin-Chun Chang
  • Jun-Wei Hsieh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11296)


Using a set of semantically annotated RGB-D images with known camera poses, many existing 3D reconstruction algorithms can integrate these images into a single 3D model of the scene. The semantically annotated scene model facilitates the construction of a video surveillance system using a moving camera if we can efficiently compute the depth maps of the captured images and estimate the poses of the camera. The proposed model-based video surveillance consists of two phases, i.e. the modeling phase and the inspection phase. In the modeling phase, we carefully calibrate the parameters of the camera that captures the multi-view video for modeling the target 3D scene. However, in the inspection phase, the camera pose parameters and the depth maps of the captured RGB images are often unknown or noisy when we use a moving camera to inspect the completeness of the object. In this paper, the 3D model is first transformed into a colored point cloud, which is then indexed by clustering—with each cluster representing a surface fragment of the scene. The clustering results are then used to train a model-specific convolution neural network (CNN) that annotates each pixel of an input RGB image with a correct fragment class. The prestored camera parameters and depth information of fragment classes are then fused together to estimate the depth map and the camera pose of the current input RGB image. The experimental results show that the proposed approach outperforms the compared methods in terms of the accuracy of camera pose estimation.


Unsupervised fragment classification 3D model Deep learning Camera pose estimation 3D point cloud clustering 



This work was supported in part by Ministry of Science and Technology, Taiwan under Grant Numbers MOST 107-2221-E-019 -033 -MY2 and 107-2634-F-019 -001.


  1. 1.
    Wolf, P.R., Dewitt, B.A.: Elements of Photogrammetry: With Applications in GIS. McGraw-Hill, New York (2000)Google Scholar
  2. 2.
    Ackermann, F.: Airborne laser scanning – present status and further expectations. ISPRS J. Photogram. Remote Sens. 54, 64–67 (1999)CrossRefGoogle Scholar
  3. 3.
    Davison, A., Reid, I., Molton, N., Stasse, O.: MonoSLAM: real-time single camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 1052–1067 (2007)CrossRefGoogle Scholar
  4. 4.
    Furukawa, Y., Curless, B., Seitz, S.M., Szeliski, R.: Towards internet-scale multi-view stereo. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2010)Google Scholar
  5. 5.
    Furukawa, Y., Ponce, J.: Accurate, dense, and robust multi-view stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 32(8), 1362–1376 (2010)CrossRefGoogle Scholar
  6. 6.
    Goldlucke, B., Aubry, M., Kolev, K., Cremers, D.: A super-resolution framework for high-accuracy multiview reconstruction. Int. J. Comput. Vision 106(2), 172–191 (2014)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Maier, R., Kim, K., Cremers, D., Kautz, J., Nießner, M.: Intrinsic3d: high-quality 3D reconstruction by joint appearance and geometry optimization with spatially-varying lighting. In: Proceedings of the IEEE International Conference on Computer Vision (2017)Google Scholar
  8. 8.
    Zhou, Q., Park, J., Koltun, V.: Open3D: A modern library for 3D data processing. arXiv:1801.09847 (2018)
  9. 9.
    Curless, B., Levoy, M.: A volumetric method for building complex models from range images. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 303–312 (1996)Google Scholar
  10. 10.
    Park, J., Zhou, Q.-Y., Koltun, V.: Colored point cloud registration revisited. In: Proceedings of ICCV (2017)Google Scholar
  11. 11.
    Choi, S., Zhou, Q.-Y., Koltun, V.: Robust reconstruction of indoor scenes. In: Proceedings of CVPR (2015)Google Scholar
  12. 12.
    Johnson, A.E., Kang, S.B.: Registration and integration of textured 3D data. Image Vis. Comput. 17, 135–147 (1999)CrossRefGoogle Scholar
  13. 13.
    Korn, M., Holzkothen, M., Pauli, J.: Color supported generalized-ICP. In: Proceedings of VISAPP (2014)Google Scholar
  14. 14.
    Men, H., Gebre, B., Pochiraju, K.: Color point cloud registration with 4D ICP algorithm. In: Proceedings of ICRA (2011)Google Scholar
  15. 15.
    Li, J.N., Wang, L.H., Li, Y., Zhang, J.F., Li, D.X., Zhang, M.: Local optimized and scalable frame-to-model SLAM. Multimedia Tools Appl. 75, 8675–8694 (2016)CrossRefGoogle Scholar
  16. 16.
    Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, James M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013). Scholar
  17. 17.
    Kan, M., Shan, S., Chen, X.: Multi-view deep network for cross-view classification. In: Proceedings of IEEE ICCVPR (2016)Google Scholar
  18. 18.
    Cheng, S.-C., Su, J.-Y., Chen, J.-M., Hsieh, J.-W.: Model-based 3D scene reconstruction using a moving RGB-D camera. In: Amsaleg, L., Guðmundsson, G.Þ., Gurrin, C., Jónsson, B.Þ., Satoh, S. (eds.) MMM 2017. LNCS, vol. 10132, pp. 214–225. Springer, Cham (2017). Scholar
  19. 19.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of ICCV (2015)Google Scholar
  20. 20.
    Žbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 17, 1–32 (2016)zbMATHGoogle Scholar
  21. 21.
    Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L. J.: Volumetric and multi-view CNNs for object classification on 3D data. arXiv:1604.03265v2 [cs.CV] 29 (2016)
  22. 22.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2015)
  23. 23.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. PAMI 39(12), 2481–2495 (2017)CrossRefGoogle Scholar
  24. 24.
    Endres, F., Hess, J., Engelhard, N., Sturm, J., Cremers, D., Burgard, W.: An evaluation of the RGB-D SLAM system. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2012)Google Scholar
  25. 25.
    Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: Proceedings of ICML (2016)Google Scholar
  26. 26.
    Stückler, J., Behnke, S.: Multi-resolution surfel maps for efficient dense 3D modeling and tracking. J. Vis. Commun. Image Represent. 25(1), 137–147 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Jui-Yuan Su
    • 1
    • 2
    Email author
  • Shyi-Chyi Cheng
    • 2
  • Chin-Chun Chang
    • 2
  • Jun-Wei Hsieh
    • 2
  1. 1.Department of New Media and Communications AdministrationMing Chuan UniversityTaipeiTaiwan
  2. 2.Department of Computer Science and Information EngineeringNational Taiwan Ocean UniversityKeelungTaiwan

Personalised recommendations