Abstract
Unsupervised monocular depth learning generally relies on the photometric relation among temporally adjacent images. Most of previous works use both mean absolute error (MAE) and structure similarity index measure (SSIM) with conventional form as training loss. However, they ignore the effect of different components in the SSIM function and the corresponding hyperparameters on the training. To address these issues, this work proposes a new form of SSIM. Compared with original SSIM function, the proposed new form uses addition rather than multiplication to combine the luminance, contrast, and structural similarity related components in SSIM. The loss function constructed with this scheme helps result in smoother gradients and achieve higher performance on unsupervised depth estimation. We conduct extensive experiments to determine the relatively optimal combination of parameters for our new SSIM. Based on the popular MonoDepth approach, the optimized SSIM loss function can remarkably outperform the baseline on the KITTI-2015 outdoor dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bian, J.W., et al.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: NeurIPS (2019)
Cao, Y.J., Lin, C., Li, Y.J.: Learning crisp boundaries using deep refinement network and adaptive weighting loss. IEEE T-MM 23, 761–771 (2021). https://doi.org/10.1109/TMM.2020.2987685
Cao, Y.J., et al.: Learning generalized visual odometry using position-aware optical flow and geometric bundle adjustment. Pattern Recogn. 136, 109262 (2023)
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS (2014)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The Kitti dataset. IJRR (2013)
Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth prediction. In: ICCV (2019)
Harris, C.R., et al.: Array programming with numpy. Nature 585(7825), 357–362 (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hu, J., Ozay, M., Zhang, Y., Okatani, T.: Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In: WACV, pp. 1043–1051. IEEE (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Liu, C., Kumar, S., Gu, S., Timofte, R., Van Gool, L.: VA-depthnet: a variational approach to single image depth prediction. In: ICLR (2023)
Luo, C., et al.: Every pixel counts++: joint learning of geometry and motion with 3D holistic understanding. IEEE T-PAMI 42(10), 2624–2641 (2020). https://doi.org/10.1109/TPAMI.2019.2930258
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Peng, P., Yang, K.F., Luo, F.Y., Li, Y.J.: Saliency detection inspired by topological perception theory. Int. J. Comput. Vision 129(8), 2352–2374 (2021)
Ranjan, A., et al.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: CVPR (2019)
Shi, W., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: CVPR, pp. 1874–1883 (2016)
Wang, G., Zhong, J., Zhao, S., Wu, W., Liu, Z., Wang, H.: 3D hierarchical refinement and augmentation for unsupervised learning of depth and pose from monocular video. IEEE Trans. Circuits Syst. Video Technol. (2022)
Yang, N., Stumberg, L., Wang, R., Cremers, D.: D3vo: deep depth, deep pose and deep uncertainty for monocular visual odometry. In: CVPR (2020)
Yin, Z., Shi, J.: Geonet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR (2018)
Yuan, W., Gu, X., Dai, Z., Zhu, S., Tan, P.: Neural window fully-connected CRFs for monocular depth estimation. In: CVPR, pp. 3916–3925 (2022)
Zhao, W., Liu, S., Shu, Y., Liu, Y.J.: Towards better generalization: joint depth-pose learning without posenet. In: CVPR (2020)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE T-IP 13(4), 600–612 (2004). https://doi.org/10.1109/TIP.2003.819861
Zhu, Z., et al.: Nice-slam: neural implicit scalable encoding for slam. In: CVPR, pp. 12786–12796 (2022)
Zoran, D., Isola, P., Krishnan, D., Freeman, W.T.: Learning ordinal relationships for mid-level vision. In: ICCV, pp. 388–396 (2015)
Zou, Y., Luo, Z., Huang, J.-B.: DF-net: unsupervised joint learning of depth and flow using cross-task consistency. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 38–55. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_3
Acknowledgements
This work was supported by National Natural Science Foundation of China (62076055) and Sichuan Science and Technology Program (2022ZYD0112).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Cao, Y., Luo, F., Li, Y. (2023). Toward Better SSIM Loss for Unsupervised Monocular Depth Estimation. In: Lu, H., et al. Image and Graphics. ICIG 2023. Lecture Notes in Computer Science, vol 14355. Springer, Cham. https://doi.org/10.1007/978-3-031-46305-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-46305-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46304-4
Online ISBN: 978-3-031-46305-1
eBook Packages: Computer ScienceComputer Science (R0)