Abstract
Video object segmentation, i.e., the separation of a target object from background in video, has made significant progress on real and challenging videos in recent years. To leverage this progress in 3D applications, this paper addresses the problem of learning to estimate the depth of segmented objects given some measurement of camera motion (e.g., from robot kinematics or vehicle odometry). We achieve this by, first, introducing a diverse, extensible dataset and, second, designing a novel deep network that estimates the depth of objects using only segmentation masks and uncalibrated camera movement. Our data-generation framework creates artificial object segmentations that are scaled for changes in distance between the camera and object, and our network learns to estimate object depth even with segmentation errors. We demonstrate our approach across domains using a robot camera to locate objects from the YCB dataset and a vehicle camera to locate obstacles while driving.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Dataset and source code website: https://github.com/griffbr/ODMS.
References
Bao, L., Wu, B., Liu, W.: CNN in MRF: video object segmentation via inference in A CNN-based higher-order spatio-temporal MRF. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Caelles, S., et al.: The 2018 DAVIS challenge on video object segmentation. CoRR abs/1803.00557 (2018)
Calli, B., Walsman, A., Singh, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: Benchmarking in manipulation research: using the Yale-CMU-Berkeley object and model set. IEEE Robot. Autom. Mag. 22(3), 36–52 (2015)
Chen, D.J., Chen, H.T., Chang, L.W.: Video object co-segmentation. In: ACM International Conference on Multimedia (2012)
Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: Computer Vision and Pattern Recognition (CVPR) (2018)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Faktor, A., Irani, M.: Video segmentation by non-local consensus voting. In: British Machine Vision Conference (BMVC) (2014)
Ferguson, M., Law, K.: A 2D–3D object detection system for updating building information models with mobile robots. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (2019)
Florence, V., Corso, J.J., Griffin, B.: Robot-supervised learning for object segmentation. In: The IEEE International Conference on Robotics and Automation (ICRA) (2020)
Gan, L., Zhang, R., Grizzle, J.W., Eustice, R.M., Ghaffari, M.: Bayesian spatial kernel smoothing for scalable dense semantic mapping. IEEE Robot. Autom. Lett. (RA-L) 5(2), 790–797 (2020)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The Kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
Griffin, B., Florence, V., Corso, J.J.: Video object segmentation-based visual servo control and object depth estimation on a mobile robot. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2020)
Griffin, B.A., Corso, J.J.: BubbleNets: learning to select the guidance frame in video object segmentation by deep sorting frames. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Griffin, B.A., Corso, J.J.: Tukey-inspired video object segmentation. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2019)
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: The IEEE International Conference on Computer Vision (ICCV) (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Ittelson, W.H.: Size as a cue to distance: radial motion. Am. J. Psychol. 64(2), 188–202 (1951)
Kasten, Y., Galun, M., Basri, R.: Resultant based incremental recovery of camera pose from pairwise matches. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2019)
Khan, S.H., Guo, Y., Hayat, M., Barnes, N.: Unsupervised primitive discovery for improved 3d generative modeling. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2014)
Kumawat, S., Raman, S.: LP-3DCNN: unveiling local phase in 3D convolutional neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Lee, Y.J., Kim, J., Grauman, K.: Key-segments for video object segmentation. In: IEEE International Conference on Computer Vision (ICCV) (2011)
Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: The IEEE International Conference on Computer Vision (ICCV) (2013)
Li, Z., et al.: Learning the depths of moving people by watching frozen people. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, B., He, X.: Multiclass semantic video segmentation with object-level active inference. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Liu, R., et al.: An intriguing failing of convolutional neural networks and the coordConv solution. In: Advances in Neural Information Processing Systems, vol. 31 (NIPS)
Liu, Y., Fan, B., Xiang, S., Pan, C.: Relation-shape convolutional neural network for point cloud analysis. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Lu, J., Xu, R., Corso, J.J.: Human action segmentation with hierarchical supervoxel consistency. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Luiten, J., Voigtlaender, P., Leibe, B.: Premvos: Proposal-generation, refinement and merging for video object segmentation. In: Asian Conference on Computer Vision (ACCV) (2018)
Maninis, K., et al.: Video object segmentation without temporal information. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1515–1530 (2018)
Mur-Artal, R., Tards, J.D.: ORB-SLAM2: an open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Trans. Robot. (T-RO) 33, 1255-1262 (2017)
Oh, S.W., Lee, J.Y., Sunkavalli, K., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Oneata, D., Revaud, J., Verbeek, J., Schmid, C.: Spatio-temporal object detection proposals. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 737–752. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_48
Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2013)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbelaez, P., Sorkine-Hornung, A., Gool, L.V.: The 2017 DAVIS challenge on video object segmentation. CoRR abs/1704.00675 (2017)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Soomro, K., Idrees, H., Shah, M.: Action localization in videos through context walk. In: IEEE International Conference on Computer Vision (ICCV) (2015)
Soomro, K., Idrees, H., Shah, M.: Predicting the where and what of actors and actions through online action localization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: DropOut: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
Swanston, M.T., Gogel, W.C.: Perceived size and motion in depth from optical expansion. Percept. Psychophys. 39, 309–326 (1986)
Tang, K., Sukthankar, R., Yagnik, J., Fei-Fei, L.: Discriminative segment annotation in weakly labeled video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013)
Tighe, J., Lazebnik, S.: SuperParsing: scalable nonparametric image parsing with superpixels. Int. J. Comput. Vis. 101(2), 352–365 (2012)
Tsai, D., Flagg, M., Nakazawa, A., Rehg, J.M.: Motion coherent tracking using multi-label MRF optimization. Int. J. Comput. Vis. 100(2), 190–202 (2012)
Vijayanarasimhan, S., Grauman, K.: What’s it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. In: British Machine Vision Conference (BMVC) (2017)
Wang, C., et al.: DenseFusion: 6D object pose estimation by iterative dense fusion. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Wehrwein, S., Szeliski, R.: Video segmentation with background motion models. In: British Machine Vision Conference (BMVC) (2017)
Xu, C., Corso, J.J.: LIBSVX: a supervoxel library and benchmark for early video processing. Int. J. Comput. Vis. 119(3), 272–290 (2016)
Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., Huang, T.S.: Youtube-VOS: a large-scale video object segmentation benchmark. CoRR abs/1809.03327 (2018)
Yamaguchi, U., Saito, F., Ikeda, K., Yamamoto, T.: HSR, human support robot as research and development platform. The Abstracts of the international conference on advanced mechatronics : toward evolutionary fusion of IT and mechatronics : ICAM 2015, vol. 6, pp. 39–40 (2015)
Yamamoto, T., Terada, K., Ochiai, A., Saito, F., Asahara, Y., Murase, K.: Development of human support robot as the research platform of a domestic mobile manipulator. ROBOMECH J. 6(1), 4 (2019)
Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Acknowledgements
We thank Madan Ravi Ganesh, Parker Koch, and Luowei Zhou for various discussions throughout this work. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Griffin, B.A., Corso, J.J. (2020). Learning Object Depth from Camera Motion and Video Object Segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12352. Springer, Cham. https://doi.org/10.1007/978-3-030-58571-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-58571-6_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58570-9
Online ISBN: 978-3-030-58571-6
eBook Packages: Computer ScienceComputer Science (R0)