Abstract
Video analysis has been moving towards more detailed interpretation (e.g., segmentation) with encouraging progress. These tasks, however, increasingly rely on densely annotated training data both in space and time. Since such annotation is labor-intensive, few densely annotated video data with detailed region boundaries exist. This work aims to resolve this dilemma by learning to automatically generate region boundaries for all frames of a video from sparsely annotated bounding boxes of target regions. We achieve this with a Volumetric Graph Convolutional Network (VGCN), which learns to iteratively find keypoints on the region boundaries using the spatio-temporal volume of surrounding appearance and motion. We show that the global optimization of VGCN leads to more accurate annotation that generalizes better. Experimental results using three latest datasets (two real and one synthetic), including ablation studies, demonstrate the effectiveness and superiority of our method.
Similar content being viewed by others
References
Acuna, D., Ling, H., Kar, A., & Fidler, S. (2018). Efficient interactive annotation of segmentation datasets with polygon-rnn++. In: Proceedings CVPR (pp. 859–868).
Bengar, J. Z., Gonzalez-Garcia, A., Villalonga, G., Raducanu, B., Aghdam, H. H., Mozerov, M., Lopez, A. M., & van de Weijer, J. (2019). Temporal coherence for active learning in videos. In IEEE/CVF international conference on computer vision workshop (ICCVW) (pp. 914–923).
Bertasius, G., & Torresani, L. (2020). Classifying, segmenting, and tracking object instances in video with mask propagation. In Proceedings CVPR (pp. 9739–9748).
Bianco, S., Ciocca, G., Napoletano, P., & Schettini, R. (2015). An interactive tool for manual, semi-automatic and automatic video annotation. CVIU, 131, 88–99.
Castrejon, L., Kundu, K., Urtasun, & R., Fidler, S. (2017). Annotating object instances with a polygon-rnn. In Proceedings CVPR (pp. 5230–5238).
Chen, B., Ling, H., Zeng, X., Gao, J., Xu, Z., & Fidler, S. (2020). Scribblebox: interactive annotation framework for video object segmentation. In Proceedings (pp. 293–310). ECCV Springer.
Cho, M., Kwak, S., Schmid, C., Ponce, J. (2015). Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In Proceedigs CVPR (pp. 1201–1210).
Ding, M., Wang, Z., Zhou, B., Shi, J., Lu, Z., & Luo, P. (2020) Every frame counts: Joint learning of video segmentation and optical flow. In Proceedings AAAI (pp 10713–10720).
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. IJCV, 111(1), 98–136.
Gao, J., Wang, Z., Xuan, J., & Fidler, S. (2020). Beyond fixed grid: Learning geometric image representation with a deformable grid. Proc (pp. 108–125). ECCV: Springer.
Giro-i Nieto, X., Camps, N., & Marques, F. (2010). Gat: A graphical annotation tool for semantic regions. Multimedia Tools and Applications, 46(2–3), 155–174.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017) Mask r-cnn. In Proceedings ICCV (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings CVPR (pp. 770–778).
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017) Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings CVPR (pp. 2462–2470).
Jain, SD., Xiong, B., & Grauman, K. (2017). Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In Proceedings CVPR (pp. 2117–2126).
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd international conference on learning representations, ICLR 2015 2015, Conference Track Proceedings. arXiv:1412.6980
Lee, H. Y., Huang, J. B., Singh, M., & Yang, M. H. (2017). Unsupervised representation learning by sorting sequences. In Proceeding ICCV (pp. 667–676).
Liang, J., Homayounfar, N., Ma, W. C., Xiong, Y., Hu, R., & Urtasun, R. (2020). Polytransform: Deep polygon transformer for instance segmentation. In Proceedings CVPR (pp. 9131–9140).
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings ECCV (pp. 740–755).
Ling, H., Gao, J., Kar, A., Chen, W., & Fidler, S. (2019). Fast interactive object annotation with curve-gcn. In Proceedings CVPR, (pp. 5257–5266).
Lipton, A., Fujiyoshi, H., & Patil, R. (1998). Moving target classification and tracking from real-time video. In Proceedings fourth IEEE workshop on applications of computer vision. WACV’98 (Cat. No.98EX201) (pp. 8–14), https://doi.org/10.1109/ACV.1998.732851.
Li, J., Zhao, Y., Fu, J., Wu, J., & Liu, J. (2019). Attention-guided network for semantic video segmentation. IEEE Access, 7, 140680–140689.
Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C. (2021). Trackformer: Multi-object tracking with transformers. In Proc CVPR.
Nabavi, S. S., Rochan, M., & Wang, Y. (2018). Future semantic segmentation with convolutional LSTM. In Procceedings BMVC (p 137).
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, R. Garnett (Eds.) Advances in neural information processing systems 32 (Curran Associates, Inc., pp. 8024–8035). http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
Paul, M., Mayer, C., Gool, L. V., Timofte, R. (2020). Efficient video semantic segmentation with labels propagation and refinement. In Proceedings WACV (pp. 2873–2882).
Peng, S., Jiang, W., Pi, H., Li, X., Bao, H., & Zhou, X. (2020). Deep snake for real-time instance segmentation. In Proceedings CVPR (pp. 8533–8542).
Porzi, L., Hofinger, M., Ruiz, I., Serrat, J., Bulo, S. R., & Kontschieder, P. (2020). Learning multi-object tracking and segmentation from automatic annotations. In Proceeding CVPR (pp. 6846–6855).
Rother, C., Kolmogorov, V., & Blake, A. (2004). “grabcut’’ interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3), 309–314.
Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., & Leibe, B. (2019a) Mots: Multi-object tracking and segmentation. In Proceeding CVPR (pp. 7942–7951).
Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., Leibe, B. (2019b) Mots: Multi-object tracking and segmentation. In Proceeding CVPR (pp. 7942–7951).
Vondrick, C., Patterson, D., & Ramanan, D. (2013). Efficiently scaling up crowdsourced video annotation. IJCV, 101(1), 184–204.
Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In Proceeding ICCV (pp. 2794–2802).
Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., Xia, H. (2021). End-to-end video instance segmentation with transformers. In Proceedings CVPR (pp. 8741–8750).
Wang, T., Han, B., & Collomosse, J. (2014). Touchcut: Fast image and video segmentation using single-touch interaction. CVIU, 120, 14–30.
Xu, Y., Wu, Y., binti Zuraimi, N. S., Nobuhara, S., Nishino, K. (2020). Video region annotation with sparse bounding boxes. In 31st British Machine Vision Conference 2020, BMVC 2020, Virtual Event 2020, BMVA Press.
Yang, L., Fan, Y., & Xu, N. (2019a). Video instance segmentation. In Proceeeding ICCV (pp. 5188–5197).
Yang, L., Fan, Y., & Xu, N. (2019b). Video instance segmentation. In Proceeding ICCV (pp. 5188–5197).
Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., Darrell. T. (2020). Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceeding CVPR (pp. 2636–2645).
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. IJCV, 127(3), 302–321.
Acknowledgements
This work was in part supported by JSPS KAKENHI 17K20143 and SenseTime Japan.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Moi Hoon Yap.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, Y., Wu, Y., binti Zuraimi, N.S. et al. Video Region Annotation with Sparse Bounding Boxes. Int J Comput Vis 131, 717–731 (2023). https://doi.org/10.1007/s11263-022-01719-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-022-01719-0