Synthesizing Training Images for Semantic Segmentation

Zhang, Yunhui; Wu, Zizhao; Zhou, Zhiping; Wang, Yigang

doi:10.1007/978-981-13-1702-6_22

Synthesizing Training Images for Semantic Segmentation

Yunhui Zhang¹¹,
Zizhao Wu¹¹,
Zhiping Zhou¹² &
…
Yigang Wang¹¹

Conference paper
First Online: 12 August 2018

1861 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 875))

Abstract

Semantic segmentation is one of the key problems in the computer vision area. Recently, Convolutional Neural Networks (CNNs) have yielded a significant performance for the semantic segmentation task. However, CNNs require a sufficient amount of annotated training images, which is challenging since massive human labour is needed. In this paper, we propose to use 3D models to automatically generate synthetic images with pixel-level annotations. We take advantage of 3D models to generate synthetic images of high diversity in object appearance and background clutterness, by randomly sampling rendering parameters and adding random background patterns. Then, we use the synthetic images to augment training samples for semantic segmentation by combining with publicly available real-world images. Experimental results demonstrate that CNNs trained with our synthetic images improve performance on the semantic segmentation task in the PASCAL VOC 2012 dataset.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. CoRR abs/1511.00561 (2015)
Google Scholar
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 44–57. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88682-2_5
Chapter Google Scholar
Chen, W., et al.: Synthesizing training images for boosting human 3D pose estimation. In: Fourth International Conference on 2016 3D Vision 3DV 2016, Stanford, CA, USA, 25–28, October, 2016 pp. 479–488 (2016)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2016)
Google Scholar
Eigen, D., Fergus, R.: Predicting depth surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV, pp. 2650–2658 (2015)
Google Scholar
Everingham, M., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010)
Article Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR, pp. 3354–3361 (2012)
Google Scholar
Hariharan, B., Arbelaez, P., Bourdev, L.D., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: IEEE International Conference on 2011 Computer Vision ICCV , Barcelona, Spain, 6–13, November, 2011 pp. 991–998 (2011)
Google Scholar
Hong, S., Oh, J., Lee, H., Han, B.: Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In: CVRP, pp. 3204–3212 (2016)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1106–1114 (2012)
Google Scholar
Ladický, Ľ., Sturgess, P., Alahari, K., Russell, C., Torr, P.H.S.: What, where and how many? combining object detectors and CRFs. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 424–437. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_31
Chapter Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on 2015 Computer Vision and Pattern Recognition CVPR 2015, Boston, MA, USA, 7–12, June, 2015 pp. 3431–3440 (2015)
Google Scholar
Pathak, D., Krähenbühl, P., Darrell, T.: Constrained convolutional neural networks for weakly supervised segmentation. In: ICCV, pp. 1796–1804 (2015)
Google Scholar
Pinheiro, P.H.O., Collobert, R.: From image-level to pixel-level labeling with convolutional networks. In: CVPR, pp. 1713–1721 (2015)
Google Scholar
Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_7
Chapter Google Scholar
Ros, G., Sellart, L., Materzynska, J., Vázquez, D., Lopez, A.M.: The Synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR, pp. 3234–3243 (2016)
Google Scholar
Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: CVPR. IEEE Computer Society (2008)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
Google Scholar
Sturgess, P., Alahari, K., Ladicky, L., Torr, P.H.S.: Combining appearance and structure from motion features for road scene understanding. In: British Machine Vision Conference BMVC, pp. 1–11 (2009)
Google Scholar
Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using cnns trained with rendered 3D model views. In: ICCV, pp. 2686–2694 (2015)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. CoRR abs/1409.4842 (2014)
Google Scholar
Wang, L., et al.: Temporal segment networks for action recognition in videos. CoRR abs/1705.02953 (2017)
Google Scholar
Wu, Z., et al.: 3D shapeNets: a deep representation for volumetric shapes. In: CVPR, pp. 1912–1920 (2015)
Google Scholar
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: The Twenty-Third IEEE Conference on 2010 Computer Vision and Pattern Recognition CVPR, San Francisco, CA, USA, 13–18 June 2010. pp. 3485–3492 (2010)
Google Scholar
Zeiler, M.D., Taylor, G.W., Fergus, R.: Adaptive deconvolutional networks for mid and high level feature learning. In: ICCV, pp. 2018–2025 (2011)
Google Scholar
Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: ICCV, pp. 1529–1537 (2015)
Google Scholar

Download references

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (No. 61602139) and Zhejiang Province science and technology planning project (2018C01030).

Author information

Authors and Affiliations

Digite Media Interactive Simulation Lab, Hangzhou Dianzi University, Hangzhou, 310018, ZJ, China
Yunhui Zhang, Zizhao Wu & Yigang Wang
School of Computer Science, Hangzhou Dianzi University, Hangzhou, 310018, ZJ, China
Zhiping Zhou

Authors

Yunhui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zizhao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiping Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yigang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunhui Zhang .

Editor information

Editors and Affiliations

Beijing Institute of Technology, Beijing, China
Yongtian Wang
Beihang University, Beijing, China
Zhiguo Jiang
Peking University, Beijing, China
Yuxin Peng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Wu, Z., Zhou, Z., Wang, Y. (2018). Synthesizing Training Images for Semantic Segmentation. In: Wang, Y., Jiang, Z., Peng, Y. (eds) Image and Graphics Technologies and Applications. IGTA 2018. Communications in Computer and Information Science, vol 875. Springer, Singapore. https://doi.org/10.1007/978-981-13-1702-6_22

Download citation

DOI: https://doi.org/10.1007/978-981-13-1702-6_22
Published: 12 August 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1701-9
Online ISBN: 978-981-13-1702-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics