Abstract
Semantic image parsing, which refers to the process of decomposing images into semantic regions and constructing the structure representation of the input, has recently aroused widespread interest in the field of computer vision. The recent application of deep representation learning has driven this field into a new stage of development. In this paper, we summarize three aspects of the progress of research on semantic image parsing, i.e., category-level semantic segmentation, instance-level semantic segmentation, and beyond segmentation. Specifically, we first review the general frameworks for each task and introduce the relevant variants. The advantages and limitations of each method are also discussed. Moreover, we present a comprehensive comparison of different benchmark datasets and evaluation metrics. Finally, we explore the future trends and challenges of semantic image parsing.
Similar content being viewed by others
References
Zhao H S, Shi J P, Qi X J, Wang X G, Jia J Y. Pyramid scene parsing network. In: Proceedings of International Conference on Computer Vision and Pattern Recognition. 2017, 2881–2890
He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In: Proceedings of IEEE International Conference on Computation Vision. 2017, 2980–2988
Tu Z, Chen X, Yuille A L, Zhu S C. Image parsing: unifying segmentation, detection, and recognition. International Journal of Computer Vision, 2005, 63(2): 113–140
Tu Z, Zhu S C. Parsing images into region and curve processes. In: Proceedings of European Conference on Computer Vision. 2002, 393–407
Han F, Zhu S C. Bottom-up/top-down image parsing with attribute grammar. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(1): 59–73
Lin L, Wang G, Zhang R, Zhang R, Liang X, Zuo W. Deep structured scene parsing by learning with image descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2276–2284
Lowe D G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60(2): 91–110
Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2005, 886–893
Ahonen T, Hadid A, Pietikäinen M. Face recognition with local binary patterns. In: Proceedings of European Conference on Computer Vision. 2004, 469–481
Liu Z, Li X, Luo P, Loy C C, Tang X. Semantic image segmentation via deep parsing network. In: Proceedings of IEEE International Conference on Computer Vision. 2015, 1377–1385
Chen L C, Papandreou G, Kokkinos I, Murphy K, Yuille A L. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834–848
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3431–3440
Chen L C, Papandreou G, Kokkinos I, Murphy K, Yuille A L. Semantic image segmentation with deep convolutional nets and fully connected crfs. 2014, arXiv preprint arXiv:14127062
Peng C, Zhang X, Yu G, Luo G, Sun J. Large kernel matters-improve semantic segmentation by global convolutional network. 2017, arXiv preprint arXiv:170302719
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems. 2012, 1097–1105
Socher R, Manning C D, Ng A Y. Learning continuous phrase representations and syntactic parsing with recursive neural networks. In: Proceedings of NIPS-2010 Deep Learning and Unsupervised Feature Learning Workshop. 2010, 1–9
Li Y, Qi H, Dai J, Ji X, Wei Y. Fully convolutional instance-aware semantic segmentation. 2016, arXiv preprint arXiv:161107709
Bengio Y. Deep learning of representations: looking forward. In: Proceedings of International Conference on Statistical Language and Speech Processing. 2013, 1–37
Bengio Y. Deep learning of representations for unsupervised and transfer learning. In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning. 2012, 17–36
Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798–1828
LeCun Y, Boser B, Denker J S, Henderson D, Howard R E, Hubbard W, Jackel L D. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1989, 1(4): 541–551
Dai J, He K, Li Y, Ren S, Sun J. Instance-sensitive fully convolutional networks. In: Proceedings of European Conference on Computer Vision. 2016, 534–549
Islam M A, Naha S, Rochan M, Bruce N, Wang Y. Label refinement network for coarse-to-fine semantic segmentation. 2017, arXiv preprint arXiv:170300551
Lipton Z C, Berkowitz J, Elkan C. A critical review of recurrent neural networks for sequence learning. 2015, arXiv preprint arXiv:150600019
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553): 436–444
Liang X, Shen X, Xiang D, Feng J, Lin L, Yan S. Semantic object parsing with local-global long short-term memory. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3185–3193
Karpathy A, Li F F. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3128–3137
Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In: Proceedings of Advances in Neural Information Processing Systems. 2014, 3104–3112
Li Z, Gan Y, Liang X, Yu Y, Cheng H, Lin L. LSTM-CF: unifying context modeling and fusion with LSTMS for RGB-D scene labeling. In: Proceedings of European Conference on Computer Vision. 2016, 541–557
Peng Z, Zhang R, Liang X, Liu X, Lin L. Geometric scene parsing with hierarchical LSTM. 2016, arXiv preprint arXiv:160401931
Byeon W, Breuel TM, Raue F, Liwicki M. Scene labeling with LSTM recurrent neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3547–3555
Liang X, Shen X, Feng J, Lin L, Yan S. Semantic object parsing with graph LSTM. In: Proceedings of European Conference on Computer Vision. 2016, 125–143
Liang X, Lin L, Shen X, Feng J, Yan S, Xing E P. Interpretable structure-evolving LSTM. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2175–2184
Zhang R, Yang W, Peng Z, Wang X, Lin L. Progressively diffused networks for semantic image segmentation. 2017, arXiv preprint arXiv:170205839
Elman J L. Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 1991, 7(2-3): 195–225
Liu W, Rabinovich A, Berg A C. Parsenet: looking wider to see better. 2015, arXiv preprint arXiv:150604579
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 1–9
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770–778
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014, arXiv preprint arXiv:14091556
Pinheiro P H O, Collobert R. Recurrent convolutional neural networks for scene labeling. In: Proceedings of International Conference on Machine Learning. 2014, 82–90
Graves A, Fernández S, Schmidhuber J. Multi-dimensional recurrent neural networks. In: Proceedings of the International Conference on Artificial Neural Networks. 2007, 549–558
Lin L, Huang L, Chen T, Gan Y, Cheng H. Knowledge-guided recurrent neural network learning for task-oriented action prediction. In: Proceedings of IEEE International Conference on Multimedia and Expo. 2017, 625–630
Farabet C, Couprie C, Najman L, LeCun Y. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1915–1929
Gupta S, Girshick R, Arbeláez P, Malik J. Learning rich features from RGB-D images for object detection and segmentation. In: Proceedings of European Conference on Computer Vision. 2014, 345–360
Ning F, Delhomme D, LeCun Y, Piano F, Bottou L, Barbano P E. Toward automatic phenotyping of developing embryos from videos. IEEE Transactions on Image Processing, 2005, 14(9): 1360–1371
Liang X, Liu S, Shen X, Yang J, Liu L, Dong J, Lin L, Yan S. Deep human parsing with active template regression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(12): 2402–2414
Liang X, Xu C, Shen X, Yang J, Liu S, Tang J, Lin L, Yan S. Human parsing with contextualized convolutional neural network. In: Proceedings of IEEE International Conference on Computer Vision. 2015, 1386–1394
Krähenbühl P, Koltun V. Efficientcient inference in fully connected CRFS with gaussian edge potentials. In: Proceedings of Advances in Neural Information Processing Systems. 2011, 109–117
Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision. 2015, 1520–1528
Badrinarayanan V, Handa A, Cipolla R. Segnet: a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. 2015, arXiv preprint arXiv:150507293
Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. 2015, 234–241
Lin G, Milan A, Shen C, Reid I. Refinenet: multi-path refinement networks with identity mappings for high-resolution semantic segmentation. 2016, arXiv preprint arXiv:161106612
Chen L C, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. 2017, arXiv preprint arXiv:170605587
Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. 2015, arXiv preprint arXiv:151107122
Li X, Liu Z, Luo P, Loy C C, Tang X. Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. 2017, arXiv preprint arXiv:170401344
Zhou Y, Xie L, Shen W, Wang Y, Fishman E K, Yuille A L. A fixedpoint model for pancreas segmentation in abdominal ct scans. In: Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. 2017, 693–701
Li Q, Wang J, Wipf D, Tu Z. Fixed-point model for structured labeling. In: Proceedings of International Conference on Machine Learning. 2013, 214–221
Wang G, Luo P, Lin L, Wang X. Learning object interactions and descriptions for semantic image segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2017, 5859–5867
Luo P, Wang G, Lin L, Wang X. Deep dual learning for semantic image segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2718–2726
Schwing A G, Urtasun R. Fully connected deep structured networks. 2015, arXiv preprint arXiv:150302351
Yang W, Luo P, Lin L. Clothing co-parsing by joint image segmentation and labeling. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3182–3189
Byeon W, Liwicki M, Breuel T M. Texture classification using 2D LSTM networks. In: Proceedings of International Conference on Pattern Recognition. 2014, 1144–1149
Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of IEEE International Conference on Computer Vision. 2015, 2650–2658
Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 580–587
Reza M, Kosecka J. Reinforcement learning for semantic segmentation in indoor scenes. 2016, arXiv preprint arXiv:160601178
Van Oord A, Kalchbrenner N, Kavukcuoglu K. Pixel recurrent neural networks. In: Proceedings of International Conference on Machine Learning. 2016, 1747–1756
Kalchbrenner N, Danihelka I, Graves A. Grid long short-term memory. 2015, arXiv preprint arXiv:150701526
Hariharan B, Arbeláez P, Girshick R, Malik J. Simultaneous detection and segmentation. In: Proceedings of European Conference on Computer Vision. 2014, 297–312
Liang X, Wei Y, Shen X, Jie Z, Feng J, Lin L, Yan S. Reversible recursive instance-level object segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 633–641
Liang X, Wei Y, Shen X, Yang J, Lin L, Yan S. Proposal-free network for instance-level object segmentation. 2015, arXiv preprint arXiv:150902636
Abtahi F, Zhu Z, Burry AM. A deep reinforcement learning approach to character segmentation of license plate images. In: Proceedings of International Conference on Machine Vision Applications. 2015, 539–542
Lin L, Wang K, Zuo W, Wang M, Luo J, Zhang L. A deep structured model with radius–margin bound for 3D human activity recognition. International Journal of Computer Vision, 2016, 118(2): 256–273
Hariharan B, Arbeláez P, Girshick R, Malik J. Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 447–456
Chen Y T, Liu X, Yang M H. Multi-instance object segmentation with occlusion handling. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3470–3478
Arbeláez P, Pont-Tuset J, Barron J T, Marques F, Malik J. Multiscale combinatorial grouping. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 328–335
Li G, Xie Y, Lin L, Yu Y. Instance-level salient object segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2017, 247–256
Dai J, He K, Sun J. Instance-aware semantic segmentation via multitask network cascades. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3150–3158
Girshick R. Fast r-cnn. In: Proceedings of IEEE International Conference on Computer Vision. 2015, 1440–1448
He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Proceedings of European Conference on Computer Vision. 2014, 346–361
Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems. 2015, 91–99
Newell A, Huang Z, Deng J. Associative embedding: end-to-end learning for joint detection and grouping. In: Proceedings of Advances in Neural Information Processing Systems. 2017, 2274–2284
Harley A W, Derpanis K G, Kokkinos I. Learning dense convolutional embeddings for semantic segmentation. 2015, arXiv preprint arXiv:151104377
Fathi A, Wojna Z, Rathod V, Wang P, Song H O, Guadarrama S, Murphy K P. Semantic instance segmentation via deep metric learning. 2017, arXiv preprint arXiv:170310277
Yang L, Jin R. Distance metric learning: a comprehensive survey. Michigan State Universiy, 2006, 2(2): 1–51
Xu J, Schwing A G, Urtasun R. Tell me what you see and I will show you where it is. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3190–3197
Miller G A, Beckwith R, Fellbaum C, Gross D, Miller K J. Introduction to wordnet: an on-line lexical database. International Journal of Lexicography, 1990, 3(4): 235–244
Socher R, Bauer J, Manning C D, Ng A Y. Parsing with compositional vector grammars. In: Proceedings of Annual Meeting of the Association for Computational Linguistics. 2013, 455–465
Everingham M, Van Gool L, Williams C K, Winn J, Zisserman A. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 2010, 88(2): 303–338
Chen X, Mottaghi R, Liu X, Fidler S, Urtasun R, Yuille A L. Detect what you can: detecting and representing objects using holistic models and body parts. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1971–1978
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015, 115(3): 211–252
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A. Semantic understanding of scenes through the ADE20K dataset. 2016, arXiv preprint arXiv:160805442
Lin T Y, Maire M, Belongie S, Bourdev L, Girshick R, Hays J, Perona P, Ramanan D, Zitnick C L, Dollár P. Microsoft COCO: common objects in context. 2015, arXiv preprint arXiv:14050312v3
Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: common objects in context. In: Proceedings of European Conference on Computer Vision. 2014, 740–755
Liu C, Yuen J, Torralba A. Nonparametric scene parsing: label transfer via dense scene alignment. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2009, 1972–1979
Silberman N, Hoiem D, Kohli P, Fergus R. Indoor segmentation and support inference from RGBD images. In: Proceedings of European Conference on Computer Vision. 2012, 746–760
Gupta S, Arbelaez P, Malik J. Perceptual organization and recognition of indoor scenes from RGB-D images. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2013, 564–571
Song S, Lichtenberg S P, Xiao J. Sun RGB-D: a RGB-D scene understanding benchmark suite. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2015, 567–576
Janoch A, Karayev S, Jia Y, Barron J T, Fritz M, Saenko K, Darrell T. A category-level 3D object dataset: putting the kinect to work. Consumer Depth Cameras for Computer Vision. London: Springer, 2013
Xiao J, Owens A, Torralba A. SUN3D: a database of big spaces reconstructed using SFM and object labels. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 1625–1632
Yamaguchi K, Kiapour M H, Ortiz L E, Berg T L. Parsing clothing in fashion photographs. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2012, 3570–3577
Liu S, Feng J, Domokos C, Xu H, Huang J, Hu Z, Yan S. Fashion parsing with weak color-category labels. IEEE Transactions on Multimedia, 2014, 16(1): 253–265
Dong J, Chen Q, XiaW, Huang Z, Yan S. A deformable mixture parsing model with parselets. In: Proceedings of IEEE International Conference on Computer Vision. 2013, 3408–3415
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3213–3223
Acknowledgements
This work was supported by the National Science Fund for Excellent Young Scholars (61622214), the National Natural Science Foundation of China (Grant Nos. 61702565 and 61622214), Guangdong Natural Science Foundation Project for Research Teams (2017A030312006), and was also sponsored by CCF-Tencent Open Research Fund.
Author information
Authors and Affiliations
Corresponding author
Additional information
Lili Huang is a PhD student in Computer Science with the School of Data and Computer Science, Sun Yat-sen University, China. Before that, she received her BE and master degrees from School of Computer Science and Technology, Anhui University, China. Her current research interests lie in the fields of computer vision, machine learning, and cognitive science.
Jiefeng Peng received the BS degree from the school of mathematics, Sun Yat-sen University, China in 2016, where he is currently pursuing the Master degree in computer science at the School of data and computer science. His research interests include deep learning and computer vision.
Ruimao Zhang received the BE and PhD degrees from Sun Yat-sen University (SYSU), China in 2011 and 2016, respectively. From 2013 to 2014, he was a visiting PhD student with the Department of Computing, Hong Kong Polytechnic University (PolyU), Hong Kong, China. His research interests include computer vision, deep learning and related multimedia applications. He currently serves as a reviewer of several academic journals, including IEEE Trans. on Image Processing, IEEE Trans. on Circuits and Systems for Video Technology, IEEE Trans. on Information Forensics and Security, Pattern Recognition and Neurocomputing.
Ruimao Zhang received the BE and PhD degrees from Sun Yat-sen University (SYSU), China in 2011 and 2016, respectively. From 2013 to 2014, he was a visiting PhD student with the Department of Computing, Hong Kong Polytechnic University (PolyU), Hong Kong, China. His research interests include computer vision, deep learning and related multimedia applications. He currently serves as a reviewer of several academic journals, including IEEE Trans. on Image Processing, IEEE Trans. on Circuits and Systems for Video Technology, IEEE Trans. on Information Forensics and Security, Pattern Recognition and Neurocomputing.
Liang Lin is a full Professor of Sun Yatsen University. He is the Excellent Young Scientist of the National Natural Science Foundation of China. He received his BS and PhD degrees from the Beijing Institute of Technology (BIT), China in 2003 and 2008, respectively, and was a joint PhD student with the Department of Statistics, University of California, Los Angeles (UCLA). From 2008 to 2010, he was a post-doctoral fellow at UCLA. From 2014 to 2015, as a senior visiting scholar he was with The Hong Kong Polytechnic University, China and The Chinese University of Hong Kong, China. His research interests include Computer Vision, Data Analysis and Mining, and Intelligent Robotic Systems, etc. Dr. Lin has authorized and co-authorized on more than 100 papers in top-tier academic journals and conferences. He has been serving as an associate editor of IEEE Trans. Human-Machine Systems. He was the recipient of the Best Paper Runners-Up Award in ACM NPAR 2010, Google Faculty Award in 2012, Best Student Paper Award in IEEE ICME 2014, Hong Kong Scholars Award in 2014 and Best Paper Award in IEEE ICME 2017.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Huang, L., Peng, J., Zhang, R. et al. Learning deep representations for semantic image parsing: a comprehensive overview. Front. Comput. Sci. 12, 840–857 (2018). https://doi.org/10.1007/s11704-018-7195-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-018-7195-8