Fine-Grained Multi-human Parsing

Zhao, Jian; Li, Jianshu; Liu, Hengzhu; Yan, Shuicheng; Feng, Jiashi

doi:10.1007/s11263-019-01181-5

Fine-Grained Multi-human Parsing

Published: 13 May 2019

Volume 128, pages 2185–2203, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jian Zhao ORCID: orcid.org/0000-0002-3508-756X^1,2,
Jianshu Li¹,
Hengzhu Liu²,
Shuicheng Yan^1,3 &
…
Jiashi Feng¹

1579 Accesses
21 Citations
Explore all metrics

Abstract

Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-identification, e-commerce, media editing, video surveillance, autonomous driving and virtual reality, etc. To perform well, models need to comprehensively perceive the semantic information and the differences between instances in a multi-human image, which is recently defined as the multi-human parsing task. In this paper, we first present a new large-scale database “Multi-human Parsing (MHP v2.0)” for algorithm development and evaluation to advance the research on understanding humans in crowded scenes. MHP v2.0 contains 25,403 elaborately annotated images with 58 fine-grained semantic category labels and 16 dense pose key point labels, involving 2–26 persons per image captured in real-world scenes from various viewpoints, poses, occlusion, interactions and background. We further propose a novel deep Nested Adversarial Network (NAN) model for multi-human parsing. NAN consists of three Generative Adversarial Network-like sub-nets, respectively performing semantic saliency prediction, instance-agnostic parsing and instance-aware clustering. These sub-nets form a nested structure and are carefully designed to learn jointly in an end-to-end way. NAN consistently outperforms existing state-of-the-art solutions on our MHP and several other datasets, including MHP v1.0, PASCAL-Person-Part and Buffy. NAN serves as a strong baseline to shed light on generic instance-level semantic part prediction and drive the future research on multi-human parsing. With the above innovations and contributions, we have organized the CVPR 2018 Workshop on Visual Understanding of Humans in Crowd Scene (VUHCS 2018) and the Fine-Grained Multi-human Parsing and Pose Estimation Challenge. These contributions together significantly benefit the community. Code and pre-trained models are available at https://github.com/ZhaoJ9014/Multi-Human-Parsing_MHP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Instance-Level Human Parsing via Part Grouping Network

Crowded pose-guided multi-task learning for instance-level human parsing

Article 05 May 2023

Deep Learning Technique for Human Parsing: A Survey and Outlook

Article 09 March 2024

Notes

https://github.com/ZhaoJ9014/Multi-Human-Parsing_MHP.
https://vuhcs.github.io/#portfolio.
http://lv-mhp.github.io/.
PASCAL-VOC-2012 (Everingham et al. 2015) and Microsoft COCO (Lin et al. 2014) are not included due to limited percent of crowd-scene images with fine details of persons.
The trainable parameters of each stage (each sub-net) are mainly learned through the losses of the corresponding stage. However, they can still be adjusted to some degree by the losses of their subsequent stages due to the nested structure during gradient BP.
As existing instance segmentation methods only offer silhouettes of different person instances, for comparison, we combine them with our instance-agnostic parsing prediction to generate the final multi-human parsing results.
We adopt CRF as a post-processing step to refine the instance-agnostic parsing map by associating each pixel in the image with one of the semantic categories.
For each testing image, we calculate the pair-wise instance bounding box IoU and use the mean value as the interaction intensity for each image.
Since Mask R-CNN only offer silhouettes of different person instances, we did not compare the speed with it for multi-human parsing.
The dataset is available at http://lv-mhp.github.io/.
The dataset is available at http://www.stat.ucla.edu/~xianjie.chen/pascal_part_dataset/pascal_part.html.
The dataset is available at https://www.inf.ethz.ch/personal/ladickyl/Buffy.zip.

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M. et al. (2016). Tensorflow: A system for large-scale machine learning.
Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. T-PAMI, 33(5), 898–916.
Article Google Scholar
Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., & Yuille, A. (2014). Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR (pp. 1971–1978).
Chen, L.-C., Yang, Y., Wang, J., Xu, W., & Yuille, A. L. (2016). Attention to scale: Scale-aware semantic image segmentation. In CVPR (pp. 3640–3649).
Chu, X., Ouyang, W., Yang, W., & Wang, X. (2015). Multi-task recurrent neural network for immediacy prediction. In ICCV (pp. 3352–3360).
Collins, R. T., Lipton, A. J., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P. et al. (2000). A system for video surveillance and monitoring. VSAM final report (pp. 1–68).
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR (pp. 3213–3223).
Dai, J., He, K., & Sun, J. (2016). Instance-aware semantic segmentation via multi-task network cascades. In CVPR (pp. 3150–3158).
De Brabandere, B., Neven, D., & Van Gool, L. (2017). Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551.
Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. T-PAMI, 34(4), 743–761.
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2011). The PASCAL visual object classes challenge 2011 (VOC2011) results. Retrieved May 25, 2011 from http://www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html.
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The PASCAL visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In CVPR (pp. 1–8).
Gan, C., Lin, M., Yang, Y., de Melo, G., & Hauptmann, A. G. (2016). Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition. In AAAI (p. 3487).
Girshick, R. (2015). Fast R-CNN. arXiv preprint arXiv:1504.08083.
Gong, K., Liang, X., Shen, X., & Lin, L. (2017). Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. arXiv preprint arXiv:1703.05446.
Hariharan, B., Arbeláez, P., R. Girshick, P., & Malik, J. (2014). Simultaneous detection and segmentation. In ECCV (pp. 297–312).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In ICCV (pp. 2980–2988).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).
Jiang, H., & Grauman, K. (2016). Detangling people: Individuating multiple close people and their body parts via region assembly. arXiv preprint arXiv:1604.03880
Klare, B. F., Klein, B., Taborsky, E., Blanton, A., Cheney, J., Allen, K., Grother, P., Mah, A., & Jain, A.K. (2015). Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In CVPR (pp. 1931–1939).
Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML.
Li, Q., Arnab, A., & Torr, P. H. (2017a). Holistic, instance-level human parsing. arXiv preprint arXiv:1709.03612.
Li, G., Xie, Y., Lin, L., & Yu, Y. (2017b). Instance-level salient object segmentation. In CVPR (pp. 247–256).
Li, J., Zhao, J., Wei, Y., Lang , C., Li, Y., Sim, T., Yan, S., & Feng, J. (2017c). Multi-human parsing in the wild. arXiv preprint arXiv:1705.07206.
Liang, X., Wei, Y., Shen, X., Yang, J., Lin, L., & Yan, S. (2015a). Proposal-free network for instance-level object segmentation. arXiv preprint arXiv:1509.02636.
Liang, X., Xu, C., Shen, X., Yang, J., Liu, S., Tang, J., Lin, L., & Yan, S. (2015b). Human parsing with contextualized convolutional neural network. In ICCV (pp. 1386–1394).
Lin, J., Guo, X., Shao, J., Jiang, C., Zhu, Y., & Zhu, S.-C. (2016). A virtual reality platform for dynamic human-scene interaction. In SIGGRAPH (p. 11).
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755).
Liu, S., Wang, C., Qian, R., Yu, H., Bao, R., & Sun, Y. (2017). Surveillance video parsing with single frame supervision. In CVPRW (pp. 1–9).
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR (pp. 3431–3440).
Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In NIPS (pp. 849–856).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS (pp. 91–99).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Article MathSciNet Google Scholar
Sapp, B., & Taskar, B. (2013). Modec: Multimodal decomposable models for human pose estimation. In CVPR (pp. 3674–3681).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Turban, E., King, D., Lee, J., & Viehland, D. (2002). Electronic commerce: A managerial perspective 2002 (Vol. 13, no. (975285), p. 4). Englewood Cliffs: Prentice Hall.
Google Scholar
Vineet, V., Warrell, J., Ladicky, L., & Torr, P. H. (2011). Human instance segmentation from video using detector-based conditional random fields. In BMVC (Vol. 2, pp. 12–15).
Wu, Z., Shen, C., Van Den Hengel, A. (2016). Wider or deeper: Revisiting the resnet model for visual recognition. arXiv preprint arXiv:1611.10080.
Xia, F., Wang, P., Chen, L.-C., & Yuille, A. L. (2016). Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In ECCV (pp. 648–663).
Xu, N., Price, B., Cohen, S., Yang, J., & Huang, T. S. (2016). Deep interactive object selection. In CVPR (pp. 373–381).
Yamaguchi, K., Kiapour, M. H., Ortiz, L. E., & Berg, T. L. (2012). Parsing clothing in fashion photographs. In CVPR (pp. 3570–3577).
Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2018). From facial expression recognition to interpersonal relation prediction. International Journal of Computer Vision, 126(5), 550–569.
Article MathSciNet Google Scholar
Zhang, N., Paluri, M., Taigman, Y., Fergus, R., Bourdev, L. (2015). Beyond frontal faces: Improving person recognition using multiple cues. In CVPR (pp. 4804–4813).
Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., & Feng, J. (2018). Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In 2018 ACM Multimedia Conference on Multimedia Conference (pp. 792–800). ACM.
Zhao, J., Li, J., Nie, X., Zhao, F., Chen, Y., Wang, Z., Feng, J., & Yan, S. (2017). Self-supervised neural aggregation networks for human parsing. In CVPRW (pp. 7–15).
Zhao, R., Ouyang, W., & Wang, X. (2013). Unsupervised salience learning for person re-identification. In CVPR (pp. 3586–3593).

Download references

Acknowledgements

The work of Jian Zhao was partially supported by China Scholarship Council (CSC) Grant 201503170248. The work of Jiashi Feng was partially supported by NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133 and MOE Tier-II R-263-000-D17-112.

Author information

Authors and Affiliations

National University of Singapore, Singapore, Singapore
Jian Zhao, Jianshu Li, Shuicheng Yan & Jiashi Feng
National University of Defense Technology, Changsha, China
Jian Zhao & Hengzhu Liu
Qihoo 360 AI Institute, Beijing, China
Shuicheng Yan

Authors

Jian Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jianshu Li
View author publications
You can also search for this author in PubMed Google Scholar
Hengzhu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shuicheng Yan
View author publications
You can also search for this author in PubMed Google Scholar
Jiashi Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Zhao.

Additional information

Communicated by Li Liu, Matti Pietikäinen, Jie Qin, Jie Chen, Wanli Ouyang, Luc Van Gool.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, J., Li, J., Liu, H. et al. Fine-Grained Multi-human Parsing. Int J Comput Vis 128, 2185–2203 (2020). https://doi.org/10.1007/s11263-019-01181-5

Download citation

Received: 27 July 2018
Accepted: 26 April 2019
Published: 13 May 2019
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11263-019-01181-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fine-Grained Multi-human Parsing

Abstract

Access this article

Similar content being viewed by others

Instance-Level Human Parsing via Part Grouping Network

Crowded pose-guided multi-task learning for instance-level human parsing

Deep Learning Technique for Human Parsing: A Survey and Outlook

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fine-Grained Multi-human Parsing

Abstract

Access this article

Similar content being viewed by others

Instance-Level Human Parsing via Part Grouping Network

Crowded pose-guided multi-task learning for instance-level human parsing

Deep Learning Technique for Human Parsing: A Survey and Outlook

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation