Abstract
Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-identification, e-commerce, media editing, video surveillance, autonomous driving and virtual reality, etc. To perform well, models need to comprehensively perceive the semantic information and the differences between instances in a multi-human image, which is recently defined as the multi-human parsing task. In this paper, we first present a new large-scale database “Multi-human Parsing (MHP v2.0)” for algorithm development and evaluation to advance the research on understanding humans in crowded scenes. MHP v2.0 contains 25,403 elaborately annotated images with 58 fine-grained semantic category labels and 16 dense pose key point labels, involving 2–26 persons per image captured in real-world scenes from various viewpoints, poses, occlusion, interactions and background. We further propose a novel deep Nested Adversarial Network (NAN) model for multi-human parsing. NAN consists of three Generative Adversarial Network-like sub-nets, respectively performing semantic saliency prediction, instance-agnostic parsing and instance-aware clustering. These sub-nets form a nested structure and are carefully designed to learn jointly in an end-to-end way. NAN consistently outperforms existing state-of-the-art solutions on our MHP and several other datasets, including MHP v1.0, PASCAL-Person-Part and Buffy. NAN serves as a strong baseline to shed light on generic instance-level semantic part prediction and drive the future research on multi-human parsing. With the above innovations and contributions, we have organized the CVPR 2018 Workshop on Visual Understanding of Humans in Crowd Scene (VUHCS 2018) and the Fine-Grained Multi-human Parsing and Pose Estimation Challenge. These contributions together significantly benefit the community. Code and pre-trained models are available at https://github.com/ZhaoJ9014/Multi-Human-Parsing_MHP.
Similar content being viewed by others
Notes
The trainable parameters of each stage (each sub-net) are mainly learned through the losses of the corresponding stage. However, they can still be adjusted to some degree by the losses of their subsequent stages due to the nested structure during gradient BP.
As existing instance segmentation methods only offer silhouettes of different person instances, for comparison, we combine them with our instance-agnostic parsing prediction to generate the final multi-human parsing results.
We adopt CRF as a post-processing step to refine the instance-agnostic parsing map by associating each pixel in the image with one of the semantic categories.
For each testing image, we calculate the pair-wise instance bounding box IoU and use the mean value as the interaction intensity for each image.
Since Mask R-CNN only offer silhouettes of different person instances, we did not compare the speed with it for multi-human parsing.
The dataset is available at http://lv-mhp.github.io/.
The dataset is available at http://www.stat.ucla.edu/~xianjie.chen/pascal_part_dataset/pascal_part.html.
The dataset is available at https://www.inf.ethz.ch/personal/ladickyl/Buffy.zip.
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M. et al. (2016). Tensorflow: A system for large-scale machine learning.
Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. T-PAMI, 33(5), 898–916.
Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., & Yuille, A. (2014). Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR (pp. 1971–1978).
Chen, L.-C., Yang, Y., Wang, J., Xu, W., & Yuille, A. L. (2016). Attention to scale: Scale-aware semantic image segmentation. In CVPR (pp. 3640–3649).
Chu, X., Ouyang, W., Yang, W., & Wang, X. (2015). Multi-task recurrent neural network for immediacy prediction. In ICCV (pp. 3352–3360).
Collins, R. T., Lipton, A. J., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P. et al. (2000). A system for video surveillance and monitoring. VSAM final report (pp. 1–68).
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR (pp. 3213–3223).
Dai, J., He, K., & Sun, J. (2016). Instance-aware semantic segmentation via multi-task network cascades. In CVPR (pp. 3150–3158).
De Brabandere, B., Neven, D., & Van Gool, L. (2017). Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551.
Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. T-PAMI, 34(4), 743–761.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2011). The PASCAL visual object classes challenge 2011 (VOC2011) results. Retrieved May 25, 2011 from http://www.pascal-network.org/challenges/VOC/voc2011/workshop/index.html.
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The PASCAL visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In CVPR (pp. 1–8).
Gan, C., Lin, M., Yang, Y., de Melo, G., & Hauptmann, A. G. (2016). Concepts not alone: Exploring pairwise relationships for zero-shot video activity recognition. In AAAI (p. 3487).
Girshick, R. (2015). Fast R-CNN. arXiv preprint arXiv:1504.08083.
Gong, K., Liang, X., Shen, X., & Lin, L. (2017). Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. arXiv preprint arXiv:1703.05446.
Hariharan, B., Arbeláez, P., R. Girshick, P., & Malik, J. (2014). Simultaneous detection and segmentation. In ECCV (pp. 297–312).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In ICCV (pp. 2980–2988).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).
Jiang, H., & Grauman, K. (2016). Detangling people: Individuating multiple close people and their body parts via region assembly. arXiv preprint arXiv:1604.03880
Klare, B. F., Klein, B., Taborsky, E., Blanton, A., Cheney, J., Allen, K., Grother, P., Mah, A., & Jain, A.K. (2015). Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In CVPR (pp. 1931–1939).
Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML.
Li, Q., Arnab, A., & Torr, P. H. (2017a). Holistic, instance-level human parsing. arXiv preprint arXiv:1709.03612.
Li, G., Xie, Y., Lin, L., & Yu, Y. (2017b). Instance-level salient object segmentation. In CVPR (pp. 247–256).
Li, J., Zhao, J., Wei, Y., Lang , C., Li, Y., Sim, T., Yan, S., & Feng, J. (2017c). Multi-human parsing in the wild. arXiv preprint arXiv:1705.07206.
Liang, X., Wei, Y., Shen, X., Yang, J., Lin, L., & Yan, S. (2015a). Proposal-free network for instance-level object segmentation. arXiv preprint arXiv:1509.02636.
Liang, X., Xu, C., Shen, X., Yang, J., Liu, S., Tang, J., Lin, L., & Yan, S. (2015b). Human parsing with contextualized convolutional neural network. In ICCV (pp. 1386–1394).
Lin, J., Guo, X., Shao, J., Jiang, C., Zhu, Y., & Zhu, S.-C. (2016). A virtual reality platform for dynamic human-scene interaction. In SIGGRAPH (p. 11).
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755).
Liu, S., Wang, C., Qian, R., Yu, H., Bao, R., & Sun, Y. (2017). Surveillance video parsing with single frame supervision. In CVPRW (pp. 1–9).
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR (pp. 3431–3440).
Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In NIPS (pp. 849–856).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS (pp. 91–99).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Sapp, B., & Taskar, B. (2013). Modec: Multimodal decomposable models for human pose estimation. In CVPR (pp. 3674–3681).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Turban, E., King, D., Lee, J., & Viehland, D. (2002). Electronic commerce: A managerial perspective 2002 (Vol. 13, no. (975285), p. 4). Englewood Cliffs: Prentice Hall.
Vineet, V., Warrell, J., Ladicky, L., & Torr, P. H. (2011). Human instance segmentation from video using detector-based conditional random fields. In BMVC (Vol. 2, pp. 12–15).
Wu, Z., Shen, C., Van Den Hengel, A. (2016). Wider or deeper: Revisiting the resnet model for visual recognition. arXiv preprint arXiv:1611.10080.
Xia, F., Wang, P., Chen, L.-C., & Yuille, A. L. (2016). Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In ECCV (pp. 648–663).
Xu, N., Price, B., Cohen, S., Yang, J., & Huang, T. S. (2016). Deep interactive object selection. In CVPR (pp. 373–381).
Yamaguchi, K., Kiapour, M. H., Ortiz, L. E., & Berg, T. L. (2012). Parsing clothing in fashion photographs. In CVPR (pp. 3570–3577).
Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2018). From facial expression recognition to interpersonal relation prediction. International Journal of Computer Vision, 126(5), 550–569.
Zhang, N., Paluri, M., Taigman, Y., Fergus, R., Bourdev, L. (2015). Beyond frontal faces: Improving person recognition using multiple cues. In CVPR (pp. 4804–4813).
Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., & Feng, J. (2018). Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing. In 2018 ACM Multimedia Conference on Multimedia Conference (pp. 792–800). ACM.
Zhao, J., Li, J., Nie, X., Zhao, F., Chen, Y., Wang, Z., Feng, J., & Yan, S. (2017). Self-supervised neural aggregation networks for human parsing. In CVPRW (pp. 7–15).
Zhao, R., Ouyang, W., & Wang, X. (2013). Unsupervised salience learning for person re-identification. In CVPR (pp. 3586–3593).
Acknowledgements
The work of Jian Zhao was partially supported by China Scholarship Council (CSC) Grant 201503170248. The work of Jiashi Feng was partially supported by NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133 and MOE Tier-II R-263-000-D17-112.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Li Liu, Matti Pietikäinen, Jie Qin, Jie Chen, Wanli Ouyang, Luc Van Gool.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhao, J., Li, J., Liu, H. et al. Fine-Grained Multi-human Parsing. Int J Comput Vis 128, 2185–2203 (2020). https://doi.org/10.1007/s11263-019-01181-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-019-01181-5