Advertisement

Learning Semantic Neural Tree for Human Parsing

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12358)

Abstract

In this paper, we design a novel semantic neural tree for human parsing, which uses a tree architecture to encode physiological structure of human body, and design a coarse to fine process in a cascade manner to generate accurate results. Specifically, the semantic neural tree is designed to segment human regions into multiple semantic sub-regions (e.g., face, arms, and legs) in a hierarchical way using a new designed attention routing module. Meanwhile, we introduce the semantic aggregation module to combine multiple hierarchical features to exploit more context information for better performance. Our semantic neural tree can be trained in an end-to-end fashion by standard stochastic gradient descent (SGD) with back-propagation. Several experiments conducted on four challenging datasets for both single and multiple human parsing, i.e., LIP, PASCAL-Person-Part, CIHP and MHP-v2, demonstrate the effectiveness of the proposed method. Code can be found at https://isrc.iscas.ac.cn/gitlab/research/sematree.

Keywords

Human parsing Semantic neural tree Semantic segmentation 

Notes

Acknowledgements

This work was supported by the Key Research Program of Frontier Sciences, CAS, Grant No. ZDBS-LY-JSC038, the National Natural Science Foundation of China, Grant No. 61807033. Libo Zhang was supported by Youth Innovation Promotion Association, CAS (2020111), and Outstanding Youth Scientist Project of ISCAS.

Supplementary material

504454_1_En_13_MOESM1_ESM.pdf (2.4 mb)
Supplementary material 1 (pdf 2438 KB)

References

  1. 1.
    Chen, L., et al.: Searching for efficient multi-scale architectures for dense image prediction. In: NeurIPS, vol, abs/1809.04184 (2018)
  2. 2.
    Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. TPAMI 40(4), 834–848 (2018)CrossRefGoogle Scholar
  3. 3.
    Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01234-2_49CrossRefGoogle Scholar
  4. 4.
    Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.L.: Detect what you can: detecting and representing objects using holistic models and body parts. In: CVPR, pp. 1979–1986 (2014)Google Scholar
  5. 5.
    Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)Google Scholar
  6. 6.
    Everingham, M., Gool, L.V., Williams, C.K.I., Winn, J.M., Zisserman, A.: The pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)CrossRefGoogle Scholar
  7. 7.
    Fang, H., Lu, G., Fang, X., Xie, J., Tai, Y., Lu, C.: Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. CoRR abs/1805.04310 (2018)
  8. 8.
    Gong, K., Gao, Y., Liang, X., Shen, X., Wang, M., Lin, L.: Graphonomy: universal human parsing via graph transfer learning. In: CVPR, pp. 7450–7459 (2019)Google Scholar
  9. 9.
    Gong, K., Liang, X., Li, Y., Chen, Y., Yang, M., Lin, L.: Instance-level human parsing via part grouping network. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 805–822. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01225-0_47CrossRefGoogle Scholar
  10. 10.
    Gong, K., Liang, X., Zhang, D., Shen, X., Lin, L.: Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. In: CVPR, pp. 6757–6765 (2017)Google Scholar
  11. 11.
    Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 297–312. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10584-0_20CrossRefGoogle Scholar
  12. 12.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: ICCV, pp. 2980–2988 (2017)Google Scholar
  13. 13.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  14. 14.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)Google Scholar
  15. 15.
    Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. CoRR abs/1608.06993 (2016)
  16. 16.
    Huang, Z., Wang, C., Wang, X., Liu, W., Wang, J.: Semantic image segmentation by scale-adaptive networks. IEEE Trans. Image Process. 29, 2066–2077 (2019).  https://doi.org/10.1109/TIP.2019.2941644CrossRefGoogle Scholar
  17. 17.
    Kimchi, R.: Primacy of wholistic processing and global/local paradigm: a critical review. Psychol. Bull. 112(1), 24 (1992)CrossRefGoogle Scholar
  18. 18.
    Kontschieder, P., Fiterau, M., Criminisi, A., Bulò, S.R.: Deep neural decision forests. In: ICCV, pp. 1467–1475 (2015)Google Scholar
  19. 19.
    LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)CrossRefGoogle Scholar
  20. 20.
    Li, J., Zhao, J., Wei, Y., Lang, C., Li, Y., Feng, J.: Towards real world human parsing: multiple-human parsing in the wild. CoRR abs/1705.07206 (2017)
  21. 21.
    Liang, X., Gong, K., Shen, X., Lin, L.: Look into person: joint body parsing & pose estimation network and a new benchmark. CoRR abs/1804.01984 (2018)
  22. 22.
    Liang, X., Gong, K., Shen, X., Lin, L.: Look into person: Joint body parsing & pose estimation network and a new benchmark. TPAMI 41(4), 871–885 (2019)CrossRefGoogle Scholar
  23. 23.
    Liang, X., Lin, L., Shen, X., Feng, J., Yan, S., Xing, E.P.: Interpretable structure-evolving LSTM. CoRR abs/1703.03055 (2017)
  24. 24.
    Liang, X., et al.: Deep human parsing with active template regression. CoRR abs/1503.02391 (2015)
  25. 25.
    Liang, X., Shen, X., Feng, J., Lin, L., Yan, S.: Semantic object parsing with graph LSTM. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 125–143. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_8CrossRefGoogle Scholar
  26. 26.
    Lin, G., Milan, A., Shen, C., Reid, I.D.: Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: CVPR, pp. 5168–5177 (2017)Google Scholar
  27. 27.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  28. 28.
    Liu, T., et al.: Devil in the details: towards accurate single and multiple human parsing. CoRR abs/1809.05996 (2018)
  29. 29.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015)Google Scholar
  30. 30.
    Luo, Y., et al.: Macro-micro adversarial network for human parsing. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 424–440. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01240-3_26CrossRefGoogle Scholar
  31. 31.
    Nie, X., Feng, J., Yan, S.: Mutual learning to adapt for joint human parsing and pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 519–534. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01228-1_31CrossRefGoogle Scholar
  32. 32.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. CoRR abs/1512.00567 (2015)
  33. 33.
    Wang, W., Zhang, Z., Qi, S., Shen, J., Pang, Y., Shao, L.: Learning compositional neural information fusion for human parsing. In: ICCV (2019)Google Scholar
  34. 34.
    Wang, W., Zhu, H., Dai, J., Pang, Y., Shen, J., Shao, L.: Hierarchical human parsing with typed part-relation reasoning. In: CVPR (2020)Google Scholar
  35. 35.
    Wang, Y., Tran, D., Liao, Z.: Learning hierarchical poselets for human parsing. In: CVPR, pp. 1705–1712 (2011)Google Scholar
  36. 36.
    Xia, F., Wang, P., Chen, L.-C., Yuille, A.L.: Zoom better to see clearer: human and object parsing with hierarchical auto-zoom net. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 648–663. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_39CrossRefGoogle Scholar
  37. 37.
    Xia, F., Wang, P., Chen, X., Yuille, A.L.: Joint multi-person pose estimation and semantic part segmentation. In: CVPR, pp. 6080–6089 (2017)Google Scholar
  38. 38.
    Xiao, H.: NDT: neual decision tree towards fully functioned neural graph. CoRR abs/1712.05934 (2017)
  39. 39.
    Zhang, R., Tang, S., Zhang, Y., Li, J., Yan, S.: Scale-adaptive convolutions for scene parsing. In: ICCV, pp. 2050–2058 (2017)Google Scholar
  40. 40.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR, pp. 6230–6239 (2017)Google Scholar
  41. 41.
    Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., Feng, J.: Understanding humans in crowded scenes: deep nested adversarial learning and a new benchmark for multi-human parsing. In: ACM MM, pp. 792–800 (2018)Google Scholar
  42. 42.
    Zhao, J., et al.: Understanding humans in crowded scenes: deep nested adversarial learning and a new benchmark for multi-human parsing. CoRR abs/1804.03287 (2018)
  43. 43.
    Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets V2: more deformable, better results. In: CVPR, pp. 9308–9316 (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Institute of Software Chinese Academy of SciencesBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina
  3. 3.University at Albany, State University of New YorkAlbanyUSA
  4. 4.JD Finance America CorporationMountain ViewUSA
  5. 5.Tencent Youtu LabShanghaiChina

Personalised recommendations