Advertisement

ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12357)

Abstract

Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as a performance boost by a robust feature learning that the referred identity can be accurately bundled by multiple attribute cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into sub-spaces corresponding to attributes using a light auxiliary attribute segmentation layer. It then aligns these visual features with the textual attributes parsed from the sentences via a novel contrastive learning loss. We validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Codes and models are available at https://github.com/Jarr0d/ViTAA.

Keywords

Person search by natural language Person re-identification Vision and language Metric learning 

Notes

Acknowledgements

Vising scholarship support for Z. Wang from the China Scholarship Council #201806020020 and Amazon AWS Machine Learning Research Award (MLRA) support are greatly appreciated. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the sponsors.

Supplementary material

504453_1_En_24_MOESM1_ESM.pdf (867 kb)
Supplementary material 1 (pdf 867 KB)

References

  1. 1.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)Google Scholar
  2. 2.
    Benenson, R., Omran, M., Hosang, J., Schiele, B.: Ten years of pedestrian detection, what have we learned? In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8926, pp. 613–627. Springer, Cham (2015).  https://doi.org/10.1007/978-3-319-16181-5_47CrossRefGoogle Scholar
  3. 3.
    Chen, D., et al.: Improving deep visual representation for person re-identification by global and local image-language association. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 54–70 (2018)Google Scholar
  4. 4.
    Chen, T., Xu, C., Luo, J.: Improving text-based person search by spatial matching and adaptive threshold. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1879–1887, March 2018Google Scholar
  5. 5.
    Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: a benchmark. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 304–311. IEEE (2009)Google Scholar
  6. 6.
    Dong, Q., Gong, S., Zhu, X.: Person search by text attribute query as zero-shot learning. In: The IEEE International Conference on Computer Vision (ICCV), October 2019Google Scholar
  7. 7.
    Fang, Z., Gokhale, T., Banerjee, P., Baral, C., Yang, Y.: Video2Commonsense: generating commonsense descriptions to enrich video captioning. arXiv preprint arXiv:2003.05162 (2020)
  8. 8.
    Fang, Z., Kong, S., Fowlkes, C., Yang, Y.: Modularized textual grounding for counterfactual resilience. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  9. 9.
    Fang, Z., Kong, S., Yu, T., Yang, Y.: Weakly supervised attention learning for textual phrases grounding. arXiv preprint arXiv:1805.00545 (2018)
  10. 10.
    Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, pp. 2121–2129 (2013)Google Scholar
  11. 11.
    Garcia, J., Martinel, N., Micheloni, C., Gardel, A.: Person re-identification ranking optimisation by discriminant context information analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1305–1313 (2015)Google Scholar
  12. 12.
    Goldberg, Y., Levy, O.: word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)
  13. 13.
    Gong, S., Cristani, M., Yan, S., Loy, C.C.: Person Re-Identification. Springer, London (2014).  https://doi.org/10.1007/978-1-4471-6296-4CrossRefzbMATHGoogle Scholar
  14. 14.
    Guo, J., Yuan, Y., Huang, L., Zhang, C., Yao, J.G., Han, K.: Beyond human parts: dual part-aligned representations for person re-identification. In: The IEEE International Conference on Computer Vision (ICCV), October 2019Google Scholar
  15. 15.
    Han, C., et al.: Re-ID driven localization refinement for person search. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9814–9823 (2019)Google Scholar
  16. 16.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)Google Scholar
  17. 17.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  18. 18.
    Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 119–126 (2003)Google Scholar
  19. 19.
    Jing, Y., Si, C., Wang, J., Wang, W., Wang, L., Tan, T.: Pose-guided joint global and attentive local matching network for text-based person search. arXiv preprint arXiv:1809.08440 (2018)
  20. 20.
    Kalayeh, M.M., Basaran, E., Gökmen, M., Kamasak, M.E., Shah, M.: Human semantic parsing for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1062–1071 (2018)Google Scholar
  21. 21.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)Google Scholar
  22. 22.
    Klein, D., Manning, C.D.: Fast exact inference with a factored model for natural language parsing. In: Advances in Neural Information Processing Systems, pp. 3–10 (2003)Google Scholar
  23. 23.
    Layne, R., Hospedales, T.M., Gong, S.: Attributes-based re-identification. In: Gong, S., Cristani, M., Yan, S., Loy, C.C. (eds.) Person Re-Identification. ACVPR, pp. 93–117. Springer, London (2014).  https://doi.org/10.1007/978-1-4471-6296-4_5CrossRefGoogle Scholar
  24. 24.
    Li, S., Xiao, T., Li, H., Yang, W., Wang, X.: Identity-aware textual-visual matching with latent co-attention. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1890–1899 (2017)Google Scholar
  25. 25.
    Li, S., Xiao, T., Li, H., Zhou, B., Yue, D., Wang, X.: Person search with natural language description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979 (2017)Google Scholar
  26. 26.
    Li, W., Zhao, R., Xiao, T., Wang, X.: DeepReID: deep filter pairing neural network for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 152–159 (2014)Google Scholar
  27. 27.
    Liang, X., Gong, K., Shen, X., Lin, L.: Look into person: joint body parsing & pose estimation network and a new benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 871–885 (2018)CrossRefGoogle Scholar
  28. 28.
    Liang, X., et al.: Deep human parsing with active template regression. IEEE Trans. Pattern Anal. Mach. Intell. 12, 2402–2414 (2015)CrossRefGoogle Scholar
  29. 29.
    Lin, Y., et al.: Improving person re-identification by attribute and identity learning. Pattern Recogn. 95, 151–161 (2019)CrossRefGoogle Scholar
  30. 30.
    Liu, X., et al.: HydraPlus-Net: attentive deep features for pedestrian analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 350–359 (2017)Google Scholar
  31. 31.
    Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014)Google Scholar
  32. 32.
    Niu, K., Huang, Y., Ouyang, W., Wang, L.: Improving description-based person re-identification by multi-granularity image-text alignments. arXiv preprint arXiv:1906.09610 (2019)
  33. 33.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2641–2649 (2015)Google Scholar
  34. 34.
    Ristani, E., Solera, F., Zou, R., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 17–35. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-48881-3_2CrossRefGoogle Scholar
  35. 35.
    Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_49CrossRefGoogle Scholar
  36. 36.
    Shekhar, R., Jawahar, C.: Word image retrieval using bag of visual words. In: 2012 10th IAPR International Workshop on Document Analysis Systems, pp. 297–301. IEEE (2012)Google Scholar
  37. 37.
    Si, J., et al.: Dual attention matching network for context-aware feature sequence based person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5363–5372 (2018)Google Scholar
  38. 38.
    Su, C., Li, J., Zhang, S., Xing, J., Gao, W., Tian, Q.: Pose-driven deep convolutional model for person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3960–3969 (2017)Google Scholar
  39. 39.
    Su, C., Zhang, S., Xing, J., Gao, W., Tian, Q.: Multi-type attributes driven multi-camera person re-identification. Pattern Recog. 75, 77–89 (2018)CrossRefGoogle Scholar
  40. 40.
    Sudowe, P., Spitzer, H., Leibe, B.: Person attribute recognition with a jointly-trained holistic CNN model. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 87–95 (2015)Google Scholar
  41. 41.
    Suh, Y., Wang, J., Tang, S., Mei, T., Mu Lee, K.: Part-aligned bilinear representations for person re-identification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 402–419 (2018)Google Scholar
  42. 42.
    Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  43. 43.
    Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S.: Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 480–496 (2018)Google Scholar
  44. 44.
    Tan, Z., Yang, Y., Wan, J., Hang, H., Guo, G., Li, S.Z.: Attention-based pedestrian attribute analysis. IEEE Trans. Image Process. 12, 6126–6140 (2019)MathSciNetCrossRefGoogle Scholar
  45. 45.
    Wang, C., Zhang, Q., Huang, C., Liu, W., Wang, X.: Mancs: a multi-task attentional network with curriculum sampling for person re-identification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–381 (2018)Google Scholar
  46. 46.
    Wang, G., Yuan, Y., Chen, X., Li, J., Zhou, X.: Learning discriminative features with multiple granularities for person re-identification. In: 2018 ACM Multimedia Conference on Multimedia Conference, pp. 274–282. ACM (2018)Google Scholar
  47. 47.
    Wang, Z., Wang, J., Yang, Y.: Resisting crowd occlusion and hard negatives for pedestrian detection in the wild. arXiv preprint arXiv:2005.07344 (2020)
  48. 48.
    Wen, Y., Zhang, K., Li, Z., Qiao, Y.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_31CrossRefGoogle Scholar
  49. 49.
    Wu, H., et al.: Unified visual-semantic embeddings: bridging vision and language with structured meaning representations. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019Google Scholar
  50. 50.
    Xu, J., Zhao, R., Zhu, F., Wang, H., Ouyang, W.: Attention-aware compositional network for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2119–2128 (2018)Google Scholar
  51. 51.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  52. 52.
    Yin, Z., et al.: Adversarial attribute-image person re-identification. In: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-2018, pp. 1100–1106. International Joint Conferences on Artificial Intelligence Organization, July 2018Google Scholar
  53. 53.
    You, Q., Zhang, Z., Luo, J.: End-to-end convolutional semantic embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5735–5744 (2018)Google Scholar
  54. 54.
    Zhang, X., Fang, Z., Wen, Y., Li, Z., Qiao, Y.: Range loss for deep face recognition with long-tailed training data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5409–5418 (2017)Google Scholar
  55. 55.
    Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701 (2018)Google Scholar
  56. 56.
    Zhang, Z., Lan, C., Zeng, W., Chen, Z.: Densely semantically aligned person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 667–676 (2019)Google Scholar
  57. 57.
    Zhao, J., Li, J., Cheng, Y., Sim, T., Yan, S., Feng, J.: Understanding humans in crowded scenes: deep nested adversarial learning and a new benchmark for multi-human parsing. In: 2018 ACM Multimedia Conference on Multimedia Conference, pp. 792–800. ACM (2018)Google Scholar
  58. 58.
    Zheng, L., Huang, Y., Lu, H., Yang, Y.: Pose invariant embedding for deep person re-identification. IEEE Trans. Image Process. 28(9), 4500–4509 (2019)MathSciNetCrossRefGoogle Scholar
  59. 59.
    Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1116–1124 (2015)Google Scholar
  60. 60.
    Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Shen, Y.D.: Dual-path convolutional image-text embedding with instance loss. arXiv preprint arXiv:1711.05535 (2017)
  61. 61.
    Zhong, Z., Zheng, L., Cao, D., Li, S.: Re-ranking person re-identification with k-reciprocal encoding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1318–1327 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Beihang UniversityBeijingChina
  2. 2.Arizona State UniversityTempeUSA

Personalised recommendations