Label2Label: A Language Modeling Framework for Multi-attribute Learning

Li, Wanhua; Cao, Zhexuan; Feng, Jianjiang; Zhou, Jie; Lu, Jiwen

doi:10.1007/978-3-031-19775-8_33

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13672))

Included in the following conference series:

European Conference on Computer Vision

1999 Accesses
9 Citations

Abstract

Objects are usually associated with multiple attributes, and these attributes often exhibit high correlations. Modeling complex relationships between attributes poses a great challenge for multi-attribute learning. This paper proposes a simple yet generic framework named Label2Label to exploit the complex attribute correlations. Label2Label is the first attempt for multi-attribute prediction from the perspective of language modeling. Specifically, it treats each attribute label as a “word” describing the sample. As each sample is annotated with multiple attribute labels, these “words” will naturally form an unordered but meaningful “sentence”, which depicts the semantic information of the corresponding sample. Inspired by the remarkable success of pre-training language models in NLP, Label2Label introduces an image-conditioned masked language model, which randomly masks some of the “word” tokens from the label “sentence” and aims to recover them based on the masked “sentence” and the context conveyed by image features. Our intuition is that the instance-wise attribute relations are well grasped if the neural net can infer the missing attributes based on the context and the remaining attribute hints. Label2Label is conceptually simple and empirically powerful. Without incorporating task-specific prior knowledge and highly specialized network designs, our approach achieves state-of-the-art results on three different multi-attribute learning tasks, compared to highly customized domain-specific methods. Code is available at https://github.com/Li-Wanhua/Label2Label.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdulnabi, A.H., Wang, G., Lu, J., Jia, K.: Multi-task cnn model for attribute prediction. TMM 17(11), 1949–1959 (2015)
Google Scholar
Ak, K.E., Kassim, A.A., Lim, J.H., Tham, J.Y.: Learning attribute representations with localization for flexible fashion search. In: CVPR, pp. 7708–7717 (2018)
Google Scholar
Bao, H., Dong, L., Wei, F.: Beit: bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)
Brown, T.B., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Cao, J., Li, Y., Zhang, Z.: Partially shared multi-task convolutional neural network with local constraint for face attribute learning. In: CVPR, pp. 4290–4299 (2018)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, H., Gallagher, A., Girod, B.: Describing clothing by semantic attributes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 609–623. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33712-3_44
Chapter Google Scholar
Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: a language modeling framework for object detection. arXiv preprint arXiv:2109.10852 (2021)
Cheng, X., et al.: Mltr: multi-label classification with transformer. arXiv preprint arXiv:2106.06195 (2021)
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.: Randaugment: practical automated data augmentation with a reduced search space. In: NeurIPS, pp. 18613–18624 (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Doersch, C., Gupta, A., Zisserman, A.: Crosstransformers: spatially-aware few-shot transfer. In: NeurIPS (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Duan, K., Parikh, D., Crandall, D., Grauman, K.: Discovering localized attributes for fine-grained recognition. In: CVPR, pp. 3474–3481 (2012)
Google Scholar
Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: CVPR, pp. 1778–1785 (2009)
Google Scholar
Feris, R.S., Lampert, C., Parikh, D.: Visual Attributes. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-50077-5
Book Google Scholar
Guo, H., Zheng, K., Fan, X., Yu, H., Wang, S.: Visual attention consistency under image transforms for multi-label image classification. In: CVPR, pp. 729–739 (2019)
Google Scholar
Hand, E.M., Chellappa, R.: Attributes for improved attributes: a multi-task network utilizing implicit and explicit relationships for facial attribute classification. In: AAAI (2017)
Google Scholar
He, K., Xinlei, C., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2106.08254 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
He, K., et al.: Harnessing synthesized abstraction images to improve facial attribute recognition. In: IJCAI, pp. 733–740 (2018)
Google Scholar
Huang, S., Li, X., Cheng, Z.Q., Zhang, Z., Hauptmann, A.: GNAS: a greedy neural architecture search method for multi-attribute learning. In: ACM MM, pp. 2049–2057 (2018)
Google Scholar
Jia, J., Chen, X., Huang, K.: Spatial and semantic consistency regularizations for pedestrian attribute recognition. In: ICCV, pp. 962–971 (2021)
Google Scholar
Kalayeh, M.M., Gong, B., Shah, M.: Improving facial attribute prediction using semantic segmentation. In: CVPR, pp. 6942–6950 (2017)
Google Scholar
Lanchantin, J., Wang, T., Ordonez, V., Qi, Y.: General multi-label image classification with transformers. In: CVPR, pp. 16478–16488 (2021)
Google Scholar
Li, D., Chen, X., Huang, K.: Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios. In: ACPR, pp. 111–115 (2015)
Google Scholar
Li, D., Chen, X., Zhang, Z., Huang, K.: Pose guided deep model for pedestrian attribute recognition in surveillance scenarios. In: ICME, pp. 1–6 (2018)
Google Scholar
Li, D., Zhang, Z., Chen, X., Huang, K.: A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios. TIP 28(4), 1575–1590 (2018)
MathSciNet Google Scholar
Li, J., Zhao, F., Feng, J., Roy, S., Yan, S., Sim, T.: Landmark free face attribute prediction. TIP 27(9), 4651–4662 (2018)
MathSciNet Google Scholar
Li, W., Duan, Y., Lu, J., Feng, J., Zhou, J.: Graph-based social relation reasoning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 18–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_2
Chapter Google Scholar
Li, W., Huang, X., Lu, J., Feng, J., Zhou, J.: Learning probabilistic ordinal embeddings for uncertainty-aware regression. In: CVPR, pp. 13896–13905 (2021)
Google Scholar
Li, W., Lu, J., Feng, J., Xu, C., Zhou, J., Tian, Q.: Bridgenet: a continuity-aware probabilistic network for age estimation. In: CVPR, pp. 1145–1154 (2019)
Google Scholar
Liu, P., Liu, X., Yan, J., Shao, J.: Localization guided learning for pedestrian attribute recognition. In: BMVC (2018)
Google Scholar
Liu, S., Zhang, L., Yang, X., Su, H., Zhu, J.: Query2label: a simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834 (2021)
Liu, X., et al.: Hydraplus-net: attentive deep features for pedestrian analysis. In: ICCV, pp. 350–359 (2017)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: CVPR, pp. 1096–1104 (2016)
Google Scholar
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV, pp. 3730–3738 (2015)
Google Scholar
Mao, L., Yan, Y., Xue, J.H., Wang, H.: Deep multi-task multi-label CNN for effective facial attribute classification. TAC (2020)
Google Scholar
Meng, Z., Adluru, N., Kim, H.J., Fung, G., Singh, V.: Efficient relative attribute learning using graph neural networks. In: ECCV, pp. 552–567 (2018)
Google Scholar
Nguyen, H.D., Vu, X.S., Le, D.T.: Modular graph transformer networks for multi-label image classification. In: AAAI, pp. 9092–9100 (2021)
Google Scholar
Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., Damen, D.: Temporal-relational crosstransformers for few-shot action recognition. In: CVPR, pp. 475–484 (2021)
Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. In: NAACL (2018)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Rudd, E.M., Günther, M., Boult, T.E.: MOON: a mixed objective optimization network for the recognition of facial attributes. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 19–35. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_2
Chapter Google Scholar
Sarafianos, N., Xu, X., Kakadiaris, I.A.: Deep imbalanced attribute classification using visual attention aggregation. In: ECCV, pp. 680–697 (2018)
Google Scholar
Sarfraz, M.S., Schumann, A., Wang, Y., Stiefelhagen, R.: Deep view-sensitive pedestrian attribute inference in an end-to-end model. In: BMVC (2017)
Google Scholar
Shao, J., Kang, K., Loy, C.C., Wang, X.: Deeply learned attributes for crowded scene understanding. In: CVPR, pp. 4657–4666 (2015)
Google Scholar
Shin, M.: Semi-supervised learning with a teacher-student network for generalized attribute prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 509–525. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_30
Chapter Google Scholar
Shu, Y., Yan, Y., Chen, S., Xue, J.H., Shen, C., Wang, H.: Learning spatial-semantic relationship for facial attribute recognition with limited labeled data. In: CVPR, pp. 11916–11925 (2021)
Google Scholar
Tang, C., Sheng, L., Zhang, Z., Hu, X.: Improving pedestrian attribute recognition with weakly-supervised multi-scale attribute-specific localization. In: ICCV, pp. 4997–5006 (2019)
Google Scholar
Taylor, W.L.: “cloze procedure’’: a new tool for measuring readability. Journalism Q. 30(4), 415–433 (1953)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Google Scholar
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR, pp. 8741–8750 (2021)
Google Scholar
Wang, Z.J., Turko, R., Chau, D.H.: Dodrio: exploring transformer models with interactive visualization. In: ACL (2021)
Google Scholar
Yu, B., Li, W., Li, X., Lu, J., Zhou, J.: Frequency-aware spatiotemporal transformers for video inpainting detection. In: ICCV, pp. 8188–8197 (2021)
Google Scholar
Zhang, C., Gupta, A., Zisserman, A.: Temporal query networks for fine-grained video understanding. In: CVPR, pp. 4486–4496 (2021)
Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: ICLR (2018)
Google Scholar
Zhang, Y., Zhang, P., Yuan, C., Wang, Z.: Texture and shape biased two-stream networks for clothing classification and attribute recognition. In: CVPR, pp. 13538–13547 (2020)
Google Scholar
Zhao, X., et al.: Recognizing part attributes with insufficient data. In: ICCV, pp. 350–360 (2019)
Google Scholar
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR, pp. 6881–6890 (2021)
Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 62125603 and Grant U1813218, in part by a grant from the Beijing Academy of Artificial Intelligence (BAAI). The authors would sincerely thank Yongming Rao and Zhiheng Li for their generous helps.

Author information

Authors and Affiliations

Department of Automation, Tsinghua University, Beijing, China
Wanhua Li, Zhexuan Cao, Jianjiang Feng, Jie Zhou & Jiwen Lu
Beijing National Research Center for Information Science and Technology, Beijing, China
Wanhua Li, Zhexuan Cao, Jianjiang Feng, Jie Zhou & Jiwen Lu

Authors

Wanhua Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhexuan Cao
View author publications
You can also search for this author in PubMed Google Scholar
Jianjiang Feng
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jiwen Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiwen Lu .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 727 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, W., Cao, Z., Feng, J., Zhou, J., Lu, J. (2022). Label2Label: A Language Modeling Framework for Multi-attribute Learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13672. Springer, Cham. https://doi.org/10.1007/978-3-031-19775-8_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-19775-8_33
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19774-1
Online ISBN: 978-3-031-19775-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Label2Label: A Language Modeling Framework for Multi-attribute Learning