Attributes as Operators: Factorizing Unseen Attribute-Object Compositions

  • Tushar NagarajanEmail author
  • Kristen Grauman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11205)


We present a new approach to modeling visual attributes. Prior work casts attributes in a similar role as objects, learning a latent representation where properties (e.g., sliced) are recognized by classifiers much in the way objects (e.g., apple) are. However, this common approach fails to separate the attributes observed during training from the objects with which they are composed, making it ineffectual when encountering new attribute-object compositions. Instead, we propose to model attributes as operators. Our approach learns a semantic embedding that explicitly factors out attributes from their accompanying objects, and also benefits from novel regularizers expressing attribute operators’ effects (e.g., blunt should undo the effects of sharp). Not only does our approach align conceptually with the linguistic role of attributes as modifiers, but it also generalizes to recognize unseen compositions of objects and attributes. We validate our approach on two challenging datasets and demonstrate significant improvements over the state of the art. In addition, we show that not only can our model recognize unseen compositions robustly in an open-world setting, it can also generalize to compositions where objects themselves were unseen during training.



This research is supported in part by ONR PECASE N00014-15-1-2291 and an Amazon AWS Machine Learning Research Award. We gratefully acknowledge Facebook for a GPU donation.

Supplementary material

474172_1_En_11_MOESM1_ESM.pdf (3.2 mb)
Supplementary material 1 (pdf 3294 KB)


  1. 1.
    Al-Halah, Z., Tapaswi, M., Stiefelhagen, R.: Recovering the missing link: predicting class-attribute associations for unsupervised zero-shot learning. In: CVPR (2016)Google Scholar
  2. 2.
    Alayrac, J.B., Sivic, J., Laptev, I., Lacoste-Julien, S.: Joint discovery of object states and manipulating actions. In: ICCV (2017)Google Scholar
  3. 3.
    Baroni, M., Zamparelli, R.: Nouns are vectors, adjectives are matrices: representing adjective-noun constructions in semantic space. In: EMNLP (2010)Google Scholar
  4. 4.
    Berg, T.L., Berg, A.C., Shih, J.: Automatic attribute discovery and characterization from noisy web data. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 663–676. Springer, Heidelberg (2010). Scholar
  5. 5.
    Chen, C.Y., Grauman, K.: Inferring analogous attributes. In: CVPR (2014)Google Scholar
  6. 6.
    Chen, L., Zhang, H., Xiao, J., Liu, W., Chang, S.F.: Zero-shot visual recognition using semantics-preserving adversarial embedding network. In: CVPR (2018)Google Scholar
  7. 7.
    Cheng, D., Gong, Y., Zhou, S., Wang, J., Zheng, N.: Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: CVPR (2016)Google Scholar
  8. 8.
    Choe, J., Park, S., Kim, K., Park, J.H., Kim, D., Shim, H.: Face generation for low-shot learning using generative adversarial networks. In: ICCVW (2017)Google Scholar
  9. 9.
    Cruz, R.S., Fernando, B., Cherian, A., Gould, S.: Neural algebra of classifiers. In: WACV (2018)Google Scholar
  10. 10.
    Dixit, M., Kwitt, R., Niethammer, M., Vasconcelos, N.: Aga: Attribute-guided augmentation. In: CVPR (2017)Google Scholar
  11. 11.
    Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: CVPR (2009)Google Scholar
  12. 12.
    Fathi, A., Rehg, J.M.: Modeling actions through state changes. In: CVPR (2013)Google Scholar
  13. 13.
    Guevara, E.: A regression model of adjective-noun compositionality in distributional semantics. In: ACL Workshop on Geometrical Models of Natural Language Semantics (2010)Google Scholar
  14. 14.
    Hariharan, B., Girshick, R.: Low-shot visual recognition by shrinking and hallucinating features. In: ICCV (2017)Google Scholar
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  16. 16.
    Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). Scholar
  17. 17.
    Hsiao, W.L., Grauman, K.: Learning the latent look: Unsupervised discovery of a style-coherent embedding from fashion images. In: ICCV (2017)Google Scholar
  18. 18.
    Huang, J., Feris, R., Chen, Q., Yan, S.: Cross-domain image retrieval with a dual attribute-aware ranking network. In: ICCV (2015)Google Scholar
  19. 19.
    Huang, S., Elhoseiny, M., Elgammal, A., Yang, D.: Learning hypergraph-regularized attribute predictors. In: CVPR (2015)Google Scholar
  20. 20.
    Isola, P., Lim, J.J., Adelson, E.H.: Discovering states and transformations in image collections. In: CVPR (2015)Google Scholar
  21. 21.
    Jayaraman, D., Grauman, K.: Zero-shot recognition with unreliable attributes. In: NIPS (2014)Google Scholar
  22. 22.
    Jayaraman, D., Grauman, K.: Learning image representations tied to ego-motion. In: ICCV (2015)Google Scholar
  23. 23.
    Jayaraman, D., Sha, F., Grauman, K.: Decorrelating semantic visual attributes by resisting the urge to share. In: CVPR (2014)Google Scholar
  24. 24.
    Kovashka, A., Parikh, D., Grauman, K.: Whittlesearch: image search with relative attribute feedback. In: CVPR (2012)Google Scholar
  25. 25.
    Kulkarni, G., et al.: Babytalk: understanding and generating simple image descriptions. TPAMI 35, 2891–2903 (2013)CrossRefGoogle Scholar
  26. 26.
    Kumar, N., Belhumeur, P., Nayar, S.: FaceTracer: a search engine for large collections of images with faces. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5305, pp. 340–353. Springer, Heidelberg (2008). Scholar
  27. 27.
    Laffont, P.Y., Ren, Z., Tao, X., Qian, C., Hays, J.: Transient attributes for high-level understanding and editing of outdoor scenes. In: SIGGRAPH (2014)Google Scholar
  28. 28.
    Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: CVPR (2009)Google Scholar
  29. 29.
    Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV (2015)Google Scholar
  30. 30.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: ECCV (2016)Google Scholar
  31. 31.
    Lu, J., Li, J., Yan, Z., Zhang, C.: Zero-shot learning by generating pseudo feature representations. arXiv preprint arXiv:1703.06389 (2017)
  32. 32.
    Lu, Y., Kumar, A., Zhai, S., Cheng, Y., Javidi, T., Feris, R.: Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In: CVPR (2017)Google Scholar
  33. 33.
    Misra, I., Gupta, A., Hebert, M.: From red wine to red tomato: composition with context. In: CVPR (2017)Google Scholar
  34. 34.
    Mitchell, J., Lapata, M.: Vector-based models of semantic composition. In: HLT, ACL (2008)Google Scholar
  35. 35.
    Nguyen, D.T., Lazaridou, A., Bernardi, R.: Coloring objects: adjective-noun visual semantic compositionality. In: ACL Workshop on Vision and Language (2014)Google Scholar
  36. 36.
    Parikh, D., Grauman, K.: Relative attributes. In: ICCV (2011)Google Scholar
  37. 37.
    Patterson, G., Hays, J.: Sun attribute database: discovering, annotating, and recognizing scene attributes. In: CVPR (2012)Google Scholar
  38. 38.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: EMNLP (2014)Google Scholar
  39. 39.
    Pezzelle, S., Shekhar, R., Bernardi, R.: Building a bagpipe with a bag and a pipe: exploring conceptual combination in vision. In: ACL Workshop on Vision and Language (2016)Google Scholar
  40. 40.
    Rosch, E., Mervis, C.B., Gray, W.D., Johnson, D.M., Boyes-Braem, P.: Basic objects in natural categories. Cogn. Psychol. 8(3), 382–439 (1976)CrossRefGoogle Scholar
  41. 41.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  42. 42.
    Sadeghi, M.A., Farhadi, A.: Recognition using visual phrases. In: CVPR (2011)Google Scholar
  43. 43.
    Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: CVPR (2015)Google Scholar
  44. 44.
    Siddiquie, B., Feris, R.S., Davis, L.S.: Image ranking and retrieval based on multi-attribute queries. In: CVPR (2011)Google Scholar
  45. 45.
    Singh, K.K., Lee, Y.J.: End-to-end localization and ranking for relative attributes. In: ECCV (2016)Google Scholar
  46. 46.
    Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: EMNLP (2013)Google Scholar
  47. 47.
    Su, C., Zhang, S., Xing, J., Gao, W., Tian, Q.: Deep attributes driven multi-camera person re-identification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 475–491. Springer, Cham (2016). Scholar
  48. 48.
    Verma, V.K., Arora, G., Mishra, A., Rai, P.: Generalized zero-shot learning via synthesized examples. In: CVPR (2018)Google Scholar
  49. 49.
    Wang, J., Cheng, Y., Schmidt Feris, R.: Walk and learn: facial attribute representation learning from egocentric video and contextual data. In: CVPR (2016)Google Scholar
  50. 50.
    Wang, Q., Chen, K.: Alternative semantic representations for zero-shot human action recognition. In: ECML (2017)CrossRefGoogle Scholar
  51. 51.
    Wang, X., Farhadi, A., Gupta, A.: Action’s transformations. In: CVPR (2016)Google Scholar
  52. 52.
    Xian, Y., Lorenz, T., Schiele, B., Akata, Z.: Feature generating networks for zero-shot learning. In: CVPR (2018)Google Scholar
  53. 53.
    Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning-the good, the bad and the ugly. In: CVPR (2017)Google Scholar
  54. 54.
    Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2image: Conditional image generation from visual attributes. In: ECCV (2016)Google Scholar
  55. 55.
    Yu, A., Grauman, K.: Fine-grained visual comparisons with local learning. In: CVPR (2014)Google Scholar
  56. 56.
    Yu, A., Grauman, K.: Semantic jitter: dense supervision for visual comparisons via synthetic images. In: ICCV (2017)Google Scholar
  57. 57.
    Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: CVPR (2017)Google Scholar
  58. 58.
    Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: ECCV (2016)Google Scholar
  59. 59.
    Zhou, Y., Berg, T.L.: Learning temporal transformations from time-lapse videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 262–277. Springer, Cham (2016). Scholar
  60. 60.
    Zhu, Y., Elhoseiny, M., Liu, B., Elgammal, A.: Imagine it for me: generative adversarial approach for zero-shot learning from noisy texts. In: CVPR (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.The University of Texas at AustinAustinUSA
  2. 2.Facebook AI ResearchAustinUSA

Personalised recommendations