Skip to main content

Zero-Shot Object Detection: Joint Recognition and Localization of Novel Concepts

Abstract

Zero shot learning (ZSL) identifies unseen objects for which no training images are available. Conventional ZSL approaches are restricted to a recognition setting where each test image is categorized into one of several unseen object classes. We posit that this setting is ill-suited for real-world applications where unseen objects appear only as a part of a complete scene, warranting both ‘recognition’ and ‘localization’ of the unseen category. To address this limitation, we introduce a new ‘Zero-Shot Detection’ (ZSD) problem setting, which aims at simultaneously recognizing and locating object instances belonging to novel categories, without any training samples. We introduce an integrated solution to the ZSD problem that jointly models the complex interplay between visual and semantic domain information. Ours is an end-to-end trainable deep network for ZSD that effectively overcomes the noise in the unsupervised semantic descriptions. To this end, we utilize the concept of meta-classes to design an original loss function that achieves synergy between max-margin class separation and semantic domain clustering. In order to set a benchmark for ZSD, we propose an experimental protocol for the large-scale ILSVRC dataset that adheres to practical challenges, e.g., rare classes are more likely to be the unseen ones. Furthermore, we present a baseline approach extended from conventional recognition to the ZSD setting. Our extensive experiments show a significant boost in performance (in terms of mAP and Recall) on the imperative yet difficult ZSD problem on ImageNet detection, MSCOCO and FashionZSD datasets.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. Meta-classes are obtained by clustering semantically similar classes.

  2. Although, we acknowledge that Recall@100 stays an appropriate measure for large-scale datasets that are not fully labeled (such as Visual Genome-see Sect. 5.5).

References

  • Akata, Z., Malinowski, M., Fritz, M., & Schiele, B. (2016). Multi-cue zero-shot learning with strong supervision. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2016). Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(7), 1425–1438. https://doi.org/10.1109/TPAMI.2015.2487986.

    Article  Google Scholar 

  • Akata, Z., Reed, S., Walter, D., Lee, H., & Schiele, B. (2015). Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition, 07–12 June-2015 (pp. 2927–2936). https://doi.org/10.1109/CVPR.2015.7298911.

  • Al-Halah, Z., Tapaswi, M., & Stiefelhagen, R. (2016). Recovering the missing link: Predicting class-attribute associations for unsupervised zero-shot learning. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Bansal, A., Sikka, K., Sharma, G., Chellappa, R., & Divakaran, A. (2018). Zero-shot object detection. In The European conference on computer vision (ECCV).

  • Changpinyo, S., Chao, W. L., Gong, B., & Sha, F. (2016). Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition, January 2016 (pp. 5327–5336).

  • Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. arXiv:1605.06409.

  • Demirel, B., Cinbis, R. G., & Ikizler-Cinbis, N. (2018). Zero-shot object detection by hybrid region embedding. In British machine vision conference (BMVC).

  • Demirel, B., Gokberk Cinbis, R., & Ikizler-Cinbis, N. (2017). Attributes2classname: A discriminative model for attribute-based unsupervised zero-shot learning. In The IEEE international conference on computer vision (ICCV).

  • Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., Li, Y., Neven, H., & Adam, H. (2014). Large-scale object classification using label relation graphs. In ECCV (pp. 48–64). Springer.

  • Deutsch, S., Kolouri, S., Kim, K., Owechko, Y., & Soatto, S. (2017). Zero shot learning via multi-scale manifold regularization. In CVPR.

  • Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In 2009 IEEE conference on computer vision and pattern recognition, CVPR 2009 (pp. 1778–1785). IEEE.

  • Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M. A., et al. (2013). Devise: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 26, pp. 2121–2129). Red Hook: Curran Associates Inc.

    Google Scholar 

  • Fu, Y., Yang, Y., Hospedales, T., Xiang, T., & Gong, S. (2015). Transductive multi-label zero-shot learning. arXiv:1503.07790.

  • Fu, Z., Xiang, T., Kodirov, E., & Gong, S. (2017). Zero-shot learning on semantic class prototype graph. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99), 1. https://doi.org/10.1109/TPAMI.2017.2737007.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., & Darrell, T. (2016). Natural language object retrieval. In CVPR (pp. 4555–4564).

  • Jayaraman, D., & Grauman, K. (2014). Zero-shot recognition with unreliable attributes. In Advances in neural information processing systems (pp. 3464–3472).

  • Jetley, S., Sapienza, M., Golodetz, S., & Torr, P. H. (2016). Straight to shapes: Real-time detection of encoded shapes. arXiv:1611.07932.

  • Kodirov, E., Xiang, T., Fu, Z., & Gong, S. (2015). Unsupervised domain adaptation for zero-shot learning. In The IEEE international conference on computer vision (ICCV).

  • Kodirov, E., Xiang, T., & Gong, S. (2017). Semantic autoencoder for zero-shot learning. In CVPR.

  • Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.

    MathSciNet  Article  Google Scholar 

  • Lampert, C., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE computer society conference on computer vision and pattern recognition workshops, CVPR workshops 2009 (pp. 951–958). https://doi.org/10.1109/CVPRW.2009.5206594.

  • Lampert, C. H., Nickisch, H., & Harmeling, S. (2014). Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 453–465. https://doi.org/10.1109/TPAMI.2013.140.

    Article  Google Scholar 

  • Lei Ba, J., Swersky, K., & Fidler, S., et al. (2015). Predicting deep zero-shot convolutional neural networks using textual descriptions. In CVPR (pp. 4247–4255).

  • Li, X., Liao, S., Lan, W., Du, X., & Yang, G. (2015). Zero-shot image tagging by hierarchical semantic embedding. In RDIR (pp. 879–882). ACM.

  • Li, Y., Wang, D., Hu, H., Lin, Y., & Zhuang, Y. (2017). Zero-shot recognition using dual visual-semantic mapping paths. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Li, Z., Gavves, E., Mensink, T., & Snoek, C. G. (2014). Attributes make sense on segmented objects. In European conference on computer vision (pp. 350–365). Springer.

  • Li, Z., Tao, R., Gavves, E., Snoek, C., & Smeulders, A. (2017). Tracking by natural language specification. In CVPR (pp. 6495–6503).

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755). Springer.

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., et al. (2016). SSD: Single shot multibox detector (pp. 21–37). Cham: Springer. https://doi.org/10.1007/978-3-319-46448-0_2.

    Book  Google Scholar 

  • Maxime Bucher, S. H., & Jurie, F. (2016). Improving semantic embedding consistency by metric learning for zero-shot classification. In Proceedings of the 14th European conference on computer vision.

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 26, pp. 3111–3119). Red Hook: Curran Associates Inc.

    Google Scholar 

  • Miller, G. A. (1995). Wordnet: A lexical database for English. Communications of the ACM, 38(11), 39–41.

    Article  Google Scholar 

  • Morgado, P., & Vasconcelos, N. (2017). Semantically consistent regularization for zero-shot recognition. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Palatucci, M., Pomerleau, D., Hinton, G. E., & Mitchell, T. M. (2009). Zero-shot learning with semantic output codes. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 22, pp. 1410–1418). Red Hook: Curran Associates Inc.

    Google Scholar 

  • Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

  • Rahman, S., Khan, S., & Barnes, N. (2018). Polarity loss for zero-shot object detection. arXiv:1811.08982.

  • Rahman, S., Khan, S., & Barnes, N. (2019). Transductive learning for zero-shot object detection. In Proceedings of the IEEE international conference on computer vision (pp. 6082–6091).

  • Rahman, S., Khan, S., & Barnes, N. (2020a). Improved visual-semantic alignment for zero-shot object detection. In AAAI (pp. 11,932–11,939).

  • Rahman, S., Khan, S., Barnes, N., & Khan, F. S. (2020b). Any-shot object detection. arXiv:2003.07003.

  • Rahman, S., Khan, S., & Porikli, F. (2018). A unified approach for conventional zero-shot, generalized zero-shot, and few-shot learning. IEEE Transactions on Image Processing, 27(11), 5652–5667. https://doi.org/10.1109/TIP.2018.2861573.

    MathSciNet  Article  Google Scholar 

  • Rahman, S., Khan, S., & Porikli, F. (2019). Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts. In C. V. Jawahar, H. Li, G. Mori, & K. Schindler (Eds.), Computer vision—ACCV 2018 (pp. 547–563). Cham: Springer.

    Chapter  Google Scholar 

  • Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031.

    Article  Google Scholar 

  • Romera-Paredes, B., & Torr, P. (2015). An embarrassingly simple approach to zero-shot learning. In Proceedings of the 32nd international conference on machine learning (pp. 2152–2161).

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y.

    MathSciNet  Article  Google Scholar 

  • Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., & Akata, Z. (2019). Generalized zero- and few-shot learning via aligned variational autoencoders. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Shigeto, Y., Suzuki, I., Hara, K., Shimbo, M., & Matsumoto, Y. (2015). Ridge regression, hubness, and zero-shot learning. In Joint European conference on machine learning and knowledge discovery in databases (pp. 135–151). Springer.

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

  • Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 26, pp. 935–943). Red Hook: Curran Associates Inc.

    Google Scholar 

  • Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology.

  • Wang, X., & Ji, Q. (2013). A unified probabilistic approach modeling relationships between attributes and objects. In Proceedings of the IEEE international conference on computer vision (pp. 2120–2127). https://doi.org/10.1109/ICCV.2013.264.

  • Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., & Schiele, B. (2016). Latent embeddings for zero-shot classification. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Xian, Y., Lampert, C. H., Schiele, B., & Akata, Z. (2018a). Zero-shot learning—A comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2018.2857768.

  • Xian, Y., Lorenz, T., Schiele, B., & Akata, Z. (2018b). Feature generating networks for zero-shot learning. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Xian, Y., Sharma, S., Schiele, B., & Akata, Z. (2019). F-vaegan-d2: A feature generating framework for any-shot learning. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms.

  • Xu, X., Shen, F., Yang, Y., Zhang, D., Shen, H. T., & Song, J. (2017). Matrix tri-factorization with manifold regularizations for zero-shot learning. In Proceedings of CVPR.

  • Ye, M., & Guo, Y. (2017). Zero-shot classification with discriminative semantic representation learning. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Yu, F. X., Cao, L., Feris, R. S., Smith, J. R., & Chang, S. F. (2013). Designing category-level attributes for discriminative visual recognition. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 771–778). https://doi.org/10.1109/CVPR.2013.105.

  • Zhang, L., Xiang, T., & Gong, S. (2017). Learning a deep embedding model for zero-shot learning. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhang, Y., Gong, B., & Shah, M. (2016). Fast zero-shot image tagging. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhang, Z., & Saligrama, V. (2015). Zero-shot learning via semantic similarity embedding. In The IEEE international conference on computer vision (ICCV).

  • Zhang, Z., & Saligrama, V. (2016). Zero-shot learning via joint latent similarity embedding. In The IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhu, P., Wang, H., Bolukbasi, T., & Saligrama, V. (2018). Zero-shot detection. arXiv:1803.07113.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shafin Rahman.

Additional information

Communicated by Tinne Tuytelaars.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The codes and dataset split are available at: https://github.com/salman-h-khan/ZSD_Release

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rahman, S., Khan, S.H. & Porikli, F. Zero-Shot Object Detection: Joint Recognition and Localization of Novel Concepts. Int J Comput Vis 128, 2979–2999 (2020). https://doi.org/10.1007/s11263-020-01355-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01355-6

Keywords

  • Zero-shot learning
  • Zero-shot object detection
  • Deep learning
  • Loss function