Zero-Shot Object Detection: Joint Recognition and Localization of Novel Concepts

Rahman, Shafin; Khan, Salman H.; Porikli, Fatih

doi:10.1007/s11263-020-01355-6

Zero-Shot Object Detection: Joint Recognition and Localization of Novel Concepts

Published: 24 July 2020

Volume 128, pages 2979–2999, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

2223 Accesses
35 Citations
Explore all metrics

Abstract

Zero shot learning (ZSL) identifies unseen objects for which no training images are available. Conventional ZSL approaches are restricted to a recognition setting where each test image is categorized into one of several unseen object classes. We posit that this setting is ill-suited for real-world applications where unseen objects appear only as a part of a complete scene, warranting both ‘recognition’ and ‘localization’ of the unseen category. To address this limitation, we introduce a new ‘Zero-Shot Detection’ (ZSD) problem setting, which aims at simultaneously recognizing and locating object instances belonging to novel categories, without any training samples. We introduce an integrated solution to the ZSD problem that jointly models the complex interplay between visual and semantic domain information. Ours is an end-to-end trainable deep network for ZSD that effectively overcomes the noise in the unsupervised semantic descriptions. To this end, we utilize the concept of meta-classes to design an original loss function that achieves synergy between max-margin class separation and semantic domain clustering. In order to set a benchmark for ZSD, we propose an experimental protocol for the large-scale ILSVRC dataset that adheres to practical challenges, e.g., rare classes are more likely to be the unseen ones. Furthermore, we present a baseline approach extended from conventional recognition to the ZSD setting. Our extensive experiments show a significant boost in performance (in terms of mAP and Recall) on the imperative yet difficult ZSD problem on ImageNet detection, MSCOCO and FashionZSD datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts

Any-Shot Object Detection

SKZC: self-distillation and k-nearest neighbor-based zero-shot classification

Article Open access 22 April 2024

Notes

Meta-classes are obtained by clustering semantically similar classes.
Although, we acknowledge that Recall@100 stays an appropriate measure for large-scale datasets that are not fully labeled (such as Visual Genome-see Sect. 5.5).

References

Akata, Z., Malinowski, M., Fritz, M., & Schiele, B. (2016). Multi-cue zero-shot learning with strong supervision. In The IEEE conference on computer vision and pattern recognition (CVPR).
Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2016). Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(7), 1425–1438. https://doi.org/10.1109/TPAMI.2015.2487986.
Article Google Scholar
Akata, Z., Reed, S., Walter, D., Lee, H., & Schiele, B. (2015). Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition, 07–12 June-2015 (pp. 2927–2936). https://doi.org/10.1109/CVPR.2015.7298911.
Al-Halah, Z., Tapaswi, M., & Stiefelhagen, R. (2016). Recovering the missing link: Predicting class-attribute associations for unsupervised zero-shot learning. In The IEEE conference on computer vision and pattern recognition (CVPR).
Bansal, A., Sikka, K., Sharma, G., Chellappa, R., & Divakaran, A. (2018). Zero-shot object detection. In The European conference on computer vision (ECCV).
Changpinyo, S., Chao, W. L., Gong, B., & Sha, F. (2016). Synthesized classifiers for zero-shot learning. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition, January 2016 (pp. 5327–5336).
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. arXiv:1605.06409.
Demirel, B., Cinbis, R. G., & Ikizler-Cinbis, N. (2018). Zero-shot object detection by hybrid region embedding. In British machine vision conference (BMVC).
Demirel, B., Gokberk Cinbis, R., & Ikizler-Cinbis, N. (2017). Attributes2classname: A discriminative model for attribute-based unsupervised zero-shot learning. In The IEEE international conference on computer vision (ICCV).
Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., Li, Y., Neven, H., & Adam, H. (2014). Large-scale object classification using label relation graphs. In ECCV (pp. 48–64). Springer.
Deutsch, S., Kolouri, S., Kim, K., Owechko, Y., & Soatto, S. (2017). Zero shot learning via multi-scale manifold regularization. In CVPR.
Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In 2009 IEEE conference on computer vision and pattern recognition, CVPR 2009 (pp. 1778–1785). IEEE.
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M. A., et al. (2013). Devise: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 26, pp. 2121–2129). Red Hook: Curran Associates Inc.
Google Scholar
Fu, Y., Yang, Y., Hospedales, T., Xiang, T., & Gong, S. (2015). Transductive multi-label zero-shot learning. arXiv:1503.07790.
Fu, Z., Xiang, T., Kodirov, E., & Gong, S. (2017). Zero-shot learning on semantic class prototype graph. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99), 1. https://doi.org/10.1109/TPAMI.2017.2737007.
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In The IEEE conference on computer vision and pattern recognition (CVPR).
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., & Darrell, T. (2016). Natural language object retrieval. In CVPR (pp. 4555–4564).
Jayaraman, D., & Grauman, K. (2014). Zero-shot recognition with unreliable attributes. In Advances in neural information processing systems (pp. 3464–3472).
Jetley, S., Sapienza, M., Golodetz, S., & Torr, P. H. (2016). Straight to shapes: Real-time detection of encoded shapes. arXiv:1611.07932.
Kodirov, E., Xiang, T., Fu, Z., & Gong, S. (2015). Unsupervised domain adaptation for zero-shot learning. In The IEEE international conference on computer vision (ICCV).
Kodirov, E., Xiang, T., & Gong, S. (2017). Semantic autoencoder for zero-shot learning. In CVPR.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.
Article MathSciNet Google Scholar
Lampert, C., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE computer society conference on computer vision and pattern recognition workshops, CVPR workshops 2009 (pp. 951–958). https://doi.org/10.1109/CVPRW.2009.5206594.
Lampert, C. H., Nickisch, H., & Harmeling, S. (2014). Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3), 453–465. https://doi.org/10.1109/TPAMI.2013.140.
Article Google Scholar
Lei Ba, J., Swersky, K., & Fidler, S., et al. (2015). Predicting deep zero-shot convolutional neural networks using textual descriptions. In CVPR (pp. 4247–4255).
Li, X., Liao, S., Lan, W., Du, X., & Yang, G. (2015). Zero-shot image tagging by hierarchical semantic embedding. In RDIR (pp. 879–882). ACM.
Li, Y., Wang, D., Hu, H., Lin, Y., & Zhuang, Y. (2017). Zero-shot recognition using dual visual-semantic mapping paths. In The IEEE conference on computer vision and pattern recognition (CVPR).
Li, Z., Gavves, E., Mensink, T., & Snoek, C. G. (2014). Attributes make sense on segmented objects. In European conference on computer vision (pp. 350–365). Springer.
Li, Z., Tao, R., Gavves, E., Snoek, C., & Smeulders, A. (2017). Tracking by natural language specification. In CVPR (pp. 6495–6503).
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV (pp. 740–755). Springer.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., et al. (2016). SSD: Single shot multibox detector (pp. 21–37). Cham: Springer. https://doi.org/10.1007/978-3-319-46448-0_2.
Book Google Scholar
Maxime Bucher, S. H., & Jurie, F. (2016). Improving semantic embedding consistency by metric learning for zero-shot classification. In Proceedings of the 14th European conference on computer vision.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 26, pp. 3111–3119). Red Hook: Curran Associates Inc.
Google Scholar
Miller, G. A. (1995). Wordnet: A lexical database for English. Communications of the ACM, 38(11), 39–41.
Article Google Scholar
Morgado, P., & Vasconcelos, N. (2017). Semantically consistent regularization for zero-shot recognition. In The IEEE conference on computer vision and pattern recognition (CVPR).
Palatucci, M., Pomerleau, D., Hinton, G. E., & Mitchell, T. M. (2009). Zero-shot learning with semantic output codes. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 22, pp. 1410–1418). Red Hook: Curran Associates Inc.
Google Scholar
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Empirical methods in natural language processing (EMNLP) (pp. 1532–1543).
Rahman, S., Khan, S., & Barnes, N. (2018). Polarity loss for zero-shot object detection. arXiv:1811.08982.
Rahman, S., Khan, S., & Barnes, N. (2019). Transductive learning for zero-shot object detection. In Proceedings of the IEEE international conference on computer vision (pp. 6082–6091).
Rahman, S., Khan, S., & Barnes, N. (2020a). Improved visual-semantic alignment for zero-shot object detection. In AAAI (pp. 11,932–11,939).
Rahman, S., Khan, S., Barnes, N., & Khan, F. S. (2020b). Any-shot object detection. arXiv:2003.07003.
Rahman, S., Khan, S., & Porikli, F. (2018). A unified approach for conventional zero-shot, generalized zero-shot, and few-shot learning. IEEE Transactions on Image Processing, 27(11), 5652–5667. https://doi.org/10.1109/TIP.2018.2861573.
Article MathSciNet Google Scholar
Rahman, S., Khan, S., & Porikli, F. (2019). Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts. In C. V. Jawahar, H. Li, G. Mori, & K. Schindler (Eds.), Computer vision—ACCV 2018 (pp. 547–563). Cham: Springer.
Chapter Google Scholar
Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. In The IEEE conference on computer vision and pattern recognition (CVPR).
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031.
Article Google Scholar
Romera-Paredes, B., & Torr, P. (2015). An embarrassingly simple approach to zero-shot learning. In Proceedings of the 32nd international conference on machine learning (pp. 2152–2161).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y.
Article MathSciNet Google Scholar
Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., & Akata, Z. (2019). Generalized zero- and few-shot learning via aligned variational autoencoders. In The IEEE conference on computer vision and pattern recognition (CVPR).
Shigeto, Y., Suzuki, I., Hara, K., Shimbo, M., & Matsumoto, Y. (2015). Ridge regression, hubness, and zero-shot learning. In Joint European conference on machine learning and knowledge discovery in databases (pp. 135–151). Springer.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. (2013). Zero-shot learning through cross-modal transfer. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 26, pp. 935–943). Red Hook: Curran Associates Inc.
Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology.
Wang, X., & Ji, Q. (2013). A unified probabilistic approach modeling relationships between attributes and objects. In Proceedings of the IEEE international conference on computer vision (pp. 2120–2127). https://doi.org/10.1109/ICCV.2013.264.
Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., & Schiele, B. (2016). Latent embeddings for zero-shot classification. In The IEEE conference on computer vision and pattern recognition (CVPR).
Xian, Y., Lampert, C. H., Schiele, B., & Akata, Z. (2018a). Zero-shot learning—A comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2018.2857768.
Xian, Y., Lorenz, T., Schiele, B., & Akata, Z. (2018b). Feature generating networks for zero-shot learning. In The IEEE conference on computer vision and pattern recognition (CVPR).
Xian, Y., Sharma, S., Schiele, B., & Akata, Z. (2019). F-vaegan-d2: A feature generating framework for any-shot learning. In The IEEE conference on computer vision and pattern recognition (CVPR).
Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms.
Xu, X., Shen, F., Yang, Y., Zhang, D., Shen, H. T., & Song, J. (2017). Matrix tri-factorization with manifold regularizations for zero-shot learning. In Proceedings of CVPR.
Ye, M., & Guo, Y. (2017). Zero-shot classification with discriminative semantic representation learning. In The IEEE conference on computer vision and pattern recognition (CVPR).
Yu, F. X., Cao, L., Feris, R. S., Smith, J. R., & Chang, S. F. (2013). Designing category-level attributes for discriminative visual recognition. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 771–778). https://doi.org/10.1109/CVPR.2013.105.
Zhang, L., Xiang, T., & Gong, S. (2017). Learning a deep embedding model for zero-shot learning. In The IEEE conference on computer vision and pattern recognition (CVPR).
Zhang, Y., Gong, B., & Shah, M. (2016). Fast zero-shot image tagging. In The IEEE conference on computer vision and pattern recognition (CVPR).
Zhang, Z., & Saligrama, V. (2015). Zero-shot learning via semantic similarity embedding. In The IEEE international conference on computer vision (ICCV).
Zhang, Z., & Saligrama, V. (2016). Zero-shot learning via joint latent similarity embedding. In The IEEE conference on computer vision and pattern recognition (CVPR).
Zhu, P., Wang, H., Bolukbasi, T., & Saligrama, V. (2018). Zero-shot detection. arXiv:1803.07113.

Download references

Author information

Authors and Affiliations

North South University, Dhaka, Bangladesh
Shafin Rahman
Data61, CSIRO, Canberra, ACT, 2601, Australia
Shafin Rahman
Australian National University, Canberra, ACT, 0200, Australia
Shafin Rahman & Fatih Porikli
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Salman H. Khan
Huawei, San Diego, CA, USA
Fatih Porikli

Authors

Shafin Rahman
View author publications
You can also search for this author in PubMed Google Scholar
Salman H. Khan
View author publications
You can also search for this author in PubMed Google Scholar
Fatih Porikli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shafin Rahman.

Additional information

Communicated by Tinne Tuytelaars.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The codes and dataset split are available at: https://github.com/salman-h-khan/ZSD_Release

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rahman, S., Khan, S.H. & Porikli, F. Zero-Shot Object Detection: Joint Recognition and Localization of Novel Concepts. Int J Comput Vis 128, 2979–2999 (2020). https://doi.org/10.1007/s11263-020-01355-6

Download citation

Received: 31 January 2019
Accepted: 08 July 2020
Published: 24 July 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11263-020-01355-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Zero-Shot Object Detection: Joint Recognition and Localization of Novel Concepts

Abstract

Access this article

Similar content being viewed by others

Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts

Any-Shot Object Detection

SKZC: self-distillation and k-nearest neighbor-based zero-shot classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Zero-Shot Object Detection: Joint Recognition and Localization of Novel Concepts

Abstract

Access this article

Similar content being viewed by others

Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts

Any-Shot Object Detection

SKZC: self-distillation and k-nearest neighbor-based zero-shot classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation