FSODv2: A Deep Calibrated Few-Shot Object Detection Network

Fan, Qi; Zhuo, Wei; Tang, Chi-Keung; Tai, Yu-Wing

doi:10.1007/s11263-024-02049-z

FSODv2: A Deep Calibrated Few-Shot Object Detection Network

Published: 04 April 2024

(2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Qi Fan ORCID: orcid.org/0000-0002-2644-4457^1,2,
Wei Zhuo³,
Chi-Keung Tang⁴ &
…
Yu-Wing Tai⁵

394 Accesses
Explore all metrics

Abstract

Traditional methods for object detection typically necessitate a substantial amount of training data, and creating high-quality training data is time-consuming. We propose a novel Few-Shot Object Detection network (FSODv2) in this paper that aims to detect objects from previously unseen categories using only a few annotated examples. Attention RPN, Multi-Relation Detector, and Contrastive Training strategy are central to our method (Fan et al., in: CVPR, 2020), which exploit similarity between few shot support set and query set to detect novel objects while suppressing false detection in the background. We also contribute a new dataset, FSOD-1k, which contains 1000 categories of various objects with high-quality annotations to train our network. To the best of our knowledge, this is one of the first datasets designed for few-shot object detection. This paper improves our FSOD model through well-designed model calibration in three areas: (1) we propose an improved FPN with multi-scale support inputs to calibrate the multi-scale support-query feature matching by exploiting multi-scale features from the same support image with different input scales; (2) we introduce a support classification supervision branch to calibrate the support feature supervision, aligning to the query feature training supervision; (3) we propose backbone calibration to preserve prior knowledge while alleviating backbone bias toward base classes by employing classification dataset to help our model calibration procedure, where such dataset has previously only been used for pre-training in other related works. Besides, we propose a Fast Attention RPN to improve evaluation speed and save computational memory during inference. Once trained, our few-shot network can detect objects from previously unseen categories without further training or fine-tuning, resulting in new state-of-the-art performance on different datasets in the few-shot setting. Our method is general in scope and has numerous potential applications. The dataset link is https://github.com/fanq15/Few-Shot-Object-Detection-Dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Fig. 9

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Notes

The fine-tuning stage benefits from more ways during the multi-way training, so we use as many ways as possible to fill up the GPU memory.
Since Feature Reweighting and Meta R-CNN are evaluated on MS COCO, in this subsection we discard pre-training on Lin et al. (2014) for fair comparison to follow the same experimental setting as described.
We also discard the MS COCO pretraining in this experiment.

References

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., & Ring, R. (2022). Flamingo: A visual language model for few-shot learning. In: NeurIPS.
Arteta, C., Lempitsky, V., & Zisserman, A. (2016). Counting in the wild. In: ECCV.
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional siamese networks for object tracking. In: ECCV.
Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-NMS—improving object detection with one line of code. In: ICCV.
Buda, M., Maki, A., & Mazurowski, M. A. (2018). A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks.
Bulat, A., Guerrero, R., Martinez, B., & Tzimiropoulos, G. (2023). FS-DETR: Few-shot detection transformer with prompting and without re-training. In: ICCV.
Cai, Q., Pan, Y., Yao, T., Yan, C., & Mei, T. (2018). Memory matching networks for one-shot image recognition. In: CVPR.
Cao, Y., Wang, J., Jin, Y., Wu, T., Chen, K., Liu, Z., & Lin, D. (2021). Few-shot object detection via association and discrimination. In: NeurIPS.
Cao, Y., Wang, J., Lin, Y., & Lin, D. (2022). Mini: Mining implicit novel instances for few-shot object detection. arXiv:2205.03381.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: ECCV.
Chen, Q., Chen, X., Wang, J., Zhang, S., Yao, K., Feng, H., Han, J., Ding, E., Zeng, G., & Wang, J. (2023). Group detr: Fast detr training with group-wise one-to-many assignment. In: ICCV.
Chen, Y., Li, W., Sakaridis, C., Dai, D., & Van Gool, L. (2018). Domain adaptive faster r-cnn for object detection in the wild. In: CVPR.
Chen, H., Wang, Y., Wang, G., & Qiao, Y. (2018). Lstd: a low-shot transfer detector for object detection. In: AAAI.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In: CVPR.
Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., & Zhang, L. (2021). Dynamic detr: End-to-end object detection with dynamic attention. In: ICCV.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In: CVPR.
Demirel, B., Baran, O. B., & Cinbis, R. G. (2023). Meta-tuning loss functions and data augmentation for few-shot object detection. In: CVPR.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009) Imagenet: a large-scale hierarchical image database. In: CVPR.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
Dong, N., & Xing, E.P. (2018). Few-shot semantic segmentation with prototype learning. In: BMVC.
Dong, X., Zheng, L., Ma, F., Yang, Y., & Meng, D. (2018). Few-example object detection with model communication. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7), 1641–1654.
Article Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth \(16\times 16\) words: Transformers for image recognition at scale. arXiv:2010.11929.
Du, J., Zhang, S., Chen, Q., Le, H., Sun, Y., Ni, Y., Wang, J., He, B., & Wang, J. (2023). \(\sigma \)-adaptive decoupled prototype for few-shot object detection. In: ICCV.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Fan, Z., Ma, Y., Li, Z., & Sun, J. (2021). Generalized few-shot object detection without forgetting. In: CVPR.
Fan, Q., Pei, W., Tai, Y.-W., & Tang, C.-K. (2022). Self-support few-shot semantic segmentation. In: ECCV.
Fan, Q., Segu, M., Tai, Y.-W., Yu, F., Tang, C.-K., Schiele, B., & Dai, D. (2023). Towards robust object detection invariant to real-world domain shifts. In: ICLR.
Fan, Q., Tang, C.-K., & Tai, Y.-W. (2021). Few-shot video object detection. arXiv:2104.14805.
Fan, Q., Tang, C.-K., & Tai, Y.-W. (2022). Few-shot object detection with model calibration. In: ECCV.
Fan, Q., Zhuo, W., Tang, C.-K., & Tai, Y.-W. (2020). Few-shot object detection with attention-RPN and multi-relation detector. In: CVPR.
Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611.
Article Google Scholar
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Article Google Scholar
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML.
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In: ICML.
Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., & Qiao, Y. (2023). Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision.
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The Kitti vision benchmark suite. In: CVPR.
Gidaris, S., & Komodakis, N. (2019). Generating classification weights with gnn denoising autoencoders for few-shot learning. In: CVPR.
Girshick, R. (2015). Fast r-cnn. In: ICCV.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR.
Gu, X., Lin, T.-Y., Kuo, W., & Cui, Y. (2021). Open-vocabulary object detection via vision and language knowledge distillation. arXiv:2104.13921.
Gui, L.-Y., Wang, Y.-X., Ramanan, D., & Moura, J. M. F. (2018). Few-shot human motion prediction via meta-learning. In: ECCV.
Guirguis, K., Meier, J., Eskandar, G., Kayser, M., Yang, B., & Beyerer, J. (2023). Niff: Alleviating forgetting in generalized few-shot object detection via neural instance feature forging. In: CVPR.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In: ICML.
Gupta, A., Dollar, P., & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In: CVPR.
Han, G., He, Y., Huang, S., Ma, J., & Chang, S.-F. (2021). Query adaptive few-shot object detection with heterogeneous graph convolutional networks. In: ICCV.
Han, J., Ren, Y., Ding, J., Yan, K., & Xia, G.-S. (2023). Few-shot object detection via variational feature aggregation. In: AAAI.
Hariharan, B., & Girshick, R. (2017). Low-shot visual recognition by shrinking and hallucinating features. In: ICCV.
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In: CVPR.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In: ICCV.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR.
Hénaff, O.J., Koppula, S., Alayrac, J.-B., Oord, A., Vinyals, O., & Carreira, J. (2021). Efficient visual pretraining with contrastive detection. In: ICCV.
Hu, H., Bai, S., Li, A., Cui, J., & Wang, L. (2021). Dense relation distillation with context-aware aggregation for few-shot object detection. In: CVPR.
Hu, X., Jiang, Y., Tang, K., Chen, J., Miao, C., & Zhang, H. (2020). Learning to segment the tail. In: CVPR.
Hu, T., Yang, P., Zhang, C., Yu, G., Mu, Y., & Snoek, C. G. M. (2019). Attention-based multi-context guiding for few-shot semantic segmentation. In: AAAI.
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., & Duerig, T. (2021). Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML.
Jia, D., Yuan, Y., He, H., Wu, X., Yu, H., Lin, W., Sun, L., Zhang, C., & Hu, H. (2023). Detrs with hybrid matching. In: CVPR.
Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., & Darrell, T. (2019). Few-shot object detection via feature reweighting. In: ICCV.
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2019). Decoupling representation and classifier for long-tailed recognition. arXiv:1910.09217.
Karlinsky, L., Shtok, J., Harary, S., Schwartz, E., Aides, A., Feris, R., Giryes, R., & Bronstein, A. M. (2019). Repmet: Representative-based metric learning for classification and few-shot object detection. In: CVPR.
Kaul, P., Xie, W., & Zisserman, A. (2022). Label, verify, correct: A simple few shot object detection method. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14237–14247).
Kim, D., Angelova, A., & Kuo, W. (2023). Contrastive feature masking open-vocabulary vision transformer. In: ICCV.
Kim, J., Kim, T., Kim, S., & Yoo, C. D. (2019). Edge-labeling graph neural network for few-shot learning. In: CVPR.
Kim, B., & Kim, J. (2020). Adjusting decision boundary for class imbalanced learning. IEEE Access, 8, 81674–81685.
Article Google Scholar
Koch, G., Zemel, R., & Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. In: ICML Workshop.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., & Bernstein, M. S. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1), 32–73.
Article MathSciNet Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In: NeurIPS.
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Duerig, T., & Ferrari, V. (2018). The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982.
Lake, B. M., Salakhutdinov, R. R., & Tenenbaum, J. (2013). One-shot learning by inverting a compositional causal process. In: NeurIPS.
Lake, B., Salakhutdinov, R., Gross, J., & Tenenbaum, J. (2011). One shot learning of simple visual concepts. In: Proceedings of the annual meeting of the cognitive science society, 33.
Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266), 1332–1338.
Article MathSciNet Google Scholar
Li, A., & Li, Z. (2021). Transformation invariant few-shot object detection. In: CVPR.
Li, H., Eigen, D., Dodge, S., Zeiler, M., & Wang, X. (2019). Finding task-relevant features for few-shot learning by category traversal. In: CVPR.
Li, Z., Hoogs, A., & Xu, C. (2022). Discover and mitigate unknown biases with debiasing alternate networks. In: ECCV.
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML.
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: Vision and language representation learning with momentum distillation. In: NeurIPS.
Li, Y., Wang, T., Kang, B., Tang, S., Wang, C., Li, J., & Feng, J. (2020). Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In: CVPR.
Li, W., Wang, L., Xu, J., Huo, J., Yang, G., & Luo, J. (2019). Revisiting local descriptor based image-to-class measure for few-shot learning. In: CVPR.
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). SiamRPN++: Evolution of siamese visual tracking with very deep networks. In: CVPR.
Li, Y., Xie, S., Chen, X., Dollar, P., He, K., & Girshick, R. (2021). Benchmarking detection transfer learning with vision transformers. arXiv:2111.11429.
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L. M., & Zhang, L. (2022). Dn-detr: Accelerate detr training by introducing query denoising. In: CVPR.
Li, J., Zhang, Y., Qiang, W., Si, L., Jiao, C., Hu, X., Zheng, C., & Sun, F. (2023). Disentangle and remerge: interventional knowledge distillation for few-shot object detection from a conditional causal perspective. In: AAAI.
Li, Y., Zhu, H., Cheng, Y., Wang, W., Teo, C. S., Xiang, C., Vadakkepat, P., & Lee, T. H. (2021). Few-shot object detection via classification refinement and distractor retreatment. In: CVPR.
Lifchitz, Y., Avrithis, Y., Picard, S., & Bursuc, A. (2019). Dense classification and implanting for few-shot learning. In: CVPR.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In: ICCV.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: ECCV.
Liu, S. & Huang, D. (2018). Receptive field block net for accurate and fast object detection. In: ECCV.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In: ECCV.
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., & Zhang, L. (2022). Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv:2201.12329.
Lu, X., Diao, W., Mao, Y., Li, J., Wang, P., Sun, X., & Fu, K. (2023). Breaking immutable: Information-coupled prototype elaboration for few-shot object detection. In: AAAI.
Lu, E., Xie, W., & Zisserman, A. (2018). Class-agnostic counting. In: ACCV.
Ma, C., Jiang, Y., Wen, X., Yuan, Z., & Qi, X. (2023). Codet: Co-occurrence guided region-word alignment for open-vocabulary object detection. arXiv:2310.16667.
Ma, J., Niu, Y., Xu, J., Huang, S., Han, G., & Chang, S.-F. (2023). Digeo: Discriminative geometry-aware learning for generalized few-shot object detection. In: CVPR.
Michaelis, C., Bethge, M., & Ecker, A. S. (2018). One-shot segmentation in clutter. In: ICML.
Miller, G. A. (1995). Wordnet: A lexical database for English. Communications of the ACM, 38(11), 39–41.
Article Google Scholar
Munkhdalai, T., & Yu, H. (2017). Meta networks. In: ICML.
Munkhdalai, T., Yuan, X., Mehri, S., & Trischler, A. (2018). Rapid adaptation with conditionally shifted neurons. In: ICML.
Oreshkin, B., López, P. R., & Lacoste, A. (2018). Tadam: Task dependent adaptive metric for improved few-shot learning. In: NeurIPS.
Pei, W., Wu, S., Mei, D., Chen, F., Tian, J., & Lu, G. (2022). Few-shot object detection by knowledge distillation using bag-of-visual-words representations. In: ECCV.
Qiao, L., Zhao, Y., Li, Z., Qiu, X., Wu, J., & Zhang, C. (2021). Defrcn: Decoupled faster r-cnn for few-shot object detection. In: ICCV.
Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O. R., & Jagersand, M. (2020). U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., & Krueger, G. (2021). Learning transferable visual models from natural language supervision. In: ICML.
Ravi, S., & Larochelle, H. (2017). Optimization as a model for few-shot learning. In: ICLR.
Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. In: CVPR.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In: CVPR.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In: NeurIPS.
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., & Lillicrap, T. (2016). Meta-learning with memory-augmented neural networks. In: ICML.
Schwartz, E., Karlinsky, L., Shtok, J., Harary, S., Marder, M., Pankanti, S., Feris, R., Kumar, A., Giries, R., & Bronstein, A. M. (2019). Repmet: Representative-based metric learning for classification and one-shot object detection. In: CVPR.
Shi, C., & Yang, S. (2023). Edadet: Open-vocabulary object detection using early dense alignment. In: ICCV.
Shu, M., Nie, W., Huang, D.-A., Yu, Z., Goldstein, T., Anandkumar, A., & Xiao, C. (2022). Test-time prompt tuning for zero-shot generalization in vision-language models. In: NeurIPS.
Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., & Kiela, D. (2022). Flava: A foundational language and vision alignment model. In: CVPR.
Singh, K. K., Mahajan, D., Grauman, K., Lee, Y. J., Feiszli, M., & Ghadiyaram, D. (2020). Don’t judge an object by its context: Learning to overcome contextual bias. In: CVPR.
Singh, B., Najibi, M., & Davis, L. S. (2018). Sniper: Efficient multi-scale training. In: NeurIPS.
Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. In: NeurIPS.
Sun, B., Li, B., Cai, S., Yuan, Y., & Zhang, C. (2021). Fsce: Few-shot object detection via contrastive proposal encoding. In: CVPR.
Tan, J., Wang, C., Li, B., Li, Q., Ouyang, W., Yin, C., & Yan, J. (2020). Equalization loss for long-tailed object recognition. In: CVPR.
Tao, Y., Sun, J., Yang, H., Chen, L., Wang, X., Yang, W., Du, D., & Zheng, M. (2023). Local and global logit adjustments for long-tailed learning. In: ICCV.
Thrun, S. (1996). Is learning the n-th thing any easier than learning the first? In: NeurIPS.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., & Rodriguez, A. (2023). Llama: Open and efficient foundation language models. arXiv:2302.13971.
Triantafillou, E., Zemel, R., & Urtasun, R. (2017). Few-shot learning through an information retrieval lens. In: NeurIPS.
Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.
Article Google Scholar
Vinyals, O., Blundell, C., Lillicrap, T., & Wierstra, D. (2016). Matching networks for one shot learning. In: NeurIPS.
Vioda, P. (2001). Rapid object detection using a boosted cascade of simple features. In: CVPR.
Wang, T. (2023). Learning to detect and segment for open vocabulary object detection. In: CVPR.
Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O.K., Singhal, S., Som, S., & Wei, F. (2023). Image as a foreign language: Beit pretraining for vision and vision-language tasks. In: CVPR.
Wang, Y., Fei, J., Wang, H., Li, W., Bao, T., Wu, L., Zhao, R., & Shen, Y. (2023). Balancing logit variation for long-tailed semantic segmentation. In: CVPR.
Wang, Y.-X., Girshick, R., Hebert, M., & Hariharan, B. (2018). Low-shot learning from imaginary data. In: CVPR.
Wang, X., Huang, T. E., Darrell, T., Gonzalez, J. E., & Yu, F. (2020). Frustratingly simple few-shot object detection. In: ICML.
Wang, T., Li, Y., Kang, B., Li, J., Liew, J., Tang, S., Hoi, S., & Feng, J. (2020). The devil is in classification: A simple framework for long-tail instance segmentation. In: ECCV.
Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y., & Cao, Y. (2021). Simvlm: Simple visual language model pretraining with weak supervision. arXiv:2108.10904.
Wang, J., Zhang, W., Zang, Y., Cao, Y., Pang, J., Gong, T., Chen, K., Liu, Z., Loy, C. C., & Lin, D. (2021). Seesaw loss for long-tailed instance segmentation. In: CVPR.
Wong, A., & Yuille, A. L. (2015). One shot learning via compositions of meaningful patches. In: ICCV.
Wu, A., Han, Y., Zhu, L., & Yang, Y. (2021). Universal-prototype enhancing for few-shot object detection. In: ICCV.
Wu, J., Liu, S., Huang, D., & Wang, Y. (2020). Multi-scale positive sample refinement for few-shot object detection. In: ECCV.
Wu, S., Zhang, W., Jin, S., Liu, W., & Loy, C. C. (2023). Aligning bag of regions for open-vocabulary object detection. In: CVPR.
Xiao, Y., & Marlet, R. (2020). Few-shot object detection and viewpoint estimation for objects in the wild. In: ECCV.
Xu, J., Le, H., & Samaras, D. (2023). Generating features with increased crop-related diversity for few-shot object detection. In: CVPR.
Yan, X., Chen, Z., Xu, A., Wang, X., Liang, X., & Lin, L. (2019). Meta r-cnn: Towards general solver for instance-level low-shot learning. In: ICCV.
Yang, J., Li, C., Zhang, P., Xiao, B., Liu, C., Yuan, L., & Gao, J. (2022). Unified contrastive learning in image-text-label space. In: CVPR.
Yang, Y., Wei, F., Shi, M., & Li, G. (2020). Restoring negative information in few-shot object detection. In: NeurIPS.
Yang, F. S. Y., Zhang, L., Xiang, T., Torr, P. H., & Hospedales, T. M. (2018). Learning to compare: Relation network for few-shot learning. In: CVPR.
Yao, L., Han, J., Liang, X., Xu, D., Zhang, W., Li, Z., & Xu, H. (2023). Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment. In: CVPR.
Yao, L., Han, J., Wen, Y., Liang, X., Xu, D., Zhang, W., Li, Z., Xu, C., & Xu, H. (2022). Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. In: NeurIPS.
Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., & Liu, C. (2021). Florence: A new foundation model for computer vision. arXiv:2111.11432.
Zang, Y., Li, W., Zhou, K., Huang, C., & Loy, C.C. (2022). Open-vocabulary detr with conditional matching. In: ECCV.
Zhang, W., & Wang, Y.-X. (2021). Hallucination improves few-shot object detection. In: CVPR.
Zhang, G., Cui, K., Wu, R., Lu, S., & Tian, Y. (2021). PNPDet: efficient few-shot detection without forgetting via plug-and-play sub-networks. In: WACV.
Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., & Agrawal, A. (2018). Context encoding for semantic segmentation. In: CVPR.
Zhang, R., Hu, X., Li, B., Huang, S., Deng, H., Qiao, Y., Gao, P., & Li, H. (2023). Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. In: CVPR.
Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L. M., & Shum, H.-Y. (2022). Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv:2203.03605.
Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., & Li, H. (2022). Tip-adapter: Training-free adaption of clip for few-shot classification. In: ECCV.
Zhao, Y., Chen, W., Tan, X., Huang, K., & Zhu, J. (2022). Adaptive logit adjustment loss for long-tailed visual recognition. In: AAAI.
Zhao, L., Teng, Y., & Wang, L. (2024). Logit normalization for long-tail object detection. International Journal of Computer Vision.
Zhong, Y., Yang, J., Zhang, P., Li, C., Codella, N., Li, L.H., Zhou, L., Dai, X., Yuan, L., Li, Y., & Gao, J. (2022). Regionclip: Region-based language-image pretraining. In: CVPR.
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., & Misra, I. (2022). Detecting twenty-thousand classes using image-level supervision. In: ECCV.
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Conditional prompt learning for vision-language models. In: CVPR.
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130, 2337–2348.
Article Google Scholar
Zhu, C., Chen, F., Ahmed, U., & Savvides, M. (2021). Semantic relation reasoning for shot-stable few-shot object detection. In: CVPR.
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection. arXiv:2010.04159.
Zong, Z., Song, G., & Liu, Y. (2023). Detrs with collaborative hybrid assignments training. In: ICCV.

Download references

Funding

This work is supported in part by the Research Grant Council of the Hong Kong SAR under Grant 16201420, the National Natural Science Foundation of China under Grant 62306183, and the Guangdong Basic and Applied Basic Research Foundation under Grant 2024A1515010194.

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Qi Fan
School of Intelligence Science and Technology, Nanjing University, Suzhou, China
Qi Fan
National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China
Wei Zhuo
Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR
Chi-Keung Tang
Computer Science Department, Dartmouth College, Hanover, USA
Yu-Wing Tai

Authors

Qi Fan
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhuo
View author publications
You can also search for this author in PubMed Google Scholar
Chi-Keung Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Wing Tai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Wei Zhuo or Yu-Wing Tai.

Additional information

Communicated by Jifeng Dai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fan, Q., Zhuo, W., Tang, CK. et al. FSODv2: A Deep Calibrated Few-Shot Object Detection Network. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02049-z

Download citation

Received: 21 July 2023
Accepted: 01 March 2024
Published: 04 April 2024
DOI: https://doi.org/10.1007/s11263-024-02049-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FSODv2: A Deep Calibrated Few-Shot Object Detection Network

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

FSODv2: A Deep Calibrated Few-Shot Object Detection Network

Abstract

Access this article

Similar content being viewed by others

Object detection using YOLO: challenges, architectural successors, datasets and applications

SSD: Single Shot MultiBox Detector

End-to-End Object Detection with Transformers

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation