Skip to main content

Simple Open-Vocabulary Object Detection

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub github.com/google-research/scenic/tree/main/scenic/projects/owl_vit.

M. Minderer and A. Gritsenko—Equal conceptual and technical contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: ICCV, pp. 6836–6846 (2021)

    Google Scholar 

  2. Bansal, A., Sikka, K., Sharma, G., Chellappa, R., Divakaran, A.: Zero-shot object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 397–414. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_24

    Chapter  Google Scholar 

  3. Bello, I., et al.: Revisiting ResNets: improved training and scaling strategies. In: NeurIPS, vol. 34 (2021)

    Google Scholar 

  4. Biswas, S.K., Milanfar, P.: One shot detection with laplacian object and fast matrix cosine similarity. IEEE Trans. Pattern Anal. Mach. Intell. 38(3), 546–562 (2016)

    Article  Google Scholar 

  5. Bradbury, J., et al.: JAX: composable transformations of Python+NumPy programs (2018). http://github.com/google/jax

  6. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  7. Chen, D.J., Hsieh, H.Y., Liu, T.L.: Adaptive image transformer for one-shot object detection. In: CVPR, pp. 12242–12251 (2021)

    Google Scholar 

  8. Dehghani, M., Gritsenko, A.A., Arnab, A., Minderer, M., Tay, Y.: SCENIC: a JAX library for computer vision research and beyond. arXiv preprint arXiv:2110.11403 (2021)

  9. Fang, Y., et al.: You only look at one sequence: rethinking transformer in vision through object detection. In: NeurIPS, vol. 34 (2021)

    Google Scholar 

  10. Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: NeurIPS. vol. 26 (2013)

    Google Scholar 

  11. Ghiasi, G., et al.: Simple copy-paste is a strong data augmentation method for instance segmentation. In: CVPR, pp. 2918–2928 (2021)

    Google Scholar 

  12. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)

  13. Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: CVPR (2019)

    Google Scholar 

  14. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

    Google Scholar 

  15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  16. Hsieh, T.I., Lo, Y.C., Chen, H.T., Liu, T.L.: One-shot object detection with co-attention and co-excitation. In: NeurIPS, vol. 32. Curran Associates, Inc. (2019)

    Google Scholar 

  17. Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39

    Chapter  Google Scholar 

  18. Huang, Z., Zeng, Z., Liu, B., Fu, D., Fu, J.: Pixel-BERT: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)

  19. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, vol. 139, pp. 4904–4916. PMLR (2021)

    Google Scholar 

  20. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR - modulated detection for end-to-end multi-modal understanding. In: ICCV, pp. 1780–1790 (2021)

    Google Scholar 

  21. Kolesnikov, A., et al.: Big Transfer (BiT): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 491–507. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_29

    Chapter  Google Scholar 

  22. Kolesnikov, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  23. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)

    Article  MathSciNet  Google Scholar 

  24. Kuznetsova, A.: The open images dataset V4. Int. J. Comput. Vision 128(7), 1956–1981 (2020)

    Article  Google Scholar 

  25. Lee, J., Lee, Y., Kim, J., Kosiorek, A.R., Choi, S., Teh, Y.W.: Set transformer: a framework for attention-based permutation-invariant neural networks. In: ICML, Proceedings of Machine Learning Research, vol. 97, pp. 3744–3753. PMLR (2019)

    Google Scholar 

  26. Li, L.H., et al.: Grounded language-image pre-training. arXiv preprint arXiv:2112.03857 (2021)

  27. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  28. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2

    Chapter  Google Scholar 

  29. Mahajan, D.: Exploring the limits of weakly supervised pretraining. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 185–201. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_12

    Chapter  Google Scholar 

  30. Michaelis, C., Ustyuzhaninov, I., Bethge, M., Ecker, A.S.: One-shot instance segmentation. arXiv preprint arXiv:1811.11507 (2018)

  31. Osokin, A., Sumin, D., Lomakin, V.: OS2D: one-stage one-shot object detection by matching anchor features. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 635–652. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_38

    Chapter  Google Scholar 

  32. Pham, H., et al.: Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050 (2021)

  33. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, 18–24 July 2021, vol. 139, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  34. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS, vol. 28. Curran Associates, Inc. (2015)

    Google Scholar 

  35. Shao, S., et al.: Objects365: a large-scale, high-quality dataset for object detection. In: ICCV, pp. 8429–8438 (2019)

    Google Scholar 

  36. Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through cross-modal transfer. In: NeurIPS, vol. 26 (2013)

    Google Scholar 

  37. Song, H., et al.: ViDT: an efficient and effective fully transformer-based object detector. In: ICLR (2022)

    Google Scholar 

  38. Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your ViT? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270 (2021)

  39. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers and distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021)

    Google Scholar 

  40. Xian, Y., Lampert, C.H., Schiele, B., Akata, Z.: Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 41(9), 2251–2265 (2018)

    Article  Google Scholar 

  41. Yao, Z., Ai, J., Li, B., Zhang, C.: Efficient detr: improving end-to-end object detector with dense prior. arXiv preprint arXiv:2104.01318 (2021)

  42. Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: CVPR, pp. 14393–14402 (2021)

    Google Scholar 

  43. Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. arXiv preprint arXiv:2106.04560 (2021)

  44. Zhai, X., et al.: LiT: zero-shot transfer with locked-image text tuning. arXiv preprint arXiv:2111.07991 (2021)

  45. Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. arXiv preprint arXiv:2112.09106 (2021)

  46. Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: arXiv preprint arXiv:2201.02605 (2021)

  47. Zhou, X., Koltun, V., Krähenbühl, P.: Probabilistic two-stage detection. arXiv preprint arXiv:2103.07461 (2021)

  48. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2021)

    Google Scholar 

Download references

Acknowledgements.

We would like to thank Sunayana Rane and Rianne van den Berg for help with the DETR implementation, Lucas Beyer for the data deduplication code, and Yi Tay for useful advice.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthias Minderer .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2383 KB)

A Appendix

A Appendix

The appendix provides additional examples, results and methodological details. For remaining questions, please refer to the code at github.com/google-research/scenic/tree/main/scenic/projects/owl_vit.

1.1 A.1 Qualitative Examples

(See Figs. 5 and 6).

Fig. 5.
figure 5

Text conditioning examples. Prompts: "an image of a {}", where {} is replaced with one of bookshelf, desk lamp, computer keyboard, binder, pc computer, computer mouse, computer monitor, chair, drawers, drinking glass, ipod, pink book, yellow book, curtains, red apple, banana, green apple, orange, grapefruit, potato, for sale sign, car wheel, car door, car mirror, gas tank, frog, head lights, license plate, door handle, tail lights.

Fig. 6.
figure 6

Image conditioning examples. The center column shows the query patches and the outer columns show the detections along with the similarity score.

1.2 A.2 Detection Datasets

Five datasets with object detection annotations were used for fine-tuning and evaluation in this work. Table 4 shows relevant statistics for each of these datasets:

MS-COCO (COCO) [27]: The Microsoft Common Objects in Context dataset is a medium-scale object detection dataset. It has about 900k bounding box annotations for 80 object categories, with about 7.3 annotations per image. It is one of the most used object detection datasets, and its images are often used within other datasets (including VG and LVIS). This work uses the 2017 train, validation and test splits.

Visual Genome (VG) [23] contains dense annotations for objects, regions, object attributes, and their relationships within each image. VG is based on COCO images, which are re-annotated with free-text annotations for an average of 35 objects per image. All entities are canonicalized to WordNet synsets. We only use object annotations from this dataset, and do not train models using the attribute, relationship or region annotations.

Objects 365 (O365) [35] is a large-scale object detection dataset with 365 object categories. The version we use has over 10M bounding boxes with about 15.8 object annotations per image.

LVIS [13]: The Large Vocabulary Instance Segmentation dataset has over a thousand object categories, following a long-tail distribution with some categories having only a few examples. Similarly to VG, LVIS uses the same images as in COCO, re-annotated with a larger number of object categories. In contrast to COCO and O365, LVIS is a federated dataset, which means that only a subset of categories is annotated in each image. Annotations therefore include positive and negative object labels for objects that are present and categories that are not present, respectively. In addition, LVIS categories are not pairwise disjoint, such that the same object can belong to several categories.

OpenImages V4 (OI) [24] is currently the largest public object detection dataset with about 14.6 bounding box annotations (about 8 annotations per image). Like LVIS, it is a federated dataset.

Table 4. Statistics of object detection datasets used in this work.

De-duplication. Our detection models are typically fine-tuned on a combination of OpenImages V4 (OI) and Visual Genome (VG) datasets and evaluated on MS-COCO 2017 (COCO) and LVIS. In several experiments our models are additionally trained on Objects 365 (O365). We never train on COCO and LVIS datasets, but the public versions of our training datasets contain some of the same images as the COCO and LVIS validation sets. To ensure that our models see no validation images during training, we filter out images from OI, VG and O365 train splits that also appear in LVIS and COCO validation and tests splits following a procedure identical to [21]. De-duplication statistics are given in Table 5.

Table 5. Train dataset de-duplication statistics. ‘Examples’ refers to images and ‘instances’ refers to bounding boxes.

1.3 A.3 Hyper-parameters

Table 6 provides an exhaustive overview of the hyper-parameter settings used for our main experiments. Beyond this, we

  • used cosine learning rate decay;

  • used focal loss with \(\alpha =0.3\) and \(\gamma =2.0\);

  • set equal weights for the bounding box, gIoU and classification losses [6];

  • used the Adam optimizer with \(\beta _1=0.9\), \(\beta _2=0.999\);

  • used per-example global norm gradient clipping (see Sect. A.9);

  • limited the text encoder input length to 16 tokens for both LIT and CLIP-based models.

Table 6. List of hyperparameters used for all models shown in the paper. Asterisks (\(*\)) indicate parameters varied in sweeps. MAP and GAP indicate the use of multihead attention pooling and global average pooling for image-level representation aggregation. Where two numbers are given for the droplayer rate, the first is for the image encoder and the second for the text encoder.
Fig. 7.
figure 7

Effect of image size used during image-level pre-training on zero-shot classification and detection performance shown for the ViT-B/32 architecture.

CLIP-Based Models. The visual encoder of the publicly available CLIP models provides, in addition to the image embedding features, a class token. In order to evaluate whether the information in the class token is useful for detection fine-tuning, we explored to either drop this token, or to merge it into other feature map tokens by multiplying it with them. We found that multiplying the class token with the feature map tokens, followed by layer norm, worked best for the majority of architectures, so we use this approach throughout. Other hyper-parameters used in the fine-tuning of CLIP models are shown in Table 6.

1.4 A.4 Pre-Training Image Resolution

We investigated the effect of the image size used during image-text pre-training, on zero-shot classification and detection performance (Fig. 7). To reduce clutter the results are shown for the ViT-B/32 architecture only, but the observed trends extend to other architectures, including Hybrid Transformers. The use of larger images during pre-training consistently benefits zero-shot classification, but makes no significant difference for the detection performance. We thus default to the commonly used \(224\times 224\) resolution for pre-training. We used \(288\times 288\) for some of our experiments with Hybrid Transformer models.

1.5 A.5 Random Negatives

Our models are trained on federated datasets. In such datasets, not all categories are exhaustively annotated in every image. Instead, each image comes with a number of labeled bounding boxes (making up the set of positive categories), and a list of categories that are known to be absent from the image (i.e., negative categories). For all other categories, their presence in the image unknown. Since the number of negative labels can be small, prior work has found it beneficial to randomly sample “pseudo-negative” labels for each image and add them to the annotations [47]. We follow the same approach and add randomly sampled pseudo-negatives to the real negatives of each image until there are at least 50 negative categories. In contrast to [47], we sample categories in proportion to their frequency in the full dataset (i.e. a weighted combination of OI, VG, and potentially O365). We exclude categories from the sample that are among the positives for the given image.

1.6 A.6 Image Scale Augmentation

To improve invariance of detection models to object size, prior work found it beneficial to use strong random jittering of the image scale during training [11]. We use a similar approach, but follow a two-stage strategy that minimizes image padding.

Fig. 8.
figure 8

Example training images. Ground-truth boxes are indicated in red. From left to right, a single image, a \(2 \times 2\) mosaic, and a \(3 \times 3\) mosaic are shown. Non-square images are padded at the bottom and right (gray color). (Color figure online)

First, we randomly crop each training image. The sampling procedure is constrained to produce crops with an aspect ratio between 0.75 and 1.33, and an area between 33% and 100% of the original image. Bounding box annotations are retained if at least 60% of the box area is within the post-crop image area. After cropping, images are padded to a square aspect ratio by appending gray pixels at the bottom or right edge.

Second, we assemble multiple images into grids (“mosaics”) of varying sizes, to further increase the range of image scales seen by the model. We randomly sample single images, \(2 \times 2\) mosaics, and a \(3 \times 3\) mosaics, with probabilities 0.5, 0.33, and 0.17, respectively, unless otherwise noted (Fig. 8). This procedure allows us to use widely varying images scales while avoiding excessive padding and/or the need for variable model input size during training.

1.7 A.7 One-shot (Image-Conditioned) Detection Details

Extracting Image Embeddings to Use as Queries. We are given a query image patch Q for which we would like to detect similar patches in a new target image, I. We first run inference on the image from which patch Q was selected, and extract an image embedding from our model’s class head in the region of Q. In general, our model predicts many overlapping bounding boxes, some of which will have high overlap with Q. Each predicted bounding box \(b_{i}\) has a corresponding class head feature \(z_{i}\). Due to our DETR-style bipartite matching loss, our model will generally predict a single foreground embedding for the object in Q and many background embeddings adjacent to it which should be ignored. Since all the background embeddings are similar to each other and different from the single foreground embedding, to find the foreground embedding, we search for the most dissimilar class embedding within the group of class embeddings whose corresponding box has IoU \(>0.65\) with Q. We score a class embedding \(z_{i}\)’s similarity to other class embeddings as \( f(z_{i}) = \sum _{j=0}^{N-1} z_{i} \cdot z_{j}^{T}\). Therefore, we use the most dissimilar class embedding \(\textrm{argmin}_{z_{i}} f(z_{i})\) as our query feature when running inference on I. In about 10% of the cases, there are no predicted boxes with IoU \(>0.65\) with Q. In these cases we fall back to using the embedding for the text query "an image of an object".

Image-Conditioned Evaluation Protocol. We follow the evaluation protocol of [16]. During evaluation, we present the model with a target image containing at least one instance of a held-out MS-COCO category and a query image patch containing the same held-out category. Both the target image and the query patch are drawn from the validation set. We report the AP50 of the detections in the target image. Note that unlike typical object detection, it is assumed that there is at least one instance of the query image category within the target image. Like prior work, we use Mask-RCNN [14] to filter out query patches which are too small or do not show the query object clearly. During detection training, we took care to hold out all categories related to any category in the held-out split. We removed annotations for any label which matched a held-out label or was a descendant of a held-out label (for example, the label “girl” is a descendant label of “person”). Beyond this we also manually removed any label which was similar to a held-out category. We will publish all held-out labels with the release of our code.

1.8 A.8 Detection Results on COCO and O365

We present additional evaluation results on the COCO and O365 datasets in Table 7. These results show the open-vocabulary generalization ability of our approach. Although we do not train these models directly on COCO or O365 (unless otherwise noted), our training datasets contain object categories overlapping with COCO and O365, so these results are not “zero-shot” according to our definition. The breadth of evaluation setups in the literature makes direct comparison to existing methods difficult. We strove to note the differences relevant for a fair comparison in Table 7.

Table 7. Open-vocabulary detection performance on COCO and O365 datasets. The results show the open-vocabulary generalization ability of our models to datasets that were not used for training. Results for models trained on the target dataset are shown in . Most of our models shown here were not trained directly on COCO or O365 (they are different from the models in Table 1). However, we did not remove COCO or O365 object categories from the training data, so these numbers are not “zero-shot”. For our models, we report the mean performance over three fine-tuning runs.

1.9 A.9 Extended Ablation Study

Table 8. Additional ablations. VG(obj) and VG(reg) respectively refer to Visual Genome object and region annotations.

Table 8 extends the ablation results provided in Table tab:ablations of the main text. It uses the same training and evaluation protocol as outlined in Table 3, but goes further in the range of settings and architectures (ViT-B/32 and ViT-R26+B/32) considered in the study. We discuss the additional ablations below.

Dataset Ratios. In the majority of our experiments we use OI and VG datasets for training. In the ablation study presented in the main text (Table 3), we showed that having more training data (i.e. training on both VG and OI) improves zero-shot performance. Here, we further explored the optimal ratio in which these datasets should be mixed and found that a 7:3 = OI:VG ratio worked best. Note that this overweighs VG significantly compared to the relative size of these datasets. Overweighing VG might be beneficial because VG has a larger label space than OI, such that each VG example provides more valuable semantic supervision than each OI example.

We also tested the relative value of VG “object” and “region” annotations. In VG, “region” annotations provide free-text descriptions of whole image regions, as opposed to the standard single-object annotations. Interestingly, we found that training on the region annotations hurts the generalization ability of our models, so we do not use them for training.

Loss Normalization and Gradient Clipping. In its official implementation, DETR [6] uses local (i.e. per-device) loss normalization and is thus sensitive to the (local) batch size. We found this to be an important detail in practice, which can significantly affect performance. We explored whether normalizing the box, gIoU and classification losses by the number of instances in the image or the number of instances in the entire batch performed better. Our experiments show that per-example normalization performs best, but only when combined with per-example gradient clipping, i.e. when clipping the gradient norm to 1.0 for each example individually, before accumulating gradients across the batch. We found that per-example clipping improves training stability, leads to overall lower losses and allows for training models with larger batch sizes.

Instance Merging. Federated datasets such as OI have non-disjoint label spaces, which means that several labels can apply to the same object, either due to (near-)synonymous labels (e.g. “Jug” and “Mug”), or due to non-disjoint concepts (e.g. “Toy” and “Elephant” labels both apply to a toy elephant). Due to the annotation procedure, in which a single label is considered at a time, one object can therefore be annotated with several similar (but not identical) bounding boxes. We found it helpful to merge such instances into a single multi-label instance. Multi-label annotations are consistent with the non-disjoint nature of federated annotations and we speculate that this provides more efficient supervision to the models, since it trains each token to predict a single box for all appropriate labels. Without this instance merging, the model would be required to predict individual boxes for each label applying to an object, which clearly cannot generalize to the countless possible object labels.

To merge overlapping instances we use a randomized iterative procedure with the following steps for each image:

  1. 1.

    Pick the two instances with the largest bounding box overlap.

  2. 2.

    If their intersection over union (IoU) is above a given threshold:

    1. 2.1

      Merge their labels.

    2. 2.2

      Randomly pick one of the original bounding boxes as the merged instance bounding box.

The picked instances are then removed and the procedure is repeated until no instances with a high enough IoU are left. Having explored multiple IoU thresholds, we note that not merging instances with highly similar bounding boxes is clearly worse than merging them; and that a moderately high threshold of 0.7–0.9 works best in practice.

Learning Rates. In Table 3 we show that using the same learning rate for the image and text encoders is clearly sub-optimal, and that it is necessary to training the text encoder with a lower learning rate. This may help to prevent catastrophic forgetting of the wide knowledge the model acquired during the contrastive pre-training stage. Here we explore a range of text encoder learning rates and demonstrate that the learning rate for the text encoder needs to be much lower (e.g. \(100\times \)) than that of the image encoder to get good zero-shot transfer (\(\text {AP}^\text {LVIS}_\text {rare}\)). However, freezing the text encoder completely (learning rate 0) does not work well either. \(\text {AP}^\text {OI}\), which measure in-distribution performance, behaves in the opposite way. While using the same learning rate for the image and text encoders results in a big drop in \(\text {AP}^\text {LVIS}_\text {rare}\), it increases \(\text {AP}^\text {OI}\). This demonstrates that the optimal recipe for zero-shot transfer (\(\text {AP}^\text {LVIS}_\text {rare}\)) does not necessarily maximize in-distribution performance (\(\text {AP}^\text {OI}\)).

Cropped Bounding Box Filtering. We use random image crop augmentation when training our models. Upon manual inspection of the resulting images and bounding boxes we noticed a frequent occurrence of instances with degenerate bounding boxes that no longer matched their original instance label (e.g. a bounding box around a hand with label “Person” resulting from cropping most of the person out of the image). To reduce the chance of our models overfitting due to having to memorize such instances, we remove object annotations if a large fraction of their box area falls outside of the random crop area. The optimal area threshold lies between 40% and 60%, and that neither keeping all boxes, nor keeping only uncropped boxes, performs as well (Tables 3 and A.9).

Mosaics. As described in Appendix A.6, we perform image scale augmentation by tiling multiple small images into one large “mosaic”. We explored mosaic sizes up to \(4 \times 4\), and found that while using only \(2 \times 2\) mosaics in addition to single images is clearly worse than also including larger mosaics, for the considered resolutions and patch sizes the benefits of using larger mosaics (i.e. smaller mosaic tiles) saturates with the inclusion of \(3 \times 3\) or \(4 \times 4\) mosaics. We have not performed extensive sweeps of the mosaic ratios, and for mosaics with grid sizes from \(1 \times 1\) (i.e. a single image) to \(M \times M\) we use a heuristic of sampling \(k \times k\) girds with probability \(\frac{2\cdot (M - k + 1)}{M\cdot (1 + M)}\), such that smaller mosaics are sampled more frequently than the larger mosaics proportionally to the mosaic size.

Prompting. For generating text queries, similar to prior work, we augment object category names with prompt templates such as "a photo of a {}" (where {} is replaced by the category name) to reduce the distribution shift between image-level pre-training and detection fine-tuning. We use the prompt templates proposed by CLIP [33]. During training, we randomly sample from the list of 80 CLIP prompt templates such that, within an image, every instance of a category has the same prompt, but prompt templates differ between categories and across images. During testing, we evaluate the model for each of the “7 best” CLIP prompts and ensemble the resulting predicted probabilities by averaging them. The results in Table 8 show that not using any prompting does not perform well, especially on the in-distribution \(\text {AP}^\text {OI}\) metric. Perhaps unsurprisingly, test-time prompt ensembling works better in cases when random prompting was also used during training. In some cases, prompting can have different effects on different model architectures. For example, applying random prompt augmentation to the VG dataset tends to improve performance of the B/32 model, but worsens that of the R26+B/32 model. We speculate that this variability is due to the relatively small number of prompt templates; expanding the list of prompt templates might provide more consistent benefits. We thus only use train-time random prompting for the OI dataset, where it yields consistent benefits.

Location Bias. As discussed in the main text, biasing box predictions to the location of the corresponding image patch improves training speed and final performance. The gain is especially large for the pure Transformer architecture (ViT-B/32 in Table 8), where removing the bias reduces performance by almost 3 points on \(\text {AP}^\text {LVIS}\)and \(\text {AP}^\text {LVIS}_\text {rare}\), whereas the hybrid R26+B/32 drops by only slightly more than 1 point. We therefore speculate that the spatial inductive bias of the convolutional component of the hybrid serves a similar function as the location bias.

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Minderer, M. et al. (2022). Simple Open-Vocabulary Object Detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13670. Springer, Cham. https://doi.org/10.1007/978-3-031-20080-9_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20080-9_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20079-3

  • Online ISBN: 978-3-031-20080-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics