1 Introduction

The object detection task aims to find all objects of a given set of object categories shown in an image. In many situations, however, a user looks at a picture with the intent of finding objects of one or more types, which are expressed by any noun, and not restricted to a predefined set of categories. In addition, the user may also know helpful additional information, like the number of objects he is looking for in the image, which could improve object detection. This document will refer to this focused search as the “Find-That” task.

A practical example is given by an image object extraction task where a user aims to automatically extract, from a stream of images, all the occurrences of one or more specific objects (entities), e.g., all the cats and dogs contained in the images. For this task, the user’s intent is known a priori, although it may range across a large set of possible nouns. Because of that, the intent can be used to condition an object detector to obtain a better recognition rate and thus a better final performance. Figure 1 presents an example that highlights the main differences between the standard object detector and the conditioned object detector approach. The task described above differs from visual-textual grounding [3, 14, 42, 45, 59], as the latter has the objective of finding a precise object referred by a textual phrase, while the “Find-That” task aims to find all the objects related to a set of given intents. More distinctions are underlined in the Related Works Sect. 7.

Fig. 1
figure 1

Main differences between the standard object detector and the conditioned object detector approach. (Left) A scenario involving a standard object detector detecting all the objects in the image. (Right) A scenario involving a concept-conditioned object detector, which given the image in input jointly with the user’s intent, directly returns only the objects of interest, thus eventually recognizing the cat on the top right of the image missed by the standard approach

A baseline method to solve the “Find-That” task is by using an object detector that extracts all the objects in the image and then filters the results according to the specified categories. This last step is not trivial as a user can express her/his interest using nouns that are not in the categories supported by the object detector. Hence, such a baseline method should use a filtering procedure that reconciles the noun specified by the user with the set of supported categories. In such a baseline, the object detector is independent of the user’s intent and may return many undesired categories.

This work proposes a method to condition an object detector with the user’s intent, represented by one or more concepts of WordNet [35] graph, to drive the localization and classification of only desirable objects. Hence, there is the need to modify the object detector so that it takes in input also a set of concepts and focuses its attention only on objects of the categories directly or indirectly specified by the concepts. WordNet allows the reconciliation of the user’s intent with the set of supported categories, handles the synonymous, removes the need for prompting, and copes with the problem of multiple meanings associated with the same word (i.e., polysemy) that would be present with textual inputs. Note that using more concepts in input allows the model to grasp any dependencies among the concepts (i.e., among the user intents) so that the detection is not done independently, one at a time, for each user intent. Moreover, the conditioned model implicitly learns the relations between the concepts in input and the classes supported by the object detector, which are nothing but the relations defined by the WordNet graph structure.

Fig. 2
figure 2

Operational setting adopted in this work for finding all the objects contained in an image representing the user’s intent. (Top) A standard object detector, given an image as the only input, detects all the objects which, through the Post-processing Selection algorithm, are filtered accordingly to the WordNet concepts. (Bottom) A concept-conditioned object detector takes the WordNet concepts in input in addition to the image. The Concept Set Encoding encodes the set of concepts in an embedding space. The Fusion Block fuses the visual features returned by the Backbone with the concepts features. Then, the multimodal features are used in the Object Detector Head to locate and classify the objects of interest

Figure 2 highlights the main difference between the baseline described above (top) and the proposed approach involving a concept-conditioned object detector (bottom). Starting from the image, a standard object detector detects all the objects depicted in the image and passes them to an ad hoc post-processing algorithm, which selects only the objects classified with categories that are represented by the WordNet concepts in input. Section 3 elaborates on how WordNet concepts can be matched with the object detector pre-defined categories, which is an important component for the Post-processing Selection component of the model. The proposed concept-conditioned object detector takes in input also a set of concepts and applies the object detection and filtering phase to a combination of image and Concept Set Encoding component. The integration of the multimodal information is implemented by the Fusion Block, which fuses the visual features returned by the model Backbone with the concepts features. Afterward, the multimodal features are employed within the Object Detector Head to locate and classify all the objects of interest.

Nonetheless, this proposed approach requires new datasets to train these models with inputs made of WordNet concepts and images. For this reason, this work proposes an effective strategy to generate WordNet concepts from existing object detection datasets, removing the need to create new ad hoc datasets from scratch.

Overall, the contributions of this article can be summarized as follows: (i) it presents a novel approach to focused object search in an image by conditioning existing object detectors with the user’s search intent, represented as a set of WordNet concepts. The proposed approach can be implemented with minor changes to a standard object detector software, e.g., it does not require the modification or addition of any object detector loss; (ii) this is the first work that proposes conditioned object detectors in which the user’s intent is represented as a set of WordNet concepts. The set approach allows the user to search multiple objects at the same time, while the WordNet graph allows the user to express a query using concepts that are not directly associated with the set of pre-defined categories supported by the object detector. Moreover, WordNet handles the problem of multiple meanings associated with the same word (polysemy) that would be present with textual inputs; (iii) it proposes an effective strategy to generate WordNet concepts from already existing object detection datasets, removing the need to create new ad hoc datasets from scratch for training concept-conditioned object detectors. Therefore, concept-conditioned object detectors can be trained starting from existing datasets for object detection taking advantage of the huge amount of images and ground truth annotations available online; (iv) the evaluation highlights that the proposed concept-conditioned object detector approach performs better than the standard baseline on two widely used object detection datasets, COCO and Visual Genome, and several object detection architectures.

2 Problem formulation

Fig. 3
figure 3

A simple example that highlights the main difficulties in retrieving the dataset categories given the concepts in input. Given the user’s intent “dandy,” which refers to the WordNet concept “dandy.n.01” (the yellow node), there are two ancestor concepts, i.e., “man.n.01” (the red node) and “person.n.01” (the green node) that are associated with the dataset pre-defined categories “MAN” and “PERSON,” respectively. The name synopsis of the WordNet concept is <meaning>.<type>.<number>. meaning refers to the concept’s meaning, type refers to the type, like noun and attribute, while number is a number

Before giving a formal definition of the “Find-that” problem, there is the need to clarify an issue about the “intent” of the user, i.e., the expected output from an object detector that takes in input a set of WordNet concepts. In fact, given a set of input concepts, it is not straightforward how to retrieve the categories that are represented by those concepts. Although it can be considered safe to assume that any object detector pre-defined category can be mapped to a corresponding concept in WordNet. Figure 3 illustrates a simple example that highlights the main difficulties: (i) WordNet concepts may have multiple concepts as parents; hence, given a concept, the set of all its ancestors could potentially result in a very large set of concepts; (ii) since the object detector’s pre-defined categories can be related to each other, as the category “PERSON” is related to the category “MAN,” concepts associated with the pre-defined categories can also be related by parent–child relations in the WordNet graph.

Therefore, given a concept, a first approach could be to select all the pre-defined categories whose concepts are equal or ancestors of at least a WordNet concept in input. In the example, this means that the selected objects should be classified as “MAN” and “PERSON.” However, maybe the user is interested in finding only objects belonging to the “MEN” category and not objects also classified as “PERSON.” In that case, the alternative approach would be to select only the category whose WordNet concept is the closest to the concept in input. In the example, this implies the selection of only the objects classified as “MAN,” discarding all the objects classified as “PERSON.” In general, one could think of an “intent” that is defined by the intended concept depth, i.e., how far one travels the WordNet graph’s structure to retrieve the object detector categories. To cope with this challenge, in the following, the problem is formally defined by also specifying a concept depth parameter.

Let \(\mathcal {L}\) be the set of categories supported by an object detector, \(\mathcal {S}\) the set of concepts in WordNet, \(f:\mathcal {L}\rightarrow \mathcal {S}\) a function that associates to each category of the object detector a unique concept in WordNet. For every \(d\in \mathbb {N}_0\), let \(f^d:\mathcal {L}\rightarrow 2^{\mathcal {S}}\) be the function that maps a label \(l\in \mathcal {L}\) of the object detector into the set of WordNet concepts as:

$$\begin{aligned} f^0(l)&= \{f(l)\} ,\\ f^{d+1}(l)&= f^{d}(l) \cup \left\{ s\in \mathcal {S}\left| \ \exists s'\in f^d(l)\hbox { such that }s'\hbox { is a parent concept of }s\hbox { in WordNet.}\right. \right\} . \end{aligned}$$

Let \(G(\varvec{I})=\{(\varvec{r}_i, l_i)\}^{n}_{i=1}\) be the set of all objects that appears in image \(\varvec{I}\) of any category in \(\mathcal {L}\). \(\varvec{r}_i \in \mathbb {R}^4\) and \(l_i \in \mathcal {L}\) are the bounding box coordinates and the category label, respectively, of the i-th object. Then, given a pair \((\varvec{I},S)\) composed of an image and a set S of WordNet concepts, and a concept depth d, the “Find-That” task produces:

$$\begin{aligned} F(\varvec{I},S,d)&= \left\{ (\varvec{r},l) \left| (\varvec{r},l)\in G(\varvec{I}) \wedge S \cap f^d(l) \ne \varnothing \right. \right\} . \end{aligned}$$

Bear in mind that the standard object detector task can be defined in the proposed framework as \(F(\varvec{I},f(\mathcal {L}),0)\).

3 Definition of a baseline

The application of standard object detectors to address the “Find-That” task involves the integration of a post-processing algorithm which filters out bounding boxes unrelated to the user’s intent. Consequently, as a baseline for this task, a standard object detector coupled with an ad hoc post-processing algorithm (i.e., the Post-processing Selection component in Fig. 2) is employed. This component selectively identifies the subset of objects aligned with the user’s intent.

Formally, given an image \(\varvec{I}\), if \(P_B(\varvec{I})=\{(\varvec{r}_i, l_i)\}^{n_p}_{i=1}\) is the set of \(n_p\) objects predicted by an object detector, the baseline approach estimates \(F(\varvec{I},S,d)\) by \(F_B(\varvec{I},S,d)\), as:

$$\begin{aligned} F_B(\varvec{I},S,d)&= \left\{ (\varvec{r},l) \left| (\varvec{r},l)\in P_B(\varvec{I}) \wedge S \cap f^d(l) \ne \varnothing \right. \right\} . \end{aligned}$$

The post-processing algorithm checks if \(S\cap f^d(l) \ne \varnothing\), i.e., it matches the concepts in input with all the descendant, at maximum depth value d, of the concepts associated with the bounding boxes categories (\(f^d(l)\)).

4 Concept-conditioned object detector

The baseline can be improved by exploiting an object detector conditioned by the input WordNet concepts. During training, given an image with a set of WordNet concepts in input, the object detector learns to detect only the desired objects. Hence, implicitly the model learns a mapping function from the set of WordNet concepts to the categories of the object detector. This improves the quality of proposals in input to the Post-processing Selection component.

Formally, given an image \(\varvec{I}\) and a set S of concepts, if \(P_C(\varvec{I},S)=\{\varvec{r}_i,l_i\}\) is the set of objects predicted by a concept-conditioned object detector, \(F(\varvec{I},S,d)\) is estimated by \(F_C(\varvec{I},S,d)\), as:

$$\begin{aligned} F_C(\varvec{I},S,d)&= \left\{ (\varvec{r},l) \left| (\varvec{r},l)\in P_C(\varvec{I},S) \wedge S \cap f^d(l) \ne \varnothing \right. \right\} . \end{aligned}$$

In the following, more details on the architecture and the training procedure of the concept-conditioned object detector will be presented.

4.1 Model architecture

Fig. 4
figure 4

Overview of the concept-conditioned object detector. The Image together with the set of WordNet Concepts represents the input of the model. The Backbone extracts the visual features from the image, while the Concept Set Encoding encodes the set of WordNet concepts in input in an embedding space. The concepts in the input are highlighted in red in the WordNet block (e.g., the node representing “kitty”). Finally, the Fusion Block fuses the visual and concept features together and provides them as input to the Object Detector Head, which predicts the Boxes Coordinates and the Boxes Categories in output

Figure 4 presents an in-depth zoom on the Concept-Conditioned Object Detector block presented in Fig. 2. It illustrates the proposed architecture that exploits the information given by the set of WordNet concepts during object detection. In fact, both an Image and a set of WordNet Concepts are provided in input to the model. The blocks that are components of a standard object detector, i.e., components that are defined by a meta-architecture (e.g., Faster R-CNN or RetinaNet) and a backbone (e.g., ResNet-50, ResNet-101, or Swin-Tiny), are depicted using the red color, while the background in light-blue color delimits the new blocks added to condition the object detector with concepts. The Backbone extracts the visual features from the image, while the Concept Set Encoding encodes in an embedding space the set of concepts in input. Finally, the Fusion Block fuses the visual and concept features together and sends them as input to the Object Detector Head, which predicts the Boxes Coordinates and the Boxes Categories in output.

4.2 Model training

A standard end-to-end gradient-based procedure can perform the training of the proposed model. However, the main issue is the lack of datasets compliant with the task definition, i.e., examples in the form \(((\varvec{I},S),F(\varvec{I},S,d))\). For this reason, this section proposes an automatic procedure to derive an ad hoc dataset \(D_F\) starting from an existing dataset D for object detection, which contains ground truth annotations for each object contained in each image \(\varvec{I}\), i.e., \(G(\varvec{I})\).

In order to define \((\varvec{I},S)\) and \(F(\varvec{I},S,d)\), it is necessary to specify the “intent” S at concept depth \(d\in \mathbb {N}_0\). Given an image \(\varvec{I}\) in D, the power set \(\xi _{G}(\varvec{I})\) of \(G(\varvec{I})\), i.e., the set of all the possible combinations of ground truths, can be automatically generated. Then, for each \(\hat{\xi }_{G} \in \xi _{G}(\varvec{I})\) with \(\hat{\xi }_{G} \ne \varnothing\), it is possible to define a new example for \(D_F\). Specifically, the set \(S_d\) of concepts can be defined as:

$$\begin{aligned} S_d = \{ \hat{s}_l \}, \,\,\, \text {with} \,\,\, \hat{s}_l \sim \mathcal {U}\left( f^d(l)\right) , \,\,\, \forall \, (\varvec{r},l)\in \hat{\xi }_{G}; \end{aligned}$$

where \(\mathcal {U}\) is the uniform probability distribution required to sample a concept \(\hat{s}_l\) among all those in \(f^d(l)\), which are all the descendants of the concept associated with the class l of the object detector.

It could be disputed that the above procedure is not correct in the case in which a child of a concept does not find a match with a pre-defined object detection category. For example, consider the concept “Siamese cat” and an object detector that only supports the category “CAT.” In this case, since f(“CAT”) returns the concept “cat,” i.e., a parent of “Siamese cat,” one runs the risk of generating an example involving an image that portrays a cat that is not a Siamese cat jointly with the concept “Siamese cat.” However, such a query could actually be placed by a user who is unaware, as she/he shouldn’t be, of the pre-defined object detection categories, and returning a bounding box containing a non-Siamese cat is the best approximation that the object detector can do. It is a problem of the object detector: the more pre-defined categories the object detector can deal with, the better the system’s performance will be.

However, the power set approach \(\xi _{G}(\varvec{I})\) described above generates exponential training examples, making it unsuitable in practice. For this reason, in this work, \(\xi _{G}(\varvec{I})\) is sampled to obtain a reasonable amount of training examples. Specifically, given an image \(\varvec{I}\) with its ground truth annotations \(G(\varvec{I})\), the procedure that synthesizes the concepts in input starts by sampling uniformly from \(\xi _{G}(\varvec{I})\) only an element \(\hat{\xi }_{G}\). For example, given the image in Fig. 1, this approach can sample three objects as ground truths, and for each of them, generate a concept to use in input. Section 6 investigates an additional sampling strategy for generating concepts \(S_d\).

The use of a uniform sampling process to acquire \(\hat{\xi }_{G}(\varvec{I})\) ensures that its expected size is smaller than that of \(G(\varvec{I})\) while allowing the sampling of each combination of ground truths. Specifically, let \(n=|G(\varvec{I})|\) and X be a random variable representing the size of \(\hat{\xi }_{G}(\varvec{I}) \in \xi _{G}(\varvec{I})\) with \(\hat{\xi }_{G}(\varvec{I}) \ne \varnothing\). Since \(|\xi _{G}(\varvec{I})| = 2^n\) and the empty set \(\varnothing\) is not considered, the uniform sampling process is applied to a set of \(2^n-1\) elements. Therefore, the expected value \(\mathbb {E}[X]\) is:

$$\begin{aligned} \mathbb {E}[X]&= \sum _{\hat{\xi }_{G}(\varvec{I}) \in \xi _{G}(\varvec{I})} \mid \hat{\xi }_{G}(\varvec{I})\mid \frac{1}{ 2^n-1 } \\&= \frac{1}{ 2^n-1} \sum _{\hat{\xi }_{G}(\varvec{I}) \in \xi _{G}(\varvec{I})} \mid \hat{\xi }_{G}(\varvec{I})\mid \\&= \frac{n2^{(n-1)}}{2^n-1 } . \end{aligned}$$

Similarly, the uniform probability distribution adopted to sample \(\hat{s}_l\) provides an equal opportunity for all concepts associated with the category l to be selected. Since the size of \(f^d(l)\) varies according to the label l, the sampling likelihood is inversely proportional to the number of elements forming the set.

5 Experiments and results

This section presents the evaluation performed to assess the effectiveness of the proposed method, namely concept-conditioned object detectors. More precisely, the comparison involves assessing several conditioned object detectors against standard object detectors for the task of identifying all objects in an image. Furthermore, conditioned object detectors are also compared with standard object detectors coupled with an ad hoc post-processing algorithm for addressing the “Find-That” task. The concept-conditioned object detectors are evaluated on datasets generated starting from two widely adopted object detection datasets: COCO and Visual Genome. The evaluation encompasses two object detector meta-architectures (i.e., RetinaNet and DynamicHead) and several backbones (i.e., ResNet-50, ResNet-101, and Swin-Tiny).

5.1 Experimental setting

5.1.1 Datasets

The COCO dataset [26] is an 80-class common object detection dataset. In this work, the 2017 version of the dataset is adopted, which is made by 118287 training and 5K validation images. Since the COCO test set ground truths are not publicly available online, the models are tested on the COCO validation set, while 5K images are randomly selected from the training set to generate the “holdout” set, which is adopted as the validation set for the model selection. The Visual Genome [23] dataset consists of 98,077 images of training, 5K images of validation and 5K images of test. Each object is classified according to 16K categories. Every split of data is available online with its ground truth annotations. Hence, the splits available online for training, validating, and testing the models are adopted on this dataset.

The procedure presented in Sect. 4.2 allows the generation of new datasets to train and evaluate concept-conditioned models when deployed for searching all the objects contained in the images as well as just a subset of objects as specified by the input concepts. More in detail, for each original dataset, two more datasets (with all their splits) are generated. The first dataset aims to evaluate the object detector when searching for all the objects in the images (\(\hat{\xi }_{G} = G(\varvec{I})\)). In other words, for each image, \(\varvec{I}\), the set S comprises at least one concept related to each ground truth in \(G(\varvec{I})\). The second dataset, dubbed “Focused,” aims to evaluate the object detector when searching for only a subset of objects in the images. For each example \(((\varvec{I}, G(\varvec{I}))\), the procedure presented in Sect. 4.2 generates the example \(((\varvec{I},S),F(\varvec{I},S,d))\), which focuses on just a subset of all the objects \(G(\varvec{I})\). Note that, the examples to use in input to the model during training are generated at “run-time,” while during evaluation, the results are computed on a pre-calculated set of examples. Additional details about the dataset statistics and class frequencies can be found in Appendix A.

5.1.2 Evaluation setting and metrics

The following metrics to evaluate the models’ performances are adopted: (i) mean Average Precision (AP): this metric is the mean Average Precision per class defined by the COCO datasetFootnote 1; (ii) AP50: this metric is the mean Average Precision per class, defined by the COCO dataset, that evaluates the AP metric only considering the Intersection over Union (IoU) threshold of 0.5. These are standard object detection metrics that, in the end, allow for a fair comparison of the proposed model on the object detection task to demonstrate the effectiveness of the proposed approach over standard object detectors. In addition, the AP metric is evaluated by considering several bounding box dimensions. The threshold values are defined by the COCO dataset: (i) Small refers to bounding boxes whose area is less than \(32^2\) pixels; (ii) Medium refers to bounding boxes whose area is less than \(96^2\) pixels and larger than \(32^2\); and (iii) Large refers to bounding boxes whose area is larger than \(96^2\) pixels; while (vi) All refers to the case in which the evaluation is performed considering all the bounding boxes. In COCO, approximately 41% of the boxes are small size, approximately 34% of the boxes are medium size, and approximately 24% of boxes are large size.

5.1.3 Model selection

Given the large computational power required to train the object detectors, the search for the best hyperparameters was performed only on the COCO dataset. Thus, the best hyperparameters selected on COCO are adopted “as-is” for training the object detectors on the Visual Genome dataset. The model hyperparameters tuning is performed on COCO by training on the train set and validating on the “holdout” set, while in the Visual Genome, it is done by training on the train set and validating on the validation set available online. The evaluation results presented in this work are always obtained on the validation set for COCO and the test set for Visual Genome.

All models are trained for 90K iterations and are then tested on the validation set. Hyperparameters related to concepts are tuned using RetinaNet [28], with ResNet-50 and Feature Pyramid Network (FPN) [27], on the COCO dataset. Regarding the Fusion Block component (see Fig. 4), three approaches to fuse the concept embeddings with the visual features are considered: addition, multiplication, and concatenation. The best AP results were obtained by adopting the concatenation approach. The best learning rate to use during training is chosen among the following values: [0.01, 0.001, 0.0001, 0.00005]. With DynamicHead, the best results were obtained with a value of 0.0001, while with RetinaNet, the best results were achieved with a learning rate value of 0.01. The addition of more expressiveness to the Concept Set Encoding network was also studied, although the best results were obtained with the configuration outlined in Sect. 5.1.4.

5.1.4 Implementation details

For model training, all ResNet [15] backbones are initialized with the pre-trained ImageNet [6] weights. The Swin-Tiny backbone is initialized with the weights provided by the authors.Footnote 2 As concept embedding, it is used a 150-dimensional Holographic [38] embeddingsFootnote 3 trained on WordNet for 500 epochs. These weights are frozen during model training. During training, the batch size is set to 16 examples for all the models. The Concept Set Encoding module employs a Deep Sets [58] network. Each 150-dimensional concept embedding is mapped to a new 256-dimensional representation using a multilayer perceptron with two layers and ReLU activation functions. The first layer has a dimension of 150 neurons, while the second layer has a dimension of 256 neurons. Finally, all the concepts’ representations are summed and transformed into a new representation with a multilayer perceptron with two 256-dimensional layers and ReLU activation functions. The Fusion Block concatenates the embedding of the concepts, in output from the Concept Set Encoding, to the visual features in output from the model Backbone. Each object detector category is mappedFootnote 4 to its corresponding WordNet synset using the Python NLTKFootnote 5 package. When NLTK failed to find the concept associated with some categories, the linking was done manually with the synset that most represented the category meaning. Where not explicitly indicated, the concept sampling procedures are done at a maximum depth of \(d=1\). The models are implemented using the Detectron2 frameworkFootnote 6 and the experiments were performed in a distributed parallel system with several A100 GPUs.

5.1.5 Object detector architectures and backbones

Table 1 Number of parameters composing each model

The proposed approach is evaluated considering two object detectors, namely RetinaNet [28] and DynamicHead [5]. To assess the effectiveness of the proposed approach, each model is evaluated considering three backbones: ResNet-50 [15], ResNet-101, and Swin-Tiny [30]. Each model adopts the Feature Pyramid Network [27] to extract image features at different resolutions. Each object detector model is modified to be conditioned with the concepts, i.e., “Concept RetinaNet” and “Concept DynamicHead” are the proposed models that can exploit the user’s intent during object detection. Table 1 presents the number of parameters composing each model adopted in this work. In particular, the table reports the number of parameters forming the backbone, the size of the concept vocabulary, and the number of parameters composing the head of the model. The head of the model is in charge of locating and classifying the objects in the image, and for this reason, its dimension depends on the number of classes to predict. In other words, the size of the model’s head changes according to the dataset. Fusion Strategy refers to the function applied to fuse the visual and concept information that will be discussed more in detail in Sect. 5.5.

5.2 Standard vs. concept-conditioned object detectors before filtering

Table 2 Object detection results on datasets generated from COCO and Visual Genome with \(d=1\) and considering several bounding box dimensions

This section investigates the benefit of leveraging the user’s intent directly in the object detector architectures. It is done by evaluating the models in detecting all the objects in the images before filtering, i.e., the evaluation is performed before employing the Post-processing Selection component. For the concept-conditioned object detectors, the set of WordNet concepts used to condition the object detection process is built appropriately to include a concept for each object present in the image ground truth annotations.

Table 2 presents the results obtained by the object detectors when deployed for searching all the objects contained in COCO and Visual Genome datasets. AP (%) refers to the object detection mean Average Precision, while AP50 (%) refers to the mean Average Precision with IoU \(\ge 0.5\). Models conditioned with the concepts are highlighted with the dove gray color.

Noticeably, the proposed concept-conditioned models, exploiting the user’s intent, consistently perform better than standard object detectors when deployed for searching all objects depicted in an image. Concept DynamicHead achieves the best outcomes in both datasets with ResNet as the backbone. Overall, the improvements given by Concept DynamicHead over the standard DynamicHead models are higher than the improvement of Concept RetinaNet over standard RetinaNet. On COCO, the larger AP improvement (6.1%) is given by Concept DynamicHead (50.2%) over DynamicHead (44.1%), both with ResNet-101/50. Even on Visual Genome, the same architecture and backbone give the best improvements (3.9% for ResNet-101).

Unexpectedly, the DynamicHead model coupled with the ResNet-50 and ResNet-101 backbones performs similarly. Given the higher neural network expressivity given by the ResNet-101 over ResNet-50, allowed by the largest number of parameters that amounts to 57.8M, the model outcomes should be better than those of ResNet-50. This is likely due to a non-exhaustive model selection performed on COCO and Visual Genome, which is detailed in Sect. 5.1.3. Regarding the AP metric evaluated according to the bounding box dimensions (i.e., columns Small, Medium, and Large), it is visible that the concept-conditioned models benefit mostly in detecting small objects in COCO and large objects in Visual Genome.

Overall, whenever the user’s intent is exploited to condition the object detector architectures, their detection performance increases.

Fig. 5
figure 5

Object detection results on COCO varying the concept depth values used to generate the WordNet concepts. Results obtained with Concept RetinaNet and ResNet-50

During model training, the maximum concept depth d value considered during the WordNet sampling process plays a fundamental role in the proposed approach. High-depth values force the model to learn more relations among categories and WordNet concepts, making the task that the model has to solve more challenging. Conversely, a low-depth value makes the learning task easier while constraining the generalization of the proposed approach to only a small set of concepts.

Figure 5 highlights the impact of employing different depth values on the concept-conditioned models. The results were obtained with RetinaNet, using ResNet-50 as the backbone, on the COCO test set. In this experiment, depth value \(d=0\) refers to examples involving concepts in \(S_0\) (i.e., only concepts associated with the object’s categories), depth value \(d=1\) refers to examples involving concepts in \(S_1\), and so on. As can be seen from the table, the best AP result is obtained with a depth value of 0, and there is no abrupt deterioration in the results, increasing the depth value from 0 to 4. More in detail, from \(d=0\) to \(d=1\), the deterioration in the AP metric amounts to 0.4%, the same value with respect to \(d=4\). The largest drop in performance is observed for \(d=3\), where it reaches 0.6%. Bear in mind that the number of concepts the user can adopt to express her/his intent grows significantly from 80 at depth \(d=0\), to 7274 at depth \(d=4\), and that these numbers increase when the Visual Genome dataset is considered.

To conclude, these results suggest that it is possible to generalize the model to use 7274 different WordNet concepts trading off some model effectiveness when considering the COCO dataset.

5.3 Searching for a subset of objects

Table 3 Object detection results on Focused COCO and Focused Visual Genome datasets with \(d=1\) and considering several bounding box dimensions

This section compares concept-conditioned and standard object detectors, coupled with the Post-processing Selection component, to search for just a subset of objects depicted in the images consistent with the input concepts. To this aim, the models are assessed on the datasets generated as explained in Sect. 4.2, which are dubbed “Focused COCO” and “Focused Visual Genome.”

Table 3 presents the obtained results, from which it can be seen that concept-conditioned models outperform standard object detectors in all architectures and backbones combinations. Both datasets achieve the best AP results by deploying DynamicHead with ResNet backbones. On Focused COCO, the larger AP improvement (3.1%) is given by Concept DynamicHead (52.1%) over DynamicHead (49.0%), both with ResNet-50. While on Visual Genome, the larger AP improvement (2.8%) is given by Concept DynamicHead (13.7%) over DynamicHead (10.9%), both with ResNet-101. In general, the improvements achieved on the Focused COCO dataset by the conditioned models are higher than those achieved in the Focused Visual Genome dataset. More details on the number of detected objects per image are reported in Appendix D, while an evaluation of the statistical significance of the results is detailed in Appendix  E.

Fig. 6
figure 6

Results obtained on the Focused COCO validation set for each category using the Post-processing Selection component. The results are obtained by adopting Swin-Tiny as the backbone. AP refers to the mean Average Precision metric

Figure 6 presents the AP metric values obtained per class in the Focused COCO dataset by DynamicHead model coupled with the Swin-Tiny backbone and with the Post-processing Selection component. In other words, the models search for a subset of objects in the image. The figure shows that the Concept DynamicHead model obtains higher results than the standard DynamicHead model in most classes. However, it presents lower results only in a small number of classes, such as “HAIR DRIER,” “KNIFE” and “MICROWAVE.” Future works will investigate these classes in more detail.

In conclusion, conditioning the object detection with the user’s intent generally improves the detection performance of an object detector that adopts a post-processing procedure for selecting the boxes of interest.

5.4 Qualitative results

Fig. 7
figure 7

Qualitative Results. The ground truth bounding boxes are indicated with red lines, while the bounding boxes predicted by the model are indicated with light blue dashed lines. The column on the left reports the prediction of DynamicHead, while the center and right columns the predictions of Concept DynamicHead given the concepts highlighted under the images

Figure 7 presents some qualitative examples predicted with standard and concept-conditioned object detectors. The standard object detector focuses its attention on all the objects in the images and sometimes is not able to detect the most important bounding boxes such as the “KEYBOARD” in the first row. On the contrary, the concept-conditioned object detector focuses its attention only on those bounding boxes that express concepts in input, improving the bounding boxes’ detection performance and decreasing the number of detected boxes when compared to standard object detectors. On the right column, it is highlighted the use case of the proposed model, which is when the object detector is used to focus the detection only on a subset of objects depicted in the image. An interesting mismatch between concepts and object detector classes is given by the second row in the right column. Given the concept “male.n.01” the object detector focuses its detection on the bounding box depicting the woman and classifies it as “PERSON.” Clearly, the concept in input was focusing only on males, but the object detector class that most approximates that concept is “PERSON,” as ”MALE” is not a COCO class. In fact, also the ground truth bounding box is classified as “PERSON.”

5.5 Further analysis on the fusion block

Fig. 8
figure 8

Object detection results considering different multimodal fusion strategies. The concept-conditioned object detector searches for all the objects depicted in the image. Results obtained with Concept DynamicHead, ResNet-101, and \(d=1\). VG stands for Visual Genome

The block that fuses information from the visual modality with information from the concept modality (the Fusion Block visible in Fig. 4) plays a crucial role in constructing the concept-conditioned object detector. The best results were obtained with the concatenation of the features, which implies a larger Object Detector Head’s input (i.e., slightly more neurons) that could explain the improvement in the object detector capabilities. To discern if that is the case, this section presents the results of adopting the “Addition” strategy, which sums the visual and concept features without increasing the size of the Object Detector Head, i.e., the head is composed of the same number of neurons.

Mathematically, consider the vectors \(\varvec{e}_v \in \mathbb {R}^{s_e}\) and \(\varvec{e}_c \in \mathbb {R}^{s_e}\) which correspond to the visual and concept features of an input (I, S) derived from the Fusion Block and the Concept Set Encoding, respectively. The addition function obtains the fused features as \(\varvec{e}_m = \varvec{e}_v + \varvec{e}_c\), while the concatenation operation as \(\varvec{e}_m = Concat(\varvec{e}_v, \varvec{e}_c)\), where Concat stacks the two vectors consecutively. The addition strategy produces features in \(\mathbb {R}^{s_e}\), whereas the concatenation strategy produces features in \(\mathbb {R}^{2s_e}\).

Figure 8 presents the AP results. In particular, for COCO and Visual Genome, it uses the evaluation setting adopted in Sect. 5.2, while for Focused COCO and Focused Visual Genome, it is using the setting adopted in Sect. 5.3. The architecture DynamicHead with ResNet-101 as the backbone is used for these experiments. In both settings, it can be observed that the Concept DynamicHead model adopting the “Addition” fusion strategy performs much better than the standard DynamicHead model. On the other hand, the “Addition” strategy is not as competitive as the “Concatenation” strategy.

In conclusion, these results demonstrate that, although more neurons help to improve the object detector performance, the major improvements are obtained using the concepts in input. More details are reported in Appendix B.

6 Concept sampling impact

During the model training and the creation of the new datasets with concepts (see Sect. 4.2), two sampling processes take place. The former aims to reduce the exponential number of examples induced by the powerset approach, i.e., \(\hat{\xi }_{G}(\varvec{I})\), while the latter aims to sample a concept \(\hat{s}_l\) associated with each label l. This section reports the results obtained by changing the sampling process applied to \(S_d\).Footnote 7 Instead of sampling one concept for each object to search in the image (as done before), it analyzes the models’ performances when it is provided in input a concept for each type of object to search in the image. Mathematically, \(S_d = \{ \hat{s}_l \}\) such that:

$$\begin{aligned} \hat{s}_l \sim \mathcal {U}\left( f^d(l)\right) , \,\, \forall \, l\in \{l\mid (\varvec{r},l) \in \hat{\xi }_{G}\}. \end{aligned}$$

This is a more generic setting than before, as the prior information concerns only the types of objects to search for, not the number of occurrences of the same object in the image. For example, according to Fig. 1, in this new setting one provides as input two concepts: one concept for the object labeled as “BOWL” and one sampled concept associated with the objects labeled as “CAT.” Following the same example, all the experiments previously performed provided as input three concepts, one for “BOWL” and two sampled concepts for “CAT,” i.e., one for each cat in the image.

The new sampling strategy is also adopted for generating new datasets. More in detail, the new “Focused” datasets adopted in this section are built starting from the “Focused” datasets adopted in Sect. 5, where only one concept is kept for each type of object. This guarantees that new datasets have the same ground truth as the starting datasets and that the concepts only differ. Going back to the previous example, given the two concepts related to the two cats appearing in the image, only one concept is sampled and adopted as input for searching for both cats in the picture. Statistics of the datasets generated with the new sampling strategy are visible in Appendix C.

6.1 Standard vs. concept-conditioned object detectors before filtering

Table 4 Object detection results on datasets generated from COCO and Visual Genome with the new sampling strategy and \(d=1\)

Table 4 presents the results obtained by the object detectors when they are deployed for searching all the objects contained in COCO and in Visual Genome datasets. The evaluation setting complies with that performed in Sect. 5.2, i.e., it is performed before filtering. From the results it is visible that there is the same trend highlighted in Table 2: when the object detector is conditioned with concepts, it improves the ability to localize the objects in the image. On COCO, the larger AP improvement (5.2%) is given by Concept DynamicHead (49.3%) over DynamicHead (44.1%), both with ResNet-101. Even on Visual Genome, the same architecture and backbone give the best improvements (3.5%).

Fig. 9
figure 9

Object detection results on COCO varying the concept depth values used to generate the WordNet concepts. Results obtained with Concept RetinaNet and ResNet-50

Figure 9 highlights the impact of employing different depth values on the concept-conditioned models adopting the new sampling strategy. The results were obtained with RetinaNet, using ResNet-50 as the backbone, on the COCO validation set. As can be seen from the table, the best AP result is obtained with a depth value of 0, and there is no abrupt deterioration in the results, increasing the depth value from 0 to 4. More in detail, from the depth value of 0 to 1, the biggest deterioration in the AP metric amounts to 1%, although from the depth value of 1 to 4, the deterioration amounts to 0.5%.

In conclusion, these results suggest that also with this new sampling strategy, in the COCO dataset, it is possible to generalize the model to the use of 7274 different WordNet concepts trading off some of the effectiveness of the model.

6.2 Searching for a subset of objects

Table 5 Object detection results obtained with the new sampling strategy on Focused COCO and Focused Visual Genome with \(d=1\)

Table 5 compares concept-conditioned and standard object detectors, coupled with the Post-processing Selection component, to search for just a subset of objects depicted in the images and consistent with the input concepts. This evaluation setting complies with that of Sect. 5.3.

From this table, it is visible that concept-conditioned models outperform standard object detectors in most of all architecture and backbones combinations, with the only exception of Concept RetinaNet with ResNet-101 on the Focused Visual Genome dataset. Both datasets achieve the best AP results by deploying DynamicHead with the Swin-Tiny backbone. On Focused COCO, the larger AP improvement (2%) is given by Concept DynamicHead (51.0%) over DynamicHead (49.0%), both with ResNet-50. While on Visual Genome, the larger AP improvement (2.3%) is given by Concept DynamicHead (13.0%) over DynamicHead (10.7%), both with ResNet-101. However, in this case, the improvements achieved on the Focused Visual Genome dataset by the conditioned models are higher than those achieved in the Focused COCO. Note that only on Focused Visual Genome, Concept RetinaNet performs slightly worse than the standard version, which could be explained by the non-exhaustive search of hyper-parameters performed during the model selection (see Sect. 5.1.3).

To conclude, even when the new sampling strategy is considered, conditioning the object detection with the user’s intent clearly improves the performance of the object detector. More details on the number of detected objects per image are reported in Appendix  D.

7 Related works

This work is mainly related to the research areas reported below.

Object Detection Task The “Find-That” task aims to find all objects depicted in an image related to the user intent, and as presented in Sect. 3, object detectors can be adapted for solving the proposed task. In addition, the proposed approach is also related to the Open-Vocabulary Object Detection [13, 18, 36, 47, 53, 55] area of research, although with significant differences. First, the approach presented in this work is not about augmenting or changing the predefined set of classes supported by the object detector but about conditioning the search of the objects in the image aligning the classes with the user intent. The approach presented in this work is designed to use knowledge graph nodes, not textual words with, if needed, prompting to express the user intent (i.e., the multimodal data is composed of graph and vision data), nor in adopting a multimodal large-scale pre-training setting. Nevertheless, a larger set of supported classes would undoubtedly benefit the proposed approach.

A thorough search of the relevant literature yielded that only Fornoni et al. [10] proposed a similar work. The authors aim to condition object detectors with prior information (as done in this paper), emphasizing mainly object detectors with efficient constraints (mobile). They re-use existing object detector code with minor changes and develop a procedure to generate the user’s prior intent from the ground truths available in existing object detection datasets. However, there is a significant difference in how the user intent is represented. In [10], the object detector is modified to consider an input composed of images and categories augmented with spatial information needed to constrain the object search in the image. The categories are those defined by the dataset, and their model is conditioned with a vector of ones (to search) and zeroes (not to search) for each class (i.e., an 80-dimension vector for COCO). In this work, the proposed approach is conditioned in input with WordNet concepts instead of categories, and the concepts are not augmented with the spatial location information, even if the model can be easily extended to do so. Hence, the approach of Fornoni et al. [10] does not tackle the mismatch problem between the concepts expressed by the user and the classes of the object detector, thus solving an easier problem compared to the task addressed in this work. In addition, since the target label is provided as an input to their proposed conditional model, their approach only aims to localize the objects in the image and does not classify them. For this reason, their evaluation is category-agnostic. Instead, the concept-conditioned models presented in this work aim to locate and correctly classify the objects depicted in the image. Thus, the evaluation adopted in this work is category-aware.

A direct experimental comparison versus the above approach is not possible since: (i) their approach uses different prior information than that adopted in this work, i.e., they adopt category vectors of ones and zeros while this work adopts WordNet embeddings; (ii) their evaluation setting (online style) significantly differs from that adopted in this work (fixed test set); and (iii) their code is not available online, making an evaluation impossible in a setting comparable to that used in this work.

Multimodal Downstream Tasks The proposed approach is also related to multimodal research areas, as often object detection is used as a building block for solving many other downstream tasks, such as Visual Grounding [3, 4, 14, 17, 42, 44, 45, 50, 59], Visual Question Answering [1, 25, 46, 48, 62, 63], Visual-Textual-Knowledge Entity Linking (VTKEL) [7,8,9] and Image-Text Retrieval [12, 20, 21, 31, 33, 54, 56]. Note that the approach proposed in this work could be deployed to solve the VTKEL problem, conditioning the object detector with the knowledge graph entities, and also to solve the Visual Grounding task by conditioning the object detector with the entity extracted from the text with a Word-sense disambiguation [2, 22, 24, 34, 37, 41, 49, 60, 61].

Visual Grounding Task The proposed “Find-That” task resembles the visual grounding [3, 14, 17, 42, 44, 45, 59] task, although there are substantial differences. First of all, the user intent needs to be represented as a textual phrase, while in the proposed approach, the user intent is expressed with one or more WordNet [35] concepts. Secondly, following the current State-of-the-Art, visual grounding models predict only the bounding box that best matches the textual phrase in the output. For this reason, when the user intent concerns multiple distinct objects depicted in the image, multiple independent queries should be performed to retrieve all objects of interest. In addition, when the user intent concerns multiple objects of the same type, the visual grounding approach is no longer suitable. Lastly, for training, visual grounding models need to use detailed datasets comprising images, boxes coordinates, textual phrases, and box-phrase ground truth alignments. These annotations are difficult to collect and, for this reason, the visual grounding datasets contain fewer examples than those of detection.

Still, given all these differences, Fornoni et al. [10] performed a comparison between an SSD [29] object detector coupled with ResNet-101 and a One-Stage BERT visual grounding recognition model [57]. In particular, the results of the SSD model were filtered according to the class expressed by the query in input, as done in the baseline proposed in this work. To summarize, Fornoni et al. [10] verified that the visual grounding model has poor generalization ability and underperforms a simple post-processing baseline.

These findings support the idea presented in this work regarding the necessity of conditioning object detectors with prior information.

8 Conclusion and future works

This work presents a novel approach to focused object search in an image by conditioning existing object detectors with the user’s search intent, which is represented as a set of WordNet concepts. It can be implemented with minor changes to standard object detectors and does not require the modification or addition of any loss. This is the first work that conditions object detectors with the user’s intent represented as a set of WordNet concepts. The proposed concept-conditioned object detector can be trained on existing datasets for object detection without the need to add or modify existing annotations to consider the WordNet concepts. The proposed approach was tested for searching all objects on COCO and Visual Genome datasets and also for searching just subsets of objects using the newly defined Focused COCO and Focused Visual Genome datasets. The evaluation, which was performed considering several object detector architectures and backbones, has demonstrated that the proposed concept-conditioned object detector performs better than the standard object detector baseline.

Future works aim to improve the fusion of the multimodal information in the object detector architecture, as well as to investigate the possibility of conditioning the object detector with a WordNet sub-graph, made of concepts and relations representing the intent of the user, leveraging graph neural networks such as [11, 19, 39, 52]. The proposed concept-conditioned model implicitly learns the relations among the concept in input and the corresponding categories. Those relations are determined by the parent–child relations defined in the WordNet graph. Future works will extend the WordNet relations considered by the model, especially with the intent of introducing new useful information. This is particularly relevant when extending the proposed method to consider also entities that belong to heterogeneous knowledge graphs, such as YAGO [16, 32, 40, 51]. Indeed, YAGO could deliver more useful information, such as morphologic details about entities, to help the object detector detect objects in the images. Finally, future works will also integrate this model within a word sense disambiguation system to solve multimodal text-image tasks, such as visual question answering and visual-textual grounding.