Abstract
Object detectors are used for searching all objects belonging to a pre-defined set of categories contained in a given picture. However, users are often not interested in finding all objects, but only those that pertain to a small set of categories or concepts. Nowadays, the standard approach to solve this task involves initially employing an object detector to identify all objects within the image, followed by refining the outcomes to retain only the ones of interest. Nevertheless, the object detector does not take advantage of the user’s prior intent that, when used, can potentially improve the detection performance of the model. This work presents a method to condition an existing object detector with the user’s intent, encoded as one or more concepts from the WordNet graph, to find just those objects of interest. The proposed approach takes advantage of existing datasets for object detection without the need for new annotations, and it allows to adapt the already existing object detector models with minor changes. The evaluation, performed on the COCO and the Visual Genome datasets considering several object detector architectures, shows that conditioning the search on concepts is actually beneficial. The code and the pre-trained model weights are released at: https://github.com/drigoni/Concept-Conditioned-Object-Detector.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The object detection task aims to find all objects of a given set of object categories shown in an image. In many situations, however, a user looks at a picture with the intent of finding objects of one or more types, which are expressed by any noun, and not restricted to a predefined set of categories. In addition, the user may also know helpful additional information, like the number of objects he is looking for in the image, which could improve object detection. This document will refer to this focused search as the “Find-That” task.
A practical example is given by an image object extraction task where a user aims to automatically extract, from a stream of images, all the occurrences of one or more specific objects (entities), e.g., all the cats and dogs contained in the images. For this task, the user’s intent is known a priori, although it may range across a large set of possible nouns. Because of that, the intent can be used to condition an object detector to obtain a better recognition rate and thus a better final performance. Figure 1 presents an example that highlights the main differences between the standard object detector and the conditioned object detector approach. The task described above differs from visual-textual grounding [3, 14, 42, 45, 59], as the latter has the objective of finding a precise object referred by a textual phrase, while the “Find-That” task aims to find all the objects related to a set of given intents. More distinctions are underlined in the Related Works Sect. 7.
A baseline method to solve the “Find-That” task is by using an object detector that extracts all the objects in the image and then filters the results according to the specified categories. This last step is not trivial as a user can express her/his interest using nouns that are not in the categories supported by the object detector. Hence, such a baseline method should use a filtering procedure that reconciles the noun specified by the user with the set of supported categories. In such a baseline, the object detector is independent of the user’s intent and may return many undesired categories.
This work proposes a method to condition an object detector with the user’s intent, represented by one or more concepts of WordNet [35] graph, to drive the localization and classification of only desirable objects. Hence, there is the need to modify the object detector so that it takes in input also a set of concepts and focuses its attention only on objects of the categories directly or indirectly specified by the concepts. WordNet allows the reconciliation of the user’s intent with the set of supported categories, handles the synonymous, removes the need for prompting, and copes with the problem of multiple meanings associated with the same word (i.e., polysemy) that would be present with textual inputs. Note that using more concepts in input allows the model to grasp any dependencies among the concepts (i.e., among the user intents) so that the detection is not done independently, one at a time, for each user intent. Moreover, the conditioned model implicitly learns the relations between the concepts in input and the classes supported by the object detector, which are nothing but the relations defined by the WordNet graph structure.
Figure 2 highlights the main difference between the baseline described above (top) and the proposed approach involving a concept-conditioned object detector (bottom). Starting from the image, a standard object detector detects all the objects depicted in the image and passes them to an ad hoc post-processing algorithm, which selects only the objects classified with categories that are represented by the WordNet concepts in input. Section 3 elaborates on how WordNet concepts can be matched with the object detector pre-defined categories, which is an important component for the Post-processing Selection component of the model. The proposed concept-conditioned object detector takes in input also a set of concepts and applies the object detection and filtering phase to a combination of image and Concept Set Encoding component. The integration of the multimodal information is implemented by the Fusion Block, which fuses the visual features returned by the model Backbone with the concepts features. Afterward, the multimodal features are employed within the Object Detector Head to locate and classify all the objects of interest.
Nonetheless, this proposed approach requires new datasets to train these models with inputs made of WordNet concepts and images. For this reason, this work proposes an effective strategy to generate WordNet concepts from existing object detection datasets, removing the need to create new ad hoc datasets from scratch.
Overall, the contributions of this article can be summarized as follows: (i) it presents a novel approach to focused object search in an image by conditioning existing object detectors with the user’s search intent, represented as a set of WordNet concepts. The proposed approach can be implemented with minor changes to a standard object detector software, e.g., it does not require the modification or addition of any object detector loss; (ii) this is the first work that proposes conditioned object detectors in which the user’s intent is represented as a set of WordNet concepts. The set approach allows the user to search multiple objects at the same time, while the WordNet graph allows the user to express a query using concepts that are not directly associated with the set of pre-defined categories supported by the object detector. Moreover, WordNet handles the problem of multiple meanings associated with the same word (polysemy) that would be present with textual inputs; (iii) it proposes an effective strategy to generate WordNet concepts from already existing object detection datasets, removing the need to create new ad hoc datasets from scratch for training concept-conditioned object detectors. Therefore, concept-conditioned object detectors can be trained starting from existing datasets for object detection taking advantage of the huge amount of images and ground truth annotations available online; (iv) the evaluation highlights that the proposed concept-conditioned object detector approach performs better than the standard baseline on two widely used object detection datasets, COCO and Visual Genome, and several object detection architectures.
2 Problem formulation
Before giving a formal definition of the “Find-that” problem, there is the need to clarify an issue about the “intent” of the user, i.e., the expected output from an object detector that takes in input a set of WordNet concepts. In fact, given a set of input concepts, it is not straightforward how to retrieve the categories that are represented by those concepts. Although it can be considered safe to assume that any object detector pre-defined category can be mapped to a corresponding concept in WordNet. Figure 3 illustrates a simple example that highlights the main difficulties: (i) WordNet concepts may have multiple concepts as parents; hence, given a concept, the set of all its ancestors could potentially result in a very large set of concepts; (ii) since the object detector’s pre-defined categories can be related to each other, as the category “PERSON” is related to the category “MAN,” concepts associated with the pre-defined categories can also be related by parent–child relations in the WordNet graph.
Therefore, given a concept, a first approach could be to select all the pre-defined categories whose concepts are equal or ancestors of at least a WordNet concept in input. In the example, this means that the selected objects should be classified as “MAN” and “PERSON.” However, maybe the user is interested in finding only objects belonging to the “MEN” category and not objects also classified as “PERSON.” In that case, the alternative approach would be to select only the category whose WordNet concept is the closest to the concept in input. In the example, this implies the selection of only the objects classified as “MAN,” discarding all the objects classified as “PERSON.” In general, one could think of an “intent” that is defined by the intended concept depth, i.e., how far one travels the WordNet graph’s structure to retrieve the object detector categories. To cope with this challenge, in the following, the problem is formally defined by also specifying a concept depth parameter.
Let \(\mathcal {L}\) be the set of categories supported by an object detector, \(\mathcal {S}\) the set of concepts in WordNet, \(f:\mathcal {L}\rightarrow \mathcal {S}\) a function that associates to each category of the object detector a unique concept in WordNet. For every \(d\in \mathbb {N}_0\), let \(f^d:\mathcal {L}\rightarrow 2^{\mathcal {S}}\) be the function that maps a label \(l\in \mathcal {L}\) of the object detector into the set of WordNet concepts as:
Let \(G(\varvec{I})=\{(\varvec{r}_i, l_i)\}^{n}_{i=1}\) be the set of all objects that appears in image \(\varvec{I}\) of any category in \(\mathcal {L}\). \(\varvec{r}_i \in \mathbb {R}^4\) and \(l_i \in \mathcal {L}\) are the bounding box coordinates and the category label, respectively, of the i-th object. Then, given a pair \((\varvec{I},S)\) composed of an image and a set S of WordNet concepts, and a concept depth d, the “Find-That” task produces:
Bear in mind that the standard object detector task can be defined in the proposed framework as \(F(\varvec{I},f(\mathcal {L}),0)\).
3 Definition of a baseline
The application of standard object detectors to address the “Find-That” task involves the integration of a post-processing algorithm which filters out bounding boxes unrelated to the user’s intent. Consequently, as a baseline for this task, a standard object detector coupled with an ad hoc post-processing algorithm (i.e., the Post-processing Selection component in Fig. 2) is employed. This component selectively identifies the subset of objects aligned with the user’s intent.
Formally, given an image \(\varvec{I}\), if \(P_B(\varvec{I})=\{(\varvec{r}_i, l_i)\}^{n_p}_{i=1}\) is the set of \(n_p\) objects predicted by an object detector, the baseline approach estimates \(F(\varvec{I},S,d)\) by \(F_B(\varvec{I},S,d)\), as:
The post-processing algorithm checks if \(S\cap f^d(l) \ne \varnothing\), i.e., it matches the concepts in input with all the descendant, at maximum depth value d, of the concepts associated with the bounding boxes categories (\(f^d(l)\)).
4 Concept-conditioned object detector
The baseline can be improved by exploiting an object detector conditioned by the input WordNet concepts. During training, given an image with a set of WordNet concepts in input, the object detector learns to detect only the desired objects. Hence, implicitly the model learns a mapping function from the set of WordNet concepts to the categories of the object detector. This improves the quality of proposals in input to the Post-processing Selection component.
Formally, given an image \(\varvec{I}\) and a set S of concepts, if \(P_C(\varvec{I},S)=\{\varvec{r}_i,l_i\}\) is the set of objects predicted by a concept-conditioned object detector, \(F(\varvec{I},S,d)\) is estimated by \(F_C(\varvec{I},S,d)\), as:
In the following, more details on the architecture and the training procedure of the concept-conditioned object detector will be presented.
4.1 Model architecture
Figure 4 presents an in-depth zoom on the Concept-Conditioned Object Detector block presented in Fig. 2. It illustrates the proposed architecture that exploits the information given by the set of WordNet concepts during object detection. In fact, both an Image and a set of WordNet Concepts are provided in input to the model. The blocks that are components of a standard object detector, i.e., components that are defined by a meta-architecture (e.g., Faster R-CNN or RetinaNet) and a backbone (e.g., ResNet-50, ResNet-101, or Swin-Tiny), are depicted using the red color, while the background in light-blue color delimits the new blocks added to condition the object detector with concepts. The Backbone extracts the visual features from the image, while the Concept Set Encoding encodes in an embedding space the set of concepts in input. Finally, the Fusion Block fuses the visual and concept features together and sends them as input to the Object Detector Head, which predicts the Boxes Coordinates and the Boxes Categories in output.
4.2 Model training
A standard end-to-end gradient-based procedure can perform the training of the proposed model. However, the main issue is the lack of datasets compliant with the task definition, i.e., examples in the form \(((\varvec{I},S),F(\varvec{I},S,d))\). For this reason, this section proposes an automatic procedure to derive an ad hoc dataset \(D_F\) starting from an existing dataset D for object detection, which contains ground truth annotations for each object contained in each image \(\varvec{I}\), i.e., \(G(\varvec{I})\).
In order to define \((\varvec{I},S)\) and \(F(\varvec{I},S,d)\), it is necessary to specify the “intent” S at concept depth \(d\in \mathbb {N}_0\). Given an image \(\varvec{I}\) in D, the power set \(\xi _{G}(\varvec{I})\) of \(G(\varvec{I})\), i.e., the set of all the possible combinations of ground truths, can be automatically generated. Then, for each \(\hat{\xi }_{G} \in \xi _{G}(\varvec{I})\) with \(\hat{\xi }_{G} \ne \varnothing\), it is possible to define a new example for \(D_F\). Specifically, the set \(S_d\) of concepts can be defined as:
where \(\mathcal {U}\) is the uniform probability distribution required to sample a concept \(\hat{s}_l\) among all those in \(f^d(l)\), which are all the descendants of the concept associated with the class l of the object detector.
It could be disputed that the above procedure is not correct in the case in which a child of a concept does not find a match with a pre-defined object detection category. For example, consider the concept “Siamese cat” and an object detector that only supports the category “CAT.” In this case, since f(“CAT”) returns the concept “cat,” i.e., a parent of “Siamese cat,” one runs the risk of generating an example involving an image that portrays a cat that is not a Siamese cat jointly with the concept “Siamese cat.” However, such a query could actually be placed by a user who is unaware, as she/he shouldn’t be, of the pre-defined object detection categories, and returning a bounding box containing a non-Siamese cat is the best approximation that the object detector can do. It is a problem of the object detector: the more pre-defined categories the object detector can deal with, the better the system’s performance will be.
However, the power set approach \(\xi _{G}(\varvec{I})\) described above generates exponential training examples, making it unsuitable in practice. For this reason, in this work, \(\xi _{G}(\varvec{I})\) is sampled to obtain a reasonable amount of training examples. Specifically, given an image \(\varvec{I}\) with its ground truth annotations \(G(\varvec{I})\), the procedure that synthesizes the concepts in input starts by sampling uniformly from \(\xi _{G}(\varvec{I})\) only an element \(\hat{\xi }_{G}\). For example, given the image in Fig. 1, this approach can sample three objects as ground truths, and for each of them, generate a concept to use in input. Section 6 investigates an additional sampling strategy for generating concepts \(S_d\).
The use of a uniform sampling process to acquire \(\hat{\xi }_{G}(\varvec{I})\) ensures that its expected size is smaller than that of \(G(\varvec{I})\) while allowing the sampling of each combination of ground truths. Specifically, let \(n=|G(\varvec{I})|\) and X be a random variable representing the size of \(\hat{\xi }_{G}(\varvec{I}) \in \xi _{G}(\varvec{I})\) with \(\hat{\xi }_{G}(\varvec{I}) \ne \varnothing\). Since \(|\xi _{G}(\varvec{I})| = 2^n\) and the empty set \(\varnothing\) is not considered, the uniform sampling process is applied to a set of \(2^n-1\) elements. Therefore, the expected value \(\mathbb {E}[X]\) is:
Similarly, the uniform probability distribution adopted to sample \(\hat{s}_l\) provides an equal opportunity for all concepts associated with the category l to be selected. Since the size of \(f^d(l)\) varies according to the label l, the sampling likelihood is inversely proportional to the number of elements forming the set.
5 Experiments and results
This section presents the evaluation performed to assess the effectiveness of the proposed method, namely concept-conditioned object detectors. More precisely, the comparison involves assessing several conditioned object detectors against standard object detectors for the task of identifying all objects in an image. Furthermore, conditioned object detectors are also compared with standard object detectors coupled with an ad hoc post-processing algorithm for addressing the “Find-That” task. The concept-conditioned object detectors are evaluated on datasets generated starting from two widely adopted object detection datasets: COCO and Visual Genome. The evaluation encompasses two object detector meta-architectures (i.e., RetinaNet and DynamicHead) and several backbones (i.e., ResNet-50, ResNet-101, and Swin-Tiny).
5.1 Experimental setting
5.1.1 Datasets
The COCO dataset [26] is an 80-class common object detection dataset. In this work, the 2017 version of the dataset is adopted, which is made by 118287 training and 5K validation images. Since the COCO test set ground truths are not publicly available online, the models are tested on the COCO validation set, while 5K images are randomly selected from the training set to generate the “holdout” set, which is adopted as the validation set for the model selection. The Visual Genome [23] dataset consists of 98,077 images of training, 5K images of validation and 5K images of test. Each object is classified according to 16K categories. Every split of data is available online with its ground truth annotations. Hence, the splits available online for training, validating, and testing the models are adopted on this dataset.
The procedure presented in Sect. 4.2 allows the generation of new datasets to train and evaluate concept-conditioned models when deployed for searching all the objects contained in the images as well as just a subset of objects as specified by the input concepts. More in detail, for each original dataset, two more datasets (with all their splits) are generated. The first dataset aims to evaluate the object detector when searching for all the objects in the images (\(\hat{\xi }_{G} = G(\varvec{I})\)). In other words, for each image, \(\varvec{I}\), the set S comprises at least one concept related to each ground truth in \(G(\varvec{I})\). The second dataset, dubbed “Focused,” aims to evaluate the object detector when searching for only a subset of objects in the images. For each example \(((\varvec{I}, G(\varvec{I}))\), the procedure presented in Sect. 4.2 generates the example \(((\varvec{I},S),F(\varvec{I},S,d))\), which focuses on just a subset of all the objects \(G(\varvec{I})\). Note that, the examples to use in input to the model during training are generated at “run-time,” while during evaluation, the results are computed on a pre-calculated set of examples. Additional details about the dataset statistics and class frequencies can be found in Appendix A.
5.1.2 Evaluation setting and metrics
The following metrics to evaluate the models’ performances are adopted: (i) mean Average Precision (AP): this metric is the mean Average Precision per class defined by the COCO datasetFootnote 1; (ii) AP50: this metric is the mean Average Precision per class, defined by the COCO dataset, that evaluates the AP metric only considering the Intersection over Union (IoU) threshold of 0.5. These are standard object detection metrics that, in the end, allow for a fair comparison of the proposed model on the object detection task to demonstrate the effectiveness of the proposed approach over standard object detectors. In addition, the AP metric is evaluated by considering several bounding box dimensions. The threshold values are defined by the COCO dataset: (i) Small refers to bounding boxes whose area is less than \(32^2\) pixels; (ii) Medium refers to bounding boxes whose area is less than \(96^2\) pixels and larger than \(32^2\); and (iii) Large refers to bounding boxes whose area is larger than \(96^2\) pixels; while (vi) All refers to the case in which the evaluation is performed considering all the bounding boxes. In COCO, approximately 41% of the boxes are small size, approximately 34% of the boxes are medium size, and approximately 24% of boxes are large size.
5.1.3 Model selection
Given the large computational power required to train the object detectors, the search for the best hyperparameters was performed only on the COCO dataset. Thus, the best hyperparameters selected on COCO are adopted “as-is” for training the object detectors on the Visual Genome dataset. The model hyperparameters tuning is performed on COCO by training on the train set and validating on the “holdout” set, while in the Visual Genome, it is done by training on the train set and validating on the validation set available online. The evaluation results presented in this work are always obtained on the validation set for COCO and the test set for Visual Genome.
All models are trained for 90K iterations and are then tested on the validation set. Hyperparameters related to concepts are tuned using RetinaNet [28], with ResNet-50 and Feature Pyramid Network (FPN) [27], on the COCO dataset. Regarding the Fusion Block component (see Fig. 4), three approaches to fuse the concept embeddings with the visual features are considered: addition, multiplication, and concatenation. The best AP results were obtained by adopting the concatenation approach. The best learning rate to use during training is chosen among the following values: [0.01, 0.001, 0.0001, 0.00005]. With DynamicHead, the best results were obtained with a value of 0.0001, while with RetinaNet, the best results were achieved with a learning rate value of 0.01. The addition of more expressiveness to the Concept Set Encoding network was also studied, although the best results were obtained with the configuration outlined in Sect. 5.1.4.
5.1.4 Implementation details
For model training, all ResNet [15] backbones are initialized with the pre-trained ImageNet [6] weights. The Swin-Tiny backbone is initialized with the weights provided by the authors.Footnote 2 As concept embedding, it is used a 150-dimensional Holographic [38] embeddingsFootnote 3 trained on WordNet for 500 epochs. These weights are frozen during model training. During training, the batch size is set to 16 examples for all the models. The Concept Set Encoding module employs a Deep Sets [58] network. Each 150-dimensional concept embedding is mapped to a new 256-dimensional representation using a multilayer perceptron with two layers and ReLU activation functions. The first layer has a dimension of 150 neurons, while the second layer has a dimension of 256 neurons. Finally, all the concepts’ representations are summed and transformed into a new representation with a multilayer perceptron with two 256-dimensional layers and ReLU activation functions. The Fusion Block concatenates the embedding of the concepts, in output from the Concept Set Encoding, to the visual features in output from the model Backbone. Each object detector category is mappedFootnote 4 to its corresponding WordNet synset using the Python NLTKFootnote 5 package. When NLTK failed to find the concept associated with some categories, the linking was done manually with the synset that most represented the category meaning. Where not explicitly indicated, the concept sampling procedures are done at a maximum depth of \(d=1\). The models are implemented using the Detectron2 frameworkFootnote 6 and the experiments were performed in a distributed parallel system with several A100 GPUs.
5.1.5 Object detector architectures and backbones
The proposed approach is evaluated considering two object detectors, namely RetinaNet [28] and DynamicHead [5]. To assess the effectiveness of the proposed approach, each model is evaluated considering three backbones: ResNet-50 [15], ResNet-101, and Swin-Tiny [30]. Each model adopts the Feature Pyramid Network [27] to extract image features at different resolutions. Each object detector model is modified to be conditioned with the concepts, i.e., “Concept RetinaNet” and “Concept DynamicHead” are the proposed models that can exploit the user’s intent during object detection. Table 1 presents the number of parameters composing each model adopted in this work. In particular, the table reports the number of parameters forming the backbone, the size of the concept vocabulary, and the number of parameters composing the head of the model. The head of the model is in charge of locating and classifying the objects in the image, and for this reason, its dimension depends on the number of classes to predict. In other words, the size of the model’s head changes according to the dataset. Fusion Strategy refers to the function applied to fuse the visual and concept information that will be discussed more in detail in Sect. 5.5.
5.2 Standard vs. concept-conditioned object detectors before filtering
This section investigates the benefit of leveraging the user’s intent directly in the object detector architectures. It is done by evaluating the models in detecting all the objects in the images before filtering, i.e., the evaluation is performed before employing the Post-processing Selection component. For the concept-conditioned object detectors, the set of WordNet concepts used to condition the object detection process is built appropriately to include a concept for each object present in the image ground truth annotations.
Table 2 presents the results obtained by the object detectors when deployed for searching all the objects contained in COCO and Visual Genome datasets. AP (%) refers to the object detection mean Average Precision, while AP50 (%) refers to the mean Average Precision with IoU \(\ge 0.5\). Models conditioned with the concepts are highlighted with the dove gray color.
Noticeably, the proposed concept-conditioned models, exploiting the user’s intent, consistently perform better than standard object detectors when deployed for searching all objects depicted in an image. Concept DynamicHead achieves the best outcomes in both datasets with ResNet as the backbone. Overall, the improvements given by Concept DynamicHead over the standard DynamicHead models are higher than the improvement of Concept RetinaNet over standard RetinaNet. On COCO, the larger AP improvement (6.1%) is given by Concept DynamicHead (50.2%) over DynamicHead (44.1%), both with ResNet-101/50. Even on Visual Genome, the same architecture and backbone give the best improvements (3.9% for ResNet-101).
Unexpectedly, the DynamicHead model coupled with the ResNet-50 and ResNet-101 backbones performs similarly. Given the higher neural network expressivity given by the ResNet-101 over ResNet-50, allowed by the largest number of parameters that amounts to 57.8M, the model outcomes should be better than those of ResNet-50. This is likely due to a non-exhaustive model selection performed on COCO and Visual Genome, which is detailed in Sect. 5.1.3. Regarding the AP metric evaluated according to the bounding box dimensions (i.e., columns Small, Medium, and Large), it is visible that the concept-conditioned models benefit mostly in detecting small objects in COCO and large objects in Visual Genome.
Overall, whenever the user’s intent is exploited to condition the object detector architectures, their detection performance increases.
During model training, the maximum concept depth d value considered during the WordNet sampling process plays a fundamental role in the proposed approach. High-depth values force the model to learn more relations among categories and WordNet concepts, making the task that the model has to solve more challenging. Conversely, a low-depth value makes the learning task easier while constraining the generalization of the proposed approach to only a small set of concepts.
Figure 5 highlights the impact of employing different depth values on the concept-conditioned models. The results were obtained with RetinaNet, using ResNet-50 as the backbone, on the COCO test set. In this experiment, depth value \(d=0\) refers to examples involving concepts in \(S_0\) (i.e., only concepts associated with the object’s categories), depth value \(d=1\) refers to examples involving concepts in \(S_1\), and so on. As can be seen from the table, the best AP result is obtained with a depth value of 0, and there is no abrupt deterioration in the results, increasing the depth value from 0 to 4. More in detail, from \(d=0\) to \(d=1\), the deterioration in the AP metric amounts to 0.4%, the same value with respect to \(d=4\). The largest drop in performance is observed for \(d=3\), where it reaches 0.6%. Bear in mind that the number of concepts the user can adopt to express her/his intent grows significantly from 80 at depth \(d=0\), to 7274 at depth \(d=4\), and that these numbers increase when the Visual Genome dataset is considered.
To conclude, these results suggest that it is possible to generalize the model to use 7274 different WordNet concepts trading off some model effectiveness when considering the COCO dataset.
5.3 Searching for a subset of objects
This section compares concept-conditioned and standard object detectors, coupled with the Post-processing Selection component, to search for just a subset of objects depicted in the images consistent with the input concepts. To this aim, the models are assessed on the datasets generated as explained in Sect. 4.2, which are dubbed “Focused COCO” and “Focused Visual Genome.”
Table 3 presents the obtained results, from which it can be seen that concept-conditioned models outperform standard object detectors in all architectures and backbones combinations. Both datasets achieve the best AP results by deploying DynamicHead with ResNet backbones. On Focused COCO, the larger AP improvement (3.1%) is given by Concept DynamicHead (52.1%) over DynamicHead (49.0%), both with ResNet-50. While on Visual Genome, the larger AP improvement (2.8%) is given by Concept DynamicHead (13.7%) over DynamicHead (10.9%), both with ResNet-101. In general, the improvements achieved on the Focused COCO dataset by the conditioned models are higher than those achieved in the Focused Visual Genome dataset. More details on the number of detected objects per image are reported in Appendix D, while an evaluation of the statistical significance of the results is detailed in Appendix  E.
Figure 6 presents the AP metric values obtained per class in the Focused COCO dataset by DynamicHead model coupled with the Swin-Tiny backbone and with the Post-processing Selection component. In other words, the models search for a subset of objects in the image. The figure shows that the Concept DynamicHead model obtains higher results than the standard DynamicHead model in most classes. However, it presents lower results only in a small number of classes, such as “HAIR DRIER,” “KNIFE” and “MICROWAVE.” Future works will investigate these classes in more detail.
In conclusion, conditioning the object detection with the user’s intent generally improves the detection performance of an object detector that adopts a post-processing procedure for selecting the boxes of interest.
5.4 Qualitative results
Figure 7 presents some qualitative examples predicted with standard and concept-conditioned object detectors. The standard object detector focuses its attention on all the objects in the images and sometimes is not able to detect the most important bounding boxes such as the “KEYBOARD” in the first row. On the contrary, the concept-conditioned object detector focuses its attention only on those bounding boxes that express concepts in input, improving the bounding boxes’ detection performance and decreasing the number of detected boxes when compared to standard object detectors. On the right column, it is highlighted the use case of the proposed model, which is when the object detector is used to focus the detection only on a subset of objects depicted in the image. An interesting mismatch between concepts and object detector classes is given by the second row in the right column. Given the concept “male.n.01” the object detector focuses its detection on the bounding box depicting the woman and classifies it as “PERSON.” Clearly, the concept in input was focusing only on males, but the object detector class that most approximates that concept is “PERSON,” as ”MALE” is not a COCO class. In fact, also the ground truth bounding box is classified as “PERSON.”
5.5 Further analysis on the fusion block
The block that fuses information from the visual modality with information from the concept modality (the Fusion Block visible in Fig. 4) plays a crucial role in constructing the concept-conditioned object detector. The best results were obtained with the concatenation of the features, which implies a larger Object Detector Head’s input (i.e., slightly more neurons) that could explain the improvement in the object detector capabilities. To discern if that is the case, this section presents the results of adopting the “Addition” strategy, which sums the visual and concept features without increasing the size of the Object Detector Head, i.e., the head is composed of the same number of neurons.
Mathematically, consider the vectors \(\varvec{e}_v \in \mathbb {R}^{s_e}\) and \(\varvec{e}_c \in \mathbb {R}^{s_e}\) which correspond to the visual and concept features of an input (I, S) derived from the Fusion Block and the Concept Set Encoding, respectively. The addition function obtains the fused features as \(\varvec{e}_m = \varvec{e}_v + \varvec{e}_c\), while the concatenation operation as \(\varvec{e}_m = Concat(\varvec{e}_v, \varvec{e}_c)\), where Concat stacks the two vectors consecutively. The addition strategy produces features in \(\mathbb {R}^{s_e}\), whereas the concatenation strategy produces features in \(\mathbb {R}^{2s_e}\).
Figure 8 presents the AP results. In particular, for COCO and Visual Genome, it uses the evaluation setting adopted in Sect. 5.2, while for Focused COCO and Focused Visual Genome, it is using the setting adopted in Sect. 5.3. The architecture DynamicHead with ResNet-101 as the backbone is used for these experiments. In both settings, it can be observed that the Concept DynamicHead model adopting the “Addition” fusion strategy performs much better than the standard DynamicHead model. On the other hand, the “Addition” strategy is not as competitive as the “Concatenation” strategy.
In conclusion, these results demonstrate that, although more neurons help to improve the object detector performance, the major improvements are obtained using the concepts in input. More details are reported in Appendix B.
6 Concept sampling impact
During the model training and the creation of the new datasets with concepts (see Sect. 4.2), two sampling processes take place. The former aims to reduce the exponential number of examples induced by the powerset approach, i.e., \(\hat{\xi }_{G}(\varvec{I})\), while the latter aims to sample a concept \(\hat{s}_l\) associated with each label l. This section reports the results obtained by changing the sampling process applied to \(S_d\).Footnote 7 Instead of sampling one concept for each object to search in the image (as done before), it analyzes the models’ performances when it is provided in input a concept for each type of object to search in the image. Mathematically, \(S_d = \{ \hat{s}_l \}\) such that:
This is a more generic setting than before, as the prior information concerns only the types of objects to search for, not the number of occurrences of the same object in the image. For example, according to Fig. 1, in this new setting one provides as input two concepts: one concept for the object labeled as “BOWL” and one sampled concept associated with the objects labeled as “CAT.” Following the same example, all the experiments previously performed provided as input three concepts, one for “BOWL” and two sampled concepts for “CAT,” i.e., one for each cat in the image.
The new sampling strategy is also adopted for generating new datasets. More in detail, the new “Focused” datasets adopted in this section are built starting from the “Focused” datasets adopted in Sect. 5, where only one concept is kept for each type of object. This guarantees that new datasets have the same ground truth as the starting datasets and that the concepts only differ. Going back to the previous example, given the two concepts related to the two cats appearing in the image, only one concept is sampled and adopted as input for searching for both cats in the picture. Statistics of the datasets generated with the new sampling strategy are visible in Appendix C.
6.1 Standard vs. concept-conditioned object detectors before filtering
Table 4 presents the results obtained by the object detectors when they are deployed for searching all the objects contained in COCO and in Visual Genome datasets. The evaluation setting complies with that performed in Sect. 5.2, i.e., it is performed before filtering. From the results it is visible that there is the same trend highlighted in Table 2: when the object detector is conditioned with concepts, it improves the ability to localize the objects in the image. On COCO, the larger AP improvement (5.2%) is given by Concept DynamicHead (49.3%) over DynamicHead (44.1%), both with ResNet-101. Even on Visual Genome, the same architecture and backbone give the best improvements (3.5%).
Figure 9 highlights the impact of employing different depth values on the concept-conditioned models adopting the new sampling strategy. The results were obtained with RetinaNet, using ResNet-50 as the backbone, on the COCO validation set. As can be seen from the table, the best AP result is obtained with a depth value of 0, and there is no abrupt deterioration in the results, increasing the depth value from 0 to 4. More in detail, from the depth value of 0 to 1, the biggest deterioration in the AP metric amounts to 1%, although from the depth value of 1 to 4, the deterioration amounts to 0.5%.
In conclusion, these results suggest that also with this new sampling strategy, in the COCO dataset, it is possible to generalize the model to the use of 7274 different WordNet concepts trading off some of the effectiveness of the model.
6.2 Searching for a subset of objects
Table 5 compares concept-conditioned and standard object detectors, coupled with the Post-processing Selection component, to search for just a subset of objects depicted in the images and consistent with the input concepts. This evaluation setting complies with that of Sect. 5.3.
From this table, it is visible that concept-conditioned models outperform standard object detectors in most of all architecture and backbones combinations, with the only exception of Concept RetinaNet with ResNet-101 on the Focused Visual Genome dataset. Both datasets achieve the best AP results by deploying DynamicHead with the Swin-Tiny backbone. On Focused COCO, the larger AP improvement (2%) is given by Concept DynamicHead (51.0%) over DynamicHead (49.0%), both with ResNet-50. While on Visual Genome, the larger AP improvement (2.3%) is given by Concept DynamicHead (13.0%) over DynamicHead (10.7%), both with ResNet-101. However, in this case, the improvements achieved on the Focused Visual Genome dataset by the conditioned models are higher than those achieved in the Focused COCO. Note that only on Focused Visual Genome, Concept RetinaNet performs slightly worse than the standard version, which could be explained by the non-exhaustive search of hyper-parameters performed during the model selection (see Sect. 5.1.3).
To conclude, even when the new sampling strategy is considered, conditioning the object detection with the user’s intent clearly improves the performance of the object detector. More details on the number of detected objects per image are reported in Appendix  D.
7 Related works
This work is mainly related to the research areas reported below.
Object Detection Task The “Find-That” task aims to find all objects depicted in an image related to the user intent, and as presented in Sect. 3, object detectors can be adapted for solving the proposed task. In addition, the proposed approach is also related to the Open-Vocabulary Object Detection [13, 18, 36, 47, 53, 55] area of research, although with significant differences. First, the approach presented in this work is not about augmenting or changing the predefined set of classes supported by the object detector but about conditioning the search of the objects in the image aligning the classes with the user intent. The approach presented in this work is designed to use knowledge graph nodes, not textual words with, if needed, prompting to express the user intent (i.e., the multimodal data is composed of graph and vision data), nor in adopting a multimodal large-scale pre-training setting. Nevertheless, a larger set of supported classes would undoubtedly benefit the proposed approach.
A thorough search of the relevant literature yielded that only Fornoni et al. [10] proposed a similar work. The authors aim to condition object detectors with prior information (as done in this paper), emphasizing mainly object detectors with efficient constraints (mobile). They re-use existing object detector code with minor changes and develop a procedure to generate the user’s prior intent from the ground truths available in existing object detection datasets. However, there is a significant difference in how the user intent is represented. In [10], the object detector is modified to consider an input composed of images and categories augmented with spatial information needed to constrain the object search in the image. The categories are those defined by the dataset, and their model is conditioned with a vector of ones (to search) and zeroes (not to search) for each class (i.e., an 80-dimension vector for COCO). In this work, the proposed approach is conditioned in input with WordNet concepts instead of categories, and the concepts are not augmented with the spatial location information, even if the model can be easily extended to do so. Hence, the approach of Fornoni et al. [10] does not tackle the mismatch problem between the concepts expressed by the user and the classes of the object detector, thus solving an easier problem compared to the task addressed in this work. In addition, since the target label is provided as an input to their proposed conditional model, their approach only aims to localize the objects in the image and does not classify them. For this reason, their evaluation is category-agnostic. Instead, the concept-conditioned models presented in this work aim to locate and correctly classify the objects depicted in the image. Thus, the evaluation adopted in this work is category-aware.
A direct experimental comparison versus the above approach is not possible since: (i) their approach uses different prior information than that adopted in this work, i.e., they adopt category vectors of ones and zeros while this work adopts WordNet embeddings; (ii) their evaluation setting (online style) significantly differs from that adopted in this work (fixed test set); and (iii) their code is not available online, making an evaluation impossible in a setting comparable to that used in this work.
Multimodal Downstream Tasks The proposed approach is also related to multimodal research areas, as often object detection is used as a building block for solving many other downstream tasks, such as Visual Grounding [3, 4, 14, 17, 42, 44, 45, 50, 59], Visual Question Answering [1, 25, 46, 48, 62, 63], Visual-Textual-Knowledge Entity Linking (VTKEL) [7,8,9] and Image-Text Retrieval [12, 20, 21, 31, 33, 54, 56]. Note that the approach proposed in this work could be deployed to solve the VTKEL problem, conditioning the object detector with the knowledge graph entities, and also to solve the Visual Grounding task by conditioning the object detector with the entity extracted from the text with a Word-sense disambiguation [2, 22, 24, 34, 37, 41, 49, 60, 61].
Visual Grounding Task The proposed “Find-That” task resembles the visual grounding [3, 14, 17, 42, 44, 45, 59] task, although there are substantial differences. First of all, the user intent needs to be represented as a textual phrase, while in the proposed approach, the user intent is expressed with one or more WordNet [35] concepts. Secondly, following the current State-of-the-Art, visual grounding models predict only the bounding box that best matches the textual phrase in the output. For this reason, when the user intent concerns multiple distinct objects depicted in the image, multiple independent queries should be performed to retrieve all objects of interest. In addition, when the user intent concerns multiple objects of the same type, the visual grounding approach is no longer suitable. Lastly, for training, visual grounding models need to use detailed datasets comprising images, boxes coordinates, textual phrases, and box-phrase ground truth alignments. These annotations are difficult to collect and, for this reason, the visual grounding datasets contain fewer examples than those of detection.
Still, given all these differences, Fornoni et al. [10] performed a comparison between an SSD [29] object detector coupled with ResNet-101 and a One-Stage BERT visual grounding recognition model [57]. In particular, the results of the SSD model were filtered according to the class expressed by the query in input, as done in the baseline proposed in this work. To summarize, Fornoni et al. [10] verified that the visual grounding model has poor generalization ability and underperforms a simple post-processing baseline.
These findings support the idea presented in this work regarding the necessity of conditioning object detectors with prior information.
8 Conclusion and future works
This work presents a novel approach to focused object search in an image by conditioning existing object detectors with the user’s search intent, which is represented as a set of WordNet concepts. It can be implemented with minor changes to standard object detectors and does not require the modification or addition of any loss. This is the first work that conditions object detectors with the user’s intent represented as a set of WordNet concepts. The proposed concept-conditioned object detector can be trained on existing datasets for object detection without the need to add or modify existing annotations to consider the WordNet concepts. The proposed approach was tested for searching all objects on COCO and Visual Genome datasets and also for searching just subsets of objects using the newly defined Focused COCO and Focused Visual Genome datasets. The evaluation, which was performed considering several object detector architectures and backbones, has demonstrated that the proposed concept-conditioned object detector performs better than the standard object detector baseline.
Future works aim to improve the fusion of the multimodal information in the object detector architecture, as well as to investigate the possibility of conditioning the object detector with a WordNet sub-graph, made of concepts and relations representing the intent of the user, leveraging graph neural networks such as [11, 19, 39, 52]. The proposed concept-conditioned model implicitly learns the relations among the concept in input and the corresponding categories. Those relations are determined by the parent–child relations defined in the WordNet graph. Future works will extend the WordNet relations considered by the model, especially with the intent of introducing new useful information. This is particularly relevant when extending the proposed method to consider also entities that belong to heterogeneous knowledge graphs, such as YAGO [16, 32, 40, 51]. Indeed, YAGO could deliver more useful information, such as morphologic details about entities, to help the object detector detect objects in the images. Finally, future works will also integrate this model within a word sense disambiguation system to solve multimodal text-image tasks, such as visual question answering and visual-textual grounding.
Notes
This implements the f function.
Only conditioned object detectors need new training.
Non-Maximum-Suppression (NMS) is applied individually to each category, considering solely the highest 1000 predicted bounding boxes with a minimum score of 0.05. It is deployed with a NMS threshold of 0.5. Following the NMS process, only the top 100 bounding boxes per image are returned. These are the RetinaNet default values defined by Detectron2: https://github.com/facebookresearch/detectron2/blob/main/detectron2/config/defaults.py.
References
Antol S, Agrawal A, Lu J et al (2015) VQA: Visual question answering. In: ICCV, pp 2425–2433
Bevilacqua M, Navigli R (2020) Breaking through the 80% glass ceiling: raising the state of the art in word sense disambiguation by incorporating knowledge graph information. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 2854–2864
Chen K, Gao J, Nevatia R (2018) Knowledge aided consistency for weakly supervised phrase grounding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4042–4050
Cho J, Yoon Y, Kwak S (2022) Collaborative transformers for grounded situation recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 19659–19668
Dai X, Chen Y, Xiao B et al (2021) Dynamic head: unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7373–7382
Deng J, Dong W, Socher R et al (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE computer society conference on computer vision and pattern recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. IEEE computer society, pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Dost S, Serafini L, Rospocher M et al (2020a) Jointly linking visual and textual entity mentions with background knowledge. In: International conference on applications of natural language to information systems. Springer, Berlin, pp 264–276
Dost S, Serafini L, Rospocher M et al (2020b) On visual-textual-knowledge entity linking. In: ICSC, IEEE, pp 190–193
Dost S, Serafini L, Rospocher M et al (2020c) Vtkel: a resource for visual-textual-knowledge entity linking. In: ACM, pp 2021–2028
Fornoni M, Yan C, Luo L et al (2021) Bridging the gap between object detection and user intent via query-modulation. arXiv preprint arXiv:2106.10258
Frazzetto P, Pasa L, Navarin N et al (2023) Topology preserving maps as aggregations for graph convolutional neural networks. In: Proceedings of the 38th ACM/SIGAPP symposium on applied computing, pp 536–543
Frome A, Corrado GS, Shlens J et al (2013) Devise: a deep visual-semantic embedding model. In: Burges CJC, Bottou L, Ghahramani Z et al (eds) NeurIPS, pp 2121–2129
Gu X, Lin TY, Kuo W et al (2021) Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921
Gupta T, Vahdat A, Chechik G et al (2020) Contrastive learning for weakly supervised phrase grounding. In: Computer vision—ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, Berlin, pp 752–768
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition. CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. IEEE computer society, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Hoffart J, Suchanek FM, Berberich K et al (2013) Yago2: a spatially and temporally enhanced knowledge base from Wikipedia. Artif. Intell. 194:28–61
Kamath A, Singh M, LeCun Y et al (2021) MDETR—modulated detection for end-to-end multi-modal understanding. In: 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10–17, 2021. IEEE, pp 1760–1770. https://doi.org/10.1109/ICCV48922.2021.00180
Kim D, Angelova A, Kuo W (2023) Region-aware pretraining for open-vocabulary object detection with vision transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11144–11154
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539
Klein B, Lev G, Sadeh G et al (2014) Fisher vectors derived from hybrid Gaussian–Laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399
Kokane CD, Babar SD, Mahalle PN et al (2023) Word sense disambiguation: adaptive word embedding with adaptive-lexical resource. In: International conference on data analytics and insights. Springer, Berlin, pp 421–429
Krishna R, Zhu Y, Groth O et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73. https://doi.org/10.1007/s11263-016-0981-7
Kumar S, Jat S, Saxena K et al (2019) Zero-shot word sense disambiguation using sense definition embeddings. In: Korhonen A, Traum DR, Mà rquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: long papers. Association for computational linguistics, pp 5670–5681. https://doi.org/10.18653/V1/P19-1568
Lerner P, Ferret O, Guinaudeau C (2023) Multimodal inverse cloze task for knowledge-based visual question answering. In: European conference on information retrieval. Springer, Berlin, pp 569–587
Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, Berlin, pp 740–755
Lin TY, Dollár P, Girshick R et al (2017a) Feature pyramid networks for object detection. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp 936–944
Lin TY, Goyal P, Girshick R et al (2017b) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Liu W, Anguelov D, Erhan D et al (2016) SSD: single shot multibox detector. In: European conference on computer vision. Springer, Berlin, pp 21–37
Liu Z, Lin Y, Cao Y et al (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Luo Z, Zhao P, Xu C et al (2023) Lexlip: lexicon-bottlenecked language-image pre-training for large-scale image-text sparse retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11206–11217
Mahdisoltani F, Biega J, Suchanek F (2014) YAGO3: a knowledge base from multilingual wikipedias. In: 7th biennial conference on innovative data systems research, CIDR conference
Mao J, Xu W, Yang Y et al (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Bengio Y, LeCun Y (eds) ICLR
Mao R, He K, Zhang X et al (2024) A survey on semantic processing techniques. Inf Fusion 101:101988
Miller GA (1998) WordNet: an electronic lexical database. MIT Press, Cambridge
Minderer M, Gritsenko A, Stone A et al (2022) Simple open-vocabulary object detection with vision transformers. arXiv preprint arXiv:2205.06230
Navigli R (2009) Word sense disambiguation: a survey. ACM Comput Surv (CSUR) 41(2):1–69
Nickel M, Rosasco L, Poggio TA (2016) Holographic embeddings of knowledge graphs. In: Schuurmans D, Wellman MP (eds) Proceedings of the thirtieth AAAI conference on artificial intelligence, Febr 12–17, 2016, Phoenix, Arizona, USA. AAAI Press, Washington, pp 1955–1961. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12484
Pasa L, Navarin N, Sperduti A (2022) SOM-based aggregation for graph convolutional neural networks. Neural Comput Appl 34:1–20
Pellissier Tanon T, Weikum G, Suchanek F (2020) YAGO 4: a reason-able knowledge base. In: European semantic web conference. Springer, pp 583–596
Raj V, Abbas N (2024) Contextual sense model: word sense disambiguation using sense and sense value of context surrounding the target. Int J Cognit Lang Sci 18(1):43–50
Rigoni D, Serafini L, Sperduti A (2022) A better loss for visual-textual grounding. In: Hong J, Bures M, Park JW et al (eds) SAC’22: the 37th ACM/SIGAPP symposium on applied computing, virtual event, April 25–29, 2022. ACM, pp 49–57. https://doi.org/10.1145/3477314.3507047
Rigoni D, Elliott D, Frank S (2023a) Cleaner categories improve object detection and visual-textual grounding. In: Scandinavian conference on image analysis. Springer, Berlin, pp 412–442
Rigoni D, Parolari L, Serafini L et al (2023b) Weakly-supervised visual-textual grounding with semantic prior refinement. In: 34th British machine vision conference 2023. BMVA Press, Aberdeen, UK. http://proceedings.bmvc2023.org/229/
Rohrbach A, Rohrbach M, Hu R et al (2016) Grounding of textual phrases in images by reconstruction. In: European conference on computer vision. Springer, Berlin, pp 817–834
Salaberria A, Azkune G, de Lacalle OL et al (2023) Image captioning for effective use of language models in knowledge-based visual question answering. Expert Syst Appl 212:118669
Shi C, Yang S (2023) EDADET: open-vocabulary object detection using early dense alignment. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 15724–15734
Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: CVPR, pp 4613–4621
Stevenson M, Wilks Y (2003) Word sense disambiguation. Oxf Handb Comput Linguist 249:249
Su W, Miao P, Dou H et al (2023) Language adaptive weight generation for multi-task visual grounding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10857–10866
Suchanek F, Alam M, Bonald T et al (2023) Integrating the wikidata taxonomy into YAGO. arXiv preprint arXiv:2308.11884
Veličković P, Cucurull G, Casanova A et al (2017) Graph attention networks. arXiv preprint arXiv:1710.10903
Wang J, Zhang H, Hong H et al (2023) Open-vocabulary object detection with an open corpus. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6759–6769
Wu J, Weng W, Fu J et al (2022) Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput Appl 34(7):5397–5416. https://doi.org/10.1007/S00521-021-06696-Y
Wu S, Zhang W, Jin S et al (2023) Aligning bag of regions for open-vocabulary object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15254–15264
Yang S, Li Q, Li W et al (2023) Semantic completion and filtration for image-text retrieval. ACM Trans Multimedia Comput Commun Appl 19(4):1–20
Yang Z, Gong B, Wang L et al (2019) A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4683–4693
Zaheer M, Kottur S, Ravanbakhsh S et al (2017) Deep sets. Advances in neural information processing systems 30
Zhang H, Niu Y, Chang SF (2018) Grounding referring expressions in images by variational context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4158–4166
Zhang X, Mao R, He K et al (2023) Neuro-symbolic sentiment analysis with dynamic word sense disambiguation. Find Assoc Comput Linguist: EMNLP 2023:8772–8783
Zhang X, Zhen T, Zhang J et al (2023b) SRCB at semeval-2023 task 1: prompt based and cross-modal retrieval enhanced visual word sense disambiguation. In: Proceedings of the 17th international workshop on semantic evaluation (SemEval-2023), pp 439–446
Zhao J, Zhang X, Wang X et al (2022) Overcoming language priors in VQA via adding visual module. Neural Comput Appl 34(11):9015–9023. https://doi.org/10.1007/S00521-022-06923-0
Zhou B, Tian Y, Sukhbaatar S et al (2015) Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167
Acknowledgements
We acknowledge the support of the PNRR project FAIR - Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU. Moreover, we acknowledge the EuroHPC Joint Undertaking for awarding us access to Vega at IZUM, Slovenia, and the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources and support.
Funding
Open access funding provided by UniversitĂ degli Studi di Padova within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there are no conflict of interest.
Ethics approval
Not applicable.
Consent to participate
The authors have consented to the submission of this manuscript to the journal.
Consent for publication
The authors have consented to the publication of this manuscript to the journal.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Dataset statistics
Table 6 reports the statistics of the datasets considered in Sect. 5. Figure 10 reports the frequencies of the classes appearing in both COCO and Focused COCO validation sets. Note that, as presented in Sect. 5.1.1, this work evaluates the models on the COCO validation set as the COCO test set ground truths are not publicly available.
Appendix B: Further results on the fusion block
This appendix reports more detailed results obtained during the analysis performed on the Fusion Block component that is presented in 5.5.
Table 7 presents the results obtained following the evaluation setting of Sect. 5.2, while Table 8 presents the results obtained following the setting of Sect. 5.3. DynamicHead with ResNet-101 as the backbone is the architecture considered for these experiments. The architecture DynamicHead with backbone ResNet-101 is used for these experiments. In both settings, it can be observed that the Concept DynamicHead model adopting the “Addition” fusion strategy performs much better than the standard DynamicHead model. On the other hand, the “Addition” strategy is not as competitive as the “Concatenation” strategy.
In conclusion, these results demonstrate that, although more neurons help the object detector performance, the major contribution to the results is given by the concepts in input.
Appendix C: Dataset statistics—new sampling strategy
Table 9 reports the statistics of datasets adopted in Sect. 6 which are generated according to the new sampling strategy. Comparing these statistics to those of the dataset presented in Sect. 5.1.1 (visible in Table 6), it is evident that only the statistics about the concepts have varied.
Appendix D: Analysis on the number of detected objects
Table 10 reports the average count of objects predicted per image by object detectors with and without the Post-processing Selection component. The results are obtained from the RetinaNet model with ResNet-50 and \(d=1\) on both the Focused COCO validation set and the Focused Visual Genome test set. The experiments are performed considering a maximum of 100 bounding boxes per image. The term “First strategy” denotes the sampling approach described in Sect. 5 of the article, whereas “Second strategy” pertains to the one outlined in Sect. 6 of the main article. In other words, the first strategy involves a concept for each object to detect in the image, while the second strategy involves only one concept for each type of object to detect.
The table illustrates that, on average, when the Post-processing Selection component is not used, the conditioned object detectors consistently detect fewer objects per image than standard object detectors. However, when the Post-processing Selection component is employed, the conditioned object detectors detect a higher number of objects aligned with the user’s intent. This holds true for both sampling processes.
It is worth noting that the elevated count of predicted objects, from both object detectors, is attributed to the parametersFootnote 8 adopted during the evaluation process. Indeed, modifying the parameters related to the non-maximum suppression algorithm (NMS) enables the retrieval of either more or fewer predictions. Since the NMS algorithm is performed per category, many predicted objects can overlap each other when they belong to different categories. This is more visible in the Visual Genome dataset, where exists many synonymous categories, like “microwave” and “microwaveoven,” “hamburger” and “burger.” More details about the Visual Genome categories are reported in [43]. For this reason, many of the predicted bounding boxes refer to the same depicted object in the image.
Appendix E: Statistical differences in predictions
This appendix presents the statistical test outcomes performed on the predictions made by the standard and conditioned DynamicHead models, with ResNet-101, on both COCO and Visual Genome datasets. Specifically, the assessment focuses on the mean Average Precision (AP) (more details about the metrics are reported in Sect. 5.1.2) values for images, treating each image as an individual of the statistical group. As the population distribution is unknown, the Wilcoxon Signed-Rank Test is an appropriate test to perform. This test is a non-parametric statistical test used to determine whether there is a significant difference between paired samples, specifically the AP values returned by the standard and the conditioned models. It is an alternative to the paired t-test when the assumptions of normality and homogeneity of variances are not. The SPSSFootnote 9 (Statistical Package for the Social Sciences) software package, also known as IBM SPSS Statistics, was used to perform the statistical test. The symmetrical distribution of differences between the two related groups, as required by the Wilcoxon Signed-Rank Test, was verified on both datasets. Notably, COCO exhibits a moderate skewness, approximately at the value of \(-1\), whereas Visual Genome demonstrates a skewness of roughly \(-1.5\). The Wilcoxon Signed-Rank Test is performed with a significance level of 0.001 and a confidence interval of 95.0. The tests conducted on both datasets, whose results are visible in Table 11, indicate the rejection of the null hypothesis, emphasizing significant statistical differences in the predictions. A second statistical test, which eased the stringency of the statistical test, was conducted considering the arguably moderate skewness. The sign test was selected, maintaining the same significance level and confidence interval adopted for the Wilcoxon Signed-Rank Test. The results, reported in Table 12, highlight again the rejection of the null hypothesis, emphasizing significant statistical differences in the predictions.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rigoni, D., Serafini, L. & Sperduti, A. Object search by a concept-conditioned object detector. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09914-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00521-024-09914-5