1 Introduction

Fig. 1
figure 1

Fine-Grained Image Recognition without Expert Labels. We propose a novel task that enables fine-grained classification without using expert class information (e.g. bird species) during training. We frame the problem as document retrieval from general image descriptions by leveraging existing textual knowledge bases, such as Wikipedia

Deep learning and the availability of large-scale labelled datasets have led to remarkable advances in image recognition tasks, including fine-grained recognition (Wah et al., 2011; Nilsback and Zisserman, 2006; Horn et al., 2017). The problem of fine-grained image recognition amounts to identifying subordinate-level categories, such as different species of birds, dogs or plants. Thus, the supervised learning regime in this case requires annotations provided by domain experts or citizen scientists (Van Horn et al., 2015).

While most people, unless professionally trained or enthusiasts, do not have knowledge in such specific domains, they are generally capable of consulting existing expert resources such as books or online encyclopedias, e.g. Wikipedia. As an example, let us consider bird identification. Amateur bird watchers typically rely on field guides to identify observed species. As a general instruction, one has to answer the question “what is most noticeable about this bird?” before skimming through the guide to find the best match to their observation. The answer to this question is typically a detailed description of the bird’s shape, size, plumage colors and patterns. Indeed, in Fig. 1, the non-expert observer might not be able to directly identify a bird as a “Vermillion Flycatcher”, but they can simply describe the appearance of the bird: “this is a bright red bird with black wings and tail and a pointed beak”. This description can be matched to an expert corpus to obtain the species and other expert-level information.

On the other hand, machines have a much harder time consulting off-the-shelf expert-curated knowledge bases. In particular, most algorithmic solutions are designed to address a specific task with datasets constructed ad-hoc to serve precisely this purpose. Our goal, instead, is to investigate whether it is possible to re-purpose general image and text understanding capabilities to allow machines to consult already existing textual knowledge bases to address a new task, such as recognizing a bird.

We introduce a novel task inspired by the way a layperson would tackle fine-grained recognition from visual input; we name this CLEVER, i.e. Curious Layperson-to-Expert Visual Entity Recognition. Given an image of a subordinate-level object category, the task is to retrieve the relevant document from a large, expertly-curated text corpus; to this end, we only allow non-expert supervision for learning to describe the image. We assume that: (1) the corpus dedicates a separate entry to each category, as is, for example, the case in encyclopedia entries for bird or plant species, etc., (2) there exist no paired data of images and documents or expert labels during training, and (3) to model a layperson’s capabilities, we have access to general image and text understanding tools that do not use expert knowledge, such as image descriptions or language models.

Given this definition, the task classifies as weakly-supervised in the taxonomy of learning problems. We note that there are fundamental differences to related topics, such as image-to-text retrieval and unsupervised image classification. Despite a significant amount of prior work in image-to-text or text-to-image retrieval (Peng et al., 2017; Wang et al., 2017; Zhen et al., 2019; Hu et al., 2019; He et al., 2019), the general assumption is that images and corresponding documents are paired for training a model. In contrast to unsupervised image classification, the difference is that here we are interested in semantically labelling images using a secondary modality, instead of grouping similar images (Asano et al., 2020; Caron et al., 2020; Van Gansbeke et al., 2020).

To the best of our knowledge, we are the first to tackle the task of fine-grained image recognition without expert supervision. Since the target corpus is not required during training, the search domain is easily extendable to any number of categories/species—an ideal use case when retrieving documents from dynamic knowledge bases, such as Wikipedia. We provide extensive evaluation of our method and also compare to approaches in cross-modal retrieval, despite using significantly reduced supervision.

2 Related Work

In this paper, we address a novel problem (CLEVER). Next we describe in detail how it differs from related problems in the computer vision and natural language processing literature and summarise the differences with respect to how class information is used in Table 1.

Table 1 Overview of related topics (K: known, U: unknown)

2.1 Fine-Grained Recognition

The goal of fine-grained visual recognition (FGVR) is categorising objects at sub-ordinate level, such as species of animals or plants (Wah et al., 2011; Van Horn et al., 2015, 2018; Nilsback and Zisserman, 2008; Kumar et al., 2012). Large-scale annotated datasets require domain experts and are thus difficult to collect. FGVR is more challenging than coarse-level image classification as it involves categories with fewer discriminative cues and fewer labeled samples. To address this problem, supervised methods exploit side information such as part annotations (Zhang et al., 2014), attributes (Vedaldi et al., 2014), natural language descriptions (He and Peng, 2017), noisy web data (Krause et al., 2016; Xu et al., 2016; Gebru et al., 2017) or humans in the loop (Branson et al., 2010; Deng et al., 2015; Cui et al., 2016). Attempts to reduce supervision in FGVR are mostly targeted towards eliminating auxiliary labels, e.g. part annotations (Zheng et al., 2017; Simon and Rodner, 2015; Ge et al., 2019; Huang and Li, 2020). There have also been efforts to classify out-of-domain data by using a semi-supervised approach where in-domain labeled examples alongside unlabeled data are used (Du et al., 2021; Su et al., 2021). In contrast, our goal is fine-grained recognition without access to categorical labels during training. Our approach only relies on side information (captions) provided by laymen and is thus unsupervised from the perspective of “expert knowledge”.

2.2 Zero/Few Shot Learning

Zero-shot learning (ZSL) is the task of learning a classifier for unseen classes (Xian et al., 2018). A classifier is generated from a description of an object in a secondary modality, mapping semantic representations to class space in order to recognize said object in images (Socher et al., 2013). Various modalities have been used as auxiliary information: word embeddings (Frome et al., 2013; Xian et al., 2016), hierarchical embeddings (Kampffmeyer et al., 2019), attributes (Farhadi et al., 2009; Akata et al., 2015) or Wikipedia articles (Elhoseiny et al., 2017; Zhu et al., 2018; Elhoseiny et al., 2016; Qiao et al., 2016). Most recent work uses generative models conditioned on class descriptions to synthesize training examples for unseen categories (Long et al., 2017; Kodirov et al., 2017; Felix et al., 2018; Xian et al., 2019; Vyas et al., 2020; Xian et al., 2018), attention-enabled feature extractors (Yu et al., 2018; Zhu et al., 2019; Shermin et al., 2022; Chen et al., 2022). The multi-modal and often fine-grained nature of the standard and generalised (G)ZSL task renders it related to our problem. However, different from the (G)ZSL settings our method uses neither class supervision during training nor image-document pairs as in (Elhoseiny et al., 2017; Zhu et al., 2018; Elhoseiny et al., 2016; Qiao et al., 2016).

2.3 Cross-Modal and Information Retrieval

While information retrieval deals with extracting information from document collections (Manning et al., 2008), cross-modal retrieval aims at retrieving relevant information across various modalities, e.g. image-to-text or vice versa. One of the core problems in information retrieval is ranking documents given some query, with a classical example being Okapi BM25 (Robertson et al., 1995). With the advent of transformers (Vaswani et al., 2017) and BERT (Devlin et al., 2019), state-of-the-art document retrieval is achieved in two-steps; an initial ranking based on keywords followed by computationally intensive BERT-based re-ranking (Nogueira and Cho, 2019; Nogueira et al., 2020; Yilmaz et al., 2019; MacAvaney et al., 2019). In cross-modal retrieval, the common approach is to learn a shared representation space for multiple modalities (Peng et al., 2017; Andrew et al., 2013; Wang and Livescu, 2016; Peng et al., 2016, 2017; Wang et al., 2017; Zhen et al., 2019; Hu et al., 2019; He et al., 2019; Zheng et al., 2021; Wang et al., 2022, 2021). In addition to paired data in various domains, some methods also exploit auxiliary semantic labels; for example, the Wikipedia benchmark (Pereira et al., 2013) provides broad category labels such as history, music, sport, etc.

We depart substantially from the typical assumptions made in this area. Notably, with the exception of He et al. (2019); Wang et al. (2009), this setting has not been explored in fine-grained domains, but generally targets higher-level content association between images and documents. Furthermore, one major difference between our approach and cross-modal retrieval, including (He et al., 2019; Wang et al., 2009), is that we do not assume paired data between the input domain (images) and the target domain (documents). We address the lack of such pairs using an intermediary modality (captions) that allows us to perform retrieval directly in the text domain.

2.4 Natural Language Inference (NLI) and Semantic Textual Similarity (STS)

Also related to our work, in natural language processing, the goal of the NLI task is to recognize textual entailment, i.e. given a pair of sentences (premise and hypothesis), the goal is to label the hypothesis as entailment (true), contradiction (false) or neutral (undetermined) with respect to the premise (Bowman et al., 2015; Williams et al., 2018). STS measures the degree of semantic similarity between two sentences (Agirre et al., 2012, 2013). Both tasks play an important role in semantic search and information retrieval and are currently dominated by the transformer architecture Vaswani et al. (2017); Devlin et al. (2019); Liu et al. (2019); Reimers and Gurevych (2019). Inspired by these tasks, we propose a sentence similarity regime that is domain-specific, paying attention to fine-grained semantics.

3 Method

We introduce the problem of layperson-to-expert visual entity recognition (CLEVER), which we address via image-based document retrieval. Formally, we are given a set of images \(x_i \in \mathcal {I}\) to be labelled given a corpus of expert documents \(D_j \in \mathcal {D}\), where each document corresponds to a fine-grained image category and there exist \(K = {\text {|}\mathcal {D}\text {|}}\) categories in total. As a concrete example, \(\mathcal {I}\) can be a set of images of various bird species and \(\mathcal {D}\) a bird identification corpus constructed from specialized websites (with one article per species). Crucially, the pairing of \(x_i\) and \(D_j\) is not known, i.e. no expert task supervision is available during training. Therefore, the mapping from images to documents cannot be learned directly but can be discovered through the use of non-expert image descriptions \(\mathcal {C}_i\) for image \(x_i\).

Fig. 2
figure 2

Overview. We train a model for fine-grained sentence matching (FGSM) using layerperson’s annotations, i.e. class-agnostic image descriptions. At test time, we score documents from a relevant corpus and use the top-ranked document to label the image

Our method consists of three distinct parts. First, we learn, using “layperson’s supervision”, an image captioning model that uses simple color, shape and part descriptions. Second, we train a model for Fine-Grained Sentence Matching (FGSM). The FGSM model takes as input a pair of sentences and predicts whether they are descriptions of the same object. Finally, we use the FGSM to score the documents in the expert corpus via voting. As there is one document per class, the species corresponding to the highest-scoring document is returned as the final class prediction for the image. The overall inference process is illustrated in Fig. 2.

3.1 Fine-grained Sentence Matching

The overall goal of our method is to match images to expert documents—however, in absence of paired training data, learning a cross-domain mapping is not possible. On the other hand, describing an image is an easy task for most humans, as it usually does not require domain knowledge. It is therefore possible to leverage image descriptions as an intermediary for learning to map images to an expert corpus.

To that end, the core component of our approach is the FGSM model \(f(c_1, c_2) \in \mathbb {R}\) that scores the visual similarity of two descriptions \(c_1\) and \(c_2\). We propose to train f in a manner similar to the textual entailment (NLI) task in natural language processing. The difference to NLI is that the information that needs to be extracted here is fine-grained and domain-specific e.g.  “a bird with blue wings” vs. “this is a uniformly yellow bird”. Since we do not have annotated sentence pairs for this task, we have to create them synthetically. Instead of the terms entailment and contradiction, here we use positive and negative to emphasize that the goal is to find matches (or mismatches) between image descriptions.

We propose to model f as a sentence encoder, performing the semantic comparison of \(c_1, c_2\) in embedding space. Despite their widespread success in downstream tasks, most transformer-based language models are notoriously bad at producing semantically meaningful sentence embeddings (Reimers and Gurevych, 2019; Li et al., 2020). We thus follow (Reimers and Gurevych, 2019) in learning an appropriate textual similarity model with a Siamese architecture built on a pre-trained language transformer. This also allows us to leverage the power of large language models while maintaining efficiency by computing an embedding for each input independently and only compare embeddings as a last step. To this end, we compute a similarity score for \(c_1\) and \(c_2\) as \(f(c_1,\,c_2)=h\left( \left[ \phi _1;\,\phi _2;\,\vert \phi _1-\phi _2\vert \right] \right) \), where \([\cdot ]\) denotes concatenation, and h and \(\phi \) are lightweight MLPs operating on the average-pooled output of a large language model \({{\,\textrm{T}\,}}(\cdot )\) with the shorthand notation \(\phi _1 = \phi ({{\,\textrm{T}\,}}(c_1))\).

Fig. 3
figure 3

Positive, negative and neutral sentence pairs (CUB-200). We show examples of the automatically generated pairs used to train FGSM. For each pair, the top sentence is a ground-truth caption of the image on the left

3.1.1 Training

One requirement is that the FGSM model should be able to identify fine-grained similarities between pairs of sentences. This is in contrast to the standard STS and NLI tasks in natural language understanding which determine the relationship (or degree of similarity) of a sentence pair on a coarser semantic level. Since our end-goal is visual recognition, we instead train the model to emphasize visual cues and nuanced appearance differences.

Let \(\mathcal {C}_i\) be the set of human-annotated descriptions for a given image \(x_i\). Positive training pairs are generated by exploiting the fact that, commonly, each image has been described by multiple annotators; for example in CUB-200 (Wah et al., 2011) there are \(\vert \mathcal {C}_i\vert = 10\) captions per image. Thus, each pair (from \(\mathcal {C}_i \times \mathcal {C}_i\)) of descriptions of the same image can be used as a positive pair. The negative counterparts are then sampled from the complement \(\bar{\mathcal {C}}_i = \bigcup _{l \ne i}\mathcal {C}_l\), i.e. among the available descriptions for all other images in the dataset. While not perfect, there is a very high chance that these come from images of different classes. We specifically do not add specific rules for constructing negative pairs, other than the fact that they describe different images, as it is not easy to automatically infer reliable noun-attribute combinations from sentences that would allow for further checking (e.g. “the bids is overall yellow, but has dark speckles on its belly”—what color is the belly?) We construct this dataset with an equal number of samples for both classes and train f with a binary cross entropy loss.

3.1.2 Inference

During inference, the sentence embeddings \(\phi \) for each sentence in each document can be precomputed and only h needs to be evaluated dynamically given an image and its corresponding captions, as described in the next section. This greatly reduces the memory and time requirements.

3.2 Document Scoring

Although trained from image descriptions alone, the FGSM model can take any sentence as input and, at test time, we use the trained model to score sentences \(s\in \mathcal {D}_j\) from an expert corpus against image descriptions \(c\in \mathcal {C}_i\). Specifically, we assign a score \(z_{ij} \in \mathbb {R}\) to each expert document \(D_j\) given a set of descriptions for the i-th image: \( z_{ij} = \frac{1}{\vert \mathcal {C}_i\times D_j\vert } \sum _{(c, s) \in \mathcal {C}_i\times D_j}{f(c, s)} , \) Since there are several descriptions in \(\mathcal {C}_i\) and sentences in \(D_j\), we compute the final document score as an average of individual predictions (scores) of all pairs of descriptions and sentences. Aggregating scores across the whole corpus \(\mathcal {D}\), we can then compute the probability \(p(D_j \,\vert \, x_i) {\triangleq } \frac{e^{-z_{ij}}}{\sum _{k} e^{-z_{ik}}}\) of a document \(D_j \in \mathcal {D}\) given image \(x_i\) and assign the document (and consequently class) with the highest probability to the image.

3.3 Bridging the Domain Gap

While training the FGSM model, we have so far only used laypersons’ descriptions, disregarding the expert corpus. However, we can expect the documents to contain significantly more information than visual descriptions. In the case of bird species, encyclopedia entries usually also describe behavior, migration, conservation status, etc. In addition, even the descriptions of visual appearance may utilize specialized jargon. This causes a gap between the style of data observed during training and that encountered during the inference phase. We can adapt the model to the new domain by additionally leveraging information (but not labels) from the target corpus during training. In this section, we thus employ two mechanisms to bridge the gap between the image descriptions and the documents.

3.3.1 Neutral Sentences

We introduce a third, neutral class to the classification problem, designed to capture sentences that do not provide relevant (visual) information. We generate neutral training examples by pairing an image description with sentences from the documents (or other descriptions) that do not have any nouns in common. Avoiding common nouns in neutral pairs is based on the rationale that if one sentence describes one part (e.g., “black wings”) while another sentence focuses on another (e.g., “white belly”), there is insufficient information to classify the pair as positive or negative. This additionally allows the model to adapt to the style of sentences in the document, which can be very different from image descriptions. Some examples are shown in Fig. 3.

Instead of binary cross entropy, we train the three-class model (positive/neutral/negative) with softmax cross entropy.

3.3.2 Score Distribution Prior

Another way of leveraging the document pool during training, without requiring paired data, is by imposing priors on document scoring. To this end, we consider the probability distribution \(p(\mathcal {D} \,\vert \, x)\) over the entire corpus \(\mathcal {D}\) given an image x in a training batch \(\mathcal {B}\). We can then derive a regularizer \(R(\mathcal {B})\) that operates at batch-level:

$$\begin{aligned} {\begin{matrix} R(\mathcal {B}) = \sum _{x \in \mathcal {B}} \Big ( &{} -\langle p(\mathcal {D} \,\vert \, x),\ p(\mathcal {D} \,\vert \, x) \rangle \;\\ &{}+ \sum _{x' \in \mathcal {B} \setminus x } \langle p(\mathcal {D} \,\vert \, x),\ p(\mathcal {D} \,\vert \, x') \rangle \Big ) \end{matrix}} \end{aligned}$$
(1)

where \(\langle \cdot , \cdot \rangle \) denotes the inner product of two vectors. The intuition of the two terms of the regularizer is as follows. \(\langle p(D\,\vert \,x),\, p(D\,\vert \,x) \rangle \) is maximal when the distribution assigns all mass to a single document. Since the score \(z_{ij}\) is averaged over all captions of one image, this additionally has the side effect of encouraging all captions of one image to vote for the same document. The second term of \(R(\mathcal {B})\) then encourages the distributions of two different images to be orthogonal, favoring the assignment of images uniformly across all documents.

Since \(R(\mathcal {B})\) requires evaluation over the whole document corpus for every image, we first pre-train f, including the large transformer model T, (c.f. Sect. 3.1). After convergence, we extract sentence features for all documents and image descriptions and train only the MLPs \(\phi \) and h with \(\mathcal {L} + \lambda R\), where \(\lambda \) balances the 3-class cross entropy loss \(\mathcal {L}\) and the regularizer.

4 Experiments

We validate our method empirically for bird and plant identification. To the best of our knowledge, we are the first to consider this task, thus in absence of state-of-the-art methods, we ablate the different components of our model and compare to several strong baselines.

4.1 Datasets and Experimental Setup

Datasets We evaluate our method on Caltech-UCSD Birds-200-2011 (CUB-200) (Wah et al., 2011) and the Oxford-102 Flowers (FLO) dataset (Nilsback and Zisserman, 2006). For both datasets, Reed et al. (2016) have collected several visual descriptions per image by crowd-sourcing to non-experts on Amazon Mechanical Turk (AMT).

CUB-200 The Caltech-UCSD Birds-200-2011 (CUB-200) (Wah et al., 2011) contains images of 200 different bird species. The train and test set contains 5,994 and 5,794 images respectively. We have collected expert documents—one document corresponding to each of the 200 categories—by crawling AllAboutBirdsFootnote 1 (AAB), which includes bird identification guides made available by the Cornell Lab of Ornithology. Each document consists of an Overview and ID info sections. We obtain basic description from Overview. From page ID info we use Identification, Size & Shape, Color Pattern, Behavior and Habitat. For Size & Shape key we omit the relative size table. For 17 categories that were not found in AAB, we resorted to Wikipedia articles instead. We queried the article for the bird class using MediaWiki API. We use introduction, description, life history sections and ignore the rest. If the class name appears in the text we replace it with the phrase “a bird". We replace any mention of the classes in corpus with the word ‘a bird’ so that the model is unable to cheat by using expert labels.

Oxford-102 Flowers The Oxford-102 Flowers (FLO) dataset (Nilsback and Zisserman, 2006) contains images of 102 categories of flowers. We use the official train and test set of 1,020 and 6,149 images respectively. Similar to CUB-200, we create an expert document corpus with one document per category by parsing Wikipedia data using the MediaWiki API. We use summary, cultivation, distribution, description, ecology, flowers, habitat sections and ignore the rest. We replace the expert labels in the corpus with the phrase ‘a flower’.

Setup We use the image-caption pairs to train two image captioning models: “Show, Attend and Tell” (SAT) (Xu et al., 2015) and AoANet (Huang et al., 2019). Unless otherwise specified, we report the performance of our model based on their ensemble, i.e. combining captions from both models. As the backbone T of our sentence transformer model, we use RoBERTa-large (Liu et al., 2019) fine-tuned on NLI and STS datasets using the setup of (Reimers and Gurevych, 2019).

4.1.1 Image Captioning

We consider the following captioning models.

SAT We train Show-Attend-and-Tell (SAT) (Xu et al., 2015) for 100 epochs with 64 batch size using the implementation of (Vedantam et al., 2017). We use a ResNet-34 (He et al., 2015) based encoder, and LSTM decoder with input size of 512 and hidden state size of 1800. We use Adam optimizer with learning rate of 0.002. Dropout rate is 0.5, vocabulary size is 5726.

AoANet For AoANet (Huang et al., 2019), we extract the bottom-up features with a Faster-RCNN (Ren et al., 2016) backbone pretrained on ImageNet (http://image-net.org/challenges/LSVRC/2015/results) and Visual Gnome(Krishna et al., 2017). The original 2048 dimensional vectors are projected to D=1024. In the decoder LSTM hidden state size is 1024. The vocabulary size for CUB-200 is 1682 and for FLO it is 1711. We use batch size 10 and train for 30 epochs. We use the Adam (Kingma and Ba, 2015) optimizer with initial learning rate of \(2e-4\). We anneal the learning rate by 0.8 every 5 epochs. For our experiments we use the implementation from the authors’ repositoryFootnote 2. During inference, we apply beam search with a beam size of 10 to sample multiple captions from both methods. We have trained the captioning models on the official data splits, reserving 10% of the images from training split for validation for all experiments except the zero-shot experiments where we follow the zero-shot data split.

BLIP2 We finetune BLIP2 (Li et al., 2023) 2.7b model starting from COCO captioning weights for 5 epochs with learning rate of \(1e-5\), batch size of 256, warmup step of 1000. We set image resolution is set to 364, drop path to 0. We use AdamW optimizer with \(\beta =(0.9, 0.999)\) and weight decay of 0.05. Layerwise decay rate is set to 0.95.

OFA We train OFA (Wang et al., 2022) separately on CUB-200 and FLO datasets. We use OFA-base and start from the COCO captioning weights. We train the model for 5 epochs with learning rate of \(1e-5\), batch size of 32. We use cross-entropy loss with label smoothing of 0.1.

For Table 4, we follow the GZSL split proposed in (Xian et al., 2018), using the trainval set to train the captioning models, with 10% of the images being again kept aside for validation. Therefore, we explicitly avoid using “unseen" categories when training the captioning models.

While general image captioning is known to suffer from low diversity, in our fine-grained setting, this is less problematic because of two reasons. Firstly, the vocabulary used is specific to the domain, e.g. , captions describe specific parts (beak, wings, tail) of birds. Secondly, captions describing similar images, such as images of the same class, should indeed exhibit similarity rather than distinctiveness.

4.1.2 FGSM implementation details

T is a sentence transformer with a RoBERTa-large backbone pretrained on the SNLI (Bowman et al., 2015), Multi-Genre NLI (Williams et al., 2018) and STS (Cer et al., 2017) benchmarks. The pretrained model is obtained from the publicly available repositoryFootnote 3 of (Reimers and Gurevych, 2019). \(\phi \) is implemented as a two layer MLP with intermediate and output dimensions of 256 and 64 respectively and \(\tanh \) activation function. For h we use a linear layer with output dimension of 2 (for the binary classification task). During the first stage of training, we use a constant learning rate of \(0.5 \cdot 10^{-6}\) for T and \(10^{-5}\) for \(\phi \) and h respectively; weight decay is set to zero for \(\phi \) and h. We follow (Reimers and Gurevych, 2019) for rest of the hyper-parameters. During the second stage, we aim to reduce the gap between the data that the model is exposed to for training and the target domain. We add the regularizer R and fix T, pre-computing all embeddings for computational efficiency. We retrain \(\phi \) and h from scratch with the Adam optimizer (Kingma and Ba, 2015) and an initial learning rate of \(10^{-5}\). For \(\phi \) we use a three layer MLP with 256, 64, and 32 output dimensions. For h we use a linear layer with an output dimension of 3 to predict positive, negative and neutral sentence pairs, training with a cross-entropy loss and the regularizer with a weight factor \(\lambda =10\). The neutral sentence pairs are either a pair of captions from two different images that have no nouns in common, or a pair containing a image caption and a random sentence from the target corpus that have no nouns in common. We sample with equal probability from these two pools. The reasoning behind common nouns is that sentences containing the same nouns could potentially describe the same parts—e.g. head, beak, wings—while adjectives are often used as attributes, e.g. red wings, short beak. Pairs of sentences without common nouns contain neither entailing nor contradicting information, i.e. they describe different objects/parts, and can be thus safely considered as neutral.

We use three metrics to evaluate the performance on the benchmark datasets. We compute top-1 and top-5 per-class retrieval accuracy and report the overall average. Additionally, we compute the mean rank (MR) of the target document for each class. Here, retrieval accuracy is identical to classification accuracy, since there is only a single relevant article per category.

Table 2 Comparison to baselines

4.2 Baseline Comparisons

Since this work is the first to explore the mapping of images to expert documents without expert supervision, we compare our method to several strong baselines (Table 2).

Our FGSM performs text-based retrieval, we evaluate current text retrieval systems.

TF-IDF Term frequency-inverse document frequency (TF-IDF) is widely used for unsupervised document retrieval (Jones, 1972). For each image, we use the predicted captions as queries and use the TF-IDF textual representation for document ranking instead of our model. We empirically found the cosine distance and n-grams with \(n={2,3}\) to perform best for TF-IDF.

BM25 Similar to TF-IDF, BM25 (Robertson et al., 1995) is another common measure for document ranking based on n-gram frequencies. We use the BM25 Okapi implementation from the python package rank-bm25 with default settings.

RoBERTa One advantage of processing caption-sentence pairs with a Siamese architecture, such as SBERT/SRoBERTa (Reimers and Gurevych, 2019), is the reduced complexity. Nonetheless, we have trained a transformer baseline for text classification, using the same backbone (Liu et al., 2019), concatenating each sentence pair with a SEP token and training as a binary classification problem. We apply this model to score documents, instead of FGSM, aggregating scores at sentence-level.

SRoBERTa-NLI/STSb Finally, to evaluate the importance of learning fine-grained sentence similarities, we also measure the performance of the same model trained only on the NLI and STSb benchmarks (Reimers and Gurevych, 2019), without further fine-tuning.

Following (Reimers and Gurevych, 2019) we rank documents based on the cosine similarity between the caption and sentence embeddings.

Our method outperforms all bag-of-words and learned baselines. Approaches such as TF-IDF and BM25 are very efficient, albeit less performant than learned models. Notably, the closest in performance to our model is the transformer baseline (RoBERTa), which comes at a large computational cost (347 sec vs. 0.55 sec for our model per image on CUB-200).

Class Supervised For completeness we also report the performance of a class-supervised model in Table 2. Specifically, we train a ResNet50 (He et al., 2016) classifier to predict the class label given an image. We fine-tune the model on each dataset (starting from ImageNet-pretrained weights) for 100 epochs with a learning rate of \(1e-4\) and SGD optimizer.

4.3 Ablation & User Interaction

We ablate the different components of our approach in Table 3. We first investigate the use of a different scoring mechanism, i.e. the cosine similarity between the embeddings of c and s as in (Reimers and Gurevych, 2019); we found this to perform worse (FGSM + cosine).

Next, we evaluate the performance of our model after the final training phase, with the proposed regularizer and the inclusion of neutral pairs (Sect. 3.3). \(R(\mathcal {B})\) imposes prior knowledge about the expected class distribution over the dataset and thus stabilizes the training, resulting in improved performance ([2-cls]). Further, through the regularizer and neutral sentences ([3-cls]), FGSM is exposed to the target corpus during training, which helps reduce the domain shift during inference compared to training on image descriptions alone (FGSM w/ ensemble).

Finally, our method enables user interaction, i.e. allowing a user to directly enter own descriptions, replacing the automatic description model. In Table 3 we have simulated this by evaluating with ground-truth instead of predicted descriptions. Naturally, we find that human descriptions indeed perform better, though the performance gap is small. We attribute this gap to a much higher diversity in the human annotations. Current image captioning models still have diversity issues, which also explains why our ensemble variant improves the results.

Table 3 Ablations and user study

To measure the influence of captions, in Table 5 we evaluate four captioning methods, SAT (Xu et al., 2015), AoANet (Huang et al., 2019), OFA (Wang et al., 2022) and BLIP2 (Li et al., 2023), and show our model’s performance. For this experiment, we train and compare all models without the regularizer \(R(\mathcal {B})\). We observe that captioning models that score higher in captioning metrics, e.g., ROUGE, CIDER, etc., also perform well with FGSM. We show examples of captions predicted by the models in Fig. 5

We also an ensemble of captions obtained by two methods, SAT and AoANet. The ensemble is created using the combination of captions of both models and computing the average matching score over all captions. As in almost all tasks, the ensemble improves the performance. The gain, however, is small as (1) captions produced by different models tend to describe similar aspects of the image (Fig. 5), and (2) inaccurate captions will still affect performance, when averaging scores across captions.

Table 4 Comparison to cross-media retrieval
Table 5 Captioning models
Fig. 4
figure 4

Qualitative Results (CUB-200). We show examples of input images and their predicted captions, followed by the top-5 retrieved documents (classes). For illustration purposes, we show a random image for each document; the image is not used for matching

Fig. 5
figure 5

Predictions of Captioning models (CUB-200). We show examples of captions predicted by the captioning models we use - SAT (Xu et al., 2015), AoANet (Huang et al., 2019), OFA (Wang et al., 2022) and BLIP2 (Li et al., 2023)

4.4 Comparison with Cross-Modal Retrieval

Since the nature of the problem presented here is in fact cross-modal, we adapt a representative method, DSCMR (Zhen et al., 2019), to our data to compare to the state of the art in cross-media retrieval. We note that such an approach requires image-document pairs as training samples, thus using more supervision than our method. Instead of using image descriptions as an intermediary for retrieval, DSCMR thus performs retrieval monolithically, mapping the modalities in a shared representation space. We argue that, although this is the go-to approach in broader category domains, it may be sub-optimal in the context of fine-grained categorization.

Since in our setting each category (species) is represented by a single article, in the scenario that a supervised model sees all available categories during training, the cross-modal retrieval problem degenerates to a classification task. Hence, for a meaningful comparison, we train both our model and DSCMR on the CUB-200 splits for ZSL (Xian et al., 2018) to evaluate on 50 unseen categories. We report the results in Table 4, including a TF-IDF baseline on the same split. Despite using no image-documents pairs for training, our method still performs significantly better.

Additionally, we compare to representative methods from the vision-and-language representation learning space. ViLBERT (Lu et al., 2019) is a multi-modal transformer model capable of learning joint representations of visual content and natural language. It is pre-trained on 3.3M image-caption pairs with two proxy tasks. We use their multi-modal alignment prediction mechanism to compute the alignment of the sentences in a document to a target image, similar to ViLBERT’s zero-shot experiments. The sentence scores are averaged to get the document alignment score and the document with the maximum score is chosen as the class. Finally, we compare to CLIP (Radford et al., 2021), that learns a multimodal embedding space from 400M image-text pairs. CLIP predicts image and sentence embeddings with separate encoders. For a target image we score each sentence using cosine similarity and average across the document for the final score. CLIP’s training data is not public, but we find that there is a high possibility it does indeed contain expert labels as removing class names from documents hurts its performance.

4.5 Qualitative Results

4.5.1 Model Performance

In Fig. 4, we show qualitative retrieval results. The input image is shown on the left followed by the predicted descriptions. We then show the top-5 retrieved documents/classes together with an example image for the reader. Note that the example images are not used for matching, as the FGSM module operates on text only. We find that in most cases, even when the retrieved document does not match the ground truth class, the visual appearance is still similar. This is especially noticeable in families of birds for which discriminating among individual species is considered to be particularly difficult even for humans, e.g. warblers (last row).

4.5.2 Sentence Composition

We observe FGSM, alongside our contrastive learning on captions, also benefits from using a pretrained large language model RoBERTa. We show an example in Fig. 6. The first row shows the retrieval result for the caption "this bird has wings that are blue". As we add another criterion the retrieval becomes more fine-grained, scoring documents with the additional specification more positively.

Fig. 6
figure 6

Sentence Composition Results (CUB-200). We show examples of the FGSM model being able to understand compound sentences. We start with a single caption and retrieve the best matching corpus classes in the first row. In the second and third row we add an additional condition to the caption which retrieves even finer-grained classes. For illustration purposes, we show a random image for each document; the image is not used for matching

4.5.3 Effectiveness of FGSM Training

We show the change in fine-grained scoring of sentence-transformer embeddings when training with our method. For this experiment we find the subset of pairs which have one or more color-part pairs present in them (e.g. brown wings, blue tail etc.). For Fig. 7 we randomly sample a caption “this bird has a long wide beak, and blackwings, belly, and head.” and calculated the similarity score for a set of captions using (Fig. 7a) FGSM and (Fig. 7b) RoBERTa. The set is created by combining various part and color names. The figure shows the distribution of similarity scores across the set of color-part combinations. We color the pairs with the mean score of the captions containing that pair. We observe only the captions containing similar colors to black are scored positive by FGSM - showing our model can perform contrastive separation based on visual attributes. Whereas RoBERTa scores all captions positive and cannot discriminate between sentences with different visual attributes. Some other combinations are scored positively by our method, potentially reflecting the expected variance between different human descriptions. For example, blue and black often appear similar in an image depending on lighting and visibility of the part.

Fig. 7
figure 7

Effectiveness of FGSM training (CUB-200). We use a random caption "this bird has a long wide beak, and black wings, belly, and head." and find its similarity with all the ground truth captions using FGSM and RoBERTa extracted features. The figures show the distribution where the size of the radius denotes the relative occurrence of that pair in captions. We color each color-part pair, e.g., {brown bill, black wings}, using the mean similarity score of all captions containing that pair. Red denotes the mean positive score and blue mean negative score. We find that, as a general-purpose text model, RoBERTa matches all captions with a positive score, while FGSM can contrast based on visual attributes and return positive matches only for colors/parts that are actually present in the caption (Color figure online)

Table 6 Captioning performance

4.5.4 Image Description Generalization

As an integral part of our approach, we analyze the performance of the captioning module. In particular, we are interested in the degradation (if any) in the capability of the captioning models to describe images of previously unseen categories. To this end, to understand whether the learned image descriptions are dependent on the training categories, we train the captioning model with the zero-shot learning split and compare the validation performance (in terms of common captioning metrics) between seen and unseen classes in Table 6. We report results using common metrics, BLUE1-4(Cho et al., 2014), METEOR (Denkowski and Lavie, 2014), ROUGE-L (Lin et al., 2004) and CIDEr-D (Vedantam et al., 2015). Interestingly, we find no significant difference in performance between seen and unseen classes, indicating that the model generalizes well to the appearance of novel categories. This is on par with our intuition and motivation for a layperson-inspired system to describe the appearance of objects without necessarily being able to recognize or name them and even when they have never previously encountered a given object.

4.5.5 Word Relevance

In Table 7 we show pairs of image descriptions and sentences from the expert corpus, along with the predicted score (after sigmoid). We highlight the importance of individual words which is estimated by masking the word and computing the difference between the new and initial score. The model has learned to pay attention to colors and body parts, which affect its decision the most. The third example also shows that the model is sensitive to negative evidence, as it correctly identifies the color mismatch between the two sentences.

4.5.6 Sentence Relevance

While sensitivity to individual words is important, the model also needs to identify which parts of the expert document are relevant, as the descriptions often contain much more information such as the behavior or history of a species. In Tables 8 and 9 we show matching results between a query and a document. We highlight the sentence with the highest matching score within the document, given the query (image description), which indeed identify the visual description within the long document.

Table 7 Word relevance visualization
Table 8 FGSM qualitative results
Table 9 FGSM qualitative results
Table 10 Contextualization of the CLEVER task

4.6 Comparison with Zero-Shot Learning

CLEVER is loosely related to the zero-shot learning (ZSL) problem, where, during inference, a model is tasked with classifying samples from classes that have not been observed during training. Unlike CLEVER, however, ZSL explicitly makes use of a subset of expert labels during training, and sometimes additional information (attributes, captions, etc.). Consequently, the CLEVER setting uses significantly reduced supervision (i.e., relying only on captions) in contrast to the ZSL setting.

To put our method in context, we compare it against ZSL approaches, even though they employ a higher degree of supervision. Due to the difference in available information during training in ZSL (i.e., some classes are known), it is important to evaluate seen and unseen classes separately. Overcoming this difference is one of the main challenges for generalized zero-shot methods (GZSL). In both settings, training is carried out on a set of 150 seen classes on CUB-200. In GZSL, during testing, the model has to label an image correctly among all 200 classes, including 50 unseen classes.

In Table 10, we evaluate our method on the ZSL and GZSL splits for CUB-200. To be compatible with the splits used for the (G)ZSL setting, we also train the captioning models and the FGSM module only on the “seen” classes (although no labels are observed). We do not use the regularizer \(R(\mathcal {B})\) for this experiment. The lack of expert annotations during training explains the gap in performance between our approach and ZSL/class-supervised methods, as we are tackling a significantly harder problem. However, while many GZSL methods show a large performance gap between seen and unseen classes, our method performs consistently on both sets. This implies that the document pool can be safely expanded to include more classes, if necessary, without the need to re-train for these new classes.

5 Discussion

Like with any method that aims to reduce supervision, our method is not perfect. There are multiple avenues where our approach can be further optimized.

First, we observe that models trained for image captioning tend to produce short sentences that lack descriptiveness, focusing on the major features of the object rather than providing detailed fine-grained descriptions of the object’s unique aspects (Fig. 5). We believe there is a scope for improvement if the captioning models could extensively describe each different part and attribute of the object. We have tried to mitigate this issue by using an ensemble of two popular captioning networks. However, using multiple models and sampling multiple descriptions may lead to redundancy. Devising image captioning models that produce descriptive fine-grained image descriptions may provide improved performance on CLEVER task; there is an active area of research (Wang et al., 2020a, b) that is looking into this problem.

Second, the proposed approach to scoring a document given an image uses all the sentences in the document classifying them as positive, negative or neutral with respect to each input caption. Given that the information provided by an expert document might be noisy, i.e. not necessarily related to the visual domain, it is likely worthwhile to develop a filtering mechanism for relevancy, effectively using only a subset of the sentences for scoring.

Third, in-domain regularization results in a significant performance boost (Table 3), which implies that the CLEVER task is susceptible to the domain gap between laypeople’s descriptions and the expert corpus. Language models such as BERT/RoBERTa partially address this problem already by learning general vocabulary, semantics and grammar during pre-training on large text corpora, enabling generalization to a new corpus without explicit training. However, further research in reducing this domain gap seems worthwhile.

Finally, in the recent time there has been an explosion of work on large multi-modal foundation models that are self-supervised with internet scale datasets. These models have been found to contain strong priors about the world (Radford et al., 2021). Our model is trained on a very small scale dataset compared to that, it would be an interesting avenue to explore how the FGSM will scale with data and how to use the existing foundation models as a prior.

6 Conclusion

We have shown that it is possible to address fine-grained image recognition without the use of expert training labels by leveraging existing knowledge bases, such as Wikipedia. This is the first work to tackle this challenging problem, with performance gains over the state of the art on cross-media retrieval, despite their training with image-document pairs. While humans can easily access and retrieve information from such knowledge bases, CLEVER remains a challenging learning problem that merits future research.