The Curious Layperson: Fine-Grained Image Recognition Without Expert Labels

Most of us are not experts in specific fields, such as ornithology. Nonetheless, we do have general image and language understanding capabilities that we use to match what we see to expert resources. This allows us to expand our knowledge and perform novel tasks without ad-hoc external supervision. On the contrary, machines have a much harder time consulting expert-curated knowledge bases unless trained specifically with that knowledge in mind. Thus, in this paper we consider a new problem: fine-grained image recognition without expert annotations, which we address by leveraging the vast knowledge available in web encyclopedias. First, we learn a model to describe the visual appearance of objects using non-expert image descriptions. We then train a fine-grained textual similarity model that matches image descriptions with documents on a sentence-level basis. We evaluate the method on two datasets (CUB-200 and Oxford-102 Flowers) and compare with several strong baselines and the state of the art in cross-modal retrieval. Code is available at: https://github.com/subhc/clever.


Introduction
Deep learning and the availability of large-scale labelled datasets have led to remarkable advances in image recognition tasks, including fine-grained recognition [21,36,57].The problem of fine-grained image recognition amounts to identifying subordinate-level categories, such as different species of birds, dogs or plants.Thus, the supervised learning regime in this case requires annotations provided by domain experts or citizen scientists [52].
While most people, unless professionally trained or enthusiasts, do not have knowledge in such specific domains, they are generally capable of consulting existing expert resources such as books or online encyclopedias, e.g.Wikipedia.As an example, let us consider bird identification.Amateur bird watchers typically rely on field guides to identify observed species.As a general instruction, one has to answer the question "what is most noticeable about this bird?"before skimming through the guide to find the best match to their observation.this is a bright red bird with black wings and tail and a pointed beak access retrieve Figure 1: Fine-Grained Image Recognition without Expert Labels.We propose a novel task that enables fine-grained classification without using expert class information (e.g.bird species) during training.We frame the problem as document retrieval from general image descriptions by leveraging existing textual knowledge bases, such as Wikipedia.
The answer to this question is typically a detailed description of the bird's shape, size, plumage colors and patterns.Indeed, in Fig. 1, the non-expert observer might not be able to directly identify a bird as a "Vermillion Flycatcher", but they can simply describe the appearance of the bird: "this is a bright red bird with black wings and tail and a pointed beak".This description can be matched to an expert corpus to obtain the species and other expert-level information.
On the other hand, machines have a much harder time consulting off-the-shelf expertcurated knowledge bases.In particular, most algorithmic solutions are designed to address a specific task with datasets constructed ad-hoc to serve precisely this purpose.Our goal, instead, is to investigate whether it is possible to re-purpose general image and text understanding capabilities to allow machines to consult already existing textual knowledge bases to address a new task, such as recognizing a bird.
We introduce a novel task inspired by the way a layperson would tackle fine-grained recognition from visual input; we name this CLEVER, i.e.Curious Layperson-to-Expert Visual Entity Recognition.Given an image of a subordinate-level object category, the task is to retrieve the relevant document from a large, expertly-curated text corpus; to this end, we only allow non-expert supervision for learning to describe the image.We assume that: (1) the corpus dedicates a separate entry to each category, as is, for example, the case in encyclopedia entries for bird or plant species, etc., (2) there exist no paired data of images and documents or expert labels during training, and (3) to model a layperson's capabilities, we have access to general image and text understanding tools that do not use expert knowledge, such as image descriptions or language models.Given this definition, the task classifies as weakly-supervised in the taxonomy of learning problems.We note that there are fundamental differences to related topics, such as imageto-text retrieval and unsupervised image classification.Despite a significant amount of prior work in image-to-text or text-to-image retrieval [20,22,41,58,72], the general assumption is that images and corresponding documents are paired for training a model.In contrast to unsupervised image classification, the difference is that here we are interested in semantically labelling images using a secondary modality, instead of grouping similar images [5,8,51].
To the best of our knowledge, we are the first to tackle the task of fine-grained image recognition without expert supervision.Since the target corpus is not required during training, the search domain is easily extendable to any number of categories/species-an ideal use case when retrieving documents from dynamic knowledge bases, such as Wikipedia.We provide extensive evaluation of our method and also compare to approaches in cross-modal retrieval, despite using significantly reduced supervision.

Related Work
In this paper, we address a novel problem (CLEVER).Next we describe in detail how it differs from related problems in the computer vision and natural language processing literature and summarise the differences with respect to how class information is used in Table 1.

Task
Train Test Table 1: Overview of related topics (K: known, U: unknown).
Fine-Grained Recognition.The goal of fine-grained visual recognition (FGVR) is categorising objects at subordinate level, such as species of animals or plants [29,37,52,53,57].Large-scale annotated datasets require domain experts and are thus difficult to collect.FGVR is more challenging than coarse-level image classification as it involves categories with fewer discriminative cues and fewer labeled samples.To address this problem, supervised methods exploit side information such as part annotations [71], attributes [55], natural language descriptions [19], noisy web data [18,28,69] or humans in the loop [7,9,10].Attempts to reduce supervision in FGVR are mostly targeted towards eliminating auxiliary labels, e.g.part annotations [17,24,49,73].In contrast, our goal is fine-grained recognition without access to categorical labels during training.Our approach only relies on side information (captions) provided by laymen and is thus unsupervised from the perspective of "expert knowledge".Zero/Few Shot Learning.Zero-shot learning (ZSL) is the task of learning a classifier for unseen classes [65].A classifier is generated from a description of an object in a secondary modality, mapping semantic representations to class space in order to recognize said object in images [50].Various modalities have been used as auxiliary information: word embeddings [16,64], hierarchical embeddings [26], attributes [3,14] or Wikipedia articles [12,13,44,74].Most recent work uses generative models conditioned on class descriptions to synthesize training examples for unseen categories [15,27,32,56,66,67].The multi-modal and often fine-grained nature of the standard and generalised (G)ZSL task renders it related to our problem.However, different from the (G)ZSL settings our method uses neither class supervision during training nor image-document pairs as in [12,13,44,74].
Cross-Modal and Information Retrieval.While information retrieval deals with extracting information from document collections [35], cross-modal retrieval aims at retrieving relevant information across various modalities, e.g.image-to-text or vice versa.One of the core problems in information retrieval is ranking documents given some query, with a classical example being Okapi BM25 [48].With the advent of transformers [54] and BERT [11], state-of-the-art document retrieval is achieved in two-steps; an initial ranking based on keywords followed by computationally intensive BERT-based re-ranking [34,38,39,70].In cross-modal retrieval, the common approach is to learn a shared representation space for multiple modalities [4,20,22,40,41,42,58,62,72].In addition to paired data in various domains, some methods also exploit auxiliary semantic labels; for example, the Wikipedia benchmark [43] provides broad category labels such as history, music, sport, etc.We depart substantially from the typical assumptions made in this area.Notably, with the exception of [20,60], this setting has not been explored in fine-grained domains, but generally targets higher-level content association between images and documents.Furthermore, one major difference between our approach and cross-modal retrieval, including [20,60], is that we do not assume paired data between the input domain (images) and the target domain (documents).We address the lack of such pairs using an intermediary modality (captions) that allows us to perform retrieval directly in the text domain.Natural Language Inference (NLI) and Semantic Textual Similarity (STS).Also related to our work, in natural language processing, the goal of the NLI task is to recognize textual entailment, i.e. given a pair of sentences (premise and hypothesis), the goal is to label the hypothesis as entailment (true), contradiction (false) or neutral (undetermined) with respect to the premise [6,63].STS measures the degree of semantic similarity between two sentences [1,2].Both tasks play an important role in semantic search and information retrieval and are currently dominated by the transformer architecture [11,31,47,54].Inspired by these tasks, we propose a sentence similarity regime that is domain-specific, paying attention to fine-grained semantics.

Method
We introduce the problem of layperson-to-expert visual entity recognition (CLEVER), which we address via image-based document retrieval.Formally, we are given a set of images x i ∈ I to be labelled given a corpus of expert documents D j ∈ D, where each document corresponds to a fine-grained image category and there exist K = |D| categories in total.As a concrete example, I can be a set of images of various bird species and D a bird identification corpus constructed from specialized websites (with one article per species).Crucially, the pairing of x i and D j is not known, i.e. no expert task supervision is available during training.Therefore, the mapping from images to documents cannot be learned directly but can be discovered through the use of non-expert image descriptions C i for image x i .
Our method consists of three distinct parts.First, we learn, using "layperson's supervision", an image captioning model that uses simple color, shape and part descriptions.Second, we train a model for Fine-Grained Sentence Matching (FGSM).The FGSM model takes as input a pair of sentences and predicts whether they are descriptions of the same object.Finally, we use the FGSM to score the documents in the expert corpus via voting.As there is one document per class, the species corresponding to the highest-scoring document is returned as the final class prediction for the image.The overall inference process is illustrated in Fig. 2.

Fine-grained Sentence Matching
The overall goal of our method is to match images to expert documents -however, in absence of paired training data, learning a cross-domain mapping is not possible.On the other hand, describing an image is an easy task for most humans, as it usually does not require domain knowledge.It is therefore possible to leverage image descriptions as an intermediary for learning to map images to an expert corpus.
To that end, the core component of our approach is the FGSM model f (c 1 , c 2 ) ∈ R that scores the visual similarity of two descriptions c 1 and c 2 .We propose to train f in a manner similar to the textual entailment (NLI) task in natural language processing.The difference to NLI is that the information that needs to be extracted here is fine-grained and domain-specific e.g."a bird with blue wings" vs. "this is a uniformly yellow bird".Since we do not have annotated sentence pairs for this task, we have to create them synthetically.Instead of the Blue-Winged Warbler the blue-winged warbler sings a distinctive beebuzz from brushy fields.
[…] adult males are bright yellow below, yellowgreen above, and have two obvious wingbars on blue-gray wings, and a black eyeline.
[…] "this is a yellow bird with grey wings and a small black beak" captioning document score: 0.75 negative: 0.62 positive: 0.87 Evening Grosbeak the yellow-bodied, dusky-headed male has an imposing air thanks to his massive bill and fierce eyebrow stripe.
[…] the bill is pale ivory on adult males and greenish-yellow on females.

Image Description Document Matching
"a small and yellow bird with grey and white wings" "this bird is yellow and has a dark stripe on its eyes" terms entailment and contradiction, here we use positive and negative to emphasize that the goal is to find matches (or mismatches) between image descriptions.We propose to model f as a sentence encoder, performing the semantic comparison of c 1 , c 2 in embedding space.Despite their widespread success in downstream tasks, most transformer-based language models are notoriously bad at producing semantically meaningful sentence embeddings [30,47].We thus follow [47] in learning an appropriate textual similarity model with a Siamese architecture built on a pre-trained language transformer.This also allows us to leverage the power of large language models while maintaining efficiency by computing an embedding for each input independently and only compare embeddings as a last step.To this end, we compute a similarity score for c 1 and c 2 as Training.One requirement is that the FGSM model should be able to identify fine-grained similarities between pairs of sentences.This is in contrast to the standard STS and NLI tasks in natural language understanding which determine the relationship (or degree of similarity) of a sentence pair on a coarser semantic level.Since our end-goal is visual recognition, we instead train the model to emphasize visual cues and nuanced appearance differences.
Let C i be the set of human-annotated descriptions for a given image x i .Positive training pairs are generated by exploiting the fact that, commonly, each image has been described by multiple annotators; for example in CUB-200 [57] there are |C i | = 10 captions per image.Thus, each pair (from C i × C i ) of descriptions of the same image can be used as a positive pair.The negative counterparts are then sampled from the complement Ci = l =i C l , i.e. among the available descriptions for all other images in the dataset.We construct this dataset with an equal amount of samples for both classes and train f with a binary cross entropy loss.Inference.During inference the sentence embeddings φ for each sentence in each document can be precomputed and only h needs to be evaluated dynamically given an image and its corresponding captions, as described in the next section.This greatly reduces the memory and time requirements.

Document Scoring
Although trained from image descriptions alone, the FGSM model can take any sentence as input and, at test time, we use the trained model to score sentences from an expert corpus against image descriptions.Specifically, we assign a score z i j ∈ R to each expert document D j given a set of descriptions for the i-th image: Since there are several descriptions in C i and sentences in D j , we compute the final document score as an average of individual predictions (scores) of all pairs of descriptions and sentences.Aggregating scores across the whole corpus D, we can then compute the probability p(D j | x i ) = e −z i j ∑ k e −z ik of a document D j ∈ D given image x i and assign the document (and consequently class) with the highest probability to the image.

Bridging the Domain Gap
While training the FGSM model, we have so far only used laypersons' descriptions, disregarding the expert corpus.However, we can expect the documents to contain significantly more information than visual descriptions.In the case of bird species, encyclopedia entries usually also describe behavior, migration, conservation status, etc.In this section, we thus employ two mechanisms to bridge the gap between the image descriptions and the documents.Neutral Sentences.We introduce a third, neutral class to the classification problem, designed to capture sentences that do not provide relevant (visual) information.We generate neutral training examples by pairing an image description with sentences from the documents (or other descriptions) that do not have any nouns in common.Instead of binary cross entropy, we train the three-class model (positive/neutral/negative) with softmax cross entropy.Score Distribution Prior.Despite the absence of paired training data, we can still impose priors on the document scoring.To this end, we consider the probability distribution p(D | x) over the entire corpus D given an image x in a training batch B. We can then derive a regularizer R(B) that operates at batch-level: where •, • denotes the inner product of two vectors.The intuition of the two terms of the regularizer is as follows.p(D | x), p(D | x) is maximal when the distribution assigns all mass to a single document.Since the score z i j is averaged over all captions of one image, this additionally has the side effect of encouraging all captions of one image to vote for the same document.The second term of R(B) then encourages the distributions of two different images to be orthogonal, favoring the assignment of images uniformly across all documents.Since R(B) requires evaluation over the whole document corpus for every image, we first pre-train f , including the large transformer model T , (c.f.Section 3.1).After convergence, we extract sentence features for all documents and image descriptions and train only the MLPs φ and h with L + λ R, where λ balances the 3-class cross entropy loss L and the regularizer.

Experiments
We validate our method empirically for bird and plant identification.To the best of our knowledge, we are the first to consider this task, thus in absence of state-of-the-art methods, we ablate the different components of our model and compare to several strong baselines.Table 2: Comparison to baselines.We report the retrieval performance of our method on CUB-200 and Oxford-102 Flowers (FLO) and compare to various strong baselines.

Datasets and Experimental Setup
Datasets.We evaluate our method on Caltech-UCSD Birds-200-2011 (CUB-200) [57] and the Oxford-102 Flowers (FLO) dataset [36].For both datasets, Reed et al. [46] have collected several visual descriptions per image by crowd-sourcing to non-experts on Amazon Mechanical Turk (AMT).We further collect for each class a corresponding expert document from specialised websites, such as AllAboutBirds1 (AAB) and Wikipedia.
Setup.We use the image-caption pairs to train two image captioning models: "Show, Attend and Tell" (SAT) [68] and AoANet [23].Unless otherwise specified, we report the performance of our model based on their ensemble, i.e. combining predictions from both models.As the backbone T of our sentence transformer model, we use RoBERTa-large [31] fine-tuned on NLI and STS datasets using the setup of [47].Please see the appendix for further implementation, architecture, dataset and training details.We use three metrics to evaluate the performance on the benchmark datasets.We compute top-1 and top-5 per-class retrieval accuracy and report the overall average.Additionally, we compute the mean rank (MR) of the target document for each class.Here, retrieval accuracy is identical to classification accuracy, since there is only a single relevant article per category.

Baseline Comparisons
Since this work is the first to explore the mapping of images to expert documents without expert supervision, we compare our method to several strong baselines (Table 2).
Our FGSM performs text-based retrieval, we evaluate current text retrieval systems.TF-IDF: Term frequency-inverse document frequency (TF-IDF) is widely used for unsupervised document retrieval [25].For each image, we use the predicted captions as queries and use the TF-IDF textual representation for document ranking instead of our model.We empirically found the cosine distance and n-grams with n = 2, 3 to perform best for TF-IDF.BM25: Similar to TF-IDF, BM25 [48] is another common measure for document ranking based on n-gram frequencies.We use the BM25 Okapi implementation from the python package rank-bm25 with default settings.RoBERTa: One advantage of processing captionsentence pairs with a Siamese architecture, such as SBERT/SRoBERTa [47], is the reduced complexity.Nonetheless, we have trained a transformer baseline for text classification, using 7.9 28.6 31.9 Table 3: Ablation and user study.On CUB-200 we evaluate scoring functions, captioning models and the regularizer R(B).
the same backbone [31], concatenating each sentence pair with a SEP token and training as a binary classification problem.We apply this model to score documents, instead of FGSM, aggregating scores at sentence-level.SRoBERTa-NLI/STSb: Finally, to evaluate the importance of learning fine-grained sentence similarities, we also measure the performance of the same model trained only on the NLI and STSb benchmarks [47], without further fine-tuning.Following [47] we rank documents based on the cosine similarity between the caption and sentence embeddings.
Our method outperforms all bag-of-words and learned baselines.Approaches such as TF-IDF and BM25 are very efficient, albeit less performant than learned models.Notably, the closest in performance to our model is the transformer baseline (RoBERTa), which comes at a large computational cost (347 sec vs. 0.55 sec for our model per image on CUB-200).

Ablation & User Interaction
We ablate the different components of our approach in Table 3.We first investigate the use of a different scoring mechanism, i.e. the cosine similarity between the embeddings of c and s as in [47]; we found this to perform worse (FGSM + cosine).We also study the influence of the captioning model on the final performance.We evaluate captions obtained by two methods, SAT [68] and AoANet [23], as well as their ensemble.The ensemble improves performance thanks to higher variability in the image descriptions.Next, we evaluate the performance of our model after the final training phase, with the proposed regularizer and the inclusion of neutral pairs (Section 3.3).R(B) imposes prior knowledge about the expected class distribution over the dataset and thus stabilizes the training, resulting in improved performance ([2-cls]).Further, through the regularizer and neutral sentences ([3-cls]), FGSM is exposed to the target corpus during training, which helps reduce the domain shift during inference compared to training on image descriptions alone (FGSM w/ ensemble).
Finally, our method enables user interaction, i.e. allowing a user to directly enter own descriptions, replacing the automatic description model.In Table 3 we have simulated this by evaluating with ground-truth instead of predicted descriptions.Naturally, we find that human descriptions indeed perform better, though the performance gap is small.We attribute this gap to a much higher diversity in the human annotations.Current image captioning models still have diversity issues, which also explains why our ensemble variant improves the results.

Input
Predicted descriptions this bird has wings that are black and has a red belly this bird is red and black in color with a stubby red beak the bird has a red crown and a black eyering that is round this bird has wings that are black and has a red belly and black bill […] Top 5retrieval results  We show examples of input images and their predicted captions, followed by the top-5 retrieved documents (classes).For illustration purposes, we show a random image for each document; the image is not used for matching.

Comparison with Cross-Modal Retrieval
Since the nature of the problem presented here is in fact cross-modal, we adapt a representative method, DSCMR [72], to our data to compare to the state of the art in cross-media retrieval.
We note that such an approach requires image-document pairs as training samples, thus using more supervision than our method.Instead of using image descriptions as an intermediary for retrieval, DSCMR thus performs retrieval monolithically, mapping the modalities in a shared representation space.We argue that, although this is the go-to approach in broader category domains, it may be sub-optimal in the context of fine-grained categorization.Since in our setting each category (species) is represented by a single article, in the scenario that a supervised model sees all available categories during training, the cross-modal retrieval problem degenerates to a classification task.Hence, for a meaningful comparison, we train both our model and DSCMR on the CUB-200 splits for ZSL [65] to evaluate on 50 unseen categories.We report the results in Table 4, including a TF-IDF baseline on the same split.Despite using no image-documents pairs for training, our method still performs significantly better.
Additionally, we compare to representative methods from the vision-and-language representation learning space.ViLBERT [33] is a multi-modal transformer model capable of learning joint representations of visual content and natural language.It is pre-trained on 3.3M image-caption pairs with two proxy tasks.We use their multi-modal alignment prediction mechanism to compute the alignment of the sentences in a document to a target image, similar to ViLBERT's zero-shot experiments.The sentence scores are averaged to get the document alignment score and the document with the maximum score is chosen as the class.Finally, we compare to CLIP [45], that learns a multimodal embedding space from 400M image-text pairs.CLIP predicts image and sentence embeddings with separate encoders.For a target image we score each sentence using cosine similarity and average across the document for the final score.CLIP's training data is not public, but we find that there is a high possibility it does indeed contain expert labels as removing class names from documents hurts its performance.

Qualitative Results
In Fig. 3, we show qualitative retrieval results.The input image is shown on the left followed by the predicted descriptions.We then show the top-5 retrieved documents/classes together with an example image for the reader.Note that the example images are not used for matching, as the FGSM module operates on text only.We find that in most cases, even when the retrieved document does not match the ground truth class, the visual appearance is still similar.This is especially noticeable in families of birds for which discriminating among individual species is considered to be particularly difficult even for humans, e.g.warblers (last row).

Discussion
Like with any method that aims to reduce supervision, our method is not perfect.There are multiple avenues where our approach can be further optimized.
First, we observe that models trained for image captioning tend to produce short sentences that lack distinctiveness, focusing on the major features of the object rather than providing detailed fine-grained descriptions of the object's unique aspects.We believe there is a scope for improvement if the captioning models could extensively describe each different and attribute of the object.We have tried to mitigate this issue by using an ensemble of two popular captioning networks.However, using multiple models and sampling multiple descriptions may lead to redundancy.Devising image captioning models that produce diverse and distinct fine-grained image descriptions may provide improved performance on CLEVER task; there is an active area of research [59,61] that is looking into this problem.
Second, the proposed approach to scoring a document given an image uses all the sentences in the document classifying them as positive, negative or neutral with respect to each input caption.Given that the information provided by an expert document might be noisy, i.e. not necessarily related to the visual domain, it is likely worthwhile to develop a filtering mechanism for relevancy, effectively using only a subset of the sentences for scoring.
Finally, in-domain regularization results in a significant performance boost (Table 3), which implies that the CLEVER task is susceptible to the domain gap between laypeople's descriptions and the expert corpus.Language models such as BERT/RoBERTa partially address this problem already by learning general vocabulary, semantics and grammar during pre-training on large text corpora, enabling generalization to a new corpus without explicit training.However, further research in reducing this domain gap seems worthwhile.

Conclusion
We have shown that it is possible to address fine-grained image recognition without the use of expert training labels by leveraging existing knowledge bases, such as Wikipedia.This is the first work to tackle this challenging problem, with performance gains over the state of the art on cross-media retrieval, despite their training with image-document pairs.While humans can easily access and retrieve information from such knowledge bases, CLEVER remains a challenging learning problem that merits future research.
-shelf expert-curated knowledge base The lazuli bunting (Passerina amoena) is a […] named for the gemstone lapis lazuli.The male is easily recognized by its bright blue head and back (lighter than the closely related indigo bunting), its conspicuous white wingbars, and its light rusty breast and […] The vermilion flycatcher (Pyrocephalus obscurus) is a small passerine bird […] It is a striking exception among the generally drab Tyrannidae due to its vermilion-red coloration.The males have bright red crowns, chests, and […] … The vermilion flycatcher (Pyrocephalus obscurus) is a small passerine bird […] It is a striking exception among the generally drab Tyrannidae due to its vermilion-red coloration.The males have bright red crowns, chests, and […]

FGSMFigure 2 :
Figure 2: Overview.We train a model for fine-grained sentence matching (FGSM) using layerperson's annotations, i.e. class-agnostic image descriptions.At test time, we score documents from a relevant corpus and use the top-ranked document to label the image.
where [•] denotes concatenation, and h and φ are lightweight MLPs operating on the average-pooled output of a large language model T(•) with the shorthand notation φ 1 = φ (T(c 1 )).

Figure 3 :
Figure 3: Qualitative Results (CUB-200).We show examples of input images and their predicted captions, followed by the top-5 retrieved documents (classes).For illustration purposes, we show a random image for each document; the image is not used for matching.
wings that are grey and has a white belly this bird is white with grey and has a long pointy beak this bird has wings that are white and has a black crown this bird has wings that are white and has a long orange beak[…]