Large Scale Retrieval and Generation of Image Descriptions
- First Online:
- Cite this article as:
- Ordonez, V., Han, X., Kuznetsova, P. et al. Int J Comput Vis (2016) 119: 46. doi:10.1007/s11263-015-0840-y
- 941 Downloads
What is the story of an image? What is the relationship between pictures, language, and information we can extract using state of the art computational recognition systems? In an attempt to address both of these questions, we explore methods for retrieving and generating natural language descriptions for images. Ideally, we would like our generated textual descriptions (captions) to both sound like a person wrote them, and also remain true to the image content. To do this we develop data-driven approaches for image description generation, using retrieval-based techniques to gather either: (a) whole captions associated with a visually similar image, or (b) relevant bits of text (phrases) from a large collection of image + description pairs. In the case of (b), we develop optimization algorithms to merge the retrieved phrases into valid natural language sentences. The end result is two simple, but effective, methods for harnessing the power of big data to produce image captions that are altogether more general, relevant, and human-like than previous attempts.
KeywordsRetrieval Image description Data driven Big data Natural language processing
Producing a relevant and accurate caption for an arbitrary image is an extremely challenging problem because a system needs to not only estimate what image content is depicted, but also predict what a person would describe about the image. However, there are already many images with relevant associated descriptive text available in the noisy vastness of the web. The key is to find the right images and make use of them in the right way! In this paper, we present two techniques to effectively skim the top of the image understanding problem to caption photographs by taking a data driven approach. To enable data driven approaches to image captioning we have collected a large pool of images with associated visually descriptive text. We develop retrieval algorithms to find good strings of text to describe an image, ultimately allowing us to produce natural-sounding and relevant captions for query images. These data-driven techniques follow in the footsteps of past work on internet-vision demonstrating that big data can often make challenging problems, see examples in image localization (Hays and Efros 2008), retrieving photos with specific content (Torralba et al. 2008), or image parsing (Tighe and Lazebnik 2010), amenable to simple non-parametric matching methods.
A key potential advantage to making use of existing human-written image descriptions is that these captions may be more natural than those constructed directly from computer vision outputs using hand written rules. Furthermore we posit that many aspects of natural human-written image descriptions are difficult to produce directly from the output of computer vision systems, leading to unnatural sounding captions (see e.g. Kulkarni et al. 2013). This is one of our main motivations for seeking to sample from existing descriptions of similar visual content. Humans make subtle choices about what to describe in an image, as well as how to form descriptions, based on image information that is not captured in, for instance, a set of object detectors or scene classifiers. In order to mimic some of these human choices, we carefully sample from descriptions people have written for images with some similar visual content, be it the pose of a human figure, the appearance of the sky, the scene layout, etc. In this way, we implicitly make use of human judgments of content importance and of some aspects of human composition during description generation. Another advantage of this type of method is that we can produce subtle and varied natural language for images without having to build models for every word in a vast visual vocabulary—by borrowing language based on visual similarity.
This paper develops and evaluates two methods to automatically map photographs to natural language descriptions. The first uses global image feature representations to retrieve and transfer whole captions from database images to a query image (Ordonez et al. 2011). The second retrieves textual phrases from multiple visually similar database images, providing the building blocks, phrases, from which to construct novel and content-specific captions for a query image.
For the second method, finding descriptive phrases requires us to break the image down into constituent content elements, e.g. object detections (e.g., person, car, horse, etc.) and coarse regions from image parsing (e.g., grass, buildings, sky, etc.). We then retrieve visually similar instances of these objects and regions as well as similar scenes and whole images from a very large database of images with descriptions. Depending on what aspect of the image is being compared, we sample appropriate phrases from the descriptions. For example, a visual match to a similar sky might allow us to sample the prepositional phrase, “on a cloudless day.” Once candidate phrases are retrieved based on matching similar image content, we evaluate several collective selection methods to examine and rerank the set of retrieved phrases. This reranking step promotes consistent content in the matching results up while pushing down outliers both in the image and language domain. In addition to intrinsic evaluation, the final set of reranked phrases are then evaluated in two applications. One tests the utility of the phrases for generating novel descriptive sentences. The second uses the phrases as features for text based image search.
Data-driven approaches to generation require a set of captioned photographs. Some small collections of captioned images have been created by hand in the past. The UIUC sentence data sets contain 1k (Rashtchian et al. 2010) and 30k (Young et al. 2014) images respectively each of which is associated with five human generated descriptions. The ImageClef1 image retrieval challenge contains 20k images with associated human descriptions. Most of these collections are relatively small for retrieval based methods, as demonstrated by our experiments on captioning with varying collection size (Sect. 4). Therefore, we have collected and released the SBU Captioned Photo Dataset (Ordonez et al. 2011) containing 1,000,000 Flickr images with natural language descriptions. This dataset was collected by performing a very large number of search queries to Flickr, and then heuristically filtered to find visually descriptive captions for images. The resulting dataset is large and varied, enabling effective matching of whole or local image content. The very large dataset also facilitates automatic tuning methods and evaluation that would not be possible on a dataset of only a few thousand captioned images. In addition this is the first—to our knowledge—attempt to mine the internet for general captioned images.
We perform extensive evaluation of our proposed methods, including evaluation of the sentences produced by our baseline and phrase-based composition methods as well as evaluation of collective phrase selection and its application to text based image search. As these are relatively new and potentially subjective tasks, careful evaluation is important. We use a variety of techniques, from direct evaluation by people (using Amazon’s Mechanical Turk) to indirect automatic measures like BLEU (Papineni et al. 2002) and ROUGE (Lin 2004) scores for similarity to ground truth phrases and descriptions. Note that none of these evaluation metrics are perfect for this task (Kulkarni et al. 2013; Hodosh et al. 2013). Hopefully future research will develop better automatic methods for image description evaluation, as well as explore how descriptions should change as a function of task, e.g. to compose a description for image search vs image captioning for the visually impaired.
A large data set containing images from the web with associated captions written by people, filtered so that the descriptions are likely to refer to visual content (Sect. 3). This was previously published as part of (Ordonez et al. 2011).
A description generation method that utilizes global image representations to retrieve and transfer captions from our data set to a query image (Sect. 4). This was previously published as part of (Ordonez et al. 2011).
New methods to utilize local image representations and collective selection to retrieve and rerank relevant phrases for images (Sect. 5).
New evaluations of our proposed image description methods, collective phrase selection algorithms, and image search prototype (Sect. 7).
2 Related Work
Associating natural language with images is an emerging endeavor in computer vision. Some seminal work has looked at the task of mapping from images to text as a translation problem (similar to translating between two languages) (Duygulu et al. 2002). Other work has tried to estimate correspondences between keywords and image regions (Barnard et al. 2003), or faces and names (Berg et al. 2004a, b). In a parallel research goal, recent work has started to move beyond recognition of leaf-level object category terms toward mid-level elements such as attributes (Berg et al. 2010; Farhadi et al. 2009; Ferrari and Zisserman 2007; Kumar et al. 2009; Lampert et al. 2009), or hierarchical representations of objects (Deng et al. 2011a, b, 2012).
Image description generation in particular has been studied in recent papers (Farhadi et al. 2010; Feng and Lapata 2010; Hodosh et al. 2013; Kulkarni et al. 2013; Kuznetsova et al. 2012; Li et al. 2011; Mitchell et al. 2012; Ordonez et al. 2011; Yao et al. 2010; Mason and Charniak 2014; Guadarrama et al. 2013). Some approaches (Kulkarni et al. 2013; Li et al. 2011; Yang et al. 2011), generate descriptive text from scratch based on detected elements such as objects, attributes, and prepositional relationships. This results in descriptions for images that are sometimes closely related to image content, but that are also often quite verbose, non-human-like, or lacking in creativity. Other techniques for producing descriptive image text, e.g. (Yao et al. 2010), require a human in the loop for image parsing (except in specialized circumstances) and various hierarchical knowledge ontologies. The recent work of Hodosh et al (2013) argues in favor of posing the image-level sentence annotation task as a sentence ranking problem, where performance is measured by the rank of the ground truth caption, but does not allow for composing new language for images.
Other attempts to generate natural language descriptions for images have made use of pre-associated text or other meta-data. For example, Feng and Lapata (2010) generate captions for images using extractive and abstractive generation methods, but assume relevant documents are provided as input. Aker et al. (2010) rely on GPS meta data to access relevant text documents.
The approaches most relevant to this paper make use of existing text for caption generation. In Farhadi et al. (2010), the authors produce image descriptions via a retrieval method, by translating both images and text descriptions to a shared meaning space represented by a single \(<object,action,scene>\) tuple. A description for a query image is produced by retrieving whole image descriptions via this meaning space from a set of image descriptions [(the UIUC Pascal Sentence data set Rashtchian et al. (2010)]. This results in descriptions that sound very human—since they were written by people—but which may not be relevant to the specific image content. This limited relevancy often occurs because of problems of sparsity, both in the data collection—1000 images is too few to guarantee similar image matches—and in the representation—only a few categories for three types of image content are considered.
In contrast, we attack the caption generation problem for more general images (images found via thousands of paired-word Flickr queries) and a larger set of object categories (89 vs. 20). In addition to extending the object category list considered, we also include a wider variety of image content aspects in our search terms, including: non-part-based region categorization, attributes of objects, activities of people, and a larger number of common scene classes. We also generate our descriptions via an extractive method with access to a much larger and more general set of captioned photographs from the web (1 million vs. 1 thousand).
Compared to past retrieval based generation approaches such as Farhadi et al. (2010) and our work Ordonez et al. (2011), which retrieve whole existing captions to describe a query image, here we develop algorithms to associated bits of text (phrases) with parts of an image (e.g. objects, regions, or scenes). As a product of our phrase retrieval process, we also show how to use our retrieved phrases (retrieved from multiple images) to compose novel captions, and to perform complex query retrieval. Since images are varied, the likelihood of being able to retrieve a complete yet relevant caption is low. Utilizing bits of text (e.g., phrases) allows us to directly associate text with part of an image. This results in better, more relevant and more specific captions if we apply our phrases to caption generation. A key subroutine in the process is reranking the retrieved phrases in order to produce a shortlist for the more computationally expensive optimization for description generation, or for use in complex query retrieval. In this paper we explore two techniques for performing this reranking collectively—taking into account the set of retrieved phrases. Our reranking approaches have close ties to work in information retrieval including PageRank (Jing and Baluja 2008) and TFIDF (Roelleke and Wang 2008).
Producing a relevant and human-like caption for an image is a decidedly subtle task. As previously mentioned, people make distinctive choices about what aspects of an image’s content to include or not include in their description. This link between visual importance and descriptions, studied in (Stratos et al. 2012), leads naturally to the problem of text summarization in natural language processing. In text summarization, the goal is to produce a summary for a document that describes the most important content contained in the text. Some of the most common and effective methods proposed for summarization rely on extractive summarization (Li et al. 2006; Mihalcea 2005; Nenkova et al. 2006; Radev and Allison 2004; Wong et al. 2008) where the most important or relevant text is selected from a document to serve as the document’s summary. Often a variety of features related to document content (Nenkova et al. 2006), surface (Radev and Allison 2004), events (Li et al. 2006) or feature combinations (Wong et al. 2008) are used in the selection process to compose sentences that reflect the most significant concepts in the document. Our retrieval based description generation methods can be seen as instances of extractive summarization because we make use of existing text associated with (visually similar) images.
3 Web-Scale Captioned Image Collection
One key requirement of this work is a web-scale database of photographs with associated descriptive text. To enable effective captioning of novel images, this database must satisfy two general requirements: (1) It must be large so that visual matches to the query are reasonably similar, (2) The captions associated with the database photographs must be visually relevant so that transferring captions between pictures driven by visual similarity is useful. To achieve the first requirement we queried Flickr using a huge number of pairs of query terms (objects, attributes, actions, stuff, and scenes). This produced a very large, but noisy initial set of photographs with associated text (hundreds of millions of images). To achieve our second requirement we filtered this set so that the descriptions attached to a picture are likely to be relevant and visually descriptive. To encourage visual descriptiveness, we select only those images with descriptions of satisfactory length, based on observed lengths in visual descriptions. We also enforce that retained descriptions contain at least two words belonging to our term lists and at least one prepositional word, e.g. “on”, “under” which often indicate visible spatial relationships.
This resulted in a final collection of over 1 million images with associated text descriptions—the SBU Captioned Photo Dataset. These text descriptions generally function in a similar manner to image captions, and usually directly refer to some aspects of the visual image content (see Fig. 1 for examples).
To evaluate whether the captions produced by our automatic filtering are indeed relevant to their associated images, we performed a forced-choice evaluation task, where a user is presented with two photographs and one caption. The user must assign the caption to the most relevant image (care is taken to remove biases due to temporal or left-right placement in the task). In this case we present the user with the original image associated with the caption and a random image. We perform this evaluation on 100 images from our web-collection using Amazon’s Mechanical Turk service, and find that users are able to select the ground truth image 96 % of the time. This demonstrates that the task is reasonable and that descriptions from our collection tend to be fairly visually specific and relevant. One possible additional pre-processing step for our dataset would be to use sentence compression by eliminating overly specific information as described in our previous work (Kuznetsova et al. 2013).
4 Global Generation of Image Descriptions
Past work has demonstrated that if your data set is large enough, some very challenging problems can be attacked with simple matching methods (Hays and Efros 2008; Torralba et al. 2008; Tighe and Lazebnik 2010). In this spirit, we harness the power of web photo collections in a non-parametric approach. Given a query image, \(I_q\), our goal is to generate a relevant description. In our first baseline approach, we achieve this by computing the global similarity of a query image to our large web-collection of captioned images. We find the closest matching image (or images) and simply transfer over the description from the matching image to the query image.
For measuring visual similarity we utilize two image descriptors. The first is the well known gist feature, a global image descriptor related to perceptual dimensions – naturalness, roughness, ruggedness etc – of scenes (Oliva and Torralba 2001). The second descriptor is also a global image descriptor, computed by resizing the image into a “tiny image”, essentially a thumbnail of size 32\(\,\times \,\)32. This helps us match not only scene structure, but also the overall color of images. To find visually relevant images we compute the similarity of the query image to images in the captioned photo dataset using a sum of gist similarity and tiny image color similarity (equally weighted).
5 Retrieving and Reranking Phrases Describing Local Image Content
In this section we present methods to retrieve natural language phrases describing local and global image content from our large database of captioned photographs. Because we want to retrieve phrases referring to specific objects, relationships between objects and their background, or to the general scene, a large amount of image and text processing is first performed on the collected database (Sect. 5.1). This allows us to extract useful and accurate estimates of local image content as well as the phrases that refer to that content. For a novel query image, we can then use visual similarity measures to retrieve sets of relevant phrases describing image content (Sect. 5.2). Finally, we use collective reranking methods to select the most relevant phrases for the query image (Sect. 5.3).
5.1 Dataset Processing
We perform four types of dataset processing: object detection, rough image parsing to obtain background elements, scene classification, and caption parsing. This provides textual phrases describing both local (e.g. objects and local object context) and global (e.g. general scene context) image content.
5.1.1 Object Detection
5.1.2 Image Parsing
Image parsing is used to estimate regions of background elements in each database image. Six categories are considered: sky, water, grass, road, tree, and building, using detectors (Ordonez et al. 2011) which compute color, texton, HoG (Dalal and Triggs 2005) and Geometric Context (Hoiem et al. 2005) as input features to a sliding window based SVM classifier. These detectors are run on all database images.
5.1.3 Scene Classification
The scene descriptor for each image consists of the outputs of classifiers for 26 common scene categories. The features, classification method, and training data are from the SUN dataset (Xiao et al. 2010). This descriptor is useful for capturing and matching overall global scene appearance for a wide range of scene types. Scene descriptors are computed on 700,000 images from the database to obtain a large pool of scene descriptors for retrieval.
5.1.4 Caption Parsing
The Berkeley PCFG parser (Petrov et al. 2006; Petrov and Klein 2007) is used to obtain a hierarchical parse tree for each caption. From this tree we gather constituent phrases, (e.g., noun phrases, verb phrases, and prepositional phrases) referring to each of the above kinds of image content in the database.
5.2 Retrieving Phrases
For a query image, we retrieve several types of relevant phrases: noun-phrases (NPs), verb-phrases (VPs), and prepositional-phrases (PPs). Five different features are used to measure visual similarity: Color—LAB histogram, Texture—histogram of vector quantized responses to a filter bank (Leung and Malik 1999), SIFT Shape—histogram of vector quantized dense SIFT descriptors (Lowe 2004), HoG Shape—histogram of vector quantized densely computed HoG descriptors (Dalal and Triggs 2005), Scene—vector of classification scores for 26 common scene categories. The first 4 features are computed locally within an (object or stuff) region of interest and the last feature is computed globally.
5.2.1 Retrieving Noun-Phrases (NPs)
5.2.2 Retrieving Verb-Phrases (VPs)
For each proposed object detection in a query image, we retrieve a set of relevant verb-phrases from the database. Here we associate VPs in database captions to object detections in their corresponding database images if the detection category (or a synonym or holonym) is the head word in an NP from the same sentence (e.g. in Fig. 3 bottom right dog picture, “sleeping under my desk” is associated with the dog detection in that picture). Our measure of visual similarity is again based on equally weighted combination of color, texton, SIFT and HoG feature similarities. As demonstrated in Fig. 3 (left), this measure often captures similarity in pose.
5.2.3 Retrieving Image Parsing-Based Prepositional-Phrases (PPStuff)
5.2.4 Retrieving Scene-Based Prepositional-Phrases (PPScene)
5.3 Reranking Phrases
Given a set of retrieved phrases for a query image, we would like to rerank these phrases using collective measures computed on the entire set of retrieved results. Related reranking strategies have been used for other retrieval systems. Sivic and Zisserman (2003) retrieve images using visual words and then rerank them based on a measure of geometry and spatial consistency. Torralba et al. (2008) retrieve a set of images using a reduced representation of their feature space and then perform a second refined reranking phase on top matching images to produce exact neighbors.
In our case, instead of reranking images, our goal is to rerank retrieved phrases such that the relevance of the top retrieved phrases is increased. Because each phrase is retrieved independently in the phrase retrieval step, the results tend to be quite noisy. Spurious image matches can easily produce irrelevant phrases. The wide variety of Flickr users and contexts under which they capture their photos can also produce unusual or irrelevant phrases.
As an intuitive example, if one retrieved phrase describes a dog as “the brown dog” then the dog may be brown. However, if several retrieved phrases describe the dog in similar ways, e.g., “the little brown dog”, “my brownish pup”, “a brown and white mutt”, then it is much more likely that the query dog is brown and the predicted relevance for phrases describing brown attributes should be increased.
In particular, for each type of retrieved phrase (see Sect. 5.2), we gather the top 100 best matches based on visual similarity. Then, we perform phrase reranking to select the best and most relevant phrases for an image (or part of an image in the case of objects or regions). We evaluate two possible methods for reranking: (1) PageRank based reranking using visual and/or text similarity, (2) Phrase-level TFIDF based reranking.
5.3.1 PageRank Reranking
PageRank (Brin and Page 1998) computes a measure for the relative importance of items within a set based on the random walk probability of visiting each item. The algorithm was originally proposed as a measure of importance for web pages using hyperlinks as connections between pages (Brin and Page 1998), but has also been applied to other tasks such as reranking images for product search (Jing and Baluja 2008). For our task, we use PageRank to compute the relative importance of phrases within a retrieved set on the premise that phrases displaying strong similarity to other phrases within the retrieved set are more likely to be relevant to the query image.
We construct four graphs, one for each type of retrieved phrase (NP, VP, PPStuff, or PPScene), from the set of retrieved phrases for that type. Nodes in these graphs correspond to retrieved phrases (and the corresponding object, region, or image each phrase described in the SBU database). Edges between nodes are weighted using visual similarity, textual similarity, or an unweighted combination of the two—denoted as Visual PageRank, Text PageRank, or Visual \(+\) Text PageRank respectively. Text similarity is computed as the cosine similarity between phrases, where phrases are represented as a bag of words with a vocabulary size of approximately 100k words, weighted by term-frequency inverse-document frequency (TFIDF) score (Roelleke and Wang 2008). Here IDF measures are computed for each phrase type independently rather than over the entire corpus of phrases to produce IDF measures that are more type specific. Visual similarity is computed as cosine similarity of the visual representations used for retrieval (Sect. 5.2).
For generating complete image descriptions (Sect. 6.1), the PageRank score can be directly used as a unary potential for phrase confidence.
5.3.2 Phrase-level TFIDF Reranking
We would like to produce phrases for an image that are not only relevant, but specific to the particular depicted image content. For example, if we have a picture of a cow a phrase like “the cow” is always going to be relevant to any picture of a cow. However, if the cow is mottled with black and white patches then “the spotted cow” is a much better description for this specific example. If both of these phrases are retrieved for the image, then we would prefer to select the latter over the former.
To produce phrases with high description specificity, we define a phrase-level measure of TFIDF. This measure rewards phrases containing words that occur frequently within the retrieved phrase set, but infrequently within a larger set of phrases—therefore giving higher weight to phrases that are specific to the query image content (e.g., “spotted”). For object and stuff region related phrases (NPs, VPs, PPStuff), IDF is computed over phrases referring to that object or stuff category (e.g., the frequency of words occurring in a noun phrase with “cow” in the example above). For whole image related phrases (PPScene), IDF is computed over all prepositional phrases. To compute TFIDF for a phrase, the TFIDF for each word in the phrase is calculated (after removing stop words) and then averaged. Other work that has used TFIDF for image features (we use it for text associated with an image) include Sivic and Zisserman (2003), Chum et al. (2008), and Ordonez et al. (2011).
For composing image descriptions (Sect. 6.1), we use phrase-level TFIDF to rerank phrases and select the top 10 phrases. The original visual retrieval score (Sect. 5.2) is used as the phrase confidence score, effectively merging ideas of visual relevance with phrase specificity (denoted as Visual + TFIDF).
6 Applications of Phrases
Once we have retrieved (and reranked) phrases related to an image we can use the associated phrases in a number of applications. Here we demonstrate two potential applications: phrasal generation of image descriptions (Sect. 6.1), and complex query image search (Sect. 6.2).
6.1 Phrasal Generation of Image Descriptions
We model caption generation as an optimization problem in order to incorporate two different types of information: the confidence score of each retrieved phrase provided by the original retrieval algorithm (Sect. 5.2) or by our reranking techniques (Sect. 5.3), and additional pairwise compatibility scores across phrases computed using observed language statistics. Our objective is to select a set of phrases that are visually relevant to the image and that together form a reasonable sentence, which we measure by compatibility across phrase boundaries.
6.1.1 Unary Potentials
\(\phi (x)\), are computed as the confidence score of phrase x determined by the retrieval and reranking techniques discussed in Sect. 5.3. To make scores across different types of phrases comparable, we normalize them using Z-score (subtract mean and divide by standard deviation). We further transform the scores so that they fall in the [0,1] range.
6.1.2 Binary Potentials
6.1.3 Inference by Viterbi Decoding
6.2 Complex Query Image Search
Image retrieval is beginning to work well. Commercial companies like Google and Bing produce quite reasonable results now for simple image search queries, like “dog” or “red car”. Where image search still has much room for improvement is for complex search queries involving appearance attributes, actions, multiple objects with spatial relationships, or interactions. This is especially true for more unusual situations that cannot be mined directly by looking at the meta-data and text surrounding an image, e.g., “little boy eating his brussels sprouts”.
We demonstrate a prototype application, showing that our approach for finding descriptive phrases for an image can be used to form features that are useful for complex query image retrieval. We use 1000 test images (described in Sect. 7) as a dataset. For each image, we pick the top selected phrases from the vision + Text PageRank algorithm to use as a complex text descriptor for that image—note that the actual human-written caption for the image is not seen by the system. For evaluation we then use the original human caption for an image as a complex query string. We compare it to each of the automatically derived phrases for images in the dataset and score the matches using normalized correlation. For each matching image we average those scores for each retrieved phrase. We then sort the scores and record the rank of the correct image—the one for which the query caption was written. If the retrieved phrases matched the actual human captions well, then we expect the query image to be returned first in the retrieved images. Otherwise, it will be returned later in the ranking. Note that this is only a demo application performed on a very small dataset of images. A real image retrieval application would have access to billions of images.
We perform experimental evaluations on each aspect of the proposed approaches: global description generation (Sect. 7.1), phrase retrieval and reranking (Sect. 7.2), phrase based description generation (Sect. 7.3), and phrase based complex query image search (Sect. 7.4).
To evaluate global generation, we randomly sample 500 images from our collection. As is usually the case with web photos, the photos in this set display a wide range of difficulty for visual recognition algorithms and captioning, from images that depict scenes (e.g. beaches), to images with relatively simple depictions (e.g. a horse in a field), to images with much more complex depictions (e.g. a boy handing out food to a group of people). For all phrase based evaluations (except where explicitly noted) we use a test set of 1000 query images, selected to have high detector confidence scores. Random test images could also be sampled, but for images with poor detector performance we expect the results to be much the same as for our baseline global generation methods. Therefore, we focus on evaluating performance for images where detection is more likely to have produced reasonable estimates of local image content.
7.1 Global Generation Evaluation
Global matching performance with respect to data set size (BLEU score measured at 1)
Global description generation (1k)
0.0774 \(\pm \) 0.0059
Global description generation (10k)
0.0909 \(\pm \) 0.0070
Global description generation (100k)
0.0917 \(\pm \) 0.0101
Global description generation (1million)
0.1177 \(\pm \) 0.0099
7.2 Phrase Retrieval and Ranking Evaluation
Average BLEU@1 score for the top K retrieved phrases against Flickr captions
Noun phrases \(K=1,5,10\)
Verb phrases \(K=1,5,10\)
Prepositional phrases (stuff) \(K=1,5,10\)
Prepositional phrases (scenes) \(K=1,5,10\)
0.24, 0.24, 0.23
0.15, 0.14, 0.14
0.30, 0.29, 0.27
0.28, 0.26, 0.25
0.23, 0.23, 0.23
0.13, 0.14, 0.14
0.28, 0.28, 0.27
0.26, 0.25, 0.25
0.30, 0.29, 0.28
0.20, 0.19, 0.17
0.38, 0.37, 0.36
0.34, 0.30, 0.27
Visual + Text PageRank
0.28, 0.27, 0.26
0.17, 0.17, 0.16
0.32, 0.30, 0.28
0.27, 0.28, 0.27
0.29, 0.28, 0.27
0.19, 0.19, 0.18
0.38, 0.37, 0.36
0.40, 0.36, 0.32
Average BLEU@1 score evaluation K=10 against MTurk written descriptions
Prepositional phrases (stuff)
Prepositional phrases (scenes)
Visual + Text PageRank
7.3 Application 1: Description Generation Evaluation
BLEU and ROUGE score evaluation of full image captions generated using HMM decoding with our strategies for phrase retrieval and reranking
Human forced-choice evaluation between various methods
Text PageRank versus no reranking
\(54\,\% / 46\,\%\)
Visual + Text PageRank versus no reranking
\(57\,\% / 43\,\%\)
Visual + TFIDF Reranking versus no reranking
\(61\,\% / 39\,\%\)
Text + Visual PageRank versus Visual + TFIDF reranking
\(49\,\% / 51\,\%\)
Text + Visual PageRank versus Global description generation
\(71\,\% / 29\,\%\)
7.4 Application 2: Complex Query Image Retrieval Evaluation
We have described explorations into retrieval based methods for gathering visually relevant natural language for images. Our methods rely on collecting and filtering a large data set of images from the internet to produce a web-scale captioned photo collection. We present two variations on text retrieval from our captioned collection. The first retrieves whole existing image descriptions and the second retrieves bits of text (phrases) based on visual and geometric similarity of objects, stuff, and scenes. We have also evaluated several methods for collective reranking of sets of phrases and demonstrated the results in two applications, phrase based generation of image descriptions and complex query image retrieval. Finally, we have presented a thorough evaluation of each of our presented methods through both automatic and human-judgment based measures.
In future work we hope to extend these methods to a real time system for image description and incorporate state of the art methods for large-scale category recognition (Deng et al. 2010, 2012). We also plan to extend our prototype complex query retrieval algorithm to web-scale. Producing human-like and relevant descriptions will be a key factor for enabling accurate and satisfying image retrieval results.
Support of the 2011 JHU-CLSP Summer Workshop Program. Tamara L. Berg and Kota Yamaguchi were supported in part by NSF CAREER IIS-1054133; Hal Daumé III and Amit Goyal were partially supported by NSF Award IIS-1139909.