Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked"What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that"the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.


Introduction
A holy grail of computer vision is the complete understanding of visual scenes: a model that is able to name and detect objects, describe their attributes, and recognize their relationships and interactions. Understanding scenes would enable important applications such as image search, question answering, and robotic interactions. Much progress has been made in recent years towards this goal, including image classification (Deng et al., 2009, Perronnin et al., 2010, Simonyan and Zisserman, 2014, Krizhevsky et al., 2012: An overview of the data needed to move from perceptual awareness to cognitive understanding of images. We present a dataset of images densely annotated with numerous region descriptions, objects, attributes, and relationships. Region descriptions (e.g. "girl feeding large elephant" and "a man taking a picture behind girl") are shown (top). The objects (e.g. elephant), attributes (e.g. large) and relationships (e.g. feeding) are shown (bottom). Our dataset also contains image related question answer pairs (not shown). et al., 2014) and object detection (Everingham et al., 2010, Girshick et al., 2014, Sermanet et al., 2013, Girshick, 2015, Ren et al., 2015b). An important contributing factor is the availability of a large amount of data that drives the statistical models that underpin today's advances in computational visual understanding. While the progress is exciting, we are still far from reaching the goal of comprehensive scene understanding. As Figure 1 shows, existing models would be able to detect discreet objects in a photo but would not be able to explain their interactions or the relationships between them. Such explanations tend to be cognitive in nature, integrating perceptual information into conclusions about the relationships between objects in a scene (Bruner, 1990, Firestone andScholl, 2015). A cognitive understanding of our visual world thus requires that we complement computers' ability to detect objects with abilities to describe those objects (Isola et al., 2015) and understand their interactions within a scene (Sadeghi and Farhadi, 2011).
There is an increasing effort to put together the next generation of datasets to serve as training and benchmarking datasets for these deeper, cognitive scene understanding and reasoning tasks, the most notable being MS-COCO (Lin et al., 2014) and VQA (Antol et al., 2015). The MS-COCO dataset consists of 300K real-world photos collected from Flickr. For each image, there is pixel-level segmentation of 91 object classes (when present) and 5 independent, user-generated sentences describing the scene. VQA adds to this a set of 614K question-answer pairs related to the visual contents of each image (see more details in Section 3.1). With this information, MS-COCO and VQA provide a fertile training and testing ground for models aimed at tasks for accurate object detection, segmentation, and summary-level image captioning (Kiros et al., 2014, Mao et al., 2014, Karpathy and Fei-Fei, 2014, Vinyals et al., 2014 as well as basic QA (Ren et al., 2015a, Antol et al., 2015, Malinowski et al., 2015, Gao et al., 2015, Malinowski and Fritz, 2014. For example, a stateof-the-art model (Karpathy and Fei-Fei, 2014) provides a description of one MS-COCO image in Figure 1 as "two men are standing next to an elephant." But what is missing is the further understanding of where each object is, what each person is doing, what the relationship between the person and elephant is, etc. Without such relationships, these models fail to differentiate this image from other images of people next to elephants.
To understand images thoroughly, we believe three key elements need to be added to existing datasets: a grounding of visual concepts to language (Kiros et al., 2014), a more complete set of descriptions and QAs for each image based on multiple image regions (Johnson et al., 2015), and a formalized representation of the components of an image (Hayes, 1978). In the spirit of mapping out this complete information of the visual world, we introduce the Visual Genome dataset. The first release of the Visual Genome dataset uses 108, 249 images from the intersection of the YFCC100M (Thomee et al., 2016) and MS-COCO (Lin et al., 2014). Section 5 provides a more detailed description of the dataset. We highlight below the motivation and contributions of the three key elements that set Visual Genome apart from existing datasets.
The Visual Genome dataset regards relationships and attributes as first-class citizens of the annotation space, in addition to the traditional focus on objects. Recognition of relationships and attributes is an important part of the complete understanding of the visual scene, and in many cases, these elements are key to the story of a scene (e.g., the difference between "a dog chasing a man" versus "a man chasing a dog"). The Visual Genome dataset is among the first to provide a detailed labeling of object interactions and attributes, grounding visual concepts to language 1 .
An image is often a rich scenery that cannot be fully described in one summarizing sentence. The scene in Figure 1 contains multiple "stories": "a man taking a photo of elephants," "a woman feeding an elephant," "a river in the background of lush grounds," etc. Existing datasets such as Flickr 30K (Young et al., 2014) and MS-COCO (Lin et al., 2014) focus on high-level descriptions of an image 2 . Instead, for each image in the Visual Genome dataset, we collect more than 42 descriptions for different regions in the image, providing a much denser and complete set of descriptions of the scene. In addition, inspired by VQA (Antol et al.,1 The Lotus Hill Dataset (Yao et al., 2007) also provides a similar annotation of object relationships, see Sec 3.1.

2015)
, we also collect an average of 17 question-answer pairs based on the descriptions for each image. Regionbased question answers can be used to jointly develop NLP and vision models that can answer questions from either the description or the image, or both of them.
With a set of dense descriptions of an image and the explicit correspondences between visual pixels (i.e. bounding boxes of objects) and textual descriptors (i.e. relationships, attributes), the Visual Genome dataset is poised to be the first image dataset that is capable of providing a structured formalized representation of an image, in the form that is widely used in knowledge base representations in NLP (Zhou et al., 2007, GuoDong et al., 2005, Culotta and Sorensen, 2004, Socher et al., 2012. For example, in Figure 1, we can formally express the relationship holding between the woman and food as holding(woman, food)). Putting together all the objects and relations in a scene, we can represent each image as a scene graph (Johnson et al., 2015). The scene graph representation has been shown to improve semantic image retrieval (Johnson et al., 2015, Schuster et al., 2015 and image captioning (Farhadi et al., 2009, Chang et al., 2014, Gupta and Davis, 2008. Furthermore, all objects, attributes and relationships in each image in the Visual Genome dataset are canonicalized to its corresponding Word-Net (Miller, 1995) ID (called a synset ID). This mapping connects all images in Visual Genome and provides an effective way to consistently query the same concept (object, attribute, or relationship) in the dataset. It can also potentially help train models that can learn from contextual information from multiple images.
In this paper, we introduce the Visual Genome dataset with the aim of training and benchmarking the next generation of computer models for comprehensive scene understanding. The paper proceeds as follows: In Section 2, we provide a detailed description of each component of the dataset. Section 3 provides a literature review of related datasets as well as related recognition tasks. Section 4 discusses the crowdsourcing strategies we deployed in the ongoing effort of collecting this dataset. Section 5 is a collection of data analysis statistics, showcasing the key properties of the Visual Genome dataset. Last but not least, Section 6 provides a set of experimental results that use Visual Genome as a benchmark.
Further visualizations, API, and additional information on the Visual Genome dataset can be found online 3 .

The man is almost bald
Park bench is made of gray weathered wood A man and a woman sit on a park bench along a river. Fig. 2: An example image from the Visual Genome dataset. We show 3 region descriptions and their corresponding region graphs. We also show the connected scene graph collected by combining all of the image's region graphs. The top region description is "a man and a woman sit on a park bench along a river." It contains the objects: man, woman, bench and river. The relationships that connect these objects are: sits on(man, bench), in front of (man, river ), and sits on(woman, bench). : An example image from our dataset along with its scene graph representation. The scene graph contains objects (child, instructor, helmet, etc.) that are localized in the image as bounding boxes (not shown). These objects also have attributes: large, green, behind, etc. Finally, objects are connected to each other through relationships: wears(child, helmet), wears(instructor, jacket), etc. The Visual Genome dataset consists of seven main components: region descriptions, objects, attributes, relationships, region graphs, scene graphs, and questionanswer pairs. Figure 4 shows examples of each component for one image. To enable research on comprehensive understanding of images, we begin by collecting descriptions and question answers. These are raw texts without any restrictions on length or vocabulary. Next, we extract objects, attributes and relationships from our descriptions. Together, objects, attributes and relationships fabricate our scene graphs that represent a formal representation of an image. In this section, we break down Figure 4 and explain each of the seven components. In Section 4, we will describe in more detail how data from each component is collected through a crowdsourcing platform.

Multiple regions and their descriptions
In a real-world image, one simple summary sentence is often insufficient to describe all the contents of and interactions in an image. Instead, one natural way to extend this might be a collection of descriptions based on different regions of a scene. In Visual Genome, we collect human-generated image region descriptions, with each region localized by a bounding box. In Figure 5, we show three examples of region descriptions. Regions are allowed to have a high degree of overlap with each other when the descriptions differ. For example, "yellow fire hydrant" and "woman in shorts is standing behind the man" have very little overlap, while "man jumping over fire hydrant" has a very high overlap with the other two regions. Our dataset contains on average a total of 42 region descriptions per image. Each description is a phrase ranging from 1 to 16 words in length describing that region.

Multiple objects and their bounding boxes
Each image in our dataset consists of an avarege of 21 objects, each delineated by a tight bounding box (Figure 6). Furthermore, each object is canonicalized to a synset ID in WordNet (Miller, 1995). For example, man and person would get mapped to man.n.03 (the generic use of the word to refer to any human being). Similarly, person gets mapped to person.n.01 (a human being). Afterwards, these two concepts can be joined to person.n.01 since this is a hypernym of man.n.03. This is an important standardization step to avoid multiple in an image, the Visual Genome dataset includes multiple human-generated image regions descriptions, with each region localized by a bounding box. Here, we show three regions descriptions: "man jumping over a fire hydrant," "yellow fire hydrant," and "woman in shorts is standing beghind the man." names for one object (e.g. man, person, human), and to connect information across images.

A set of attributes
Each image in Visual Genome has an average of 16 attributes. Objects can have zero or more attributes associated with them. Attributes can be color (yellow), states (standing), etc. (Figure 7). Just like we extract objects from region descriptions, we also extract the attributes attached to these objects. In Figure 7, from the phrase "yellow fire hydrant," we extract the attribute yellow for the fire hydrant.
As with objects, we canonicalize all attributes to Word-Net (Miller, 1995); for example, yellow is mapped to yellow.s.01 (of the color intermediate between green and orange in the color spectrum; of something resembling the color of an egg yolk).

A set of relationships
Relationships connect two objects together. These relationships can be actions (jumping over), spatial (is behind), verbs (wear), prepositions (with), Fig. 6: From all of the region descriptions, we extract all objects mentioned. For example, from the region description "man jumping over a fire hydrant," we extract man and fire hydrant. : Some descriptions also provide attributes for objects. For example, the region description "yellow fire hydrant" adds that the fire hydrant is yellow.
Here we show two attributes: yellow and standing.
comparative (taller than), or prepositional phrases (drive on). For example, from the region description "man jumping over fire hydrant," we extract the relationship jumping over between the objects man and fire hydrant (Figure 8). These relationships are directed from one object, called the subject, to another, called the object. In this case, the subject is the man, who is performing the relationship jumping over on the object fire hydrant. Each relationship is canonicalized to a WordNet (Miller, 1995) synset ID; i.e. jumping is canonicalized to jump.a.1 (move forward by leaps and bounds). On average, each image in our dataset contains 18 relationships.

A set of region graphs
Combining the objects, attributes, and relationships extracted from region descriptions, we create a directed graph representation for each of the 42 regions. Examples of region graphs are shown in Figure 4. Each region graph is a structured representation of a part of the image. The nodes in the graph represent objects, attributes, and relationships. Objects are linked to their respective attributes while relationships link one object to another. The links connecting two objects in Figure 4 point from the subject to the relationship and from the relationship to the other object.

One scene graph
While region graphs are localized representations of an image, we also combine them into a single scene graph representing the entire image ( Figure 3). The scene graph is the union of all region graphs and contains all objects, attributes, and relationships from each region description. By doing so, we are able to combine multiple levels of scene information in a more coherent way. For example in Figure 4, the leftmost region description tells us that the "fire hydrant is yellow," while the middle region description tells us that the "man is jumping over the fire hydrant." Together, the two descriptions tell us that the "man is jumping over a yellow fire hydrant."

A set of question answer pairs
We have two types of QA pairs associated with each image in our dataset: freeform QAs, based on the entire image, and region-based QAs, based on selected regions of the image. We collect 6 different types of questions per image: what, where, how, when, who, and why. In Figure 4, "Q. What is the woman standing next to?; A. Her belongings" is a freeform QA. Each image has at least one question of each type listed above. Regionbased QAs are collected by prompting workers with region descriptions. For example, we use the region "yellow fire hydrant" to collect the region-based QA: "Q. What color is the fire hydrant?; A. Yellow." Region based QAs allow us to independently study methods that use NLP and vision priors to answer questions.

Related Work
We discuss existing datasets that have been released and used by the vision community for classification and object detection. We also mention work that has improved object and attribute detection models. Then, we explore existing work that has utilized representations similar to our relationships between objects. In addition, we dive into literature related to cognitive tasks like image description, question answering, and knowledge representation.

Datasets
Datasets (Table 1) et al., 2008) introduced a dataset with multiple objects per category. They also provided a web interface that experts and novices could use to annotate additional images. This web interface enabled images to be labeled with polygons, helping create datasets for image segmentation. The Lotus Hill dataset (Yao et al., 2007) contains a hierarchical decomposition of objects (vehicles, man-made objects, animals, etc.) along with segmentations. Only a small part of this dataset is freely available. SUN (Xiao et al., 2010), just like LabelMe (Russell et al., 2008) and Lotus Hill (Yao et al., 2007), was curated for object detection. Pushing the size of datasets even further, 80 Million Tiny Images ) created a significantly larger dataset than its predecessors. It contains tiny (i.e. 32 × 32 pixels) images that were collected using Word-Net (Miller, 1995) synsets as queries. However, because the data in 80 Million Images were not human-verified, they contain numerous errors. YFCC100M (Thomee et al., 2016) is another large database of 100 million images that is still largely unexplored. It contains human generated and machine generated tags.
Pascal VOC (Everingham et al., 2010) pushed research from classification to object detection with a dataset containing 20 semantic categories in 11, 000 images. Imagenet (Deng et al., 2009) took WordNet synsets and crowdsourced a large dataset of 14 million images. They started the ILSVRC (Russakovsky et al., 2015) challenge for a variety of computer vision tasks. ILSVRC and PASCAL provide a test bench for object detection, image classification, object segmentation, person layout, and action classification. MS-COCO (Lin et al., 2014) recently released its dataset, with over 328, 000 images with sentence descriptions and segmentations of 91 object categories. The current largest dataset for QA, VQA (Antol et al., 2015), contains 204, 721 images annotated with one or more question answers. They collected a dataset of 614, 163 freeform questions with 6.1M ground truth answers and provided a baseline approach in answering questions using an image and a textual question as the input.
Visual Genome aims to bridge the gap between all these datasets, collecting not just annotations for a large number of objects but also scene graphs, region descriptions, and question answer pairs for image regions. Unlike previous datasets, which were collected for a single task like image classification, the Visual Genome dataset was collected to be a general-purpose representation of the visual world, without bias toward a particular task. Our images contain an average of 21 objects, which is almost an order of magnitude more dense than any existing vision dataset. Similarly, we contain an average of 18 attributes and 18 relationships  Table 1: A comparison of existing datasets with Visual Genome. We show that Visual Genome has an order of magnitude more descriptions and question answers. It also has a more diverse set of object, attribute, and relationship classes. Additionally, Visual Genome contains a higher density of these annotations per image.
per image. We also have an order of magnitude more unique objects, attributes, and relationships than any other dataset. Finally, we have 1.7 million question answer pairs, also larger than any other dataset for visual question answering.

Image Descriptions
One of the core contributions of Visual Genome is its descriptions for multiple regions in an image. As such, we mention other image description datasets and models in this subsection. Most work related to describing images can be divided into two categories: retrieval of human-generated captions and generation of novel captions. Methods in the first category use similarity metrics between image features from predefined models to retrieve similar sentences (Ordonez et al., 2011, Hodosh et al., 2013. Other methods map both sentences and their images to a common vector space (Ordonez et al., 2011) or map them to a space of triples (Farhadi et al., 2010). Among those in the second category, a common theme has been to use recurrent neural networks to produce novel captions (Kiros et al., 2014, Mao et al., 2014, Karpathy and Fei-Fei, 2014, Vinyals et al., 2014. More recently, researchers have also used a visual attention model . One drawback of these approaches is their attention to describing only the most salient aspect of the image. This problem is amplified by datasets like Flickr 30K (Young et al., 2014) and MS-COCO (Lin et al., 2014), whose sentence desriptions tend to focus, somewhat redundantly, on these salient parts. For example, "an elephant is seen wandering around on a sunny day," "a large elephant in a tall grass field," and "a very large elephant standing alone in some brush" are 3 descriptions from the MS-COCO dataset, and all of them focus on the salient elephant in the image and ignore the other regions in the image. Many real-world scenes are complex, with multiple objects and interactions that are best described using multiple descriptions (Karpathy andFei-Fei, 2014, Lebret et al., 2015). Our dataset pushes toward a complete understanding of an image by collecting a dataset in which we capture not just scene-level descriptions but also myriad of low-level descriptions, the "grammar" of the scene.

Objects
Object detection is a fundamental task in computer vision, with applications ranging from identification of faces in photo software to identification of other cars by self-driving cars on the road. It involves classifying an object into a semantic category and localizing the object in the image. Visual Genome uses objects as a core component on which each visual scene is built. Early datasets include the face detectio (Huang et al., 2008) and pedestrian datasets (Dollar et al., 2012). The PASCAL VOC and ILSVRC's detection dataset (Deng et al., 2009) pushed research in object detection. But the images in these datasets are iconic and do not capture the settings in which these objects usually co-occur. To remedy this problem, MS-COCO (Lin et al., 2014) annotated real-world scenes that capture object contexts. However, MS-COCO was unable to describe all the objects in its images, since they annotated only 91 object categories. In the real world, there are many more objects that the ones captured by existing datasets. Visual Genome aims at collecting annotations for all visual elements that occur in images, increasing the number of semantic categories to over 17,000.

Attributes
The inclusion of attributes allows us to describe, compare, and more easily categorize objects. Even if we haven't seen an object before, attributes allow us to infer something about it; for example, "yellow and brown spotted with long neck" likely refers to a giraffe. Initial work in this area involved finding objects with similar features (Malisiewicz et al., 2008) using examplar SVMs. Next, textures were used to study objects (Varma and Zisserman, 2005), while other methods learned to predict colors (Ferrari and Zisserman, 2007). Finally, the study of attributes was explicitly demonstrated to lead to improvements in object classification (Farhadi et al., 2009). Attributes were defined to be paths ("has legs"), shapes ("spherical"), or materials ("furry") and could be used to classify new categories of objects. Attributes have also played a large role in improving fine-grained recognition (Goering et al., 2014) on fine-grained attribute datasets like CUB-2011 (Wah et al., 2011). In Visual Genome, we use a generalized formulation (Johnson et al., 2015), but we extend it such that attributes are not image-specific binaries but rather object-specific for each object in a real-world scene. We also extend the types of attributes to include size ("small"), pose ("bent"), state ("transparent"), emotion ("happy"), and many more.

Relationships
Relationship extraction has been a traditional problem in information extraction and in natural language processing. Syntactic features (Zhou et al., 2007, GuoDong et al., 2005, dependency tree methods (Culotta andSorensen, 2004, Bunescu andMooney, 2005), and deep neural networks (Socher et al., 2012, Zeng et al., 2014 have been employed to extract relationships between two entities in a sentence. However, in computer vision, very little work has gone into learning or predicting relationships. Instead, relationships have been implicitly used to improve other vision tasks. Relative layouts between objects have improved scene categorization (Izadinia et al., 2014), and 3D spatial geometry between objects has helped object detection (Choi et al., 2013). Comparative adjectives and prepositions between pairs of objects have been used to model visual relationships and improved object localization (Gupta and Davis, 2008).
Relationships have already shown their utility in improving cognitive tasks. A meaning space of relationships has improved the mapping of images to sentences (Farhadi et al., 2010). Relationships in a structured representation with objects have been defined as a graph structure called a scene graph, where the nodes are objects with attributes and edges are relationships between objects. This representation can be used to generate indoor images from sentences and also to improve image search (Chang et al., 2014, Johnson et al., 2015. We use a similar scene graph representation of an image that generalizes across all these previous works (Johnson et al., 2015). Recently, relationships have come into focus again in the form of question answering about associations between objects (Sadeghi et al., 2015). These questions ask if a relationship, involving generally two objects, is true, e.g. "do dogs eat ice cream?". We believe that relationships will be necessary for higher-level cognitive tasks (Johnson et al., 2015, Lu et al., 2016, so we collect the largest corpus of them in an attempt to improve tasks by actually understanding relationships between objects.

Question Answering
Visual question answering (QA) has been recently proposed as a proxy task of evaluating a computer vision system's ability to understand an image beyond object recognition (Geman et al., 2015, Malinowski andFritz, 2014). Several visual QA benchmarks have been proposed in the last few months. The DAQUAR (Malinowski and Fritz, 2014) dataset was the first toy-sized QA benchmark built upon indoor scene RGB-D images of NYU Depth v2 (Nathan Silberman and Fergus, 2012). Most new datasets (Yu et al., 2015, Ren et al., 2015a, Antol et al., 2015, Gao et al., 2015 have collected QA pairs on MS-COCO images, either generated automatically by NLP tools (Ren et al., 2015a) or written by human workers (Yu et al., 2015, Antol et al., 2015, Gao et al., 2015.
In previous datasets, most questions concentrated on simple recognition-based questions about the salient objects, and answers were often extremely short. For instance, 90% of DAQUAR answers (Malinowski and Fritz, 2014) and 87% of VQA answers (Antol et al., 2015) consist of single-word object names, attributes, and quantities. This shortness limits their diversity and fails to capture the long-tail details of the images. Given the availability of new datasets, an array of visual QA models have been proposed to tackle QA tasks. The proposed models range from SVM classifiers (Antol et al., 2015) and probabilistic inference (Malinowski and Fritz, 2014) to recurrent neural networks (Gao et al., 2015, Malinowski et al., 2015, Ren et al., 2015a and convolutional networks . Visual Genome aims to capture the details of the images with diverse question types and long answers. These questions should cover a wide range of visual tasks from basic perception to complex reasoning. Our QA dataset of 1.7 million QAs is also larger than any currently existing dataset.

Knowledge Representation
A knowledge representation of the visual world is capable of tackling an array of vision tasks, from action recognition to general question answering. However, it is difficult to answer "what is the minimal viable set of knowledge needed to understand about the physical world?" (Hayes, 1978). It was later proposed that there be a certain plurality to concepts and their related axioms (Hayes, 1985). These efforts have grown to model physical processes (Forbus, 1984) or to model a series of actions as scripts (Schank and Abelson, 2013) for stories-both of which are not depicted in a single static image but which play roles in an image's story. More recently, NELL (Betteridge et al., 2009) learns probabilistic horn clauses by extracting information from the web. DeepQA (Ferrucci et al., 2010) proposes a probabilistic question answering architecture involving over 100 different techniques. Others have used Markov logic networks (Zhu et al., 2009, Niu et al., 2012 as their representation to perform statistical inference for knowledge base construction. Our work is most similar to that of those , Zhu et al., 2015, Sadeghi et al., 2015 who attempt to learn common-sense relationships from images. Visual Genome scene graphs can also be considered a dense knowledge representation for images. It is similar to the format used in knowledge bases in NLP.

Crowdsourcing Strategies
Visual Genome was collected and verified entirely by crowd workers from Amazon Mechanical Turk. In this section, we outline the pipeline employed in creating all the components of the dataset. Each component (region descriptions, objects, attributes, relationships, region graphs, scene graphs, questions and answers) involved multiple task stages. We mention the different strategies used to make our data accurate and to enforce diversity in each component. We also provide background information about the workers who helped make Visual Genome possible.

Crowd Workers
We used Amazon Mechanical Turk (AMT) as our primary source of annotations. Overall, a total of over 33, 000 unique workers contributed to the dataset. The dataset was collected over the course of 6 months after 15 months of experimentation and iteration on the data representation. Approximately 800, 000 Human Intelligence Tasks (HITs) were launched on AMT, where each HIT involved creating descriptions, questions and answers, or region graphs. Each HIT was designed such that workers manage to earn anywhere between $6-$8 per hour if they work continuously, in line with ethical research standards on Mechanical Turk (Salehi et al., 2015). Visual Genome HITs achieved a 94.1% retention rate, meaning that 94.1% of workers who completed one of our tasks went ahead to do more.  Table 2: Geographic distribution of countries from where crowd workers contributed to Visual Genome.
Figures 9 (a) and (b) outline the demographic distribution of our crowd workers. The majority of our workers were between the ages of 25 and 34 years old. Our youngest contributor was 18 years old and the oldest was 68 years old. We also had a near-balanced split of 54.15% male and 45.85% female workers.

Region Descriptions
Visual Genome's main goal is to enable the study of cognitive computer vision tasks. The next step towards understanding images requires studying relationships between objects in scene graph representations of images. However, we observed that collecting scene graphs directly from an image leads to workers annotating easy, frequently-occurring relationships like wearing(man, shirt) instead of focusing on salient parts of the image. This is evident from previous datasets (Johnson et al., 2015, Lu et al., 2016 that contain a large number of such relationships. After experimentation, we observed that when asked to describe an image using natural language, crowd workers naturally start with the most salient part of the image and then move to describing other parts of the image one by one. Inspired by this finding, we focused our attention towards collecting a dataset of region descriptions that is diverse in content. When a new image is added to the crowdsourcing pipeline with no annotations, it is sent to a worker who is asked to draw three bounding boxes and write three descriptions for the region enclosed by each box. Next, the image is sent to another worker along with the previously written descriptions. Workers are explicitly encouraged to write descriptions that have not been written before. This process is repeated until we have collect 50 region descriptions for each image. To prevent workers from having to skim through a long list of previously written descriptions, we only show them the top seven most similar descriptions. We calculate these most similar descriptions using BLEU (Papineni et al., 2002) (n-gram) scores between pairs of sentences. We define the BLEU score between a description d i and a previous description d j to be: where we enforce a brevity penalty using: and p n calculates the percentage of n-grams in d i that match n-grams in d j .
When a worker writes a new description, we programmatically enforce that it has not been repeated by using BLEU score thresholds set to 0.7 to ensure that it is dissimilar to descriptions from both of the following two lists: 1. Image-specific descriptions. A list of all previously written descriptions for that image. 2. Global image descriptions. A list of the top 100 most common written descriptions of all images in the dataset. This prevents very common phrases like "sky is blue" from dominating the set of region descriptions.
Finally, we ask workers to draw bounding boxes that satisfy one requirement: coverage. The bounding box must cover all objects mentioned in the description. Figure 10 shows an example of a good box that covers both the street as well the car mentioned in the description, as well as an example of a bad box.

Objects
Once 50 region descriptions are collected for an image, we extract the visual objects from each description. Each description is sent to one crowd worker, who extracts all the objects from the description and grounds each object as a bounding box in the image. For example, from Figure 4, let's consider the description "woman in shorts is standing behind the man." A worker would extract three objects: woman, shorts, and man. They would then draw a box around each of . the objects. We require each bounding box to be drawn to satisfy two requirements: coverage and quality. Coverage has the same definition as described above in Section 4.2, where we ask workers to make sure that the bounding box covers the object completely (Figure 11). Quality requires that each bounding box be as tight as possible around its object such that if the box's length or height were decreased by one pixel, it would no longer satisfy the coverage requirement. Since a one pixel error can be physically impossible for most workers, we relax the definition of quality to four pixels.
Multiple descriptions for an image might refer to the same object, sometimes with different words. For example, a man in one description might be referred to as person in another description. We can thus use this crowdsourcing stage to build these co-reference chains. With each region description given to a worker to process, we include a list of previously extracted objects as suggestions. This allows a worker to choose a previously drawn box annotated as man instead of redrawing a new box for person.
Finally, to increase the speed with which workers complete this task, we also use Stanford's dependency parser  to extract nouns automatically and send them to the workers as suggestions.
While the parser manages to find most of the nouns, it sometimes misses compound nouns, so we avoided completely depending on this automated method. By combining the parser with crowdsourcing tasks, we were able to speed up our object extraction process without losing accuracy.

Attributes, Relationships, and Region Graphs
Once all objects have been extracted from each region description, we can extract the attributes and relationships described in the region. We present each worker with a region description along with its extracted objects and ask them to add attributes to objects or to connect pairs of objects with relationships, based on the text of the description. From the description "woman in shorts is standing behind the man", workers will extract the attribute standing for the woman and the relationships in(woman, shorts) and behind (woman, man). Together, objects, attributes, and relationships form the region graph for a region description. Some descriptions like "it is a sunny day" do not contain any objects and therefore have no region graphs associated with them. Workers are asked to not generate any graphs for such descriptions. We create scene graphs by combining all the region graphs for an image by combining all the co-referenced objects from different region graphs.

Scene Graphs
The scene graph is the union of all region graphs extracted from region descriptions. We merge nodes from region graphs that correspond to the same object; for example, man and person in two different region graphs might refer to the same object in the image. We say that objects from different graphs refer to the same object if their bounding boxes have an overlap over union of 0.8. However, this heuristic might contain false positives. So, before merging two objects, we ask workers to confirm that a pair of objects with significant overlap are indeed the same object. For example, in Figure 12 (right), the fox might be extracted from two different region descriptions. These boxes are then combined together (Figure 12 (left)) when constructing the scene graph. Two region graphs are combined together by merging objects that are co-referenced by both the graphs.

Questions and Answers
To create question answer (QA) pairs, we ask the AMT workers to write pairs of questions and answers about . an image. To ensure quality, we instruct the workers to follow three rules: 1) start the questions with one of the "seven Ws" (who, what, where, when, why, how and which); 2) avoid ambiguous and speculative questions; 3) be precise and unique, and relate the question to the image such that it is clearly answerable if and only if the image is shown.
We collected two separate types of QAs: freeform QAs and region-based QAs. In freeform QA, we ask a worker to look at an image and write eight QA pairs about it. To encourage diversity, we enforce that workers write at least three different Ws out of the seven in their eight pairs. In region-based QA, we ask the workers to write a pair based on a given region. We select the regions that have large areas (more than 5k pixels) and long phrases (more than 4 words). This enables us to collect around twenty region-based pairs at the same cost of the eight freeform QAs. In general, freeform QA tends to yield more diverse QA pairs that enrich the question distribution; region-based QA tends to produce more factual QA pairs at a lower cost.

Verification
All Visual Genome data go through a verification stage as soon as they are annotated. This stage helps eliminate incorrectly labeled objects, attributes, and relationships. It also helps remove region descriptions and questions and answers that might be correct but are vague ("This person seems to enjoy the sun."), subjective ("room looks dirty"), or opinionated ("Being exposed to hot sun like this may cause cancer").
Verification is conducted using two separate strategies: majority voting (Snow et al., 2008) and rapid judgments . All components of the dataset except objects are verified using majority voting. Majority voting (Snow et al., 2008) involves three unique workers looking at each annotation and vot-ing on whether it is factually correct. An annotation is added to our dataset if at least two (a majority) out of the three workers verify that it is correct.
We only use rapid judgments to speed up the verification of the objects in our dataset. Meanwhile, rapid judgments  use an interface inspired by rapid serial visual processing that enable verification of objects with an order of magnitude increase in speed than majority voting.

Canonicalization
All the descriptions and QAs that we collect are freeform worker-generated texts. They are not constrained by any limitations. For example, we do not force workers to refer to a man in the image as a man. We allow them to choose to refer to the man as person, boy, man, etc. This ambiguity makes it difficult to collect all instances of man from our dataset. In order to reduce the ambiguity in the concepts of our dataset and connect it to other resources used by the research community, we map all objects, attributes, relationships, and noun phrases in region descriptions and QAs to synsets in WordNet (Miller, 1995). In the example above, person, boy, and man would map to the synsets: person.n.01 (a human being), male child.n.01 (a youthful male person) and man.n.03 (the generic use of the word to refer to any human being) respectively. Thanks to the WordNet hierarchy it is now possible to fuse those three expressions of the same concept into person.n.01 (a human being) since this is the lowest common ancestor node of all aforementioned synsets.
We use the Stanford NLP tools  to extract the noun phrases from the region descriptions and QAs. Next, we map them to their most frequent matching synset in WordNet according to WordNet lexeme counts. We then refine this simple heuristic by hand-crafting mapping rules for the 30 most common failure cases. For example according to WordNet's lexeme counts the most common semantic for "table" is table.n.01 (a set of data arranged in rows and columns). However in our data it is more likely to see pieces of furniture and therefore bias the mapping towards table.n.02 (a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs). The objects in our scene graphs are already noun phrases and are mapped to WordNet in the same way.
We normalize each attribute based on morphology (so called "stemming") and map them to the WordNet adjectives. We include 15 hand-crafted rules to address common failure cases, which typically occur when the concrete or spatial sense of the word seen in an image is not the most common overall sense. For example, the synset long.a.02 (of relatively great or greater than average spatial extension) is less common in WordNet than long.a.01 (indicating a relatively great or greater than average duration of time), even though instances of the word "long" in our images are much more likely to refer to that spatial sense.
For relationships, we ignore all prepositions as they are not recognized by WordNet. Since the meanings of verbs are highly dependent upon their morphology and syntactic placement (e.g. passive cases, prepositional phrases), we try to find WordNet synsets whose sentence frames match with the context of the relationship. Sentence frames in WordNet are formalized syntactic frames in which a certain sense of a word might appear; for example, play.v.01: participate in games or sport occurs in the sentence frames "Somebody [play]s" and "Somebody [play]s something." For each verb-synset pair, we then consider the root hypernym of that synset to reduce potential noise from WordNet's fine-grained sense distinctions. The WordNet hierarchy for verbs is segmented and originates from over 100 root verbs. For example, draw.v.01: cause to move by pulling traces back to the root hypernym move.v.02: cause to move or shift into a new position, while draw.v.02: get or derive traces to the root get.v.01: come into the possession of something concrete or abstract. We also include 20 hand-mapped rules, again to correct for WordNet's lower representation of concrete or spatial senses.
These mappings are not perfect and still contain some ambiguity. Therefore, we send all our mappings along with the top four alternative synsets for each term to Amazon Mechanical Turk. We ask workers to verify that our mapping was accurate and change the mapping to an alternative one if it was a better fit. We present workers with the concept we want to canonicalize along with our proposed corresponding synset with 4 additional options. To prevent workers from always defaulting to the our proposed synset, we do not explicitly specify which one of the 5 synsets presented is our proposed synset. Section 5.8 provides experimental precision and recall scores for our canonicalization strategy.

Dataset Statistics and Analysis
In this section, we provide statistical insights and analysis for each component of Visual Genome. Specifically, we examine the distribution of images (Section 5.1) and the collected data for region descriptions (Section 5) and questions and answers (Section 5.7). We analyze region graphs and scene graphs together in one section (Section 5.6), but we also break up these graph structures into their three constituent parts-objects (Section 5.3), attributes (Section 5.4), and relationships (Section 5.5)-and study each part individually. Finally, we describe our canonicalization pipeline and results (Section 5.8).

Image Selection
The Visual Genome dataset consists of all 108, 249 images from the intersection of MS-COCO's (Lin et al., 2014) 328, 000 images and YFCC's (Thomee et al., 2016) 100 million images. These images are real-world, non-iconic images that were uploaded onto Flickr by users. The images range from as small as 72 pixels wide to as large as 1280 pixels wide, with an average width  of 500 pixels. We collected the WordNet synsets into which our 108, 249 images can be categorized using the same method as ImageNet (Deng et al., 2009). Visual Genome images cover 972 synsets. Figure 13 shows the top synsets to which our images belong. "ski" is the most common synset, with 2612 images; it is followed by "ballplayer" and "racket," with all three synsets referring to images of people playing sports. Our dataset is somewhat biased towards images of people, as Figure 13 shows; however, they are quite diverse overall, as the top 25 synsets each have over 800 images, while the top 50 synsets each have over 500 examples.

Region Description Statistics
One of the primary components of Visual Genome is its region descriptions. Every image includes an aver- Fig. 17: The process used to convert a region description into a 300-dimensional vectorized representation.
age of 42 regions with a bounding box and a descriptive phrase. Figure 14 shows an example image from our dataset with its 50 region descriptions. We display bounding boxes for only 6 out of the 50 descriptions in the figure to avoid clutter. These descriptions tend to be highly diverse and can focus on a single object, like in "A bag," or on multiple objects, like in "Man taking a photo of the elephants." They encompass the most salient parts of the image, as in "An elephant taking food from a woman," while also capturing the background, as in "Small buildings surrounded by trees." MS-COCO (Lin et al., 2014) dataset is good at generating variations on a single scene-level descriptor. Consider three sentences from MS-COCO dataset on a similar image: "there is a person petting a very large elephant," "a person touching an elephant in front of a wall," and "a man in white shirt petting the cheek of an elephant." These three sentences are single scenelevel descriptions. In comparison, Visual Genome descriptions emphasize different regions in the image and thus are less semantically similar. To ensure diversity in the descriptions, we use BLEU score (Papineni et al., 2002) thresholds between new descriptions and all previously written descriptions. More information about crowdsourcing can be found in Section 4.
Region descriptions must be specific enough in an image to describe individual objects, like in the description "A bag," but they must also be general enough to describe high-level concepts in an image, like "An man being chased by a bear." Qualitatively, we note that regions that cover large portions of the image tend to be general descriptions of an image, while regions that cover only a small fraction of the image tend to be more specific. In Figure 15 (a), we show the distribution of regions over the width of the region normalized by the width of the image. We see that the majority of our regions tend to be around 10% to 15% of the image width. We also note that there are a large number of regions covering 100% of the image width. These regions usually include elements like "sky," "ocean," "snow," "mountains," etc. that cannot be bounded and thus span the entire image width. In Figure 15 (b), we show a similar distribution over the normalized height of the region. We see a similar overall pattern, as most of our regions tend to be very specific descriptions of about 10% to 15% of the image height. Unlike the distribution over width, however, we do not see a increase in the number of regions that span the entire height of the image, as there are no common visual equivalents that span images vertically. Out of all the descriptions gathered, only one or two of them tend to be global scene descriptions that are similar to MS-COCO (Lin et al., 2014).
After examining the distribution of the size of the regions described, it is also valuable to look at the semantic information captured by these descriptions. In Figure 16, we show the distribution of the length (word count) of these region descriptions. The average word count for a description is 5 words, with a minimum of 1 word and a maximum of 12 words. In Figure 18 (a), we plot the most common phrases occurring in our region descriptions, with stop words removed. Common visual elements like "green grass," "tree [in] distance," and "blue sky" occur much more often than other, more nuanced elements like "fresh strawberry." We also study descriptions with finer precision in Figure 18 (b), where we plot the most common words used in descriptions. Again, we eliminate stop words from our study. Colors like "white" and "black" are the most frequently used words to describe visual concepts; we conduct a similar study on other captioning datasets including MS-COCO (Lin et al., 2014) and Flickr 30K (Young et al., 2014) and find a similar distribution with colors occur-ring most frequently. Besides colors, we also see frequent occurrences of common objects like "man," "tree," and "sign" and of universal visual elements like "sky." Semantic diversity. We also study the actual semantic contents of the descriptions. We use an unsupervised approach to analyze the semantics of these descriptions. Specifically, we use word2vec (Mikolov et al., 2013) to convert each word in a description to a 300-dimensional vector. Next, we remove stop words and average the remaining words to get a vector representation of the whole region description. This pipeline is outlined in Figure 17. We use hierarchical agglomerative clustering on vector representations of each region description and find 71 semantic and syntactic groupings or "clusters." Figure 19 (a) shows four such example clusters. One cluster contains all descriptions related to tennis, like "A man swings the racquet" and "White lines on the ground of the tennis court," while another cluster contains descriptions related to numbers, like "Three dogs on the street" and "Two people inside the tent." To quantitatively measure the diversity of Visual Genome's region descriptions, we calculate the number of clusters represented in a single image's region descriptions. We show the distribution of the variety of descriptions for an image in Figure 19 (b). We find that on average, each image contains descriptions from 17 different clusters. The image with the least diverse descriptions contains descriptions from 4 clusters, while the image with the most diverse descriptions contains descriptions from 26 clusters.
Finally, we also compare the descriptions in Visual Genome to the captions in MS-COCO. First we aggregate all Visual Genome and MS-COCO descriptions and remove all stop words. After removing stop words, the descriptions from both datasets are roughly the same length. We conduct a similar study, in which we vectorize the descriptions for each image and calculate each dataset's cluster diversity per image. We find that on average, 2 clusters are represented in the captions for each image in MS-COCO, with very few images in which 5 clusters are represented. Because each image in MS-COCO only contains 5 captions, it is not a fair comparison to compare the number of clusters represented in all the region descriptions in the Visual Genome dataset. We thus randomly sample 5 Visual Genome region descriptions per image and calculate the number of clusters in an image. We find that Visual Genome descriptions come from 4 or 5 clusters. We show our comparison results in Figure 19 (c). The difference between the semantic diversity between the two datasets is statistically significant (t = −240, p < 0.01).  Fig. 18: (a) A plot of the most common visual concepts or phrases that occur in region descriptions. The most common phrases refer to universal visual concepts like "blue sky," "green grass," etc. (b) A plot of the most frequently used words in region descriptions. Colors occur the most frequently, followed by common objects like "man" and "dog" and universal visual concepts like "sky."

Numbers Cluster
Two people inside the tent.

Many animals crossing the road.
Five ducks almost in a row.

The number four.
Three dogs on the street. Two towels hanging on racks.

Tennis Cluster
White lines on the ground of the tennis court.

A pair of tennis shoes.
Metal fence securing the tennis court. Navy blue shorts on tennis player. The man swings the racquet. Tennis player preparing a backhand swing.

Ocean Cluster
Ocean is blue and calm.

Rows of waves in front of surfer.
A group of men on a boat.

Surfboard on the beach.
Woman is surfing in the ocean. Foam on water's edge.

Transportation Cluster
Ladder folded on fire truck. Dragon design on the motorcycle. Tall windshield on bike.

Front wheels of the airplane.
A bus rear view mirror. The front tire of the police car.  Objects in Visual Genome come from a variety of categories. As shown in Figure 22 (b), objects related to WordNet categories such as humans, animals, sports, and scenery are most common; this is consistent with the general bias in image subject matter in our dataset. Common objects like man, person, and woman occur especially frequently with occurrences of 24K, 17K, and 11K. Other objects that also occur in MS-COCO (Lin et al., 2014) are also well represented with around 5000 instances on average. Figure 22 (a) shows some examples of objects in images. Objects in Visual Genome span a diverse set of Wordnet categories like food, animals, and man-made structures.
It is important to look not only at what types of objects we have but also at the distribution of objects in images and regions. Figure 20 (a) shows, as expected, that we have between 0 and 2 objects in each region on average. It is possible for regions to contain no objects if their descriptions refer to no explicit objects in the image. For example, a region described as "it is dark outside" has no objects to extract. Regions with only one object generally have descriptions that focus on the attributes of a single object. On the other hand, regions with two or more objects generally have descriptions that contain both attributes of specific objects and relationships between pairs of objects.
As shown in Figure 20 (b), each image contains on average around 21 unique objects. Few images have a low number of objects, which we expect since images usually capture more than a few objects. Moreover, few images have an extremely high number of objects (e.g. over 40).   People are the most frequently occurring objects in our dataset, followed by common objects and visual elements like building, shirt, and sky.

Attribute Statistics
Attributes allow for detailed description and disambiguation of objects in our dataset. About 45% of objects in Visual Genome are annotated with at least one attribute; our dataset contains 1.6 million total attributes with 13, 041 unique attributes. Attributes include colors (green), sizes (tall), continuous action verbs (standing), materials (plastic), etc. Each attribute in our scene graphs belongs to one object, while each object can have multiple attributes. We denote attributes as attribute(object).
On average, each image in Visual Genome contains 21 attributes, as shown in Figure 23. Each region contains on average 1 attribute, though about 42% of regions contain no attribute at all; this is primarily because many regions are relationship-focused. Figure 24 (a) shows the distribution of the most common attributes in our dataset. Colors (e.g. white, green) are by far the most frequent attributes. Also common are sizes (e.g. large) and materials (e.g. wooden). Figure 24 (b) shows the distribution of attributes describing people (e.g. man, girls, and person). The most common attributes describing people are intransitive verbs describing their states of motion (e.g.standing and walking). Certain sports (skiing, surfboarding) are overrepresented due to a bias towards these sports in our images.
Attribute Graphs. We also qualitatively analyze the attributes in our dataset by constructing co-occurrence graphs, in which nodes are unique attributes and edges connect those attributes that describe the same object. For example, if an image contained a "large black dog" (large(dog), black(dog)) and another image contained a "large yellow cat" (large(cat), yellow(cat)), its attributes would form an incomplete graph with edges (large, black) and (large, yellow). We create two such graphs: one for both the total set of attributes and a second where we consider only objects that refer to people. A subgraph of the 16 most frequently connected (co-occurring) personrelated attributes is shown in Figure 25 (a).
Cliques in these graphs represent groups of attributes in which at least one co-occurrence exists for each pair of attributes in that group. In the previous example, if a third image contained a "black and yellow taxi" (black(taxi), yellow(taxi)), the resulting third edge would create a clique between the attributes black, large, and yellow. When calculated across the entire Visual Genome dataset, these cliques provide insight into commonly perceived traits of different types of objects. Figure 25 (b) is a selected representation of three example cliques and their overlaps. From just a clique of attributes, we can predict what types of objects are usually referenced. In Figure 25 (b), we see that these cliques describe an animal (left), water body (top right), and human hair (bottom right).
Other cliques (not shown) can also uniquely identify objects. In our set, one clique contains athletic, young, fit, skateboarding, focused, teenager, male, skinny, and happy, capturing some of the common traits of skateboarders in our set. Another such clique has shiny, small, metal, silver, rusty, parked, and empty, most likely describing a subset of cars. From these cliques, we can thus infer distinct objects and object types based solely on their attributes, potentially allowing for highly specific object identification based on selected characteristics.

Relationship Statistics
Relationships are the core components that link objects in our scene graphs. Relationships are directional, i.e. they involve two objects, one acting as the subject and one as the object of a predicate relationship. We denote all relationships in the form relationship(subject, object). For example, if a man is swinging a bat, we write swinging(man, bat). Relationships can be spatial (e.g. inside of), action (e.g. swinging), compositional (e.g. part of), etc. More complex relationships such as standing on, which includes both an action and a spatial aspect, are also represented. Relationships are extracted from region descriptions by crowd workers, similarly to attributes and objects. Visual Genome contains a total of 13, 894 unique relationships, with over 1.8 million total relationships. Figure 26 (a) shows the distribution of relationships per region description. On average, we have 1 relationship per region, with a maximum of 7. We also have some descriptions like "an old, tall man," which have multiple attributes associated with the man but no relationships. Figure 26 (b) is a distribution of relationships per image object. Finally, Figure 26 (c) shows the distribution of relationships per image. Each image has an average of 19 relationships, with a minimum of 1 relationship and with ax maximum of over 60 relationships.
Top relationship distributions. We display the most frequently occurring relationships in Figure 27 (a). on is the most common relationship in our dataset. This is primarily because of the flexibility of the word on, which can refer to spatial configuration (on top of), attachment (hanging on), etc. Other common relationships involve actions like holding and wearing and spatial configurations like behind, next to, and under. Figure 27 (b) shows a similar distribution but for relationships involving people. Here we notice more human-centric relationships or actions such as kissing, chatting with, and talking to. The two distributions follow a Zipf distribution.
Understanding affordances. Relationships allow us to also understand the affordances of objects. We show this using a specific distribution of subjects and objects involved in the relationship riding in Figure 28. Figure 28 (a) shows the distribution for subjects while Figure 28 (b) shows a similar distribution for objects. Comparing the two distributions, we find clear patterns of people-like subject entities such as person, man, policeman, boy, and skateboarder that can ride other objects; the other distribution contains objects that afford riding, such as horse, bike, elephant, motorcycle, and skateboard. We can also learn specific common-sense knowledge, like that skateboarders only ride skateboards and only surfers ride waves or surfboards.
Related work comparison. It is also worth mentioning in this section some prior work on relationships. The concept of visual relationships has already been explored in Visual Phrases (Sadeghi and Farhadi, 2011), who introduced a dataset of 17 such relationships such as next to(person, bike) and riding(person, horse). However, their dataset is limited to just these 17 relationships. Similarly, the MS-COCO-a dataset (Ruggero Ronchi and Perona, 2015) introduced 140 actions that humans performed in MS-COCO's dataset (Lin et al., 2014). However, their dataset is limited to just actions, while our relationships are more general and numerous, with over 13K unique relationships. Finally, VisKE (Sadeghi et al., 2015) introduced 6500 relationships, but in a much smaller dataset of images than Visual Genome.

Region and Scene Graph Statistics
We introduce in this paper the largest dataset of scene graphs to date. We use these graph representations of images as a deeper understanding of the visual world. In this section, we analyze the properties of these representations, both at the region level through region graphs and at the image level through scene graphs. We also briefly explore other datasets with scene graphs and provide aggregate statistics on our entire dataset.
Scene graphs by asking humans to write triples about an image (Johnson et al., 2015). However, unlike them, we collect graphs at a much more fine-grained level, the region graph. We obtained our graphs by asking workers to create them from the descriptions we collected from our regions. Therefore, we end up with multiple graphs for an image, one for every region description. Together, we can combine all the individual region graphs to aggregate a scene graph for an image. This scene graph is made up of all the individual region graphs. In our scene graph representation, we merge all the objects that referenced by multiple region graphs into one node in the scene graph. Each of our images has a distribution between 40 to 50 region graphs per image, with an average of 42. Each image has exactly one scene graph. Note that the number of region descriptions and the number of region graphs for an image are not the same. For example, consider the description "it is a sunny day". Such a description contains no objects, which are the building blocks of a region graph. Therefore, such descriptions have no region graphs associated with them.
Objects, attributes, and relationships occur as a normal distribution in our data. Table 4 shows that in a region graph, there are an average of 0.43 objects, 0.41 attributes, and 0.45 relationships. Each scene graph and consequently each image has average of 21.26 objects, 16.21 attributes, and 18.67 relationships.

Question Answering Statistics
We collected 1, 773, 258 question answering (QA) pairs on the Visual Genome images. Each pair consists of a question and its correct answer regarding the content of an image. On average, every image has 17 QA pairs. Rather than collecting unconstrained QA pairs as previous work has done (Antol et al., 2015, Gao et al., 2015, Malinowski and Fritz, 2014, each question in Visual Genome starts with one of the six Ws -what, where, when, who, why, and how. There are two major benefits to focusing on six types of questions. First, they offer a considerable coverage of question types, ranging from basic perceptual tasks (e.g. recognizing objects and scenes) to complex common sense reasoning (e.g. inferring motivations of people and causality of events). Second, these categories present a natural and consistent stratification of task difficulty, indicated by the baseline performance in Section 6.4. For instance, why questions that involve complex reasoning lead to the poorest performance (3.4% top-100 accuracy) of the six categories. This enables us to obtain a better understanding of the strengths and weaknesses of today's computer vision models, which sheds light on future directions in which to proceed.
We now analyze the diversity and quality of our questions and answers. Our goal is to construct a largescale visual question answering dataset that covers a diverse range of question types, from basic cognition tasks to complex reasoning tasks. We demonstrate the richness and diversity of our QA pairs by examining the distributions of questions and answers in Figure 29.
Question type distributions. The questions naturally fall into the 6W categories via their interrogative words. Inside each of the categories, the second and following words categorize the questions with increasing granularity. Inspired by VQA (Antol et al., 2015), we show the distributions of the questions by their first three words in Figure 30. We can see that "what" is the most common of the six categories. A notable difference between our question distribution and VQA's is that we focus on ensuring that all 7 question categories are adequately represented, while in VQA, 32.37% of the questions are yes/no binary questions. As a result, a trivial model can achieve a reasonable performance by just predicting "yes" or "no" as answers. We encourage more difficult QA pairs by ruling out binary questions.
Question and answer length distributions. We also analyze the question and answer lengths of each 6W category. Figure 31 shows the average question and answer lengths of each category. Overall, the average question and answer lengths are 5.7 and 1.8 words respectively. In contrast to the VQA dataset, where .88%, 8.38%, and 3.25% of the answers consist of one, two, or three words, our answers exhibit a long-tail distribution where 57.3%, 18.1%, and 15.7% of the answers have one, two, or three words respectively. We avoid verbosity by instructing the workers to write answers as concisely as possible. The coverage of long answers means that many answers contain a short description that contains more details than merely an object or an attribute. It shows the richness and complexity of our visual QA tasks beyond object-centric recognition tasks. We foresee that these long-tail questions can mo- Fig. 31: Question and answer lengths by question type. The bars show the average question and answer lengths of each question type. The whiskers show the standard deviations. The factual questions, such as "what" and "how" questions, usually come with short answers of a single object or a number. This is only because "how" questions are disproportionately counting questions that start with "how many". Questions from the "where" and "why" categories usually have phrases and sentences as answers.
tivate future research in common-sense reasoning and high-level image understanding.

Canonicalization Statistics
In order to reduce the ambiguity in the concepts of our dataset and connect it to other resources used by the research community, we canonicalize the semantic meanings of all objects, relationships, and attributes in Visual Genome. By "canonicalization," we refer to word sense disambiguation (WSD) by mapping the components in our dataset to their respective synsets in the WordNet ontology (Miller, 1995). This mapping reduces the noise in the concepts contained in the dataset and also facilitates the linkage between Visual Genome and other data sources such as ImageNet (Deng et al., 2009), which is built on top of the WordNet ontology. Figure 32 shows an example image from the Visual Genome dataset with its components canonicalized. For example, horse is canonicalized as horse.n.01: solid-hoofed herbivorous quadruped domesticated since prehistoric times. Its attribute, clydesdale, is canonicalized as its breed clydesdale.n.01: heavy feathered-legged breed of draft horse originally from Scotland. We also show an example of a QA from We also only show a subset of the scene graph for this image to avoid cluttering the figure.
which we extract the nouns shamrocks, symbol, and St. Patrick's day, all of which we canonicalize to WordNet as well.
Related work. Canonicalization, or WSD (Pal and Saha, 2015), has been used in numerous applications, including machine translation, information retrieval, and information extraction (Rothe andSchütze, 2015, Leacock et al., 1998). In English sentences, sentences like "He scored a goal" and "It was his goal in life" carry different meanings for the word "goal." Understanding these differences is crucial for translating languages and for returning correct results for a query. Similarly, in Visual Genome, we ensure that all our components are canonicalized to understand how different objects are related to each other; for example, "person" is a hypernym of "man" and "woman." Most past canonicalization models use precision, recall, and F1 score to evaluate on the Semeval dataset (Mihalcea et al., 2004). The current state-of-the-art performance on Semeval is an F1 score of 75.8% (Chen et al., 2014). Since our canonicalization setup is different from the Semeval benchmark (we have an open vocabulary and no annotated ground truth for evaluation), our canonicalization method is not directly comparable to these existing methods. We do however, achieve a similar precision and recall score on a held-out test set described below.
Region descriptions and QAs. We canonicalize all objects mentioned in all region descriptions and QA pairs.  Table 5: Precision, recall, and mapping accuracy percentages for object, attribute, and relationship canonicalization.
Because objects need to be extracted from the phrase text, we use Stanford NLP tools  to extract the noun phrases in each region description and QA, resulting in 99% recall of noun phrases from a subset of 200 region descriptions we manually annotated. After obtaining the noun phrases, we map each to its most frequent matching synset (according to Word-Net lexeme counts). This resulted in an overall mapping accuracy of 86% and a recall of 98.5%. The most common synsets extracted from region descriptions, QAs, and objects are shown in Figure 33.
Attributes. We canonicalize attributes from the crowdextracted attributes present in our scene graphs. The "attribute" designation encompasses a wide range of grammatical parts of speech. Because part-of-speech taggers rely on high-level syntax information and thus fail on the disjoint elements of our scene graphs, we normalize each attribute based on morphology alone (so-called "stemming"). Then, as with objects, we map each attribute phrase to the most frequent matching WordNet synset. We include 15 hand-mapped rules to address common failure cases in which WordNet's frequency counts prefer abstract senses of words over the spatial senses present in visual data, e.g. "short.a.01: limited in duration" over short.a.02: lacking in length. For verification, we randomly sample 200 attributes, produce ground-truth mappings by hand, and compare them to the results of our algorithm. This resulted in a recall of 95.9% and a mapping accuracy of 83.5%. The most common attribute synsets are shown in Figure 34 (a).
Relationships. As with attributes, we canonicalize the relationships isolated in our scene graphs. We exclude prepositions, which are not recognized in Word-Net, leaving a set primarily composed of verb relationships. Since the meanings of verbs are highly dependent upon their morphology and syntactic placement (e.g. passive cases, prepositional phrases), we map the structure of each relationship to the appropriate WordNet sentence frame and only consider those WordNet synsets with matching sentence frames.
For each verb-synset pair, we then consider the root hypernym of that synset to reduce potential noise from WordNet's fine-grained sense distinctions. We also include 20 hand-mapped rules, again to correct for WordNet's lower representation of concrete or spatial senses; for example, the concrete hold.v.02: have or hold in one's hand or grip is less frequent in WordNet than the abstract hold.v.01: cause to continue in a certain state. For verification, we again randomly sample 200 relationships and compare the results of our canonicalization against ground-truth mappings. This resulted in a recall of 88.5% and a mapping accuracy of 77.6%. While several datasets, such as VerbNet (Schuler, 2005) and FrameNet (Baker et al., 1998), include semantic restrictions or frames to improve classification, there is no comprehensive method of mapping to those restrictions or frames. The most common relationship synsets are shown in Figure 34 (b).

Experiments
Thus far, we have presented the Visual Genome dataset and analyzed its individual components. With such rich information provided, numerous perceptual and cognitive tasks can be tackled. In this section, we aim to provide baseline experimental results using components of Visual Genome that have not been extensively studied.
Object detection is already a well-studied problem (Everingham et al., 2010, Girshick et al., 2014, Sermanet et al., 2013, Girshick, 2015, Ren et al., 2015b. Similarly, region graphs and scene graphs have been shown to improve semantic image retrieval (Johnson et al., 2015, Schuster et al., 2015. We therefore focus on the remaining components, i.e. attributes, relationships, region descriptions, and question answer pairs. In Section 6.1, we present results for two experiments on attribute prediction. In the first, we treat attributes independently from objects and train a classifier for each attribute, i.e. a classifier for red or a classifier for old, as in (Malisiewicz et al., 2008, Varma and Zisserman, 2005, Ferrari and Zisserman, 2007, Farhadi et al., 2009, Johnson et al., 2015. In the second experiment, we learn object and attribute classifiers jointly and predict object-attribute pairs (e.g. predicting that an apple is red), as in (Sadeghi and Farhadi, 2011).
In Section 6.2, we present two experiments on relationship prediction. In the first, we aim to predict the predicate between two objects, e.g. predicting the predicate kicking or wearing between two objects. This experiment is synonymous with existing work in action recognition (Gupta et al., 2009, Ramanathan et al., 2015. In another experiment, we study relationships by classifying jointly the objects and the predicate (e.g. predicting kicking(man, ball )); we show that this is a very difficult task due to the high variability in the appearance of a relationship (e.g. the ball might be on the ground or in mid-air above the man). These experiment are generalizations of tasks that study spatial relationships between objects and ones that jointly reason about the interaction of humans with objects (Yao andFei-Fei, 2010, Prest et al., 2012).
In Section 6.3 we present results for region captioning. This task is closely related to image captioning (Chen et al., 2015); however, results from the two are not directly comparable, as region descriptions are short, incomplete sentences. We train one of the top 16 state-of-the-art image caption generator (Karpathy and Fei-Fei, 2014) on (1) our dataset to generate region descriptions and on (2) Flickr30K (Young et al., 2014) to generate sentence descriptions. To compare results between the two training approaches, we use simple templates to convert region descriptions into complete sentences. For a more robust evaluation, we validate the descriptions we generate using human judgment.
Finally, in Section 6.4, we experiment on visual question answering, i.e. given an image and a question, we attempt to provide an answer for the question. We report results on the retrieval of the correct answer from a list of existing answers.

Attribute Prediction
Attributes are becoming increasingly important in the field of computer vision, as they offer higher-level semantic cues for various problems and lead to a deeper understanding of images. We can express a wide variety of properties through attributes, such as form (sliced), function (decorative), sentiment (angry), and even intention (helping). Distinguishing between similar objects (Isola et al., 2015) leads to finer-grained classification, while describing a previously unseen class through attributes shared with known classes can enable "zero-shot" learning (Farhadi et al., 2009, Lampert et al., 2009. Visual Genome is the largest dataset of attributes, with 18 attributes per image for a total of 1.8 million attributes. Setup. For both experiments, we focus on the 100 most common attributes in our dataset. We only use objects that occur at least 100 times and are associated with one of the 100 attributes in at least one image. For both experiments, we follow a similar data pre-processing pipeline. First, we lowercase, lemmatize, and strip excess whitespace from all attributes. Since the number of examples per attribute class varies, we randomly sample 500 attributes from each category (if fewer than 500 are in the class, we take all of them).
We end up with around 50, 000 attribute instances and 43, 000 object-attribute pair instances in total. We use 80% of the images for training and 10% each for validation and testing. Because each image has about the same number of examples, this results in an approximately 80%-10%-10% split over the attributes themselves. The input data for this experiment is the cropped bounding box of the object associated with each attribute.
We train an attribute predictor by using features learned from a convolutional neural network. Specifi-"playing" (predicted "grazing") "beautiful" (predicted "concrete") "metal" (predicted "closed") "dark" (predicted "dark") "parked" (predicted "parked") "white" (predicted "stuffed") (a) "green leaves" (predicted "white snow") "flying bird" (predicted "black jacket") "brown grass" (predicted "green grass") "red bus" (predicted "red bus") "skiing person" (predicted "skiing person") "white stripe" (predicted "black and white zebra") (b) Fig. 35: (a) Example predictions from the attribute prediction experiment. Attributes in the first row are predicted correctly, those in the second row differ from the ground truth but still correctly classify an attribute in the image, and those in the third row are classified incorrectly. The model tends to associate objects with attributes (e.g. elephant with grazing). (b) Example predictions from the joint object-attribute prediction experiment.
cally, we fine-tune a 16-layer VGG network (Simonyan and Zisserman, 2014) for both of these experiments using the 50, 000 attribute and 43, 000 object-attribute pair instances respectively. We modify the network so that the learning rate of the final fully-connected layer is 10 times that of the other layers, as this improves convergence time. We use a base learning rate of 0.001, which we scale by 0.1 every 200 iterations, and momentum and weight decays of 0.9 and 0.0005 respectively. We use the fine-tuned features from the network and train 100 individual SVMs (Hearst et al., 1998) to predict each attribute. We output multiple attributes for each bounding box input. For the second experiment, we also output the object class.
Results. 17%. This implies that some attributes occur exclusively with a small number of objects. Additionally, by jointly learning attributes with objects, we increase the inter-class variance, making the classification process an easier task. Figure 35 (a) shows example predictions for the first attribute prediction experiment. In general, the model is good at associating objects with their most salient attributes, for example, animal with stuffed and elephant with grazing. However, there is some difference between the user-provided result and the correct ground truth, so the model incorrectly classifies some correct predictions. For example, the white stuffed animal is correct but evaluated as incorrect. Figure 35 (b) shows example predictions for the second experiment in which we also predict the object. While the results in the second row might be considered correct, to keep a consistent evaluation, we mark them as incorrect. For example, the predicted "green grass" might be considered subjectively correct even though it is annotated as "brown grass". For cases where the objects are not clearly visible but are abstract outlines, our model is unable to predict attributes or objects accurately. For example, it thinks that the "flying bird" is actually a "black jacket".  The attribute clique graphs in Section 5.4 clearly show that learning attributes can help us identify types of objects. This experiment strengthens that insight. We learn that studying attributes together with objects can improve attribute prediction.

Relationship Prediction
While objects are the core building blocks of an image, relationships put them in context. These relationships help distinguish between images that contain the same objects but have different holistic interpretations. For example, an image of "a man riding a bike" and "a man falling off a bike" both contain man and bike, but the relationship (riding vs. falling off) changes how we perceive both situations. Visual Genome is the largest known dataset of relationships, with a total of 1.8 million relationships and an average of 18 relationships per image.
Setup. The setups of both experiments are similar to those of the experiments we performed on attributes. We again focus on the top 100 most frequent relationships. We lowercase, lemmatize, and strip excess whitespace from all relationships. We end up with around 34, 000 relationships and 27, 000 subject-relationshipobject triples for training, validation, and testing. The input data to the experiment is the image region containing the union of the bounding boxes of the subject and object (essentially, the bounding box containing the two object boxes). We fine-tune a 16-layer VGG network (Simonyan and Zisserman, 2014) with the same learning rates mentioned in Section 6.1.
Results. Overall, we find that relationships are not visually distinct enough for our discriminative model to learn effectively. Table 7 shows results for both experiments. For relationship classification, we converge after around 800 iterations with 8.74% top-one accuracy and 29.69% top-five accuracy. Unlike attribute prediction, the accuracy results for relationships are much lower  because of the high intra-class variability of most relationships. For the second experiment jointly predicting the relationship and its two object classes, we converge after around 450 iterations with 25.83% top-one accuracy and 65.57% top-five accuracy. We notice that object classification aids relationship prediction. Some relationships occur with some objects and never others; for example, the relationship drive only occurs with the object person and never with any other objects (dog, chair, etc.). Figure 36 (a) shows example predictions for the relationship classification experiment. In general, the model associates object categories with certain relationships (e.g. animals with eating or drinking, bikes with riding, and kids with playing). Figure 36 (b), structured as in Figure 36 (a), shows example predictions for the joint prediction of relationships with its objects. The model is able to predict the salient features of the image (e.g. "boat in water") but fails to distinguish between different objects (e.g. boy vs. woman and car vs. bus in the bottom row).

Generating Region Descriptions
Generating sentence descriptions of images has gained popularity as a task in computer vision (Kiros et al., 2014, Mao et al., 2014, Karpathy and Fei-Fei, 2014, Vinyals et al., 2014; however, current state-of-the-art models fail to describe all the different events captured in an image and instead provide only a high-level summary of the image. In this section, we test how well state-of-the-art models can caption the details of images. For both experiments, we use the NeuralTalk model (Karpathy and Fei-Fei, 2014), since it not only provides state-of-the-art results but also is shown to be robust enough for predicting short descriptions. We train NeuralTalk on the Visual Genome dataset for region descriptions and on Flickr30K (Young et al., 2014) for full sentence descriptions. As a model trained on other datasets would generate complete sentences and would not be comparable (Chen et al., 2015) to our region descriptions, we convert all region descriptions generated by our model into complete sentences using predefined templates (Hou et al., 2002).
Dog "carrying" frisbee (predicted: "laying on") Boy "playing" soccer (predicted "playing") Sheep "eating" grass (predicted "eating") Bike "attached to" rack (predicted "riding") Bag "inside" rickshaw (predicted "riding") Shadow "from" zebra (predicted "drinking") (a) "glass on table" (predicted "plate on table") "car on road" (predicted "bus on street") "train on tracks" (predicted "train on tracks") "leaf on tree" (predicted "building in background") "boat in water" (predicted "boat in water") "boy has hair" (predicted "woman wearing glasses") Setup. For training, we begin by preprocessing region descriptions; we remove all non-alphanumeric characters and lowercase and strip excess whitespace from them. We have 4, 158, 841 region descriptions in total. We end up with 3, 150, 000 region descriptions for training -504, 420 each for validation and testing. Note that we ensure descriptions of regions from the same image are exclusively in the training, validation, or testing set. We feed the bounding boxes of the regions through the pretrained VGG 16-layer network (Simonyan and Zisserman, 2014) to get the 4096-dimensional feature vectors of each region. We then use the NeuralTalk (Karpathy and Fei-Fei, 2014) model to train a long short-term memory (LSTM) network (Hochreiter and Schmidhuber, 1997) to generate descriptions of regions. We use a learning rate of 0.001 trained with rmsprop (Dauphin et al., 2015). The model converges after four days. For testing, we crop the ground-truth region bounding boxes of images and extract their 4096-dimensional 16-layer VGG network (Simonyan and Zisserman, 2014) features. We then feed these vectors through the pretrained NeuralTalk model to get predictions for region descriptions.
Results. Table 8 shows the results for the experiment. We calculate BLEU, CIDEr, and METEOR scores (Chen et al., 2015) between the generated descriptions and their ground-truth descriptions. In all cases, the model trained on VisualGenome performs better. Moreover, we asked crowd workers to evaluate whether a generated description was correct-we got 1.6% and 43.03% for models trained on Flickr30K and on Visual Genome, respectively. The large increase in accuracy when the model trained on our data is due to the specificity of our dataset. Our region descriptions are shorter and cover a smaller image area. In comparison, the Flickr30K data are generic descriptions of entire images with multiple events happening in different regions of the image. The model trained on our data is able to make predictions that are more likely to concentrate on the specific part of the image it is looking at, instead of generating a summary description. The objectively low accuracy in both cases illustrates that current models are unable to reason about complex images. Figure 37 shows examples of regions and their predicted descriptions. Since many examples have short "a black motorcycle" "trees in background" "train is visible" "the umbrella is red" "black and white cow" "a kite in the sky" descriptions, the predicted descriptions are also short as expected; however, this causes the model to fail to produce more descriptive phrases for regions with multiple objects or with distinctive objects (i.e. objects with many attributes). While we use templates to convert region descriptions into sentences, future work can explore smarter approaches to combine region descriptions and generate a paragraph connecting all the regions into one coherent description.

Question Answering
Visual Genome is currently the largest dataset of visual question answers with 1.7 million question and answer pairs. Each of our 108, 249 images contains an average of 17 question answer pairs. Answering questions requires a deeper understanding of an image than generic image captioning. Question answering can involve fine-grained recognition (e.g. "What is the breed of the dog?"), object detection (e.g. "Where is the kite in the image?"), activity recognition (e.g. "What is this man doing?"), knowledge base reasoning (e.g. "Is this glass full?"), and common-sense reasoning (e.g. "What street will we be on if we turn right?").
By leveraging the detailed annotations in the scene graphs in Visual Genome, we envision building smart  Table 9: Baseline QA performances (in accuracy).
models that can answer a myriad of visual questions. While we encourage the construction of smart models, in this paper, we provide some baseline metrics to help others compare their models.
Setup. We split the QA pairs into a training set (60%) and a test set (40%). We ensure that all images are exclusive to either the training set or the test set. We implement a simple baseline model that relies on answer frequency. The model counts the top k most frequent answers (similar to the ImageNet challenge (Russakovsky et al., 2015)) in the training set as the predictions for all the test questions, where k = 100, 500, and 1000. We let a model make k different predictions. We say the model is correct on a QA if one of the k predictions matches exactly with the groundtruth answer. We report the accuracy over all test questions. This evaluation method works well when the answers are short, especially for single-word answers. However, it causes problems when the answers are long phrases and sentences. Other evaluation methods require word ontologies (Malinowski and Fritz, 2014), multiple choices (Antol et al., 2015, Yu et al., 2015, or human judges (Gao et al., 2015).
Results. Table 9 shows the performance of the openended visual question answering task. These baseline results imply the long-tail distribution of the answers. Long-tail distribution is common in existing QA datasets as well (Antol et al., 2015, Malinowski andFritz, 2014). The top 100, 500, and 1000 most frequent answers only cover 41.1%, 57.3%, and 64.1% of the correct answers. In comparison, the corresponding sets of frequent answers in VQA (Antol et al., 2015) cover 63%, 75%, and 80% of the test set answers. The "where" and "why" questions, which tend to involve spatial and common sense reasoning, tend to have more diverse answers and hence perform poorly, with performances of 0.096% and 0.024% top-100 respectively. The top 1000 frequent answers cover only 41.8% and 18.7% of the correct answers from these two question types respectively.  Table 8: Results for the region description generation experiment. Scores in the first row are for the region descriptions generated from the NeuralTalk model trained on Flickr8K, and those in the second row are for those generated by the model trained on Visual Genome data. BLEU, CIDEr, and METEOR scores all compare the predicted description to a ground truth in different ways.

Future Applications
We have analyzed the individual components of this dataset and presented experiments with baseline results for tasks such as attribute classification, relationship classification, description generation, and question answering. There are, however, more applications and experiments for which our dataset can be used. In this section, we note a few potential applications that our dataset can enable.
Dense image captioning. We have seen numerous image captioning papers (Kiros et al., 2014, Mao et al., 2014, Karpathy and Fei-Fei, 2014, Vinyals et al., 2014 that attempt to describe an entire image with a single caption. However, these captions do not exhaustively describe every part of the scene. An natural extension to this application, which the Visual Genome dataset enables, is the ability to create dense captioning models that describe parts of the scene.
Visual question answering. While visual question answering has been studied as a standalone task (Yu et al., 2015, Ren et al., 2015a, Antol et al., 2015, Gao et al., 2015, we introduce a dataset that combines all of our question answers with descriptions and scene graphs. Future work can build supervised models that utilize various components of Visual Genome to tackle question answering. Image understanding. While we have seen a surge of image captioning (Kiros et al., 2014) and question answering (Antol et al., 2015) models, there has been little work on creating more comprehensive evaluation metrics to measure how well these models are performing. Such models are usually evaluated using BLEU, CIDEr, or METEOR and other similar metrics that do not effectively measure how well these models understand the image (Chen et al., 2015). The Visual Genome scene graphs can be used as a measurement for image understanding. Generated descriptions and answers can be matched against the ground truth scene graph of an image to evaluate its corresponding model.
Relationship extraction. Relationship extraction has been extensively studied in information retrieval and natural language processing (Zhou et al., 2007, GuoDong et al., 2005, Culotta and Sorensen, 2004, Socher et al., 2012. Visual Genome is the first largescale visual relationship dataset. This dataset can be used to study the extraction of visual relationships (Sadeghi et al., 2015) from images, and its interactions between objects can also be used to study action recognition (Yao andFei-Fei, 2010, Ramanathan et al., 2015) and spatial orientation between objects (Gupta et al., 2009, Prest et al., 2012.
Semantic image retrieval. Previous work has already shown that scene graphs can be used to improve semantic image search (Johnson et al., 2015, Schuster et al., 2015. Further methods can be explored using our region descriptions combined with region graphs. Attention-based search methods can also be explored where the area of interest specified by a query is also localized in the retrieved images.

Conclusion
Visual Genome provides a multi-layered understanding of pictures. It allows for a multi-perspective study of an image, from pixel-level information like objects, to relationships that require further inference, and to even deeper cognitive tasks like question answering. It is a comprehensive dataset for training and benchmarking the next generation of computer vision models. With Visual Genome, we expect these models to develop a broader understanding of our visual world, complementing computers' capacities to detect objects with abilities to describe those objects and explain their interactions and relationships. Visual Genome is a large formalized knowledge representation for visual understanding and a more complete set of descriptions and question answers that grounds visual concepts to language.