Understanding Synonymous Referring Expressions via Contrastive Features

Referring expression comprehension aims to localize objects identified by natural language descriptions. This is a challenging task as it requires understanding of both visual and language domains. One nature is that each object can be described by synonymous sentences with paraphrases, and such varieties in languages have critical impact on learning a comprehension model. While prior work usually treats each sentence and attends it to an object separately, we focus on learning a referring expression comprehension model that considers the property in synonymous sentences. To this end, we develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels, where features extracted from synonymous sentences to describe the same object should be closer to each other after mapping to the visual domain. We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets, and demonstrate that our method performs favorably against the state-of-the-art approaches. Furthermore, since the varieties in expressions become larger across datasets when they describe objects in different ways, we present the cross-dataset and transfer learning settings to validate the ability of our learned transferable features.


Introduction
Referring expression comprehension is a task to localize a particular object within an image guided by a natural language description, e.g., "the man holding a remote standing next to a woman" or "the blue car".Since referring expressions are widely used in our daily conversations, the ability to understand such expressions provides an intuitive way for humans to interact with intelligent agents.One challenge of this task is to jointly comprehend the knowledge from both visual and language domains, where there are multiple ways (i.e., synonymous sentences) to describe and paraphrase the same object.For instance, we can refer to an object by its attribute, location, or interaction with other objects.Referring expressions also vary in lengths and synonyms.As such, the varieties of sentences that describe the same object cause gaps in the language domain, and affect the model Figure 1.Overview of the proposed algorithm.An object can be described in different ways, e.g., by its attribute, location, or interaction with other objects.For a referring expression S, there are positive expressions S+ describing the same object and negative ones S− for another object.While prior work considers each expression way separately, our method encourages features of synonymous sentences for the same object to attend nearby in the language-to-visual embedding space (F and F+) but far away from negative ones (F−).Thus, the proposed framework can transfer learned features to unseen data.training process.
In this work, we take this language property, synonymous sentences, into consideration during the training process.This is different from existing referring expression comprehension methods [42,45,22,37] that consider all the sentences as individual instances.Here, our main idea is to learn contrastive features when mapping the language features to the visual domain.That is, while the same object can be described by different synonymous sentences, these language features should be close to each other after mapping to the visual domain.On the other hand, for other expressions that do not describe that object, our model should also map them further away from that object (see Figure 1 for an illustration).
To exploit how synonymous sentences are utilized to help model training as described above, we integrate feature learning techniques, e.g., contrastive learning [8,33,9], into our framework.Then, the requirement of (multiple) positive/negative samples in contrastive learning can be satisfied by the notion of synonymous sentences.However, it is not trivial to determine where we learn contrastive features in the model, in which we find that using language-to-visual features is beneficial to optimizing both the image and lan-guage modules.To this end, we design an end-to-end learnable framework that enables feature learning on two different levels with language-to-visual features, i.e., image and object instance levels, which are responsible for global context and relationships between object instances, respectively.Moreover, since there are large varieties of negative samples (i.e., any languages describing different objects can be negatives), we explore the option of mining negative samples to further facilitate the learning process.
In our framework, one benefit of learning contrastive features from synonymous sentences is to equip the model with the ability to contrast different language meanings.This ability is important when transferring the model to other datasets, as each domain may contain different varieties of sentences to describe objects.To understand whether the learned features are effectively transferred to a new domain, we show that our model performs better in the crossdataset setting (testing on the unseen dataset), as well as in the transfer learning setting that fine-tunes our pre-trained model on the target dataset.Note that, although a similar concept of synonymous sentences is also adopted in [35] for retrieval tasks, it has not been exploited in referring expression comprehension for feature learning and transfer learning.Specifically, [35] uses off-the-shelf feature extractors and does not consider feature learning using multiple positive/negative samples on both image and instance levels like our framework.
We conduct extensive experiments on referring expression comprehension benchmarks to demonstrate the merits of learning contrastive features from synonymous sentences.First, we use the RefCOCO benchmarks, including RefCOCO [43], RefCOCO+ [43], and RefCOCOg [26,27], to perform baseline studies with comparisons to state-ofthe-art methods.Second, we focus on cross-dataset and transfer learning settings using the ReferItGame [15] and Ref-Reasoning [38] datasets to validate the ability of transferable features learned on the RefCOCO benchmarks.
The main contributions of this work are summarized as follows: 1) We propose a unified and end-to-end learnable framework for referring expression comprehension by considering various synonymous sentences to improve the training procedure.2) We integrate feature learning techniques into our framework with a well-designed sampling strategy that learns contrastive features on both the image and instance levels.3) We demonstrate that our model is able to effectively transfer learned representations in both cross-dataset and transfer learning settings.

Related Work
Referring Expression Comprehension.The task of referring expression comprehension is typically considered as determining an object among several object proposals, given the referring expression.To this end, several methods adopt two-stage frameworks to first generate object proposals with a pre-trained object detection network, and then rank the proposals according to the expression.For example, CNN-LSTM models have been used to generate captions based on the image and proposals [12,25,26], and the one with the maximum posterior probability for generating the query expression is selected.Other approaches [35,29] embed proposals and the query sentence into a common feature space, and choose the object with the minimum distance to the expression.In addition, several strategies are adopted to improve the performance, e.g., analyzing the relationship between an object and its context [27,43,45], or exploring the attributes [21] to distinguish similar objects.To jointly consider multiple factors, MAttNet [42] learns a modular network by considering three components, i.e., subject appearance, location and relationship to other objects.While these two-stage frameworks achieve promising results, the computational cost is significantly high due to extensive post-processing steps and individual models.Furthermore, the model performance is largely limited by the pre-trained object detection network.
Some recent approaches adopt one-stage frameworks to tackle referring expression object segmentation [3], zeroshot grounding [30], and visual grounding [41,40,39], where the language features are fused with the object detector.In these methods, the models are end-to-end trainable and more computationally efficient.Compared to the methods mentioned above that consider each individual dataset separately, we aim to learn contrastive features by considering synonymous sentences during the training process.In this work, we adopt a one-stage framework in which the representations in both the language and visual domains are learned jointly.
Feature Learning.Feature learning aims to represent data in the embedding space, where similar data points are close to each other, and dissimilar ones are far apart, based on pairwise [32], triplet [31] or contrastive relationships [8,33,9].This representation model is then used for downstream tasks such as classification and detection.These loss functions are computed on anchor, positive and negative samples, where the anchor-positive distance is minimized, and the anchor-negative distance is maximized.While the triplet loss [31] uses one positive and one negative sample per anchor, contrastive loss [8,33,9] includes multiple positive and negative samples for each anchor, which makes the learning process more efficient.
For natural language processing tasks, recent studies [28,5] based on the transformer [34] have shown success in transfer learning.Built upon the transformer-based BERT [5] model, learning representation for vision and language tasks by large-scale pre-training methods recently attracts much attention.These methods [23,24,46,18,2,7] learn generic representations from a large amount of image-text Figure 2. Pipeline of the proposed framework.The features of the input image I and referring expression S (with its synonymous sentence S+ and a negative expression S−) are first extracted by a visual encoder Ev and language encoder E l , respectively.Then we adopt two attention modules Aimg and Ains for attending language features to the visual domain on the image and instance levels, where we apply our feature learning losses Limg and L ins−cl to contrast positive/negative pairs, i.e., {R+, R−} and {H+, H−} on two levels, respectively.On the instance level, a graph convolutional network G is employed to model the relationships between object proposals.Finally, we generate the object bounding box with a detection head D.
pairs in a self-supervised manner, and fine-tune the model for the downstream vision and language tasks.Recently, ViLBERT [23] and its multi-tasking version [24] use two parallel BERT-style models to extract features on image regions and text segments, and connect the two streams with co-attentional transformer layers.Moreover, the OS-CAR [18] method uses object tags as anchor points to align the vision and language modalities in a shared semantic space.While these approaches aim to learn generic representations for vision and language tasks by training the models on large-scale datasets, we focus on the referring expression comprehension, and adopt the feature learning techniques to improve the performance by considering synonymous sentences.
In this work, with a similar spirit to feature learning, we integrate the contrastive loss into our model by considering synonymous sentences for referring expression comprehension.While it is natural to use the concept of synonymous expressions in tasks such as image-text retrieval [35] or textbased person retrieval [36], it has not been exploited in referring expression comprehension to improve feature learning and further for transfer learning.Different from [35,36] that apply hinge loss on triplets, we consider multiple positive and negative samples for each anchor, and perform contrastive learning on both image and instance levels.Moreover, we adopt an end-to-end framework, where the visual and language features are jointly learned, which is different from [35,36] that use off-the-shelf feature extractors.

Proposed Framework
In this work, we address the problem of referring expression comprehension via using the information in synonymous sentences to learn contrastive features across sentences.Given an input image I and a referring expression S = {w t } T t=1 consisting of T words w t , the task is to lo-calize the object identified by the expression.We design a framework composed of a visual encoder E v and a language encoder E l for feature extraction in the visual and language domains.Since our goal is to learn contrastive features after mapping the language features to the visual domain, we utilize the attention modules A img and A ins for languageto-visual features and a graph convolutional network (GCN) G for aggregating instance-level features, which will be detailed in the following sections.The output of referring expression comprehension is predicted by a detection head D.
Figure 2 shows the pipeline of the proposed framework.
Our method learns contrastive features on two levels to account for both the global context and relationships between object instances.To this end, we enforce that the language-to-visual features on either the image or instance level should be close to each other if the referring expressions are synonymous sentences, and vice versa.Given the image-level attention map R l→v obtained from A img and the instance-level attention feature H l→v inferred from A ins followed by a GCN module G, we regularize R l→v and H l→v by leveraging feature learning techniques, guided by the notion of synonymous sentences.As a result, our language-to-visual features contrast the attentions from the language domain to the visual one based on the sentence meanings, which facilitates the comprehension task.

Image-Level Feature Learning
To comprehend the information from the input image and referring expression, features of the two inputs are first extracted by each individual encoder.We then use an attention module A img that attends the l-dimensional language feature where h and w denote the spatial dimension.A response map R l→v that contains the multimodal information is obtained accordingly, i.e., R l→v = A img (F l , F v ) ∈ R h×w .The details of the encoders and attention module A img are presented later in Section 3.3.
As there are numerous synonymous sentences to describe the same object, the language-to-visual features should attend to similar regions regardless of how to describe the object.Intuitively, We can apply a triplet loss on the response map to encourage the samples with synonymous expressions describing the same object to be mapped closely in the embedding space.Otherwise, they should be mapped far away from each other.
Specifically, for each input image I and a referring expression anchor S, we randomly sample a positive expression S + that describes the same object, and a negative expression S − identifying a different object within the same image.The triplet loss is then computed on the response generated by attending the three expression samples to the image I: where R, R + and R − are the responses of the anchor, positive and negative samples, respectively.In addition, d is the L2 distance between two response maps and α is the margin.After this step, we combine R l→v and F v via element-wise multiplication ⊗ to produce the attentive feature , which is then used as the input to the detection head D (see Figure 2).We note that the triplet loss is applied to the response map R l→v rather than the feature F l→v , since R l→v is easier to optimize with the much lower dimension.

Instance-Level Feature Learning
In addition to applying the triplet loss on the image level for learning contrastive features, we also consider the features on the instance level to encourage the model to focus more on local cues.However, through the RoI module that generates proposals, each proposal (instance) only contains the information within its bounding box and does not provide the local context information (e.g., interactions with other objects) described in the referring expression.To tackle this problem, we design a graph convolutional network (GCN) [17] in a way similar to [37] to model the relationships between proposals, and then use the features after GCN as the input to our instance-level feature learning module.More details regarding the implementation are provided in the supplementary material.
Contrastive Feature Learning.As shown in Figure 2, similar to the image-level in Section 3.1, we first adopt an instance-level attention module A ins following [42] to attend the language feature F l to the RoI proposals, followed by the GCN module G to aggregate the proposal relationships.As a result, we obtain the instance-level language-tovisual features Next, we propose to regularize H l→v guided by the concept of synonymous sentences, where the proposal features from referring expressions that describe the same object should be close to each other.Otherwise, they should be apart from each other.To this end, one straightforward way to learn instance-level contrastive features is to apply the triplet loss similar to (1): where H, H + , and H − represent instance-level features of the anchor, positive and negative expressions attending on the visual domain, respectively.Note that each expression may generate different RoI locations.To use the same proposals across samples in the triplet, we select the proposal with the highest IoU score with respect to the ground truth bounding box.
Contrastive Loss with Negative Mining.Although the triplet loss in (2) can be used to learn instance-level contrastive features, it is limited to sample one positive and negative at a time, which may not fully exploit multiple synonymous sentences.Therefore, we leverage the contrastive loss [16] with the property that can consider multiple positive/negative samples.Intuitively, we can treat synonymous sentences as positives, but the space of negative samples is large and noisy as any two languages describing different objects are negatives.Moreover, it has been studied that finding good negative samples is critical for learning contrastive features effectively [14].
To tackle this issue, we employ the following two strategies to mine useful negative samples: 1) From the perspective of the visual domain, we mine samples that describe the same object category but in different images, and then use the corresponding referring expressions as the negatives.This encourages our model to contrast features that describe similar contents across images, as sentences referring to the same object category usually share common contexts.2) Considering the language embedding features, we mine the top N samples (i.e., N = 8 in this work) that have closer language features to the anchor sample but in different images.This helps the model contrast samples that have a similar language structure.Overall, our contrastive loss L ins−cl can be formulated as: where Q + and Q − are the sets of positive and negative samples, and τ is the temperature parameter.In practice, we follow the SimCLR [1] method and use h(•) as a linear layer that projects features H to an embedding space where the contrastive loss is applied.
Discussions.Compared to the instance-level triplet loss in (2), using the contrastive loss in (3) has a few merits.First, given an anchor sample, it can contrast with multiple positive and negative samples at the same time, which is much more efficient than sampling the triplets, as shown in [16] and our ablation study presented later.Second, the projection head h(•) provides a learnable buffer before feeding features to compute the contrastive loss, which helps the model learn better representations.
In terms of the sampling strategy, different from the negative mining method in [4] that generates negatives with the same object category, we also consider negatives with similar languages to the anchor language but containing other object categories.This difference allows us to sample more negatives with similar language structures, which is crucial for our contrastive loss in language.

Model Training and Implementation Details
In this section, we provide more details in training our framework and the design choices.Overall Objective.The overall objective for the proposed algorithm consists of the aforementioned loss functions (1) and ( 3) for learning contrastive features on the image and instance levels, and the detection loss L det as defined in the Mask R-CNN [10] following the MAttNet [42] method: Implementation Details.For the visual encoder E v in our framework and the detection head D, we adopt the Mask R-CNN [10] as the backbone model, which is pre-trained on COCO training images, excluding those in validation and testing splits of RefCOCO, RefCOCO+ and RefCOCOg.The ResNet-101 [11] is used as the feature extractor, where the output of the final convolutional layer in the fourth block is the feature F v that serves as the input to the attention module A img .For the language encoder E l , we use either the BERT [5] or Bi-LSTM model.In the image-level attention module A img , we adopt dynamic filters similar to [3] to attend language features to the visual domain and generate R l→v .On the instance level, features after GCN are duplicated to each spatial location and concatenated with the spatial features after RoI.Then, the concatenated features are fed to the detection head D to generate final results.During testing, the detected object with the largest score is considered as the prediction.The margin α in the triplet loss (1) and ( 2) is set to 1.In the contrastive loss (3), the temperature τ is set to 0.1, and the projection head h(•) is a 2-layer MLP, projecting the features to a 128-dimensional latent space.We implement the proposed model in PyTorch with the SGD optimizer, and the entire model is trained end-to-end with 10 epochs.The initial learning rate is set to 10 −4 and decreased to 10 −5 after 3 epochs.The source code and trained models are available at https://github.com/wenz116/RefContrast.

Experimental Results
We evaluate the proposed framework on three referring expression datasets, including RefCOCO [43], Ref-COCO+ [43] and RefCOCOg [26,27].The three datasets are collected on the MSCOCO [20] images, but with different ways to generate referring expressions.Extensive experiments are conducted in multiple settings.We first compare the performance of the proposed algorithm with stateof-the-art methods and present the ablation study to show the improvement made by each component.Then we evaluate the models on the unseen Ref-Reasoning [38] dataset to validate the effectiveness on unseen datasets.Furthermore, we conduct experiments in the transfer learning setting, where the pre-trained models are fine-tuned on either the Ref-Reasoning [38] or ReferItGame [15] dataset.More results and analysis are provided in the supplementary material.
Intra-and Inter-Dataset Feature Learning.Since synonymous sentences exist within one dataset and across datasets, we consider both intra-and inter-dataset feature learning loss.For each input image and referring expression anchor, we sample positive and negative expressions from the same dataset for the intra-dataset case, and expressions from different datasets as inter-dataset samples.In our experiments, we use the intra-dataset loss for training on a single dataset, and both the intra-and inter-dataset losses for jointly training on the three datasets.

Datasets and Evaluation Metric
RefCOCO contains 19,994 images with 142,209 referring expressions for 50,000 objects, while RefCOCO+ is composed of 141,564 expressions for 49,856 objects in 19,992 images.Restrictions are not placed on generating expressions for RefCOCO, but put on RefCOCO+ by forbidding the location information, making it focus more on the appearance of the target object and its interaction with others.The two testing splits testA and testB are generated respectively on images containing multiple people and images containing multiple objects of other categories.We follow the split of the training, validation and testing images in [43], and there is no overlap across the three sets.
The RefCOCOg dataset consists of 85,474 referring expressions for 54,822 objects in 26,711 images with longer expressions.There are two splits constructed in different ways.The first split [26] randomly partitions objects into training and validation sets.Therefore, the same image could appear in both sets.The validation set is denoted as "val*" in this paper.The second partition [27] randomly splits images into training, validation and testing sets, where we denote the validation and testing splits as "val" and  "test", respectively.In our experiments, when jointly training on three datasets, to avoid overlaps between the training, validation and testing images, we create another split for RefCOCOg, where each set contains the images present in the corresponding set of RefCOCO and RefCOCO+.We denote this split as "RefCOCOg*".The Ref-Reasoning dataset contains 83,989 images from GQA [13] with 791,956 referring expressions automatically generated based on the scene graphs.The ReferItGame dataset is collected from ImageCLEF [6] and consists of 130,525 expressions, referring to 96,654 objects in 19,894 images.To evaluate the detection performance, the predicted bounding box is considered correct if the intersection-over-union (IoU) of the prediction and the ground truth bounding box is above 0.5.

Evaluation on Seen Datasets
Table 1 shows the results of the proposed algorithm against state-of-the-art methods [27,25,44,21,48,45,42,22,3,37,41,19,40].All the compared approaches except for the recent methods [3,41,19,40] adopt two-stage frameworks, where the prediction is chosen from a set of proposals.Therefore, their models are not end-to-end trainable, while our one-stage framework is able to learn better feature representations by end-to-end training.Moreover, all of these methods train on each dataset separately and do not consider the varieties of synonymous sentences within/across datasets.
The models in the top and middle groups of Table 1 are trained and evaluated on the same single dataset.We separate the methods using different language encoders for fair comparisons.The results show that the full model trained with our loss functions consistently improves the baseline model without considering synonymous sentences.In addition, our method with either LSTM or BERT as the language encoder performs favorably against other approaches on most dataset splits.Note that, the runtime of our unified framework (0.325 seconds per frame) is much faster than the two-stage MAttNet [42] and CM-Att-Erase [22] methods (0.671 and 0.734 seconds, respectively).
The bottom group of Table 1 shows the results of our models jointly trained on the RefCOCO, RefCOCO+ and RefCOCOg* (our split) datasets.Since the three datasets share the same images but contain expressions of very different properties, directly training on all of them as the baseline would cause training difficulties in a single model due to the large varieties in language.In contrast, by applying the proposed loss terms, the performance improves from the baseline, and the gains are larger than those in the single dataset setting.We also note that all the training images are the same, but the varieties of synonymous sentences are different across these two settings.

Ablation Study
We present the ablation study results in Table 2 to show the effect of each component in the proposed framework.The models are jointly trained on the RefCOCO, Ref-COCO+ and RefCOCOg* datasets.In the top group of Table 2, we first demonstrate that applying either the imagelevel or instance-level loss improves the performance from the baseline model that does not consider the property of synonymous sentences.In the middle group of the table, the results are further improved in the full models with the loss on both levels, and the one with contrastive loss in the instance level achieves better performance than the one using triplet loss.In the bottom group, we provide the detailed ablation study in our model.When removing negative min-ing or GCN from the model, the performance is slightly reduced compared to the full model in the middle group.This shows that our proposed feature learning techniques play the main role in performance improvement, while the proposed strategies regarding negative mining and GCN also help model training.Furthermore, we present qualitative results in Figure 3, which show that our full model better distinguishes between similar objects and understands relationships across objects.

Evaluation on the Unseen Dataset
To demonstrate that the proposed framework is able to transfer learned features to other unseen datasets, we use our trained models to evaluate on the Ref-Reasoning [38] dataset, which contains completely different images and expressions from our training datasets.The results are shown in Table 3 and the performance of our model trained on the Ref-Reasoning dataset using the fully-supervised setting is provided as a reference.We first train our models on the RefCOCOg dataset [27].When applying the intra-dataset loss in the full model, the performance is improved from the baseline and better than the two-stage MAttNet [42] and CM-Att-Erase [22] approaches, where we evaluate using their official pre-trained models.Then we jointly train our models on RefCOCO, RefCOCO+ and RefCOCOg* (our split) datasets.When more datasets are used to train our full model, the performance gains over the baseline increase, which demonstrates the effectiveness of our feature learning using synonymous sentences.

Transfer Learning on Unseen Datasets
To validate the feature learning ability of the proposed method, we conduct experiments on the transfer learning setting, where the models are pre-trained on the RefCOCO, RefCOCO+ and RefCOCOg* datasets, and fine-tuned on the Ref-Reasoning [38] or ReferItGame [15] dataset.We also note that the SGMN [38] model in Table 4 uses ground truth proposals from the dataset, which is served as a reference, but is not a direct comparison with our method that predicts the locations without accessing ground truth bounding boxes.
In Table 4, we first observe that two models of "Ours (baseline)", with or without the pre-training stage, perform very similarly to each other on Ref-Reasoning [38].This shows that it is not trivial to learn transferable features by simply pre-training on existing datasets.However, by introducing our feature learning schemes, either during pretraining (Pre) or fine-tuning (FT), our models achieve better performance.When using our method in both pre-training and fine-tuning stages, the performance is further improved.
Similar comparisons can be observed in Table 5, where the models are fine-tuned on the ReferItGame [15] dataset.In the bottom group, we demonstrate that the feature learning technique improves the performance consistently by using the proposed loss terms.Compared to results in the top group, despite that our baseline model does not perform better than ZSGNet [30] and Darknet-LSTM [41], we show significant improvement by using our proposed loss functions to achieve better performance.This shows the ability of our method for transferring learned representations to other datasets.

Analysis of Learned Features
To demonstrate the effectiveness of the proposed loss, we calculate the similarity (dot product) between the instance-level features of synonymous sentences when jointly training on the RefCOCO, RefCOCO+ and Ref-COCOg* datasets.We randomly sample a pair of synonymous sentences for each image, and compute the average value of all samples in the validation and testing splits of each dataset.Table 6 shows the similarity scores on the RefCOCO, RefCOCO+ and RefCOCOg* datasets, while Table 7 provides the similarity computed on the Ref-Reasoning and ReferItGame datasets in the transfer learning setting.From the results, we observe that the similarity in the embedding space of synonymous sentences is higher when the proposed loss terms are applied in all the settings, which shows that our model is able to transfer the learned features to unseen datasets.

Conclusions
In this paper, we focus on the task of referring expression comprehension and tackle the challenge caused by the varieties of synonymous sentences.To deal with this problem, we propose an end-to-end trainable framework that considers the property in languages for paraphrasing the objects to learn contrastive features.To this end, we employ the feature learning techniques on the image level as well as the instance level to encourage language features describing the same object to attend closely in the visual embedding space, while the expressions identifying different objects to be separated.We design two negative mining strategies to further facilitate the learning process.Extensive experiments and the ablation study on multiple referring expression datasets demonstrate the effectiveness of the proposed algorithm.Moreover, in the cross-dataset and transfer learning settings, we show that the proposed method is able to transfer learned representations to other datasets.
In this appendix, we provide additional analysis and experimental results, including 1) more implementation details and runtime performance, 2) evaluation on the unseen ReferItGame dataset, and 3) more qualitative results for referring expression comprehension.

A. Implementation Details and Runtime Performance
The BERT [5] model in our framework is pre-trained on BookCorpus [47] and English Wikipedia.We use the BASE model that has 12 layers of Transformer blocks with each layer having hidden states of size 768 and 12 self-attention heads.For the GCN in our method, the object proposals are generated from Mask R-CNN [10].We keep the top K detection candidates for each image, where K is set to 20.We then construct a graph G = (V, E) from the set of object proposals P = {p i } K i=1 , where each vertex v i ∈ V corresponds to an object proposal p i , and each edge e ij ∈ E models the pairwise relationship between instances p i and p j .In the instance-level attention module A ins , we compute the word attention on each object proposal to focus on instances that are referred by the sentence.The word attention a i on the proposal p i is defined as the average of all the probabilities that each word w t refers to p i : where T is the number of words in the sentence, and s i,t is the inner product between the feature f wt of word w t and the average pooled feature f pi of proposal p i .To compute the feature node f vi at vertex v i , we first concatenate the average pooled feature f pi and the 5-dimensional location feature [26] [ x tl W , y tl H , x br W , y br H , wh W H ] on proposal p i , where (x tl , y tl ) and (x br , y br ) are the coordinates of the top-left and bottom-right corners of the proposal, h and w are height and width of the proposal, H and W are height and width of the image.Then, the concatenated feature is multiplied by the word attention a i in (5) to form f vi .We use a two-layer GCN to capture second-order interactions.
Our framework is implemented on a machine with an Intel Xeon 2.3 GHz processor and an NVIDIA GTX 1080 Ti GPU with 11 GB memory.We report the runtime performance in Table 8.Our end-to-end model takes 0.325 seconds to process one image in a single forward pass, which is much faster than MAttNet [42] and CM-Att-Erase [22] that require multiple steps of inference.All the source codes and models will be made available to the public.

B. Ablation Study
In Table 9, we provide more ablation study on the language encoder in our framework.The models are jointly

Method Time (s)
MAttNet [42] 0.671 CM-Att-Erase [22] 0.734 Ours 0.325 trained on the RefCOCO, RefCOCO+ and RefCOCOg* datasets.We show the results of the baseline model without our loss and the full model using the Bi-LSTM or BERT as the language encoder.The results demonstrate that the performance gains against the baseline are mainly contributed by the proposed image-and instance-level loss (row 1 vs. row 3) rather than the BERT model (row 2 vs. row 3).

C. More Evaluation on the Unseen Dataset
To demonstrate the ability of our model to transfer learned features to unseen datasets, we use the models trained under various settings to evaluate on the Refer-ItGame [15] dataset, which consists of completely different images and expressions from our training datasets.The results are presented in Table 10, and the performance of our model trained on the ReferItGame dataset under the fully supervised setting is provided for reference.We first show the results of models trained on the RefCOCOg dataset.
When training with the proposed intra-dataset loss, the performance is improved from the baseline and better than the two-stage MAttNet [42] and CM-Att-Erase [22] approaches.Note that all the models use the same ResNet-101 backbone as the feature extractor.Then we jointly train our models on the RefCOCO, RefCOCO+ and Ref-COCOg* (our split) datasets.When more datasets are included to train our full model, we observe more performance gain over the baseline, which demonstrates the effectiveness of our feature learning technique considering synonymous sentences.

D. Qualitative Results
In Figure 4, 5, and 6, we present more qualitative results generated by our full model trained on the RefCOCO, Re-fCOCO+ and RefCOCOg* datasets.The proposed method is able to localize objects accurately given the synonymous sentences with paraphrases and also distinguish between sentences that describe different objects.We also provide some failure cases of our method in Figure 7.While the proposed algorithm shows the effectiveness on referring expression comprehension, it still suffers from some unfavorable effects, such as objects sharing similar attributes or ambiguous sentences."elephant behind the others" "far right elephant" "elephant most in the water" "kid in black" "kid skiing" "woman in blue" "skier in light blue jacket" "woman on left" "lady in black" "man in front to the right" "green striped shirt" "girl with dog" "black shirt" "woman bending" "pink skirt" "right van" "mini van" "white car in front of bus" "old white car" "guy" "blue shirt" "animal on right" "the standing cow" "left baby" "boy in plaid" "right kid" "strip shirt boy eyes closed" "right couch with cat on it" "furthest couch" "bottom sofa" "sofa closest" "man farthest right" "man with gray coat" "woman middle front white jacket" "white jacket lady off steps" "boy on left" "kid under purple umbrella" "child on right" "child holding orange umbrella" "center lady white shirt" "female watching" "man in white shirt in front of white poster board" "guy swinging" "donut in middle on left side" "donut above sprinkles" "chocolate donut second from right in front row" "nearest chocolate donut" "girl right standing" "woman handing cake" "the young child on the left sitting near the old" "little girl looking up" "second horse from the left" "black horse next to white" "white horse on left" "white horse with head down" "the blue jacket dude on the right" "man in blue shirt" "guy with gray sweater" "man with back to camera" "zebra on far right" "zebra alone" "left zebra" "zebra closest" "rightmost person" "baseball player" "man in red" "man wearing red shirt beside woman" "man in white with black hair" "middle orange" "the sheep behind the sheep looking at the camera" "hands" Figure 7. Failure cases of our method.The red and yellow boxes represent the ground truth and our results, respectively.

Figure 3 .
Figure 3. Sample results of jointly training on the RefCOCO, RefCOCO+ and RefCOCOg* datasets.The green and yellow boxes represent the results of the baseline and our full model, respectively.

Figure 4 .
Figure 4. Sample results of jointly training on the RefCOCO, RefCOCO+ and RefCOCOg* datasets.

Figure 5 .
Figure 5. Sample results of jointly training on the RefCOCO, RefCOCO+ and RefCOCOg* datasets.

Figure 6 .
Figure 6.Sample results of jointly training on the RefCOCO, RefCOCO+ and RefCOCOg* datasets.

Table 1 .
Comparisons with state-of-the-art methods.The models in the top and middle groups are trained on a single dataset, while those in the bottom group are jointly trained on the RefCOCO, RefCOCO+ and RefCOCOg* (our split) datasets, where same images are shared among three datasets but with more expressions than the top and middle groups.

Table 2 .
Ablation study of jointly training on three datasets.The top and middle groups demonstrate the effectiveness of the proposed feature learning techniques in different levels, and the superiority of contrastive loss over triplet loss in the instance level.The bottom group shows the influence of the negative mining and GCN in our model.

Table 3 .
Evaluation on the Ref-Reasoning dataset with models trained on different datasets.The models are trained on the Ref-Reasoning (top group), RefCOCOg (middle group), or RefCOCO, RefCOCO+ and RefCOCOg* datasets (bottom group).

Table 4 .
[38]sfer learning on the Ref-Reasoning dataset with different settings of pre-training (Pre) and fine-tuning (FT).The models in the top group are directly trained on Ref-Reasoning, while those in the bottom group are pre-trained on RefCOCO, RefCOCO+ and RefCOCOg*, and fine-tuned on Ref-Reasoning.Note that * indicates that SGMN[38]uses ground truth proposals to generate final outputs, which is served as a reference here.

Table 5 .
Transfer learning on the ReferItGame dataset with different settings of pre-training (Pre) and fine-tuning (FT).The models in the top group are directly trained on ReferItGame, while those in the bottom group are pre-trained on RefCOCO, Re-fCOCO+ and RefCOCOg*, and fine-tuned on ReferItGame.

Table 6 .
Similarity between instance-level features of synonymous sentences.The models are jointly trained on the RefCOCO, RefCOCO+ and RefCOCOg* datasets.

Table 7 .
Similarity between instance-level features of synonymous sentences.All models are jointly trained on the RefCOCO, RefCOCO+ and RefCOCOg* datasets.Only models in the bottom group are fine-tuned on Ref-Reasoning or ReferItGame.

Table 10 .
Evaluation on the ReferItGame dataset with models trained on different datasets.The models are trained on the ReferItGame (top group), RefCOCOg (middle group), or Ref-COCO, RefCOCO+ and RefCOCOg* datasets (bottom group).