Natural language guided object retrieval in images

The ability to understand the surrounding environment and being able to communicate with interacting humans are important functionalities for many automated systems where visual input (e.g., images, video) and natural language input (speech or text) have to be related to each other. Possible applications are automatic image caption generation, interactive surveillance systems, or human robot interaction. In this paper, we propose algorithms for automatic responses to natural language queries about an image. Our approach uses a predefined neural net for detection of bounding boxes and objects in images, spatial relations between bounding boxes are modeled with a neural net, the queries are analyzed with a syntactic parser, and algorithms to map natural language to properties in the images are introduced. The algorithms make use of semantic similarity and antonyms. We evaluate the performance of our approach with test users assessing the quality of our system’s generated answers.


Introduction
In human-computer interaction and human-robot interaction, a computer's or robot's ability to communicate in natural language with interacting humans depends mainly on two factors: the system has to understand the surrounding environment and has to be able to communicate in natural language about the environment. An environment might be the real world, a video or an image and consists of visual percepts such as objects, spatial relations, colors, and actions. The process of relating linguistic symbols in natural language input (e.g., nouns, verbs) to visual percepts is called language grounding.
One example, of a scenario where language has to be grounded, is surveillance systems, where an operator may ask questions about what is happening in the video streams.
In this paper, we propose an architecture and algorithms for language grounding, given an image and a natural language text query. In particular, we describe our approach for ana-lyzing and labeling visual percepts, a method for analyzing linguistic symbols and introduce algorithms that map linguistic symbols occurring in the text query into visual percepts.
We consider three types of text queries: 1. Attention queries (type 1), e.g., "find the person to the right of the monitor," 2. Relation queries (type 2), e.g., "where is the person?", 3. Identification queries (type 3), e.g., "what is to the right of the monitor?" For attention queries, our system returns two bounding boxes (which are rectangular boxes around objects in an image). For relation queries our systems returns two bounding boxes, an object category, and a spatial relation. For identification queries, two bounding boxes and an object category are returned.
Our approach builds on usage of a predefined neural net for detection of bounding boxes and objects in images. Spatial relations between bounding boxes are modeled with a neural net, the text queries are analyzed with a syntactic parser, and algorithms using semantic similarity and antonyms output the result of language grounding.
The performance of the developed system was assessed by test users who reported how well the system's generated answers matched their own view for a given set of images and questions.
The paper is structured as follows. In Sect. 2, we discuss existing work and approaches for language grounding in images. In Sect. 3, we give an overall description of the proposed solution, followed by separate sections on image analysis in Sect. 4, spatial relations analysis in Sect. 5, and analysis of queries in Sect. 6. Section 7 describes novel algorithms for language grounding, through which natural language words are mapped to objects and spatial relations in an image. We evaluate our approach in Sect. 8, present results in Sect. 9, and conclude the paper with a discussion of results and suggested future work in Sect. 11.

Related work
Language grounding is the process of connecting linguistic symbols (e.g., nouns, verbs, sentences) to visual percepts in an image or video (e.g., objects, spatial relations between objects, actions). In particular, given an image containing a set of visual percepts V = {v 1 , v 2 , . . . , v k } and a text query containing a set of linguistic symbols T = {t 1 , t 2 , . . . , t p }, the language grounding problem is an assignment problem where linguistic symbols t i ∈ T should be assigned to visual percepts v j ∈ V . In the literature the language grounding problem is investigated under different names and approaches such as object retrieval using language grounding, relationship detection and Visual Question Answering (VQA).
Methods for object retrieval using language grounding identify objects in an image based on a query text that includes properties of objects such as attributes, categories and spatial relations.
Addressing the similar problem as ours, Guadarrama et al. [11] developed a system to localize objects in images based on a text query by generating text as bag-of-words from candidate boxes using the class labels which are predicted from a pretrained object classifier, and then compares the text query and the bags. Ronghang et al. [15] developed a Spatial Context Recurrent ConvNet (SCRC) model as a scoring function for measuring the similarity of a text query and candidate boxes by integrating spatial configurations and global scenelevel contextual information into the network. This work motivated our approach to use a similarity function to determine the analogy between the text query fragments, detected objects and their spatial relations.
Similarly, in [19], given an image and the text query, words in the query are aligned with image regions by embedding the detected objects as a result of a pretrained object detector and text fragments from a parser with a ranking loss. The work in [21] introduced Logical Semantics with Perception (LSP) model that learns to map natural language sequences to related objects in an image by grounding language acquisition. The authors in [33] developed an attention model which learns to ground phrases in images using the regions of an image that best reconstruct the phrase. Other methods generally use visual features that are generated from an input text query, and match them to image regions to find the object of interest [2,27].
A crucial step in these works is generating text captions from the images to describe objects and their relations. These captions are used to recognize the best match to the input query text. Methods based on recurrent neural networks [8,28,35] shown to be effective for caption generation. In the proposed method, we also generated image captions and evaluated them, but since this work is focused only on retrieving objects and their relations based on natural language, image captions are generated as a set of tuples including detected objects and their spatial relations. Moreover, our method is similar to these works, as we also compared the text query with the detected objects and their spatial relation, but different as in other works they trained neural networks, mostly based on long short-term memory (LSTM), to learn the similarity and alignments of them. However, in our method, the process is simplified using a similarity function which makes the need of training and required datasets unnecessary and also circumvents the non-transparency of approaches that solely use machine learning methods.
VQA is the task of answering a natural language question about images, and is sometimes referred as a "visual turing test" [10,25]. A large variety of VQA algorithms have been proposed in recent years, and in all methods features are extracted both from the image and a corresponding query text and then combined and feed to a classifier to predict an answer. Classical approaches are given in [21,31], which both used a semantic parser and instead of learning compositional operation they depend on fixed logical inferences. Recently proposed approaches differ in how they combine image and question features to infer an answer. In [1,18,42], features are merged by concatenation, element wise addition, and multiplication, and fed to a neural network or linear classifier to predict the answer. In [37], a CNN-and LSTM-based model is introduced that first recovers a structural scene representation from the images of various block world scenarios and then translates a natural language question into a program, which then is used to obtain an answer considering the given scene representation.
In [16,21], the authors use attention models to achieve better alignment between text and visual features. Similarly, the authors in [38] use language and visual attention mechanisms to map a natural language expression describing parts of an image, into the corresponding visual percepts.
In [34], the authors developed a method that learns to identify image regions that are most relevant to a given text query and uses these regions to answer the question. The work in [36] proposed a spatial inference method, spatial memory network for VQA, to answer questions about images. The method uses a two-hop model; in the first hop, the attention process extracts the image regions which correspond to individual words in the question, and in the second hop it predicts the answer using collected fine-grained evidence from the first hop and embeddings of the entire question. The authors in [24] introduce ViLBERT (short for Vision-and-Language BERT) which learns joint representations of vision and natural language features and visual language grounding is aimed to be a pretrainable and task agnostic.
The work on relationship detection is most similar to our work in as much that the natural language processing component is kept simple and focus is on the detection of relationships between two objects in an image. In [7], the authors introduce an approach to detect visual relationships between two objects in an image using deep relational networks (DR-Net). Similar to our approach, bounding boxes of two objects in an image are extracted and an output of the triplet form (s, r , o) is inferred, where r describes the relation (e.g., spatial relation) between two objects s and o. The approach in [7] exploits spatial configurations and statistical evidence among the two objects and their relation via a deep relational network. In the paper [39], the authors detect, utilizing various versions of CNN, undetermined relationships between objects, that is, relationships that are not labeled as such or have false labels (e.g., a guitar being labeled as a lamp). The authors use a similarity measure and frequencies of the two objects and their relationship to transform these into probability distributions. In [40], the model is trained to detect triplets of the form (s, r , o) combining a semantic inference module and visual features. After a feature fusion step, the relationship between the two objects in question is predicted. The authors in [43] utilize Faster R-CNN ( [32]) for object detection, followed by a relationship prediction composed of feature-level and label-level prediction for learning the objects-relationship triplets. In [22], a reinforcement learning framework for detecting visual relationships is proposed. The approach utilizes a directed semantic action graph for a given image for the prediction of the objects-relationship triplets and takes into account global context cues that describe the interactions among different objects in the image.
In all of these works, the attention is on extracting features from the query text and image and developing neural networks, mostly deep CNNs, to learn embeddings of features such that the best answer to the question is selected. In our approach, instead of utilizing CNNs, we used a function for each type of question based on the probability of detected objects, generated captions of the image and semantic similarity analysis of the query text. In this way, the developed system does not require training and the process of answer selection is less computational demanding and complex in comparison to deep neural networks as less number of parameters are introduced. It enables us to identify the exact shortcomings in the language grounding process and improve its performance. Furthermore, since we do not rely on training data in the process of embedding text and image features for answer selection, the proposed approach has advantages for under-resourced languages in which training data either does not exist or it is difficult to obtain, in comparison with existent datasets for other languages such as English.
Undoubtedly, there are many advantages with purely machine learning based approaches; however, it also has become clear that unwanted bias or discrimination is perpetuated due to already biased data or the machine learning approach (see [12] for an overview). In particular, the authors in [41] describe the bias found in visual language grounding. There are many ways to mitigate unwanted bias or discrimination in the learned model (for example, [5,17]), and one approach is to build hybrid systems that utilize the advantages of machine learning methods and algorithmic approaches, resulting in more transparent and interpretable systems. "woman," "car"). The bounding boxes and object categories are used as input to the spatial relations analysis, which outputs spatial relation words (e.g., "left," "under," "behind") for pairs of bounding boxes. The input to the text query analysis is a text query, and the output is n-tuples of linguistic symbols (e.g., noun) describing objects and spatial relations relevant for the processing of the query.
The output of the three parts is used as input for the algorithms for language grounding, where the n-tuples are mapped to the generated bounding boxes, object categories, and spatial relations. The general approach is to combine probabilistic measures of object categories and spatial relations generated from the image with measures of word similarity, such that the most probable grounding can be done.
In the following sections, image analysis, spatial relations analysis, text query analysis, and the algorithms for language grounding are described in more detail.

Image analysis
Given an image, the purpose of the image analysis part is to generate bounding boxes for identified objects in the image along with associated probabilities for the identified objects being of a certain object category (such as person, monitor, car). We used YOLO V2 [30], a pretrained neural network that recognizes N O B J = 80 object categories. It is very fast, which makes it a good candidate for real-time systems (for more details see reference [30]). We let where (x 1 , y 1 ) and (x 2 , y 2 ) represent upper left and lower right corners, respectively. In our algorithms, we consider for each bounding box b i the object categories with the five highest probabilities op i,k , and set the other op i,k to zero.

Spatial relations analysis
We want the system to identify and correctly denote the spatial relation between pairs of bounding boxes in an image. A classifier was constructed to map the coordinates of pairs of boxes to an integer k ∈ {1, 2, 3, 4, 5, 6}, corresponding to an element in S R = {S R 1 , S R 2 , . . . , S R N S R }, with spatial relations: "left," "right," "top," "under," "front," and "behind," respectively. We used a weighted average probabilities network (WAP), with inputs from three classifiers with the same weights: a multilayer perceptron (MLP) with 4 layers and 10 neurons in each, a K -nearest neighbors with K = 5, and a support vector machine (SVM). The classifiers were trained using 505 images with a total of 1515 manually labeled spatial relations between bounding boxes. The bounding boxes were defined as The values were scaled to make them size and location invariant: The vector Z was used as input to the classifiers. Data were split into training and testing sets, and the WAP classifier achieved 72.7% accuracy in fivefold cross-validation.
Since the WAP classifier outputs probability estimates for all 6 spatial relations k ∈ {1, 2, 3, 4, 5, 6}, these values were used as estimates of sp i, j,k (the probability that the spatial relation between the bounding boxes b i and b j is k).
Summarizing, for a given image I , we have a set of bounding boxes, In addition, we have for all pairs (b i , b j ), i = j the probability sp i, j,k of the pair having a spatial relation k, where k ∈ {1, 2, 3, 4, 5, 6} (corresponding to the spatial relations {"left," "right," "top," "under," "front," "behind"}, respectively). For example, sp 1,2,3 denotes the probability that object 1 is on top of object 2. In order to achieve a grounding between text and image, we later compute the similarity between the text labels obtained from the image analysis, and the text labels obtained from the query analysis.

Analysis of queries
We consider three types of queries: 1. Attention queries (type 1). Attention queries are commands in imperative form, beginning with a verb, containing two nouns and a spatial relation between the nouns. For example, the attention query "find the person to the right of the monitor" begins with the verb "find," has the two nouns "person" and "monitor," and a spatial relation "to the right." Our system returns two bounding boxes containing objects that correspond to the two nouns in the query and that have the spatial relation described in the query. 2. Relation queries (type 2). Relation queries begin with the word where and contain a noun.
For example, the relation query "where is the person?", contains the noun "person." Our system returns a bounding box containing the noun in the query, and another box that contains another object in the image which is not linguistically expressed in the query.
Our system returns two bounding boxes, one containing an object equal or similar to the noun in the query (e.g., "person" or "human"), another bounding box, and the spatial relation between the two bounding boxes (e.g., "to the right of"). The name of the object in the latter box is also returned. 3. Identification queries (type 3). Identification queries begin with the word what, contain a spatial relation and a noun. For example, the identification query "what is to the right of the monitor?" contains the spatial relation "to the right of" and the noun "monitor." Our system returns two bounding boxes, one containing the noun in the identification query (e.g., "monitor"), and the other having the given spatial relation to the first box. The name of the object in the latter box is also returned.
Given a query q, we use the Stanford CoreNLP parser [26] to extract syntactic categories in q. The syntactic categories determine the type of the query (i.e., 1, 2, or 3 as described above). We analyze the input query on three levels, namely clausal, phrasal and on word level. Table 1 shows the extracted syntactic categories for each type of query in tabular form. In particular, the syntactic categories and their quantities on the word level determine the type of input query. For example, as shown in Table 1, a query q is of type 1 if q has a syntactic category V B and three nouns N N.
The syntactic categories S and S B AR Q indicate a declarative clause and a question initiated by a wh-word.
The syntactic categories V P, N P, W H ADV P and W H N P stand for verbal phrase, nominal phrase, whadverb phrase and wh-noun phrase, respectively. The syntactic categories V B, N N, W RB and W P indicate verbs, nouns, wh-adverbs and wh-pronoun, respectively Once we have extracted the syntactic categories and determined the type of query, we create tuples that are later used for the language grounding. In particular, we consider the words that are of syntactic category N N in the order in which they appear in the query: -Given a type 1 query, we consider the words that are of category N N, that is, N N 1 , N N 2 and N N 3 . We let N N 1 be entit y 1 , N N 2 be sr 1 and N N 3 be entit y 2 and create the tuple < entit y 1 , sr 1 , entit y 2 >. For example, in the query "find the person to the right of the monitor" we have three words that are N N, namely "person," "right" and "monitor" and we generate the tuple < person, right, monitor >, describing that the person is to the right of the monitor. -Given a type 2 query, we consider the single word of category N N and let N N be entit y 1 and create < entit y 1 >. For example, the query "where is the person?" has one N N, namely "person" and we generate < person >. -Given a type 3 query, we consider the two words that are N N, that is N N 1 and N N 2 .
We let N N 1 be sr 1 and N N 2 be entit y 2 and create < sr 1 , entit y 2 >. For example, in "what is to the right of the monitor?", the words "right" and "monitor" are N Ns and we generate the tuple < right, monitor >.

Natural language grounding
After processing the input image and text query, the following data are available: -N B B bounding boxes and estimated probabilities op i,k for bounding box b i containing an object of category k. -Estimated probabilities sp i, j,k for bounding boxes i and j having a spatial relation k.
-Query type and tuples containing object names and spatial relation labels extracted from the query q.
These data are used in algorithms for generation of appropriate responses to the queries. To improve matching of object labels given in the query with labels returned from the image analysis, a method to identify semantic similarity between words was employed. We used the similarity function from spaCy 1 [14], an open-source library for natural language processing. Based on the angle between two input word vectors, it computes a similarity measure between 0 and 1. For example, the words "person" and "person" have a similarity value of 1.0, while "person" and "woman" have a value of 0.84, and "monitor" and "person" a value of 0.79. In Fig. 1 Image with three detected bounding boxes. A list of possible objects in each box, and corresponding probabilities is given in Table 2   Table 2 Object labels and probabilities for the 5 most probable objects in each one of the 3 bounding boxes shown in Fig. 1 Bounding  Table 4). Similarity is also computed for spatial relations S R 1 , S R 2 , . . . , S R 6 and words expressing spatial relations in the queries (see the example in Table 5). Antonyms of spatial relations were also considered to improve performance. For example, when trying to find a person to the right of a monitor, the system also considered finding a monitor to the left of a person. Antonyms were extracted using WordNet [29] alongside the NLTK module [4].
In the following subsections, the algorithms for each one of the three query types are described in detail.

Attention queries
Algorithm 1 Input: An image I and an attention query q (such as "find the person to the right of the monitor."), containing an object denoted entit y 1 , a spatial relation denoted sr 1 , and a second object denoted entit y 2 .
Output: A bounding box β 1 containing an object of type entit y 1 , and a bounding box β 2 , where sr 1 describes the spatial relation between β 1 and β 2 .
Method: Calculate β 1 and β 2 as follows: 1. Syntactically analyze q to generate entit y 1 , sr 1 , and entit y 2 , where sr 1 is a spatial relation word, entit y 1 and entit y 2 are nouns referring to objects (see Sect. 6). 2. Generate the inverse relation sr 2 as the antonym of sr 1 . 3. Analyze I to generate a set of N B B bounding boxes: B = {b 1 , b 2 , ..., b N 6. Let β 1 = b i and β 2 = b j .
The steps in the algorithm are illustrated in the following example, for the input query "find the person to the right of the monitor.", and the input image shown in Fig. 1 (All images are from the Visual Genome dataset [20] and used with permission (see acknowledgments for details)): 1. The syntactic analysis of q yields: entit y 1 = " person , sr 1 = "right , entit y 2 = "monitor . 2. The antonym of sr 1 is computed as sr 2 = "left," 3. In the input image, three bounding boxes are generated. The five most probably objects for each box are listed in Table 2. 4. The neural network estimates probabilities sp i, j,k for the six spatial relations k for all pairs (i, j) of bounding boxes, as shown in Table 3. For example, box 1 is located to the right of box 2 with probability 0.919, which is the value assigned to sp 2,1,2 . 5. The most likely bounding boxes i and j are computed by solving the maximization problem in Eqs. 4 and 5 . The optima is achieved for i = 1 and j = 2. 6. Let β 1 = b 2 and β 2 = b 1 .

Relation queries
Algorithm 2 Input: An image I and a relation query q (such as "where is the person?"), containing an object denoted entit y 1 .
Output: An object name o, a bounding box β 1 containing entit y 1 , and a bounding box β 2 containing an object of category o, and sr describing the spatial relation between β 1 and β 2 .

Same as step 3 in Algorithm 1.
The steps in the algorithm are illustrated in the following example, for an input query "where is the person?", and the input image shown in Fig. 1: 1. entit y 1 is extracted from q as "person." 2. Same as step 3 in Algorithm 1. 3. Same as step 4 in Algorithm 1. 4. The most likely object, bounding boxes, and spatial relation are computed by solving the maximization problem in Eq. 6. The optima is achieved for n = 1 (corresponding to the first object category in Table 2), i = 1, j = 2 and k = 2 (spatial relation "right"). 5. Let o = O n , β 1 = b i , β 2 = b j , and sr = S R k .

Identification queries
Algorithm 3 Input: An image I and an identification query q (such as "what is to the right of the monitor"), containing a spatial relation denoted sr 1 and an object denoted entit y 2 .
Output: An object category o, a bounding box β 1 containing an object of category o, and a bounding box β 2 containing entit y 2 .
Method: Calculate β 1 , β 2 , and o as follows: 1. Syntactically analyze q to extract an object entit y 2 and a spatial relation sr 1 where z i, j,k = max(sp i, j,k × sim(S R k , sr 1 ), sp j,i,k × sim(S R k , sr 2 )). The steps in the algorithm are illustrated in the following example, for an input query "what is to the right of the monitor.", and the input image shown in Fig. 1: Table 4 Similarity value between labels of detected objects from image in Fig. 1  1. Entities and a spatial relation are extracted from q as: entit y 2 = "monitor," sr 1 = "right." 2. The antonym of sr 1 is computed as sr 2 = "left" 3. Same as step 3 in Algorithm 1 4. Same as step 4 in Algorithm 1 5. The most likely bounding boxes and object are computed by solving the maximization problem in Eqs. 7 and 8. The optima is achieved for m = 1, i = 1, j = 2 and k = 2.

Evaluation
The performance of the developed system was assessed by test users who reported how well the system's generated answers matched their own view for a given set of images and questions. The users used a test program showing a sequence of images, see Fig. 2. At first, the image was shown with detected bounding boxes. The user was then asked to compose a question fitting a given template corresponding to one of the three query types. For type 1, the template was "Find the < object1 >< spatialrelation >< object2 >," for type 2 "Where is the < object >?", and for type 3 "What is < spatial relation >< object >?". After completing the query, the system produced and displayed an answer in the field "Systems Output Text," and the user clicked on one of the fields "Answer correct" "Answer Not Correct," or "Not Sure." For the second option, the reason could be specified as either "Wrong Spatial Relation," "Wrong Object Detection," or both.
The system also produced a number of image captions (depending on the number of detected objects) describing the relation between objects in the image. The user assessed these captions by entering the number of accepted captions. Each user assessed the system with 12 images, six of which were common to all users. These six images were used to analyze how differently people interpret spatial relations in a scene.
Thirty users were recruited for the evaluation. They were all university students with at least a Masters degree, and all spoke good English. The users analyzed 186 different images, generated 1080 queries, used 75 different object names, and 10 different words describing spatial relations to form queries. The spatial relations include: behind, front, right, left, above, top, on, under, below and bottom. A total of 2005 image captions were generated and assessed. Fig. 2 The test program used for evaluation of the system. At first an image with unlabeled bounding boxes is displayed. A question is input by the user and the system displays the system output. The user then assesses whether this output is correct or not. Image captions are also generated for more extensive analysis

Results and analysis
According on the test users' responses, the system correctly answered 81.9% of the 1080 posed queries in the evaluation. Of the 18.1% incorrect answers, 62.2% were caused by incorrect detection of spatial relations, and 37.8% by incorrect object classification. 68.9% of the generated captions were assessed as correct.
One strength with the presented solution is that several (in the presented results five) object labels were simultaneously considered for each detected bounding box. To investigate the value of this approach, performance was computed for a system considering also 1-4 object labels. As shown in Table 6 (left part), performance dropped from 81.9 to 79.8% when considering only one object label. The reason is that the target object in the query sometimes does not match the label with the highest probability, even if the bounding box really contains the target object. In such cases, an incorrect bounding box may be selected if only the object with highest probability is considered. This situation is exemplified in Fig. 4. The used YOLO system can detect 80 objects of different classes. Without additional precautions, our system would not be able to handle queries related to other nouns. As an example, the query like "find the "woman" cannot be correctly processed, since "woman" is not one of these 80 labels. We overcame this limitation by computing similarity values between object categories in the query and the 80 categories that can be detected in images.
In this way, we can handle a larger vocabulary, as demonstrated in Fig. 3. In the shown example, the system detected the "chair," "monitor," and a "person." For the query "Where is the woman?" the similarity between "woman" and the detected "person" is computed as 0.84, which is sufficiently high for the correct bounding box to be selected by Algorithm 2. The system creates a correct response by labeling the bounding box with "woman," and generating the text "woman right of monitor." The value of the similarity function increases when considering more than one object label. Figure 4 shows four objects, a person, a bowl, a milk box and a cereal box, in red, blue, yellow and green bounding boxes, respectively. YOLO assigns highest probabilities to the object labels "person," "cup" (the bowl), "bottle" (the milk box) and "bottle" (the cereal box) as shown in Table 7. For the query "Where is the bowl?", "bowl" does not match any of these objects. Additionally, since the similarity value between "person" and "bowl" (0.78) is higher than between "bowl" and "cup"(0.73), and between "bowl" and "bottle"(0.76), "person," would be selected in the system's response. On the other hand, if we used more than one object label for each bounding box, "bowl" would be the second most probable object label for the blue bounding box. As a result, the blue bounding box would be returned as the system response.
To investigate the value of the similarity function, we computed performance also with this function disabled, thus affecting matching of both object names and names for spatial relations in the algorithms. As can be seen in Table 6 (right part), discarding the similarity function reduced the system performance by about 20%. In the example illustrated in Fig. 6, the similarity function was disabled with the results that the object name "armchair" given in the query did not match any of the labels of the objects detected in the image. The system then generated the correct, but irrelevant, answer "person right of monitor." When using the similarity function, "armchair" was matched with the detected object "chair," and the image caption "person behind armchair" was generated. This shows the importance of the similarity function in the proposed solution.
Analysis of the data from test users' responses gave us new insights in how spatial relations in images are perceived and described. We identify two kinds of a scene interpretations and denote them global and local. When a spatial relation between objects in an image is based on the perspective of the camera, it is defined as global. When it is based on the perspective of an object or person in the image, it is defined as local. Figure 5 clarifies the concepts. In this figure, the description "monitor to the left of the woman" is an example of a global relation, while "monitor in front of the woman," is an example of a local relation. The neural network for classification of spatial relations was trained and tested with data labeled with global relations. Hence, the trained network did not model cases where users interpret relations locally. The relatively low accuracy (68.9%) for the generated captions may be caused by this and indicates that people tend to interpret and express spatial relations locally to a significant extent. Nevertheless, the system often managed to infer the correct bounding Fig. 3 The designed system is not limited to the number of class labels the object detection system accepts.
In the shown example, the word "woman" appears in the query, but is not recognized by Yolo. However, by computing linguistic similarity between "woman" and "person," which is the object label detected by Yolo, the correct bounding box is selected and labeled "woman" (color figure online) Fig. 4 Example showing the value of considering more than one object type for each bounding box. The image analysis part outputs bounding boxes with assigned probabilities for object classes (see Table 7). The objects with highest probability in the red, blue, yellow and green bounding boxes are person, cup, bottle and bottle, respectively. Using only the most probable object, the query "where is the bowl?", results in an output highlighting the red bounding box. If the two most probable objects in each bounding box are considered, the system correctly highlights the blue bounding box (color figure online) boxes and object names, even if the most likely spatial relation predicted by the neural network did not match the test user's assessment. This robustness is due to the probabilistic approach through which several spatial relations are predicted and considered, and a low probability sp i, j,k for a spatial relation may be counter weighted by high probabilities op i,m , op j,n , for the detected object classes (see Eqs. 4 and 5) (Fig. 6). answers by test users, we found that some address the monitor as "to the left of the woman" (globally), and some as "in front of the woman" (locally) Fig. 6 With the similarity function disabled, the query object "armchair" is not grounded to the chair in the image, and the system gives the incorrect output "person right of monitor" (color figure online)

Comparison to other work
There are purely machine learning-based systems to visual language grounding and quite a few approaches that can be characterized as hybrid approaches combining some kind of structural analysis with machine learning (for example, [21,22,37]), which our approach falls into. As outlined in Sect. 2, one of the unwanted outcomes in purely based machine learning approaches is the unwanted bias and debiasing strategies may operate directly on the data or on the language model, whereas other approaches try to find evidence what, for example, vector representations actually encode (as, for example, in [6,13]). One bias-aware approach is to build hybrid systems that incorporate some structural method or algorithmic approach. In the following, we compare our work in more detail to several approaches that are most similar to our approach in so far as these approaches learn triplets of the form (o 1 , r , o 2 ), that is the relation r between two objects o. It is not possible to make a purely quantitative comparison since all of these approaches use recall as an evaluation metric for performance measure, whereas we use accuracy. In addition, we evaluate our performance on test users. In [7], the authors introduce deep relational networks (DR-Net) and given the bounding boxes of two objects, a triplet (s, r , o) is predicted, where r describes the relation between the two objects s and o. The approach exploits spatial configurations and statistical evidence among the two objects and their relation via a deep relational network. The authors test their approach on two datasets VRD [23] and sVG which is a larger dataset constructed from the Visual Genome dataset [20]. The training dataset in [7] is considerably larger than ours (with 108K images and 998K relationship instances). In addition, they operate solely over object labels and do not process text queries for visual grounding. The Recall values on the dataset sVG are 88.26 (Recall@50) and 91.26 (Recall@100). The accuracy for our approach is at 81.9, and the assumed lower performance (taking into account some precision measure) may be due to the fact that we do not use the most probable class label if it does not describe the object in the image but consider the five most probable class labels, taking into account only class labels that actually describe the object in the image. For example, a child and a woman in illustrated images in [7] are both predicted as "man," whereas in our approach they would not or at least not without affecting the accuracy. This is an example how (gender) biased pure machine learning methods can be by, for example, not discriminating between the class label man and the actual person in the image.
In [39], the authors detect so-called undetermined relationships between objects, that is, relationships that are not labeled or have false labels (e.g., a guitar hanging on the wall being labeled as a lamp). The authors use a similarity measure (e.g., word2vec) and frequencies of all relation triplets in the training set to get a probability distribution. These frequencies might again amplify unwanted bias, or reflect a biased dataset (that does not contain images with electric guitars hanging on the wall, which probably is a very common way to store and display guitars). Visual Genome is the dataset used for their performance evaluation. Their Recall is 14.4 (Recall@50) and 16.5 (for Recall@100) for relationship prediction. In our approach, the probability of a triplet such as (guitar, on, wall) would likely have low probability values as well, however would hopefully be more robust due to the algorithmic nature of grounding visual percepts into the image.
The authors in [40] use a semantic inference module based on word vector representations, attention mechanism of the global image context, and feature fusion to learn relation triplets (object1, predicate, object2). Their model is also validated on the Visual Genome Relationship dataset with per type predicate classification accuracy with Recall rates 65.0 (Recall@50) and 67.1 (Recall@100).
In [22], the authors introduce a reinforcement learning framework for detecting relationships between objects and their attributes. Their approach systematically detects relationship and attribute instances according to a traversal scheme on a built directed semantic action graph for the image. In addition, to local analysis the authors incorporate also global analysis of the image for learning purposes. Their results for relationship detection on Visual Genome are 13.34 (Recall@50) and 12.57 (Recall@100).

Conclusion and future work
We presented a system for responding to three types of questions regarding objects and their spatial relations in given images. The answers comprised identification of objects in the image and generation of appropriate text. 81.9% of the generated answers were assessed as correct by 30 test users. The system's robustness was demonstrated by the fact that it often correctly answered queries based on a local view on spatial relations, while it was trained on data with a global view (see Sect. 9). This was an effect of the probabilistic approach that combine probabilities for object classes and spatial relations.
By using the semantic similarity function, our model overcame the problems with a limited number of object classes in pretrained network models. Flexibility regarding the varying ways users express spatial relations was also improved. Without the semantic similarity function, accuracy was reduced to 60.7%. Another feature that contributed to the high performance was that several object types were considered for each detected bounding box. The approach in which probabilities for object classes and spatial relations were combined with a measure of semantic similarity contributed to robustness as well as high performance.
In the proposed approach, we used a set of functions instead of training CNNs for measuring the similarity between text query and detected objects and also selecting the most probable answer. It resulted in a less complex and computationally demanding process which also discarded the need of training data. Therefore, the developed method can also be used for under-resourced languages, with minimal number of changes, in which the required training data are not available. The system could be further enhanced by training the algorithms with the user responses from the evaluation.
The automatic identification of spatial relations would benefit from a depth perspective, since humans easily perceive depth in images, and also denote spatial relations based on that, for example with "behind" and "front" relations. Methods for generation of depth in regular 2D images [3,9] could be investigated, and usage of 3D cameras would obviously be a potential approach.
Such an extension, along with incorporating more identifiable objects and relations between objects, could be relevant for Urban Search and Rescue Robots (USAR), where robots and human operators work together to locate humans after natural disasters such as flooding or earthquakes. For USAR situations, where, for example, the robot is in an environment unreachable to the human operator and information about the environment has to be exchanged via the robot's remote cameras, language grounding algorithms would have to take into account the complexity of changing environments as well as perspectives of the robot.