1 Introduction

Human–Machine Interaction (HMI) systems have emerged as an active area for research, paving the way for the development of various real-world robots with widespread application potential. These HMI systems aim to bridge the gap between robots and humans by providing them with an effective and intuitive way to communicate and collaborate. This is crucial to developing more accessible, user-friendly, and versatile robots capable of addressing a wide range of tasks and applications.

Historically, the interfaces of HMI systems have primarily relied on input methods such as mice and keyboards and, more recently, voice-based commands [1]. However, humans in their day-to-day lives rely on more modes of expression rather than just speech for communication. Hand gestures, specifically pointing gestures, are one of the major components of such non-verbal communication and are closely linked with the understanding of language in humans [2]. Humans naturally use pointing gestures, for example, while emphasizing a region or object of interest while talking about them. The pointing gesture often serves as a soft signal to guide the attention, avoiding the need to speak a very precise and lengthy statement. For example, while talking about one of the many similar fruits on the tree, it is natural and convenient to use a pointing gesture to refer to the one fruit with an expression like “that fruit”.

To make technology more accessible, machines must allow humans to communicate with them in the same fashion they do in their daily lives with each other. This has led to the development of many such HMI systems incorporating pointing gestures as well as an interface for communication [3,4,5,6,7,8,9,10]. The initial works in this field pioneered the techniques for recognizing the pointing gesture in 3D space and assembling it along with the various components to build a complete HMI system [3,4,5, 7, 10]. More recent works have tried to go beyond this and propose systems that integrate these pointing gestures more intricately with linguistic utterance. Despite these advancements, these previous works in HMI still treat each modality (vision, language, and gestures) quite independently, mainly focusing on integrating the multimodal knowledge [9, 11,12,13]. Hence, there still remains a significant challenge of executing complex instructions that require not only combining the multimodal knowledge but also performing multi-hop logical inference.

Meanwhile, with breakthrough successes in visual recognition and linguistic understanding [14, 15], complex multimodal inference tasks involving both linguistic and visual modalities are being challenged. These tasks cover a wide range of tasks, including question–answering [16], grounding [17], multimodal machine translation [18], etc. Referring Expression Comprehension (REC) [19, 20] is another challenging task involving identifying an object being referred to in the image from a natural language query. It is a particularly challenging problem, since it requires understanding the language semantics and matching them with visual information like color, shape, or name of objects. However, these works only focus on images and natural language queries and hence are unsuitable for the HMI systems, which expect additional modalities like pointing gestures.

Fig. 1
figure 1

Our task setup in VR shows the user wearing a VR headset and hand controllers (left-side image) and looking at the simulated environment (right-side image). The blue star depicts the pointing position on the ground plane. The solid green and dashed red circles and arrows represent the correct and incorrect reasoning hops, respectively, for the instruction “Pick up the ball to the right of this black clipper”

To bridge this gap, in this work, we explore the REC task in the context of such a multimodal HMI system incorporating pointing gestures. This is done in a VR setup consisting of a robot sharing an environment with a user, as shown in Fig. 1. We specifically focus on complex instructions that require the machine to perform intricate and often multi-hop reasoning over language, vision, and gesture modalities. The user can give free-form instruction involving an object to the robot using speech and pointing gestures, and the robot needs to identify the object being referred to by its user to fetch it. We deliberately encourage using pointing gestures as an intermediate step toward describing the target object through demonstratives such as “this ball” or “that large object”, etc. An example of such a complex instruction requiring multi-hop reasoning can be seen in Fig. 1 with the instruction “Pick up the ball to the right of this black clipper”. This instruction uses a demonstrative “this black clipper” to describe an intermediate object to refer to the target object of the “ball”. Such an intermediate step might be required for several reasons, including partial/occluded visibility of the target object or it being very small, as in this case, resulting in difficulty in pointing to it directly. It should be noted that the pointing gesture plays an absolutely crucial role in this inference, since there are two such “balls” that are to the right of a “black clipper”. Not utilizing the pointing gesture may lead to either the incorrect inference marked in red dashed circles, rather than the correct one with green solid circles. Moreover, combining the pointing gesture and the linguistic utterance independently can also lead to the same incorrect inference marked with the red dashed circles. From the linguistic expression, it will be inferred that the target object is some “ball” to the right of a “black clipper”. From the pointing gesture as an additional cue, it will be inferred that the target object should be closer to the pointed position marked with the blue star. Thus, combining this, we again arrive at the incorrect inference of the target object as the yellow ball marked with a red dashed circle. Hence, these convoluted expressions of natural language and pointing gestures require a deeper understanding and inference of the interplay between the two modalities, which is the main focus of this work.

Tackling the multimodal REC task for HMI systems involving reasoning over complex instructions of language and gestures is an enormous challenge encompassing several fields. The current state-of-the-art methods for multimodal inference tasks on language and vision include several deep neural networks, such as joint transformer-based architectures [21,22,23,24]. They have proven successful across diverse downstream multimodal inference tasks [25, 26]. However, they have also been known to exploit dataset biases instead of learning to perform reasoning [27]. Additionally, they sometimes behave in unexpected ways [28] and are susceptible to acquiring spurious correlations during training, impeding their ability to generalize effectively to sufficiently novel instances [29]. Hence, some recent works have argued the need to explicitly incorporate logical and compositional reasoning in the model structure to perform complex reasoning tasks successfully [30]. This has resulted in the rise of a hybrid approach of neuro-symbolic methods that combine deep neural networks with symbolic reasoning techniques [31,32,33]. These methods first leverage deep neural networks to parse the raw visual and linguistic features into symbolic structures, which are then fed to a reasoning and execution module to arrive at the solution. Such a hybrid approach enables the development of a highly modular, interpretable, and compositional model. Thus, we propose a hybrid neuro-symbolic approach to solve our multimodal REC task involving language and gestures.

The contribution of this paper is as follows:

  • We construct a small but challenging dataset for a referring expression comprehension task in HMI systems involving complex linguistic and pointing gesture-based instructions.

  • We propose an interpretable and compositional model using a neuro-symbolic approach to solve the above task.

  • We assess our model’s ability to generalize its reasoning to unseen environments. Moreover, we conduct ablation studies and qualitative analysis to gain deeper insights into its workings.

2 Related Works

2.1 HMI Systems

Our work draws origin from the classical works in the Human–Machine Interaction (HMI) systems on incorporating pointing gestures as an additional input modality [3, 5,6,7,8,9,10]. For instance, some of these works propose the use of pointing gestures as an intuitive interface for interacting with a graphical screen [3] or in an industrial control system [5]. While, others have considered using pointing gestures as an additional aid for navigational commands [7, 9]. We extend these works to challenge more complex problems requiring multi-hop reasoning based on intricate linguistic and gestural inputs.

2.2 REC Tasks

The Referring Expression Comprehension (REC) task has been studied extensively from the language and vision multimodal perspective with several datasets [19, 20]. The popular datasets for this task, like RefCOCO and RefCOCO+[20], mainly focus on grounding the objects and attributes in the linguistic query to the image regions and often do not require any complex multi-hop reasoning. This was later addressed by the CLEVR-Ref+ dataset [19] which used a synthetic environment to create an REC dataset that requires complex multi-hop reasoning specifically. However, these datasets are purely from a vision and language perspective, not considering any gesture modality, and hence are not directly applicable to our HMI setup.

Table 1 Comparison of datasets for REC tasks

The more relevant datasets to our work include some embodied REC tasks, including CAESAR [11] and YouRefIt [12]. They both consist of a human interacting with the objects in a shared environment with multimodal referring expressions, including pointing gestures. However, CAESAR is constructed in a completely simulated environment, with the instructions, environment, and pointing gestures all generated automatically from templates, neglecting the natural variations in human utterance. Also, they do not provide separate ground truth supervision for pointing gesture recognition. This makes it difficult to fine-tune and use any gesture recognition model on this different visual domain of the simulated scene. YouRefIt, on the other hand, consists of a real-world setting with natural language instructions and pointing gestures. However, as argued in their paper, they aim to use pointing gestures as an additional aid for reducing the cognitive load, and thus, they are not necessarily compulsory for the task. A typical example from their dataset includes an instruction like “The black phone on the table” with a pointing gesture to the only black phone present on the table, thereby rendering the pointing gesture as a piece of additional information rather than a necessary one. They also do not consider pointing gestures as an intermediate step in a multi-hop instruction for the target object. This is also evident from their low average length of instruction in Table 1.

Thus, moving a step forward, we collect natural free-form linguistic and gestural instructions that require complex reasoning, and construct an original dataset for the REC task in HMI systems. This is done through an object-picking task in a virtual reality environment, as depicted in Fig. 2.

We explicitly focus on complex instructions that do not just use pointing gestures as an additional cue for the target object, but rather as an integral component of a multi-hop reasoning instruction. This is achieved using demonstratives like “this” and “that” to refer to an intermediate object which can then be used to describe the target object. The comparison between all these previous REC datasets is summarized in Table 1.

2.3 Multimodal Systems

The model architectures of the existing works incorporating language and pointing gestures often do not fully utilize the pointing gesture for reasoning but rather only use them as a loose guidance, with textual data being used majorly. For instance, M2Gestic [13] does not truly utilize the pointing gesture in conjunction with the linguistic instruction while inferencing, but rather only at the end to pick the best from the potential candidates inferred using the other two modalities (vision and text) alone. Hence, it will struggle to perform multi-hop inference involving complex instructions of intertwining gestures and text instructions present in our task, like the one in Fig. 2.

Fig. 2
figure 2

Overview of the data collection setup using VR headset and hand controllers. The listed items in the blue boxes at the bottom depict the recorded data. The gray scrolls contain an example output from the random scene generator and HTC Vive’s ROS interface

In the vision and language multimodal community, the state-of-the-art methods for REC tasks mainly include transformer-based architectures [23, 24]. These models, while being highly versatile, suffer severe drawbacks, including difficulty in ensuring transparency in their reasoning, as explained in Sect. 1. Hence, we base our model on the neuro-symbolic approach, which enables us to build a robust and interpretable model to perform complex compositional reasoning. This approach has proven to be very effective for visual question answering tasks like VQA [34] and CLEVR [35], with works like [30,31,32,33]. In this work, we have taken inspiration from one of the above methods called Neural State Machine [31], due to its fully differentiable graph-based architecture requiring a minimal injection of knowledge for symbolic execution. We borrowed their core ideas of disentangled representations and state machine-based reasoning to develop an architecture for our REC task.

3 Dataset

As discussed in Sect. 2, the existing datasets in this field exhibit several limitations with respect to the purpose of this work. They either do not have any pointing gesture modality [19, 20] or are collected in a very constrained fashion, not allowing the natural variations of gestural and linguistic commands [11]. Hence, we need to gather original data for this task in an HMI setup that allows the user to freely interact with the surrounding environment with free-form language and pointing gestures. This setup should also facilitate recording and capturing all the required data, including the visual scene consisting of the objects around the user, the user’s hand motion for pointing gestures, and the user’s utterance.

3.1 Data Collection Setup

To fulfill these requirements, we designed an object-picking task in a virtual reality (VR) simulated environment and employed an HTC Vive VR headset with hand controllers to facilitate interaction within it. Within this simulated environment, we created a scene containing diverse objects positioned on the floor using the 3D object meshes from [36]. We randomly selected ten objects (with replacement), assigning them random positions within 10 ms of the user’s location. We further ensured a minimum distance of 1.5 ms between each pair of objects to prevent the objects from overlapping or even coinciding with each other and defying the physical laws of nature.

This data collection setup was developed using the Robot Operating System (ROS) [37], a widely used open-source software development kit for robotics applications. To generate object visualizations within a 3D virtual environment, we employed Rviz [38], a graphical interface for ROS. This visualization could then be experienced through the VR headset of the HTC Vive system [39]. The HTC Vive system was also integrated with ROS using the existing drivers to track and record the 3D positions and orientations of the headset and the two hand controllers, as seen in Fig. 2. However, once inside the VR environment, the users can only see the simulated artificial scene and nothing from the real world. Thus, the users are unable to see their own hands once the headset is worn, posing challenges for intuitive pointing gestures. To address this, we introduced additional visualizations for the user’s hands using an artificial marker by utilizing the hand controller’s real-time position and orientation data. The two 3D red solid arrows in the simulated scene on the right in Fig. 2 represent this visualization of the user’s hands.

3.2 Annotation Process

For the data annotation process, the participants wore the HTC Vive headset and hand controllers, looking at the simulated environment featuring objects positioned on the ground. Subsequently, participants spoke the instructions to pick up one of the objects on the ground, employing the hand controllers for pointing gestures when necessary. Throughout the annotation procedure, the positions and orientations of the headset and hand controllers were recorded to track the movements of the user’s head and hands. Simultaneously, the visual representation of the object scene was captured to save the ground truth positions of the objects in 3D space. An external microphone was placed to record the user’s verbal instructions. We employed an off-the-shelf speech recognition Python API from [40] to generate textual instructions from the audio recordings. Any errors in the generated text were manually rectified. Finally, participants were asked to provide the ID of the object they had referred to, which was subsequently saved as a label along with the other recorded data.

For collecting these annotations, we ask human annotators to interact with our system and collect their ground truth data and labels as described above for each iteration. The annotators were asked to use free-form natural language for the referring expressions and were not constrained by any specific templates or structure. The guidelines recommended using the pointed object as an intermediate for describing a target object by a multi-hop reasoning instruction. Hence, the guidelines emphasized the collection of complex multimodal referring expressions. Two of the co-authors (both male), aged 24–25, spent around a week as human annotators to collect the data using this setup. They are both graduate school students as well in the Department of Computer Science. The annotations were all done in English, which both annotators have native-level fluency in, and was their primary medium of education.

An example annotation can be seen in Fig. 1. The human annotator sees the simulated scene on the right from the VR headset and utters the instruction—“Pick up the ball to the right of this black clipper” to refer to the ball (as highlighted with the green solid circles and arrow) while performing a pointing gesture using the hand controllers to point at the clipper.

Fig. 3
figure 3

The objects set used for generating scenes in the training/validation set and the test set, respectively

3.3 Dataset Statistics

In total, we collected 130 samples, divided into 104 samples for training and the remaining 26 for validation purposes. For the test set, we collected an additional 40 samples, this time utilizing an entirely distinct set of objects from [41], comprising various toys, as illustrated in Fig. 3. Using completely different objects for the test set facilitates evaluating the model’s compositional and generalization ability to new environments, as explained in detail later in Sect. 5.

We tried to ensure both the variety of objects and their randomness to minimize any potential biases that might be exploited. The training and validation set used the 3D objects from [36] consisting of 75 distinct objects commonly found in everyday life, comparable to 80 objects for the other REC datasets [11, 20]. These objects include kitchen items, tools, geometric shapes, and food items with different colors, shapes, sizes, and textures. We randomly pick ten objects with replacements from this set for each iteration of data collection to prevent any biases. Similarly, for the test set, we picked objects from [41] consisting of 15 distinct real toys, again with varied colors, shapes, and sizes. Sample objects from each can be seen in the Fig. 3.

Table 2 Ten most common words in our dataset except the stop words along with their example usage in the dataset

We also deliberately refrained from enforcing any constraints while collecting the user’s linguistic and gestural instruction to allow any natural variations in the utterance. The 130 samples in training and validation data consist of a vocabulary of 114 distinct words, with “pick” and “up” being the most frequent, since the instructions are for an object-picking task, followed by the demonstratives “this” and “that” which are used for pointing at an object. These are then followed by the nouns and adjectives as summarized in Table 2. Our analysis reveals that the expressions in these samples range from 3 to 11 words, with an average of 6.1 words per expression. In comparison, referring expressions from other datasets involving pointing gestures, such as CAESAR [11], exhibit templates ranging from 2 to 9 words with an average length of 5.3 words, while those in YouRefIt [12] have a significantly shorter average length of just 3.7 words. The longer expressions in our dataset reflect the emphasis on complex multi-hop instructions.

4 Method

As discussed in Sects. 1 and 2, we employ a neuro-symbolic approach for this task. The core idea behind these approaches is to disentangle reasoning from language and visual understanding [31, 32]. We borrow these core ideas and state machine-based reasoning module from [31] to develop a novel neuro-symbolic method for the REC task. The details of our method are explained in this section.

4.1 Overview

Essentially, the method can be conceptualized as a combination of two stages. In the first stage, the input modalities, including the image pixels, characters composing the uttered sentence, and the temporal information of hand positions, are transformed from raw feature space to an abstract semantic space. The image is transformed into a semantic graph-based structure called a scene graph, as proposed by [42]. This scene graph contains objects and their relationships as the nodes and the edges, respectively. The pointing gesture is then transformed as another attribute for the nodes in this scene graph. Simultaneously, the linguistic instruction is transformed into a sequence of steps required to be performed for the reasoning involved. Modeling the raw inputs as semantic structures, such as scene graphs and reasoning steps, allows for capturing the necessary and relevant information with high expressiveness. Importantly, they are constructed to share the same underlying vocabulary of concepts and properties like color, shape, size, material, etc., allowing joint inferencing over these structures. Moreover, they provide a computationally fast and convenient structure for the symbolic reasoning modules in the next stage.

The next stage involves a reasoning module that uses all the information and determines the target object being referred to. This is done using symbolic program execution of the reasoning steps on the semantic scene graph. This two-stage neuro-symbolic approach allows us to build a highly modular, interpretable, and compositional model for our multimodal REC task.

Figure 4 shows an overview of our model architecture. Each component of this model is explained in detail further in this section.

4.2 Reasoning Steps Generator

First, the input linguistic instruction is transformed into a sequence of unit semantic instructions. One common way of doing this is by treating this task as a machine translation task and employing a deep neural encoder–decoder network to generate the unit semantic instruction steps [30, 32]. However, this requires a huge amount of data to be trained, which is not the case with our dataset.

Fig. 4
figure 4

Model architecture overview; best viewed in color. The heatmaps next to each reasoning instruction step represent the probability distribution over the instruction types. The attributes and relations in the scene graph are represented with blue and orange colors, respectively. Similarly, the heatmaps next to them represent the probability distribution over the possible vocabulary of classes

Fig. 5
figure 5

Dependency parser output for a sample instruction, generated from [43], represented using the Universal Dependencies format [44]

Therefore, we parse the input linguistic instruction by understanding the underlying grammatical patterns in the form of dependency parsing. Dependency parsers have been trained on vast amounts of natural language data to extract the dependency relations between the semantic tokens in the sentence and output a tree of dependencies, as shown in Fig. 5. First, we filter the unnecessary tokens using stop-words and part-of-speech (POS) tag-based filtering. We use the dependency parser from the Stanford CoreNLP toolkit [43] for our purposes. Then, we traverse this dependency tree recursively, starting from the root node. At each recursive step, we generate the reasoning steps for all the children trees of the current node. Then, they are concatenated using a priority order of the dependency tags constructed heuristically. Then, the reasoning step for the current root node is appended to these concatenated reasoning steps. This way, we recursively traverse this dependency tree to generate the desired step-wise reasoning instructions to process the linguistic instruction. For instance, the sentence “Pick up the ball to the right of this black clipper” is parsed into a dependency tree, as shown in Fig. 5. We traverse this tree starting from its root, “pick”, and then recursively traverse the two sub-trees originating from “ball” and “right”. This recursive traversal carries on and the results are concatenated in a depth-first order with heuristic rules breaking ties for the order at each depth. Finally, the traversal leads to the following reasoning steps [“this”, “black”, “clipper”, “right”, “ball”].

We further classify each reasoning instruction step into one of the reasoning type categories: [object name, color, shape, size, demonstrative, relation between objects]. These types are necessary to perform the step-by-step reasoning inside the state machine, which will be explained later in Sect. 4.5. This classification is done heuristically by constructing patterns for each class using the POS and dependency tags. However, if a reasoning step does not fall into one of the patterns, the cosine similarity score between the reasoning step and all the category embeddings is used for classification.

4.3 Pointing Position Estimation

The wrist position and orientation in 3D space can be extracted from the HTC Vive’s hand controller output. In each experiment, we collect the positions for both hands from the beginning to the end. From these positions, we first extract the duration for which a pointing gesture is performed, if any, and then use the data in that duration to estimate the target pointed position on the ground. To identify a pointing gesture we use a simple heuristic—if one of the hands is raised up in the air and the wrist trajectory (controller’s trajectory) corresponds to a straight arm motion, then that hand is in a pointing gesture motion.

Once we identify the gesture, the position data corresponding to the pointing arm for the pointing duration are extracted. For each wrist position, a gesture ray is constructed by joining it with the head controller’s position. This is in accordance with previous research suggesting that the pointing gesture ray is best approximated by using an eye–wrist pair rather than any other pairs, such as shoulder–wrist [45, 46]. The intersection points of all these gesture rays on the ground plane are calculated and stored. These points on the ground are then used to construct a density estimation plot using the kernel density estimation (KDE) algorithm [47, 48]. The point with maximum density is then estimated as the target pointed position on the ground. This is based on the assumption that the user spends significant time pointing directly at the target and less time searching for the target or in motion. Therefore, the density of points should be highest around the target pointed position. Using this target pointed position and the density plot, each object is allocated a probability score of being the target pointed object. This score is then treated as one of the attributes of that object in the scene graph, constructed as explained in the following subsection.

4.4 Scene Graph Generator

The scene graph generator module needs to transform the raw image pixels as perceived by the machine into a structured scene graph. A scene graph consists of nodes representing the objects in the scene and the edges representing the relations between those objects. The nodes also contain the following attribute information—name, color, shape, and size. Due to the limited amount of data available for our task, it is not possible to train a reliable visual recognition model for scene graph generation. Moreover, this work focuses predominantly on the novel semantic reasoning module within a hybrid neuro-symbolic framework designed to tackle our multimodal REC task. In the future, this scene graph can be generated by leveraging the vast research in scene graph generation [49, 50] when enough data are available for training.

Hence, for the purpose of this work, we use an oracle scene graph generator using the ground truth object detections and their annotated attributes. The relations are classified among one of the spatial relations (left, right, front, back, and near) using a simple heuristic on the relative position on the ground between each pair of objects in the scene. Moreover, each attribute and relation in this scene graph is represented as a probability distribution over the vocabulary of each attribute and relation class as done in [31], resulting in what is known as a probabilistic scene graph.

4.5 Reasoning Module: Neural State Machine

Once all the input modalities are transformed into their respective semantic structural space, we perform reasoning using a neural state machine as described in [31]. The probabilistic scene graph created from the scene is treated as a state machine on which the reasoning instruction steps are fed one by one to perform sequential reasoning. The state machine redistributes its probability distribution for the target node at each reasoning step. Once all the steps are processed, the probability distribution on the state machine corresponds to that of the target object referred to by the input instruction.

Fig. 6
figure 6

The reasoning module architecture. The green solid boxes represent a function (corresponding equation number written below it) with learnable parameters, if any, written inside it. This module iteratively updates the probability distribution of the target node from \(p_{i}\) to \(p_{i+1}\) after processing the \(i^{th}\) reasoning step

The redistribution of target node probabilities at each iterative step i (\(i = {1...n}\), with n representing the number of steps), is accomplished by utilizing the reasoning instruction embedded in that step, denoted as \(r_i\) (a tensor of size d, representing the embedding dimension), the reasoning type distribution \(R_i\) (a tensor of size \(L+1\), where L is the number of attributes, which equals 5), the node representations \(s^j\) for each attribute j from 1 to L, and edge representations \(e'\). Using these elements, we initially calculate a relevance score for each node \(\gamma _i(s)\) and each edge \(\gamma _i(e)\), employing learnable weight parameters \(W_j\), where \(j = {1...L+1}\), each having a shape of [\(d \times d\)]. These relevance scores depict the significance of each node and edge in the scene graph concerning the current reasoning instruction step, respectively

$$\begin{aligned} \gamma _i(s)= & {} \sigma \left( \sum _{j=1}^L R_i(j)(r_i \circ W_j s^j) \right) \end{aligned}$$
(1)
$$\begin{aligned} \gamma _i(e)= & {} \sigma ( r_i \circ W_{L+1} e' ). \end{aligned}$$
(2)

Having these relevance scores, we independently compute the probability distribution for the target node using only the relevance score for the nodes and edges to calculate \(p_{i+1}^s\) and \(p_{i+1}^r\), respectively. These signify the updated probability distribution after processing this reasoning step considering only the nodes and edges in the scene graph, respectively

$$\begin{aligned} p_{i+1}^s= & {} \text {softmax}_{s \in S}(p_i(s) \cdot \gamma _i(s)) \end{aligned}$$
(3)
$$\begin{aligned} p_{i+1}^r= & {} \text {softmax}_{s \in S} \left( \sum _{(s',s) \in E} p_i(s') \cdot \gamma _i((s',s)) \right) . \end{aligned}$$
(4)

Finally, we combine them by a weighted sum, with the weight being the probability of this reasoning step being of type relation, which is denoted as \(R_i(L+1) = r_i'\)

$$\begin{aligned} p_{i+1} = r_i' \cdot p^r_{i+1} + (1 - r_i') \cdot p^s_{i+1}. \end{aligned}$$
(5)

Once the state machine processes all the reasoning steps, we get the final probability distribution of the referenced target object over all nodes in the graph. Overall, this allows for the iterative step-wise reasoning of the referring expression, shifting the probability distribution of the target node at each step. This enables us to also peek into the model’s inferencing, enhancing its interpretability as exemplified later in Sect. 5.4.

5 Experiments and Results

5.1 Implementation Details

We first preprocess the data using pre-trained FastText word embeddings [51]. To ensure uniformity in instruction length, we set a maximum instruction length of 16 tokens, padding the shorter ones with zero values. Similarly, to maintain consistency in the number of nodes of the scene graph, we set a maximum node count of 12 and applied padding.

We train our complete architecture in a supervised manner using the criterion to minimize the cross-entropy loss against the ground truth label. We use Adam optimizer [52] with a learning rate of 0.0001 and batch size of 8. The embedding size and hidden size were both set to 30. These hyperparameter values were determined by performing a random search over their domain for maximum validation accuracy. During the model inference phase, we output the object corresponding to the node with the highest probability score after processing all the steps.

5.2 Experiment Settings

We train our model on the complete training set of 104 samples and evaluate it on the validation set of 26 samples. Furthermore, we test the generalization ability of our model by evaluating it on the generalization test dataset created using a completely different set of objects as explained in Sect. 3.3. Furthermore, we also performed an ablation study in the input pointing gesture modality to understand its importance in our task. This is done by evaluating our model without the pointing gesture input modality in the same setting. This is done by providing a uniform probability distribution of pointing gestures over all the objects in the scene as the input.

Table 3 Accuracy on validation and test set for different experiment settings

5.3 Experiment Results

The results for all these experiments are summarized in Table 3. The comparable performance in validation and test set proves the generalization capabilities of our model, even with a very low amount of training data. This is made possible because of our method’s extensive compositionality and modularity. We break down the input modalities into structured semantic concepts and perform reasoning on this abstract space. This makes it easier to train on a relatively smaller amount of data with less trainable parameters and still generalize the reasoning to a new environment. This contrasts the deep-learning approaches, which often struggle to generalize even with vast amounts of training data and end up learning statistical biases on raw low-level feature space.

From the ablation study results in the second row of Table 3, we can conclude that the pointing gesture is indeed very crucial for this task. The accuracy is significantly worse for the model which does not use the pointing gesture modality input. This is expected, since humans often tend to convey implicit information through these gestures, which are crucial to understanding the instruction completely.

5.4 Qualitative Analysis

We further perform some qualitative analysis to gain deeper insight into the inner workings of our model. Since our model is based on a step-by-step reasoning approach, we analyze the shift in the probability distribution of each node being the target object after processing each reasoning step. Subsequently, we assess the reasoning process’s logical coherence and human interpretability. An example of such analysis is visualized in Fig. 7, for the instruction—“Pick up the yellow object near this fruit”. Each sub-figure from (a) to (e) corresponds to one of the reasoning steps. The markers inside the figure are the objects in the scene from the top view, and the plot axes are the X and Y coordinate axes. Each object marker is associated with a name, then its ID in brackets, and finally, its probability of being the target after the current step in square brackets. The estimated pointed position on the ground plane is marked with a blue cross.

Fig. 7
figure 7

Qualitative analysis of the step-by-step reasoning performed by the model for the referring instruction—“Grab this yellow object near the fruit”. Each ae corresponds to one of the reasoning steps. The plot axes are the X and Y coordinate axes. The estimated pointed position on the ground plane is marked with a blue cross. The markers inside the figure are the objects in the scene from the top view in the scene. They are associated with a name, then its ID in brackets, and finally, its probability of being the target after the current step in square brackets. The objects with a high probability of being the target after this step are hired. After processing each reasoning step, we can see that the model iteratively refines its predictions and finally converges to the target object

The referring instruction—“Grab this yellow object near the fruit”—is a particularly challenging example. It consists of five reasoning steps to arrive at the solution, as determined by the reasoning steps generator module— [“fruit”, “near”, “this”, “yellow”, “object”]. It requires the model to perform complex multi-hop reasoning considering both linguistic and gestural features simultaneously.

Analyzing these figures enables us to visualize how the probabilities shift across the different objects in the scene after processing each instruction step. Initially, the model has uniform probabilities for each node as the target object. Then, the shift in the probability distribution for each node as being the target after processing the first instruction “fruit” can be viewed in Fig. 7a. We notice that the probabilities of fruits are higher than other objects except for some other related objects like tomato soup cans. The next instruction “near” then shifts the probabilities to the neighboring objects of those with higher probabilities in the previous step. This shows that the model could infer that this instruction is of type relation, and hence, the probabilities are shifted along the relevant edges. Similarly, after each reasoning step, we can analyze the shift in the probability distribution across the nodes. Finally, after processing all reasoning steps, the model is able to narrow down the mustard bottle with ID 9 as the target object with a probability score of 0.43, the highest among all.

5.5 Error Analysis

Furthermore, we also manually examined each failed test sample to identify the root cause of these errors using the step-by-step reasoning visualization plots similar to Fig. 7. A few of the incorrect samples were due to inaccurate pointing gesture estimation, resulting in considering some other object as being pointed, leading to incorrect reasoning. Incorrect generation of the reasoning steps is also contributed to some failed samples. For example, in the instruction “Pick up that tennis ball closer to the cracker box”, the reasoning steps generator incorrectly generated the steps as— [“that”, “tennis”, “ball”, “cracker”, “box”, “closer”]. This is due to the complex grammar of the sentence being unable to be handled by our simplified heuristic algorithm on the dependency graph. Some other errors were also due to incorrect reasoning of the parsed steps. For example, given the steps [“yellow”, “fruit”], the reasoning module incorrectly inferred “plum” instead of “banana” as the referred object.

6 Conclusion

In this work, we tackled the issue of referring expression comprehension for human–machine interaction systems involving multiple modalities—language, vision, and pointing gestures. To do so, we collected a small but challenging dataset in a simulated VR environment to mimic a real-world application of such a task. We proposed a novel method based on a neuro-symbolic approach to solving our multimodal REC task and showed its effectiveness in performing complex multi-hop reasoning. We also prove that such a method can also generalize the reasoning to unseen environments as well. Moreover, we also performed ablation studies to emphasize the importance of pointing gestures in real-world interaction tasks. Finally, we perform qualitative analysis using the interpretable framework of our model to gain deeper insight into its inner workings.

Much remains to be done to build robust HMI systems capable of doing this task in the real world. First and foremost, a larger dataset on real-world HMI tasks focusing on referring expressions is required. Collecting such data on a large scale is a tedious task with huge monetary costs due to the interactive nature of the task. However, this will reap rich rewards for the research in this field, fostering the development of many applications in real-world robotic systems. Second, more sophisticated reasoning modules need to be developed to interpret various reasoning paradigms involved in daily human communication. These reasoning involve, but are not limited to, ordinal references, counting, perspective disambiguation, etc. Out of these, perspective disambiguation is a huge challenge unique to HMI systems. In real-world HMI systems, humans and machines have different eyes, which requires the machine to consider this difference in perspective while performing spatial reasoning. We hope to keep working on these future directions for our research work.