1 Introduction

The task of human-object interaction (HOI) understanding aims to infer the relationships between human and objects, such as “riding a bike” or “washing a bike”. Beyond traditional visual recognition of individual instances, e.g., human pose estimation, action recognition, and object detection, recognizing HOIs requires a deeper semantic understanding of image contents. Recently, deep neural networks (DNNs) have shown impressive progress on above individual tasks of instance recognition, while relatively few methods [1, 2, 14, 38] were proposed for HOI recognition. This is mainly because it requires reasoning beyond perception, by integrating information from human, objects, and their complex relationships.

In this paper, we propose a novel model, Graph Parsing Neural Network (GPNN), for HOI recognition. The proposed GPNN offers a general framework that explicitly represents HOI structures with graphs and automatically parses the optimal graph structures in an end-to-end manner. In principle, it is an generalization of Message Passing Neural Network (MPNN) [12]. An overview of GPNN is shown in Fig. 1. The following two aspects motivate our design.

First, we seek a unified framework that utilizes the learning capability of neural networks and the power of graphical representations. Recent deep learning based HOI models showed promising results, but few touched how to interpret well and explicitly leverage spatial and temporal dependencies and human-object relations in such structured task. Aiming for this, we introduce GPNN. It inherits the complementary strengths of neural networks and graphical models, for forming a coherent HOI representation with strong learning ability. Specifically, with the structured representation of an HOI graph, the rich relations are explicitly utilized, and the information from individual elements can be efficiently integrated and broadcasted over the structures. The whole model and message passing operations are well-defined and fully differentiable. Thus it can be efficiently learned from data in an end-to-end manner.

Fig. 1.
figure 1

Illustration of the proposed GPNN for learning HOI. GPNN offers a generic HOI representation that applies to (a) HOI detection in images and (b) HOI recognition in videos. With the integration of graphical model and neural network, GPNN can iteratively learn/infer the graph structures (a.v) and message passing (a.vi). The final parse graph explains a given scene with the graph structure (e.g., the link between the person and the knife) and the node labels (e.g., lick). A thicker edge corresponds to stronger information flow between nodes in the graph.

Second, based on our efficient HOI representation and learning power, GPNN applies to diverse HOI tasks in both static and dynamic scenes. Previous studies for HOI achieved good performance in their specific domains (spatial [1, 14] or temporal [20, 34, 35]). However, none of them addresses a generic framework for representing and learning HOI in both images and videos. The key difficulty lies in the diverse relations between components. Given a set of human and objects candidates, there may exist an uncertain number of human-object interaction pairs (see Fig. 1 (a.ii) as an example). The relations become more complex after taking temporal factors into consideration. Thus pre-fixed graph structures, as adopted by most previous graphical or structured DNN models [11, 20, 22, 43], are not an optimal choice. Seeking a better generalization ability, GPNN incorporates an essential link function for addressing the problem of graph structure learning. It learns to infer the adjacency matrix in an end-to-end manner and thus can infer a parse graph that explicitly explains the HOI relations. With such learnable graph structure, GPNN could also limit the information flow from irrelevant nodes while encouraging message to propagate between related nodes, thus improving graph parsing.

We extensively evaluate the proposed GPNN on three HOI datasets, namely HICO-DET [1], V-COCO [17] and CAD-120 [22], for HOI detection from images (HICO-DET, V-COCO) and HOI recognition and anticipation in spatial-temporal settings (CAD-120). The experimental results verify the generality and scalability of our GPNN based HOI representation and show substantial improvements over state-of-the-art approaches, including pure graphical models and pure neural networks. We also demonstrate GPNN outperforms its variants and other graph neural networks with pre-fixed structures.

This paper makes three major contributions. First, we propose the GPNN that incorporates structural knowledge and DNNs for learning and inference. Second, with a set of well defined modular functions, GPNN addresses the HOI problem by jointly performing graph structure inference and message passing. Third, we empirically show that GPNN offers a scalable and generic HOI representation that applies to both static and dynamic settings.

2 Related Work

Human-Object Interaction. Reasoning human actions with objects (like “playing baseball”, “playing guitar”), rather than recognizing individual actions (“playing”) or object instances (“baseball”, “guitar”), is essential for a more comprehensive understanding of what is happening in the scene. Early work in HOI understanding studied Bayesian model [15, 16], utilized contextual relationship between human and objects [47,48,49], learned structured representations with spatial interaction and context [8], exploited compositional models [9], or referred to a set of HOI exemplars [19]. They were mainly based on handcrafted features (e.g., color, HOG, and SIFT) with object and human detectors. More recently, inspired by the notable success of deep learning and the availability of large-scale HOI datasets [1, 2], several deep learning based HOI models were then proposed. Specifically, Mallya et al.[29] modified Fast RCNN model [13] for HOI recognition, with the assistance of Visual Question Answering (VQA). In [38], zero-shot learning was applied for addressing the long-tail problem in HOI recognition. In [1], the human proposals, object regions, and their combinations were fed into a multi-stream network for tackling the HOI detection problem. Gkioxari et al. [14] estimated an action-type specific density map for identifying the interacted object locations, with a modified Faster RCNN architecture [36].

Although promising results were achieved by above deep HOI models, we still observe two unsolved issues. First, they lack a powerful tool to represent the structures in HOI tasks explicitly and encodes them into modern network architectures efficiently. Second, despite the successes in specific tasks, a complete and generic HOI representation is missing. These approaches can not be easily extended to HOI recognition from videos. Aiming to address those issues, we introduce GPNN for imposing high-level relations into DNN, leading to a powerful HOI representation that is applicable in both static and dynamic settings.

Neural Networks with Graphs/Graphical Models. In the literature, some approaches were proposed to combine graphical models and neural networks. The most intuitive approach is to build graphical models upon DNN, where the network that generates features is trained first, and its output is used to compute potential functions for the graphical predictor. Typical methods were used in human pose estimation [42], human part parsing [33, 45], and semantic image segmentation [3, 4]. These methods lack a deep integration in the sense that the computation process of graphical models cannot be learned end-to-end. Some attempts [7, 21, 31, 32, 37, 40, 44, 51] were made to generalize neural network operations (e.g., convolutions) directly from regular grids (e.g., images) to graphs. For the HOI problem, however, a structured representation is needed to capture the high-level spatial-temporal relations between humans and objects. Some other work integrated network architectures with graphical models [12, 20] and gained promising results on applications such as scene understanding [24, 30, 46], object detection and parsing [27, 50], and VQA [41]. However, these methods only apply to problems that have pre-fixed graph structures. Liang et al.[26] merged graph nodes using Long Short-Term Memory (LSTM) for human parsing problem, under the assumption that the nodes are mergeable.

Those methods achieved promising results in their specific tasks and well demonstrated the benefit in completing deep architectures with domain-specific structures. However, most of them are based on pre-fixed graph structures, and they have not yet been studied in HOI recognition. In this work, we extend previous graphical neural networks with learnable graph structures, which well addresses the rich and high-level relations in HOI problems. The proposed GPNN can automatically infer the graph structure and utilize that structure for enhancing information propagation and further inference. It offers a generic HOI representation for both spatial and spatial-temporal settings. To the best of our knowledge, this is a first attempt to integrate graph models with neural networks in a unified framework to achieve state-of-art results in HOI recognition.

Fig. 2.
figure 2

Illustration of the forward pass of GPNN. GPNN takes node and edge features as input, and outputs a parse graph in a message passing fasion. The structure of the parse graph is given by a soft adjacency matrix. It is computed by the link function based on the features (or hidden node states). The darker the color in the adjacency matrix, the stronger the connectivity is. Then message functions compute incoming messages for each node as a weighted sum of the messages from other nodes. Thicker edges indicate larger information flows. The update functions update the hidden internal states of each node. Above process is repeated for several steps, iteratively and jointly learning the computation of graph structures and message passing. Finally, for each node, the readout functions output HOI action or object labels from the hidden node states. See Sect. 3 for more details.

3 Graph Parsing Neural Network for HOI

3.1 Formulation

For HOI understanding, human and objects are represented by nodes, and their relations are defined as edges. Given a complete HOI graph that includes all the possible relationships among human and objects, we want to automatically infer a parse graph by keeping the meaningful edges and labeling the nodes.

Formally, let \(\mathcal {G} = (\mathcal {V}, \mathcal {E}, \mathcal {Y})\) denote the complete HOI graph. Nodes \(v \in \mathcal {V}\) take unique values from \(\{1, \cdots , |\mathcal {V}|\}\). Edges \(e \in \mathcal {E}\) are two-tuples \(e = (v, w) \in \mathcal {V} \times \mathcal {V}\). Each node v has a output state \(y_{v} \in \mathcal {Y}\) that takes a value from a set of labels \(\{1, \cdots , Y_{v}\}\) (e.g., actions). A parse graph \(g = (\mathcal {V}_{g}, \mathcal {E}_{g}, \mathcal {Y}_{g})\) is a sub-graph of \(\mathcal {G}\), where \(\mathcal {V}_{g} \subseteq \mathcal {V}\) and \(\mathcal {E}_{g} \subseteq \mathcal {E}\). Given the node features \(\Gamma ^{\mathcal {V}}\) and edge features \(\Gamma ^{\mathcal {E}}\), we want to infer the optimal parse graph \(g^{*}\) that best explains the data according to a probability distribution p:

$$\begin{aligned} \begin{aligned} g^{*}&= \mathop {\mathrm {argmax}}\limits _{g}~p(g | \Gamma , \mathcal {G}) = \mathop {\mathrm {argmax}}\limits _{g}~p(\mathcal {V}_{g}, \mathcal {E}_{g}, \mathcal {Y}_{g} | \Gamma , \mathcal {G}) \\&= \mathop {\mathrm {argmax}}\limits _{g}~p(\mathcal {Y}_{g} | \mathcal {V}_{g}, \mathcal {E}_{g}, \Gamma ) p(\mathcal {V}_{g}, \mathcal {E}_{g} | \Gamma , \mathcal {G}) \end{aligned} \end{aligned}$$
(1)

where \(\Gamma = \{\Gamma ^{\mathcal {V}}, \Gamma ^{\mathcal {E}}\}\). Here \(p(\mathcal {V}_{g}, \mathcal {E}_{g} | \Gamma , \mathcal {G})\) evaluates the graph structure, and \(p(\mathcal {Y}_{g} | \mathcal {V}_{g}, \mathcal {E}_{g}, \Gamma )\) is the labeling probability for the nodes in the parse graph.

This formulation provides us a principled guideline for designing the GPNN. We design the network to approximate the computations of \(\mathop {\mathrm {argmax}}\nolimits _{g} p(\mathcal {V}_{g}, \mathcal {E}_{g} | \Gamma , \mathcal {G})\) and \(\mathop {\mathrm {argmax}}\nolimits _{g} p(\mathcal {Y}_{g} | \mathcal {V}_{g}, \mathcal {E}_{g}, \Gamma )\). We introduce four types of functions as individual modules in the forward pass of a GPNN: link functions, message functions, update functions, and readout functions (as illustrated in Fig. 2). The link functions \(L(\cdot )\) estimate the graph structure, giving an approximation of \(p(\mathcal {V}_{g}, \mathcal {E}_{g} | \Gamma , \mathcal {G})\). The message, update and readout functions together resemble the belief propagation process and approximate \(\mathop {\mathrm {argmax}}\nolimits _{\mathcal {Y}_{g}} p(\mathcal {Y}_{g} | \mathcal {V}_{g}, \mathcal {E}_{g}, \Gamma )\).

Specifically, the link function takes edge features as input and infers the connectivities between nodes. The soft adjacency matrix is thus constructed and used as weights for messages passing through edges between nodes. The incoming messages for a node are summarized by the message function then the hidden embedding state of the node is updated based on the messages by an update function . Finally, readout functions compute the target outputs for each nodes. Those four types of functions are defined as follows:

Link Function. We first infer an adjacency matrix that represents connectivities (i.e., the graph structure) between nodes by a link function. A link function \(L(\cdot )\) takes the node features \(\Gamma ^{\mathcal {V}}\), and edge features \(\Gamma ^{\mathcal {E}}\) as input and outputs an adjacency matrix \(A\in [0,1]^{|\mathcal {V}| \times |\mathcal {V}|}\):

$$\begin{aligned} \begin{aligned} A_{vw} = L(\Gamma _{v}, \Gamma _{w}, \Gamma _{vw}) \end{aligned} \end{aligned}$$
(2)

where \(A_{vw}\) denotes the (vw)-th entry of the matrix A. Here we overload the notation and let \(\Gamma _{v}\) denote node features and \(\Gamma _{vw}\) denote edge features. In this way, the structure of a parse graph g can be approximated by the adjacency matrix. Then we start to propagate messages over the parse graph, where the soft adjacency matrix controls the information to be passed through edges.

Message and Update Functions. Based on the learned graph structure, the message passing algorithm is adopted for inference of node labels. During belief propagation, the hidden states of the nodes are iteratively updated by communicating with other nodes. Specially, message functions \(M(\cdot )\) summarize messages to nodes coming from other nodes, and update functions \(U(\cdot )\) update the hidden node states according to the incoming messages. At each iteration step s, the two functions computes:

$$\begin{aligned} \begin{aligned} m_{v}^{s} = \sum \nolimits _{w} A_{vw} M(h_{v}^{s-1}, h_{w}^{s-1}, \Gamma _{vw}) \end{aligned} \end{aligned}$$
(3)
$$\begin{aligned} \begin{aligned} h_{v}^{s} = U(h_{v}^{s-1}, m_{v}^{s}) \end{aligned} \end{aligned}$$
(4)

where \(m_{v}^{s}\) is the summarized incoming message for node v at s-th iteration and \(h_{v}^{s}\) is the hidden state for node v. The node connectivity A encourages the information flow between nodes in the parse graph. The message passing phase runs for S steps towards convergence. At the first step, the node hidden states \(h_{v}^{0}\) are initialized by node features \(\Gamma _{v}\).

Readout Function. Finally, for each node, hidden state is fed into a readout function to output a label:

$$\begin{aligned} \begin{aligned} y_{v} = R(h_{v}^{S}). \end{aligned} \end{aligned}$$
(5)

Here the readout function \(R(\cdot )\) computes output \(y_v\) for node v by activating its hidden state \(h_{v}^{S}\) (node embeddings).

Iterative Parsing. Based on the above four functions, the messages are passed along the graph and weighted by the learned adjacency matrix A. We further extend above process into a joint learning framework that iteratively infers the graph structure and propagates the information to infer node labels. In particular, instead of learning A only at the beginning, we iteratively infer A with the updated node information and edge features at each step s:

$$\begin{aligned} \begin{aligned} A_{vw}^{s} = L(h_{v}^{s-1}, h_{w}^{s-1}, m_{vw}^{s-1}). \end{aligned} \end{aligned}$$
(6)

Then the messages in Eq. 3 are redefined as:

$$\begin{aligned} \begin{aligned} m_{v}^{s} = \sum \nolimits _{w} A_{vw}^{s} M(h_{v}^{s-1}, h_{w}^{s-1}, \Gamma _{vw}). \end{aligned} \end{aligned}$$
(7)

In this way, both the graph structure and the message update can be jointly and iteratively learned in a unified framework. In practice, we find such a strategy would bring better performance (detailed in Sect. 4.3).

In next section, we show that by implementing each function by neural networks, the entire system is differentiable end-to-end. Hence all the parameters can be learned using gradient-based optimization.

3.2 Network Architecture

Link Function. Given the complete HOI graph \(\mathcal {G} = (\mathcal {V}, \mathcal {E}, \mathcal {Y})\), we use \(d_{V}\) and \(d_{E}\) to denote the dimension of the node features and the edge features, respectively. In a message passing step s, we first concatenate all the node features (hidden states) \(\{h^s_v\in \mathbb {R}^{d_{V}}\}_v\) and all the edge features (messages) \(\{m^s_{vw}\in \mathbb {R}^{d_{E}}\}_{v,w}\) to form a feature matrix \(F^{s}\in \mathbb {R}^{|V| \times |V| \times (2d_{V}+d_{E})}\) (see in Fig. 2). The link function is defined as a small neural network with one or several convolutional layer(s) (with \(1\times 1\times (2d_{V}+d_{E}\)) kernels) and a sigmoid activation. Then the adjacency matrix \(A^s \in [0,1]^{|\mathcal {V}| \times |\mathcal {V}|}\) can be computed as:

$$\begin{aligned} \begin{aligned} A^{s} = \sigma (\mathbf {W}^L*F^{s}), \end{aligned} \end{aligned}$$
(8)

where \(\mathbf {W}^L\) is the learnable parameters of the link function network \(L(\cdot )\) and \(*\) denotes conv operation. The sigmoid operation \(\sigma (\cdot )\) is for normalizing the values of the elements of \(A^{s}\) into [0, 1]. The essential effect of multiple convolutional layers with \(1 \times 1\) kernels is similar to fully connected layers applied to each individual edge features, except that the filter weights are shared by all the edges. In practice, we find such operation generates good enough results and leads to a high computation efficiency.

For spatial-temporal problems where the adjacency matrices should account for the previous states, we use convolutional LSTMs [39] for modeling \(L(\cdot )\) in temporal domain. At time t, the link function takes \(F^{s, t}\) as input features and the previous adjacency matrix \(A^{s, t-1}\) as hidden state: \(A^{s, t} = convLSTM(F^{s, t}, A^{s, t-1})\). Again, the kernel size for the conv layer in convLSTM is \(1 \times 1 \times (2d_{V}+d_{E})\).

Message Function. In our implementation, the message function \(M(\cdot )\) in Eq. 3 is computed by:

$$\begin{aligned} \begin{aligned} M(h_{v}, h_{w}, \Gamma _{vw}) = [\mathbf {W}_V^M h_{v}, \mathbf {W}_V^M h_{w}, \mathbf {W}_E^M \Gamma _{vw}], \end{aligned} \end{aligned}$$
(9)

where [., .] denotes concatenation. It concatenates the outputs of linear transforms (i.e., fully connected layers parametrized by \(\mathbf {W}_V^M\) and \(\mathbf {W}_E^M\)) that takes node hidden states \(h_{v}\) or edge features \(\Gamma _{vw}\) as input.

Update Function. Recurrent neural networks [10, 18] are natural choices for simulating the iterative update process, as done by previous works [12]. Here we apply Gated Recurrent Unit (GRU) [5] as the update function, because of its recurrent nature and smaller amount of parameters. Thus the update function in Eq. 4 is implemented as:

$$\begin{aligned} \begin{aligned} h_{v}^{s} = U(h_{v}^{s-1}, m_{v}^{s}) = GRU(h_{v}^{s-1}, m_{v}^{s}), \end{aligned} \end{aligned}$$
(10)

where \(h_{v}^{s}\) is the hidden state and \(m_{v}^{s}\) is used as input features. As demonstrated in [25], the GRU is more effective than vanilla recurrent neural networks.

Readout Function. A typical implementation of readout functions is combining several fully connected layers (parameterized by \(\mathbf {W}^R\)) followed by an activation function:

$$\begin{aligned} \begin{aligned} y_v = R(h_v^S) = \varphi (\mathbf {W}^R h_v^S). \end{aligned} \end{aligned}$$
(11)

Here the activation function \(\varphi (\cdot )\) can be used as softmax (one-class outputs) or sigmoid (multi-class outputs) according to different HOI tasks.

In this way, the entire GPNN is implemented to be fully differentiable and end-to-end trainable. The loss for specific HOI task can be computed for the outputs of readout functions, and the error can propagate back according to chain rule. In next section, we will offer more details for implementing GPNN for HOI tasks on spatial and spatial-temporal settings and present qualitative as well as quantitative results.

4 Experiments

To verify the effectiveness and generic applicability of GPNN, we perform experiments on two HOI problems: (i) HOI detection in images [1, 17], and (ii) HOI recognition and anticipation from videos [22]. The first experiment is performed on HICO-DET [1] and V-COCO [17] datasets, showing that our approach is scalable to large datasets (about 60 K images in total) and achieves a good detection accuracy over a large number of classes (more than 600 classes of HOIs). The second experiment is reported on CAD-120 dataset [22], showing that our method is well applicable to spatial-temporal domains.

4.1 Human-Object Interaction Detection in Images

For HOI detection in an image, the goal is to detect pairs of a human and an object bounding box with an interaction class label connecting them.

Table 1. HOI detection results (mAP) on HICO-DET dataset [1]. Higher values are better. The best scores are marked in bold.
Fig. 3.
figure 3

HOI detection results on HICO-DET [1] test images. Human and objects are shown in red and green rectangles, respectively. Best viewed in color.

Datasets. We use HICO-DET [1] and V-COCO [17] datasets for benchmarking our GPNN model. HICO-DET provides more than 150 K annotated instances of human-object pairs in 47,051 images (37,536 training and 9,515 testing). It has the same 80 object categories as MS-COCO [28] and 117 action categories. V-COCO is a subset of MS-COCO [28]. It consists of a total of 10,346 images with 16,199 people instances, where \(\sim \)2.5K images in the train set, \(\sim \)2.8K images for validation and \(\sim \)4.9K images for testing. Each annotated person has binary labels for 26 different action classes. Note that three actions (i.e., cut, eat, and hit) are annotated with two types of targets: instrument and direct object.

Implementation Details. Humans and objects are represented by nodes in the graph, while human-object interactions are represented by edges. In this experiment, we use a pre-trained deformable convolutional network [6] for object detection and features extraction. Based on the detected bounding boxes, we extract node features (\(7 \times 7 \times 80\)) from the position-sensitive region of interest (PS RoI) pooling layer from the deformable ConvNet. We extract the edge feature from a combined bounding box, i.e., the smallest bounding box that contains both two nodes’ bounding boxes. The functions of GPNN are implemented as follows. We use a convolutional network (128-128-1)-Sigmoid(\(\cdot \)) with \(1 \times 1\) kernels for the link function. The message functions are composed of a fully connected layer, concatenation, and summation. For a node v, the neighboring node feature \(\Gamma _{w}\) and edge feature \(\Gamma _{vw}\) are passed through a fully connected layer and concatenated. The final incoming message is a weighted sum of messages from all neighboring nodes. Specifically, the message for node v coming from node w through edge \(e=(v, w)\) is the concatenation of output from FC(\(d_{V}\)-\(d_{V}\)) and FC(\(d_{E}\)-\(d_{E}\)). A GRU(\(d_{V}\)) is used for the update function. The propagation step number S is set to be 3. For the readout function, we use a FC(\(d_{V}\)-117)-Sigmoid(\(\cdot \)) and FC(\(d_{V}\)-26)-Sigmoid(\(\cdot \)) for HICO-DET and V-COCO, respectively.

The probability of an HOI label of a human-object pair is given by the product of the final output probabilities from the human node and the object node. We employ an L1 loss for the adjacency matrix. For the node outputs, we use a weighted multi-class multi-label hinge loss. The reasons are two-folds: the training examples are not balanced, and it is essentially a multi-label problem for each node (there might not even exist a meaningful human-object interaction for detected humans and objects).

Our model is implemented using PyTorch and trained with a machine with a single Nvidia Titan Xp GPU. We start with a learning rate of 1e-3, and the rate decays every 5 epochs by 0.8. The training process takes about 20 epochs (\(\sim \)15 h) to roughly converge with a batch size of 32.

Table 2. HOI detection results (mAP) on V-COCO [17] dataset. Legend: Set 1 indicates 18 HOI actions with one object, and Set 2 corresponds to 3 HOI actions (i.e., cut, eat, hit) with two objects (instrument and object).
Fig. 4.
figure 4

HOI detection results on V-COCO [17] test images. Human and objects are shown in red and green rectangles, respectively. Best viewed in color. (Color figure online)

Comparative Methods. We compare our method with eight baselines: (1) Fast-RCNN (union) [13]: for each human-object proposal from detection results, their attention windows are used as the region proposal for Fast-RCNN. (2) Fast-RCNN (score) [13]: given human-object proposals, HOI is predicted by linearly combining the human and object detection scores. (3) HO-RCNN [1]: a multi-stream architecture with a ConvNet to classify human, object and human-object proposals, respectively. The final output is computed by combining the scores from all the three streams. (4) HO-RCNN+IP [1] and (5) HO-RCNN+IP+S [1]: HO-RCNN with additional components. Interaction Patterns (IP) acts as a attention filter to images. S is an extra path with a single neuron that uses the raw object detection score to produce an offset for the final detection. More detailed descriptions of above five baselines can be found in [1]. (6) Gupta et al.[17]: trained based on Fast-RCNN [13]. We use the scores reported in [14]. (7) Shen et al.[38]: final predictions are from two Faster RCNN [36] based networks which are trained for predicting verb and object classes, respectively. (8) InteractNet [14]: a modified Faster RCNN [36] with an additional human-centric branch that estimates an action-specific density map for locating objects.

Experiment Results. Following the standard settings in HICO-DET and V-COCO benchmarks, we evaluate HOI detection using mean average precision (mAP). An HOI detection is considered as a true positive when the human detection, the object detection, and the interaction class are all correct. The human and object bounding boxes are considered as true positives if they overlap with a ground truth bounding boxes of the same class with an intersection over union (IoU) greater than 0.5. For HICO-DET dataset, we report the mAP over three different HOI category sets: (i) all 600 HOI categories in HICO (Full); (ii) 138 HOI categories with less than 10 training instances (Rare); and (iii) 462 HOI categories with 10 or more training instances (Non-Rare). For V-COCO dataset, since we concentrate on HOI detection, we report the mAP on three groups: (i) 18 HOI action classes with one target object; (ii) 3 HOI categories with two types of objects; (iii) all 24 (=\(18+3\times 2\)) HOI classes. Results are evaluated on the test sets and reported in Tables 1 and 2.

As shown in Table 1, the proposed GPNN substantially outperforms the comparative methods, achieving 31.89%, 30.45%, and 32.13% improvement over the second best methods on the three HOI category sets on the HICO-DET dataset. The results on V-COCO dataset (in Table 2) also consistently demonstrate the superior performance of the proposed GPNN. Two important conclusions can be drawn from the results: (i) our method is scalable to large datasets; (ii) and our method performs better than pure neural network. Some visual results can be found in Figs. 3 and  4.

4.2 Human-Object Interaction Recognition in Videos

The goal of this experiment is to detect and predict the human sub-activity labels and object affordance labels as the human-object interaction progresses in videos. The problem is challenging since it involves complex interactions that humans make with multiple objects, and objects also interact with each other.

CAD-120 Dataset [22]. It has 120 RGB-D videos of 4 subjects performing 10 activities, each of which is a sequence of sub-activities involving 10 actions (e.g., reaching, opening), and 12 object affordances (e.g., reachable, openable) in total.

Table 3. Human activity detection and future anticipation results on CAD-120 [22] dataset, measured via F1-score.
Fig. 5.
figure 5

Confusion matrices of HOI detection (a)(b) and anticipation (c)(d) results on CAD-120 [22] dataset. Zoom in for more details.

Implementation Details. The link function is implemented as: convLSTM(1024-1024-1024-1)-Sigmoid(\(\cdot \)) (i.e., a four-layer convLSTM). We use the same architecture as the previous experiment for message functions and update functions: [FC(\(d_{V}\)-\(d_{V}\)), FC(\(d_{E}\)-\(d_{E}\))] for message function and GRU(\(d_{V}\)) for update function. The propagation step number S is set to be 3. We use a FC(\(d_{V}\)-10)-Softmax(\(\cdot \)) and a FC(\(d_{V}\)-12)-Softmax(\(\cdot \)) for readout functions of sub-activity and object affordance detection/anticipation, respectively. We employ an L1 loss for the adjacency matrix and a cross entropy loss for the node outputs. We use the publicly available node and edge features from [23].

Comparative Methods. We compare our method with two baselines: anticipatory temporal CRF (ATCRF) [22] and structural RNN (S-RNN) [20]. ATCRF is a top-performing graphical model approach for this problem, while S-RNN is the state-of-art method using structured neural networks. ATCRF models the human activities through a spatial-temporal conditional random field. S-RNN casts a pre-defined spatial-temporal graph as an RNN mixture by representing nodes and edges as LSTMs.

Fig. 6.
figure 6

HOI detection results on a “cleaning objects” activity on CAD-120 [22] dataset. Human are shown in red rectangle. Two objects are shown in green and blue rectangles, respectively. Detection and anticipation results are shown by different bars. For anticipation task, the label of the sub-activity at time t is anticipated at time \(t\text {-}1\). (Color figure online)

Experiment Results. In Table 3 we show the quantitative comparison of our method with other competitors. It shows the F1-scores averaged over all classes on detection and activity anticipation tasks. GPNN greatly improves over ATCRF and S-RNN, especially on anticipation task. Our method outperforms the other two for the following reasons. (i) Comparing to ATCRF limited to the Markov assumption, our method allows arbitrary graph structures with improved representation ability. (ii) Our method enjoys the benefit of deep integration of graphical models and neural networks and can be learned in an end-to-end manner. (iii) Rather than relying on a pre-fixed graph structure as in S-RNN, we infer the graph structure via learning an adjacency matrix and thus be able to control the information flow between nodes during massage passing. Figure 5 show the confusion matrices for detecting and predicting the sub-activities and object affordances, respectively. From above results we can draw two important conclusions: (i) our method is well applicable to the spatio-temporal domain; and (ii) our method outperforms pure graphical models (e.g., ATCRF) and deep networks with pre-fixed graph structures (e.g., S-RNN). Figure 6 shows a qualitative visualization of “cleaning objects”. We show one representative frame for each sub-activity as well as the corresponding detections and anticipations.

4.3 Ablation Study

In this section, we analyze the contributions of different model components to the final performance and examine the effectiveness of our main assumptions. Table 4 shows the detailed results on all three datasets.

Integration of DNN with Graphical Model. We first examine the influence of integrating DNN with a graphical model. We directly feed the features, which are originally used for GPNN, into different fully connected networks for predicting HOI action or object classes. From Table 4, we can observe the performance of w/o graph is significantly worse than GPNN model over various HOI datasets. This supports our view that modeling high-level structures and leveraging learning capabilities of DNNs together is essential for HOI tasks.

Table 4. Ablation study of GPNN model. Higher values are better.

GPNN with Fixed Graph Structures. In Sect. 3, GPNN automatically infers graph structures (i.e., parse graph) via learning a soft adjacency matrix. To assess this strategy, we fix all the entries in the soft adjacency matrices to be constant 1. This way the graph structures are fixed and the information flow between nodes are not weighted. For constant graph baseline, we see obvious performance decrease, compared with the full GPNN model. This indicates that inferring graph structures is critical to get reasonable performance.

GPNN without Supervision on Link Functions. We perform experiments by turning off the L1 loss on adjacency matrices (w/o graph loss in Table 4). We can observe that the intermediate L1 loss is effective, further verifying our design to learn the graph structure. Another interesting observation is that training the model without this loss has a similar effect to training with constant graph. Hence supervision on the graph is fairly important.

Jointly Learning Parse Graph and Message Passing. We next study the effect of jointly learning graph structures and message passing. By isolating graph parsing from message passing, we obtain w/o joint parsing, where the adjacency matrices are directly computed by link functions from edge features at the beginning. We observe a performance decrease in Table 4, showing that learning graph structures and message passing together indeed boost the performance.

Iterative Learning Process. Next we examine the effect of iterative message passing, we report three baselines: 1 iteration, 2 iterations, and 4 iterations, which correspond to the results from different message passing iterations. The baseline GPNN (first row in Table 4) are the results after three iterations. From the results we observe that the iterative learning process is able to gradually improve the performance in general. We also observe that when the iteration round is increased to a certain extent, the performance drops slightly.

5 Conclusion

In this paper, we propose Graph Parsing Neural Network (GPNN) for inferring a parse graph in an end-to-end manner. The network can be decomposed into four distinct functions, namely link functions, message functions, update functions and readout functions, for iterative graph inference and message passing. GPNN provides a generic HOI representation that is applicable in both spatial and spatial-temporal domains. We demonstrate a substantial performance gain on three HOI datasets, showing the effectiveness of the proposed framework.