Learning Human-Object Interactions by Graph Parsing Neural Networks

Qi, Siyuan; Wang, Wenguan; Jia, Baoxiong; Shen, Jianbing; Zhu, Song-Chun

doi:10.1007/978-3-030-01240-3_25

Learning Human-Object Interactions by Graph Parsing Neural Networks

Conference paper
First Online: 05 October 2018

2860 Accesses
248 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11213))

Abstract

This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images and videos. We introduce the Graph Parsing Neural Network (GPNN), a framework that incorporates structural knowledge while being differentiable end-to-end. For a given scene, GPNN infers a parse graph that includes (i) the HOI graph structure represented by an adjacency matrix, and (ii) the node labels. Within a message passing inference framework, GPNN iteratively computes the adjacency matrices and node labels. We extensively evaluate our model on three HOI detection benchmarks on images and videos: HICO-DET, V-COCO, and CAD-120 datasets. Our approach significantly outperforms state-of-art methods, verifying that GPNN is scalable to large datasets and applies to spatial-temporal settings.

S. Qi and W. Wang—Equal contribution.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

The task of human-object interaction (HOI) understanding aims to infer the relationships between human and objects, such as “riding a bike” or “washing a bike”. Beyond traditional visual recognition of individual instances, e.g., human pose estimation, action recognition, and object detection, recognizing HOIs requires a deeper semantic understanding of image contents. Recently, deep neural networks (DNNs) have shown impressive progress on above individual tasks of instance recognition, while relatively few methods [1, 2, 14, 38] were proposed for HOI recognition. This is mainly because it requires reasoning beyond perception, by integrating information from human, objects, and their complex relationships.

In this paper, we propose a novel model, Graph Parsing Neural Network (GPNN), for HOI recognition. The proposed GPNN offers a general framework that explicitly represents HOI structures with graphs and automatically parses the optimal graph structures in an end-to-end manner. In principle, it is an generalization of Message Passing Neural Network (MPNN) [12]. An overview of GPNN is shown in Fig. 1. The following two aspects motivate our design.

First, we seek a unified framework that utilizes the learning capability of neural networks and the power of graphical representations. Recent deep learning based HOI models showed promising results, but few touched how to interpret well and explicitly leverage spatial and temporal dependencies and human-object relations in such structured task. Aiming for this, we introduce GPNN. It inherits the complementary strengths of neural networks and graphical models, for forming a coherent HOI representation with strong learning ability. Specifically, with the structured representation of an HOI graph, the rich relations are explicitly utilized, and the information from individual elements can be efficiently integrated and broadcasted over the structures. The whole model and message passing operations are well-defined and fully differentiable. Thus it can be efficiently learned from data in an end-to-end manner.

Second, based on our efficient HOI representation and learning power, GPNN applies to diverse HOI tasks in both static and dynamic scenes. Previous studies for HOI achieved good performance in their specific domains (spatial [1, 14] or temporal [20, 34, 35]). However, none of them addresses a generic framework for representing and learning HOI in both images and videos. The key difficulty lies in the diverse relations between components. Given a set of human and objects candidates, there may exist an uncertain number of human-object interaction pairs (see Fig. 1 (a.ii) as an example). The relations become more complex after taking temporal factors into consideration. Thus pre-fixed graph structures, as adopted by most previous graphical or structured DNN models [11, 20, 22, 43], are not an optimal choice. Seeking a better generalization ability, GPNN incorporates an essential link function for addressing the problem of graph structure learning. It learns to infer the adjacency matrix in an end-to-end manner and thus can infer a parse graph that explicitly explains the HOI relations. With such learnable graph structure, GPNN could also limit the information flow from irrelevant nodes while encouraging message to propagate between related nodes, thus improving graph parsing.

We extensively evaluate the proposed GPNN on three HOI datasets, namely HICO-DET [1], V-COCO [17] and CAD-120 [22], for HOI detection from images (HICO-DET, V-COCO) and HOI recognition and anticipation in spatial-temporal settings (CAD-120). The experimental results verify the generality and scalability of our GPNN based HOI representation and show substantial improvements over state-of-the-art approaches, including pure graphical models and pure neural networks. We also demonstrate GPNN outperforms its variants and other graph neural networks with pre-fixed structures.

This paper makes three major contributions. First, we propose the GPNN that incorporates structural knowledge and DNNs for learning and inference. Second, with a set of well defined modular functions, GPNN addresses the HOI problem by jointly performing graph structure inference and message passing. Third, we empirically show that GPNN offers a scalable and generic HOI representation that applies to both static and dynamic settings.

2 Related Work

Human-Object Interaction. Reasoning human actions with objects (like “playing baseball”, “playing guitar”), rather than recognizing individual actions (“playing”) or object instances (“baseball”, “guitar”), is essential for a more comprehensive understanding of what is happening in the scene. Early work in HOI understanding studied Bayesian model [15, 16], utilized contextual relationship between human and objects [47,48,49], learned structured representations with spatial interaction and context [8], exploited compositional models [9], or referred to a set of HOI exemplars [19]. They were mainly based on handcrafted features (e.g., color, HOG, and SIFT) with object and human detectors. More recently, inspired by the notable success of deep learning and the availability of large-scale HOI datasets [1, 2], several deep learning based HOI models were then proposed. Specifically, Mallya et al.[29] modified Fast RCNN model [13] for HOI recognition, with the assistance of Visual Question Answering (VQA). In [38], zero-shot learning was applied for addressing the long-tail problem in HOI recognition. In [1], the human proposals, object regions, and their combinations were fed into a multi-stream network for tackling the HOI detection problem. Gkioxari et al. [14] estimated an action-type specific density map for identifying the interacted object locations, with a modified Faster RCNN architecture [36].

Although promising results were achieved by above deep HOI models, we still observe two unsolved issues. First, they lack a powerful tool to represent the structures in HOI tasks explicitly and encodes them into modern network architectures efficiently. Second, despite the successes in specific tasks, a complete and generic HOI representation is missing. These approaches can not be easily extended to HOI recognition from videos. Aiming to address those issues, we introduce GPNN for imposing high-level relations into DNN, leading to a powerful HOI representation that is applicable in both static and dynamic settings.

Neural Networks with Graphs/Graphical Models. In the literature, some approaches were proposed to combine graphical models and neural networks. The most intuitive approach is to build graphical models upon DNN, where the network that generates features is trained first, and its output is used to compute potential functions for the graphical predictor. Typical methods were used in human pose estimation [42], human part parsing [33, 45], and semantic image segmentation [3, 4]. These methods lack a deep integration in the sense that the computation process of graphical models cannot be learned end-to-end. Some attempts [7, 21, 31, 32, 37, 40, 44, 51] were made to generalize neural network operations (e.g., convolutions) directly from regular grids (e.g., images) to graphs. For the HOI problem, however, a structured representation is needed to capture the high-level spatial-temporal relations between humans and objects. Some other work integrated network architectures with graphical models [12, 20] and gained promising results on applications such as scene understanding [24, 30, 46], object detection and parsing [27, 50], and VQA [41]. However, these methods only apply to problems that have pre-fixed graph structures. Liang et al.[26] merged graph nodes using Long Short-Term Memory (LSTM) for human parsing problem, under the assumption that the nodes are mergeable.

Those methods achieved promising results in their specific tasks and well demonstrated the benefit in completing deep architectures with domain-specific structures. However, most of them are based on pre-fixed graph structures, and they have not yet been studied in HOI recognition. In this work, we extend previous graphical neural networks with learnable graph structures, which well addresses the rich and high-level relations in HOI problems. The proposed GPNN can automatically infer the graph structure and utilize that structure for enhancing information propagation and further inference. It offers a generic HOI representation for both spatial and spatial-temporal settings. To the best of our knowledge, this is a first attempt to integrate graph models with neural networks in a unified framework to achieve state-of-art results in HOI recognition.

3 Graph Parsing Neural Network for HOI

3.1 Formulation

For HOI understanding, human and objects are represented by nodes, and their relations are defined as edges. Given a complete HOI graph that includes all the possible relationships among human and objects, we want to automatically infer a parse graph by keeping the meaningful edges and labeling the nodes.

Formally, let $\mathcal {G} = (\mathcal {V}, \mathcal {E}, \mathcal {Y})$ denote the complete HOI graph. Nodes $v \in \mathcal {V}$ take unique values from $\{1, \cdots , |\mathcal {V}|\}$. Edges $e \in \mathcal {E}$ are two-tuples $e = (v, w) \in \mathcal {V} \times \mathcal {V}$. Each node v has a output state $y_{v} \in \mathcal {Y}$ that takes a value from a set of labels $\{1, \cdots , Y_{v}\}$ (e.g., actions). A parse graph $g = (\mathcal {V}_{g}, \mathcal {E}_{g}, \mathcal {Y}_{g})$ is a sub-graph of $\mathcal {G}$, where $\mathcal {V}_{g} \subseteq \mathcal {V}$ and $\mathcal {E}_{g} \subseteq \mathcal {E}$. Given the node features $\Gamma ^{\mathcal {V}}$ and edge features $\Gamma ^{\mathcal {E}}$, we want to infer the optimal parse graph $g^{*}$ that best explains the data according to a probability distribution p:

$$\begin{aligned} \begin{aligned} g^{*}&= \mathop {\mathrm {argmax}}\limits _{g}~p(g | \Gamma , \mathcal {G}) = \mathop {\mathrm {argmax}}\limits _{g}~p(\mathcal {V}_{g}, \mathcal {E}_{g}, \mathcal {Y}_{g} | \Gamma , \mathcal {G}) \\&= \mathop {\mathrm {argmax}}\limits _{g}~p(\mathcal {Y}_{g} | \mathcal {V}_{g}, \mathcal {E}_{g}, \Gamma ) p(\mathcal {V}_{g}, \mathcal {E}_{g} | \Gamma , \mathcal {G}) \end{aligned} \end{aligned}$$

(1)

where $\Gamma = \{\Gamma ^{\mathcal {V}}, \Gamma ^{\mathcal {E}}\}$. Here $p(\mathcal {V}_{g}, \mathcal {E}_{g} | \Gamma , \mathcal {G})$ evaluates the graph structure, and $p(\mathcal {Y}_{g} | \mathcal {V}_{g}, \mathcal {E}_{g}, \Gamma )$ is the labeling probability for the nodes in the parse graph.

This formulation provides us a principled guideline for designing the GPNN. We design the network to approximate the computations of $\mathop {\mathrm {argmax}}\nolimits _{g} p(\mathcal {V}_{g}, \mathcal {E}_{g} | \Gamma , \mathcal {G})$ and $\mathop {\mathrm {argmax}}\nolimits _{g} p(\mathcal {Y}_{g} | \mathcal {V}_{g}, \mathcal {E}_{g}, \Gamma )$. We introduce four types of functions as individual modules in the forward pass of a GPNN: link functions, message functions, update functions, and readout functions (as illustrated in Fig. 2). The link functions $L(\cdot )$ estimate the graph structure, giving an approximation of $p(\mathcal {V}_{g}, \mathcal {E}_{g} | \Gamma , \mathcal {G})$. The message, update and readout functions together resemble the belief propagation process and approximate $\mathop {\mathrm {argmax}}\nolimits _{\mathcal {Y}_{g}} p(\mathcal {Y}_{g} | \mathcal {V}_{g}, \mathcal {E}_{g}, \Gamma )$.

Specifically, the link function takes edge features as input and infers the connectivities between nodes. The soft adjacency matrix is thus constructed and used as weights for messages passing through edges between nodes. The incoming messages for a node are summarized by the message function then the hidden embedding state of the node is updated based on the messages by an update function . Finally, readout functions compute the target outputs for each nodes. Those four types of functions are defined as follows:

Link Function. We first infer an adjacency matrix that represents connectivities (i.e., the graph structure) between nodes by a link function. A link function $L(\cdot )$ takes the node features $\Gamma ^{\mathcal {V}}$, and edge features $\Gamma ^{\mathcal {E}}$ as input and outputs an adjacency matrix $A\in [0,1]^{|\mathcal {V}| \times |\mathcal {V}|}$:

$$\begin{aligned} \begin{aligned} A_{vw} = L(\Gamma _{v}, \Gamma _{w}, \Gamma _{vw}) \end{aligned} \end{aligned}$$

(2)

where $A_{vw}$ denotes the (v, w)-th entry of the matrix A. Here we overload the notation and let $\Gamma _{v}$ denote node features and $\Gamma _{vw}$ denote edge features. In this way, the structure of a parse graph g can be approximated by the adjacency matrix. Then we start to propagate messages over the parse graph, where the soft adjacency matrix controls the information to be passed through edges.

Message and Update Functions. Based on the learned graph structure, the message passing algorithm is adopted for inference of node labels. During belief propagation, the hidden states of the nodes are iteratively updated by communicating with other nodes. Specially, message functions $M(\cdot )$ summarize messages to nodes coming from other nodes, and update functions $U(\cdot )$ update the hidden node states according to the incoming messages. At each iteration step s, the two functions computes:

$$\begin{aligned} \begin{aligned} m_{v}^{s} = \sum \nolimits _{w} A_{vw} M(h_{v}^{s-1}, h_{w}^{s-1}, \Gamma _{vw}) \end{aligned} \end{aligned}$$

(3)

$$\begin{aligned} \begin{aligned} h_{v}^{s} = U(h_{v}^{s-1}, m_{v}^{s}) \end{aligned} \end{aligned}$$

(4)

where $m_{v}^{s}$ is the summarized incoming message for node v at s-th iteration and $h_{v}^{s}$ is the hidden state for node v. The node connectivity A encourages the information flow between nodes in the parse graph. The message passing phase runs for S steps towards convergence. At the first step, the node hidden states $h_{v}^{0}$ are initialized by node features $\Gamma _{v}$.

Readout Function. Finally, for each node, hidden state is fed into a readout function to output a label:

$$\begin{aligned} \begin{aligned} y_{v} = R(h_{v}^{S}). \end{aligned} \end{aligned}$$

(5)

Here the readout function $R(\cdot )$ computes output $y_v$ for node v by activating its hidden state $h_{v}^{S}$ (node embeddings).

Iterative Parsing. Based on the above four functions, the messages are passed along the graph and weighted by the learned adjacency matrix A. We further extend above process into a joint learning framework that iteratively infers the graph structure and propagates the information to infer node labels. In particular, instead of learning A only at the beginning, we iteratively infer A with the updated node information and edge features at each step s:

$$\begin{aligned} \begin{aligned} A_{vw}^{s} = L(h_{v}^{s-1}, h_{w}^{s-1}, m_{vw}^{s-1}). \end{aligned} \end{aligned}$$

(6)

Then the messages in Eq. 3 are redefined as:

$$\begin{aligned} \begin{aligned} m_{v}^{s} = \sum \nolimits _{w} A_{vw}^{s} M(h_{v}^{s-1}, h_{w}^{s-1}, \Gamma _{vw}). \end{aligned} \end{aligned}$$

(7)

In this way, both the graph structure and the message update can be jointly and iteratively learned in a unified framework. In practice, we find such a strategy would bring better performance (detailed in Sect. 4.3).

In next section, we show that by implementing each function by neural networks, the entire system is differentiable end-to-end. Hence all the parameters can be learned using gradient-based optimization.

3.2 Network Architecture

Link Function. Given the complete HOI graph $\mathcal {G} = (\mathcal {V}, \mathcal {E}, \mathcal {Y})$, we use $d_{V}$ and $d_{E}$ to denote the dimension of the node features and the edge features, respectively. In a message passing step s, we first concatenate all the node features (hidden states) $\{h^s_v\in \mathbb {R}^{d_{V}}\}_v$ and all the edge features (messages) $\{m^s_{vw}\in \mathbb {R}^{d_{E}}\}_{v,w}$ to form a feature matrix $F^{s}\in \mathbb {R}^{|V| \times |V| \times (2d_{V}+d_{E})}$ (see in Fig. 2). The link function is defined as a small neural network with one or several convolutional layer(s) (with $1\times 1\times (2d_{V}+d_{E}$) kernels) and a sigmoid activation. Then the adjacency matrix $A^s \in [0,1]^{|\mathcal {V}| \times |\mathcal {V}|}$ can be computed as:

$$\begin{aligned} \begin{aligned} A^{s} = \sigma (\mathbf {W}^L*F^{s}), \end{aligned} \end{aligned}$$

(8)

where $\mathbf {W}^L$ is the learnable parameters of the link function network $L(\cdot )$ and $*$ denotes conv operation. The sigmoid operation $\sigma (\cdot )$ is for normalizing the values of the elements of $A^{s}$ into [0, 1]. The essential effect of multiple convolutional layers with $1 \times 1$ kernels is similar to fully connected layers applied to each individual edge features, except that the filter weights are shared by all the edges. In practice, we find such operation generates good enough results and leads to a high computation efficiency.

For spatial-temporal problems where the adjacency matrices should account for the previous states, we use convolutional LSTMs [39] for modeling $L(\cdot )$ in temporal domain. At time t, the link function takes $F^{s, t}$ as input features and the previous adjacency matrix $A^{s, t-1}$ as hidden state: $A^{s, t} = convLSTM(F^{s, t}, A^{s, t-1})$. Again, the kernel size for the conv layer in convLSTM is $1 \times 1 \times (2d_{V}+d_{E})$.

Message Function. In our implementation, the message function $M(\cdot )$ in Eq. 3 is computed by:

$$\begin{aligned} \begin{aligned} M(h_{v}, h_{w}, \Gamma _{vw}) = [\mathbf {W}_V^M h_{v}, \mathbf {W}_V^M h_{w}, \mathbf {W}_E^M \Gamma _{vw}], \end{aligned} \end{aligned}$$

(9)

where [., .] denotes concatenation. It concatenates the outputs of linear transforms (i.e., fully connected layers parametrized by $\mathbf {W}_V^M$ and $\mathbf {W}_E^M$) that takes node hidden states $h_{v}$ or edge features $\Gamma _{vw}$ as input.

Update Function. Recurrent neural networks [10, 18] are natural choices for simulating the iterative update process, as done by previous works [12]. Here we apply Gated Recurrent Unit (GRU) [5] as the update function, because of its recurrent nature and smaller amount of parameters. Thus the update function in Eq. 4 is implemented as:

$$\begin{aligned} \begin{aligned} h_{v}^{s} = U(h_{v}^{s-1}, m_{v}^{s}) = GRU(h_{v}^{s-1}, m_{v}^{s}), \end{aligned} \end{aligned}$$

(10)

where $h_{v}^{s}$ is the hidden state and $m_{v}^{s}$ is used as input features. As demonstrated in [25], the GRU is more effective than vanilla recurrent neural networks.

Readout Function. A typical implementation of readout functions is combining several fully connected layers (parameterized by $\mathbf {W}^R$) followed by an activation function:

$$\begin{aligned} \begin{aligned} y_v = R(h_v^S) = \varphi (\mathbf {W}^R h_v^S). \end{aligned} \end{aligned}$$

(11)

Here the activation function $\varphi (\cdot )$ can be used as softmax (one-class outputs) or sigmoid (multi-class outputs) according to different HOI tasks.

In this way, the entire GPNN is implemented to be fully differentiable and end-to-end trainable. The loss for specific HOI task can be computed for the outputs of readout functions, and the error can propagate back according to chain rule. In next section, we will offer more details for implementing GPNN for HOI tasks on spatial and spatial-temporal settings and present qualitative as well as quantitative results.

4 Experiments

To verify the effectiveness and generic applicability of GPNN, we perform experiments on two HOI problems: (i) HOI detection in images [1, 17], and (ii) HOI recognition and anticipation from videos [22]. The first experiment is performed on HICO-DET [1] and V-COCO [17] datasets, showing that our approach is scalable to large datasets (about 60 K images in total) and achieves a good detection accuracy over a large number of classes (more than 600 classes of HOIs). The second experiment is reported on CAD-120 dataset [22], showing that our method is well applicable to spatial-temporal domains.

4.1 Human-Object Interaction Detection in Images

For HOI detection in an image, the goal is to detect pairs of a human and an object bounding box with an interaction class label connecting them.

Table 1. HOI detection results (mAP) on HICO-DET dataset [1]. Higher values are better. The best scores are marked in bold.

Full size table

Datasets. We use HICO-DET [1] and V-COCO [17] datasets for benchmarking our GPNN model. HICO-DET provides more than 150 K annotated instances of human-object pairs in 47,051 images (37,536 training and 9,515 testing). It has the same 80 object categories as MS-COCO [28] and 117 action categories. V-COCO is a subset of MS-COCO [28]. It consists of a total of 10,346 images with 16,199 people instances, where $\sim $2.5K images in the train set, $\sim $2.8K images for validation and $\sim $4.9K images for testing. Each annotated person has binary labels for 26 different action classes. Note that three actions (i.e., cut, eat, and hit) are annotated with two types of targets: instrument and direct object.

Implementation Details. Humans and objects are represented by nodes in the graph, while human-object interactions are represented by edges. In this experiment, we use a pre-trained deformable convolutional network [6] for object detection and features extraction. Based on the detected bounding boxes, we extract node features ($7 \times 7 \times 80$) from the position-sensitive region of interest (PS RoI) pooling layer from the deformable ConvNet. We extract the edge feature from a combined bounding box, i.e., the smallest bounding box that contains both two nodes’ bounding boxes. The functions of GPNN are implemented as follows. We use a convolutional network (128-128-1)-Sigmoid($\cdot $) with $1 \times 1$ kernels for the link function. The message functions are composed of a fully connected layer, concatenation, and summation. For a node v, the neighboring node feature $\Gamma _{w}$ and edge feature $\Gamma _{vw}$ are passed through a fully connected layer and concatenated. The final incoming message is a weighted sum of messages from all neighboring nodes. Specifically, the message for node v coming from node w through edge $e=(v, w)$ is the concatenation of output from FC($d_{V}$-$d_{V}$) and FC($d_{E}$-$d_{E}$). A GRU($d_{V}$) is used for the update function. The propagation step number S is set to be 3. For the readout function, we use a FC($d_{V}$-117)-Sigmoid($\cdot $) and FC($d_{V}$-26)-Sigmoid($\cdot $) for HICO-DET and V-COCO, respectively.

The probability of an HOI label of a human-object pair is given by the product of the final output probabilities from the human node and the object node. We employ an L1 loss for the adjacency matrix. For the node outputs, we use a weighted multi-class multi-label hinge loss. The reasons are two-folds: the training examples are not balanced, and it is essentially a multi-label problem for each node (there might not even exist a meaningful human-object interaction for detected humans and objects).

Our model is implemented using PyTorch and trained with a machine with a single Nvidia Titan Xp GPU. We start with a learning rate of 1e-3, and the rate decays every 5 epochs by 0.8. The training process takes about 20 epochs ($\sim $15 h) to roughly converge with a batch size of 32.

Table 2. HOI detection results (mAP) on V-COCO [17] dataset. Legend: Set 1 indicates 18 HOI actions with one object, and Set 2 corresponds to 3 HOI actions (i.e., cut, eat, hit) with two objects (instrument and object).

Full size table

Comparative Methods. We compare our method with eight baselines: (1) Fast-RCNN (union) [13]: for each human-object proposal from detection results, their attention windows are used as the region proposal for Fast-RCNN. (2) Fast-RCNN (score) [13]: given human-object proposals, HOI is predicted by linearly combining the human and object detection scores. (3) HO-RCNN [1]: a multi-stream architecture with a ConvNet to classify human, object and human-object proposals, respectively. The final output is computed by combining the scores from all the three streams. (4) HO-RCNN+IP [1] and (5) HO-RCNN+IP+S [1]: HO-RCNN with additional components. Interaction Patterns (IP) acts as a attention filter to images. S is an extra path with a single neuron that uses the raw object detection score to produce an offset for the final detection. More detailed descriptions of above five baselines can be found in [1]. (6) Gupta et al.[17]: trained based on Fast-RCNN [13]. We use the scores reported in [14]. (7) Shen et al.[38]: final predictions are from two Faster RCNN [36] based networks which are trained for predicting verb and object classes, respectively. (8) InteractNet [14]: a modified Faster RCNN [36] with an additional human-centric branch that estimates an action-specific density map for locating objects.

Experiment Results. Following the standard settings in HICO-DET and V-COCO benchmarks, we evaluate HOI detection using mean average precision (mAP). An HOI detection is considered as a true positive when the human detection, the object detection, and the interaction class are all correct. The human and object bounding boxes are considered as true positives if they overlap with a ground truth bounding boxes of the same class with an intersection over union (IoU) greater than 0.5. For HICO-DET dataset, we report the mAP over three different HOI category sets: (i) all 600 HOI categories in HICO (Full); (ii) 138 HOI categories with less than 10 training instances (Rare); and (iii) 462 HOI categories with 10 or more training instances (Non-Rare). For V-COCO dataset, since we concentrate on HOI detection, we report the mAP on three groups: (i) 18 HOI action classes with one target object; (ii) 3 HOI categories with two types of objects; (iii) all 24 (=$18+3\times 2$) HOI classes. Results are evaluated on the test sets and reported in Tables 1 and 2.

As shown in Table 1, the proposed GPNN substantially outperforms the comparative methods, achieving 31.89%, 30.45%, and 32.13% improvement over the second best methods on the three HOI category sets on the HICO-DET dataset. The results on V-COCO dataset (in Table 2) also consistently demonstrate the superior performance of the proposed GPNN. Two important conclusions can be drawn from the results: (i) our method is scalable to large datasets; (ii) and our method performs better than pure neural network. Some visual results can be found in Figs. 3 and 4.

4.2 Human-Object Interaction Recognition in Videos

The goal of this experiment is to detect and predict the human sub-activity labels and object affordance labels as the human-object interaction progresses in videos. The problem is challenging since it involves complex interactions that humans make with multiple objects, and objects also interact with each other.

CAD-120 Dataset [22]. It has 120 RGB-D videos of 4 subjects performing 10 activities, each of which is a sequence of sub-activities involving 10 actions (e.g., reaching, opening), and 12 object affordances (e.g., reachable, openable) in total.

Table 3. Human activity detection and future anticipation results on CAD-120 [22] dataset, measured via F1-score.

Full size table

Implementation Details. The link function is implemented as: convLSTM(1024-1024-1024-1)-Sigmoid($\cdot $) (i.e., a four-layer convLSTM). We use the same architecture as the previous experiment for message functions and update functions: [FC($d_{V}$-$d_{V}$), FC($d_{E}$-$d_{E}$)] for message function and GRU($d_{V}$) for update function. The propagation step number S is set to be 3. We use a FC($d_{V}$-10)-Softmax($\cdot $) and a FC($d_{V}$-12)-Softmax($\cdot $) for readout functions of sub-activity and object affordance detection/anticipation, respectively. We employ an L1 loss for the adjacency matrix and a cross entropy loss for the node outputs. We use the publicly available node and edge features from [23].

Comparative Methods. We compare our method with two baselines: anticipatory temporal CRF (ATCRF) [22] and structural RNN (S-RNN) [20]. ATCRF is a top-performing graphical model approach for this problem, while S-RNN is the state-of-art method using structured neural networks. ATCRF models the human activities through a spatial-temporal conditional random field. S-RNN casts a pre-defined spatial-temporal graph as an RNN mixture by representing nodes and edges as LSTMs.

Experiment Results. In Table 3 we show the quantitative comparison of our method with other competitors. It shows the F1-scores averaged over all classes on detection and activity anticipation tasks. GPNN greatly improves over ATCRF and S-RNN, especially on anticipation task. Our method outperforms the other two for the following reasons. (i) Comparing to ATCRF limited to the Markov assumption, our method allows arbitrary graph structures with improved representation ability. (ii) Our method enjoys the benefit of deep integration of graphical models and neural networks and can be learned in an end-to-end manner. (iii) Rather than relying on a pre-fixed graph structure as in S-RNN, we infer the graph structure via learning an adjacency matrix and thus be able to control the information flow between nodes during massage passing. Figure 5 show the confusion matrices for detecting and predicting the sub-activities and object affordances, respectively. From above results we can draw two important conclusions: (i) our method is well applicable to the spatio-temporal domain; and (ii) our method outperforms pure graphical models (e.g., ATCRF) and deep networks with pre-fixed graph structures (e.g., S-RNN). Figure 6 shows a qualitative visualization of “cleaning objects”. We show one representative frame for each sub-activity as well as the corresponding detections and anticipations.

4.3 Ablation Study

In this section, we analyze the contributions of different model components to the final performance and examine the effectiveness of our main assumptions. Table 4 shows the detailed results on all three datasets.

Integration of DNN with Graphical Model. We first examine the influence of integrating DNN with a graphical model. We directly feed the features, which are originally used for GPNN, into different fully connected networks for predicting HOI action or object classes. From Table 4, we can observe the performance of w/o graph is significantly worse than GPNN model over various HOI datasets. This supports our view that modeling high-level structures and leveraging learning capabilities of DNNs together is essential for HOI tasks.

Table 4. Ablation study of GPNN model. Higher values are better.

Full size table

GPNN with Fixed Graph Structures. In Sect. 3, GPNN automatically infers graph structures (i.e., parse graph) via learning a soft adjacency matrix. To assess this strategy, we fix all the entries in the soft adjacency matrices to be constant 1. This way the graph structures are fixed and the information flow between nodes are not weighted. For constant graph baseline, we see obvious performance decrease, compared with the full GPNN model. This indicates that inferring graph structures is critical to get reasonable performance.

GPNN without Supervision on Link Functions. We perform experiments by turning off the L1 loss on adjacency matrices (w/o graph loss in Table 4). We can observe that the intermediate L1 loss is effective, further verifying our design to learn the graph structure. Another interesting observation is that training the model without this loss has a similar effect to training with constant graph. Hence supervision on the graph is fairly important.

Jointly Learning Parse Graph and Message Passing. We next study the effect of jointly learning graph structures and message passing. By isolating graph parsing from message passing, we obtain w/o joint parsing, where the adjacency matrices are directly computed by link functions from edge features at the beginning. We observe a performance decrease in Table 4, showing that learning graph structures and message passing together indeed boost the performance.

Iterative Learning Process. Next we examine the effect of iterative message passing, we report three baselines: 1 iteration, 2 iterations, and 4 iterations, which correspond to the results from different message passing iterations. The baseline GPNN (first row in Table 4) are the results after three iterations. From the results we observe that the iterative learning process is able to gradually improve the performance in general. We also observe that when the iteration round is increased to a certain extent, the performance drops slightly.

5 Conclusion

In this paper, we propose Graph Parsing Neural Network (GPNN) for inferring a parse graph in an end-to-end manner. The network can be decomposed into four distinct functions, namely link functions, message functions, update functions and readout functions, for iterative graph inference and message passing. GPNN provides a generic HOI representation that is applicable in both spatial and spatial-temporal domains. We demonstrate a substantial performance gain on three HOI datasets, showing the effectiveness of the proposed framework.

References

Chao, Y.W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions (2018)
Google Scholar
Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: A benchmark for recognizing human-object interactions in images. In: ICCV (2015)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. PAMI (2016)
Google Scholar
Chen, L.C., Schwing, A., Yuille, A., Urtasun, R.: Learning deep structured models. In: ICML (2015)
Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: Syntax, Semantics and Structure in Statistical Translation, p. 103 (2014)
Google Scholar
Dai, J., et al.: Deformable convolutional networks. In: ICCV (2017)
Google Scholar
Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: NIPS (2016)
Google Scholar
Delaitre, V., Sivic, J., Laptev, I.: Learning person-object interactions for action recognition in still images. In: NIPS (2011)
Google Scholar
Desai, C., Ramanan, D.: Detecting actions, poses, and objects with relational phraselets. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 158–172. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_12
Chapter Google Scholar
Elman, J.L.: Finding structure in time. Cogn. Sci. (1990)
Google Scholar
Fang, H.S., Xu, Y., Wang, W., Zhu, S.C.: Learning pose grammar to encode human body configuration for 3D pose estimation. In: AAAI (2018)
Google Scholar
Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: ICML (2017)
Google Scholar
Girshick, R.: Fast R-CNN. In: ICCV (2015)
Google Scholar
Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: CVPR (2018)
Google Scholar
Gupta, A., Davis, L.S.: Objects in action: an approach for combining action understanding and object perception. In: CVPR (2007)
Google Scholar
Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: using spatial and functional compatibility for recognition. PAMI (2009)
Google Scholar
Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. (1997)
Google Scholar
Hu, J.F., Zheng, W.S., Lai, J., Gong, S., Xiang, T.: Recognising human-object interaction via exemplar based modelling. In: ICCV (2013)
Google Scholar
Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: CVPR (2016)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
Google Scholar
Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. PAMI (2016)
Google Scholar
Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. Int. J. Robot. Res. (2013)
Google Scholar
Li, R., Tapaswi, M., Liao, R., Jia, J., Urtasun, R., Fidler, S.: Situation recognition with graph neural networks. In: ICCV (2017)
Google Scholar
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. In: ICLR (2016)
Google Scholar
Liang, X., Lin, L., Shen, X., Feng, J., Yan, S., Xing, E.P.: Interpretable structure-evolving LSTM. In: ICCV (2017)
Google Scholar
Liang, X., Shen, X., Feng, J., Lin, L., Yan, S.: Semantic object parsing with graph LSTM. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 125–143. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_8
Chapter Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Mallya, A., Lazebnik, S.: Learning models for actions and person-object interactions with transfer to question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 414–428. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_25
Chapter Google Scholar
Marino, K., Salakhutdinov, R., Gupta, A.: The more you know: using knowledge graphs for image classification. In: CVPR (2016)
Google Scholar
Monti, F., Boscaini, D., Masci, J., Rodolà, E., Svoboda, J., Bronstein, M.M.: Geometric deep learning on graphs and manifolds using mixture model CNNs. In: CVPR (2016)
Google Scholar
Niepert, M., Ahmed, M., Kutzkov, K.: Learning convolutional neural networks for graphs. In: ICML (2016)
Google Scholar
Park, S., Nie, X., Zhu, S.C.: Attribute and-or grammar for joint parsing of human pose, parts and attributes. PAMI (2017)
Google Scholar
Qi, S., Huang, S., Wei, P., Zhu, S.C.: Predicting human activities using stochastic grammar. In: ICCV (2017)
Google Scholar
Qi, S., Jia, B., Zhu, S.C.: Generalized earley parser: bridging symbolic grammars and sequence data for future prediction. In: ICML (2018)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Google Scholar
Seo, Y., Defferrard, M., Vandergheynst, P., Bresson, X.: Structured sequence modeling with graph convolutional recurrent networks. arXiv preprint arXiv:1612.07659 (2016)
Shen, L., Yeung, S., Hoffman, J., Mori, G., Fei-Fei, L.: Scaling human-object interaction recognition through zero-shot learning (2018)
Google Scholar
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: NIPS (2015)
Google Scholar
Simonovsky, M., Komodakis, N.: Dynamic edge-conditioned filters in convolutional neural networks on graphs. In: CVPR (2017)
Google Scholar
Teney, D., Liu, L., van den Hengel, A.: Graph-structured representations for visual question answering. In: CVPR (2017)
Google Scholar
Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS (2014)
Google Scholar
Wang, W., Xu, Y., Shen, J., Zhu, S.C.: Attentive fashion grammar network for fashion landmark detection and clothing category classification. In: CVPR (2018)
Google Scholar
Wu, Z., Lin, D., Tang, X.: Deep Markov random field for image modeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 295–312. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_18
Chapter Google Scholar
Xia, F., Zhu, J., Wang, P., Yuille, A.L.: Pose-guided human parsing by an And/Or graph using pose-context features. In: AAAI (2016)
Google Scholar
Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: ICCV (2017)
Google Scholar
Yao, B., Fei-Fei, L.: Grouplet: a structured image representation for recognizing human and object interactions. In: CVPR (2010)
Google Scholar
Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: CVPR (2010)
Google Scholar
Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: ICCV (2011)
Google Scholar
Yuan, Y., Liang, X., Wang, X., Yeung, D.Y., Gupta, A.: Temporal dynamic graph LSTM for action-driven video object detection. In: ICCV (2017)
Google Scholar
Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: ICCV (2015)
Google Scholar

Download references

Acknowledgments

The authors thank Prof. Ying Nian Wu from UCLA Statistics Department for helpful comments on this work. This research is supported by DARPA XAI N66001-17-2-4029, ONR MURI N00014-16-1-2007, ARO W911NF1810296, and N66001-17-2-3602.

Author information

Authors and Affiliations

University of California, Los Angeles, Los Angeles, USA
Siyuan Qi, Wenguan Wang, Baoxiong Jia & Song-Chun Zhu
International Center for AI and Robot Autonomy (CARA), Los Angeles, USA
Siyuan Qi & Song-Chun Zhu
Beijing Institute of Technology, Beijing, China
Wenguan Wang & Jianbing Shen
Peking University, Beijing, China
Baoxiong Jia
Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Jianbing Shen

Authors

Siyuan Qi
View author publications
You can also search for this author in PubMed Google Scholar
Wenguan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Baoxiong Jia
View author publications
You can also search for this author in PubMed Google Scholar
Jianbing Shen
View author publications
You can also search for this author in PubMed Google Scholar
Song-Chun Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianbing Shen .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qi, S., Wang, W., Jia, B., Shen, J., Zhu, SC. (2018). Learning Human-Object Interactions by Graph Parsing Neural Networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11213. Springer, Cham. https://doi.org/10.1007/978-3-030-01240-3_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-01240-3_25
Published: 05 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01239-7
Online ISBN: 978-3-030-01240-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Related Work

3 Graph Parsing Neural Network for HOI

3.1 Formulation

3.2 Network Architecture

4 Experiments

4.1 Human-Object Interaction Detection in Images

4.2 Human-Object Interaction Recognition in Videos

4.3 Ablation Study

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation