Detecting human—object interaction with multi-level pairwise feature network

Human–object interaction (HOI) detection is crucial for human-centric image understanding which aims to infer ⟨human, action, object⟩ triplets within an image. Recent studies often exploit visual features and the spatial configuration of a human–object pair in order to learn the action linking the human and object in the pair. We argue that such a paradigm of pairwise feature extraction and action inference can be applied not only at the whole human and object instance level, but also at the part level at which a body part interacts with an object, and at the semantic level by considering the semantic label of an object along with human appearance and human–object spatial configuration, to infer the action. We thus propose a multi-level pairwise feature network (PFNet) for detecting human–object interactions. The network consists of three parallel streams to characterize HOI utilizing pairwise features at the above three levels; the three streams are finally fused to give the action prediction. Extensive experiments show that our proposed PFNet outperforms other state-of-the-art methods on the V-COCO dataset and achieves comparable results to the state-of-the-art on the HICO-DET dataset.

in visual recognition [1] and object detection [2][3][4]. To achieve deeper levels of image understanding, researchers have turned to detecting visual relationships rather than isolated instances [5,6], a task which remains challenging due to the wide variety of relations. More specifically, detecting human-centric relationships with surrounding objects, referred to as human-object interaction (HOI) detection [7,8], has become crucial for tasks like video understanding [9] and visual question answering [10]. The goal is to determine the human, action, object triplets in a single image.
Current attempts to address the problem of HOI detection usually rely on considering all human, object pairs in an image, where the pairwise features comprise three components: visual features of the human, visual features of the object, and spatial configuration linking the human and object [7,11]. These components help to recognize actions with a typical spatial interaction pattern, e.g., ride, or actions strongly correlated with the presence of a person or specific objects. However, most existing methods only extract such pairwise features at the global instance level [12][13][14], which we argue is insufficient to distinguish some fine-grained actions that require subtle local cues from body parts and/or knowledge about object semantic labels-for instance, the action of eating something, involving multiple nearby objects.
This paper seeks to apply more informative pairwise representations for HOI detection in addition to the global instance information. We observe that an inherent hierarchical structure exists in pairwise features, as can be seen in Fig. 1. Beyond the instance-level interactions between a person and an object, there are actions strongly associated with a body part, e.g., the hold action involves a hand, Fig. 1 HOI can be characterized using three levels of pairwise features, the instance-, body part-, and semantic-levels. At the partlevel, the visually and spatially related hand and object pair indicates hold; at the semantic-level, the object label pizza strongly suggests eat. and the kick action, a foot. Therefore, additional pairwise features at the body part-level that capture interactions between body parts and nearby objects can provide useful local cues for recognizing such fine-grained actions. Compared to previous partbased approaches [15,16], our proposed part-level pairwise features are more comprehensive, consisting of three components (visual features of the body part, visual features of the object, and their relative spatial configuration), whereas previous methods do not consider all three components and are thus less capable of modeling subtle interactions between body parts and objects. Furthermore, we observe that the semantic label of an object can serve as a reliable prior as well as a substitute for object appearance when the object is partially occluded. Given the object semantic label, the number of visual phrases (i.e., valid pairs of action and object) becomes far smaller than the total number of action, object combinations. Therefore we propose a third level of pairwise features at the semantic-level, which utilize object labels to allow the learning of sparse correspondences between actions and objects.
In order to effectively utilize the multi-level pairwise features presented above to detect human-object interactions, we propose a novel multi-level pairwise feature network (PFNet) consisting of three parallel streams. PFNet aggregates pairwise visual and spatial features at three levels and incorporates both local body parts and semantic priors to achieve more robust and accurate HOI detection. The instance-level stream of PFNet captures visual and spatial configuration features of human, object pairs. The part-level stream captures visual and spatial relationship features of body-part, object pairs. Specifically, at the part-level, we enlarge the receptive field of the object visual feature to be the union of the bounding boxes covering the object and a neighbouring body part. The part-level spatial configuration is represented by the distance between the object and its nearest body part. The semantic-level stream resembles the instancelevel counterpart but captures pairwise relations by replacing the object visual feature by its semantic label feature. Lastly, the three streams are fused to predict the HOI. A comparison with other methods conducted on two large-scale datasets, V-COCO [17] and HICO-DET [7], shows that our method achieves state-of-the-art performance on V-COCO and the best result on HICO-DET, without needing any extra annotation.

Related work
Action recognition is a human-centric visual recognition task closely related to HOI detection. Action recognition usually relies on pose-guided human appearance and contextual information. Zhao et al. [18] generate body parts with the assistance of a human pose estimation network and use the state of body parts to complement global human appearance. Luvizon et al. [19] conduct multitask learning for both action recognition and pose estimation to improve the performance for each task. Attention mechanisms are also widely used for action recognition. Abdulmunem et al. [20] extract both local and global descriptors for efficient action recognition guided by object saliency. Girdhar and Ramanan [21] propose a top-down and bottom-up attention mechanism to capture global context and local features. Although action recognition can be considered as an image-level task, these strategies can readily be transferred to recognizing and detecting instance-level human-object interactions. In our work, we also utilize detailed information about body parts to enrich global features.
Human-object interaction detection lies at the intersection of action recognition and general visual relationship detection. In existing instance-level approaches, a multi-stream network extracts pairwise visual and spatial features for a human-object pair for interaction prediction. In addition, instance-centric attention [11], or spatial relation guided attention [22], can be used to refine the pairwise features. Wang et al. [13] introduce context-aware human and object appearance features that better incorporate information from background scenes. Some networks further predict a binary interaction score [12,15] for a human-object pair or estimate the object location with a localization branch [8]. HOIs can also be parsed as a scene graph [5] so that information from all human-object pairs in one image can be utilized. Instance features are refined using iterative message passing [23] or graph convolution [22].
PMFNet [15] and RPNN [16] are two typical partbased approaches that utilize visual features of body parts. PMFNet has a zoom-in module that extracts local visual and spatial features from pose keypoint guided regions. RPNN has a graph for human and body parts and another graph for object and body parts, which enrich the coarse instance-level human and object features with weighted body part visual features. However, they capture a certain scope of local feature which is suboptimal for representing interactions between body parts and objects. Unlike these approaches, we employ the same type of pairwise representation for local part-level features as for instance-level pairs.
Moreover, semantic features carried by action and object labels have also been explored to obtain better generalization when few examples exist for an HOI category [6,[24][25][26]. A common approach is to learn a joint embedding space that matches visual and language features of HOIs [6,24,25] so that a similarity term can be appended to the final prediction score to determine how an action prediction matches its semantic meanings. Instead of learning a joint embedding space, we directly model semantic dependency for action categories based on object labels.
Finally, very recently models built on anchor-free point-based detection frameworks have been proposed to perform HOI detection [27,28]. They treat HOIs as keypoints lying between a human and an object.

Network architecture
We adopt a two-stage pipeline consisting of an instance localization stage and an interaction recognition stage, following Refs. [11,12]. Given an image I, an off-the-shelf object detector, e.g., Faster R-CNN [2] with a ResNet50 backbone, first detects all human instances with bounding boxes b h , and all object instances with bounding boxes b o and class labels c o . The feature map F for the image I is extracted from the ResNet50 C4 conv layer. Meanwhile, a human pose estimation network parses human instances h into keypoints k h for extracting body part boxes b k h . We perform action prediction on all pairs of humans and objects.
We regard a human-object interaction as a function of pairwise features at multiple levels: instance level, body part level, and semantic level. As shown in Fig (1) Next we detail how we extract the three levels of pairwise features through three parallel streams. Each stream follows a similar concise pattern with minor structural changes to adapt it to the variations between each level.

Instance-level pairwise feature stream
The instance-level pairwise feature captures the holistic visual and spatial relationships for a human and object pair [7,[11][12][13]15]. To obtain this pairwise feature, we first crop visual features of the human instance and the object instance in the pair by applying the RoI-Align [29] operation on the feature map F . Then we apply global average pooling (GAP), followed by two fully connected layers, to obtain the feature vectors f h ins and f o ins . The spatial configuration can be represented as a three-channel spatial-pose map [12,15] which augments the two-channel binary mask map of human and objects [7,11] with an In the third channel we draw pose keypoints in k h as the body joints and lines linking them as the body skeleton. The lines have gray values ranging from 0.15 to 0.95 in order to encode different parts. The human pose is obtained as in Ref. [30] and the body skeleton follows a conventional pattern [12]. After resizing the spatial-pose map to a size of 64 × 64 × 3, we use two conv layers followed by max pooling layers and two fully connected layers to obtain the spatial feature vector f sp ins . Considering both visual and spatial relations, the instance-level pairwise feature f ins can be represented as a concatenation of the feature vector f h ins , f o ins , and f sp ins : where ⊕ denotes concatenation of feature vectors.

Part-level pairwise feature stream
The part-level pairwise feature stream is responsible for capturing local interactions between objects and body parts. To this end, we first consider visual feature and spatial relations for a body part and object pair. We then organize a set of part-level pairwise features into a form like Eq. (2) with aggregated body part feature, augmented object feature and aggregated spatial configuration. Given a set of human pose keypoints k h , we crop n = 10 body part regions whose center points can be well defined by k h , following Ref. [31]. Specifically, the ten body parts include head, pelvis, both left and right arms, both hands, both knees, and both feet. All body-part regions are square boxes of size proportional to the height of the human bounding box ( Fig. 3(a)). Here we denote a body part region We apply RoI-Align to crop a body part feature from the feature map according to b k i h , followed by global average pooling to generate a feature f h k i for each body part. Body part visual features are aggregated using a fully connected layer W h par to get part-level human visual feature f h par : The original object bounding box often has a limited receptive field to capture important visual cues about how an object interacts with a local body part. However, this is crucial for recognizing actions when a body part and an object have direct contact or are close to each other. To address this issue we introduce an augmented object feature that also includes a neighbouring body part. Given the fact that a number of actions involve local interaction between hands and objects, we consider a hand- Fig. 3 Examples of part-level pairwise features. (a) Body part regions, e.g., head state (above) for read and knee state (below) for throw. (b) We augment the object feature by cropping the feature from the expanded union box covering the object (blue) and its neighbouring body part (e.g., hand) to enrich the visual cues for local interactions, e.g., throw and hold, which are strongly correlated with body parts. The distance between the object and body part (dashed lines) is utilized as a local spatial feature.
augmented object feature. Specifically, as shown in Fig. 3(b), we first find the closest hand part box to an object (for left or right hand). Then we generate a union box covering that hand part and the object, and expand the region by a margin. Similarly, we adopt RoI-Align to crop the feature which is followed by a conv layer, a GAP, and a fully connected layer W o par to extract the hand augmented object feature f o par . Note that, although we augment the object feature with a specific type of body part here, one could also consider an arbitrary group of body parts.
As the size of a body part bounding box does not have specific meaning, we utilize body part to object distance as a discriminant feature to indicate which body part has a close spatial relation with the object. We use the normalized distance between the object center (u o , v o ) and a body part box center (u i , v i ): where H, W are the height and width of the union bounding box of human b h and object b o , and D(·, ·) is the Euclidean distance between two two-dimensional (2D) points. Considering all distances, we have a distance vector and then obtain the spatial feature f sp par by applying another fully connected layer W sp par : Finally, the part-level pairwise feature is represented by concatenating the aggregated local body part visual feature, augmented object feature and local spatial relations: We do not employ attention modules for feature refinement [13,15], as the pose keypoints are already effective for region selection. The effectiveness of each part is validated in Section 4.5.2.
Since the instance-level and part-level pairwise features together encode the appearance of a humanobject pair, we therefore concatenate them and pass the result through fully connected layers W appr to predict an action probability p a h,o : where σ is a sigmoid layer.

Semantic-level pairwise feature stream
Inspired by Ref. [26] which utilizes a group of semantically related object labels to improve generalization, we propose a semantic-level pairwise feature incorporating object labels to explore semantic dependency for different actions. The semantic-level pairwise feature for a humanobject pair is constructed in the same way as the instance-level pairwise feature except that the visual object feature is replaced by the language embedding feature of its object label. This is based on the observation that given the human appearance, humanobject spatial relations, and object labels, very reasonable predictions can be made in scenarios like eat or drink. Thus, the semantic-level pairwise feature is defined as We adopt a weight sharing strategy to learn this feature. Weights for the human visual feature and spatial relation are shared across instance-level and semantic-level features for joint learning, for better consistency. Therefore we have f h sem sharing parameters with f h ins and f sp sem sharing parameters with f sp ins ; f h ins and f sp ins were described in Section 3.2. The semantic feature f o sem is obtained from object labels c o with three fully connected layers, the first of which is initialized by word2vec [32] embeddings.
We utilize f sem to independently predict an action classification score q a h,o as a semantic prior: where σ is a sigmoid layer and W sem are two fully connected layers.

Loss function
Our proposed network can be trained in an endto-end fashion. In an HOI detection task, a person can simultaneously conduct more than one action, making it a multi-label classification problem. As positive human, action, object triplets are relatively sparse among all triplets, some previous work [12,15] further predict an interaction or affinity term to filter out human-object pairs that are not interacting. Here we address the sparsity problem by applying a focal loss [33]

Experiments
In this section we first introduce datasets and evaluation metrics used in our experiments. Then comparisons with state-of-the-art methods are presented; we also conducted extensive experiments to validate the effectiveness of our proposed network.

Datasets
V-COCO [17] and HICO-DET [7] are the two most commonly used benchmark datasets for HOI. V-COCO annotates 10,346 human instances with 26 actions for a subset of the COCO [34] dataset. It has 2533 images for training, 2867 images for validation, and 4946 images for testing. HICO-DET is a much larger dataset with 117 action labels, 80 object labels in COCO, and altogether 600 HOI classes. The training set has 38,118 images and test set has 9658 images. The whole dataset annotates more than 150,000 instances.

Evaluation metrics
We use standard mean average precision (mAP) as evaluation metric for both datasets. A human, action, object result is regarded as a true positive if the action is correctly predicted and the intersection over union (IoU) between the detected human/object instance and the ground truth instance is greater than 0.5.

Implementation details
To enable a fair comparison we adopt the same setting as Ref. [12]. Faster R-CNN ResNet50 [2] pretrained on the COCO dataset is used as the feature backbone and kept frozen. The pose estimation result is obtained using AlphaPose [30] and the object detection result comes from the Detectron [35]. In each level of pairwise features, f h , f o , and f sp are all 1024 dimensional features, giving the overall pairwise feature a size of 3072. The number of hidden units for all fully connected layers is set to 1024. In the partlevel feature, the dimension of body part features is reduced to 256 with a spatial size of 5 × 5. Following Ref. [12], we train V-COCO on the trainval set. During training, we sample positive and negative samples with a ratio of 1:3 using 8 training images as a batch and use the Adam optimizer [36]. We set the initial learning rate to 10 −4 and reduce it to 10 −5 in the 11th epoch for V-COCO and in the 7th epoch for HICO-DET. Our model is trained for 20 epochs in total for both datasets. During testing, the object threshold is set to 0.1 while the human threshold is set to 0.3 for V-COCO and 0.5 for HICO-DET. As the whole pipeline starts with a localization stage, the quality of detected human and object instances affects the final HOI detection score. Therefore the final score for action prediction is merged with the instance confidence scores. We apply the Low-grade Instance Suppression (LIS) function [12] to make a non-linear adjustment to the original detection scores. For V-COCO we also conduct post-processing to remove contradictory predictions, following Ref. [12].

Comparison with state of the art
We report quantitative results from our proposed pipeline on the V-COCO dataset and compare them to the results from other state-of-the-art methods in Table 1. One can see that our proposed approach achieves a mAP role of 52.8, surpassing all other methods. Table 2 compares results from a number of approaches, using COCO-pretrained detectors, for the HICO-DET dataset. HICO-DET has two different settings, Default and Known objects. For each setting, the model is evaluated in three different modes-the full mode with all 600 HOIs, the rare mode with 138 HOIs that have fewer than 10 training samples, and the non-rare mode with the remaining 462 HOIs. We report a very competitive mAP of 20.05 for Default mode. Note that we outperform PMFNet [15] which is another pose-guided multilevel network by a large margin (2.59) due to better pairwise feature representation and training strategies. While two recently published methods [39,40] achieve 21.34 and 22.65 respectively, they rely on external 3D information and heavily annotated body part states. Figure 4 provides some sample qualitative results for V-COCO and HICO-DET datasets. As can be seen, our model distinguishes well the finegrained actions and is able to handle challenging cases in which multiple humans interact with multiple objects.

Effect of each level of pairwise features
Our network consists of three levels of pairwise features. To fully understand how they contribute to the final result, we conducted ablation experiments using the V-COCO dataset. We used the model with only the instance-level feature as the baseline and ablated the other two pairwise features. All models were trained with the same settings. Results are shown in   We also investigated the necessity of predicting a pairwise interaction or affinity score. Using our full model, we applied an interaction score pretrained on HICO-DET, provided by Li et al. [12], to filter out non-interacting pairs. The final performance improved only very slightly by 0.08, indicating that our model has implicitly captured the pairwise affinity.
We also evaluated the per-class performance to examine the effect of the semantic-level pairwise feature. As shown in Table 4, the model using a pairwise semantic feature significantly outperforms one without semantics on specific action classes. Actions like drink and read can be well predicted with the assistance of a class-specific action prior. This demonstrates the efficacy of the semantic-level feature. Figure 5 shows various cases in which predictions are considerably improved by employing part-level and semantic-level features.

Components in part-level feature
As a pairwise feature exploits both visual and spatial information, we also investigated the contribution of each component to the part-level feature. We considered various combinations of aggregated body part feature f h par , augmented object feature f o par , and part-level spatial feature f sp par ; results are shown

Limitations
Our approach has limitations. Firstly, as shown in Table 4, our approach performs worse with the semantic-level pairwise feature for some action classes such as "talk on phone". This is mainly because the semantic prior may lead to an incorrect association between human and object in confusing scenes: see Fig. 6(left). A possible solution could be to apply attention modules for level-wise feature selection to weight different features. Secondly, our approach is two-staged and the results are influenced by accuracy of object detection: see Fig. 6(right). An end-toend multi-task network that simultaneously detects objects and interactions could help to improve both accuracy and efficiency.

Conclusions
In this paper, we have presented a multi-level pairwise feature network (PFNet) for human-object interaction detection. We represent the human-object interaction as multi-level pairwise visual and spatial relations in a unified formulation. In addition to the instance-level pairwise feature, the part-level pairwise feature exploits local visual and spatial relations between a body part and an object guided by pose keypoints, while the semantic-level pairwise feature represents an object using its semantic label. Extensive experiments show that our proposed approach utilizing multi-level pairwise features for HOI detection outperforms other methods on the V-COCO dataset, while various ablation studies demonstrate the utility of multi-level pairwise features and fine-grained visual and spatial features involving body parts. Xiaolei Huang is an associate professor in the College of Information Sciences and Technology at Pennsylvania State University. Her research interests lie in the intersection of biomedical image analysis, machine learning, and computer vision. She has over 140 publications and holds 7 patents in these areas. She is an associate editor of the Computer Vision and Image Understanding journal. She received her bachelor degree in computer science from Tsinghua University, and her master and doctoral degrees in computer science from Rutgers University.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.