JMNet: A joint matting network for automatic human matting

We propose a novel end-to-end deep learning framework, the Joint Matting Network (JMNet), to automatically generate alpha mattes for human images. We utilize the intrinsic structures of the human body as seen in images by introducing a pose estimation module, which can provide both global structural guidance and a local attention focus for the matting task. Our network model includes a pose network, a trimap network, a matting network, and a shared encoder to extract features for the above three networks. We also append a trimap refinement module and utilize gradient loss to provide a sharper alpha matte. Extensive experiments have shown that our method outperforms state-of-theart human matting techniques; the shared encoder leads to better performance and lower memory costs. Our model can process real images downloaded from the Internet for use in composition applications.


Introduction
Various graphics applications in the digital media sector, such as mixed reality, film production, and photographic composition, need accurate extraction of human regions from images. Some recent human segmentation methods [1,2] focus on producing coarse human silhouettes. However, for the aforementioned applications, coarse segmentation without accurate alpha estimation for translucent human regions, such as clothing and hair, is insufficient. Therefore, automatic human alpha matting has attracted much attention in the computer vision community.
Alpha matting takes an image I as input and assumes that it is a linear blend of a foreground image F and a background image B: (1) where α is the per pixel alpha matte with values in the range [0, 1]. For a color image, this leads to 7 unknown variables at each pixel but only 3 known values in I, making this problem extremely ill-posed. In previous methods, user interaction is used to guide the matting algorithms [3,4], using, e.g., trimaps or scribbles. However, these interactions need a certain level of professional knowledge and can be time-consuming to get satisfactory results.
In the human matting task, because of the latent intrinsic structure of the human body, a trimap can be intuitively generated by eroding the human segmentation map, as a guide for matte estimation. But this trivial solution rarely produces satisfactory results [5]. One reason is that semantic segmentation aims to coarsely separate humans from the background; it often ignores fine details in the segmentation results. Simple fixed-width boundary erosion of the segmentation map cannot be used universally for different semantic parts. For example, hair needs a wider boundary than the limbs. Chen et al. [6] propose an end-to-end deep learning framework for human matting, which combines a segmentation network with a matting network using a fusion module. Their method can automatically generate alpha mattes, but may miss some body parts or details when the background is complicated.
We propose a novel end-to-end deep learning framework, the Joint Matting Network (JMNet), to improve the accuracy of human matting for more complicated images. We utilize the semantic structure of the human image regions to assist the neural network's learning process. We combine the matting network with a pose estimation module, which produces heatmaps of body keypoints representing intrinsic human structures.
The heatmaps provide global topology for the trimap generation process and guide the matting network to pay adaptive attention to different body parts. JMNet contains three subnetworks, a pose network, a trimap network, and a matting network. These subnetworks share features extracted from the image by a backbone encoder. The outputs of the earlier subnetworks are also fed into the later subnetworks as guiding inputs. A trimap refinement component is used for better trimap boundary generation. To get a finer and sharper alpha matte, we apply gradient loss and skip connections. Figure 1 shows an alpha matte predicted by JMNet.
Extensive experiments have shown that our method can produce high quality alpha mattes for highresolution human images. Our experiments used the DIM dataset [7] and a newly collected large-scale human image dataset with carefully annotated matte values. They show that our method outperforms state-of-the-art human matting approaches on several quantitative metrics. We have also tested our model on some Internet images and produced impressive alpha matte results for a composition task, demonstrating the generalizability of our framework.

Image matting
Most image matting techniques can be classified as being sampling-based, affinity-based, or learningbased.
Sampling-based methods [8][9][10][11] gather several color samples from the background and the foreground marked by the trimap. These samples help build a foreground and background model to estimate the alpha value in the transition area. Affinitybased methods propagate alpha values of known foreground and background pixels to the unknown regions according to affinity scores based on, e.g., spatial proximity and color similarity, which can lead to high computational and memory costs.
Recently, with the rapid development of the deep learning techniques, learning-based image matting methods have been introduced, achieving impressive results. Cho et al. [12] use a deep convolutional network to refine alpha matte values produced by closed-form matting [3] and KNN matting [4]. Xu et al. [7] present an end-to-end deep learning network trained by alpha and composition loss functions. They also provide a large natural image dataset with precise matte annotations, the first largescale image matting dataset available for training neural networks. AlphaGAN [13] applies generative adversarial networks (GANs) to the image matting task for more realistic and sharper results. Tang et al. [14] propose a hybrid sampling-based and learning-based matting method, which estimates the background and foreground color values for unknown regions as guidance in predicting alpha matte values.
All the above matting techniques rely on user interaction, generally in the form of providing trimaps labeling background, foreground, and uncertain regions. Drawing a trimap with per-pixel annotation is not easy for novice users. Shen et al. [5] propose an automatic portrait matting method, utilizing a fully-convolutional network [15] to predict the trimap for the input image. The network also learns to predict the parameters of the Laplacian matrix. The alpha matte is eventually computed by the formula provided by Levin et al. [3]. Chen et al. [6] combine PSPNet [16] with the matting network proposed by Xu et al. [7] and simultaneously train them with a fusion module. Zhang et al. [17] present a novel matting method taking a single RGB image as input.
The framework contains two branches for foreground and background classification respectively and applies a fusion branch to predict alpha matte values from the two classification results. Nevertheless, all of these automatic image matting techniques ignore the intrinsic structures of the human body, possibly leading to erroneous decisions when facing complex scenes.

Semantic segmentation and pose estimation
Many successful semantic segmentation techniques are based on deep convolutional neural networks. FCN [15] was the first fully CNN-based method for semantic segmentation. U-Net [18] uses skip connections to build a bridge between the encoder and decoder, allowing the network to be aware of low-level details. DeepLab [19] applies atrous convolution in the segmentation network for dense feature extraction and field-of-view enlargement. PSPNet [16] presents a pyramid pooling module to take advantage of multiscale features containing both global structure and local texture.
Human segmentation methods are usually built upon general segmentation networks like the aforementioned ones. PFCN+ [2] calculates an average mask from the training set and aligns it to each portrait image by detecting facial landmarks. The mask is then fed into the segmentation network as prior information along with the input image. Chen et al. [1] use a boundary attention module to refine boundary areas in portrait segmentation; boundary attention is not used in general segmentation frameworks but is important for portrait images.
DCNN-based pose estimation methods either use regression to determine the positions of keypoints [20,21], or predict heatmaps generated by the keypoints [22][23][24]. Some papers use pose heatmaps to guide human parsing [25,26] and human image completion [27]. Our method uses heatmaps as a pose representation, because heatmaps can readily encode body structures and provide convenient inputs to the other neural networks.

Overview
Our deep learning model, the Joint Matting Network (JMNet), addresses automatic human image matting. Its input is an RGB image, its output is an alpha matte mask giving the transparency value for each foreground pixel. Unlike previous image matting techniques, our approach does not need user interaction, but generates a trimap automatically during the prediction process. Given an input image, our framework produces pose heatmaps, a trimap, and the alpha matte in turn. Earlier outputs are fed into the next subnetwork as guidance. This endto-end deep learning framework allows our method to predict a human image alpha matte quickly and accurately.

Architecture
JMNet contains an encoder and three subnetworks: a pose network, a trimap network, and a matting network. The encoder first extracts multi-scale features from the input image. These features are then fed into the pose network to produce the heatmaps. The heatmaps and the features are fed into the trimap network to generate the trimap. Finally, the trimap, the heatmaps, and the features are fed into the matting network to predict the alpha matte. Figure 2 provides an overview of JMNet.

Feature encoder
We use a pre-trained ResNet-50 [28] to extract features since it is a highly successful image classifier and has proved to be a reliable feature extractor in other computer vision tasks [16,29,30]. Then we apply the atrous spatial pyramid pooling (ASPP) module from DeepLab-v2 [19]. ASPP can exploit multi-scale image representations from the extracted features, where features with large dilations contribute to the global body structure while features with small dilations are helpful for local parts.

Pose network
The pose subnetwork consists of several convolutional layers. The input to the pose network comprises the features extracted by the encoder; its output comprises K-dimensional heatmaps, where K is the number of keypoints in the human pose. The groundtruth heatmaps consist of 2D Gaussians (with a standard deviation of 1 pixel) centered on the joint locations. The pose network is able to explore intrinsic human structures, encoding them into the heatmaps as guidance for downstream subnetworks.

Trimap network
The trimap network aims to produce the trimap automatically to avoid the need for user interaction. Generation of the trimap can be regarded as a three-class semantic segmentation task, determining background, foreground, and the uncertain region. The input to the trimap network is a combination of the features and the heatmaps. The output is a three-channel map indicating the probabilities that the pixel belongs to the foreground, background, and uncertain region. The original output trimap is 1/8 the size of the input image and simply upsampling it will lead to an over-smooth result. Therefore, we append a refinement module after the trimap network to sharpen the upsampled trimap, making it more similar to a manual annotation.

Matting network
The features, the heatmaps, and the trimap are concatenated and fed into the matting network, which predicts the alpha matte for the human image. The matting task focuses on local details while the segmentation task pays more attention to global semantics. Nevertheless, the final features extracted by the encoder generally represent high-level abstract information and may not capture many details. Inspired by U-Net [18], we use skip connections from the shallow layers of the encoder to the deep layers of the matting network, so that low-level features can be transmitted to the matte estimation process.
We also adopt the fusion module proposed in SHM [6]. The final alpha matte combines the alpha values predicted by the matting network and the probability of each class predicted by the trimap network: where α denotes the final alpha matte, α m denotes the alpha matte predicted by the matting network, and F s and U s denote the probabilities that the pixel belongs to the foreground and the uncertain regions respectively after a softmax function. Table 1 provides detailed architectural details for the above three subnetworks.

Loss functions
The loss function for optimizing the parameters of each subnetwork is defined by the difference between the predicted result and the ground truth for each subnetwork.
For the pose network, following past pose estimation methods [22][23][24], we use the mean squared error (MSE) loss function for heatmap generation. It can be formulated as where p denotes heatmaps produced by the pose network andp denotes ground-truth heatmaps. For the trimap network, cross entropy loss is Conv adopted, which can be formulated as where C denotes the number of classes (3 in our task), t c denotes the predicted probability that the pixel belongs to class c, andt c denotes the ground-truth class (1 or 0).
For the matting network, we use alpha loss and the compositional loss proposed by Xu et al. [7]. Alpha loss is defined as the L 1 distance between the predicted alpha matte α and the ground-truth alpha matteα. Compositional loss is defined as the L 1 distance between the predicted compositional image values I and the ground-truth image valueŝ I. However, the matting network produces blurry alpha mattes if we only train it using the above two losses. To address this issue, we introduce the gradient loss defined by the L 1 distance between the spatial gradients of the predicted and the groundtruth alpha mattes, denoted by G andĜ respectively. The final loss function of the matting network is (5) where λ α , λ I , and λ G balance the three losses. We set λ α = 0.4, λ I = 0.3, λ G = 0.3 in our experiments.
The overall loss function is a sum of above three losses: We set λ p = 0.001, λ t = 0.01, λ m = 1 in our experiments.

Implementation details
Since JMNet has several joint subnetworks, and it is hard to train them simultaneously, we divide the training process into two stages. In the first stage, we use the ground-truth pose heatmaps and trimaps as inputs to the subnetworks. In the second stage, we use the output of the corresponding subnetwork as the input instead of the ground truth for end-to-end training. We train the entire network for 5 epochs in each of the two stages. We use Adam as the optimizer and set the learning rate to 0.00001. The batch size is 32. It takes about 1 day for the two-stage training process on four NVidia GTX 1080 Ti GPUs. We apply OpenPose [31] to find 2D pose keypoints for all the images in the human dataset. The heatmaps are then generated using these poses and regarded as the ground truth for training. The ground-truth trimap is generated by dilating the ground-truth alpha matte by a random kernel size in the range [5,20]. Following Xu et al. [7], data augmentations is performed by random cropping and horizontal flipping. The images are cropped to different sizes: 320 × 320, 480 × 480, 640 × 640, and then resized to a fixed size of 320 × 320.

Dataset
Xu et al. [7] provided the Deep Image Matting (DIM) dataset; its training set includes 493 foreground objects. We selected all 216 training images which only contain humans from the DIM dataset. To enhance the scalability of our method, we also collected an extra 30,000 human images for training. To obtain matte annotations, we asked a few volunteers to draw trimaps and used the closed-form matting algorithm [3] to estimate the alpha matte values as well as foreground color values. Since the backgrounds of these images are all pure colors, the estimated results are mostly accurate and can be regarded as ground truth.
Following Xu et al. [7], we constructed our training set by compositing the foreground images with the background images by alpha blending. We randomly sampled background images excluding humans from the MS-COCO dataset [32]. Each foreground image is composited with N background images. We set N = 100 for the DIM dataset and N = 1 for our own dataset. Therefore, we had 51,600 human images in total for training; some examples are shown in Fig. 3. Meanwhile, we followed the same approach to construct the test set. We selected all 10 human images from the DIM test set where N = 20 and collected 300 human images with matte annotations where N = 1. Consequently, the test set for evaluations consisted of 500 images.

Comparison
Our approach is compared with Semantic Human Matting (SHM) [6], the state-of-the-art human matting method. We implemented and trained the SHM model by using our new dataset. We also compared our method with Closed Form (CF) matting [3] and Deep Image Matting (DIM) [7], where the trimap is produced by PSPNet50 [16] pre-trained for trimap generation. Figure 4 shows several predicted alpha matte results using the above methods. Because JMNet exploits the intrinsic structure of humans by pose estimation, our method can recover the complete human body when other methods fail. Our method also performs better for fine details owing to the gradient loss and skip connections.
We also conducted quantitative comparison experiments. Four metrics were used to measure the quality of the predicted alpha matte: SAD (sum of absolute differences), MSE (mean square error), gradient error, and connectivity error defined by Rhemann et al. [33]. All metrics were calculated over the entire image instead of only the unknown region and averaged by the total number of pixels. The results in Table 2 show that our method outperforms other state-of-the-art matting methods, for all metrics.
Due to the fully-convolutional architecture of JMNet, arbitrary resolutions of images can be adapted to our method for testing. We evaluated the inference time as well as the GPU memory cost of our method for different resolutions of input images, Fig. 3 Pairs of images and alpha mattes in our new compositional dataset. The two leftmost foreground images are from the DIM dataset; the right three pairs were collected by ourselves.  and compared them to those of SHM [6]. Experiments were conducted by using a 3.40 GHz, 4 core, Intel Core i7-4770 CPU, and an nVidia GTX 1080 Ti GPU. Since JMNet only utilizes the encoder to extract and share features with subsequent subnetworks, its computational cost and memory cost are both less than for SHM: see Table 3.

Ablation study
In order to investigate the effectiveness of the various components in our framework, we conducted several ablation studies. We excluded the pose network, skip connections, and the gradient loss respectively, and trained the resulting models in the same way as for the full model. We used SAD and gradient error as metrics to quantitatively analyze the effect of each component. Table 4 demonstrates that these components indeed improve the matting results.

Image composition result
Compositing the foreground with a new background is a natural application of image matting. Figure  5 shows several results composited using the alpha Using the multi-stage joint network exploring the intrinsic structure of the human body, our method is able to recover the global silhouette of the human image as well as fine details in local regions like the hair or the tassel. matte determined by our method. However, these input images come from the test set of the composited dataset. To demonstrate the generalizability of our method, we tested our model on several real images. Results in Fig. 6 show that our method is able to produce reasonable alpha matte results when compositing real images downloaded from the Internet.

Conclusions
We have proposed the Joint Matting Network (JMNet) for human image matting. JMNet is an end-to-end deep learning framework able to predict an alpha matte for human images without any user interaction. Even for some images with sophisticated backgrounds, we are able to generate accurate alpha matte results. With the assistance of global semantic structural information extracted by the pose estimation subnetwork, our trimap and matting subnetworks are capable of generating better matting masks than previous methods. The shared encoder also reduces the computational cost of extracting features. We further apply gradient loss and skip connections to produce finer and sharper alpha mattes. The composited results for Internet images demonstrate the generalizability of our approach.
In future, more semantic information from natural images could be incorporated to extend our method to a more generic solution. More accurate structural representations like the human parsing map, which are able to indicate per-pixel semantic labels such as hair or face, could be considered to improve the matting process by providing more adaptive attention. Our method could also be extended to human video by taking temporal coherence into consideration.