1 Introduction

Facial expressions (FE) are important cues for recognizing non-verbal behaviour. The ability to automatically mine human intentions, attitudes or experiences has many applications like building socially aware systems [4, 18], improving e-learning [9], adapting game status according to player’s emotions [1], and detecting deception during police interrogations [11].

Fig. 1.
figure 1

Patch and structure learning are key problems in AU recognition. (a) By masking a region an expressive face becomes indistinguishable from neutral. (b) Multiple, correlated AUs can be active at the same time.

The Facial Action Unit System (FACS) [5] is a descriptive coding scheme of FEs that focuses on what the face can do without assuming any cognitive or emotional value. Its basic components are called Action Units (AU) and they combine to form a complete representation of FEs.

AUs are patterns of muscular activation and the way they modify facial morphology is localized (Fig. 1a). While initial AU recognition methods (like JPML [25] and APL [28]) were using shallow predefined representations, recent methods (like DRML [26], ROI [12] and GL [7]) applied deep learning to learn richer local features that capture facial morphology. Therefore one could predict specific AUs from informative face regions selected depending on the facial geometry. For instance, contrary to non-adaptive methods like DRML [26] and APL [28], ROI [12] and JPML [25] extract features around facial landmarks which are more robust with respect to non-rigid shape changes. Patch learning is challenging as the human face is highly articulated and different patches can contribute to either specific or groups of AUs. Learning the best patch combination together with learning specific features from each patch could be beneficial for AU recognition.

AU recognition is also multi-label. Several AUs can be active at the same time and certain AU combinations are more probable than others (Fig. 1b). AU prediction performance could be improved by considering probabilistic dependencies. In deep learning approaches, correlations can be addressed implicitly in the fully connected layers (e.g. DRML [26], GL [7] and ROI [12]). However, structure is not learned explicitly and inference and sparsity are implicit by design. JPML [25] treats the problem by including pre-learned priors about AU correlations into their learning. Learning structured outputs has also been studied by using Graphical Models [6, 19, 25]. However, these models are not end-to-end trainable.

In this work, we claim that patch and the structure learning are key problems in dealing with AU recognition. We propose a deep neural network that tackles those problems in an integrated way through an incremental and end-to-end trainable approach. First, the model learns local and holistic representations exhaustively from facial patches. Then it captures structure between patches by predicting specific AUs. Finally, AU correlations are captured by a structure inference network that replicates message passing inference algorithms in a connectionist fashion. Table 1 compares some of the most important features of the proposed method to the state-of-the-art (specifically JPML [25], APL [28], DRML [26], GL [7] and ROI [12]). We show that by separately treating problems in different parts of the network and being able to optimize them jointly, we improve state-of-the-art by 5.3% and 8.2% performance on BP4D and DISFA datasets, respectively. Summarizing, our 2 main contributions are: (1) we propose a model that learns representation, patch and output structure end-to-end, and (2) we introduce a structure inference topology that replicates inference algorithm in probabilistic graphical models by using a recurrent neural network.

Table 1. Features of our model and related work. LRL: local representation learning, AP: adaptive patch, PL: patch learning, SL: structured learning, EE: end-to-end.

The paper is organized as follows. Section 2 presents related work. Section 3 details the proposed model and Sect. 4 the results. Section 5 concludes the paper.

2 Related Work

Related work is discussed in relation to patch learning or structure learning.

Patch Learning. Inspired by locally connected convolutional layers [17], Zhao et al. [26] proposed a regional connected convolutional layer that learns specific convolutional filters from sub-areas of the input. In [12], different CNNs are trained on different parts of the face merging features in an early fusion fashion with fully connected layers. Zhao et al. [25] performed patch selection and structure learning with shallow representations where patches for each AU were selected by group sparsity learning. Jaiswal et al. [8] used domain knowledge and facial geometry to pre-select a relevant image region for a particular AU, passing it to a convolutional and bi-directional Long Short-Term Memory (LSTM) neural network. Zhong et al. [28] proposed a multi-task sparse learning framework for learning common and specific discriminative patches for different expressions. Patch location was predefined and did not take into account facial geometry.

Structure Learning. Zhang et al. [23] proposed a multi-task approach to learn a common kernel representation that describes AU correlations. Elefteriadis et al. [6] adopted a latent variable Conditional Random Field (CRF) to jointly detect multiple AUs from predesigned features. While existing methods capture local pairwise AU dependencies, Wang et al. [20] proposed a restricted Boltzmann machine that captures higher-order AU interactions. Together with patch-learning, Zhao et al. [25] used positive and negative competitions among AUs to model a discriminative multi-label classifier. Walecki et al. [19] placed a CRF on top of deep representations learned by a CNN. Both components are trained iteratively to estimate AU intensity. Wu et al. [21] used a Restricted Boltzman Machine that captures joint probabilities between facial landmark locations and AUs. More recently, Benitez et al. [7] proposed a loss combining the recognition of isolated and groups of AUs.

3 Method

Let \(\mathcal {D}=\{\mathbf {X},\mathbf {Y}\}\) be a set of pairs of input images \(\mathbf {X}=\{\mathbf {x}_1,\ldots ,\mathbf {x}_M\}\) and output AU labels \(\mathbf {Y}=\{\mathbf {y}_1,\ldots ,\mathbf {y}_M\}\) with M number of instances. Each image \(\mathbf {x}_i\) is composed of P patches \(\{I_1,\ldots ,I_P\}\) and output label \(\mathbf {y}_i\) is a set of N AUs \(\{y_1,\ldots ,y_N\}\) taking a binary value \(\{0,1\}\). Several AU classes can be active for an observation as a multi-label problem. Predicting such output is challenging as a softmax function can not be applied on the set of outputs contrary to the standard mono-label/multi-class problems. In addition, using independent AU activation functions in losses like cross-entropy, ignores AU correlations. Including the ability to learn structure in the model design is thus relevant.

Two main ways of solving multi-label learning in AU recognition are either capturing correlations through fully-connected layers [7, 12, 26] or inferring structure through probabilistic graphical models (PGM) [6, 19, 25]. While the former can capture correlations between classes, this is not done explicitly. On the other hand, PGMs offer an explicit solution and their optimization is well studied. Unfortunately, placing classical PGMs on top of neural network predictions considerably lowers the capacity of the model to learn high order relationships since it is not end-to-end trainable. One solution is to replicate graphical model inference in a conectionist fashion which would make possible joint optimization. Jointly training CNNs and CRFs has been previously studied in different problems [2, 3, 27]. Following this trend, in this work we formulate AU recognition by a graphical model and implement it by neural networks, more specifically CNNs and recurrent neural network (RNN). This way, AU predictions from local regions along AU correlations are learned end-to-end.

Let \(\mathcal {G}=(\mathcal {V},\mathcal {E})\) denote a graph with vertices \(\mathcal {V}=\mathbf {y}\) specifying AUs and edges \(\mathcal {E}\subseteq \mathcal {V}\times \mathcal {V}\) indicating the relationships between AUs. Given the Gibbs distribution we compute conditional probability \( P (\mathbf {y}|\mathbf {x},\varTheta )\) as:

$$\begin{aligned} P (\mathbf {y}|\mathbf {x},\varTheta )=\frac{1}{Z(\mathbf {y},\mathbf {x},\varTheta )}e^{-E(\mathbf {y}|\mathbf {x},\varTheta )}, \end{aligned}$$
(1)

where \(\varTheta \) are model parameters, Z is a normalization function and E is an energy function. The model can be updated by introducing latent variables \(\mathbf {p}\) as:

$$\begin{aligned} P (\mathbf {y}|\mathbf {x},\varTheta )=\sum _\mathbf {p} P (\mathbf {y},\mathbf {p}|\mathbf {x},\varTheta ), \end{aligned}$$
(2)

where \(\mathbf {p}\) is given as the output of CNN. The vertices and edges in the graph \(\mathcal {G}\) can be updated as \(\mathcal {V}=\mathbf {y} \cup \mathbf {p}\) and \(\mathcal {E}=\mathcal {E}_y \cup \mathcal {E}_{py} \cup \mathcal {E}_p\). Although edges \(\mathcal {E}_y\) can be defined by a prior knowledge taken from a given dataset, we use a fully connected graph independent to the dataset and assign a mutual gating strategy to control information passing through edges (more details in Sect. 3.3). We define \(\mathcal {E}_{py}\) as edges between \(\mathbf {p}\) and \(\mathbf {y}\), and use a selective strategy to define edges in this set. Finally, edges \(\mathcal {E}_p\) is an empty set, since in our model an independent CNN is trained on each image patch \(I_j\) and we do not assign any edge among \(\mathbf {p}\). Given this assumption, probability distribution \( P (\mathbf {y},\mathbf {p}|\mathbf {x},\varTheta )\) is given by:

$$\begin{aligned} P (\mathbf {y},\mathbf {p}|\mathbf {x},\varTheta )= P (\mathbf {y}|\mathbf {p},\mathbf {x},\varTheta ) \prod _{k} P (p_k|\mathbf {x},\varTheta ). \end{aligned}$$
(3)

As in CRF, energy function E(.) is computed by unary and pairwise terms as:

$$\begin{aligned} \small E(\mathbf {y},\mathbf {p},\mathbf {x},\varTheta )=\sum _k \varphi _p(p_k,\mathbf {x},\pi ) + \sum _{(i,k)\in \mathcal {E}_{py}} \psi _{py}(y_i,p_k,\phi ) + \sum _{(i,j)\in \mathcal {E}_y} \psi _y(y_i,y_j,\omega ), \end{aligned}$$
(4)

where \(\varphi (.)\) is a unary term, \(\psi _*(.)\) are pairwise terms and \(\varTheta =\pi \cup \phi \cup \omega \). Figure 2 presents our Deep Structure Inference Network (DSIN). It consists of three components each designed to solve a term in Eq. 4. We refer to the initial part as Patch Prediction (PP), whose purpose is to exhaustively learn deep local representations from facial patches and produce local predictions. Then, the Fusion (F) module performs patch learning per AU. The final stage, Structure Inference (SI), refines AU prediction by capturing relationships between AUs. The DSIN is end-to-end trainable and CNN features can be trained based on gradients back-propagated from structure inference in a multi-task learning fashion.

Fig. 2.
figure 2

Deep Structure Inference Network (DSIN) learns independent AU predictions from global and local learned features. It refines each AU prediction by taking into account correlation to the other AUs. Each input image is cropped into a set of patches \(\{I_i\}_{i=1}^{P}\) which is used for training an independent CNN for producing a probability vector \(p_i\) for N AUs (\(\varphi _p\) in Eq. 4). From \(s_j\) (the patch predictions for a specific AU) we learn a combination for producing a single AU prediction \(f_j\) (simplified \(\psi _{py}\) in Eq. 4). Final predictions \(y_j\) are computed by inferring structure among AUs through iterative message passing similar to inference in a probabilistic graph model (\(\psi _y\) in Eq. 4).

3.1 Patch Prediction

Given image patches \(\mathbf {x}\), unary terms \(\varphi _p(\mathbf {p},\mathbf {x},\pi )\) provide AUs confidences for each patch which are defined as the log probability:

$$\begin{aligned} \varphi _p(\mathbf {p},\mathbf {x},\pi )=\log P (\mathbf {p}|\mathbf {x},\pi ). \end{aligned}$$
(5)

Probability \( P (\mathbf {p}|\mathbf {x},\pi )\) is modeled by independent patch prediction functions \(\{\varPi _i(I_i;\pi _i)\}_{i=1}^{P}\), where \(I_i\) is input image patch and \(\pi _i\) are function parameters. Each \(\varPi _i\) is a CNN computing N AUs probabilities through sigmoid function at last layer. P independent predictions are provided at this stage, each being a vector of AU predictions. Although image patches may overlap, we assume independence to let each network be expert at predicting AUs on local regions. By learning independent global representations and local representations, we can better capture facial morphology and address AU locality.

Fig. 3.
figure 3

(a) Topology of patch prediction CNNs. Each convolutional block has stride 2 and batch normalization. Number of filters followed by the size of the kernel are marked. The last layers are fully-connected (FC) layers marked with the number of neurons. All neurons use ReLU activations. (b) Each fusion unit is a stack of 2 FC layers. (c) A structure inference unit. For better visualization, we just show the interface of the unit without the inner topology. See details in Sect. 3.3.

In Fig. 3(a) we detail the topology of the CNNs used for learning the patch prediction functions. Many complex topologies have been proposed in recent years and searching for the best is out of the scope of this work. The chosen topology, a shallow network, follows the intuition behind well known models like VGG [16].

3.2 Fusion

Computational complexity to marginalize pairwise relationships in \(\mathcal {E}_{py}\) is high. In our formulation, we simplify edges such that \(\mathcal {E}_{py}\) becomes directed from nodes in \(\mathbf {p}\) to nodes in \(\mathbf {y}\). It means we omit mutual relationships among \(\mathbf {p}\) and \(\mathbf {y}\). Therefore, nodes in \(\mathbf {y}\) are conditioned on the nodes in \(\mathbf {p}\). However, we want each AU node in \(\mathbf {y}\) to be conditioned on the same AU nodes in \(\mathbf {p}\) from different patches. It means different patches can provide complementary information to predict target AU independent to other AUs. Finally, \(\psi _{py}(\mathbf {y},\mathbf {p},\phi )\) is defined as the log probability of \( P (\mathbf {y}|\mathbf {p},\phi )\) which is modeled by a set of independent functions, so called fusion functions \(\{\varPhi _j(s_j;\phi _j)\}_{j=1}^N\), where \(s_j\subset \mathbf {p}\) corresponds to the set of j-th AU predictions from all patches and \(\phi _j\) is function parameters. We simply model each function \(\varPhi _j\) with 2 fully connected layers with 64 hidden units, each followed by a sigmoid layer, as shown in Fig. 3(b). We found 64 hidden units works well in practice while higher dimensionality does not bring any additional performance and quickly starts over-fitting. The output of each \(\varPhi _j\) is the predicted probability \(f_j\) for j-th AU.

3.3 Structure Inference

Up to now, we computed individual AU probabilities in a feed-forward neural network without taking AU relationships explicitly into account. The goal is to model pairwise terms \(\psi _y\) such that the whole process is end-to-end trainable in a compact way. Belief propagation and message passing between nodes is one of the well known algorithms for PGM inference. Inspired by [3], which proposes a connectionist implementation for action recognition, we build a Structure Inference (SI) module in the final part of DSIN.

The SI updates each AU prediction in an iterative manner by taking into account information from other AUs. The intuition behind this is that by passing information between predictions in an explicit way, we can capture AU correlations and improve predictions. The structure inference module is a collection of interconnected recurrent structure interference units (SIU) (see Fig. 3(c)). For each AU there is a dedicated SIU. We denote the computations done by SIU by a function \(\varOmega \). Let \(\{\varOmega _j\}_{j=1}^N\) be the set of SIU functions \(\varOmega _j:\mathbb {R}^{N+2} \rightarrow \mathbb {R}^2\) where:

$$\begin{aligned} \hat{y}_j^{t}, m_j^{t} = \varOmega _j(f_j, m_1^{t-1}, m_2^{t-1}, ..., m_N^{t-1}, \hat{y}_j^{t-1};\omega _j). \end{aligned}$$
(6)

At each iteration t, \(\varOmega _j\) takes as input the initial prediction \(f_{j}\) for its class, a set of incoming messages \(\{m_j^{t-1}\}_{j=1}^{N}\) from the SIUs corresponding to the other classes and its own previous prediction \(\hat{y}_j^{t-1}\). Each function \(\varOmega _j\) has two inline units: producing j-th AU prediction \(\hat{y}_j^{t}\) and message \(m^{t}_{j}\) for next time step. In this way, predictions are improved iteratively by receiving information from other nodes. Computationally, we replicate this iterative message passing mechanism in the collection of SIUs with a recurrent neural network that shares function parameters \(\varOmega _j\) across all time steps. We show a SIU unit in Fig. 3(c).

A message unit basically corresponds to the distribution of the AU node. A message unit from a SIU is a parametrized function of the previous messages, the initial fused prediction and the previous prediction of the same SIU:

$$\begin{aligned} m_{j}^{t} = \sigma \left( \omega _{j}^{m}\left[ \mu (m_{1}^{t-1}, ..., m_{N}^{t-1}), f_j, \hat{y}_j^{t-1}\right] + \beta _{j}^{m} \right) , \end{aligned}$$
(7)

where \(\sigma (.)\) is the sigmoid function, \(\mu (.)\) is the mean function, \(\omega _{j}^{m}\in \mathbb {R}^3\) and \(\beta _{j}^{m}\in \mathbb {R}\) are message function parameters. Messages between two nodes at each time step have a mutual relationship which can be controlled by a gating strategy. Therefore, a set of correction factors are computed as:

$$\begin{aligned} \chi _j^{t} = \sigma \left( \omega _{j}^{g} \left[ \mu (m_{1}^{t}, ..., m_{N}^{t}), f_j, \hat{y}_j^{t-1}\right] + \beta _j^g \right) , \end{aligned}$$
(8)

where \(\omega _{j}^{g}\in \mathbb {R}^3\) and \(\beta _{j}^{g}\in \mathbb {R}\) are gating function parameters. Then, a message \(m_{i\rightarrow j}^{t}\) that is passed from AU node i to j will be updated by the mutual factors of the gate between nodes i and j as:

$$\begin{aligned} \overline{m}^{t}_{j} = \mu (\chi _{i}^{t}, \chi _{j}^{t}) m_{i\rightarrow j}^{t}. \end{aligned}$$
(9)

Finally, updated messages coming to the j-th node along with initial estimation \(f_j\) are used to produce output prediction \(\hat{y}_j^{t}\) as:

$$\begin{aligned} \hat{y}_j^{t} = \sigma \left( \omega _{j}^y \left[ \mu (\overline{m}_{1}^{t}, ..., \overline{m}_{N}^{t}), f_j\right] + \beta _j^{y} \right) , \end{aligned}$$
(10)

where \(\omega _{j}^{y}\in \mathbb {R}^2\) and \(\beta _{j}^{y}\in \mathbb {R}\) are prediction function parameters. By doing this, we are able to combine representation learning in function \(\varPi \), patch learning in function \(\varPhi \) and structure inference in the \(\varOmega \) in a single end-to-end trainable model. We introduce our training strategy in Sect. 4.1.

4 Experimental Analysis

In the following, we describe experimental settings and results.

4.1 Experimental Setting

Data. We used BP4D [24] and DISFA [13] datasets. BP4D contains 2D and 3D videos of 41 young adults. It has 328 videos (8 videos for 41 participants) with 12 coded AUs, resulting in about 140k valid face images [24]. DISFA contains 27 adults (12 women and 15 men) with ages between 18 to 50 years and relative ethnic diversity. The data corpus consists of approximately 130k frames in total. AU intensity is coded for each video frame on a 0 (not present) to 5 (maximum intensity) ordinal scale. For our purpose we consider all labels with intensity greater than 3 as active and the rest as non-active. Both datasets are widely used in most recent AU recognition works.

Fig. 4.
figure 4

Each input image is aligned and cropped into 5 patches.

Preprocessing. For each image, facial geometry is estimated using [10]. From all neutral faces we compute 3 reference anchors as the mean of the eyes and the mouth landmarks. Faces are resized to \(224 \times 224 \times 3\) and a rigid transformation is applied for registering to the anchors, reducing variance to scale and rotation. We crop 5 patches of size \(56 \times 56 \times 3\) around points defined by the detected landmarks (see Fig. 4). For reducing redundancy we ignore corresponding, symmetrical patches like the left eye and cheek.

Training. We incrementally train each part of DSIN before end-to-end model training. During training we use supervision on the patch prediction p, the fusion f and the structure inference outputs \(\hat{y}\). On p we use a weighted \(L_2\) loss denoted by \(L_{\varPi }(p,y)\). The weights are inversely proportional to the ratio of positives in the total number of observations for each AU class in training. The weighting gives more importance to the minority classes in each training batch which ensures a more equal gradient update across classes and overall better performance. On the fusion and structure inference outputs we apply a binary cross-entropy loss (denoted by \(L_\varPhi (f,y)\) and \(L_\varOmega (\hat{y},y)\)). For the structure inference we include a regularization on the correction factors (denoted by \(\chi \) in Eqs. 8 and 9) to force sparsity in the message passing. Details of the training procedure are shown in Algorithm 1. We use an Adam optimizer with learning rate of 0.001 and mini-batch size 64 with early stopping. Experimentally, we found the individual loss contributions \(w_1=0.25\), \(w_2=0.25\) and \(w_3=0.5\) to work well in training. For both datasets we perform a subject exclusive 3-fold cross-validation. Similarly to [12], on DISFA we take the best CNNs trained for patch prediction on the BP4D and retrained fully connected layers for the new set of outputs. We fix the convolutional filters throughout the rest of the training.

Methods and Metrics. We compare against CPM [22], APL [28], JPML [25], DRML [26], and ROI [12] state-of-the-art alternatives. We evaluate F1-frame score as \(F1 = 2\frac{PR}{P+R}\), where \(P=\frac{tp}{tp+fp}\), \(R=\frac{tp}{tp+fn}\), tp being true positives, fn false negatives and fp false positives. All metrics are computed per AU and then averaged. Targeted AUs shown in Fig. 6.

figure a

4.2 Results

In the following, we explore the effect design decisions included in the DSIN followed by comparison against state-of-the-art alternatives in Sect. 4.2 and qualitative examples in Sect. 4.2.

Ablation Study. We analyze DSIN design decisions in the following.

Class Balancing. In both datasets, classes are strongly imbalanced. This can be harmful during training. To alleviate this, we use a weighted loss on patch prediction CNNs. Table 2 shows results with and without class balancing. This overall improves performance, especially on poorly represented classes. On BP4D the classes with ratios of positives in the total of samples lower than \(30\%\) are AU01, AU02, AU04, AU17, AU24. These are the classes that are improved the most. AUs like AU07 or AU12 have positives to total rations higher than \(50\%\). Balancing can reduce performance on these classes.

Table 2. Recognition results on BP4D. PP([patch]) stands for patch prediction on the indicated patch. F stands for the fusion and DSIN is the final model. We indicate the results when training on individual AUs with [method]\(^{ind}\), fine tuning on the validation dataset of the decision threshold by DSIN\(^{tt}\), number of iterations of the structure inference by DSIN\(_{T}\) and training without correction factors as DSIN\(^{ncf}\). VGG(face)\(^{ft}\) is a pre-trained VGG-16 [14] fine-tuned on BP4D. PP(face)\(^{ncb}\) is a patch prediction without class balancing. All results are obtained by 3-fold cross-validation on BP4D.

Choice of Prediction Topology. In Table 2 we compare the proposed CNN for patch prediction (PP(face)) against VGG-16. The VGG-16 model used was trained for face recognition [14] and fine-tuned on our data for AU recognition. Our model shows superior performance.

Targeting Subsets of AUs. We explore the effect of the considered target set on the overall prediction performance. In Table 2 we show prediction results from the right eye and from the mouth patches when training either on the full set of targets ([method]) or on individual targets (\([method]^{ind}\)). When training on individual AUs the decision for the classifier is simpler. On the other hand any correlation information between classes that could be captured by the FC layers is ignored. In certain cases the individual prediction is superior to the exhaustive prediction. In the case of the right eye patch this is particularly true for AU01. But this is rather the exception. On average and across patches training on groups of AUs or on all AUs is beneficial as correlation information between classes is employed by the network in the fully connected layers. Additionally, predicting AU individually with independent nets would quickly increase the number of parameters with considerable effects on the training speed and final model performance.

Tables 2 and 3 show AU recognition results on both datasets trained on patches. That proves the locality assumption. When training on the mouth the performance on the upper face AUs is greatly affected. Similarly, training on the eye affects the performance on the lower face AUs. This is expected as the patch prediction can only infer the other AUs from the ones visible in the patch.

Learning Local Representations. On average, face prediction compared to patch prediction performs better on the entire output set. However, when individual AUs are considered, this is no longer the case. For BP4D, the performance on AU15 and AU24 are considerably higher when predicting from the mouth patch than from the face (see Table 2). On DISFA the prediction from the whole face is the best on just 3 AUs (see Table 3). The nose patch is better for predicting AU06 and AU09, the mouth patch is better for AU12, AU25 and AU26, and the between eye patch for AU01.

Fig. 5.
figure 5

Different levels of regularization on the mean \(\mu (\chi )\) (white line) and standard deviation \(\sigma (\chi )\) (envelope) of the correction factors during training. Small regularization values force the correction factors to diverge faster. Increasing regularization collapses the correction factors hurting the message passing.

Patch Learning. Tables 2 and 3 show results of AU-wise fusion for BP4D and DISFA (PP + F). On both, patch learning through fusion is beneficial, but on DISFA benefits are higher. This might be due to the fact that prediction results on DISFA are considerably more balanced across patches. Overall on BP4D the fusion improves results on almost all AUs compared to face prediction. This shows that even though the other patches perform worse on certain classes, there is structure to learn from their prediction that helps to improve performance. However, the fusion is not capable to replicate the result of the mouth prediction on AU14. On DISFA, in almost every case fusion gets close or higher to the best patch prediction. In both cases, fusion has greater problems in improving individual patches in cases where input predictions are already very noisy.

Table 3. Results of DSIN on DISFA. PP([patch]) stands for patch prediction on the indicated patch. F stands for the fusion. DSIN is the final model. For DISFA we only show the DSIN with \(T=10\), the best performing on BP4D.

Structure Learning. Tables 2 and 3 show results of the final DSIN model. For BP4D, we also perform a study of the number of iterations T considered for structure inference. Since parameters \(\omega _j\) are shared across iterations, more iterations are beneficial to capture AU relationships in a fully connected graph with a large number of nodes (12 in our case). We also trained DSIN without correction factors (Eq. 9 is not applied in this case). Results are inferior compared with the same model with correction factors. In the case of DISFA, we only applied the structure inference with the best previously found \(T=10\) steps. Structure inference is beneficial in both cases. On BP4D, it considerably improves AU2 and AU14. For DISFA, the results are even more conclusive. Adding the structure inference brings more than 5% improvement over the fusion.

Correction Factor Regularization. Figure 5 shows the effect of increasing regularization applied on the correction factors \(\chi \). Overall, regularizing \(\chi \) does not bring significant benefits. When comparing \(r=10^{-2}\) with no regularization the differences are minimal. The network has the ability to learn sparse message passing by itself without regularization. Still, small values of r lead to faster divergence of \(\chi \) and faster convergence of the network. The difference in performance is not significant. On the other hand values of \(r>5 \times 10^{-2}\) negatively affect performance as most of \(\chi \) get closer to 0 and no messages are passed anymore. For these reasons, we keep \(r=5 \times 10^{-3}\).

Fig. 6.
figure 6

Facial action units targeted in this work.

Fig. 7.
figure 7

\(\tau \) vs AU performance on BP4D validation set. Black circles denote best score.

Threshold Tuning. Prediction value per AU takes values between 0 and 1. In all results, we compute the performance by binarizing the output with respect to threshold \(\tau =0.5\). Although class balancing as a weighted loss is beneficiary, it does not totally solve data imbalance. Figure 7 shows performance in terms of \(\tau \) for validation set of BP4D. As shown, a threshold \(\tau =0.5\) is not an ideal value. For most classes \(\tau \in [0.1,0.3]\) is preferable. Exception is AU04. Tables 2 and 3 show the performance of the proposed model after tuning \(\tau \) per class (DSIN\(^{tt}\)). This way \(2.8\%\) and \(3.1\%\) of performance is gained on BP4D and DISFA, respectively.

Table 4. AU recognition results on BP4D. Best results are shown in bold. Second best results are shown in brackets. For the proposed model we show an additional set of results (DSIN\(_{tt}\)) obtained when the decision threshold is tuned per AU.

Comparison with State-of-the-Art. Tables 4 and 5 show how our model compares against the state-of-the-art related methods on BP4D and DISFA, respectively. DSIN and ROI are the best performing in both datasets. Both methods learn deep local representations and patch combinations end-to-end. The worst performing methods, JPML on BP4D and APL on DISFA, use predefined features and are not end-to-end trained. Comparing DSIN and ROI with DRML one can observe the advantage in learning independent local representation. Both ROI and our model learn independent local representations, while DRML disentangles the representation learning in just one layer of their network. Interestingly though, there is also an exception. On BP4D, CPM performs slightly better than DRML even though it is not a deep learning method. When comparing our proposed model with ROI on BP4D our CNN trained just on face without class balancing has inferior results. When we include class balancing and patch learning our topology improves performance, further enhanced by structure inference and end-to-end final training. In the case of DISFA, single CNN trained on the whole face with class balancing has a performance of 43.9, being \(4.6\%\) lower than ROI. When we add patch prediction fusion (PP + F) we get just \(0.5\%\) lower than ROI while the addition of the structure inference and threshold tuning improves ROI performance. Finally, DSIN shows the best results on both datasets. For BP4D, from the 12 AUs target it performs best on 5 and second best on additional 5. In the case of DISFA the improvement over ROI is greater, DSIN performing best in all but one AU. Overall, we obtain 5.3% absolute and 9.4% relative performance improvement on BP4D and 8.2% absolute and 16.9% relative performance improvement on DISFA, respectively.

Table 5. AU recognition results on DISFA. Best results are shown in bold. Second best results are shown in brackets.
Fig. 8.
figure 8

(a) Examples of AU predictions: ground-truth (top), fusion module (middle) and structure inference (bottom) prediction (: true positive, : false positive). (b) AUs correlation in BP4D (: positive, : negative). Line thickness is proportional with correlation magnitude. (c) Class activation map for AU24 that shows the discriminative regions of simple patch prediction (left) and DSIN (right). Best seen in color.

Qualitative Results. Figure 8(a) shows examples of how structure inference tends to correct predictions following AU correlations. We show the magnitude of AU correlations on BP4D in Fig. 8(b). In the first 3 column examples, AU06 and AU07 are not correctly classified by the fusion model (middle row). Both these AUs are highly correlated with already detected AUs like AU10, AU12 and AU14. Such correlation could be captured by SI (bottom row). The rightmost example shows how AU17, a false positive, is corrected. As shown in Fig. 8(b), AU17 is negatively correlated with AU4, which was already detected. In Fig. 8(c) we show a class activation map [15] for AU24 of the patch prediction (left) vs. the DSIN (right). Contrary to very localized patch prediction, the attention on right expands to a larger area of the face where possible correlated AUs might exist.

5 Conclusion

We proposed the Deep Structured Inference Network, designed to deal with both patch and structure learning for AU recognition. DSIN first learns independent local and global representations and corresponding predictions. Then, it learns relationships between predictions per AU through stacked fully connected layers. Finally, inspired by inference algorithms in graphical models, DSIN replicates a message passing mechanism in a connectionist fashion. This adds the ability to capture correlations in the output space. The model is end-to-end trainable and improves state-of-the-art results by 5.3% and 8.2% performance on BP4D and DISFA datasets, respectively. Future work includes learning patch structure at feature level and a structure inference module with increased capacity.