Spatio-temporal Pain Recognition in CNN-Based Super-Resolved Facial Images

  • Marco Bellantonio
  • Mohammad A. Haque
  • Pau Rodriguez
  • Kamal Nasrollahi
  • Taisi Telve
  • Sergio Escalera
  • Jordi Gonzalez
  • Thomas B. Moeslund
  • Pejman Rasti
  • Gholamreza Anbarjafari
Conference paper

DOI: 10.1007/978-3-319-56687-0_13

Part of the Lecture Notes in Computer Science book series (LNCS, volume 10165)
Cite this paper as:
Bellantonio M. et al. (2017) Spatio-temporal Pain Recognition in CNN-Based Super-Resolved Facial Images. In: Nasrollahi K. et al. (eds) Video Analytics. Face and Facial Expression Recognition and Audience Measurement. FFER 2016, VAAM 2016. Lecture Notes in Computer Science, vol 10165. Springer, Cham


Automatic pain detection is a long expected solution to a prevalent medical problem of pain management. This is more relevant when the subject of pain is young children or patients with limited ability to communicate about their pain experience. Computer vision-based analysis of facial pain expression provides a way of efficient pain detection. When deep machine learning methods came into the scene, automatic pain detection exhibited even better performance. In this paper, we figured out three important factors to exploit in automatic pain detection: spatial information available regarding to pain in each of the facial video frames, temporal axis information regarding to pain expression pattern in a subject video sequence, and variation of face resolution. We employed a combination of convolutional neural network and recurrent neural network to setup a deep hybrid pain detection framework that is able to exploit both spatial and temporal pain information from facial video. In order to analyze the effect of different facial resolutions, we introduce a super-resolution algorithm to generate facial video frames with different resolution setups. We investigated the performance on the publicly available UNBC-McMaster Shoulder Pain database. As a contribution, the paper provides novel and important information regarding to the performance of a hybrid deep learning framework for pain detection in facial images of different resolution.


Super-Resolution Convolutional Neural Network (CNN) Recurrent Neural Network (RNN) Pain detection 

1 Introduction

Pain is a prevalent medical problem that reveals as an unpleasant experience and needs to be managed effectively as a moral and professional responsibility [5]. Traditionally, pain is measured by ‘self-report’. However, self-reported pain level assessment requires cognitive, linguistic and social competencies of the affected person. These aspects make self-report unfeasible to use for young children and patients with limited ability to communicate [37]. Thus, the notion of computer vision-based automatic pain level assessment was introduced [31, 32].

Facial pain expression can be considered as a subset of facial expression and expresses emotion valley regarding to experiencing pain [2]. It can also provide information about the severity of pain that can be assessed by using the Facial Action Coding System (FACS) coding from [6, 52]. For a long time the FACS has been used to measure facial expression appearance and intensity. Thus, vision-based approaches came into the scene to measure pain by using features from facial appearance change. Prkachin first reported the consistency of facial pain expressions for different pain modalities in [45] and then together with Solomon developed a pain metric called Prkachin and Solomon Pain Intensity (PSPI) scale based on FACS in [47].

The task of assessing the pain level from facial image or video is rather challenging. A substantial body of literature has been produced in the recent years to address the challenges [3, 10, 29, 46, 48]. A glimpse of the reason why pain level detection is difficult can be found in Fig. 1 [14]. From the facial images in the figure, we can see that the pain and non-pain frames may not present enough visual difference; however, the self-report tells a different story about having pain and non-pain status. The challenges also increase in the presence of external factors like ‘smiling in pain’ phenomenon and gender difference (male’s vs female’s way of experiencing) to pain [28, 30, 53]. This in turns result to a non-linearly wrapped facial emotion levels in a high dimensional space [51].
Fig. 1.

Pain and non-pain facial expression is sometimes very difficult to distinguish visually. Examples from the UNBC-McMaster shoulder pain database [38]. The pain frames are at the left and the non-pain frames are at the right.

Recent advances in facial video analysis using deep learning frameworks such as Convolutional Neural Networks (CNN) or Deep Belief Networks (DBN) provide the notion of realizing non-linear high dimensional compositions [49]. Deep learning architectures have been widely used in face recognition [18, 35, 43, 55], facial expression recognition [24, 56], emotion detection [21, 23, 49]. Pain level estimation using a deep learning framework was also proposed [57]. Employing deep learning framework for pain level assessment from facial video entails two kinds of information processing from facial video sequences: (i) spatial information, (ii) temporal information. Spatial information provides pain related information in the facial expressions of a single video frame. On the other hand, temporal information exhibits the relationship between pain expressions revealed in consecutive video frames.

While exploring spatial and temporal information from facial images, face quality (e.g. low face resolution) can also play important role as studied in [11, 12, 13]. The first limitation of the image resolution is created by the imaging acquisition devices or the imaging sensors [41]. The spatial resolution of the image capture is determined by the sensor size or the number of sensor elements. So, for increasing the spatial resolution of an imaging system, one of the easy ways is to increase the sensor density by reducing the sensor size. However, as the sensor size decreases, the amount of light incident on each sensor also decreases, causing the shot noise [41]. Also, the hardware cost of a sensor increases by making sensor density greater or corresponding image pixel density. Applying various signal processing tools is the other approach for enhancing the face resolution. One of the famous techniques is Super Resolution (SR). The basic idea behind SR methods is to obtain high resolution (HR) image from low resolution (LR) image or images [42, 54]. Huang and Tsai [17] as pioneers of SR proposed a method in order to improve spatial resolution of satellite images of earth, where a large set of translated images of the same scene are available. They showed that the better restoration can be achieved rather than spline interpolation by using multiple offset images of the same scene and a proper registration. Since then, SR methods become common practice for many applications in different fields such as remote sensing [34], surveillance video [4, 36], medical imaging such as ultrasound, magnetic resonance imaging (MRI), and computerized tomography (CT) scan [22, 39, 40, 44].

The desire for HR stems from two principal application areas:
  • Improvement of resolution for human interpretation: in these applications, human is ultimate goal for system. SR methods improve resolution and visual quality in captured image. For example, a doctor can diagnose or treat with image capture from outside and inside the patient’s body.

  • Helping representation for automatic machine perception: SR methods are used to improve the resolution and image quality, for facilitating the machine processing. SR methods are used in various problems such as optical character recognition (OCR) problem or machine face recognition [1, 9, 15, 50].

In this paper, we investigate the plausibility of using a Recurrent Neural Network (RNN) [57] to exploit the temporal axis information from facial video using Long Short-Term Memory (LSTM) [8, 16] to estimate pain level expression in the face. The RNN is fed with the features extracted by a CNN that explores spatial information. We employ a SR technique to generate super-resolved high-resolution images from low resolution faces and we employ the CNN+RNN based deep learning framework to observe the performance. We report our results through the publicly available challenging database called UNBC-McMaster Shoulder Pain database [38]. The major contribution of the paper are as follows:
  • Analyzing the pain detection performance fluctuation due to facial image resolution.

  • Determining the impact of employing SR techniques in pain expression detection.

  • Employing a hybrid deep learning framework by combining CNN and RNN to exploit spatio-temporal information of pain in video sequences.

The rest of the paper is organized as follows. Section 2 describes the proposed methodology for pain level assessment. Section 3 presents the experimental environment and the obtained results. Section 4 contains the conclusions.

2 The Proposed Pain Detection Framework

In this section we first describe the facial pain-expression database to be used in our investigation. We then describe the procedure of generating facial images with different resolutions and, finally, the deep learning-based classification framework for the experiment.

2.1 The Database

We use the UNBC-McMaster Shoulder Pain database collected by the researchers at McMaster University and University of Northern British Columbia [38]. The database contains facial video sequences of participants who had been suffering from shoulder pain and were performing a series of active and passive range of motion tests to their affected and unaffected limbs on multiple occasions. The database also contains FACS information of the video frames, self-reported pain scores in sequence level and facial landmark points obtained by an appearance model. The database was originally created by capturing facial videos from 129 participants (63 males and 66 females). The participant had a wide variety of occupations and ages. During data capturing the participants underwent eight standard range-of-motion tests: abduction, flexion, and internal and external rotation of each arm separately. Participants’ self-reported pain score along with offline independent observers rated pain intensity were recorded. At present, the UNBC-McMaster database contains 200 video sequences with 48398 FACS coded frames of 25 subjects.

2.2 Obtaining Pain-Expression Data with Varying Face Resolution

We created multiple datasets by obtaining the original images from the UNBC-McMaster database and then varying the resolutions by down-up sampling or SR algorithms. The down-up sampling was accomplished by simply down-sampling the original images and then up-sampling the down-sampled images to the same resolution of the original images by employing a cubic-interpolation.

In order to generate SR images, a state-of-the-art technique, namely example-based learning [26] is adopted. The work in [26] is an extension of [25] which uses kernel ridge regression in order to estimate the high-frequency details of the underlying HR image. Also a combination of gradient descent and kernel matching pursuit is considered and allows time-complexity to be kept to a moderate level. Actually the proposed method improves the SR method presented in [7]. In this algorithm, For a given set of training data points \({(x_{1},y_{1}),...,(x_{l},y_{l})}\subset \mathfrak {R}^{M}\times \mathfrak {R}^{N}\), the following regularized cost functional is minimized.
$$\begin{aligned} O\left( \left\{ f^{1},...,f^{N} \right\} \right) =\sum _{i=1,...,N}\left( \frac{1}{2}\sum _{j=1,...,N}\left( f^{i}\left( x_{j} \right) -y_{j}^{i} \right) ^{2}+\frac{1}{2}\lambda \left\| f^{i} \right\| _{H}^{2} \right) \end{aligned}$$
where \(y_{j}=\left[ y_{j}^{1},...,y_{j}^{N} \right] \) and H is a reproducing kernel Hilbert space. Due to the reproducing property, the minimizer of Eq. 1 is expanded in kernel functions:
$$\begin{aligned} f^{i}\left( \cdot \right) =\sum _{j=1,...,l}a_{j}^{i}k\left( x_{j},\cdot \right) ,~for\quad i=1,...,N \end{aligned}$$
where k is the generating kernel for H which, is choosen as a Gaussian kernel \(\left( k\left( x,y \right) =exp\left( -\left\| x-y \right\| ^{2}/\sigma _{k} \right) \right) \). Equation 1 is the sum of individual convex cost functionals for each scalar-valued regressor and can be minimized separately. The final estimation of pixel value for an image location (xy) is then obtained as the convex combination of candidates given in the form of a softmax:
$$\begin{aligned} Y\left( x,y \right) =\sum _{i=1,..N}w_{i}\left( x,y \right) Z\left( x,y,i \right) \end{aligned}$$
where \(w_{i}\left( x,y \right) =exp\left( -\frac{\left| d_{i}\left( x,y \right) \right| }{\sigma _{C}} \right) /\left[ \sum _{j=1,...,N}exp\left( -\frac{\left| d_{j}\left( x,y \right) \right| }{\sigma _{C}} \right) \right] \) and Z is the initial SR image that is generated by a bicubic interpolation.

We use the down-sampled images as input to the SR algorithm and obtain the super-resolved images.

2.3 Deep Hybrid Classification Framework

We use a combination of CNN and RNN based hybrid framework to exploit both spatial and temporal information of facial pain expressions for pain detection. The hybrid pain detection framework is depicted in Fig. 2. In order to extract discriminative facial features, we fine-tune VGG_Faces [43], a 16-layer pre-trained CNN with 2.6M facial images of 2.6K people. Concretely, we replace the last layer of the CNN by a randomly initialized fully-connected layer with the three pain levels to recognize, and set its learning rate as ten times the learning rate of the rest of the CNN.

Once, fine-tuned, we extract the features of the fc7 layer of the fine-tuned model and use them as input to a Long-Short Term Memory (LSTM) Recurrent Neural Network (RNN) [16]. LSTMs are particular implementations of RNN that make use of the forget (f), input (i), and output (o) gates so as to solve the vanishing or exploding gradient problems, making them suitable for learning long-term time dependencies. These gates control the flow of information through the model by using point-wise multiplications and sigmoid functions \(\sigma \), which bound the information flow between zero and one:
$$\begin{aligned} i(t) = \sigma (W_{(x\rightarrow i)}x(t) + W_{(h \rightarrow i)}h(t-1) + b_{(1\rightarrow i)}) \end{aligned}$$
$$\begin{aligned} f(t) = \sigma (W_{(x\rightarrow f)}x(t) + W{(h\rightarrow f)}h(t-1) + b_{(1\rightarrow f)}) \end{aligned}$$
$$\begin{aligned} z(t) = tanh(W_{(x\rightarrow c)}x(t)) + W_{(h\rightarrow c)}h(t-1) + b_{(1\rightarrow c)}) \end{aligned}$$
$$\begin{aligned} c(t) = f(t)c(t-1) + i(t)z(t), \end{aligned}$$
$$\begin{aligned} o(t) = \sigma (W_{(x\rightarrow o)}x(t) + W_{(h\rightarrow o)}h(t-1) + b_{(1\rightarrow o)}) \end{aligned}$$
$$\begin{aligned} h(t) = o(t)tanh(c(t)), \end{aligned}$$
where z(t) is the input to the cell at time t, c is the cell, and h is the output. \(W_{(x\rightarrow y)}\) are the weights from x to y. More detail can be found in the original implementation [33].
Fig. 2.

The block diagram of the deep hybrid classification framework based on a combination of CNN and RNN

Labels are predicted sequence-wise, i.e. given a sequence of n frames \(f_i \in \{f_1, ..., f_n\}\), the target prediction is the pain level of the \(f_n\) frame. Thus, training is set so that the information contained in the past frames is used in order to predict the current pain level. We optimize the LSTM with Adam [27] with an initial learning rate of 0.001 so as to alleviate the hyper-parameter tuning problem.

3 Experimental Results and Discussions

3.1 Experimental Environment

As stated in the previous section, we evaluated the performance of pain detection in varying face resolution by employing the hybrid deep learning framework on the UNBC-McMaster Shoulder Pain database [38]. The video frames of the database showed patients who were suffering from shoulder pain while they were performing a series of active and passive range-of-motion tests. The pain indexes were computed by following Prkachin and Solomon Pain Intensity (PSPI) scale from [47] and the pain levels vary in the interval 0–16 based on the FACS codes. Following [20], we classified each pain index into three categories of no pain (pain index lower than 1), weak pain (pain index between 2 and 6) and strong pain (pain index greater than 6). The three categories have been balanced by dropping consecutive no-pain frames at the beginning and at the end of each video, or by discarding entire video sequences which do not contain pain.

We applied the down-up sampling and SR algorithm described in Sect. 2.2 to generate three experimental datasets. The first dataset was created by using the original images from the UNBC-McMaster database (also used by [19]). The second and third datasets were denoted by ‘SROpen image in new window’ and ‘SROpen image in new window’, and were created by employing down-up sampling with the values Open image in new window and Open image in new window, respectively, on the first dataset. The fourth dataset were denoted by ‘SR2’, and was created by employing the SR algorithm from Sect. 2.2 on the down-sampled images with factor Open image in new window. The LSTM network was configured with 3 hidden-layers of 64 hidden-units each and a temporal window of 16 consecutive video frames. For the purpose of comparison, the experimental setup of the LSTM was kept fixed for all the experiments against the three datasets. The performances was estimated with leave-one-subject-out cross-validation protocol.

3.2 The Obtained Results

Table 1 shows the results of the proposed system against the three sets. Here we report the accuracy in percentage for each of the three categories, namely “No pain”, “Weak Pain” and “Strong Pain”. From the experiments we can claim that the proposed method applied to super resolved images is crucial since it reaches better performance than using the plain down-sampled versions. The latter is denoted by the amount of improvement appearing in the pain detection rate using the super-resolved images as the subjects, while being compared against that of the LR ones. In other words, when recognizing the pains using the super-resolved images, a more powerful SR method leads to recognition rates closer to the case of considering the original ones. From the results we can see that pain detection is much better in super-resolved images compared to down-sampled ones by a large margin in case of strong pain, while for the other two levels, namely no-pain and weak pain, the performances are slightly better. This is due to the fact that stronger pain (compared to weak or no pain case) imposes more changes on the face and these changes are more pronounced on super-resolved images hence the detection accuracy improves by far in the strong pain class compared to the other classes. In order to see how temporal information affects the final results, we provide the SR2 accuracy when using a linear classifier on the plain CNN features against the LSTM predictions, which aggregates the temporal information in Table 2. Here the results are reported for each subject in the considered data set. As it can be seen, temporal information improves the predictions for a large margin, a \(16\%\) in average, meaning that spatial features are not enough for determining the pain level on facial images. Thus, the temporal variation of the frames allows for finding higher level facial features, like FACS, which are central for predicting the PSPI pain score [48].
Table 1.

Pain detection results for the four experimental datasets created from the UNBC-McMaster [38] database

Semantic ground truth

Pain index

SROpen image in new window

SROpen image in new window


No pain

0, 1




Weak pain

2, 3, 4, 5




Strong pain














From Table 2 we notice that in two cases, specifically subject number 7 and subject number 8, the LSTM failed to improve the accuracy of the CNN. After a detailed study of the dataset, we notice that sometimes, for both subjects, the pain index changes very rapidly among consecutive frames. The same pattern occurs (in a lighter form) also for subject 6, which improvement in the accuracy is not as good as for the other subjects. In addition, subject 7 is the only one that contains only one video for the validation set, while subject 8 contains three videos, among which one very noisy with only 20 frames. We think that the aforementioned differences could be the key problems which leads to such a different performance for different subjects.
Table 2.

Comparison between CNN and LSTM performances on SR2 dataset (in accuracy \(\%\)). The CNN relies on the information of a single frame, while the LSTM takes into account variations on the images in the temporal axis. As it can be seen, the LSTM enhances the accuracy prediction for all subjects, reaching a \(16\%\) in average.

















































4 Conclusions

We investigated the performance of a recurrent deep learning framework trained against super-resolved high-resolution images for pain level classification. The system is a combination of CNN and a LSTM used to exploit both spatial and temporal information in videos. We evaluated our proposed method on UNBC_McMaster database by down sampling by different factors and by applying a super-resolution algorithm. From the experimental results of the pain detection performances we concluded that super-resolution and temporal information are key for obtaining good recognition results. Our experiments also showed that including deep temporal information within the model increases the generalization capabilities in discriminating among different levels of pain. Employing super-resolution techniques lead to an improvement of the performances in our pain detector. Down-sampling, on the other hand, worsen the system capabilities.

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Marco Bellantonio
    • 1
  • Mohammad A. Haque
    • 2
  • Pau Rodriguez
    • 1
  • Kamal Nasrollahi
    • 2
  • Taisi Telve
    • 3
  • Sergio Escalera
    • 1
  • Jordi Gonzalez
    • 1
  • Thomas B. Moeslund
    • 2
  • Pejman Rasti
    • 3
  • Gholamreza Anbarjafari
    • 3
  1. 1.Computer Vision Center (UAB)University of BarcelonaBarcelonaSpain
  2. 2.Visual Analysis of People (VAP) LaboratoryAalborg UniversityAalborgDenmark
  3. 3.iCV Research Group, Institute of TechnologyUniversity of TartuTartuEstonia

Personalised recommendations