Keywords

1 Introduction

Physical exercise reduces the risk of developing and/or dying from cardiovascular disease by maintaining various types of physiological parameters (Heart rate, blood pressure etc.) and blood components (Blood sugar, cholesterol, triglycerides, etc.). It enhances and maintains physical fitness, increases muscle strength, reduces depression and anxiety, and reduces various types of diseases [1, 2]. The benefits from physical exercise could be increased by proper monitoring in real time [3]. Exercises for the elderly and for rehabilitation may lead to many accidents, which may be caused by lack of proper monitoring of the exercise in real time [4]. Measurement of intensity (how hard it feels for the person to perform exercise) can be performed in subjective or objective ways. Basically, there are three ways of monitoring physical exercise intensity: by extracting or monitoring physiological parameters, such as heart rate (HR), respiratory rate (RR) etc., the rated perceived exertion scale, and the talk test (how hard it is for a subject to talk). Usually, physical exercise intensity is synchronous to heart rate; therefore, if we measure heart rate, we can define a level of physical exercise.

In general, physical exercise intensity is considered “very light” at the beginning and “very hard” at the end of the physical exercise in a submaximal graded exercise. The intensity of the exertion depends on many overall body responses, including heart rate (HR), respiratory rate (RR), blood lactate, physical status, mood state, etc. Therefore, a proper measurement or estimation of these parameters during exercise helps to monitor the physical exercise.

Borg scale is a common way to classify the feeling of hardness during physical exercise [5]. He proposed a subjective technique to classify the perceived exertion during exercise which is called rate of perceived exertion (RPE). The measurement of these parameters during exercise is quite challenging because the feeling is based on the individual, and the subject must be familiar with this scale, making the measurement of these levels quite difficult for people who do not have enough knowledge of the Borg scale. Nowadays, monitoring of physical exercise can be carried out by extracting physiological features, using invasive or non-invasive techniques. The subjective way of defining the level of exercise intensity has been used for a long time and has considerable validity [5]. There are several instruments using invasive techniques or contact-sensor technology to measure physiological signals, including heart rate, respiratory rate, and blood lactate etc. [6, 7]. By measuring these parameters, we can correlate them with physical exercise intensity level. The non-invasive technique to identify/classify exercise intensity can consist of measuring physiological data and converting it into exercise intensity level or classes, or directly recognizing exercise intensity using computer vision technique. It is commonly observed that if the person gets tired his/her facial expression and facial color changes, which could be an important cue for the classification of level of intensity.

Most of the recent research trends for measurement of physical exercise intensity include facial image analysis using feature points analysis [8, 9], facial color analysis, mouth and eye blink analysis [10], body movement tracking [11], etc. In the literature, we can find various ways to measure physical exercise intensity using non-invasive methods. Fatigue can be detected by analyzing the pattern of movement of the muscles [12]. The head motion pose analysis can be measured using feature points tracking, which can be analyzed using statistical and machine learning algorithms [8]. Haque [12] presented an efficient non-contact system for detecting non-localized physical fatigue from maximal muscle activity using facial videos, where the video was taken in a realistic environment. Salik [9] proposed exercise intensity classification using facial feature point analysis.

In computer vision, deep learning is an emerging area to classify images. The classification of physical exercise intensity is more likely to use facial expression analysis since the facial expression changes when a person feels a higher intensity in the exercise [9]. Nowadays, facial emotion analysis using deep learning techniques is also very common and has achieved better results than traditional machine learning techniques [13,14,15,16]. Deep learning can also be implemented to analyze or monitor physical exercise by analyzing body parameters. Gordienko [17] proposed a multimodal approach to estimate the fatigue using deep learning techniques, where the input parameters were extracted using wearable sensors.

The exercise intensity level has been classified using several subjective techniques for a long time, but using objective techniques to achieve this task is still challenging. In this paper, an objective or quantitative technique is proposed to classify exercise intensity using computer vision technique. The ground truth class/level of exercise intensity was defined according to the incremental HR. The intensity level of (class) exercise at the beginning or minimum HR is the initial class ‘light’ and at the end of exercise, the maximum HR is the final class or ‘hard’ level. The other classes or levels are also defined by the HR. The deep learning approach using convolutional neural network was applied to classify the facial images, where images were extracted from a video collected during submaximal exercises.

2 Methods

2.1 Dataset Description

Twenty university students (mean age = 26.88 ± 6.01 years, mean weight = 72.56 ± 14.27 kg, mean height = 172.88 ± 12.04 cm, 14 males and six females, and all white Caucasian) participated in the study. An informed consent form was signed by each participant prior to data collection and they were informed of the study protocol before the recordings. The test consisted of a submaximal ramp exercise protocol in a Wattbike Cycloergometer (Wattbike Ltd, Nottingham, UK), after a 5-min warmup with a constant power output of 60 W. The initial power output was 75 W, which was increased by 15 W min-1 until participants reached 85% of their maximal heart rate (calculated as 208 – (0.7*age) or until they were unable to maintain cadence to generate the required power output throughout this stage. Heart rate data was collected at 100 Hz using the Polar T31 cardiofrequencimeter, (Polar Electro, Kempele, Finland) synchronized to the Wattbike load cell for power output measures, sampled at 100 Hz. For the facial tracking, facial video (25 Hz with spatial resolution of 1080 × 1920 pixels) was recorded during the test using a video camera placed in (90° angle with face and camera) the frontal plane view to capture the participants’ face while performing the exercise. The participants were not allowed to talk during the test but could express their feelings freely with facial expression throughout.

For the purpose of this study, a dataset containing various classes of images with different levels of tiredness was prepared. The image frames were manually assigned the categories, accordingly to the heart rate. Considering two classes, the initial 500 frames of each video were considered as class one (not tired faces) and the last 500 frames were considered as class two (tired faces). Since there were 20 subjects, the total number of images for a class were 10,000 and the total number of images in the dataset was 10,000 times the number of classes. The dataset was prepared for two, three, and four classes separately (see Fig. 1).

Fig. 1.
figure 1

Allocation of time slots to extract images with the initial class (Minimum exercise intensity), the intermediate classes, and the final class (Maximum exercise intensity).

The allocation of time slots in a video was based on the incremental HR value. In the case of more than two class classification, the middle classes are considered according to the Heart rate value. For instance, if the minimum heart rate is 80 and the final heart rate is 180, then the image frames at the time of 130 bpm are considered as second class or middle class. Likewise, the time slots for more classes are considered by synchronizing the heart rate with the frame number.

2.2 Pre-processing

Before feeding the neural network with inputs, various image pre-processing techniques were applied. All the pre-processing before the neural network is shown in Fig. 2. Since the images were extracted from the video recorded with a moving object (head movement), the frames extracted may have some blurred effects. Therefore, the first pre-processing consists of detecting and removing any blurring effect on the image frame. In the second step, the face was detected in the frame so that we could specifically analyze the face, not the whole frame. The well-known Viola Jones algorithm [18] was applied to detect the face. After detecting the face, we cropped it and down-sampled it into 96 × 96, which is the size of the input layer. One of the basic purposes of this research is to find out the best color channel to classify the physical exercise; therefore, the experiments were performed with separate raw 2D image representing each color channel (Red, Green and Blue) and Grayscale. So, after cropping the face and resizing, RGB frames were split into R, G, B, and converted to Grayscale image.

Fig. 2.
figure 2

Block diagram with detail preprocessing with output images.

2.3 Proposed CNN Architecture

A deep neural network based on the Convolutional Neural Network (CNN) or ConvNet was designed with five hidden layers and two fully connected layers, as shown in Fig. 3. Three main types of layers are used to build ConvNet architectures: Convolutional Layer, Pooling Layer, andFully-Connected Layer (exactly as seen in regular Neural Networks). These layers were stacked to form a full ConvNetarchitecture:

Fig. 3.
figure 3

Proposed convolutional neural network architecture.

  • Input Layer [96 × 96]: holds the raw pixel values of 2D the image of the faces.

  • CONV layer: computes the output of neurons that are connected to local regions in the input, each computing a convolution of their weights, as well as a small region they are connected to in the input volume.

  • Fully-connected layer: computes the class scores, resulting in volume of size [1 × 1 × n], where n is the number of classes. As with ordinary Neural Networks, and as the name implies, each neuron in this layer will be connected to all the numbers in the previous layer.

  • The activation function chosen was ReLU.

  • Maxpooling with Pool size (2, 2).

  • 25% dropout results in the maximum amount of regularization.

The first part of each layer consists of a convolutional layer (Conv2d) which can have spatial batch normalization, Maxpooling, dropout and ReLU activation. Each layer consists of these five tasks. After 5 convolutional layers, the network is led to 2 fully connected layers that always have Affine operation and ReLU activation.

We implemented this architecture in the well-known python library Keras. The experiments were carried out in Google Colab GPU.

2.4 Experiments

The first convolutional layer consists of 64 3 × 3 filters; the second one had 128 3 × 3 filters; the third one had 256 3 × 3; the forth one had 512 3 × 3 filters; and the last one also had 512 3 × 3 filters. In all the hidden layers a stride size of 1, batch normalization, max-pooling of size 2 × 2, dropout of 0.25 and ReLU as the activation function. These five hidden layers are followed by two fully connected layers with 256 neurons and 512 neurons respectively. Both the fully connected layers had batch normalization, dropout and ReLU with the same parameters. SoftMax is also used as an out-loss function. Figure 3 shows our deep neural network architecture.

The training was performed in 75 epochs with the batch size of 64. From the dataset of 10,000 images of each class, the dataset was randomly split into training, validation and testing set in the ratio of 80:10:10. For two classes (tired and not tired) the total number of images was 20,000, where 16,000 were for training, 2,000 for validation, and 2,000 for testing. Experiments with two, three, and four classes were also performed. To reduce overfitting, we used dropout and batch normalization in addition to L2 regularization.

3 Experimental Results and Discussion

Separate experiments were done in order to determine the classification into two, three, and four classes and the accuracy of each case was analyzed. In the experiments, the color images were split into Red, Green and Blue components and the original RGB images were converted into Grayscale. The green color component provides the best accuracy of classification. The confusion matrix was drawn in each case. Most of the cases, the accuracy of classification using a two-class classification is more than 99% (See Table 1). From the overall results, classification into two and three classes was accurate and resulted in very high classification accuracy, whereas the classification with four classes had a lower classification performance in each algorithm.

Table 1. The average accuracy of classification into two, three, and four classes, using red, green, blue and gray channels.

Based on the result presented on the table, the classification into two classes has 100% of accuracy in all the cases. It also shows that the best raw color channel is Green which obtained the average accuracy of 100%, 99.86%, and 99.75% in two, three, and four class classification, respectively. From these results it is concluded that the level of tiredness or physical exercise intensity is better reflected by the Green color channel. Therefore, in the remaining part of this article, all the experimental results and slots will only be based on the Green color channel.

The accuracy and loss history during training 75 epoch is shown in Fig. 4(a) and (b) respectively. Only the plot of the Green channel is shown, since Green channel resulted the best average prediction accuracy among all other color channels (Tables 2, 3 and 4).

Fig. 4.
figure 4

Training and validation accuracy and loss vs. epoch of green color for four class classification. (Color figure online)

Table 2. Classification accuracy of each class in the classification of physical exercise intensity into two classes.
Table 3. Classification accuracy of each class in the classification of physical exercise intensity into three classes.
Table 4. Classification accuracy of each class in the classification of physical exercise intensity into four classes.

The confusion matrixes of the green color channel in all the class classifications are presented as shown in Figs. 5, 6 and 7. The accuracy for the two-class classification is 100% and it shows that when using convolutional neural network, it is very easy to classify normal and fully tired face. The test set contains randomly selected 2000 images, where 1037 images were normal faces and 963 images were tired faces.

Fig. 5.
figure 5

Confusion matrix of the two-class classification of physical exercise intensity.

Fig. 6.
figure 6

Confusion matrix of the three-class classification of physical exercise intensity.

Fig. 7.
figure 7

Confusion matrix of the four-class classification of physical exercise intensity.

Similarly, the confusion matrix of the three-class classification is shown in Fig. 6. In this case, the misclassification is only for the first and second classes. The last class is 100% accurate. None of the other classes are classified into this class, nor this class is classified into other class. In the case of class one, among 1025 images, only one class was misclassified as class two. Likewise, in the case of class two, three images out of 1018 were misclassified as class one.

Likewise, the recognition of fully tired faces was easier when compared to the others. The misclassification rate was always greater for the nearest class. For example, the first class, or normal faces (not tired), are mostly misclassified into second class, second class is misclassified into first class and third class, and so on.

4 Conclusion

Based on various experiments with various types of image datasets, the deep learning approach for exercise intensity classification based on facial expression can be a potential method to classify exercise intensity in two, three, four or more levels. In case of a two-class classification, the accuracy rate is 100% and even for the three-class classification it is also around 99%. From all the experiments, it can be concluded that the best color channel for the raw input image is Green, in terms of its classification accuracy. The training and testing dataset were randomly prepared from the same subjects; therefore, this approach is more appropriate for personalized physical exercise monitoring.

Future work can be extended two classify images into more than four classes. The experiments were done with only 20 subjects with little diversity in age and origin. To generalize this model, we can train the model with more diversity and a greater number of subjects, in order to improve the test accuracy. Considering that the training and testing were performed from the same subjects, this approach might be more appropriate for personalized exercise monitoring systems, where the system can be trained from the same subject with image datasets taken in various exercise sessions.