1 Introduction

The search for beauty has been pursued by mankind since its beginnings. Attempting to discover the secret of beauty has been a goal of philosophers, artists, and scientists throughout human history [1]. Even in ancient Greece, for example, beauty was associated with symmetry. Nowadays, the beauty of the face receives even more interest due to the rapid development of plastic surgery and the cosmetics industry [2].

Computer science is also not unaware of this fact, therefore facial beauty has become an interesting research topic in computer vision and machine learning [3, 4]. This research mainly focuses on facial beauty estimation and classification/prediction which can be useful in various applications such as cosmetic recommendations [5], plastic surgery planning [6], facial beautification [7], and social network services (SNS) (such as Facebook, Instagram, and dating websites) [8]. In addition, automatic facial beauty prediction (FBP) may find application when attractiveness is a basic requirement, such as in advertising, magazine covers, and in the selection of applicants for certain professions, such as in the entertainment industry and modeling business [6].

Motivation

Over the past decade, CNNs have become the dominant solution for most computer vision and machine learning tasks [9, 10]. Despite these tremendous advances in deep learning methods, FBP has not been able to benefit much from deep learning. One of the goals of this work is to leverage some of the recent powerful CNN architectures to develop an accurate and robust solution for FBP.

Developing such an accurate solution for facial beauty estimation is a difficult task because facial beauty is a subjective task that changes from person to person, and facial attributes (gender, ethnicity, age...) also affect facial beauty evaluation. In addition, the person’s internal state (facial expression) can also influence facial beauty evaluation [11]. Although Deep Learning, especially CNN architectures, have made significant progress in facial beauty assessment and prediction, it is noted that more labelled data is needed to train Deep CNNs. To deal with the aforementioned data limitation, we use active data augmentation. Moreover, pre-trained models on ImageNet database [12] are used to extract high level features.

In this paper, we present a Deep Learning approach for predicting the beauty of faces. The presented approach is based on three main contributions. First, we propose a two-backbone architecture where two different CNN architectures are fused into a single architecture that is trained in an end-to-end method. Second, we propose a dynamic robust loss function for training the deep regressors. Third, we propose an ensemble of regressions where the final prediction is given by the average of all predictions without retraining the final solution with a new validation set. In the ensemble solution, each model is trained separately. The ensemble consists of single-branch architectures (ResneXt-50 and Inception-v3) and our proposed two-backbones architecture (2B-IncRex) with different loss functions. In this approach, three robust loss functions are made dynamic: ParametricSmoothL1, Huber and Tukey.

In the following, the most important contributions of the proposed solution are explained one after the other.

  • ParamSmoothL1 regression loss function and a dynamic law that changes the parameters of the robust loss function during training. For this purpose, we use the parabolic law with the following robust loss functions: ParamSmoothL1, Huber and Tukey, to be able to solve the problem of complexity in finding the best loss function parameter. Moreover, these dynamic losses improve the training convergence compared to the standard loss functions (MSE and L1) and the robust loss functions (SmoothL1, Huber and Tukey) that assume a fixed parameter.

  • A network with two backbones (2B-IncRex) based on ResneXt-50 and Inception-v3 architectures is proposed for face beauty prediction.

  • regression for face beauty estimation by fusing the predicted values of one-branch networks (ResneXt-50 and Inception-v3) and two-backbones networks (2B-IncRex) is proposed. This ensemble of five CNNs is trained with these dynamic loss functions: Dynamic ParamSmoothL1, Dynamic Tukey, Dynamic ParamSmoothL1, Dynamic Huber, and Dynamic Tukey, respectively. Although the individual regression models are trained separately with the same fixed hyperparameters, the estimates produced by the resulting ensemble regression are more accurate compared to the individual models as well as to the state-of-the-art solutions. The code to train and test our approach is publicly available at. https://github.com/faresbougourzi/Dynamic_ER-CNN. (Last accessed on March, 25th 2022)

The paper is divided into the following sections: Section 2 presents some related work on facial beauty prediction. In Section 3, we explain the backbone CNN architectures used, the proposed approach, and the proposed dynamic robust losses. Section 4 contains: the description of the databases and evaluation metrics used, and the experimental setup. Section 5 presents the performance evaluations for the SCUT-FBP5500 dataset. Section 6 presents the performance evaluation for the KDEF-PT dataset. Section 7 provides a discussion and comparison with state-of-the-art methods. Finally, Section 8 concludes the paper.

2 Related work

Automatic prediction of the beauty of faces is still a young problem, but it is becoming increasingly important in the field of machine learning and computer vision. There is a unified concept of facial beauty that enables the automation of this prediction [13, 14]. In this way, the classification of facial beauty and the prediction of attractiveness score were developed to allow the association of facial attractiveness and image features in a quantitative mode [15]. The first database created to treat FBP as a regression task dates back to 2015 [16]. In fact, two main methods for performing FBP can be distinguished in the literature: hand-crafted [17,18,19,20,21,22,23,24] and deep learning [18, 25,26,27]. Similarly, hand-crafted methods are classified as geometry-based or appearance-based [22].

Before the heyday of deep learning architectures, hand-crafted methods were commonly used for FBP. Aarabi et al. [21] and H. Yan [22] presented work dealing with appearance-based hand-crafted methods. In the first work, an automatic system for evaluating the beauty of faces was developed. It is based on the ratios between facial features (face, eyes, eyebrows and moth concretely) with the K-nearest neighbor algorithm to learn the beauty assignment. H. Yan [22], on the other hand, proposed a new CSOR (Cost-Sensitive Ordinal Regression) method to measure the importance of samples in different classes. The CSOR is applied to four types of characteristics: Intensity, LBP [28], SIFT [29], and LE [30]. A typical geometry-based hand-crafted method was described by Zhang et al. [20] presented. This technique uses a huge amount of data (tens of thousands of face images, both female and male). These are mapped to a human face shape subspace, and a quantitative method is used to analyze the effects of facial geometry on the beauty of the human face. The analysis was performed using the transformation invariant shape distance measurement. On the other hand, Liang et al. proposed a mixed technique combining geometric-based and appearance-based hand-crafted methods. It is based on the use of geometric features (extracted 18-dimensional ratio features of faces) and appearance features (40 Gabor feature maps), with apartment predictors that are linear regression (LR) and Support Vector Regression (SVR).

Most of the hand-crafted methods listed above have been tested using the SCUT-FBP5500 database. This database includes 5500 frontal, neutral-looking, and unclouded faces of individuals aged 15 to 60 years [18]. On the other hand, it should be mentioned that the introduction of deep learning methods in computer vision and especially in FBP has surpassed the results obtained with hand-crafted methods. As a result, in recent years, deep learning architectures have been widely used for evaluating the beauty of faces.

In [18], Liang et al. presented their face beauty database (SCUT-FBP5500) with two evaluation protocols (60-40% split and five-fold cross validation). They tested three CNN architectures (Alexnet [31], Resnet-18 [32], and ResneXt-50 [33]). Their results show that the ResneXt-50 architecture outperforms the other two deep architectures. In terms of the improvement that Deep Learning methods represent over hand-crafted methods, it should also be noted that all of the deep neural networks tested in their work (including Alexnet and Resnet-18) performed better than the hand-crafted methods tested with various shallow regressors. Cao et al. used a residual-in-residual (RIR) block to build a deeper network with multilevel skip connections to achieve better gradient transmission flow. In addition, they used both channel-wise and space-wise attention mechanisms to find the inherent correlation between feature maps. Their approach was tested on the SCUT-FBP5500 database [18] and showed good performance. Lin et al. [27] proposed an R3CNN architecture. It consists of two components: a regression component and a ranking component. The regression component has a Siamese network (two identical regression sub-networks) to consistently map each face image to a beauty value. The ranking component, on the other hand, uses the Siamese network for a few images and provides an additional task to improve the learning process of the regression subnets. The idea is that the ranking network learns the pairwise ranking of beauty for two images. Their architecture showed promising results on the SCUT-FBP [16] and SCUT-FBP5500 [18] databases. Dornaika et al. [34] introduced a multi-layer local discriminative embedding algorithm that integrates feature selection as the main step. Feature selection captures the most relevant and discriminative features of an input face image or face descriptor. All the methods mentioned so far are supervised learning methods. However, the work presented in [35] proves that semi-supervised learning also yields promising results in facial beauty estimation.

3 Methodology

This section focuses on presenting the CNN architectures used, our proposed approach, and the proposed dynamic robust losses.

3.1 Backbone CNN Architectures

The use of CNN architectures in FBP has become increasingly popular since Deep Learning methods have demonstrated their efficient performance [31].

This work is also based on CNN and the architecture presented is a combination of ResneXt-50 [33] and Inception-v3 [36]. However, the approach is open to using other backbone architectures. In addition, pre-trained models are used, trained with the ImageNet challenge database [12].

To keep the paper self-contained, this section briefly introduces the two CNNs (ResneXt-50 and Inception-v3) used as backbone architectures in our proposed solution.

ResneXt-50 Architecture:

ResneXt-50 architecture [33] is a variation of the popular Resnet [37] architecture. The main idea is to modify the residue blocks and add parallel convolutional layers with a smaller number of filters. The outputs of these filters are combined by summation and serve as input for the next residual block.

Inception-v3 Architecture:

Inception-v3 [36] is an evolution of the GoogLeNet architecture [38], in which the Inception module was introduced. The main idea of this module is to use parallel convolutional layers with different kernel sizes as well as pooling layers. In this way, different receptive fields can be applied to the input in an efficient way.

3.2 Our approach

Our method is described in Fig. 1. The predicted score of beauty is the mean of multiple scores, which means that we employ an ensemble of multiple regression models, each of which independently provides an individual score. In our implementation, we use five models. There are two main contributions to this ensemble: (i) the deep network with two backbones (2B-IncRex) (see Section 3.4) and (ii) the dynamic robust loss functions (see Section 3.5).

Fig. 1
figure 1

General structure of the proposed approach (Dynamic ER-CNN). Note that every model in this set of five solutions is trained separately using a given regression loss

The first two scores are predicted by the trained Inception-v3 and ResneXt-50 deep networks using the Dynamic Tukey loss function and the Dynamic ParamSmoothL1 loss function, respectively. The selection of the associated loss function for each backbone was empirically determined when these backbones were evaluated in Section 5.1.2. The remaining three scores are estimated after training the proposed deep network with two backbones (2B-IncRex) using the three dynamic loss functions: Dynamic ParamSmoothL1, Dynamic Huber and Dynamic Tukey. The deep network with two backbones consists of ResneXt-50 and inception-v3, which are merged into a single architecture. As will be seen in the experimental section, the performance using the two contributions without the ensemble is better than that of the state-of-the-art methods. The use of the ensemble shown in Fig. 1 will further improve the results.

3.3 Face preprocessing

In the preprocessing phase of the faces, we adopted the 2D alignment scheme described in [39] and [40]. This scheme is summarized in Fig. 2. To obtain a rectified and cropped face region, we apply three steps to the raw face image. First, the face image is rotated so that the two eyes have the same vertical coordinates. For the SCUT-FBP5500 dataset [18], we used the face landmarks provided by the authors of this dataset. For the KDEF-PT dataset [11], we used the Dlib library [41] to obtain these landmarks. Once the image and its associated detected points are rotated in the image plane, the three furthest face points in the left, right, and bottom directions are selected as the three boundaries of the face. We denote the distance from the lower boundary to the vertical position of the eyes as d1. The upper boundary of the face is set at a distance d2 from the eyes, which is set to d2 = 0.6 d1. It is worth noting that the distance d2 determines the region of the forehead included in the cropped face. Empirically, we found that d2 = 0.6 d1 works well. Finally, the face ROI is obtained by cropping the face using the four specified boundaries. The obtained ROI is then resized to a fixed size that depends on the input size of the corresponding convolutional neural network.

Fig. 2
figure 2

Face Region of Interest. The left image is an original image from the database SCUT-FBP5500 [18]. The second image is the rotated face with its 86 detected landmarks used to estimate the three face boundary lines (right, left, and bottom). These boundaries correspond to the three points ∗ marked in blue. The third image shows how the upper boundary of the face is determined. It is located at a distance d2 = 0.6d1 from the vertical position of the two eyes. The fourth image shows the cropped and rescaled face image with 224 × 224 pixels. Note that the distances D1 and D2 are constant for all cropped faces

3.4 Two branches architecture

Recently, many successful deep architectures have been proposed for many computer vision tasks. In our solution, we employ two dual architectures to exploit the different capabilities of deep neural networks. Since FBP image data is limited, we propose to exploit the low-level and high-level feature extraction capability of two powerful architectures jointly. Figure 3 summarizes our introduced architecture with two branches. The first and second branches are the ResneXt-50 and Inception-v3 architectures, respectively, with the decision layers removed. In our proposed architecture with two backbones, we added the FC1 layer, which maps the encoded deep features of the ResneXt-50 branch (vector of dimension 2048) to 1024 neurons. Similarly, we added layer FC2, which maps the embedded deep features of the Inception-v3 branch (vector of dimension 2048) to 1024 neurons. FC1 and FC2 were concatenated into a single vector FC, which is followed by the FC3 layer that performs the regression, namely the beauty score. Note that the weights of both branches are the weights of the pre-trained ResneXt-50 and Inception-v3 models (trained on the ImageNet Challenge database [12].), while the FC1, FC2 and FC3 layers are randomly initialized. Our proposed network with two branches is called 2B-IncRex architecture. In the training phase, we will fine-tune this architecture for FBP.

Fig. 3
figure 3

Our proposed two branches network 2B-IncRex

3.5 Loss Functions: the use of dynamic robust losses

During convolutional network training, the loss function measures the error (the loss) between the ground truth and the current predicted values. Training a CNN aims to minimize the loss based on the gradients of the loss function used to update the weights of the network. In this section, we will describe the loss functions we used for training our proposed architectures. We will also introduce a dynamic law that adjusts the parameters of these robust losses during training. The losses are computed for a batch of N face images. Let yi denote the ground truth score of the ith image, and \(\hat {y_{i}}\) denote the predicted score.

3.5.1 Dynamic Parameterized SmoothL1 (ParamSmoothL1) loss function

The loss function SmoothL1 produces a criterion that uses a quadratic term when the absolute element-wise error falls below 1, and the absolute error otherwise. It is commonly used for training deep CNN-based regressions because it is less sensitive to the presence of outliers than the Mean Square Error loss function and in some cases prevents exploding gradients [42]. The SmoothL1 loss function of N images is defined by:

$$ L_{SmoothL1} = \frac{1}{N} \sum\limits_{i=1}^{N}{z_{i}} $$
(1)

where N is the batch size and zi is given by:

$$ z_{i}= \left\{\begin{array}{ll} 0.5 (y_{i} - \hat{y_{i}})^{2},& { \text{if}} \lvert {y_{i} - \hat{y_{i}}\rvert} < 1\\ \lvert {y_{i} - \hat{y_{i}}\rvert} - 0.5, & \text{otherwise} \end{array}\right. $$
(2)

In this work, we introduce the Dynamic Parametrized SmoothL1 loss function. First, we present the Parametrized SmoothL1 loss. We then present its dynamic variant.

Since the threshold can be different from one task to another, we proposed a Parameterized SmoothL1 loss function which is defined by:

$$ L_{ParamSmoothL1} = \frac{1}{N} \sum\limits_{i=1}^{N}{z_{i}} $$
(3)

where N is the batch size and zi is given by:

$$ z_{i}= \left\{\begin{array}{ll} 0.5 (y_{i} - \hat{y_{i}})^{2},& { \text{if}} \lvert {y_{i} - \hat{y_{i}}\rvert} \leq \alpha \\ \lvert {y_{i} - \hat{y_{i}}\rvert} + 0.5 \alpha^{2} - \alpha, & \text{otherwise} \end{array}\right. $$
(4)

where α is a tunable parameter. Our proposed dynamic robust loss functions are motivated by the following observation. During the training of CNNs, the robust loss functions can be adjusted as the training progresses. Namely, during training, the model evolves and the trained outlier examples may vary. In the early stages of training, the model is usually neither very stable nor accurate enough to handle the outlier examples correctly. Therefore, it is advantageous to use a quadratic loss function. At the end of the training, the model may be more or less accurate to handle the outliers. Therefore, it is useful to use a more rigorous robust loss function where the range of non-outlier errors is relatively small. In other words, we can time the parameter of the robust loss function (ParamSmoothL1) so that it is initialized with a maximum value and decreases monotonically as training progresses. From a practical point of view, it is extremely difficult to know the best value for α in advance. However, the variation interval [αmin,αmax] can be known in advance. Therefore, to better fit the robust loss function to the training progress, we propose a dynamic parameter α that decreases according to a parabolic law as a function of the epoch number. The current value of α is given by:

$$ \alpha_{e}= \alpha_{max} - (\alpha_{max} - \alpha_{min}) \left( \frac{e}{n_{e}} \right)^{2} $$
(5)

where αe is the value of α in the current epoch (e) varying between 1 and the total number of epochs (ne). αmax and αmin are the maximum and minimum of the α value. In this paper, we denote the proposed Dynamic Parameterized SmoothL1 by Dynamic ParamSmoothL1. Figure 4 illustrates the variation of α using the proposed dynamic law ((5)) as a function of epoch number. Here αmax and αmin are fixed at 0.7 and 0.3, respectively. Our introduced dynamic law was inspired by dynamic laws used to control the learning rate during training in stochastic gradient descent methods [43].

Fig. 4
figure 4

Dynamic parameter α that decreases from 0.7 to 0.3

Fig. 5
figure 5

Illustration of two loss functions: the Mean Sqaure Error loss (L2 loss), and the Huber loss function with four β values (0.7, 0.5, 0.3 and 0.1)

3.5.2 Dynamic Huber loss function

Huber is another robust loss function that is less sensitive to outliers in the data than the Mean Square Error loss function. For N training images, the Huber loss function is defined by [44]:

$$ L_{Huber} = \frac{1}{N} \sum\limits_{i=1}^{N}{z_{i}} $$
(6)

where N is the batch size and zi is defined by:

$$ z_{i}= \left\{\begin{array}{ll} 0.5 (y_{i} - \hat{y_{i}})^{2},& { \text{if}} \lvert {y_{i} - \hat{y_{i}}\rvert} \leq \beta \\ \beta \lvert {y_{i} - \hat{y_{i}}\rvert} - 0.5 \beta^{2} , & \text{otherwise} \end{array}\right. $$
(7)

where β is a controlled parameter. Figure 5, shows a visualization of Huber loss function with four β values (0.7, 0.5, 0.3 and 0.1) and L2 loss function.

Similar to ParamSmoothL1 loss, we suggest using dynamic β during training according to the equation:

$$ \beta_{e}= \beta_{max} - (\beta_{max} - \beta_{min}) \left( \frac{e}{n_{e}} \right)^{2} $$
(8)

where βe is the value of β in the current epoch (e), where e increases from 1 to the total number of epochs (ne). βmax and βmin are the defined maximum and minimum of β value.

3.5.3 Dynamic Tukey loss function

The Tukey loss function [45] suppresses the influence of outlier data during backpropagation by reducing the magnitude of its gradient toward zero. Another interesting property of this loss function is its smooth transition between inliers and outliers [46]. The Tukey loss function is defined by:

$$ L_{Tukey} = \frac{1}{N} \sum\limits_{i=1}^{N}{z_{i}} $$
(9)

where N is the batch size and zi is given by:

$$ z_{i} = \left\{\begin{array}{ll} \frac{c^{2}}{6} \left[1 - \left( 1-\left( \frac{\lvert {y_{i} - \hat{y_{i}}\rvert}}{c}\right)^{2}\right)^{3}\right],& {\text{if}} \lvert {y_{i} - \hat{y_{i}}\rvert} \leq c \\ \frac{c^{2}}{6} , & \text{otherwise} \end{array}\right. $$
(10)

where c is an adjustable parameter. Similar to ParamSmoothL1 and Huber losses, we propose to use dynamic c during training through the equation:

$$ c_{e}= c_{max} - (c_{max} - c_{min}) \left( \frac{e}{n_{e}} \right)^{2} $$
(11)

where ce is the value of c in the current epoch (e), with e increasing from 1 to the total number of epochs (ne). cmax and cmin are the maximum and minimum of c value.

4 Experimental setting

4.1 Database and evaluation protocols

To evaluate the performance of our approach, we used the SCUT-FBP5500 [18] database. It consists of 5500 frontal faces of subjects with different attributes: age (from 15 to 60), gender (male/ female), and ethnicity (Asian/ Caucasian). Each facial image was labelled with beauty score in the range [1-5] by 60 volunteers. In addition, each facial image has 86 facial landmarks. Figures 6 and 7 show some facial samples with their corresponding face beauty score. The creators of the SCUT-FBP5500 database provided two evaluation scenarios [18]. In the first scenario, the data were divided into a training split and test split (60 - 40%). In the second scenario, the data were divided into 5 folds to perform a five-fold cross-validation. In our evaluations, we will use both scenarios.

Fig. 6
figure 6

Facial beauty samples from the SCUT-FBP5500 database, (a) Female Assian samples their score from left to the right are: 1.88, 3.00, 3.93 and 4.28. (b) Male Assian samples their score from left to the right are: 1.73, 2.48, 3.53 and 4.43. (c) Female Caucasian samples their score from left to the right are: 1.93, 2.87, 3.63 and 4.7. (d) Male Caucasian samples their score from left to the right are: 1.88, 2.67, 3.27 and 4.43

Fig. 7
figure 7

Facial beauty samples from the KDEF-PT dataset, (a) Female samples their score from left to the right are: 2.78, 4.81 and 3.47, for anger, happy and neutral faces, respectively. (b) Male samples their score from left to the right are: 2.43, 3.34 and 3.24, for anger, happy and neutral faces, respectively

In addition to the SCUT-FBP5500 dataset, the KDEF-PT dataset [11] was used to evaluate the performance of our approach in the presence of facial expressions. KDEF-PT consists of 70 subjects (35 females and 35 males). Each subject performs three facial expressions, namely joy, neutrality, and anger. To determine facial attractiveness, each image was labelled by the participants and they were asked to indicate the extent of attractiveness on a 7-point rating scale (1 = not at all attractive to 7 = very attractive). Each image was rated by a varying number of subjects (from 34 to 42 subjects). The attractiveness score is the average of the subjects’ ratings, Fig. 7 shows two examples from the KDEF-PT dataset. In our experiments, we split the 70 subjects into a training set and a validation set (80% and 20%) to avoid using the same subject in both the training and validation sets. Since there are only 168 training images, we used the trained models from the first fold of the SCUT-FBP5500 dataset and then performed transfer learning (model fine-tuning) with the training part of the KDEF-PT dataset.

4.2 Evaluation metrics

To evaluate the performance of each model, four evaluation metrics are used which are: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Pearson Correlation coefficient (PC) and the 𝜖-error. Let consider Y = (y1,y2,...,yn) the ground-truth scores of the tested n images and \(\hat {Y} = (\hat {y_{1}}, \hat {y_{2}}, ..., \hat {y_{n}})\) are their corresponding estimated scores. Here, n denotes the number of the tested face images. The evaluation metrics are defined as follows:

Mean Absolute Error (MAE): MAE is defined by:

$$ MAE = \frac{1}{n} \sum\limits_{i=1}^{n}{\lvert {y_{i} - \hat{y_{i}}\rvert}} $$
(12)

MAE is scale-dependent accuracy measurement, this means MAE uses the same scale as the data being measured.

Root Mean Square Error (RMSE): RMSE is defined by:

$$ RMSE = \frac{1}{n} \sum\limits_{i=1}^{n}{(y_{i} - \hat{y_{i}})^{2}} $$
(13)

The RMSE is another scale-dependent accuracy measure. Unlike MAE, the effect of any error on the RMSE is proportional to the squared error; thus, larger errors have a disproportionately large effect on the final RMSE. Consequently, the RMSE is sensitive to outliers.

Pearson Correlation coefficient (PC): PC was developed by Karl Pearson [47] and it is defined by:

$$ PC = \frac{{\sum}_{i=1}^{n}{(y_{i} - \overline{y_{i}}) \left( \hat{y_{i}} - \overline{\hat{y_{i}}}\right)}}{\sqrt{\strut{\sum}_{\text{i}=1}^{\text{n}}\left( \text{y}_{\text{i}} - \overline{\text{y}_{\text{i}}}\right)^{2}} \sqrt{{\sum}_{i=1}^{n}\left( \hat{y_i} - \overline{\hat{y_i}}\right)^{2}}} $$
(14)

where \(\overline {y_{i}}\) and \(\overline {\hat {y_{i}}}\) are the mean of the ground-truth scores and the estimated scores, respectively. PC has a value between + 1 and -1, it is a statistic that measures linear correlation between two variables Y and \(\hat {Y}\). A value of + 1 means total positive linear correlation, 0 means no linear correlation, and -1 means total negative linear correlation.

𝜖-error:𝜖-error is defined by:

$$ \epsilon-error = \frac{1}{n} \sum\limits_{i=1}^{n}{\left( 1- exp \left( \frac{\left( y_{i} - \hat{y_{i}}\right)^{2}}{2 {\sigma_{i}^{2}}} \right) \right)} $$
(15)

where σi is the standard deviation of the scores of all raters of the image i. The value of 𝜖-error is the accumulation of each image i error which based on the term \(\epsilon -error_{i} = 1- exp (\frac { (y_{i} - \hat {y_{i}})^{2}}{2 {\sigma _{i}^{2}}})\). When the absolute error of image i goes toward zero (i.e., \(y_{i} = \hat {y_{i}}\)), 𝜖errori is zero. In contrast, when the absolute error is large 𝜖error takes into account the uncertainty of the rate which is represented by \({\sigma _{i}^{2}}\). In more details, the division by the term \({\sigma _{i}^{2}}\) provides a smaller contribution to the value of the 𝜖 error when the uncertainty in the rate is large and vice versa.

4.3 Experimental setup

All experiments are carried out on Pytorch library [48] with NVIDIA GPU Device GeForce TITAN RTX 24 GB. All Networks are trained for 40 epochs using Adam optimizer [49] and batch size of 15. The initial learning rate is 1e-4 for 20 epochs, then leaning rate decreases to 1e-5 for next 10 epochs, for the last 10 epochs the learning rate decreases to 1e-6. Active data augmentation is performed by rotating the input face by an angle between [-5, 5]. For all experiments, the reported results correspond to the best PC of the test data during the training/testing of the 40 epochs.

5 Performance evaluation on SCUT-FBP5500 dataset

5.1 Experimental results on the 60-40% split scenario

In this section, we limit the study to the provided 60-40% split.

5.1.1 Raw input vs the proposed face preprocessing

Preprocessing of faces is considered an important step for face analysis by machine learning. However, Deep Learning architectures are capable of learning high-level features in scenarios with shape and rotation variations. In this section, we investigate the impact of the preprocessing step on estimating the beauty of a face. For this purpose, we used ResneXt-50 and Inception-v3 with the default MSE loss function and considered two input scenarios : raw face images and aligned face images. Table 1 shows the results obtained. From this table, it can be seen that preprocessing the face image provides a significant improvement for the ResneXt-50 architecture. In contrast, this improvement is small for Inception-v3. In general, face alignment and cropping can support the training of CNN architectures by discarding the background features and prioritizing the face features.

Table 1 Face beauty prediction using ResneXt-50 and Inception-v3 networks with MSE loss function and two input image scenarios (The raw image and the detected face with our preprocessing scheme)

5.1.2 Dynamic vs fixed loss function parameter

To investigate the effectiveness of the proposed dynamic law for the parametric robust loss functions, we use ResneXt-50 in two cases: (i) a loss adopting a fixed parameter and (ii) a loss adopting a dynamic parameter using the parabolic law. The considered parametric loss functions are the proposed ParamSmoothL1, Huber and Tukey. For each parametric loss, we choose an interval for the dynamic law. For the fixed values, we choose the left and right limits of the interval and a set of values within the interval. The interval for the dynamic law is chosen experimentally and determined independently for each loss function. We compare the dynamic law not only with the fixed values, but also with their average.

The results obtained are summarized in Table 2. For both ParamSmoothL1 and Huber loss functions, the dynamic law interval is set to [0.7-0.3] and the fixed values are {0.7,0.6,0.5,0.4,0.3}. Table 2 shows that PramSmoothL1 and the Huber loss function with the proposed dynamic law perform better than the fixed values and their average. On the other hand, the dynamic interval of the c parameter of the Tukey loss function is set to [2-1.5] and the fixed c values are {2,1.7,1.5}. Similar to ParamSmoothL1 and Huber, the dynamic Tukey loss function using the parabolic law achieves better performance than using the fixed c values and their average. For the Tukey loss function, the dynamic interval of [2-1] achieved better performance than the Dynamic Tukey loss function of interval [2-1.5], as shown in Table 2. Based on these results, the dynamic intervals for the parameters α, β, and c are set to [0.7-0.3], [0.7-0.3], and [2-1] for ParamSmoothL1, Huber and Tukey, respectively.

Table 2 Comparison between dynamic and fixed parameters of ParamSmoothL1, Huber and Tukey loss functions using ResneXt-50 network

5.1.3 Two branches vs one branch

The goal of this section is to compare our proposed architecture with two backbones (2B-IncRex) with the pre-trained CNNs used to create 2B-IncRex. To this end, we used five loss functions (L1, MSE, Dynamic ParamSmoothL1, Dynamic Huber, and Dynamic Tukey losses) to test ResneXt-50, Inception-v3, and 2B-IncRex, as shown in Tables 34, and 5. respectively. In addition to the comparison between our proposed two-backbone architecture and the individual backbones, these results also show the comparison between our proposed dynamic loss functions and the standard loss functions (L1 and MSE).

Table 3 Facial Beauty Prediction using ResneXt-50 Network with five loss functions (L1, MSE, Dynamic ParamSmoothL1, Dynamic Huber and Dynamic Tukey losses)
Table 4 Facial Beauty Prediction using Inception-v3 Network with five loss functions (L1, MSE, Dynamic SmoothL1, Dynamic Huber and Dynamic Tukey)
Table 5 Facial Beauty Prediction using the proposed two backbones Network (2B-IncRex) with five loss functions (L1, MSE, Dynamic ParamSmoothL1, Dynamic Huber and Dynamic Tukey losses)

Based on the results of ResneXt-50 in Table 3, we can see that our proposed dynamic loss function ParamSmoothL1 achieves the best performance. Moreover, the other two dynamic loss functions (Huber and Tukey) obtained similar results to the Dynamic ParamSmoothL1 loss function and better performance than L1 and MSE. This proves the efficiency of using the dynamic law not only compared to fixed parametric losses (as in Table 2), but also compared to other loss functions. From the Inception-v3 results in Table 4, we can also see that the dynamic loss functions give better results than L1 and MSE. For the Inception-v3 architecture, the Dynamic Tukey loss function achieved the best performance.

The results of 2B-IncRex using the five loss functions are summarized in Table 5. Again, we note that the proposed dynamic loss functions achieve better performance than L1 and MSE. On the other hand, the proposed Dynamic ParamSmoothL1 achieved the best performance for our proposed 2B-IncRex architecture. From the results of ResneXt-50, Inception-v3 and 2B-IncRex (from Tables 34 and 5), we conclude that the proposed 2B-IncRex converges to the lowest error compared to ResneXt-50 and Inception-v3. This proves the effectiveness of our proposed CNN architecture with two backbones and the effectiveness of the proposed dynamic law for the robust loss functions.

5.1.4 CNN ensemble

The goal of this section is to use the trained models from Section 5.1.3 to improve the performance of FBP. To this end, we select the best models for the two individual backbones (ResneXt-50 and Inception) and the three best models of the proposed 2B-IncRex. In summary, three ensemble scenarios were tested. First, we combine the ensemble of the single backbones (ResneXt-50 and Inception). Second, the three best models of the proposed 2B-IncRex are combined. The third scenario is the combination of five models (best individual backbones and the best three 2B-IncRex, which corresponds to the trained models with the dynamic robust losses), the results are shown in Table 6. Since the creators of the SCUT -FBP5500 dataset provided two evaluation scenarios (60-40% and five fold cross-validation), each considering only training and test splits, we considered the last model after it was trained with 40 epochs. The goal of selecting the last model in our ensemble approach is to use the test data only once. For an input image fed with regressors (trained CNN models), the ensemble is obtained by computing the average of the different regressors; this average is considered as the ensemble prediction. From the results of Table 6, we can observe the following:

  • The ensemble of the two individual backbones outperforms the individual CNN backbones.

  • The ensemble of our proposed 2B-IncRex trained with the proposed dynamic losses outperforms the ensemble of the two individual backbones.

  • Finally, we find that the ensemble of the two single backbones and the three 2B-IncRex models improves the results compared to the previous ensemble schemes. Based on these results, we consider this last ensemble as our proposed solution for FBP. We refer to it as Dynamic ER-CNN, since it benefits from the proposed dynamic loss functions and ensemble of CNN architectures for the regression task.

Table 6 Facial Beauty Prediction using the proposed CNN ensemble of different trained models on 60-40% data split

From the above results, we conclude that all ensemble scenarios improve the face beauty estimation. Although the second ensemble scenario achieves a better result than the first, the combination of both ensemble scenarios further improves the results. This proves that the individual backbones can provide a diversity estimator for the proposed Dynamic ER-CNN solution.

5.2 Experimental results using the five fold cross-validation scenario

In addition to the 60-40% evaluation scheme of the SCUT -FBP5500 dataset, the creator of this dataset has provided five-fold cross-validation splits. In this section, we will test the best identified solutions from 60-40% for one and two backbones. Specifically, these are the following solutions: ResneXt-50 trained with Dynamic ParamSmoothL1, Inception-v3 trained with Dynamic Tukey and 2B-IncRex trained with the three dynamic robust loss functions (ParamSmoothL1, Huber and Tukey). The obtained results are summarized in Table 7. For each architecture and corresponding loss function, we report the results using four evaluation metrics (PC, MAE, RMSE and 𝜖-error) for each fold and its average over the five folds.

Table 7 Five folds cross-validation of Facial Beauty Prediction using single backbone networks (Inception-v3 with Dynamic Tukey loss and ResneXt-50 with Dynamic ParamSmoothL1 loss) and two backbones networks (2B-IncRex wit Dynamic ParamSmoothL1, Dynamic Huber, and Dynamic Tukey losses)

Similar to the results of 60-40% split, two backbones architecture achieve higher performance than the single backbones, again this proves the efficiency of the proposed 2B-IncRex architecture. On the other hand, we notice that the two backbones architecture achieves close performance using different dynamic robust losses, with small preference for the Dynamic ParamSmoothL1 loss function based on MAE, RMSE and 𝜖-error metrics. Similar to the 60-40% split results, the architecture with two backbones achieves higher performance than the one with one backbone, which again proves the efficiency of the proposed 2B-IncRex architecture. On the other hand, we find that the architecture with two backbones achieves similar performance when using different dynamic robust losses, slightly favoring the Dynamic ParamSmoothL1 loss function based on MAE, RMSE and 𝜖 error metrics.

Similar to Section 5.1.4, we tested three ensemble scenarios, (i) the ensemble of the single backbones (ResneXt-50 and Inception), (ii) the ensemble of the proposed 2B-IncRex architecture trained with the three dynamic robust losses, (iii) the ensemble of all models used in the first two scenarios (i and ii), denoted by Dynamic CNN-ER. Table 8 summarizes the obtained results of the three ensemble scenarios for each fold and their mean. From the results for one and two backbones (Table 7) and the ensemble scenarios (Table 8), we notice the following:

  • The fusion of the individual backbones (scenario (i)) performs better than the individual backbone networks (Inception-v3 with Dynamic Tukey and ResneXt-50 with Dynamic PramSmoothL1 loss function).

  • The second ensemble scenario shows that the ensemble of 2B-IncRex outperforms all the results obtained by the single two backbone networks (2B-IncRex with the three dynamic robust loss functions).

  • Our proposed ensemble approach Dynamic CNN-ER (scenario (iii)) outperforms not only single and 2B-IncRex networks, but also their combination.

Table 8 Five folds cross-validation of Facial Beauty Prediction using the proposed CNN ensemble of different trained models

The comparison between the results of Tables 7 and 8 proves the effectiveness of the proposed Dynamic ER-CNN for the assessment of the beauty of the face.

6 Performance evaluation on KDEF-PT dataset

In this experiment, we used the KDEF-PT dataset [11], which contains ratings of facial beauty in the presence of facial expressions. Table 9 summarizes the results obtained with the selected individual CNN architectures in our ensemble trained with the proposed dynamic loss functions. Similar to the ensemble experiments in the SCUT-FBP5500 dataset, three ensemble scenarios are evaluated. From the results of Table 9, we can make the following observations:

  • The proposed 2B-IncRex trained by various dynamic robust losses performs better than the individual backbones and their ensemble.

  • The fusion of individual backbones performs better than the individual backbone networks (Inception-v3 with Dynamic Tukey and ResneXt-50 with Dynamic PramSmoothL1 loss function).

  • 2B-IncRex-based ensemble scenario outperforms all results obtained by the single two backbone networks (2B-IncRex with the three dynamic robust loss functions).

  • Our proposed ensemble approach Dynamic CNN-ER outperforms not only single and 2B-IncRex networks, but also their combination.

Table 9 Facial Beauty Prediction using single backbone CNN architectures (Inception-v3 and Resnext-50) and our proposed 2B-IncRex architecture. Furthermore, the ensemble of these approaches and our proposed Dynamic ER-CNN approach are evaluated. All these methods are tested on KDEF-PT dataset

Despite the presence of facial expressions and a limited amount of data, our approach can achieve very good performance in estimating facial beauty using transfer learning (model fine-tuning). Moreover, the 𝜖-error shows that our approach achieves very good performance despite the high labeling uncertainty of the ground truth. From the above results and discussion, it is clear that all of our proposed elements (2B-IncRex, Dynamic Law for Robust Loss Function, and the Ensemble) and their combination prove their efficiency for FBP. As far as we know, this is the first time that facial beauty estimation has been evaluated using machine learning methods in the presence of facial expressions on the dataset KDEF-PT.

7 Discussion and comparison

The goal of this section is to compare the performance of our proposed solutions with state-of-the-art approaches. In summary, this comparison examines the two evaluation scenarios of SCUT-FBP5500 (60-40% and five-fold cross-validation). Table 10 summarizes the comparison with the state-of-the-art approaches using the first evaluation scenario (60-40% split). The comparison consists of two parts. First, we compare our proposed Dynamic ER-CNN with the state-of-the-art approaches in three evaluation metrics (PC, MAE, and RMSE). This comparison shows that our proposed Dynamic ER-CNN outperforms state-of-the-art methods. In addition to the first comparison, our proposed 2B-IncRex architecture trained with the proposed Dynamic PramSmoothL1 loss function achieves better performance than the state-of-the-art approaches. The above comparisons prove that the superiority of our approach over the state-of-the-art methods is not only due to the ensemble of models, but that both the proposed 2B-IncRex network and the dynamic parabolic law for the robust loss functions played a crucial role in achieving such performance.

Table 10 Comparison with the sate-of-the-arts methods using the 60-40% split. Dynamic ParamSmoothL* is our 2B-IncRex network that was trained using the Dynamic ParamSmoothL1 loss function

Similar to the comparison with state-of-the-art approaches for the 60-40% assessment scenario, we used our proposed Dynamic ER-CNN and Dynamic PramSmoothL1 of the 2B-IncRex architecture for the five-fold scenario. From Table 11, our approach (Dynamic ER-CNN) outperforms all state-of-the-art methods on the three evaluation metrics. Moreover, our Dynamic ER-CNN not only outperforms the state-of-the-art methods, but also our two proposed backbones with Dynamic ParamSmoothL1 achieve better performance than the state-of-the-art methods. This confirms that the strength of our proposed Dynamic ER-CNN is not only due to the ensemble of different regressors, but both the proposed 2B-IncRex network and the dynamic loss functions play a crucial role in outperforming state-of-the-art methods. The efficiency of our proposed Dynamic ER-CNN has been demonstrated in both 60-40% and five-fold cross-validation.

Table 11 Comparison with the sate-of-the-arts methods the five folds cross-validation scenario. + the authors of [27] used ResNeXt-50 as the backbone network to re-implement the methods [50] and [51] on the newly-constructed SCUT-FBP5500 dataset. Dynamic ParamSmoothL* is our 2B-IncRex network that was trained using the Dynamic ParamSmoothL1 loss function

8 Conclusion

In this paper, we presented a framework based on an ensemble of regression CNNs (Dynamic ER-CNN). Our proposed approach averages the output of five trained CNN architectures. The CNNs used are ResneXt-50, Inception-v3, and the proposed 2B-IncRex architectures. These architectures were trained with the proposed Dynamic ParamSmoothL1, Dynamic Huber, and Dynamic Tukey. For these dynamic loss functions (ParamSmoothL1, Huber and Tukey), a parabolic law is proposed to reduce the parameter of the loss. The dynamic schemes were found to be very efficient both in terms of performance and in avoiding the grid search for the best value, which has a high computational cost. Moreover, the dynamic loss functions performed better than two standard loss functions, namely L1 and MSE.

The obtained results show the superiority of the proposed 2B-IncRex over ResneXt-50 and Inception-v3 networks. Moreover, the proposed approach (Dynamic ER-CNN) outperformed not only one and two branches networks, but also their fused models. On the other hand, the proposed approach performed better than the state-of-the-art methods in both the 60-40% and cross-validation experiments for the three evaluation metrics (PC, MAE and RMSE) onthe SCUT-FBP5500 dataset. The experimental results on KDEF-PT proved the efficiency of our approach for estimating facial beauty with adopting transfer learning, despite the presence of facial expressions and limited data. We also found that using the proposed dynamic robust loss functions generally leads to better estimates