1 Introduction

In the last decade, soft biometric traits have been widely used for person identification because of the robustness to noise, non-intrusiveness, and privacy preservation. In the last years Deep learning approaches have been proposed also to extract soft-biometric attributes from face images. However, the high performance achieved are always paired with the requirement for high computational power and a large dataset for training. Liu et al. [1] proposed a method based on two CNNs which are trained for face localization (LNet) and attributes prediction (ANet). The top network layer, FC is exploited to learn identity-related features, such as the gender and race. Layers C3 and C4 are exploited to extract Identity-unrelated attributes, such as the facial expression, wearing hat and sunglasses. Samangouei et al. [2,3,4] proposed a CNN architecture suitable for mobile devices, which is based on the analysis of face parts. Recently, Dhar et al. [5] considered the usefulness of the outputs of the internal layers of two deep convolutional networks, Resnet 101 and Inception Resnet v2, for the prediction of facial attributes. Izadi [6] proposed the fusion of the extracted facial attributes with the face image to perform face recognition on a shared CNN architecture. Recently different works [7, 8] proved that the final representation computed by a deep convolutional neural network embeds information not only about identity but also on the head pose and illumination. In this paper we propose to extract information from internal layer corresponding to HMAX network with the purpose of predicting different facial attributes. The HMAX model, which has been developed before deep learning took over in many computer vision problems, demonstrated the feasibility of a biologically-inspired neural architecture for face recognition. The model was tested on several publicly available databases such as LFW, PubFig and SURF-W [9] providing results at the state of the art. In [10], a new C3EF layer, inspired by the ventral and dorsal streams of the visual cortex, has been added to perform view-independent face recognition. Hu [11] proposed a version of the HMAX model, named ’sparse HMAX’, addressing the local-to-global structure of the hierarchy, where the S2 bases are learned by sparse coding. In this paper we propose a novel hybrid system based on the HMAX network architecture. The outputs of the internal S2 layer are used as the seeds for extracting interest regions which are then used to generate the feature vector for the classification of the facial attributes. The following issues are addressed:

  • How the salient feature points extracted from the HMAX architecture can improve the prediction of facial attributes.

  • To which extent the devised system can be applied to predict different kinds of facial attributes.

  • What is the robustness of an attentitive visual system to variations in head pose, lighting and facial expression.

2 Prediction of Facial Attributes

Most of the time celebrities or familiar people are remembered because of their special hair style, accessories or even clothes. This daily life concept is exploited by soft biometrics, or general visual attributes. These attributes can add significant information to face images and are quite robust to image degradation and changes in appearance. For this purpose, the internal S2 layer of the HMAX is used to detect the most salient points on the subject’s face. The Linear Binary Pattern feature extractor is exploited to build a local description of the image texture around the selected points. The feature vectors computed for each salient point are concatenated to produce a global feature vector to characterize the face image. The obtained feature vector is fed to a SVM binary classifier to predict several visual attributes. In Fig. 1 the general architecture of the proposed framework is shown.

Fig. 1.
figure 1

Proposed hybrid system for the prediction of facial attributes.

2.1 The Hierarchical HMAX Network

HMAX is an hierarchical system that closely follows the organization of visual cortex and builds an increasingly complex and invariant feature representation by alternating between a template matching and max pooling [12]. As the network structure is fixed, a limited number of training examples is required for learning. The computational process is hierarchical and it is also invariant to position, scale and view-point. Along the hierarchy, the size of the receptive fields and the complexity of their optimal stimuli increases. The model consists of four computational layers, where simple ‘S’ units alternate with complex ‘C’ units.

Fig. 2.
figure 2

General architecture of the HMAX model.

The first layer S1 in the HMAX network consists of a bank of Gabor filters applied to the full resolution image. The response to a particular filter G, of layer S, at the pixel position (X,Y) is given by:

The first layer of the HMAX model ‘S1’ consists of a bank of Gabor filter, the following steps are implemented in (Fig. 2): The Image (at finest scale) is [256 * 256 * 1]. In each image intensity, four Gabor filters are applied over each pixel position. The result of S1 layer (at finest scale) is [246 * 246 * 4]. The response of a patch of pixels X to a particular S1 filter G is given by:

$$\begin{aligned} R(X,Y)= \left| \frac{ \sum {X_i G_i} }{\sqrt{\sum {X_i}^ 2 }} \right| \end{aligned}$$

The size of the Gabor filter is 11 \(\times \) 11 and it is formulated as:

$$\begin{aligned} G(x,y)= exp(\frac{-(x^2 \gamma ^2 Y^2)}{2\sigma ^2})cos(\frac{2\pi }{\lambda }X) \end{aligned}$$

Where X = xcos\(\theta -ysin\theta \) and Y = xsin\(\theta +ycos\theta \). x and y vary between −5 and 5, and \(\theta \) varies between 0 and \(\pi \). The parameters \(\rho \) (aspect ratio), \(\sigma \) (effective width), and \(\lambda \) (wavelengh) are set to 0.3, 4.5 and 5.6, respectively. For the local invariance (C1) layer, a local maximum is computed for each orientation. They also perform a subsampling by a factor of 5 in both the X and Y directions [13]. In the intermediate feature layer (S2 level), the response for each C1 grid position is computed. Each feature is tuned to a preferred pattern as stimulus. Starting from an image of size 256 \(\times \) 256 pixels, the final S2 layer is a vector of dimension 44 \(\times \) 44 \(\times \) 400. The response is obtained using:

$$\begin{aligned} R(X,P)=exp (\frac{\left| | X-P \right| |^2 }{\sigma ^2}) \end{aligned}$$

The last layer of the architecture is the Global Invariance layer (C2). The maximum response to each intermediate feature over all (X, Y) positions and all scales are calculated. The result is a characteristic vector that will be used for classification. For the implementation of the HMAX model we use the tool proposed in [13] As the final layer (C2) is the features vector that corresponds to maximums obtained from each S2 output, which are 400 characteristics. These maximum correspond to a certain locations (best coordinates) that are the maximal responses for each patch and image. These coordinates are accumulated and projected into the original faces images and used as interest points.

2.2 Local Texture Description Based on LBP

LBP is a type of visual descriptor used for classification in computer vision. The idea of the texture extraction using LBP is to give to each pixel a code which depends on the gray scale of its neighbors. The gray scale of the central pixel (\(i_c\)) is compared to its neighbors in the following formula:

$$\begin{aligned} LBP(x_c,y_c)=\sum _{n=0}^P{s(i_n-i_c)2^n}\quad X= {\left\{ \begin{array}{ll} 0, &{} x<0 \\ 1, &{} x>=0 \end{array}\right. } \end{aligned}$$

The LBP code of the current pixel is produced by concatenating the 8 values to construct a binary code. The center of each window corresponds to the interest points obtained from HMAX.

2.3 Binary Classification with Support Vector Machines

The SVMs are groups of learning techniques that are designated to solve problems of discrimination, i.e. to decide to which class a pattern belongs, or of regression, i.e. to predict the numerical value of a variable. The success of this method is defined by solid mathematical bases. The main objective of the SVM is like the perceptron principle but it consists not only into finding an hyper plan that separates perfectly the classes but also to find the optimal one that can separate perfectly the classes by maximizing the margin. They project the data in space of characteristics by using non-linear functions. In this space it builds the optimal hyper plan which separates the transformed data. The principal idea is to build a linear separation surface in the space of the characteristics which corresponds to a non-linear surface in the space entry (Figure 3 presents this non-linear transformation).

Fig. 3.
figure 3

The non linearity.

The Support Vector Machines approach passes through two steps: The train that consists of searching an optimal hyper plan of separation by maximizing the margin, with the resolution of a quadratic program and determination of the Lagrange multipliers [14]. The test, after the determination of the Lagrange multipliers, it applies the decision function to the test examples to determinate the class [14]. The classification is conducted by using the SVM-KM toolbox [15] and by considering a Gamma value of 1e-7 and penalty parameter of the error term C = 100 while we use Gaussian kernel.

3 Experimental Results

Several publicly available large datasets have been used for testing the proposed architecture. The CelebFaces Attributes Dataset (CelebA) [1] is a large-scale face attributes dataset with more than 200K celebrity images each is notated with 40 attributes. The images in CelebA dataset include large variations in appearance such as pose and background. It contains 10.177 identities having 20 images in average. The CelebA does not overlap with LFW dataset identities. We also use the LFW dataset [16] which is a large dataset, real-world face dataset consisting of 13.000 images of faces collected from internet. These images are taken in completely uncontrolled situations. This dataset contains variations in pose, lighting, expression, camera, imaging conditions. CelebA and LFW were intensively used in the recent proposed work in the litterature for the aim to predict facial attributes. The PubFig dataset [19] has been used to test the sensitivity to variations in pose, illumination and facial expression. The PubFig dataset is a large, real-world face dataset (including both celebrities and politicians) consisting of 58797 images of 200 subjects collected from the internet. The PubFig dataset is both larger and deeper, on average 300 images per individual than previous seen dataset. These images are taken in completely uncontrolled situations. This database contains variations in pose, lighting, expression, camera, imaging conditions. The PubFig dataset is similar to LFW dataset. However, the PubFig dataset has enough examples per each subject.

Experiment 1: The first experiment consists of the facial attributes prediction using the Labeled Faces in the Wild (LFW) and the CelebFaces dataset. Table 1 and Table 2 represent the facial attributes prediction results using CelebA and LFW accordingly. In this experiment we propose also to compare our proposed system (internal layer) for attributes prediction with the top layers corresponding to HMAX, VGG, Alexnet and ResNet-50. Alexnet is a convolutional neural network that is trained on more than a million images from the ImageNet database [19]. The network is 8 layers deep and it can classify the images into 1000 object categories [20, 21]. The network has an image input size of 227-by-227. ResNet-50 is also a CNN framework trained with the images from ImageNet [19]. It is 50 layers deep with an input size of 224-by-224. But unlike Alexnet, Resnet-50 layers are organised in residual blocks. With each block encompassing of at-least 3 CNN layers (1 \(\times \) 1; 3 \(\times \) 3; and 1 \(\times \) 1 convolutions) followed by a shortcut connection. VGG is a convolutional neural network that is trained on more than a million images from the ImageNet database. The network is 16 layers deep and can classify images into 1000 object categories, such as keyboard, mouse, pencil, and many animals. As a result, the network has learned rich feature representations for a wide range of images [20]. This experiment is for the aim to compare between the proposed system that is based on internal layer and the final layers corresponding to the HMAX, VGG, Alexnet and Resnet-50.

Table 1. Facial attributes prediction accuracies on LFW.
Table 2. Facial attributes prediction accuracies on CelebA.

Experiment 2: In the second experiment we propose to compare the obtained results on LFW and CelebA faces with some recent obtained results in the state of the art such as FaceTracer [17], PANDA [18], LNets+ANet [3] and Shared CNN proposed in [8]. Table 3 and Table 4 represent the results on LFW and CelebA datasets respectively. The aim of this experiment is to compare our proposed hybrid architecture with recently proposed systems in the litterature based on DNNs.

Table 3. Comparaison of the attribute prediction frameworks on LFW
Table 4. Comparaison of the attribute prediction frameworks on CelebAFaces

Experiment 3: In the third experiment, we test the PubFig dataset as it contains a large number of images per each subject. We study the efficiency of our proposed system on pose variation, faces under different illumination and expression. As the PubFig dataset contains images from the web a lot of them are not available, for this reason we construct a database composed of 54 subjects with 8942 images. Table 5 represents the obtained results which were compared also with Alexnet, VGG and Resnet-50 features.

Table 5. Comparison of the attribute prediction on PubFig database

From the obtained results (Table 1, Table 2), one can see clearly that the LFW dataset obtain better results with our hybrid system comparing to the CelebA and this is mainly due to the fact that we used CelebA faces in nonuniform order for the identities because the CelebA is composed of thousand of identities with twenty images per identity in average, these twenty images are dispatched randomly in the data; however, in the LFW database all the images corresponding to the same subject are in a successive order. Our proposed system demonstrate a comparable efficiency with the pretrained Deep Neural Networks (VGG, Alexnet and Resnet-50) in the same time surpass the obtained results with the features obtained from the final layer C2 corresponding to HMAX. Some facial attributes such as ‘Attractive’, ‘Gray Hair’, ‘Male’, ‘Mustache’ were well predicted using our model comparing to the pretrained DCNNs. In addition, our proposed biological system shows good performances comparing to the proposed models in the state of the art [3, 8, 17, 18] specially on identity facial attributes. These identity-based facial attributes such as ‘gender’, ‘hair color’, ‘nose’ and ‘lips shape’, ‘chubby’ and ‘Blad’ can add meaningful information for face identification. Additionally, the proposed hybrid attention-based system achieve a comparable results with the final layers of VGG, AlexNet, ResNet-50 and HMAX. Another advantage is that our model use different locations to predict the whole attributes faces. However, LNets-ANets [1] use different layers to predict different kind of attributes, also [3] consider the fusion of different and specific regions from the face to detect a specific facial attribute. In the latest case we may not find all facial regions available specially with different poses on mobile devices. Even though the PubFig is very challenging database with variation in pose, illumination and expression, our proposed approach can distinguish between frontal and dark lighting images.

4 Conclusion

In this paper a visual attention-based system have been proposed to predict the facial attributes. This hybrid system allows a biological hierarchical network ‘HMAX’ to look into a particular salient regions of the input faces in the same time reduce the complexity by discarding the irrelevant information. These regions were introduced to LBP with the aim of extracting texture feature around these interest points extracted with HMAX. This proposed framework shows a promising results comparing to Deep Neural Networks architectures. The success of the hybrid architecture is due not only to the biological vision perception but also to the possibility and flexibility of this approach to learn and treat small amount of data and predict different facial attributes. By surpassing these issues we can solve the main problem of DCNNs that require large amount of data for training. The proposed approach can add a good impact on face recognition as it can predict the most challenging real-world scenario (different background, pose variation, illumination, expression) that can degrade significantly the face recognition performances.