1 Introduction

Face Recognition (FR) is a common authentication tool in many applications. The face has a physical biometric characteristic that is non-invasive and is accepted by users. There is no direct contact with the acquisition device as required when using the iris and the finger print. Nowadays, FR is the most widespread technique used in authentication [1]. The face does not characterize only humans, also animals have facial features [38, 40]. The main architecture of FR is shown in the figure below (Fig. 1).

Fig. 1
figure 1

Classical face recognition pipeline

The classical FR pipeline consists of two phases: Online and Offline phases. In the Offline phase, the user is not logged in. Part of the dataset follows the preprocessing steps, such as face cropping, denoising, smoothing, and alignment. The feature extraction step is carried out to compute the biometric signature of each face, then to classify these signatures to categorize the features. However, the Online phase is carried out at each interrogation of the dataset by the user, where a query face goes through the same steps; i.e., preprocessing and feature extraction. Then, a check is established to know the belonging rate for each class and to establish a related decision. The decision step is a 1:N problem that compares a query face image using its biometric computed signature against all the stored signatures to determine the identity of the query face. The run time of this phase must be reduced.

The challenges facing FR are the lighting conditions, pose variation, occlusions, facial expressions, and low resolution [66]. All these problems decrease the recognition rate. To solve them, the preprocessing step must be well-developed to strengthen the recognition system and to achieve good results in recognizing the face as taken in the wild. Numerous approaches related to this field have been advanced; however, several challenges still persist [7, 23].

In this paper, we present a method in the context of FR. The aim is to tackle the pose variation problem because recognizing and identifying a person from a single 2D image under pose variation remains a great challenge. To get the highest recognition rate, an alignment step has to be well-developed. It is an essential preprocessing step in face recognition. Thus, our work includes:

  1. 1.

    Feature extraction to add keypoints to the 68 traditional fiducial landmarks since these keypoints provide rich information about facial geometry.

  2. 2.

    3D face reconstruction from 2D obtained keypoints of a single image under an arbitrary view to localize the self-occluded face parts in the case of large poses.

  3. 3.

    3D face alignment by fitting 3D reconstructed face to 2D face image using keypoints marching to render the frontal view by pose normalization and correction.

  4. 4.

    Application of face recognition using Deep Convolutional Neural Networks (DCNNs) on the aligned faces.

Indeed, facial alignment and 3D reconstruction are two different tasks. Currently, the relationship between these two tasks has become known. Indeed, 2D face alignment has shown weakness consisting in its inability to address large poses. The relationship between 3D face reconstruction and face alignment consists essentially in mapping and estimating the 3D face geometry from a single 2D image. The main objective is to compute the visibility and position of 2D landmarks.

Recent methods have used hand-crafted features to improve performance, especially for the earliest contributions. In this paper, our approach is applied directly to RGB face images using compact features with engineered descriptors to achieve good performance. The power of CNNs, which are used to learn the features on large multi-identities datasets for 3D face alignment with application to face recognition, is therefore exploited.

2 Related works

2.1 Face recognition

Currently, FR is a widely used biometric technique since the face has become the most attractive biometric. Also, the COVID-19 pandemic has changed several statistics worldwide, including the biometric modalities. In the earliest results of FindBiometrics (Fig. 2) reported in a review survey [30], FR has retained the top spot as the year’s most used and exciting modality.

Fig. 2
figure 2

Results of the latest FindBiometrics until year 2020

2.1.1 Face Recognition studies

FR methods can be classified into three categories: global also known as holistic methods using the entire facial surface [3], local methods based on local regions or patches and not considering the whole face [51], and hybrid methods [74] consisting in combining global and local feature descriptors.

Global Face Recognition methods

The global or holistic methods for 2D FR extract features from the entire facial surface. The used descriptors are not dedicated to give information on a specific part or region of the face. Indeed, extracted features give information on the entire facial surface. It is time consuming but very efficient to synthesize the complete face.

EigenFace [88] is a global FR method which uses Principal Component Analysis (PCA) [72]. Eigen Vectors are measured to describe faces, which are computed by measuring the features of the nose tip, mouth, eye corners, and chin edges. Since global methods project face representation into a small subspace or a correlation plane, EigenFaces are projected onto a reduced face space by PCA. Eigenface has been used in several other works accompanied by modifications and improvements as presented in [67].

Fisherface [43] is part of the known holistic method in FR whose principle is based on maximizing the separation between classes during training. Fisherface reduces face space dimension using PCA. Also, Fisher’s Linear Discriminant (FDL) [95] method is used to generate face features as a linear combination able to separate two or more classes. This famous algorithm has undergone several modifications in several criteria as presented in [34, 73].

In [75], a comparative study between Fisherface and EigenFace is presented. Other methods focusing on fusion based on PCA and FDL are also presented in [69].

Joshua et al. [86] presented a non linear holistic approach capable of extracting complex natural observations and ensuring a global optimal solution of the true structure convergence of face images under low dimensional input-space.

LaplacianFaces [33] consists in mapping the face into a face subspace based on Locality Preserving Projections (LPP) [32] to get the best global face description.

Local Face Recognition methods

FR methods based on local features focus on fiducial points and parts of the face to generate features. These techniques compute the local features through pixel parameters, face histograms, geometric shapes, and correlation planes between different regions. Local feature-based methods require no face representation reduction since it is a work on local features of the face.

The most popular used techniques are based on different descriptors, such as Local Binary Pattern (LBP) and its derivatives [47], Histogram of Oriented Gradients (HOG) [55], Vander Lugt Correlator (VLC) , Scale Invariant Feature Transform (SIFT). All these descriptors are presented in [78].

In [16], Chandrakala et al. dealt with pose variations, scale, facial expression, and illumination challenges using a cascading of LBP and HOG.

A recent work presented in [46] is based on a variation of the Local Radius of Gyration Face (LRGF), invariant to lighting conditions variation, pose change, and noise.

Hybrid Face Recognition methods

Hybrid FR methods consist of a fusion of global and local methods. In fact, the global characteristics are combined with the local ones, making this FR category the most efficient and robust [27, 65].

In a recent work [24], the authors focused on features optimization by the selection of the optimal characteristics of the face with Particle Swarm Optimization (PSO) algorithm based on the face active region of interest.

A FR system using the LBP Histogram (LBPH) descriptor for local and global special features of the face is presented in [21].

Table 1 presents a brief study of some FR approaches.

Table 1 Summary of some FR methods in the literature in a chronological order

2.1.2 Deep face recognition studies

With the advent of BigData and Data Mining, methods and approaches for FR have become numerous. In this work, our goal is to recognize individuals from their faces under pose variations using CNNs. This method proved to have impressive results. With the advent of CPU and GPU cores [54], CNNs and Deep CNNs have been used in a huge number of training data.

CNNs can be classified among the category of hybrid FR methods. They are adapted to feature learning and label prediction, as well as to mapping the input data to deep features, which are the output of the last hidden layer. They are later to the predicted labels. Feature learning is carried out automatically and it is shared as weights between different layers. However, DCNNs achieve superior performance since they are able to extract high level features ensured by the classification architecture [93]. Once deep features are extracted, most of the methods directly calculate the similarity between the two features using cosine, L2, or the nearest neighbor (NN) metrics, and therefore establish comparison for identification. Yet, deep networks which perform perfectly on benchmark datasets may fail in real world applications.

Most of the recent methods perform face image representation using hand-crafted local image descriptors, such as SIFT, LBP, and HOG [9, 48, 61].

Contrary to the aforementioned methods, our method is applied to RGB pixels without combining other descriptors to improve performance.

Researchers have used CNNs and DCNNS in FR application, either for features learning, features extraction, or features classification.

In CosFace [94], Large Margin Cosine Loss (LMCL), as a novel loss function, is performed to remove radial variations and to maximize the decision margin in the angular space. LMCL guides DCNNs to learn the highly discriminative face features. So, intra-class variance is minimized and inter-class variance is maximized.

SphereFace [58] represents class centers in the angular space and penalizes the angles between deep features and their corresponding weights in a multiplicative way, since authors found out that linear transformation matrix in the last fully connected layer of the CNN is useful for this issue. Thus, an Additive Angular Margin Loss helps to obtain the highly discriminative features learned via DCNNs for FR.

In the same context, RegularFace [99] uses intuitive geometric interpretation by penalizing the angle between an identity and its nearest neighbor by focusing on intra-class compactness.

In [56], the authors focused on decreasing information redundancy in features learning and on maintaining the most informative components of spatial feature maps. This module, called attention, is added to the convolutional layer of a standard CNN.

FR methods based on deep CNNs are in full development. Indeed, to have a high recognition rate, it is absolutely necessary to focus on features since CNNs perform feature learning in an automatic way. So, most methods add a module or an additional function to CNNs layers or focus on the preprocessing steps to keep only the salient features of the face (Table 2).

Table 2 Summary of some Deep-based FR methods in the literature in a chronological order

2.2 Face alignment

As mentioned in the previous subsection, the recognition rate is relative to the extracted and the learned face features. For this reason, the face must be well preprocessed before performing the recognition test.

The alignment process forms a part of the preprocessing steps and involves the placement of the face in a frontal position (pitch (ϕ) = 0, yaw (γ) = 0 and roll (𝜃) = 0). More precisely, it is pose normalization since the frontal pose covers the canonical view of the face taken arbitrarily in the wild. Aligning poses make FR easier.

In the majority of papers, authors refer to face alignment as face detection while aligning faces consists in establishing a rotation in the plane and making the face in a frontal view. Moreover, a face image captured under pose variation presents missing data, which can degrade the recognition rate.

Methods of face alignment are numerous and have shown impressive results with sophisticated techniques.

2D face alignment aims at establishing pose normalization if faces are in frontal or near-frontal poses as shown in Fig. 3. However, this transformation fails due to out-of-plane rotation. So, 2D face alignment has difficulties [41] when addressing large poses (Fig. 4). Yet, 3D face alignment consists in aligning faces despite the presence of out-of-plane rotations.

Fig. 3
figure 3

Head poses close to the frontal pose

Fig. 4
figure 4

Examples of large poses

Whatever the method used for face alignment is, we must always take into account that the departure is based on facial landmarks.

The human face contains regions that make it unique even in the case of twins. These regions are called landmarks or/and keypoints (Table 3).

Table 3 Summary of some face alignment methods based on fiducial landmarks

Landmraks: characteristic points in each face, such as the eyes, eyebrows, ears, chin, nose, mouth, etc . Their number is standard and fixed according to the applied algorithm. There are automatic algorithms of face annotation that generate landmarks. The use of landmarks serve to localize the salient regions of the face for face alignment, face morphing, face replacement, face recognition, etc .

Keypoints: characteristic points which characterize a single face. Indeed, two faces cannot contain the same keypoints, such as wrinkles, moles, warts, scars, etc.

2.2.1 3D face alignment methods based on fitting 3D generic models to 2D faces

The human face is characterized by 68 landmarks which can provide information about the head pose. The fitting process consists in pasting a 3D face model to the 2D face using landmarks as references. This is performed by minimizing the difference between the face image and the 3D face model appearance. The purpose of fitting lies in the possibility of rotating the face and performing the alignment to a frontal pose. Fitting is a method used for 3D face alignment, especially in medium poses. However, in large poses, it is very challenging because of the dramatic appearance variations when getting closer to the profile view (Table 4).

Table 4 Summary of some 3D face alignment methods based on 3D morphable models fitting

In [100], the authors introduced a 3D Dense Face Alignment (3DDFA) which fits the 3D morphable model (3DMM) [12]. 3DDFA synthesizes face appearances by labeling invisible landmarks due to large poses. Its objective is to skip 2D landmark detection and start from 3DMM fitting. HPEN [101] aims at fitting the 3DMM to the 2D faces captured in the wild. The approximation method is also performed to avoid iterative visibility estimation of the masked landmarks in large poses. In addition, an identity-preserving normalization is carried out by correcting 3D transformation and anchoring adjustment in the meshed image. In the same context, a method proposed in [79] uses the Basel Face Model (BFM) [37] for 3D face alignment and keypoints locations. It consists of a deep evolutionary model integrating sparse 3D Diffusion Heap Maps (DHM) for pose assistance. CNN is used for feature extraction and Recurrent Neural Network (RNN) is utilized for learning.

The methods already quoted have achieved the best results in FR framework, including face alignment. However, the big challenge is always evoked when dealing with large poses. Their main drawback is therefore related to the limited geometry of the 3D models used. On the other hand, the use of a 3D model, such as 3DMM or BFM, to establish fitting always leaves a common signature in the extracted features.

2.2.2 3D face alignment methods based on 3D face model reconstruction

This process consists in reconstructing a 3D face model from a 2D face to have its own model for each input 2D face image without the need for a 3D model, such as 3DMM, BFM, or any external data (Table 5). Indeed, each 3D reconstructed model has its own characteristics and parameters. Thereafter, the 3D reconstructed model and the 2D landmarks are correlated by a specific technique.

Table 5 Summary of some 3D face alignment methods based on face model reconstruction from the input 2D face image

DeepFace [82] modelizes a 3D face based on the extracted 67 fiducial points. Thus, this method consists in wrapping the detected facial crop to a 3D frontal model after mesh reconstruction by Delaunay triangulation. Also, the 67 anchor points are fitted to the obtained 3D shape to get correspondence between the 67 detected fiducial points and their 3D references. In the same context, another work used the Iterative Closest Points (ICP) [45] algorithm to perform correspondence between each reconstructed 3D face and the ground truth point cloud. Then, normalized mean error (NME) is calculated by the face bounding box size.

Feng et al. [28] proposed a new approach for 3D face reconstruction using UV space as a position map [85]. The UV position map represents the full 3D plot of facial structure with alignment information. It is a 2D image recording 3D positions of all the points in UV space. So, the full facial geometry is reconstructed along with the semantic meaning and it is regressed to get aligned faces.

The previously cited works have used methods establishing alignment with face model reconstruction without 3D generic model basis, which is challenging but had good results. So, no 3D model shape or template restriction was present. In our method, reconstruction of 3D models for each 2D face is carried out as explained in the following section.

3 Proposed method

Conventional pipeline consists of face detection, face alignment to get frontal pose, face representation that has to be trained in the DCNN, and finally face classification to establish identification. Face detection and face alignment are preprocessing steps. In the figure below (Fig. 5), our global pipeline is presented.

Fig. 5
figure 5

Overview of the proposed method

A specification of the main algorithm of the proposed method is presented in Algorithm 1. The different steps are detailed in the following subsections.

Algorithm 1
figure a

Overall.

3.1 Face detection and cropping

Before detecting faces in the images, we eliminate the duplicated images and check the labels. For face detection, Modified Viola–Jones algorithm [63] is used.

When it first appeared [92], this method was effective in the detection of faces in frontal position; however, following certain modifications, it has become sophisticated in all scenarios. So, the face, which is our region of interest (ROI), can be detected under various poses, various illumination conditions, different skin colors, and complex backgrounds while maintaining considerable speedup by parallelizing the training. Once the face is detected, bounding boxes are randomly generated around the detected window (Fig. 6).

Fig. 6
figure 6

Face detection using Modified Viola–Jones algorithm

When facial detection is established, all images are resized in the same scale. In case of images having multiple faces, each detected face is labeled manually and assigned to the appropriate class.

3.2 3D Face reconstruction

In this paper, we revisit the alignment step which consists in searching landmarks based on global shape or texture models to configure landmarks locations. However, under some view angles, landmarks are invisible. So, performance decreases for non-frontal faces and invisible landmarks are considered as self occlusions. It is for this reason that face reconstruction is required. The difference between using a 3D generic model and a 3D reconstructed model is that each 2D face has its own 3D model which preserves texture, shape, and other features. The use of a generic model, such as BFM or 3DMM, causes a common signature between all faces, which increases the error rate afterwards.

3D reconstruction is established by keypoints detection which is added to the traditional fiducial landmarks (Fig. 7). Indeed, the addition of supplementary keypoints to face features is helpful in the reconstruction stage because the 68 landmarks are not enough for 3D mesh creation.

Fig. 7
figure 7

68 traditional facial landmarks

3.2.1 Facial keypoints detection and extraction

First, we start by locating the 68 fiducial points using the facial landmark detector included in the dlib library and OpenCV presented in [71].

The 68 (x, y) extracted landmarks allow to delineate the facial surface in the face image as shown in Fig. 8. Thus, our new ROI is delimited by the jaws and eyebrows keypoints. This method is tested under large poses and this step is successfully performed.

Fig. 8
figure 8

68 landmarks detection using dlib library: First row: input face images, Second row: numbered 68 landmarks

Our choice of the 68 facial landmarks detector was made following a series of tests and experiments that proved robustness against large poses. They are detailed in the self-evaluation section.

According to the state-of-the-art studies, the presence of out of plane or invisible landmarks is noted in large poses. So, keypoints are added since the 68 landmarks are not enough for 3D face reconstruction (Algorithm 2). Indeed, this is our basic contribution.

Algorithm 2
figure b

Keypoints detection and extraction.

The edges in the face image are detected using Canny and Prewitt edge detection algorithms [91]. Only the features in the delimited ROI are kept.

The Canny method consists in finding edges by looking for local maxima in the image gradient. The edge function calculates the gradient using the derivative of a Gaussian filter. This method uses two thresholds to detect the strong and weak edges, including the weak edges in the output if they are connected to the strong ones. By using two thresholds, the Canny method helps to detect the true weak edges which can represent wrinkles in the face (Fig. 9(c)).

Fig. 9
figure 9

Facial keypoints detection: (a) Input face image, (b) 68 Landmarks, (c) Detected edges using Canny, (d) Detected edges using Prewitt, (e) Detected regions using MSER, and (f) All detected keypoints

On the other hand, the Prewitt method aims at finding the edges at the points where the image gradient is maximum using the Prewitt approximation to the derivative (Fig. 9(d)).

Since the output is a binary image, pixels with 0 values are found and they are extracted to be added to the other keypoints. We notice that the number of keypoints is variable for each given face.

In addition to edge detection, Maximally Stable Extremal Regions (MSER) features [77] are added. Indeed, using this descriptor (Fig. 9(e)) allows to obtain good identification of significant image parts, usually combined with high repeatability under typical image distortions. It also allows to get highlighting boundaries of the ROI, which are maximally stable extremal regions. Moreover, MSER helps to find correspondences between the image elements from two images with different viewpoints.

For each input 2D image, the detected keypoints number is not the same (Fig. 9(f)). Once keypoints are detected, they are extracted and saved under the same label as the image in order to be used in the 3D reconstruction process. In Table 6, we present the number of extracted keypoints of two query face images.

Table 6 Examples of the number of extracted keypoints of two query face images (the images are the faces of two celebrities coming from the datasets we test)

In this work, we also add other keypoints to the traditional 68 landmarks. We believe that the face contains more points that characterizing it.

Using two examples, Table 6 shows that each face has a variable number of characterizing weights in each step of features extraction. The number of keypoints is useful in 3D face reconstruction, 3D face processing (mesh subdivision), face fitting, and face alignment process. The number of keypoints is required for face meshing.

3.2.2 Face meshing

Once the keypoints are extracted, we start meshing the ROI using Delaunay triangulation [11]. Algorithm 3 presents the main steps of 2D face meshing.

Algorithm 3
figure c

2D face meshing.

Delaunay triangulation creates triangulations of a set of points and ensures that the circumcircle associated with each triangle contains no other points in its interior that depends on its neighborhood. Delaunay triangulation derived from the extracted facial keypoints is shown in Fig. 10.

Fig. 10
figure 10

3D face reconstruction from single 2D face image: Step by step image meshing (from 2D keypoints to 3D mesh)

After the triangulation process, we obtain facial points in 3D domain, derived from the facial keypoints in 2D domain using n, which is the number of extracted landmarks. It is worth noting that n is not the same for each given face (P0: Initial Points, Pm: Meshed Points).

$$ \begin{array}{@{}rcl@{}} P_{\mathrm{0}}={[x_{\mathrm{1}}, y_{\mathrm{1}}, x_{\mathrm{2}}, y_{\mathrm{2}}, ..., x_{\mathrm{n}}, y_{\mathrm{n}}]}^{T} \in \mathbb{R}^{2.n {\ast} 1} \end{array} $$
(1)
$$ \begin{array}{@{}rcl@{}} P_{\mathrm{m}}={[x_{\mathrm{1}}, y_{\mathrm{1}}, z_{\mathrm{1}}, x_{\mathrm{2}}, y_{\mathrm{2}}, z_{\mathrm{2}}, ..., x_{\mathrm{n}}, y_{\mathrm{n}}, z_{\mathrm{n}}]}^{T} \in \mathbb{R}^{3.n {\ast} 1} \end{array} $$
(2)

As previously mentioned, face cropping is performed to extract the face from the image, but we notice that a part of the background is still there. This part is useful in the alignment step; however, in the reconstruction of the 3D face, it should be ignored because we only need the salient part of the face. If the background in the 3D reconstruction is left, this will be very demanding in terms of time and complexity.

3.2.3 3D face preprocessing

This step is very important since the obtained mesh is not in good quality due to several factors, such as mesh regularity and holes coming from self occlusions. Vertices with no connections can also be found. In Algorithm 4, we present the steps to be followed to perform 3D face preprocessing.

Algorithm 4
figure d

3D face preprocessing.

First of all, we extract the facial surface using Region Growing [6], which is a segmentation algorithm suitable for 3D mesh. The nose tip is used as a seed point and several tests are performed to determine the extraction radius suitable for any face shape (r= length of Bounding Box 0.6). Then, the geodesic distance is used to obtain an oval shape, as shown in Fig. 11. Indeed, the keypoints residing around the jaws and their neighborhoods are taken into consideration.

Fig. 11
figure 11

3D facial surface extraction: (a) localization of the nose tip, (b) Facial patch detection, (c) Extracted face

Once the suitable facial region (patch) is extracted from the initial generated mesh, we locate the diagonal of the face from the annotated landmarks (28, 29, 30, 31, 34, 52, 63, 67, 58, 9), as shown in Fig. 12(b). We also extract other facial diagonal keypoints having the same coordinates on the y axis as the last ones. Then, we start generating symmetrical vertices to the y axis of each facial landmark while considering x and z axes, as shown in Fig. 12(c). This allows to solve the problem of missing parts or self occlusions caused by large poses and profile views (Fig. 12(a)).

Fig. 12
figure 12

3D mesh reconstruction (Update) by adding the missing parts: (a) Missing parts of the face caused by large poses, (b) Localization of facial diagonal keypoints referencing to the 68 annotated landmarks: The red points represent the number of landmarks: 28, 29, 30, 31, 34, 52, 63, 67, 58, and 9. The blue ones represent other detected keypoints having the same y coordinates, (c) Missing parts reconstruction

After adding the missing parts of the 3D face, the quality of the preprocessed mesh in the context of good reconstruction is improved for the pose normalization task. Remeshing to connect the new vertices and the facial surface subdivisions of the mesh is performed using the Butterfly subdivision algorithm [60] and the Ball Pivoting Algorithm (BPA) [8] for triangular interpolation (Fig. 13) .

Fig. 13
figure 13

Remeshing process: 3 mesh subdivision iterations using Butterfly algorithm. Red circles show interpolating triangulation using BPA

The Butterfly algorithm is used for mesh subdivisions and vertices connections. This process is very essential in 3D reconstruction to produce other vertices whose purpose is to achieve mesh regularity controlled by BPA to preserve the facial shape.

Using the butterfly algorithm, we normalize all the 3D reconstructed faces to a defined number of vertices and facets. Indeed, the original meshes do not have the same parameters since the number of extracted landmarks is variable from one face to another.

For each facet consisting of 3 vertices (3 coordinates in the 3D space x,y, and z), BPA pivots around an edge (which connects two vertices) until it touches another vertex, forming another triangle. So, BPA builds relationships between vertices having no connections, which improves the mesh regularity. This process is iterated until connecting all the vertices in the mesh.

BPA is a very used and efficient technique for mesh interpolation. It exhibits linear time complexity and robustness in the given 3D meshes. Although these two techniques are old, they are very efficient. In the experimental part, we justify our choice using some discriminating values.

3.3 3D face alignment

3.3.1 Pose normalization

After 3D mesh reconstruction and preprocessing, we wrap all the detected 2D facial keypoints by projecting the 3D reconstructed face onto the image plane using the Weak Perspective Projection [14], based on the 2D positions of the 3D points on the image plane. The following Algorithm summarizes the main steps.

Algorithm 5
figure e

3D alignment.

Then, we fit the 3D obtained face by minimizing the difference between the 2D extracted landmarks and their references in our 3D reconstructed model while considering the rotation parameters (R is 33 matrix constructed with pitch (ϕ), yaw(γ), and roll(𝜃)), the translation vector t3d, and the scale factor f given by the normalization process.

$$ \begin{array}{@{}rcl@{}} {arg_{f,R,t3d}}=\min ||P_{m}- P_{0}|| \end{array} $$
(3)

The rotation matrix is obtained by multiplying the following three matrices:

$$ \begin{array}{@{}rcl@{}} R_{x}(\phi) = \left( \begin{array}{ccc} 1& 0 & 0 \\ 0 & \cos(\phi) & \sin(\phi) \\ 0 & -\sin(\phi) & cs(\phi) \end{array} \right) \end{array} $$
(4)
$$ \begin{array}{@{}rcl@{}} R_{y}(\theta)= \left( \begin{array}{ccc} \cos(\theta) & 0 & -sin(\theta) \\ 0 & 1 & 0 \\ sin(\theta) & 0 & \cos(\theta) \end{array} \right) \end{array} $$
(5)
$$ \begin{array}{@{}rcl@{}} R_{z}(\gamma) = \left( \begin{array}{ccc} \cos(\gamma) & -sin(\gamma) & 0 \\ sin(\gamma) & \cos(\gamma) & 0 \\ 0 & 0 & 1 \end{array}\right) \end{array} $$
(6)

In Fig. 14, we present the results of the fitting process when using our 3D reconstructed models. The first line presents celebrities faces taken from the datasets we test, the second line contains the fitting results.

Fig. 14
figure 14

Fitting results using our 3D reconstructed models: The first line includes original face images, the second line contains fitting results

The salient surface of the face is completely and perfectly wrapped by our reconstructed 3D model. The advantage of 3D reconstruction is that each identity has a specific 3D model, which is useful for alignment. This makes it unique and original. In fact, there is no common factor between the different identities. Indeed, this is useful for the recognition task.

Later, we perform pose correction for the alignment step. So, the 3D face designed by Pm in (7) is rotated by normalizing with R− 1 to the frontal pose with 0 view centered by the nose tip and considering the pose map of the 2D extracted keypoints. This step is iterated until the face is aligned (Pa) to the desired view according to the pitch (𝜃), yaw (γ), and roll (ϕ) values of the frontal pose.

$$ \begin{array}{@{}rcl@{}} P_{a}=R^{-1}P_{m} \end{array} $$
(7)

Once the 3D face is normalized to the frontal pose, correspondence between 3D and 2D keypoints is redone to refine the new 2D keypoints location.

Following a bibliographic study we performed, we notice that face alignment methods using generic 3D models have a problem of breaking correspondence, especially in cases of large poses. Indeed, the keypoints on the face contour boundary are not consistent. In addition, the shape of the 3D generic model is always existent. This implies that after the fitting process, all the faces will have a common touch despite the different identities, simply because they are fitted with the same 3D generic model. For this reason, full reconstruction of the 3D face of each given 2D face is efficient and recommended. So, each 2D face will have its own 3D modeling which makes it truly original following the fitting and the alignment steps.

3.3.2 Aligned image cleaning

After the fitting and the alignment processes, we notice that the images obtained are not in good condition and they contain holes and missing parts due to alignment.

Some preprocessing operations are performed to clean the resulting images and to increase the recognition rate. It is not possible to generate a reasoned face image just like the one taken in the frontal view. So, artifacts are treated using the mirroring method [22], whose purpose is to fill the holes and the missing parts caused by alignment.

In Fig. 15, the graphical results of 3D face alignment when applying our method are presented. The blue circles show our method robustness and justify our contribution at the level of keypoints addition, which serves to detect more regions and to wrap all the visible parts of the face. In fact, more keypoints extraction involves good 3D face reconstruction, which leads to fit the whole face region to get better face alignment. The purpose of such face alignment is to increase face recognition results, no matter how challenging the conditions are. In This work [39], another use cases of our alignment approach is presented.

Fig. 15
figure 15

Graphical results of face cleaning and alignment: (a) Input image, (b) 3D Face alignment, (c) Results of image cleaning and alignment

3.4 Deep face recognition

After face frontalization and preprocessing, we move to face verification using DCNNs which eliminate the need for manual features extraction. Otherwise, the features are learned directly. We train our DCNN on a multi-class face dataset. To establish this operation, our main objectives are fast GPU-implementation of a DCNN to win a face image recognition contest and to search for successful DCNN applications for such big datasets. Applying DCNN to aligned facial images makes the network more robust to small registration errors.

In our work, we tried several DCNNs and we kept the best recognition rates obtained for each dataset. Our DCNN is therefore trained on an aligned RGB face image. The image size is adapted to the input layer of each tested DCNN.

Our input consists of an RGB image of the aligned face which is given to a convolutional layer (CL) and resized according to the CL characteristic of each tested DCNN. Indeed, AlexNet [52] has given the best recognition rates that will be detailed in the experimental part of this paper.

4 Experimental results

Using our method, we present the experiments conducted on YTF, LFW, and BIWI datasets, which are well-known benchmarks for face recognition. Our implementation is based on the dlib library using Python 3, MatConvNet, Image Processing, and Graph MATLAB toolbox for 3D mesh processing. Indeed, MeshLab linked to the NVIDIA packages is used to accelerate training. All our experiments were carried out using NVIDIA CUDA development 9.2 and were run on intel (R) Core (TM) i7-7500U, 2.70 GHz and 2.90 GHz with 8GO RAM.

4.1 Experimentation and results on LFW dataset

Labeled Faces in the Wild is a big dataset for face verification testing in unconstrained domains (lighting, poses, facial expressions). It contains 13,233 face images of 5,749 different identities collected from the web. It includes 1,680 people having two or more images against 4,069 people having only single image in the dataset.

In our experiments, we used the configuration described in paper [36] related to the dataset, and we only used the LFW samples. No outside data were used. Two protocols are presented in LFW dataset: image-restricted and image-unrestricted protocols.

The restricted protocol has image-restricted settings: binary labels are available. So, “matched” or “mismatched” verification for pairs of images is performed. On the other hand, the unrestricted protocol has unrestricted setting: identity information of the person in the image is available which helps to make new pairs of images.

Following this experimentation, we tested several DCNNs and the best recognition results are obtained using AlexNet. They were 98.37% with the restricted protocol and 97.28% when using the unrestricted protocol. Table 7 presents a comparison between our results and those of the existing methods using different alignment methods, as described in the previous sections.

Table 7 Comparison of FR rates with some existing methods on LFW dataset

4.2 Experimentation and results on YTF dataset

YouTube Faces dataset [96] includes 3,425 YouTube videos of 1,595 different subjects. The used classes are the same as in LFW (a subset of celebrities presented in the LFW dataset [36]). The videos were taken by professional photographers and were divided into 5,000 video pairs and 10 splits. They were used to evaluate the video-level face verification. The images of this dataset are not in good quality due to acquisition problems. So, a preprocessing step, including smoothing and other filters was primordial.

In this paper, we performed our experiments employing the restricted protocol, which limits the information available for training to the same/not-same labels in the training splits.

Before performing 3D alignment, FR was tested via different DCNNs to check if alignment increases the recognition rate. Using AlexNet, the recognition rate was 99.14%. In Table 8, a comparison with some related works is presented.

Table 8 Comparison with the state-of-the-art on YTF dataset

4.3 Experimentation and results on Biwi dataset

The BIWI dataset includes 15,678 frames collected from videos of 20 individuals: 6 women and 14 men (there are ones who were recorded twice). There are 24 sequences acquired with a Kinect sensor and collected in under-controlled conditions and different head poses.

In our experimentation, we used 2D frame images (RGB) presented in the dataset. We performed the same processing steps used for the two other tested datasets. Then, we applied our proposed method of 3D face alignment and pose normalization.

For FR, we followed the experimental protocol used by several works in the literature for this dataset. So, we randomly split the dataset into 70% for training and 30% for testing and verification. Using AlexNet, the recognition rate was 97.92%. In Table 8, a comparison with some related works is presented (Table 9).

Table 9 Comparison with the state-of-the-art on BIWI dataset

4.4 Self evaluation

We carried out a series of tests to justify our qualitative and quantitative choices of the different parameters and techniques. Apart from highlighting the robustness of our contribution through the rates obtained, we would also like to emphasize the quality of our work.

First of all, we start by justifying the use of the 68 landmark detector. As it was mentioned in the proposed method, we used dlib and OpenCV through Python3. This technique gave the best results for face annotation compared to Chow-Liu algorithm [18], which is widely used in recent face landmarks detection methods although it is an old technique (Fig. 16(b)), and compared to the Gaussian-Newton method [89], which is also widely used in face alignment (Fig. 16(c)). Comparison can also be made through the graphic results in Fig. 16.

Fig. 16
figure 16

68 facial landmarks detection: (a) Input face image, (b) 68 landmarks detection using Chow-Liu algorithm, (c) 68 landmarks detection using the Gaussian-Newton method, (d) 68 landmarks detection using dlib and OpenCV

The used technique established landmarks detection in almost all pose variations. Contrarily to some other techniques, we obtained errors of landmarks detection in critical scenarios or bad locations that would be disturbing during mesh reconstruction.

Our first contribution consists in adding more keypoints to the traditional 68 facial landmarks. This is useful for 3D model reconstruction which is used in the alignment process.

So, is 3D reconstruction perfect?

To answer this question, an experiment was carried out. An evaluation of 3D reconstruction was made based on the BU3DFE dataset [83], which contains 3D meshes accompanied by 2D images just to make sure that our reconstruction is perfect and close to the 3D faces as taken by 3D acquisition devices.

We used Mean Absolute Error (MAE) evaluation metric which measures the average magnitude of errors between prediction (3D reconstructed faces) and real 3D faces.

The average MAE of 3D reconstructed faces decreases with each addition of other keypoints (Fig. 17). This justifies the addition of other points to accomplish this task of reconstruction. However, the rates obtained are not within the standards.

Fig. 17
figure 17

MAE of the proposed 3D reconstruction method on BU3DF

For this reason, 3D mesh preprocessing was performed to conduct mesh regularization and to further decrease MAE, which would guarantee the alignment phase, as shown in Fig. 18.

Fig. 18
figure 18

MAE of the proposed processed 3D reconstruction method on BU3DF using butterfly and BPA algorithms

In Fig. 19, the histogram presents a quantitative study of the number of vertices and facets during the 3D reconstruction phase. We perform 3 iterations of mesh subdivision using Butterfly algorithm in the remeshing step. This choice is established after a series of tests. For interpolating triangulation using BPA, pivoting ball radius is = 3.3231 and the angle threshold is = 90.

Fig. 19
figure 19

Number of vertices and facets during the 3D reconstruction process of the query face image

Once 3D model reconstruction is performed for each given 2D face, fitting to wrap all the detected 2D facial landmarks is conducted by projecting the 3D reconstructed faces onto the 2D ones.

As self-evaluation, the fitting process was tested using two widely-used existing models in face alignment, in addition to the model we generated. So, we noticed that the alignment process with fitting BFM (Fig. 20(a)) is not well-adapted to the 2D face due to projection errors. The shift is very remarkable. Apart from the cases of large poses, many images are missed because projection is unreachable.

Fig. 20
figure 20

Fitting results: (a) Fitting process using BFM, (b) Fitting process using 3DMM, (c) Fitting process using our reconstructed model

When using 3DMM (Fig. 20(b)), the fitting process was successful under wide poses. We also noticed that facial expressions are well-illustrated on the obtained model. This is due to the reconstruction of this generic model learned from 10,000 faces in the wild. However, using this model has one drawback consisting in image meshing each time the shape of the 3DMM is present in all faces. This implies that all the identities have the same signature, which degrades face frontalization.

Performing fitting with an appropriate 3D face model, as shown in Fig. 20(c), aids in preserving identity at the level of pose correction. All the 2D keypoints undergo this change of plane while referencing to the 3D ones.

Moreover, quantitative tests were carried out to justify and highlight our contribution. A recognition test was therefore established after having carried out the alignment process using the previously mentioned fitting methods. We used the same technique of keypoints projection and keypoints marching. The recognition rates are presented in Tables 1011, and 12.

Table 10 Face recognition rates of aligned faces on YTF dataset
Table 11 Face recognition rates of aligned faces on LFW dataset, testing restricted and unrestricted protocols
Table 12 Face recognition rates of aligned faces on BIWI dataset

Indeed, BFM and 3DMM are two different generic models used in the fitting process. Yet, for pose normalization, both image cleaning and image classification were carried out in the same way to be able to establish comparisons between results.

To ensure that our approach is efficient and effective, the time factor is considered. The following curve shows the time consumed in each step (Fig. 21).

Fig. 21
figure 21

Consumed time computing during the preprocessing step of FR pipeline per a query image

4.5 Discussion

Our contribution consists in applying 3D face alignment to FR. The results obtained are among the best ones thanks essentially to the efficiency of our 3D face alignment method.

Adding keypoints consists in covering the cropped facial surface, which reduces the number and the size of regions hidden by poses. This guarantees a sophisticated 3D mesh reconstruction from a single input face image. The aim of 3D reconstruction is to wrap maximum keypoints when the fitting process is established. This process facilitates face rotation with a slight damage to the 2D face image.

5 Conclusion

This paper presents a research on face recognition using DCNNs with appropriate training. We added keypoints to the 68 traditional fiducial landmarks using MSER, Canny, and Prewitt techniques.

We reconstructed 3D meshes based on Delaunay triangulation, followed by facial surface extraction using Region Growing algorithm, mesh subdivision, and remeshing using Butterfly and BPA algorithms.

Then, we projected the obtained 3D mesh onto the 2D image plane and wrapped it. This step was followed by pose correction whose purpose was to establish face alignment.

The recognition rates we found are justified by several factors, including the well-developed preprocessing steps and the efficient addition of more keypoints. This proves that 3D mesh reconstruction was conducted very carefully. So, the resulting images of faces were directly given to DCNNs without any intervention.

The results obtained are comparable to those reported in the state-of -the-art. In the near future, we are preparing other experiments on other existing benchmarks, such as LFPW and WLFW, using our proposed method.