1 Introduction

Ever since the outburst of pandemic COVID-19, the whole world is struggling to overcome it. The world order has changed and people are developing new protocols to cope with the highly infectious disease. Recently the World Health Organization suggested that the world will have to learn to live with the COVID-19 disease even after the vaccine is launched. This will also involve modifying the existing systems to work with the new world protocol.

To prevent the spread of the highly infectious disease, people have to wear mask at all public places like airports. Mei Ngan et. al [32] from National Institute of Standards and Technology (NIST), recently published an exhaustive report on Face recognition accuracy with masks using pre-COVID-19 algorithms. This report contains a detailed analysis of the results obtained from 89 different face recognition algorithms from different companies while applied on masked faces. Some of the face recognition algorithms which have been tested on masked faces are [20, 30, 36, 44]. The report states that due to mask, the face recognition algorithms resulted in comparatively higher False non-match rate (FNMR) and False match rate (FMR), in comparison to when the pre-COVID-19 face recognition algorithms are applied on unmasked faces. The reason for the failure of present day face recognition algorithms on masked faces is that they require whole face image as input for recognition. However, when the mask on the face hides most of the area on the face below the eyes, the face recognition algorithms are not able to generate features that are rich enough to discriminate between the different faces. The second reason for the higher FNMR and FMR in the results of the pre-COVID-19 algorithms on masked faces is that the mask on the faces adds to the noise in the generated face features. The noise in the face features dominates the relevant face features and contributes to misclassification.

Though there are improvements in the primitive machine learning classifiers and deep learning networks for face recognition [1, 12, 20, 31], the improvements have not been done keeping in mind the masked faces. Hence, there is an urgent need to develop a face recognition system that is able to recognise faces even with mask or in other words, we can say that there is need for a face recognition systems which can recognize faces using only the eye region. The system that recognizes a face using the eye region is called Periocular Recognition System. A Periocular recognition system has the advantage that the periocular region does not change because of various poses, aging, expression, facial changes and other artifacts. Because of this advantage, lot of research work is going on in the domain of periocular biometric recognition [11, 41]. Some have used the primitive features like local binary patterns while others have used modern deep learning models for extraction of features and its classification [3, 21, 24]. However, in all the cases, periocular recognition systems are not as robust as the full face recognition systems. The reason is very clear that due to the presence of the mask on the faces, the informative portion of the face is hidden. Robust face recognition therefore, cannot be done on the basis of features present in the periocular region alone.

Face recognition using the thermal modality is an emerging field of research [28]. This is because unlike visible images, the thermal modality is unaffected by the illumination variation and most of the occlusions (except occlusion caused by glass). Also the thermal images can be captured in dark which is not possible using the visible camera. This means that that the thermal and visible modality complement each other by providing for the missing information in the other modality. But the thermal images cannot be alone used for face recognition [6, 35] because the face detection algorithms work well only for visible images and not on thermal images. To use the thermal modality for face recognition, reconstruction of visible images from the thermal images has to be done [9, 19, 47]. This process is heavy weight and complex, and prone to error. The reconstruction of visible image from the thermal image may not always be correct because of the missing information in the thermal image required for the reconstruction of the visible image. This may lead to inconsistent face recognition results. Lot of research therefore has been done in the field of fusion of thermal and visible features [27, 37, 49]. Fusion of thermal and visible image has given good results for face recognition. This is because the both the thermal and visible modalities complement each other. Also, the process of fusion of thermal and visible features is a comparatively lightweight process when compared with the process of reconstruction of visible face from the thermal. In our survey, we did not find any research work that applied the concept of fusion of the visible and thermal modality for periocular recognition. In this paper, we therefore propose a Masked Mobile Lightweight Thermo-visible Periocular Recognition system (MmLwThV) that performs better and comparable to the existing face recognition systems by the use of fusion of thermal and visible features.

At most places such as airports, arranging large number of training images is not possible. Hence, we propose to develop a periocular recognition system that is lightweight and needs only few templates of the individual faces for recognition. Another challenge that limits the performance of the MmLwThV framework is the introduction of the noise due to mask. The noise due to mask, may affect the discriminative ability of the thermo-visible features and hence the periocular recognition accuracy. To negate the affect of noise, we propose to use ensemble of classifiers [34] along with random subspace sampling [7, 8] of the training samples in the proposed MmLwThV framework. Our proposed lightweight framework MmLwThV, can be deployed in any mobile or handheld device which has a visible and thermal camera.

Hence, to summarize, we make the following contributions in this paper:

  • We proposed an efficient and robust periocular recognition framework for masked faces that uses the fusion of thermal and visible features and an ensemble of subspace networks to mitigate the effect of noise due to mask in the features. The proposed framework is called Masked Mobile Lightweight Thermo-visible Face Recognition (MmLwThV) framework.

  • We collected an unconstrained, in-the-wild thermo-visible masked dataset for validation of our proposed MmLwThV framework.

2 Related work

Lot of work has been done in face recognition using visible images [20, 30, 36, 44]. However, all these works fail to work with masked faces [32]. Occlusion is one of the real world challenges that degrades the performance of face recognition. Various approaches have been proposed to handle the problem of occlusion [10, 23, 25, 26, 38, 42, 43, 46]. But all these works are based on random occlusion. These works are therefore not suited for conditions where most of the face is occluded with mask leaving only periocular region uncovered.

To overcome the challenges of face recognition in visible images, [18] and [40] reported face recognition using thermal images. Singh et.al. [40] used the wavelet domain and eigenspace domain to combine and fuse the features from visible and thermal images. He used Genetic algorithm to find an optimum fusion strategy. Madheswari et. al. [27] used feature fusion of thermal and visible images. To perform feature fusion from images of two different modalities, the authors considered features such as discrete wavelet transform and Curvelet transform (CT). Then they employed particle swarm optimization, self-tuning particle swarm optimization and brain storm optimization algorithm to find optimal fusion coefficients. In the above such works where the features from the thermal and visible modalities were fused, finding an appropriate feature and subsequently finding an optimal fusion strategy was always a challenge. Also, we needed a feature that can be extracted from both thermal and visible image without any compatibility issues.

To reduce sensitivity to noise, illumination conditions, and facial expressions, texture analysis has time and again proven to be highly efficient. Ojala et. al. [33] introduced the local binary patterns (LBP) for capturing the texture in an image. LBP is computationally simple and yet is capable of capturing the fine details in the image. LBP has not only been used in visible images but also in thermal images for feature extraction. Several variants of LBP have been proposed like Local Derivative Patterns [5] and Local Variant Patterns (LVP) [17]. Most recently in the year 2015, in the paper[11, 41] used the Local Binary Pattern to extract feature from the eye region. The author performed bit shifting for feature matching, because the accuracy may degrade because of head movement. Dubey el. al. [15] in 2015 introduced a novel low dimensional and time efficient variant of LBP. This was called Local Bit-plane Decoded Pattern (LBDP). The LBDP encodes the relationship of each pixel in the image with its neighbours in each bit plane separately. In LBDP, encoding is done at the lowest level of an image, i.e. the bit level, LBDP can be efficiently used to capture the texture variation of an image from two different modalities. These works motivated us to use LBP and its different variants to extract features from both visible and thermal images and fuse them.

3 Datasets

To validate our proposed framework MmLwThV, we needed a dataset that contained masked and unmasked images of each subject. Each image in the dataset was required to have both the thermal and the visible modality. In our experiments for the MmLwThV framework, we needed the periocular region from both the thermal and visible images. However, cropping the periocular region from the thermal image is difficult as eye detection algorithms do not work on thermal images. The periocular region can be cropped from the thermal image using the same coordinates as that of the periocular region in the visible image, only when the thermal image is registered with the visible image. The dataset for our experiments therefore, was required to have pixel-to-pixel registration between the thermal and the corresponding visible images. WHU-IIP dataset contains registered thermal and visible face images of different subjects. However, it does not contain the masked images of subjects. So we collected and prepared a masked thermo-visible dataset and named it as CVBL Masked Face Recognition dataset. The details of the datasets are given below:

3.1 WHU-IIP dataset

WHU-IIP dataset is a thermo-visible dataset that contains thermal and visible images of 33 different unmasked subjects. For each of the 33 subjects, the dataset contains 24 thermal and 24 visible images. The thermal and the corresponding visible images are pixel-to-pixel registered. Some of the thermal and visible images from the WHU-IIT dataset are shown in Fig. 1.

Fig. 1
figure 1

Registered thermal and visible images of WHU-IIP dataset. The WHU-IIP dataset has no masked faces

3.2 CVBL masked face recognition dataset

To verify our proposed framework MmLwThV, we collected a masked dataset in unconstrained environment for face recognition. We named this dataset as CVBL Masked Face Recognition dataset. The dataset has both visible and corresponding thermal images for masked and unmasked images of all the subjects. The images from the dataset have been shown in Fig. 2. The images have been captured using Sonel KT150 Thermal Imager camera. The camera is capable of capturing both the thermal and visible images of a subject simultaneously. Both the thermal and visible images are pixel-to-pixel registered. The images have been captured in the real world environment with varying pose, illumination, resolution and distance of the subject from the camera. Because of the lighting variations, the quality of the thermal image is also varying and is not consistent throughout. This is because, the quality of thermal images is affected by the day light intensity which kept on varying during the entire session of the data collection. The visible images also exhibit wide variation in their quality. The periocular region is sometimes not very clearly captured in the visible images because of inconsistent lighting and varying pose as shown in Fig. 2. Most importantly, the masked dataset is realistic and not synthetic. The quality of masks and position of the masks on the face also varies from person to person. The number of images per subject also varies in the dataset. Thus the CVBL masked dataset represents the actual situation where the number of images available for training the classifier per subject may vary depending on the availability. Thus, CVBL masked dataset presents all such practical and real world challenges before the periocular recognition system. Thus, CVBL masked dataset will allow us to check the validity and robustness of our proposed method.

Fig. 2
figure 2

Registered thermal and visible images of CVBL Masked dataset. The dataset is unconstrained and captured in real world environment. There is high variation in illumination, pose, distance of subject of the camera. The subjects have real masks on their faces. The quality of thermal images is also affected by the intensity of the day light when the image was captured

There are in total 235 visible images and 235 corresponding thermal images for 21 subjects. Thus there are on an average 11 images per subject each for visible and thermal modalities. The lowest number of images for any subject is 6 and the highest number of images for any subject is 16. In case of unmasked images, there are 236 images from the same 21 subjects. The lowest and highest number of unmasked images for any subject is 6 and 16 respectively.

4 Ablation study

4.1 Face recognition on masked faces

In this section, we have shown the experimental results when a masked face is used for face recognition using the pre-COVID-19 algorithms inspired by Local Binary Patterns (LBP) [2] and its variants. We then compared the results of face recognition on masked and unmasked full faces. The experiments have been performed on CVBL masked dataset using the thermal, visible and thermo-visible features. The results are summarized in Table 1. We used the LBP[2], LBDP [15], LDP [5] and LVP [17] features and used lightweight primitive classifiers such as Minimum Distance Classifier, Support Vector Machines (SVM) [13] and its variants, K-Nearest Neighbour(KNN) [16] and its variants, Ensemble Subspace Discriminant [4] and Ensemble Subspace KNN [48] for classification.

In all experiment, we trained the classifiers on the features from the unmasked face images. We used 5-fold cross validation to test the classifiers on unmasked images. Using the trained classifiers on unmasked faces, we also tested the classifiers on masked dataset. The results in Table 1 can be summarized in two headings: In the Table 1, we compare the corresponding values from the two rows ’M’ and ’Un’, for a classifier, we find that the face recognition accuracy on masked face is very less than the corresponding accuracy on unmasked face for a particular modality. The highest accuracy obtained on masked face is 27% with Ensemble Subspace Discriminant classifier and LBDP features of the masked thermal face image. The corresponding accuracy on unmasked face is 63.6%. The highest accuracy obtained on unmasked face is 92.80% when Ensemble Subspace KNN classified the thermo-visual features of LBDP for the unmasked images.

Table 1 Comparative Results of Face Recognition for MASKED (M) and UNMASKED (Un) faces on CVBL Masked Dataset. The experiment has been performed on features from Visible (V) and Thermal (T) images

From the Table 1, it can be easily observed that, in many cases the fusion of thermal and visible features did not generate better results than the features for visible or the thermal modality when used individually. This pattern in the results is more prevalent in case of masked images than in case of unmasked images. This clearly means that when the features of thermal and visible modalities are fused together, the noise due to facial mask become more dominant than the facial features themselves. Hence, face recognition accuracy on masked faces dropped down due to fusion of the features from thermal and visible modalities.

4.2 Strength of periocular recognition

In this section, we experimentally try to understand the strength of the periocular recognition. For this, we extracted the lightweight LBDP features of the full face, left eye and right eye from the unmasked images in the WHU-IIP dataset and compared the results. The results are summarized in Table 2. The predictions have been done using the minimum distance classifier which uses the Euclidean distance.

Table 2 Recognition Accuracy for different face regions for Thermal and Visible Images using Euclidean distance based prediction

From the Table 2, we compare the results for left and right eye from that of the face. The periocular recognition for left and right eye is quite less than the face recognition for the thermal or the visible modality. However, for thermo-visible modality, the periocular recognition accuracy for left and right eye become comparable to the face recognition accuracy. The accuracy for the left and right eyes are respectively 92.05% and 91.29%.

So we can conclude that periocular recognition can make a very strong contribution towards identification of a person by using thermo-visible features.

5 The proposed methodology

We describe our proposed Masked Mobile Lightweight Thermo-visible Face Recognition (MmLwThV) framework in this section. The entire framework is shown in Fig. 3. The MmLwThV framework follows the following steps for recognition of masked faces:

  1. 1.

    Periocular Region Extraction

    As diagrammatically described in the Fig. 3 the training set X contains registered masked thermal and visible images of n subjects. For each subject, the periocular region is extracted from the visible image using cropping based on coordinates provided by the eye detection algorithm. The corresponding periocular region from thermal image is extracted using the bounding box obtained from the visible image. The bounding box was obtained using the functions available in opencv.

  2. 2.

    Thermo-visible Feature Extraction After obtaining the periocular region from the thermal and visible images of a subject in the training set, the handcrafted features are then generated from periocular region in the thermal and visible modality and concatenated to generate features Xk such that k varies from 1 to n. Thus X1,X2,X3,X4,...,Xk are the fused features obtained from each subject in the training set X.

  3. 3.

    Random Subspace Sampling This is shown in block named Random Feature Selection in Fig. 3. The training set X is randomly sampled to produce D different samples subspaces \({{X^{d}}_{1}, {X^{d}}_{2},..., {X^{d}}_{k},...., {X^{d}}_{D}}\). D such random subspaces are generated for training D different classifier models.

  4. 4.

    Classifier Training Classifier Training is shown in block named Random Feature Selection in Fig. 3. D different models C1,C2,C3,C4,.....,CD are trained such that classifier model Ck is trained using the sample subspace \({X_{k}^{d}}\), k lies between 1 and n. Thus, the ensemble network consisting of D weak classifiers is trained using D random subspaces. The trained network can now be used for periocular recognition of masked faces.

  5. 5.

    Periocular Recognition by the Ensemble Network After the training, the ensemble of classifiers C1,C2, C3,C4,.....,CD is used for classification. Registered thermal and visible images of masked faces are captured and used to extract periocular region from both visible and thermal images. Features are then extracted from both thermal and visible images of the periocular region. The features obtained from thermal and visible periocular regions are then fused and is given as input to the ensemble network. The ensemble network classifies the input using the majority voting rule.

Fig. 3
figure 3

Complete block diagram of MmLwThV framework

In Fig. 4, three different visible facial images \({I^{V}_{A}}\), \(I^{V_{i}}_{P}\) and \(I^{V_{j}}_{N}\) from a facial image dataset of size N such that 1 ≤ i,jN − 1 and ij for all i and j. \({I^{V}_{A}}\) is the anchor image, \(I^{V_{i}}_{P}\) is a positive image and it is having face of the person same as in \({I^{V}_{A}}\). \(I^{V_{j}}_{N}\) is a negative image and contains a face image of a different person than that in \({I^{V}_{A}}\).

Similarly, \({I^{T}_{A}}\), \(I^{T_{i}}_{P}\) and \(I^{T_{j}}_{N}\) are three thermal images corresponding to the visible images \({I^{V}_{A}}\), \(I^{V_{i}}_{P}\) and \(I^{V_{j}}_{N}\) respectively such that \({I^{V}_{A}}\) and \({I^{T}_{A}}\) are pixel-to-pixel registered. Similarly, \(I^{V_{i}}_{P}\) and \(I^{T_{i}}_{P}\) as well as \(I^{V_{j}}_{N}\) and \(I^{T_{j}}_{N}\) are also having pixel-to-pixel correspondence.

Features of the visible images \({I^{V}_{A}}\), \(I^{V_{i}}_{P}\) and \(I^{V_{j}}_{N}\) are \({F^{V}_{A}}\), \(F^{V_{i}}_{P}\) and \(F^{V_{j}}_{N}\) respectively such that:

$$ {F^{V}_{A}} = [F^{V}_{A_{1}}, F^{V}_{A_{2}}, F^{V}_{A_{3}}, F^{V}_{A_{4}}, F^{V}_{A_{5}}, ....... F^{V}_{A_{l}}] $$
(1)
$$ F^{V_{i}}_{P} = [F^{V_{i}}_{P_{1}}, F^{V_{i}}_{P_{2}}, F^{V_{i}}_{P_{3}}, F^{V_{i}}_{P_{4}}, F^{V_{i}}_{P_{5}}, ....... F^{V_{i}}_{P_{l}}] $$
(2)
$$ F^{V_{j}}_{N} = [F^{V_{j}}_{N_{1}}, F^{V_{j}}_{N_{2}}, F^{V_{j}}_{N_{3}}, F^{V_{j}}_{N_{4}}, F^{V_{j}}_{N_{5}}, ....... F^{V_{j}}_{N_{l}}] $$
(3)

The distance between the anchor image \({I^{V}_{A}}\) and positive image \(I^{V_{i}}_{P}\) is denoted by \(D^{V_{i}}_{AP}\) and the distance between the \({I^{V}_{A}}\) and negative image \(I^{V_{j}}_{N}\) is denoted by \(D^{V_{j}}_{AN}\) where,

$$ D^{V_{i}}_{AP} = {F^{V}_{A}} - F^{V_{i}}_{P} $$
(4)
$$ D^{V_{j}}_{AN} = {F^{V}_{A}} - F^{V_{j}}_{N} $$
(5)

Calculating \(D^{V_{i}}_{AP}\) and \(D^{V_{j}}_{AN}\) using Euclidean distance, we get:

$$ D^{V_{i}}_{AP} = \biggl\{{\sum\limits}_{k=1}^{l}(F^{V}_{A_{k}} - F^{V_{i}}_{P_{k}})^{2} \biggr\}^{1/2} $$
(6)
$$ D^{V_{j}}_{AN} = \biggl\{{\sum\limits}_{k=1}^{l}(F^{V}_{A_{k}} - F^{V_{j}}_{N_{k}})^{2} \biggr\}^{1/2} $$
(7)

From the unmasked faces, we obtain discriminative features from the whole of the face. Hence, intra-class distance \(D^{V_{i}}_{AP}\) is assumed less than inter-class distance \(D^{V_{j}}_{AN} \). This can be written ∀k: 1 ≤ kl as:

$$ (F^{V}_{A_{k}} - F^{V_{i}}_{P_{k}}) <<< (F^{V}_{A_{k}} - F^{V_{j}}_{N_{k}}) $$
(8)

So, the classification can be done correctly as the demarcation line between the classes is clear.

Fig. 4
figure 4

The Anchor, Positive and Negative Visible Images. For correct classification, the Intra-class distance should be less than the Inter-class distance

5.1 Problem 1

For the masked faces, if a kth feature \(NF^{V_{i}}_{P_{k}}\) is a noise due to mask. Then it may affect the relation in equation (8). The equation may become:

$$ (F^{V}_{A_{k}} - NF^{V_{i}}_{P_{k}}) >> (F^{V}_{A_{k}} - F^{V_{j}}_{N_{k}}) $$
(9)

If (9) is true for most of the k, then the result is:

$$ D^{V_{i}}_{AP} >> D^{V_{j}}_{AN} $$
(10)

i.e. the discriminative ability of the features gets affected. The problem is to devise a method for masked face recognition so that the noise due to mask do not affect the discriminative ability of the features for correct classification.

5.2 Proposed solution stage 1: Thermo-visible fusion

We propose to use only the periocular region from the masked faces such that the noise due to mask is removed. For simplicity of nomenclature, we assume that \({F^{V}_{A}}\), \(F^{V_{i}}_{P}\), \(F^{V_{j}}_{N}\), \({F^{T}_{A}}\), \(F^{T_{i}}_{P}\), \(F^{T_{j}}_{N}\) are the thermal and visible features from the periocular region, then we get:

$$ (F^{V}_{A_{k}} - F^{V_{i}}_{P_{k}}) \approx < (F^{V}_{A_{k}} - F^{V_{j}}_{N_{k}}) $$
(11)

However, since the features from periocular region are not highly discriminative, we get

$$ D^{V_{i}}_{AP} \approx D^{V_{j}}_{AN} $$
(12)

This may lead to erratic classification results. Hence, the fusion of thermal features with the visible features is done to increase the discriminative power of the thermo-visible features of the periocular region.

Given the thermal and visible features, we fuse the features of the visible image from the features of the corresponding thermal image, then as shown in Fig. 5, the thermo-visible features can be written as:

$$ \begin{array}{@{}rcl@{}} F^{VT}_{A} &=& [F^{V}_{A_{1}}, F^{V}_{A_{2}}, F^{V}_{A_{3}}, F^{V}_{A_{4}}, F^{V}_{A_{5}}, ....... F^{V}_{A_{l}}, \\ &&F^{T}_{A_{1}}, F^{T}_{A_{2}}, F^{T}_{A_{3}}, F^{T}_{A_{4}}, F^{T}_{A_{5}}, ....... F^{T}_{A_{l}}] \end{array} $$
(13)
$$ \begin{array}{@{}rcl@{}} F^{VT_{i}}_{P} &=& [F^{V_{i}}_{P_{1}}, F^{V_{i}}_{P_{2}}, F^{V_{i}}_{P_{3}}, F^{V_{i}}_{P_{4}}, F^{V_{i}}_{P_{5}}, ....... F^{V_{i}}_{P_{l}}, \\ &&F^{T_{i}}_{P_{1}}, F^{T_{i}}_{P_{2}}, F^{T_{i}}_{P_{3}}, F^{T_{i}}_{P_{4}}, F^{T_{i}}_{P_{5}}, ....... F^{T_{i}}_{P_{l}}] \end{array} $$
(14)
$$ \begin{array}{@{}rcl@{}} F^{VT_{j}}_{N} &= &[F^{V_{j}}_{N_{1}}, F^{V_{j}}_{N_{2}}, F^{V_{j}}_{N_{3}}, F^{V_{j}}_{N_{4}}, F^{V_{j}}_{N_{5}}, ....... F^{V_{j}}_{N_{l}}, \\ &&F^{T_{j}}_{N_{1}}, F^{T_{j}}_{N_{2}}, F^{T_{j}}_{N_{3}}, F^{T_{j}}_{N_{4}}, F^{T_{j}}_{N_{5}}, ....... F^{T_{j}}_{N_{l}}] \end{array} $$
(15)
Fig. 5
figure 5

The figure explains how the mask causes noise in the thermo-visible features. The noise affects the discriminative ability of the features

If we assume ∀k: 1 ≤ kl:

$$ D^{V_{i}}_{AP_{k}} = (F^{V}_{A_{k}} - F^{V_{i}}_{P_{k}}) $$
(16)

we can write (6) and (7) as:

$$ D^{V_{i}}_{AP} = \sqrt{{\sum\limits}_{k=1}^{l}(D^{V_{i}}_{AP_{k}})^{2}} $$
(17)
$$ D^{V_{j}}_{AN} = \sqrt{{\sum\limits}_{k=1}^{l}(D^{V_{j}}_{AN_{k}})^{2}} $$
(18)

Similar to (4) and 5, the Euclidean distance between the fused features \(F^{VT}_{A}\) and \(F^{VT_{i}}_{P}\), denoted by \(D^{VT_{i}}_{AP}\) and the distance between \(F^{VT}_{A}\) and \(F^{VT_{j}}_{N}\), denoted by \(D^{VT_{j}}_{AN}\) is given by equations:

$$ D^{VT_{i}}_{AP} = \sqrt{{\sum\limits}_{k=1}^{l}(F^{V}_{A_{k}} - F^{V_{i}}_{P_{k}})^{2} + {\sum\limits}_{k=1}^{l}(F^{T}_{A_{k}} - F^{T_{i}}_{P_{k}})^{2}} $$
(19)
$$ D^{VT_{j}}_{AN} = \sqrt{{\sum\limits}_{k=1}^{l}(F^{V}_{A_{k}} - F^{V_{j}}_{N_{k}})^{2} + {\sum\limits}_{k=1}^{l}(F^{T}_{A_{k}} - F^{T_{j}}_{N_{k}})^{2}} $$
(20)

Squaring both sides of (6) and also of (7) and substituting in (19) and 20 respectively, we get:

$$ D^{VT_{i}}_{AP} = \sqrt{ (D^{V_{i}}_{AP})^{2} + {\sum\limits}_{k=1}^{l}(F^{T}_{A_{k}} - F^{T_{i}}_{P_{k}})^{2} } $$
(21)
$$ D^{VT_{j}}_{AN} = \sqrt{(D^{V_{i}}_{AN})^{2} + {\sum\limits}_{k=1}^{l}(F^{T}_{A_{k}} - F^{T_{j}}_{N_{k}})^{2}} $$
(22)

When thermal features are added and concatenated with the visible features, they complete the missing information in the features. The thermal features complement the visible features. The term \((F^{T}_{A_{k}} - F^{T_{i}}_{P_{k}})\) in (21) is less than the term \((F^{T}_{A_{k}} - F^{T_{j}}_{N_{k}})\) in (22) i.e.:

$$ F^{T}_{A_{k}} - F^{T_{i}}_{P_{k}} < F^{T}_{A_{k}} - F^{T_{j}}_{N_{k}} $$
(23)

k: 1 ≤ kl. Hence, there is a small increment added to the intra-class distance \(D^{V_{i}}_{AP}\) while, there is a larger increment in the inter-class distance \(D^{V_{i}}_{AN}\). So, we get:

$$ D^{VT_{i}}_{AP} <<< D^{VT_{j}}_{AN} $$
(24)

The equation clearly tells that, the fusion of the thermal and visible features, makes the fused feature more discriminative. The intra-class distance is much less than the inter-class distance.

5.3 Problem 2

In real world circumstances, the mask is randomly placed on the face. When the periocular region is cropped from the face in the thermal or visible images, a portion of the mask is left in the cropped periocular region. This is shown in Fig. 5. This small masked region in the cropped periocular part of the thermal and visible periocular image, adds to the noise and may affect the discriminative ability of features. For the recognition of the periocular region from the masked face, it is compared with periocular region of the unmasked face template. Here, the unmasked periocular template is free from the noise due to no mask. While, the test image may be a periocular region from the masked Positive image or a Negative image.

We can therefore rewrite the (16) for noisy features as:

$$ D^{V_{i}}_{AP_{k}} = (F^{V}_{A_{k}} - NF^{V_{i}}_{P_{k}}) $$
(25)
$$ D^{V_{i}}_{AN_{k}} = (F^{V}_{A_{k}} - NF^{V_{i}}_{N_{k}}) $$
(26)

such that \(NF^{V_{i}}_{P_{k}}\) and \(NF^{V_{i}}_{N_{k}}\) are the noisy feature values from the periocular region of a positive and a negative visible image \(I^{V_{i}}_{P}\) and \(I^{V_{j}}_{N}\) respectively.

As stated before, if \(D^{V_{i}}_{AP_{k}} < D^{V_{i}}_{AN_{k}}\) for most of values of k then the condition \(D^{V_{i}}_{AP} < D^{V_{i}}_{AN}\) is satisfied and classification is correct.

However, the noisy feature values \(NF^{V_{i}}_{P_{k}}\) and \(NF^{V_{i}}_{N_{k}}\) can be any random value. This can affect the relation between \(D^{V_{i}}_{AP_{k}}\) and \(D^{V_{i}}_{AN_{k}}\) and therefore \(D^{V_{i}}_{AP}\) and \(D^{V_{i}}_{AN}\). The relation may transform to \(D^{V_{i}}_{AP} > D^{V_{i}}_{AN}\), and subsequently result in misclassification.

The problem is to propose a method so that the noise in the features can be ignored for better classification.

5.4 Proposed solution stage 2: Random subspace sampling method and ensemble of networks

We propose to use Random Subspace Sampling Method to overcome the effect of noise in the thermo-visible features. Random Subspace Sampling Method is described below:

Let us suppose that we have a training set X = (X1, X2,.....,Xn). Each of the training object Xi in the training set X is a p dimensional feature vector such that Xi = (xi1, xi2,...,xip).

To construct random subspace Xd from the available training set X, we select r random features from each of the p dimensional feature vector Xi such that r < p. Thus the modified random subspace training set \(X^{d} = ({X^{d}_{1}}, {X^{d}_{2}},\) \(....., {X^{d}_{n}})\). Each of training object in random subspace training set, \({X^{d}_{i}}\) for i = 1,2,...n is a r-dimensional vector such that \({X^{d}_{i}} = (x^{d}_{i1}, x^{d}_{i2}, ...., x^{d}_{ir})\) where each of \(x^{d}_{i1}\) for i = 1,2...r is selected from the p-dimensional vector Xi = (xi1,xi2,...,xip).

We now construct an ensemble of classifiers Cd(x) containing D classifiers, such that d = 1,2,...,D. Each classifier Cd(x), is trained using a separate random subspace Xd, d = 1,2,...,D. The results of each classifier Cd(x) is combined using the majority voting rule. The algorithm for the training of the ensemble of classifiers using the random subspace samples from the original training set is shown in Figure 6.

Fig. 6
figure 6

Random Subspace Sampling Method

The thermo-visible features of the anchor image is as shown in (13). The features of the masked face image contains noise due to mask, hence we can rewrite (13) to include the noise as shown in (27) and (28):

$$ \begin{array}{@{}rcl@{}} F^{VT_{i}}_{P} &=& [F^{V_{i}}_{P_{1}}, NF^{V_{i}}_{P_{2}}, NF^{V_{i}}_{P_{3}}, F^{V_{i}}_{P_{4}}, F^{V_{i}}_{P_{5}}, ....... F^{V_{i}}_{P_{l}}, \\ &&F^{T_{i}}_{P_{1}}, NF^{T_{i}}_{P_{2}}, NF^{T_{i}}_{P_{3}}, F^{T_{i}}_{P_{4}}, F^{T_{i}}_{P_{5}}, ....... F^{T_{i}}_{P_{l}}] \end{array} $$
(27)
$$ \begin{array}{@{}rcl@{}} F^{VT_{j}}_{N} &=& [F^{V_{j}}_{N_{1}}, F^{V_{j}}_{N_{2}}, NF^{V_{j}}_{N_{3}}, NF^{V_{j}}_{N_{4}}, F^{V_{j}}_{N_{5}}, ....... F^{V_{j}}_{N_{l}}, \\ &&F^{T_{j}}_{N_{1}}, F^{T_{j}}_{N_{2}}, NF^{T_{j}}_{N_{3}}, NF^{T_{j}}_{N_{4}}, F^{T_{j}}_{N_{5}}, ....... F^{T_{j}}_{N_{l}}] \end{array} $$
(28)

such that NF in the feature term represents the noisy feature. For example, \(NF^{V_{i}}_{P_{k}}\) or \(NF^{T_{j}}_{N_{3}}\), k = 1,2,3,....(2l)), denote the noisy features. Lets say we generate a random subspace Xd, d = 1,2,3,....,D of dimension r, r < 2l, using the random subspace sampling method, then one of the sample objects, \(F^{VT_{i}}_{P}\) can be expressed as:

$$ F^{VT_{i}}_{P} = [F^{V_{i}}_{P_{1}}, F^{V_{i}}_{P_{4}}, F^{V_{i}}_{P_{5}},..., F^{T_{i}}_{P_{r}}] $$
(29)
$$ F^{VT_{j}}_{N} = [F^{V_{j}}_{N_{1}}, F^{V_{j}}_{N_{2}}, NF^{T_{j}}_{N_{4}},..., F^{T_{j}}_{N_{r}}] $$
(30)

Because r < S, such that r is the dimension of the randomly selected subspace, and S is dimension of the features in the original training set, the probability that a features gets selected in the sampled subspace Xd, from the visible feature set is only r/l. Whereas, the probability that a feature gets selected in the sampled subspace Xd, from the thermo-visible feature set is only r/(2l). Since r/(l) >> r/(2l), we can say that fusion of thermal and visible features and random subspace sampling both together lessen the probability of appearance of noise in the final sampled subspace and hence the misclassification.

The ensemble of networks has the following advantages in our proposed approach:

  1. 1.

    Generalization property of the Ensemble The ensemble of networks has a good generalization property. It takes weak classifiers and generalizes the result obtained from the weak classifiers to generate results better than the individual classifiers.

  2. 2.

    Ensemble reduces the impact of noise The probability of a noisy feature appearing in the randomly selected subspace is r/(2l). The probability is low and hence the noisy feature may not appear in most of the sampled subspace. This causes most of classifiers networks in the ensemble to make correct prediction. The final decision about the classification, is done by majority voting. This allows to nullify the effect of noise on the final classification results of the ensemble network.

6 Implementation

To verify the efficacy of our proposed MmLwThV framework, we carried out the experiments on WHU-IIP dataset and CVBL masked face recognition dataset. As discussed before, WHU-IIT dataset has pixel-to-pixel registered thermal and visible images. But it does not have the masked images of the subjects. So using the WHU-IIP dataset, we performed the periocular recognition with visible, thermal and thermo-visible features extracted from the unmasked faces. The CVBL masked face recognition dataset contains the masked and unmasked faces in thermal and visible modalities. Hence we performed exhaustive experiments on the dataset for periocular recognition using the thermal, visible and thermo-visible features on masked and unmasked faces. We have used ensemble networks for classification. We have also used other basic classifiers such as minimum distance classifier(using Euclidean distance), Support Vector Machines(SVM), K-Nearest Neighbour(KNN) and their different versions for the purpose of comparison. The result on WHU-IIT dataset and CVBL dataset are summarized in Tables 3 and 4 respectively.

Table 3 Results of MmLwThV framework on WHU-IIP dataset on Thermal(T) and Visible (V) Images for different Features
Table 4 Results of MmLwThV framework on CVBL Masked dataset on Thermal(T) and Visible (V) Images for different Features

6.1 Periocular recognition using MmLwThV framework on WHU-IIP dataset

As discussed above, the WHU-IIT dataset does not contain masked faces. Hence, there is no introduction of noise due to mask in the thermo-visible features. We therefore performed the experiment on WHU-IIT dataset to validate the efficacy of fusion of thermal and visible features. We used 5-fold cross-validation for the test results. The results are summarized in Table 3.

In order to evaluate different classifiers, it is necessary to understand if there is any statistical difference in the results of any two classifiers. If the results of any two classifiers are statistically different, then the two classifiers can be compared for better performance. If the results are statistically same, then we can say that the behavior of the two classifiers are same. Hence, we analysed the results of MmLwThV framework on WHU-IIT dataset using the Wilcoxon Signed rank test [14, 39].

6.1.1 Wilcoxon signed-ranks test

The Wilcoxon signed-rank test [14, 39] is a non-parametric statistical test. This means that the data is derived from a population that is non-parametric in nature. A non-parametric population can be ranked but does not have numerical values. The Wilcoxon signed-rank test is used to determine if two or more sets of pairs are different from one another in a statistically significant manner. The Wilcoxon sign test is performed with the assumption that the two samples are dependent observations of a case. The second assumption is that the paired observations are randomly and independently drawn from the population.

6.1.2 Evaluation of Wilcoxon signed-ranks test on WHU-IIT dataset

From the Table 3, we can see that Ensemble subspace discriminant is having the highest accuracy of 99.9%. Hence we compare Ensemble subspace discriminant with every other classifier as specified in Table 3. The comparison is done using Wilcoxon signed rank test to find if the results of Ensemble subspace discriminant is statistically different from the results of other classifiers. The results of the Wilcoxon Signed-rank Test are summarized in Table 5 where each column presents the results obtained from the comparison of Ensemble subspace discriminant with a different classifier. In the first column of Table 5, for example, Ensemble subspace discriminant is compared with Ensemble subspace KNN.

Table 5 Results of Wilcoxon Signed-ranks Test for the MmLwThV framework on WHU-IIT masked dataset. Ensemble Subspace Discriminant is compared with all other classifiers. The other classifiers are represented by the letter in brackets. Ensemble subspace KNN (B), Minimum distance classifier (C), Linear SVM (D), Quadratic SVM(E), Cubic SVM (F), Fine Gaussian SVM (G) , Medium Gaussian SVM (H), Core Gaussian SVM (J), Fine KNN (K), Medium KNN (L), Course KNN (M), Cosine KNN (N), Cubic KNN (O), Weighted KNN (P). Also, S stands for similar and D stands for Different

From the results in Table 5, it is very much clear that the results of the classifiers Ensemble subspace KNN, Minimum Distance classifier, Linear SVM and Quadratic SVM are similar to that of the Ensemble subspace discriminant classifier. The results of all other classifiers are different from that of the Ensemble subspace discriminant. Hence we can say that the Ensemble subspace discriminant shows a better performance in comparison to most of the classifiers on WHU-IIT dataset.

6.1.3 Results analysis and discussion

From the Table 3, it can be observed that, the results of MmLwThV framework on WHU-IIP dataset is highly encouraging. The highest accuracy for periocular recognition obtained is 99.9% for the LBP thermo-visible features using the Ensemble subspace discriminant classifier. The accuracy of 99.9% is comparable to any state-of-the-art result. For LBP feature, the periocular recognition accuracy of the Ensemble Subspace Discriminant for the visible modality is 99.1% and 98.1% in thermal mode. The fusion of thermal and visible LBP features increased the periocular recognition accuracy to 99.9% by the Ensemble Subspace Discriminant. Thus the result validates our proposed framework. Also, the ensemble subspace classifiers performs the best among all the classifiers. Ensemble subspace classifier gives the highest accuracy in each column of the Table 3.

6.2 Periocular recognition using MmLwThV framework on CVBL masked face recognition dataset

We performed experiments on CVBL masked dataset. Since both the masked and unmasked images are available in the dataset, we cropped the periocular region from the registered thermal and visible images for masked and unmasked faces. For the masked faces, features were extracted from the periocular region of both thermal and visible images and subsequently fused with each other. The same was done with features of the periocular regions of the unmasked images. We then trained the classifier on the thermo-visible features from the periocular region of the unmasked faces and tested the classifier on the thermo-visible features from the periocular regions of the masked faces. To understand the effect of fusion of thermal and visible features, we conducted separate experiments for face recognition using only thermal images for masked and unmasked faces and also face recognition using only visible images for masked and unmasked faces. The results are summarized in Table 4

6.2.1 Evaluation of Wilcoxon signed-ranks test on CVBL masked dataset

To see if the results of Ensemble Subspace KNN are same as or different from other classifiers as specified in Table 4, we conducted Wilcoxon Signed-ranks Test. The results of the Wilcoxon Signed-ranks Test are summarized in Table 6. In the first column for example, Ensemble subspace KNN is compared with Ensemble subspace discriminant.

Table 6 Results of Wilcoxon Signed-ranks Test for the MmLwThV framework on CVBL masked dataset. Ensemble Subspace KNN is compared with all other classifiers. The other classifiers are represented by the letter in brackets. Ensemble subspace KNN(A), Ensemble subspace discriminant (B), Minimum distance classifier (C), Linear SVM (D), Quadratic SVM(E) , Cubic SVM (F), Fine Gaussian SVM (G) , Medium Gaussian SVM (H), Core Gaussian SVM (J), Fine KNN (K), Medium KNN (L), Course KNN (M), Cosine KNN (N), Cubic KNN (O), Weighted KNN (P). Also, S stands for similar and D stands for Different

From the results in Table 6, it is very much clear that the results of the classifiers Ensemble subspace discriminant, Quadratic SVM and Cubic SVM are similar to that of the Ensemble subspace KNN. The results of all other classifiers are different from that of the Ensemble subspace KNN. Hence we can say that the Ensemble subspace KNN shows a better performance in comparison to most of the classifiers on CVBL dataset.

6.2.2 Results analysis and discussion

The results on CVBL masked dataset are a more realistic evaluation of the MmLwThV framework. Firstly because, as already discussed, the CVBL dataset has been captured in real world circumstances and therefore presents real world challenges for periocular recognition. Also, the masks in the dataset are real and no synthetic masks have been used. The second reason why the results on CVBL dataset are realistic is because the unmasked images from the dataset are used for training the classifiers and the masked images are used for testing the trained classifiers. This is unlike in case of WHU-IIT dataset, where we used the unmasked images for both training and testing of the classifiers using 5-fold validation.

As shown in the Table 4, the highest periocular accuracy on masked faces is 70.64%. The accuracy has been obtained using the Ensemble subspace KNN on the thermo-visible LBP features. The second in position of periocular recognition accuracy is Minimum Distance classifier on LBP thermo-visible features with 65.95%. Subsequently, the third rank in the order of accuracy is held by ensemble subspace classifiers on thermo-visible features indicates the efficacy of the classifier on thermo-visible fused features.

The efficacy of the ensemble subspace classifiers are further validated when we analyse the Table 4 column wise. When we say ensemble subspace classifiers, we mean either by Ensemble subspace Discriminant or Ensemble subspace KNN. For all the columns in the Table 4, the highest periocular accuracy is given by the ensemble subspace classifiers. There is one exception to this however, when the Minimum Distance classifier gives the highest accuracy of 56.17% on LBDP features and Ensemble subspace KNN follows next with an accuracy of 54.47% accuracy. But this happens for the visible modality. But for all the thermo-visible features, the highest periocular accuracy is given by the ensemble subspace classifiers.

In summary, the experiments on CVBL masked dataset re-validate all our conclusions drawn on results of MmLwThV framework on WHU-IIT dataset.

6.2.3 Effect of noise on the results of MmLwThV framework

In our paper, we have performed our experiments on two different datasets: WHU-IIT dataset and CVBL masked dataset. WHU-IIT dataset is a dataset that contains images without mask. On the other hand, CVBL masked dataset contains images that contain images with real masks. The images in CVBL dataset are captured in real world environment.

Hence, the periocular region from the images in WHU-IIT dataset contain no noise due to mask. However, periocular images from the images in CVBL dataset contain noise due to mask. This is because while the periocular region is extracted from the faces in CVBL dataset, a portion of the mask is present in the periocular images.

Comparison of the results in Tables 3 and 4 clearly demonstrates the effect of noise due to mask. The highest accuracy of 99.9% in case of WHU-IIT dataset in comparison to 70.64% in case of CVBL dataset clearly demonstrates the effect of noise in the images due to mask. The effect of noise can also be understood by comparing cells in the Tables 3 and 4. It can be understood that the accuracy of a classifier on a feature in case of CVBL dataset is less than the corresponding accuracy of the same classifier on the same feature in case of WHU-IIT dataset. This means that the noise due to masks negatively impacts the accuracy of face recognition.

On analysing the results column wise in Table 4, we can see that the accuracy of either Ensemble subspace discriminant or Ensemble subspace KNN is highest. This clearly means that drop in the accuracy due to noise, is more in case of all other classifiers except Ensemble subspace discriminant or Ensemble subspace KNN. Hence we can say that, our proposed MmLwThV framework, is less affected due to noise in comparison to other classifiers. We can therefore conclude that MmLwThV framework is robust against noise to mask.

6.2.4 Complexity of the MmLwThV framework

For SVM based classifiers in the Table 1,the number of parameters in a Support Vector Machine (SVM) is equal to the number of pixels in the input image. Since we input handcrafted features such LBP, LDP etc in the classifier, the largest input size is 4096. Hence, we can say that largest number of trainable parameters in an SVM can be equal to 4096.

As we know that, the discriminant analysis classifier learns coefficients for projecting a sample into the correct class. If there are k classes, k × k structure of coefficient matrices are learnt by the classifier. In an Ensemble Subspace Discriminant classifier, there are 30 such Discriminant classifiers. Also, there are 33 classes in WHU-IIP dataset and 21 classes in CVBL Masked Face recognition dataset. Therefore, total number of parameters for WHU-IIP dataset can be calculated as 32,670. Similarly, for CVBL Masked Face recognition dataset total number of learnable parameters are 13,230. For an Ensemble Subspace KNN, there are no trainable parameters. This is because in KNN there is no requirement for training.

Now when we compare the Ensemble Subspace KNN or Ensemble Subspace Discriminant classifier any SVM based classifier, then we can say that in general all the classifiers mentioned in Table 4, are actually machine learning based classifiers. All the machine learning based classifiers are lightweight. This can be better understood if any of the machine learning based classifier is compared with any deep learning model. This is because a deep learning model contains at least a million of parameters.

If we compare Ensemble based classifiers with SVM based classifiers based on the number of learnable parameters, it may occur that Ensemble based classifiers are more complex. But it must be mentioned that the increase in the number of parameters is not very high. Also, a little increase in complexity of ensemble based classifiers is because of more number of learners in the model. The increase in the parameters of ensemble based classifiers comes with increase in robustness against noise and generalization capability. Hence we can finally conclude that our proposed MmLwThV framework is a lightweight solution for the masked face recognition.

6.2.5 Summary of discussion on MmLwThV framework

Putting together the results on WHU-IIT dataset, we can say that periocular recognition can be effectively used instead of full face identification if the visible features are used with the fusion of thermal features as well. However, upon the fusion of the thermal and visible features, it is likely that the noise due to mask can dominate the recognition accuracy and increase the false reject and false accept rates. The Ensemble subspace classifiers have been effectively able to combat the effect of noise and generalize well over the features from the periocular region.

Thus experimentally it is validated that our proposed MmLwThV framework is highly effective in improving the robustness and accuracy of the masked periocular recognition over the existing visible periocular recognition systems. MmLwThV framework accomplishes this by using the ensemble subspace networks over thermo-visible features.

7 Comparison with the state-of-the-art methods

Because of the advent of COVID-19 in recent past, little work has been published in the domain of masked face recognition. Diaz et al. [22] used deep learning networks for feature extraction from the periocular region of the masked faces. The features were then used for classification using Euclidean distance measure. Li et al. [29] used attention based network to recognize faces with masks. Li et al. [29] used spatial and channel attention modules within the deep learning networks. The attention modules within the deep learning network forced the network to give attention to those areas of the input face that could help in generation of discriminative features. Another work by Wu et al. [45] used pyramidal attention module, apart from the spatial and channel attention modules for masked face recognition.

Table 7 Comparison with state-of-the-art methods

We implemented the above discussed works on our CVBL masked face dataset. Since the CVBL masked dataset is too small with only 21 classes, it cannot be used to train a deep learning network from scratch. Hence, we used pretrained deep learning models for our purpose and retrained them using transfer learning on CVBL masked face dataset. The fine-tuning of the deep network was carried for 100 epochs. The results are summarized in Table 7.

From the Table 7, it can be easily concluded that state-of-the-art methods perform poorly on the challenging CVBL masked face dataset. The reason is that the CVBL masked face dataset is a challenging real world dataset collected in real world environment. Such a dataset must be large enough to train a deep learning network properly. However, since the CVBL masked dataset is small and not enough to train a deep learning network, the results for masked face recognition are poor. Hence, we can say that our proposed method is an efficient, robust and a lightweight method for masked face recognition in case of a small dataset.

8 Conclusion

For COVID-19 like scenarios, we proposed a novel framework for periocular recognition which is robust and does not require much data for training. The proposed MmLwThV framework fuses the thermal and visible features from the periocular region and classifies it using ensemble subspace network. We used the ensemble subspace network for classification because of its ability to generalize and ignore the presence of noise due to mask. We tested our proposed MmLwThV framework on two thermo-visible datasets, WHU-IIT and CVBL masked dataset. On both the datasets, MmLwThV framework successfully improved the results of periocular recognition over the visible or thermal modality by using thermo-visible fused features and ensemble subspace network. We obtained the highest accuracy of 99.96% on the WHU-IIP dataset using the LBP feature. The result is comparable to any state-of-the-art face recognition systems. Again, on unconstrained and challenging CVBL masked face dataset, the MmLwThV framework successfully increases the accuracy of visible periocular recognition from 68.94% to 70.64%. Moreover, the MmLwThV framework makes the periocular recognition system robust to noise due to mask. The MmLwThV framework can be customized flexibly to work with any other feature other than LBP for better performance and suitability. The MmLwThV framework, being lightweight, can be easily deployed on any mobile phone which has an installed visible and an infrared camera on it.