1 Introduction

Video surveillance applications have grown by leaps and bounds in areas like protecting public monuments, private establishments, and monitoring traffic. This is due to the availability of cheaper hardware and better-performing algorithms. Nowadays, cameras can be found at every corner and intersection. This, in turn, has made them an integral part of maintaining the nearby environment’s security. Research in the face recognition field has also developed at a similar pace, and better, sophisticated ways are proposed each day for improving the performance of surveillance systems. Keeping in view the problems faced by face recognition systems, a large amount of time has been devoted to studying unconstrained environment scenarios [1, 32]. These include low lighting, occlusion, low resolution, and noisy scenarios [2, 28, 36].

Viola and Jones [67] has laid down the primitive framework of face recognition systems with an incremental classification of features detected in images. Each classifier is weak and does not denote the whole image accurately. Still, when several of these weak classifiers are taken incrementally, the features can be mapped with a high degree of accuracy, and faces can be identified. Normal digital cameras record videos at a lower resolution than still pictures. Hence, the frames captured from these videos significantly affect face recognition performance. Super-Resolution (SR) is a method that can be adopted to enhance these low-resolution images and video sequences, thereby increasing the face recognition rates [77]. Little work has been done to study the effects of SR on images obtained from a face recognition system [40, 57]. Improving the quality of such images using SR can boost any face recognition technology’s performance to a large extent.

This paper proposed a super-resolution enabled system for real-time face recognition in video surveillance. The proposed system aims to detect and collect face images and super resolve them in real-time. Further, the system aims to improve descriptor-count based on the super-resolved faces and explores the effect of noise, scale, and descriptors on face recognition performance. An experimental analysis on the ORL, Caltech, and Chokepoint datasets is performed to evaluate the performance of the presented approach. PSNR and face recognition rate are used as performance evaluation measures. Additionally, a thorough comparison of the proposed system with other state-of-the-art approaches is performed. The results showed an increment in image recognition rates, where the face image didn’t contain pose expressions and scale variations. Further, for the complicated cases involving scale, pose, and lighting variations, the presented approach resulted in a 5%-6% performance increment in each case.

1.1 Contributions

Video surveillance involves detecting scene(s) and looking for specific patterns that are indecorous or that may indicate the existence of improper behaviour. The surveillance process includes identifying areas of concern, and by viewing the selected images at appropriate times, it is possible to determine if an improper activity is occurring. However, previously developed face detection systems suffer from low-resolution images or images under severely distorted conditions. This hinders the performance of the video surveillance system. This work focuses on overcoming the problems of low-resolution, blur, and noisy images in face detection systems by employing a super-resolution-based approach. The main contributions of this work include:

  1. 1.

    This paper presents an approach for real-time face recognition and detection in a video surveillance system using the super-resolution.

  2. 2.

    The presented approach addresses the problems of noise, scale, and descriptors on face recognition performance and improves descriptor-count based on the super-resolved faces.

  3. 3.

    An empirical evaluation of the presented approach is performed on three different image datasets, and different performance measures are used to assess the performance of the presented approach.

  4. 4.

    The presented super-resolution-based approach is combined with eigenface and BRISK approaches to overcome the video’s low-resolution image constraint.

  5. 5.

    A thorough comparison of the presented approach with original image data and low-resolution image data is carried out to assess the performance improvement of the presented approach.

The rest of the paper is organized as follows. Section 2 discusses the background and related work of face recognition. The presented approach is discussed in Section 3. Section 4 presented experimental objects such as used datasets, performance metrics, and used face detection approaches. The experimental analysis results are discussed in Section 5, followed by conclusions in Section 6.

2 Background and related work

In the face recognition system, image resolution plays a vital role. An image can be divided into two sets, best resolution andminimal resolution. The best resolution is when the descriptor performs at optimal speeds and provides the best recognition rates. The minimal resolution is the threshold value below which the recognition performance drops sharply. Wang et al. [68] demonstrated the use of facial structure information for face recognition purposes. Some common studies into resolution problem were taken up in [10, 21, 37]. The main takeaway from this was that minimal resolution depends on the system and databases used. Low resolution (LR) images obtained suffered from smaller images and image quality problems. An insufficient number of pixels in the obtained image causes inaccurate descriptions. According to [53], if face size is smaller than 32*24, most conventional methods fail. Depending on focus and illumination, severe blur distortions can degrade image quality and cause misclassifications [54].

Several researchers have developed different approaches for face detection and recognition. Some of them addressed the problems of low-resolution images, while others addressed the issues of blur, noise, and scale of images. In this section, we have discussed different reported works on low-resolution face recognition approaches, super-resolution approaches, and other state-of-the-art approaches for face recognition.

2.1 Super-resolution based approaches for face recognition

Numerous techniques such as MAP-based, example-based, DSR based, S2R2 based, and FFD-based have been used for the super-resolution in the face recognition.

MAP based approaches: Given a set of LR-HR pairs, where I(l) refers to image in LR and I(h) refers to image in HR, this metric calculates D (downsampling operator) such that |I(l) − DI(h)|2 is minimized. Capel [16] used MAP-based methods by dividing the face region into six unrelated parts and using PCA to determine important regions for face hallucination. Using Baker’s [76] work as a reference, Dedeoglu established spatiotemporal coherence between face images and hallucinated face images to very high magnification (nearly 16 times). Baker’s work also inspired Freeman [12] to integrate a parametric model for hallucination at global face image level and local level. Freeman used PCA linear references to enhance \(P(I_{l}|{I^{g}_{h}})P({I^{h}_{g}})\) thereby getting an optimal face image. The main drawback of the method was that it used explicit down-sampling. Soft and hard constraints were also proposed to beautify faces. Soft constraints made the obtained face image closer to the mean face, whereas the hard constraint was used to faithfully reproduce the discriminating facial details. Noise distributions are assumed to be identical and independently distributed with Gaussian base functions [39].

Example based approaches: This focuses on HR-LR pair similarity and assumes that if a HR image is constructed from linear set of other HR images, then the same could be done in LR domain as well. For this purpose, weights are necessary to balance individual images’ contributions properly. Chang et al. [23] used the concept of manifolds for locating similar local geometry patterns in different feature spaces. But authors did not compensate for treating SR as patch-based, thereby losing local details in final images. Park [51] separated face image into texture and PCA-based feature sets and then defined training sets to imitate the model observed. The final LR input image was constructed based on the nearest match in both domains. While dealing with SR using the aforementioned methods. The data domain determines the best possible LR image for reconstruction in the HR domain based on distance or similarity measures. The algorithm domain ensures that the HR image selected belongs to a face. The main limitation arises when it is assumed that changes in HR image are reflected proportionally in the LR domain, as that may not be the case. Hence data constraint optimization fails. Similarly, while recognizing faces using SR algorithms, all the information are not utilized, including class labels in the training set, to enhance accuracy.

DSR based approaches: As the mapping done by humans and machines is different, two constraints are proposed [78] for each scenario. The data constraint is developed to perform the linear mapping between VLR and HR domain, thereby minimizing the error in reconstruction. The constraint provides discriminant analysis for faces as machines use reliable face descriptors for face recognition. Different methods can be used for clustering algorithms with different parameters. In this case, linearity is defined as a measure for clustering pairs in the VLR-HR domain. A set of images is defined as pair P such that all the closest neighbors of this pair have the same linearity. This is ensured by calculating gradients of all possible pairs near location x of the original pair such that gradient difference is minimized. Even if the clusters have been identified for VLR-HR pairs, the correct relationship between pairs is needed to be identified to transform the images to the HR domain. Keeping this in mind a relation factor R is proposed such that when it is applied on I(l) it converts it into I(h). This is unique as earlier methods converted HR to the VLR domain for error determination, resulting in the loss of useful information. The minimization of ε is needed to find a reliable HR image. Face features and discriminants are more important than reliable HR-VLR pairs for machine-based descriptive learning. By using multiple factors for images, error measurements are enhanced, and misclassifications are reduced. The SR algorithm can extract more images with improved data constraints based on linear classifiers. The reconstruction factor tends to overfit the VLR in the HR domain in many cases. This problem is tackled by using linearity as criteria for classification and allowing nonlinear cases to be analyzed as well. The proposed SR algorithm also stands out as the first to use class labels to enhance testing for generic VLR images. Visual quality is detected using RLSR [38] algorithm for which the database CAS-PEAL and YaleB yielded significant improvements. Even in the case of face variations due to external elements, the algorithm performed moderately well.

S2R2 (Simultaneous super-resolution based approaches): Instead of sequentially reconstructing and then identifying the face in the image, the parameters can be combined that are then used to classify and enhance the LR image simultaneously. This method is proposed in [25]. Availability of a probe set is assumed, which is also called a test set to which the low-resolution image belongs. The other set is the gallery which contains training data and high-resolution images for the corresponding low-resolution images. The main problem boils down to finding a distance vector such that its magnitude represents the minimum of the distance between the LR and HR image space. For this purpose, authors choose an image xyp from the probe set and calculate its distance from xg, i.e., an image from the gallery. For the LR scenario, all xyp may not always be available, and hence a need for conversion of LR to HR arises. All LR images are denoted as yp belonging to the probe set. There are two main ways in which one can perform inter-domain matching. Firstly, an approximate \(\bar {x_{p}}\) can be found corresponding to yp such that it can be matched in place of actual xyp. For all the pairs (xg, yp), the distance metric needs to be minimized, which in turn leads to the determination of parameters mentioned in base cases. The first part of the equation deals with the SR such that the image found out is close to HR space. The second constants denote smoothness for the SR. The third part refers to the features derived from SR and chooses the best among them. This problem can be simplified to determine the weight w such that all the constants are inside the w matrix. The domain is separated into two parts for each set in the gallery and probe. Every part’s IDA is calculated and compared for each image pair, and the lowest value score of discriminant is chosen. For calculating α, β, and λ, Powell’s method is used. All these values are then combined to form w matrix.

FFD based approaches: Focusing on the face reconstruction problem in non-rigid registration scenarios FFD [31] based SR techniques are proposed. Face distortion and expressions have a major role in accurately registering and reconstructing faces collected from consecutive frames. The FFD technique uses a mesh of control points located on the face image to deform the face and bring it closer to the other view angles for accurate registration. This step is further broken down into local and global registration. As local registration performs precise enhancement, it occurs in the HR grid after B-spline interpolation. The global registration is done in the low-resolution grid with fast and slightly imprecise methods. This multi-level elastic deformation technique performs global deformations to account for expression changes. This precision is further improved using edge information such as SSD. The face edge contour information is between adjacent frames provides accurate registrations. After global registration, a set of sub-image pairs consisting of the global image and the reference image is taken, and correlation coefficients are calculated. If the value of coefficients is very small, price enhancements in HR grids are required. In the end, a POCS algorithm is used for SR reconstruction. The experiments are conducted on a chokepoint video database and record 16% improvement in face recognition accuracy.

Recently, some researchers have used super-resolution (SR) based approaches for different face recognition tasks. Kim et al. [30] proposed an edge and identity preserving network (EIPNet) that uses face SR to reduce distortion by employing a lightweight edge block and identity information. The presented network elaborately restored facial components and generated the high-quality 8 × scaled SR images. Furthermore, the network successfully reconstructed a 128 × 128 SR image with 215 fps. The experimental analysis on CelebA and VGGFace2 datasets showed that the presented network outperformed other state-of-the-art methods. Cai et al. [11] have proposed the FCSR-GAN approach based on joint face completion and super-resolution via multi-task learning. The experiments have been performed on CelebA and Helen datasets. Results demonstrated that the proposed approach produced better performance than other state-of-the-art methods for face super-resolution (up to 8 times scale).

Shamsolmoali et al. [59] have used a deep convolution network for surveillance record super-resolution. The presented work aimed to recover the low-resolution objects and points in the surveillance record. The developed model was tested on SCface, and Chokepoint datasets, and PSNR measure was used to evaluate the performance. The results showed that the model produced promising results. Lu et al. [64] presented an approach for very low-resolution (VLR) face recognition and super-resolution based on semi-coupled locality constrained. The presented approach enhances the consistency between VLR and high-resolution local manifold geometries and overcomes the negative effects of one-to-many mapping. The authors have used AR and CMU face recognition datasets to validate the presented approach. The results showed that the proposed method outperformed numerous state-of-the-art SR and recognition algorithms. Some other works such as [55], [14] and [65] also explored the use of super-resolution for the face detection and recognition. Table 1 summarizes the review of the related work.

Table 1 A Comparative Analysis of Different Super Resolution Methodologies

2.2 Approaches to low resolution face recognition

Low resolution (LR) face recognition methods can be divided into indirect methods and direct methods categories. The indirect method forms high-resolution (HR) images from LR images and then classifies the results using normal HR techniques. The important works that outlined this include Baker et al. [4] and S2R2 (Hennings) [25]. Direct methods extract discriminating features from the images independent of resolution. The techniques that follow this are, coupled locality preserving mappings (CLPM) [9] and multidimensional scaling (MDS) [8]. This can be categorized further into resolution robust features and interrelationship between HR-LR pairs for classifications. Super-resolution methods such as interpolation [69], reconstruction based [5], and learning-based [72] to enhance the images.

Liao et al. [43] presented a JPEG image steganography method based on the dependencies of inter-block coefficients. The inter-block dependencies that describe the interaction among coefficients at the corresponding positions in different discrete cosine transform (DCT) blocks are preserved using the presented method. The experimental analysis has been performed on six brains magnetic resonance imaging (MRI) images. Results showed that the presented method efficiently clustered inter-block embedding changes, improving anti-steganalysis performance. In another work, Liao et al. [42] presented a steganographic embedding function to preserve the correctness and efficiency of the image. The function is utilized to discriminate the image’s smoothness. The authors have performed an experimental analysis to validate the presented function by developing and testing special data hiding methods. The results showed that the presented function could perform better than the prior works. Liao et al. [41] presented a separable data hiding method in the encrypted image using compressive sensing and discrete Fourier transform in another similar work. The authors showed that the presented method could generate better image quality when hiding the same embedding capacity through experimental analysis.

Sharma et al. [60] presented the D-FES, a deep face expression recognition system using a recurrent neural network. The presented system could detect six different facial expressions based on the lip structure. The presented system is trained and tested using the JAFFE, MMI, and Cohn-Kanade datasets. Results showed that the presented system had achieved the precision, recall, and f1-score values of 93.8%, 94.5%, and 94.2%, respectively. Kumar et al. [35] presented a superpixel-based color spatial feature approach for salient object detection in another work. The presented approach generates a spatial color feature and combines it with the center-based position before creating the saliency map. The experimental analysis of six datasets showed that the presented approach produced improved AUC, recall, precision, and F1-score measures. In another similar work, Negi et al. [48] presented a deep neural architecture to detect the face mask amid the Covid-19 pandemic. The presented architecture combines CNN and VGG16 models and is trained on the simulated masked face dataset. The results showed that the best-achieved training, validation, and testing accuracy was 99.47%, 98.59%, and 98.97%, respectively.

A model for predictive analytics for human activities recognition using the residual network has been presented by Negi et al. [47]. The authors have developed a human action recognition system, which uses the ResNet-50 model with transfer learning. The experiments have been performed on the UTKinect Action-3D dataset. The results showed that the presented system yielded better performance than other state-of-the-art methods. Kumar et al. [34] have presented a fast and deep event summarization (F-DES) approach. The presented approach extracts the features, resolves the problem of variations in illumination, removes fine texture details, and detects the objects in a frame. The experimental results showed that the presented F-DES approach has successfully reduced the video content and kept the meaningful information in events. Other similar works for the object-detection and face recognition have been reported by [33, 46, 63].

3 Presented face recognition approach

The main objective of the presented work is to design a system to perform real-time face recognition for a large set of video databases and produces aesthetically better results using the super-resolution system incorporated. Developing an effective face recognition system requires tackling low-resolution and face recognition problems simultaneously. Therefore, it is necessary to use robust methods for each problem separately. Face detection and recognition are the two most important steps in a face recognition system as a whole. Super-resolution (SR) is an important technique for enhancing images to identify faces in the images and videos [15, 62]. Figure 1 shows the overview of the presented face recognition approach. The approach takes a low-resolution video as input and generates detected and recognized the face as the output. The proposed super-resolution-based face recognition system has three main steps; face detection, super-resolution, and face matching and recognition. The description of each component is provided in the upcoming subsections.

Fig. 1
figure 1

Overview of the presented face recognition approach

3.1 Face Detection

In any face recognition system, the primary step is detecting faces ranging from a wide variety of dimensions. Since in the presented work, very low-resolution face images are considered, therefore, faces with resolutions ranging from (24*24) to (48*48) are captured. The presented face recognition system accepts faces in the range (96*96), this gives us a magnification factor of 2 and 4 to perform super-resolution. First, low-resolution face regions are collected from the input image. Since the input image is of low-resolution (LR), therefore, in the next step, super-resolution is applied to the collected face regions to enhance their quality. A minimum of 8 face images is required to perform super-resolution [52]. If the number is below this limit, it was found that little enhancement in the final image is obtained. Moreover, if many faces are collected, the super-resolution step takes a long time, making it infeasible. The following three steps are applied to collect the face images. (1) Bypass the initial 4 frames as they account for large noise and blur due to motion variations. (2) Collect the frames until the count reaches 8. (3) Pass the collected frame array to the super-resolution module. The pre-processing and collection step takes a negligible amount of time and helps maintain the objective of real-time.

3.2 Super-Resolution

Super-resolution [29] is a technology used to sharpen out-of-focus images or smooth rough edges in images that have been enlarged using a general up-scaling process (such as a bilinear or bicubic process), thereby delivering an image with high-quality resolution. The proposed super-resolution approach uses the L1-norm median face method proposed by Sina et al. [61]. Figure 2 depicts the steps of the proposed super-resolution approach. Several steps are applied to perform super-resolution, as discussed below.

Fig. 2
figure 2

Super resolution steps

3.2.1 Affine motion estimation

To perform super-resolution, it is required that the frames are correctly aligned with each other. For this purpose, the descriptors are used for motion detection that generates a 2*3 matrix for affine motions. This model accounts for rotation, scale, and translation. Since the captured frames are related temporally, the first frame image is assigned as a reference and the motion matrices are calculated correspondingly. An image alignment problem [18] can be expressed as in (1).

$$ I_{r} (x) = I_{w} (\varphi(x;p)), \forall x\in \tau $$
(1)

Where, Ir is the template image, Iw is the warped image, x is the set of coordinates for pixels in template image, and φ(x;p) is the set of parametric correspondences for the warped image.

Algorithm-1 depicts the steps of the affine motion estimation process [18]. Figure 3 illustrates an example of the affine motion estimation process. The algorithm performs better than the optical flow optimization techniques proposed by Lucas-Kanade. The efficacy has been proved in the case of 1D translation estimation, and 2D translation estimation in registration [22].

figure a
Fig. 3
figure 3

Results for matching descriptors with minimum distance

The transformation matrix as proposed in [49] containing the six components is given as follows.

$$ \begin{pmatrix} Z_{x} * cos(a), & -q_{1} * sin(a), & d_{x}\\ q_{2} * sin(a) & Z_{y} * cos(a) & d_{y} \end{pmatrix} $$

Where, dx and dy refer to translations in x and y axis.

Zx and Zy are the relative scale operations on x and y axis

‘a’ is the angle of rotation

‘q’ is the skew parameter. This causes image to skew on one side

3.2.2 ECC based optimization

When different images of the same scene need to be aligned with respect to geometric distortions, it becomes necessary to consider some objective functions that minimize the overall error rate. Here, an Enhanced Correlation Coefficient (ECC) [18] has been used to solve the affine motion problem. The benefit of using ECC-based optimization is that it is invariant to photometric distortions like brightness and contrast in consecutive frames, and it is fast because of the conversion of the non-linear parametric equation to linear form. The following steps are involved in estimating affine motion by the ECC algorithm.

  1. 1.

    Select a template face image as the reference frame. Here, the first frame is considered the template image.

  2. 2.

    Measure the similarity of the consecutive frames with respect to the first frame and calculate the conversion matrices.

  3. 3.

    Warp the obtained frames with respect to the first frame and store them in the database.

  4. 4.

    Use the SR algorithm to determine the best representation of the face image obtained from all the frames.

3.2.3 Construction of an average image

To perform error minimization, a hypothesis frame is needed to be synthesized. This frame is obtained after aligning the LR frames obtained after affine motion detection and performing suitable transformations. The steps required are mentioned below.

  1. 1.

    All the input LR frames are interpolated using the Nearest Neighbour method as shown in Fig. 4. During this, the interpolated pixel is allocated the value of the nearest sample point in the input image. This method is computationally very fast and hence perfect for our goal.

  2. 2.

    After all the LR images are aligned correctly in a high-resolution (HR) grid. The mean of all values in a pixel neighborhood is taken [61]. If the number of images is less than the magnification term (i.e., N < r2), the pixel values are singular, and the mean, median estimator, performs similarly. In other cases, where no singular value is present, no estimate is entered.

  3. 3.

    The above method generates a blurred and noisy high-resolution image.

Fig. 4
figure 4

Interpolation methods for image resizing

The result of performing this is a reference frame H that is used as an initial input in the Gradient Descent Iterative Back projection approach.

3.2.4 Iterative back projection

After obtaining affine motion transformations and hypothesis frame H, the main error/cost minimization problem can be reduced to (2) [26].

$$ \begin{array}{@{}rcl@{}} \scriptsize X_{n+1} &=& X_{n} - \beta \{ \sum\limits_{k=1}^{N} {F_{k}^{T}}{H_{k}^{T}}{D_{k}^{T}}sign(D_{k}H_{k}F_{k}X_{n} - Y_{k}) \\&&+ \lambda \sum\limits_{l=-P}^{P} \sum\limits_{m=0}^{P}\alpha^{|m|+|l|}[I - S_{y}^{-m}S_{x}^{-l}] sign(X_{n} - {S_{x}^{m}} {S_{y}^{l}}X_{n})\} \end{array} $$
(2)

Where,

xn= HR frame as input in nth iteration

xn + 1 = HR frame obtained after nth iteration

β = Scalar defining step size in direction of gradient

Yk = kth Low resolution Frame

Fk = Geometric motion operator for the kth LR frame

Hk = Blur operator for the kth LR frame

Dk = Downsampling operator for the kth LR frame

I = Identity matrix

λ = Regularization factor

α = Smoothing/Decaying addition term

Sx = Shift in x-direction

Sy = Shift in y-direction

3.2.5 Super resolution for misregistration, deblurring, and denoising of image

Figure 5 shows steps of the presented super-resolution approach for misregistration, deblurring, and denoising of image. The hypothesis frame H can be assumed as the first frame X0 for the algorithm mentioned. The main idea behind the procedure is to minimize the error caused by blur and noise terms using gradient descent. The work reported in [61] proven that when one moves in a direct negative to the gradient, a maximum decrease in cost occurs. The cost function includes errors between predicted LR images and actual LR inputs. Hence, to minimize the cost function (containing the summation of error terms), generation of an error-free HR image is needed iteratively. Algorithm-2 described the working of the super-resolution approach for misregistration, deblurring, and denoising of image [17].

figure b
Fig. 5
figure 5

Super resolution algorithm for misregistration, deblurring and denoising [56]

The image representation in the above algorithm is in the form of a vector. Also, the blur downsampling matrices are assumed to be correctly available beforehand in their matrix dimensions. Registration vector contains the information regarding the regularization and helps remove noise while preserving sharp edges. The impact of different parameters in the algorithm is mentioned below.

Regularization factor (λ): When the variation in image intensity is weak smoothing should be encouraged equally in all directions and hence this factors value should be sufficiently high to normalize variations and predict intensities using nearby pixels. On the other hand, if a pixel is surrounded by non-similar pixels, BTV considers it a heavily noisy pixel. It uses a large neighborhood to determine whether smoothing is to be performed or not. This ensures that edges are preserved.

Smoothing/Decaying addition term (α): This term gives respective weights to the pixels surrounding the current pixel in a decaying fashion. This ensures that nearby pixel values are given more weight than pixels located far away.

Scalar defining the step size in the direction of gradient (β): This determines the step size in the direction of the gradient. If the step size is large, the error is minimized in fewer iterations, but if it becomes very large, the solution becomes very prone to incorrect estimations. Hence statistical analysis of results is necessary to optimize their values correctly.

This step results in enhanced face images, which can further be processed for face detection and recognition.

3.3 Face matching and recognition

The super-resolution approach increased the aesthetic aspects of the image. However, it is not always guaranteed that the distinguishing face details in the image are enhanced. Therefore, adoption of a robust recognition method is needed to ensure that face recognition is free from variations in illumination, rotation, scale, and pose. Algorithm-3 described the steps of face matching and recognition.

figure c

Earlier works used the Eigenface methods for face recognition, but the time-intensive calculations prove them unviable for real-time implementation [45]. Some other works used the SURF method for face recognition, but the matching rates are low compared to other binary descriptors [24]. BRISK (Binary robust invariant scalable keypoints) method also shown a large performance enhancement on a large range of binary descriptors [38]. The capabilities of these methods are used to develop a face recognition system. The presented face recognition system consists of the following sub-modules.

  1. 1.

    Key point detection and building descriptor: For finding a good number of key points using BRISK [38], a threshold of 15 was decided. Only the top 50 key points were used for further processing.

  2. 2.

    Matching and finding good features: After obtaining descriptors of key points, they are matched against a set of training image descriptors. Since binary descriptors are stored in a string of 1’s and 0’s, a simple hamming distance match using the XOR operator is sufficient to measure the distance. The nearest neighbor is selected as the best match having the least distance between descriptors. To make matching more robust, a ratio test was also performed. If the distance of the match is less than 0.7 * distance of the second nearest neighbor, then this is a good match. Image with the maximum number of good matches will define the person’s identity.

  1. 3.

    Quality Estimation: If the maximum number of good matches falls below 20, the control is passed back to super resolution module to enhance the LR image set to 10 images. This recursive process enlarges the LR image set by 2 units each time a sufficient match is not found. The process is repeated a maximum of 4 times, i.e., until the LR image set enhances to 16 images. The face image is termed unqualified for recognition, and a new face image is processed. Algorithm 4 shows the precision enhancement process for matching and recognition of the image.

figure d

4 Experimental setup and analysis

This section describes the experimental procedure used for the analysis. Three different datasets, ORL, Caltech, and Chokepoints are used to build and evaluate the performance of the presented approach.

4.1 Used image datasets

Three different datasets, the ORL face dataset, Caltech Palestinian dataset, and Chokepoint video surveillance dataset, have been used to evaluate the performance of the presented super-resolution-based approach.

4.1.1 ORL face dataset

The ORL face database consists of 400 images of 40 subjects, including 10 images of each subject [58]. The images were taken at different times for different subjects, varying the lighting, facial expressions, and facial details. The subjects were imaged in an upright, frontal position against a black, uniform background. Each image is 92x112 pixels and has 256 grey levels per pixel. This dataset has been used to evaluate how the presented super-resolution-based approach improves the image quality and helps in accurate face recognition.

4.1.2 Caltech dataset

The Caltech dataset corresponds to 10 hours of 640x480 30Hz video captured from a vehicle travelling in typical traffic in an urban setting [20]. There are around 250,000 annotated frames (in roughly 137 minute-long parts) with a total of 350,000 bounding boxes and 2300 individual pedestrians. This dataset [48] contains varying number of images for each subject. The dataset generation was similar to ORL database. Instead of taking varying Gaussian noise levels, spike noise levels where altered.

4.1.3 Chokepoint dataset

For testing on video surveillance dataset, ChokePoint dataset is used [73]. This dataset can be used for person identification/verification under real-world surveillance. It consists of face images of persons walking through the pedestrian traffic. Three cameras were mounted over two portals (P1 and P2) to capture the video sequences of the subjects entering (E) or leaving (L) the portals in a natural manner. Images are varied in terms of illumination conditions, pose, sharpness, as well as misalignment due to automatic face localization/detection. The dataset includes 25 subjects in portal 1 and 29 subjects in portal 2. The frame rate is set to 30 fps, and the image resolution is 800X600 pixels. A total of 48 video sequences and 64,204 face images are included in the dataset.

4.1.4 Dataset preparation for the system

Figure 6 shows the data preparation process for the system.

Fig. 6
figure 6

Dataset Preparation for the system

4.2 Performance measures

The PSNR and face recognition rate are used as the performance measures.

1. Peak Signal-to-Noise Ratio (PSNR): It is defined as the ratio between the maximum possible power of an image and the power of corrupting noise that affects the quality of its representation. The PSNR value of an image is computed by comparing the image with an ideal clean image with the maximum possible power. It is defined by (3) [19].

$$ PSNR = 10log_{10}\frac{(L-1)^{2}}{MSE} = 20log_{10}\frac{(L-1)}{RMSE} $$
(3)

Here, L is the number of maximum possible intensity levels. MSE is defined by the (4) [13].

$$ MSE = \frac{1}{m} \sum\limits_{i=0}^{m-1} \sum\limits_{j=0}^{n-1}(O(i,j)-D(i,j)^{2}) $$
(4)

Where O is an original image matrix. D is the degraded image matrix. m shows the numbers of rows of pixels, and i shows the index of that row of the image. n shows the number of columns of pixels, and j shows the index of that column of the image.

2. Face Recognition Rate: It is defined as the number of correctly identified faces in the given images. It is calculated as given in (5).

$$ Face~ recognition~ rate = \frac{no.~of~correctly~ identified~ images}{Total~ no.~ of~ images} * 100 $$
(5)

4.3 Used face recognition methods

4.3.1 Eigenface method

The Eigenface approach is based on identifying the most important vectors that can describe faces in the database, termed as face space [66]. Eigenface accomplishes this by capturing variations in a large set of training images and comparing it with other images without discarding any information in captured pixels. Each face image can be expressed in terms of a linear combination of M eigenfaces. For computational efficiency, only the best faces are chosen for forming these M eigenfaces.

4.3.2 Fisherface method

Fisherface method uses projections on linear sub-space to determine the classes of faces. Fisherface [7] method improves the Eigenface technique as it tries to maximize the inter-class differences while simultaneously minimizing the intraclass parameters. This helps in classifying images to a higher degree of accuracy.

4.3.3 Scale invariant feature transform (SIFT)

SIFT [44] method is a highly distinctive descriptor that can match objects with high probability over a large collection of similar objects. This approach can be easily used for recognition purposes considering that it provides high accuracy even for highly cluttered and occluded scenarios. SIFT is robust to orientation and scale. The robustness of SIFT method can be described in terms of scale, noise, and orientation.

4.3.4 Speed up robust features (SURF)

SURF [6] was developed as an alternative to SIFT. The SURF method is based on the two components, SURF detector, and SURF descriptor. SURF detector extracts key points in an image by applying LoG masks to an image at varying scales and then calculates the hessian matrix for that scale. The intensity comparison between scales is measured using integral images as only values within a rectangle are compared. SURF descriptor is a rotation and scale-invariant scheme. Rotation invariance is assured by finding the feature’s dominant direction and rotating the sampling window to align with that angle. Scale invariance is assured by sampling the descriptor over a window proportional to the detection window size.

4.3.5 BRISK

BRISK solves low lighting problems, pose variation, and scales using Keypoint Detection, Orientation compensation, and Descriptor Construction [38]. BRISK includes a handcrafted sampling pattern consisting of concentric circles of varying radius originating from the center. This causes the generation of 512 sampling pairs taking into account a key point at each center of a circle. The pairs can further be broken down into long and short pairs. If pair distance is below a threshold, it is a short pair; else, it is a long pair. The long pairs determine the orientation, and short pairs provide intensity comparison.

4.4 Implementation details

All the experiments were performed on an Intel core i5 2.4 GHz machine with 4 GB RAM. The platform used was OpenCV/C++ with Ubuntu 10.04 as Operating System. Face detection is performed on videos having low resolutions ranging from 400*300 to 200*150 for each frame. The frame rate is around 25 fps giving Haar-based face detection an average time of 100ms per frame [50]. Since super-resolution using the given frames takes an estimated 1s for 7-8 frames, the detection, super-resolution, and recognition rate are selected as 6 fps. An important thing to note here is that if the video resolution is very small, the super-resolution time decreases drastically, ranging from 500-600ms for 8 frames, and increases the overall processing rate to 10 fps. This entails the possibility of using super-resolution for face recognition in real-time surveillance videos [27]. In the face recognition system, the face image is affected by blur, motion, and noise while being captured by a camera. These parameters are not known accurately, and to perform super-resolution, the motion characteristics have been estimated using descriptors while presuming that the blur kernel is known for the given camera [61].

5 Results and comparative analysis

his section discusses the presented face recognition approach results on the ORL, Caltech, and Chokepoint datasets. The proposed super-resolution approach has been applied to datasets with different noise levels to evaluate their robustness and further the face recognition rate. The results for different performance measures are discussed in the upcoming subsections.

5.1 Robustness of the presented approach for the different noise and spike levels

Figure 7 shows the real-time performance benchmark for the presented super-resolution-based approach with successive iterations. Different noise spike levels, 5, 5.2, 5.4, and 5.6 are used. As the iterations have been increased, a linear increase in time is observed. It is true for all the spike levels. When the noise content has been changed from 20 to 10 as shown in Fig. 8, it is observed that a change in noise content does not affect the real-time performance much significantly. The approach works similarly for different noise levels. This proves the robustness of the presented super-resolution approach to the noise.

Fig. 7
figure 7

Time vs iteration scenario for proposed super resolution based approach at noise= 20 (Different series are corresponding to different noise spike levels, 5, 5.2, 5.4 and 5.6)

Fig. 8
figure 8

Time vs iteration scenario for proposed super resolution based approach at noise= 10 (Different series are corresponding to different noise spike levels, 5, 5.2, 5.4 and 5.6)

5.2 PSNR analysis of the presented approach

A PSNR ratio analysis has been performed of the presented approach for the original image to measure the quality of face reconstruction. The better the face reconstruction higher is the PSNR gain. Figures 9 and 10 show the PSNR ratio values of the super-resolution approach with noise level of 20 and 10, respectively. Different noise spike levels have been considered, 500, 520, 540, and 560 for the PSNR analysis. From the figures, it can be observed that the PSNR ratio values are increased with successive iterations. It is true for all the considered noise spike levels. The highest achieved PSNR value is 0.8 approximately for the noise spike level 520. These results showed that the presented super-resolution approach performs better for face reconstruction. The PSNR values increase as the number of iterations increases. The super-resolution approach produced an improved performance for both the considered noise levels.

Fig. 9
figure 9

PSNR Vs. iteration graph for proposed super resolution based approach at various spike noise levels with noise= 20 (Different series are corresponding to different noise spike levels, 500, 520, 540 and 560)

Fig. 10
figure 10

PSNR Vs. iteration graph for proposed super resolution based approach at various spike noise levels with noise= 10 (Different series are corresponding to different noise spike levels, 500, 520, 540 and 560)

5.3 Face recognition rate of the presented approach

The super-resolution approach’s performance has been evaluated using the face recognition rate measure for the ORL and Caltech datasets. The super-resolution and other face recognition approaches (eigenface and BRISK) have been combined, and the results have been evaluated. Initially, the eigenface and BRISK approaches have been to the original images extracted from the datasets. Then, original images have been degraded by applying noise and spike levels. They are called LR/2 (48*48 size) and LR/4 (24*24 size). These are low-resolution images. The eigenface and BRISK approaches have been applied again on the LR images, and performance has been recorded. Finally, the presented super-resolution approach has been applied to the generated LR/2 and LR/4 images to enhance their quality. They are called SR4by8 (super-resolution from 24*24 to 96*96) and SR2by8 (super-resolution from 48*48 to 96*96). The eigenface and BRISK approaches with the presented super-resolution approach have been applied, and the performance improvement achieved by the super-resolution approach has been recorded. Different training set sizes have been used to determine the worst-case performance of the presented face recognition system. The training set size (K) 1 to 5 is formed by randomly selecting images for training and testing. Since each person has 10 images with varying pose and lighting, the performance variations with BRISK and eigenfaces approaches can be determined while demonstrating that super-resolution increases each scenario’s performance. Figure 11 shows the face recognition rate on the ORL dataset using eigenface with the presented super-resolution approach. Table 2 shows the face recognition rate on Caltech face dataset with different training set sizes, K using eigenface, and the presented super-resolution. The figure and table show that eigenface produced the best performance on the original images. The performance of the eigenface decreased significantly on the LR images (LR/2 and LR/4). However, the performance has increased significantly when eigenface is applied with the presented super-resolution approach. Similarly, Fig. 12 shows the face recognition rate on the ORL dataset using super-resolution and BRISK approaches. Table 3 shows the Face recognition rate on Caltech face dataset with different training set sizes, K using super-resolution and BRISK approaches. By using BRISK (Table 3 and BRISK (Fig. 12), again, it is observed that BRISK produced the best performance on the original images, and performance decreased for the LR images. However, again, when super-resolution and BRISK are both applied on the LR images, the performance of BRSIK has increased significantly. Further, it is found that BRISK is more resistant to changes in noise and gives better recognition rates under normal conditions. Table 3 shows the face recognition rate on Caltech face dataset with different training set sizes, K using BRISK. It is observed from the table that for the K = 4, BRISK and proposed-SR2by8 produced the best performance. These results showed that the presented super-resolution approach improves the performance of both eigenface and BRISK approaches for the low-resolution images. The results produced by combining the presented SR approach with eigenface and BRISK approaches are comparable to the original images’ results.

Fig. 11
figure 11

Face recognition rates on ORL Dataset using proposed super resolution based approach

Table 2 Face recognition rate on Caltech face dataset using proposed super resolution based approach and other approaches with different training set size, K
Fig. 12
figure 12

Face recognition rates on ORL Dataset using proposed super resolution based approach and BRISK

Table 3 Face recognition rate on Caltech face dataset with different training set size, K

A comparison of BRISK and Eigenfaces method is done in Fig. 13 and Table 4 with the presence of Gaussian noise = 15 and varying spike noise levels to demonstrate the effectiveness of the BRISK method in combination with super-resolution. Even LR images with high noise content can be identified correctly under severe lighting variations. Figure 14 and Table 5 demonstrate the same concept with varying noise levels proving that BRISK is not affected much by the presence of Gaussian and spike noise due to the effects of Multi-frame super-resolution.

Fig. 13
figure 13

Face recognition rate on ORL face dataset with different spike noise levels (Different series are corresponding to different noise spike levels)

Table 4 Face recognition rate on Caltech face dataset with different Gaussian noise and spike noise levels
Fig. 14
figure 14

Face recognition rate on ORL face dataset with different Gaussian noise levels

Table 5 Face recognition rate on Caltech face dataset with different Gaussian noise levels

5.4 Result analysis on chokepoint dataset

For testing on the video surveillance dataset, the ChokePoint dataset has been used [73]. The ChokePoint dataset has been used by various authors previously, and it is available in the public repository for access. Therefore, this dataset is selected. For comparative analysis, the same dataset is used. Dataset consists of images captured in a real-world unconstrained environment and hence is quite challenging as images contain variation in lighting, pose, expression, etc., with low-quality images. Three cameras were mounted over two portals (P1 and P2) to capture the video sequences of the subjects entering (E) or leaving (L) the portals in a natural manner. Considering that no noise content has been added in from outside in these videos, the performance gain calculation obtained by our technique is required. For training 10 images have been selected randomly from the S1_C1 sequence for each subject. The PSNR values are calculated and reported in Fig. 15. Figure 15(a), (b), and (c) demonstrate that the SR algorithm performs correctly for real-world images where the noise content is inherently present. It is to be noted that only simple translation motions are considered while capturing images. This means that for each expression change or poses variation, translated versions of images have been generated for the scene. Each result shows interpolation to 96*96 level using 15 iterations.

Fig. 15
figure 15

PSNR results on P1L sequences of ChokePoint dataset (Series1 = 24*24, Series2 = 48*48)

Table 6 shows the recognition rates for the images obtained after super-resolution. It is to be noted that the dataset was trained using random images obtained from each video and then calculating the face recognition rates for each case individually. The training set ranged from 5 images to 9 images per video, showing that face recognition rates increase when more training data is available. The scale change considered was from 24*24 ⇒96*96 and from 48*48⇒96*96.

Table 6 Face Recognition results on P1L sequences of ChokePoint dataset

In unconstrained motion scenarios, the person walking towards the camera may exhibit expression changes. These changes cause distortion in the face image. After applying for ECC-based registration on 8 frames, the face image collected after SR was processed and stored. The training image set consisted of the 7 LR frames obtained initially, with a comparison performed between the initial LR image and obtained SR image. BRISK was used as a descriptor measure for identification. In Fig. 16(a), the video sequence contained a person entering with the camera facing the front. This ensured maximum face coverage. In Fig. 16(b), the subject entered with the camera facing at his side. This also gave good results proving that BRISK was able to extract unique features from the image even if the face was partially visible. A steady increase in descriptors was obtained in all the cases, even for highly complicated motions and pose variations. This shows the effectiveness of the system proposed even for videos (Fig. 16(c)).

Fig. 16
figure 16

BRISK descriptor count on P1L sequences of ChokePoint dataset for video 3 descriptors (Series1 = 24*24, Series2 = 48*48)

5.5 Discussion

Unconstrained face recognition conditions and low-resolution images are serious constraint to the accuracy of automated video surveillance. Images from surveillance camera are typically low contrast and feature a large blur and noise in real-life surveillance circumstances. The existing methods developed for high-resolution images do not generalize well for low-resolution images, and therefore, the face recognition task becomes challenging. This work presented a super-resolution-based approach to enhance the quality of the low-resolution images and improve the accuracy of the face recognition system. The presented super-resolution approach is combined with eigenface and BRISK approaches to overcome the video’s low-resolution image constraint. A performance evaluation for three different image and video datasets has been performed. The following observations have been drawn from the experimental analysis.

  • The results found that a significant performance improvement for face recognition could be achieved by combining BRISK descriptors with the presented multi-frame super-resolution approach. The presented approach with BRISK (BRISK-SR2by8 and BRISK-SR 4by8) achieved around 5% improvement in the face recognition rate compared to low-resolution images (Table 3 reports these results).

  • The noise and spike level analysis showed that the presented approach is robust to the noise. The performance of the approach remains the same for different noise levels. Figures 7 and 8 showed that when the noise spike level increased from 5.0 to 5.6, the performance of the presented approach decreased marginally only.

  • The presented approach achieved a high PSNR value for different noise levels. It shows that the presented approach can accurately reconstruct the face from low-resolution images. Figures 9 and 10 reports these results. The highest achieved PSNR value is 0.8 approximately for the noise spike level 520.

  • The evaluation using Chokepoint video surveillance dataset demonstrated that the presented approach has successfully extracted the unique features from the blur or noisy images. It confirms the effectiveness of the presented approach.

  • Finally, the comprehensive experimental analysis has demonstrated that the presented super-resolution approach increases the efficacy of the system in face recognition and could be used in severe noisy and blurred conditions.

6 Conclusions and future work

For video-based surveillance techniques, recognizing face images from a long distance is crucial yet challenging task due to the low image quality. To address this problem, the low-resolution (LR) images need to be enhanced to make them viable for recognition. The presented work aimed to demonstrate and verify the effect of a multi-frame super-resolution technique on the latest binary descriptor-based face recognition techniques. This work developed a system that could generate a super-resolved image from multiple frames and verify the face recognition performance. The efficacy of the system was determined by training the system on a set of few HR images and then testing them on 24*24 and 48*48 LR images. The experimental analysis was performed on three video surveillance and image datasets, ORL, Caltech, and checkpoints. The results showed an increase in image recognition rates where the face image didn’t contain pose expressions and scale variations. Similarly, an increase in BRISK descriptor count for complicated cases involving scale, pose, and lighting variations have been observed. For the LR images, it has been observed that after applying SR and interpolating them to 96*96 resolution, a performance increment of 5%-6% was observed in each case.

In the future, the use of the proposed system for unconstrained face recognition conditions will be explored. Further, a better registration mechanism would be devised to correctly align subsequent frames for each other and switch the super-resolution framework with an example-based super-resolution to enhance speed and visual accuracy.