Advertisement

SN Applied Sciences

, 1:1511 | Cite as

Understanding unconventional preprocessors in deep convolutional neural networks for face identification

  • Chollette C. Olisah
  • Lyndon SmithEmail author
Open Access
Case Study
  • 267 Downloads
Part of the following topical collections:
  1. Engineering: Data Science, Big Data and Applied Deep Learning: From Science to Applications

Abstract

Deep convolutional neural networks have achieved huge successes in application domains like object and face recognition. The performance gain is attributed to different facets of the network architecture such as: depth of the convolutional layers, activation function, pooling, batch normalization, forward and back propagation and many more. However, very little emphasis is made on the preprocessor’s module of the network. Therefore, in this paper, the network’s preprocessing module is varied across different preprocessing approaches while keeping constant other facets of the deep network architecture, to investigate the contribution preprocessing makes to the network. Commonly used preprocessors are the data augmentation and normalization and are termed conventional preprocessors. Others are termed the unconventional preprocessors, they are: color space converters; grey-level resolution preprocessors; full-based and plane-based image quantization, Gaussian blur, illumination normalization and insensitive feature preprocessors. To achieve fixed network parameters, CNNs with transfer learning is employed. The aim is to transfer knowledge from the high-level feature vectors of the Inception-V3 network to offline preprocessed LFW target data; and features is trained using the SoftMax classifier for face identification. The experiments show that the discriminative capability of the deep networks can be improved by preprocessing RGB data with some of the unconventional preprocessors before feeding it to the CNNs. However, for best performance, the right setup of preprocessed data with augmentation and/or normalization is required. Summarily, preprocessing data before it is fed to the deep network is found to increase the homogeneity of neighborhood pixels even at reduced bit depth which serves for better storage efficiency.

Keywords

Deep convolutional neural networks Face identification Preprocessing Transfer learning 

1 Introduction

Humans have an intuitive ability to effortlessly analyze, process and store face information for the purposes of identification and authentication [1]. This ability also extends to recognition of face images even at low-resolution [2]. However, since the inception of convolutional neural networks (CNNs), a class of deep machine learning algorithms, machines have been developed that perform face identification and verification tasks at the level of efficiency comparable to humans.

The ability of intelligent machines to perform recognition tasks successfully is dependent on the CNN architecture and the format of the input data. While there is an arsenal of research on the architecture of deep networks [3], there is less focus on the input data of the CNN. This may be due to the perception that the network only needs the raw face images in RGB format in order to extract and learn relevant features for discerning between faces without prior processing. Meanwhile, some recognition applications might present to the network scenarios where information for recognition is not sufficiently represented. For instance, the size of the dataset might be insufficient for training a neural network, distribution of the data may vary, or degradation due to noise or grey-level resolution might be a problem, the color space might differ. Also, there may be cases of intra-person variation resulting from pose differences and/or lighting. Additionally, there may be inter-person similarity where there is close resemblance of faces of persons of different classes. The latter is more typical within a face space than in object space, but it is not within the scope of this study.

Typically, the size of the dataset for training a CNN is highly significant for making sense of the intricate patterns in the data. Compared to traditional machine learning algorithms that employ handcrafted feature extraction algorithms, the amount of data input to the CNN must be large enough, (as is the case for ImageNet [4]. To address the small sample size problem, data augmentation methods such as translation, rotation, scaling and reflection, are often employed in the literature; examples are [5, 6, 7, 8]. Another common practice is data normalization for reducing variation in the distribution of data. It involves subtracting the mean from each pixel of the input data and afterwards, the resulting output is divided by the standard deviation of the data. In [9], the study considered the object classification problem. They first adopted the zero-mean and one-standard deviation data normalization method to the training data; then, followed by zero component analysis to enhance the edges of features fed to the CNN. Their work showed significant improvement to the raw-input format. These practices are widely used and are known to substantially improve the CNN performance in face recognition tasks. Another, not so popular, preprocessing strategy, is the color space conversion. Reddy et al. [10] trained the CNN model on different color spaces to investigate influence on the performance of the network. Their results show that the best color format to serve to CNN models as input is the raw RGB image format. Though this study is in the context of object recognition, it appears relevant to face recognition. On the other hand, the robustness of CNN to degraded samples, due to noise by study [11] is gaining interest. Degradation can take another form; image grey-level resolution. This impacts on the visual quality of images due to reduced bit depth and has been seen to affect performance [12, 13]. However, reduced bit depth can be useful for real-world applications [14, 15] of face recognition on mobile devices requiring small memory usage and processing. And for this reason, it is a significant problem to be studied.

The problem of dissimilarity due to pose differences and lighting are particularly significant to hand-crafted feature extraction models. With preprocessing approaches such as, face alignment, contrast enhancement, feature preprocessing, etc., employed prior to the facial feature extraction, the face recognition classifiers (after extraction of handcrafted feature processing) showed considerable increases in accuracy [16, 17]. Consequently, the effect of some of the commonly used preprocessors were investigated on deep networks. An interesting characteristic of preprocessing in the CNN pipeline can be seen in the work of [18]. Here, raw input images of a general classification problem which includes; places, things and objects, were run through the local contrast normalization pre-processor before sending its output to the network. The pre-processor shows significant improvement to the performance of the model in comparison to the use of raw image format. Similarly, illumination normalization and contrast enhancement [19] is seen to enhance performance of deep networks for face recognition on extended Yale face dataset by a large margin. Also, [19, 20], show that contrast enhancement does improve CNN accuracies. Pitaloka et al. [21] studied different preprocessors, contrast enhancement and additive noising. The results show that preprocessing actually improves recognition accuracy. A remarkable 20.37% and 31.33% CNN performance improvement to the recognition accuracy of the original raw input data were observed with histogram equalization and noise addition, respectively, on facial expression datasets. Hu et al. [22] applied and studied various feature preprocessors: large and small-scale feature (LSSF), Difference-of-Gaussian (DOG), and Single Scale Retinex (SSR). These feature preprocessors were applied to face images that were fed to the CNN model. Their work showed that a 10.85% increase to the recognition accuracy of the network can be achieved as compared with the original input data.

In all these studies, the input data, whether as raw or preprocessed data, were trained on the CNN and accuracies of the model were validated and tested. However, the future of deep learning should encompass solving large search engine problems. Search engines, such as googlehouse collections of data on various entities such as, faces, objects, places, things, and so on. Therefore, it is unclear how deep convolutional models trained on specific recognition task domains, can translate to practical search engine usage given that the data format may have changed. Another challenge is, for a search of a given person’s face, for the engine to output like faces of the individual. For cases such as this, transfer learning may be useful. The search engine can house a pool data for features such as, faces, objects, places, and/or things, which vary extensively and are captured at different camera sensitivities. Therefore, it is worth investigating whether preprocessors are relevant when knowledge of general classification feature model parameters is transferred from a pretrained model to a specific new classification problem like face identification.

In this paper, the performance gain of commonly used and yet simple preprocessors is studied from a generalized classification problem to a more streamlined search problem. The preprocessors being considered are chosen with respect to the assumptions made around the data. The assumptions are: (a) small sample dataset, (b) varying distribution, (c) pose variability, (d) input data is of a different color space, (e) image degradation due to grey-level resolution, (f) intra-class dissimilarity due to varying lighting. To overcome these issues, the following category of preprocessors are employed: data augmentation, data normalization, data alignment, RGB to other color space conversion, grey-level quantization, and illumination normalization and insensitive feature preprocessing, respectively. The preprocessed input is then fed to a deep CNN model pre-trained on 1.2 million images for generalized classification tasks that mimic a search engine application domain. The weights and biases of the pre-trained models are transferred to the new specific search problem, while the last fully-connected layer is updated as new data is trained. The contributions of this paper are summarized as follows:
  • Empirical analysis of the preprocessing module in deep networks with knowledge transfer. This is to demonstrate the performance of the network when data format of the target domain is of a different color space, reduced grey-level resolution, or varying lighting, from the source domain.

  • Exhaustive evaluation of the conventional preprocessing with unconventional preprocessing methods to investigate the best setup for the preprocessors in deep networks.

  • Demonstrate and propose effective preprocessing strategy for input images into the deep networks, the plane-based quantization and Gaussian blurring. Quantization increases the homogeneity of nearness pixels and blurring retains relevant features. Both utilizes a more reduced bit depth for better storage efficiency. Common is quantization fashioned inside the CNN architecture and Gaussian blur used to mimic degraded probe sample.

2 The framework

The preprocessing module is varied for different preprocessing approaches while keeping constant other facets of the deep convolutional neural network architecture. Instead of the raw RGB image as input, a preprocessed image, S, is input to the network as follows.
$$f\left( S \right) = WS + b$$
where S is the preprocessed input image, W the weight matrix, a learnable parameter and b the bias term. Since W is dependent on the input, it is possible that preprocessed input can achieve a better discriminative space.
The setup for the preprocessors in CNN for the face identification task is illustrated in Fig. 1 and is described in the following subsections.
Fig. 1

The preprocessor module in a deep convolutional learning framework. Fixed network parameters are achieved through transfer of knowledge from generalized raw RGB ImageNet data samples comprising of faces, objects, places, things, animals, etc.to preprocessed target LFW samples

2.1 Preprocessing

The raw RGB face images are preprocessed using data augmentation by translation, data alignment by deep funneling, and normalization by zero-mean and one-standard deviation method. Also, Hue-Saturation-Value (HSV), CIELAB and YCBCR color space converters, image quantizers, and illumination normalization and insensitive feature preprocessors using: histogram equalization (HE), rgb-gamma encoding in log domain (rgbGELog) [23], local contrast normalization (LN), illumination normalization based LSSF [24] and complete face structural pattern (CFSP). These preprocessors individually address assumptions a-f, respectively, associated with the dataset.

2.1.1 Data augmentation

Training from scratch or from a pretrained network, requires a good number of samples of data per class for the CNN to generalize well to the given class. Data augmentation has been used as an artificial means to grow the size of the training data. There are different transformation approaches commonly used: translation, rotation, scaling and reflection [5, 6, 7, 8]. Each contributes differently to the CNN performance. For interested readers, the work by Paulin et al. [25] exemplifies each transformation performance. Since data augmentation is commonly adopted by the deep learning community, it is deemphasized in this work. Therefore, only a translation operation is performed, such that the data are translated by [+ 30, − 30] pixels to create four (4) additional faces shifted to the left, right, top and bottom, per class.

2.1.2 Data alignment and normalization

A common problem of real-world face image data is that face appearance, from the same person frontal view to his/her profile, varies drastically across large poses. Face alignment is used to improve the performance of handcrafted feature extraction algorithms and is currently applied to input faces to deep networks, to set face data of multiple pose variation to a canonical pose. Since removing the existence of pose variability significantly improves recognition performance, this work continues in the trend by using images aligned by the deep funneling method [26, 27]. Also, to make convergence of the network faster while training, the zero-mean and one-standard deviation is adopted. This ensures that the data input parameters are normalized to exhibit the same data distribution.

2.1.3 Color space conversion

In computer vision, color spaces other than RGB, are somewhat robust to lighting changes. Therefore, following on from the work in [10], this study evaluates the color space preprocessors such as, Hue-Saturation-Value (HSV) and YCBCR. The Y represents luminance, Cb and Cr represent the chrominance component. For the CIE La* b*, L is for luminance, while a* and b* for the green–red and blue–yellow color components. The analysis is not only focused on whether the RGB performance was improved, but also addressed the following question: is it possible for a CNN trained on RGB input data to transfer its knowledge to an input data of different color space?

2.1.4 Image degradation

Grey-level resolution is considered as one of the ways an image can be degraded. There are two forms of degradation considered in this paper: degradation by quantization and by image blurring. The degradation by quantization often arises from color quantization which compresses an image grey-level to reduced bit depth. Most online acquired data suffer from color compression because most handheld devices only support a limited number of colors. Two approaches are used: global-based quantization and plane-based quantization. Both employ Otsu’s thresholding method, but the former generates a multiple level threshold vector from an RGB image based on a specified level. The threshold vector values change as the quantization level changes. For example, a 6-bits level quantization (see Fig. 2) means that 6-level threshold values are generated from the entire raw format image and used to quantize the image. The latter takes into consideration that the grey-levels changes from plane-to-plane for an RGB image. Therefore, threshold values are generated for the red, green and blue planes to create a 6-threshold vector in 3-planes, (that is, if the choice of quantization level is 6-levels, for quantizing each plane, respectively). To capture real-world mobile devices data, it will be worth observing, the performance of the deep networks with reduced grey-level. However, this study hypothesizes that the quantization process increases the homogeneity of pixels near to each other, which is expected to increase the discriminative capability of the CNN classifier. It should be known that this study considers preprocessing quantization unlike other approaches of quantization-based CNN within the architecture of the network [14, 15].
Fig. 2

Image Quantization. Using a region in an image the homogeneity of nearness pixels is illustrated for a 224 grey-level region converted to 7-level and 4-levels, respectively

Image blur is another form of degradation which reduces image quality. Gaussian blur is a popular image blurring technique in the field of image processing. Its importance is heightened in handcrafted feature-based face recognition, but seem awkward in a CNN architecture because it operates as a convolutional kernel. However, we postulate that it might enhance learning in a CNN-based face recognition as it does in handcrafted feature-based face recognition. For this reason, we explore gaussian blur as a preprocessing module that serves to retain features of significance from fine to coarse and see how it fits within a CNN architecture.

The Gaussian blur is achieved via the use of Gaussian kernels with σ = 3,7,11 to simulate fine to coarse features. Unlike [12, 13] which considered it as a degradation property on probe samples, the Gaussian blur is applied on the target and probe data on the LFW validation set just like other preprocessor used in this paper.

2.1.5 Ilumination normalization and insensitive feature preprocessing

For face images acquired at different spectral bands, the effects of spatial variation in sensitivity of camera systems, is likely to occur. To minimize the variability effect for the face images of the same class, rgbGELog [23] and commonly used illumination normalization techniques LSSF [24] is employed. Other approaches, such as LN and CFSP, involve the extraction of illumination-insensitive features. These types of feature preprocessors mostly enhance edges as opposed to low-level features. To output a color image with LSSF, HE, LN and CFSP, a color image version of the preprocessors is used. Given an RGB image, each channel plane is processed individually.

2.2 Convolutional neural networks

Current trends in the application of deep CNN shows that its possibilities in the real-world are endless. A remarkable attribute of deep networks is the ability, with sufficient training data. They can also process raw pixel data directly, extract and learn deep structural features for discrimination and generalize incredibly well to new data. More remarkable is that the deep structural features of the network can be transferred irrespective of the domain [28, 29]. For this reason, transfer learning is explored.

2.3 Transfer learning

A deep CNN architecture designed by Szegedy et al. [30], denoted as Inception-V3 was used. The fact that the Inception-V3 model is trained on ImageNet [31], a huge dataset of a million and 200,000 (1.2 million) generalized data samples and 1000 distinct class labels (of faces, objects, places, things, animals, etc.), makes it a good fit to the objective of this study. It is commonly presented in literature that transfer learning is mainly for situations where training data is insufficient [32]. However, a more significant property of interest, in this study, is the transfer of knowledge from one domain (source) to another, almost unidentical (target), domain. By unidentical it is meant that the data format of the target set might change, either it is of different color space, reduced grey-level resolution, or varying lighting, etc. From the work of [33] the rich features of the CNN at different layers were investigated and their study showed that the lower layers respond to edge-like features, while succeeding layers combine lower layers and more abstract features, which are finally merged at the higher layer as global features. This is likened to the recognition ability of the human visual cortex [1] which processes parts of the face, individually, and are put together as a global feature to make sense of a person’s identity. In [34], the output of the last layer, which comprises the high-level feature vectors of a pretrained CNN, showed to generalize well to a new target dataset than fine-tuning some layers of network. It is on this reasons that the high-level feature vectors of the inception-V3 model is found useful to this study’s face search problem.

3 Experimental setup

For a better understanding of the experimentation carried out in this study, the data, the transfer model settings and the evaluation strategy is presented in the succeeding subsections.

3.1 Data

The LFW data set [26] is commonly used for modelling real-world data. It is well-known for intra-class variability resulting from pose, illumination and expression problems. The images are in RGB format and comprise of 13,233 face images of 5749 individuals. The face search problem with respect to the objective of this study, do not necessarily demand huge data for classifier training. Therefore, only the individuals with over 50 images are considered to enable the classifier to generalize well with the new data.

Each of the deep funneled 1456 [27] face samples belonging to 10-person classes comprises an RGB image of 250 × 250 resolution, but is resized to 299 × 299 resolution because the pretrained network has been trained on images of size 299 × 299 × 3. After resizing, the images are further scaled in the range [0, 1].

3.2 Training the classifier

The training was implemented using Inception-V3 pretrained on ImageNet on standard software package, Tensorflow-Slim [35]. The lower layers of the network are frozen, while the last (high-level features) layer of the network is used as provided by the freely available source code.1 This layer is believed to resemble human global assemble of individually processed parts of a face, therefore, is transferred for training new weights and biases of the face search data. Implementation computation was performed using Intel® Core™ i7-7500U CPU 4 logical processors. The data set was split into training, validation and testing sets, containing 70%, 5% and 25% of the images, respectively.2 The validation set controls the training process, while the final accuracy is determined using the test set. The Adam optimizer2 was used with exponential decay, from 0.003 initial learning rate to 0.0001 (exponentially decayed with step down method for every 29 iterations).

3.3 Evaluation

Performance is evaluated using the Top-1 accuracy metric. Since the data augmentation and normalization preprocessors are commonly used on aligned data by the CNN research community, this study terms them conventional preprocessors. Consequently, they become the basis for evaluating the unconventional preprocessors in a CNN as follows: with_augmentation (WA), without_augmentation (NA), with_normalization (WN), and without_normalization (NN). The performance of each of the unconventional preprocessors is reported under these categories: color space conversion, illumination normalization and insensitive feature preprocessors, grey-level resolution degradation. Finally, we present the unconventional preprocessors performance alongside the state of arts methods.

4 Results and discussion

Here, the applicability of unconventional preprocessing with conventional preprocessing methods in transfer networks, for a specific target problem such as a face search, is observed and reported. Therefore, the results will be discussed in four stages as follows. Stage one: A general performance of the preprocessors that encompasses: color space conversion, illumination normalization and insensitive feature preprocessors. Stage two: a streamlined report on various grey-level degradation preprocessors. Here various levels of quantization, 224 grey-level to 7-levels, 6-levels, 5-levels and 4-levels; and Gaussian kernels with σ = 3, 7, 11 are presented. Further report on accuracy and loss of some of preprocessor in this category is illustrated. For the third stage, the face search is depicted through a result of the performance of each of the unconventional preprocessors for a search of a given person’s identity.

4.1 Preprocessor performance in transfer network

Table 1 shows the evaluation results of passing the target data through preprocessors. The color space conversion, illumination normalization and insensitive feature preprocessors, is reported. The latter is lent from handcrafted feature extraction algorithms. The evaluation of these preprocessors is compared against the raw RGB format commonly used in CNN networks. Therefore, this evaluation seeks to answer the following questions: (1) how much knowledge can be transferred when the input formats of the source and target domains are different? and (2) are preprocessors for addressing the effects of spatial variation due to sensitivity of camera systems relevant in transfer networks?
Table 1

RGB versus preprocessor performance in transfer networks

Category

Preprocessor

WA and WN (%)

WA and NN (%)

NA and WN (%)

NA and NN (%)

Mean (%)

Original

RGB

70.7386

65.6250

51.1364

50.0000

59.3700

Color space conversion

CIELAB

41.7614

40.6250

43.4659

39.7727

41.4062

HSV

51.1364

41.4318

52.8409

53.4091

49.7046

YCBCR

57.1023

65.6250

65.3409

64.4886

63.1392

rgbGELog

58.8068

56.8182

59.6591

53.4091

57.1733

Illumination Normalization and Insensitive methods

Histogram equalization

64.7727

51.9886

55.3977

56.5341

57.1733

CFSP

59.3750

55.1136

51.7054

53.6932

54.9719

Local contrast normalization

59.6591

58.8068

59.6391

52.8409

57.7365

Large & small-scale features

68.7500

61.3636

63.9205

59.9432

63.4943

The highest result in each column is highlighted in bold

Surprisingly, the performance of the color space preprocessors showed that format does not hinder knowledge transfer. That is to say that no matter what the input format of a data model is in a transfer setting, the transferred features still generalize well to a new target data. The CIELAB was by far the worst color format across the board. A difference of 28.9772% is observed when compared against the RGB format for the augmented and normalized data experiment.

This might be based on the fact that it captures only structural features. Though the RGB has the highest identification accuracy, the YCBCR remained consistent across all the experiments. On average, it outperformed the raw RGB format by a 3.7692% margin.

On the other hand, the LSSF illumination normalization and insensitive feature preprocessors appear to perform best within this preprocessor category. Others performed similarly, except for the CFSP. The CFSP does not seem to be promising in a transfer network. The best performing illumination preprocessor is the LSSF, LN followed by HE.

Additionally, it is obvious that different preprocessors react differently with data augmentation and normalization. The raw RGB, HE, CFSP, LN and LSSF perform better when data is normalized and augmented. The YCBCR is best with only data augmentation and rgbGELog is best with only normalized data.

4.2 Grey-level preprocessor performance in transfer network

Even more interesting is the performance of grey-level resolution reduction preprocessors in transfer networks. As shown in Table 2, the results of this experiment verify our claim that homogeneity of nearness pixels is increased through quantization and image blurring. A 100.00%, 99.7159% and 72.7273% Top-1 accuracy is achieved for Gaussian blur image with σ = 3, 11 and a 224 grey-level image quantized to 7-levels using the plane-based quantization approach, respectively. Also, the experiment shows that creating variants of given data by translation does not favor quantized data whereas, it favors Gaussian blur image in deep networks. Despite the fact that the raw RGB format data achieved a 70.7386% accuracy with increased data, it is worth noting that the quantized image achieved better accuracy without augmentation. And Gaussian blur data exceeded expectation in performance by it very high accuracy.
Table 2

Comparing the performance of RGB data and grey-level preprocessors in a transfer network

Category

Preprocessor

WA and WN (%)

WA and NN (%)

NA and WN (%)

NA and NN (%)

Mean (%)

Original

RGB

70.7386

65.6250

51.1364

50.0000

59.3700

Full

F-Level-4

66.4773

63.9205

68.4659

66.4773

66.3353

F-Level-5

63.3523

64.4886

65.0568

64.7727

64.4176

F-Level-6

63.9205

60.2273

62.2159

59.0909

61.3637

F-Level-7

67.6136

69.6023

67.0455

67.0455

67.8262

P-Level-4

64.4886

63.3523

63.3523

67.6136

67.7017

Plane

P-Level-5

64.7727

70.1705

70.4545

65.0568

67.6136

P-Level-6

65.6250

64.2045

67.6136

63.3523

65.1989

P-Level-7

67.0455

68.7500

72.7273

65.3409

68.4659

Gauss-blur

Sigma-3

98.5795

99.7159

99.1477

Sigma-7

99.1477

99.1477

99.1477

Sigma-11

100.0000

99.7159

99.8579

The highest result in each column is highlighted in bold

In the first experiment, with data augmentation and normalization, the quantization to 7-levels using the full-image based approach outperforms its other counterparts as well as the plane-based quantization approach. The same performance is achieved for data quantized to 4-levels with no augmentation and normalization. Also, in the second and third experiments, the 5-level quantization preprocessors competed between augmentation and normalization with almost a draw in performance. However, the 7-level quantization performs significantly well when data is normalized.

Further, the experiment with Gaussian blur preprocessor is performed without normalization. We assume that such a preprocessor operates in a similar way as a normalizer. It is not clear in the experiment why σ = 7 did not improve on the accuracy of σ = 3, but it might imply that pixels within a 3x3 and a 11x11 neighborhood do share a lot more in common than a 7x7 neighborhood. One would think that the performance of the Gaussian blur preprocessor is as a result of overfitting but the accuracy plot in Fig. 3 reveals differently. However, the face identity search and retrieval experiment point to σ = 3 to be more consistent in performance.
Fig. 3

Visualizing the performance of some unconventional preprocessors in deep transfer network

4.3 The transfer network face search problem

The layout of this experiment is illustrated in Fig. 4. Given the different types of preprocessors, of which their individual performances have been observed, it is important to determine the practicability of the face search in the transfer network. The transfer network makes the search problem solvable. Assuming a large database of faces, objects, places, things, and so on, a search engine searching for a face identity might take hours and days. This is when the transfer network becomes really useful, because the engine can be modelled to comprise a large engine (for all the possible classes that comprises of face, objects, places, things, etc.) and a mini-engine (specific sample data), possibly of a handheld mobile device. Here it is of interest to know how the preprocessors are able to retrieve the identity of a given face relative to other face classes of the target dataset. The result of this experiment is presented in Fig. 5. Figure 5 shows the result of querying the engine for Ariel Sharon and George W. Bush to determine their identity. It appears that the CNN network with plane-based quantization from 224 grey-level to 7-levels achieved a 99.88% accuracy with no augmentation and with normalization while the CNN network with RGB format achieved 99.27% with augmentation and normalization.
Fig. 4

The face search data model. Exploring the identity retrieval of different preprocessors in a transfer network

Fig. 5

Identity retrieval accuracy for Ariel Sharon and George W. Bush across the 10 face classes for different preprocessors. The view is limited to top 4 accuracies of the identity retrieval

Interestingly, the LSSF, HE, YCBCR, rgbGELog, full and plane-based quantization preprocessors were equally useful and competitive at retrieving the right identity of a query image. However, in order to attain the best performance of any of these preprocessors, it is best to utilize them based on their acceptance for some conventional preprocessing such as augmentation and normalization.

The face search experiment also reveals that the raw RGB format only performs well when data size is increased by augmentation and the distribution of data is normalized. Other than this, it fails to perform favorably. The LSSF, YCBCR, rgbGELog, and HE maintained good accuracies independent of normalization and or augmentation. The LSSF particularly performed better than the raw RGB in identity retrieval.

4.4 On the state-of-the-art results

The accuracy report of unconventional preprocessor driven network with the state-of-the-art results on the LFW dataset is carried out particularly to portray that preprocessors can make a contribution to the learning of features that matter to deep networks if properly explored. However, the comparison is limited to deep networks which utilized outside data for training, though the methods under consideration differ by number of LFW validation samples, evaluation protocols, preprocessors and CNN architecture.

For completeness Table 3 reports accuracies of LFW benchmarks under deep face recognition methods with outside data. Though the evaluation protocols are not the same, we make comparison to only the networks that share deep networks. From the Table 3 we see that with augmentation and normalization, the deep network with LSSF, Inception-V3-LSSF performed comparable to CNN-L-SyntheticFace [22] with a difference margin of about 0.2%.
Table 3

Unconventional preprocessors comparison with state-of-the-art results on LFW

Method

Preprocessing

Training set

Accuracy (%)

DeepFace [36]

alignment, augmentation

Facebook (4.4 M, 4 K)

97.35

DeepIDa

none

CelebFaces + (0.2 M,10 K)

99.53

VGGFace [37]

augmentation

VGGFace (2.6 M, 2.6 K)

98.95

L-Softmax [38]

alignment, crop, PCA

CASIA-WebFace (0.49 m, 10 k)

99.42

SphereFace [39]

Crop, normalization

CASIA-WebFace (0.49 M,10 k)

99.42

RingLoss [40]

none

MS-Celeb-1 M (3.5 M,31 k)

99.50

RangeLossb

none

MS-Celeb-1 M, CASIA-WebFace(5 M,100 k)

99.52

CNN-L-SyntheticFace [22]

face synthesis with facial parts

face synthesis with facial parts

69.11

CNN-L-SyntheticFace-LSSF [22]

face synthesis with facial parts

face synthesis with facial parts

68.97

CNN-L-SyntheticFace [22]

face synthesis with facial parts

CASIA-WebFace (0.5 M,10 k)

97.23

Inception-V3-Original [12]

none

VGGFace (1.8 M, 1 K)

92.15

Inception-V3-Original [12]

Gaussian blur with σ = 3 (probe set)

VGGFace (1.8 M, 1 K)

70.00

Inception-V3-Original [12]

Gaussian blur with σ = 11 (probe set)

VGGFace (1.8 M, 1 K)

50.00

Inception-V3-Original

augmentation, normalization

ImageNet (1.2 M, 1 K)

70.74

Inception-V3-LSSF

augmentation, normalization

ImageNet (1.2 M, 1 K)

68.75

Inception-V3-Quantization

Patch-based quantization to 7-levels

ImageNet (1.2 M, 1 K)

72.73

Inception-V3-Gaussianblur

Gaussian blur with σ = 11

ImageNet (1.2 M, 1 K)

99.72

Inception-V3-Gaussianblur

Gaussian blur with σ = 3

ImageNet (1.2 M, 1 K)

99.72

Inception-V3-Gaussianblur

Gaussian blur with σ = 11, augmentation

ImageNet (1.2 M, 1 K)

100.00

aarXiv preprint arXiv:1502.00873

barXiv preprint arXiv:1611.08976

Our preprocessor based deep network is contrary to the goal of the work in [12] which is basically to understand the performance of the deep network model when probed with an image degraded due to blurriness from camera settings. They only applied the Gaussian blur of different sigma levels the probe set and their result clearly showed that blur image reduces accuracy of a deep network extremely. On the other hand, we simulated the Gaussian blur with the goal of preprocessing to retain features of significance from fine to coarse and see how it fits within a CNN architecture. Obviously, from our experiment it is evident that Gaussian blur is a promising preprocessor for the deep network. The performance of blur with σ = 3 over σ = 11 graphed in Fig. 5 confirms our earlier analogy of performance consistency.

5 Conclusion

In contrast to what is commonly believed, the preprocessor module for deep learning frameworks has proved significant in deep networks. In this paper, the various facets of the CNN architecture were kept constant while the preprocessing module was varied for different preprocessing algorithms.

The HE, full-based and plane-based quantization, rgbGELog, and YCBCR (these are the unconventional preprocessors in CNNs), showed that the discriminative capability of the deep networks can be improved by preprocessing the raw RGB data prior to feeding it to the network. However, the best performance of these preprocessors was achieved via considering, under various preprocessing setups, data augmentation and/or normalization (these are the conventional preprocessors) in CNNs. Even though the raw RGB format performed well, quantizing a 224 grey-level image to 7-levels and Gaussian blur with σ = 3 outperformed the RGB. They achieved above 72% and 99% accuracy with data normalization and no augmentation, respectively. This preprocessor was found to be an effective preprocessing strategy in deep networks for the reasons that they both might have increased the homogeneity of neighborhood pixels and utilizes a more reduced bit depth for better storage efficiency.

Footnotes

Notes

Funding

The authors received no specific funding for this work.

Complaince with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

References

  1. 1.
    Dakin SC, Watt RJ (2009) Biological bar codes in human faces. J Vis 9:1–10Google Scholar
  2. 2.
    Sinha P, Balas B, Ostrovsky Y, Russell R (2006) Face recognition by humans: nineteen results all computer vision researchers should know about. J Proc IEEE 94:1948–1962CrossRefGoogle Scholar
  3. 3.
    Liu W et al (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26CrossRefGoogle Scholar
  4. 4.
    Krizhevsky I, Sutskever Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  5. 5.
    Gudi et al (2015) Deep learning based facs action unit occurrence and intensity estimation. In: 11th IEEE international conference and workshops on automatic face and gesture recognition, pp 1–5Google Scholar
  6. 6.
    Khorrami P, Paine T, Huang T (2015) Do deep neural networks learn facial action units when doing expression recognition? In: Proceedings of the IEEE international conference on computer vision workshops 2015, pp 19–27Google Scholar
  7. 7.
    Mollahosseini A, Chan D, Mahoor MH (2016) Going deeper in facial expression recognition using deep neural networks. In: Proceedings applications of computer vision, pp 1–10Google Scholar
  8. 8.
    Lopes T et al (2017) Facial expression recognition with convolutional neural networks: coping with few data and the training sample order. Pattern Recogn 61:610–628CrossRefGoogle Scholar
  9. 9.
    Pal KK, Sudeep KS (2016) Preprocessing for image classification by convolutional neural networks. In: IEEE international conference on recent trends in electronics, information and communication technology, pp 1778–1781Google Scholar
  10. 10.
    Reddy KS, Singh U, Uttam PK (2017) Effect of image colourspace on performance of convolution neural networks. In: IEEE international conference on recent trends in electronics, information and communication technology, pp 2001–2005Google Scholar
  11. 11.
    Dodge S, Karam L (2016) Understanding how image quality affects deep neural networks. In: 8th international conference on quality of multimedia experience, pp 1–6Google Scholar
  12. 12.
    Grm K, Struc V, Artiges A, Caron M, Ekenel HK (2017) Strengths and weaknesses of deep learning models for face recognition against image degradations. IET Biom 7:81–89CrossRefGoogle Scholar
  13. 13.
    Karahan S, Yildirum MK, Kirtac K, Rende FS, Butun G, Ekenel HK (2016) How image degradations affect deep CNN-based facerecognition? In: IEEE international conference in biometrics special interest group, pp 1–5Google Scholar
  14. 14.
    Wu J et al. (2016) Quantized convolutional neural networks for mobile devices. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 4820–4828Google Scholar
  15. 15.
    Zhao DD, Li F, Sharif K et al (2019) Space efficient quantization for deep convolutional neural networks. J Comput Sci Technol 34:305–317.  https://doi.org/10.1007/s11390-019-1912-1 CrossRefGoogle Scholar
  16. 16.
    Tan X, Triggs B (2010) Enhanced local texture feature sets for face recognition under difficult lighting conditions. IEEE Trans Image Process 19:1635–1650MathSciNetCrossRefGoogle Scholar
  17. 17.
    Huang D et al (2011) Local binary patterns and its application to facial image analysis: a survey. IEEE Trans Syst Man Cybern Part C 41:765–781CrossRefGoogle Scholar
  18. 18.
    Yu S et al (2017) A shallow convolutional neural network for blind image sharpness assessment. PLoS ONE 12:e0176632CrossRefGoogle Scholar
  19. 19.
    Ghazi MM, Ekenel HK (2016) A comprehensive analysis of deep learning-based representation for face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 34–41Google Scholar
  20. 20.
    Dosovitskiy A, Springenberg JT, Riedmiller M, Brox T (2014) Discriminative unsupervised feature learning with convolutional neural networks. In: Advances in neural information processing systems, pp 766–774Google Scholar
  21. 21.
    Pitaloka DA, Wulandari A, Basaruddin T, Liliana DY (2017) Enhancing CNN with preprocessing stage in automatic emotion recognition. Procedia Comput Sci 116:523–529CrossRefGoogle Scholar
  22. 22.
    Hu G et al (2018) Frankenstein: learning deep face representations using small data. IEEE Trans Image Process 27:293–303MathSciNetCrossRefGoogle Scholar
  23. 23.
    Olisah CC (2016) Minimizing separability: a comparative analysis of illumination compensation techniques in face recognition. Int J Inf Technol Comput Sci 9:40–51.  https://doi.org/10.5815/ijitcs CrossRefGoogle Scholar
  24. 24.
    Xie X, Zheng WS, Lai J, Yuen PC, Suen CY (2011) Normalization of face illumination based on large-and small-scale features. IEEE Trans Image Process 20(7):1807–1821MathSciNetCrossRefGoogle Scholar
  25. 25.
    Paulin M (2014) Transformation pursuit for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3646–3653Google Scholar
  26. 26.
    Huang G, Mattar M, Lee H, Learned-Miller EG (2012) Learning to align from scratch. In: Advances in neural information processing systems, pp 764–772Google Scholar
  27. 27.
    Huang G, Mattar M, Lee H, Learned-Miller EG (2008) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. In: Workshop on faces in ‘Real-Life’ Images: detection, alignment, and recognitionGoogle Scholar
  28. 28.
    Patricia N, Caputo B (2014) Learning to learn, from transfer learning to domain adaptation: a unifying perspective. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1442–1449Google Scholar
  29. 29.
    Tan C (2018) A survey on deep transfer learning. In: International conference on artificial neural networks, pp 270–279CrossRefGoogle Scholar
  30. 30.
    Szegedy C (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9Google Scholar
  31. 31.
    Deng J (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings computer vision and pattern recognition, pp 248–255Google Scholar
  32. 32.
    Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1717–1724Google Scholar
  33. 33.
    Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Advances in neural information processing system, pp 3320–3328Google Scholar
  34. 34.
    Akcay S, Kundegorski ME, Willcocks CG, Breckon TP (2018) Using deep convolutional neural network architectures for object classification and detection within X-ray baggage security imagery. IEEE Trans Inf Forensics Secur 13:2203–2215CrossRefGoogle Scholar
  35. 35.
    Xia X, Xu C, Nan B (2017) Inception-v3 for flower classification. In: International conference on image, vision and computing, pp 783–787Google Scholar
  36. 36.
    Taigman Y, Yang M, Ranzato M, Wolf L (2014) Deepface: Closing the gap to human-level performance in face verification. In: CVPR, pp 1701–1708Google Scholar
  37. 37.
    Parkhi MO, Vedaldi A, Zisserman A et al (2015) Deep face recognition. BMVC 1:6–17Google Scholar
  38. 38.
    Liu W, Wen Y, Yu Z, Yang M (2016) Large-margin softmax loss for convolutional neural networks. In: ICML, pp 507–516Google Scholar
  39. 39.
    Liu W, Wen Y, Yu Z, Li M, Raj B, Song L (2017) Sphereface: Deep hypersphere embedding for face recognition. In: CVPR, pp 5690–4699Google Scholar
  40. 40.
    Zheng Y, Pal DK, Savvides M (2018) Ring loss: Convex feature normalization for face recognition. In: CVPR, pp 5089–5097Google Scholar

Copyright information

© The Author(s) 2019

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  1. 1.Baze UniversityAbujaNigeria
  2. 2.The University of the West of EnglandBristolUK

Personalised recommendations