1 Introduction

Artificial intelligence has been developing rapidly with many real-world applications such as time series prediction [24], image classification [40, 46], and smart cities [28]. Among them, personal identification using biometric traits is a hot trend nowadays and has received increasing attention in the computer vision community. With biometric characteristics, the face image can be easily obtained from the camera as a non-invasive acquisition process. Therefore, face recognition can be widely applied to public environments such as video surveillance, criminal detection, access control system, mobile device security, etc. [9]. Although diverse methods for face recognition have been introduced [18, 26, 36], they still have shortcomings. For this reason, face recognition is a challenging topic. Figure 1 shows several face recognition challenges, such as facial expression, head pose, illumination, and background complexity. Also, it has other difficulties, including occlusion, aging, makeup, image quality, etc. These challenges are formidable to deal with well.

Fig. 1
figure 1

Representative challenges in recognizing face. a illumination variations b head pose/viewpoint variations c facial occlusion d facial expressions

A face recognition application typically consists of face detection, feature extraction, and classification. In general, the feature extraction stage plays a vital role because it will fail to achieve decent results when the employed feature descriptor is not adequate. Indeed, most well-known methods have robust feature descriptors, highly discriminative, and robust to extrinsic changes. In recent years, most face recognition algorithms, which have been studied extensively in addressing robust and discriminative descriptors, focus on three primary techniques: holistic, local, and hybrid models [23]. The holistic approach exploits the entire face and projects it into a small subspace such as Eigenfaces in manifold space [45], Fisherfaces [16, 33]. The local approach considers certain facial features such as Speed-up robust features (SURF) [17], Local Binary Patterns (LBP) [22]. The local information combines with the holistic information to enrich feature descriptors for performance improvements in the hybrid approach: the fusion of 54 Gabor functions and fuzzy logic for facial expression recognition [15], two-color local descriptors, called Color ZigZag Binary Pattern (CZZBP) [19], or a fusion of Deep features [12].

Thanks to the low computational cost and efficient feature extraction capability, the LBP-based methods have been studied and widely applied to many tasks such as face recognition, facial expression classification, or texture classification. A large number of the LBP variants and hybrid models based on LBPs have been introduced [1, 36] for face recognition. However, they still have some drawbacks, such as noise sensitivity, contrast information, or illumination variation. This paper proposes a weighting statistical binary pattern framework that can improve the local descriptor in terms of discriminative power and robust against noise and illumination variation.

This work is extended from our prior efforts where we consider neighborhoods in straight-line topology [44] to utilize more useful information for local feature descriptors by statistical binary patterns [14, 37]. In this way, the proposed framework firstly considers two statistical moments (mean and variance) for noise elimination and obtain complementary information. Then, the proposed LBP variant is applied to the first-moment image for LBP representations. The second-statistical moment image is a complementary component for building the weighted histogram to incorporate each pattern contribution. This proposed framework can enrich local descriptors by utilizing both moments without increasing the fused histogram dimension. The present study addresses prior shortcomings and proposes an upgraded descriptor for face recognition. The contributions of it are given as follows:

  • We present a straight-line topology approach with LBP by direction (known as LBPα), which is robust against several visual challenges, such as noise, illumination, and facial expressions, as a base foundation.

  • Then, we propose a novel complementary LBP variant (known as CLBPα), which is inspired by the local difference magnitude-sign transform to complement information for the local descriptor.

  • To extract more robust descriptors from salient information in statistical moments, we propose the fused histogram of CLBPα, that is constructed by using WSBPα to obtain enriched features.

  • A comprehensive evaluation of six public datasets suggests that our proposed framework outperforms the state-of-the-art methods.

The paper is organized as follows. Section 2 prepares some background on LBP. Section 3 details the proposed framework. In section 4, we analyze the implementations through several parameter settings for evaluations. Experimental results are interpreted in Section 5. A discussion for our proposed framework is analyzed in Section 6, and the last one consists of our conclusion and future works.

2 Related works

Many methods based on basic LBP descriptors, that can encode the local appearance by the relation between neighborhoods, have been introduced. However, there exist several shortcomings, such as local information loss or sensitivity to noise. Diverse LBP variants have been proposed to address these shortcomings. Several neighborhood topologies or encoding operators have been introduced, such as Dominant Rotated Local Binary Patterns (DRLBP) [32] and Enhanced Line Local Binary Pattern (EL-LBP) [44].

Recently, several hybrid models based on LBP-like descriptors for face analysis have been examined and proved to have highly discriminative power [22]. Lin et al. [27] proposed a fast algorithm, called LBP edge-mapped descriptor, which was to fuse LBP and SIFT using the maxima of gradient magnitude points on the image to illustrate facial contours for face recognition. Ding et al. [11] introduced the Dual-Cross Patterns (DCPs) as a core algorithm to extract facial features at both the holistic and component levels of a human face, then applied the first derivative of Gaussian for eliminating the differences of illumination. The Multi-scale block Local Multiple Patterns (MB-LMP) [49] exploited multiple feature maps based on the modified Weber’s ratio, then fused the histograms of non-overlapping patches for more robust features. Kas et al. [21] addressed shortcomings of previous LBPs and proposed Mixed Neighborhood Topology Cross Decoded Patterns (MNTCDP) by considering multi-radial and multi-orientation information simultaneously to exploit the relationship between the referenced point and its neighbors on each 5 × 5 pixel block. Inspired by LBP-like in face recognition, Shu et al. [43] proposed Equilibrium Difference LBP (ED-LBP) in multiple color channels (RGB, HSV, YCbCr) accompanied with an SVM classifier for face spoofing detection. Unlike the traditional LBP circle, the Local Diagonal Extrema Number Pattern (LDENP) [42] descriptor only encoded information within the local diagonal neighbors using the first-order local diagonal derivatives to obtain a compact description for face recognition. Deng et al. [10] proposed an accurate face recognition by exploiting the compressive binary patterns (CBP) on a set of first six random-field eigenfilters, which reduced the bit error rate of LBP-like descriptor and were more robust against additive Gaussian noise. According to LBP, another approach encoded information by examining neighboring pixels at different distances across different derivative directions called Local Gradient Hexa Pattern (LGHP) [6] which generated discriminative inter-class facial images. Lu et al. [29] proposed an unsupervised feature learning to represent face images from raw pixels and jointly encoded codebook for small regions to obtain high discrimination in descriptors, called Simultaneous Local Binary Feature Learning and Encoding (SLBFLE).

The other aspect was to utilize more useful information for descriptors to overcome local information loss within images. For instance, the Completed LBP technique (CLBP) [14] described local difference Sign-Magnitude transform to obtain higher performance. Another improvement of CLBP, i.e. the statistical binary patterns model [37], was built on several statistical moments for robust descriptors and improved the performance.

2.1 LBP

LBP was first introduced by Ojala et al. [38]. The LBP feature describes the spatial relationship in an image by encoding the neighbor points of a given central point. Let f be an 2D discrete image in \(\mathbb {Z}^{2}\) space. Then, the LBP encoding of f can be considered as a mapping from \(\mathbb {Z}^{2}\) to {0,1}P:

$$ \text{LBP}_{P,R}(f)(\mathbf{c}) = \sum\limits_{p = 0}^{P} s(f(\mathbf{g}_{p}) - f(\mathbf{c})), \quad \text{with } s(x) = \begin{cases} 1, \quad x \geq 0 \\ 0, \quad \text{otherwise} \end{cases} $$
(1)

and gp are intensity of P neighbors and are measured on the circle of central point c and radius R.

The dimension of LBP descriptor can be reduced by considering its uniform patterns, whose values U(LBPP,R) ≤ 2 and defined by the following equation:

$$ \mathbf{U}(\text{LBP}_{P,R}) = \sum\limits_{p=1}^{P} |\text{LBP}_{P,R}^{p} - \text{LBP}_{P,R}^{p-1}| $$
(2)

where \(\text {LBP}_{P,R}^{p}\) is the p-th bit of LBPP,R, and \(\text {LBP}_{P,R}^{P} = \text {LBP}_{P,R}^{0}\). \(\text {LBP}_{P,R}^{u2}\) [38] was a very robust and reliable descriptor for face representation or texture classification. As a result, the mapping from LBPP,R to \(\text {LBP}^{u2}_{P,R}\) produces L = P(P − 1) + 3 distinct output values by building a lookup table of 2P patterns. Therefore, the local descriptor is described as follows:

$$ \mathbf{H} = [H_{0}, H_{1}, ..., H_{L-1}]^{T} $$
(3)

where

$$ H_{t} = \sum T\big\{\text{LBP}_{P,R}(x,y) = t\big\}, \text{and }T\big\{ A \big\} = \begin{cases} 1, \quad \text{if A is true} \\ 0, \quad \text{otherwise} \end{cases} $$
(4)

in which Ht is the occurrence of the tth LBPu2 code, where t ∈ [0..L − 1]. Therefore, the length of histogram in uniform LBP representation is L = P(P − 1) + 3.

2.2 Completed LBP

Guo et al. [14] considered a local difference sign-magnitude transform and proposed a completed model as a state-of-the-art variant of LBP. The transform dp = spmp consists of two components, i.e. signs: sp = sign(dp) and magnitudes: mp = |dp| = |f(gp) − f(c)|. By adding components to them, three operators, called CLBP-Sign (CLBP_S), CLBP-Magnitude (CLBP_M), and CLBP-Center (CLBP_C), were designed to code three features S, M, and C. The first operator CLBP_S was the same as the original LBP operator and produced S component. The M component which expresses the local variance of magnitude should be consistent with S and defined as follows:

$$ \text{CLBP\_M}_{P,R}(f)(\mathbf{c}) = (s(m_{p} - \textit{\={m}}))_{0 \leq p < P} $$
(5)

where is the mean value of mp from the whole image. Moreover, the last component C also carries discriminant information. Therefore, the CLBP_C operator is formulated:

$$ \text{CLBP\_C}(f)(\mathbf{c}) = s(f(\mathbf{c}) - \textit{\={f}}) $$
(6)

where is set as the mean gray level of the whole image. Because of complementary relationship between these operators, it turns out that the Completed LBP descriptor is useful for the texture classification task.

2.3 Face representation based on LBPs

The face representation based on LBP descriptors has been first introduced by Ahonen et al. [1] by analyzing small local regions in the face instead of striving for a holistic facial texture representation. In such a local approach, a face image is partitioned into m non-overlapping patches R(j) (j = 1..m) where an LBP operator is independently applied to produce local histograms. It aims to fuse all LBP histograms as a single vector (also known as local LBP descriptors) for facial texture representation. The concatenation approach is a simple and efficient one for LBP description. Each LBP histogram H(j) by each image patch R(j) is computed by (3). Finally, the global LBP descriptor for all patches R(j) is formulated as follows (T is the transpose operator):

$$ \mathbf{H} = [(\mathbf{H}^{(1)})^{T} (\mathbf{H}^{(2)})^{T} ... (\mathbf{H}^{(m)})^{T}]^{T} $$
(7)

The resulting feature vector has the size of m × n, where n is the length of LBP histogram along with its topology. Therefore, this approach for face representation is more robust under variations such as poses or illumination. Notably, small patches within an image can be of different sizes or overlapping regions. Many face recognition works have followed the local approach and obtained significant LBP variants [5, 42, 47, 49].

2.4 Statistical moment images

Since we define that f is a 2D discrete image in \(\mathbb {Z}^{2}\) space, we can obtain a real-valued image in \(\mathbb {R}\) by a mapping technique. The spatial support, which is employed to compute the local statistics, is modeled as \({\mathscr{B}} \subset \mathbb {Z}^{2}\), such that \(\mathcal {O} \in {\mathscr{B}}\), where \(\mathcal {O}\) is the origin of \(\mathbb {Z}^{2}\) [37]. Figure 2 illustrates how to construct a spatial support \({\mathscr{B}}\).

Fig. 2
figure 2

Illustration of the spatial support with an example \({\mathscr{B}} = \{(1, 4); (2, 8)\}\) designed as a collection of neighbor points sampled on different circles having the same center

The r-order moment image associated to f and \({\mathscr{B}}\) is also a mapping from \(\mathbb {Z}^{2}\) to \(\mathbb {R}\), defined as

$$ m_{(f, \mathcal{B})}^{r}(\mathbf{c}) = \frac{1}{|\mathcal{B}|} \underset{\textbf{b} \in \mathcal{B}}{\sum} (f(\textbf{c} + \textbf{b}))^{r} $$
(8)

where c is a pixel from \(\mathbb {Z}^{2}\), and \(|{\mathscr{B}}|\) is the cardinality of the structuring element \({\mathscr{B}}\). Accordingly, the r-order centered moment image (r > 1) is defined as

$$ \mu_{(f, \mathcal{B})}^{r}(\mathbf{c}) = \frac{1}{|\mathcal{B}|} \underset{\textbf{b} \in \mathcal{B}}{\sum} (f(\textbf{c} + \textbf{b}) - m_{(f, \mathcal{B})}^{1}(\mathbf{\textbf{c}}))^{r} $$
(9)

where \(m_{(f, {\mathscr{B}})}^{1}(\mathbf {\textbf {c}})\) is the average value (1-order moment) calculated around c. Finally, the r-order normalized centered moment image (r > 2) is defined as

$$ {\upbeta}_{(f, \mathcal{B})}^{r}(\mathbf{c}) = \frac{1}{|\mathcal{B}|} \underset{\textbf{b} \in \mathcal{B}}{\sum}(\frac{(f(\textbf{c} + \textbf{b}) - m_{(f, \mathcal{B})}^{1}(\textbf{c})}{\sqrt{\mu_{(f, \mathcal{B})}^{2}(\mathbf{c})}})^{r} $$
(10)

where \(\mu _{(f, {\mathscr{B}})}^{2}(\mathbf {c})\) is the variance (2-order centered moment) calculated around c.

3 Weighted Statistical Binary Patterns by direction α (WSBPα)

We propose the Weighted Statistical Binary Patterns by direction α (WSBPα) descriptor to enhance the discriminant capability of LBPs for face recognition while reducing its sensitivity to representative challenges, such as facial emotions, noise, or illumination. Such a descriptor can encode spatial information in a set of local statistical moment images and maps this coding to uniform LBP u2 to produce a more compact descriptor. Because of complementary and consistent characteristics, two crucial components CLBP_Sα and CLBP_Mα (Here, α is a given direction) are computed on a mean image and weighted by a variance image to improve the performance. The detail is given as follows.

3.1 Local Binary Patterns by direction (LBPα)

In the original LBP and several variants, the neighbors gp have the coordinate \((R\cos \limits (2\pi p/P), R\sin \limits (2\pi p/P))\) lying on a circle of radius R. In the proposed LBPα, we consider the relationship between pixels by a straight-line topology on a direction α, given that the coordinate of c is (0,0). The neighbors of straight-line topology are defined as follows:

$$ \mathbf{g}^{p}_{\alpha} = (\frac{2pR\cos\alpha}{P}, \frac{2pR\sin\alpha}{P})_{-P/2 \leq p \leq P/2} $$
(11)

When considering a line topology, the number of neighbors should be an even number, and neighbors are bilateral symmetry with a central point c. Figure 3 illustrates four LBP\(_{\alpha _{i}}\) by considering 6 neighbors along with a line topology.

Fig. 3
figure 3

An example of several Local Binary Patterns by directions {αi} = {00,450,900,1350}, where {linei} = {yellow, orange, green, blue}, respectively

Similar to traditional LBP, we encode an image by LBPα operator as defined in (1). Therefore, it can be expressed as follows:

$$ \text{LBP}_{\alpha (P, R)}(f)(\mathbf{c}) = {\sum}_{p = 0}^{P} s(f(\mathbf{g}_{\alpha}^{p}) - f(\mathbf{c})), \quad \text{with } s(x) = \begin{cases} 1, \quad x \geq 0 \\ 0, \quad \text{otherwise} \end{cases} $$
(12)

where \(\mathbf {g}_{\alpha }^{p}\) is defined in (11), and remaining variables such as f, c, P, and R are defined in (1). As a result, LBPα operator, which produces 2P distinct patterns, leads to a huge descriptor. Inspired by the LBP uniform principle in Section 2.1, we reduce the number of patterns by considering the uniform patterns concept for LBPα. After this process, the LBPα “uniform patterns” have P(P − 1) + 3 distinct output values from a lookup table of 2P values.

The main difference between the circle LBP and the LBPα is that the LBP considers spatial relationship by a circle, whereas the LBPα exploits spatial information in the straight-line of neighbors along with the given directions. Although several primary factors such as the direction of exposure, illumination, or facial expressions are given as challenges in face recognition, it turns out that the LBPα-based representation is robust against changes of illumination and scale since it examines micro-patterns in a line topology. Moreover, by taking advantage of traditional LBP, the proposed LBPα can characterize the distribution of local pixels by a direction, and the frequency of occurrences of LBPα values can be used to represent various facial structures.

3.2 Complementary Local Binary Patterns by direction α (CLBPα)

The CLBP [14] had been used for texture classification by combining three operators CLBP_S, CLBP_M, and CLBP_C in a joint or hybrid way. Similar to CLBP, we propose Complementary Local Binary Patterns by direction α (CLBPα) which considers neighbors \(\mathbf {g}_{\alpha }^{p}\) by direction α for the face recognition task. The proposed CLBPα consists of two operators: CLBPα-Sign (CLBP_Sα) and CLBPα-Magnitude (CLBP_Mα). In general, the CLBP_Sα is similar to the proposed LBPα described in (12). The CLBP_Sα operator describes the structure of image f with respect to the local relationship, whereas the CLBP_Mα complements local difference Magnitude and is in a consistent format with that of the CLBP_Sα. This operator is defined as follows:

$$ \begin{array}{@{}rcl@{}} \text{CLBP\_M}_{\alpha (P, R)}(f)(\mathbf{c}) &=& (s(m_{\alpha}^{p} - \textit{\={m}}_{\alpha}))_{0 \leq p < P} \text{, } \\ m_{\alpha}^{p} &=& |d_{p}| = |f(\mathbf{g}_{\alpha}^{p}) - f(\mathbf{c})| \end{array} $$
(13)

where α is the mean value of \(m_{\alpha }^{p}\) for the whole discrete image f. Each component S and M has P(P − 1) + 3 distinct values corresponding to the “uniform” LBPα coding of discrete image f. Inspired by forming CLBP descriptors [14], we have two ways to combine different components for enhanced descriptors. The first descriptor CLBP_S/Mα, which forms a joint 2D histogram from the CLBP_Sα and CLBP_Mα codes, has [P(P − 1) + 3]2 values. The second descriptor CLBP_S_Mα, which concatenates two histograms together, has 2[P(P − 1) + 3] values. The distribution of the first one can become too sparse when the dimension (i.e., the number of neighbors P) increases. However, the marginal histogram of the second one obtains a reasonable size of 2[P(P − 1) + 3]. As a trade-off between the performance and computational cost, the marginal histogram approach is utilized in our experiments. Note that component C, which expresses the local gray level in the image, is ignored in our proposed model. The proposed CLBP_S_Mα produces more reliable and significant expressiveness for the facial feature representation.

3.3 Weighted Statistical CLBP by directions α i (WSBP\(_{\alpha _{i}}\))

An introduction of two first-order moments (mean and variance moments) into an LBP-based operator was proposed as Statistical Binary Patterns (SBP) [37]. The first order, known as mean-valued moment m1, gives the contribution of individual pixel intensity for the entire image. The second order, known as variance-valued moment μ2, is to find how each pixel varies from its neighboring pixels and represents salient regions in an image. Our proposed WSBP can build a novel histogram from CLBP\(_{\alpha _{i}}\) descriptors by computing CLBP\(_{\alpha _{i}}\) image on the first-order moment m1 and counting occurrences of every pattern on that CLBP\(_{\alpha _{i}}\) image by a significance index corresponding to the salient regions using the new second-order moment \(\mu ^{\prime }_{2}\). The proposed descriptor can discard the noise, illumination, or near-uniform regions. Figure 4 illustrates the flow diagram of our descriptor based on the WSBP\(_{\alpha _{i}}\) descriptors. With the mean image m1, the spatial relationship between local structures is represented using CLBP\(_{\alpha _{i}}\) operator to obtain two essential components S\(_{\alpha _{i}}\) and M\(_{\alpha _{i}}\). Then, each component (S\(_{\alpha _{i}}\) and M\(_{\alpha _{i}}\)) obtained by CLBP\(_{\alpha _{i}}\) operator is weighted by the contribution of every local pattern according to the new variance image \(\mu ^{\prime }_{2}\) for the weighting histogram.

Fig. 4
figure 4

The flow diagram of our WSBP\(_{\alpha _{i}}\) descriptors. The input image is initially divided into mean (m1) and variance (μ2) moments. A new variance moment (\(\mu ^{\prime }_{2}\)), which contains distinctive facial features, is prepared by extracting root k-th. Then, when Sign and Magnitude components along four different directions using the mean moment are constructed, a weighting approach according to the new variance is applied to each component. Finally, the weighted histograms of Sign and Magnitude components are concatenated to build a weighted CLBP histogram

Let H be the histogram vector of each component, and (x,y) be location of pixel in each component of CLBP\(_{\alpha _{i} (P, R)}\) image. Then, the histogram for each component is based on the contribution of every location (pixel) in new variance moment \(\mu ^{\prime }_{2}\). Equation (4) defines the occurrence of every CLBP\(_{\alpha _{i} (P, R)}\) code tth as follows:

$$ H_{t} = \begin{cases} \underset{\forall (x,y)}{\sum} \mu^{\prime}_{2}(x,y), \text{ if } \text{CLBP}_{\alpha_{i} (P, R)}(x,y) = t \\ 0, \text{ otherwise} \end{cases} $$
(14)

The SBP descriptor [37] produces enhanced descriptors and only considers all patterns having the same weights and ignoring their significance. In this paper, the WSBP\(_{\alpha _{i}}\) descriptors capture the local relationships within images corresponding to the mean moment, and exploit contrast and gradient magnitude information through variance moment to enhance the local relationship description. Equation (14) describes how every pixel occurrence is weighted by its contribution corresponding to those pixels in a new variance moment \(\mu ^{\prime }_{2}\). Therefore, the histogram of each component S\(_{\alpha _{i}}\) and M\(_{\alpha _{i}}\) has P(P − 1) + 3 values. Finally, the dimensionality of WSBP\(_{\alpha _{i}}\) descriptor is 2[P(P − 1) + 3] because of the concatenation of histograms. As a result, the WSBP\(_{\alpha _{i}}\) descriptor is not only compact but also robust to noise, illumination and other variations.

3.4 The computational complexity

In this section, we address the computational complexity of WSBP descriptor for an input image of size N × N. Suppose that the pre-defined spatial support \({\mathscr{B}}\) is defined as (R1,P1),(R2,P2); WSBPα is calculated by considering P neighbors. The computational complexity of WSBP descriptor depends on the following factors.

  • Construction of moment images: At each pixel, the mean value can be obtained after O(P1 + P2) operations, while the variance value requires O((P1 + P2)2) operations. Therefore, the construction of moment images can be done in O((P1 + P2)2N2) = O(N2).

  • Construction of CBLPα: CBLPα consists of 2 components CBLP_Sα and CBLP_Mα. The first one is calculated in O(PN2). The second one has the same complexity of O(PN2). As a result, the complexity of CBLPα is O((2PN2) = O(N2).

  • Construction of WSBPα: WSBPα addresses CBLPα on mean image and considers variance image for constructing the weighted histogram. As mentioned above, each component can be done in O(N2).

Therefore, the computational complexity of WSBP is O(N2). It is evident that WSBP requires more calculation than LBP, but both are in the same computational complexity order. Such a constraint guarantees that our operator is effective as the non-LBP methods in terms of computation time.

4 Implementation

In this section, we detail the configuration of the WSBP descriptor.

4.1 The fusion of different descriptors WSBP\(_{\alpha _{i}}\)

Suppose that WSBPα considers only one direction (α is a given direction), it could lead to an inadequate description simply because such a descriptor would exploit only the local relationship along that direction. What we aim here is that this descriptor should utilize every useful surrounded features. Inspired by LBP operators in a circle topology (with scale of (P, R) = (8, 1)), we propose to consider at least four directions for the fused histogram, αi ∈{00,450,900,1350} (see Section 5). Figure 5 shows components S and M of CLBP at four directions {αi} as four views of a given image. The fusion of four views could be an adequate descriptor in recognizing face against illumination or head pose variations. Such a WSBP can be expressed as follows:

  • WSBP = WSBP\(_{\alpha _{1}}\)_WSBP\(_{\alpha _{2}}\)_WSBP\(_{\alpha _{3}}\)_WSBP\(_{\alpha _{4}}\)

Fig. 5
figure 5

Illustration of CLBP\(_{\alpha _{i}}\) for component S (a, b, c, d) and M (e, f, g, h) with four different directions {00,450,900,1350}, respectively. Each CLBP\(_{\alpha _{i}}\) operator, consisting of two components, is computed on mean moment with a structuring element \({\mathscr{B}}=\{(1,8)\}\)

4.2 Moment parameters

For a successful implementation of our descriptor, a proper parameter setting has to be made. As a pre-processing step, the mean (m1) and variance (μ2) moments obtained by computing the spatial support \({\mathscr{B}}\) are used to reduce the noise sensitivity. Thus, moment parameters should be in the optimal settings for this purpose.

We define the structuring elements as a circle spatial support \({\mathscr{B}} = \{\{(R_{i},P_{i})\}\}\), such that (Pi) is the number of neighbors and (Ri) is its radii. Figure 6 shows an example of two-moment images using \({\mathscr{B}} = \{(1,8)\}\). Given that the second-order moment (variance moment) tends to emphasize only dominant edges, some potentially important information could be discarded. To handle this problem, we propose to perform an extraction of root k-th for the variance moment as \(\mu ^{\prime }_{2} = \sqrt [k]{\mu _{2}}\) (k ∈ [2,16]). For example, Fig. 6e shows the new variance moment (\(\mu ^{\prime }_{2}\)), built by extracting root 9-th from the original one. With this method, more useful facial features, such as eye, nose, and mouth, can be enhanced as salient regions. Thus, the weighted histogram can enrich the essential areas by exploiting the contribution of every statistical pattern in the variance image. In the next section, we show how \(\mu ^{\prime }_{2} = \sqrt [9]{\mu _{2}}\) under the \({\mathscr{B}} = \{(1,6)\}\) for structuring element makes a huge difference through a series of experiments with six public face datasets.

Fig. 6
figure 6

Illustration of c Mean (m1) and d Variance (μ2) moments for a the given input using b the structuring element \({\mathscr{B}} = \{(1,8)\}\). e A new variance (\(\mu ^{\prime }_{2}\)) can be obtained by extracting root 9-th from μ2

5 Experiments

This section describes experiments with six face datasets, such as ORL, YALE, AR, Caltech, FERET, and KDEF. Our statistical feature descriptors were processed with the algorithm mentioned above. Below features, that were the concatenation of 4 directions in exploiting CLBP\(_{\alpha _{i}}\) operators, were used in our experiments:

  • CLBP_S(m1) = CLBP_S0_CLBP_S45_CLBP_S90 _CLBP_S135

  • CLBP_M(m1) = CLBP_M0_CLBP_M45_CLBP_M90 _CLBP_M135

  • CLBP_S(\(m_{1},\mu ^{\prime }_{2}\)) = CLBP_S(m1)_CLBP_S(\(\mu ^{\prime }_{2}\))

  • CLBP_M(\(m_{1},\mu ^{\prime }_{2}\)) = CLBP_M(m1)_CLBP_M(\(\mu ^{\prime }_{2}\))

  • CLBP_S_M(m1) = CLBP_S(m1)_CLBP_M(m1)

  • CLBP_S_M(\(m_{1},\mu ^{\prime }_{2})=\) CLBP_S_M(m1) _CLBP_S_M(\(\mu ^{\prime }_{2}\))

  • WSBP_S, WSBP_M, and WSBP were weighted-statistical CLBPs applied to S, M, and a fusion of S and M components as described in Section 3, respectively. Note that each descriptor had the concatenated histogram by 4 directions of CLBP\(_{\alpha _{i}}\), {αi} = {00,450,900,1350}.

The fusion of different directions and components (S, M, m1, \(\mu ^{\prime }_{2}\)) would lead to a very long descriptor as the concatenation of histograms. To handle this problem, the Principal Component Analysis (PCA) with the percentange of cumulative sum of eigenvalues of 95%, was adopted for the dimension reduction purpose. For the classification task, the Linear SVMs were utilized.

5.1 Databases and experimental protocols

The ORL dataset

Footnote 1 had 40 subjects and 10 different gray-scale images with a size of 92 × 112 were collected from each subject. All ORL images were collected under various conditions such as facial expression, illumination changes, occlusion (sun glasses), see Fig. 7.

Fig. 7
figure 7

The ORL dataset samples which are decomposed into pairs of moment images (mean m1, variance \(\mu ^{\prime }_{2}\)) for each two-row pair

The YALE Face dataset

Footnote 2 included 165 images from 15 individuals and 11 different images with the size of 243 × 320 were collected from each subject. The dataset had various expressions and lighting conditions, see Fig. 8.

Fig. 8
figure 8

The YALE dataset samples which are decomposed into pairs of moment images (mean m1, variance \(\mu ^{\prime }_{2}\)) for each two-row pair

The Caltech 1999 dataset

Footnote 3 produced by California Institute of Technology had 447 images from 26 persons, yet the number of images for each person was different and collected under the unconstrained background. The dataset had various conditions such as different expression, illuminations and occlusions, see Fig. 9.

Fig. 9
figure 9

The Caltech 1999 dataset samples which are decomposed into pairs of moment images (mean m1, variance \(\mu ^{\prime }_{2}\)) for each two-row pair

The KDEF dataset

Footnote 4 [3] is a set of 4900 photographs of facial expressions. The set had 70 persons (35 males and 35 females) displaying seven facial expressions under five different viewing angles. For the present evaluation, the frontal view was considered for each facial expression. This subset contained 490 color images for 70 individuals as shown in Fig. 10 where each subject expressed seven different emotions.

Fig. 10
figure 10

The KDEF dataset samples which are decomposed into pairs of moment images (mean m1, variance \(\mu ^{\prime }_{2}\)) for each two-row pair

The AR dataset

Footnote 5 [31] had 3016 face images for 116 persons (63 men and 53 women), each having 26 color images (768 × 576) under severe illumination conditions (left-light, right-light or all sidelights), 7 basic emotions (happy, sad, neutral, sleepy, anger, surprised and wink), head poses, and occlusion (sun glasses and bangs). Figure 11 shows several examples in which original color images were converted to gray-scale images and decomposed into mean and variance moment images. For the dataset, we conducted two experiments for comprehensive evaluation. Because of color images, the first experiment was examined with gray-scale images following the same protocol as other datasets, while the other was carried on other color channels to give several our perspectives.

Fig. 11
figure 11

The AR dataset samples which are decomposed into pairs of moment images (mean m1, variance \(\mu ^{\prime }_{2}\)) for each two-row pair

The FERET dataset

Footnote 6 [41], collected in 15 sessions during four years, was a large benchmark used extensively for comparison. The dataset comprised a total of 14,126 images from 1199 individuals. A subset adopted for evaluations had 1400 images of 200 subjects (7 images per person), including variations in poses, expression, and illumination. Figure 12 shows images under 7 states of each person.

Fig. 12
figure 12

The FERET dataset samples which are decomposed into pairs of moment images (mean m1, variance \(\mu ^{\prime }_{2}\)) for each two-row pair

5.2 Results with the ORL and YALE datasets

For the ORL dataset, Ntrain training images per subject were randomly selected (Ntrain = 2, 4, 5, 8) while the remaining (10 - Ntrain) images were used for testing. For the YALE dataset, Ntrain training images per subject were randomly selected (Ntrain = 2, 4, 6, 8) and the remaining (11 - Ntrain) images were carried out for testing. The measurements were repeated 100 times by shuffling data process. Results were shown in Tables 1 and 2 for the average classification rate.

Table 1 Recognition rates for the ORL dataset
Table 2 Recognition rate for the YALE dataset

Tables 1 and 2 summarised our experimental results under the various configurations. First, the CLBP_S(m1) and CLBP_M(m1), the WSBP_S and WSBP_M produced similar recognition results. The CLBP_S(m1) and WSBP_S played a major role in achieving good results, suggesting that the S component encoded more valuable information from each face image. Second, when utilizing both mean (m1) and variance (\(\mu ^{\prime }_{2}\)) moments as the input data for CLBP, the CLBP_S(\(m_{1}, \mu ^{\prime }_{2}\)) and CLBP_M(\(m_{1}, \mu ^{\prime }_{2}\)) dramatically increased the classification rate than considering only the first-order (m1) moment, suggesting that fusion of S and M components improved the performance. Indeed, the CLBP_S_M(m1) produced better results. For instance, when (P, R) = (4, 2), the best results obtained by CLBP_S_M(m1) with the number of training images (N = 2, 4, 5, and 8) for the ORL dataset were 86.22%, 96.2%, 98.25%, and 99.45%, respectively. Similarly, the WSBP descriptors reached 87.91%, 96.77%, 98.51%, or 99.41%, respectively. On the other hand, WSBP and CLBP_S_M(\(m_{1},\mu ^{\prime }_{2}\)) were similar. For the YALE dataset (Table 2), our WSBP outperformed the other methods: 91.95%, 97.14%, 98.72%, and 99.44% at the scale of (P, R) = (4, 2).

Also, our proposed method was compared with other state-of-the-art methods as shown in Table 3, that summarized techniques and recognition rate corresponding to each method. Ten methods on the top, including ours, were based on several hand-crafted features, and three remaining ones were based on deep features. Our method achieved 98.51% and 98.72% recognition rates for ORL and YALE datasets, which were greater than other methods, suggesting that our descriptor was robust against visual challenges, such as illumination variation, facial expressions, head poses (multi-orientation), and occlusion.

Table 3 Performance comparison with the ORL and YALE datasets

5.3 Results with Caltech 1999 and KDEF datasets

Since the number of images for each class in Caltech 1999 dataset varied, we did not change the number of training images for each class like the previous experiments. Here, we randomly chose half of the images in each class as a training set, and the remaining ones were used as a testing set. Table 4 showed our results for three configurations of (P, R), respectively, suggesting that our WSBP and CLBP_S_M(\(m_{1},\mu ^{\prime }_{2}\)) achieved the highest recognition rate. Table 5 compared our results with a deep learning approach based on Deep Stack Denoising Sparse Autoencoders (DSDSA) [13]. As can be seen from these tables, even using one single small scale (P, R) = (4, 2), our descriptor WSBP (Ours 1), CLBP_S_M(\(m_{1},\mu ^{\prime }_{2}\)) reached the recognition rate of 98.83%, 98.96%, respectively, which were greater than the performance of DSDSA. When we considered CLBP_S_M(\(m_{1},\mu ^{\prime }_{2}\)) at the scale of (P, R) = (6, 3), the performance was 99.03% (Ours 2).

Table 4 Recognition rates for the Caltech dataset
Table 5 Performance comparison with the Caltech dataset

For the KDEF dataset, we conducted the face recognition task by changing the number of training images for each person to verify the accuracy rates of each train/test portion. Indeed, several training images Ntrain (Ntrain = 2, 3, 4, 5) were randomly chosen while the remaining images (7 - Ntrain) were used for test. Here, the evaluation was repeated 100 times by shuffling data to get the average accuracy. Table 6 showed our results for three configurations of (P, R), respectively. Specifically, at the scale of (P, R) = (4, 2), our descriptor WSBP substantially increased the accuracy up to 94.11%, 97.87%, 99.07%, and 99.33% with the number of training images Ntrain = 2, 3, 4, and 5, respectively. Such high performance suggests that our descriptor could effectively deal with visual challenges such as diverse facial expressions, illumination or occlusions.

Table 6 Recognition rate for the KDEF dataset

5.4 Results with the AR dataset

5.4.1 Evaluation with gray-scale images

We carried out the cross-validation for training and testing. The different number of training images (Ntrain = 10, 13, 15, 20) and the remaining (26 - Ntrain) were used for training and testing sets to guarantee they had unseen images. Results using 100 shuffle splits were summarized in Table 7 for the recognition rate and Table 8 for comparing with other methods, respectively.

Table 7 Recognition rate for the AR dataset
Table 8 Performance comparison with the AR dataset

Table 7 showed several LBPs results obtained with various parameters. As it can be seen, the CLBP_S(m1) and CLBP_M(m1) obtained the base results. With a specific parameter of (P, R) = (6, 2), the best results obtained by CLBP_S(m1) with Ntrain = 10,13,15,20 were respectively 82.87%, 88.83%, 90.79%, 95.90%. Similarly, the results of CLBP_M(m1) were respectively 80.09%, 86.22%, 89.83%, 95.67%. However, the recognition rates were improved significantly when bilaterally complementing S and M components with CLBP_S_M(m1), the results were 98.46%, 98.68%, 99.04%, 99.93%, respectively. In this case, the CLBP_S_M(m1) had a significant improvement which could increase 12.46% at Ntrain = 13 (compared to CLBP_M(m1) and CLBP_S(m1)). Moreover, CLBP_S_M(\(m_{1},\mu ^{\prime }_{2}\)) and WSBP descriptors also significantly increased the performance when reaching 98.79% and 99.37%. With this parameter, our proposed framework WSBP outperformed the CLBP_S_M(m1) (0.69%) and CLBP_S_M(\(m_{1}, \mu ^{\prime }_{2}\)) (0.58%). Notice that the improvement of WSBP could reach 13% compared to the original LBPs.

Table 8 compared our method with others. In terms of the recognition rates, Ours outperformed the state-of-the-art methods, including hand-crafted features and deep features techniques. Also, our WSBP was better than Multi-resolution dictionary [30] (82.19%), MNTCDP [21] (96.18%), Local Multiple Patterns [49] (98.00%), or even deep facial features CS [2] (93.99%) by a substantial margin. The remaining algorithms, including EL-LBP [44] (98.27%) and deep feature FDDL + CNN [39] (98%) were comparable with our descriptors, and yet ours prevailed.

5.4.2 Evaluation with color channel

The motivation of this experiment was to check the behaviors of facial descriptors for the color channel. The experiment was conducted with the HSV color images by keeping the other experimental setting was similar to that of the gray-scale image. First, an RGB color image was converted into an HSV color image. Second, the Hue channel was extracted from the HSV space, called it H image, and fed it as an input for our experiment. Our descriptors were able to extract the eyes, eye-blows and mouth from H image probably because these areas had the distinctive colors. And yet the Sign (S) and Magnitude (M) components, computed by CLBPα on m1, could not discriminate the subtle color change occurring within the facial skin area, as shown in Fig. 13.

Fig. 13
figure 13

Illustration of resulting images for H (Hue) and gray-scale images. The upper part contains the mean (m1) and new variance (\(\mu ^{\prime }_{2} = \sqrt [9]{\mu _{2}}\)) images of H image by using the structuring element \({\mathscr{B}} = \{(1, 5); (2, 8)\}\) and its Sign-Magnitude components calculated by CLBP\(_{\alpha _{i}}\) operators. The lower part is similar to the upper one but using the gray-scale image with the structuring element \({\mathscr{B}} = \{(1, 8)\}\). All resulting images are put together for a comprehensive viewing

An evaluation with color channel was conducted with the same protocol settings of the gray-scale case. Result with 100 shuffle splits was summarized in Table 9. Note that (P, R) = (4, 2), and Ntrain = 13 were specific parameters chosen from Table 9, suggesting that S worked better than M for three cases. For instance, the accuracy for CLBP_S(m1), CLBP_S(\(m_{1},\mu ^{\prime }_{2}\)), and WSBP_S was 49.41%, 93.56%, and 90.28%, respectively, while that of CLBP_M(m1), CLBP_M(\(m_{1},\mu ^{\prime }_{2}\)), and WSBP_M was low as 24.36%, 55.32%, and 57.06%, respectively, indicating that the combination of S and M somehow impaired the overall accuracy compared to the S case. In addition, the accuracy for CLBP_S_M(m1), CLBP_S_M(\(m_{1},\mu ^{\prime }_{2}\)), and WSBP reached 33.73%, 82.52%, and 83.05%, respectively.

Table 9 Performance comparison with S, M and S-M components where the structuring element was \({\mathscr{B}} = \{(1, 5);(2, 8)\}\)

CLBPα, which was inspired from CLBP [14], was designed for the gray-scale case to complement the crucial component M. It was not very effective in discriminating the color change within facial skin (see Magnitude components of H and the gray-scale images in Fig. 13). On the other hand, S component worked very well on H image by utilizing statistical moments (\(m_{1}, \mu ^{\prime }_{2}\)), since its accuracy was comparable with the state-of-the-art methods. For instance, the accuracy of CLBP_S(\(m_{1},\mu ^{\prime }_{2}\)) and WSBP_S reached 93.56% and 90.28%, respectively, while CLBP_S(m1) reached 49.41%. Notice that the accuracies of CLBP_S(\(m_{1},\mu ^{\prime }_{2}\)) and WSBP_S were better than that of CLBP_S(m1) since margins were 44.15% and 40.87%, respectively. These results suggest that our descriptor was designed to extract the spatial relationship of the neighboring pixels, not to simply discriminate the magnitude between pixels.

5.5 Results with the FERET dataset

Previous works [10, 29] in the literature performed their experiments with a protocol by using only frontal sets (Fa, Fb, Fc, Duplicate I, and Duplicate II) where Fa with 1196 images known as the gallery and others known as probes. Unlike the previous protocol, we did with a subset created from 1400 images (ba, bd, be, bf, bg, bj, bk) in which each person had two facial expression images, two left pose images, two right pose images, and one illumination image. This subset was more challenging than the previous one since it comprised not only frontal faces but also multiple orientations, expressions. Moreover, experimental results under this subset could reflect changing accuracy rates of each train/test portion.

The experiment was carried out with random Ntrain training images of each class (Ntrain = 1,2,3,4,5,6), and Ntest testing images (Ntest = 7 - Ntrain) by 100 splits for average accuracy. Table 10 illustrates the achieved results on CLBP\(_{\alpha _{i}}\) operators at various scales of (P, R). For most cases, WSBP obtained the best results and reached over 90% accuracy with 2 training images only. Table 11 compared a few recent methods and stated that our descriptors achieved the best performance. In detail, WSBP with 3 training images exceeded MNTCDP [21] at 2.57%, which was not easy to deal with the challenging FERET dataset having images under multiple orientations. As mentioned above, FERET had two different protocols for evaluations. It would not be a fair evaluation if we compared such methods under different protocols. And yet, it is interesting to evaluate the previous reports. For this purpose, we performed the average accuracy result based on recent reports: CLBP [10], SLBFLE [29], and WPCBP+FLD (HI) [47] (see Table 11), wherein these methods performed efficiently with the subset of the frontal face cases.

Table 10 Recognition rate for the FERET dataset
Table 11 Performance comparison with the FERET dataset

5.6 Robustness against degraded images

In practical surveillance scenarios, the degradation of images often happened during the acquisition process and could significantly affect the system performance. Therefore, motivation of this experiment was to examine how our facial descriptors dealt with such problems. In the first scenario, the Gaussian noise was added to the original image. For instance, five different levels of Gaussian noise were added by levels = {10%,20%,30%,40%,50%} using the Matlab function “imnoise”. In the second scenario, occlusion was simulated by adding a white rectangle of random positions within the face region. Each rectangle had various sizes, ranging from [20, 20] to [30, 60] with a Matlab function “insertShape”. Figure 14 showed the both scenarios.

Fig. 14
figure 14

The modified ORL images: (1) the first row has five levels of Gaussian noise (10%-50%), (2) the second one has occlusion by white random rectangles

In each scenario, five images were chosen in each class as training samples, whereas the rest as testing samples, by splitting the data into 100. The average recognition rates from the different methods were shown in Table 12. Here, we fine-tuned the structuring element \({\mathscr{B}}_{2} = \{(1, 5); (2, 6)\}\) to obtain the best achievements for WSBP and CLBP_S_M(\(m_{1},\mu ^{\prime }_{2}\)). This structuring element made our descriptors more robust against noise and occlusion comparing to other methods.

Table 12 Performance comparison for different descriptors with the ORL dataset added with Gaussian noise and occlusion

5.7 The processing time

This section describes the computational cost of several descriptors based on LBPs. Experiments of the ORL dataset for 400 images with 92 × 112 pixels were carried out with a machine with 3.5GHz CPU, 32GB RAM, and Windows 10 64-bit operating system. Table 13 showed the computational cost from two aspects: firstly, the processing time for the feature descriptor extraction phase and; secondly, the processing time for the matching phase (in seconds) of various descriptors with three different configurations of (P, R). The processing time measured here was based on the structuring element \({\mathscr{B}} = \{(1, 6)\}\), where the training set had 200 images and the testing sets had 200 images.

Table 13 The processing time of the descriptors used for the present study with different parameters (FT: feature extraction time, FTS: feature size of LBPs descriptors without using any dimension reduction techniques, and MT: matching time)

Table 13 showed that the WSBP required a longer processing time for the feature extraction and matching phase than CLBP_S(m1) or WSBP_S. Indeed, it took much more processing time proportionally to a size of (P, R) due to larger dimension. And yet, notice that our WSBP descriptor was effective when it was compared with CLBP_S_M(\(m_{1},\mu ^{\prime }_{2}\)) since both recognition rates were approximately the same; see Tables 13 and 1.

6 Summary and discussion

Based on our experiments, we summarize and discuss several advantages of our proposed descriptors:

  • The WSBP descriptor is designed to extend the LBPs with local difference Sign-Magnitude distributions on statistical moments. As a pre-processing step, statistical moment images obtained by spatial support \({\mathscr{B}}\) of local filters can eliminate noise coming from contrast change or illumination variation (mean moment) and yet derive useful information from the salient regions in a face image (variance moment) (see Fig. 6).

  • The classical LBPs consider neighborhoods bilaterally in a circle, whereas our WSBP descriptors are to exploit CLBP\(_{\alpha _{i}}\) operators along with multiple directions, i.e. four directions, independently and combine them in the final descriptors. It is found that they are robust against different lighting conditions, head poses, and facial expressions to achieve high performance (see CLBP_S_M(m1), CLBP_S_M(\(m_{1}, \mu ^{\prime }_{2}\)), and WSBP in Tables 1247, and 10).

  • Since the WSBP is built by fusing CLBPs along four different directions {αi} = {00,450,900,1350}, it works well with a single scale (P, R) of CLBP operators. It is then unnecessary to exploit a multi-scale approach using many parameters (P, R) because it could lead to a high dimensional descriptor.

  • Evaluation using six face datasets suggests that our descriptors outperform state-of-the-art methods, such as EL-LBP [44], AECLBP-S (B16) [22], Multi-resolution dictionary [30], DR-LBP + LDA [35], LDENP [42]. Moreover, our WSBP descriptors achieves better results than some deep facial features such as Deep Belief Net (GDBN) [8], Deep Autoencoders (DSDSA) [13], Compressive sensing (CS) [2], or FDDL + CNN [39] (see Tables 35, and 8).

  • According to an additional experiment with the color channel, it is found that the Magnitude transform captures the relationship of pixel magnitude on gray-scale image very well, but is not effective with Hue image (see Fig. 13), since a combination of Sign-Magnitude of CLBPα in Hue space performs worse than that of the gray-scale case. Also, fusing statistical moments (m1 and μ2) in CLBP_S and WSBP_S achieves the higher accuracy in Hue space, by ignoring texture pixel intensity. This evaluation suggests a new direction in face recognition problems, such as integrating many color channels to enhance face spoofing detection performance [43].

  • Although the YALE dataset contains some facial expression cases, it would be interesting in testing how our descriptor affords systematic variation of facial expressions. In addition, we use the KDEF dataset which has seven facial expressions for each subject to study the effect of facial expressions. Result suggests that our descriptor deals with such cases very well.

  • Our facial descriptors using both mean (m1) and variance (μ2) have shown their robustness against degraded images by evaluating the ORL dataset that contains artificial noise. Given that the Gaussian noise level of 50% makes the degraded face more challenging to recognize by human eyes, our WSBP(\({\mathscr{B}}_{2}\)) descriptor still reaches the acceptable accuracy of 93.05% for noise and 85.09% for occlusion, which are much higher than those of other LBPs.

7 Conclusions and future work

We present a set of descriptors wherein the local difference distributions in local binary patterns are exploited by directions, and then a weighting approach for binary patterns is applied to statistical moment images for an efficient and robust facial feature representation. A comprehensive evaluation with several standard face datasets is carried out to validate our proposal. We have analyzed the behaviors of several descriptors with gray-scale images and found that our method mostly outperforms state-of-the-arts. Also, an analysis with a set of color images has also been examined using the Hue channel for AR dataset. We have also simulated a few practical scenarios, that can be occurred during the data acquisition stage, by adding various Gaussian noise and random occlusion to the ORL dataset. One may understand that the spatial support strategy is a special preprocessing technique to eliminate the noise issues, and selecting the structure element \({\mathscr{B}}\) depends on the levels and types of noise. For the scenarios examined in this study, it is found that the structuring element of two circles eliminates noise very efficiently. Although this issue could downgrade the recognition performance, our experimental result is still higher than others. It shows that our proposed descriptor is robust against the degradation of the given image. Overall, our experimental results suggest that the proposed descriptor is robust against noise, contrast change, illumination variation, and facial expressions by exploiting different directions of binary pattern operators on the mean moment and considering the contribution of binary pattern to the variance moment.

We expect that these descriptors find more applications in the face recognition area and other areas such as facial paralysis analysis and face spoofing detection. Although our proposed framework is novel and high-performing, it has a few issues to be addressed: (1) the computational cost for matching increases when the descriptor dimension becomes larger; (2) it is necessary to fine-tune the optimal k-parameter for the root extraction variance moment. We plan to focus on how to deal with them. Also, it would be interesting to combine the WSBP descriptors with deep neural network for building powerful descriptors.