Face image retrieval based on shape and texture feature fusion

Humongous amounts of data bring various challenges to face image retrieval. This paper proposes an efficient method to solve those problems. Firstly, we use accurate facial landmark locations as shape features. Secondly, we utilise shape priors to provide discriminative texture features for convolutional neural networks. These shape and texture features are fused to make the learned representation more robust. Finally, in order to increase efficiency, a coarse-tofine search mechanism is exploited to efficiently find similar objects. Extensive experiments on the CASIAWebFace, MSRA-CFW, and LFW datasets illustrate the superiority of our method.


Introduction
One of the first visual patterns an infant learns to recognize is the face. The face provides a natural means for people to recognize each other. For this and several other reasons, face recognition and retrieval have been problems of prime interest in the fields of computer vision, biometrics, pattern recognition, and machine learning for decades. The face has been very successful used in biometrics due to its unobtrusive nature and ease of use; it is suited to both overt and covert applications. Along with advances in face analysis technology, face recognition, expression recognition, attribute analysis, and other applications have come to the fore. Also, content-based image information retrieval technology has gradually matured, and major search engines now offer a search by image function. Progress in face recognition and contextbased information retrieval technology have made automatic similar face retrieval possible. Similar face retrieval has high application value in the fields of entertainment search, criminal surveillance, and so on. Figure 1 illustrates large-scale face retrieval in the field of prevention of terrorist crimes.
As a specific application of image retrieval, face retrieval has the same research characteristics. Unlike face recognition and face identification, the aim of face retrieval is to search for all the face images similar to an input image in a given face image database, and to sort the results by similarity. Existing face retrieval methods are usually designed to compute geometric properties and relationships between significant local features, such as the eyes, nose, and mouth [1,2]. Bach et al. [3] manually annotated images of faces and used artificial features extracted from the annotated regions for face matching, thus providing a semiautomatic face retrieval system. Eickeler [4] applied the pseudo 2D hidden Markov model method for the first time in a face retrieval system, achieving good results. Gudivada and Raghavan [5] borrowed methods from face matching and proposed using features extracted from face matching in a face retrieval system. Wang et al. [6] proposed a multitask learning structure using local binary patterns (LBP) [7] to solve face verification and retrieval problems.
Learning face representations via deep learning has achieved a series of breakthroughs in recent years [8][9][10][11][12][13]. The idea of mapping a pair of face images to a distance originated in Ref. [14]. They trained Siamese networks as a basis for the similarity metric, which is small for positive pairs and large for the negative pairs. This approach requires image pairs as input.
Very recently, Refs. [12,15] supervised the learning process in CNNs using challenging identification signals (with a softmax loss function), which brings richer identity-related information to deeply learned features. Subsequently, a joint identificationverification supervision signal was adopted in Refs. [10,13], leading to more discriminative representation features. Reference [16] enhanced supervision by adding a fully connected layer and loss functions to each convolutional layer. The advantage of triplet loss has been proved in Refs. [8,9,17]. With deep embedding, the distance between an anchor and a positive instance is minimized, while the distance between an anchor and a negative instance is maximized until a preset margin is met. They achieved state-of-the-art performance on the LFW dataset.
We propose a method for fast large-scale face retrieval using fused shape and texture features to represent a face. Firstly, we use accurate face alignment to gain shape information, inspired by SDM [18]. Secondly, we adopt a modified GoogleNet [19] to gain texture information about the face. Thirdly, we fuse these two features to represent the face image. Furthermore, we use a coarse-to-fine structure that clusters the dataset into several dense subsets to achieve fast retrieval. We thoroughly evaluate the contributions of each part in this paper and show that it achieves excellent performance on experimental datasets.

Overview
Figure 2 provides an overview of our shape and texture cascade face retrieval approach. Firstly we use SDM to extract face landmarks and a modified GoogleNet to extract face texture information. Secondly we fuse and balance the two features using principal component analysis (PCA). Finally, we search the face dataset using the fused features to get the result.

Shape feature representation
This section describes use of SDM in the context of face alignment. Algorithm 1 shows the main steps of the SDM evaluation procedure. SDM is based on a regressor that starts from a raw initial shape guess x 0 and progressively refines this estimate using descent directions R k and bias terms b k , outputting a final shape estimate x k . The descent directions set R k and bias terms b k have been learned during training. The training procedure corresponds to minimizing: arg min where x * are the manually annotated face landmarks. Minimizing this corresponds to a linear least squares problem that can be solved in closed-form.

Algorithm 1 Face alignment via supervised descent method (SDM)
Input: Image I, descent directions R k , bias terms b k , initial guess x0. 3:

Texture feature representation
This section explains how we use CNNs modified from GoogleNet V2 [20] to extract the texture features. Convolutional neural networks (CNNs) have played an extremely significant role in computer vision due to the revolutionary improvements they provide over the state of the art in many applications. In the field of face analysis, however, large scale public datasets are extremely scarce. Thus, here we use a face dataset containing 20,000 celebrities, each with 50-1000 images, for a total of about 2,000,000 images taken from the Internet. We combine the state of the art performance of the GoogleNet V2 and the accurate and efficient approach of triplet loss [8] to train our face texture extraction model using the above dataset.
GoogleNet Inception V1 is the earliest version of GoogleNet, appearing in 2014 [19]. Generally, the most direct way to increase network performance is to increase the depth and width of the network, which means generating a massive number of parameters. However, so many parameters will not only cause overfitting but also increase the computation. Reference [19] believes that the fundamental way to solve these two drawbacks is to convert the connections, even the convolutions, to a sparse set of connections. For non-uniform sparse data, the computational efficiency of computer software and hardware is very poor, so determining an approach that not only keeps the sparsity of the network, but also permits the high computational performance associated with dense matrices, is a key issue. A large number of papers show that the computing performance can be improved by clustering the sparse matrix into dense submatrices. Inspired by those methods, the Inception module was designed to realize the above ideas. Figure 3(a) shows the initial version of the Inception module. The different sizes of convolutions mean different sizes of receptive fields; filter concatenation fuses diverse scale features. As the network deepens, the features tend to become more abstract, and the receptive field of each feature involved is also increased. Thus, with an increasing number of layers, the proportion of 3×3 and 5×5 convolutions also increases, resulting in a huge computational load. Inspired by Ref. [21], a 1 × 1 convolutional kernel is applied to dimensionality reduction. The dimension-reduction form of the Inception module is shown in Fig. 3(b).
Although this network has been proposed, building deeper networks is becoming mainstream, but the computational efficiency reduces as the models enlarge. Hence, Szegedy et al. [20] tried to find a method to expand the network while avoiding increased computational requirements. GoogleNet V2 was proposed in 2015, which, compared with V1, is an improvement in that it applies n×1 rather than n×n convolutional kernels. Because of this scheme, the convolutional neural network can keep  a wide range of receptive fields and reduce the number of parameters needed when expanding the network, increasing the computational speed. Figure  4 illustrates the architecture of the Inception module of GoogleNet V2. Here, n = 7 for the 17×17 grid. In virtue of its high performance and lightweight model, we choose it as the basic network used to extract face texture features.
As an improvement, we adopt a triplet-based loss to learn a face embedding when we train the GoogleNet. The triplet-loss acts, in brief, such that when we compare a pair of two alike faces (a, b) and a third differing face c, the aim is to ensure that a is more similar to b than c, unlike traditional metric learning approaches.
The output φ( t ) ∈ R D of the GoogleNet, pretrained, is l 2 -normalised and mapped to an L D dimensional space using an affine projection x t = W φ(l t )/ φ(l t ) 2 , where W ∈ R L×D . There are two key differences compared to use of a linear predictor: firstly, L = D is not equal to the number of class identities, but it is the size of the descriptor embedding; secondly, the projection W is trained to minimise the empirical triplet loss: where α 0 is a fixed scalar representing a learning margin and T is a set of training triplets. Here we do not learn the bias, unlike in the previous function. A triplet (a, p, n) is composed of an anchor face a, and furthermore a positive p = a, and negative n sample of the anchor's identity.
We obtain our texture feature representation by training using a face dataset that contains 2,000,000 images; the model size is 58.7 MB.

Fast face retrieval via coarse-to-fine procedure
This section explains we achieve fast face retrieval for large-scale databases, using two main steps. The first fuses face shape and texture features. The above two features are 132 and 256 dimensional vectors respectively. We apply PCA to reduce the combined features to a final fused feature vector of 128 dimensions. All face data is used in this operation. The second step clusters the combined feature vectors for each dataset into several dense subclusters. We determine the number of clusters according to the number of images in each dataset. Our experiments show that about 100,000 images per cluster give the best balance between speed and precision of retrieval. Therefore, we choose 5 and 2 clusters respectively for the CASIA-WebFace [22](abbreviated as CASIA in the following) and MRSA-CFW [23] (abbreviated as CFW) datasets.

Experimental data
As Table 1 shows, we have performed experiments on three datasets. As most identities contain only one image in LFW [24], we conduct face verification on this dataset to demonstrate the excellent selectivity of our face feature representation. The other two datasets are used for face retrieval. Figure 5 shows some examples of face images in these three face datasets. All face images from CASIA are cropped to a uniform size but we use the original images from CFW. Thus, CASIA only contains face images while CFW includes many busts and full-body pictures.

Evaluation
We now explain how we carried out the experiments. Because both CASIA and CFW were collected for training face recognition tasks, and do not give a standard test set for face retrieval, we therefore manually selected a test sample for each identity in both datasets. Extensive experiments on the LFW dataset were used to evaluate the performance of the features extracted by our method. As there is no benchmark for face image retrieval using CASIA and CFW, in the following evaluations, we selected 10,575 representative face images using each identity in CASIA as its test set, and used the same method to set up a test set for CFW with 1583 representative face images. Following standard image retrieval experimental practice, we use top-1 and top-5 retrieval precisions as our performance metric. Top-1 and top-5 precisions are calculated using: where n represents the number of representative face images in the test set, and C(X i , Y i ) compares the ground truth X i and the retrieval result Y i . In top-1 retrieval mode, Y i contains just the most similar retrieval result, and if In top-5 retrieval mode, Y i contains the five most similar retrieval results, and as long as one of the five results is equal to the ground truth, C(X i , Y i ) = 1, otherwise C(X i , Y i ) = 0.

Face retrieval evaluation
As Table 1 shows, CASIA contains 494,414 face images with 10,575 identities while CFW contains 202,792 face images with 1583 identities. Here we conduct two kinds of experiments. The first strategy performs face retrieval by directly calculating the Euclidean distance between the test image and all images in the test database (the linear scan approach). Sorting the distances gives the top-1 and top-5 retrieval results. We also use a coarse-to-fine strategy (the coarse-to-fine approach). Firstly, we adopt k-means to cluster the database image features into k dense subsets (k = 5 and k = 2 respectively for CASIA and CFW). Secondly, we find the nearest subset to the test image. Finally, we search this closest subset for the final top-1 and top-5 results.
Our retrieval results are shown in Table 2. For the CASIA dataset we find that our features give excellent performance, achieving 96.62% and 99.34% precisions in top-1 and top-5 modes respectively using linear scan to find the top-k face images. However, the linear scan method is time consuming. The average search time per probe face is nearly 3 s, which is unacceptable. Therefore, we use a coarseto-fine structure to speed up the retrieval. It takes about 0.3 s to produce retrieval results per probe image. The retrieval speed increased by 8-9 times, at a cost of precision decrease by approximately 2%.
We also achieve outstanding performance on CFW, the retrieval precisions in top-1 and top-5 Table 2 Face retrieval results for CASIA and CFW; retrieval time is the average search time per probe face   CASIA  CFW  Retrieval method  Linear scan  Coarse-to-fine  Linear scan  Coarse-to-fine  Retrieval mode  Top-1  Top-5  Top-1  Top-5  Top-1  Top-5  Top-1  Top- modes using linear scan being 98.61% and 99.30% respectively. As the dataset is much smaller than CASIA, the retrieval time is only about 0.5 s. When we applied the coarse-to-fine procedure to the retrieval, the results were quite different from those expected. In top-1 mode, the time cost of each retrieval did not reduce, but increased. This experiment illustrates that if the dataset is not large, the coarse-to-fine operation does not reduce the retrieval time, but increases the complexity of the search.
In order to prove that the fusing features gives better retrieval results, we performed comparative experiments on both CASIA and CFW with fused features, and only texture feature. Table 3 shows the retrieval results, which confirm our expectations. For CASIA, using only texture features, top-1 and top-5 retrieval accuracies decreased by 8% and 5%.
The reduction for CFW is more severe, top-1 and top-5 retrieval accuracies being reduced by 17% and 11% respectively. The differences between the two databases led to these quite different accuracy reductions: all face images of CASIA are cropped to uniform size but CFW still contains the original images. As expected, the facial shape information indeed contributes to the good performance.
We demonstrate some results using real examples. Figures 6 and 7 show top-10 results for CASIA and CFW retrieved by the coarse-to-fine method. All retrieval experiments were carried out on a desktop computer with an Intel i7-2600 CPU and 24 GB RAM.

Face verification evaluation
We conducted a face verification evaluation using the LFW dataset, which is the standard test set for face verification in an unconstrained environment.  We report mean face verification accuracy and the receiver operating characteristic (ROC) curve on the 6000 given face pairs in LFW. We rely on a huge outside dataset for training our face representation model, like all recent highperformance face representation methods [12,15,[25][26][27][28][29][30][31][32][33][34]. We compared our method with these methods which all used unrestricted, labeled outside data for training. Furthermore, we used SVM to learn a threshold to verify whether two faces have the same identity or not. In this way, we achieved 97.68% face verification accuracy. We also only used texture features to conduct a face verification evaluation, and achieved 96.70% face verification accuracy, once again proving the advantages of our fused features. The comparison of accuracy and ROC curves to previous state-of-the-art methods using LFW are shown in Table 4 and Fig. 8, respectively. We achieve outstanding results that demonstrate the excellence of our face representation model.