The generalized similarity measure (GSM), a similarity measure expressed by a linear combination of the Mahalanobis distance and bilinear similarity, obtains good recognition accuracy for face recognition [1]. The GSM is also practical because learning the GSM is expressed as a convex optimization problem and the global solution can be found by existing algorithms. While the GSM shows good accuracy and practicability, it still has a scalability issue. To deal with massive data from the web, scalability becomes as important as accuracy and practicability. When recognizing a query, the similarity measures must be calculated for all samples to find the closest sample. Hence, computing the similarity measures is expensive if the number of samples is large.

A feasible way to reduce the computational cost is to limit the samples used for calculating the similarity measures. This can be realized using approximate nearest neighbor search (ANNS) methods such as locality sensitive hashing (LSH) [2], fast library for approximate nearest neighbors (FLANN) [3], and bucket distance hashing (BDH) [4]. However, they are not directly applicable to the GSM because the GSM is different from any similarity measures on which existing ANNS methods work; few acceleration methods based on other similarity measures include the binary similarity measure based recognition, which was accelerated by introducing LSH [5], and the cosine similarity measure based recognition with LSH [6]. To the best of our knowledge, no method accelerates recognition based on the GSM.

In this paper, we propose an acceleration method based on the GSM using an ANNS method. We embed the GSM in d-dimensional space into (d+1)-dimensional Euclidean space, where d is the dimensionality of feature space. This enables us to use the GSM with a Euclidean distance-based ANNS method, so that the computational cost of the GSM-based search can be reduced. Experimental results show that the proposed method realizes fast recognition without degrading accuracy on face, object, and character datasets.


In this section, we explain the GSM [1], which is the similarity measure of the proposed method. The GSM is defined as a linear combination of the Mahalanobis distance and bilinear similarity. The bilinear similarity measure is adopted as one of the GSM components because it shows favorable recognition results [7]. The GSM is more useful than the cosine similarity measure [8] because the cost function for learning the GSM is convex with respect to M and G. Let x and y be feature vectors. The GSM f (M,G)(x,y) is expressed with the Mahalanobis distance d M (x,y)=(xy) M(xy) and the bilinear similarity measure s G (x,y)=x G y as follows:

$$ f_{\left(M,G\right)}\left(\boldsymbol{x},\boldsymbol{y}\right)= s_{G}\left(\boldsymbol{x},\boldsymbol{y}\right) - d_{M}\left(\boldsymbol{x},\boldsymbol{y}\right). $$

To reduce the effect of the intra-class variations, Eq. (1) uses mapped feature vectors x and y according to [1]. Let \(\boldsymbol {z}_{i}^{{\phantom {.}}j}\) be the feature extracted from jth image of the subject i∈{1…,S}. The intra-class covariance matrix is then defined by

$$ C_{S}=\sum_{i\in S}\left(\boldsymbol{z}_{i}^{{\phantom{.}}j} - \boldsymbol{z}_{i}^{{\phantom{.}}k} \right) \left(\boldsymbol{z}_{i}^{{\phantom{.}}j} - \boldsymbol{z}_{i}^{{\phantom{.}}k}\right)^{\top}. $$

Let (λ 1,…,λ l ), V=(v 1,…,v l ) and X be the top l eigenvalues, eigenvectors of C S , and the original feature vector of x, respectively. A mapped feature vector x is expressed as

$$ \boldsymbol{x}= \text{diag} \left(\lambda_{1}^{-1/2}, \ldots, \lambda_{l}^{-1/2}\right) V^{\top} \boldsymbol{X}. $$

The parameters of the GSM, M and G, are learned by similarity metric learning on the intra-class subspace (sub-SML) [1].

Acceleration of the GSM-based nearest neighbor search

The computational cost of nearest neighbor search based on the GSM is expensive because the value of f (M,G)(·,·) in Eq. (1) is recalculated for each sample in a database. Our idea is to accelerate the search by introducing an ANNS method. Thanks to approximation and efficient calculation, ANNS can be realized by calculating only a limited number of distances. The biggest problem to introducing an ANNS method to a search is that function f (M,G)(·,·) in Eq. (1) cannot be directly treated as an L n norm such as the Euclidean and Manhattan distances because it consists of two terms with different characteristics. Thus, we transform Eq. (1) so it can be calculated as an L 2 norm (i.e., the Euclidean distance).

Suppose that x is a mapped sample feature in a database and y is a mapped query feature. The GSM of x and y is expressed by the following equation:

$$ f_{\left(M,G\right)}\left(\boldsymbol{x},\boldsymbol{y}\right) =\boldsymbol{x}G\boldsymbol{y}-\left(\boldsymbol{x}-\boldsymbol{y}\right)^{\top}M\left(\boldsymbol{x}-\boldsymbol{y}\right). $$

Let x p=(G+2M)x, and Eq. (4) can be rewritten as

$$ f_{\left(M,G\right)}\left(\boldsymbol{x},\boldsymbol{y}\right) = \boldsymbol{y}^{\top} \boldsymbol{x}^{{\phantom{.}}p} - \boldsymbol{x}^{\top} M \boldsymbol{x} - \boldsymbol{y}^{\top} M \boldsymbol{y}. $$

The first term in Eq. (5) is the dot product of y and x p. Hence, it is represented using the Euclidean distance ∥yx p∥ as

$$ \boldsymbol{y}^{\top}\boldsymbol{x}^{{\phantom{.}}p}= -\frac{1}{2}\left\{\left\|\boldsymbol{y}-\boldsymbol{x}^{{\phantom{.}}p}\right\|^{2} -\left\|\boldsymbol{x}^{{\phantom{.}}p}\right\|^{2} -\left\|\boldsymbol{y}\right\|^{2} \right\}. $$

We substitute Eq. (6) into Eq. (5) and obtain

$$ f_{\left(M,G\right)}\left(\boldsymbol{x},\boldsymbol{y}\right) =-\frac{1}{2} \left\{\left\|\boldsymbol{y}-\boldsymbol{x}^{{\phantom{.}}p}\right\|^{2} -\left\|\boldsymbol{y}\right\|^{2}+2\boldsymbol{y}^{\top}M\boldsymbol{y} +L(\boldsymbol{x}) \right\}, $$


$$ L(\boldsymbol{x}) =\boldsymbol{x}^{\top}\left\{2M-\left(G+2M\right)^{\top}\left(G+2M\right)\right\} \boldsymbol{x}. $$

Note that because the second and third terms \(\frac {1}{2} \|\boldsymbol {y}\|^{2}-\boldsymbol {y}^{\top }M\boldsymbol {y}\) in the right-hand side (RHS) of Eq. (7) can be ignored (because it is the common term for all samples in the database), only the first term \( -\frac {1}{2} \|\boldsymbol {y}-\boldsymbol {x}^{{\phantom {.}}p}\|^{2} \) is a function of y. This term cannot be calculated until a query y is given. In contrast, the third term \(-\frac {1}{2} L(\boldsymbol {x})\) can be calculated before the query is given. As a consequence, Eq. (7) can be obtained by adding the third term to the Euclidean distance calculated by the first term. Though this is not the Euclidean distance, it can be transformed into the Euclidean distance by introducing augmented vectors that have one more dimension than the original vectors. Augmented vectors \(\phantom {\dot {i}\!}\boldsymbol {x}^{p'},\boldsymbol {y}'\) are given as

$$\begin{array}{@{}rcl@{}} \boldsymbol{x}^{{\phantom{.}}p^\prime} &=& \left({\boldsymbol{x}^{{\phantom{.}}p}}^{\top}, \sqrt{L(\boldsymbol{x})} \right)^{\top}, \end{array} $$
$$\begin{array}{@{}rcl@{}} \boldsymbol{y}' &=& \left({\boldsymbol{y}}^{\top}, 0 \right)^{\top}. \end{array} $$

In order to make \(\sqrt {L(\boldsymbol {x})}\) in Eq. (9) real, L(x)≥0 should be satisfied. Here, ∥M∥≤0.5 is required because {2M−(G+2M)(G+2M)} in Eq. (8) must be a positive-semidefinite matrix to satisfy L(x)≥0. Using Eqs. (9) and (10), we can rewrite Eq. (7) with \(\phantom {\dot {i}\!}\boldsymbol {x}^{p'}\) and y as

$$ f_{\left(M,G\right)}\left(\boldsymbol{x},\boldsymbol{y}\right)=-\frac{1}{2}\left\{ \left\|\boldsymbol{y}'-\boldsymbol{x}^{{\phantom{.}}p^\prime}\right\|^{2} -\left\|\boldsymbol{y}\right\|^{2}+2\boldsymbol{y}^{\top}M\boldsymbol{y}\right\}. $$

As the second and third terms in the RHS of Eq. (11) can be ignored, Eq. (11) is expressed as the Euclidean distance between \(\phantom {\dot {i}\!}\boldsymbol {x}^{p'}\) and y . Thus, we can apply ANNS.


We used three datasets for the evaluation: the Labeled Face in the Wild dataset (LFW) [9], Amsterdam Library of Object Images (ALOI) [10], and ETL9B1. We describe the experimental setting and results in each dataset.


LFW is a celebrity face database from Yahoo! News2. It has 13,233 images of 5790 subjects. The image set we used was called “LFW-a” [11], whose images were cropped and normalized to 250 × 250 pixels by a commercial face detector. We used 482 subjects for learning M and G, and 1198 subjects for the gallery and probe. The subjects used for learning did not overlap with the subjects for the gallery and probe. We used both 50 and 482 subjects for learning M and G. When we used 50 subjects, the number of images in each subject was fixed to 22. When we used 482 subjects, the number of images was different in each subject and the average number of images was 22 per subject. We chose one image per subject for the gallery and one image per subject for the probe. We evaluated the computational time and recognition rate by increasing the number of subjects for the gallery and probe from 100 to 1000 in increments of 100. We extracted features following Cao’s method [1]. That is, nine feature points were fixed as shown in Fig. 1, and the SIFT descriptor [12] was extracted on the points at three scales: 2, 6, and 10. Extracted features were concatenated and the dimensionality was reduced to 100 using principal component analysis. In the recognition process, we used the 1-nearest neighbor. We used the BDH [4] as the ANNS method. Figure 2 shows the recognition rates and average search times of the proposed method 1000 subjects for the gallery when the BDH parameters changed. This indicates that the recognition rate and speed depend on the BDH parameters. Therefore, we experimented many times with different parameters and present the best recognition rate in the paper.

Fig. 1
figure 1

Facial feature points. Red dots show the feature points

Fig. 2
figure 2

Recognition rates and computational times of the BDH parameter changes

We compared the proposed method with the face recognition method proposed in [13], called local generic representation (LGR). The LGR focuses on improving recognition accuracy when a single image per person is available for the gallery and probe. In the literature, the best recognition rate on the LFW database is 30.4% with 50 subjects [13]. We used only 50 subjects for learning in the LGR because the number of images in each subject must be uniform for learning in the LGR. We could evaluate the LGR up to 400 subjects in the gallery because of the memory limitation. We also compared the proposed method with the fast face recognition method proposed in [14], which uses the PCA-SIFT for image representation and BDH for search. In the literature, a 100% cumulative recognition rate with 139-ms search time on an original 5 million-item database has been presented [14].

All experiments were conducted on a PC with an Intel Xeon E5-4627 v2 (3.30 GHz) processor and 8 GB of RAM running the Debian 4.9.2-10 operating system using a single processor core. We measured the search time for all queries and calculated the average search time of each query. The search time excluded feature extraction time and learning time.

The recognition rate and average search time are shown in Fig. 3. In Fig. 3 a, “without ANNS” is the method which ANNS is excluded in the proposed method, “BDH+PCA-SIFT” is the method proposed in [14], and the following numbers represent the number of subjects used for learning. The proposed method showed the same recognition rate as “without ANNS” in Fig. 3 a. This indicates that the proposed method can recognize subjects without reducing the accuracy. The proposed method showed better recognition rate than the LGR and “BDH+PCA-SIFT.” This indicates that the proposed method can achieve satisfactory accuracy for the face recognition task.

Fig. 3
figure 3

Recognition rate and average search time on the LFW. a Recognition rate. b Average search time

In Fig. 3 b, the proposed method and “without ANNS” used the parameters learned with the 482 subjects. Figure 3 b shows that the proposed method is about 24,000 times faster than the LGR, 88 times faster than “without ANNS,” and about 1600 times faster than “BDH+PCA-SIFT”.


ALOI is an object image database that consists of 110,250 color images of 1000 small objects. We used a part of the ALOI called “ALOI-VIEW,” whose images were taken from 72 different directions by rotating objects on the plane at intervals of 5 degrees. The number of objects is 1000 and the total number of images is 7200. The image size is 384 × 288 pixels. We sampled images according to the rotation of the objects to use for learning parameters: every 180 degrees (0+k180), 90 degrees (0+k90), 45 degrees (0+k45), 20 degrees (0+k20), and 10 degrees (0+k10). We also used the sampled data for the gallery, and the rest were used for the probe in the recognition process. We fixed the number of objects used for the gallery and probe to 100, 200, 500, 700, and 1000, and evaluated the computational time and recognition rate. We used the bag-of-feature model with the SIFT features for image representation. To acquire the whole image feature, we sampled feature points at every 5 pixels horizontally and vertically, and extracted the SIFT descriptors from these points. The scales of the SIFT were fixed to 20 and 30. We fixed the number of visual words to 300 based on preliminary experimental results. We used a soft-voting k-Nearest Neighbor (k-NN) classifier in which the voting weight was given by the similarity score. We fixed parameter k to 100.

We compared the proposed method with a multiclass SVM [15]. The training and probe data were identical to the proposed method, and we used a linear function as a kernel function. We also compared the k-NN classifier with the Euclidean distance. We conducted experiments on the same PC used for the LFW experiments.

Figure 4 shows the recognition rates and search time. Figure 4 a shows that the proposed method obtained a better recognition rate than other methods when the learning data were identical. In Fig. 4 b, the proposed method was faster than the other methods under almost all experimental settings. When the learning data were 0+k180 and 0+k90, and the number of subjects was 100, the multiclass SVM was faster than the proposed method. However, when the number of subjects increased to more than 200, the proposed method was faster than the multiclass SVM. This indicates that the proposed method has better scalability than the multiclass SVM.

Fig. 4
figure 4

Recognition rate and average search time on the ALOI. a Recognition rate. b Average search time


ETL9B is a handwriting Japanese character database that is a binarized version of ETL9 [16] and consists of 3036 Japanese characters and 200 images per character. The size of character images is 64 × 63 pixels. We used the first 100 images in each character for learning and the gallery, and the rest for the probe. We fixed the number of characters for the gallery and probe to 100, 500, 1000, 2000, and 3036, to evaluate the proposed method. We resized the images to 16 × 16 pixels, and converted the resized images to vector features by concatenating the pixel values. We also used the directional element features [17], which have a dimensionality of 196, to represent the images. The recognition method was same as ALOI, and the PC used for evaluation was the same as LFW. We compared the proposed method with a Euclidean distance-based k-NN classifier. Figure 5 shows the recognition rate and average search time. In Fig. 5 a, the proposed method shows the same recognition rate as the method without the ANNS method, just as for LFW and ALOI. Figure 5 b shows that the computational time increased in a sublinear manner. Consequently, we confirmed that the proposed method was much faster than existing methods and, with respect to recognition accuracy, no worse than existing methods.

Fig. 5
figure 5

Recognition rate and average search time on the ETL9B. a Recognition rate. b Average search time


In this paper, we proposed a fast recognition method based on the GSM. The proposed method embeds the GSM into the Euclidean distance and applies an existing ANNS method to reduce the number of calculated similarity measures. The experimental results show that the proposed method was 88 times faster than before acceleration. In addition, an evaluation on three databases demonstrates that increase in the computational time of the proposed method was sublinear when the number of subjects in the gallery increased.