Linguistic Descriptors in Face Recognition

In this study, we propose linguistic descriptors-based approach to the problem of face identification realized by both humans and computers. This approach is motivated by an evident observation that linguistic descriptors offer an ability to formalize and exploit important pieces of knowledge describing human’s face. These entities are used by people in face recognition and could be found of importance in building machine-oriented recognition schemes. Moreover, evident humans’ abilities to recognize other individuals can be incorporated into computational face recognition problems as an invaluable vehicle improving recognition rate of machine-oriented classifiers. Specifically, we propose an application of analytic hierarchy process to determine linguistic values of facial features. The experts’ assessments of faces in terms of such attributes support coping with uncertainty captured through experts’ decisions result in a set of useful assuring the desired property of inter-class similarities and between-class differences among faces. It is worth noting that the method presented in this study can be easily applied to any other classification problem with the presence of experts.


Introduction
Face recognition has been a challenging problem since more than two decades. The reason stems from the omnipresence of computers and a genuine need for identification of people with the use of biometric methods from which face recognition seems to be the most noninvasive. Moreover, the forensic identification methods are strongly related with biometric methods, particularly in case of face recognition. The literature of this field is rich and comprises various approaches. Most of the methods suffer from the lack of robustness to variance in pose, illumination, expression, age, occlusion, distance to the camera, and other factors. The existing methods could be augmented by bringing some crucial factors that are considered by humans when recognizing faces. With this regard, people are still better than machines, at least when recognizing familiar persons. No matter of the place of birth, race, education, and other factors coming from the living environment, people describe the subjects in a similar manner. Of course, the factors like own race bias can skew this general opinion but for all the people any feature and its attribute such as ''big nose'' have the same obvious meaning. The area of computational face recognition methods is vast and covers many approaches being still developed and extended, e.g., well-known local descriptors, sparse representation, deep learning. The literature concerning the problem of human face recognition mainly focuses on two threads: biological foundations of recognition, namely detection of brain activity regions [1] or eye tracking [2], and the saliency of facial regions together with the ability to recognition of so-called familiar and unfamiliar faces [3,4]. Analogical considerations, however from the computational point of view, were presented in [5][6][7] and others, where the relevance of particular facial features or facial regions (e.g., upper/lower or eye/nose/mouth area) was discussed. Those can result in determining the weights for the feature-based algorithms, where the aggregation of classifiers or the information fusion may be applied with a proper weight set. A brief survey of the papers can be found in [8]. It is worth noting that the task of face description and its parts has been discussed in numerous studies. Let us discuss some of them here. Government institutions (say, police) use the systems like Identikits, Evofit, and others [9] where the process of searching is carried out manually or automatically, and it is based on the manual (sketched) or automatic compositions of the images. It is important that finally, at the end of the process, the forensic expert has to confirm or reject the results of the search [10]. Very comprehensive and detailed guidelines on how to describe the criminal can be found in the standardizing documents [11], police websites with instructions for the witnesses of the crime like [12], and the textbooks for policemen [13]. An interesting approach to face identification and face retrieval was proposed in [14,15]. A set of features considered in the studies was described in terms of linguistic descriptions including terms such as small, medium, big. Those descriptors are characterized (quantified) in terms of fuzzy sets. It is worth noting that the variants of this proposal included the emotions' descriptions. An in-depth study of the way people describe human attributes was presented in [16]. Axiomatic fuzzy set was utilized to obtain a semantic facial descriptor in [17]. Interval value fuzzy TOPSIS technique was applied to 3D facial classification system in [18]. Recently, the results of AHP by experts were applied to a neural network classifier [19]. Moreover, the AHP was used to obtain the weights of facial features in [20]. Finally, the linguistic descriptors were measured by using experts' votes in [21]. Other approaches based on linguistic descriptors expressed in terms of fuzzy sets, fuzzy geometries, granular computing, and others were described in [22][23][24][25][26][27][28]. A comprehensive survey of methods utilizing the linguistic descriptors in face recognition can be found in [29]. Finally, it is noteworthy to add here that the AHP was used in object recognition tasks, however in a different manner than discussed in this study. The method was applied to image semantic representation in the image retrieval method [30,31] and to face emotion recognition [32].
The main objective of this study is to propose a novel method based on linguistic descriptors which can be applied to the face recognition or face retrieval problem, particularly to the problem of criminal identification, with or without the usage of any numeric measures related with a particular face images. The process of identification becomes easy, intuitively appealing and can be conducted both by professional expert from the field of criminology or by the witness of the crime. We are interested in the use of a mechanism commonly encountered in multi-criteria decision-making theory, namely analytic hierarchy process, AHP [33][34][35]. The problem of facial recognition can be decomposed into two levels of hierarchy. At the higher level of the hierarchy, we form the weights related to the abstract facial features involved in the process of classification. At the lower level of the hierarchy, using again the AHP method, we transform the linguistic descriptions of the concrete facial features into the numeric variables specifying the importance of all the possible attributes related with a given feature. Additionally, our aim is to investigate how the AHP can improve the recognition rate when it is combined with some other methods based on the geometrical relationships present in the face. It is worth noting that the method presented in this study can be easily applied to any other expert-oriented task. Face recognition problem is treated here as a case study.
The paper is organized as follows. In Sect. 2, the role played by the AHP method is presented. In Sect. 3, we describe the methods of assignment of information coming from the numeric values of the features to their linguistic labels (descriptors) and the general scheme of processing. Section 4 covers the experimental results, while Sect. 5 offers conclusions and elaborates on the perspectives for the future work.

The Role of the Analytic Hierarchy Process
This section is devoted to a concise introduction to the AHP method as it was originally proposed in [33,34]. Using this approach one can obtain the ranking and the priorities of any set of features or attributes under consideration. The algorithm is briefly outlined as follows. First, the hierarchy present in the problem is formed. The goal is positioned at the top, next the criteria are formulated, while at the bottom of the hierarchy the set of alternatives is located. In our case, there are two objectives. First, we intend to produce the weights of the facial features, which can be utilized in the process of classification (whenever the weights can be applied to prioritize the classifiers). Second, we are interested in obtaining concrete degrees of membership of the facial features to the individual linguistic terms describing the set of attributes occurring in the population. For instance, we would like to know whether someone's nose is short, middle, or long, and to which extent it belongs to each of the classes, i.e., short, middle, or long noses.
At the next step of the algorithm, the expert (or a group of experts) quantifies the judgments between the elements (i.e., alternatives) of the hierarchy. These assessments are based on the pairwise comparisons of the elements. For n alternatives, the experts' responses are collected in the n Â n matrix A, where n is a number of the options to be considered (in our case, facial features).
The experts generate the pairwise comparisons' results using the following scale [34,35]: equal importance (1), weak importance (2), moderate importance (3), moderate plus (4), essential/strong importance (5), strong plus (6), very strong/demonstrated importance (7), very, very strong (8), extreme importance (9). A is called a reciprocal matrix, meaning that it satisfies the following requirements: For each element a ij we have a ij ¼ 1=a ji ; i; j ¼ 1; . . .; n; and a ii ¼ 1. Let us introduce the expression m ¼ k max À n ð Þ = n À 1 ð Þ, where k max ! n is a maximal eigenvalue of the reciprocal matrix A and the value l ¼ m=r, where r ¼ 0; 0; 0:52; 0:89; 1:11; 1:25; 1:35; 1:40; 1:45; 1:49 for n ¼ 1; . . .; 10, respectively. These values concern the mean consistency indices of 500 randomly generated reciprocal matrices [36]. For the matrices of higher dimensionality, the methods generating the pertinent values are discussed in [37]. m, l, and r are called inconsistency index, consistency ratio, and random inconsistency index, respectively. From a practical perspective, it is considered that the consistency ratio should not exceed the value 0.1 to assure the satisfactory level of consistency of results [35]. However, it can be difficult to obtain such level of consistency, especially when the non-numerical, intangible features are compared. The final ranking of the priorities is constructed using the values of the elements of the eigenvector of the matrix A associated with the maximal eigenvalue k max : As mentioned in the introduction, we use the AHP method to obtain the weights related to particular facial features and the degree of membership of specific features to the linguistic attributes. Therefore, if we can assume that we extract the most descriptive facial features, which can be relatively easily estimated by people when looking at the 2D facial image. The main idea of the process is that the experts do the pairwise comparisons between the features answering the questions in the following form: To which extent the feature A is preferred over the feature B? To put in a different way: To which extent one of attributes is preferred over other attributes of this feature? The algorithm results in the normalized vector w ¼ w 1 ; w 2 ; . . .; w N ½ containing N weights associated with the considered features.
Let us now consider a specific face and its particular features. Each feature of this set can be described in terms of the available descriptors-this way we can produce vectors f 1 ; f 2 ; . . .; f N which could be concatenated into a single vector presenting a description of a given face, say Let us assume that the whole set of such image descriptions is denoted by X. Our intent is to classify a new face as belonging/not belonging to one of the faces in X. The face is characterized by some vector g. If we consider one of the possible applications, e.g., the classification process, it can be done, for instance, using the nearest neighbor rule by determining a minimal distance between g and f coming from X. To illustrate this, we can look, for example, at the feature eyebrow length. Let us assume that the set of faces were assessed by an expert and the expert's answers regarding the length of eyebrows were as follows: short-middle: 1-5, short-long: 1-9, middlelong: 1-7. Then, the pairwise comparison matrix A comes in the form A ¼ . These values are related to the linguistic values short, middle, long. From this example, one can see that the eyebrow length is likely long rather than short or middle. Once all the features have been estimated, the face is described by the vector with entries in 0; 1 ½ being the result of the concatenation of all the N normalized eigenvectors built in the same way as for f eyebrow length . It is worth noting here, that the psychological studies suggest that people have difficulties with a numeric and linguistic estimation of the human physical attributes such as height and width [16]. Therefore, the method of pairwise comparisons can be a sound alternative here.

The Fusion of Information Coming from the Experts' Assessments and Numeric Measures
In the previous section, we have elaborated on a way on how to include the linguistic terms coming from the expert's opinion in the form of numerical values. The vectors obtained in this manner can be supplemented by the numerical descriptors coming directly from the geometrical relations between the particular facial features. Starting from the linguistic description of the set of faces, we can get the degree of membership of the numeric values of measurable facial features (e.g., eyebrows or nose length) to the attributes such as short, medium, long. Intuitively appealing technique that is of interest here concerns the well-known K-means algorithm [38]. In this study, we present a model based on the above-mentioned K-means and membership functions, which can be easily extended by using other approaches.

K-Means and Features' Lengths Normalization
From all possible N facial features, some of them are measurable features, for which one can easily determine their numerical values such as a length of nose. In contrast to other non-measurable features like shape of the face, shape of the nose tip, there is a natural linear order of linguistic descriptors, say short, medium, long. This observation comes as the starting point for a selection of such measurable features whose specific values can serve as input data to be clustered by the K-means or the FCM method. Hence, the use of the classical K-means algorithm for a clustering of the investigated dataset into 3 groups, corresponding to the linguistic descriptions namely small, average, big, or short, medium, long depending upon each of these M measured features separately, arises as a sound alternative here. The clustering for each feature separately allows for a deeper analysis of the key differences between the considered faces and as a result we get N clusters, not three general, multidimensional ones. It is based on data resulting from measuring the distances between the characteristic points on each of the faces in an image dataset. The location of the landmarks is visualized in Fig. 1a. For instance, forehead width can be obtained by the formula 0:5½dðP 1 ; P 2 Þ þ dðP 3 ; P 4 Þ, where P i denotes the coordinates of the ith point (i ¼ 1; . . .; 55) shown in Fig. 1a. It should be emphasized that the measurable features discussed in this study serve as illustrative examples and do not exhaust the set of all measurable features. For example, for a feature ''length of eyebrows'' cluster centers corresponding to linguistic descriptors short, medium, long have been designed. Next, for each person, the degree of membership to each cluster is determined. These clusters are describing the possible values of feature ''length of eyebrows''. After determining the above-mentioned lengths of M measurable features a k i ; i ¼ 1; . . .; M; k ¼ 1; . . .; m (m is a number of considered faces), the results are normalized. The distances are scaled by setting the same distance between the pupils for all the faces, namely a kÃ i ¼ const a k i , where const is a scaling coefficient. In the sequel, they are clustered with the use of the K-means algorithm. Normalized in this manner, the length of a particular feature is the starting point for testing the degree of membership of every person to a cluster determined for each feature separately. After applying the K-means method, we receive a set of centers of clusters. More specifically, for each of the measurable facial parts, we receive three numerical values corresponding to the linguistic descriptors such as small, average, big. Denote such descriptions by c i ; i ¼ 1; . . .; M. Taking into account these vectors, we can Fig. 1 a Location of selected face landmarks; the face coming from the FERET dataset [39]; b The relations between the cluster centers and a specific feature length. Note that the closer is the position of the numeric (measured in pixels) length d to a particular cluster center then the greater is the value of z Ã , see the description in the text; c Examples of membership functions of linguistic terms short, medium, and long determine the degrees of membership to the respective centers in a following manner. Note that the degree of membership to the cluster should be determined by the distance between the numeric value of a concrete feature and the center of the cluster.
Assume that the vector d k ; k ¼ 1; . . .; m, contains the values of the measurable features for kth person. Based on the values of the K-means algorithm for each of the features, the distances to the centers c i are calculated. In other words, for each person k ¼ 1; . . .; m the elements of new vectors z k i ; i ¼ 1; . . .; M, are given by z k i;j ¼ d k;i À c i;j , where j ¼ 1; 2; 3 is an index corresponding to small, average, and long, respectively. More precisely, if a specific eyebrows length is d ¼ 35 and for this feature the centers vector c has the form [25,37,47], then the vector containing distances between the numerical value and the centers of clusters is z ¼ 10; 2; 12 ½ . In the next step, the vectors z k i ; i ¼ 1; . . .; M; k ¼ 1; . . .; m, are standardized as z kÃ i;j ¼ r i À z k i;j =z k i;j ; where r i ¼ max 1 k m d k;i À min 1 k m d k;i and d k;i denotes a dispersion of ith measurable feature. This procedure is illustrated in Fig. 1b. In our example, if this spread for the feature eyebrows length is 48, then z Ã ¼ 3:8; 23; 3 ½ . The final result is a set of vectors z kÃ i which are normalized to the length of 1. These vectors are containing the degrees of memberships to particular clusters corresponding to linguistic values.

Formation of Membership Functions
We consider triangular membership functions as sound way of quantifying individual linguistic variables such as short, medium, and long, by introducing the following parameterized membership functions 1; x max x; are the elements of vectors assigned to the kth face, and x k is a value of the feature's length. It is worth noting that other types of membership functions such as Gaussian ones could be considered here.

Classification Process
The main flow of the proposed classification process is outlined in Fig. 2. A group of experts describes a certain face by quantifying its features with the use of the AHP method. The results of assessments of the individual features are concatenated into the vectors representing activation levels of face descriptors forming a certain linguistic space. The vectors can be easily averaged using, for instance, arithmetic mean. In parallel, the original measures of facial features are hold in the form of the vectors containing the memberships to the linguistic values short, medium, long. The vectors formed on a basis of the linguistic terms and the numeric values of the measurable facial features are used to carry out classification. An intuitively appealing alternative is the nearest neighbor (NN) classification algorithm in which a weighted Euclidean distance is considered: . . .; x n ½ ; y ¼ y 1 ; . . .; y n ½ ; and w ¼ w 1 ; . . .; w n ½ . Considering that p experts took part in the process of pairwise comparisons of N features, one obtains p vectors of weights, i.e., w 1 ; . . .; w p describing the importance of facial cues, namely w i ¼ w i;1 ; . . .; w i;N Â Ã ; i ¼ 1; . . .; p. Similarly, as in the case of concatenated vectors corresponding to particular facial features, we can reform the weight vectors and build the element vectors Here, Q is the number being the sum of all the linguistic values corresponding to the N facial features. By stretching we mean that the weight corresponding to the feature from above example, i.e., the eyebrow length, is associated with elements of the vector v i (corresponding to the three potential linguistic values short, medium, long). Finally, the weights are averaged, i.e., v ¼

Experimental Studies
The experimental study is completed for the well-known FERET dataset [39]. We consider the first 50 images, called ba and the first 50 images coming from the subset referred to as bk. The first group of images stands for set A (we treat it as the training set), while the second one forms set B (testing set). Three experts (our laboratory members or friends) were asked to describe the images belonging to the set A, and 3 experts were asked to describe the images belonging to the set B. Two of them filled the questionnaires regarding both sets. More precisely, each of them played a role of a crime witness describing a facial image using the pairwise comparisons described above. The experts completed the questionnaires using some specially prepared forms to make their task easier. In this manner, we got 3 questionnaires for the set A and 3 questionnaires for the set B. For the needs of experiments, the questionnaires were formed as special tables prepared in spreadsheet program where the questions were given in the form presented in Sect. 2. Moreover, we asked an experienced expert from the field of face recognition (being our laboratory member) to describe N ¼ 52 of the most descriptive, in our opinion, facial features. These facial features come with linguistic descriptors. The cues selected in this way along with their descriptors are intuitive, easy to identify and compare by experts in the fields of forensics, cognitive psychology, etc., or, what is probably the most desired in the context of application of the proposed method, for potential witnesses of the crime which have to describe a criminal to be identified. In the process of running the AHP, expert survey results concerning the estimation of interrelations between 52 facial features, were utilized. The features were chosen with use of the standards described in [11,12,40]. The non-measurable features were: gender, origin, shape of the face, hair length, hair texture, hairstyle, forehead shape, forehead skin, forehead profiling, eyebrows direction, eyebrows shape, shape of the lower eyelid, direction of the fissures, placement of eyes shallow, shape of the nasal bridge, shape of the nasal tip, ears protrusion, ears shape, size of the earlobe, position of the earlobe, earlobe shape, cheeks fullness, shape of the opening between lips, mouth fullness, depth of the philtrum, chin shape, chin prominence, chin details. The measurable features were: forehead width, forehead height, eyebrows length, distance between the eyebrows, eyebrows position, eyebrows thickness, distance between eyelids, fissures length, inter-eye distance, nose length, nose width, width of the nasal bridge, height of the nasal bridge, nostrils, size of the nose holes, ears length, length of the cheeks, width of the cheeks bones, mouth width, upper lip height, lower lip height, width of the philtrum, chin size. The results of our experiments are collected in Table 1. The application of AHP has been found a useful tool. Particularly, when more than one expert takes part in the estimation process, we may anticipate reaching a good recognition rate level (more than 90%). The results show that the application of the methods assessing the degrees of memberships to the corresponding linguistic values represented by membership functions or clusters can strongly supply a description process realized by the experts. Information fusion coming from the experts' linguistic descriptions and the measurements of the features' lengths in the form of concatenated vector improves the accuracy of the classification algorithm. For instance, if the face images are described by a single expert, in 94% of situations the subject has been correctly identified. The efficiency of the algorithm improves when more experts are involved. The participation of two experts in face evaluation arises as a good and relatively inexpensive alternative. When the K-means is used to build an augmented feature vector, good results are reported when only two experts are involved in the description of the training set and one expert is answering the questions regarding the testing images (recognition rate over 97%). The use of the weights produced by the AHP process completed in the linguistic facial space results in the improvement in the performance at the level of 6%. Finally, it is worth to note that the optimization of the experts' answers regarding both concrete descriptions of specific facial features and the weights assigned to the considered cues by well-known particle swarm optimization algorithm (PSO) [41] with the termination criterion that the inconsistency index should not exceed the 0.1 value leads to slight improvement in the results up to 1.5% recognition rate. These special results are denoted in Table 1 by AHP ? PSO and AHP ? P-SO ? K-means. In our case, we set the number n of particles in the PSO algorithm to be 30 and the number of generations as 200. To compare our proposal based on linguistic descriptors with other algorithms, we have chosen the method based on so-called local descriptors. The last four rows shown in Table 1 contain the recognition rates obtained for well-known local descriptors, namely local binary patterns (LBP) [42] and multi-scale block local binary pattern (MBLBP) [43]. They are obtained using the same FERET subset when considering the following setup: Each image was divided into n Â n rectangles of equal size. The best results were obtained for n ¼ 6. In the case of MBLBP, the blocks of pixels were built from 3, 5, and 7 pixels, respectively. Note that in most cases our method outperforms the machine-based feature extraction approaches such as LBP and MBLBP. Moreover, we present 3 methods related to linguistic descriptors, namely AHP with no information about distances [19], voting on the chosen feature lengths [20], and fuzzy sets based on the weights obtained directly from the users [21].

-
The even lines of the results present the results obtained with weights associated to the saliency of each of the facial features

Conclusions and Future Studies
In this study, we have presented a novel application of the analytic hierarchy process regarded here as a useful vehicle to develop linguistic descriptors of facial features obtained from the experts' evaluations of the faces. This approach produced very good results, particularly when it was fused with a standard classifier. The average recognition rates varying from 94 to 100% show the efficiency of the method in applications, where the presence of experts becomes necessary and very important, e.g., in forensics. Moreover, we have introduced the method of determining the weights of facial cues, which significantly improved the accuracy of classification process. Furthermore, the method can be easily improved by invoking optimization methods. Here, as an example, the application of a well-known PSO has led to improved results. Future work may focus on automation of the process, applying other weights and various aggregation techniques, i.e., the aggregation of the corresponding elements of reciprocal matrices or the aggregation of the AHP results, assessing the weights for specific elements of matrices, deepened investigation of the manner of calculating such weights, and an application of the method to other fields of decision making, where the experts play a significant role. Moreover, it is worth to examine other forms of classifiers like SVM or fuzzy Sugeno classifier, see, e.g., [44].
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://crea tivecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.