1 Introduction

Recently, various studies are being conducted to provide customized service through gender classification in various mobile/Internet shopping applications and social networks [1, 2]. Especially, since correct gender recognition is very helpful in determining accurate situation in video surveillance applications [3], many studies on gender classifications are being conducted [4,5,6,7]. Facial images have been generally used to differentiate gender as face contains useful information such as shape of eyes, nose, lips, and so on. To develop appropriate gender classifier, many facial images are required as the ground truth data which can be used for training purposes. Most recent studies on the development of gender classifiers use facial images obtained with restricted backgrounds or particular facial angles for training dataset. For this purpose, a video annotation system named INHA-VAS (video annotation system) which can support to generate efficiently new ground truth data by searching and editing previous related data from the integrated video annotation database was proposed [8].

However, accurate gender classification of facial images with uncontrolled background is hardly expected from current gender classifiers developed using training datasets with facial images obtained from controlled environments [4]. That is, gender classifiers developed through the controlled training datasets of facial images lack the ability to learn to differentiate various distinctions such as changes in facial angle, occlusion, illumination, or background complexity which can occur in uncontrolled environments.

In order to solve such problem, the usage of real facial images obtained from uncontrolled environments has been proposed for training dataset for gender classifiers in several studies [5,6,7]. In many cases, as an actual uncontrolled dataset, the Gallagher dataset is used which is built from the real facial images obtained from Flickr [7]. Pablo Dago-Cases et al. [6] used 7380 male and female facial images, each obtained from the Gallagher dataset for a training dataset, and conducted an experiment with LFW (Labeled Faces in the Wild) composed of actual facial images from uncontrolled environments [9].

Both the Gallagher dataset, which has been widely used as a training dataset for the development of gender classifiers, and the LFW dataset, which has been widely used as a testing dataset to evaluate the accuracy of the gender classifier, solely contain Westerners’ facial images. For example, less than 5% of Asian faces are included in the LFW dataset [10]. In order to develop a gender classifier for a specific region, for example in Korea, a large quantity of the actual uncontrolled facial images of Koreans should be used as the training dataset. However, a significant amount of time is needed to collect and refine uncontrolled actual facial images for a dataset. For example, roughly 270 days are expected to develop the Gallagher dataset since the images need to be filtered for eliminating duplications. Moreover, LFW has been collecting and refining facial images from the Internet with the Viola-Jones face detector since 2007 [11]. Therefore, a quicker method to collect actual facial images for a dataset is needed to develop a gender classifier [12].

Most of the gender classifiers generally have lower classification accuracy with actual uncontrolled test dataset than with controlled one. For example, Pablo Dago-Cases et al. [6] used LBP (Local Binary Patterns) from the Gallagher dataset, a well-known uncontrolled facial image set, to extract feature information and developed a gender classifier using the SVM (Support Vector Machine) machine learning scheme. The gender classifier was tested for its accuracy with the LFW dataset to obtain 89.77% in the accuracy rate. However, Nesli Erdoğmus et al. [4] and Erno Mäkinen et al. [13] obtained classification accuracy rates of 90.68 and 93.33%, respectively, using facial images obtained from a controlled environment, FERET [14], for their training and test datasets. A gender classifier developed with an uncontrolled training dataset should be improved to obtain accuracy in performance similar to a classifier developed with a controlled training dataset.

Therefore, this study proposes following two strategies to solve the abovementioned problems. First, this study proposes to use a Facebook dataset as the uncontrolled training dataset of actual facial images to develop a gender classifier. Actual facial images of people registered as Facebook friends are collected and used as a training dataset. This method has an advantage of collecting uncontrolled actual facial images of people from the particular regions for training data in less time than that of the Gallagher dataset. Second, a weighted bagging gender classification scheme which can achieve better performance than previous methods is proposed with a collected Facebook training dataset. The weighted bagging gender classification scheme uses the LBP (Local Binary Patterns) [15], Gabor wavelets [16], and HOG (Histogram of Oriented Gradients) [17] algorithms to extract unique features from facial images for comprehensive analysis and finalization of the gender classification decision. In order to improve the accuracy of the final gender classification decision, calculated probability values on the gender from each classification algorithm are used as the weighting factors.

The rest of the paper is as follows. Section 2 introduces the previous uncontrolled datasets of facial images and appearance-based feature extraction methods. Section 3 explains the constructing procedure of a Facebook dataset. Section 4 discusses the weighted bagging gender classification scheme, and the Sect. 5 discusses the experiments and their results. Lastly, Sect. 6 concludes with plans for the future studies.

2 Related works

2.1 Facial image dataset from uncontrolled environment

The Gallagher dataset [7] is obtained by refining facial image data composed of searched and collected pictures using specific terms such as marriage, bride, groom and family from Flickr. Maximum of 100 pictures were collected per day to prevent duplicated images and the whole data collection procedure took 270 days. A total of 28,231 facial images were collected for the Gallagher dataset through the refining procedure.

LFW (Labeled Faces in the Wild) [9] is a dataset composed of facial images with uncontrolled facial angle, partial occlusion, illumination, and complex background. This dataset is composed of a total of 13,233 images collected from all web-based articles. The LFW dataset was used in the uncontrolled gender classification protocol of BeFIT (Benchmarking Facial Image Analysis Technologies) which is an international collaboration for standardizing evaluation of the facial image analysis technologies [10]. Hence, the LFW dataset is widely being used to evaluate the gender classification accuracy of facial images.

Fig. 1
figure 1

LBP histogram sequence of a facial image

Fig. 2
figure 2

Convolution result of a facial image and 40 Gabor wavelets

2.2 Appearance-based feature extraction

The appearance-based feature extraction approaches have been widely used to differentiate the unique features of facial images [15,16,17]. The approaches including the LBP (Local Binary Patterns) [15], Gabor wavelets [16], and HOG (Histogram of Oriented Gradients) approaches [17] which process image pixel information to extract unique features for gender classification. LBP, proposed by Timo Ojala et al. [15], has been used in face recognition and image analysis for its robustness against various illumination conditions [18]. Generally, LBP converts the relative illumination changes of surrounding 3 \(\times \) 3 pixels of each pixel to binary codes to be used as a descriptor. Two extensions of LBP have been proposed by Timo Ojala et al. [19]. First, LBP with adjusted sampling points around standard pixels and radius sizes was proposed to extract the comparatively outstanding features for better differentiation. Another LBP with uniform patterns to reduce a respectable amount of the computational cost without losing a significant amount of data was proposed. A LBP descriptor is usually represented as \(LBP_{P, R}^{u2} \), and P as sampling points on a circle of radius R, and u2 as uniform patterns.

A histogram of grouped binary codes of each pixel value of a facial image made by a LBP descriptor is used to represent one facial image. A facial image is first divided into partial non-overlapping \(k \quad \times \quad k\) domains to maintain its regional shape information. A LBP histogram composed of divided domains is then connected in a serial manner to make a single LBP HS (histogram sequence). Figure 1 represents one descriptor with a LBP histogram of a facial image.

Since the frequency and orientation representations of Gabor wavelets are similar to those of the human visual system, Gabor wavelet based features extracted from one facial image have been used in the areas of texture partition, written number recognition, and fingerprint recognition [20]. Moreover, since the Gabor functions offer optimized resolution in the spatial and frequency domains, the functions can be used to extract the holistic form of an image [21]. The facial images represented through the Gabor functions can be obtained by combining facial images with Gabor wavelets. The facial images with the Gabor wavelets can be defined as follow.

$$\begin{aligned} O_{\mu ,\upsilon } \left( x \right) =I*\psi _{\mu ,\upsilon } \end{aligned}$$

All those features are concatenated in order to derive an augmented feature vector which is used as a descriptor for a face image representation. Here, I represents the pixel value of a facial image, and \(\psi _{\mu ,\upsilon }\) represents Gabor wavelets with orientation \(\mu \) and scale \(\upsilon \). \(O_{\mu ,\upsilon } (x)\) represents the convolution output of a facial image and Gabor wavelets with specific orientation and scale. Most of the researches on the facial recognition use Gabor wavelets with scale, \({\upsilon }\in \left\{ {0,\ldots ,4} \right\} \) and orientation, \(\mu \quad \in \{0, \ldots ,7\}\) to extract distinct characteristics from an image [21]. Figure 2 convoluted 40 Gabor wavelets and one facial image to obtain distinct features to represent one facial image with edge information of different scales and orientations. All the features are coupled to obtain an augmented feature vector to be used as a descriptor to represent a facial image.

HOG (Histogram of Oriented Gradient) [17] has the advantage to detect changes in orientation or illumination since it is known for its efficient algorithm to detect pedestrians and contains vector information of the edge pixels of the divided domains [22]. A facial image should be divided into uniformly sized cells first before calculating the histogram of the edge pixel orientation of each cell to apply the HOG algorithm to a facial image as in Fig. 3. The sequenced vector of the calculated histogram of uniformly sized cells is then composed into one histogram sequence to be used as a descriptor to represent the histogram sequence into one facial image.

Fig. 3
figure 3

HOG algorithm applied facial image

Fig. 4
figure 4

Comprehensive facebook dataset construction procedure

3 Facebook dataset of uncontrolled actual facial images

3.1 Construction of a facebook dataset

Facebook is currently one of the most popular Social Network Services. The Facebook users upload information such as personal history, academic background, hometown, and photos to share with friends. In particular, since numerous images taken mostly in uncontrolled environments are being uploaded, these images can be used for construction of a training dataset of actual facial images within uncontrolled backgrounds.

A dataset composed of uncontrolled actual facial images collected from Facebook can be constructed through the following procedure (Fig. 4). First, using the Facebook API provided by Facebook and a developed application, publically uploaded photos will be collected. The identifiers of friends and photos will be used to exclude duplicated images. Remaining photos without duplicated images will be included in the Facebook dataset after the preprocessing procedure as in Subsection 3.2.

3.2 Preprocessing phase

Photos of registered Facebook friends taken in actual environments can be collected with the Facebook API and the developed application program. Preprocessing shown in Fig. 5 was conducted to use the collected image data as a gender classification training dataset.

Fig. 5
figure 5

Preprocess for construction of a facebook facial image dataset

Table 1 Comparisons of the facebook and gallagher datasets
Fig. 6
figure 6

Classification example by the weighted bagging gender classifier

First, the facial images were extracted from the collected photos. One of the widely used facial detection algorithm, Viola-Jones [11] face detector, known for its speed, was used to extract facial images from the collected photos. Despite of the face detector application, many false positive images remained in the dataset. In order to exclude such photos, an eye detector provided by OpenCV [23] was used to select facial images with both two eyes opened. Secondly, the facial images were aligned vertically to improve the accuracy of gender classifier [24]. A gender classifier developed based on a training dataset with facial images of different eye, eyebrow, and nose locations negatively affects the accuracy. In order to maintain the constant size of the facial images, the facial region was cropped to 91\(\times \)127 in size. Thirdly, since the contrast and brightness may not be constant as the facial images were taken in uncontrolled environments, separate histogram equalization was conducted to standardize contrast and brightness. Bilateral filter known for smoothing the images and maintaining the edges sharp at the same time was used to exclude noise developed through equalization. Lastly, the elliptical mask was applied to eliminate clutters around the facial image. As the final step of the construction of the Facebook dataset, remaining false positive images could be manually removed. Such preprocessing was used to normalize Facebook facial images as a training dataset for the development of a gender classifier.

3.3 Facebook dataset

Images collected by the developed application program based on the Facebook API went through a preprocessing phase to obtain a total of 28,235 facial images with uncontrolled environments for the Facebook dataset. The number of facial images included in the Facebook dataset was similar to that of the Gallagher dataset. However, although the Gallagher dataset collected maximum of 100 photos per day for a total of 270 days, the Facebook dataset with excluded redundant images filtered through the identifiers of the friend images and manual exclusion process of the false positive images took only 31 days for the dataset publication. The following Table 1 shows comparative data and collection duration between the Facebook dataset and the Gallagher dataset. To compare the collecting duration, the number of facial images in the Facebook dataset was set to the similar number of images as in the Gallagher dataset, even though more facial images could be collected from Facebook. Therefore, a Facebook dataset may be used to swiftly develop a region specific gender classifier by collecting uncontrolled actual facial images for a training dataset.

4 A weighted bagging gender classification scheme

The bagging classification scheme is a widely used classifying method for collecting classified results from several classifiers and determining the final classification result. Such bagging scheme uses the majority voting-based classification method with the results obtained from several element classifiers. The majority voting classification method selects the most frequent result among the results by the element classifiers. However, the majority voting method has a disadvantage of disregarding accuracy of the classification results. The weighted bagging gender classification scheme was used to supplement the drawback of the majority voting scheme. Either static or dynamic method can be used for the weight of each element classifier and corresponding decision. In the static method, each element classifier is assigned with the weight based on the gender classification ability and uses the weight to all images. On the other hand, the gender probability value is used as the certainty factor and utilized that as the weight for deciding gender in the dynamic method. Since it is generally known that instead of using fixed or static weight on all images, the results obtained by the element classifiers can be better unified by giving different weights based on the certainty value obtained from each gender classification, this study used the dynamic weighting method.

In the weighted bagging gender classification scheme, first, the characteristic features for training dataset were extracted using the LBP, Gabor wavelets, and HOG algorithms, respectively, which have been generally used for face recognition among the appearance-based feature extraction approaches. Generally, diverse classifiers for the voting-based classification scheme are needed to obtain high accuracy of a classifier. LPB extracts local texture information, Gabor wavelets extract the global form information, and HOG extracts the edge orientation of sub-regions from the facial images. These three algorithms are used to extract specific information to develop individual gender classifier using SVM widely used for binary classification. Final gender decision is made through weighted voting on the probability values of the classification results of each classifier. The gender probability values calculated through each classifier are added to each gender for normalization and the gender with higher value is selected as the final choice through the weighted bagging gender classification scheme. As an example, in Fig. 6, although 2 out 3 element classifiers classified the male facial image as female, the weighted bagging gender classification scheme correctly classifies the face as male through a stepwise procedure.

5 Experiment and analysis

5.1 Parameter value decision experiment

Equally matched training data of 11,463 images for both the males and females were selected from a Facebook dataset for a training dataset to develop a classifier with a weighted bagging gender classification scheme. In order to find the parameter values necessary for the development of an optimally functioning gender classifier using a dataset, following experiments with 10 fold cross-validation were conducted to evaluate the classifier performance.

First, optimization of the SVM classifier for the gender classification scheme was necessary to obtain better performance of the weighted bagging gender classification scheme. The function of the SVM classifier is affected by the parameter values used in the kernel of SVM. More specifically, the RBF (Radial Basis Function) kernel is greatly affected by the parameter values of gamma and cost [25]. Therefore, the grid search provided by LIBSVM [26] was used to acquire parameter values for optimal function of the SVM classifier and the parameter values obtained through the grid search were used for the experiments.

The size of the facial images used for a training data influences the gender classifier performance since the image size raises the computational cost. Therefore, in order for the SVM classifiers using LBP, Gabor wavelets, and HOG to search for appropriately sized facial images for the optimal performance, the facial images were resized to 64 \(\times \) 64, 80 \(\times \) 80, and 96 \(\times \) 96 for the experiments. When each classifier calculated the accuracy and compared the mean accuracy rates of all images, 80 \(\times \) 80 sized images showed the highest accuracy rate.

The performance of a classifier using a LBP algorithm is affected by the radius of LBP and the number of divided blocks of an image. Therefore, an experiment was performed to find the LBP radius and the number of divided image blocks for the optimal performance of a classifier. First, an experiment was performed to find the optimal performance condition by altering the LBP radius size. Uniformed patterns were used to reduce the computational cost. Figure 7 indicates that the overall gender classification accuracy using the SVM classifier with LBP is at the highest when the radius is 1. The best performance was shown at the radius value of 1 because \(\hbox {LBP}_{8, 1}^{u2} \) contains the most facial information [18].

Fig. 7
figure 7

Gender classification accuracy for the SVM classifier using LBP with different radii

A procedure to divide a facial image into size \(k \times k\) is required to preserve the regional information of a facial image when using LBP to express a facial image. An experiment to find certain k value which displays the optimal performance of a gender classifier was conducted by altering the k value since the performance of a gender classifier changes in function of the value k. Figure 8 shows the accuracy rate of \(LBP_{8,1}^{u2} \) as the function of k. Through the experiment, k of 8 showed the optimal performance condition.

Fig. 8
figure 8

Gender classification accuracy of \(LBP_{8,1}^{u2} \) in function of k

Fig. 9
figure 9

Accuracy comparisons of 3 gender classifiers

Fig. 10
figure 10

Confusion matrix comparison with the LFW dataset

Fig. 11
figure 11

Confusion matrix comparison with only Koreans from the LFW dataset

5.2 Accuracy analysis for the weighted bagging gender classifier

Widely used LFW dataset, which is standard protocol for gender classification of facial images taken in an uncontrolled environment, was used to objectively conduct experiments and analyze the accuracy of proposed gender classifier. Preprocessing described in Subsection 3.2 was performed to equally compare the Facebook and LFW datasets. As results, 7586 male and 2200 female facial images from the LFW dataset were selected as the final test dataset. Although the final test dataset utilized in this study contains comparatively more male images than female images, the ratio difference was not artificially altered for more accurate comparison with previous studies which utilized LFW with comparatively more male images [6].

The weighted bagging gender classification scheme with high rate of gender classification accuracy was required to obtain generally optimal classification accuracy. The parameter values obtained from Subsection 5.1 and linear and RBF kernels were used to develop the SVM classifiers with LBP, Gabor wavelets, and HOG. The accuracy of the SVM classifier with the RBF kernel was about 3% higher than the SVM classifier with the linear kernel when the LFW dataset was used as a test dataset to compare the linear and RFB kernel based SVM classifiers. Therefore, the SVM classifier with the RBF kernel was included in the weighted bagging gender classification scheme.

First, performances of 3 types of element gender classifiers which utilize the LBP, Gabor wavelets, and HOG features, respectively, a gender classifier which fused the classification results of 3 element gender classifiers using general majority voting, and a gender classifier based on the weighted bagging model suggested by this study were compared. Figure 9 shows that accuracy rates were higher for the majority voting model and weighted bagging model which utilized fused results of the element gender classifications than the element gender classifiers which utilized each characteristic, respectively. The accuracy rates were similar or greater than the accuracy rate, 89.77%, of a classifier learned with the Gallagher dataset and evaluated with LFW [6]. Such accuracy suggests effectiveness of a gender classifier which utilizes a Facebook dataset as learning data.

In order to conduct more precise comparison between the majority voting model and weighted bagging model for the classification accuracy, confusion matrix was developed as in Fig. 10. According to the matrix, the weighted bagging model showed trend for increased accuracy for both males and females in comparison to the majority voting model. Moreover, additional analysis showed a total of 120 classification mismatches for both models with 47 and 73 mismatches for the weighted bagging model and majority voting model, respectively, indicating superior accuracy of the majority voting model.

In order to analyze feasibility of utilizing a Facebook dataset as learning data for developing gender classifier in specific region such as Korea, second assessment was conducted using Koreans and Japanese and Chinese people with similar appearance and skin tone selected from the final test dataset. In this test dataset, 180 males and 53 females were included. The confusion matrixes of the test results of the majority voting model and weighted bagging model are shown in Fig. 11.

From the comparison with the confusion matrix of Fig. 10 which utilized all LFW test dataset, Fig. 11 showed better accuracy in gender classification with only Koreans, especially for females. Through the analysis of the images, it was shown that differentiated appearance and makeup style of Korean females lead to comparatively greater degree of gender classification accuracy. Therefore, it was proven through the results that a learned gender classifier utilizing a Facebook dataset with greater amount Korean female images could lead to accurate classification. Therefore, this study showed that it is effective to use a Facebook dataset with easily obtainable uncontrolled facial images from specific region to develop a gender classifier for that specific region.

6 Conclusions

This study proposed to collect and use a large training dataset of uncontrolled face images from Facebook for developing appropriate gender classifier for a specific region and also a weighted bagging gender classification scheme which votes to select on the gender classification results by the gender classifiers with the LBP, Gabor wavelets, and HOG algorithms for increased gender recognition accuracy of the actual facial images taken in uncontrolled environments. The probability values calculated by each classifier were used for the weighted voting to increase the classification accuracy. The classifiers used in the gender classification scheme used uncontrolled actual facial images collected with the Facebook API as a training dataset and the LFW dataset generally used to evaluate a test dataset to obtain comparatively high accuracy rate of 94.68%. Future studies should consider collection of useful information other than the facial images through the Facebook API to further increase the gender classification accuracy.