Evaluation of different chrominance models in the detection and reconstruction of faces and hands using the growing neural gas network

Physical traits such as the shape of the hand and face can be used for human recognition and identification in video surveillance systems and in biometric authentication smart card systems, as well as in personal health care. However, the accuracy of such systems suffers from illumination changes, unpredictability, and variability in appearance (e.g. occluded faces or hands, cluttered backgrounds, etc.). This work evaluates different statistical and chrominance models in different environments with increasingly cluttered backgrounds where changes in lighting are common and with no occlusions applied, in order to get a reliable neural network reconstruction of faces and hands, without taking into account the structural and temporal kinematics of the hands. First a statistical model is used for skin colour segmentation to roughly locate hands and faces. Then a neural network is used to reconstruct in 3D the hands and faces. For the filtering and the reconstruction we have used the growing neural gas algorithm which can preserve the topology of an object without restarting the learning process. Experiments conducted on our own database but also on four benchmark databases (Stirling’s, Alicante, Essex, and Stegmann’s) and on deaf individuals from normal 2D videos are freely available on the BSL signbank dataset. Results demonstrate the validity of our system to solve problems of face and hand segmentation and reconstruction under different environmental conditions.


Introduction
Over the last decades, there has been an increasing interest in using neural networks and computer vision techniques to allow users to directly explore and manipulate objects in a natural and intuitive environment without the use of electromagnetic tracking systems. Such a sensor-free human-machine interaction system is simpler and more flexible and therefore more adaptable for a broad range of applications in all aspects of life in a modern society: from gaming and robotics to medical tasks. Considering recent progress in the computer vision field, there has been an increasing interest in the medical domain [55], especially in relation to screening or assessment of acquired neurological impairments associated with motor changes in older individuals, such as dementia, stroke and Parkinson's disease. Deploying hand gesture recognition or hand trajectory tracking systems is one of the most practical approaches, which is also attributed to their natural and intuitive quality. Moreover, with the recent rise of non-intrusive sensors (e.g. Microsoft Kinect, Leap motion) gesture recognition and face detection have added an extra dimension to human-machine interaction. However, the images captured of hand gestures, which are effectively a 2D projection of a 3D object, can become highly complex for any recognition system. Systems that follow a model-based method [1,46] require an accurate 3D model that efficiently captures the hand's articulation in terms of its high degrees of freedom (DOF) and elasticity. The main drawback of a model-based method is that it requires massive calculations, making it unrealistic for real-time implementation. Since this method is too complicated to implement, the most widespread alternative is the feature-based method [26] where features such as the geometric properties of the hand or face are analysed using either neural networks (NNs) [47,52] or stochastic models such as hidden Markov models (HMMs) [11,49]. This is feasible because of the emergence of cheap 3D sensors capable of providing a real-time data stream and therefore enabling feature-based computation of three-dimensional environment properties like curvature, an approach closer to human learning procedures.
Another approach for both faces and hands is to use a skin colour classifier [27]. Colour processing is much faster than processing other facial features. Under certain lighting conditions, colour is orientation invariant. It reduces the search space for human targets by segmenting images into skin and non-skin regions based on pixel colour. However, tracking human faces using colour as a feature has several problems: the colour representation of a face obtained by a camera is influenced by many factors (luminance, object movement, etc.); different cameras produce significantly different colour values, even for the same person under the same lighting conditions; and skin colour differs from person to person. Nevertheless, many researchers have worked with skin colour segmentation. Ghazali et al. [17] proposed a skin Gaussian model in YCgCr colour space for detecting human faces. However, the technique still produces false positives in complex backgrounds. Subasic et al. [45] used mean shift and AdaBoost to segment and label human faces in complex background images. However, the image database they used for evaluation testing consists of just a single frontal face detection. Khan et al. [24] noted that detection rate is dependent on skin colour selection, which can be improved by using an appropriate lighting correction algorithm. Zakaria and Suandi [54] reported that skin colour detection failure due to illumination effects increases the false positive rate. Additionally, the processing time also increases because many face candidates are sent to the classifier for verification purposes. Yan et al. [51] proposed a hierarchy based on use of a structure model and structural support vector machine (SVM) in learning to handle global variation in the appearance of human faces. However, a hierarchical structure approach in face detection architecture needs integration between one or more classifications, and this increases the overall processing time.
In order to use colour as a feature for face or hand tracking, we have to solve these problems. In the learning framework, the initialisation of the object is crucial. The main approach is to find a suitable means of segmentation that separates the object of interest from the background. While a great deal of research has been focused on efficient detectors and classifiers, little attention has been paid to efficiently acquiring and labelling suitable training data. One method is to partition the image into regions. Each region of interest is spatially contiguous and the pixels in that region are of the same kind with respect to the predefined criteria. However, the segmentation process itself may be time-consuming as it is usually performed manually [3]. Obtaining a set of training examples automatically is a more difficult task. Existing approaches to minimise labelling effort [28,33,39,40] use a classifier which is trained in a small number of examples. The classifier is then applied by means of a training sequence, and the detected patches are added to the previous set of examples. However, to learn the model for feature position and appearance, a great number (e.g., 10,000 images) of hand-labelled face images are needed. A further disadvantage of these approaches is that either manual initialisation [21] or a pre-trained classifier is needed to initialise the learning process. With a sequence of images, these disadvantages can be avoided by using an incremental model.
In this work, what we are interested in is the accurate initialisation of the first frame of the neural network model. If this is achieved, by performing an accurate segmentation in relation to the background so that the regions that represent the foreground (objects of interest) can be classified by the learning model, then the network preserves the topology in consecutive frames and acceleration to the network is achieved. The key to successful segmentation relies on reducing meaningless image data. We achieve automatic segmentation by taking into consideration that human skin has a relatively unique colour, and we apply appropriate parametric skin distribution modelling. Although the use of the SOM-based techniques of neural gas (NG) [31], growing cell structures (GCS) [13] and growing neural gas (GNG) [14] for various data inputs has already been studied and successful results have been reported [8,19,20,37,44,46], some limitations still persist. Most of these studies have assumed noise-free environments and low complexity distributions. Therefore, applying these methods to challenging real-world data obtained using noisy 2D 1 and 3D 2 sensors is our main study. These particular noninvasive sensors have been used in the associated experiments and are typical, contemporary technology.
The remainder of this paper is organised as follows. Section 2 presents the initialisation of the object using probabilistic colour models. Section 3 provides a description of the original GNG algorithm and the modifications for topology preservation in 3D. In Sect. 4, a set of experimental results is presented for various datasets before conclusions are drawn in Sect. 5.

Approach and methodology
The growing neural gas (GNG) [14] is an incremental neural model able to learn the topological relations of a given set of input patterns by means of competitive Hebbian learning [30]. Unlike other methods, the incremental character of this model avoids the necessity of previously specifying the network size. On the contrary, from a minimal network size, a growth process takes place, where new neurons are inserted successively using a particular type of vector quantisation [14]. With this approach, however, the problem of background modelling takes central stage, where the goal is to get a segmentation of the background, i.e. the irrelevant part of the scene and the foreground. If the model is accurate, the regions that represent the foreground (objects of interest) can then be extracted. This problem also plays a central role, since we are interested in setting the initial frame for the GNG algorithm.

Background modelling
We subdivide background modelling methods into two categories: (1) background subtraction methods; and (2) statistical methods. In background subtraction methods, the background is modelled as a single image and the segmentation is estimated by thresholding the background image and the current input image. Background subtraction can be done either using a frame differencing approach or using a pixel-wise average or median filter over n number of frames. In statistical methods, a statistical model for each pixel describing the background is estimated. The more the variance of the pixel values, the more accurate the multi-modal estimation. In the evaluation stage of the statistical models, the pixels in the input image are tested if they are consistent with the estimated model. The most well-known statistical models are the eigenbackgrounds [9,34], the Single Gaussian Model (SGM) [6,50] and Gaussian Mixture Models (GMM) [12,42].
The methods based on background subtraction are limited in more complicated scenarios. For example, if the foreground objects have similar colour to the background, these objects cannot be detected by thresholding. Furthermore, these methods only adapt to minor changes in environmental conditions. Changes such as turning on the light cannot be captured by these models. In addition, these methods are limited to segmenting the whole object from the background, although for many tasks, such as face recognition, gesture tracking, etc., specific parts need to be detected. Since most image sources (i.e. cameras) provide colour images, we can use this additional information in our model for the segmentation of the first image. This information can then be stored in the network structure and used to detect changes between consecutive frames.
2.1.1 Probabilistic colour models: single Gaussian-Image segmentation based on colour is studied by many researchers especially in applications of object tracking [5,7,29,41] and human-machine interaction [4,15]. Also, a great deal of research has been done in the field of skin colour segmentation [22,23,35] since the human skin can create clusters in the colour space and thus be described by a multivariate normal distribution. First, we attempt to model skin colour using a Single Gaussian distribution. For the skin domain, we have used the Stirling 3 and Essex 4 databases. With SGM, the model can be obtained via the maximum likelihood criterion which looks for the set of parameters (mean and covariance) that maximises the likelihood function. The likelihood function has a single maximum, and the estimates μ and Σ for the mean vector and the covariance matrix are obtained analytically by μ = 1 where μ is the estimated mean vector, Σ is the estimated covariance matrix, T is the number of observations in the sample set, and x t is the tth observation. The resulting Gaussian PDF that fits the data is given by where is the square Mahalanobis distance and d is the dimensionality of the Gaussian function (which is 2 in our case). Figure 1 illustrates the SGM model applied to different colour spaces (nRGB, HSV, CIE X, Y, Z, and CIE L*, a*, b*). It is evident that the SGM model covers the entire area of the distribution for both skin and background. In some colour spaces, the differentiation is greater, but the overlapping between skin and non-skin regions is sufficient to produce high false positive rate (FPR). As such, an SGM distribution cannot model all possible variation in the skin colour data. The existing approaches [6,50] were extended by using Gaussian Mixture Models [36,53].

Probabilistic colour models: Gaussian mixture model-Below
we summarise the steps involved in a MG skin colour model.

•
Firstly, the variance caused by the intensity is removed. This is achieved by normalising the data or by transforming the original pixel values into a different colour space (e.g., rg colour space [38] or HSV colour space [35]).
• Secondly, a colour histogram is computed, which is used to estimate an initial mixture model.

•
Finally, a Gaussian mixture model is estimated, which can efficiently be done by applying the iterative EM-algorithm [10].
Assume a Gaussian mixture model: where π (j) denotes the prior probability of expert j, and φ (j) = {μ (j) , Σ (j) } denotes the parameters mean μ (j) , and full-rank covariance matrix Σ (j) of the expert. The GMM's output is given by: where is the jth Gaussian density of the GMM. The method for determining the parameters of a Gaussian mixture model from a data set is based on maximising the data likelihood.
Because the likelihood is a differentiable function, it is possible to use general purpose optimisation algorithms such as the EM algorithm for fast convergence. After the initialisation of θ 0 , the EM iteration is as follows: 1.
E-step As we do not know the class labels, but do know their probability distribution, what we can do is to use the expected values of the class labels given the current parameters. For the nth iteration, we form the function Q(θ|θ n ) as follows: and define Using Bayes' theorem, we can calculate h n j x t as: which is actually the expected posterior distribution of the class labels given the observed data. In other words, the probability that x t belongs to group j given the current estimates θ n is given by h n j x t . The calculation of Q is the E-step of the algorithm and determines the best guess of the membership function h n j x t .

2.
To compute the new set of parameter values of θ (denoted as θ*), we optimise Q(θ|θ n ); such as θ* = arg max θ Q(θ|θ n ). This is the M-step of the algorithm.
• Increment n by 1 and repeat the E-step until convergence.
To determine μ (k)* , differentiate Q with respect to μ (k) and equate to zero ( which gives: To determine Σ (k)* , differentiate Q with respect to Σ (k) and equate to zero ( which gives: To determine π (k)* , maximise Q(θ|θ n ) with respect to π (k) subject to the constraint Σ j = 1 J π j = 1 which gives: The two steps are repeated until the likelihood does not change significantly or the maximum number of iterations is reached. The GMMs obtained after 5 EM iterations for the different colour spaces (nRGB, HSV, CIE X, Y, Z, and CIE L*, a*, b*) are shown in Fig. 2.
The results were obtained from a database containing approximately one half million pixels. Figures 3 and 4 show the GMM probability map for CIE L*, a*, b* skin colour against different backgrounds, which can then be used to initialise the network for the GNG algorithm. Figure 5 shows another example of the probability map for skin colour in deaf individuals, using normal 2D videos freely available from the BSL SignBank dataset, 5 which can then be implemented with the specific goal of developing an automated dementia screening toolkit. In all three examples, the colour space used to represent the input image plays an important part in the segmentation. As we have seen above, some models are more perceptually uniform than others and some separate out information such as luminance and chrominance.
At each node of the network, we experimented with more cluttered backgrounds with the perceptually uniform colour model CIE L*, a*, b*, and the non-perceptually uniform colour models, nRGB, and CIE X, Y, Z. We also experimented with the HSV colour model, which separates the brightness component from hue and saturation, to compensate for changes in illumination (Fig. 6). Unlike CIE L*, a*, b*, HSV is not perceptually mapped to the human visual system, meaning changes in colour values are not proportional to changes in the perceived significance of the change. A detailed discussion of the different colour models can be found in [23,43,48].
In order to address some of the challenges we face in real environments with changes in illumination, shadows and more cluttered backgrounds, we have tested the different chrominance models on two synthetic images we have generated with our virtual environment UnrealROX [32]. Figure 7 shows the segmentation of the skin colour after applying the four different chrominance models and getting the skin probability with a GMM. Qualitative results demonstrate CIE L*, a*, b*, as the most efficient for skin segmentation since it is the most perceptually mapped to the visual system.
Given consideration of perceptual uniformity our best option is the CIE L*, a*, b* colour space. Thus, we converted the RGB values to the CIE L*, a*, b* values. The colour conversion from RGB to CIE L*, a*, b* is obtained by undergoing a linear conversion from RGB to CIE X, Y, Z, and a nonlinear conversion from CIE X, Y, Z to CIE L*, a*, b*.
where f r = r 1 3 if r > 0 . 008856 The X n , Y n and Z n refer to the CIE X, Y, Z values for a specified white point. To get the learning process started, we need a simple but robust method to obtain the topology preserving graph (TPG) with as little user intervention as possible. Henceforth, the topology preserving graph TPG = 〈N, C〉 is defined with a vertex (neurons) set N and an edge set C that connects them. The self-organising neural network growing neural gas (GNG) is used to filter out non-face candidates and detect and reconstruct the face and or the hands.

Growing neural gas (GNG) algorithm in 2D and 3D
In order to determine where to insert new neurons, local error measures are gathered during the adaptation process and each new unit is inserted near the neuron which has the highest accumulated error. At each adaptation step, a connection between the winner and the second-nearest neuron is created as dictated by the competitive Hebbian learning algorithm. This is continued until an ending condition is fulfilled, as for example evaluation of the optimal network topology based on the topographic product [16]. This measure is used to detect deviations between the dimensionalities of the network and that of the input space, detecting folds in the network and indications that it is trying to approximate to an input manifold with different dimensions. In addition, in a GNG network the learning parameters are constant in time, in contrast to other methods where learning is based on decaying parameters.
The network is specified as: • A set N of nodes (neurons). Each neuron c ∈ N has its associated reference vector w c ∈ R d . The reference vectors can be regarded as positions in the input space of their corresponding neurons.
• A set of edges (connections) between pairs of neurons. These connections are not weighted and their purpose is to define the topological structure. The edges are determined using the competitive Hebbian learning algorithm. An edge ageing scheme is used to remove connections that are invalid because of the activation of the neuron during the adaptation process.
The GNG learning is presented in Algorithm 1. With the new advances of low-cost 3D sensors, it is possible to generate more natural gesture-based 3D interactions. However, these systems need to address the problem of 3D tracking of hand joints. In [2], the original GNG algorithm is extended to perform 3D surface reconstruction by considering surface normal information during the learning process. It also modifies the original competitive Hebbian learning process by producing wire-frame 3D representations. In order to obtain 3D information in the form of point clouds, we had to project disparity information obtained from the device to the three-dimensional space using the known geometry of the sensor. The relationship between a disparity map provided by the Kinect sensor and a normalised disparity map is given by d = 1/8 · (d off − kd), where d is the normalised disparity, kd is the disparity provided by the Kinect and d off is a particular offset of a Kinect sensor. Calibration values can be obtained in the calibration step [25]. Figure 8 shows the 3D mesh created using the method discussed in [2].

3.
Decrease the error for all prototypes

end while
It can be seen how this extended algorithm is able to create a coloured 3D mesh that represents the input data. Since point clouds obtained using the Kinect are partial 3D views, the mesh obtained is not complete and therefore the model generated by the GNG is an open coloured mesh. Figures 9 and 10 show the whole process for different input samples that were synthetically generated using UnrealROX [32]. From left to right, we can see the original RGB image and its corresponding ground truth segmentation mask. The ground truth segmentation mask for the skin class and the obtained results using the proposed method (CIE L*, a*, b* Gaussian).
Finally, on the right we can see the 3D reconstruction results using as an input the skin mask, RGB and depth images. Using the GNG algorithm we are able to create a simplified coloured mesh from the input data. Since the skin mask has some false positives, the reconstructed mesh also shows small outliers.

Experiments
In this section, different experiments are shown validating the capabilities of our method (e.g. which statistical model is best for GNG reconstruction) which has also been used in 3D datasets. We tested our system on our own data set (University of Alicante, and University of Westminster) of faces and hands recorded from 15 participants. To create this data set we have recorded images over several days using a simple webcam with image resolution 800 × 600. In total, we have recorded over 112,500 frames, and for computational efficiency, we have resized the images to 300 × 225, 200 × 160, 198 × 234, and 124 × 123 pixels. In addition, experiments were conducted based on two publicly available databases, Mikkel B. Stegmann 6 and Stirling's. 7 All methods have been developed and tested on a desktop machine of 2.26 GHz Pentium IV processor. These methods have been implemented in MATLAB and C++. Figure 11 shows the ground truth model which we use to measure the error produced by our method. To measure the class probability of skin to non-skin regions we use the positive (TPR) and false positive (FPR) rates as obtained by: where TP is number of true positive (pixels correctly assigned to the skin class). FP is the number of false positives (non-skin pixels wrongly assigned to skin). S is the total of skin pixels and NS the totals of non-skin pixels. These rates TPR and FPR are calculated using both models (SGM and GMM) for all test images. Through this computation we get vector of measures P = (TPR, FPR) that express performance of the given model. In order to measure the similarity between the predicted skin colour region and the ground truth region in the two synthetic images, we applied the Intersection over Union (IoU) metric, as a standard performance measure for all the four chrominance models. The IoU metric measures the number of pixels common to both the target and prediction regions, divided by the total number of pixels present across both regions.

IoU = target∩prediction target∪prediction
We also calculated the loss function of our image segmentation tasks in the two synthetic images based on the Dice coefficient calculated as: Tables 1 and 2 show the results.
For further comparison between the SGM and the GMM model, we used ROC curves for all the test images. For drawing ROC curves, we calculated TPR and FPR for all images using different threshold (threshold value is set as per P(S|x)′). Thus by using K different thresholds, we can get K point vectors, which, when plotted, results in a ROC curve for the specific model with respect to test image. Figure 12, shows ROC curves for the test images 4 and 17, respectively, from Fig. 11. It is clear from the ROC curve that both SGM and GMM have almost the same TPR. The FPR in GMM curve (orange dotted line) becomes constant at a certain threshold, as shown by the green circle. However, FPR for SGM (blue line) keeps increasing as the threshold value decreases. FPR reaches nearly to 0.9. Several of the thresholds for images 4 and 17 are shown in Figs. 13 and 14 respectively. From the threshold images, it is very clear that SGM images have a higher FPR rate. Based on the ROC curves for all test images, it is evident that SGM have a very high FPR and thus they perform badly as compared to GMM. Additionally, for several images TPR at a given threshold for SGM was marginally low as compared to GMM. This marginally low TPR will not have any effect on the images with large skin area. Nevertheless for the images with small skin area (small faces), this low TPR can have an adverse effect. Tables 3 and 4 show calculated TPR and FPR for all four colour spaces using the SGM and GMM statistical models, respectively. For the calculations, we have used a fixed threshold value of 0.55, since it is the middle range value where we can find the highest TPR and lowest FPR for comparison purposes. It can be seen how the CIE L*, a*, b* model outperforms all other three colour spaces since it has the lowest FPR rate among the 20 images, followed by the CIE X, Y, Z. The comparisons demonstrate that GMM outperforms SGM with low FPR rate, which makes it a suitable model for image segmentation especially in cases where hands and faces are involved. This is then used in the initialisation of the first frame of the GNG algorithm. Figure 15 shows the correctly detected hands and face after applying EM to segment the skin region. The reconstruction of the 2D topology of both hands and face is done with the GNG algorithm. Figure 16 shows the 3D reconstruction of a human face acquired using the Kinect sensor (top) and the 3D reconstruction of a synthetically generated human face (bottom). Both faces were reconstructed using the GNG for 3D surface reconstruction discussed in Sect. 4. Synthetic data was generated using Blensor software [18], for simulating a virtual Kinect sensor (noise-free).
While 3D downsampling and reconstruction methods like Poisson or Voxelgrid are not able to deal with noisy data, the GNG method is able to avoid outliers and obtain an accurate representation in the presence of noise. This ability is due to the Hebbian learning rule used and its random nature that updates vertex locations based on the average influence of a large number of input patterns.

Conclusions and future work
In this paper, we have compared the performance of different probabilistic colour models and colour spaces for skin segmentation as an initialisation stage for the GNG algorithm. Based on the capabilities of GNG to readjust to new input patterns without restarting the learning process, we are interested in reducing meaningless image data by taking into consideration that human skin has a relatively unique colour and applying appropriate parametric skin distribution modelling. We concluded that GMM was superior to SGM with lower FPR rates. We also showed that CIE L*, a*, b* colour space outperforms all three other colour spaces since it has the lowest FPR rate among the dataset. **Preprocessing was also used as an initialisation stage in the 3D reconstruction of faces and hands based on the work conducted in [2]. Further work will aim at improving system performance by accelerating GPUs which can then be used for robotic system recognition. Nonetheless, we are currently working, after obtaining a clean segmentation, on hand sign trajectories in order to analyse the sign space envelope (sign trajectories/depth/speed) and facial expressions of deaf individuals. An automated screening toolkit will be beneficial not only to screening of deaf individuals for dementia, but also for assessment of other acquired neurological impairments associated with motor changes, for example, stroke and Parkinson's disease.

Funding
This work has been supported by the Spanish Government TIN2016-76515R Grant, supported with FEDER funds, the University of Alicante Project GRE16-19, the Valencian Government Project GV-2018-022, and the UK Dunhill Medical Trust Grant RPGF1802\37. Image segmentation on a different background based on skin colour information. a original input image, and b probability map for the skin colour Image segmentation by skin colour in the BSL Signbank dataset. a Original input image, and b probability map for the skin colour Skin colour images in cluttered backgrounds. Top row shows the original images followed by the various colour spaces Image segmentation on two synthetic images generated with UnrealROX [32]. From top to bottom, original input image, and probability maps for the skin colour after applying the four different chrominance models   Overview of the whole process using the first synthetic sample. Top row shows the ground truth segmentation and skin masks. Bottom part shows skin segmentation results using CIE L*, a*, b* (Gaussian). The images on the right show the error map of the computed skin mask and the 3D reconstructed mesh using the GNG algorithm  Overview of the whole process using the second synthetic sample. Top row shows the ground truth segmentation and skin masks. Bottom part shows skin segmentation results using CIE L*, a*, b* (Gaussian). The images on the right show the error map of the computed skin mask and the 3D reconstructed mesh using the GNG algorithm  GNG 3D reconstructions. Top: 3D face reconstruction from data obtained using the Kinect sensor. Bottom: 3D face reconstruction from data synthetically generated using Blensor software     Bold indicates the best achieved score