1 Introduction

Humanly related tasks like; pornographic image filtering, personal identity, hand detection and tracking, verifications video surveillance, face detection and tracking, image retrieval, human pose modelling, naked people detection, and facial expression analysis, depend largely on skin detection algorithms to perform optimally. Existing systems for human skin pattern classification/detection, suffer from certain major setbacks, including, individual pixel operation (otherwise known as pixel-by-pixel (PBP) or pixel-wise (PW) operation), high rate of false hit and poor performance especially in terms of predicting darker complexioned skin. These systems still pose challenging pattern recognition tasks for computer vision; thus, it has attracted a great deal of research in recent years [17, 28, 54]. Skin detection methods utilize color information from conventional color space. However, according to [8], there is a substantial disparity in the accuracy of classifying darker skin colors against their lighter counterparts, therefore, requiring urgent attention of commercial companies in building genuinely fair, transparent, and accountable skin analysis algorithms. Skin detection algorithms suggest the presence of human skin in a digital image. It is an important pre-processing step for techniques like face detection and semantic filtering of web content. According to [2], every color space contains an optimal skin detector scheme such that the performance of all the schemes is the same.

In [36], the basic steps in skin detection include representation of image pixels in color spaces, suitable distribution of skin and non-skin pixels, and skin colour modelling (which uses an underlying skin color distribution characteristic on a colour space to detect skin colour pixels quickly). However, human skin appearance in images is affected by various factors such as illumination, background, camera characteristics, and ethnicity, as such, skin detection using color information can be a challenging task [22, 28]. Numerous techniques exist in the literature for skin detection using color, nonetheless, due to real-world conditions such as illumination and viewing conditions, many of these studies are limited in performance. These techniques according to [28, 53] are prone to false skin detection in most cases, therefore, they are not able to cope with the variety of human skin colors across different ethnic groups. Thus, in this paper, we have proposed a fast algorithm based on an improved, combined (HSL and RGB) color model threshold value, for human skin detection from coloured 2D images using our new packed k-dimensional tree (PKT). The accepted skin colour threshold value was deduced from exhaustive experimentation for toning human skin color. The procedure involves the normalization of the RGB/HSL color channels of several randomly selected colored images. The final standardized RGB/HSL coordinate values, lead to the realization of the adopted skin color threshold. Additional comprehensive channel toning was equally adopted to facilitate enhancement on colour insensitivity due to luminance.

For the pixel-by-pixel (PBP) problem, common structures used for performance enhancement is the quad-tree. However, the idea of repeated deep quad-tree-like tedious partitioning seems to be cumbersome; and in some cases, the quad-trees have been proven to have a poor shape analysis and poor performance on pattern recognition due to their inability to compare two images with different translation or rotation efficiently. So, we present the PKT to overcome most of the challenges in these structures. The PKT algorithm starts by reducing the size of the image, thereby achieving only about 60% of the image size. The data reduction pre-processing technique only aims at increasing the speed of the application. The main purpose of the proposed model is to eradicate the common state-of-the-art PBP/PW approach of pixel classification, common in recognition procedures.

To the best of our knowledge, this is the first time a structure like PKT is developed. The structure shows high prospect, in terms of speed, low rate of false hits, reduced computational cost and complexity, high accuracy and precision rate. Going by to the performance comparison of existing models in [28], our experiments show that the proposed algorithm is characterized by a very high accuracy rate, precision, and efficiency (Table 4).

1.1 Description of color spaces (channels)

According to [2, 25, 38], the RGB (Red, Green, and Blue), HSV (Hue, Saturation, and Value), HSL (Hue, Saturation, Lightness) and YCbCr (Luminance, Chrominance) color models are some of the main parameters for identifying and recognizing a skin pixel. In [33], the HS color hexagon was described as what picture windows use in their color picker to display the brightest possible versions of all possible colors, based on their hue and saturation. This justifies our decision to adopt the color model as a choice tool for skin color detection. Additionally, the characterization of color range for skin detection is achieved by manipulating the H channel of the HS color model [13]. From the RGB coordinates of the image, the values for H, S, and L, are derived. The H channel of the HSL is applied to characterize the color range for skin detection. The S channel defines the saturation of the H pigment. The L channel normalizes the shade or saturation of both H and S.

We have used the PKT for the classification, prediction, and recognition of human skin pixel in an image, however, we have shown that the model is robust and versatile and can be useful in many other fields of machine learning procedures and pattern recognition including clustering (for instance, clustering skin pixels on the face as a blob in face recognition, cell/DNA clustering in biology for matching purpose, etc.), design of discovery systems (e.g. gene pattern discovery and identification in bioinformatics, data mining and knowledge discovery, etc.).

The major differences between the current study and existing models in terms of colour are

  1. a.

    The adaptability of the hue channel to different ethnic skin colour shades; achieved through significant range normalization between these color categories,

  2. b.

    The speed up (Table 5) of segmentation and classification procedures using a spatial model

  3. c.

    Most importantly, our model achieved a high precision and accuracy (Table 4).

2 Related work

Human skin related recognition and identification technologies according to [49] have proven to work less accurately on darker skin. One reason this may be so according to a study by [8], is that skin type classification systems are overwhelmingly designed to favour lighter-skinned subjects, with an error rate of up to 34.7%, leading to higher overall accuracy rates for identifying men than that for women. Their study established that darker-skinned females are the most misclassified group. Illumination, pose, noise and expressions, are the opposing factors faced during face capture and analysis. According to [53, 55], these factors greatly affect especially, the performance of facial recognition systems. Thus, among all biometric systems, according to the authors, facial recognition has shown the highest false acceptance and rejection rates. Several algorithms are proposed to solve this problem, notwithstanding, they only manage to attain slightly acceptable results characterized by high false positives [28, 54]. This means that human skin related recognition systems need to develop better skin detection algorithms to improve their performances. Besides, great attention has been paid to real-time human detection in applications such as vehicle autonomous driving, video surveillance and human activity understanding. However, despite a large body of work devoted to human detection in the last decades, it is still an open problem [48].

2.1 PBP (PW) operations

Most existing skin detection methods [4, 28, 51, 59], depend on building an n-dimensional histogram for pixel classification. In most cases, two histograms are constructed at the start with sample pixels, for training purposes. One histogram for skin-related pixels, the other for non-skin related pixels. Using these two histograms, a classification rule such as Bayes (or any other), is applied to each pixel of the test pixels or image to complete the detection process.

Many state-of-the-art techniques for human skin detection from images, depends wholly on PW operation. However, the efficacy of the PW classification is limited [23]. The main objective of PW skin color detection according to [57] is to build up a decision rule that classifies each pixel as skin or non-skin individually. [23, 31] claimed that the performance of skin detection algorithms has not been high in accuracy due to the high overlapped degree between “skin” and “non-skin” pixels. As a solution, they applied a Bayesian classifier and connected component algorithm to identify individual “true skin” pixels using the first posterior probability threshold. Though this method helps to improve skin classification performance, especially the false positive rate, it goes through the rigorous task of checking all pixels individually, thereby not efficient for speed.

Several state-of-the-art methods for skin detection use single color region approach according to [18]; in contrast to this, they applied genetic algorithms to determining optimal skin color regions from a selected color space, which considers skin color as a union of multiple smaller CbCr color regions rather than the aforementioned single color region counterpart. However, even though they applied an optimization on the CbCr color model, which they used in their work, the image pixels were as other systems tested individually. The work in [30], like our proposed system, starts by reducing the size of the given image and then applies the RGB and YCbCr colour models. However, it eventually ends up with processing individual candidate pixels that are in the range of skin color, for detecting human skin. Like other models presented so far [35] offered a similar method of individual pixel examination using only the HSV color model but applied two different types of noise filters NOGIE (Noise Object Global Image Enhancement) and NOWGIE (Noise Object with Global Image Enhancement) for an improved result.

In [9], an adaptive neuro-fuzzy inference system (ANFIS) for skin/non-skin pixels detection was proposed. [25] presented a new threshold based on a combination of RGB, HSV and YCbCr values for skin/non-skin pixels detection. A modified likelihood ratio, in addition to multi-scale, was used for classification in [40] for PBP skin pigment classification. Likewise, by establishing some correlation rules between the chrominance components PCr and PCb of a pixel P, [17] formulated two equations (PCr − PCb ≥ IP and |PCb − PCbs | ≤ JP) that must be true before a single pixel (P) can be classified as skin or not. Note: IP = max between (PCr, PCb) and JP = maxDistance between points ((PY, PCb), (PY, PCbs)), where PY = the pixel value of P on the Y components of the YCbCr space, and PCbs is an estimated value of PCb. [56] equally presented a system that uses a pixel-by-pixel operation for pixel classification.

In [23], a spatial based system for skin detection was presented. The system applies a discriminative feature space as a domain for spatial analysis of skin pixels, based on textural features extracted from skin colour probability maps. The texture is extracted in the form of seed, taking advantage of the fact that real skin areas have pixels with a high-skin probability threshold if an image is binarized. A distance function is applied for finding the shortest routes from large blobs of the seed to every pixel. Eventually, pixels that are not close to any of the seed blobs are rejected and then the skin regions are extracted. While this method seems promising due to its texture-based idea, the method does not involve actual space partitioning, which is typical of hierarchical classification for efficient distance threshold queries, as such, query performance is impaired due to bulky distance calculation between seeds and individual pixels. Again, Pixel-wise (PW) classification was applied to hand detection procedure in [27], to find a human hand on pixel-level from a video. [28, 54] also applied the PW classification method for pixel matching purposes.

2.2 Tree-based solutions

Many tree-based systems have been proposed for the improvement of skin prediction procedures. However, as discovered in this study, these systems are still faced with a high rate of false hits. Therefore, the effectiveness of our model will contribute to improving time and computational complexity in learning systems. [14], in addition to Deep Neural Network and Naïve Bayesian models, presented a decision tree-based solution for skin detection that overcomes the challenge of color range thresholding. By calculating the probability of each pixel, their proposed equation is tested by PW technique before a pixel is classified using a skin/no-skin decision tree. Their method performs reasonably. However, it still portrays the shortcomings of most existing systems including inefficiency with time due to stages that are involved in processing individual pixels, and inaccuracy in certain skin type prediction, as stated by the author.

The quad-tree structure in [40] differs greatly from what we have presented here, such that for each 32 × 32 sized neighborhood pixels, if a boundary is detected, the 32 × 32 neighborhood is further divided into four sub-neighborhoods, and the modified likelihood ratio test is performed on each of these sub-neighborhoods. The procedure is repeated recursively until a decision is reached, or the window becomes so small that a significant decision cannot be made. This method might be promising but the idea of repeated deep quadtree-like partitioning seems to be cumbersome. The method proposed in [1] applied the Bayesian Rough Decision Tree (BRDT) classifier to improve the accuracy of human skin detection. Quadtree classified vector quantization (QCVQ) method was used in [11]. This method firstly partitions a quadtree into its usual segmentation and then classified into smooth and high-detail blocks. The authors claim that the scheme yields better retrieval performance compared to the well-known vector quantization (VQ)-based image retrieval methods. However, even though quad-trees are very good on images with large areas of a single color, which eventually become compact, they have been proven to have a poor shape analysis and poor performance on pattern recognition due to their inability to compare two images with different translation or rotation efficiently, especially with an image that has different colors for every pixel. [16, 24, 50, 58] discussed other tree-based structures that are similar to decision trees or quad-trees.

2.3 Super-pixels

The Super-pixel (Sp) paradigm, introduced in [39] and presented in [19, 29, 32, 47], is a pre-processing activity that divides an image into adjacent regions/clusters of pixels. In most Sp based procedures, watershed image segmentation process is carried out on the image before creating Sp(s). Constructing Sp(s) requires the application of a pre-defined similarity measure that is based on perceptual features. The steps in creating Sp(s) using the common SLIC algorithm, include:

  1. 1.

    Create initial regions based on ¢ (the parameter that determines the number of Sp(s))

  2. 2.

    Group the regions to cluster pixels using a similarity criterion.

  3. 3.

    Modify the connection between regions.

Thus, based on ¢, the number of pixels in a Sp would be:

$$ Y=\sqrt{\raisebox{1ex}{$N$}\!\left/ \!\raisebox{-1ex}{$\mathit{\cent}\kern0.5em $}\right.} $$

Where N is the number of pixels in the input image, ¢ is the estimated number of super-pixels (sub-regions) constructed, and Y is the number of pixels in each Sp(s). The above equation is different from the concept of PKT. After constructing the Sp(s), the sums of the probabilities of their (training set images) pixels are computed; these sums are compared to the probability map of an input image and the Sp(s) of the input image with lower sums are returned as the predicted cells/values. The procedure described above is totally different from the procedure of the PKT. Unlike the supervised/semi-supervised construction process (using an initial human marked image segmentation) of the Sp-based systems, PKT is fully automatic and unsupervised. It does not require hand annotated images, training sets, initial estimated regional clusters, or an input parameter (like ¢). Rather than find Y, PKT computes ∆ as in the equation below (see ALGORITHM 1 for details).

$$ \Delta =\left\lceil {\left(\left(\mathrm{DL}\right)/\upmu \right)}^{\mathrm{e}}\right\rceil $$

In the above equation, ∆ is compared to ¢, DL to N, and Y to μ. However, while ¢ is estimated in Sp(s), ∆is automatically computed in PKT based on the number of dimensions of the image or the number of principal attributes of the data table (for non-spatial data). PKT is a multidimensional structure and can apply to higher dimensional space or high dimensional datasets. Moreover, the projected number of regions of Sp(s) based on ¢, can lead to over-segmentation [19], whereas the number of PKT sub-regions ∆ is optimally computed based on the fixed determinant value (μ = 25). In addition, setting an initial position for Sp regions using constant distance increments, presents the challenge of possibly placing these positions (centres) at an image border and thus, Sp might fail to obtain a good segmentation [29]. This is not the case with PKT. PKT cells (regions) are generated automatically and instantly through a recursive partitioning strategy. Furthermore, only 5 strategic pixels are selected from each PKT sub-region, as against computing the sum of the probabilities of pixels of Sp’s sub-regions. Finally, the use of predefined determinants μ to automatically detect the optimal number of PKT sub-regions, reduces the time taken to find the skin colour pigments, by eliminating the daunting search for ¢.

In fact, all the methods presented above are quite different from our proposed system. One thing that is common amongst some of the above methods and approaches is, PBP examination to develop a classification rule. Even with tree-based enhancement, there are still some required adjustments if performance is of interest. For example, the decision tree is only an analytical, decision or visualization support tool and might not be proficient for multidimensional or spatial analysis. The quad-tree as we have mentioned is not balanced and therefore not very efficient during computation. Moreover, an image that has different colours for every pixel, will involve very tedious partitioning, thereby losing effect. Additionally, the four children constraint, and constant partitioning in quad-trees limit the proper utilization of the leaf nodes of the quadtree. Besides, [51] reiterated that in neural network methods, the training stage may take a long time if the number of training patterns is very large. In addition, even though we have stated that most spatial-based models for skin segmentation use the Bayesian classifier (BC), BC methods alone according to [28] are not efficient because they do not have the capability of detecting skin pixels without false alarms. Thus, to overcome the above limitations and challenges, including the complications associated with quadtree-like structures and to reduce the use of PBP/PW methodologies, we present the PKT and PQT.

3 Proposed system (PKT)

Our proposed tree structure is an integration of the methods described in [42, 43, 45, 46]. In these materials, various related spatial indexing and modelling mechanisms including a description of improvement strategies for spatial structures and techniques were presented. Nevertheless, this current work is an aggregate classic study that produces an efficient technique for spatial modelling, which we have applied in this paper to predicting human skin related pixels.

The basic concept/idea behind our technique is the fact that skin pixels are hardly isolated. That is, once a skin pigment is encountered at a certain position/location, there is a high probability that the neighbouring pixels are equally skin. As such, selecting a tiny fraction of the pixels, in that bounded area, will most probably guarantee the satisfaction of the skin/no-skin classification condition. Therefore, we build an effective k-dimensional tree structure, for partitioning and indexing the pixels in an input image into sub-groups. After the partitioning, we interpolate (by inverse distance weighting - IDW) through the leaf nodes using only a very few sample pixels (r out of Ω 0, where Ω0 is the total pixels in the leaf node) from the leaf node. Finally, only the leaf nodes where all r pixels satisfy the skin/no-skin criteria (i.e., Eq. 6 evaluates to T for those pixels) are returned.

Notice the high rate of false hit on the other methods with combined colours in Figs. 7b and 7c. Most parts of the skin area in some images were not captured. In some other images, non-skin areas were captured as skin. Our suggestion in Fig. 8 shows an improvement to these problems and this improvement contributes to the high precision and accuracy of our spatial model.

3.1 Method description (spatial modelling)

The segmentation procedure in Section 3.4, for predicting skin and non-skin pixels in an input image based on the defined color threshold, is normally performed (in the most state-of-the-art systems) by testing individual pixels (PBP or PW operations). However, this process is very slow with a significant degree of false hits (Table 2), but it can speed up and perform better if enhanced by a k-dimensional data structure like PKT that is efficient for detecting patterns from colored k-dimensional images.

The model largely depends on an arbitrary value μ, which determines its performance. With an exhaustive experiment, μ = 25 was established as the most fitting value for any image type and size, although this depends on the underlying task. For dimension k = 2, and an image of size = DL, given that ∆ = ((DL)/μ) 1/k, the number of pixels to be processed reduces from DL to ð, where ð = r x ∆ x ∆, r = 5/Ω0, ∆ = numbers of partitions in each dimension, and Ω 0 is the number of data elements (pixels) on each leaf node (Ln) of the proposed tree. This means that; for an image of size 3000 pixels, rather than processing the entire pixel elements, only 605-pixels are processed yet an efficient outcome is achieved.

figure a

3.1.1 Spatial analysis

Elucidation

Given a set of P points/pixels in a 2-dimensional (2) space of an N x N image (raster), each point pj, j = 1, 2, …, P, pj occupies a single location (j) as shown in Fig. 1a --- the grid of P pixels.

Fig. 1
figure 1

Sample input image, (a) showing pixel locations (b) vectorized form of the input image pixel locations (c) a block diagram of our skin detection procedure

Thus, in k = 2 dimension, we define a spatial operation for the image of the form:

$$ \mathrm{A}=\mathrm{f}\ \left({\partial}_1,{\partial}_2\dots ..{\partial}_{\mathrm{NxN}}\ \right)\kern1.5em \mathrm{for}\ {\partial}_{\mathrm{j}}\in {\mathbb{R}}^{\mathrm{k}},\kern0.75em \mathrm{j}=1,2,\dots, \mathrm{N}\ \mathrm{x}\ \mathrm{N} $$
(1)

Eq. 1 depicts a region of local spatial features (A) for the input image, that is to say, A is a function of a k-dimensional region of dispersed elements (). This means that j is the location of a group of features for the jth pixel. In this case, j is the X, Y position/location of pixel pj (pjx, pjy), on the N x N image raster. Therefore, A can be represented as an N x N matrix as in Eq. 2.

$$ \mathrm{A}^{\prime }=\left[{\partial}_1\left({\mathrm{x}}_1,{\mathrm{y}}_1\right),{\partial}_2\ \left({\mathrm{x}}_2,{\mathrm{y}}_2\right)\dots {\partial}_{\mathrm{N}\mathrm{x}\ \mathrm{N}}\ \left({\mathrm{x}}_{\mathrm{N}\ \mathrm{x}\mathrm{N}},{\mathrm{y}}_{\mathrm{N}\ \mathrm{x}\ \mathrm{N}}\right)\right] $$
(2)

Thus, in Eq. 3, we create a row/column vector representation of the image as in Fig. 1b, without explicitly reflecting the xy tuple.

$$ \mathrm{A}^{\prime\prime }=\left[{\partial}_1,{\partial}_2\dots ..{\partial}_{\mathrm{N}\ \mathrm{x}\ \mathrm{N}}\right] $$
(3)

3.2 PKT

Building the PKT generally starts with compressing the size of the image to only 60% of the actual size. This pre-processing technique increases the speed of the application. Similar to the KD-tree [26, 41] where the space and dimension of the dataset are considered in terms of partitioning, which is carried out on each dimension in an iterative manner, the PKT partitioning considers space and dimensions too; however, it employs a recursive partitioning strategy i.e., partitioning on one dimension is recursively completed before moving to the next.

The PKT (typical structure Fig. 2) performs the partitioning of a k-dimensional space (see output of the tree is in Fig. 5). Building a static PKT from P points/pixels has the average time complexity of O (log n), the case is the same with PKT skin detection operation. After the size reduction, The PKt starts the partitioning procedure in ALGORITHM 1. The partitioning is simply an array sub-division procedure for spatial data, where the dimension of the array is determined by the underlying space that is holding the image. For non-spatial datasets, the computation for ∆ in ALGORITHM 1 will vary, such that μ might be the number of attributes/columns, or it might be the number of attributes divided by an undetermined arbitrary value (to avoid only one item per leaf node). Everything else remains the same. For now, we are concentrating on spatial datasets.

Fig. 2
figure 2

Typical PKT in two (2) dimensions

The tree does not go deeper than a maximum depth of two (2), for 2 dimensional, and depth of 3, for 3-dimensional spaces, respectively. This means that the partitioning will always end up with an axis-aligned bounding box of the leaf. Fig. 2 shows the root node as a forest of internal subtrees; each internal node is equally a forest of sub-nodes. These sub-nodes could be leaf nodes (Ln) containing the image pixels if the partitioning procedure has reached the last dimension. Note: if the size of the pixel array (DL) = P = N x N is not even, the last leaf node will be extended to a super-node. The idea of a super-node does not affect the performance of the PKT, because in terms of pixel classification, the mid pixel in the bounding box, is assumed to carry the most classification weight (Eq. 10).

For static datasets, the tree employs a recursive top-down partitioning strategy, such that, partitioning only takes place if the number of elements in the dataset (DL) ≥ a certain value (∆). All sub nodes are stored in the root as internal node and then further partitioning of internal nodes occurs only if the number of elements in partition j (∆j) is greater than the value of ∆ (ALGORITHM 1).

For a dynamic dataset, the above procedure will start from the leaf, in the sense that a total of Ω0 number of pixels are recursively stored on each leaf, and a total of ∆ leaf nodes are stored on the upper (internal) node until the root is reached.

The output of the tree after the partitioning is shown in Fig. 5 with different values of μ (see Fig. 5e). The μ, is the determinant of tree behavior and performance. μ stands for the expected number of items in the leave node (Ln), increasing the value means more items in the leave node, while decreasing the value means otherwise. Any of these actions will affect the tree significantly. After several repeated, exhaustive experimentation, we have chosen 25 as the most efficient value for μ that behaves perfectly for all image sizes.

$$ \Rightarrow \kern0.75em {\Omega}_0\cong \upmu $$
figure b

In ALGORITHM 1, we perform (by sorting A′′ on all the dimensions) the partitioning of A′′ (a row/column vector of the image pixels, with length DL, derived in Eq. 3), into predetermined sub-regions using the proposed partitioning strategy. After the partitioning, each array A′′jk containing a total of ¥k elements, is further divided by ∆ until the last dimension is reached, and a certain condition is met. Ultimately, each cell in the grid (with Ω0 elements, bounded by an axis aligned bounding box denoted as (jk = k1) forms the leaf node (Ln) of the PKT and a group of Ln forms an internal node k.

figure c

ALGORITHM 1. (a) describes the construction of the PQT (Fig. 14c), which is an improved quad-tree. Similar to the conventional quad tree, PQT continues partitioning until a certain condition is met. However, with PQT, the value of the partitioning parameter is pre-set by automatically computing ∆.

After extracting the pixels of an input image into array A ′ ′, partitioning begins if the length of A ′ ′ is greater than ∆, this means that partitioning will terminate if:

$$ {\mathrm{A}}^{\prime \prime}\div 4\le \Delta $$

The depth of the PQT is given as:

$$ dep\cong \left\lceil\ \frac{{\mathit{\log}}_2\left(\ \raisebox{1ex}{$\mathrm{DL}$}\!\left/ \!\raisebox{-1ex}{$\Delta $}\right.\right)}{2}\right\rceil $$

and the number of children in each of the leave node of PQT is approximately:

$$ {\Omega}_0\cong \raisebox{1ex}{$\mathrm{DL}$}\!\left/ \!\raisebox{-1ex}{${4}^{dep}$}\right. $$

The PQT is a highly balanced tree as all the leave nodes will always reside on the same level/depth.

Note, with the above equations, PQT will easily adapt to various tasks as the parameters that determines the structure can easily be adjusted to suit the task at hand. The depth of the tree and the number of children to reside in each leave node can be established a priori.

Next, we find R (ALGORITHM 2). R is an array of qualifying pixels selected by spatial interpolation from Ω0 elements using the inverse distance weighting (IDW) function in Section 3.3. In this program, only the elements in R will be processed during the classification stage.

figure d

Note that the length of R is constant for any size of A ′ ′.

3.2.1 Selecting r pixels

r pixels are selected from each leaf node (jk = k1) of to form a single R (ALGORITHM 2), such that they match the pixels depicted in Fig. 4c and 4d, where:

r1:

pixel at (top left of the bounding box of jk = k1).

r2:

pixel at (top right of the bounding box of jk = k1),

r3:

pixel at (bottom left of bounding box jk = k1)

r4:

pixel at (bottom right of the bounding box of jk = k1),

r5:

pixel at (middle of the bounding box of jk = k1)

After computing R, next we find the value of ð (the total expected significant pixels in an image for efficient skin/no-skin classification).

It has been established that Ω 0 = total pixels in each leaf node (jk = k1), r = size of array R → 5 (selected significant pixel in a cell), DL is the size of the input array of image pixels. Thus, in 2-dimension, the total pixels in the image to be processed should normally be:

$$ \mathrm{DL}={\Omega}_0\ \mathrm{x}\Delta \mathrm{x}\Delta $$
(4)

However, since we have selected only 5 pixels/elements out of Ω 0, the new total pixels in the image to be processed would be

(5)

It is evident from Eqs. 4 and Eq. 5, that the total number of potential candidate pixels (ð) is lesser than DL.

Note: ð in Eq. 5 can vary greatly, depending on the size of A ′ ′. However, r and Ω 0 (typically ≅ 24 Eq. 8) are constant. r is not arbitrary, as we have chosen only 5 strategically positioned pixels from each cell/leaf node). Note also that the number of occurrences of ∆ in ALGORITHM 1, will directly depend on the number of dimensions of the space holding the input image or number of principal components, for non-spatial datasets. Therefore, for a k dimensional space, we would have

$$ \Delta \ \mathrm{x}\ \Delta \ \mathrm{x}\dots \dots \mathrm{x}\ {\Delta }_{\mathrm{k}}\ \mathrm{as}\ \mathrm{the}\ \mathrm{total}\ \mathrm{number}\ \mathrm{of}\ \mathrm{leaf}\ \mathrm{nodes}\ \mathrm{in}\ \mathrm{PKT}. $$

Below, we show that Eq. 5 will greatly reduce the size of the computation in terms of predicting pixels that correspond to human skin. i.e., for our example image of size DL (length of A ′ ′), where DL = 3000 in 2-dimensions, we will have:

figure f

The above enhancement indicates that rather than matching all 3000 available pixels in A ′ ′, against the skin color threshold in Table 1 to find human skin, only compare 605 pixels are compared, reducing the number of computations to only 20% of the actual total.

Table 1 Skin color threshold comparison

3.3 Interpolation process

For pixel matching purposes, the skin segmentation program should normally test a total of DL elements/pixels (Eq. 4). However, this will make the segmentation/classification process highly inefficient and impractical. Therefore, we find R (from ALGORITHM 2) to reduce the computation from DL to ð (Eq. 5). Hence, the interpolation procedure below is adopted to classify the pixels into skin or non-skin pixels, using only a total of ð pixels.

Consider the P pixels and the locations in A ′  ′  as a set of S spatial events {p1, p2 …}, as illustrated in Eq. 3. Let Њ be some color threshold as we defined in Table 1, for classifying pj, as skin/no-skin i.e.:

If the function evaluates to T, it implies that pj is skin and non-skin otherwise.

Proposition

If Њ in Eq. 6 is accurate, only ð pixels instead of DL is significant for the prediction in Eq. 10 and Eq. 11 to hold, with an accurate, precise, and fast result.

Proof 1

Let the instances of S be each pj located on a single location (j) on the grid of P pixels. This means that for the array of pixels (A ′ ′), since there is a discrete hybrid partition (i.e., PKT) of A ′ ′ based on spatial proximity between locations, j, j = 1, 2, …., DL, a spatial inferential rule (as in Eq. 7) can be discovered faster and more efficiently.

Following Proof 1, we can now say that with μ =25, k =2, the total pj (for any subset pj⊆P), in each leaf node of PKT (ALGORITHM 1) is:

$$ {\displaystyle \begin{array}{c}\mathrm{C}\cong \left\lfloor \mathrm{DL}/\Delta /\Delta \right\rfloor \\ {}\mathrm{Where}\kern0.5em \Delta \cong \left\lceil {\left(\mathrm{DL}/\upmu \Big)\right)}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$\mathrm{k}$}\right.}\right\rceil \kern0.5em \left(\mathrm{ALGORITHM}\ 1\right)\end{array}}\kern0.5em $$
(7)

Eq. 7 means that following the analysis in Proof 1, for any object or image, in k-dimensional space, there will always be only approximately C (computed in Eq. 8) data elements (pixels in the case of images) in the leaf node of PKT.

For example, let DL = 3000, k = 2

$$ {\displaystyle \begin{array}{c}\begin{array}{c}=>\mathrm{ceil}\ {\left(\raisebox{1ex}{$3000$}\!\left/ \!\raisebox{-1ex}{$25$}\right.\right)}^{1/2}=11\\ {}=>\mathrm{floor}\ \left(3000/11/11\right)\cong 24\end{array}\\ {}\square \mathrm{C}=\Omega\ 0\cong 24\ \left(\mathrm{number}\ \mathrm{of}\ \mathrm{pixels}/\mathrm{points}\ \mathrm{in}\ \mathrm{leave}\ \mathrm{node}\ \mathrm{for}\ \mathrm{all}\ \mathrm{input}\ \mathrm{sizes}\right)\ \end{array}} $$
(8)

Eq. 8 is affected significantly by the value of μ. Notice that if μ is 100, Eq. 8 will become ≅102, this means there are approximately 102 items in the leave node. Having such huge number of elements in the leave node might lead to an inefficient model due to oversized sub-region. Similarly, having less than 24 elements in the leave node might equally lead to inefficiency.

Proof 2

Spatial autocorrelation measures the similarity between samples of a given population, as a function of spatial distance [6, 7, 44]. Figure 3 is a plot showing the relationship between sample skin pixels from the input image (Fig. 5). Fig. 3a shows how these pixels (pj), are highly correlated with a correlation coefficient r = 0.9181. The figure also shows that the pixels are densely clustered. Figure 3b shows the degree of normality in pixel distribution, within the sample image.

Fig. 3
figure 3

Sample skin pixels from the input image in Fig. 5 (a) shows relationships between image pixels, with a correlation coefficient r = 0.9181 (b) The degree of normality in pixel distribution

Thus, with the assumption that the skin pixels are highly correlated and are located within near-zero proximity with their neighbours, we define q --the measure of the spatial distance (SP) between locations ∂j-- as any spatial construct:

$$ \mathrm{q}=\left\{\mathrm{q}:\mathrm{q}\in \mathrm{SP}\right\},\mathrm{Thus}\kern0.5em \mathrm{q}=\left\{\begin{array}{c}0,\mathrm{if}\ {\mathrm{SP}}^{\prime}\left(\ {\mathrm{p}}_{\mathrm{j}}\right),\\ {}\ge 1,\mathrm{otherwise}\end{array}\right. $$
(9)

SP signifies a very close proximity (within a bounding box) between pj in P.

□ ∀ pj, in Proof 1, since spatial autocorrelation occurs due to correlation of a variable with itself through space [10], we assume q = 0 in Eq. 9. That is, ∀pj : pj ∈  P, ∀pr : pr ∈  P, observations made from pr include information present in pj. Therefore, it is rational that the sample size, r, be less than the total number of observations Ω 0 in each Ln bounding box/cell.

We, therefore, select only r sample pixels (described in ALGORITHM 2) on strategic locations from each Ln cell (Fig. 4), as the five significant sample points for interpolation and then move on to the interpolation procedure in Section 3.3.1.

Fig. 4
figure 4

Interpolation procedure (a) underlying idea of predicting new unknown cell values from known ones based on known sample points, (b) Expanded cells showing skin and non-skin pixel (in their bounding boxes) as classified by PKT cropped from Fig. 11f, (c) Expanded single cell from skin area from (b-ii), (d) Expanded single cell from the non-skin area from (b-i)

3.3.1 Prediction by interpolation using IDW

In this section, we describe how we predicted 24 points (pixels) from only 5 points (pixels). Now, since all the Ω0 elements in the Ln cell are highly correlated pixels, and since the cell is very small (Fig. 5e), there is a high probability following proof 2, that if the mid pj in the cell (pjM) passes the threshold test in Eq. 10, then every other pj in the cell will pass i.e.

(10)
Fig. 5
figure 5

Output of the PKT on different values of μ (a) the original image (b, c, d, and e) outcome of the partition based on different values of μ. (b) = > μ =20. (c) = > μ =40. (d) = > μ =100. (e) = > μ =25

Interpolation as depicted in Fig. 4a, is a way of predicting values in a cell from a limited number of sample data points. The diagram in Fig. 4, depicts the prediction procedure, the white points in Figs. 4c and 4d are arbitrary unknown pixel values. The blue point at the middle pjM is a known sample point, which carries the largest classification weight (w = d(pr,   pj)2), where d = distance between known pixel pr and unknown pixel pj as in Eq. 12.

We already showed that q = 0 in Eq. (12). That means the distance between pjM and all pj in R (ALGORITHM 2) = 0. Therefore, to find the value of pj ∉R, for each Ln cell, we interpolate (Eq. 12) through locations ∂j to test the color channels of pj against the threshold Њ based on the known points/pixels in R i.e.:

(11)

So, let pr be pj∈ R, and Њ(pr) be their threshold value, let d be the distance between pr and unknown pj thus from Eq. 12, we find a discrete assignment of pj in each cell using inverse distance weighting (IDW) with a power of 2:

(12)

Following the evaluation in Eq. 12, the threshold values of all pj in each cell are projected such that:

(13)

Thus, all cells that meet the criterion in Eq. 13 are returned as skin cells (ALGORITHM 3). Based on this proposition, the calculation in Eq. 4, reduces to the calculation in Eq. 5 warranting up to 80% less work and time.

Note, combining the interpolation processes with a fast, spatial search structure like PKT, achieves an efficient log N interpolation performance, which is highly suitable for large-scale problems.

It is evident from Eq. 13 that for an image or object of any size in 2 dimensions, only r = 5 x ∆ x ∆, rather than Ω 0 x ∆ x ∆ of the data elements/pixels, would be tested for matching purposes.

If we convert the calculation in Eq. 14 (Q), into a percentage using our example image of size 3000, that is, we find the ratio of the number of operations to the size of the input size and multiply by 100, then

$$ {\displaystyle \begin{array}{c}\frac{\mathrm{Q}=\mathrm{the}\ \mathrm{number}\ \mathrm{of}\ \mathrm{operations}}{\mathrm{Size}\ \mathrm{of}\ \mathrm{the}\ \mathrm{input}}\\ {}\Rightarrow \mathrm{Q}=\left(605/3000\right)\times 100\Big)\cong 20\%.\end{array}} $$
(14)

Thus, we conclude that the percentage of pixels/points needed to detect human skin or match/classify patterns from any image is

figure h

3.4 Finding the human skin pixel

We have described our interpolation procedure, using IDW in Section 3.3. Now, let us look at how the PKT carries out the classification/segmentation process (using our proposed HSL and RGB color models in Table 1), for identifying human skin presence in an image.

Initially, the program starts by extracting all the pixels from the image and then store them in a k-1 dimensional array (as described in Section 3). Next, the array is partitioned to produce the leave node Ln following the procedure in ALGORITHM 1.

The process of mining patterns from images can generally be enhanced, by adopting a k-dimensional data structure like PKT, which is efficient for detecting patterns from coloured k-dimensional images. ALGORITHM 3, highlights the steps involved in this operation.

From each Ln (the tiny rectangles/cells in Fig. 5e), containing Ω0 pixels, only r pixels elements belonging to R, are selected strategically. The RGB values of each r pixel (pr) in the array are converted to HSL values. All unknown pr in each cell are compared recursively to match the skin/no-skins threshold in Table 1, eventual, only Ln where all pr meet the colour matching criteria are returned.

figure i

4 Results/discussion

4.1 Experiment

PKT experiment was implemented from scratch in JavaScript and ran on a machine with Intel® Core™ i5-5200U CPU @ 2.20GHz 2.20 GHz, 8 GB RAM, with Windows 10. All PKT modules/functions (ALGORITHM 1–3 etc.), sub-functions (e.g., bounding boxes, selecting r, etc.) and helper functions (e.g., color channel conversion and matching procedures, sort, sum, distance etc.) were custom-built de novo in JavaScript and visualised using Html tags ran on Python’s HTTP local host server.

4.2 Datasets

We conducted the experiments using real-world datasets, synthetic datasets, and image segmentation datasets (all two-dimensional). The ColorFERET dataset [34] (Fig. 12b), Pratheepan human skin dataset [52] (Fig. 12a), and various other images of diverse complexion, pose, orientation, age, variation of illumination and sex, selected from the internet (Fig. 12c). The results of applying our algorithm to these datasets is shown on Figs. 12 and 13.

4.3 Elucidation

The proposed improved skin colour threshold in Table 1 and the multidimensional spatial structure (PKT) are applied in this work for effectively detecting human skin from an image. Below we have presented the results of the model. The results show that the structure is very versatile as it is promising, showing tendencies of greater prospects (Table 4).

By performing a few geometric operations on the reverse aspect of the model (Fig. 6), some facial features including the face, nose, eyes mouth and so on, can be detected. In Fig. 6, the non-skin areas have been marked with red points by PKT. Using some simple distance metrics, the head and neck could be extracted. Additionally, working out the position of facial features can help find the nose, mouth, and eyes, but this will be looked at in a later version.

Fig. 6
figure 6

a Is the reverse aspect of our test image; non-skin areas are shaded b cropped portion of the image isolating facial features

4.4 Comparison between common color thresholds for human skin classification and ours

[12] Described several color thresholds for modeling skin colors. However, researchers including [3, 5, 21, 25, 37], adopted similar RGB/HS color models for human skin identification and possible recognition. These models fall within a given threshold for all skin color types (Table 1)

Our study is disputing the human skin color threshold premise and assumption made by authors [3, 5, 21, 25, 37], as it falls short of reality for certain human skin color codes. Three main colors (Red, Pink, Brown) pose the most problem in the human skin color threshold setting. Of course, this is because they are very close in shades to the red color underlying the human skin [20]. After an exhaustive toning on both the HSL and RGB color models, to enhance insensitivity to luminance, we came up with a more efficient threshold (Table 1) to tackle darker skin color problems mentioned earlier in Section 2 (Figs. 7 and 8).

Fig. 7
figure 7

a Original image b and c shows common HSL, RGB color model as adopted in [3, 5, 21, 25, 37]. (b) RGB (R > 95 and G > 40 and B > 20 and R > G and R > B and | R - G | > 15 and A > 15). (c) HS > > 0.0 < = H < = 50.0 and 0.23 < = S < = 0.68

Fig. 8
figure 8

(a) and (b) our suggested color models, (a) HSL (H > =10.0 && H < =30.0 && S > =0.20 && L1 > =0.10), (b) RGB (R > 50 && G < 220 && B! = 60 && B < 200 && R > G&& R > B && | (R - G) | > 20)

It was noted, however, that these color models do not perform efficiently when applied discretely. That is, there is always a high degree of false hit. Hence, we tested a combination of both, and the results are found in Fig. 9 (for suggestions from others), and Fig. 10 (for our suggestion). The improved colour threshold we suggested in Table 1 is efficient especially as an improvement for darker skin prediction. Evidence of this is shown is equally shown in Figs. 9 and 10.

Fig. 9
figure 9

RGB + HSL shows the result of the combination of HSL, RGB color models (other methods)

Fig. 10
figure 10

RGB + HSL shows the result of the combination of HSL, RGB color models (our suggestion)

Figure 11 shows the various stages of the procedures of the PKT. The ground truth image Fig. 11a, original image Fig. 11b and the result of various stages of the skin detecting process using different values of μ. Figure 11f is our final result (with μ = 25). The red points on the face are midpoints of the bounding boxes on each leaf node jk = k1 where Eq. 6 evaluates to T.

Fig. 11
figure 11

The skin detection procedure completed with the proposed tree structure (a) ground truth image. (b) the original image. (c) Output showing skin area, when μ = 500. (d) Output when μ = 100. (e) Midpoints of leaf bounding boxes (that match our skin threshold value (ALGORITHM 3) shown as red points. (f) The expected and final result of the skin detection process. Extracted skin areas marked as red points. (g) Points representing skin area, extracted from an image using PKT (h) PKT smoothed image of in Fig. 10c

Figure 11g shows only the points representing the skin area. As we can see, the μ affects the performance and behavior of the tree. In Fig. 11c, where μ = 500, some parts of skin pigments were not detected, this will give rise to a high rate of false hits. Even though Fig. 11d --where μ = 100 -- looks promising, there is still a tendency of some measure of false-negative hit. Figure 11e shows the outcome of Fig. 11d without displaying the tree. At μ = 25 in Fig. 11f, a perfect result was achieved. Figure 11h shows how the PKT was used to smooth out the result of the image in Fig. 10c.

In Fig. 12, we show the result of applying the PKT algorithm to find human skin of varying types, complexion, illumination, shade, pose, position, etc. The images in Fig. 13 are skin pigments, detected using the same technique. However, the tree boundaries are not displayed.

Fig. 12
figure 12

Other images, showing detected skin using our proposed model. a Pratheepan images, b ColorFERET images, c Internet images

Fig. 13
figure 13

More images tested (tree not displayed)

4.5 Evaluation

In order to evaluate the performance of the tree structure against the commonly used methods (PBP/ PW operations and quad-tree like structures) adopted by many authors, we have compared the tree performance with these techniques. Figure 14, shows the performance of the various models. In Fig. 14a, we have the ground truth image, Fig. 14 (a-i) shows the skin area and Fig. 14 (a-ii) highlights non-skin areas as identified by the PW technique. No enhancement was applied, thus each pixel was checked individually based on our purported color threshold for skin pixel classification in Section 3.1.

Fig. 14
figure 14

Comparison of performance between the three methods of study, using an image from Pratheepan dataset (a) the ground truth image, (a-i) PBP predicted skin pixels (in red), (a-ii) reverse process of PBP, showing non-skin pixels in red, (b) result and performance measure of PKT(proposed model) skin identification procedure, (b-i) extracted skin area in red (red points represents the midpoint of the leaf node (Ln) in the region), (b-ii) reverse process showing non-skin area, (c) result and performance measure of the PQT skin identification procedure, (c-i)) extracted skin area in black points, (c-ii) reverse process showing the non-skin area

In Fig. 14b, we show the same image with the skin area mapped out with the PKT. In Fig. 14 (b-i) the tree boundaries are not displayed and in Fig. 14 (b-ii), the reverse effect of the tree was depicted, showing non-skin areas as identified by the tree.

Figure 14c is the result of applying the PQT to the skin prediction procedure based on our improved color threshold in Section 3.1. Skin areas mapped out with the PQT is shown in Fig. 14 (c-i) without displaying the boundaries. In Fig. 14 (c-ii), the reverse effect of the PQT was depicted, showing non-skin areas as identified by the tree.

4.6 Performance graph

A plot of time performance for the various structures is shown in Fig. 15. In Fig. 15a, a comparison of time of construction between the PKTand the PQT is shown. Needless to say that the quad-tree deep quad partitioning strategies have a negative effect on the speed performance of the structure. Though the quad structure performs fairly in terms of classification (Fig. 14c), a little improvement might be necessary to speed up the system. As can be seen, the worst performance in terms of time consumption for pixel classification is the PBP technique, followed by the PQT method (Fig. 15b). This means that even though there is no structure to build in the PBP method, the method can not improve beyond a time complexity of O(n). Note, the timing here includes the pixel extraction time, array manipulation, tree partitioning, and pixel classification. These comparisons are also shown in Table 5.

Fig. 15
figure 15

Time of construction and classification (a) comparison of times elapsed between proposed system PKT and other (b) comparison between the size of data and the duration of pixel classification between the two methods of study

4.7 Precision, recall and accuracy

To evaluate the accuracy and precision of our proposed model, we have prepared Tables 2, 3 and 4 for different pixel sizes. The tables show the accuracy, recall and precision rate of these three methods in terms of pixel classification and skin segmentation. The accuracy, recall and precision calculation for proper evaluation of our skin/no-skin pixel classification model, were based on the formula in [15]. The higher precision and accuracy rate have been achieved by PKT because of the large IDW weight value attached to the midpoint pixel (pjM) of the leaf bounding boxes, such that Њ(pjM) must evaluate to T (Eq. 10) for any other unknown pixel pj to be a valid skin pixel. Even though the PQT method employs a similar restriction strategy, the four-child partitioning technique is bound to limit the restriction, by partitioning a cell with very close neighbors into two different cells of dissimilar subsets. PBP based systems do not have such constraint as such, the system selects every area of the image where there seems to be a trace of the defined colour threshold (Њ).

Table 2 Measurement of Accuracy and Precision of the proposed structure for the PBP method
Table 3 Measurement of Accuracy and Precision of the proposed structure for PQT method
Table 4 Measurement of Accuracy and Precision of the proposed structure for PKT method

5 Conclusion and future works

PBP operations for human skin detection or skin pixel classification is sometimes characterized by a high rate of false hits, and increased time consumption. In this paper, we have presented an improved color threshold-based algorithm for recognizing and classifying human skin pixels in an image using the combination of RGB-HSL color models. To speed up the process and improve performance, we have proposed and implemented a k-dimensional structure for the classification procedure. Our proposed model shows very high promising results in terms of precision, recall and accuracy as compared to most state-of-the-art systems.

Images from different sources were tested, and the model scaled high. From the results presented in Tables 2, 3 and 4 (the overall performance of the algorithm), it can be seen that the proposed model provides a very significant reduction in false detection rates as compared to the PBP testing mechanism and quad-tree like techniques applied in many systems. Quad-trees have been in use for speeding up of the detection process however, we proposed and implemented PQT, an improved quadtree structure to which we compared our main model PKT (Table 5).

Table 5 Run-time of the different number of pixels

Although there is a significant improvement as compared to the PBP techniques in terms of speed and accuracy, the quad-tree structure showed certain drawbacks in terms of speed of construction and speed of classification of pixels, which can be attributed to the structures’ partitioning strategy. We can boldly say that the proposed approach yields better detection performance compared to that of the state-of-the-art PBP and the quad-tree based techniques with a significant reduction in time and computational cost.

We have equally shown that with little geometry, the algorithm can detect a face, hand, and other features and gestures. For future work, an improved PKT is currently being investigated, which performs a second level filtering of the PKT cells (sub-regions) to produce a (skin) patch rather than skin pixels. This second PKT variants (which will include an edge detection procedure), will apply more smoothly to generalised object detection, segmentation, and recognition. We shall equally investigate parallelizing the structure to further improve its speed and efficiency in terms of general pattern mining. We are also investigating the implementation of PKT and PQT for higher dimensional spaces, and for clustering non-spatial datasets.

Finally, we claim that using these structures (PKT, PQT), only ≅20% of the pixels in an image are required to classify the pixels in a skin detection procedure.