1 Introduction

The issue of localization of characteristic facial areas in thermal images has been developing for many years and has been described in many publications [5, 12]. The exact localization of certain elements of the face opens up possibilities for further analysis of this type of images. Examples of using such methods in the case of thermal imaging can be found in biometric systems [1, 2], as an initial block before the proper recognition process, or in medicine [9, 10], where the process of locating facial areas allows for further automated analysis of thermograms. They can be also used in the creation of human-computer interfaces whose task is to facilitate cooperation with the computer [3] or increase safety [6]. A particularly important element of the face are the eyes because they carry a large amount of information [7, 19]. Therefore, many described solutions are directly related to eye localization [18, 8, 13, 20, 21] or eye localization is part of a larger system of analysis of the entire face [1, 10, 14, 19]. The algorithm for eye localization from thermal images presented in this paper enables to automate this process. What is significant is that it works properly even when the position, scale and orientation of the head are unknown (in the range from about −20 to +20 degrees relative to the y axis). In conclusion, the main purpose of this study is fast and automatic localization of characteristic points of the eyes despite the typical problems of thermovision and image transformations.

2 Related works

Some selected methods of eye localization from thermal images or thermograms showing exemplary solutions are presented and discussed below.

The authors in publication [19] used a combined edge and corner detector, called the Harris operator, for the initial detection of interesting eye points, which enabled to single out groups of pixels located in the areas where the biggest changes in the image brightness occurred. These groups were located in the eye and mouth areas. In the next stage, “eigenimages” were created for each face area from the areas covering these groups of pixels. The features thus obtained were subjected to classification using SVN trained separately for each eye area (left and right). The authors did not mention the effectiveness of localization and the discussed method was used in the system of emotion recognition. The set of test images included 270 cases.

In paper [18], a genetic algorithm in conjunction with the active contour method (“snake”) was used for exact eye localization. In the first stage, the genetic algorithm determines the initial position of the active contour and then the active contour adapts to the eye area. The initial contour for the snake is formed by two parabolas and during processing the top and bottom parabolas try to adjust to the eye edges. At the end, the circle of the cornea is additionally marked. The authors mention 90 % effectiveness for a set of 125 images.

In [13] the authors used the IR illuminator and the software for monitoring the eyes, which helps to avoid car accidents. For this purpose, the driver’s face is illuminated using a near-infrared (NIR) illuminator to improve the visibility of the pupils. The key element of this solution was the eye localization algorithm based on pupil localization. With IR illumination, simple methods based on the highest values of pixels could be used. Next, two potential regions were selected as candidates for the eyes using the centroid function. The authors mentioned that the eye regions were correctly detected in the vicinity of the pre-designated points but they did not present the effectiveness of localization.

In paper [8], the face area was initially divided into regions sized 16 × 16 or 32 × 32 pixels, which were then subjected to a wavelet transform using the Haar wavelet. The resulting set of features was classified using SVM or GentleBoost allowing for localization of rectangles surrounding the eyes. A collection of test images, however, was small (78 cases) and the test images covered only the human head of a similar scale and standardized location and orientation. The authors tried to evaluate the effectiveness by calculating the value of erreye. The result was about 83 %.

In publication [20] the authors proposed an automatic eye localization method for subjects with and without eyeglasses. For the subjects wearing eyeglasses, the locations of valleys in the vertical and horizontal projection of curves were used to perform eye localization. A set of 3 features including the width, height of valleys and their combinations was used by the SVM classifier to localize the eyeglasses. For subjects without eyeglasses, the image was divided into 15 sub-regions and in the next step eight types of Haar-like features were extracted. Eight classifiers were trained based on the features selected by the Adaboost algorithm for the left and right eye, respectively. At the end, eye localization based on the vote strategy was performed. To calculate the effectiveness, the authors used the aforementioned methodology and the values of erreye were about 83 %–86 %. The last two methods were used to compare their effectiveness with the method presented in this article.

As shown above, the solutions described in the literature hardly refer to the issue of eye localization in the case of variable orientation, scale and position of the head. This situation seems to be much closer to real conditions than forcing a fixed position and maintaining the head in a vertical position. A high speed of the localization algorithm is equally important as it provides the opportunity of using it in video sequences or in real time. The main aim of this study is to find a solution for eye localization from thermal images resistant to changes in orientation, scale and position of the head.

3 Material – collection of images

The test collection of images included 198 thermograms covering a portion of the torso and head. The images were characterized by large variability of test subjects and diversity of features - the location, orientation, size of the head and torso areas. The number of subjects was 128. In the case of 36 people, the authors acquired 2–4 images per subject, whereas in the other cases, one thermogram was acquired for each subject. The images were acquired interchangeably with several thermal cameras: AGEMA 590 PAL, ThermaCam S65, A310, T335 with 320 × 240-pixel optical resolution of detectors. The camera-to-subject distance was about 1 m. In order to increase the versatility of the algorithm, the analysed images were also subjected to rotation in the range from −20 to 20 every 10 degrees, which gave a total of 990 images.

The central regions of the head sized 30 × 30 pixels and in the brightness range of 0–255 were separated from the generated set - Fig. 1 (described in section 4).

Fig. 1
figure 1

Examples of images from the test collection and the analysed areas sized 30 × 30 pixels

The separated cropped images were used to create a training and a test set for the developed feature generation algorithm and the classifier used therein. The network was trained on a set of 200 randomly selected images. The selection criteria for the size of the training and test set were adopted based on the literature [4, 16]. Then, for each training image, a set of pairs of potential areas was generated that could be designated as part of the eyes - Fig. 2. The rest of nearly 800 images were used as a test set.

Fig. 2
figure 2

Sample cropped images of the central face area and automatically identified potential areas of the eyes

4 Methodology

When analysing the collection of images, it was assumed that when the face frontal position relative to the camera was maintained, the potential areas of the eyes were located near the centre of the head. It was also assumed that it is enough to designate the location of the brightest regions in the head area to correctly determine the location of the eyes. In order to carry out the tests, cropped images sized 30 × 30 pixels located near the centre of the head were selected - Fig.1. Regions of the centre of the head were determined automatically based on the previously developed solutions [9, 10].

4.1 Initial assumptions

Based on the statistical analysis of the collection of images in a standard position (observation of a number of potential areas required for correct localization), some basic assumptions about the mechanism of segmentation were defined taking into account the need for at least 2 areas that could be fragments of the eyes (Table 1).

Table 1 The number of selected areas depending on the value of the automatically set brightness threshold

In most cases, the brightness threshold value was sufficient to single out at least 2 brightest regions (in 81 % of the images they were the actual points of the eyes) - Fig. 2a, g). In about 13 % of cases, it was necessary to further reduce the brightness threshold to obtain 3 brightest areas covering the characteristic points of the eyes - Fig. 2b, h). Still, in about 6 % of cases, in order to identify potential areas, it was necessary to further reduce the threshold until four or five clusters of pixels were obtained - Fig. 2c, d, e). In the case of 17 images, the segmentation block did not identify properly the required 2 potential eye areas - Fig. 2f).

On the basis of the results, an attempt was made to develop a classifier allowing for unequivocal identification of appropriate pairs of areas which were actually elements of the eyes.

In the first stage, the algorithm tries to search for at least two areas. They are then classified using a neural network. If a pair of areas is not classified as the eyes, the algorithm looks for another area number 3 (reducing the brightness threshold), subjecting to classification all newly created pairs of areas. If they still are not eye areas, the fourth region is generated and each of the pairs of areas is subjected to classification. The automatically set threshold value allowing for identification of potential areas resulted in some cases in generating up to 5 potential areas, which then had to be correctly classified. A block diagram of the algorithm is shown below – Fig. 3.

Fig. 3
figure 3

Block diagram of the algorithm

4.2 Pre-processing

The silhouette area was selected and the approximate centre of the head was determined using the methods described in [9, 10]. Determination of the centre of the head and a cropped image covering the centre sized 30 × 30 pixels provided the test images. At this stage, 990 cropped images of the face area containing the eyes were prepared for analysis – Fig. 1. In a small number of images (4 images), a cropped image sized 30 × 30 pixels did not include the potential areas, which did not have a major impact on the effectiveness. It is possible to eliminate this situation by increasing the area of analysis to more than 30 × 30 pixels.

Pre-processing involved using a median filter sized 3 × 3 to eliminate single bright pixels being interference. Then, the image was subjected to the process of automatic determination of the brightness threshold allowing for identification of potential areas of the eye sockets.

The proposed segmentation algorithm cyclically reduces the threshold (from the maximum value in the image equal to 255 in steps equal to 2) until the areas classified as the eyes are found. The next areas are generated until the number of areas exceeds four and their surface area is greater than 5 pixels (the value of 3 pixels resulted in generation of inappropriate areas, the value of 9 reduced effectiveness). Owing to this segmentation block, a maximum of 5 areas were identified (Fig. 2), but it also resulted in situations when the algorithm did not identify the potential areas of the eyes (Fig. 2f) for 74 (out of 990) images.

4.3 Selection and generation of feature values

As mentioned in section 2, authors use different sets of features in the eye localization process. In [18, 19], the authors used simple information based on edges and corners. In article [13] the highest values of pixels were sufficient owing to the use of IR illumination. By contrast, in [8] the set of features was prepared by the Haar wavelet transform and required more calculations. In publication [20], the authors used two sets of features. The first was based on the parameters calculated from the image projection and the second one was based on Haar-like features. Based on the above, this article proposed a set of features describing the pairs of potential eye regions.

Given the size of the cropped image and the need to build a large and complex network in the case of an unprocessed image matrix, an attempt was made to develop a method for generating a set of appropriate imaging features that allow for proper (most effective) description of the pairs of eye areas.

The use of the segmentation algorithm (discussed earlier) enabled to automatically generate areas and then determine a set of several features. They describe selected geometric relationships and brightness distributions in designated areas and are independent of the position and orientation of the eyes. The tests were carried out initially for a 7-element vector of features (features 1,2,5,6,7,8,9 – Table 2), but the results were not satisfactory (effectiveness of 88 % - according to the formula 4). Eventually, the set of features was expanded and a 9-element vector was applied based on the parameters of pairs of eye areas identified as a result of automatic segmentation and the parameters of the studied area as a whole. Each pair of areas is represented by a feature vector Z (1).

$$ \mathrm{Z}={\left({\mathrm{Z}}_1,{\mathrm{Z}}_2,{\mathrm{Z}}_3,{\mathrm{Z}}_4,{\mathrm{Z}}_5,{\mathrm{Z}}_6,{\mathrm{Z}}_7,{\mathrm{Z}}_8,{\mathrm{Z}}_9\right)}^{\mathrm{T}} $$
(1)
Table 2 Statistical information of features for pairs of areas

Basic statistical information and a description of the features are presented in Table 2. The brightness of the pixels in the image is defined as I, and the distance between the potential areas of the eyes as D.

Figure 4 shows a graphical interpretation of some selected image features and the method of their determination based on the characteristic brightness distribution along the section D.

Fig. 4
figure 4

Examples of pairs of areas, marked features and a graph of brightness along the section D

The features determined in this way are subjected to normalization [11, 15] in order to avoid the influence of too large values. Normalization of the features was performed in accordance with the Eq. (2).

$$ {Z}_i=\frac{z_i-{z}_{i\_\mathit{\min}}}{z_{i\_\mathit{\max}}-{z}_{i\_\mathit{\min}}} $$
(2)

4.4 Classification of designated areas

After generating the area (from 2 up to 5), the algorithm verifies whether it is possible to single out a pair from the set of designated areas that will be classified as a pair of two points of the eyes (as indicated in Fig. 4).

A “feedforward” neural network with the sigmoid activation function was used as a classifier. After reviewing the literature on the construction of neural networks [11, 16, 17], a network with 9 inputs with 1 hidden layer (9 neurons) and 1 output was proposed. By developing an algorithm for generating a set of specialized features for the analysed images, it was possible to significantly reduce the complexity, training time and operation of the network. Normalized values of features determined automatically from the test images were used as the input vector (Table 2).

A randomly generated set of 200 images (out of 990) was used as a training set. In the case of images where a feature selection block chose more than 2 potential areas, the first selected pair (e.g. O2 - O3 – Fig. 4) as well as the second one (e.g. O3 - O4 – Fig. 4), appropriately classified by the expert, were used in the learning process. This increased the training set of pairs of areas for the classifier (from 200 to 249 unique pairs) and the number of images to 400. As a result, the training set contained 400 elements as a vector of feature values zi and class label ti (1 - eyes, 0 - not eyes) - (3).

$$ {\mathrm{u}}_{400}=\left\{\left({\mathrm{z}}_{\mathrm{i}},{\ \mathrm{t}}_{\mathrm{i}}\right)\left|\mathrm{i}\right.=1,\dots, 400;{z}_{\mathrm{i}}={\left({z}_{\mathrm{i}1},{z}_{\mathrm{i}2},{z}_{\mathrm{i}3},{z}_{\mathrm{i}4},{z}_{\mathrm{i}5},{z}_{\mathrm{i}6},{z}_{\mathrm{i}7},{z}_{\mathrm{i}8},{z}_{\mathrm{i}9}\right)}^{\mathrm{T}},{\mathrm{t}}_{\mathrm{i}}\in \left\{0,1\right\}\right\} $$
(3)

The above neural network was then subjected to supervised learning. It was performed using two methods selected for verification which learning method will affect the network operation. The results are provided in Table 3.

Table 3 The results of training neural networks used as classifiers

Finally, a neural network trained with the method number 1 was used as a classifier. The results of learning obtained on the training set showed its greater effectiveness in this particular application and the learning time was significantly shorter (Table 4). The analysed set of images was classified by putting a formed vector of 9 feature values Z (describing quantitatively the pairs of eye areas) to the input layer of the neural network. The classification result WK is presented on a single output assuming (4).

$$ \begin{array}{l}{\mathrm{W}}_{\mathrm{K}}>0.5-\mathrm{designated}\ \mathrm{a}\mathrm{rea}\ \mathrm{is}\ \mathrm{a}\ \mathrm{pair}\ \mathrm{of}\ \mathrm{the}\ \mathrm{eyes}\\ {}{\mathrm{W}}_{\mathrm{K}}<=0.5-\mathrm{designated}\ \mathrm{a}\mathrm{rea}\ \mathrm{is}\ \mathrm{not}\ \mathrm{a}\ \mathrm{pair}\ \mathrm{of}\ \mathrm{the}\ \mathrm{eyes}\end{array} $$
(4)
Table 4 The effectiveness of the algorithm for the test set

To prepare the classifier, the authors used Matlab implementation of a feedforward neural network. The structure is presented in Fig. 5.

Fig. 5
figure 5

Structure of neural network

5 Results and discussion

In order to verify the algorithm, the images being the training set were removed from the set of 990 images, leaving 790 test images. The effectiveness SD was determined by dividing the number of images where the eyes were located correctly ND by the total number of test images NA - according to Eq. (5).

$$ {S}_D=\frac{N_D}{N_A}\ast 100\% $$
(5)

When analysing the results, it can be concluded that the proposed set of features of the analysed images allow for a precise quantitative description of the studied problem. More than 91 % effectiveness enables to use the proposed solution for fast eye localization (990 images were analysed in less than 60 s - i7 processor / 2.4 GHz) from thermal images and thermograms. An important element is the transfer, at the algorithm input, of a cropped raw image containing fragments of the human eyes. This allows for the use of any mechanism for locating the head centre as a block preceding the described algorithm.

Figure 6 shows sample results for the correct and incorrect localizations. The proposed method works effectively in highly diverse cases where the brightness, sharpness, orientation and visibility of details vary considerably between individual images. This confirms the effectiveness of the block of segmentation, generation of areas and determination of the image features. The images in Fig. 6a–g) are cases in which 2,3,4,5 potential eye areas were designated. This confirms the fact that two brightest areas do not always clearly define the eyes or indicate the correct operation of the proposed classifier.

Fig. 6
figure 6

Sample results

The cases in Fig. 6h–l) are the images where the eyes were not correctly located. They show a situation when the classifier incorrectly singles out a few areas of the eyes, which blocks further generation of areas – Fig. 6h, i). It can be observed on the images in Fig. 6j, k, l) that the block of generating the brightest areas identifies the potential areas correctly, but they are incorrectly classified by the neural network. This situation confirms the belief that in this block it is possible to further improve the algorithm.

5.1 Comparison with other authors’ results

To measure the effectiveness of the proposed approach, the authors used a method proposed by Martinez [8] and used by Wang [20]. The displacement error between the manually selected and automatically detected characteristic points of the eyes was defined as (6):

$$ er{r}_{eye}=\frac{ \max \left(\parallel {P}_l-{\widehat{P}}_l\parallel, \parallel {P}_r-{\widehat{P}}_r\parallel \right)}{\parallel {P}_l-{P}_r\parallel } $$
(6)

where, Pl and Pr are true coordinates of left and right eye points and \( \widehat{P}l \) and \( \widehat{P}r \) are the automatically detected (by our method) coordinates of the left and right eye points, respectively. ||.|| represents L2 norm. The authors in [8, 20] assumed the value of erreye < 0.15 as correct localization.

The values of erreye were calculated for each analysed image (original 199 images without any transformation) and are presented in Fig. 7. The error threshold value of 0.15 is also marked. As can be seen from the chart, the results were unsatisfactory only in a few cases – the characteristic points were localized incorrectly. In 86 % of cases, the proposed method localized the eye points successfully - erreye values were less than 0.15.

Fig. 7
figure 7

Localization error of characteristic points of eyes

To compare the obtained results with other methods, the authors chose two other studies, described above, and showed the experimental results in Table 5. It can be seen that the proposed method gives better results based on the Eq. (6). The cases with erreye greater than 0.15 are the images in which the algorithm incorrectly localized the potential eye regions (e.g. Fig. 6i), j)). This algorithm block can be improved in the future. The cases with erreye less than 0.15 are the images in which the algorithm correctly and accurately localized the characteristic points of the eyes.

Table 5 Eye detection experimental performances on samples

In summary, the proposed algorithm offers new possibilities in comparison to other methods:

  • it works in the case of vertical and non-vertical head orientation

  • it is very fast

  • the results obtained on the basis of the Eq. (6) indicate the advantage of the proposed methods

6 Conclusions

The presented method due to its high effectiveness (about 91 %) and speed (approximately 0.06 s per image) can be used as part of a larger system for fast eye localization from thermal images under different conditions. The most important elements are the developed method of automatic segmentation, the algorithm for generating image features and the mechanism of classification based on a neural network. Through many tests and by proposing and selecting relevant features (9 features describing a pair of eye areas), it became possible to use a very simple network structure, which has a positive effect on the speed of learning and operation. The neural network is composed of 9 inputs, one hidden layer containing 9 neurons with the sigmoidal activation function and one output. At this stage of research, it seems that the set of features could be extended by adding the parameters describing a larger area of the image around the potential regions of the eyes. It could improve the effectiveness of classification of regions. The correct localization of the eyes, as the most important areas of the face, is an essential element for further quantitative analysis of face images. The described algorithm will ultimately be used as part of a comprehensive system allowing for eye localization directly from thermal images where the centre area of the head will be designated automatically. Such an approach will completely automate eye localization from thermal images and thermograms (after initial temperature conversion to brightness levels) regardless of the location, size or orientation. In light of the above, the presented solution can be applied in several areas such as medical or biometric systems or in the creation of human-computer interfaces.