Keywords

1 Introduction

Modern applications from diverse fields rely on robust image-based object detection. These fields include, though are not limited to, autonomous driving [4, 8] and scene understanding [36], driver monitoring [5, 28], eye tracking [10, 51], cognitive sciences [54], psychology [29], medicine [16] and many more. To approach object detection, many leading techniques are based on Deep Neural Networks, and in particular, on Convolutional Neural Networks [33, 50]. Recent improvements of CNNs are multi-scale layers [23], deconvolution layers [62] (transposed convolutions), and recurrent architectures [41, 44]. Nevertheless, the main disadvantage of such networks is that they need an immense amount of annotated data to obtain a robust and general network. For instance, in the realm of eye-tracking, gaze position estimation and eye movement detection are based on robust detection of the pupil center from eye images [20]. More specifically, modern eye trackers rely on image-based pupil center detection and head pose estimation, where multiple landmarks have to be initially detected. A state-of-the-art approach to cope with this problem is to synthesize image data. For example, [48] employed rendered images for gaze position estimation in both head-mounted and remote eye tracking. [32, 45] used rendering to measure the effect of eyeglasses on the gaze estimation. [59] applied a k- nearest neighbor estimator on rendered images to compute the gaze signal of a person directly from an image. This approach was further improved by [63] using rendered data to train a CNN.

Also, rendering data itself is challenging, since the objective is for highly realistic data that not only cover a certain variety of anatomical structures of the eye and head, but also reflect realistic image capturing properties of the real world. Consequently, models generally need to be trained on both synthetic and real images. Since the annotation of real-world images is a tedious task, we propose an algorithm supporting accurate image annotation: Coined as Multiple Annotation Maturation (MAM). MAM is a self training algorithm based on a grid of detectors. Unlabeled data is clustered based on the detection, iteration, and recognition. To ensure a high detection accuracy for each point, our approach uses a grid of detectors. The deformation of this grid is used to cope with object deformation and occlusions. MAM enables labeling of a large amount of data based only on a small fraction of annotated data and is also capable of reusing already trained detectors under different environmental conditions. Additionally, it delivers specialized object detectors, which can further be used for new data annotations or online detection.

The remaining of this paper is organized as follows. After a review of related work on transfer learning, the proposed approach is described. We show examples of our new dataset as well as how it was annotated. The last sections are the evaluation of the proposed approach on public datasets and its limitations.

2 Related Work

Our method belongs to the domain of transfer learning. Transfer learning itself refers to the problem of adapting a classification model or detector to a new problem, or enhancing its general performance on unknown data. This problem can be solved in an inductive, transductive, or unsupervised way. In the inductive case, annotated data in the target domain is provided in addition to labeled data from the source domain. The process is called self-thought learning or multi-task learning. In self-thought learning, unlabeled data is used to improve the classification performance. For example, [42] proposed a two-step architecture: Where in the first step, feature extraction is improved by analyzing the unlabeled data using sparse coding [38]. The obtained basis vectors are used afterward to generate a new training set from the labeled data. Then, a machine learning approach, such as a support vector machine (SVM), is trained on the new training data. In multi-task learning, the goal is to improve the classification based on the information gain from other tasks or classes. It has been shown experimentally in [2, 7, 11, 52] that if the tasks are related to each other, multi-task learning outperforms individual task learning. In [2] for example, a Gaussian Mixture Model on a general Bayesian statistics-based approach as developed by [1, 3] was employed. [11] developed a nonlinear kernel function similar to SVMs, which couples the multi-task parameters to a relation between two regularization parameters and separated slack variables per task. In another work, [6] inspected the problem of detecting pedestrians in different datasets, where the recording system differed (DC [35] and NICTA [37]). The authors used a nearest neighbor search to adapt the distribution between the training sets of both data sets to construct a new training set.

In the transductive case of transfer learning, available labeled data in the source domain is employed with the intention to adapt the model to a new (but related) domain, i.e., domain adaption. In this case, the domain is same; however, the problem is reduced to the sample selection bias. Meaning, finding the weighting of training that trains a better-generalized classification, as proposed by [25]. Another approach is the covariance shift proposed by [46], which is the importance weighting of samples in a cross-validation scenario with the same goal of producing a better-generalized classification. If the domain or distribution between the training set and the target set differs, it is usually known as domain adaption. Numerous works have been proposed in this field of transfer learning. For example, [24] proposed Large Scale Detection through Adaptation (LSDA), which learns the difference between the classification and the detection task to transform labeled classification data into detection data by finding a minimal bounding box for the classification label. [43] adapts a recurrent convolutional neuronal network detector (RCNN) trained on labeled data to unlabeled data. Here, the first step is normalizing the data in the source and target domain by calculating the first n principal components. Afterwards, a transformation matrix aligning both domains is computed. The source data is then transformed using this matrix; afterwards, the RCCN detector is trained on the transformed data. For example, in [26], a Gaussian process regression was used to reclassify uncertain detections of a Haar Cascade classifier [56]. The Gaussian process is initialized based on certain detection values that were chosen threshold based. In [12], domain adaption was used to improve image classification. Their proposed pipeline starts with maximum mean discrepancy (same as in [34, 39, 47]) for a dimensionality reduction and aims to minimize the distance of the means of the source and target domain. Afterwards, a transformation based on Gaussian Mixture Models is computed and applied to the source domain. This step aims to adjust the marginal distribution. The last step is a class-conditional distribution adaption as proposed in [13], which is based again on Gaussian Mixture Models. The same procedure is used in [34], where a modified version of the maximum mean discrepancy is used for the marginal and conditional distribution adaption. [47] learned a nonlinear transformation kernel as first proposed by [40], with the difference being they used eigendecomposition to avoid the need for semidefinite programming (SDP) solvers. In the realm of deformable part-based models, [61] proposed to incrementally improve the target classifier, basing it on multiple instance learning. Therefore, their model needs either some ground truth in the target data or a previously trained detector. For a new image, training data is updated based on the detections and retrain detectors on this data. This step is repeated until there is no update to the training set.

The last (and the most challenging) category is unsupervised learning. The most famous representer of this group is the Principal Component Analysis [58]. The main application of unsupervised learning is the feature extraction (from images or from audio data) [39] based on autoencoders. The signal itself is the target label and the internal weights are learned as a sparse representation. This representation serves as an easier, understandable structure of input data for machine learning algorithms. Based on such features, more advanced approaches like one-shot object classification, as proposed by [14] or one-shot gesture recognition by [57] can be applied. [14] initialized a multi-dimensional Gaussian Mixture Model on already learned object categories and retrained it on a small set of new object classes using Variational Bayesian Expectation Maximization. [60] proposed new feature extractor which is the extended motion history images. It includes gait energy information (compensation for pixels with low or no motion) and the inverse recording (recover the loss of initial frames).

Our approach for automatic video labeling belongs to the category of self-training. It does not require prior knowledge of the object to detect, rather a very small set of labeled examples. It can be done by either using a previously trained detector, or by labeling some object positions (ten in the evaluation).

3 Method

The general idea behind our algorithm is that an object occurs under similar conditions in a video, but at different timestamps. With similar conditions, we mean equal pose and illumination for example. Therefore, different conditions cause varying challenges. As illustrated in Fig. 1(a), the orange line represents the same object under different conditions (y-axis) over time (x-axis). Using this interpretation, we can consider the object in a video as a function (orange line). Given some examples (gray dots in Fig. 1), our algorithm tries to detect objects under similar conditions in the entire video (horizontal juxtaposed dots on the orange line). The knowledge gain out of the first iteration is represented as the green bars in Fig. 1. In the second iteration, this knowledge is extended (blue bars) by retraining a detector on the existing knowledge. This approach alone leads to saturation, which is especially present if some challenges are overrepresented in the video. Even more, it can occur if the object does not follow a continuous function, which also impedes tracking (orange line Fig. 1(b)).

Fig. 1.
figure 1

Our approach, MAM, tries to extend its knowledge of the object. The orange line represents the object to be detected in the video under different conditions such as reflections or changing illumination (challenges). The x-axis represents the timeline of the video, whereas gray dots represent the initially given labels. The green bar represents the detected objects representing similar challenges. Blue is the detection state after the second iteration. (Color figure online)

To cope with this problem, we propose to cluster the detections (knowledge K) into age groups (A, Eq. 3); where the age is determined by the amount of re-detections. This clustering allows us to train a set of detectors for different age groups. The detector, which is trained on the entire knowledge obtained from the video (V), is for validation of new detections over multiple iterations (re-detection). The detectors trained on a younger subsets are used to extend the knowledge. Then, the challenge becomes evaluating whether a newly trained detector is reliable or not. Here, we use a threshold TH on recall and precision (on the training set). If both are below TH, the algorithm is stopped or the detector is excluded from the iteration (Eq. 1).

$$\begin{aligned} STOP={\left\{ \begin{array}{ll}1&{}\frac{TP}{TP+FP}<TH \\ 1&{}\frac{TP}{TP+FN}<TH \end{array}\right. } \end{aligned}$$
(1)
$$\begin{aligned} \begin{matrix} D_{Iter,Feat}^{Age}= \frac{1}{2} ||w||^{2} \sum _{i}^{|A<Age|} \alpha _{i}\\ (y_{i}\in L_{A<Age} (\langle x_{i} \in Feat(K_{A<Age}),w \rangle +b)-1) \end{matrix} \end{aligned}$$
(2)

Equation 2 shows the simplified optimization of an SVM for the age subsets (used in this work). w is the orthogonal vector to the hyperplane, \(\alpha \) is the Lagrange multipliers, and b is the shift. In this optimization, we seek to maximize \(\alpha \) and minimize bw. With \(L_{A<Age}\), we address the subset of found labels L, which has a lower age than Age. The same applies for \(K_{A<Age}\), where Feat() represents a transformation of the input data. In our implementations, we only used the raw and histogram equalized images. The detector \(D_{Iter, Feat}^{Age}\) can be any machine learning algorithm, e.g. CNN, random forest, neuronal net, etc.

$$\begin{aligned} A(i)={\left\{ \begin{array}{ll}A(i)+=a&{}, K(i)\in D_{Iter,Feat}^{Age}(V)\\ 0&{}, else\end{array}\right. } \end{aligned}$$
(3)

Equation 3 specifies the aging function. If the detector \(D_{Iter}^{Age}\) detects a previously found object on an image, the age of this object is increased by a constant factor a. In the following, we will describe the details of our algorithm and address our solutions for the challenge of detecting the position accurately without further information about the object (avoid drifting).

Fig. 2.
figure 2

Workflow of the MAM algorithm. The gray boxes on top represent the input and on the bottom, the output for each iteration. The algorithm starts by splitting its knowledge into age groups and trains detectors for each of them. Afterwards, knowledge and age are updated and a new iteration starts (orange arrow). (Color figure online)

Figure 2 shows the workflow of the algorithm, where either a previously labeled set or a detector can serve as input. The input represents the initial knowledge of the algorithm. In the first iteration, only one detector can be trained (since only one age group exists). After n iterations, there can be theoretically n age groups, though this does not happen in practice. Nonetheless, it is useful to restrict the number of age groups for two reasons. First, it reduces the computational costs in the detection part (since each detector has to see the entire video). Second, it packs together similar challenges, which would generate more stable detectors. For all our implementations, we used three age groups. The first group (G1) trains on the entire knowledge for validation (Eq. 1) and correction. In the second group (G2), all objects detected twice are selected. Then, in the last group (G3), only objects detected once are selected. After detection, the age is updated, where we assign each group a different a as specified in Eq. 3.

For implementation, we used the histogram of oriented gradients (HOG) together with an SVM as proposed by [15]. More specifically, we used the DLIB implementation from [31]. The HOG features rely on cells which make them either inaccurate (on pixel level) or consume large amounts of memory (overlapping cells). In our implementation, we shifted the computed gradients below the cell grid in x and y directions ranging from one to eight pixels (used cell size cs). For each shift, we run a detection and collect the results. The idea is that the average of all detections is accurate. For some of those detections, the counterpart is missing (no detection on the opposite shift); therefore, we perform outliers removal for two times the standard deviation. The shift procedure not only improves the accuracy, but also increases the detection rate.

Fig. 3.
figure 3

Images are taken from [18, 20, 21].

Subset of challenges which arise in pupil center detection. Deformations, reflections, motion blur, nearly closed eyes, and contact lenses are shown.

Another issue with accuracy is when it comes to deformable objects in addition to moving occlusions, changing lighting conditions, and distractors (Fig. 3). Specifically, for pupil center detection tasks, the circular pupil deforms to an ellipse as shown in Fig. 3. Moreover, the pupil size changes and many people use makeup or need eyeglasses: All of which lead to reflections in the near infrared spectrum. To adapt to those challenges, we propose to use a grid of detectors and average over the deformation. This averaging is dependent on the combination possibilities for different types of success patterns of the grid (symmetric patterns).

Fig. 4.
figure 4

Some exemplary symmetric means for a detector grid with size nine.

In our implementation, we chose the minimal grid consisting of nine detectors with a shift of gs pixels. Some valid symmetric mean patterns can are shown in Fig. 4, where a red dot indicates that the detector belonging to this grid position found an object. Those patterns can be calculated using the binomial coefficient to get all possible combinations. For evaluation, if it is symmetric, the sum of coordinates has to be zero if they are centered on the central detector (for example \(x,y \in \{-1,0,1\}\) where (0, 0) is the central detector).

Fig. 5.
figure 5

Exemplary images of the new dataset.

4 New Dataset

In addition to the proposed algorithm, this work contributes a new dataset with more than 16,200 hand-labeled images (\(1280 \times 752\)) from six different subjects. These images were recorded using a near-infrared remote camera in a driving simulator setting (prototype car with all standard utilities included) at Bosch GmbH, Germany. As exemplary shown in Fig. 5, the subjects drove in a naturalistic way, e.g., when turning the steering wheel, eyes or head are occluded.

Fig. 6.
figure 6

Exemplary eyelid and pupil annotations. The red dots are on the pupil boundary, green dots represent the upper eyelid, blue dots the lower eyelid, and the turquoise dots are on the eye corners. (Color figure online)

We annotated all eyes on these images using a modified version of EyeLad from [19]. Eyes that are occluded by approximately 50% were not annotated. We labeled the smallest enclosing eye boxes: The pupil outline with five points, and for the eye corners and the upper and lower eyelid, we used three points each. The pupil annotation consists of five points on the outline with sub-pixel accuracy (Fig. 6). This new data contains different kind of occlusions: For instance, reflections (Fig. 6(d)), the nose of the subject (Fig. 6(f)), occlusion due to steering (Fig. 6(e)), and occlusion of the pupil or eyelids due to eyelashes (Fig. 6(b)). Therefore, we believe that our data set is a valuable contribution to the research community in the realm of object detection, specifically for gaze tracking.

Table 1. Eye detection results (recall; T = true, F = false) for the first, middle and last iteration. Subject 6 (images on the left) has many unannotated frames, since eyes are occluded by approximately 50% (100% of the error is on non-annotated locations). The red star represents a detection by our algorithm that was not annotated and the green star represents an annotation that was successfully found.
Table 2. Head mounted pupil center detection results error up to five pixels [21].
Table 3. Remote pupil center detection results (3 and 6 are the pixel error).
Table 4. Remote eyelid point detection results (3 and 6 are the pixel error).

5 Evaluation

We evaluated our algorithm on several publicly available data sets ([17, 17, 18, 21, 27, 55, 64]) for self learning together with our proposed dataset. The first evaluation is without the grid of detectors to demonstrate the performance of the aging approach itself. Table 1 shows the results for the eye detection task (without grid). We ran the algorithm for a maximum of 15 iterations. Most of the error in the proposed data set stems from unlabeled images due to the annotation criteria of labeling only eyes with less than 50% occlusion. This error is apparent especially for subject 6, where the error reaches 28% in relation to all possible correct detections. The same applies for subject 2 and 5. The subsequent evaluations refer to pixel-precise object recognition.

Table 2 shows the results for comparing our approach to the state-of-the-art algorithms [49], ExCuSe [18], and ElSe [21]. The results support that our approach, for all datasets, had the highest detection performance. Here, the maximum of iterations was set to 15. For initialization of our algorithm, we selected ten annotations. The distance between the selected annotations was again ten frames (\(i \mod 10=0\)). Though our algorithm outperforms all the competitors, the results provide a basis for even further improvement. The input to the algorithm was each entire data set, except for data set XIX. Here, we performed the same selection of ten frames from 13,473 images as with the other sets, but for the iterations, we divided it into three sets. They were set sizes of 5,000 and 3,473 images for the first two sets and the last set respectively. This division was made due to the original size of the data set exceeding the memory capacity of our server.Footnote 1

For comparison in remote pupil detection, we chose the best competitor in [17], which is the second part of ElSe [21], since it outperformed all the other algorithms [9, 22, 53] on all datasets. For data sets GI4E [55], BioID [17, 27], we used the labeled eye boxes and increased the size by twenty pixels in each direction: In order to increase the challenge. For the proposed dataset, we selected the eye center and extracted a \(161 \times 161\) area surrounding it. We only used the left eye (from the viewer perspective) for the pupil center evaluation to reduce the data set size. For the proposed approach, we initially selected again ten images with a fixed distance of ten (\(i \mod 10=0\)). As indicated in Table 3, the proposed approach surpasses the state-of-the-art. Moreover, the effect of the increased eye boxes is shown for ElSe.Footnote 2

For the eyelid experiment, we evaluated our approach against the shape detector from [30]. This predictor was trained on all data sets except the one for evaluation; for example, the evaluation for subject 1 involved training the predictor on subjects 2 through 6. The defined eyelid shape is constructed by four points as illustrated in the image next to Table 4. The left and right eye corner points are used as the ground truth data. For the upper and lower eyelid point, we interpolated the connection using Bezier splines and selected the center point on both curves. The images were the same as in the previous experiment. For the point selection, we again used ten points with distance ten (\(i \mod 10=0\)). We selected different starting locations to give a more broad spectrum of possible results of the algorithm. As can be seen in Table 4, our algorithm is more often the most accurate, even for the condition to detect each point separately without any global optimization between the points. In addition, it should be noted that we optimize the evaluation for the approach from [30]. This means that [30] expects to receive an equally centered bounding box on the object to estimate the outline, otherwise it fails. For our approach, it does not change anything if the eye box is shifted. See footnote 2.

6 Conclusion

We proposed a novel algorithm for automatic and accurate point labeling in various scenarios with remarkable performance. While our algorithm is capable of generating detectors in addition to the annotation, it remains difficult to evaluate their generality: Hence, we refer to them as specialized detectors. In addition to the proposed algorithm, we introduced a dataset with more than 16,000 manually labeled images with annotated eye boxes, eye lid points, eye corner, and the pupil outline, which will be made publicly available together with a library.