Recording eye movement data based on VOG method has relatively straightforward requirements; First, there is a need for one or more cameras to capture eye images. Second, there is a need for an algorithm to detect a few distinguishable features from the eyes images. Especially for calculating the pupil center, there are different possible features that can be used (e.g., limbus, pupil contour). In this paper, we are detecting the pupil center directly from a region of interest based on the glint detection. In the following subsection, we describe our system.
Hardware design
To reduce the cost of the eye tracker (less than 600 euros), we use only one camera. Moreover, IR emitters (LEDs) are used to create glints and a dark pupil effect in the eye images. Table 1 shows the components necessary to build the eye tracker. The main challenge of building a high-speed eye tracker is to find a suitable camera. Most of the high-speed cameras are expensive or have a very low resolution. We use a CMOS PYTHON 500 by ON Semiconductor with a 1/3.6” global shutter sensor. A USB 3.0 monochrome camera equipped with this sensor is able to work with 575.0 fps at 800 x 600 pixel resolution.
Table 1 List of components for building the high-speed eye tracker We use SFH 4557 LEDs (840 nm) from OSRAM Opto Semiconductors Inc. for IR illumination, and Optolite Infra Red Acrylic visible filter due to performance/cost ratio (Instrument Plastics Ltd Optolite Infra Red Acrylic, n.d.) to block the wavelengths less than 740 nm. Figure 1 shows the eye tracker close to the computer screen. Based on the Intersil IR safety guide, our LEDs have 0.8W/m2 irradiance in total, which is far below the 100W/m2 limit (according IEC 62471 (based on CIE S009)).Footnote 1
We developed our eye-tracking system using a PC with Intel I7, 7700k, 4.2 GHz and 16 GB RAM. Although we tested our eye tracker using this computer, our results show that there is no need for that much RAM or CPU power.
Software design
The whole system is written in C/C++ and uses OpenCV, Qt and boost libraries. On startup, a user interface is shown where calibration, gaze estimation, image or video upload, and playback can be chosen. The user interface is built for easy usage of the different processes (Fig. 2). All data are saved inside a user folder. The preferred sequence of using the system is as follows:
Show face: At first the user should have a look at the head position. The "show face” function opens the camera and shows the users face with the pupil center and glints detections mapped on the image (example see Fig. 5). In this view, the user can already get an impression of the detection quality and whether the head should be adjusted for better position and focus in the camera.
Calibration: After adjusting the head position, a calibration procedure is run in order to allow the system to adjust the default values of the 3D eye model to the user. This step is also necessary for each user to train our pupil detection algorithm to learn the most probable pupil sizes (explained in detail in chapter “Pupil detection”) and afterwards the gaze estimation algorithm can calculate the visual/optical axis based on the estimates of the pupil and cornea diameter. This step takes around 10-30 s.
Then, the user can have a look at the calculated values that are plotted onto the calibration pattern. As the gaze estimation highly depends on the results of the calibration, a shift or similar artifacts can be seen in this screen. After calibration, the user has several options.
Gaze estimation evaluation: An optional gaze estimation evaluation procedure for calculating the accuracy and precision on unseen data can be chosen.
Live Gaze: If the user wants to experience how well the tracker works, he can start a live gaze overlay where he can see his estimated gaze on the screen.
Stimulus playback: If the user has uploaded an image or video as stimulus, then he is able to playback this stimulus while recording gaze in real time. As stimuli can have different speeds, we synchronize the gaze and stimulus by saving the timestamps of the gaze signal when it is captured and the stimulus frame, when it is presented on the screen. Afterwards, mapping takes the timestamp of the gaze signal and matches the timestamp of the corresponding stimulus frame. The resulting video has the speed of the gaze signal (575 Hz). For example, if the stimulus video has 30 Hz, each stimulus frame timestamp corresponds to about 19 gaze timestamps. The resulting video contains about 19 times the same stimulus frame but with a different gaze signal mapped on it. The raw gaze data with the timestamps and the stimulus with the mapped gaze on it are placed inside the user folder.
In general, our system consists of two main threads. One handles image capturing and feature detection from both eyes and the second thread handles the display of the stimulus and the timed saving of the detected features according to the stimulus. The main components of the first thread are: (1) image capturing, (2) glint detection, and (3) pupil detection. The second thread shows the stimulus and grabs the saved features from the first thread. Either the calibration, the evaluation, live gaze overlay, or the video playback procedure run in the second thread. These procedures only differ in the usage of the captured data. During calibration, the detected values are used to estimate the visual and optical axis of the users eyes, and during gaze estimation and video playback, the values are used to estimate the gaze based on the calibration model. In all procedures, the values are saved inside the user folder (Fig. 3).
Image capture
The uEye camera software from the IDS-Imaging company provides a SDK to set camera parameters. For example, the manual settings in our system are: 575 FPS, 1.6 ms exposure time, and image brightness. The camera gain value is another important parameter, as a high gain value would lead to a very noisy image. With higher frame rates, the exposure time gets shorter and therefore less light is reaching the sensor. For detection, the image must be bright enough to distinguish the different features. As an exposure time of 1.6 ms with the default gain value did not provide enough brightness for a robust feature detection, the gain value of the sensor needs to be adjusted too. A higher gain value leads to a brighter image, but also increases the noise in the image, which can disturb the detection of edges (e.g., glint contours). Setting these values is a trade-off between brightness and noisiness. Thus, these values (gamma 220, gain 3) are adjusted to get an image that is bright enough to distinguish the features but low-noise enough to not disturb edges. Next, the frame capture is started and sends the data to our software. When a frame is processed within the camera memory, the captured data are saved in the working memory (RAM) of the PC where they are accessible for the second thread.
Glint detection
When the camera frame is ready, the image is convolved by a Gaussian kernel for denoising. On the denoised image, a 2D-Laplacian filter (Laplacian of Gaussians) is applied to improve the pixel intensity. The size of the kernel for both filters highly depends on the size of the bright points that should be detected and is a trade-off between speed and quality of results. We choose the smallest possible kernel size to achieve the fastest detection. In conjunction with optimizing the brightness of the IR emitters by controlling their voltage and the angle of the IR emitters we can use a 3 × 3 kernel to detect the glints in the images. As the feature improvement also strengthens other less bright white blobs, the image is thresholded. The thresholding value can be adjusted in the interface if needed. We found that the differences in brightness between glints and other bright points is big enough to use a fixed thresholding value of 176 on images with brightness normalized to a range of 0 to 255. The thresholding removes less bright white blobs that can disturb the detection of the contours of the brightest blobs—assumed to be the correct glints—or lead to false detections in the next step. After the contour detection on the thresholded image, the centers of the minimum area rectangles of the found contours are passed as glint candidates to the next step. For each glint candidate, the brightness and distance to the surrounding pixels is checked. These values depend on the size of the glints and on the distance between the glints. We ignore reflections—assumed they lie on the sclera—when the mean brightness of the surrounding pixels is above a certain limit. In our system, we designed the locations of the IR emitters that the glints often appear at the border of the iris when a user is at a 60–70-cm distance from the camera. A glint candidate whose surrounding mean brightness is too high is treated as a reflection of another environmental light source. One problem here is to choose the correct value for the threshold. Low limit values eliminate glints that are on the border between iris and sclera, whereas high limit values create large number of false glints on the sclera area.
In addition to the above procedure, we check the geometric relations of the glints that fit to the LEDs. We know the physical distances between the LEDs: Where two are on the same horizontal axis and one is in the center of them shifted downwards. Thus, we can search for a similar geometry inside the eye image. For example, when two glints lie on the same horizontal axis, ± 4 pixels buffer vertically, we assume, these are two of the three glints we are searching for. We know that the third glint must lie horizontally in the center of these two (or slightly shifted to one side, depending on the corneal curvature) and is vertically shifted downwards. The buffer and distances between the glints in pixels are empirically defined and fit to the possible sizes of the glints in the whole range where the camera is in focus.
In the case where more than one glint combination is found that fits to our geometry, the combination with the lowest surrounding mean brightness is taken. Due to our LED design, one of the combinations must lie entirely on the pupil or iris. In all other combinations at least one glint must lie on the sclera or on the border to the iris and was not ignored in the steps before. Combinations with such glint have therefore a higher surrounding mean brightness. Figure 3 shows two cases. Image a) shows three reflections that are entirely positioned either on the pupil or on the iris. Their background is dark. We detect these three reflections as our glints as they fit to the geometry of the LEDs.
In Fig. 3 image b) there are more than three reflections and therefore more than one possible combination for the geometry that fits to the LED geometry. As only one combination is positioned entirely on dark background (iris) we can ignore other combinations, because our algorithm detects that the remaining combinations have at least one reflection on the brighter background of the sclera or near the border to it. The surrounding brightness is much higher than for the combination lying on the iris. Once we have a glint combination for each eye, we define a region of interest around each eye and pass a much smaller image (100x100 pixel) to the pupil detection algorithm. These glint combinations are calculated for each incoming frame. Figure 4 shows a summary of the glint detection algorithm. Figure 5 shows a sample image with the detected glints and pupil center. An image of 100x100 pixel compared to the full image size is a good ratio, because the eye is fully covered for the full range of where the camera stays in focus. This image size seems to be typical for other eye trackers such as The Eye Tribe in which the eye image is 100 × 90 pixel based on our tests.
Pupil detection
The pupil detection algorithm is an implementation of the BORE algorithm from Fuhl, Santini, and Kasneci (2017), which is based on the oriented edge optimization formulation from Fuhl et al., (2017). Oriented edge optimization gives a rating for each pixel based on its surrounding edges. In Fuhl et al., (2017), polynomials are used as the pattern of the edges. Each edge with the corresponding orientation to the current image position and the gradient pattern (correct edge), corresponds to a positive evaluation. In contrast, wrongly oriented edges do not count into the evaluation. BORE (Fuhl et al., 2017) is the reformulation of the optimization to circles and ellipses. In the case of circles, different radii (r) are considered for each image position. The current image position corresponds to the center of the circle (p). To scan the oriented edges along a circle, different angles (a) with a fixed increment are considered.
$$ argmax_{p} {\int}_{r=r_{min}}^{r_{max}} {\int}_{a=0}^{2\pi} L({\varDelta} C(p,r,a)) dr da, $$
(1)
Equation 1 describes the approach formally. rmin and rmax are the circle radii. ΔC() is the oriented edge weight and is either ignored or added to the current evaluation via the evaluation function (L, Eq. 2).
$$ L=\left\{\begin{array}{ll} {inner < outer} & 1\\ else & 0 \end{array}\right. $$
(2)
In Eq. 2, the inner pixel value is compared with the outer pixel value. After each image position has been evaluated, the maximum is selected as the pupil center. This position corresponds to one pixel in the image which is too inaccurate for low resolution images of the eye, which typically occur in remote eye tracking. To overcome this restriction, the authors Fuhl et al., (2017) provided an extended version with an ellipse fit. The ellipse fit selects the oriented edges which are positively evaluated for this center and calculates an ellipse over these points. As this procedure is computationally very expensive, the authors Fuhl et al., (2017) presented an unsupervised boosting approach for automated person specific training of the algorithm.
To train the algorithm, BORE is given video sequences captured during calibration (without any annotation). After calibration, BORE calculates the pupil centers for all oriented gradients occurring in the training images. Afterwards, all gradients are evaluated by their contribution to the detections. Unimportant gradients are removed from the algorithm sequence based on predefined run-time restrictions. This means that BORE can be adjusted to be faster or more accurate. The algorithm can be used for frame rates up to 1000. Further details regarding the algorithm, can be found in Fuhl et al., (2017).
Gaze estimation
During calibration, the user is presented different screens containing calibration targets. First, the full nine-point calibration grid is displayed for 1 s. It is followed by a blank white screen and the calibration targets one after another (from top left to bottom right). Presentation time per target is 1500 ms. We discard the first and last 500 ms per target. This is done because we consider the first 500 ms as necessary to maintain a stable fixation on the target. Furthermore, we observed that the users’ focus typically decreases after 1000 ms, to explore the space around the calibration point. In order to avoid outliers and decreasing precision in the calibration, we decided to discard that data. We determined these times based on our observation of the data during development of the system.
During the central 500 ms, we capture the position of the calibration target, the detected eye features (shown in Fig. 5), as well as cropped image regions of the eye. The image data produce considerable amount of data that need to be held in memory. Thus, we implemented a temporary buffer to reserve the required memory before recording. However, it needs to be copied to a global queue that is accessible to different threads after each calibration target. This copy and reset of the temporary buffer happens during the last 500 ms of stimulus presentation.
In the next step, we use the cropped eye images to fine-tune the pupil-detection algorithm. The pupil detection learns which pupil sizes are likely to occur during recording. This training optimizes the processing speed later. The pupil center and glint locations for each calibration point are then detected and passed on to fitting of a 3D eye model. All calibration targets, the location of the detected features, and the percentage of images when the tracker could not detect the necessary features are reported to a file.
For gaze estimation, converting image feature locations into a 3D gaze ray, we use the 3D eye model described by Guestrin and Eizenman (2006). Guestrin specifies models for any number of light sources and cameras. We employ the version with three light sources and one camera as a good trade-off between accuracy and cost-efficiency. This way, we do not require to synchronize multiple high-speed cameras. One disadvantage of this configuration of the model is that, with one camera, the model is not very robust to head movements and the optimization needs considerable computation time.
The fitting process is split into three parts: At first, the optical axis of the eye is reconstructed. Accurate measurement of the physical distances between camera, LEDs, and screen are required for this, as all coordinates need to be transformed to a common world coordinate system. The cornea center is then calculated as an intersection of planes spanned by the camera location, the light source location, and the location of the reflection at the cornea surface. This problem can be solved, assuming a spherical cornea and utilizing the unprojection of the glint image feature on the camera sensor towards 3D. Afterwards, the pupil location can be determined in a similar way and the optical axis is described as the vector between cornea center and pupil center.
In the second part, the deviation between optical and visual axis needs to be determined (illustrated in Fig. 6). This deviation differs between humans and a calibration is needed to estimate the offset. The visual axis is defined by the nodal point of the eye and the center of the fovea (region of highest acuity on the retina). Typically, this offset is about 1.5∘ below, and between 4 and 5∘ temporally to the intersection point of the optical axis on the retina. The whole model is described as an optimization problem where, depending on the camera and light setup, a differential system needs to be solved for up to 13 unknown variables simultaneously. The amount of unknown variables and therefore the model complexity shrinks with more light sources and especially more cameras. As soon as the visual axis is reconstructed, the model can be used to estimate the gaze (third part).
After the calibration is calculated, we re-run the calibration calculations to remove possible outliers. Therefore, we apply a filter that removes all measurements that exceed the Euclidean distance to their respective calibration target by more than the average half plus/minus the standard deviation.
Signal quality
As humans are usually moving slightly without a chin rest and the resolution of the camera does not allow higher-quality images, leading to a larger scene coverage by each pixel, algorithms for detection can be noisy, as they can only detect pixel-wise. With much more and smaller pixels, this noise can be minimized as small light changes are spread on smaller areas. This is problematic when finding contours in an image. As light changes on pixels at the border of a contour can lead to a fluctuating contour. Another way is to filter the signal. We filter the pupil center signal by applying a moving-average filter over 20 consecutive samples (windows size: 20 samples). We apply this filter to take the positions of all 20 samples of the current window into account to correct the position of the current sample. This way, we can remove noise in the signal. This filter does not remove any sample, but rather optimizes the position of the current sample by smoothing it with the positions of the previous samples. When the algorithm receives a new pupil center signal, we test if the new signal is in the range of one standard deviation of the current window. If so, the window moves one sample further by removing the oldest sample. We add the new value (PCM) and calculate the mean value of the current window by solving Eq. 3 and use that as the new pupil center signal (PCNew). If the new signal is outside of the range of one standard deviation of the window, we assume a movement of the pupil signal was not noise, but rather a real movement and take the new value as a new sample (empty values PC0, ⋯, PCM).
$$ PC_{New} = \frac{1}{n} \sum\limits_{i=0}^{n-1} PC_{M-i} $$
(3)
As the pupil signal can change rapidly and a filtering of the signal must be fast and should not falsify saccade detections in later steps, we chose this moving-average filter for the pupil center signal. For the glints we use a Kalman filter (Bishop, Welch, & et al. 2001) to minimize the noise. The Kalman filter can be problematic for the pupil signal, as pupil movements can be very fast and create far jumps in the image. A Kalman filter would smooth these jumps out and create a delay, which is not acceptable as saccades can not be detected correctly anymore. As the glints do not move very far on the cornea, the Kalman filter is optimal for stabilizing the position of the glints. We use this filter as slight movements of the participant already introduce noise in the glint signal. Due to the small resolution of the camera image, one pixel covers a larger area and slight movements can lead to flickering of the pixels at the boundaries of the glints. In our tests, the Kalman filter reduced this noise much better than a moving-average filter. We configured the Kalman filter with noise variances of smaller than one pixel in horizontal and vertical axis.