Noise-robust fixation detection in eye movement data: Identification by two-means clustering (I2MC)
Eye-tracking research in infants and older children has gained a lot of momentum over the last decades. Although eye-tracking research in these participant groups has become easier with the advance of the remote eye-tracker, this often comes at the cost of poorer data quality than in research with well-trained adults (Hessels, Andersson, Hooge, Nyström, & Kemner Infancy, 20, 601–633, 2015; Wass, Forssman, & Leppänen Infancy, 19, 427–460, 2014). Current fixation detection algorithms are not built for data from infants and young children. As a result, some researchers have even turned to hand correction of fixation detections (Saez de Urabain, Johnson, & Smith Behavior Research Methods, 47, 53–72, 2015). Here we introduce a fixation detection algorithm—identification by two-means clustering (I2MC)—built specifically for data across a wide range of noise levels and when periods of data loss may occur. We evaluated the I2MC algorithm against seven state-of-the-art event detection algorithms, and report that the I2MC algorithm’s output is the most robust to high noise and data loss levels. The algorithm is automatic, works offline, and is suitable for eye-tracking data recorded with remote or tower-mounted eye-trackers using static stimuli. In addition to application of the I2MC algorithm in eye-tracking research with infants, school children, and certain patient groups, the I2MC algorithm also may be useful when the noise and data loss levels are markedly different between trials, participants, or time points (e.g., longitudinal research).
KeywordsEye-tracking Fixation detection Noise Data quality Data loss
The emergence of the remote video-based eye-tracker has allowed researchers to conduct eye movement research with a plethora of participant groups for which conventional eye-tracking techniques are unsuitable. Unlike, for example, scleral coil techniques or head-mounted and tower-mounted video-based eye-trackers, remote eye-trackers can be positioned at a distance from the participants and allow them to move freely within a specified range. Remote video-based eye-trackers are therefore suitable to use in participant groups whose head movements are difficult to restrain, such as infants (Oakes, 2012) or school children (e.g., Holmberg, Holmqvist, & Sandberg, 2015). As a result, eye-tracking research in, for instance, infants has gained a lot of momentum over the last decades (e.g., Aslin & McMurray, 2004; Oakes, 2012). Although eye-tracking research in these participants groups has become easier with the advance of remote eye-trackers, and although studies are available that provide advice on how to choose an eye-tracker for research in nonoptimal conditions (Hessels, Cornelissen, Kemner, & Hooge, 2015), data quality is still often low relative to recordings of well-trained adults (Hessels, Andersson, Hooge, Nyström, & Kemner, 2015; Wass, Forssman, & Leppänen, 2014; Wass, Smith, & Johnson, 2013). Current solutions for the automatic detection of one of the most commonly investigated events, fixations, in eye-tracking data are not built for low-quality data. This applies to both the solutions provided by eye-tracker manufacturers and the research community. This is problematic for eye-tracking research with infants and young children, in which data of low quality frequently occur. As a result, some researchers have moved away from fully automatic analysis techniques and turned to manual correction of fixation detection in eye movement data from infants (Saez de Urabain, Johnson, & Smith, 2015). Here we consider the consequences of low data quality for fixation detection, describe and quantify the noise in infant data from remote video-based eye-trackers, and introduce a new and superior solution for detecting fixations in noisy data.
Event detection and data quality
An event detection algorithm generally consists of two parts. The first, which we refer to as the “search rule,” aims to separate fast periods (saccades) and slow periods (fixations) in the data from each other. The second part, the “categorization rule(s),” accepts, rejects, and/or merges the saccade and/or fixation candidates from the first rule according to a set of criteria. These criteria may include, for instance, a minimum fixation time, maximum saccade duration, and so forth. Moreover, these criteria may be based on physiological constraints of the eye—for example, a maximum acceleration during a saccade (Nyström & Holmqvist, 2010)—or on the experimental setup—for example, a minimum saccade amplitude of 2° when the elements to be fixated are spaced 4° apart. Event detection algorithms are often referred to by their search rule. For example, two popular types of event detection algorithms for labeling fixations and saccades are velocity- and dispersion-based algorithms (see Holmqvist et al., 2011, pp. 147–175, for an elaborate overview). Velocity-based event detection algorithms compute a velocity signal from the gaze position signal, and subsequently use a velocity cutoff to label periods of data as fixation candidates. The velocity cutoff used may be set in advance, for example at 30°/s. Other strategies involve first detecting saccade candidates with a fixed threshold, and subsequently finding fixation start and end points by adapting the threshold to the mean velocity in a period preceding the saccade candidate (Smeets & Hooge, 2003) or using the median velocity plus a certain number of standard deviations (Engbert & Kliegl, 2003). Dispersion-based algorithms, on the other hand, label periods of data as “fixation candidates” when a set of subsequent samples exceed a minimum time and do not exceed a maximum distance from each other. As long as subsequent samples stay within the distance limit, they are added to the fixation candidate. If, however, a subsequent sample does exceed the maximum distance, the current fixation candidate is ended and a new fixation candidate is started in the next available window that fits the minimum-time and maximum-distance requirements. Both event detection algorithms may subsequently employ different or identical categorization rules to accept, reject, or merge fixation candidates into fixations.
The two classes of event detection algorithms just discussed are widely applied for analyzing eye movement data. Although such event detection algorithms may lead to reasonable results for adult data with large saccades and a low noise level (Fig. 1, top panel), they may not necessarily do so for data with a greater noise level. Figure 1 (middle panel) depicts adult data in which the noise level is higher and smaller saccades were made. However, the fixations and saccades are still relatively easy to distinguish at first glance. In infant data (Fig. 1, bottom panel), the amplitude of noise is frequently higher than in adult data, and often over short bursts no data are reported by the eye-tracker (Hessels, Andersson, et al., 2015; Wass et al., 2014). How these differences in data quality affect event detection may depend on the specific event detection algorithm used.
Two aspects of data quality are important to consider for the present event detection purposes: spatial precision and data loss.1 First, the spatial precision of the data refers to the reliability of the measurement when no movement of the eye takes place: the variable error, or noise, in the signal. Although the eye always moves slightly (e.g., tremor or drift), one way to estimate the noise amplitude is to calculate the sample-to-sample position change during fixation (Holmqvist, Nyström, & Mulvey, 2012). When the sample-to-sample position change is low, so is the noise amplitude. One may observe that the noise amplitude is lowest for adult data using the SR Research EyeLink 1000 (Fig. 1, top panel), followed by adult data using the Tobii TX300 (Fig. 1, middle panel), and finally for infant data using the Tobii TX300 (Fig. 1, bottom panel). The noise level is determined not only by the hardware (i.e., the eye-tracker), but also by the behavior of the participant group (Holmqvist et al., 2011). For instance, the poorer data quality in infant eye-tracking research may in part be due to the higher amount of movement in infants. An increase in noise amplitude may affect outcome measures such as the number of fixation candidates and the mean duration of fixation candidates. Moreover, depending on the specific search rule used, the number of fixation candidates and the fixation duration may either increase or decrease. If a fixed velocity threshold is used to separate fixations from saccades, decreased precision may break up long fixation candidates into shorter fixation candidates due to noise spuriously exceeding the threshold (Wass et al., 2013). When long fixation candidates are broken up into multiple shorter fixation candidates, the number of fixations increases, and the mean fixation duration decreases. On the other hand, if a velocity threshold is adaptively chosen on the basis of the noise amplitude in the data, small saccades with a velocity close to that of the noise may be missed. This may then cause multiple fixation candidates to be merged into longer fixation candidates (Holmqvist et al., 2012). As a consequence, the number of fixations decreases, and the mean fixation duration increases.
The second aspect of data quality, data loss, refers to periods in which no position coordinates are reported by the eye-tracker. Although this intuitively may be attributed to a participant not looking at the screen, data loss may often occur due to unstable tracking of the eye by the eye-tracker (Hessels, Andersson, et al., 2015; Wass et al., 2014). Figure 1 (bottom panel) depicts such brief loss of contact: Between 3 and 4 s, the recorded data are interrupted by short periods of data loss. During event detection, fixation candidates may be broken up by periods of data loss (Holmqvist et al., 2012). When more data loss occurs, the number of fixations increases, and the mean fixation duration decreases, as compared to lower data loss levels. Whether changing the parameters of the categorization rule(s) may compensate for the differences in output by the search rule is part of ongoing research (Zemblys & Holmqvist, 2016). On the other hand, although the problem of reduced data quality in, for instance, infant research is a common one, few solutions have been designed to accommodate it.
Event detection in noisy data
To our knowledge, two solutions have been proposed specifically to accomplish event detection in noisy data (Saez de Urabain et al., 2015; Wass et al., 2013). Wass et al. adapted a velocity-based event detection algorithm specifically designed to cope with eye movement data from infants. The search rule of this algorithm is as follows: The algorithm first selects only the portions for which data for both eyes are available, and applies a smoothing procedure. Subsequently, periods of data loss up to 150 ms are interpolated if the velocity between the start and end of the data loss period does not exceed the velocity threshold. Thereafter, all periods of data below the velocity threshold of 35°/s are marked as fixation candidates. The categorization rules that follow to label fixation candidates as fixations are extensive. First, if a fixation candidate borders on a period of data loss, it is excluded. Second, saccades are excluded (and fixation candidates consequently merged) if the fixation candidates before and after are within 0.25° distance from each other. Third, if a saccade is preceded by a fixation candidate with an average velocity over 12°/s, the saccade and the bordering fixation candidate are excluded. Fourth, if a saccade is preceded by three samples with an average velocity over 12°/s, the saccade and bordering fixation candidate are also excluded. Fifth, if the distance between the gaze positions for the two eyes prior to the saccade is larger than 3.6°, the saccade and bordering fixation candidates are excluded. Finally, fixation candidates shorter than 100 ms are excluded. Wass et al. reported that this algorithm remains reliable for data containing higher noise amplitude, whereas standard dispersion-based algorithms decrease in reliability with increasing noise amplitude. Saez de Urabain et al., on the other hand, use a two-step approach to event detection. In a graphical user interface, the user can set a number of parameters for a first estimation of which data segments are fixation candidates (i.e., the search rule). Thereafter, the user may manually correct the fixation candidates (i.e., a manual categorization procedure). Although the machine-coding, manual-correction approach by Saez de Urabain et al. may increase the amount of eye movement data that can be successfully labeled as fixations and used in further analysis, it is a highly time-consuming, and subjective, process. Moreover, although the velocity threshold adaptation by Wass et al. is automatic, it features a large number of categorization rules for rejecting data, leading to significant amounts of data being excluded when the noise level is high. Intuitively, an ideal approach would be an algorithm that is automatic and that can reliably achieve fixation labeling in periods of noisy data, instead of excluding such data. Here we introduce such an approach.
The present article introduces the “identification by two-means clustering” (I2MC) algorithm. This algorithm was specifically designed to accomplish the labeling of fixations across a wide range of noise levels and when periods of data loss may be present, without the need to set a large number of parameters or perform manual coding. Before we introduce how the algorithm operates, we first discuss how to the algorithm is to be evaluated.
Evaluating the algorithm
How can the output of an event detection algorithm be evaluated? Or, the question that more generally arises; is the output from my event detection algorithm “correct”? Intuitively, it makes sense to ask this question. One would want to know whether the fixations labeled by an algorithm are “correct,” or whether the output of the algorithm conforms to a golden standard. The problem is, however, that although researchers appear to informally discuss fixations and saccades with relative ease, there is no gold standard for when a fixation starts or stops (Andersson, Larsson, Holmqvist, Stridh, & Nyström, 2016). Essentially, a fixation is defined by how it is computed, which differs for each event detection algorithm. Holmqvist et al., (2011, p. 150) also note “In reality, perfect matches between the fixations detected by an algorithm and moments of stillness of the eye are very rare. To make matters worse, the term fixation is sometimes also used for the period during which the fixated entity is cognitively processed by the participant. The oculomotor, the algorithmically detected, and the cognitive ‘fixations’ largely overlap, but are not the same.” Tackling the evaluation of algorithms based on the fixations detected is therefore problematic, as the definitions for fixations differ between algorithms.
Komogortsev, Gobert, Jayarathna, Koh, and Gowda (2010), for instance, aimed to determine a goodness of fit of the eye movement data to the stimulus that was presented. Participants were presented with a white dot that appeared sequentially at 15 locations on screen for one second each. In this case, an event detection algorithm was considered ideal if it detects 15 fixations, 14 saccades, an average fixation duration of one second, and an accuracy of 0°. Importantly, Komogortsev et al. (2010) noted that this method is inherently flawed. For instance, eye-trackers typically report accuracies of 0.5°, even under ideal circumstances. Moreover, 14 saccades and 15 fixations would imply that no corrective saccades are made. However, Komogortsev et al. (2010) preferred their method over manual techniques, which “are frequently employed to classify eye-movement behavior. However, this type of classification technique [i.e., manual coding] is susceptible to human error and can be open for biased interpretation with limited generalizability” (p. 2643). Andersson et al. (2016), on the other hand, employed two experienced human coders as their gold standard for comparing with algorithms. They noted that although human coders generally agreed more with each other than with automatic algorithms, there is a problem: Human coders make mistakes and have disagreements, and eventually it is impossible to determine whether an algorithm or a human coder should be “right.” Finally, in the approach by Zemblys and Holmqvist (2016), the algorithm considered “best” by the authors was taken as the gold standard—a choice that will most likely provoke debate. To sum, there is little consensus on the manner of evaluating the performance of a fixation detection algorithm.
Instead of focusing on whether a gold standard for the evaluation of fixation detection algorithms can be approximated, we take a different approach here. The purpose of the presented algorithm, called I2MC, is to achieve consistent labeling of fixations when there may be large differences in data quality between participants and between trials, as is often encountered in, for instance, infant research. This means that the I2MC algorithm should achieve fixation labeling across a range of noise amplitudes and when short periods of data loss may be present. If the labeling of fixations is done for data with a small noise amplitude and few periods of data loss, the output thereof can be compared to the output after adding noise of higher amplitude and periods of data loss to the same data. If the number of labeled fixations and the corresponding distribution of fixation durations remain unchanged as noise and data loss increase, the algorithm is considered to be robust to noise and data loss. The key decision left to the eye movement researcher is then whether the fixation-labeling output at low noise and data loss levels is satisfactory. If one is satisfied with the initial output of the algorithm, one can generalize this satisfaction to higher noise and data loss levels, given the algorithm’s stability in the face of increasing noise and data loss. This decision inevitably has to be made by every researcher, since no two experimental setups and data sets are identical.
To evaluate whether the I2MC algorithm improves on the currently available solutions, seven competing state-of-the-art algorithms were chosen from the literature. The general motivation for including an algorithm is that it should be able to deal with one or more of the data quality issues that we outlined above. The specific motivations are given in the Method section. As we discussed briefly, algorithms may differ in both their search rules and their categorization rules for labeling fixations. Moreover, the preprocessing steps, such as data smoothing and interpolation to impute missing data, prior to application of these rules may differ between algorithms. It is paramount to note that the focus here is not on finding a combination of preprocessing steps with search and categorization rules that produces the most noise-robust output by testing all possible combinations and exhaustively searching through their parameter spaces. Instead, the focus is on whether previous solutions, taken as is, produce noise-robust output, and whether or not the present I2MC algorithm improves over them. We reasoned that taking the algorithms as they come “out of the box” is what the vast majority of researchers who are not experts on event detection algorithms would do when choosing algorithms for purposes of data analysis.
Because previous research has shown that increased noise amplitude may affect the number of fixations, and consequently the fixation durations (Holmqvist et al., 2012; Wass et al., 2013), we calculated the numbers of fixations, mean fixation durations, and standard deviations of the fixation durations for all noise and data loss levels. Such an approach is similar to that of Zemblys and Holmqvist (2016), who investigated the parameter settings of event detection algorithms as a function of increasing noise level. However, their aim was to see how the settings of an algorithm should be adapted for the output to approach that of their gold standard (the algorithm they considered to be the “best”). Here, however, we compared each algorithm against itself to examine its robustness of the outcome measures to noise and data loss levels. After examining the noise robustness of the outcome measures of the I2MC algorithm in this manner, we examined the application of the I2MC algorithm to infant data.
Interpolation of missing data
To maximize the amount of eye-tracking data that can be used for event detection, imputation of short periods of missing data is performed through interpolation. We chose an interpolation method satisfying two conditions: (1) interpolation must be locally determined by the gaze samples at each end of the interpolation window, and (2) interpolation must be monotonic (i.e., there are no extrema in the data points between the start and end points). The interpolation method adopted here, which satisfies both constraints, was developed by Steffen (1990). It should be noted that the commonly used cubic spline interpolation (e.g., Frank, Vul, & Johnson, 2009) does not satisfy the constraints posed here, because it can produce extrema in the interpolated data.
Interpolation was performed as follows. Periods of missing coordinates in the gaze coordinate signals were interpolated provided that the following criteria were met. First, the period of missing coordinates had to be shorter than a set value. The value used in this article was 100 ms, and was chosen so as not to interpolate over entire saccade–fixation–saccade sequences (saccades with a latency of 100 ms are considered extremely early; Fischer & Ramsperger, 1984). In addition, blink durations are usually higher than 100 ms, and so relatively few blinks should be interpolated. The value of 100 ms is, however, not fixed, and may be adapted according to the periods of data loss observed in the eye-tracking data. Second, valid data had to be available for at least two samples at each end of the missing window.
If the current window contains no missing data (if it does, go to Step 4): Force the gaze position signal into two clusters. Cluster membership is a value of either 1 or 2 (i.e., the cluster the sample belongs to). It is important to note that cluster membership itself is not relevant, but only where the membership transitions occur from Cluster 1 to Cluster 2, or vice versa. Only the times at which these transitions from one cluster to another occur are used in the next step.
Construct a clustering weight for the current window from the cluster membership determined in Step 1. The clustering weight for samples in which a cluster membership transition occurs is 1/number of total transitions in the window. The clustering weight for the other samples (i.e., those at which no transition occurs) is 0. If, for example, one transition occurs from Cluster 1 to Cluster 2, as in the saccade example in Fig. 2, the clustering weight for the sample containing the transition is 1. For all samples containing a transition in the fixation example in Fig. 2, the clustering weight is much lower, because there are many transitions from one cluster to the other in the window.
To ensure that transitions are not caused solely by high-frequency noise in the data, down-sample the gaze position signal to integer divisions of the original sampling frequency, and repeat Steps 1 and 2 for each down-sampled position signal. For example, the data in Fig. 2 (and in this article) were recorded at 300 Hz and down-sampled to 150, 60, and 30 Hz. The clustering weights for the gaze position signal at its original sampling frequency, as well as the down-sampled signals, are subsequently summed.
Move the window in the gaze position signal. The window may be moved either one sample or a number of samples. We used a step of 20 ms (six samples at 300 Hz) here, because it provides nearly identical results to moving the window one sample, but decreases the computation time sixfold. This step size is, however, configurable. If the subsequent window contains missing data or is moved past the end of the data, go back in steps of one sample to determine the last possible window. If no additional windows are possible backward in time up to the previous possible window, find the first possible window after the period of missing data. As long as the end of the window does not reach the end of the gaze position signal, return to Step 1.
For each sample, average the clustering weights assigned in Steps 1–3 for each time that sample was included in the moving window. For example, if a sample was included in three windows, it will have been assigned three clustering weights, and these three weights are averaged. The subsequent clustering-weight signal (see Fig. 2) can now be used for fixation detection.
For binocular eye-tracking data, the clustering procedure described above was run on the data for the left and right eyes separately. These two clustering-weight signals were then averaged to determine the final clustering-weight signal. Doing this has the advantage that if only one eye moved according to the eye-tracker, which most likely represents noise, the result is unlikely to lead to a large peak in the clustering-weight signal. For monocular eye-tracking data, and when the data from only one eye are available in binocular eye-tracking data due to data loss, the clustering-weight signal from that one eye is used.
The categorization rules for the present algorithm are as follows. A cutoff is used to determine fixation candidates from the clustering-weight signal (see panel B in Fig. 2). Here we used a cutoff of the mean clustering weight plus two standard deviations. However, different cutoffs may be required for different datasets. All periods of clustering-weight signal below this cutoff are labeled as fixation candidates, and thereafter consecutive fixation candidates are merged. Finally, short fixation candidates are excluded from the output. The settings for merging fixation candidates may depend on the stimuli used in the experiment, the noise level in the eye-tracking data, or the size of the saccades of interest. Here we opted for merging fixation candidates that were less than 0.7° apart and were separated by less than 40 ms. Fixation candidates shorter than 40 ms were removed.
Settings for the identification by two-means clustering (I2MC) algorithm and their suggested values for data at 300 Hz
Used Value(s) at 300 Hz
Impact When Changed
Increase will lead to interpolation of blinks, decrease will lead to less periods of data loss being interpolated.
6.7 ms (two samples)
Increase will require more data points at data loss edge, and will not interpolate in the event of flicker (i.e., repetition of short period of data loss and data points). At least two samples are required.
Clustering window size
Increase will lead to clustering procedure being more readily carried out over saccade-fixation-saccade sequence.
150, 60, and 30 Hz
Removal of downsampling steps will lead to more susceptibility to short bursts of noise.
Window step size
We observed no difference between 3.3 and 20 ms.
2 standard deviations above the mean
Increase will lead to fewer fixation candidates (more conservative), decrease to more fixation candidates (more liberal).
Merge fixation distance
Increase will lead to more fixation candidates being merged.
Merge fixation time
Increase will lead to more fixation candidates being merged.
Min. fixation duration
Increase will lead to more short fixation candidates being excluded.
The algorithms were compared in terms of the following outcome measures: the number of fixations, the mean fixation duration, and the standard deviation of the fixation duration. These outcome measures were obtained from eye movement data with increasing noise amplitude and periods of data loss, as well as the combination of the two. A dataset consisting of binocular data that were recorded with the SR Research EyeLink 1000 by Hooge, Nyström, Cornelissen, and Holmqvist (2015) was used, in which the noise amplitude and data loss were artificially increased. This dataset was chosen on the basis of its low noise amplitude and low data loss levels. In Hooge et al.’s experiment, the participants made horizontal and vertical saccades of a wide range of amplitudes.
To examine robustness, the noise amplitude and data loss level in the data were artificially increased. To provide a valid test of algorithm performance, it is important that the added noise and data loss be representative of levels that actually occur when doing eye-tracking research in suboptimal conditions. To achieve this, we first characterized what the noise and data loss looked like in a set of infant data recorded by Hessels, Andersson, et al. (2015; for an example, see the bottom panel of Fig. 1). We subsequently developed methods to add noise and data loss with these characteristics at varying levels to our clean data. These methods are described in detail in Appendix A. The noise level was varied from a sample-to-sample root-mean square (RMS) noise level of 0 to 5.57°. The latter value was chosen as being beyond the upper limit for noise typically encountered in eye movement data. For comparison, the RMS noise level in infant eye movement data rarely exceeds 3° (Hessels, Kemner, van den Boomen, & Hooge, 2015). Data loss was varied by changing the occurrence of periods of data loss from 0 % to 100 % of a trial. Because the noise and data loss were characterized in a dataset recorded at 300 Hz, our clean data to which we subsequently added noise and data loss were down-sampled from 1000 to 300 Hz using first-order interpolation.
After examining the noise robustness of I2MC and competing algorithms in eye movement data with a range of artificially generated noise and data loss levels, we applied the algorithms to real infant data. To interpret the outcome measures of the algorithms when applied to the infant data, four eye movement experts (authors R.H., D.N., and I.H., as well as an external expert) hand-coded the fixations in the infant data. A subset of the infant data from Hessels, Hooge, and Kemner (2016) were extracted for manual coding. The data from 20 infants, amounting to a total of 40 min of eye movement data, were coded. Hand coding was done in custom MATLAB software and took each manual coder approximately 3 h to complete.
Algorithms for comparison
Seven algorithms were chosen for comparison against the I2MC algorithm. Because only three of the seven algorithms provided output for all noise and/or data loss levels, only these algorithms are discussed here. The remaining four algorithms are included in Appendix B. For each algorithm, only parameters that were dependent on the sample frequency of the input data were adjusted. They were set to match their initial values for the 300-Hz data we used here. The search and categorization rules, as well as the motivation for including each algorithm, are described below.
Adaptive velocity algorithms for low-frequency data
An implementation of an adaptive velocity search rule is given by Hooge and Camps (2013; hereafter, HC). Their algorithm labels fixations instead of saccades and was originally designed for low-frequency (120 Hz or lower) eye movement data. The categorization rules that they employed are (1) adjacent fixations that are less than 1.0° away from each other are merged, and (2) fixations shorter than 60 ms are excluded. This algorithm’s was included because it is a simple fixation-labeling algorithm with few parameters that is based on a threshold that is adaptive to the noise level in the data.
Another implementation of an adaptive velocity search rule was given by van der Lans, Wedel, and Pieters (2011). They described their algorithm as follows: “Our Binocular-Individual Threshold (BIT) algorithm for identifying fixations is . . . a parameter-free fixation-identification algorithm that automatically identifies task- and individual-specific velocity thresholds by optimally exploiting the statistical properties of the eye-movement data across different eyes and directions of eye movements” (p. 240). The algorithm improves over standard adaptive velocity search rules by using the covariance between movement of the left and right eyes. If the left eye moves in a given direction, the right eye often does so too, whereas in noise the movement of the two eyes is uncorrelated. Given this feature, the BIT algorithm may more readily be able to distinguish saccades from noise than are standard velocity algorithms, and therefore we included it in this comparison. No further categorization rules are reported by the authors for the algorithm.
Identification by analysis of variance and covariance
Veneri et al. (2011) designed a fixation-labeling algorithm (hereafter, C-DT) with a search rule based on the covariance of the horizontal and vertical eye position signals. The algorithm labels gaze samples as belonging to a fixation when an F test indicates that the variances of the horizontal and vertical eye positions are equal. When the covariance between x- and y-coordinates is high, samples are labeled as belonging to a saccade. The remaining samples are labeled according to a combination of their covariances and the F test for equal variances. Veneri et al. reported that their C-DT algorithm identified fixations more accurately than a standard dispersion algorithm when the noise amplitude was high. The reason it was included is that it appears to be robust to noise. No further categorization rules are reported by the authors for the algorithm.
The HC algorithm showed an increase in the mean fixation duration as noise increased, as well as an increase in the standard deviation of the fixation durations. The BIT algorithm also showed increases in both the mean fixation duration and the standard deviation of the fixation duration. The CDT algorithm showed a slowly decreasing, but still fairly consistent, mean fixation duration and standard deviation of the fixation duration. Finally, the I2MC algorithm showed both a fairly consistent mean fixation duration and standard deviation of the fixation duration as a function of noise level, with only a small decrease in the mean fixation duration when more than 5° of RMS noise was added to the eye movement data.
These findings show that the outcome measures from the I2MC algorithm were most robust, among the algorithms we tested, to increasing noise levels in the eye movement data.
Variable RMS noise
For the mean fixation durations and the standard deviations of the fixation durations, the results overall matched the previous RMS noise analyses, albeit once again with smaller increases and decreases.
For the CDT algorithm, the mean fixation duration and standard deviation of the fixation duration decreased as a function of data loss. For the HC algorithm, a slight decrease was followed by an increase, and then again by a decrease for both the mean fixation duration and the standard deviation of the fixation duration as a function of data loss. Finally, for the BIT and I2MC algorithms, the mean fixation duration was fairly stable as data loss increased.
Combined noise and data loss
As a final test of the robustness of the outcome measures of the four algorithms, a combination of RMS noise and data loss was added to the eye movement data.
Interpreting noise robustness
Applying the algorithm to infant data
The purpose of the present work was to address the need for an algorithm capable of labeling fixations across a wide range of noise levels in data in which periods of data loss may occur. This is particularly relevant because of the rise of remote video eye-tracking in participant groups such as infants, school children, and certain patient groups whose body movement is difficult to restrain– which may strongly affect the eye movement data quality. Here we proposed and evaluated a new algorithm designed specifically for eye movement data of varying data quality: identification by two-means clustering.
In comparison with other state-of-the-art event detection algorithms, we found the following. First, we report that the numbers of fixations, mean fixation durations, and standard deviations of the fixation durations recorded by the I2MC algorithm were most robust to increases in noise level than the same measures recorded by competing algorithms. This was the case both when the noise level was increased for the entire eye movement signal and when the noise level was increased for only part of the eye movement signal, mimicking short bursts of noise. Second, differences between the algorithms were smaller for the data loss analysis than for the noise analyses. However, the outcome measures of the I2MC and BIT algorithms were most robust to increases in data loss. We report that when adding both noise and data loss to eye movement data, the outcome measures for the I2MC algorithm are most robust. This was particularly the case for 0°–2° of noise, at which levels the outcome measures were almost identical. Since the BIT algorithm appeared to show stable outcome measures as a function of noise for noise amplitudes larger than 2°, albeit with marked differences between the data loss levels, and the HC algorithm showed stable outcome measures for the highest data loss level only, we examined the distributions of fixation duration more closely. Here, we report that the I2MC algorithm showed nearly identical distributions of fixation durations for the lowest noise levels, regardless of data loss level, whereas differences in the distributions of fixation durations for the competing algorithms were already present at the lowest noise levels. However, when the noise amplitude was larger than 2°, the distribution of fixation durations according to the I2MC algorithm also began to show marked differences as compared to the lower noise levels. Finally, when the I2MC was applied to infant data, the outcome measures best approached the average of those of four expert coders, as compared to the other algorithms. We concluded that the outcome measures for the I2MC algorithm were most robust to noise, most notably in the range 0°–2° of RMS noise (for each data loss level) added to the eye movement data.
If we compare the noise robustness of the I2MC algorithm to what has been found in previous research, we see that it also compares favorably to manufacturer-provided algorithms. For instance, Zemblys and Holmqvist (2016) reported that the numbers of fixations and mean fixation durations for the SMI I-VT and I-DT algorithms changed drastically for noise levels higher than 0.25° RMS. In addition, they created a mathematical model to estimate the best-compromise threshold for the SMI dispersion- and velocity-threshold algorithms from the noise levels observed in the eye movement recording. Although adjusting the threshold may compensate for higher noise levels, this comes at the cost of reduced agreement of the algorithms’ output (i.e., numbers of fixations and mean fixation durations) relative to the gold standard. Moreover, a large-scale study of data quality revealed that many of remote eye-trackers produce data with noise levels over 2° for some of participants, with infrequent recordings in which the RMS noise level was even over 3° (Zemblys & Holmqvist, 2016). In addition, Hessels, Kemner, et al. (2015) reported that the RMS noise was rarely over 3° in infant research. The fact that the outcome measures of the I2MC are noise-robust, and particularly so between 0° and 2° RMS noise, means that the algorithm may apply to most real-world situations. Moreover, no parameters need to be adjusted to achieve the same output of the fixation parameters when the noise level varies between 0 and 2°, whereas this was the case for manufacturer-provided algorithms in this range (Zemblys & Holmqvist, 2016).
As we noted in our introduction, a fixation is defined by how it is computed. This means that a fixation for one algorithm is not the same as a fixation for another algorithm. Even when the noise level was low in the eye movement data investigated here, there were large differences between algorithms in the numbers of fixations and the fixation durations (see also Appendix B). This was not only a result of the search rule algorithms apply to find fixation candidates, but also of the categorization rules used to merge or exclude these fixation candidates. When one algorithm excludes more fixation candidates under uncertainty, it will produce fewer fixations as output than will an algorithm that does not exclude any fixation candidates. As such, comparing fixation parameters between algorithms is like comparing apples and oranges. Instead of doing so, we first compared the fixation parameters for each algorithm to itself as a function of data quality—the absolute values for each algorithm were not so important, only the change in the value as a function of noise or data loss. In essence, we compared apples to apples, and oranges to oranges. Here we have presented an algorithm that produces the same output under a wide range of circumstances—the apple remains roughly the same, regardless of the situation. Does this then mean that the output of the presented algorithm is “good”? That is a very difficult question to answer. Commonly in the literature, this question is tackled by comparing the output of an algorithm to an expert coder. Here, when comparing the outcome measures of I2MC and the competing algorithms to the output of four expert coders, I2MC also outperformed the other algorithms. It should be noted, however, that the expert coders did not produce identical outcome measures, such that the question becomes how informative one expert coder actually is. Future research should examine whether expert coders serve as a good gold standard for event detection algorithms.
The I2MC algorithm is applicable to fixation labeling in situations in which the data quality may be low—for instance, when working with infants, school children, or certain patient groups. The I2MC algorithm may also be used when the noise and data loss levels are markedly different between trials and/or participants; the outputs should be comparable despite these differences in noise and data loss levels. This is also particularly relevant for studies in which two groups are compared: For instance, Shic, Chawarska, and Scassellati (2008) reported that changing the parameters of a fixation detection algorithm may reverse the effects between toddlers with autism spectrum disorder and typically developing controls. In addition, recent work has also suggested that differences in data quality between groups should be carefully monitored (Keehn & Joseph, 2016). Here, using an algorithm whose outcome measures are robust to differences in data quality between groups may be a better solution.
Although the I2MC algorithm may be applicable in a wide range of studies, it has several limitations. First, the algorithm is built only for labeling fixations. When saccade parameters are of interest, for example, the I2MC is not a sensible analysis tool. Second, we restricted the present study to labeling fixations in data collected with remote or tower-mounted eye-trackers using static stimuli. When head-mounted eye-trackers are used, the gaze position signal is complicated by periods of optokinetic nystagmus and the vestibulo-ocular reflex, which may require a different event detection strategy. Moreover, when moving stimuli are used in remote or tower-mounted eye-trackers, smooth pursuit movements may occur. Recent work has begun to address event detection in these situations (Larsson et al., 2015). In addition, a general limitation to the present work is that we focused on 300-Hz data, whereas a broad range of sampling frequencies are being used in eye-tracking research. Moreover, although the selection of algorithms against which we tested the I2MC algorithm was motivated, it represents but a subset of the entire event detection catalog.
Here we presented a fixation detection algorithm for eye-tracking data recorded with remote or tower-mounted eye-trackers using static stimuli. The algorithm works offline and is automatic. The key improvement made by this algorithm is that it labels fixations across a wide range of noise levels and when periods of data loss may occur.
Although accuracy is also an aspect of data quality, this does not affect the detection of fixations.
The authors thank Jacco van Elst for valuable help with coding the eye movement data. The study was financed through the Consortium on Individual Development (CID). CID is funded through the Gravitation program of the Dutch Ministry of Education, Culture, and Science and by the Netherlands Organization for Scientific Research (NWO Grant No. 024.001.003, awarded to author C.K.). The funding body had no involvement at any stage of the study.
- Findlay, J. M. (1971). Frequency analysis of human involuntary eye movement. Kybernetik, 6, 1–8.Google Scholar
- Holmqvist, K., Nyström, M., Andersson, R., Dewhurst, R., Jarodzka, H., & van de Weijer, J. (2011). Eye tracking: A comprehensive guide to methods and measures. Oxford, UK: Oxford University Press.Google Scholar
- Shic, F., Chawarska, K., & Scassellati, B. (2008). The amorphous fixation measure revisited: With applications to autism. In B. C. Love, K. McRae, & V. M. Sloutsky (Eds.), Proceedings of the 30th Annual Meeting of the Cognitive Science Society (pp. 1964–1969). Austin, TX: Cognitive Science Society.Google Scholar
- Steffen, M. (1990). A simple method for monotonic interpolation in one dimension. Astronomy and Astrophysics, 239, 443–450.Google Scholar
- Zemblys, R., & Holmqvist, K. (2016). Optimal settings for commercial event detection algorithms based on the level of noise. Manuscript submitted for publication.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.