We will present our review ordered by the categories Eye-tracking methods, Environment, Setup and geometry, Participant, Calibration, Features of the experiment, Signal processing, Event detection, Area-of-Interest measures, and Higher-order measures. The minimal reporting guideline itself can be found in Section “An empirically based minimal reporting guideline
Eye-tracking methods: Similarities and differences
Over the past 130 years (e.g. Delabarre, 1898; Lamare, 1892), many methods for eye movement registration have been developed. A recent comprehensive overview is provided by Holmqvist and Andersson (2017, Ch 4). For other overviews of eye trackers and methods for measuring eye movements, see Hansen and Ji (2010), Duchowski, (2007, pp. 51–59), Ciuffreda & Tannen (1995, pp. 184–205), Young and Sheena (1975), and Ditchburn, (1973, pp. 36–77).
In this section, we describe how characteristics of the eye-tracker signals differ between the measurement techniques and between various eye-tracker models. From the perspective of a researcher embarking on a new project, with a limited budget, each measurement technique is likely to have some advantages and some disadvantages. Within each technique, differences between manufacturer models in data quality and other properties may be found to be large enough to determine the success or failure of the upcoming study.
Table 2 summarises 42 existing cross-comparative benchmarking studies of eye trackers, which we refer the reader to for specific details. In short, these 42 studies inform their readers that data quality often differs very considerably, in very many ways, between eye trackers, while other eye trackers record data with similar quality. The studies in Table 2 may assist in assessing whether an eye tracker can actually produce data of the desired quality, either in preparation for acquiring a system, or when preparing a replication where the eye tracker in the intended replication study differs from the eye tracker in the original publication.
Summarising studies on accuracy and precision, particularly, Holmqvist and Andersson (2017) point out that the difference in distribution of RMS-S2S precision values between eye trackers may be up to two orders of magnitude, while in comparison between-subjects differences in precision within each eye tracker tend to be relatively small. In contrast, the distributions of accuracy values for each eye tracker overlap considerably between eye trackers (i.e. they have similar accuracy), but exhibit a very wide range within each eye tracker which represents data from people with different eye physiologies, spectacles, and data obtained during fixations in the corner vs central positions of monitors. This suggests that for precision, the eye tracker matters more, while for accuracy: the participant, the calibration and the geometrical setup matter more. This was found for adult human participants in the lab and may differ for infants, animals and difficult recording environments.
As we outline below, irrespective of measurement method: anything that interferes with obtaining or processing of a feature used in estimating gaze direction (P, CR, P1, P4, limbus, magnetic induction or retinal features) will affect the data quality of the signal in the data reported by the eye tracker.
P–CR eye tracking
Video-based P–CR eye tracking was introduced by Merchant (1967). In 2021, camera-based P–CR eye trackers dominate the market almost completely. The P of P–CR eye trackers refers to the pupil centre in the camera image, and the CR to one or more reflection centre(s) in the cornea from infrared illuminators in the eye tracker. P–CR eye trackers estimate gaze direction as a function of the relative positions of P and CR coordinates in the pixel coordinate system of the video image, for instance by subtracting the CR coordinate from the P coordinate. Note that more advanced models have been developed (Hansen and Ji, 2010).
More types and models of P–CR eye trackers are available than for any other measurement technique, and prices vary over a wide range. There exists plenty of software for stimulus presentation, data processing and analysis, and the learning threshold for beginning researchers is lower than for other eye-tracking methods.
Many studies have examined aspects of P–CR eye trackers (Table 2). A host of issues with the feature detection of both pupil and corneal reflection may impair quality of gaze and pupil-size data. As we point out elsewhere, P–CR trackers suffer from the pupil-size artefact (Section “Environment”) and the pupil foreshortening artefact (Section “Setup and geometry”). Refraction in the cornea alters the pupil size in the camera image and its position with respect to the limbus (Villanueva & Cabeza, 2008). Pupil occlusion and mascara can interfere with pupil detection. Blue irises tend to result in poorer precision (in dark-pupil eye trackers), which is due to poor contrast between (a dark) pupil and iris in the infra-red light of video-based eye trackers (Section “Participants”, and Figure 4.13 in Holmqvist & Andersson, 2017). Combining the pupil with the CR signal to form the P–CR gaze signal may amplify post-saccadic oscillations and overestimate peak saccadic velocity (Hooge et al., 2016).
P–CR eye trackers exhibit clear post-saccadic oscillations (PSOs) (Hooge et al., 2015; Nyström et al., 2013), which make it difficult to draw a clear border between saccade and subsequent fixation, and which has led to the development of event detection algorithms that include PSO detection (Larsson et al., 2013; Nyström & Holmqvist, 2010; Zemblys et al., 2019).
Discussing which technologies could be used for future studies of saccade dynamics, Hooge et al., (2016) reason that variants of CR-tracking without the involvement of the pupil feature could be the preferred future method. However, Holmqvist and Blignaut (2020) reported incorrectly measured amplitudes of small eye movements (below 2∘) in all 11 P–CR eye trackers they tested, and suggest that it is due to erroneous calculations of the CR centre by the image processing algorithms in the eye trackers, interacting with the resolution of the eye camera sensor. Other artefacts in the CR signal arise from changes in head position (relative to the eye tracker), which may alter the size and the shape of corneal reflections (Guestrin & Eizenman, 2006). Patterns in the iris may interact with the CR image and change the calculated CR center (Tran & Kaufman, 2003). Illumination levels, sampling frequency and the optic lenses in the camera may all affect the CR. Droege and Paulus (2009) point out that the use of low-quality eye cameras may further degrade precision in the gaze signal, due to the slower pixel updating, which makes pixels retain some of the brightness of the passing corneal reflection, leaving a bright trace behind the real reflection, making centre calculation of the CR image more perilous.
DPI eye tracking
The Dual-Purkinje Imaging (DPI) system is an analogue eye tracker that bases its estimation of gaze on the relative movement of the infrared reflection off the cornea (P1) versus the reflection at the back of the crystalline lens (P4), and reports P1, gaze and head translation as voltages (Crane & Steele, 1985). At present, there are around 60 DPI trackers left in the world (Personal communication; Warren Ward). As the DPI produces a continuous signal, it can be digitised to the desired sampling frequency in an AD-converter. Internal bandwidth restrictions limit the maximum sampling frequency to 39.06kHz (Personal communication; Warren Ward).
The DPI used to be the main workhorse of many psychology laboratories and features in many influential publications such as Frazier and Rayner (1982) and Deubel and Schneider (1996). The learning threshold is clearly higher than for P–CR trackers, but the major drawback of the DPI is that it is a bulky and sensitive machine built using optoelectronics from the 1970s that are serviced commercially by only one person. However, the camera-based DPI built by Rucci et al., (2020) has a data quality comparable to the original analogue system and is built with modern electronics, which may revive the DPI measurement technique.
The P1 in DPI eye tracking is the same reflection as the CR of P–CR trackers, with the important distinction that P–CR eye trackers estimate the center of the CR from a small portion of a pixelated camera image, while the DPI finds the centre of an analogue light beam. This has been proposed to be the reason that the DPI does not mismeasure the amplitudes of small eye movements (Holmqvist and Blignaut, 2020).
The DPI records gaze signals with a quality sufficient to detect tremor, oculomotor drift, microsaccades, and smooth pursuit with good reliability (see Holmqvist & Blignaut, 2020; Ko et al.,, 2016; Poletti & Rucci, 2016, for details). Holmqvist (2015) report a median precision of 0.008∘ and an accuracy of 0.4∘ across 192 participants, both better than any video-based P–CR system. The quality of DPI data is generally lower when recording participants with small pupils that cover the P4 reflection, which causes inaccuracies and data loss (Crane and Steele, 1985; Holmqvist et al., 2020). A DPI is best recorded with participants who have large pupils, either in dark rooms or with artificially dilated pupils. The reliance on the P4 reflection furthermore results in the largest measured amplitudes of post-saccadic oscillations in any eye tracker (Deubel & Bridgeman, 1995).
Scleral search coils
Scleral search coils were introduced by Robinson (1963) and adapted for use with human participants by Collewijn et al., (1975). The scleral search coil method involves placing a copper wire coil, embedded in an annulus or contact lens, onto the sclera. The participant is placed in oscillating magnetic fields and the induced voltage in the eye coil is taken to represent the orientation of the eye with respect to the magnetic fields. This technique was dubbed the gold standard of eye tracking by Collewijn (1998). Reulen and Bakker (1982) presented the double magnetic induction principle, improved by Bour et al., (1984). Like the DPI, scleral search coil systems are analogue trackers, and data can be digitised at very high sampling frequencies. Coils can even record combined eye and head rotation for the same participant (Collewijn et al., 1985).
Houben et al., (2006) compared a coil system with a torsion-capable video eye tracker, finding that the gaze signal from the coil system was ten times more precise, and Ko et al., (2016) compared a coil system to a DPI, finding that although data from a coil system are somewhat more precise, both systems provide a data resolution sufficient for reliable detection of intersaccadic (fixational) eye movements. Collewijn (2001) sampled data at 10000Hz, and additionally reported a tracking range of 20∘ in all directions with a resolution of 1’, while Malpeli (1998) reports a precision of 1’ (0.017∘) and Collewijn et al., (1988) recorded saccades with amplitudes of up to 80∘.
All studies in Table 2 that have compared EyeLink systems with scleral search coils reported substantial agreement in precision and detection of microsaccades and oculomotor drift in both systems (McCamy et al.,, 2015, for a review). Note however that coils have been suspected to slow down the saccades of participants who wear them (Frens and van der Geest, 2002; Träisk et al., 2005). However, coils probably estimate the velocity more accurately than P–CR eye trackers, which overestimate saccadic velocity (Hooge et al., 2016).
The scleral coil tracking method is distinctly invasive, and evidence exists that older coils systems, in combination with the anaesthetics that were applied, caused temporary reductions in visual acuity (Irving et al.,, 2003, but see Murphy et al., 2001), deformation of the visual field (Duwaer et al., 1982), and blurred vision (Arend & Skavenski, 1979). Contemporary search coils are embedded on flexible contact lenses and used for research and clinical diagnostic purposes in neuro-ophthalmology and neurology, due to their high precision, and the fact that patients often suffer from uncontrolled head and body movements.
Schott (1922) and Meyers (1929) could produce recordings of the horizontal component of gaze, based on the corneo-retinal potential principle discovered in 1849 by Du Bois-Raymond. An EOG system records eye movements using electrodes on the side of the eyes that pick up an electromagnetic field produced by this corneo-retinal electrical potential of 10–30mV (Brown et al., 2006). The signal is then taken through an isolated instrumentation amplifier connected to a chart recorder or a computer. EOG is an analogue method. EOG systems are often part of other recording devices. For instance, electroencephalogram (EEG) systems often have extra electrodes for the eyes that can be used for EOG recordings.
Brown et al., (2006) proposed a standardized measurement procedure for clinical EOG measurements, aiming at acquiring high-quality EOG data. Their procedure includes dilating the pupil, preparing the skin of the participant, and then applying two electrodes on the sides of each eye and a reference electrode to the forehead. The corneo-retinal potential is mainly derived from the retinal pigment epithelium, and it changes in response to retinal illumination. Hence, in a totally dark environment, the participant spends 15 minutes looking at dim fixation targets, followed by a light phase of similar duration. This darkness-light sequence maximizes the corneo-retinal potential. The actual data recording then commences.
EOGs can be a useful variety of eye tracking when studying larger movements of the eye. Small movements will drown in the noise of EOG data (compare Fig. 2). One specific advantage of EOGs is that they can be used when the eyes are closed, for instance to study REM sleep (Aserinsky and Kleitman, 1953). However, EOG eye tracking comes with a poor accuracy, compared with most other eye trackers: Young and Sheena (1975) report a 1.5–2∘ inaccuracy on average.
The first published implementation of a (photo-electric) limbus tracker was by Török et al., (1951). Limbus trackers estimate the limbus border between the iris and sclera, either from video or photosensors. Limbus eye trackers based on photodiodes were sold for research up until the year 2000 by the Skalar company, but are now only known for controlling the laser during refractive surgery of the eye (Arba-Mosquera and Aslanides, 2012). The Ober Saccadometer is not a limbus tracker, but a corneal bulge tracker (Holmqvist & Andersson, 2017, p. 73), although like the Skalar limbus tracker, the Saccadometer uses photosensors to track the corneal bulge.
Video-based limbus trackers use the fact that the limbus border (between iris and sclera) has a contrast comparable to the pupil-iris border. However, limbus trackers do not suffer from pupil-based artefacts, which affect both DPI and P–CR systems. Refraction in the cornea is also not a problem. Eye trackers with low-resolution cameras may benefit from using the limbus method. The drawback is that a large portion of the limbus may be covered by the eyelid, which puts challenges on image processing.
Piezoelectric eye tracking
The piezoelectric transduction method, first introduced by Bengi and Thomas (1968), involves bringing a silicone-tipped piezoelectric bimorph into contact with the sclera, typically in the interpalpebral region near the temporal limbus. It outputs voltage signals, in which horizontal microsaccades and oculomotor tremor can be detected. This analogue eye tracker has not been used for purposes other than measuring intrafixational eye movements. There is a suspicion that the introduced pressure on the sclera affects the microsaccade behaviour (see McCamy et al.,, 2013, for a discussion).
Retinal image-based eye tracking
Computational tracking of retinal features involves finding the optic disk, blood vessels and smaller features, and was first done by Cornsweet (1958). A computer vision algorithm provides an analysis of the movement of features in the camera view, and infers eye movements.
Retinal image-based eye trackers are the most accurate and precise of all existing eye trackers. An early system by Cornsweet (1958), albeit limited in that it only tracked features along one axis, could detect eye movements (microsaccades) down to amplitudes of 10 seconds of arc (0.0028∘). Putnam et al., (2005) presented very impressive numbers on gaze position accuracy (5” which is 0.0014∘) based on snapshots taken with an adaptive optics retinal camera.
The retinal-based eye trackers with the highest speed and best accuracy are preferably built from scanning imagery, specifically from scanning laser ophthalmoscopes (SLO). These rely on the so-called ‘rolling shutter’ principle to recover eye motion (Mulligan, 1997), and are especially effective in SLOs that use adaptive optics that offer high resolution, high magnification and densely sampled retinal video (Stevenson and Roorda, 2005). Stevenson et al., (2016) introduced the first binocular system, which optically divided a single SLO image field between two eyes.
Retinal imaging systems also generally occlude forward viewing, impeding stimulus presentation. This may however change: Bartuzel et al., (2020) describe a MEMS-based retinal imaging system that allows for presentation of stimuli while recording with a high sampling frequency (1240Hz). Even then, the measurement range (also “trackable range”) tends to be smaller than with other eye trackers: Bartuzel et al., (2020) report an 16∘ range (8∘ left, 8∘ right), which we can compare to 20–40∘ for the DPI and many video-based P–CR trackers, and 90∘ or more for scleral coils.
Retinal image-based eye-tracking systems typically rely on a reference frame which, in a scanning system, is a single retinal image upon which to register strips of all movie frames to compute the eye motion. This process generally yields two outputs; a stabilised movie and an eye motion trace. If the reference frame is perfect and every strip from each scanned frame is perfectly registered to it, then it follows that the eye motion trace will also be perfect. However, distortions in the reference remain a challenge to overcome and these distortions yield artefacts in the eye motion trace. Recent efforts have been made to correct for these (Azimipour et al., 2018; Bedggood and Metha, 2017) but, if uncorrected, these artefacts are evident as peaks in the power spectrum of eye motion (Bowers et al., 2019).
To date, however, retinal-image-based eye trackers have had a limited scope of application. The intrinsic trade-off between accuracy and range has rendered them most useful to study eye movements during steady fixation (Bowers et al., 2019). Retinal eye trackers have predominately been used in ophthalmology applications, often relating to disease in the retina and how that expresses itself in vision and miniature eye movements (Godara et al., 2010).
Binocular vs monocular eye tracking
The different technologies above can be constructed or set up to record either monocularly or binocularly. A common use of binocular eye tracking, particularly in remote eye trackers, is to combine the left and right signal by averaging synchronous data samples from the two eyes in the recording software, sometimes referred to as “cyclopean gaze”. Cui and Hondzinski (2006) report that averaging left and right signals improves accuracy, but Hooge et al., (2019) found that averaging the gaze positions from the two eyes improved accuracy only for some of the participants.
Furthermore, head-mounted eye trackers may suffer from parallax errors, which happens because the vantage point of the eye and the scene camera do not coincide, typically when the measurement is not confined to a single plane. Binocular averaging is regularly done in glasses-based eye trackers (SMI ETG, Tobii Glasses, for instance), and in the Ober Saccadometer, which helps to alleviate the parallax issue. A thorough investigation of the geometry of the parallax error is provided by Mardanbegi and Hansen (2012), Narcizo et al., (2017), and Narcizo and Hansen (2015), and Tatler et al., (2019).
Alternatively, the two signals from the two eyes can be used to measure vergence (e.g. Liversedge et al.,, 2006). Jaschinski et al., (2010) showed that the EyeLink II, assuming no environmental and participant artefacts, can resolve vergence eye movements of just below 40mm in depth at a 60cm viewing distance. However, vergence measurements with P–CR eye trackers are sensitive to artefacts that affect accuracy: Hooge et al., (2019) and Jaschinski (2016) both report effects of the pupil-size artefact on vergence. Calibration for binocular recordings introduces the choice whether to calibrate both eyes at once, or separately (Kirkby et al., 2013; Nuthmann and Kliegl, 2009; Švede et al., 2015). Additionally, Wang et al., (2019) found that the calculation of the vergence point (intersection between the gaze direction vectors of left and right eye) may show a large deviation to the fixated point, with a wide distribution in depth and a misestimation of the vergence mean point towards the participant.
Eye tracking may take place in various environments–such as an MRI scanner, cars, fighter jets, behind a desk, in VR, and during sports. These environments may differ in light conditions, vibrations and sound, temperature and the presence of other people.
Direct sunlight has a critical impact on data quality in video-based P–CR and DPI eye trackers. Hansen and Pece (2005) and Holmqvist & Andersson, 2017, p. 138–139) show several examples of how infrared radiation from sunlight and hot light bulbs undermine tracking in video-based P–CR trackers. The importance of a controlled light environment is exemplified by Wang et al., (2010), who excluded 32% of participants, recorded while driving a real car, from one of their analyses due to poor data quality, but only had to remove 17% of participants recorded in a car simulator. The authors attributed the difference in data quality to the variable lighting conditions encountered during real driving. In a study of six pupil-centre calculation algorithms for video-based outdoor eye tracking, Fuhl et al., (2016) note that pupil algorithms have good average performance, but there are still problems in obtaining robust pupil centres in the case of poor illumination conditions. Rapid changes in illumination, common in car driving and flight deck research, can be detrimental to data quality and lead to a time-consuming investment in manual post-processing (Kasneci et al., 2014). Non-commercial algorithms to improve tracking in sunlight have been developed by Santini et al., (2018) and Hansen and Pece (2005).
Even moderate changes in light levels can indirectly affect data quality. Multiple studies have established the existence of the pupil-size artefact, in which changes in pupil size affects gaze position accuracy in both video-based P–CR systems (Choe et al.,, 2016; Drewes et al.,, 2012, 2014, 2011; Hooge et al.,, 2021, Hooge et al.,, 2019; Jaschinski, 2016; Wildenmann & Schaeffel, 2013; Wyatt, 2010) and for the DPI (Holmqvist et al., 2020; Holmqvist, 2015). Manipulating light levels to affect pupil size typically results in increased gaze inaccuracy of 1 to 5∘. The reason that changes in pupil-size affect reported gaze direction is that the pupil constricts and dilates asymmetrically, altering the pupil shape, and hence the calculated centre of the pupil image shifts position. In any video-based P–CR eye tracker, this implies a shift in gaze, even though the eyeball has not rotated with respect to the head. In a DPI, a small pupil may result in the P4 reflection at the back of the crystalline lens to be obstructed. The geometry of the setup, gaze direction and distance to the eye camera have also been found to influence the magnitude of pupil-based errors (Ahmed et al.,, 2016; Hooge et al.,, 2021; Wilson et al.,, 1992; Wyatt, 2010, 1995). In addition, it has been reported that pupil size in P–CR eye trackers is also related to some eye-movement measures, such as the saccadic peak velocity (Nyström et al., 2016).
Accuracy in video-based P–CR trackers is generally better for participants who have smaller baseline pupils (before calibration), measured under controlled illumination, as reported by Ahmed et al., (2016) and Holmqvist (2015). For the DPI eye tracker, the opposite is true: a large baseline pupil size results in better accuracy (Holmqvist, 2015). The signals of EOG systems and scleral coils are likely independent of pupil size, while data from retinal trackers benefit from a large pupil.
The pupil-size artefact may affect other measures. For instance, Hooge et al., (2019) found that light levels affect vergence estimations, with an error of 0.36–0.75∘/mm change in pupil size (and similar findings were reported by Jaschinski, 2016). We can expect that gaze position errors induced by the pupil-size artefact will inevitably propagate to many AOI- and other higher-order measures.
Environmental vibrations and ambient noise
Sources of vibration in the recording environment contribute to increased variation in the gaze signal, as exemplified by Figure 6.24 in Holmqvist and Andersson (2017), showing how transients in the signal appear when a person walks in a room where an artificial eye is being measured with a tower eye tracker. Vibrations could be expected to matter particularly on flight decks, in cars, and during sports. For instance, De Reus et al., (2012) report that alignment shifts of the eye tracker inside the flight helmet due to external motion frequently caused inaccuracies of gaze (see also Niehorster et al.,, 2020b). For lab studies, a nearby elevator shaft, a powerful air conditioning unit, or vibrations caused by someone walking nearby on hard floors may add measurable noise to a sensitive eye-tracking recording. Sound in the recording situation is another form of oscillation that could make the eye tracker vibrate and affect the quality of recorded data. However, Hooge et al., (2019) recorded Tobii TX300 data at an indoor science festival with moderately loud music and found accuracy values close to manufacturer specifications. Controlled studies of the effect of vibrations on eye-tracking data quality appear to be lacking.
Presence of others
The presence of other people during the recordings may affect measures of eye movements and gaze behaviour in ways that are little understood. Social appropriateness may matter: The very presence of an eye tracker can impact head and eye movements, with people looking only at what they feel is socially appropriate when they believe that an eye tracker is recording (Risko and Kingstone, 2011; Nasiopoulos et al., 2015). Distraction is another possible factor: For instance, infants are easily distracted, looking at nearby people rather than at the monitor (Tomalski & Malinowska-Korczak, 2020). Accidental mismeasurements may happen when the infant is seated in the lap of a parent, and the eye tracker finds and records the parent’s eyes. Additionally, Oliva et al., (2017) found longer latencies in the antisaccade task when adult participants were recorded in proximity to one another, for reasons that are not well understood.
Special recording environments
The MRI scanner environment consists of a dark and noisy tunnel, with powerful magnetic fields, in which participants must lie down. The duration of experiments and pacing of stimuli often differs from outside the MRI. Importantly, data quality from video-based P–CR tracking in MRI (SR Research, SMI, Arrington, Gaze Intelligence) generally appears to be lower than outside the MRI: poorer precision and accuracy, and more frequent data loss (Dar et al., 2021). For infrared limbus trackers (MR-Eyetracker, Cambridge Research Systems) attached to the headcoil, even small movements of the head may over time result in data loss. MRI trackers also exist that use a multicore fiber to transmit light back to outside the MRI machine where they process the reflections of the corneal bulge. The Ober MRI-tracker exhibits crosstalk (i.e. correlation) between horizontal and vertical signals, which makes the gaze signal useful only for horizontal tracking.
A curious observation is that saccadic latencies are longer when obtained in an MRI scanner than outside the MRI scanner, which could reflect the long fixation periods between saccades required in scanners, or other differences, such as participants laying down and potentially feeling drowsy (e.g. Talanow et al.,, 2020, their Table 1). Furthermore, the magnetic field of 7T MRIs has been reported to induce nystagmus in some participants (Roberts et al., 2011).
Head-mounted virtual-reality sets allow exclusive control over the visual stimulation provided to a subject, while shutting out any visual references provided by the outside world. Little is known of the data quality of eye trackers integrated into VR goggles, but Pastel et al., (2021) found that precision is significantly poorer in the SMI Vive VR goggles compared to the SMI glasses. Accuracy however differs only in some conditions, mostly when the distance to the fixation point changes. Stein et al., (2021) found that the end-to-end latency of common VR headsets ranged from 45ms to 81ms (compare Section “Signal properties and
Setup and geometry
When preparing a manuscript about an experiment involving an eye tracker it is important to realise that an eye-tracking setup is more than just the eye tracker itself. Hessels and Hooge (2019) point out that a screen-based eye-tracking setup may consist of at least an eye tracker, computer screen, a seat for the participant, and a table or mounting device for positioning the eye tracker. For wearable eye trackers, the setup includes the participant, eye tracker, and whatever frame, headbands, helmets or straps are used to position the eye tracker relative to the participant’s eyes. With geometry, we mean the “absolute position and orientations of the eye, the eye-tracker camera, and the IR illuminator” (Hooge et al., 2021), and in the case of screen-based eye tracking, the screen. The geometry can thus (partially) be described by the distances between eye tracker (camera and/or IR illuminator), participant, and screen, and their relative orientations. A picture or schematic can be useful in providing this information, as done in Choe et al., (2016, Figure 1), Hessels & Hooge (2019, Figure 2), Valtakari et al., (2021, Figure 1), and our Fig. 3.
Gaze direction, measurement space and monitor size
Relevant properties of the setup may include the distance and relative orientation between participant and eye tracker, participant and computer screen, and the size and resolution of the computer screen. Most video eye trackers report gaze position in pixels on a screen. For some research this is sufficient (e.g. area-of-interest research in marketing). For other studies, one may wish to report the orientation and rotation of the eye in angular measurements (e.g. Haslwanter, 1995). In order to convert a gaze position on a screen in pixels to an angular measurement, it is necessary to know the distance and relative orientation between participant and eye tracker, participant and computer screen, and the size and resolution of the computer screen. If the width and height of the screen are smaller than 20∘ (10∘ to the left and 10∘ to the right), the small angle approximation may be applied. For example, this allows one to transform gaze positions in centimetres or pixels on screen to angles with a simple multiplication factor. For a general and more accurate method for this transformation, see Holmqvist & Andersson (2017, p. 21).
When the monitor is larger than the measurement range of the eye tracker (Section “Eye-tracking methods: Similarities and differences”), data quality will be poorer in the outer parts. Niehorster et al., (2020b), Schlegelmilch and Wertz (2019), Popelka et al., (2016), Holmqvist (2015), and Guestrin and Eizenman (2006) all found that data recorded in the corners of the monitor (or measurement plane) are of poorer quality than those recorded at the monitor’s centre. Generally, recordings made while looking at corner positions exhibit a precision that might be worsened by a factor of 3, and accuracy by an average 1–10∘, depending on the system. Such findings led Majaranta et al., (2009) to suggest putting important information in gaze-controlled systems in the centre of the screen, to give the user a better perceived accuracy.
As most P–CR eye trackers do not report physical pupil size, but pupil size in the eye image, the pupil-size signal is susceptible to viewing direction and distance. Therefore, in experimental designs in which the participant is required to look around the screen, researchers should also be aware of the pupil foreshortening artefact (Brisson et al., 2013; Mathur et al., 2013; Young and Sheena, 1975). As the gaze direction deviates from the eye-tracker camera axis, the image of the pupil in the eye-camera sensor deforms, making the pupil shape appear more oval and the pupil diameter – a common basis for pupil-size measurements –artificially shorter, and pupil area measurements artificially smaller. This is of particular importance for experiments using the pupil size as a measurement for estimates of the participant’s psychological state (e.g. cognitive load or arousal) during free-viewing.
Various compensation algorithms have been developed to decrease the pupil foreshorting artefact, for instance relying on a geometrical model (Gagl et al., 2011), or using data from an artificial eye rotating horizontally in front of the screen (Hayes & Petrov, 2016).
Distance between participant and eye tracker
The distance between participant and eye tracker needs to be given attention, for all eye trackers, remote as well as head mounted systems. Chatelain et al., (2020) report that when participants are allowed to choose for themselves where to sit in front of a remote eye tracker, the distance to the eye tracker ranges from 40–120cm. This self-preferred range of seating distances is larger than what eye trackers can handle. Most manufacturers of remote eye trackers recommend having the distance between the participant and the eye tracker to be within a narrow range, defined by the optics of the system, with its centre at around 60–70cm (the LC EyeFollower being an exception with a specified range of 46–97cm). When a participant moves outside of the tracking range, the inaccuracies and noise levels in data can quickly triple and data loss also increases (Blignaut and Beelders, 2012; Blignaut & Wium, 2014; Kolakowski & Pelz, 2006; Schlegelmilch & Wertz, 2019).
Restrained vs. free head movements
The history of eye-movement research includes numerous examples of attempts to minimize the participants’ head movements. Often, the use of head restriction is based on assumptions that the recorded data will be of better quality with a restricted head (e.g. van der Laan et al.,, 2017). Although overall there is a lack of studies on the effect of using chinrests, there are a few indications that they may be useful: For instance, Hermens (2015) concluded that in some cases, the EyeLink II may produce artificial microsaccades due to small head movements, and Cerrolaza et al., (2012) showed that inaccuracies may originate from small stabilizing head movements that participants make. Additionally, Holmqvist et al., (2021) found that recording participants in a chinrest increased the level of noise in some eye trackers.
Head restriction methods can be roughly divided into chinrest, forehead rest, and bite bar/board, the three of which can be combined to prevent both rotation and translation of the head. For some animal participants that take part in concurrent eye-movement and neurophysiological measurements, such as the rhesus macaque, the desire for head-movement restriction from both measurement methods has led to head restraints being surgically attached to the animal’s skull for data collection with video-based eye trackers (McFarland et al., 2013) or they may have scleral coils implanted in their eyes for use with magnetic coil trackers (Kimmel et al., 2012).
The P–CR technique found in the vast majority of eye trackers today, originally came about to allow some head movement by the participant (Merchant, 1967). While the original P–CR method may handle small movements of the head, at the size of a few millimetres up to a centimetre, recent remote video-based eye trackers are designed to allow for free head movements in a much larger space (the headbox, see Fig. 3), tens of centimetres or more across.
One way to accomplish room for larger head movements is to use a wide-angled eye camera that covers a large space around the participant, and use a trade-off: The sampling frequency of the eye camera can be increased by reducing the size of the recording window on the camera sensor so it just samples the eye region. When the participant moves, this recording window on the camera sensor must be moved in real-time (or physically, using a pan-tilt camera as in the LC EyeFollower). Although moving the recording window allows for larger head-movements, this window motion introduces sample dropping (data loss) in some eye trackers (Holmqvist and Andersson, 2017, p. 168). Studying the effect on accuracy, precision, latency and loss of data, Blignaut (2018) found that one or two headbox adjustments per second would have no effect on accuracy, but it did on spatial and temporal precision (in the author’s custom-built eye tracker). However, some eye trackers change sampling frequency altogether when the eye is lost in the recording window of the camera sensor and the eye tracker goes into full-sensor search mode (Hessels et al.,, 2015, Figure 3).
When participant eyes are at the center of the headbox eye-tracking data quality is best. When located away from the headbox center, data quality is negatively affected, as experienced by many infancy researchers and investigated experimentally by Hessels et al., (2015) and Niehorster et al., (2018), who found a strong effect of rotating the head on the quality of eye-tracking data on a number of eye trackers. In fact, any relative movement between eye and the eye camera of the eye tracker can reduce data quality, also in eye-tracking glasses (Niehorster et al., 2020b).
During gaze interaction, the human–computer interaction technique of controlling a computer with gaze, the participant/user has immediate cursor feedback of where the eye tracker thinks that gaze is located. Gaze inaccuracy originating from the users’ movements undermines effective usage. Chinrests are not a solution here, because many users have involuntary head movements or seating positions that make a simple head restriction impossible, requiring a different user interface design (Donegan, 2012). Some users (try to) actively use head movements to adjust gaze pointing inaccuracies (Špakov et al., 2014). The authors speculate that this can be common among people with disabilities who actually use gaze control in their everyday life.
For infants, adults with certain disabilities, and animals, head restriction methods are not always practically usable, and alternative methods for head movement reduction are often used. Hessels et al., (2015) compared the eye-tracking data quality of infants recorded in a reclining car seat versus that of infants sitting on the parent’s lap or in a highchair. Accuracy was worse (higher) for infants seated on the parent’s lap or in the highchair than for infants in the car seat. Yet, a participant’s positioning puts additional constraints on the placement of the eye tracker. Hessels and Hooge (2019) found that placing infants in a car seat required the eye tracker to be tilted forward substantially, which that might not be possible for some eye trackers without extensive modifications and additional equipment. Similarly, for patients confined to the bed, mounting the eye tracker on an adjustable arm allowed for effective gaze interaction for disabled users lying on their back (Blignaut, 2017; Hansen et al., 2011).
In this section, we review how certain characteristics of participants are related to the quality of recorded eye-tracking data, to eye-movement measures and high-order measures of gaze behaviour. The characteristics we discuss include gender, age, visual acuity, visual aids, physiology of the eye region, mental state (e.g. sleep deprivation, mental fatigue, cognitive workload), expertise, and psychopathology. A complete review of all these characteristics – particularly expertise and psychopathology – is beyond the scope of the present paper. However, our goal here is to show that these characteristics may be relevant, which researchers may use when defining their participant group and exclusion criteria. Whenever possible, we direct readers to more in-depth reviews on the specific topics.
Attrition rate is operationalised as the proportion (or percentage) of participants who were not included in the analysis. Attrition rate exhibits a large variation between studies. For instance, Dalveren and Cagiltay (2019) report an attrition rate of 17.9% for the EyeTribe, while Holmqvist (2015) report 1.0% for the same eye tracker. The reported attrition rates appear to be lower in studies with adult participants in light-controlled labs, for instance 0–8.2% in Holmqvist (2015), compared to recordings made in sun-lit environments, for instance Wang et al., (2010), who report 32% attrition rate during outdoor driving. Attrition rates may be high for infant studies, for instance: 59–64% in Burmester and Mast (2010), and for children in the autism spectrum (100% in Birmingham et al.,, 2017).
Older remote video-based eye trackers have been reported to have higher attrition values also for lab studies with adults. For instance, Sibert and Jacob (2000) reported 38% attrition rate for ASL Model 3250R, while Schnipke and Todd (2000) reported 62.5% for the ASL 504.
52.2% of the publications in the reporting database (see Section “Reporting practices and existing reporting guidelines” for details) report the number of participants excluded from analysis. Their main reasons for excluding participants were “data quality” (44.1% of the publications), “impossible to calibrate” (19.8%), “the participant” (12.6%), “other” (7.2%), “error in the experimental procedure” (5.4%), and “failed to follow the instructions” (0.9%). This suggests that poor data quality is the major reason for excluding participants from analysis.
Alternatively, attrition rate can refer to the number or proportion of trials or events per participant that were excluded, for those participants included in the analysis. In the reporting database, 30.9% of the studies reported excluding trials or fixations. Each study reported a slightly different reason for exclusion, many of which relate to data quality, outliers, technical failures or behavioural mishaps.
There are some reports of differences between genders in gaze behaviour towards other people (Coutrot et al., 2016; Gluckman & Johnson, 2013; Rupp & Wallen, 2007), and in pupil reactions to pain (Ellermeier & Westphal, 1995). Coors et al., (2021) found that although gender-related differences in eye-movement measures (blink rate, smooth pursuit gain) do exist, most are negligible in magnitude.
Blignaut and Wium (2014) report that, statistically, Asian participants are more difficult to track, and the resulting data are on average of worse quality than for participants of European or African ethnicity (see also Holmqvist, 2015). These findings reflect the generally narrower palpebral aperture in the east Asian population. Amatya et al., (2011) found a larger proportion of express saccade makers in the Asian participant group, indicative of faster saccadic reaction times.
Data quality as well as many eye movement measures covary with the age of the participant. Firstly, infant researchers have consistently shown that eye-tracking data quality tends to be worse for younger children than for adults. For example, accuracy and precision are generally worse, and data loss is generally poorer, for infants and toddlers than for school-aged children and adults (Dalrymple et al.,, 2018; Hessels et al.,, 2016, 2019). Interestingly, worse precision in infant eye-tracking data is not due to fixation instability (Seemiller et al., 2018). Moreover, higher amounts of data loss with infant participants are not only due to infants looking away more from the screen, as it is often characterised by short periods of data loss (less than 100ms: Hessels et al.,, 2015; Wass et al.,, 2014). Neither is this due to blinking, as young children blink significantly less than adults (Stern et al., 1994). In addition, it seems that individual differences in data quality are larger for the younger participants (5–10 months) than for the older participants (3–9 years, Hessels and Hooge, 2019). The latter is particularly problematic when analysis methods are used that are susceptible to differences in data quality.
The oculomotor system develops into adulthood and old age. The resting pupil diameter has been found to be larger for young adults (around 20 years) than for older (around 70 years), independent of luminance level (Bitsios et al., 1996). Saccadic amplitudes have been found to be shorter both for children (below 10 years) and older adults (above 60), compared to young adults (30–40 years, Helo et al.,, 2014; Açik et al.,, 2009; Mackworth & Bruner, 1970; Açık et al.,, 2010). The latencies of said saccades follow the same pattern, decreasing from childhood into adulthood (Luna & Velanova, 2011; Salman et al., 2006), and then increasing again as participants grow older (Moschner & Baloh, 1994). Smooth pursuit parameters such as latency (time until the movement is initiated) and gain (how closely gaze follows the target velocity) also have been found to be related to age. While latency is longer for older than for younger adults (Sharpe & Sylvester, 1978), gain is closer to the ideal value in young adults compared to children (Luna & Velanova, 2011; Salman et al., 2006).
Binocular coordination during reading is also poorer in children than in adults (Blythe et al., 2006). In a review of the eye movements of the aging reader, Paterson et al., (2020) point out changes both on lexical (e.g. the word frequency effect), and orthographic levels (e.g. sensitivity to removal of inter-word spacing). Age variation in fixations and blinks has not been systematically explored outside reading research (Marandi and Gazerani, 2019).
Also, with older age, it is more likely that the participant will wear spectacles or lenses, have droopy eyelids, have cataracts, or an artificial lens from cataract surgery, macular degeneration and peripheral scotomas, as well as several neurodegenerative ailments, which tend to make either data quality worse or alter eye movements, or both.
Visual acuity and visual impairment
For readers with low acuity, the fixation durations are longer, saccades shorter, and consequently text reading takes much longer (Legge et al., 1997). Furthermore, blurred vision caused by, for instance, myopic refractive error results in an increase of the amplitude of microsaccades (Ghasia & Shaikh, 2015). Eye movements are dramatically different for participants with low vision, i.e. a loss of vision that cannot be corrected by medical or surgical treatments or conventional eyeglasses, such as macular degeneration, scotomas, cataracts, or nystagmus (Leigh & Zee, 2006).
Spectacles, lenses and makeup
Nyström et al., (2013) investigated the effect of eye-region physiology, spectacles and other factors on accuracy, precision and data loss in the SMI HiSpeed1250, finding poorer precision when participants wear spectacles, and poorer accuracy, precision and data loss when contact lenses are worn. In a large follow-up using 12 eye trackers, Holmqvist (2015) reports up to 10∘ worse accuracy and up to three times (300%) poorer precision for recordings where the participants wore spectacles that were scratched or dirty or that had an anti-reflective coating, compared to recordings where no visual aids were used. Data recorded from participants wearing soft contact lenses exhibited 0.5–3∘ poorer accuracy and on average 20–40% poorer precision, compared to when participants wore no visual aid. Asking a participant to remove the spectacles to record data of better quality might result in poorer acuity that may alter the eye movements (see above).
Makeup (eyeliner, eye shadow and mascara) result in a poorer accuracy by 0.2–3∘, and up to three times poorer precision (Holmqvist, 2015). For participants with forward- and downward-pointing eyelashes, makeup results in poor data quality (see also Nyström et al.,, 2013). Mascara is black in both infrared and visible light, and Holmqvist and Andersson (2017, Figure 5.5) show eye images from actual recordings that depict how the dark mascara may interact with the pupil center calculation.
Physical properties of the eye region
Differences in eye physiology refers to eye colour, lash direction, ocular dominance, baseline pupil size and more. Holmqvist (2015), Hessels et al., (2015), and Nyström et al., (2013) investigated the relation of data quality to physical properties of eyes, from large groups ranging between 75 and 194 participants, in up to 12 eye trackers, and reported compatible findings. In this subsection, we report effect sizes from these three studies, as ranges from the many eye trackers.
Holmqvist (2015) found that darker pigmentation in hair, eyes and skin correlate positively with better (lower) accuracy on most video-based eye trackers (0.5–1∘), and also better precision (20–80% lower RMS-S2S). The advantage of dark iris pigmentation over blue eyes has been hypothesised to result from poor contrast between pupil and iris when the eye image is recorded in infrared light: A blue iris is dark, while a brown iris is bright (Holmqvist and Andersson, 2017, Figure 4.13), providing a clearer contrast between iris and the dark pupil, which the image processing algorithms can make better use of.
Clinical participant groups may have features in their irises that may make tracking more difficult for some eye trackers. For instance, participants who lack an iris, known as aniridia (Beby et al., 2011), are likely difficult to record with P–CR trackers. Participants with William’s Syndrome have a stellate pattern in the iris (Tran & Kaufman, 2003) that could interfere with the CR image of P–CR trackers. These iris features are often associated with specific eye-movements. For instance, participants with albinism may have transillumination effects in their irises, and their lack of pigmentation in skin and in the retina is associated with congenital nystagmus (Collewijn et al., 1985).
A smaller baseline pupil results in better accuracy (up to 2∘) and up to three times poorer precision (Holmqvist, 2015). Interocular distance is defined as the distance between pupil centres when looking straight ahead. Holmqvist (2015) found poorer accuracy (0.5–1.0∘) for small interocular distances, but only in remote eye trackers.
A larger eye opening (also ‘palpebral fissure’ or ‘eye cleft’) correlates with better accuracy: up to 1∘ better in fully open compared to eyes with the smallest palpebral fissure. Forward or upward-pointing lashes show the best accuracy, while downward-pointing eye lashes, which Holmqvist (2015) found in about 10% of their 194 participants, exhibit a poorer accuracy (up to 4∘) and precision, although some eye trackers are more affected than others. A more closed eye is more likely to block the eye tracker’s view of pupil and CR features, but this depends on the geometry of the setup, both in remote and head-mounted systems.
Arousal, mental fatigue and cognitive workload
Ayres et al., (2021) present a meta-study of 33 experiments and conclude that eye-movement measures of cognitive load are more sensitive than heart, skin, and brain measures. Mental workload and arousal are positively associated with pupil dilation as shown in a large number of controlled studies and life-like human factors studies, measured using high- or low-end eye trackers (Einhäuser, 2017). Examples include performing a memory task (Kahneman and Beatty, 1966), arithmetic tasks (Ahern & Beatty, 1979; Hess & Polt, 1964), Air Traffic Control (Ahlstrom & Friedman-Berg, 2006), (simulated) driving (Čegovnik et al., 2018), tasting a disgusting drink (Kaneko et al., 2019) and social stress caused by having to sing a song (Toet et al., 2017). Other parameters of eye movement behaviour can be affected as well, but this seems to be context or task dependent. For instance, for blinking rate, Recarte et al., (2008) and Čegovnik et al., (2018) found an increase with increasing workload, whereas Brouwer et al., (2014) found no effect; and Bauer et al., (1987) and Fogarty and Stern (1989) found a decrease in blinking rate with increasing workload. This variation in results may be caused by the differences in the workload-inducing task across these studies.
Workload has also been reported to decrease microsaccade rates but increase their amplitudes (Siegenthaler et al., 2014), increase fixation duration (Rayner & Pollatsek, 1989) and decrease horizontal scanning during driving (Recarte & Nunes, 2003). Mental fatigue and workload have been found to affect saccade and microsaccade dynamics during visual search (Di Stasi et al., 2013), surgery (Di Stasi et al., 2014) and for pilots suffering from low levels of oxygen (Di Stasi et al., 2014). When researchers investigate workload, these eye-movement measures are often combined. For instance, Van Orden et al., (2000) developed a model using regression analyses from eye movement data on a surveillance tracking task, showing that fixation duration, blink duration and mean pupil dilation combined to a robust and reliable predictor of the performance of surveillance tracking.
Many studies have reported effects of partial and total sleep deprivation on eye movements. Sleep deprivation is known to result in increased saccadic latency and reduced saccadic peak velocity and smooth pursuit velocity, as well as more antisaccade errors (Ahlstrom et al., 2013; Fransson et al., 2008; Meyhöfer et al., 2017). Furthermore, Schalén et al., (1983) present data showing that saccadic and smooth pursuit peak velocity may vary with the circadian rhythm.
Moreover, sleep deprivation has been shown to cause mental fatigue and affect a myriad of cognitive domains such as memory (Van Der Werf et al., 2009), cognitive speed (Van Dongen and Dinges, 2005) and arousal (Gunzelmann et al., 2007), which in turn may affect eye movements.
Many eye-tracking studies of expertise have been made. Good overall reviews are provided by Reingold and Sheridan (2011) and Gegenfurtner et al., (2011). For instance, expert chess players tend to have fewer, longer fixations in the middle, while novices scan more (Charness et al., 2001). Expert radiologists tend to fixate abnormalities earlier than novices (Nodine et al., 2002; Alexander et al., 2020). Even the ability to keep one’s eye still is affected by training and experience (Cherici et al., 2012; Di Russo et al., 2003). In medical expertise research, a lack of experience or familiarity in the task has been correlated with blink rate and duration, fixation duration, transition rate, and pupil dilation (Lee et al., 2019, 2020). Machine learning approaches have been used to differentiate between levels of language proficiency (Karolus et al., 2017). Findings in expertise studies do not easily transfer to other domains of expertise. The one and same participant can be an expert in one task while having no expertise in a very related task (Kevic et al., 2015). In fact, it is important to understand that the participant’s field of expertise, the task, and the stimulus are crucial determinants of what effect can be expected in terms of eye movements.
Pathology and personality
Several different psychiatric disorders have independently been found to coincide with oculomotor impairments with medium-to-large effect sizes, although these depend on diagnosis and experimental task (Alexander et al., 2018; Smyrnis et al., 2019). For instance, patients with schizophrenia reliably show reduced smooth pursuit accuracy (reduced gain, increased root-mean-square error of the signal, increased frequency of saccades during pursuit). In a meta-study on the eye movements of patients with schizophrenia, O’Driscoll and Callahan (2008) stated that “Average effect sizes and confidence limits for global measures of pursuit and for maintenance of gain place these measures alongside the very strongest neurocognitive measures in the literature.” (p. 359). Patients with schizophrenia also reliably show increased rates of direction errors on the antisaccade task. Similar impairments, albeit with smaller effect size, are observed in patients with bipolar disorder or major depressive disorder (Katsanis et al., 1997).
Differences in gaze behaviour between individuals with and without a diagnosis of autism spectrum disorder (ASD) have also been substantially investigated (see e.g. Bast et al.,, 2021; Guillon et al.,, 2014; Sasson et al.,, 2011). One often-reported finding is differences in gaze behaviour to the eyes of a face between individuals with and without an ASD diagnosis (e.g. Dalton et al.,, 2005; Jones et al.,, 2008, 2013; Klin et al.,, 2002; Rice et al.,, 2012). However, these findings are not unequivocal (see e.g. Dapretto et al.,, 2006; McPartland et al.,, 2011; van der Geest et al.,, 2002). Several potential explanations have been posited for the inconsistent findings, including the presence of alexithymia (Bird et al., 2011) and the cognitive demand required in the experimental setting (Senju & Johnson, 2009). A meta-analysis of 122 studies on gaze differences to social and non-social information between people with and without autism is given by Frazier et al., (2017). Other reported differences include eye movements during visual search (e.g. Keehn and Joseph, 2016; Kemner et al.,, 2008) and attentional disengagement (e.g. Keehn et al.,, 2013).
Furthermore, Alzheimer’s (Kapoula et al., 2014), Parkinson’s (Otero-Millan et al., 2018) and Huntington’s are known to affect several characteristics of eye movements (Leigh & Zee, 2006).
Variation in human personality has been associated with eye movements (Bargary et al., 2017) and with gaze patterns to social stimuli (Wu et al., 2014).
Medication and drugs
For studies that investigate differences in eye-movement measures between clinical and control groups, recording patients who may be under medication, the question may arise whether it is the psychopathological state or the medication that drives the difference. For example, benzodiazepine drugs cause reduced saccade peak velocity (De Visser et al., 2003) as well as increased saccade latency and reduced spatial accuracy of saccades (Ettinger et al., 2018). Measures of intra-individual variability of saccades are also increased. Benzodiazepines also reliably reduce smooth pursuit velocity (Karpouzian et al., 2019).
Even in non-clinical trials, drug use may be a consideration. Acute consumption of nicotine may improve smooth pursuit accuracy, reduce catch-up saccades (Meyhöfer et al., 2019; Avila et al., 2003) and may reduce antisaccade latencies as well as the rates of direction errors in the antisaccade task (Ettinger & Kumari, 2019). Cannabis has the opposite effects to nicotine: latencies and errors in the antisaccade and memory-guided saccade tasks are increased, and saccade peak velocity is lower (Huestegge et al., 2009). Pupil size is affected by some drugs (Newmeyer et al., 2017). Increased blood alcohol levels impair the quality of smooth pursuit (Flom et al., 1976; Wilkinson et al., 1974), decrease saccade velocity (Lehtinen et al., 1979) and increase fixation durations (Moser et al., 1998). Alcohol also has effects on gaze behaviour. For instance, Buikhuisen and Jongman (1972) presented a traffic film containing 86 important events to participants, while tracking their eye movements. Those who were alcohol-intoxicated fixated on fewer events, especially when located away from the centre of the display, than non-intoxicated participants.
Calibration and accuracy
Calibrating the eye tracker for the specific participant is a prerequisite for recording gaze in some eye trackers and for optimal accuracy on all eye trackers. In this section, we first describe the procedure and principles of calibration generally, how to assess calibration, and correct for poor accuracy, and then we describe methods for calibrating challenging participants, such as infants, dogs, and people with nystagmus. These methods all aim to ensure the best possible accuracy.
How is calibration done?
Just before or at the beginning of a recording session, participants typically need to perform a small initial task of looking at a set of pre-defined targets that either appear on, or smoothly move across the stimulus monitor, or are otherwise presented in front of the participant. If the recording is made within the software of a video-based P–CR tracker, when the participant fixates the point, the eye tracker registers the relative positions of features (such as P and CR) for each calibration point. Quite often, the researcher may choose how many targets (often points) will be shown during this initial phase, and in some cases, where targets appear, and what the target will look like. For most other technologies (DPI, coils, EOG, etc.), calibration needs to be done with custom software and will likely also involve looking at or following fixation targets.
The choice of calibration target may have an effect on the data quality in the subsequent recording. Thaler et al., (2013) examined which fixation target results in the least dispersion during fixation for adult participants, while Schlegelmilch and Wertz (2019) investigated the effects of calibration targets on the dispersion of the gaze position signal of the EyeLink 1000 Plus, for infant research. Whether showing a calibration target that minimises dispersion will result in better accuracy is unknown.
Colour and luminance of the background
Previously referenced studies on the pupil-size artefact (Section “Environment”) tell us that changes in pupil size will affect the accuracy of the gaze position signal. Thus, calibrating at a different luminance from the luminances displayed during data collection is likely to affect the accuracy of the measurement. If stimuli vary in luminance, it may be useful to calibrate for a range of pupil sizes (Drewes et al., 2012).
Which data segment to use for the calibration?
The eye-tracking software, manufacturer-based or custom tailored, selects a segment of data for when it estimates that the participant is looking at the calibration target. The exact decision which segment of data is used for calibration is mostly made by the software itself (Hansen & Ji, 2010). Nyström et al., (2013), however, showed accuracy is higher when the participant indicates s/he is looking at the fixation target, than leaving this decision up to the system. This finding also relates to the idea behind the participant-controlled post-calibration by Ko et al., (2016). However, participant-controlled calibration does not appear to be the standard in most eye-tracking software today.
Number of targets and the mathematics of calibration
Akkil et al., (2014) reported for the Tobii T60 that calibrating with 9 points result in a better accuracy compared to using 5 or 2 points, with a difference of about 0.2∘ between the 9-point and the 2-point calibrations.
In a number of video-based eye trackers (most SMIs, all EyeLinks, and many Tobiis, for instance the T60), the calibration involves finding a best fit between the sensor values (P and CR positions in the eye camera, for instance) and the spatial positions of calibration points. The exact polynomials used in these equations varies by the manufacturers, but also by the number of calibration points. Thus, it is important to realise that the choice of a specific number of calibration points in the eye-tracker manufacturer software is also a choice of a specific set of equations used for the calibration procedure. Each set of polynomial equations may result in different accuracy values for the same eye movement data (Blignaut and Wium, 2013; Blignaut, 2014; Cerrolaza et al., 2012).
Modelling the 3D shape of the eyeball is possible when multiple cameras and/or multiple corneal reflections are employed. Theoretically, the minimum number of calibration points is one, and this point is needed to measure the difference between optical and visual axes (Guestrin & Eizenman, 2006; Hansen & Ji, 2010). Recently, some manufacturers have developed calibration methods that model the eyeball more extensively. In particular, the curvature of the cornea is an important part in these calibration models, which have been used in eye trackers such as the SMI glasses, many Tobii eye trackers (US Patent US7,572,008), and in the open-source eye tracker by Barsingerhorn et al., (2018).
Calibration software is not supplied with every eye tracker. For instance, the DPI eye trackers require the researcher to employ custom-built calibration algorithms to establish the mapping between sensor values and points on the monitor. Holmqvist (2015) used a RANSAC fit (Fischler and Bolles, 1981) followed by a linear shift to calibrate the DPI.
Using the calibration of another participant
There are also examples of researchers calibrating their eye tracker on a person other than their actual participant, when the actual participant is difficult to calibrate. For example, Kulke (2015) calibrated on adults, and then recorded infants by reusing that adult calibration, arguing that this procedure improved data quality compared to calibrating for infants. Indeed, Harrar et al., (2018) present data showing that this practice does not introduce non-linearities (variations in accuracy over space), and also find that calibrating on one person and recording on another led to a poorer accuracy by 2–4∘. Similarly, researchers recording with artificial eyes also calibrate on themselves before recording with the artificial eye. Holmqvist and Blignaut (2020) show that no noticeable non-linearities appear in the data when using the human calibration for a subsequent recording with artificial eyes, but also note that accuracy is likely to be poorer.
Validation of the calibration
Present eye-tracker vendor software almost always reports accuracy after each calibration, recorded on validation points immediately after the calibration sequence. If the accuracy is not sufficient after the first calibration, commercial recording software may allow the operator to recalibrate several times, and select the calibration with the best accuracy in the validation test.
Although it is rarely done, a poor accuracy after calibration can also be improved using a post-calibration correction. This procedure involves a second round of looking at points. For instance, Blignaut et al., (2014) used a regression model to improve accuracy by 0.3–0.6∘. Correction can also be made by letting the participant manually guide an online, calibrated, gaze-contingent visualisation of raw gaze samples to fall exactly in line of his/her gaze (Poletti and Rucci, 2016), i.e. until these samples are projected onto the centre of the fovea, and then push a recalibration button, which in their study improved the already very accurate DPI by a factor of 2.
Drift, and methods for drift correction
Accuracy that worsens over time is often called drift (not to be confused with oculomotor drift), irrespective of its source: small body adjustments, head-mount slippage, changes in pupil size, or some change in the hardware or software setup. Head-mount slippage could be the reason that the SMI EyeLink I and the SR Research EyeLink II were known to be so drift-prone that most researchers used to adjust their calibration, via a one-point drift correction, once before each trial (e.g. Greene & Rayner, 2001). Although drift refers to accuracy, other measures may also be affected by long recordings. For instance, Hessels et al., (2015) and Wass (2014) report a decline in precision from an early trial to a later one.
It is not known how much drift there is in current eye trackers, which are often sold as “drift free” (S. R. Research, 2017, p. 24), but a certain drift still exists in some instruments. Nyström et al., (2013) report a 0.2∘ drift during a 15-min reading task with the SMI HiSpeed 1250, and Choe et al., (2016, Figure 2) show drift due to the pupil-size artefact. Ko et al., (2016) found that the DPI and coils recording artificial eyes drift by around 0.03’ per minute. Drift happens not only in long recordings, but also in cases where the recording does not immediately follow calibration: Chatelain et al., (2020) found that when recording participants on the Tobii 4C in sessions over one month with no recalibrations, accuracy degraded by 0.30∘ + 0.13∘/month, i.e. the initial drop in accuracy is the largest.
Drift correction procedures involve re-calibrating with a single point, shifting all subsequent data by the measured offset. Later EyeLink models offer drift checks in which the offset between gaze cursor and target is assessed, and the experimenter can optionally make a linear shift of estimated gaze. In infant research, Constantino et al., (2017) implemented automatic drift correction on the fly, using an appearing fixation target and a criterion on accuracy. Jones et al., (2014) instead used a happy face and a probability calculation that decided whether the infant had fixated on the face, even if the eye tracker records the contrary, in which case an automatic drift correction was made. The threshold for when to perform drift correction may impose a maximum allowed accuracy. However, this is not the same as the empirically determined accuracy, and there is no guarantee that a central drift correction will improve accuracy in more peripheral points. When the user has a visible gaze cursor, as with users of gaze-controlled computers, Graupner and Pannasch (2014) show that they can learn to take advantage of the visible cursor as a cue to understand variations in accuracy over space, and choose to recalibrate when it is needed for the functionality they want.
If accuracy is found to be poor after the recordings are completed, while inspecting the data as scanpath plots, the EyeLink Data Viewer by SR Research allows the possibility of ‘performing drift correction on fixations’ by simply grabbing any fixation or group of fixations and pulling it to a new position. A simple test reveals that saccade amplitudes and velocities also change during these data editing operations, not only the fixation positions themselves (Data Viewer 3.1.97). The Data Viewer manual states that when batch-moving fixations like this, a movement of more than 30 pixel is not acceptable; however, for those users who want to move fixations more than this, the 30 pixel setting can easily be changed. Later, SMI also started offering this feature in the BeGaze software, and it is also possible in OGAMA (ogama.net). Note that the researcher has to be very careful not to move fixations in favour of a hypothesis to avoid subsequently arriving at faulty conclusions.
This practice is mostly relevant for text reading, in particular when participants read more than one line of text. Cohen(2013, p. 677) comment on practices in reading research that “Fixations are typically corrected manually, sometimes within a program such as EyeDoctor” (https://blogs.umass.edu/eyelab/software/, accessed 10-03-2021). Alternative software solutions for re-aligning inaccurate gaze data to lines of text are offered by Cohen (2013), Hyrskykari (2006) and Špakov et al., (2019).
Dragging fixations in place has also been applied in infant research (Frank et al., 2012; Kooiker et al., 2016). Manual post hoc calibration was commonplace in nystagmus research in the past, and tended to be based on finding the fixation periods of the nystagmus waveform and using those gaze locations for the re-alignment (Dell’Osso, 2005).
Recording from the participant’s dominant eye results on average in 0.2∘ better accuracy and also better precision (Holmqvist, 2015; Nyström et al., 2013), as compared to recordings from the non-dominant eye. This difference in data quality between the dominant and non-dominant eye leads to one consideration when calibrating for binocular recordings: whether to calibrate both eyes simultaneously or to instead calibrate the two eyes separately, patching one while calibrating the other. Calibrating both eyes at once, binocularly, may give an erroneous (absolute) disparity value because the calibration procedure assumes that both eyes are directed towards the calibration point, when in fact one eye may be slightly off. Nuthmann and Kliegl (2009) nevertheless calibrate for both eyes simultaneously, arguing that they can still correctly measure relative changes in disparity. Švede et al., (2015) and Liversedge et al., (2006) recommended a separate monocular calibration for each eye when using binocular recordings, for investigating the absolute disparity between the two lines of gaze. This should be done by covering one eye, calibrating the other, and then switching.
Calibration of special populations
Researchers working with participant populations other than young adults, such as infants or animals, will likely be faced with additional challenges during calibration. This may be due, for example, to these participants not being able to respond to verbal instructions. While some animals can to a degree be trained to remain still and to look at the desired calibration target (Park et al., 2020), infants and some monkeys can be nudged to look at the desired point by using contracting and dilating images, or by using transient appearances of calibration targets on screen (e.g. Hessels et al.,, 2015; Jones et al.,, 2014).
Patients with age-related macular degeneration have difficulty foveating calibration targets (because they have no or reduced foveal vision). Harrar et al., (2018, p. 9) suggest using the calibration of another person and found that accuracy degrades by 4–8∘ with this method, but that it does not introduce non-linearities.
Calibrating an eye tracker for participants with an unstable gaze, such as nystagmus or continuous square wave jerks, presents the problem that as they look at a calibration point their eyes will not be still. For these participant groups, researchers have developed dedicated calibration routines specific to the particular oculomotor condition (Dunn et al., 2019; Rosengren et al., 2020). Note that not all eye trackers allow for these calibration routines, e.g. when a standard calibration procedure has to be performed before a recording can commence. Eye trackers that can record without explicit calibration include the DPI and scleral coils (Holmqvist and Andersson, 2017, pp. 214–217) and some P–CR eye trackers.
Features of the experiment
Here, we address only those aspects of experimental design that may be specifically relevant or problematic in the context of eye-tracking research such as the operator skill level, eye-movement measures used as dependent variables, the number of trials and experiment duration.
Operator skill level
By operator we mean the person (researcher or research assistant) who records data from the participant. Nyström et al., (2013) report an advantage of 0.2∘ in the accuracy recorded by experienced operators, compared to inexperienced, whereas Hessels and Hooge (2019) report experienced operators tend to succeed calibrating difficult participants where inexperienced operators give up, and point out that training of operators could have a beneficial effect on data quality.
The instruction to participants
Task instructions have a strong influence on eye movement behaviour, as elegantly shown by Buswell (1935, p. 136) and Yarbus (1967, p. 174). The instruction to the participants is part of the experimental design, and can be used actively to drive participant behaviour. However, the small differences in wording may have unexpected effects, and the exact instruction may need to be verified during piloting. For instance, asking participants to “fixate” rather than “hold the eyes still” reduces the rate of microsaccades (Poletti & Rucci, 2016), and Enright and Hendriks (1994) found that “staring” differs from “scrutinizing”, in that the latter involves a larger net muscular force exerted on the eye from the opposing rectus muscles, pulling the eyeball backward in its socket.
Trial durations and trial-by-trial effects
Besides the fact that data quality seems to be worse after longer periods of time (Section “Calibration and accuracy”), the duration of trials and experiments is relevant also for other reasons. For instance, during scene viewing, fixations tend to be shorter and saccade amplitudes longer during the first second or two of a trial. This can be interpreted as an initial overview/ambient scan followed by detailed/focal inspection, shown by Tatler and Vincent (2008), Unema et al., (2005) and Buswell (1935) for free-viewing, by Scinto et al., (1986) for visual search and by Over et al., (2007) for visual search and free viewing. This would imply that when trials vary in duration, mean fixation duration for long-lasting trials may be longer than mean fixation duration for short trials, irrespective of other factors. Also, when trials are short, comparing mean fixation durations for short sequences of saccades, one should consider not including initial fixation durations because initial fixation durations are longer than subsequent fixation durations (Hooge and Erkelens, 1996; Zingale & Kowler, 1987). This also holds for infant participants (Hessels et al., 2016).
A technical trial-by-trial effect is that the duration of the initial fixation of a sequence of fixations may not reflect the whole duration of that initial fixation, because it started before the trial started, and was cut in two by the change of trial. In the visual-cognition literature, when analysing fixation durations, the first and last fixations are typically discarded (e.g. Nuthmann, 2013).
Tatler and Hutton (2007) found trial-by-trial effects in the antisaccade task: Both the error rates and latencies increased on trials following a trial with an erroneous anti-saccade. Switching from making an antisaccade in one trial to making a prosaccade in the next trial involves a cost in increased saccade latency of the prosaccade (Tari et al., 2019). Similarly, a saccade to a location that was fixated at the end of the previous trial may be preceded by a prolonged fixation (Carpenter, 2001), and may affect latencies and fixation durations in the current trial.
Eye-movement measures as dependent variables
In some research fields the choice of the appropriate eye-movement measures, and the range of task parameters, for the study at hand is either straightforward or very well established. This is for instance the case in reading research (Clifton et al., 2007), and for studies employing the anti-saccade paradigm (Antoniades et al., 2013).
In some applied research fields, measure selection is all but obvious and terminology of measures confusing (e.g. Sharafi et al.,, 2015). A line of publications may get accustomed to a choice of measures that later turns out to be unfortunate. See for instance Šmideková et al., (2020) for a discussion of the selection of measures for research in classroom management.
Naming of events is also variable. What some know as saccade latency (Holmqvist and Andersson, 2017, p. 580) is sometimes termed saccade reaction time or calculated as time to first fixation (Tatham et al., 2020). Fixation duration is sometimes called ‘fixation time’, but also ‘dwell time’, or ‘dwell time of the fixation’. Oster and Stern (1980) used the terms saccadic reaction time and intersaccadic interval for fixation duration. The original term was ‘pause time’ (Erdmann & Dodge, 1898), and the term ‘pause duration’ was used long into the 1940s.
Terminology for the dwell time measure also varies. In some parts of human factors research, the dwell time measure is called ‘glance duration’ (Horrey & Wickens, 2007), while Loftus and Mackworth (1978) used the term ‘duration of the first fixation’ for the first dwell time in an AOI. Terms like ‘observation’ and ‘visit’ can also be found. In reading and some parts of scene perception research, dwell time is often called ‘gaze duration’ or ‘regional gaze duration’, and ‘first-pass fixation time’ when the AOI consists of two words (Clifton et al., 2007).
Signal properties and processing
In this section, we discuss the properties and processing of the stream of data from the eye tracker, such as gaze position signals, time stamps, pupil-size signal, and more.
Sampling frequency (also temporal resolution) is the number of measurements per second. The sampling frequency of modern video-based eye trackers ranges from 30 to over 2000Hz. Some eye trackers, like the DPI, scleral search coils and some other analogue systems have no sampling frequency. Instead, their analogue signals may be digitized to any desirable frequency up to at least 10000Hz (Collewijn, 2001), who remarked that “The choice of 10000Hz followed from the general rule that the (temporal) resolution of a measurement should preferably be an order of magnitude better than the expected effect.” (p. 3417). For video-based eye trackers, the video camera and its settings determine the sampling frequency.
Sampling frequency is one of the most highlighted properties of modern eye trackers, often being either a part of, or mentioned directly in connection to the model name. The competition for higher sampling frequencies has made some manufacturers of video-based eye-tracking systems with multiple cameras interleave image acquisition to achieve higher effective sampling rates. For instance, the Tobii Glasses 2 have two cameras per eye, each sampling the eye at 50Hz. This system is made into a 100Hz eye tracker by alternately sampling each camera. However, the alternating samples are offset in the resulting data, yielding a zigzag pattern that is very common in 100Hz data from Tobii Glasses but does not happen in 50Hz data (see Figure 11 in Niehorster et al.,, 2020b). The EyeFollower from LC Technologies uses two 60Hz cameras, one per eye, to achieve a net gaze sampling rate of 120Hz by alternatingly sampling the right and left eyes.
In theory, high sampling rates when combined with low velocity noise would allow for very precise determination of velocity and acceleration, and therefore facilitate more precise determination of on- and offset of fixation, saccades and other events. This would obviate the need for filtering and for averaging metrics such as saccade latency / fixation duration over large numbers of trials, which are difficult to record with patients and other groups that only provide small samples.
In practice, however, the many different eye trackers exhibit a large variation of both sampling frequencies and precision levels. Research on the relation between eye-tracking measures and sampling frequency shows that some outcome measures (e.g. fixation durations) are less sensitive to sampling frequency, whereas others (saccadic peak velocity) are more so.
For instance, Andersson et al., (2010) quantified the effect of sampling frequency on event durations, such as fixation durations, in a series of simulations and tests on human eye-movement data. They also provided estimates of the number of measurements that are required to average out the mis-estimations of the on- and offset of fixations due to a low sampling frequency.
Saccadic peak velocity measures are more dependent on sampling frequency, but exactly how much more is a matter of debate. Wierts et al., (2008) showed that although a 50Hz eye tracker cannot provide accurate saccadic peak acceleration/deceleration values, it can be used to accurately measure peak velocities without aliasing if saccades are at least 5∘. Inchingolo and Spanio (1985) used a 200Hz EOG system and found that saccade duration and velocity values in that data were comparable to those obtained in data of a 1000Hz system, as long as the saccades were larger than 5∘ in size. However, using EOG- and photoelectric eye-tracking systems to study 20∘ saccades, Juhola et al., (1985) provided evidence that sampling frequency should preferably be higher than 300Hz in order to reliably calculate the peak saccade velocity. Mack et al., (2017) replicate the finding that the peak saccade velocity estimation is more inaccurate for lower sampling frequencies. Unfortunately, these somewhat contradictory results are made more difficult to interpret because of differences in the precision of the eye trackers, how velocity is calculated, and whether filters were involved in the velocity calculation. The observations that both DPI and P–CR technologies misestimate saccade velocity (e.g. Hooge et al.,, 2016) add complication to the interpretation of these studies.
Temporal precision is the variation in the inter-sample durations. A perfect temporal precision means that samples always arrive after exactly the same time interval. However, when temporal precision is poor, there could sometimes be, for instance, 33ms between samples, and other times 43ms (actual intervals found in data from an EyeTribe, Holmqvist and Andersson, 2017, p. 193). This is indicative of an unstable sampling frequency, the explanation for which could be in small head movements, the camera type and transfer protocols as well as image processing. Examples of eye trackers with unstable sampling frequencies include the EyeTribe (Ooms et al., 2015), the Pupil Labs 240Hz (Ehinger et al., 2019), the Tobii 1750 (Shukla et al., 2011), and the SMI REDm 60/120, and the SMI RED 250 (Hessels et al., 2015). Some implementations of algorithms for filtering, velocity and acceleration calculation, as well as event detectors, may assume a stable sampling frequency, and may thus not be suitable for data with unstable sampling frequencies.
Precision ranges reported in the publications of Table 2 vary between eye trackers with a factor of 100 or more (median RMS-S2S deviation 0.001–0.75∘). Precision ranges vary little with calibration, and can be calculated from participants (and artificial eyes) without their cooperation. Precision calculations can be made in many different ways (Niehorster et al., 2020c). The resulting precision values change when filtering the gaze signal with the built-in manufacturer filters (Niehorster et al., 2021).
Precision recorded with human eyes is often worse (e.g. higher RMS-S2S deviation) than precision recorded with artificial eyes (Holmqvist et al., 2021; Niehorster et al., 2020c), but different artificial eyes may also result in different precision levels.
Niehorster et al., (2020c) investigated how four different precision measures correlate, depend on sampling frequency and express different properties of the signal. In particular, RMS-S2S deviation reflects the noise velocity in the signal, while STD (standard deviation) and BCEA of the gaze signal (bivariate contour ellipse area, Steinman, 1965; Crossland and Rubin, 2002) are measures of the dispersion of gaze samples. The slope α of the power spectrum density instead measures the colour of the noise, as does RMS-S2S divided by STD (for the same gaze data).
Together, these four measures allow for a more complete characterization of the precision in gaze data from an eye tracker. Niehorster et al., (2020c) provide code to generate noise based on this characterization. Adding synthetic noise to data is a method to test event detectors, and can also be used to provide identification privacy in future consumer products with inbuilt eye-tracking systems (Liu et al., 2019).
The most common way to reduce (improve) precision values is to employ a filter. McConkie (1981) proposes that all filters should be reported. Filtering of the resulting data stream compensates for noise generated earlier at the level of sensors, light, fans and more. However, filtering affects various characteristics of the signal differently, and using the four different measures above allows researchers to investigate whether filters are present (Niehorster et al., 2021).
Ko et al., (2016) remarked that an optimal filter should be based on (a) a characterization of the noise level and (b) the component of eye movements one is interested in examining. Most other design criteria of filters seem to be guided by heuristics, or ‘rules of thumb’, motivated by visual inspection of the data (e.g. Stampe, 1993). Notice that pattern matching filters, such as those described by Stampe (1993, p. 138, known as the heuristic filter in EyeLink and SMI trackers) and Duchowski (2007) amplify parts of the gaze signal with a similar appearance to the filter pattern, while attenuating other portions. Špakov (2012) compared several noise filters, and revealed that finite-impulse response filters with triangular or Gaussian kernel (weighting) functions, and parameters dependent on signal state, show the best performance, as judged by a comparison to idealised saccade models using multiple criteria.
Derivatives of the gaze position signals are used by both researchers and event detection algorithms. Numerical differentiation of a signal however amplifies high frequency content (which is usually noise) in the signal. Specific filters are therefore often used to counteract the increased high frequency noise resulting from differentiation. The most detailed investigations of these filters were conducted by Inchingolo and Spanio (1985) and Larsson (2010), who showed how saccade parameters (e.g. duration and peak velocity) were affected by the type of differentiation filter and peak velocity threshold in the event detector. Larsson (2010) concluded that the Savitzky–Golay filter used by Nyström and Holmqvist (2010) and the differential filter used by Engbert and Kliegl (2003) produced eye movement velocity and acceleration most like those found in literature. Unlike the pattern-matching filters, these two filters make no strong assumptions on the overall shape of the velocity curve.
Data loss and interpolation
Several studies have shown that average data loss differs between eye trackers. Holmqvist (2015) report that the video-based eye trackers SMI HiSpeed 1250 and the EyeLink 1000 had the lowest data loss with around 3% of the raw data samples lost on average, while the Tobii T60 XL and the TX300 lost 15% or more. Nevalainen and Sajaniemi (2004) report 3.0–8.7% data loss for the Tobii 1750 and two ASL trackers, while Funke et al., (2016) found 22% in EyeTribe and 24% data loss in Tobii EyeX. For reference, around 2% of the data are lost due to blinks (Holmqvist and Andersson, 2017, p. 167). In contrast to the values reported for the TX300 by Holmqvist & Andersson (2017, p. 167), Hessels et al., (2015, Figure 6) reported less than 3% data loss for the TX300 for upright head orientations, and Hessels & Hooge (2019, Figure 9) reported less than 10% data loss for 9 year old children measured with the TX300. There is thus a large range in the reported data loss values for each eye-tracker model. This suggests that not only the eye-tracker hardware itself plays a role, but also operator experience, participant groups, lighting conditions, stimuli and experimental procedures, and laboratory protocols. This should be taken into account when interpreting data loss values reported in the literature.
Furthermore, Castner et al., (2020) reported that data loss values produced by manufacturer software are not always reliable. They found that for a participant with a reported tracking ratio of 98% (a data loss of 2%), an additional large gap in the left eye gaze signal–approximately 3.5s out of a 90s recording–appeared as data loss, but was labelled as a blink.
Fixation points positioned in the corner of the monitor, as well as recording participants with downward-pointing eye lashes and large head movements tend to result in higher data loss (Hessels et al., 2015; Holmqvist et al., 2011; Niehorster et al., 2018), though the operator might have a significant influence as well (Hessels & Hooge, 2019).
Data loss may affect the output of event detection, if the event detector terminates fixations and other events whenever a period of data loss is encountered. Holmqvist et al., (2012) added increasing amounts of data loss (as short segments) into data with no data loss, and found that 18% data loss reduces the number of fixations by about one quarter, and increases their average duration by around 50ms, when using the Nyström and Holmqvist (2010) algorithm. Hessels et al., (2017) found that adding periods of data loss to eye-tracking data affected the number of fixations and corresponding fixation durations for different event detection algorithms strongly and idiosyncratically.
Some algorithms merge fixations close in time and space where there are small bursts of data loss (Komogortsev et al., 2010; Wass et al., 2013; Zemblys et al., 2018), reducing some of the effect of periods of data loss. The solution to gaps in data in the Tobii Pro Lab software is to allow users to fill the gaps of data loss using a linear interpolation with synthetic data. This interpolation is selected in the event detection dialog menu in the Tobii software. The I2MC algorithm (Hessels et al., 2017) also employs interpolation of gaps up to a certain duration, but instead uses a non-linear Steffen interpolation (Steffen, 1990).
Latency, gaze contingency
Latency (also known as temporal accuracy and end-to-end delay, e.g. Reingold, 2014) is often defined as the average end-to-end delay from the time of an actual movement of the tracked eye until the recording computer signals the eye movement. Theoretically, there is always a latency of a few milliseconds, and in the optimal case, it is constant. Any processes run by the computers involved in the data recording may add to this basic latency.
A known constant latency is uncritical for most research (except closed-loop, gaze-contingent experiments). A variable latency, which translates to high temporal imprecision, is much more critical, as it cannot be easily compensated for, particularly if the eye tracker does not provide reliable timestamps.
A large and variable latency is somewhat tricky to detect, measure, and prevent, and may come as an unpleasant surprise long after data were recorded. McConkie (1997) looked back at the foundational work on reading using gaze-contingency (McConkie and Rayner, 1975), and remarked that they were unaware of a filter in the eye-tracker circuitry that increased the latency by 25ms between the eye movement and the registered signal, potentially undermining their conclusions.
Table 3 lists existing measurements of eye-tracker latencies. Measurement type 1 concerns the time from when an eye movement is made until the output gaze coordinates change, while measurement types 2–5 include the time needed to update the monitor.
Gaze-contingent paradigms and latencies
Whether a gaze-contingent paradigm – for instance, boundary and moving window paradigms (Hohenstein & Kliegl, 2014; McConkie & Rayner, 1975; Nuthmann, 2013) or saccadic adaptation paradigms (McLaughlin, 1967; Pélisson et al., 2010) – can be run without exceeding the maximum allowed latency depends on how quickly a gaze coordinate can be fed back to the stimulus program so that the stimulus monitor can be changed without the participant realising (facilitated by saccadic suppression, Campbell & Wurtz, 1978; Holt, 1903). Loschky and Wolverton (2007) reported that it is enough to update the stimulus image within 60ms after the onset of the eye movement. However, Slattery et al., (2011) point out that the position of gaze during the display change has an effect on fixation durations (for the next word after the boundary) that can be seen already at 15–25ms delay of the signal. This behavioural change indicates detection of the manipulation, and the delay can be compared to the measured latencies in Table 3. Note that a single detection may be enough to affect behaviour, which means that maximum latency, rather than the mean, would be the most relevant comparison.
Saccade latency measurements versus system latencies
In other cases, researchers are concerned whether their eye-movement recording was properly synchronized to stimulus onsets on their displays. Improper synchronization would for instance affect eye latency measures, such as saccadic latencies. One method to check this has been to compare the eye video to the file of the raw data stream or gaze scanpath (Morgante et al., 2012). This however has the drawback that both data streams are generated by the same software, and could be affected by the same latencies. Also, the video is usually of a low temporal resolution in comparison to the eye-tracking data, which limits detection of synchronization issues to the temporal resolution of the video recording. As an alternative method of measuring synchronisation, Shukla et al., (2011) used a mirror positioned next to the participant and a 300Hz high-speed camera, which made a recording of the participant’s eye and, through the mirror, the monitor where the stimuli appeared and disappeared. Results revealed a variable latency with a mean of 27ms on their Tobii 1750, similar to the latencies reported by Leppänen et al., (2015) in a study using the same approach with a low temporal resolution camera and a Tobii TX300, while Morgante et al., (2012) reported latencies of up to 54ms for the Tobii TX60XL.
Fixation and saccade detection
Historically, fixation and saccade detection were conducted manually and was very time-consuming. For instance, Hartridge and Thomson (1948) presented a novel method to process eye movements at a rate of approximately 10000s (almost three hours) of manual work for 1s of recorded data. Decades later, Monty (1975) remarked: “It is not uncommon to spend days processing data that took only minutes to collect” (p. 331–332). Today, software can run a similar analysis in a matter of minutes, even for several hours of recorded data. Potential reasons for still doing manual analysis include that it allows for better general monitoring of data quality as well as participant performance and engagement.
Event detection algorithms (or event classification, see Hessels et al.,, 2018) are used to process a time series signal (gaze position, pupil size, etc.) into labelled, meaningful units, such as fixations, saccades, blinks, etc. What happens inside the event detection algorithms was considered important enough by McConkie (1981) that he recommended that details about these algorithms should be published in the paper presenting the processed events.
Note that operationalisations for fixations may depend on the frame of reference (i.e. whether the eye tracker is fixed to the world or to the head). A moving observer that fixates a static object in the world, produces a gaze point in the world that is stationary with respect to the object, but slowly moving with respect to the head. This point is extensively discussed in Lappi (2016), Holmqvist and Andersson (2017, Chapter 7) and Hessels et al., (2018).
There are many different event detection algorithms available. Here, we describe a select number of them to give an idea of the breadth and scope. The I-DT finds fixations using a spatial threshold on maximum gaze dispersion (typically 0.75–1.5∘) and a temporal threshold on minimum fixation duration (typically 50–150ms). What remains are assumed to be saccades. The I-VT instead finds saccades using a minimum peak velocity criterion (such as 20–100∘/s), and assumes that everything in between saccades are fixations. The I-DT and I-VT were described by Salvucci and Goldberg (2000), and later appeared in software from manufacturers. For instance, BeGaze by SMI offers both the I-VT and the I-DT algorithms, whereas Tobii Pro Lab provides a version of the I-VT, and the Data Viewer by SR Research has an I-VT-related saccade detector with both velocity and acceleration thresholds.
The NH2010 algorithm by Nyström and Holmqvist (2010) is an improvement of the I-VT algorithm which adapts the peak velocity threshold to the level of noise in the data, and additionally outputs detected post-saccadic oscillations. The I2MC by Hessels et al., (2017) is an algorithm designed to be robust against increasing levels of noise and data loss, common in infant research.
GazeNet by Zemblys et al., (2019) is a fully end-to-end machine learning-based event detector that learns from examples, and detects fixations, saccades, and post-saccadic oscillations with very high resemblance to human expert coders. The Deep eye movement classifier by (Startsev et al., 2019) is another recent machine-learning algorithm that also detects periods of smooth pursuit in data.
There also exist dedicated event detection algorithms for data from head-mounted eye trackers, used to describe gaze behaviour during e.g. navigation in real environments (Hessels et al., 2020; Niehorster et al., 2020a). For researchers interested in labelling eye-tracking data from head-mounted eye trackers into smooth pursuit, fixations during head movements, OKN, vergence etc, no automated techniques exist at the moment. However, this is a quickly evolving field, in which relevant work is done on some of the problems it involves (Kothari et al., 2020; Larsson et al., 2014).
Furthermore, there are many other special-purpose event detectors (for instance, blink detectors, microsaccade detectors, algorithms for desaccading smooth pursuit or nystagmus data, and smooth pursuit detectors), summarised by Holmqvist and Andersson (2017, Section 7.4).
Most event detection algorithms are offline, operating on already recorded data. However, for gaze-contingent research, event detection algorithms have to be fast and online, operating in real-time when saccades happen (Holmqvist & Andersson, 2017, p. 234–235). This online algorithm is necessary in the Fixation-Contingent Scene Quality Paradigm (Henderson et al., 2013; Walshe & Nuthmann, 2014). In the boundary paradigm, however, there is just a simple check whether raw data (typically one eye only, see discussion in Nuthmann & Kliegl, 2009, p. 23) have crossed the boundary, assuming such a crossing to mean that a saccade is in progress (see also Slattery et al.,, 2011).
The risk that poor precision poses for the detection of small eye movements
Small eye movements may be hidden in the noisy, imprecise parts of data. For instance, Fig. 2A shows how the large saccades are often followed by small saccades which are clearly seen and reasonably easy to detect by algorithms. In Fig. 2B, the big saccades are visible, but the small saccades, if they were made during the recording, have left a trace that is harder to distinguish from noise, for human data inspectors and algorithms alike.
The degree to which outcome measures of event-detection algorithms are sensitive to the noise level has been systematically investigated by Hessels et al., (2017), Holmqvist (2016), and Holmqvist et al., (2012), who all investigated the effect of artificially increasing noise levels (degrading precision) on the outcome of event detectors, and by van Renswoude et al., (2018), who investigated correlations between precision and outcome measures. Effect sizes are large; for instance, using the algorithm by Nyström and Holmqvist (2010), Holmqvist et al., (2012) compared the precision levels 0.03–0.37∘ and found an increase of average fixation durations from 430ms to 630ms and a reduction of the number of fixations by about one-third, for the same eye-movement data. Hessels et al., (2017) and Holmqvist (2016) report (and illustrate in figures) how for some algorithms, no fixations whatsoever are found when imprecision increases beyond a certain level.
Event detection algorithms have a variety of settings, some examples of which are the minimal peak velocity threshold for saccade detection (I-VT, EyeLink), the minimal fixation duration and the maximum gaze dispersion for fixations (I-DT). Changing the settings of these algorithms can have large effects on measures such as number and duration of fixations and saccades (Blignaut, 2009; Holmqvist, 2016; Manor and Gordon, 2003; Shic et al., 2008). For some experimental designs, in particular between-subjects comparisons, and when comparing between studies, or when conducting replication studies, a change of algorithm settings may have an impact on the rejection of a hypothesis (see for instance, Shic et al.,, 2008, for a within-subjects design with comparison between different stimulus types).
Settings can be manually adapted based on for instance the precision of the data. Holmqvist (2016), and (Holmqvist and Andersson, 2017, Ch 7) provide practical advice on the relationship between precision and settings and the outcome measures, for two commonly used algorithms: I-DT and I-VT. The larger the saccades are in the task, the higher the thresholds can be. Studies with a focus on small saccades need good precision and low thresholds.
There are also adaptive algorithms that change the thresholds based on the precision in the data (e.g. Braunagel et al.,, 2016; Engbert & Kliegl, 2003; Hooge & Camps, 2013; Mould et al.,, 2012; Nyström & Holmqvist, 2010). However, an adaptive algorithm does not solve the problem of variable precision, as it may adapt the parameters to the level of noise, but changed parameters have consequences in the fixation and saccade output by the algorithm. Hessels et al., (2017) developed an algorithm which had the explicit goal to be robust to differences in data quality and enable comparisons across conditions when there are differences in data quality. Note, however, that although noise-resilient algorithms may produce fixations that result in the same average fixation duration from data of varying precision, further investigations are needed to assess the extent to which the individual events (their on- and offsets) change as precision varies.
Not everyone is free to choose which event detection algorithm to use, but for those who are and want an algorithm adapted to their wishes, there are many algorithms to choose from. The many existing event-detection algorithms do not necessarily produce the same output measures when given the same eye-tracking data. In fact, several algorithm comparisons have reported large differences in fixation and saccade measures between algorithms (Andersson et al., 2017; Benjamins et al., 2018; Dalveren and Cagiltay, 2019; Komogortsev et al., 2010; Salvucci & Goldberg, 2000; Stuart et al., 2019). This research suggests that differences in, for instance, average fixation durations between studies that use different algorithms may in part stem for differences between the algorithms.
It has become common that developers of algorithms benchmark their novel algorithm against previous ones (e.g. Hessels et al.,, 2017; Otero-Millan et al.,, 2014; Zemblys et al.,, 2018, 2019). Event detectors based on machine learning have started to appear, whose behaviour cannot be fully described in terms of rules that relate to concepts humans have about the eye-movement signal. Consequently, trust in the algorithm derives from benchmarking against human coders or existing algorithms (Zemblys et al., 2019).
There is an ongoing discussion around the methods in building and evaluating event detectors, in particular how to calculate inter-rater reliability, used to compare algorithms against algorithms or against human coders (e.g. Friedman, 2020; Startsev et al.,, 2019; Zemblys et al.,, 2019, 2021). Other current topics concern whether human coding of events is a good benchmark to test the algorithms against (Hooge et al., 2018), or build algorithms from (Zemblys et al., 2019), and what kind of noise to add to the data when testing the noise-robustness of an event detector (Niehorster et al., 2020c).
Fixations, saccade latencies, amplitudes, and curvature have been operationalised in more than one way. For instance, a common way to calculate saccade amplitudes is to calculate the Euclidian distance between start and end of a saccade (e.g. van der Geest et al.,, 2002). Alternatively, the amplitude can be measured as the distance along the saccade path (calculated, for instance, as duration multiplied by average velocity). These two amplitude calculations will differ for curved saccades (Holmqvist & Andersson, 2017, p. 613).
Different algorithms calculate fixation durations and other measures in different ways (Andersson et al., 2017). In particular, some algorithms exclude the post-saccadic oscillation (PSO) from both the saccade and the following fixation event (e.g. Nyström & Holmqvist, 2010; Zemblys et al.,, 2019), while the I-VT algorithm and the EyeLink algorithm have no separate detection of PSOs and assign parts of the PSO either to the saccade or the fixation, largely depending on the amplitude of the PSO.
Area-of-interest (AOI) measures
Areas of Interest (AOIs, also known as Regions of Interest, ROI, and Interest Areas, IA) are employed when the researcher’s interest is in the relation between gaze behaviour and the visual world (e.g. Buswell, 1935; Viviani, 1990). Researchers may be interested in what parts of a webpage attract gaze most effectively, and in what order (Goldberg et al., 2002), or interested in gaze behaviour while listening to ambiguous sentences about a scene (Allopenna et al., 1998). AOI-measures such as absolute or relative time spent in AOIs or the number of transitions between various AOIs may be used for this.
Areas of Interest provide fundamental processing tools for the analysis of eye-tracking data, and are used in many branches of cognitive psychology, architecture, marketing, clinical research, neuroscience, educational science and many other fields. Multiple methods exist to relate the AOIs to the stimulus, presented by Holmqvist & Andersson (2017, Ch 8), Hessels et al., (2016), and Orquin et al., (2016).
There are methods that assist with the same function that AOIs are used for, but that are not referred to as AOIs: Reading researchers use non-proportional fonts and oftentimes study single sentences only. This way, fixation-to-word and/or fixation-to-letter assignment is easily done post-hoc; all they need to know is the horizontal offset of the sentence and the PPC value (pixel per character), along with the actual sentence. This also makes gaze-contingent reading research (moving window and boundary paradigms) technically easier to implement. For reading researchers who prefer to use AOIs, both BeGaze from SMI and the SR Research stimulus presentation software automatically segments text into AOIs at the word, sentence, and character level.
When the stimulus consists of animated material or videos, a static segmentation of space into AOIs may not suffice. Dynamic interest areas can be made to move in synch with the underlying object, but may require AOI measures to be calculated based on raw data samples rather than using fixations (e.g. because event detectors often are not reliable when smooth pursuit is present).
The size of the AOI is of great importance. If the accuracy of the gaze data is poor, the eye tracker might report a gaze position that is outside the AOI, even though the participant was looking in that area, and vice versa (Holmqvist et al., 2012).
Hessels et al., (2016) report the effects of altering the size of AOIs (face stimuli) on important AOI measures (dwell time, total dwell time, time to first AOI hit), pointing out that effect sizes are large and the relationship is non-linear. Below a certain AOI size, the total dwell times are no longer significantly different between the two AOIs (eyes vs mouth) used in their study. Orquin et al., (2016) reanalysed four experiments using different AOI sizes, and found only some effects of varying AOI size on the outcome of the statistical analysis. Orquin et al., (2016) also note that one third of the researchers in their survey reported conducting analyses with multiple AOI sizes, which may help confirming that the result is robust over all AOI sizes.
Orquin and Holmqvist (2018) present simulations where they vary AOI size, the shape and position of the AOIs, and accuracy and precision, and investigate the effect on the AOI measure hit rate. They report complex, non-linear interactions between data quality measures and AOI properties.
Not only the inaccuracy of the eye tracker matters when calculating AOI measures from AOIs of different sizes. The minimum size of an AOI that encircles a target stimulus is also limited by the inaccuracy of the visuo-oculomotor system when targeting small objects, which can be larger for some participant groups (Clayden et al., 2020; Pajak & Nuthmann, 2013).
It has been suggested that margins should be added around AOIs to compensate for inaccuracy (Holmqvist & Andersson, 2017; Orquin et al., 2016), which may or may not be possible depending on how densely populated the stimulus is. Hooge and Camps (2013) point out that if the visual stimulus is sparse, AOIs could be made as large as possible, sharing the remaining empty space between nearby AOIs. Their argument is that in sparse stimuli, there is not much crowding, and the functional visual field is large (Engel, 1971; Toet & Levi, 1992). A large functional visual field implies that objects are visible at larger eccentricities (or larger distance from the gaze point), allowing observers to overview larger areas around the gaze point.
Outcome measures that build upon or are derived from AOI or fixation and saccade measures could be referred to as higher-order measures. As a rule of thumb, the higher-order measures have a large number of settings that can be varied, whether in
scan path analysis (Anderson et al., 2015; Cristino et al., 2010; Dewhurst et al., 2012; Duchowski et al., 2010; Jarodzka et al., 2010; Kübler et al., 2014)
(hidden) Markov models (Chuk et al., 2014; Coutrot et al., 2018; Ellis & Stark, 1986)
recurrence quantification analysis (Anderson et al., 2013; Pérez et al., 2018)
entropy analyses (Allsop & Gray, 2014, 2017; Hessels et al.,, 2019; Hooge & Camps, 2013; Krejtz et al.,, 2014; Niehorster et al.,, 2019)
heatmap-based analysis (Caldara & Miellet, 2011)
It is reasonable to expect that data loss, as well as poor precision and accuracy, will be carried through event detection and AOI procedures, and propagate into these higher-order measures. Similarly, settings in the event detector and choices of AOI sizes may also have strong effects on the higher-order measures.
To date, very few studies have been made of the effect on higher-order analyses of changing settings and varying data quality. One example is Krejtz et al., (2015), who show that the size of gridded AOIs affect gaze transition entropy results, with non-linear relationships and large effect sizes in outcome entropy.
We have reviewed research on how the eye tracker, methodology, environment, participant, settings of event detectors and AOI tools, etc., affect (or relate to) the quality of the eye-tracking data obtained, the properties of the eye-tracker signals, and the eye-movement and gaze measures. Our review has shown that there exists a significant body of research that has investigated the quality of data from eye trackers and what this quality relates to.
These studies have reported that sunlight and luminance (environment) have large effects on gaze, that the accuracy, precision and data loss often vary significantly between different eye trackers, and that the setup and geometry of the recording situation is of great importance to the quality of the data.
These studies have also shown, for instance, that accuracy, precision and data loss vary between participants, depending on age, eye-region physiology and many other factors. We have seen that calibration matters for accuracy, and that operator skill and trial structures may influence outcome measures. We have learnt that some researchers use filters to counter poor precision, interpolation across gaps of data loss, and manual methods for re-aligning inaccurate gaze data.
The reviewed literature suggests that algorithms for event detection vary dramatically between studies and most algorithms are highly influenced by both precision and settings. Other research has quantified the large non-linear effects of data quality on area-of-interest and higher-order measures.
In the next section, we will examine how the various factors reviewed above are reflected in current reporting practices and guidelines.