Introduction

Eye tracking is a method used to investigate eye movements, gaze behaviour, and pupil dilation in many different research fields (e.g. perception, attention, memory, reading, psychopathology, ophthalmology, neuroscience, human–computer interaction, animal research, human factors, consumer behaviour, optometry etc., see Duchowski, 2002; Kowler, 2011; Liversedge et al.,, 2011; Majaranta, 2011; Rayner, 1998, for overviews). In addition, there is a belief that eye tracking will soon become a ubiquitous technology in laptops and augmented reality headsets for consumers (e.g. Chuang et al.,, 2019; Clay et al.,, 2019). Eye tracking is widespread and likely to become more so in the future.

While eye tracking may be used in a variety of research fields to answer very different questions, many methodological aspects appear to be shared, such as the eye-tracker models used, or the algorithms for processing and analysing recorded eye-tracking data. One therefore might expect that the part of the method sections describing the eye-tracking setup, and the processing and analysis of the data it yields, are similar across very different research fields (such as human factors research and neuroscience).

However, recent research suggests that this is not necessarily the case. For example, in many studies using an eye tracker, reporting the quality of the eye-tracking data obtained is not common practice (see e.g. Hessels & Hooge, 2019; Holmqvist et al.,, 2012). Moreover, although there exist a number of reporting guidelines for research using eye trackers (e.g. Carter & Luke, Fiedler et al.,, 2019; McConkie, 1981; Oakes, 2010; Strohmaier et al.,, 2020), existing guidelines differ substantially from one another, are based on consensus decisions within a small group of authors or researchers, and to the best of our knowledge are not widely used.

The present paper was initiated after several large-scale meetings between eye-tracking researchers from many different disciplines. In these meetings, it was established that there is a need for guidance in what to report in a study using an eye tracker. However, the needs on what should be reported may differ substantially between research or applied fields. Therefore, the first step was to combine previous research into an empirical foundation for any future reporting standard.

Evidence-based reporting guidelines are essential for at least three reasons.

  1. 1.

    Being expected to report a specific set of features of the experiment may help researchers with planning and designing their studies, as they will be more aware of preparing and collecting information that needs to be reported at a later stage.

  2. 2.

    The adoption of reporting guidelines, leading to sufficient detail in the reported methods of a study, may allow reviewers and future readers to assess the validity of that study’s claims.

  3. 3.

    Following reporting guidelines may assist authors in providing sufficient detail about a study to enable other researchers to reproduce (and potentially replicate) a study. A well-known study on replicability estimated that a mere one third of the findings in psychological science are replicable, qualifying this as a ‘replication crisis’ (Open Science Collaboration, 2015). In eye-tracking research specifically, replication may be particularly hampered by an over-reliance on the performance of the eye trackers and their default algorithms and settings.

Note that we distinguish between reporting guidelines, which offer researchers the possibility to make informed choices of what to report, and reporting standards which prescribe mandatory reporting items, approved by one or another authority. We here deliver the empirical foundation and derived from this what should minimally be reported according to empirical research. Our efforts may be followed up by e.g. consensus-based approaches to deliver formal reporting standards. We expect these to differ between for example fundamental research fields and clinical applications, due to different considerations with regard to e.g. safety, ethics, legal requirements, researcher background knowledge, and nature of the research field.

In what follows, we review the existing literature with regard to the following central question: How do the various aspects of a study using an eye tracker (such as the instrument, methodology, environment, participant, etc.) affect the quality of the eye-tracking data obtained, or the eye-movement and gaze measures? We contrast what has been shown to be relevant against what already existing reporting guidelines prescribe and against an existing database of 207 publications of what researchers have reported in eye-tracking research on judgement and decision-making (Fiedler et al., 2019). This review of empirical research forms the basis of our minimal reporting guideline.

As will become apparent, a large proportion of the studies that we discuss are conducted from the perspective of eye-tracking data quality. Why is that important? Better data quality may result in, for example, a lower attrition rate, fewer subjects, shorter experimental sessions, more statistical power, better diagnosis, etc. In other words, it means getting more out of each measurement, observation, or experimental session. The data quality approach entails scrutinising aspects of the procedure of an eye-tracking experiment and improving them such that the quality of the eye-tracking data may be increased. In studies on eye-tracking data quality, the eye trackers or aspects of the eye-tracking data analysis are the target of interest, analogous to the focus on specific traits of humans or animals in a psychological study. The goal can, for example, be to understand how the data from an eye tracker changes when the illuminance of the room, or the distance between a human’s eye and the eye tracker, is varied. Likewise, researchers may be interested in the relationship between aspects of the eye-tracker signal and the age, eye colour, or eye physiology of the human or animal being tracked. Often, the effects of such environmental, setup-related or participant-related factors are quantified in terms of eye-tracking data quality (see e.g. Ehinger et al.,, 2019; Hessels et al.,, 2015; Holmqvist, 2015; Nyström et al.,, 2013).

Also, researchers may be interested in how the quality of eye-tracking data affects eye-movement measures when fed through a particular aspect of the eye-tracking data analysis pipeline (Fig. 1). For example, researchers may be interested in how a ‘fixation duration’ as reported by a fixation-detection algorithm is affected by the precision (Table 1) in the gaze-position signal, or how a measure derived from an area-of-interest (Table 1) analysis may be affected by the settings of a fixation-detection algorithm.

Fig. 1
figure 1

From eye orientation to higher-order eye-tracking measures. This is a crude division of the process from eye orientation to higher-order eye-tracking measures. There may be cases where a more fine-grained division is applicable

Table 1 List of some common terms used in this paper

This paper may be useful for at least two types of readers: researchers interested in eye tracking per se, and researchers for whom eye tracking is not their core business but who use eye tracking as one of the tools in their research toolbox.

Structure of this paper

For eye-tracking researchers at all levels of experience to follow along, it is vital that we clarify a number of important terms, among which are the characteristics of eye-tracking data quality, the various eye-tracking methods, and common terms in eye-tracking data processing and analysis. Table 1 lists some common terms and definitions. Figure 1 furthermore depicts a general flow from eye-tracking recording to eye-movement measure.

In Section “Measuring data quality of eye-tracker signals”, we briefly explain how the fundamental data quality measures for eye-tracking data are operationalised and calculated. We will use the terms defined in Section “Measuring data quality of eye-tracker signals”: accuracy, precision, data loss, latency etc., throughout the paper.

Section “A review of empirical eye-tracking studies as the basis for a reporting guideline”, the first of the three subsequent content sections, consists of a scoping review of available research relevant to our question: How do the various aspects of a study using an eye tracker, such as the instrument, methodology, environment, and participant affect (or relate to) the quality of the eye-tracking data obtained, the properties of the eye-tracker signals, or the eye-movement and gaze measures? We furthermore review how the quality of the eye-tracking data and the data processing and analysis methods used may affect eye-movement and gaze measures.

In Section “Reporting practices and existing reporting guidelines”, we compare the findings from our scoping review (Section “A review of empirical eye-tracking studies as the basis for a reporting guideline”) against five existing reporting guidelines for research with an eye tracker, and against actual reporting practices. Conveniently for the latter, four of our co-authors have coded the frequencies of the actual reporting of 99 common aspects of eye-tracking experiments from 207 published studies using eye trackers in research on decision-making. See Fiedler et al., (2019) for an earlier presentation of the same data.

Finally, Section “An empirically basedminimal reporting guideline” presents a summary of what is empirically relevant to report. This summary could serve as a flexible reporting guideline, offering researchers the ability to make informed choices about what to report for their particular study. This final section is written from the point of view that any aspect of a study that matters to the outcome of a study should be reported.

Measuring data quality of eye-tracker signals

Eye-tracking data quality is often characterised by three measures: accuracy, precision, and data loss (see Fig. 2). Accuracy refers to the difference between the true gaze position and the gaze position reported by the eye tracker. Precision refers to the reproducibility of a gaze position by the eye tracker when the true gaze position does not change. Finally, data loss refers to the amount of data lost in an eye-tracker signal. However, another data quality concept is sometimes reported: system latency, which refers to the time it takes to produce gaze coordinates from the sensor data (camera image, for instance). Below, we will give operationalisations for these data quality concepts.

Fig. 2
figure 2

Characteristics of eye-tracking data quality. A Horizontal gaze position (in Fick, 1854, coordinates, see Haslwanter (1995)) of the right eye as a function of time. The gaze position was recorded from an adult participant with an EyeLink 1000 by Hooge et al., (2015). Call-outs indicate the relatively precise gaze-position signal (compared with panel B). B Horizontal gaze position in Fick coordinates of the right eye as a function of time. The gaze position was recorded from an infant participant with the Tobii TX300 by Hessels et al., (2016). Call-outs indicate the relatively imprecise gaze-position signal (compared with panel A), short gaps in the gaze-position signal (data loss), and an extreme gaze position reported by the eye tracker. The extreme gaze position is interesting because it can be considered an aspect of eye-tracking data quality not captured in the measures accuracy, precision, or data loss. C, D Gaze position signals (black dots) in a 2D representation, i.e. as if on a screen. Gaze position signals were recorded from adult participants by Hooge et al., (2019). Gaze position samples with high velocity were removed such that saccades are not visible. Orange markers represent validation targets. They are positioned to illustrate good/poor accuracy and do not correspond to the location of the actual validation targets in the experiment by Hooge et al., (2019). Call-outs indicate validation targets with corresponding precise and accurate, precise and inaccurate, imprecise and accurate, and imprecise and inaccurate gaze position signals, respectively. Note that the qualifications ‘precise’, ‘imprecise’, ‘accurate’, and ‘inaccurate’ are relative here and are often quantified

Operationalizing accuracy

requires that the participants look at a set of fixation targets on screen, often just after having completed the calibration. The accuracy measurement is commonly known as a validation procedure. Research on the positioning of validation points is lacking, but accuracy values may be underestimated (better) if the same points are used for validation as for calibration, or if only part of the stimulus is covered by validation points. Additionally, a second validation procedure and accuracy calculation at the end of the experiment might be beneficial to be able to detect changes in accuracy between experiment start and end.

Accuracy may be calculated as the mean difference between the reported gaze locations near a validation target and the actual position of that validation target. The achieved accuracy thus critically depends on participant gaze during calibration. Instructing the participant to confirm when s/he is looking at the target (Nyström et al., 2013) or letting the participant adjust the parameters of the calibration while getting feedback from online gaze data (Poletti and Rucci, 2016) may improve accuracy.

When participants produce a saccade to a validation target, they may under- or overshoot the target, make a small correction and only then fixate the target. A method is needed to find the period when the participant looks at the validation target. Manufacturers have built such selection methods into their software for calibration and validation, and some researchers have also investigated and used various sample selection principles (e.g. Hessels et al.,, 2015; Holmqvist, 2015; Niehorster et al.,, 2020c; Van der Stigchel et al.,, 2017). We refer to these studies for details.

Precision

of the gaze position signal may be operationalised in different ways, such as the Root Mean Square sample-to-sample deviation (RMS-S2S) of a segment of gaze data collected when the participants’ gaze is fixed on a validation target. Following Niehorster et al., (2020c), RMS-S2S is calculated as in Eq. 1:

$$ \text{RMS-S2S} = \sqrt {\frac{1}{n-1} \sum\limits_{i=1}^{n-1} {(x_i-x_{i+1})^2 + (y_i-y_{i+1})^2}} $$
(1)

where (xi,yi) and (xi+ 1,yi+ 1) are successive gaze positions during a fixation. Another measure would be the standard deviation (STD) of that segment or the Bivariate Contour Ellipse Area (BCEA, Crossland and Rubin, 2002; Steinman, 1965). As detailed in Niehorster et al., (2020c), these calculations operationalise different aspects of the gaze signal. Given a stable sampling frequency, this makes the RMS-S2S value of the gaze signal an indicator of noise velocity, which can be compared to the velocity threshold in the event detectors (Section “Fixation and saccade detection”). In contrast, the STD calculation operationalises the dispersion of gaze samples in a segment of data. The dispersion measure STD is calculated as in Eq. 2, where \(\overline {x}\) denote the mean of quantity x:

$$ \text{STD} = \sqrt{\frac{1}{n} \sum\limits_{i=1}^n (x_i - \overline{x})^2 + (y_i - \overline{y})^2} $$
(2)

The two calculations (1) and (2) can be applied not only to gaze data, but to any sequence of data from an eye tracker, such as pupil and CR position or pupil diameter data to investigate, for instance, the stability of a pupil dilation measurement.

Data loss

may be operationalised as the percentage (or proportion) of samples which lack coordinates for the gaze signal. An example of the latter would be an eye tracker that has an advertised sampling frequency of 250Hz but reports only 2000 gaze coordinates during 10 s; this would represent a data loss of 20%. However, there are other operationalisations of data loss that may be useful in some situations: for instance, in some cases, the researcher might wish to count gaze or pupil samples that are missing due to blinks as data loss. Blinks may account for about 2% loss of the total data set (Holmqvist and Andersson, 2017, p. 167). In some cases, gaze shifts outside the tracking range of the eye tracker may count as data loss. In developmental research, where young children are prone to look away from a monitor when they are no longer interested, researchers might wish to exclude periods of looking away from the calculation of data loss (see e.g. Hessels et al.,, 2015; Wass et al.,, 2014, for operationalisations of data loss in developmental research).

System latency

(also known as temporal accuracy and end-to-end delay, e.g. Reingold, 2014, p. 641) may be operationalised as the average duration from the time of an actual movement of the tracked eye until the recording computer signals that the eye movement has taken place. In a video-based P–CR tracker, the optimal latency is the time from image acquisition to calculated gaze, which takes 1–3 samples (1–3ms in a 1000Hz recording, see Holmqvist & Andersson, 2017, p. 85). Any timing issues in the processes run by the computers involved in the data recording may add latencies. A large variability in the latency may be characterised as poor temporal precision.

Long and variable latencies are problematic for the interpretation of measurements that are assumed to be synchronised: eye tracker and EEG, for instance, or eye tracker and stimulus monitor. The latter is very important in gaze-contingent research, where latencies are reported to be 10–60ms, including the delay to the next retrace of the monitor (Section “Signal properties and processing”).

Latencies can be measured in at least the following five ways, some of which require specific equipment and/or software. The first method measures the time until there is an update in the gaze signal. Methods three to five measure latency until a display change has been completed. The second method can be used for either of these two types of measurements.

  1. 1.

    Compare the file of the raw data stream against a video output of the participant’s eye (Leppänen et al., 2015) or gaze scanpath (Morgante et al., 2012).

  2. 2.

    Equip an artificial eye with two diodes that act as artificial corneal reflections per IR illuminator, and turn one off while the other diode is turned on, so that the eye appears to move, and then measure the time until a movement is seen in the gaze signal, or until the display changes (Bernard et al., 2007; Holmqvist et al., 2012; Reingold, 2014).

  3. 3.

    Shukla et al., (2011) used a mirror positioned next to the participant’s face and a 300Hz high-speed camera, which captured the participant’s eye and, through the mirror, the monitor where the stimuli appeared and disappeared.

  4. 4.

    Saunders and Woods (2014) tested gaze-contingent monitors with the EyeLink 1000, by blinding the eye tracker with an infrared pulse and measuring the time until the gaze-contingent monitor changed by recording both the infrared pulse and the monitor with a 1000Hz camera.

  5. 5.

    Hohenstein and Kliegl (2014) measured the latency between saccades and display changes in a gaze-contingent study, with a light sensor attached onto the monitor.

As is evident from the operationalisations above, lower values for accuracy, precision, data loss, and system latency are better: The ideal value is 0 for each data quality measure. Worse data quality manifests as higher values.

Examples of procedures, formulas, (pseudo)code or links to software for estimating some measures of data quality and effects thereof may be found in e.g. Crossland and Rubin (2002), Blignaut and Beelders (2012), Akkil et al., (2014), Dalrymple et al., (2018), Hessels et al., (2017), Orquin and Holmqvist (2018), Kangas et al., (2020), Niehorster et al.,, (2020a, 2020c).

A review of empirical eye-tracking studies as the basis for a reporting guideline

We will present our review ordered by the categories Eye-tracking methods, Environment, Setup and geometry, Participant, Calibration, Features of the experiment, Signal processing, Event detection, Area-of-Interest measures, and Higher-order measures. The minimal reporting guideline itself can be found in Section “An empirically based minimal reporting guideline ”.

Eye-tracking methods: Similarities and differences

Over the past 130 years (e.g. Delabarre, 1898; Lamare, 1892), many methods for eye movement registration have been developed. A recent comprehensive overview is provided by Holmqvist and Andersson (2017, Ch 4). For other overviews of eye trackers and methods for measuring eye movements, see Hansen and Ji (2010), Duchowski, (2007, pp. 51–59), Ciuffreda & Tannen (1995, pp. 184–205), Young and Sheena (1975), and Ditchburn, (1973, pp. 36–77).

In this section, we describe how characteristics of the eye-tracker signals differ between the measurement techniques and between various eye-tracker models. From the perspective of a researcher embarking on a new project, with a limited budget, each measurement technique is likely to have some advantages and some disadvantages. Within each technique, differences between manufacturer models in data quality and other properties may be found to be large enough to determine the success or failure of the upcoming study.

Table 2 summarises 42 existing cross-comparative benchmarking studies of eye trackers, which we refer the reader to for specific details. In short, these 42 studies inform their readers that data quality often differs very considerably, in very many ways, between eye trackers, while other eye trackers record data with similar quality. The studies in Table 2 may assist in assessing whether an eye tracker can actually produce data of the desired quality, either in preparation for acquiring a system, or when preparing a replication where the eye tracker in the intended replication study differs from the eye tracker in the original publication.

Table 2 Comparative benchmarking studies

Summarising studies on accuracy and precision, particularly, Holmqvist and Andersson (2017) point out that the difference in distribution of RMS-S2S precision values between eye trackers may be up to two orders of magnitude, while in comparison between-subjects differences in precision within each eye tracker tend to be relatively small. In contrast, the distributions of accuracy values for each eye tracker overlap considerably between eye trackers (i.e. they have similar accuracy), but exhibit a very wide range within each eye tracker which represents data from people with different eye physiologies, spectacles, and data obtained during fixations in the corner vs central positions of monitors. This suggests that for precision, the eye tracker matters more, while for accuracy: the participant, the calibration and the geometrical setup matter more. This was found for adult human participants in the lab and may differ for infants, animals and difficult recording environments.

As we outline below, irrespective of measurement method: anything that interferes with obtaining or processing of a feature used in estimating gaze direction (P, CR, P1, P4, limbus, magnetic induction or retinal features) will affect the data quality of the signal in the data reported by the eye tracker.

P–CR eye tracking

Video-based P–CR eye tracking was introduced by Merchant (1967). In 2021, camera-based P–CR eye trackers dominate the market almost completely. The P of P–CR eye trackers refers to the pupil centre in the camera image, and the CR to one or more reflection centre(s) in the cornea from infrared illuminators in the eye tracker. P–CR eye trackers estimate gaze direction as a function of the relative positions of P and CR coordinates in the pixel coordinate system of the video image, for instance by subtracting the CR coordinate from the P coordinate. Note that more advanced models have been developed (Hansen and Ji, 2010).

More types and models of P–CR eye trackers are available than for any other measurement technique, and prices vary over a wide range. There exists plenty of software for stimulus presentation, data processing and analysis, and the learning threshold for beginning researchers is lower than for other eye-tracking methods.

Many studies have examined aspects of P–CR eye trackers (Table 2). A host of issues with the feature detection of both pupil and corneal reflection may impair quality of gaze and pupil-size data. As we point out elsewhere, P–CR trackers suffer from the pupil-size artefact (Section “Environment”) and the pupil foreshortening artefact (Section “Setup and geometry”). Refraction in the cornea alters the pupil size in the camera image and its position with respect to the limbus (Villanueva & Cabeza, 2008). Pupil occlusion and mascara can interfere with pupil detection. Blue irises tend to result in poorer precision (in dark-pupil eye trackers), which is due to poor contrast between (a dark) pupil and iris in the infra-red light of video-based eye trackers (Section “Participants”, and Figure 4.13 in Holmqvist & Andersson, 2017). Combining the pupil with the CR signal to form the P–CR gaze signal may amplify post-saccadic oscillations and overestimate peak saccadic velocity (Hooge et al., 2016).

P–CR eye trackers exhibit clear post-saccadic oscillations (PSOs) (Hooge et al., 2015; Nyström et al., 2013), which make it difficult to draw a clear border between saccade and subsequent fixation, and which has led to the development of event detection algorithms that include PSO detection (Larsson et al., 2013; Nyström & Holmqvist, 2010; Zemblys et al., 2019).

Discussing which technologies could be used for future studies of saccade dynamics, Hooge et al., (2016) reason that variants of CR-tracking without the involvement of the pupil feature could be the preferred future method. However, Holmqvist and Blignaut (2020) reported incorrectly measured amplitudes of small eye movements (below 2) in all 11 P–CR eye trackers they tested, and suggest that it is due to erroneous calculations of the CR centre by the image processing algorithms in the eye trackers, interacting with the resolution of the eye camera sensor. Other artefacts in the CR signal arise from changes in head position (relative to the eye tracker), which may alter the size and the shape of corneal reflections (Guestrin & Eizenman, 2006). Patterns in the iris may interact with the CR image and change the calculated CR center (Tran & Kaufman, 2003). Illumination levels, sampling frequency and the optic lenses in the camera may all affect the CR. Droege and Paulus (2009) point out that the use of low-quality eye cameras may further degrade precision in the gaze signal, due to the slower pixel updating, which makes pixels retain some of the brightness of the passing corneal reflection, leaving a bright trace behind the real reflection, making centre calculation of the CR image more perilous.

DPI eye tracking

The Dual-Purkinje Imaging (DPI) system is an analogue eye tracker that bases its estimation of gaze on the relative movement of the infrared reflection off the cornea (P1) versus the reflection at the back of the crystalline lens (P4), and reports P1, gaze and head translation as voltages (Crane & Steele, 1985). At present, there are around 60 DPI trackers left in the world (Personal communication; Warren Ward). As the DPI produces a continuous signal, it can be digitised to the desired sampling frequency in an AD-converter. Internal bandwidth restrictions limit the maximum sampling frequency to 39.06kHz (Personal communication; Warren Ward).

The DPI used to be the main workhorse of many psychology laboratories and features in many influential publications such as Frazier and Rayner (1982) and Deubel and Schneider (1996). The learning threshold is clearly higher than for P–CR trackers, but the major drawback of the DPI is that it is a bulky and sensitive machine built using optoelectronics from the 1970s that are serviced commercially by only one person. However, the camera-based DPI built by Rucci et al., (2020) has a data quality comparable to the original analogue system and is built with modern electronics, which may revive the DPI measurement technique.

The P1 in DPI eye tracking is the same reflection as the CR of P–CR trackers, with the important distinction that P–CR eye trackers estimate the center of the CR from a small portion of a pixelated camera image, while the DPI finds the centre of an analogue light beam. This has been proposed to be the reason that the DPI does not mismeasure the amplitudes of small eye movements (Holmqvist and Blignaut, 2020).

The DPI records gaze signals with a quality sufficient to detect tremor, oculomotor drift, microsaccades, and smooth pursuit with good reliability (see Holmqvist & Blignaut, 2020; Ko et al.,, 2016; Poletti & Rucci, 2016, for details). Holmqvist (2015) report a median precision of 0.008 and an accuracy of 0.4 across 192 participants, both better than any video-based P–CR system. The quality of DPI data is generally lower when recording participants with small pupils that cover the P4 reflection, which causes inaccuracies and data loss (Crane and Steele, 1985; Holmqvist et al., 2020). A DPI is best recorded with participants who have large pupils, either in dark rooms or with artificially dilated pupils. The reliance on the P4 reflection furthermore results in the largest measured amplitudes of post-saccadic oscillations in any eye tracker (Deubel & Bridgeman, 1995).

Scleral search coils

Scleral search coils were introduced by Robinson (1963) and adapted for use with human participants by Collewijn et al., (1975). The scleral search coil method involves placing a copper wire coil, embedded in an annulus or contact lens, onto the sclera. The participant is placed in oscillating magnetic fields and the induced voltage in the eye coil is taken to represent the orientation of the eye with respect to the magnetic fields. This technique was dubbed the gold standard of eye tracking by Collewijn (1998). Reulen and Bakker (1982) presented the double magnetic induction principle, improved by Bour et al., (1984). Like the DPI, scleral search coil systems are analogue trackers, and data can be digitised at very high sampling frequencies. Coils can even record combined eye and head rotation for the same participant (Collewijn et al., 1985).

Houben et al., (2006) compared a coil system with a torsion-capable video eye tracker, finding that the gaze signal from the coil system was ten times more precise, and Ko et al., (2016) compared a coil system to a DPI, finding that although data from a coil system are somewhat more precise, both systems provide a data resolution sufficient for reliable detection of intersaccadic (fixational) eye movements. Collewijn (2001) sampled data at 10000Hz, and additionally reported a tracking range of 20 in all directions with a resolution of 1’, while Malpeli (1998) reports a precision of 1’ (0.017) and Collewijn et al., (1988) recorded saccades with amplitudes of up to 80.

All studies in Table 2 that have compared EyeLink systems with scleral search coils reported substantial agreement in precision and detection of microsaccades and oculomotor drift in both systems (McCamy et al.,, 2015, for a review). Note however that coils have been suspected to slow down the saccades of participants who wear them (Frens and van der Geest, 2002; Träisk et al., 2005). However, coils probably estimate the velocity more accurately than P–CR eye trackers, which overestimate saccadic velocity (Hooge et al., 2016).

The scleral coil tracking method is distinctly invasive, and evidence exists that older coils systems, in combination with the anaesthetics that were applied, caused temporary reductions in visual acuity (Irving et al.,, 2003, but see Murphy et al., 2001), deformation of the visual field (Duwaer et al., 1982), and blurred vision (Arend & Skavenski, 1979). Contemporary search coils are embedded on flexible contact lenses and used for research and clinical diagnostic purposes in neuro-ophthalmology and neurology, due to their high precision, and the fact that patients often suffer from uncontrolled head and body movements.

EOG

Schott (1922) and Meyers (1929) could produce recordings of the horizontal component of gaze, based on the corneo-retinal potential principle discovered in 1849 by Du Bois-Raymond. An EOG system records eye movements using electrodes on the side of the eyes that pick up an electromagnetic field produced by this corneo-retinal electrical potential of 10–30mV (Brown et al., 2006). The signal is then taken through an isolated instrumentation amplifier connected to a chart recorder or a computer. EOG is an analogue method. EOG systems are often part of other recording devices. For instance, electroencephalogram (EEG) systems often have extra electrodes for the eyes that can be used for EOG recordings.

Brown et al., (2006) proposed a standardized measurement procedure for clinical EOG measurements, aiming at acquiring high-quality EOG data. Their procedure includes dilating the pupil, preparing the skin of the participant, and then applying two electrodes on the sides of each eye and a reference electrode to the forehead. The corneo-retinal potential is mainly derived from the retinal pigment epithelium, and it changes in response to retinal illumination. Hence, in a totally dark environment, the participant spends 15 minutes looking at dim fixation targets, followed by a light phase of similar duration. This darkness-light sequence maximizes the corneo-retinal potential. The actual data recording then commences.

EOGs can be a useful variety of eye tracking when studying larger movements of the eye. Small movements will drown in the noise of EOG data (compare Fig. 2). One specific advantage of EOGs is that they can be used when the eyes are closed, for instance to study REM sleep (Aserinsky and Kleitman, 1953). However, EOG eye tracking comes with a poor accuracy, compared with most other eye trackers: Young and Sheena (1975) report a 1.5–2 inaccuracy on average.

Limbus tracking

The first published implementation of a (photo-electric) limbus tracker was by Török et al., (1951). Limbus trackers estimate the limbus border between the iris and sclera, either from video or photosensors. Limbus eye trackers based on photodiodes were sold for research up until the year 2000 by the Skalar company, but are now only known for controlling the laser during refractive surgery of the eye (Arba-Mosquera and Aslanides, 2012). The Ober Saccadometer is not a limbus tracker, but a corneal bulge tracker (Holmqvist & Andersson, 2017, p. 73), although like the Skalar limbus tracker, the Saccadometer uses photosensors to track the corneal bulge.

Video-based limbus trackers use the fact that the limbus border (between iris and sclera) has a contrast comparable to the pupil-iris border. However, limbus trackers do not suffer from pupil-based artefacts, which affect both DPI and P–CR systems. Refraction in the cornea is also not a problem. Eye trackers with low-resolution cameras may benefit from using the limbus method. The drawback is that a large portion of the limbus may be covered by the eyelid, which puts challenges on image processing.

Piezoelectric eye tracking

The piezoelectric transduction method, first introduced by Bengi and Thomas (1968), involves bringing a silicone-tipped piezoelectric bimorph into contact with the sclera, typically in the interpalpebral region near the temporal limbus. It outputs voltage signals, in which horizontal microsaccades and oculomotor tremor can be detected. This analogue eye tracker has not been used for purposes other than measuring intrafixational eye movements. There is a suspicion that the introduced pressure on the sclera affects the microsaccade behaviour (see McCamy et al.,, 2013, for a discussion).

Retinal image-based eye tracking

Computational tracking of retinal features involves finding the optic disk, blood vessels and smaller features, and was first done by Cornsweet (1958). A computer vision algorithm provides an analysis of the movement of features in the camera view, and infers eye movements.

Retinal image-based eye trackers are the most accurate and precise of all existing eye trackers. An early system by Cornsweet (1958), albeit limited in that it only tracked features along one axis, could detect eye movements (microsaccades) down to amplitudes of 10 seconds of arc (0.0028). Putnam et al., (2005) presented very impressive numbers on gaze position accuracy (5” which is 0.0014) based on snapshots taken with an adaptive optics retinal camera.

The retinal-based eye trackers with the highest speed and best accuracy are preferably built from scanning imagery, specifically from scanning laser ophthalmoscopes (SLO). These rely on the so-called ‘rolling shutter’ principle to recover eye motion (Mulligan, 1997), and are especially effective in SLOs that use adaptive optics that offer high resolution, high magnification and densely sampled retinal video (Stevenson and Roorda, 2005). Stevenson et al., (2016) introduced the first binocular system, which optically divided a single SLO image field between two eyes.

Retinal imaging systems also generally occlude forward viewing, impeding stimulus presentation. This may however change: Bartuzel et al., (2020) describe a MEMS-based retinal imaging system that allows for presentation of stimuli while recording with a high sampling frequency (1240Hz). Even then, the measurement range (also “trackable range”) tends to be smaller than with other eye trackers: Bartuzel et al., (2020) report an 16 range (8 left, 8 right), which we can compare to 20–40 for the DPI and many video-based P–CR trackers, and 90 or more for scleral coils.

Retinal image-based eye-tracking systems typically rely on a reference frame which, in a scanning system, is a single retinal image upon which to register strips of all movie frames to compute the eye motion. This process generally yields two outputs; a stabilised movie and an eye motion trace. If the reference frame is perfect and every strip from each scanned frame is perfectly registered to it, then it follows that the eye motion trace will also be perfect. However, distortions in the reference remain a challenge to overcome and these distortions yield artefacts in the eye motion trace. Recent efforts have been made to correct for these (Azimipour et al., 2018; Bedggood and Metha, 2017) but, if uncorrected, these artefacts are evident as peaks in the power spectrum of eye motion (Bowers et al., 2019).

To date, however, retinal-image-based eye trackers have had a limited scope of application. The intrinsic trade-off between accuracy and range has rendered them most useful to study eye movements during steady fixation (Bowers et al., 2019). Retinal eye trackers have predominately been used in ophthalmology applications, often relating to disease in the retina and how that expresses itself in vision and miniature eye movements (Godara et al., 2010).

Binocular vs monocular eye tracking

The different technologies above can be constructed or set up to record either monocularly or binocularly. A common use of binocular eye tracking, particularly in remote eye trackers, is to combine the left and right signal by averaging synchronous data samples from the two eyes in the recording software, sometimes referred to as “cyclopean gaze”. Cui and Hondzinski (2006) report that averaging left and right signals improves accuracy, but Hooge et al., (2019) found that averaging the gaze positions from the two eyes improved accuracy only for some of the participants.

Furthermore, head-mounted eye trackers may suffer from parallax errors, which happens because the vantage point of the eye and the scene camera do not coincide, typically when the measurement is not confined to a single plane. Binocular averaging is regularly done in glasses-based eye trackers (SMI ETG, Tobii Glasses, for instance), and in the Ober Saccadometer, which helps to alleviate the parallax issue. A thorough investigation of the geometry of the parallax error is provided by Mardanbegi and Hansen (2012), Narcizo et al., (2017), and Narcizo and Hansen (2015), and Tatler et al., (2019).

Alternatively, the two signals from the two eyes can be used to measure vergence (e.g. Liversedge et al.,, 2006). Jaschinski et al., (2010) showed that the EyeLink II, assuming no environmental and participant artefacts, can resolve vergence eye movements of just below 40mm in depth at a 60cm viewing distance. However, vergence measurements with P–CR eye trackers are sensitive to artefacts that affect accuracy: Hooge et al., (2019) and Jaschinski (2016) both report effects of the pupil-size artefact on vergence. Calibration for binocular recordings introduces the choice whether to calibrate both eyes at once, or separately (Kirkby et al., 2013; Nuthmann and Kliegl, 2009; Švede et al., 2015). Additionally, Wang et al., (2019) found that the calculation of the vergence point (intersection between the gaze direction vectors of left and right eye) may show a large deviation to the fixated point, with a wide distribution in depth and a misestimation of the vergence mean point towards the participant.

Environment

Eye tracking may take place in various environments–such as an MRI scanner, cars, fighter jets, behind a desk, in VR, and during sports. These environments may differ in light conditions, vibrations and sound, temperature and the presence of other people.

Light conditions

Direct sunlight has a critical impact on data quality in video-based P–CR and DPI eye trackers. Hansen and Pece (2005) and Holmqvist & Andersson, 2017, p. 138–139) show several examples of how infrared radiation from sunlight and hot light bulbs undermine tracking in video-based P–CR trackers. The importance of a controlled light environment is exemplified by Wang et al., (2010), who excluded 32% of participants, recorded while driving a real car, from one of their analyses due to poor data quality, but only had to remove 17% of participants recorded in a car simulator. The authors attributed the difference in data quality to the variable lighting conditions encountered during real driving. In a study of six pupil-centre calculation algorithms for video-based outdoor eye tracking, Fuhl et al., (2016) note that pupil algorithms have good average performance, but there are still problems in obtaining robust pupil centres in the case of poor illumination conditions. Rapid changes in illumination, common in car driving and flight deck research, can be detrimental to data quality and lead to a time-consuming investment in manual post-processing (Kasneci et al., 2014). Non-commercial algorithms to improve tracking in sunlight have been developed by Santini et al., (2018) and Hansen and Pece (2005).

Even moderate changes in light levels can indirectly affect data quality. Multiple studies have established the existence of the pupil-size artefact, in which changes in pupil size affects gaze position accuracy in both video-based P–CR systems (Choe et al.,, 2016; Drewes et al.,, 2012, 2014, 2011; Hooge et al.,, 2021, Hooge et al.,, 2019; Jaschinski, 2016; Wildenmann & Schaeffel, 2013; Wyatt, 2010) and for the DPI (Holmqvist et al., 2020; Holmqvist, 2015). Manipulating light levels to affect pupil size typically results in increased gaze inaccuracy of 1 to 5. The reason that changes in pupil-size affect reported gaze direction is that the pupil constricts and dilates asymmetrically, altering the pupil shape, and hence the calculated centre of the pupil image shifts position. In any video-based P–CR eye tracker, this implies a shift in gaze, even though the eyeball has not rotated with respect to the head. In a DPI, a small pupil may result in the P4 reflection at the back of the crystalline lens to be obstructed. The geometry of the setup, gaze direction and distance to the eye camera have also been found to influence the magnitude of pupil-based errors (Ahmed et al.,, 2016; Hooge et al.,, 2021; Wilson et al.,, 1992; Wyatt, 2010, 1995). In addition, it has been reported that pupil size in P–CR eye trackers is also related to some eye-movement measures, such as the saccadic peak velocity (Nyström et al., 2016).

Accuracy in video-based P–CR trackers is generally better for participants who have smaller baseline pupils (before calibration), measured under controlled illumination, as reported by Ahmed et al., (2016) and Holmqvist (2015). For the DPI eye tracker, the opposite is true: a large baseline pupil size results in better accuracy (Holmqvist, 2015). The signals of EOG systems and scleral coils are likely independent of pupil size, while data from retinal trackers benefit from a large pupil.

The pupil-size artefact may affect other measures. For instance, Hooge et al., (2019) found that light levels affect vergence estimations, with an error of 0.36–0.75/mm change in pupil size (and similar findings were reported by Jaschinski, 2016). We can expect that gaze position errors induced by the pupil-size artefact will inevitably propagate to many AOI- and other higher-order measures.

Environmental vibrations and ambient noise

Sources of vibration in the recording environment contribute to increased variation in the gaze signal, as exemplified by Figure 6.24 in Holmqvist and Andersson (2017), showing how transients in the signal appear when a person walks in a room where an artificial eye is being measured with a tower eye tracker. Vibrations could be expected to matter particularly on flight decks, in cars, and during sports. For instance, De Reus et al., (2012) report that alignment shifts of the eye tracker inside the flight helmet due to external motion frequently caused inaccuracies of gaze (see also Niehorster et al.,, 2020b). For lab studies, a nearby elevator shaft, a powerful air conditioning unit, or vibrations caused by someone walking nearby on hard floors may add measurable noise to a sensitive eye-tracking recording. Sound in the recording situation is another form of oscillation that could make the eye tracker vibrate and affect the quality of recorded data. However, Hooge et al., (2019) recorded Tobii TX300 data at an indoor science festival with moderately loud music and found accuracy values close to manufacturer specifications. Controlled studies of the effect of vibrations on eye-tracking data quality appear to be lacking.

Presence of others

The presence of other people during the recordings may affect measures of eye movements and gaze behaviour in ways that are little understood. Social appropriateness may matter: The very presence of an eye tracker can impact head and eye movements, with people looking only at what they feel is socially appropriate when they believe that an eye tracker is recording (Risko and Kingstone, 2011; Nasiopoulos et al., 2015). Distraction is another possible factor: For instance, infants are easily distracted, looking at nearby people rather than at the monitor (Tomalski & Malinowska-Korczak, 2020). Accidental mismeasurements may happen when the infant is seated in the lap of a parent, and the eye tracker finds and records the parent’s eyes. Additionally, Oliva et al., (2017) found longer latencies in the antisaccade task when adult participants were recorded in proximity to one another, for reasons that are not well understood.

Special recording environments

The MRI scanner environment consists of a dark and noisy tunnel, with powerful magnetic fields, in which participants must lie down. The duration of experiments and pacing of stimuli often differs from outside the MRI. Importantly, data quality from video-based P–CR tracking in MRI (SR Research, SMI, Arrington, Gaze Intelligence) generally appears to be lower than outside the MRI: poorer precision and accuracy, and more frequent data loss (Dar et al., 2021). For infrared limbus trackers (MR-Eyetracker, Cambridge Research Systems) attached to the headcoil, even small movements of the head may over time result in data loss. MRI trackers also exist that use a multicore fiber to transmit light back to outside the MRI machine where they process the reflections of the corneal bulge. The Ober MRI-tracker exhibits crosstalk (i.e. correlation) between horizontal and vertical signals, which makes the gaze signal useful only for horizontal tracking.

A curious observation is that saccadic latencies are longer when obtained in an MRI scanner than outside the MRI scanner, which could reflect the long fixation periods between saccades required in scanners, or other differences, such as participants laying down and potentially feeling drowsy (e.g. Talanow et al.,, 2020, their Table 1). Furthermore, the magnetic field of 7T MRIs has been reported to induce nystagmus in some participants (Roberts et al., 2011).

Head-mounted virtual-reality sets allow exclusive control over the visual stimulation provided to a subject, while shutting out any visual references provided by the outside world. Little is known of the data quality of eye trackers integrated into VR goggles, but Pastel et al., (2021) found that precision is significantly poorer in the SMI Vive VR goggles compared to the SMI glasses. Accuracy however differs only in some conditions, mostly when the distance to the fixation point changes. Stein et al., (2021) found that the end-to-end latency of common VR headsets ranged from 45ms to 81ms (compare Section “Signal properties and processing ”).

Setup and geometry

When preparing a manuscript about an experiment involving an eye tracker it is important to realise that an eye-tracking setup is more than just the eye tracker itself. Hessels and Hooge (2019) point out that a screen-based eye-tracking setup may consist of at least an eye tracker, computer screen, a seat for the participant, and a table or mounting device for positioning the eye tracker. For wearable eye trackers, the setup includes the participant, eye tracker, and whatever frame, headbands, helmets or straps are used to position the eye tracker relative to the participant’s eyes. With geometry, we mean the “absolute position and orientations of the eye, the eye-tracker camera, and the IR illuminator” (Hooge et al., 2021), and in the case of screen-based eye tracking, the screen. The geometry can thus (partially) be described by the distances between eye tracker (camera and/or IR illuminator), participant, and screen, and their relative orientations. A picture or schematic can be useful in providing this information, as done in Choe et al., (2016, Figure 1), Hessels & Hooge (2019, Figure 2), Valtakari et al., (2021, Figure 1), and our Fig. 3.

Fig. 3
figure 3

Example of a head-boxed eye-tracking setup. The setup consists of a participant, eye tracker (camera and IR illuminator) and a computer screen. The geometry of this setup can be described by the relative orientations and distances of the monitor, camera and IR illuminator, and participant. Some eye trackers have a fixed relation with the computer screen (e.g. Tobii Pro Spectrum), while others do not and allow for more adjustments (e.g. SR Research EyeLink 1000). Note that the eye-tracker distance and screen distance are not identical. Screen height and width refer to both the physical and the pixel measures

Gaze direction, measurement space and monitor size

Relevant properties of the setup may include the distance and relative orientation between participant and eye tracker, participant and computer screen, and the size and resolution of the computer screen. Most video eye trackers report gaze position in pixels on a screen. For some research this is sufficient (e.g. area-of-interest research in marketing). For other studies, one may wish to report the orientation and rotation of the eye in angular measurements (e.g. Haslwanter, 1995). In order to convert a gaze position on a screen in pixels to an angular measurement, it is necessary to know the distance and relative orientation between participant and eye tracker, participant and computer screen, and the size and resolution of the computer screen. If the width and height of the screen are smaller than 20 (10 to the left and 10 to the right), the small angle approximation may be applied. For example, this allows one to transform gaze positions in centimetres or pixels on screen to angles with a simple multiplication factor. For a general and more accurate method for this transformation, see Holmqvist & Andersson (2017, p. 21).

When the monitor is larger than the measurement range of the eye tracker (Section “Eye-tracking methods: Similarities and differences”), data quality will be poorer in the outer parts. Niehorster et al., (2020b), Schlegelmilch and Wertz (2019), Popelka et al., (2016), Holmqvist (2015), and Guestrin and Eizenman (2006) all found that data recorded in the corners of the monitor (or measurement plane) are of poorer quality than those recorded at the monitor’s centre. Generally, recordings made while looking at corner positions exhibit a precision that might be worsened by a factor of 3, and accuracy by an average 1–10, depending on the system. Such findings led Majaranta et al., (2009) to suggest putting important information in gaze-controlled systems in the centre of the screen, to give the user a better perceived accuracy.

As most P–CR eye trackers do not report physical pupil size, but pupil size in the eye image, the pupil-size signal is susceptible to viewing direction and distance. Therefore, in experimental designs in which the participant is required to look around the screen, researchers should also be aware of the pupil foreshortening artefact (Brisson et al., 2013; Mathur et al., 2013; Young and Sheena, 1975). As the gaze direction deviates from the eye-tracker camera axis, the image of the pupil in the eye-camera sensor deforms, making the pupil shape appear more oval and the pupil diameter – a common basis for pupil-size measurements –artificially shorter, and pupil area measurements artificially smaller. This is of particular importance for experiments using the pupil size as a measurement for estimates of the participant’s psychological state (e.g. cognitive load or arousal) during free-viewing.

Various compensation algorithms have been developed to decrease the pupil foreshorting artefact, for instance relying on a geometrical model (Gagl et al., 2011), or using data from an artificial eye rotating horizontally in front of the screen (Hayes & Petrov, 2016).

Distance between participant and eye tracker

The distance between participant and eye tracker needs to be given attention, for all eye trackers, remote as well as head mounted systems. Chatelain et al., (2020) report that when participants are allowed to choose for themselves where to sit in front of a remote eye tracker, the distance to the eye tracker ranges from 40–120cm. This self-preferred range of seating distances is larger than what eye trackers can handle. Most manufacturers of remote eye trackers recommend having the distance between the participant and the eye tracker to be within a narrow range, defined by the optics of the system, with its centre at around 60–70cm (the LC EyeFollower being an exception with a specified range of 46–97cm). When a participant moves outside of the tracking range, the inaccuracies and noise levels in data can quickly triple and data loss also increases (Blignaut and Beelders, 2012; Blignaut & Wium, 2014; Kolakowski & Pelz, 2006; Schlegelmilch & Wertz, 2019).

Restrained vs. free head movements

The history of eye-movement research includes numerous examples of attempts to minimize the participants’ head movements. Often, the use of head restriction is based on assumptions that the recorded data will be of better quality with a restricted head (e.g. van der Laan et al.,, 2017). Although overall there is a lack of studies on the effect of using chinrests, there are a few indications that they may be useful: For instance, Hermens (2015) concluded that in some cases, the EyeLink II may produce artificial microsaccades due to small head movements, and Cerrolaza et al., (2012) showed that inaccuracies may originate from small stabilizing head movements that participants make. Additionally, Holmqvist et al., (2021) found that recording participants in a chinrest increased the level of noise in some eye trackers.

Head restriction methods can be roughly divided into chinrest, forehead rest, and bite bar/board, the three of which can be combined to prevent both rotation and translation of the head. For some animal participants that take part in concurrent eye-movement and neurophysiological measurements, such as the rhesus macaque, the desire for head-movement restriction from both measurement methods has led to head restraints being surgically attached to the animal’s skull for data collection with video-based eye trackers (McFarland et al., 2013) or they may have scleral coils implanted in their eyes for use with magnetic coil trackers (Kimmel et al., 2012).

The P–CR technique found in the vast majority of eye trackers today, originally came about to allow some head movement by the participant (Merchant, 1967). While the original P–CR method may handle small movements of the head, at the size of a few millimetres up to a centimetre, recent remote video-based eye trackers are designed to allow for free head movements in a much larger space (the headbox, see Fig. 3), tens of centimetres or more across.

One way to accomplish room for larger head movements is to use a wide-angled eye camera that covers a large space around the participant, and use a trade-off: The sampling frequency of the eye camera can be increased by reducing the size of the recording window on the camera sensor so it just samples the eye region. When the participant moves, this recording window on the camera sensor must be moved in real-time (or physically, using a pan-tilt camera as in the LC EyeFollower). Although moving the recording window allows for larger head-movements, this window motion introduces sample dropping (data loss) in some eye trackers (Holmqvist and Andersson, 2017, p. 168). Studying the effect on accuracy, precision, latency and loss of data, Blignaut (2018) found that one or two headbox adjustments per second would have no effect on accuracy, but it did on spatial and temporal precision (in the author’s custom-built eye tracker). However, some eye trackers change sampling frequency altogether when the eye is lost in the recording window of the camera sensor and the eye tracker goes into full-sensor search mode (Hessels et al.,, 2015, Figure 3).

When participant eyes are at the center of the headbox eye-tracking data quality is best. When located away from the headbox center, data quality is negatively affected, as experienced by many infancy researchers and investigated experimentally by Hessels et al., (2015) and Niehorster et al., (2018), who found a strong effect of rotating the head on the quality of eye-tracking data on a number of eye trackers. In fact, any relative movement between eye and the eye camera of the eye tracker can reduce data quality, also in eye-tracking glasses (Niehorster et al., 2020b).

During gaze interaction, the human–computer interaction technique of controlling a computer with gaze, the participant/user has immediate cursor feedback of where the eye tracker thinks that gaze is located. Gaze inaccuracy originating from the users’ movements undermines effective usage. Chinrests are not a solution here, because many users have involuntary head movements or seating positions that make a simple head restriction impossible, requiring a different user interface design (Donegan, 2012). Some users (try to) actively use head movements to adjust gaze pointing inaccuracies (Špakov et al., 2014). The authors speculate that this can be common among people with disabilities who actually use gaze control in their everyday life.

For infants, adults with certain disabilities, and animals, head restriction methods are not always practically usable, and alternative methods for head movement reduction are often used. Hessels et al., (2015) compared the eye-tracking data quality of infants recorded in a reclining car seat versus that of infants sitting on the parent’s lap or in a highchair. Accuracy was worse (higher) for infants seated on the parent’s lap or in the highchair than for infants in the car seat. Yet, a participant’s positioning puts additional constraints on the placement of the eye tracker. Hessels and Hooge (2019) found that placing infants in a car seat required the eye tracker to be tilted forward substantially, which that might not be possible for some eye trackers without extensive modifications and additional equipment. Similarly, for patients confined to the bed, mounting the eye tracker on an adjustable arm allowed for effective gaze interaction for disabled users lying on their back (Blignaut, 2017; Hansen et al., 2011).

Participants

In this section, we review how certain characteristics of participants are related to the quality of recorded eye-tracking data, to eye-movement measures and high-order measures of gaze behaviour. The characteristics we discuss include gender, age, visual acuity, visual aids, physiology of the eye region, mental state (e.g. sleep deprivation, mental fatigue, cognitive workload), expertise, and psychopathology. A complete review of all these characteristics – particularly expertise and psychopathology – is beyond the scope of the present paper. However, our goal here is to show that these characteristics may be relevant, which researchers may use when defining their participant group and exclusion criteria. Whenever possible, we direct readers to more in-depth reviews on the specific topics.

Attrition rate

Attrition rate is operationalised as the proportion (or percentage) of participants who were not included in the analysis. Attrition rate exhibits a large variation between studies. For instance, Dalveren and Cagiltay (2019) report an attrition rate of 17.9% for the EyeTribe, while Holmqvist (2015) report 1.0% for the same eye tracker. The reported attrition rates appear to be lower in studies with adult participants in light-controlled labs, for instance 0–8.2% in Holmqvist (2015), compared to recordings made in sun-lit environments, for instance Wang et al., (2010), who report 32% attrition rate during outdoor driving. Attrition rates may be high for infant studies, for instance: 59–64% in Burmester and Mast (2010), and for children in the autism spectrum (100% in Birmingham et al.,, 2017).

Older remote video-based eye trackers have been reported to have higher attrition values also for lab studies with adults. For instance, Sibert and Jacob (2000) reported 38% attrition rate for ASL Model 3250R, while Schnipke and Todd (2000) reported 62.5% for the ASL 504.

52.2% of the publications in the reporting database (see Section “Reporting practices and existing reporting guidelines” for details) report the number of participants excluded from analysis. Their main reasons for excluding participants were “data quality” (44.1% of the publications), “impossible to calibrate” (19.8%), “the participant” (12.6%), “other” (7.2%), “error in the experimental procedure” (5.4%), and “failed to follow the instructions” (0.9%). This suggests that poor data quality is the major reason for excluding participants from analysis.

Alternatively, attrition rate can refer to the number or proportion of trials or events per participant that were excluded, for those participants included in the analysis. In the reporting database, 30.9% of the studies reported excluding trials or fixations. Each study reported a slightly different reason for exclusion, many of which relate to data quality, outliers, technical failures or behavioural mishaps.

Gender

There are some reports of differences between genders in gaze behaviour towards other people (Coutrot et al., 2016; Gluckman & Johnson, 2013; Rupp & Wallen, 2007), and in pupil reactions to pain (Ellermeier & Westphal, 1995). Coors et al., (2021) found that although gender-related differences in eye-movement measures (blink rate, smooth pursuit gain) do exist, most are negligible in magnitude.

Ethnicity

Blignaut and Wium (2014) report that, statistically, Asian participants are more difficult to track, and the resulting data are on average of worse quality than for participants of European or African ethnicity (see also Holmqvist, 2015). These findings reflect the generally narrower palpebral aperture in the east Asian population. Amatya et al., (2011) found a larger proportion of express saccade makers in the Asian participant group, indicative of faster saccadic reaction times.

Age

Data quality as well as many eye movement measures covary with the age of the participant. Firstly, infant researchers have consistently shown that eye-tracking data quality tends to be worse for younger children than for adults. For example, accuracy and precision are generally worse, and data loss is generally poorer, for infants and toddlers than for school-aged children and adults (Dalrymple et al.,, 2018; Hessels et al.,, 2016, 2019). Interestingly, worse precision in infant eye-tracking data is not due to fixation instability (Seemiller et al., 2018). Moreover, higher amounts of data loss with infant participants are not only due to infants looking away more from the screen, as it is often characterised by short periods of data loss (less than 100ms: Hessels et al.,, 2015; Wass et al.,, 2014). Neither is this due to blinking, as young children blink significantly less than adults (Stern et al., 1994). In addition, it seems that individual differences in data quality are larger for the younger participants (5–10 months) than for the older participants (3–9 years, Hessels and Hooge, 2019). The latter is particularly problematic when analysis methods are used that are susceptible to differences in data quality.

The oculomotor system develops into adulthood and old age. The resting pupil diameter has been found to be larger for young adults (around 20 years) than for older (around 70 years), independent of luminance level (Bitsios et al., 1996). Saccadic amplitudes have been found to be shorter both for children (below 10 years) and older adults (above 60), compared to young adults (30–40 years, Helo et al.,, 2014; Açik et al.,, 2009; Mackworth & Bruner, 1970; Açık et al.,, 2010). The latencies of said saccades follow the same pattern, decreasing from childhood into adulthood (Luna & Velanova, 2011; Salman et al., 2006), and then increasing again as participants grow older (Moschner & Baloh, 1994). Smooth pursuit parameters such as latency (time until the movement is initiated) and gain (how closely gaze follows the target velocity) also have been found to be related to age. While latency is longer for older than for younger adults (Sharpe & Sylvester, 1978), gain is closer to the ideal value in young adults compared to children (Luna & Velanova, 2011; Salman et al., 2006).

Binocular coordination during reading is also poorer in children than in adults (Blythe et al., 2006). In a review of the eye movements of the aging reader, Paterson et al., (2020) point out changes both on lexical (e.g. the word frequency effect), and orthographic levels (e.g. sensitivity to removal of inter-word spacing). Age variation in fixations and blinks has not been systematically explored outside reading research (Marandi and Gazerani, 2019).

Also, with older age, it is more likely that the participant will wear spectacles or lenses, have droopy eyelids, have cataracts, or an artificial lens from cataract surgery, macular degeneration and peripheral scotomas, as well as several neurodegenerative ailments, which tend to make either data quality worse or alter eye movements, or both.

Visual acuity and visual impairment

For readers with low acuity, the fixation durations are longer, saccades shorter, and consequently text reading takes much longer (Legge et al., 1997). Furthermore, blurred vision caused by, for instance, myopic refractive error results in an increase of the amplitude of microsaccades (Ghasia & Shaikh, 2015). Eye movements are dramatically different for participants with low vision, i.e. a loss of vision that cannot be corrected by medical or surgical treatments or conventional eyeglasses, such as macular degeneration, scotomas, cataracts, or nystagmus (Leigh & Zee, 2006).

Spectacles, lenses and makeup

Nyström et al., (2013) investigated the effect of eye-region physiology, spectacles and other factors on accuracy, precision and data loss in the SMI HiSpeed1250, finding poorer precision when participants wear spectacles, and poorer accuracy, precision and data loss when contact lenses are worn. In a large follow-up using 12 eye trackers, Holmqvist (2015) reports up to 10 worse accuracy and up to three times (300%) poorer precision for recordings where the participants wore spectacles that were scratched or dirty or that had an anti-reflective coating, compared to recordings where no visual aids were used. Data recorded from participants wearing soft contact lenses exhibited 0.5–3 poorer accuracy and on average 20–40% poorer precision, compared to when participants wore no visual aid. Asking a participant to remove the spectacles to record data of better quality might result in poorer acuity that may alter the eye movements (see above).

Makeup (eyeliner, eye shadow and mascara) result in a poorer accuracy by 0.2–3, and up to three times poorer precision (Holmqvist, 2015). For participants with forward- and downward-pointing eyelashes, makeup results in poor data quality (see also Nyström et al.,, 2013). Mascara is black in both infrared and visible light, and Holmqvist and Andersson (2017, Figure 5.5) show eye images from actual recordings that depict how the dark mascara may interact with the pupil center calculation.

Physical properties of the eye region

Differences in eye physiology refers to eye colour, lash direction, ocular dominance, baseline pupil size and more. Holmqvist (2015), Hessels et al., (2015), and Nyström et al., (2013) investigated the relation of data quality to physical properties of eyes, from large groups ranging between 75 and 194 participants, in up to 12 eye trackers, and reported compatible findings. In this subsection, we report effect sizes from these three studies, as ranges from the many eye trackers.

Holmqvist (2015) found that darker pigmentation in hair, eyes and skin correlate positively with better (lower) accuracy on most video-based eye trackers (0.5–1), and also better precision (20–80% lower RMS-S2S). The advantage of dark iris pigmentation over blue eyes has been hypothesised to result from poor contrast between pupil and iris when the eye image is recorded in infrared light: A blue iris is dark, while a brown iris is bright (Holmqvist and Andersson, 2017, Figure 4.13), providing a clearer contrast between iris and the dark pupil, which the image processing algorithms can make better use of.

Clinical participant groups may have features in their irises that may make tracking more difficult for some eye trackers. For instance, participants who lack an iris, known as aniridia (Beby et al., 2011), are likely difficult to record with P–CR trackers. Participants with William’s Syndrome have a stellate pattern in the iris (Tran & Kaufman, 2003) that could interfere with the CR image of P–CR trackers. These iris features are often associated with specific eye-movements. For instance, participants with albinism may have transillumination effects in their irises, and their lack of pigmentation in skin and in the retina is associated with congenital nystagmus (Collewijn et al., 1985).

A smaller baseline pupil results in better accuracy (up to 2) and up to three times poorer precision (Holmqvist, 2015). Interocular distance is defined as the distance between pupil centres when looking straight ahead. Holmqvist (2015) found poorer accuracy (0.5–1.0) for small interocular distances, but only in remote eye trackers.

A larger eye opening (also ‘palpebral fissure’ or ‘eye cleft’) correlates with better accuracy: up to 1 better in fully open compared to eyes with the smallest palpebral fissure. Forward or upward-pointing lashes show the best accuracy, while downward-pointing eye lashes, which Holmqvist (2015) found in about 10% of their 194 participants, exhibit a poorer accuracy (up to 4) and precision, although some eye trackers are more affected than others. A more closed eye is more likely to block the eye tracker’s view of pupil and CR features, but this depends on the geometry of the setup, both in remote and head-mounted systems.

Arousal, mental fatigue and cognitive workload

Ayres et al., (2021) present a meta-study of 33 experiments and conclude that eye-movement measures of cognitive load are more sensitive than heart, skin, and brain measures. Mental workload and arousal are positively associated with pupil dilation as shown in a large number of controlled studies and life-like human factors studies, measured using high- or low-end eye trackers (Einhäuser, 2017). Examples include performing a memory task (Kahneman and Beatty, 1966), arithmetic tasks (Ahern & Beatty, 1979; Hess & Polt, 1964), Air Traffic Control (Ahlstrom & Friedman-Berg, 2006), (simulated) driving (Čegovnik et al., 2018), tasting a disgusting drink (Kaneko et al., 2019) and social stress caused by having to sing a song (Toet et al., 2017). Other parameters of eye movement behaviour can be affected as well, but this seems to be context or task dependent. For instance, for blinking rate, Recarte et al., (2008) and Čegovnik et al., (2018) found an increase with increasing workload, whereas Brouwer et al., (2014) found no effect; and Bauer et al., (1987) and Fogarty and Stern (1989) found a decrease in blinking rate with increasing workload. This variation in results may be caused by the differences in the workload-inducing task across these studies.

Workload has also been reported to decrease microsaccade rates but increase their amplitudes (Siegenthaler et al., 2014), increase fixation duration (Rayner & Pollatsek, 1989) and decrease horizontal scanning during driving (Recarte & Nunes, 2003). Mental fatigue and workload have been found to affect saccade and microsaccade dynamics during visual search (Di Stasi et al., 2013), surgery (Di Stasi et al., 2014) and for pilots suffering from low levels of oxygen (Di Stasi et al., 2014). When researchers investigate workload, these eye-movement measures are often combined. For instance, Van Orden et al., (2000) developed a model using regression analyses from eye movement data on a surveillance tracking task, showing that fixation duration, blink duration and mean pupil dilation combined to a robust and reliable predictor of the performance of surveillance tracking.

Sleep deprivation

Many studies have reported effects of partial and total sleep deprivation on eye movements. Sleep deprivation is known to result in increased saccadic latency and reduced saccadic peak velocity and smooth pursuit velocity, as well as more antisaccade errors (Ahlstrom et al., 2013; Fransson et al., 2008; Meyhöfer et al., 2017). Furthermore, Schalén et al., (1983) present data showing that saccadic and smooth pursuit peak velocity may vary with the circadian rhythm.

Moreover, sleep deprivation has been shown to cause mental fatigue and affect a myriad of cognitive domains such as memory (Van Der Werf et al., 2009), cognitive speed (Van Dongen and Dinges, 2005) and arousal (Gunzelmann et al., 2007), which in turn may affect eye movements.

Expertise

Many eye-tracking studies of expertise have been made. Good overall reviews are provided by Reingold and Sheridan (2011) and Gegenfurtner et al., (2011). For instance, expert chess players tend to have fewer, longer fixations in the middle, while novices scan more (Charness et al., 2001). Expert radiologists tend to fixate abnormalities earlier than novices (Nodine et al., 2002; Alexander et al., 2020). Even the ability to keep one’s eye still is affected by training and experience (Cherici et al., 2012; Di Russo et al., 2003). In medical expertise research, a lack of experience or familiarity in the task has been correlated with blink rate and duration, fixation duration, transition rate, and pupil dilation (Lee et al., 2019, 2020). Machine learning approaches have been used to differentiate between levels of language proficiency (Karolus et al., 2017). Findings in expertise studies do not easily transfer to other domains of expertise. The one and same participant can be an expert in one task while having no expertise in a very related task (Kevic et al., 2015). In fact, it is important to understand that the participant’s field of expertise, the task, and the stimulus are crucial determinants of what effect can be expected in terms of eye movements.

Pathology and personality

Several different psychiatric disorders have independently been found to coincide with oculomotor impairments with medium-to-large effect sizes, although these depend on diagnosis and experimental task (Alexander et al., 2018; Smyrnis et al., 2019). For instance, patients with schizophrenia reliably show reduced smooth pursuit accuracy (reduced gain, increased root-mean-square error of the signal, increased frequency of saccades during pursuit). In a meta-study on the eye movements of patients with schizophrenia, O’Driscoll and Callahan (2008) stated that “Average effect sizes and confidence limits for global measures of pursuit and for maintenance of gain place these measures alongside the very strongest neurocognitive measures in the literature.” (p. 359). Patients with schizophrenia also reliably show increased rates of direction errors on the antisaccade task. Similar impairments, albeit with smaller effect size, are observed in patients with bipolar disorder or major depressive disorder (Katsanis et al., 1997).

Differences in gaze behaviour between individuals with and without a diagnosis of autism spectrum disorder (ASD) have also been substantially investigated (see e.g. Bast et al.,, 2021; Guillon et al.,, 2014; Sasson et al.,, 2011). One often-reported finding is differences in gaze behaviour to the eyes of a face between individuals with and without an ASD diagnosis (e.g. Dalton et al.,, 2005; Jones et al.,, 2008, 2013; Klin et al.,, 2002; Rice et al.,, 2012). However, these findings are not unequivocal (see e.g. Dapretto et al.,, 2006; McPartland et al.,, 2011; van der Geest et al.,, 2002). Several potential explanations have been posited for the inconsistent findings, including the presence of alexithymia (Bird et al., 2011) and the cognitive demand required in the experimental setting (Senju & Johnson, 2009). A meta-analysis of 122 studies on gaze differences to social and non-social information between people with and without autism is given by Frazier et al., (2017). Other reported differences include eye movements during visual search (e.g. Keehn and Joseph, 2016; Kemner et al.,, 2008) and attentional disengagement (e.g. Keehn et al.,, 2013).

Furthermore, Alzheimer’s (Kapoula et al., 2014), Parkinson’s (Otero-Millan et al., 2018) and Huntington’s are known to affect several characteristics of eye movements (Leigh & Zee, 2006).

Variation in human personality has been associated with eye movements (Bargary et al., 2017) and with gaze patterns to social stimuli (Wu et al., 2014).

Medication and drugs

For studies that investigate differences in eye-movement measures between clinical and control groups, recording patients who may be under medication, the question may arise whether it is the psychopathological state or the medication that drives the difference. For example, benzodiazepine drugs cause reduced saccade peak velocity (De Visser et al., 2003) as well as increased saccade latency and reduced spatial accuracy of saccades (Ettinger et al., 2018). Measures of intra-individual variability of saccades are also increased. Benzodiazepines also reliably reduce smooth pursuit velocity (Karpouzian et al., 2019).

Even in non-clinical trials, drug use may be a consideration. Acute consumption of nicotine may improve smooth pursuit accuracy, reduce catch-up saccades (Meyhöfer et al., 2019; Avila et al., 2003) and may reduce antisaccade latencies as well as the rates of direction errors in the antisaccade task (Ettinger & Kumari, 2019). Cannabis has the opposite effects to nicotine: latencies and errors in the antisaccade and memory-guided saccade tasks are increased, and saccade peak velocity is lower (Huestegge et al., 2009). Pupil size is affected by some drugs (Newmeyer et al., 2017). Increased blood alcohol levels impair the quality of smooth pursuit (Flom et al., 1976; Wilkinson et al., 1974), decrease saccade velocity (Lehtinen et al., 1979) and increase fixation durations (Moser et al., 1998). Alcohol also has effects on gaze behaviour. For instance, Buikhuisen and Jongman (1972) presented a traffic film containing 86 important events to participants, while tracking their eye movements. Those who were alcohol-intoxicated fixated on fewer events, especially when located away from the centre of the display, than non-intoxicated participants.

Calibration and accuracy

Calibrating the eye tracker for the specific participant is a prerequisite for recording gaze in some eye trackers and for optimal accuracy on all eye trackers. In this section, we first describe the procedure and principles of calibration generally, how to assess calibration, and correct for poor accuracy, and then we describe methods for calibrating challenging participants, such as infants, dogs, and people with nystagmus. These methods all aim to ensure the best possible accuracy.

How is calibration done?

Just before or at the beginning of a recording session, participants typically need to perform a small initial task of looking at a set of pre-defined targets that either appear on, or smoothly move across the stimulus monitor, or are otherwise presented in front of the participant. If the recording is made within the software of a video-based P–CR tracker, when the participant fixates the point, the eye tracker registers the relative positions of features (such as P and CR) for each calibration point. Quite often, the researcher may choose how many targets (often points) will be shown during this initial phase, and in some cases, where targets appear, and what the target will look like. For most other technologies (DPI, coils, EOG, etc.), calibration needs to be done with custom software and will likely also involve looking at or following fixation targets.

Fixation targets

The choice of calibration target may have an effect on the data quality in the subsequent recording. Thaler et al., (2013) examined which fixation target results in the least dispersion during fixation for adult participants, while Schlegelmilch and Wertz (2019) investigated the effects of calibration targets on the dispersion of the gaze position signal of the EyeLink 1000 Plus, for infant research. Whether showing a calibration target that minimises dispersion will result in better accuracy is unknown.

Colour and luminance of the background

Previously referenced studies on the pupil-size artefact (Section “Environment”) tell us that changes in pupil size will affect the accuracy of the gaze position signal. Thus, calibrating at a different luminance from the luminances displayed during data collection is likely to affect the accuracy of the measurement. If stimuli vary in luminance, it may be useful to calibrate for a range of pupil sizes (Drewes et al., 2012).

Which data segment to use for the calibration?

The eye-tracking software, manufacturer-based or custom tailored, selects a segment of data for when it estimates that the participant is looking at the calibration target. The exact decision which segment of data is used for calibration is mostly made by the software itself (Hansen & Ji, 2010). Nyström et al., (2013), however, showed accuracy is higher when the participant indicates s/he is looking at the fixation target, than leaving this decision up to the system. This finding also relates to the idea behind the participant-controlled post-calibration by Ko et al., (2016). However, participant-controlled calibration does not appear to be the standard in most eye-tracking software today.

Number of targets and the mathematics of calibration

Akkil et al., (2014) reported for the Tobii T60 that calibrating with 9 points result in a better accuracy compared to using 5 or 2 points, with a difference of about 0.2 between the 9-point and the 2-point calibrations.

In a number of video-based eye trackers (most SMIs, all EyeLinks, and many Tobiis, for instance the T60), the calibration involves finding a best fit between the sensor values (P and CR positions in the eye camera, for instance) and the spatial positions of calibration points. The exact polynomials used in these equations varies by the manufacturers, but also by the number of calibration points. Thus, it is important to realise that the choice of a specific number of calibration points in the eye-tracker manufacturer software is also a choice of a specific set of equations used for the calibration procedure. Each set of polynomial equations may result in different accuracy values for the same eye movement data (Blignaut and Wium, 2013; Blignaut, 2014; Cerrolaza et al., 2012).

Modelling the 3D shape of the eyeball is possible when multiple cameras and/or multiple corneal reflections are employed. Theoretically, the minimum number of calibration points is one, and this point is needed to measure the difference between optical and visual axes (Guestrin & Eizenman, 2006; Hansen & Ji, 2010). Recently, some manufacturers have developed calibration methods that model the eyeball more extensively. In particular, the curvature of the cornea is an important part in these calibration models, which have been used in eye trackers such as the SMI glasses, many Tobii eye trackers (US Patent US7,572,008), and in the open-source eye tracker by Barsingerhorn et al., (2018).

Calibration software is not supplied with every eye tracker. For instance, the DPI eye trackers require the researcher to employ custom-built calibration algorithms to establish the mapping between sensor values and points on the monitor. Holmqvist (2015) used a RANSAC fit (Fischler and Bolles, 1981) followed by a linear shift to calibrate the DPI.

Using the calibration of another participant

There are also examples of researchers calibrating their eye tracker on a person other than their actual participant, when the actual participant is difficult to calibrate. For example, Kulke (2015) calibrated on adults, and then recorded infants by reusing that adult calibration, arguing that this procedure improved data quality compared to calibrating for infants. Indeed, Harrar et al., (2018) present data showing that this practice does not introduce non-linearities (variations in accuracy over space), and also find that calibrating on one person and recording on another led to a poorer accuracy by 2–4. Similarly, researchers recording with artificial eyes also calibrate on themselves before recording with the artificial eye. Holmqvist and Blignaut (2020) show that no noticeable non-linearities appear in the data when using the human calibration for a subsequent recording with artificial eyes, but also note that accuracy is likely to be poorer.

Validation of the calibration

Present eye-tracker vendor software almost always reports accuracy after each calibration, recorded on validation points immediately after the calibration sequence. If the accuracy is not sufficient after the first calibration, commercial recording software may allow the operator to recalibrate several times, and select the calibration with the best accuracy in the validation test.

Post-calibration correction

Although it is rarely done, a poor accuracy after calibration can also be improved using a post-calibration correction. This procedure involves a second round of looking at points. For instance, Blignaut et al., (2014) used a regression model to improve accuracy by 0.3–0.6. Correction can also be made by letting the participant manually guide an online, calibrated, gaze-contingent visualisation of raw gaze samples to fall exactly in line of his/her gaze (Poletti and Rucci, 2016), i.e. until these samples are projected onto the centre of the fovea, and then push a recalibration button, which in their study improved the already very accurate DPI by a factor of 2.

Drift, and methods for drift correction

Accuracy that worsens over time is often called drift (not to be confused with oculomotor drift), irrespective of its source: small body adjustments, head-mount slippage, changes in pupil size, or some change in the hardware or software setup. Head-mount slippage could be the reason that the SMI EyeLink I and the SR Research EyeLink II were known to be so drift-prone that most researchers used to adjust their calibration, via a one-point drift correction, once before each trial (e.g. Greene & Rayner, 2001). Although drift refers to accuracy, other measures may also be affected by long recordings. For instance, Hessels et al., (2015) and Wass (2014) report a decline in precision from an early trial to a later one.

It is not known how much drift there is in current eye trackers, which are often sold as “drift free” (S. R. Research, 2017, p. 24), but a certain drift still exists in some instruments. Nyström et al., (2013) report a 0.2 drift during a 15-min reading task with the SMI HiSpeed 1250, and Choe et al., (2016, Figure 2) show drift due to the pupil-size artefact. Ko et al., (2016) found that the DPI and coils recording artificial eyes drift by around 0.03’ per minute. Drift happens not only in long recordings, but also in cases where the recording does not immediately follow calibration: Chatelain et al., (2020) found that when recording participants on the Tobii 4C in sessions over one month with no recalibrations, accuracy degraded by 0.30 + 0.13/month, i.e. the initial drop in accuracy is the largest.

Drift correction procedures involve re-calibrating with a single point, shifting all subsequent data by the measured offset. Later EyeLink models offer drift checks in which the offset between gaze cursor and target is assessed, and the experimenter can optionally make a linear shift of estimated gaze. In infant research, Constantino et al., (2017) implemented automatic drift correction on the fly, using an appearing fixation target and a criterion on accuracy. Jones et al., (2014) instead used a happy face and a probability calculation that decided whether the infant had fixated on the face, even if the eye tracker records the contrary, in which case an automatic drift correction was made. The threshold for when to perform drift correction may impose a maximum allowed accuracy. However, this is not the same as the empirically determined accuracy, and there is no guarantee that a central drift correction will improve accuracy in more peripheral points. When the user has a visible gaze cursor, as with users of gaze-controlled computers, Graupner and Pannasch (2014) show that they can learn to take advantage of the visible cursor as a cue to understand variations in accuracy over space, and choose to recalibrate when it is needed for the functionality they want.

If accuracy is found to be poor after the recordings are completed, while inspecting the data as scanpath plots, the EyeLink Data Viewer by SR Research allows the possibility of ‘performing drift correction on fixations’ by simply grabbing any fixation or group of fixations and pulling it to a new position. A simple test reveals that saccade amplitudes and velocities also change during these data editing operations, not only the fixation positions themselves (Data Viewer 3.1.97). The Data Viewer manual states that when batch-moving fixations like this, a movement of more than 30 pixel is not acceptable; however, for those users who want to move fixations more than this, the 30 pixel setting can easily be changed. Later, SMI also started offering this feature in the BeGaze software, and it is also possible in OGAMA (ogama.net). Note that the researcher has to be very careful not to move fixations in favour of a hypothesis to avoid subsequently arriving at faulty conclusions.

This practice is mostly relevant for text reading, in particular when participants read more than one line of text. Cohen(2013, p. 677) comment on practices in reading research that “Fixations are typically corrected manually, sometimes within a program such as EyeDoctor” (https://blogs.umass.edu/eyelab/software/, accessed 10-03-2021). Alternative software solutions for re-aligning inaccurate gaze data to lines of text are offered by Cohen (2013), Hyrskykari (2006) and Špakov et al., (2019).

Dragging fixations in place has also been applied in infant research (Frank et al., 2012; Kooiker et al., 2016). Manual post hoc calibration was commonplace in nystagmus research in the past, and tended to be based on finding the fixation periods of the nystagmus waveform and using those gaze locations for the re-alignment (Dell’Osso, 2005).

Binocular calibration

Recording from the participant’s dominant eye results on average in 0.2 better accuracy and also better precision (Holmqvist, 2015; Nyström et al., 2013), as compared to recordings from the non-dominant eye. This difference in data quality between the dominant and non-dominant eye leads to one consideration when calibrating for binocular recordings: whether to calibrate both eyes simultaneously or to instead calibrate the two eyes separately, patching one while calibrating the other. Calibrating both eyes at once, binocularly, may give an erroneous (absolute) disparity value because the calibration procedure assumes that both eyes are directed towards the calibration point, when in fact one eye may be slightly off. Nuthmann and Kliegl (2009) nevertheless calibrate for both eyes simultaneously, arguing that they can still correctly measure relative changes in disparity. Švede et al., (2015) and Liversedge et al., (2006) recommended a separate monocular calibration for each eye when using binocular recordings, for investigating the absolute disparity between the two lines of gaze. This should be done by covering one eye, calibrating the other, and then switching.

Calibration of special populations

Researchers working with participant populations other than young adults, such as infants or animals, will likely be faced with additional challenges during calibration. This may be due, for example, to these participants not being able to respond to verbal instructions. While some animals can to a degree be trained to remain still and to look at the desired calibration target (Park et al., 2020), infants and some monkeys can be nudged to look at the desired point by using contracting and dilating images, or by using transient appearances of calibration targets on screen (e.g. Hessels et al.,, 2015; Jones et al.,, 2014).

Patients with age-related macular degeneration have difficulty foveating calibration targets (because they have no or reduced foveal vision). Harrar et al., (2018, p. 9) suggest using the calibration of another person and found that accuracy degrades by 4–8 with this method, but that it does not introduce non-linearities.

Calibrating an eye tracker for participants with an unstable gaze, such as nystagmus or continuous square wave jerks, presents the problem that as they look at a calibration point their eyes will not be still. For these participant groups, researchers have developed dedicated calibration routines specific to the particular oculomotor condition (Dunn et al., 2019; Rosengren et al., 2020). Note that not all eye trackers allow for these calibration routines, e.g. when a standard calibration procedure has to be performed before a recording can commence. Eye trackers that can record without explicit calibration include the DPI and scleral coils (Holmqvist and Andersson, 2017, pp. 214–217) and some P–CR eye trackers.

Features of the experiment

Here, we address only those aspects of experimental design that may be specifically relevant or problematic in the context of eye-tracking research such as the operator skill level, eye-movement measures used as dependent variables, the number of trials and experiment duration.

Operator skill level

By operator we mean the person (researcher or research assistant) who records data from the participant. Nyström et al., (2013) report an advantage of 0.2 in the accuracy recorded by experienced operators, compared to inexperienced, whereas Hessels and Hooge (2019) report experienced operators tend to succeed calibrating difficult participants where inexperienced operators give up, and point out that training of operators could have a beneficial effect on data quality.

The instruction to participants

Task instructions have a strong influence on eye movement behaviour, as elegantly shown by Buswell (1935, p. 136) and Yarbus (1967, p. 174). The instruction to the participants is part of the experimental design, and can be used actively to drive participant behaviour. However, the small differences in wording may have unexpected effects, and the exact instruction may need to be verified during piloting. For instance, asking participants to “fixate” rather than “hold the eyes still” reduces the rate of microsaccades (Poletti & Rucci, 2016), and Enright and Hendriks (1994) found that “staring” differs from “scrutinizing”, in that the latter involves a larger net muscular force exerted on the eye from the opposing rectus muscles, pulling the eyeball backward in its socket.

Trial durations and trial-by-trial effects

Besides the fact that data quality seems to be worse after longer periods of time (Section “Calibration and accuracy”), the duration of trials and experiments is relevant also for other reasons. For instance, during scene viewing, fixations tend to be shorter and saccade amplitudes longer during the first second or two of a trial. This can be interpreted as an initial overview/ambient scan followed by detailed/focal inspection, shown by Tatler and Vincent (2008), Unema et al., (2005) and Buswell (1935) for free-viewing, by Scinto et al., (1986) for visual search and by Over et al., (2007) for visual search and free viewing. This would imply that when trials vary in duration, mean fixation duration for long-lasting trials may be longer than mean fixation duration for short trials, irrespective of other factors. Also, when trials are short, comparing mean fixation durations for short sequences of saccades, one should consider not including initial fixation durations because initial fixation durations are longer than subsequent fixation durations (Hooge and Erkelens, 1996; Zingale & Kowler, 1987). This also holds for infant participants (Hessels et al., 2016).

A technical trial-by-trial effect is that the duration of the initial fixation of a sequence of fixations may not reflect the whole duration of that initial fixation, because it started before the trial started, and was cut in two by the change of trial. In the visual-cognition literature, when analysing fixation durations, the first and last fixations are typically discarded (e.g. Nuthmann, 2013).

Tatler and Hutton (2007) found trial-by-trial effects in the antisaccade task: Both the error rates and latencies increased on trials following a trial with an erroneous anti-saccade. Switching from making an antisaccade in one trial to making a prosaccade in the next trial involves a cost in increased saccade latency of the prosaccade (Tari et al., 2019). Similarly, a saccade to a location that was fixated at the end of the previous trial may be preceded by a prolonged fixation (Carpenter, 2001), and may affect latencies and fixation durations in the current trial.

Eye-movement measures as dependent variables

In some research fields the choice of the appropriate eye-movement measures, and the range of task parameters, for the study at hand is either straightforward or very well established. This is for instance the case in reading research (Clifton et al., 2007), and for studies employing the anti-saccade paradigm (Antoniades et al., 2013).

In some applied research fields, measure selection is all but obvious and terminology of measures confusing (e.g. Sharafi et al.,, 2015). A line of publications may get accustomed to a choice of measures that later turns out to be unfortunate. See for instance Šmideková et al., (2020) for a discussion of the selection of measures for research in classroom management.

Naming of events is also variable. What some know as saccade latency (Holmqvist and Andersson, 2017, p. 580) is sometimes termed saccade reaction time or calculated as time to first fixation (Tatham et al., 2020). Fixation duration is sometimes called ‘fixation time’, but also ‘dwell time’, or ‘dwell time of the fixation’. Oster and Stern (1980) used the terms saccadic reaction time and intersaccadic interval for fixation duration. The original term was ‘pause time’ (Erdmann & Dodge, 1898), and the term ‘pause duration’ was used long into the 1940s.

Terminology for the dwell time measure also varies. In some parts of human factors research, the dwell time measure is called ‘glance duration’ (Horrey & Wickens, 2007), while Loftus and Mackworth (1978) used the term ‘duration of the first fixation’ for the first dwell time in an AOI. Terms like ‘observation’ and ‘visit’ can also be found. In reading and some parts of scene perception research, dwell time is often called ‘gaze duration’ or ‘regional gaze duration’, and ‘first-pass fixation time’ when the AOI consists of two words (Clifton et al., 2007).

Signal properties and processing

In this section, we discuss the properties and processing of the stream of data from the eye tracker, such as gaze position signals, time stamps, pupil-size signal, and more.

Sampling frequency

Sampling frequency (also temporal resolution) is the number of measurements per second. The sampling frequency of modern video-based eye trackers ranges from 30 to over 2000Hz. Some eye trackers, like the DPI, scleral search coils and some other analogue systems have no sampling frequency. Instead, their analogue signals may be digitized to any desirable frequency up to at least 10000Hz (Collewijn, 2001), who remarked that “The choice of 10000Hz followed from the general rule that the (temporal) resolution of a measurement should preferably be an order of magnitude better than the expected effect.” (p. 3417). For video-based eye trackers, the video camera and its settings determine the sampling frequency.

Sampling frequency is one of the most highlighted properties of modern eye trackers, often being either a part of, or mentioned directly in connection to the model name. The competition for higher sampling frequencies has made some manufacturers of video-based eye-tracking systems with multiple cameras interleave image acquisition to achieve higher effective sampling rates. For instance, the Tobii Glasses 2 have two cameras per eye, each sampling the eye at 50Hz. This system is made into a 100Hz eye tracker by alternately sampling each camera. However, the alternating samples are offset in the resulting data, yielding a zigzag pattern that is very common in 100Hz data from Tobii Glasses but does not happen in 50Hz data (see Figure 11 in Niehorster et al.,, 2020b). The EyeFollower from LC Technologies uses two 60Hz cameras, one per eye, to achieve a net gaze sampling rate of 120Hz by alternatingly sampling the right and left eyes.

In theory, high sampling rates when combined with low velocity noise would allow for very precise determination of velocity and acceleration, and therefore facilitate more precise determination of on- and offset of fixation, saccades and other events. This would obviate the need for filtering and for averaging metrics such as saccade latency / fixation duration over large numbers of trials, which are difficult to record with patients and other groups that only provide small samples.

In practice, however, the many different eye trackers exhibit a large variation of both sampling frequencies and precision levels. Research on the relation between eye-tracking measures and sampling frequency shows that some outcome measures (e.g. fixation durations) are less sensitive to sampling frequency, whereas others (saccadic peak velocity) are more so.

For instance, Andersson et al., (2010) quantified the effect of sampling frequency on event durations, such as fixation durations, in a series of simulations and tests on human eye-movement data. They also provided estimates of the number of measurements that are required to average out the mis-estimations of the on- and offset of fixations due to a low sampling frequency.

Saccadic peak velocity measures are more dependent on sampling frequency, but exactly how much more is a matter of debate. Wierts et al., (2008) showed that although a 50Hz eye tracker cannot provide accurate saccadic peak acceleration/deceleration values, it can be used to accurately measure peak velocities without aliasing if saccades are at least 5. Inchingolo and Spanio (1985) used a 200Hz EOG system and found that saccade duration and velocity values in that data were comparable to those obtained in data of a 1000Hz system, as long as the saccades were larger than 5 in size. However, using EOG- and photoelectric eye-tracking systems to study 20 saccades, Juhola et al., (1985) provided evidence that sampling frequency should preferably be higher than 300Hz in order to reliably calculate the peak saccade velocity. Mack et al., (2017) replicate the finding that the peak saccade velocity estimation is more inaccurate for lower sampling frequencies. Unfortunately, these somewhat contradictory results are made more difficult to interpret because of differences in the precision of the eye trackers, how velocity is calculated, and whether filters were involved in the velocity calculation. The observations that both DPI and P–CR technologies misestimate saccade velocity (e.g. Hooge et al.,, 2016) add complication to the interpretation of these studies.

Temporal precision

Temporal precision is the variation in the inter-sample durations. A perfect temporal precision means that samples always arrive after exactly the same time interval. However, when temporal precision is poor, there could sometimes be, for instance, 33ms between samples, and other times 43ms (actual intervals found in data from an EyeTribe, Holmqvist and Andersson, 2017, p. 193). This is indicative of an unstable sampling frequency, the explanation for which could be in small head movements, the camera type and transfer protocols as well as image processing. Examples of eye trackers with unstable sampling frequencies include the EyeTribe (Ooms et al., 2015), the Pupil Labs 240Hz (Ehinger et al., 2019), the Tobii 1750 (Shukla et al., 2011), and the SMI REDm 60/120, and the SMI RED 250 (Hessels et al., 2015). Some implementations of algorithms for filtering, velocity and acceleration calculation, as well as event detectors, may assume a stable sampling frequency, and may thus not be suitable for data with unstable sampling frequencies.

Spatial precision

Precision ranges reported in the publications of Table 2 vary between eye trackers with a factor of 100 or more (median RMS-S2S deviation 0.001–0.75). Precision ranges vary little with calibration, and can be calculated from participants (and artificial eyes) without their cooperation. Precision calculations can be made in many different ways (Niehorster et al., 2020c). The resulting precision values change when filtering the gaze signal with the built-in manufacturer filters (Niehorster et al., 2021).

Precision recorded with human eyes is often worse (e.g. higher RMS-S2S deviation) than precision recorded with artificial eyes (Holmqvist et al., 2021; Niehorster et al., 2020c), but different artificial eyes may also result in different precision levels.

Niehorster et al., (2020c) investigated how four different precision measures correlate, depend on sampling frequency and express different properties of the signal. In particular, RMS-S2S deviation reflects the noise velocity in the signal, while STD (standard deviation) and BCEA of the gaze signal (bivariate contour ellipse area, Steinman, 1965; Crossland and Rubin, 2002) are measures of the dispersion of gaze samples. The slope α of the power spectrum density instead measures the colour of the noise, as does RMS-S2S divided by STD (for the same gaze data).

Together, these four measures allow for a more complete characterization of the precision in gaze data from an eye tracker. Niehorster et al., (2020c) provide code to generate noise based on this characterization. Adding synthetic noise to data is a method to test event detectors, and can also be used to provide identification privacy in future consumer products with inbuilt eye-tracking systems (Liu et al., 2019).

Filters

The most common way to reduce (improve) precision values is to employ a filter. McConkie (1981) proposes that all filters should be reported. Filtering of the resulting data stream compensates for noise generated earlier at the level of sensors, light, fans and more. However, filtering affects various characteristics of the signal differently, and using the four different measures above allows researchers to investigate whether filters are present (Niehorster et al., 2021).

Ko et al., (2016) remarked that an optimal filter should be based on (a) a characterization of the noise level and (b) the component of eye movements one is interested in examining. Most other design criteria of filters seem to be guided by heuristics, or ‘rules of thumb’, motivated by visual inspection of the data (e.g. Stampe, 1993). Notice that pattern matching filters, such as those described by Stampe (1993, p. 138, known as the heuristic filter in EyeLink and SMI trackers) and Duchowski (2007) amplify parts of the gaze signal with a similar appearance to the filter pattern, while attenuating other portions. Špakov (2012) compared several noise filters, and revealed that finite-impulse response filters with triangular or Gaussian kernel (weighting) functions, and parameters dependent on signal state, show the best performance, as judged by a comparison to idealised saccade models using multiple criteria.

Derivatives of the gaze position signals are used by both researchers and event detection algorithms. Numerical differentiation of a signal however amplifies high frequency content (which is usually noise) in the signal. Specific filters are therefore often used to counteract the increased high frequency noise resulting from differentiation. The most detailed investigations of these filters were conducted by Inchingolo and Spanio (1985) and Larsson (2010), who showed how saccade parameters (e.g. duration and peak velocity) were affected by the type of differentiation filter and peak velocity threshold in the event detector. Larsson (2010) concluded that the Savitzky–Golay filter used by Nyström and Holmqvist (2010) and the differential filter used by Engbert and Kliegl (2003) produced eye movement velocity and acceleration most like those found in literature. Unlike the pattern-matching filters, these two filters make no strong assumptions on the overall shape of the velocity curve.

Data loss and interpolation

Several studies have shown that average data loss differs between eye trackers. Holmqvist (2015) report that the video-based eye trackers SMI HiSpeed 1250 and the EyeLink 1000 had the lowest data loss with around 3% of the raw data samples lost on average, while the Tobii T60 XL and the TX300 lost 15% or more. Nevalainen and Sajaniemi (2004) report 3.0–8.7% data loss for the Tobii 1750 and two ASL trackers, while Funke et al., (2016) found 22% in EyeTribe and 24% data loss in Tobii EyeX. For reference, around 2% of the data are lost due to blinks (Holmqvist and Andersson, 2017, p. 167). In contrast to the values reported for the TX300 by Holmqvist & Andersson (2017, p. 167), Hessels et al., (2015, Figure 6) reported less than 3% data loss for the TX300 for upright head orientations, and Hessels & Hooge (2019, Figure 9) reported less than 10% data loss for 9 year old children measured with the TX300. There is thus a large range in the reported data loss values for each eye-tracker model. This suggests that not only the eye-tracker hardware itself plays a role, but also operator experience, participant groups, lighting conditions, stimuli and experimental procedures, and laboratory protocols. This should be taken into account when interpreting data loss values reported in the literature.

Furthermore, Castner et al., (2020) reported that data loss values produced by manufacturer software are not always reliable. They found that for a participant with a reported tracking ratio of 98% (a data loss of 2%), an additional large gap in the left eye gaze signal–approximately 3.5s out of a 90s recording–appeared as data loss, but was labelled as a blink.

Fixation points positioned in the corner of the monitor, as well as recording participants with downward-pointing eye lashes and large head movements tend to result in higher data loss (Hessels et al., 2015; Holmqvist et al., 2011; Niehorster et al., 2018), though the operator might have a significant influence as well (Hessels & Hooge, 2019).

Data loss may affect the output of event detection, if the event detector terminates fixations and other events whenever a period of data loss is encountered. Holmqvist et al., (2012) added increasing amounts of data loss (as short segments) into data with no data loss, and found that 18% data loss reduces the number of fixations by about one quarter, and increases their average duration by around 50ms, when using the Nyström and Holmqvist (2010) algorithm. Hessels et al., (2017) found that adding periods of data loss to eye-tracking data affected the number of fixations and corresponding fixation durations for different event detection algorithms strongly and idiosyncratically.

Some algorithms merge fixations close in time and space where there are small bursts of data loss (Komogortsev et al., 2010; Wass et al., 2013; Zemblys et al., 2018), reducing some of the effect of periods of data loss. The solution to gaps in data in the Tobii Pro Lab software is to allow users to fill the gaps of data loss using a linear interpolation with synthetic data. This interpolation is selected in the event detection dialog menu in the Tobii software. The I2MC algorithm (Hessels et al., 2017) also employs interpolation of gaps up to a certain duration, but instead uses a non-linear Steffen interpolation (Steffen, 1990).

Latency, gaze contingency

Latency (also known as temporal accuracy and end-to-end delay, e.g. Reingold, 2014) is often defined as the average end-to-end delay from the time of an actual movement of the tracked eye until the recording computer signals the eye movement. Theoretically, there is always a latency of a few milliseconds, and in the optimal case, it is constant. Any processes run by the computers involved in the data recording may add to this basic latency.

A known constant latency is uncritical for most research (except closed-loop, gaze-contingent experiments). A variable latency, which translates to high temporal imprecision, is much more critical, as it cannot be easily compensated for, particularly if the eye tracker does not provide reliable timestamps.

A large and variable latency is somewhat tricky to detect, measure, and prevent, and may come as an unpleasant surprise long after data were recorded. McConkie (1997) looked back at the foundational work on reading using gaze-contingency (McConkie and Rayner, 1975), and remarked that they were unaware of a filter in the eye-tracker circuitry that increased the latency by 25ms between the eye movement and the registered signal, potentially undermining their conclusions.

Table 3 lists existing measurements of eye-tracker latencies. Measurement type 1 concerns the time from when an eye movement is made until the output gaze coordinates change, while measurement types 2–5 include the time needed to update the monitor.

Table 3 Studies of eye-tracker mean latencies. While measurement type 1 compares the duration from an eye movement starts until a change in gaze coordinate, measurements 2–5 include the time needed to update the monitor in a gaze-contingent setup. Numbers in brackets denote standard deviations

Gaze-contingent paradigms and latencies

Whether a gaze-contingent paradigm – for instance, boundary and moving window paradigms (Hohenstein & Kliegl, 2014; McConkie & Rayner, 1975; Nuthmann, 2013) or saccadic adaptation paradigms (McLaughlin, 1967; Pélisson et al., 2010) – can be run without exceeding the maximum allowed latency depends on how quickly a gaze coordinate can be fed back to the stimulus program so that the stimulus monitor can be changed without the participant realising (facilitated by saccadic suppression, Campbell & Wurtz, 1978; Holt, 1903). Loschky and Wolverton (2007) reported that it is enough to update the stimulus image within 60ms after the onset of the eye movement. However, Slattery et al., (2011) point out that the position of gaze during the display change has an effect on fixation durations (for the next word after the boundary) that can be seen already at 15–25ms delay of the signal. This behavioural change indicates detection of the manipulation, and the delay can be compared to the measured latencies in Table 3. Note that a single detection may be enough to affect behaviour, which means that maximum latency, rather than the mean, would be the most relevant comparison.

Saccade latency measurements versus system latencies

In other cases, researchers are concerned whether their eye-movement recording was properly synchronized to stimulus onsets on their displays. Improper synchronization would for instance affect eye latency measures, such as saccadic latencies. One method to check this has been to compare the eye video to the file of the raw data stream or gaze scanpath (Morgante et al., 2012). This however has the drawback that both data streams are generated by the same software, and could be affected by the same latencies. Also, the video is usually of a low temporal resolution in comparison to the eye-tracking data, which limits detection of synchronization issues to the temporal resolution of the video recording. As an alternative method of measuring synchronisation, Shukla et al., (2011) used a mirror positioned next to the participant and a 300Hz high-speed camera, which made a recording of the participant’s eye and, through the mirror, the monitor where the stimuli appeared and disappeared. Results revealed a variable latency with a mean of 27ms on their Tobii 1750, similar to the latencies reported by Leppänen et al., (2015) in a study using the same approach with a low temporal resolution camera and a Tobii TX300, while Morgante et al., (2012) reported latencies of up to 54ms for the Tobii TX60XL.

Fixation and saccade detection

Historically, fixation and saccade detection were conducted manually and was very time-consuming. For instance, Hartridge and Thomson (1948) presented a novel method to process eye movements at a rate of approximately 10000s (almost three hours) of manual work for 1s of recorded data. Decades later, Monty (1975) remarked: “It is not uncommon to spend days processing data that took only minutes to collect” (p. 331–332). Today, software can run a similar analysis in a matter of minutes, even for several hours of recorded data. Potential reasons for still doing manual analysis include that it allows for better general monitoring of data quality as well as participant performance and engagement.

Event detection algorithms (or event classification, see Hessels et al.,, 2018) are used to process a time series signal (gaze position, pupil size, etc.) into labelled, meaningful units, such as fixations, saccades, blinks, etc. What happens inside the event detection algorithms was considered important enough by McConkie (1981) that he recommended that details about these algorithms should be published in the paper presenting the processed events.

Note that operationalisations for fixations may depend on the frame of reference (i.e. whether the eye tracker is fixed to the world or to the head). A moving observer that fixates a static object in the world, produces a gaze point in the world that is stationary with respect to the object, but slowly moving with respect to the head. This point is extensively discussed in Lappi (2016), Holmqvist and Andersson (2017, Chapter 7) and Hessels et al., (2018).

There are many different event detection algorithms available. Here, we describe a select number of them to give an idea of the breadth and scope. The I-DT finds fixations using a spatial threshold on maximum gaze dispersion (typically 0.75–1.5) and a temporal threshold on minimum fixation duration (typically 50–150ms). What remains are assumed to be saccades. The I-VT instead finds saccades using a minimum peak velocity criterion (such as 20–100/s), and assumes that everything in between saccades are fixations. The I-DT and I-VT were described by Salvucci and Goldberg (2000), and later appeared in software from manufacturers. For instance, BeGaze by SMI offers both the I-VT and the I-DT algorithms, whereas Tobii Pro Lab provides a version of the I-VT, and the Data Viewer by SR Research has an I-VT-related saccade detector with both velocity and acceleration thresholds.

The NH2010 algorithm by Nyström and Holmqvist (2010) is an improvement of the I-VT algorithm which adapts the peak velocity threshold to the level of noise in the data, and additionally outputs detected post-saccadic oscillations. The I2MC by Hessels et al., (2017) is an algorithm designed to be robust against increasing levels of noise and data loss, common in infant research.

GazeNet by Zemblys et al., (2019) is a fully end-to-end machine learning-based event detector that learns from examples, and detects fixations, saccades, and post-saccadic oscillations with very high resemblance to human expert coders. The Deep eye movement classifier by (Startsev et al., 2019) is another recent machine-learning algorithm that also detects periods of smooth pursuit in data.

There also exist dedicated event detection algorithms for data from head-mounted eye trackers, used to describe gaze behaviour during e.g. navigation in real environments (Hessels et al., 2020; Niehorster et al., 2020a). For researchers interested in labelling eye-tracking data from head-mounted eye trackers into smooth pursuit, fixations during head movements, OKN, vergence etc, no automated techniques exist at the moment. However, this is a quickly evolving field, in which relevant work is done on some of the problems it involves (Kothari et al., 2020; Larsson et al., 2014).

Furthermore, there are many other special-purpose event detectors (for instance, blink detectors, microsaccade detectors, algorithms for desaccading smooth pursuit or nystagmus data, and smooth pursuit detectors), summarised by Holmqvist and Andersson (2017, Section 7.4).

Most event detection algorithms are offline, operating on already recorded data. However, for gaze-contingent research, event detection algorithms have to be fast and online, operating in real-time when saccades happen (Holmqvist & Andersson, 2017, p. 234–235). This online algorithm is necessary in the Fixation-Contingent Scene Quality Paradigm (Henderson et al., 2013; Walshe & Nuthmann, 2014). In the boundary paradigm, however, there is just a simple check whether raw data (typically one eye only, see discussion in Nuthmann & Kliegl, 2009, p. 23) have crossed the boundary, assuming such a crossing to mean that a saccade is in progress (see also Slattery et al.,, 2011).

The risk that poor precision poses for the detection of small eye movements

Small eye movements may be hidden in the noisy, imprecise parts of data. For instance, Fig. 2A shows how the large saccades are often followed by small saccades which are clearly seen and reasonably easy to detect by algorithms. In Fig. 2B, the big saccades are visible, but the small saccades, if they were made during the recording, have left a trace that is harder to distinguish from noise, for human data inspectors and algorithms alike.

The degree to which outcome measures of event-detection algorithms are sensitive to the noise level has been systematically investigated by Hessels et al., (2017), Holmqvist (2016), and Holmqvist et al., (2012), who all investigated the effect of artificially increasing noise levels (degrading precision) on the outcome of event detectors, and by van Renswoude et al., (2018), who investigated correlations between precision and outcome measures. Effect sizes are large; for instance, using the algorithm by Nyström and Holmqvist (2010), Holmqvist et al., (2012) compared the precision levels 0.03–0.37 and found an increase of average fixation durations from 430ms to 630ms and a reduction of the number of fixations by about one-third, for the same eye-movement data. Hessels et al., (2017) and Holmqvist (2016) report (and illustrate in figures) how for some algorithms, no fixations whatsoever are found when imprecision increases beyond a certain level.

Algorithm settings

Event detection algorithms have a variety of settings, some examples of which are the minimal peak velocity threshold for saccade detection (I-VT, EyeLink), the minimal fixation duration and the maximum gaze dispersion for fixations (I-DT). Changing the settings of these algorithms can have large effects on measures such as number and duration of fixations and saccades (Blignaut, 2009; Holmqvist, 2016; Manor and Gordon, 2003; Shic et al., 2008). For some experimental designs, in particular between-subjects comparisons, and when comparing between studies, or when conducting replication studies, a change of algorithm settings may have an impact on the rejection of a hypothesis (see for instance, Shic et al.,, 2008, for a within-subjects design with comparison between different stimulus types).

Settings can be manually adapted based on for instance the precision of the data. Holmqvist (2016), and (Holmqvist and Andersson, 2017, Ch 7) provide practical advice on the relationship between precision and settings and the outcome measures, for two commonly used algorithms: I-DT and I-VT. The larger the saccades are in the task, the higher the thresholds can be. Studies with a focus on small saccades need good precision and low thresholds.

There are also adaptive algorithms that change the thresholds based on the precision in the data (e.g. Braunagel et al.,, 2016; Engbert & Kliegl, 2003; Hooge & Camps, 2013; Mould et al.,, 2012; Nyström & Holmqvist, 2010). However, an adaptive algorithm does not solve the problem of variable precision, as it may adapt the parameters to the level of noise, but changed parameters have consequences in the fixation and saccade output by the algorithm. Hessels et al., (2017) developed an algorithm which had the explicit goal to be robust to differences in data quality and enable comparisons across conditions when there are differences in data quality. Note, however, that although noise-resilient algorithms may produce fixations that result in the same average fixation duration from data of varying precision, further investigations are needed to assess the extent to which the individual events (their on- and offsets) change as precision varies.

Algorithm comparisons

Not everyone is free to choose which event detection algorithm to use, but for those who are and want an algorithm adapted to their wishes, there are many algorithms to choose from. The many existing event-detection algorithms do not necessarily produce the same output measures when given the same eye-tracking data. In fact, several algorithm comparisons have reported large differences in fixation and saccade measures between algorithms (Andersson et al., 2017; Benjamins et al., 2018; Dalveren and Cagiltay, 2019; Komogortsev et al., 2010; Salvucci & Goldberg, 2000; Stuart et al., 2019). This research suggests that differences in, for instance, average fixation durations between studies that use different algorithms may in part stem for differences between the algorithms.

It has become common that developers of algorithms benchmark their novel algorithm against previous ones (e.g. Hessels et al.,, 2017; Otero-Millan et al.,, 2014; Zemblys et al.,, 2018, 2019). Event detectors based on machine learning have started to appear, whose behaviour cannot be fully described in terms of rules that relate to concepts humans have about the eye-movement signal. Consequently, trust in the algorithm derives from benchmarking against human coders or existing algorithms (Zemblys et al., 2019).

There is an ongoing discussion around the methods in building and evaluating event detectors, in particular how to calculate inter-rater reliability, used to compare algorithms against algorithms or against human coders (e.g. Friedman, 2020; Startsev et al.,, 2019; Zemblys et al.,, 2019, 2021). Other current topics concern whether human coding of events is a good benchmark to test the algorithms against (Hooge et al., 2018), or build algorithms from (Zemblys et al., 2019), and what kind of noise to add to the data when testing the noise-robustness of an event detector (Niehorster et al., 2020c).

Event operationalisation

Fixations, saccade latencies, amplitudes, and curvature have been operationalised in more than one way. For instance, a common way to calculate saccade amplitudes is to calculate the Euclidian distance between start and end of a saccade (e.g. van der Geest et al.,, 2002). Alternatively, the amplitude can be measured as the distance along the saccade path (calculated, for instance, as duration multiplied by average velocity). These two amplitude calculations will differ for curved saccades (Holmqvist & Andersson, 2017, p. 613).

Different algorithms calculate fixation durations and other measures in different ways (Andersson et al., 2017). In particular, some algorithms exclude the post-saccadic oscillation (PSO) from both the saccade and the following fixation event (e.g. Nyström & Holmqvist, 2010; Zemblys et al.,, 2019), while the I-VT algorithm and the EyeLink algorithm have no separate detection of PSOs and assign parts of the PSO either to the saccade or the fixation, largely depending on the amplitude of the PSO.

Area-of-interest (AOI) measures

Areas of Interest (AOIs, also known as Regions of Interest, ROI, and Interest Areas, IA) are employed when the researcher’s interest is in the relation between gaze behaviour and the visual world (e.g. Buswell, 1935; Viviani, 1990). Researchers may be interested in what parts of a webpage attract gaze most effectively, and in what order (Goldberg et al., 2002), or interested in gaze behaviour while listening to ambiguous sentences about a scene (Allopenna et al., 1998). AOI-measures such as absolute or relative time spent in AOIs or the number of transitions between various AOIs may be used for this.

Areas of Interest provide fundamental processing tools for the analysis of eye-tracking data, and are used in many branches of cognitive psychology, architecture, marketing, clinical research, neuroscience, educational science and many other fields. Multiple methods exist to relate the AOIs to the stimulus, presented by Holmqvist & Andersson (2017, Ch 8), Hessels et al., (2016), and Orquin et al., (2016).

There are methods that assist with the same function that AOIs are used for, but that are not referred to as AOIs: Reading researchers use non-proportional fonts and oftentimes study single sentences only. This way, fixation-to-word and/or fixation-to-letter assignment is easily done post-hoc; all they need to know is the horizontal offset of the sentence and the PPC value (pixel per character), along with the actual sentence. This also makes gaze-contingent reading research (moving window and boundary paradigms) technically easier to implement. For reading researchers who prefer to use AOIs, both BeGaze from SMI and the SR Research stimulus presentation software automatically segments text into AOIs at the word, sentence, and character level.

When the stimulus consists of animated material or videos, a static segmentation of space into AOIs may not suffice. Dynamic interest areas can be made to move in synch with the underlying object, but may require AOI measures to be calculated based on raw data samples rather than using fixations (e.g. because event detectors often are not reliable when smooth pursuit is present).

AOI size

The size of the AOI is of great importance. If the accuracy of the gaze data is poor, the eye tracker might report a gaze position that is outside the AOI, even though the participant was looking in that area, and vice versa (Holmqvist et al., 2012).

Hessels et al., (2016) report the effects of altering the size of AOIs (face stimuli) on important AOI measures (dwell time, total dwell time, time to first AOI hit), pointing out that effect sizes are large and the relationship is non-linear. Below a certain AOI size, the total dwell times are no longer significantly different between the two AOIs (eyes vs mouth) used in their study. Orquin et al., (2016) reanalysed four experiments using different AOI sizes, and found only some effects of varying AOI size on the outcome of the statistical analysis. Orquin et al., (2016) also note that one third of the researchers in their survey reported conducting analyses with multiple AOI sizes, which may help confirming that the result is robust over all AOI sizes.

Orquin and Holmqvist (2018) present simulations where they vary AOI size, the shape and position of the AOIs, and accuracy and precision, and investigate the effect on the AOI measure hit rate. They report complex, non-linear interactions between data quality measures and AOI properties.

Not only the inaccuracy of the eye tracker matters when calculating AOI measures from AOIs of different sizes. The minimum size of an AOI that encircles a target stimulus is also limited by the inaccuracy of the visuo-oculomotor system when targeting small objects, which can be larger for some participant groups (Clayden et al., 2020; Pajak & Nuthmann, 2013).

It has been suggested that margins should be added around AOIs to compensate for inaccuracy (Holmqvist & Andersson, 2017; Orquin et al., 2016), which may or may not be possible depending on how densely populated the stimulus is. Hooge and Camps (2013) point out that if the visual stimulus is sparse, AOIs could be made as large as possible, sharing the remaining empty space between nearby AOIs. Their argument is that in sparse stimuli, there is not much crowding, and the functional visual field is large (Engel, 1971; Toet & Levi, 1992). A large functional visual field implies that objects are visible at larger eccentricities (or larger distance from the gaze point), allowing observers to overview larger areas around the gaze point.

Higher-order measures

Outcome measures that build upon or are derived from AOI or fixation and saccade measures could be referred to as higher-order measures. As a rule of thumb, the higher-order measures have a large number of settings that can be varied, whether in

  • scan path analysis (Anderson et al., 2015; Cristino et al., 2010; Dewhurst et al., 2012; Duchowski et al., 2010; Jarodzka et al., 2010; Kübler et al., 2014)

  • (hidden) Markov models (Chuk et al., 2014; Coutrot et al., 2018; Ellis & Stark, 1986)

  • recurrence quantification analysis (Anderson et al., 2013; Pérez et al., 2018)

  • entropy analyses (Allsop & Gray, 2014, 2017; Hessels et al.,, 2019; Hooge & Camps, 2013; Krejtz et al.,, 2014; Niehorster et al.,, 2019)

  • heatmap-based analysis (Caldara & Miellet, 2011)

It is reasonable to expect that data loss, as well as poor precision and accuracy, will be carried through event detection and AOI procedures, and propagate into these higher-order measures. Similarly, settings in the event detector and choices of AOI sizes may also have strong effects on the higher-order measures.

To date, very few studies have been made of the effect on higher-order analyses of changing settings and varying data quality. One example is Krejtz et al., (2015), who show that the size of gridded AOIs affect gaze transition entropy results, with non-linear relationships and large effect sizes in outcome entropy.

Summary

We have reviewed research on how the eye tracker, methodology, environment, participant, settings of event detectors and AOI tools, etc., affect (or relate to) the quality of the eye-tracking data obtained, the properties of the eye-tracker signals, and the eye-movement and gaze measures. Our review has shown that there exists a significant body of research that has investigated the quality of data from eye trackers and what this quality relates to.

These studies have reported that sunlight and luminance (environment) have large effects on gaze, that the accuracy, precision and data loss often vary significantly between different eye trackers, and that the setup and geometry of the recording situation is of great importance to the quality of the data.

These studies have also shown, for instance, that accuracy, precision and data loss vary between participants, depending on age, eye-region physiology and many other factors. We have seen that calibration matters for accuracy, and that operator skill and trial structures may influence outcome measures. We have learnt that some researchers use filters to counter poor precision, interpolation across gaps of data loss, and manual methods for re-aligning inaccurate gaze data.

The reviewed literature suggests that algorithms for event detection vary dramatically between studies and most algorithms are highly influenced by both precision and settings. Other research has quantified the large non-linear effects of data quality on area-of-interest and higher-order measures.

In the next section, we will examine how the various factors reviewed above are reflected in current reporting practices and guidelines.

Reporting practices and existing reporting guidelines

The many studies reviewed in the previous section show that the knowledge exists to help make good choices when conducting a study with an eye tracker. Is this knowledge readily applied by researchers using eye trackers? How does our literature review (Section “A review of empirical eye-tracking studies as the basis for a reporting guideline”) of important aspects of an eye-tracking study compare to the reporting practices of researchers using eye trackers? In the current section, we first summarise reporting practices from a database of 207 eye-tracking studies of judgement and decision-making (see Fiedler et al.,, 2019, for details) and discuss this in the light of our literature review. We then discuss reporting practices in light of five existing reporting guidelines, which attempt to make explicit what researchers are expected to report.

Reporting practices

The reporting database used here was first made public on https://decisionlab.shinyapps.io/iGuidelines/ on June 13, 2018, and later on https://osf.io/ysvzk/?view_only=1be57d949dff43e99189ec6ad13f8a23 as supplementary material to the present paper. Table 4 present a comprehensive synopsis of this section.

Table 4 Synopsis of reporting frequencies of different aspects of studies derived from the reporting database

Environment

Only 12.5% of the 207 publications in the database report the location and setting where data were recorded.

Eye tracker

Table 4 presents data showing, for instance, that the eye-tracker model (90.8% of studies) and eye-tracker sampling frequency (75.8%) are often reported. While ranges of data quality values differ radically between eye trackers, sampling frequency is of importance only in some cases (Section “Signal properties and processing”). In contrast, the fundamental data quality measures–accuracy, precision, data loss and latency–are virtually never reported in the 207 publications of the database. Only 4.3% of the studies reported precision, and only 3.8% reported data loss. Only 0.5% of the studies were found to have reported a (measured or reiterated) latency value. Studies report the manufacturer’s specified accuracy (29.3%) almost ten times more often than self-measured accuracy (3.5%).

Geometry and setup

56.5% of studies reported monitor resolution, while only 29.6% reported its physical size. Furthermore, 56.5% of studies reported the distance between participant and eye tracker (range 18–280cm, with 60 and 70cm being most common). To make full use of one of these measures, the other two are usually also required. Reporting all three measures is done in 20.3% of the studies. In comparison, 27.7% of the studies report that the authors applied a chinrest during recordings, their reasons are not revealed by the database.

The software

used for stimulus presentation was reported by 17.9% of the studies. 44.9% of the studies reported which software was used for data processing and analysis. The most commonly reported processing tools were SMI BeGaze, Tobii, and SR Research Data Viewer, while the most common statistical tools were SPSS, R and Matlab. Papers that investigate the relationship between software tools and data quality are to the best of our knowledge currently lacking.

Participants

The gender distribution is reported in 77.8% of the publications in the reporting database. Although gender is potentially relevant to certain aspects of some studies, there is no clear evidence that it is related to eye-tracking data quality and only to a small extent to aspects of eye movement behaviour. Age is reported by 67% of the studies in the reporting database, and in contrast to gender, age was found to relate to smaller pupil, more frequent use of spectacles, droopier eye lid and other issues that affect data quality as well as changes in the eye movements themselves (Section “Participants”). Of those studies that report age, the average age is below 25 years in 67.4% of the publications, and between 26 and 46 years in the remaining 32.6%. Use of spectacles or lenses for correction for poor visual acuity of participants is reported by 40.6% of authors. Reports of having excluded recorded participants from further analysis were found in 51.2% of the publications, in which case exclusion criteria were always given.

Calibration

59.4% of the studies report having calibrated only at the beginning, versus 16.4% who reported having recalibrated at some point during the study. 41.1% reported the number of calibration targets, with 9 points being most common (67% of those studies that report number of targets), and 5 (17.6%), 13 (5.8%), and 3 (3.5%) occasionally used. Only 2.4% of all studies reported the background colour of the screen during calibration.

Features of the experiment

99% of the studies in the reporting database report the number of participants (on average just above 40). As an example of the range, Noton and Stark (1971) used data of 2 participants in the first, and 4 in the second experiment, whereas Coors et al., (2021) compared eye-movement data of almost 4000 people to draw their conclusions. 94.2% report the number of trials (on average just below 60). 31% of the authors report the duration of the total recording. Of those who report this duration, 31% report durations of 16–30 minutes, 28% 31–45 minutes and 20% 46–60 minutes. Only one study (0.5%) reported who recorded their data.

Unsurprisingly, 100% of the authors reported which dependent variables were used. This number does not necessarily mean that reporting dependent variables is straightforward. Naming of dependent variable is often unclear. For instance, dwell time is also called gaze duration and glance duration, depending on the research field. Sometimes terminology is confused, as when fixation duration is called dwell time, or when time to first fixation is named saccade latency or saccadic reaction time.

Exclusion criteria

Exclusion criteria for trials and events were reported by 30.9% of the authors, while 53.6% report having used exclusion criteria for participants. Exclusion criteria are composed from conditions for data quality and event values, personal characteristics, behavioural mishaps by the participants or operators, technical issues, and more.

Event detector

Overall, 27.0% reported the event detector that was used. Among those authors who used fixation-based or saccade-based measures as their dependent variables, 37.0% reported their event detector. However, only 2.1% of those authors who used event detectors in their analysis reported precision, compared to 4.3% of the authors overall.

Areas of Interest

76.3% of the authors in the reporting database included a figure with a stimulus image in their publication (which may have included an AOI drawn onto it). Of those who use AOI analyses, 28.7% report accuracy, although these authors always reiterated values from manufacturer specifications and never measured accuracy in their own data. 24% reported the size of their AOIs. 33% of the authors in the reporting database stated that the AOI was larger than the stimulus object (margin included), 27.5% that the AOI and the stimulus object were the same size (no margin), and 5% that the AOI was smaller than the stimulus object (negative margin). 1% of the authors used overlapping AOIs, 68% made clear that their AOIs do not overlap, while 31% failed to mention either. Only 8% mentioned the distance between AOIs, whereof 3% stated a zero distance between AOIs, and the rest reported distances between 5 and 241cm.

Summary

Many authors in the database report dependent variables, number of participants, eye-tracker sampling frequency, and eye-tracker model, which are readily available in most studies, but often fail to report measures and settings that we have found to be relevant from a data quality perspective. We can only speculate as to why this is: Lack of knowledge of what is relevant to report may play a role. Some researchers may find it unclear how to measure and calculate accuracy and precision. An over-reliance on the eye tracker and its software may add to that, as evidenced by the large proportion of authors reporting manufacturer-specified accuracy (29.3%) rather than measured accuracy (3.5%). In sum, we conclude that there is a discrepancy between reporting practices (the current section) and what is relevant to report for a study using an eye tracker (Section 4).

Existing reporting guidelines

The discrepancy between what is relevant from a data quality perspective, and the actual reporting practices, raises the question whether it is difficult for the users of eye trackers to find out what they need to report.

There are at least five existing reporting guidelines (Carter and Luke, 2020; Fiedler et al., 2019; McConkie, 1981; Oakes, 2010; Strohmaier et al., 2020). McConkie (1981) provides an early but still remarkably relevant example of general publishing guidance for eye-movement research, from an era when researchers often built their own eye trackers, and there were only a few manufacturers who sold them. In 2010, the journal Infancy adopted a policy for what to report in eye-tracking studies (Oakes, 2010). In the field of eye tracking in decision-making studies, Fiedler et al., (2019) proposed a reporting standard aimed to support replicability, based on suggestions from a panel of researchers. Carter and Luke (2020) provide a standard for reporting for eye-tracking studies, as part of a broader goal to describe best practices in a variety of disciplines around psychophysiology. In a review of eye-tracking research on mathematics education, in preparation of their guideline, Strohmaier et al., (2020) reported that “Although studies necessarily vary in the specific eye-tracking method they use, we found large inconsistencies in the reporting of these methods” (p. 165).

In Table 5, we summarise all recommendations that are common to at least two of the existing guidelines. Table 5 shows that there are also inconsistencies between the existing reporting guidelines. Although all five guidelines recognise the importance of reporting monitor properties and procedures for event detection, they diverge in everything else. Even when several existing guidelines recommend the same feature to be reported, they differ in the details such as which operationalisations and terminology they use.

Table 5 Features of eye-tracking experiments that were common to at least two existing reporting guidelines. Terminology used in this table is by necessity reduced, but as closely as possible quoted from each guideline publication. See original publications for details

For instance, McConkie (1981) presents three separate tests of accuracy that each researcher should conduct and report, while Oakes (2010) only requires available information about accuracy to be reported, which may suggest the accuracy specification by the manufacturer, and Strohmaier et al., (2020) ask for average accuracy, i.e. accuracy measured by the researchers in their own experiment. While Strohmaier et al., (2020) specifically ask for event detection algorithms and thresholds, Oakes (2010) asks that future papers provide specifics concerning the definitions of saccades and fixations.

Furthermore, each guideline appears to have its own specific focus, which may reflect the field it originated from. For instance, the guideline by Oakes (2010) requires that filtering and interpolation algorithms for post-processing of eye-tracking data be reported. This presumably reflects the fact that eye-tracking data in infant research tends to have poor precision and frequent periods of data loss, that may need interpolation and filtering (Section “Signal properties and processing”). Also, Oakes (2010) is the only guideline to ask for recovery time data to be reported: the time it takes to resume tracking when the eyes reappear in the eye-tracker camera view after a period of track loss.

The guideline by Strohmaier et al., (2020) asks for “correlation between all used measures”, which is presumably intended to detect cases where multiple eye-movement measures are used as separate, independent corroborations of a single hypothesis, for instance number of fixations in an AOI, and dwell time in the AOI. A similar argument is made by Orquin and Holmqvist (2018).

The guideline by McConkie (1981) is the only one to emphasize measurement and reporting of linearity, system latency, drift, and multiple tests of accuracy.

The guideline by Fiedler et al., (2019) is the only one to ask authors to report on many experimental design parameters, such as inter-stimulus interval, counterbalancing of the position of AOIs, number of trials and the location where the data were collected. Furthermore, Fiedler et al., (2019) provide the most specific recommendations on reporting AOI details, which makes it surprising that Fiedler et al., (2019) do not recommend that accuracy be reported.

Carter and Luke (2020) is the only existing guideline to ask for basic demographic information, and the only one to also ask for “A list of the dependent variables selected for analysis, and a justification for that selection”.

Existing guidelines may not be the obvious choice for all researchers. For instance, Uesbeck et al., (2020) did not make use of any of the guidelines above, but opted to report according to the CONSORT reporting guideline, which is used in the medical sciences (http://www.consort-statement.org).

Our summary suggests that previous reporting guidelines are incomplete and inconsistent, and often biased towards specific research fields. Therefore, in the next section, we will design a minimal reporting guideline based on empirical research, which may be used as is or form the basis for developing mandatory reporting standards.

Considerations on reporting guidelines for eye-tracking research

Ideally, all details necessary to replicate a study, or assess the validity of a study’s claims, should be reported. The review above forms the empirical foundation for what may need to be reported in studies using an eye tracker.

Guidelines may also include requirements on, for instance, data formats and data sharing principles that make collaboration more convenient. Similarly, researchers publishing in an APA journal that requires gender and socioeconomic status of participants to be reported should report those. Such items are not specific to eye tracking, and are therefore not considered in this paper. For specific research fields, there may exist additional considerations for conducting eye-tracking studies (see e.g. Sharafi et al.,, 2020, for software engineering). Our minimal guideline can be appended with such items for use in specific contexts.

Furthermore, previous guidelines have presented a single list with everything each author must report. We believe that this is counterproductive. Eye trackers are used in many different research fields, and not all of the many aspects in Table 6 are relevant for each and every study. For example, reporting monitor properties is nonsensical for studies that do not use screens, such as a wearable eye-tracking study during locomotion. Reporting the interocular distance does not make sense for monocular eye-tracking studies, nor do firmware versions apply to analogue eye trackers. The exact reporting in each study needs to take the study’s particularities into account. The purpose of a reporting guideline should be to provide authors with the information that allows them to make an informed selection of which specific aspects to report, how to measure them, and how to describe them.

Recommendations for making informed choices

Based on these considerations, we have arrived at a flexible reporting structure with three parts. Firstly, Table 6 provides a list of reporting items that may be useful to report, depending on the specifics of each particular study. Secondly, we deem certain central aspects found in our review to be essential to report in any study. These are found in Table 7 and could form some kind of minimal core of future reporting guidelines. The third part of our recommendations for reporting guidelines comprises a list of prototypical situations and contexts (Tables 891011121314151617 and 18), that may assist readers to select the reporting items from Tables 6 for studies in specific research areas.

Table 6 Detailed descriptions of each aspect are found in Section “A review of empirical eye-tracking studies as the basis for a reporting guideline’. Please note that these suggestions comprise neither a mandatory, nor an exhaustive list; common sense is highly recommended
Table 7 Reporting aspects common to all studies. We consider this a strongly recommended list of aspects to report, albeit not exhaustive
Table 8 Research comparing specific groups of participants
Table 9 Clinical studies or case studies in neuropsychology, psychiatry, rehabilitation or ophthalmology
Table 10 Eye tracking for fixation control, i.e. do participants look where they are instructed to look?
Table 11 Pupil-size estimation
Table 12 AOI research on a single screen
Table 13 Research with more than one screen (for instance vehicle simulators)
Table 14 Wearable eye-tracking studies in unconstrained situations, e.g. in supermarkets, cars, and flight decks, or during locomotion and sports
Table 15 Development, evaluation or validation of eye-tracker methodology
Table 16 Gaze interaction in applications and experimental studies
Table 17 Gaze-contingent research
Table 18 Saccade reaction time studies

An empirically based minimal reporting guideline

We have presented research showing how various aspects of a study with an eye tracker, such as the instrument, methodology, environment, participant, etc., affect the quality of the eye-tracking data obtained, the properties of the eye-tracker signals, and the eye-movement and gaze measures. We have summarised these aspects in Table 6. We have then shown that this body of research has not made any major imprint on current reporting practices. We have also shown that existing reporting guidelines for research using an eye tracker leave much to be desired.

Conclusions

What is reported in eye-tracking publications is decided on a case-by-case negotiation between authors, reviewers, and action editors of the journal/venue in question, which appears to lead to a large variation in reporting practices (Table 4).

Our review of the existing literature showed that many factors in the environment, setup, participant, eye tracker, experimental design, event detectors, and area of interest settings may impact the conclusions of any eye-tracking study. We examined a separate database on what is reported in published research on decision-making using eye trackers, which suggests that actual reporting is variable and may be in need of guidance. We examine five existing reporting guidelines for eye-tracking research and concluded that they are inconsistent, incomplete, and little used.

We have proposed a flexible, minimal reporting guideline with a core set of aspects which everyone should aim to report (Table 7), a large list of suggestions that may apply to many or some studies (Table 6), and several scenarios for specific uses of eye trackers (Tables 8-18). This information may help in making informed decisions what to report.

The reporting items that we have listed may also be used as a checklist by researchers when designing and conducting their eye-tracking experiments, and when analysing their eye-tracking data. Moreover, reviewers and journal editors may use Table 6 when assessing research during peer-review to ensure that sufficient detail is provided for replication.

Our proposal of reporting aspects may also be taken as the empirical component for a future process to develop a formalised and mandatory reporting standard (using the EQUATOR approachFootnote 1 or similar). It is possible that potential future mandatory standards would differ between clinical practice and research, or between research fields. However, we urge all such future endeavours to consider including the suggestions for reporting that we present in our empirical approach.

Open Practices Statement

The reporting database has been made available at https://osf.io/ysvzk/?view_only=1be57d949dff43e99189ec6ad13f8a23.