A total of 13 university students (seven women, 13 right-handed) took part in Experiment 1 after exclusion of one participant due to a failure to calibrate, and one for excessive data loss and difficulties tracking. They ranged in age from 19 to 28 years (M = 22.08, SD = 2.40). All participants had normal or corrected-to-normal vision and reported being able to complete the study without relying on vision correction. Hence, no participants wore glasses or contact lenses throughout the experiment. Finally, no participants wore make-up during the experiment.
Apparatus and task environment
The experiment was run on a Dell computer (Intel Core i7-3610QM @ 2.30 GHz, 16 GB RAM, Windows 10) and the task stimuli were presented on a monitor (53 x 30 cm, 60 Hz refresh rate, 1920 x 1080 pixels, 45.99 x 27.01 degrees of visual angle). The experiment was completed in a dimly lit, sound-proof testing room.
The remote GP3-HD eye-tracker, recording at 150 Hz, was used in this study. The eye-tracker was controlled through a custom script in PsychoPy (Peirce et al., 2019). The eye-tracker was placed at a 45° angle and 60–65 cm from the participants’ eyes (MDISTANCE = 62.45 cm), in line with the instructions provided in the Gazepoint manual. The Gazepoint control and monitoring window, and physical measurement (before and between tasks) were used to aid setup and find an optimal position. Eye-tracker specifications provided by the manufacturer are summarized in Table 1.
Prior to starting the main tasks, calibration was performed using a nine-point grid followed by a validation sequence. Satisfactory calibration criteria for continuing with the task were determined a priori: (a) all nine calibration points had to be deemed valid according to the Gazepoint Control software; (b) average calibration error had to be below or equal to 40 pixels (approximately 1 degree of visual angle); and finally, (c) using a real-time gaze relay nine-point grid, where participants’ gaze was shown as moving green dots on the screen, participants were asked to report how good they thought the eye-tracker was at approximating where they were actually looking using a 0–10 scale after explicitly attending to each of the gaze targets. Only answers equal to, or above, 8 were accepted.
Participants completed two main tasks. The Fixation-Saccade task was designed to provide data for the calculation of fixation and saccade metrics, and for accuracy and precision analyses. Participants were presented with nine black dots on the screen (size: 40 pixels, approximately 1 degree of visual angle; with an average distance of 11 degrees between target dots; see Fig. 1a). A target (blue dot) started on the central dot and then transitioned from the center to each peripheral dot at random throughout the task. All target positions were sampled before any were repeated. The duration of time for which the central dot was blue was varied between 2 and 5 s on each transition to prevent participants from trying to predict when and where the dot will move next. The task finished when each peripheral dot had turned blue twice.
Pupillary light reflex task
The second task was designed to evoke the pupillary light reflex (PLR). Participants were presented with a grey dot in the middle of the screen (size: 50 pixels, approximately 1.2 degrees of visual angle) and instructed to fixate on it while black and white backgroundsFootnote 2 interchanged every 5 s. Each screen was presented 12 times (24 changes of color, see Fig. 1b).
Participants completed the setup and the calibration procedure, followed by the tasks detailed above. Each task was completed twice, once with the participants’ heads placed on the chinrest to limit their head movements, and once without. In both conditions participants were asked to avoid head and body movements. The order of tasks and conditions (chinrest vs. no-chinrest) was counterbalanced. The calibration procedure was performed prior to each task. After the experiment participants were debriefed. All experimental procedures were conducted in accordance with the revised 2013 Declaration of Helsinki and were approved by the local research ethics committee.
Eye-tracking data were pre-processed using custom code in R (version: 3.6.1). Gaze samples falling outside screen coordinates were eliminated, as well as the samples labeled as invalid by the eye-tracker (in total, 2.72% of the samples were excluded, out of which 0.74% fell outside the screen boundaries, and 1.98% were labeled as invalid by the eye-tracker). A simple implementation of the adaptive velocity-based algorithm proposed by Engbert and Kliegl (2003) was used to detect fixations and saccades in the ‘Fixation-Saccade’ task. Saccades were defined as periods of at least 20 ms (the duration of three adjacent gaze samples) where velocity exceeded an adaptive threshold set for each participant and condition (chinrest and no-chinrest) based on the level of noise in the data. The velocity threshold was defined as 5 median absolute deviations above the median velocity for each participant and condition. Finally, to prevent artificial improvements in accuracy and precision, no smoothing, filtering, or interpolation was applied to gaze position coordinates.
For the PLR task, pupil data were first cleaned by removing pupil sizes outside the range of 2–10 mm. Pupil data were then pre-processed using functions from the R package GazeR (Geller et al., 2020). Data loss (e.g., blinks) up to 150 ms in duration were imputed using linear interpolation. Finally, a subtractive baseline correction was applied in line with the recommendations in Mathôt et al. (2018). Median pupil size during the last 20 samples of the preceding trial and the first 20 samples of the current trial (approximately 240 ms, incorporating an equal duration of light and dark screens) was taken as baseline pupil size, from which all individual pupil sizes were subtracted on each trial.
Metrics and analyses
To assess the calibration quality, two metrics were used based on the manufacturer’s calibration procedure: 1) the number of calibration attempts it took until the experimenter accepted the calibration, and 2) the average error of the accepted calibration. All calibrations needed to have nine valid calibration points to be accepted so the number of valid calibration points was not considered in further analyses.
Sampling rate variability
As the GP3-HD eye-tracker has a sampling frequency of 150 Hz, the expected average inter-sample time is approximately 0.0067 s (6.7 ms). Sampling rate variability was assessed by calculating the mean and the standard deviation of inter-sample time as well as their robust equivalents (median and median absolute deviation), for both the chinrest and no-chinrest condition. Sampling rate variability was assessed across both the Fixation-Saccade and the PLR task.
Data loss occurs when the eye-tracker cannot detect the position of the eyes, and individual samples where this happens are labeled as invalid by the device. Proportion of lost gaze was computed for each trial and participant and compared between conditions (chinrest and no-chinrest).
Prior to computing accuracy and precision, the first and last 250 ms of each trial were removed to give participants time to fixate on the new target dot and to limit the extent to which participants’ anticipatory saccades influenced these metrics. This interval was chosen after calculation and visual inspection of saccade latency. Accuracy was computed as the error between the estimated gaze location and the location of a known target (Holmqvist et al., 2012). Horizontal and vertical accuracy were calculated for each gaze sample in the Fixation-Saccade task by subtracting estimated x and y gaze coordinates from the pre-defined x and y coordinates of each target dot location. Sample-level global accuracies were calculated as Euclidian distances between estimated x and y gaze coordinates and pre-defined x and y coordinates of target dot locations (see Fig. 1c, d).
Having calculated all three types of sample-level accuracies, outlier samples were removed if they were greater than 4 median absolute deviations from the median respective accuracy for each participant, condition, and trial (2.1% of the samples were excluded for vertical accuracy, 3% for horizontal accuracy, and 2.9% for global accuracy—note that these values are not independent). These outliers corresponded mostly to saccades, with the size of the error matching the expected saccade sizes during the task (note that analyses with outliers yielded consistent results, see Tables S1 and S2 in Supplementary materials). This was performed to avoid biasing the accuracy calculation by including gaze samples where the participant was likely to have clearly moved their eyes away from the target dot. Finally, mean vertical, horizontal, and global accuracy were calculated for each participant, condition, and trial.
Following the calculation of descriptive statistics, linear mixed models in lme4 (Bates et al., 2014) were fitted to test whether global accuracy differed between conditions (chinrest and no-chinrest) and target dot locations (central and peripheral) while accounting for participant and trial random effects. Maximal models were always fitted first (Barr et al., 2013), and convergence and singularity warnings were resolved by simplifying the random structure using principal component analysis to determine the most relevant random components (Bates et al., 2015).
Finally, we decided to compare accuracy on the central dot location against accuracy on all the peripheral dot locations grouped together for two reasons: (a) in psychological research, stimuli are commonly presented at the center of the screen, and it might be useful for researchers to know whether accuracy at this location is superior to accuracy at any peripheral location; (b) since trials with different target dot locations varied in frequency and duration (central target was presented more frequently and for a longer period of time in comparison to peripheral targets), grouping all peripheral locations allowed us to increase statistical power. Additionally, only sample-level accuracies within the first 2 s of the central trials were used in this comparison in order to match their duration with the duration of peripheral trials. For analyses, accuracy was log-transformed to correct the violation of the assumption that the residuals of the model are normally distributed.
Precision is a measure of the spatial variance in accuracy when the eye is assumed to be relatively stationary (Holmqvist et al., 2012). Therefore, gaze samples were first parsed into fixations and saccades based on the adaptive velocity threshold described above. Only fixations longer than 80 ms were used for calculating precision to avoid including small saccades which may be inaccurately labeled as fixations. Horizontal and vertical precision were calculated on a trial-by-trial basis for each participant by computing the root mean square from successive gaze samples for each fixation to the target dot. Global precision was calculated by first estimating the Euclidian distances between pairs of adjacent gaze samples and then computing the root mean square over these distances. After calculating descriptive statistics, linear mixed models were fitted to test whether the three types of precision (horizontal, vertical, global) varied between conditions (chinrest and no-chinrest) and target locations (central and peripheral) while controlling for participant and trial variability. Comparison between central and peripheral target locations as well as the choice of model’s random structure followed the same logic as the analysis of accuracy. Precision data was also log-transformed for analysis due to non-normality of residuals.
Following pupil pre-processing and baseline correction, the degree of PLR elicited by the changing stimulus was estimated (i.e., pupil constriction in response to light and pupil dilation in response to darkness). More specifically, linear mixed models were fitted for each condition (chinrest and no-chinrest) to compare pupil size changes in response to the two stimuli (black vs. white).
Saccade starting and landing error, amplitude, gain, curvature, latency, mean, and peak velocity were calculated (see Fig. 1). These metrics are provided to allow researchers to judge how well saccade parameters are reflected in gaze data from the GP3-HD, as there are no standard norms to which these values can be judged against, other than direct comparisons with other systems.
Saccade starting error was calculated as the Euclidian distance between the gaze sample labeled as the saccade onset and the center of the starting target, while saccade landing error was calculated as the Euclidian distance between the gaze sample labeled as the saccade end and the target center (Dalmaijer, 2014). After calculating saccade amplitude (size of the saccade in degrees of visual angle), we proceeded to compute gain, the ratio between the observed saccade amplitude and the expected saccade amplitude, actual distance between the two consecutive target locations (Noto & Robinson, 2001). Saccades with gains less than 1 were too small (hypometric), while saccades with gains higher than 1 were too large (hypermetric). Gain provides an approximation of over- or under-estimation of the expected saccade amplitudes.
In order to capture saccade trajectories, curvature was calculated as the median angle between each gaze point in a saccade and an imaginary straight line connecting the start and the end of the saccade, following the strategy of van Leeuwen and Belopolsky (2018). Saccade latency is the time from target onset until the initiation of the saccade to that target.
Finally, velocity was calculated for each gaze sample making up a saccade by dividing inter-sample distance (in degrees of visual angle) by inter-sample time (in seconds), after which mean and peak velocity were computed for each saccade as a whole.
Calibration metrics included the number of attempts it took to achieve an acceptable calibration, and the average error of the accepted calibration. Due to the violation of normality (Shapiro–Wilk test: W(12) = 0.72, p < .001), a Wilcoxon signed-rank test was performed to examine differences between the chinrest and no-chinrest conditions on both calibration quality metrics. No differences were detected in the number of calibration attempts (MCHINREST = 1.81, MNOCHINREST = 1.58, W(12) = 30, p = .402) nor in the average error of the accepted calibration (MCHINREST = 0.93Footnote 3, MNOCHINREST = 0.99, W(12) = 47, p = .946) (see Fig. 2).
Sampling rate variability
As expected, the observed average inter-sample time was 6.7 ms and the standard deviation of the inter-sample time was 0.79 ms (robust descriptives: Mdn = 6.68 ms; MAD = 35.29 ms). Only 0.0003% of the inter-sample times were greater than the duration of two consecutive samples (13.4 ms). Distribution of inter-sample time is shown in Fig. 2.
A Wilcoxon signed-rank test was performed to compare whether the chinrest and no-chinrest condition differed in the proportion of lost gaze across tasks, and no differences were observed (MCHINREST = .031, MNOCHINREST = .029, W(12) = 43, p = .583) (see Fig. S4 in the Supplementary materials). Additionally, the quality of the accepted calibration prior to each combination of condition and task (Fixation-Saccade task – chinrest condition; Fixation-Saccade task – no chinrest condition; pupil task – chinrest condition; pupil task – no chinrest condition) did not correlate with the proportion of lost gaze (Kendall’s tau: tFIX_SACC_CHINREST = –.24, p = .228; tFIX_SACC_NO_CHINREST = .01, p = 1; tPUPIL _CHINREST = .01, p = 1; tPUPIL_NO_CHINREST = –.21, p = .331).
A visualization of participants’ gaze positions superimposed on target positions for the Fixation-Saccade task is provided in Fig. 3. No differences in global accuracy, defined as the Euclidian distance of the actual gaze sample from the expected gaze position, were found between conditions with and without the chinrest (estimate < 0.01, SE = 0.01, t = – 0.37, p = .710), while peripheral target locations had slightly better global accuracy in comparison to the central position (estimate = – 0.09, SE = 0.02, t = – 4.29, p < .01). However, different accuracy profiles were observed when vertical and horizontal accuracies were analyzed separately. In the case of vertical accuracy, no differences were found between conditions (estimate = 0.02, SE = 0.01, t = 1.64, p = .102), while vertical error at the peripheral target locations was lower than at the central target location (estimate = – 0.19, SE = 0.03, t = – 6.79, p = < .001). In the case of horizontal accuracy, better accuracy was achieved without the chinrest (estimate = – 0.04, SE = 0.01, t = – 3.16, p < .01), and peripheral target locations had worse horizontal accuracy in comparison to the central target location (estimate = 0.16, SE = 0.03, t = 4.95, p < .001). Descriptive statistics for vertical, horizontal, and global accuracy for each condition and target location can be seen in Fig. 3 and Table 2.
These results suggest that globally, the accuracy of the GP3-HD is closer to the upper bound of the expected ~ 0.5–1° values, although horizontal accuracy is consistently closer to 0.5°. Nonetheless, the range of accuracy values is similar to what has been reported in previous evaluations of commercial eye-trackers (Funke et al., 2016; Holmqvist, 2017). Overall, the GP3-HD shows precision below 0.5°, which is also in line with what is reported using similarly priced eye-trackers and even some high-end systems (Funke et al., 2016; Holmqvist, 2017).
No differences were detected between central and peripheral target locations in any of the precision metrics (Horizontal: estimate = 0.01, SE < 0.01, t = 1.86, p = .069; Vertical: estimate < 0.01, SE < 0.01, t = – 1.07, p = .288; Global: estimate < 0.01, SE = 0.01, t = 0.52, p = .607), whereas the differences between chinrest and no-chinrest conditions showed a less consistent pattern. The no-chinrest condition yielded increased vertical precision (estimate = – 0.01, SE < 0.01, t = – 2.66, p < .01), whereas no differences were observed in horizontal (estimate < 0.01, SE < 0.01, t = 1.53, p = .126) and global precision (estimate < 0.01, SE < 0.01, t = – 0.80, p = .426). Descriptive statistics for vertical, horizontal, and global precision for each condition and target location can be seen in Table 2.
Bayesian equivalents for condition comparisons are provided in the Supplementary materials – see Bayesian analyses for condition comparisons.
Pupillary light reflex
The PLR was reliably detected in both the chinrest (estimate = – 19.55, SE = 0.26, t = – 74.50, p < .001) and no-chinrest conditions (estimate = – 20.75, SE = 0.30, t = – 69.79, p < .001), see Fig. 4a, b). As expected, the PLR effect is very large, accounting for 95% of the variance in both conditions.
Descriptive statistics for different saccade metrics are provided in Table 3. The relationships between saccade amplitude, duration and peak velocity are visualized in Fig. 4c, d.
As expected, since the majority of the saccades detected in the Fixation-Saccade task would be classified as small using the guidelines of Gibaldi and Sabatini (2021), the relationship between these metrics was approximately linear.
Finally, we examined each participant’s saccade trajectories from the central target to each peripheral target (see Fig. 5). Individual saccade trajectories plotted in Fig. 5 are not smooth, but appear broken and edgy, which is a sign of under-sampling (Dalmaijer, 2014). While clear identification of saccade events is possible, this suggests that the GP3-HD eye-tracker is less suitable for saccadometry research.
In summary, accuracy measures are comparable to the range reported in the existing eye-tracking literature for similar grade devices. In both ideal (chinrest) and non-ideal (no-chinrest) conditions, overall accuracy was at the upper limit or even higher than the values stated by the manufacturer (1 degree of visual angle). While this degree of accuracy is perfectly capable of capturing gaze behavior reliably across most experimental tasks, tracking targets smaller than this error could be problematic. While clear separation of saccades and fixations is possible using the GP3-HD, study of the properties of saccade kinematics is compromised by the low sampling rate.
Experiment 1 provided a standard validation of the GP3-HD eye-tracker. Overall, calibration and data loss were acceptable, with calibration achieved successfully for most participants after one or only a few attempts, and with a low rate of data loss over the experiment, comparable to more expensive eye-trackers based on previous reports (Holmqvist et al., 2011). Accuracy and precision of gaze tracking were also acceptable, but closer to upper desired limits (~1 degree of visual angle). This means that during stimulus design and presentation, researchers should accommodate this tracking error by ensuring that target locations are separated by 3 degrees of visual angle or more, so that if a participant was looking exactly at the middle of two targets, the estimated gaze locations (even accounting for error) would not fall onto either of the targets. It is worth noting, however, that better accuracy values were achieved (0.5 degrees of visual angle) particularly for the horizontal dimension. This is useful to know, as it is possible to calibrate how gaze is assigned to areas of interest given the tracking error associated with a particular participant at a particular point in time (Hessels & Hooge, 2019; Orquin & Holmqvist, 2018). Similarly, individualized accuracy and precision profiles can be used to calibrate gaze parsing filters (see, Feit et al., 2017).
With regard to saccade metrics, while the identification of saccades appears good, the calculation of kinematic parameters appears to be affected by the low sampling rate. As a result, the study of the properties of saccades using the GP3-HD may lead to inaccuracies depending on the specific parameters being studied. For example, it appears that the amplitude of saccades is underestimated, as the gain value was < 1, indicating hypometric saccades, that is, saccades where the amplitude was smaller than expected. Looking at the plots in Fig. 5, it is apparent that saccade patterns show breaks as a result of the sampling rate. This is consistent with Nyquist's theorem, which specifies that a signal must be sampled at more than twice the highest frequency component of the signal. Note, however, that saccade detection as well as peak velocity approximations are reliable with even 60 Hz sampling, and there are saccade reconstruction techniques that can improve the approximation of the “true” parameters of saccades (Wierts et al., 2008).
The lack of consistent significant differences between chinrest and no-chinrest conditions suggests that the algorithm for gaze estimation used by GP3-HD is relatively robust to small head movements. However, increased infra-red ‘bounce’ was observed during the sessions with a chinrest, where small and temporary reflections from the chinrest would cause it to be mistakenly detected as a part of the eye. Improvements in tracking accuracy expected from use of the chinrest may have been lost as a result. While this problem is solvable by eliminating infra-red reflections, a possibility for development would be for Gazepoint to offer a user intervention enabling the operator to manually correct the misidentified reflections during calibration, such that the Gazepoint algorithm would subsequently underweight those regions when estimating gaze location. Similarly, providing the stream of estimated distances from the screen, in addition to the gaze coordinates and pupil samples, would be useful for researchers to accommodate changes in distance from the screen in non-chinrest conditions.