Validation of Gazepoint low-cost eye-tracking and psychophysiology bundle

Cuve, Hélio Clemente; Stojanov, Jelka; Roberts-Gaal, Xavier; Catmur, Caroline; Bird, Geoffrey

doi:10.3758/s13428-021-01654-x

Validation of Gazepoint low-cost eye-tracking and psychophysiology bundle

Open access
Published: 17 August 2021

Volume 54, pages 1027–1049, (2022)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Validation of Gazepoint low-cost eye-tracking and psychophysiology bundle

Download PDF

Hélio Clemente Cuve ORCID: orcid.org/0000-0001-9436-8292¹,
Jelka Stojanov¹,
Xavier Roberts-Gaal¹,
Caroline Catmur² &
…
Geoffrey Bird^1,3

6993 Accesses
22 Citations
1 Altmetric
Explore all metrics

Abstract

Eye-tracking and recording of physiological signals are increasingly used in research within cognitive science and human–computer interaction. For example, gaze position and measures of autonomic arousal, including pupil dilation, skin conductance (SC), and heart rate (HR), provide an indicator of cognitive and physiological processes. The growing popularity of these techniques is partially driven by the emergence of low-cost recording equipment and the proliferation of open-source software for data collection and analysis of such signals. However, the use of new technology requires investigation of its reliability and validation with respect to real-world usage and against established technologies. Accordingly, in two experiments (total N = 69), we assessed the Gazepoint GP3-HD eye-tracker and Gazepoint Biometrics (GPB) system from Gazepoint. We show that the accuracy, precision, and robustness of the eye-tracker are comparable to competing systems. While fixation and saccade events can be reliably extracted, the study of saccade kinematics is affected by the low sampling rate. The GP3-HD is also able to capture psychological effects on pupil dilation in addition to the well-defined pupillary light reflex. Finally, moderate-to-strong correlations between physiological recordings and derived metrics of SC and HR between the GPB and the well-established BIOPAC MP160 support its validity. However, low amplitude of the SC signal obtained from the GPB may reduce sensitivity when separating phasic and tonic components. Similarly, data loss in pulse monitoring may pose difficulties for certain HR variability analyses.

Characterization of Eye Gaze and Pupil Diameter Measurements from Remote and Mobile Eye-Tracking Devices

Eye Tracking Methodology

Performance Evaluation of the Gazepoint GP3 Eye Tracking Device Based on Pupil Dilation

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Eye-tracking and psychophysiological recording^{Footnote 1} have gained popularity in recent years as a way to gain insight into cognitive processes, particularly the time course of those processes (Cacioppo et al., 2016; Holmqvist et al., 2011). The use of these techniques for research is not new; attempts to track human gaze and link physiological signals (e.g., heart rate, skin conductance, and pupillary change) to cognition, although highly expensive and invasive, can be found even in the 19th century (see, Buswell, 1935; Cacioppo et al., 2016; Dodge & Cline, 1901 for a review).

While of interest to researchers for decades, this technology was until recently, mostly limited to research groups that could afford their high cost along with the proprietary software necessary to analyze the data (Funke et al., 2016). However, the proliferation of technology companies working on virtual reality, human–computer interaction, and marketing, has diversified research into, and applications of, eye-tracking and psychophysiology technologies.

Manufacturers are now starting to offer low-cost eye-trackers (e.g., GP3 and GP3-HD from Gazepoint; Tobii Eye Tracker 4C and Tobii Eye Tracker 5 from Tobii or the now discontinued EyeTribe), and there is increased availability of open-source data acquisition and analysis software (e.g., OpenSesame - Mathôt et al., 2012; PsychoPy - Peirce et al., 2019; PyGaze - Dalmaijer et al., 2014; GazeR - Geller et al., 2020). Similarly, there are several psychophysiology devices for skin conductance and heart rate measurement targeted at both consumers, e.g., Fitbit bracelets, Apple Watch, and smartphone apps (Mühlen et al., 2021), and researchers (e.g., Shimmer by Tobii, the E4 wristband by Empatica), alongside the more conventional (and more expensive) devices traditionally used for scientific research. A number of established open-source tools for analyses of signals like SCR (Ledalab, Benedek & Kaernbach, 2010; PSPM - Bach & Staib, 2015; EDAExplorer - Taylor et al., 2015) and heart rate and variability (ArtiFact, Kaufmann et al., 2011; Kubios - Tarvainen et al., 2014; RapidHRV; Kirk et al., 2021) have made it easier to automate often cumbersome pre-processing procedures.

These inexpensive eye-tracking solutions represent a very attractive option, not only for researchers operating on a limited budget, but also for those interested in more portable and less cumbersome eye-tracking devices that can be easily moved and retrofitted according to specific study purposes and environments. There are also several potential advantages provided by the newer and simpler devices to measure SC and HR. For instance, traditional psychophysiological recording takes time to set up and can be invasive (e.g., attaching specialized ECG sensors to the participant’s chest and torso, often requiring the removal of clothing), which can add burden particularly to special participant populations (e.g., clinical groups).

While the diversification of eye-tracking and psychophysiology solutions provides considerable opportunities, it can also represent a risk to research validity and reproducibility if researchers are not adequately informed about the limitations of the low-cost devices available on the market (Orquin & Holmqvist, 2018; Society for Psychophysiological Research Ad, 2012). For eye-tracking applications, manufacturers commonly specify spatial accuracy—the average distance between a known target in space and the gaze position estimated by the eye-tracker; and spatial precision—the average distance between consecutive gaze position data points where gaze is assumed to have remained relatively stationary (Holmqvist et al., 2012). However, manufacturers’ performance evaluations are usually conducted under optimal conditions with trained participants using chinrests, or even using artificial eyes (Hessels, Cornelissen, et al., 2015b). As a result, relying solely on performance estimates provided by the manufacturers can yield unjustified optimism when evaluating the suitability of low-cost eye-tracking devices to answer certain research questions. Aware of the need for performance evaluations in more realistic experimental conditions, eye-tracking researchers have conducted extensive validation and comparison studies for some of the most frequently used eye-tracking devices (see, Funke et al., 2016; Hessels, Andersson, et al., 2015a; Janthanasub & Meesad, 2015; Leube et al., 2017; Mannaru, Balasingam, Pattipati, Sibley, & Coyne, 2017b; Niehorster et al., 2018). A common observation across these studies is that, even under ideal conditions, there is still a great deal of variability in how well different eye-tracking systems perform.

In addition to gaze position, eye-movement researchers often study saccades. However, systematic analysis of saccades in validation studies is often overlooked. While system accuracy and precision can inform saccadometry research, there aren’t many established baselines for saccade metrics and most research has relied on direct comparison of different eye-trackers (e.g., Dalmaijer, 2014; Nyström, Niehorster, Andersson & Hooge, 2021). Nonetheless, it is possible to assess saccade parameters descriptively, for example, by looking at known regular relationships between saccade parameters (e.g., duration, amplitude, velocity) known as the saccadic ‘main sequence’ (Bahill et al., 1975; Gibaldi & Sabatini, 2021). The shape of the main sequence is well known—for small to medium saccades (between 10 and 20 degrees of visual angle in size), one should expect the relationship between these saccade metrics to be approximately linear (Gibaldi & Sabatini, 2021).

Similarly, for a given task where the size of the expected saccade is known, researchers could use the actual observed saccades of typical participants to assess undershooting or overshooting, as well the degree of saccade curvature (van Leeuwen & Belopolsky, 2018).

In this study, we aimed to assess the performance of a new relatively low-cost eye-tracker, the GP3-HD (Gazepoint), with a sampling rate of 150 Hz and incorporating a high-definition machine vision-powered camera. The GP3-HD replaces the previous model, the GP3, which recorded at 60 Hz and for which independent validations exist (Brand et al., 2020; Mannaru, Balasingam, Pattipati, & Sibley, 2017a).

In addition to the GP3-HD, Gazepoint recently launched a Biometrics system (GPB) for the measurement of autonomic responses, specifically, skin conductance (SC) and heart rate (HR). SC and HR provide an indication of the degree of an individual’s physiological arousal, and the physio-anatomical mechanisms underlying changes in SC and HR are relatively well understood. As is the case with the GP3-HD, however, the reliability and validity of the GPB is currently unknown. A comparison of raw and derived SC and HR metrics obtained from the GPB and from a well-established device would provide a useful insight into the potential of the GPB to provide valid measurements. Therefore, this study also aimed to validate the GPB.

In Experiment 1, common data quality indicators (calibration quality, data loss, accuracy, precision) were obtained for the GP3-HD eye-tracker. We also provide information on sampling rate variability and fixation and saccade metrics (Holmqvist et al., 2011). In addition to gaze position and saccade analyses, we provide pupillometry analyses tracking the pupillary light reflex (PLR), a physiological process in which the pupil constricts in response to increased light intensity and dilates in response to reduced light intensity (Mathôt, 2018). In Experiment 2, we provide a second validation of the GP3-HD system that enabled us to study its performance under the conditions encountered in a typical psychological experiment, as well as to further test measurement of pupillary responses. Additionally, in Experiment 2, data were collected simultaneously from a well-established psychophysiological recording system (BIOPAC-MP160) and from the GPB system to assess the validity of SC and HR data, and derived metrics, recorded from the GPB system. Finally, recommendations for researchers planning to use this technology are provided.

Experiment 1

Method

Participants

A total of 13 university students (seven women, 13 right-handed) took part in Experiment 1 after exclusion of one participant due to a failure to calibrate, and one for excessive data loss and difficulties tracking. They ranged in age from 19 to 28 years (M = 22.08, SD = 2.40). All participants had normal or corrected-to-normal vision and reported being able to complete the study without relying on vision correction. Hence, no participants wore glasses or contact lenses throughout the experiment. Finally, no participants wore make-up during the experiment.

Apparatus and task environment

The experiment was run on a Dell computer (Intel Core i7-3610QM @ 2.30 GHz, 16 GB RAM, Windows 10) and the task stimuli were presented on a monitor (53 x 30 cm, 60 Hz refresh rate, 1920 x 1080 pixels, 45.99 x 27.01 degrees of visual angle). The experiment was completed in a dimly lit, sound-proof testing room.

Eye-tracking

The remote GP3-HD eye-tracker, recording at 150 Hz, was used in this study. The eye-tracker was controlled through a custom script in PsychoPy (Peirce et al., 2019). The eye-tracker was placed at a 45° angle and 60–65 cm from the participants’ eyes (M_DISTANCE = 62.45 cm), in line with the instructions provided in the Gazepoint manual. The Gazepoint control and monitoring window, and physical measurement (before and between tasks) were used to aid setup and find an optimal position. Eye-tracker specifications provided by the manufacturer are summarized in Table 1.

Table 1 GP3-HD eye-tracker specifications offered by the manufacturer (Gazepoint)

Full size table

Prior to starting the main tasks, calibration was performed using a nine-point grid followed by a validation sequence. Satisfactory calibration criteria for continuing with the task were determined a priori: (a) all nine calibration points had to be deemed valid according to the Gazepoint Control software; (b) average calibration error had to be below or equal to 40 pixels (approximately 1 degree of visual angle); and finally, (c) using a real-time gaze relay nine-point grid, where participants’ gaze was shown as moving green dots on the screen, participants were asked to report how good they thought the eye-tracker was at approximating where they were actually looking using a 0–10 scale after explicitly attending to each of the gaze targets. Only answers equal to, or above, 8 were accepted.

Tasks

Fixation-Saccade task

Participants completed two main tasks. The Fixation-Saccade task was designed to provide data for the calculation of fixation and saccade metrics, and for accuracy and precision analyses. Participants were presented with nine black dots on the screen (size: 40 pixels, approximately 1 degree of visual angle; with an average distance of 11 degrees between target dots; see Fig. 1a). A target (blue dot) started on the central dot and then transitioned from the center to each peripheral dot at random throughout the task. All target positions were sampled before any were repeated. The duration of time for which the central dot was blue was varied between 2 and 5 s on each transition to prevent participants from trying to predict when and where the dot will move next. The task finished when each peripheral dot had turned blue twice.

Pupillary light reflex task

The second task was designed to evoke the pupillary light reflex (PLR). Participants were presented with a grey dot in the middle of the screen (size: 50 pixels, approximately 1.2 degrees of visual angle) and instructed to fixate on it while black and white backgrounds^{Footnote 2} interchanged every 5 s. Each screen was presented 12 times (24 changes of color, see Fig. 1b).

Procedure

Participants completed the setup and the calibration procedure, followed by the tasks detailed above. Each task was completed twice, once with the participants’ heads placed on the chinrest to limit their head movements, and once without. In both conditions participants were asked to avoid head and body movements. The order of tasks and conditions (chinrest vs. no-chinrest) was counterbalanced. The calibration procedure was performed prior to each task. After the experiment participants were debriefed. All experimental procedures were conducted in accordance with the revised 2013 Declaration of Helsinki and were approved by the local research ethics committee.

Pre-processing

Eye-tracking data were pre-processed using custom code in R (version: 3.6.1). Gaze samples falling outside screen coordinates were eliminated, as well as the samples labeled as invalid by the eye-tracker (in total, 2.72% of the samples were excluded, out of which 0.74% fell outside the screen boundaries, and 1.98% were labeled as invalid by the eye-tracker). A simple implementation of the adaptive velocity-based algorithm proposed by Engbert and Kliegl (2003) was used to detect fixations and saccades in the ‘Fixation-Saccade’ task. Saccades were defined as periods of at least 20 ms (the duration of three adjacent gaze samples) where velocity exceeded an adaptive threshold set for each participant and condition (chinrest and no-chinrest) based on the level of noise in the data. The velocity threshold was defined as 5 median absolute deviations above the median velocity for each participant and condition. Finally, to prevent artificial improvements in accuracy and precision, no smoothing, filtering, or interpolation was applied to gaze position coordinates.

For the PLR task, pupil data were first cleaned by removing pupil sizes outside the range of 2–10 mm. Pupil data were then pre-processed using functions from the R package GazeR (Geller et al., 2020). Data loss (e.g., blinks) up to 150 ms in duration were imputed using linear interpolation. Finally, a subtractive baseline correction was applied in line with the recommendations in Mathôt et al. (2018). Median pupil size during the last 20 samples of the preceding trial and the first 20 samples of the current trial (approximately 240 ms, incorporating an equal duration of light and dark screens) was taken as baseline pupil size, from which all individual pupil sizes were subtracted on each trial.

Metrics and analyses

Calibration quality

To assess the calibration quality, two metrics were used based on the manufacturer’s calibration procedure: 1) the number of calibration attempts it took until the experimenter accepted the calibration, and 2) the average error of the accepted calibration. All calibrations needed to have nine valid calibration points to be accepted so the number of valid calibration points was not considered in further analyses.

Sampling rate variability

As the GP3-HD eye-tracker has a sampling frequency of 150 Hz, the expected average inter-sample time is approximately 0.0067 s (6.7 ms). Sampling rate variability was assessed by calculating the mean and the standard deviation of inter-sample time as well as their robust equivalents (median and median absolute deviation), for both the chinrest and no-chinrest condition. Sampling rate variability was assessed across both the Fixation-Saccade and the PLR task.

Data loss

Data loss occurs when the eye-tracker cannot detect the position of the eyes, and individual samples where this happens are labeled as invalid by the device. Proportion of lost gaze was computed for each trial and participant and compared between conditions (chinrest and no-chinrest).

Accuracy

Prior to computing accuracy and precision, the first and last 250 ms of each trial were removed to give participants time to fixate on the new target dot and to limit the extent to which participants’ anticipatory saccades influenced these metrics. This interval was chosen after calculation and visual inspection of saccade latency. Accuracy was computed as the error between the estimated gaze location and the location of a known target (Holmqvist et al., 2012). Horizontal and vertical accuracy were calculated for each gaze sample in the Fixation-Saccade task by subtracting estimated x and y gaze coordinates from the pre-defined x and y coordinates of each target dot location. Sample-level global accuracies were calculated as Euclidian distances between estimated x and y gaze coordinates and pre-defined x and y coordinates of target dot locations (see Fig. 1c, d).

Having calculated all three types of sample-level accuracies, outlier samples were removed if they were greater than 4 median absolute deviations from the median respective accuracy for each participant, condition, and trial (2.1% of the samples were excluded for vertical accuracy, 3% for horizontal accuracy, and 2.9% for global accuracy—note that these values are not independent). These outliers corresponded mostly to saccades, with the size of the error matching the expected saccade sizes during the task (note that analyses with outliers yielded consistent results, see Tables S1 and S2 in Supplementary materials). This was performed to avoid biasing the accuracy calculation by including gaze samples where the participant was likely to have clearly moved their eyes away from the target dot. Finally, mean vertical, horizontal, and global accuracy were calculated for each participant, condition, and trial.

Following the calculation of descriptive statistics, linear mixed models in lme4 (Bates et al., 2014) were fitted to test whether global accuracy differed between conditions (chinrest and no-chinrest) and target dot locations (central and peripheral) while accounting for participant and trial random effects. Maximal models were always fitted first (Barr et al., 2013), and convergence and singularity warnings were resolved by simplifying the random structure using principal component analysis to determine the most relevant random components (Bates et al., 2015).

Finally, we decided to compare accuracy on the central dot location against accuracy on all the peripheral dot locations grouped together for two reasons: (a) in psychological research, stimuli are commonly presented at the center of the screen, and it might be useful for researchers to know whether accuracy at this location is superior to accuracy at any peripheral location; (b) since trials with different target dot locations varied in frequency and duration (central target was presented more frequently and for a longer period of time in comparison to peripheral targets), grouping all peripheral locations allowed us to increase statistical power. Additionally, only sample-level accuracies within the first 2 s of the central trials were used in this comparison in order to match their duration with the duration of peripheral trials. For analyses, accuracy was log-transformed to correct the violation of the assumption that the residuals of the model are normally distributed.

Precision

Precision is a measure of the spatial variance in accuracy when the eye is assumed to be relatively stationary (Holmqvist et al., 2012). Therefore, gaze samples were first parsed into fixations and saccades based on the adaptive velocity threshold described above. Only fixations longer than 80 ms were used for calculating precision to avoid including small saccades which may be inaccurately labeled as fixations. Horizontal and vertical precision were calculated on a trial-by-trial basis for each participant by computing the root mean square from successive gaze samples for each fixation to the target dot. Global precision was calculated by first estimating the Euclidian distances between pairs of adjacent gaze samples and then computing the root mean square over these distances. After calculating descriptive statistics, linear mixed models were fitted to test whether the three types of precision (horizontal, vertical, global) varied between conditions (chinrest and no-chinrest) and target locations (central and peripheral) while controlling for participant and trial variability. Comparison between central and peripheral target locations as well as the choice of model’s random structure followed the same logic as the analysis of accuracy. Precision data was also log-transformed for analysis due to non-normality of residuals.

PLR

Following pupil pre-processing and baseline correction, the degree of PLR elicited by the changing stimulus was estimated (i.e., pupil constriction in response to light and pupil dilation in response to darkness). More specifically, linear mixed models were fitted for each condition (chinrest and no-chinrest) to compare pupil size changes in response to the two stimuli (black vs. white).

Saccade metrics

Saccade starting and landing error, amplitude, gain, curvature, latency, mean, and peak velocity were calculated (see Fig. 1). These metrics are provided to allow researchers to judge how well saccade parameters are reflected in gaze data from the GP3-HD, as there are no standard norms to which these values can be judged against, other than direct comparisons with other systems.

Saccade starting error was calculated as the Euclidian distance between the gaze sample labeled as the saccade onset and the center of the starting target, while saccade landing error was calculated as the Euclidian distance between the gaze sample labeled as the saccade end and the target center (Dalmaijer, 2014). After calculating saccade amplitude (size of the saccade in degrees of visual angle), we proceeded to compute gain, the ratio between the observed saccade amplitude and the expected saccade amplitude, actual distance between the two consecutive target locations (Noto & Robinson, 2001). Saccades with gains less than 1 were too small (hypometric), while saccades with gains higher than 1 were too large (hypermetric). Gain provides an approximation of over- or under-estimation of the expected saccade amplitudes.

In order to capture saccade trajectories, curvature was calculated as the median angle between each gaze point in a saccade and an imaginary straight line connecting the start and the end of the saccade, following the strategy of van Leeuwen and Belopolsky (2018). Saccade latency is the time from target onset until the initiation of the saccade to that target.

Finally, velocity was calculated for each gaze sample making up a saccade by dividing inter-sample distance (in degrees of visual angle) by inter-sample time (in seconds), after which mean and peak velocity were computed for each saccade as a whole.

Results

Calibration quality

Calibration metrics included the number of attempts it took to achieve an acceptable calibration, and the average error of the accepted calibration. Due to the violation of normality (Shapiro–Wilk test: W(12) = 0.72, p < .001), a Wilcoxon signed-rank test was performed to examine differences between the chinrest and no-chinrest conditions on both calibration quality metrics. No differences were detected in the number of calibration attempts (M_CHINREST = 1.81, M_NOCHINREST = 1.58, W(12) = 30, p = .402) nor in the average error of the accepted calibration (M_CHINREST = 0.93^{Footnote 3}, M_NOCHINREST = 0.99, W(12) = 47, p = .946) (see Fig. 2).

Sampling rate variability

As expected, the observed average inter-sample time was 6.7 ms and the standard deviation of the inter-sample time was 0.79 ms (robust descriptives: Mdn = 6.68 ms; MAD = 35.29 ms). Only 0.0003% of the inter-sample times were greater than the duration of two consecutive samples (13.4 ms). Distribution of inter-sample time is shown in Fig. 2.

Data loss

A Wilcoxon signed-rank test was performed to compare whether the chinrest and no-chinrest condition differed in the proportion of lost gaze across tasks, and no differences were observed (M_CHINREST = .031, M_NOCHINREST = .029, W(12) = 43, p = .583) (see Fig. S4 in the Supplementary materials). Additionally, the quality of the accepted calibration prior to each combination of condition and task (Fixation-Saccade task – chinrest condition; Fixation-Saccade task – no chinrest condition; pupil task – chinrest condition; pupil task – no chinrest condition) did not correlate with the proportion of lost gaze (Kendall’s tau: t_{FIX_SACC_CHINREST} = –.24, p = .228; t_{FIX_SACC_NO_CHINREST} = .01, p = 1; t_{PUPIL _CHINREST} = .01, p = 1; t_{PUPIL_NO_CHINREST} = –.21, p = .331).

Accuracy

A visualization of participants’ gaze positions superimposed on target positions for the Fixation-Saccade task is provided in Fig. 3. No differences in global accuracy, defined as the Euclidian distance of the actual gaze sample from the expected gaze position, were found between conditions with and without the chinrest (estimate < 0.01, SE = 0.01, t = – 0.37, p = .710), while peripheral target locations had slightly better global accuracy in comparison to the central position (estimate = – 0.09, SE = 0.02, t = – 4.29, p < .01). However, different accuracy profiles were observed when vertical and horizontal accuracies were analyzed separately. In the case of vertical accuracy, no differences were found between conditions (estimate = 0.02, SE = 0.01, t = 1.64, p = .102), while vertical error at the peripheral target locations was lower than at the central target location (estimate = – 0.19, SE = 0.03, t = – 6.79, p = < .001). In the case of horizontal accuracy, better accuracy was achieved without the chinrest (estimate = – 0.04, SE = 0.01, t = – 3.16, p < .01), and peripheral target locations had worse horizontal accuracy in comparison to the central target location (estimate = 0.16, SE = 0.03, t = 4.95, p < .001). Descriptive statistics for vertical, horizontal, and global accuracy for each condition and target location can be seen in Fig. 3 and Table 2.

Table 2 Descriptive statistics for accuracy and precision

Full size table

These results suggest that globally, the accuracy of the GP3-HD is closer to the upper bound of the expected ~ 0.5–1° values, although horizontal accuracy is consistently closer to 0.5°. Nonetheless, the range of accuracy values is similar to what has been reported in previous evaluations of commercial eye-trackers (Funke et al., 2016; Holmqvist, 2017). Overall, the GP3-HD shows precision below 0.5°, which is also in line with what is reported using similarly priced eye-trackers and even some high-end systems (Funke et al., 2016; Holmqvist, 2017).

Precision

No differences were detected between central and peripheral target locations in any of the precision metrics (Horizontal: estimate = 0.01, SE < 0.01, t = 1.86, p = .069; Vertical: estimate < 0.01, SE < 0.01, t = – 1.07, p = .288; Global: estimate < 0.01, SE = 0.01, t = 0.52, p = .607), whereas the differences between chinrest and no-chinrest conditions showed a less consistent pattern. The no-chinrest condition yielded increased vertical precision (estimate = – 0.01, SE < 0.01, t = – 2.66, p < .01), whereas no differences were observed in horizontal (estimate < 0.01, SE < 0.01, t = 1.53, p = .126) and global precision (estimate < 0.01, SE < 0.01, t = – 0.80, p = .426). Descriptive statistics for vertical, horizontal, and global precision for each condition and target location can be seen in Table 2.

Bayesian equivalents for condition comparisons are provided in the Supplementary materials – see Bayesian analyses for condition comparisons.

Pupillary light reflex

The PLR was reliably detected in both the chinrest (estimate = – 19.55, SE = 0.26, t = – 74.50, p < .001) and no-chinrest conditions (estimate = – 20.75, SE = 0.30, t = – 69.79, p < .001), see Fig. 4a, b). As expected, the PLR effect is very large, accounting for 95% of the variance in both conditions.

Saccadometry

Descriptive statistics for different saccade metrics are provided in Table 3. The relationships between saccade amplitude, duration and peak velocity are visualized in Fig. 4c, d.

Table 3 Saccade metrics calculated using GP3-HD data

Full size table

As expected, since the majority of the saccades detected in the Fixation-Saccade task would be classified as small using the guidelines of Gibaldi and Sabatini (2021), the relationship between these metrics was approximately linear.

Finally, we examined each participant’s saccade trajectories from the central target to each peripheral target (see Fig. 5). Individual saccade trajectories plotted in Fig. 5 are not smooth, but appear broken and edgy, which is a sign of under-sampling (Dalmaijer, 2014). While clear identification of saccade events is possible, this suggests that the GP3-HD eye-tracker is less suitable for saccadometry research.

In summary, accuracy measures are comparable to the range reported in the existing eye-tracking literature for similar grade devices. In both ideal (chinrest) and non-ideal (no-chinrest) conditions, overall accuracy was at the upper limit or even higher than the values stated by the manufacturer (1 degree of visual angle). While this degree of accuracy is perfectly capable of capturing gaze behavior reliably across most experimental tasks, tracking targets smaller than this error could be problematic. While clear separation of saccades and fixations is possible using the GP3-HD, study of the properties of saccade kinematics is compromised by the low sampling rate.

Discussion

Experiment 1 provided a standard validation of the GP3-HD eye-tracker. Overall, calibration and data loss were acceptable, with calibration achieved successfully for most participants after one or only a few attempts, and with a low rate of data loss over the experiment, comparable to more expensive eye-trackers based on previous reports (Holmqvist et al., 2011). Accuracy and precision of gaze tracking were also acceptable, but closer to upper desired limits (~1 degree of visual angle). This means that during stimulus design and presentation, researchers should accommodate this tracking error by ensuring that target locations are separated by 3 degrees of visual angle or more, so that if a participant was looking exactly at the middle of two targets, the estimated gaze locations (even accounting for error) would not fall onto either of the targets. It is worth noting, however, that better accuracy values were achieved (0.5 degrees of visual angle) particularly for the horizontal dimension. This is useful to know, as it is possible to calibrate how gaze is assigned to areas of interest given the tracking error associated with a particular participant at a particular point in time (Hessels & Hooge, 2019; Orquin & Holmqvist, 2018). Similarly, individualized accuracy and precision profiles can be used to calibrate gaze parsing filters (see, Feit et al., 2017).

With regard to saccade metrics, while the identification of saccades appears good, the calculation of kinematic parameters appears to be affected by the low sampling rate. As a result, the study of the properties of saccades using the GP3-HD may lead to inaccuracies depending on the specific parameters being studied. For example, it appears that the amplitude of saccades is underestimated, as the gain value was < 1, indicating hypometric saccades, that is, saccades where the amplitude was smaller than expected. Looking at the plots in Fig. 5, it is apparent that saccade patterns show breaks as a result of the sampling rate. This is consistent with Nyquist's theorem, which specifies that a signal must be sampled at more than twice the highest frequency component of the signal. Note, however, that saccade detection as well as peak velocity approximations are reliable with even 60 Hz sampling, and there are saccade reconstruction techniques that can improve the approximation of the “true” parameters of saccades (Wierts et al., 2008).

The lack of consistent significant differences between chinrest and no-chinrest conditions suggests that the algorithm for gaze estimation used by GP3-HD is relatively robust to small head movements. However, increased infra-red ‘bounce’ was observed during the sessions with a chinrest, where small and temporary reflections from the chinrest would cause it to be mistakenly detected as a part of the eye. Improvements in tracking accuracy expected from use of the chinrest may have been lost as a result. While this problem is solvable by eliminating infra-red reflections, a possibility for development would be for Gazepoint to offer a user intervention enabling the operator to manually correct the misidentified reflections during calibration, such that the Gazepoint algorithm would subsequently underweight those regions when estimating gaze location. Similarly, providing the stream of estimated distances from the screen, in addition to the gaze coordinates and pupil samples, would be useful for researchers to accommodate changes in distance from the screen in non-chinrest conditions.

Experiment 2

Experiment 1 provided a standard validation of the GP3-HD eye-tracker in terms of its accuracy and precision, degree of data loss and the benefits of head stabilization. While this provides a useful benchmark assessment of the GP3-HD, such eye-tracking validation studies do not reflect typical experiments in which task demands are usually greater and there are more limited checks on the eye-tracker’s performance imposed by study design (Niehorster et al., 2018). Therefore, in Experiment 2 we provide data from a real-world typical psychological experiment to assess GP3-HD performance. Importantly, as typical behavioral experiments may vary from a few minutes to hours, we investigated how data quality parameters changed over time, across a 1-h-long experiment.

Experiment 1 also demonstrated that GP3-HD is able to capture pupil changes such as the PLR. In most cognitive and behavioral research, however, researchers are typically interested in how pupil changes are modulated by cognitive and affective factors rather than low-level visual properties. This can range from quantifying cognitive effort (Papesh & Goldinger, 2012; Piquado et al., 2010) to emotional arousal (Bradley et al., 2008). Considering that cognitive and emotional effects on pupil diameter are much smaller in magnitude than luminance effects (Mathôt, 2018), it is unclear how well the GP3-HD is able to capture such effects. In Experiment 2 we used data from a paradigm that allowed us to explore how pupil size is modulated by emotional factors. Specifically, the effect of viewing emotionally arousing images on pupil diameter was investigated. We predicted a positive correlation between self-rated arousal and pupil size. Since this task makes use of naturalistic visual stimuli varying in low-level properties, the luminance of the stimuli can also be regressed onto pupil size to model modulation of pupil size due to the PLR. Stimulus brightness should have a negative correlation with pupil size, such that brighter stimuli should lead to decreased pupil size (pupil constriction) whereas darker stimuli should predict increased pupil size (pupil dilation).

Additionally, in Experiment 2, SC and HR data was collected from participants simultaneously using both the GPB and a well-validated physiological recording system (BIOPAC-MP160). Strong correlations between devices would indicate the capability of the GPB to capture physiological signals.