Online webcam-based eye tracking in cognitive science: A first look
KeywordsOnline experiment Web technology Eye tracking Online study Cognitive psychology
Eye tracking provides a unique way to observe the allocation of human attention in an extrinsic manner. By identifying where a person looks, scientists are able to identify what guides human visual attention. Yarbus (1967) was one of the first to quantify this attentional value by recording eye-gaze patterns, and since then techniques to measure and algorithms to interpret eye movements have only been improved. The main approaches to utilize eye-tracking data are to measure the onset of a saccade – a jerk-like re-allocation of foveal fixation – and smooth pursuit movements, the stable tracking of moving objects. While the former introduces a shift in human attention by voluntarily or involuntarily centering a point of interest at the physically highest resolution (the fovea), pursuit movements are used to follow motion. Therefore, using eye fixations as a metric of information gathering reveals when, for how long, and how often someone looks at certain parts of an image to obtain visual information.
Consequently, eye tracking can be found in many areas of science (see Duchowski, 2002, for an overview). From basic research on perception (e.g., Chua, Boland, & Nisbett, 2005) or memory (e.g., Allopenna, Magnuson, & Tanenhaus, 1998), over more applied fields in marketing (e.g., Wedel & Pieters, 2000), industrial engineering (e.g., Chapman, Underwood, & Roberts, 2002), learning (van Gog & Scheiter, 2010) to clinical settings, e.g., to quantify differences between special populations and controls (e.g., Boraston & Blakemore, 2007; Holzman, Proctor, & Hughes, 1973). Taken together, the ease of application makes eye tracking one of the widest spread scientific tools.
On the other hand, eye tracking can be considered a laborious approach. It relies on expensive and stationary in-lab equipment, usually a meticulous calibration procedure as well as the need for a scientist to perform the experiment. Fortunately, recently a movement in the eye-tracking research community has emerged that tries to move away from the classical costly setup towards cheaper alternatives of eye-tracking devices (Dalmaijer, 2014), open source software for data analyses (Dalmaijer, Mathôt, & Van der Stigchel, 2013; Mathôt, Schreij, & Theeuwes, 2012), wearable eye trackers (Bulling & Gellersen, 2010), or even the use of webcams for recording gaze patterns (Valenti, Staiano, Sebe, & Gevers, 2009). Together, these endeavors indicate that scientists in the field of eye tracking wish to broaden its applications by making use of new technologies as supplements or even substitutes to classical tracking setups. Beyond a reduction of costs, the exchange of open source software and new areas of data acquisition (e.g., mobile recordings through wearables or large-scale longitudinal studies through consumer-grade devices) can be defined as main factors for a change of behavior in the community.
As such, the movement of the eye-tracking research community is comparable to a similar movement in the area of psychophysics. Following the success of online-conducted questionnaires (Birnbaum, 2000, amongst others), several studies furthered methodological investigations of online data acquisition towards error-rate based studies (Germine et al., 2012), and lately on comparing millisecond-based reaction time studies (Semmelmann & Weigelt, 2016). The general findings are that an overwhelming majority of experimental effects can be reproduced in an online setting and – following earlier literature about online participants (Gosling, Vazire, Srivastava, & John, 2000) – the data are of the same quality as in-lab acquired data. The reasons for moving towards online data conduction are numerous: Main points include the opportunity to obtain large, more diverse participant samples required for the generalization of results, a higher speed of data acquisition thanks to parallel, autonomous data recordings, independence of available soft- and hardware, and lower costs due to less time involvement of participants as well as scientists. Obviously, if the eye-tracking community could utilize these advantages in their movement towards new areas, likewise larger and more diverse participant samples and the use of new technology could be facilitated to a greater extent.
While we analyzed each of the tasks in a classical in-lab setting (also using consumer-grade webcams) to have a baseline of accuracy, we extended our data set by acquiring data through an online (crowdscience) environment. This allowed taking a step further from the technical nature of hard- and software accuracy in a controlled environment to changing the conduction surrounding and identifying potential differences between the technology itself and effects of online data acquisition. While we generally assumed that online data might be noisier and potentially less accurate than in-lab acquired data, we specifically investigated the differences, and discuss potentials solutions for emerging offsets. Taken together, we see this as one of the first investigations on the potential and possible limitations of webcam-based online eye tracking in the field of cognitive psychology and beyond.
Hardware and software
In our in-lab setting, participants were seated 50 cm from the 15-in. display of a MacBook Pro with a resolution of 1,680 × 1,050 px and its built-in webcam (720p FaceTime HD). The experiment itself was programmed in HTML5, supported by jQuery 1.12.3 and php 5.3.3 on an Apache 2.6.18. The eye-tracking algorithms were taken from Papoutsaki et al. (2016; https://webgazer.cs.brown.edu) due to the good documentation and ease of integration into a new application and were modified to fit our purpose (available at the Open Science Framework; https://osf.io/jmz79). Mainly, we removed the reliance on mouse movements and clicks and integrated a calibration-based model training method. Despite not being intended for external calibration, we were able to get the eye-tracking algorithm running “out of the box” and to adapt it to our needs through the clear object-oriented and well-documented build.
Stimuli and design
Calibration and validation
After the calibration, validation started. In each validation phase, five points appeared one after the other, in random order, on each of the four edge positions of the calibration (positions being 20% top, bottom, left, and right) and in the center. Each validation sample required the fixation to be within 200 px (4.58° in-lab) of the center of the dot. If a position was not successfully validated within 20 s, the participant was sent back to the instructional page and had to re-calibrate. If there were five unsuccessful attempts in a row, the user was informed that his hard-/software or setup was not suitable for the experiment and the experiment ended automatically. If the webcam was successfully calibrated and validated, the actual task started. Before each task, the participant received a written on-screen instruction.
Each trial of the fixation task started with a 1,500 ms black fixation cross, followed by a 500-ms blank screen, before a 50 × 50 px blue-colored (RGB: 221, 73, 75) dot appeared for 2,000 ms. The fixation cross and dots each covered 50 × 50 px, which corresponded to 1 cm on-screen size in the in-lab setting, translating into 1.15° of visual angle.
In the pursuit task a ramp stimulus (Fukushima et al., 2013) was used. At the beginning of each trial, a black dot appeared at one of four starting positions, each having 20% margin towards either the top or bottom and either left or right (see Fig. 2). The dot turned red after 1,500 ms and moved towards another position within 2,000 ms. Starting position and motion were randomly picked from the 12 possible combinations. The distance covered was 596 px (11.5 cm, 13.12°) in the vertical, 1,007 px (20 cm, 22.62°) in the horizontal, and 1,170 px (23 cm, 25.91°) in the diagonal motions.
Lastly, the free-viewing task consisted of a centered 1,000 ms fixation cross, followed by a 500-ms blank screen, before one of 30 facial images (15 female, 15 male) was shown for 3,000 ms. The image was aligned with its center at 25% from either the left border or the center of the screen and was scaled to fill 100% of the height of the screen. Each task ended with a blank screen, before the next trial started. On average, the faces covered 14 × 16 cm (615 × 820 px) of the in-lab screen, corresponding to 15.94° x 18.18° of visual angle. All images were taken from the Glasgow Unfamiliar Face Database (Burton, White, & McNeill, 2010; http://www.facevar.com/downloads/gufd), cropped to the faces with hair only and presented on a grey background.
The experiment was approved by the local ethics committee and participants provided a completed consent form before being able to continue on the website. All data was analyzed on the fly on the participants’ computers (either on the MacBooks for the in-lab or their personal systems in the online case), thus only sending the eye position to our server. Compared to sending whole video files, this approach allows increasing anonymity and decreasing data size to a large degree.
Thirty in-lab participants (age M = 22.68 years, SD = 3.71, range 19–32 (one did not report his age), 76% female, 24% male) were recruited at the Ruhr-Universität Bochum and participated for course credit or 5€ for the 30-min experiment. For the crowdscience approach, we used www.crowdflower.com to distribute the experiment. In total, we had 84 consent form transmissions that divided into 52 incomplete and 32 complete data sets. Of the incomplete ones, 27 started the first, 11 the second, four the third, and three the fourth calibration before quitting the experiment. Overall there were 59 failed validation trials over 25 participants of the incomplete data sets, 33 over 32 participants in the complete online data sets, and four over 30 participants in the in-lab case. These numbers show that most dropouts occur due to either (a) privacy concerns or (b) high perceived effort to complete the task before starting the calibration. Only complete online participations were rewarded with a US$4 compensation. We distributed the experiment in Germany only to avoid potential cultural influences in comparison to the in-lab participants. The mean age was 39.92 years (SD = 12.88, range 20–62; two did not report their age), 29% identified as female, and 71% as male. On average, in-lab participants indicated fewer visual impairments (17% glasses, 14% corrected through contact lenses, 7% uncorrected, vs. 62% without visual impairment) than online participants (32% glasses, 11% corrected through contact lenses, 46% good vision, 11% uncorrected). Overall, one in-lab participant had to be excluded because of missing data (transmission error) and four crowdflower data sets were removed due to multiple participation1), and conducting the task without visual aids despite having themselves identified as visually impaired. This left us with 29 data sets for the in-lab and 28 for the online setting for all further analyses.
For the fixation task
As an initial indication of whether data quality was high enough to detect changes in fixation position through saccades, we plotted the fixation positions over time for the fixation task. This allowed observing the change from the middle of the screen (fixation cross) towards the appearing dot at a very basic level, given a sufficiently accurate algorithm. To account for saccadic preparation and the saccade itself (Walker, Walker, Husain, & Kennard, 2000), we assumed that the participants’ fixations should have arrived at the target point 1,000 ms after the target appears (3,000–4,000 ms). During this time frame, we further quantified the accuracy of the approach by calculating the offset as the Euclidean distance between intended target position and recorded gaze estimation. Accordingly, analyses of variance (ANOVAs) quantified a potential difference between position of the target and experimental setting (in-lab vs. online).
We followed a similar approach with the pursuit task. Here, we expected a saccade towards the initial fixation point (appears at trials start), followed by a steady Euclidean distance to the target that should not significantly change if participants followed the movement (starting at 1,500 ms) and the algorithm was sensitive enough to follow the variability in fixation position. Consequently, we first calculated the offset per participant at between 2,000–4,000 ms and calculated a t-test to identify the potential difference between in-lab and online data acquisition. Beyond this main analysis, to gain a better understanding of the overall data quality, we calculated a principle component analysis (PCA) for each setting and for each direction of movement to be able to compare the intended target movement to the recorded viewing behavior. Furthermore, to compare our results to earlier pursuit literature (e.g., Duchowski, 2002; Lisberger, Morris, & Tychsen, 1987), we calculated pursuit speed.
As we instructed participants to freely look at the stimulus as they like in the free-viewing task, we initially used a very basic differentiation of fixated display side to check whether an appropriate saccade towards the target stimulus was detectable, or participants did not look at the presented image at all. For more in-depth analyses, we defined regions of interest (ROIs) in each image shown. This was done by dividing each image into a grid with 5% distances and then marking the coordinates (similar to Dalmaijer, 2014; Pelphrey et al., 2002) for image (full image), head (leaving out white space on the image, on average 71.32% of the image area), inner face (not including forehead, chin or ears, 26.99%), and the key regions of the face, namely eyes (without eye brows, 7.98%), nose (2.72%), and mouth (3.63%), as can be seen in Fig. 2. These ROI analyses allowed checking for a replication of distribution of fixation time on key facial features (e.g., Caldara et al., 2010) and potential differences in experimental setting (in-lab vs. online).
Besides those main factors, we further analyzed potential non-task-specific factors: Here, as a first factor, we compared overall experimental duration to reveal whether online participants take longer or shorter and/or take more or less pauses compared to in-lab observers. Additionally, we assumed a difference in available hardware power between in-lab and home systems. Thus, we calculated the framerate per second (fps) our application was to achieve and investigated potential influences on the results above. These factors are informative whether potential accuracy differences are due to behavioral (longer pauses, less concentration) or computational (low-performing computer systems) reasons.
Please note that all analyses were intentionally carried out on unprocessed, unsmoothed data without outlier removal to enable a picture of the raw data quality and to identify potential noise. Additional processing (which often depends on the research question) obviously would allow improving the accuracy of results in subsequent studies. To allow detailed questions on how data quality improvements can enhance the analyses to be answered, we published the data online for interested colleagues to apply their methodology (see above). Furthermore, due to the differences in screen size some statistical tests were calculated based on percentage of screen size instead of absolute pixels. This is necessary as (a) the position of stimuli was calculated on screen size (e.g., “20% from the left edge”) and (b) only through that way do we get an equal comparability over the different screen sizes and resolutions used in the online case. We tried to note both pixel-wise (for a clearer mental representation) and percentual values (for better comparability), whenever possible.
Additionally, to differentiate between random and systematic offset, we calculated the general average offset (compared to Euclidean distance we used before) of samples for both settings. This approach cancels out random noise (i.e., samples that gather around the target position in any direction) by averaging over fixation samples and yields a more systematic variation from target position. We found −69 px (6.1%) and −54 px (5.5%) offset for the in-lab and online cases, respectively, which did not differ significantly for the between-subject factor setting, t(43.85) = 0.06, p = .95. Please see the Appendix for a table of these offsets over target positions.
In short, while both experimental settings showed a clearly detectable saccade of about 300 px (22% of screen size, 7.39° visual angle in the in-lab case) towards the target position, in-lab conducted data, especially in the upper parts of the screen, seemed to be more accurate due to shorter saccadic duration and lower variance (with 171 px, 15%, 3.94° visual angle offset for in-lab, 207 px, 18% online).
Using consumer-grade webcams our aim was to establish a first common ground of the potential and limitations of web technology in the acquisition of eye-tracking data online. We employed three paradigms – a fixation task, a pursuit task, and a free-viewing task – and measured each of them in a classical in-lab setting and online (through a crowdscience approach). We found the expected viewing patterns (fixations, saccades, and ROIs) consistently matched our paradigms, with an offset of about 191 px (15% of screen size, 4.38° visual angle) in-lab, whereas online data was found to exhibit a higher variance, lower sampling rate, and longer experimental sessions, but not showing a significant difference in accuracy (offset 211 px, 18% of screen size).
Accuracy of the approach
Here we discuss the individual findings from the most basic to the most detailed (and hence across the three tasks that we employed). In the free-viewing task, we were clearly able to determine whether a participant was making a saccade towards the target screen side. This very basic result did not show a significant difference between experimental settings, thus arguing for comparable quality of in-lab and online data. From there, the fixation task results were able to show that we can clearly detect saccades with an average duration of about 450 ms in the in-lab and 750 ms in the online case, reducing the Euclidean distance by 305 px and resulting in an offset of about 188 px (3.94°), whereas commercial systems reach an accuracy of e.g., 0.15° visual angle (EyeLink 1000, e.g., Dalmaijer, 2014). Together, these two results allow the inferernce that saccades can be identified through online webcam-based eye tracking.
The findings are furthered by the pursuit task. We showed that after the motion started, the spatial offset did not increase but was constant at about 214 px, 4.90°, thus indicating that the participant followed the motion and the pursuit of the target was identified correctly by our application. Again, we did not find a difference in accuracy between experimental settings, putting online data acquisition on par with in-lab data quality. On the other hand, analyzing the speed of eye movements, it becomes apparent that there are still challenges in absolute accuracy when using this methodology. By recording speeds around 37.89°/s, averaged over settings, the measurements indicate a faster movement than the target stimuli (mean 20.41°/s). We see three potential reasons for this deviation. First, due to the fact that we did not smooth our data, micro-saccades (e.g., Møller, Laursen, Tygesen, & Sjølie, 2002) could be accountable for these high speeds. Second, due to measurement inaccuracies of the application, the variance of gaze estimations of the same target position creates artificial local eye movements by recording different fixation points, despite the participant keeping his fixation constant. Third, there might be a conflict between presentation and recording on the same computer system, which usually is avoided by using different machines in an in-lab setting. Very carefully designed experiments might be able to pinpoint the exact reason behind the deviation, but our paradigm is not suited for this specific task.
Through the third task we were able to identify specific, attention-guided fixations by introducing a semantic layer in the free-viewing task. Here, we were able to replicate the common finding that Western observers use the eyes of faces as a very distinct feature to analyze (Blais et al., 2008; Caldara et al., 2010; Janik, Wellens, Goldberg, & Dell’Osso, 1978). The significant differences in fixations were as expected: participants pre-dominantly fixated the eye region when compared to other facial landmarks such as the nose or the mouth. Comparing online and in-lab data, we only found an increase in number of fixations for off-image areas in the online data collection. This is in line with our other results that online data quality is lower and noisier than in-lab conducted data, and participants take more time, but argues for no difference in semantic interpretation of stimuli between settings.
With regard to the free-viewing task, we were rather strict when defining the ROIs of our images in the free-viewing task. Pelphrey et al. (2002), for example, defined 27% of the image as key feature regions, while in our case we only defined 14%. Thus, increasing the ROIs to more liberal values might compensate for the slight inaccuracies of fixation estimation. Lastly, we assume that technical control mechanisms, like head movement detection or instructions on how to use a makeshift chin rest could improve the data quality even further and should be implemented.
Next to these obvious improvements, further questions are opened up due to the novelty of this research method. For example, not much is known about how many calibration and validation trials are necessary to achieve a high level of gaze estimation accuracy. In our experiments, we chose the values arbitrarily and therefore the calibration/validation-phase took up to 50% of the experimental time of about 30 min. If the validation phase could be minimized or even intrinsically performed (e.g., a similar method to what Papoutsaki et al. are using), the comfort of participating in such studies would be increased dramatically. Gamification of the experimental paradigms would help this task. Another critical question that will need further investigation is the absolute accuracy of the gaze estimation. We found it to be about 200 px (4.16 °), which is in line with around 215 px from Papoutsaki et al. (2016). A much higher accuracy has been achieved by Xu et al. (1.06–1.32 °), due to their use of a headrest and post-hoc processing to detect fixations (meanshift clustering) of three subjects, instead of relying on raw-gaze estimation data. This approach obviously cancels out systematic noise, like we have shown in fixation-task analysis, which results in a lower nominal offset (around 60 px, 6% of screen size in our case). Yet, we assume that with very careful calibration, instructions, and pre-selections, these values might be improved. This question could be investigated by combining classical in-lab studies with common eye-tracking hardware and web technology studies to have a very clear differentiation between attentional offset (e.g., fixating the forehead vs. the eyes) and measurement inaccuracy (e.g., classical eye-tracking hardware indicates eye region, but the web technology implementation indicates forehead due to inaccuracy). Another potentially interesting factor is indicated by our findings, which show a lower accuracy in the lower portion of the screen. We assume that this is due to the fact that most webcams are positioned on top of the display, yet a specifically designed experiment about spatial accuracy resolution would be necessary to pinpoint potential effects. To ease the process of data acquisition and improve the comfort for participants, it would also be very interesting to re-integrate the automatic calibration through mouse movement and/or clicks by Papoutsaki et al. (2016). This approach would allow for a much more gamified version of online eye tracking, thereby potentially improving the number of interested participants and reducing the number of dropouts throughout the study.
At the current state, we would argue that only those studies that require a very detailed spatial resolution of fixations (e.g., studies in reading, or the dissection of singular items in a crowded display), very time-sensitive information (e.g., high spatio-temporal resolution), or a very short number of trials (e.g., one-trial paradigms) cannot be conducted online. Large ROIs or sections of the screen should be possible to observe. A limiting factor, especially when talking about the temporal course of attention allocation, would be the participant’s system performance. We recommend testing the performance beforehand by determining the fps and either end the experiment beforehand or exclude those participants before analyses. Additionally, the resolution of the webcam should be sufficient, but the limits are not yet detected. It is also important to determine which browsers (currently Firefox and Chrome) and what hardware (currently only tested on desktop computers and laptops, but it should be usable on tablets and mobile phones as well) should be supported by the experiment. Cross-browser functionality increases participation rates, but might lead to higher code maintenance. In general, to achieve a higher general data quality, very explicit instructions should be given: No head movement, only eye movement, suitable distance to the screen, and good illumination are just a few, as can be seen at the instructional figure above. Moreover, we would argue for implementing a head-tracking system that restarts calibration once too much movement is detected. Again, it should be decided whether internal (e.g., using mouse movements) or external (e.g., calibration points) calibration is applicable: In most studies, external calibration should allow for a more exact calibration and mediating the seriousness of the experiment, while some studies might need to rely on internal calibration (e.g., external would be too exhausting for the population or to increase gamification aspects).
Taken together, obviously there is a long road ahead of perfectly reliable and accurate online web technology-based eye tracking, yet with these results in a first investigation we do think that it is one worth travelling. The ease of access to participants, rapid data collection, diversity of demographics, and lower cost and time investments are just a few of the factors to consider when deciding on online data collection. Furthermore, even now the spatial resolution should be sufficient for many eye-tracking studies that do not need pixel-wise but rather area-wise accuracy. As the foundation of online experimentation grows, we estimate that algorithms and software will develop in turn and golden standards will emerge that will improve the accuracy towards comparable levels to classical methodological acquisitions. Thus, we think it is worth employing these (still experimental) methods to broaden the possibilities in psychological data acquisition.
We used the service of www.crowdflower.com in our study, as the more common Amazon Turk (AMT) was not accessible in Germany at the time. Furthermore, crowdflower accesses many different external mini-task job sites and distributes their jobs to them, potentially allowing a better generalization of the obtained data compared to AMT, where studies have questioned the diversity of participants (Stewart et al., 2015). While we had great experiences in psychophysical studies with crowdflower, we encountered one severe case of fraud during this specific work. Especially one person used the success code, necessary to pass the system’s compensation program, multiple times on multiple computers with different accounts, thereby circumventing the system and breaching the terms of service. Despite the literature being mostly in favor of fair participation of online studies (e.g., Gosling, Vazire, Srivastava, & John, 2000), and our overall positive experience with online participants, we highly recommend that multiple participation does not go unchecked. Methods to do so include checking IP address, user agent, screen size, or simply the use of a success-based reimbursement system that only will compensate users after successful participation and data validation through the scientists (“bonus” payment).
We would like to thank Astrid Hönekopp and Alexander Diel for help in collecting the data and Katharina Sommer for providing the instructional images. All code, raw data and analysis files can be found at The Open Science Framework (https://osf.io/jmz79).
- Birnbaum, M. H. (2000). Introduction to psychological experiments on the internet. In M. H. Birnbaum (Ed.), Psychological experiments on the internet (pp. xv–xx). Academic Press. doi: 10.1016/B978-012099980-4/50001-0
- Blais, C., Jack, R. E., Scheepers, C., Fiset, D., & Caldara, R. (2008). Culture shapes how we look at faces. PLoS ONE, 3(8). doi: 10.1371/journal.pone.0003022
- Chen, M. (2001). What can a mouse cursor tell us more? Correlation of eye/mouse movements on web browsing. Proceedings of the ACM Conference on Human Factors in Computing Systems, 281–282. doi: 10.1145/634067.634234
- Dalmaijer, E. S., Mathôt, S., & Van der Stigchel, S. (2013). PyGaze: An open-source, cross-platform toolbox for minimal-effort programming of eyetracking experiments. Behavior Research Methods, (February 2016), 1–16. doi: 10.3758/s13428-013-0422-2
- Fukushima, K., Fukushima, J., Warabi, T., & Barnes, G. R. (2013). Cognitive processes involved in smooth pursuit eye movements: Behavioral evidence, neural substrate and clinical correlation. Frontiers in Systems Neuroscience, 7(March), 4. doi: 10.3389/fnsys.2013.00004 PubMedPubMedCentralGoogle Scholar
- Germine, L., Nakayama, K., Duchaine, B. C., Chabris, C. F., Chatterjee, G., & Wilmer, J. B. (2012). Is the Web as good as the lab? Comparable performance from Web and lab in cognitive/perceptual experiments. Psychonomic Bulletin & Review, 19(5), 847–857. doi: 10.3758/s13423-012-0296-9 CrossRefGoogle Scholar
- Papoutsaki, A., Sangkloy, P., Laskey, J., Daskalova, N., Huang, J., & Hays, J. (2016). WebGazer : Scalable webcam eye tracking using user interactions. International Joint Conference on Artificial Intelligence.Google Scholar
- Valenti, R., Staiano, J., Sebe, N., & Gevers, T. (2009). Webcam-based visual gaze estimation (1), pp. 662–671.Google Scholar
- Xu, P., Ehinger, K. A., Zhang, Y., Finkelstein, A., Kulkarni, S. R., & Xiao, J. (2015). TurkerGaze: Crowdsourcing saliency with webcam based eye tracking. arXiv Preprint arXiv: …, 91(12), 5. Retrieved from http://arxiv.org/abs/1504.06755