Realistic precision and accuracy of online experiment platforms, web browsers, and devices

Anwyl-Irvine, Alexander; Dalmaijer, Edwin S.; Hodges, Nick; Evershed, Jo K.

doi:10.3758/s13428-020-01501-5

Realistic precision and accuracy of online experiment platforms, web browsers, and devices

Open access
Published: 02 November 2020

Volume 53, pages 1407–1425, (2021)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Realistic precision and accuracy of online experiment platforms, web browsers, and devices

Download PDF

Alexander Anwyl-Irvine ORCID: orcid.org/0000-0002-3792-7745^1,2,
Edwin S. Dalmaijer¹,
Nick Hodges² &
…
Jo K. Evershed²

22k Accesses
141 Citations
55 Altmetric
1 Mention
Explore all metrics

Abstract

Due to increasing ease of use and ability to quickly collect large samples, online behavioural research is currently booming. With this popularity, it is important that researchers are aware of who online participants are, and what devices and software they use to access experiments. While it is somewhat obvious that these factors can impact data quality, the magnitude of the problem remains unclear. To understand how these characteristics impact experiment presentation and data quality, we performed a battery of automated tests on a number of realistic set-ups. We investigated how different web-building platforms (Gorilla v.20190828, jsPsych v6.0.5, Lab.js v19.1.0, and psychoJS/PsychoPy3 v3.1.5), browsers (Chrome, Edge, Firefox, and Safari), and operating systems (macOS and Windows 10) impact display time across 30 different frame durations for each software combination. We then employed a robot actuator in realistic set-ups to measure response recording across the aforementioned platforms, and between different keyboard types (desktop and integrated laptop). Finally, we analysed data from over 200,000 participants on their demographics, technology, and software to provide context to our findings. We found that modern web platforms provide reasonable accuracy and precision for display duration and manual response time, and that no single platform stands out as the best in all features and conditions. In addition, our online participant analysis shows what equipment they are likely to use.

Online versus offline: The Web as a medium for response time data collection

Article 14 July 2015

A jsPsych touchscreen extension for behavioral research on touch-enabled interfaces

Article Open access 12 July 2024

QRTEngine: An easy solution for running online reaction time experiments using Qualtrics

Article Open access 19 November 2014

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Conducting behavioural research online has vastly increased in the past few years. For instance, the number of papers tracked by Web of Science with the keywords ‘MTurk’ or ‘Mechanical Turk’ (Amazon’s popular platform for accessing online participants or workers, available since 2005) was 642 in 2018, over a five-fold increase over five years from 121 publications in 2013 (Fig. 1). While scientists do not exclusively use MTurk for psychological experiments, it is indicative of a trend. For example, Bohannon (2016) reported that published MTurk studies in social science increased from 61 in 2011 to 1200 in 2015—an almost 20-fold increase.

A unique problem with internet-based testing is its reliance on participants’ hardware and software. Researchers who are used to lab-based testing will be intimately familiar with their computer, stimulus software, and hardware for response collection. At the very least, they can be sure that all participants are tested using the very same system. For online testing, the exact opposite is true: participants use their own computer (desktop, laptop, tablet, or even phone), with their own operating system, and access experiments through a variety of web browsers.

In addition to participant degrees of freedom, researchers can choose between various options to generate experiments. These vary from programming libraries (e.g. jsPsych) to graphical experiment builders (e.g. Gorilla Experiment Builder), and come with their own idiosyncrasies with respect to timing, presentation of visual and auditory stimuli, and response collection.

This presents a potential problem for researchers: Are all of the unique combinations of hardware and software equal? Here, we first investigate the types of software that potential participants use, and how common each option is. We then provide a thorough comparison of the timing precision and accuracy of the most popular platforms, operating systems, internet browsers, and common hardware. We specifically compare four frequently used platforms that facilitate internet-based behavioural research:

Gorilla Experiment Builder build 20190828 (www.gorilla.sc)
jsPsych v6.0.5 (www.jspsych.org)
Lab.js v19.1.0 (lab.js.org)
PsychoJS v3.1.5 (building in PsychoPy3, and hosting on www.p avlovia.org)

We included these packages because they are among the most frequently used platforms, in our experience, but little quantitative data is available to support this. Regrettably, other notable platforms such as LabVanced (www.labvanced.com) and the OSWeb extension to OpenSesame (Mathôt, Schreij, & Theeuwes, 2012) have remained untested here due to practical restrictions on our time and resources.

A brief history of online experiments

The almost exponential increase in papers citing MTurk is surprisingly recent. While the internet has been available since the 1990s, and tools like MTurk have existed since the mid-2000s, the adoption of online research has begun to accelerate only in the past 5–10 years. There are, however, some early examples of online experimentation, for example, investigating spatial cognition (Givaty et al., 1998), visual motion extrapolation (Hecht et al., 1999), probability learning (Birnbaum & Wakcher, 2002), and establishment of labs dedicated to web experiments (Reips, 2001). In the late 1990s and early 2000s, several guidance books and articles on the subject were published (Birnbaum, 2000; McGraw et al., 2000), with one 1995 review even coining the term ‘Cyberpsych’ to describe internet-based psychological science (Kelley-Milburn & Milburn, 1995). Sadly, it appears that the term did not catch on. Articles providing technical guidance published for running experiments, such as maintaining a web server (Schmidt et al., 1997) and analysing server logs (Reips & Stieger, 2004), also emerged around this time. However, despite the availability of these tools and the promise of larger sample sizes, it took years to reach the current high levels of demand. There are several potential explanations for this apparent research adoption lag: the required level of technical ability, availability of personal devices, and concerns over data quality.

Building a research project online in the late 2000s required a much higher level of web-specific technical skills. Experimenters would have to have known how to construct web pages and load resources (e.g. images and videos), capture and transmit participant data, configure and maintain a server to host the web pages and receive the participant data, and store the participant data in a database. Additionally, the capabilities of web applications at this time did not allow for much more than slow image and text presentation. Interactive animations and dynamic elements were inconsistent, and often slow to load for most users. There were survey tools available such as Qualtrics, Survey Monkey, and Lime Survey (Baker, 2013), but these really only permitted relatively simple experiments.

In the early 2010s, the situation began to change with better tools becoming available. In particular, the High Resolution Time API, which allowed for far better timing accuracy than older methods such as setTimeout(), began appearing in browsers in 2013 (although it was not supported in all major browsers until 2015—www.caniuse.com/#feat=high-resolution-time). Running online research, allowing dynamic presentation of experimental trials and stimuli, and recording reaction times was possible through tools such as QRTEngine (Qualtrics Reaction Time Engine; Barnhoorn, Haasnoot, Bocanegra, & Steenbergen, 2015) and jsPsych v6.0.5 (JavaScript Library for building and presenting experiments; de Leeuw, 2015), which originally appeared around 2013. As more tools and platforms have become available (for an overview, see Anwyl-Irvine, Massonnié, Flitton, Kirkham, & Evershed, 2019), the technical barrier to web-based research seems to have been at least partially alleviated, allowing more research to be conducted online.

The access individuals have to the internet via a personal or shared device has also increased over this time, and continues to increase relatively linearly. This is illustrated in Fig. 2, using data provided by the United Nations International Telecommunication Union. This pattern indicates a continuing increase in the potential reach of any web-based research to larger proportions of populations across the globe. This is particularly important considering a historical problem with under-powered research leading to unreliable results, where increased sample sizes provide one way to address this issue (Button et al., 2013).

The current state

Despite the potential availability of large samples online, there is a hesitancy to adopt certain types of tasks and experiments, particularly those that utilise short stimulus durations (e.g. visual masking experiments) or that need very accurate response time logging (such as an attentional flanker task). The relative noise from online studies can be characterised as coming from two independent sources:

1)
Differences in participant behaviour relative to a lab setting
2)
Differences in technology, such as software (OS, web browsers, and platforms) and hardware (screens, computers, mobile devices)

The differences in participant behaviour when taking part remotely is difficult to address systematically with software or hardware, and ultimately comes down to the design of the experiment, and utilisation of certain tools. That being said, there are ways to reduce this noise—a brief summary of how to improve the quality of data collected online is given by Rodd (2019), and is also discussed in Clifford & Jerit (2014) and more recently in a tutorial by Sauter, Draschkow, & Mack (2020). This paper, however, focuses on issues related to the second point: measurement error introduced by technology. This issue can be improved through restriction of hardware and software, and quantifying the introduced imprecisions would help reassure researchers, enabling them to utilise large web-based samples easily in timing-sensitive experiments.

There have been various claims made on the scientific record regarding the display and response timing ability of experimental set-ups using web browsers—for instance, that timing can be good depending on device and set-up (Pronk, Wiers, Molenkamp, & Murre, 2019), and that different techniques for rendering animations lead to reduced timing precision (Garaizar & Reips, 2019). Ultimately, though, the variance in timing reflects the number of different ways to create an online experiment, and the state of the software and hardware landscape at the time of assessment—all of these are changing at a fast rate. We previously undertook a discussion of the changing hardware and software ecosystem in Anwyl-Irvine et al. (2019). To address this variance, it is important to report any timing validation on a range of devices. To the authors’ knowledge, the largest number of devices tested with online software was undertaken by Reimers and Stewart (2015), where 19 Windows machines were assessed, and it is suggested that systems (OS and devices) contribute the greatest variability, with Windows XP displaying less variability than Windows 7. The justification for only testing Windows devices was that 85–90% of their participants used these. However, this has changed since 2015; see the demographics section of this paper for more details.

In a highly commendable concurrent effort, Bridges, Pitiot, MacAskill, and Peirce (2020) compare a wide range of online and offline experimental software across several different operating systems (Windows, Linux, and macOS) and web browsers (Chrome, Edge, Edge-Chromium, Firefox, and Safari). Their data paint an encouraging picture, with reaction time (RT) lags of 8–67 ms, precision of < 1 ms to 8 ms, visual lagging of 0–2 frames, and a variance of under 10 ms for most combinations. Auditory lag is poorer across the board, with average delays ranging in the hundreds of milliseconds. Our study asks similar questions, and uses a similar approach as theirs, with a few crucial differences. Firstly, Bridges et al. (2020) employed a test that is highly suitable for testing lab environments, whereas we aimed to realistically simulate participants' home environments by using an actuator to perform presses on keyboards (Bridges and colleagues employed a high-precision USB button box). Secondly, the authors only assessed one frame duration (200 ms), so they were not sensitive to any interaction between duration and timing errors, whereas we assess 29 different durations. Thirdly, the authors used a lower number of trials for their duration tests than we do (1000 vs 4350), and were therefore less likely to detect irregular delays. Nevertheless, our two concurrent studies have come to similar conclusions, with some differences and limitations to ecological validity in both studies that are further explored in the Discussion. Together, the two studies provide a richer picture of the current state of affairs than each would alone.

A vital issue with research into timing is that it is tempting to interpret results from one (or a set of) studies, and extrapolate this to all ‘online research’. However, most online research is undertaken using different builders, hosting websites, and entire software-as-a-service (SaaS) platforms; very little is made using written-from-scratch JavaScript. These different platforms and websites are separate software, each providing different animation, rendering and response polling code. Just because good timing is possible using one particular JavaScript method in a specific scenario does not mean that it will be great in all online studies. Therefore, in this paper, we compare a variety of online study platforms.

A realistic approach to chronometry

Researchers must be furnished with the information they need to make sensible decisions about the limitations of browsers, devices, and operating systems. With this information, they can trade off the size of their participant pool with the accuracy and precision of the collected data. If we are to make any timing validation functionally informative to the users, we have to ensure that our methods are representative of the real-world set-ups that our participants will be using. Failure to do so could result in unexpected behaviour, even when running previously well-replicated experiments (Plant, 2016).

When researchers assess the accuracy of software in respect to timing, often the software and hardware set-ups are adjusted significantly in order to record optimum performance in the most ideal environment. These set-ups require the removal of keyboard keys and soldering on of wires (Reimers & Stewart, 2015) or specialised button boxes (Bridges et al., 2020), and include discrete graphics cards (Garaizar, Vadillo, & López-de-Ipiña, 2014; Bridges et al., 2020). This does not represent the average internet user's devices at all. For instance, in the first quarter of 2019, less than 30% of new PCs sold included discrete (i.e. non-integrated) graphics cards (Peddie, 2019), likely representing an even smaller number of online participants. Recently, Pronk et al. (2019) utilised a robotic actuator to press keyboard keys and touchscreens, a more representative assessment of RT recording. Testing on ideal-case set-ups, whilst vital for realising the frontier of what is possible with online software, is likely to poorly reflect the situation researchers face when collecting data online. Consequently, we have made an attempt to use more realistic set-ups in our study, such as an actuator on consumer keyboards in our research.

The first and second parts of this paper test the visual display and response logging performance of different software on different common browsers and devices, in order to give an indication of each set-up’s limits. The final part of the paper then provides an overview of the device demographics of online participants, with a snapshot sample of over 200,000 Gorilla participants taken in 2019. Pronk et al. (2019) use global web user data to select the browsers they use, but this may be different from the sub-population of those who engage in online research. Our approach is therefore well-suited to estimate the distribution and variability of devices and browsers within the online participant population.

For the testing sections, we selected a realistic variety of devices. Windows and macOS operating systems cover the majority of the population for online testing (73% of our user sample). The devices we use are split between a desktop PC with an external monitor, a desktop Mac with an integrated monitor, a high-spec Windows Ultrabook, and a lightweight Mac laptop. Further to this, the devices are assessed as they are, with no steps taken to restrict the browsers or operating systems, increasing the likelihood that they reflect users’ actual set-ups.

In order to provide researchers a barometer of how their participants’ devices will perform, we have endeavoured to cover as many commonly used tools, operating systems, and devices as possible (given the number of trials needed for each test). We have assessed these using an external chronometry device that can independently capture the accuracy and precision of systems.

We also distinguish between the average accuracy of the timing of set-ups (e.g. on average, how close to the actual reaction time is a given set-up’s record) and the variability of this accuracy (i.e. will the reaction time error vary a great deal within one experiment). Variability in presentation and reaction times increases the noise in the experiment. For example, a delayed—but consistent—reaction time record permits comparisons between trials and conditions, whereas variability in this can potentially obscure small differences between conditions. These concepts are referred to respectively as accuracy and precision.

In all data reporting, we have intentionally avoided the use of inferential statistics, and chosen to show descriptive statistics, an approach previous studies have taken (Neath et al., 2011; Reimers & Stewart, 2015, 2016). We made this choice for two reasons. Firstly, the distributions of the data traces produced are highly irregular, and deviations are either very small and frequent or very large and infrequent, making formal comparison very difficult. Secondly, there is no ideal way to define a unit of observation. If we consider each sample within a condition, the large number of samples is likely to make any minor difference statistically significant, even if it is not practically meaningful. Alternatively, if we consider each device-browser-platform combination, comparisons would be severely under-powered. We thus report descriptive statistics, as well as the entire distribution of samples within each cell.

We undertake three analyses in this paper to answer the questions of accuracy and precision in realistic set-ups. The first deals with the timing of visual stimuli presented on a screen, where the delay we report is the difference between the expected duration on the screen versus the actual duration. The second characterises the accuracy of each set-up in recording keyboard presses in response to a displayed item on-screen, where the delay is the difference between the recorded press onset and the actual onset. The third characterises the participants themselves: what devices they use, where they are based, and what recruitment services are used—this provides context to our results.

Visual duration accuracy

This experiment looks at how robust different web-based tools are when it comes to both response recording and display accuracy. We compare our platform, Gorilla v.20190828, with three other web-based tools: jsPsych v6.0.5, psychoJS/PsychoPy3 v3.1.5 (produced from their builder and hosted on Pavlovia.org), and Lab.js v19.1.0 (using their builder).

These implementations are tested in a variety of configurations to represent some of the most common participant scenarios. Five browsers are used: Chrome, Firefox, Edge, Safari, IE; and two operating systems are used, Windows 10 and macOS Mojave.

Methods

Display duration

The visual duration delay (VDD) experiment assessed the accuracy of the platform’s visual display timing on the test rigs. A series of white squares were presented for a variable duration on a black background, with a 500-ms/30-frame inter-stimulus interval. Stimuli were presented for a duration of 1–29 frames (1/60th of a second to one-half of a second) to create a profiling trace for each system. Each duration was repeated 150 times, for a total of 4350 presentations per hardware and software combination. The order of these durations was randomised. The white and black squares were PNG images, and were identical for each platform.

We constructed each task according to each platform’s documentation. The details are described for each platform below:

Gorilla v.20190828

A task was created with two screens, both containing an ‘image zone’ and a ‘timing zone’. The first zone was configured to show the black PNG image, and the second a white PNG image; these were uploaded as stimuli to Gorilla v.20190828. The timing zone was set to read information from a configuration spreadsheet containing the timings described above; the duration was variable for the white PNG and set to 500 ms for the black PNG. Fullscreen was enabled by requesting this in the ‘onScreenStart’ function in the Task Builder. The task was run in the browser using the ‘Preview’ button.

jsPsych v6.0.5

jsPsych v6.0.5 had a GUI builder at the time of running the experiment; however, we did not use it, as this was still in beta, and we wanted to assess the most common implementation at the time. The black and white PNG images were presented using the ‘image-keyboard-response’ plugin, with black and white trials alternating, and both the ‘stimulus_duration’ and ‘trial_duration’ were set from a series of ‘timeline_variables’ to hold the durations described above. These were randomised by setting the ‘randomize_order’ value to ‘true’. The task was run locally by opening up an HTML file containing the JavaScript and importing the toolbox using script tags. As jsPsych was set to pre-load assets and scripts, running the task locally would not result in differences compared to running on a remote server (like the Gorilla and PsychoJS examples). The fullscreen plugin was used to request the fullscreen window.

psychoJS/PsychoPy3 v3.1.5

We used the PsychoPy3 v3.1.5 Builder GUI to construct this task. A trial was created containing two image stimuli, one the black PNG image and the other the white PNG image. The black stimulus had a start time of 0.0 s and a stop time of 0.5 s; the white stimulus had a start time of 0.5 s and a stop duration of a variable value described above (referred to in the builder using the ‘$’ syntax). These trials were presented using a loop within a builder routine, the CSV containing durations for the white trial was specified in the ‘Conditions’ field of the loop properties, and the ‘loopType’ was set to ‘random’. The task requested fullscreen; this was done using the experiment settings in the builder—in the JavaScript code, the openWindow ‘fullscr’ attribute is set to ‘true’. The task was then exported to PsychoJS v3.1.5 and uploaded to Pavlovia.org using the GUI, where the task was run from a browser.

Lab.js v19.1.0

The task was created in Lab.js v19.1.0’s in-browser GUI builder tool. A frame was used containing an HTML canvas. A ‘Loop’ was created with the spreadsheet variables uploaded, containing the required durations of the white squares. Within the loop a ‘Sequence’ was created, which contained two components, one with the black PNG and one with the white PNG, both uploaded as ‘media’ in the Content tab. The timeout field in the Behaviour tab for the black PNG was set as 500 ms, and the field value was taken from the loop for the white PNG. Fullscreen was requested using the fullscreen pre-made class on the canvas.

The task was then exported for local use, using the offline data collection option in the ‘save’ menu. As Lab.js pre-loads assets and scripts, running the task locally would not result in differences compared to running on a remote server (like the Gorilla and PsychoJS examples).

The duration of each white square was recorded using a photodiode/opto-detector connected to a Black Box Toolkit version 2 (BBTKv2) (Plant, 2014). This photo-diode was attached to the centre of each screen with an elastic strap, ensuring it was attached firmly and flatly to the screen. In line with the BBTKv2 user manual, an amplitude threshold was used that was relative to each screen. This was titrated beforehand with a continuously flashing square, and the highest threshold that permitted detection of the flashing white square was chosen.

Browsers

Browser versions were verified from the browsers themselves on each machine rather than via version tracking tools within testing platforms, as these were sometimes inaccurate, or used different versioning conventions (e.g. Edge 44 on Windows 10 desktop PC was recorded as Edge 18.17763 by Gorilla—the first being the version of the browser and the second being the HTML engine version). The browser versions used were Chrome 76 (Windows), Chrome 75 (macOS), Firefox 68 (Windows), Firefox 69 (macOS), Safari 12 (macOS), and Edge 44 (Windows).

At the time of testing, PsychoJS v3.1.5 would not run on Edge on our set-ups; this compatibility issue has been fixed, but we were unable to test this set-up, as this was a recent development and would require re-testing all platforms to be equitable, which is not feasible due to the resources needed.

Devices

The two devices were (1) a Windows desktop running Windows 10 Pro, with an Intel Core i5-2500 3.3 GHz CPU, 8 Gb of RAM, and a 60 Hz ASUS VS247 23.6” monitor with 1920 × 1090 resolution; and (2) a 2017 Apple iMac running macOS 10.14.1 with an Intel Core i5-7400 3.0 GHz CPU, a built-in 21.5” monitor with a 4096 × 2304 resolution. The devices used were not adjusted or restricted in any way. This meant that background processes such as virus scans and file-sharing services could spike in activity during the study, just as they could on a participant’s computer.

Platforms

All data were collected between June and September 2019. The Gorilla task was run on builds 20190625, 20190730, and 20190828, and the PsychoJS task was made with PsychoPy3 v3.1.5 and hosted on Pavlovia.org—this was up to date at the time of testing, although a newer version has since become available, and is reported to have better timing (Bridges et al., 2020). The jsPsych task was made using v6.0.5. The Lab.js task was built using the GUI, and was made with version 19.1.0.

Data processing

The metric of interest is the accuracy and precision of displaying the white square for the requested duration. This can be expressed as a delay score where the expected duration of the square and the actual recorded time from the photodiode are compared in milliseconds. Outliers (defined as more than four standard deviations from the mean) were included in the plots, and their range is reported, as we believe these very rare trials may still be of interest. Occasionally, on under-presentation, durations of a single frame would not be rendered, leading to a continuous black square; these had to be manually accounted for in the analysis by replacing the missing opto-detector value with a ‘0’, and were identified when the opto-detector recordings became offset by 1.

Results

Summary statistics for this test are shown in Table 1. The cumulative distributions of these summary statistics are also illustrated in Fig. 3. Figure 4 shows a summary of these delays on the level of individual testing sessions, with the standard error and mean for each combination. We have not converted all timings to frames, and have summarised the data in milliseconds for transparency, as the iMac screen appeared to not always stick to 60 Hz. All platforms exhibited a positive delay (on average, they overrepresented the duration of items), except for PsychoJS v3.1.5, which both overestimated and underestimated. In terms of timing, Chrome and Windows appear to show the smallest delay. In terms of variance, the smallest standard deviation was with Lab.js v19.1.0, which had a maximum delay of 16.49 ms (one frame at 60 Hz) and an average of 9.8 ms. The other platforms appear to exhibit almost equivalent delay. Browsers and platforms showed no superiority in terms of variance.

Table 1. Summary of Visual Duration Delay results in milliseconds. Visual Duration Delay is calculated as the difference in milliseconds between the requested duration of a white square and the duration that is recorded by a photodiode sensor. It is broken down by Platform (Gorilla versions 20190625, 20190730, and 20190828; Lab.js version 19.1.0; PsychoJS/PsychoPy version 3.1.5; jsPsych version 6.0.5), Browser, and Device

Full size table

A more fine-grained overview of the results for VDD can be seen in Fig. 5. The overall story is complex, with traces varying in shape, but some themes are apparent. In macOS, across different devices and platforms, jsPsych v6.0.5 consistently showed a slight delay for requested durations between 3 and 20 frames. Firefox showed the largest amount of variance out of all the browsers, both between different frame lengths (Fig. 5) and between different platforms (Fig. 4), leading to a more drawn out distribution in Fig. 6. The best all-round browser was Chrome— it showed the least variance across devices and platforms, although it was more spread out between platforms on macOS (Fig. 4).

The traces in Fig. 5 also tell us that delays persist in longer durations as well as shorter durations: in most platforms, the error at one frame (16.66 ms) was the same as the error at 30 frames (500 ms). This is positive for users who wish to conduct research with different durations for different images, and means that variability will be broadly equivalent between times. The exceptions to this are jsPsych v6.0.5, Firefox, and Edge, which should probably be avoided in this scenario.

Outliers are very rare, with 22 trials out of 103,500. They range from 95.75 to 265 ms. They are fairly equally distributed among some platforms (10 Gorilla, 9 PsychoJS v3.1.5, 4 Lab.js v19.1.0, 0 jsPsych v6.0.5), but it is difficult to draw inferences from so few instances. These are likely due to display or external chronometery anomalies—it is difficult to tell with such low rates of replication.

We note that the above descriptions relate to the data collected from tested devices, and would not necessarily generalise to the population of participants’ home devices.

Reaction time accuracy

This experiment assessed the accuracy of an entire system to record responses on a keyboard. The BBTK robotic actuator was programmed to press a space key in reaction to a white square at a pre-specified reaction time. This actuator uses a motor to propel a metal ‘finger’ with a foam tip onto the keyboard of the device. Once calibrated, it can deliver reaction times with sub-millisecond accuracy. We opted for using an actuator instead of deconstructing a keyboard to attach wires to the underlying board, for two reasons: it enables us to easily test touchscreen devices in the future, and it more closely resembles what participants of online experiments do, without optimising for an unrealistic set-up.