Conducting behavioural research online has vastly increased in the past few years. For instance, the number of papers tracked by Web of Science with the keywords ‘MTurk’ or ‘Mechanical Turk’ (Amazon’s popular platform for accessing online participants or workers, available since 2005) was 642 in 2018, over a five-fold increase over five years from 121 publications in 2013 (Fig. 1). While scientists do not exclusively use MTurk for psychological experiments, it is indicative of a trend. For example, Bohannon (2016) reported that published MTurk studies in social science increased from 61 in 2011 to 1200 in 2015—an almost 20-fold increase.
A unique problem with internet-based testing is its reliance on participants’ hardware and software. Researchers who are used to lab-based testing will be intimately familiar with their computer, stimulus software, and hardware for response collection. At the very least, they can be sure that all participants are tested using the very same system. For online testing, the exact opposite is true: participants use their own computer (desktop, laptop, tablet, or even phone), with their own operating system, and access experiments through a variety of web browsers.
In addition to participant degrees of freedom, researchers can choose between various options to generate experiments. These vary from programming libraries (e.g. jsPsych) to graphical experiment builders (e.g. Gorilla Experiment Builder), and come with their own idiosyncrasies with respect to timing, presentation of visual and auditory stimuli, and response collection.
This presents a potential problem for researchers: Are all of the unique combinations of hardware and software equal? Here, we first investigate the types of software that potential participants use, and how common each option is. We then provide a thorough comparison of the timing precision and accuracy of the most popular platforms, operating systems, internet browsers, and common hardware. We specifically compare four frequently used platforms that facilitate internet-based behavioural research:
We included these packages because they are among the most frequently used platforms, in our experience, but little quantitative data is available to support this. Regrettably, other notable platforms such as LabVanced (www.labvanced.com) and the OSWeb extension to OpenSesame (Mathôt, Schreij, & Theeuwes, 2012) have remained untested here due to practical restrictions on our time and resources.
A brief history of online experiments
The almost exponential increase in papers citing MTurk is surprisingly recent. While the internet has been available since the 1990s, and tools like MTurk have existed since the mid-2000s, the adoption of online research has begun to accelerate only in the past 5–10 years. There are, however, some early examples of online experimentation, for example, investigating spatial cognition (Givaty et al., 1998), visual motion extrapolation (Hecht et al., 1999), probability learning (Birnbaum & Wakcher, 2002), and establishment of labs dedicated to web experiments (Reips, 2001). In the late 1990s and early 2000s, several guidance books and articles on the subject were published (Birnbaum, 2000; McGraw et al., 2000), with one 1995 review even coining the term ‘Cyberpsych’ to describe internet-based psychological science (Kelley-Milburn & Milburn, 1995). Sadly, it appears that the term did not catch on. Articles providing technical guidance published for running experiments, such as maintaining a web server (Schmidt et al., 1997) and analysing server logs (Reips & Stieger, 2004), also emerged around this time. However, despite the availability of these tools and the promise of larger sample sizes, it took years to reach the current high levels of demand. There are several potential explanations for this apparent research adoption lag: the required level of technical ability, availability of personal devices, and concerns over data quality.
Building a research project online in the late 2000s required a much higher level of web-specific technical skills. Experimenters would have to have known how to construct web pages and load resources (e.g. images and videos), capture and transmit participant data, configure and maintain a server to host the web pages and receive the participant data, and store the participant data in a database. Additionally, the capabilities of web applications at this time did not allow for much more than slow image and text presentation. Interactive animations and dynamic elements were inconsistent, and often slow to load for most users. There were survey tools available such as Qualtrics, Survey Monkey, and Lime Survey (Baker, 2013), but these really only permitted relatively simple experiments.
The access individuals have to the internet via a personal or shared device has also increased over this time, and continues to increase relatively linearly. This is illustrated in Fig. 2, using data provided by the United Nations International Telecommunication Union. This pattern indicates a continuing increase in the potential reach of any web-based research to larger proportions of populations across the globe. This is particularly important considering a historical problem with under-powered research leading to unreliable results, where increased sample sizes provide one way to address this issue (Button et al., 2013).
The current state
Despite the potential availability of large samples online, there is a hesitancy to adopt certain types of tasks and experiments, particularly those that utilise short stimulus durations (e.g. visual masking experiments) or that need very accurate response time logging (such as an attentional flanker task). The relative noise from online studies can be characterised as coming from two independent sources:
Differences in participant behaviour relative to a lab setting
Differences in technology, such as software (OS, web browsers, and platforms) and hardware (screens, computers, mobile devices)
The differences in participant behaviour when taking part remotely is difficult to address systematically with software or hardware, and ultimately comes down to the design of the experiment, and utilisation of certain tools. That being said, there are ways to reduce this noise—a brief summary of how to improve the quality of data collected online is given by Rodd (2019), and is also discussed in Clifford & Jerit (2014) and more recently in a tutorial by Sauter, Draschkow, & Mack (2020). This paper, however, focuses on issues related to the second point: measurement error introduced by technology. This issue can be improved through restriction of hardware and software, and quantifying the introduced imprecisions would help reassure researchers, enabling them to utilise large web-based samples easily in timing-sensitive experiments.
There have been various claims made on the scientific record regarding the display and response timing ability of experimental set-ups using web browsers—for instance, that timing can be good depending on device and set-up (Pronk, Wiers, Molenkamp, & Murre, 2019), and that different techniques for rendering animations lead to reduced timing precision (Garaizar & Reips, 2019). Ultimately, though, the variance in timing reflects the number of different ways to create an online experiment, and the state of the software and hardware landscape at the time of assessment—all of these are changing at a fast rate. We previously undertook a discussion of the changing hardware and software ecosystem in Anwyl-Irvine et al. (2019). To address this variance, it is important to report any timing validation on a range of devices. To the authors’ knowledge, the largest number of devices tested with online software was undertaken by Reimers and Stewart (2015), where 19 Windows machines were assessed, and it is suggested that systems (OS and devices) contribute the greatest variability, with Windows XP displaying less variability than Windows 7. The justification for only testing Windows devices was that 85–90% of their participants used these. However, this has changed since 2015; see the demographics section of this paper for more details.
In a highly commendable concurrent effort, Bridges, Pitiot, MacAskill, and Peirce (2020) compare a wide range of online and offline experimental software across several different operating systems (Windows, Linux, and macOS) and web browsers (Chrome, Edge, Edge-Chromium, Firefox, and Safari). Their data paint an encouraging picture, with reaction time (RT) lags of 8–67 ms, precision of < 1 ms to 8 ms, visual lagging of 0–2 frames, and a variance of under 10 ms for most combinations. Auditory lag is poorer across the board, with average delays ranging in the hundreds of milliseconds. Our study asks similar questions, and uses a similar approach as theirs, with a few crucial differences. Firstly, Bridges et al. (2020) employed a test that is highly suitable for testing lab environments, whereas we aimed to realistically simulate participants' home environments by using an actuator to perform presses on keyboards (Bridges and colleagues employed a high-precision USB button box). Secondly, the authors only assessed one frame duration (200 ms), so they were not sensitive to any interaction between duration and timing errors, whereas we assess 29 different durations. Thirdly, the authors used a lower number of trials for their duration tests than we do (1000 vs 4350), and were therefore less likely to detect irregular delays. Nevertheless, our two concurrent studies have come to similar conclusions, with some differences and limitations to ecological validity in both studies that are further explored in the Discussion. Together, the two studies provide a richer picture of the current state of affairs than each would alone.
A realistic approach to chronometry
Researchers must be furnished with the information they need to make sensible decisions about the limitations of browsers, devices, and operating systems. With this information, they can trade off the size of their participant pool with the accuracy and precision of the collected data. If we are to make any timing validation functionally informative to the users, we have to ensure that our methods are representative of the real-world set-ups that our participants will be using. Failure to do so could result in unexpected behaviour, even when running previously well-replicated experiments (Plant, 2016).
When researchers assess the accuracy of software in respect to timing, often the software and hardware set-ups are adjusted significantly in order to record optimum performance in the most ideal environment. These set-ups require the removal of keyboard keys and soldering on of wires (Reimers & Stewart, 2015) or specialised button boxes (Bridges et al., 2020), and include discrete graphics cards (Garaizar, Vadillo, & López-de-Ipiña, 2014; Bridges et al., 2020). This does not represent the average internet user's devices at all. For instance, in the first quarter of 2019, less than 30% of new PCs sold included discrete (i.e. non-integrated) graphics cards (Peddie, 2019), likely representing an even smaller number of online participants. Recently, Pronk et al. (2019) utilised a robotic actuator to press keyboard keys and touchscreens, a more representative assessment of RT recording. Testing on ideal-case set-ups, whilst vital for realising the frontier of what is possible with online software, is likely to poorly reflect the situation researchers face when collecting data online. Consequently, we have made an attempt to use more realistic set-ups in our study, such as an actuator on consumer keyboards in our research.
The first and second parts of this paper test the visual display and response logging performance of different software on different common browsers and devices, in order to give an indication of each set-up’s limits. The final part of the paper then provides an overview of the device demographics of online participants, with a snapshot sample of over 200,000 Gorilla participants taken in 2019. Pronk et al. (2019) use global web user data to select the browsers they use, but this may be different from the sub-population of those who engage in online research. Our approach is therefore well-suited to estimate the distribution and variability of devices and browsers within the online participant population.
For the testing sections, we selected a realistic variety of devices. Windows and macOS operating systems cover the majority of the population for online testing (73% of our user sample). The devices we use are split between a desktop PC with an external monitor, a desktop Mac with an integrated monitor, a high-spec Windows Ultrabook, and a lightweight Mac laptop. Further to this, the devices are assessed as they are, with no steps taken to restrict the browsers or operating systems, increasing the likelihood that they reflect users’ actual set-ups.
In order to provide researchers a barometer of how their participants’ devices will perform, we have endeavoured to cover as many commonly used tools, operating systems, and devices as possible (given the number of trials needed for each test). We have assessed these using an external chronometry device that can independently capture the accuracy and precision of systems.
We also distinguish between the average accuracy of the timing of set-ups (e.g. on average, how close to the actual reaction time is a given set-up’s record) and the variability of this accuracy (i.e. will the reaction time error vary a great deal within one experiment). Variability in presentation and reaction times increases the noise in the experiment. For example, a delayed—but consistent—reaction time record permits comparisons between trials and conditions, whereas variability in this can potentially obscure small differences between conditions. These concepts are referred to respectively as accuracy and precision.
In all data reporting, we have intentionally avoided the use of inferential statistics, and chosen to show descriptive statistics, an approach previous studies have taken (Neath et al., 2011; Reimers & Stewart, 2015, 2016). We made this choice for two reasons. Firstly, the distributions of the data traces produced are highly irregular, and deviations are either very small and frequent or very large and infrequent, making formal comparison very difficult. Secondly, there is no ideal way to define a unit of observation. If we consider each sample within a condition, the large number of samples is likely to make any minor difference statistically significant, even if it is not practically meaningful. Alternatively, if we consider each device-browser-platform combination, comparisons would be severely under-powered. We thus report descriptive statistics, as well as the entire distribution of samples within each cell.
We undertake three analyses in this paper to answer the questions of accuracy and precision in realistic set-ups. The first deals with the timing of visual stimuli presented on a screen, where the delay we report is the difference between the expected duration on the screen versus the actual duration. The second characterises the accuracy of each set-up in recording keyboard presses in response to a displayed item on-screen, where the delay is the difference between the recorded press onset and the actual onset. The third characterises the participants themselves: what devices they use, where they are based, and what recruitment services are used—this provides context to our results.