We examined how accurately web applications on touchscreen and keyboard devices present stimuli for specified durations in Experiment 1, and measured RTs in Experiment 2. In a simulation, we examined how the accuracy of RT measurements affected the reliability with which individual differences could be measured. The results of each experiment are discussed below, followed by a general assessment of the technical capabilities of web applications for mental chronometry.
With regard to stimulus presentation, we first compare the results for different methods for timing and presenting stimuli, followed by an assessment of timing accuracy across devices and browsers. Timing via rAF was more accurate at realizing precise stimulus duration than was timing via CSS animations. In part, this was because iOS timed via CSS consistently presented stimuli for one frame longer than requested. In those cases, requesting slightly shorter durations than was done in this study could improve the accuracy of stimuli timed with CSS animations. However, such consistency was not found for all devices and browsers, so overall we recommend using rAF for timing stimuli. We suspect that the inconsistencies in the behavior of CSS animations may be due to the standards for CSS animations still being a working draft (World Wide Web Consortium, 2018) at the time of this study.
When timing via rAF, presentation method had a relatively small effect on accuracy, with opacity outperforming background position and canvas by up to five percentage points. Because presentation methods were a relatively small factor in timing accuracy, compared with timing methods, a researcher might consider choosing a presentation method on the basis of practical considerations. For instance, canvas may be more suitable than opacity or background position when dynamically generating stimuli. Also, note that a range of other presentation methods is supported by web browsers besides the three methods considered here, such as Scalable Vector Graphics and Web Graphics Library (WebGL; Garaizar, Vadillo, & López-de-Ipiña, 2014a). Future research could establish whether the findings reported here generalize to those presentation methods as well.
Internal chronometry measures of stimulus duration were similarly accurate in estimating stimulus duration as counting the number of frames was at realizing them. This finding is different from prior research (Barnhoorn et al., 2015; Garaizar & Reips, 2018), which may be due to differences in study aims and designs. The present study included a larger variety of devices and browsers and was the first to simultaneously compare timing stimuli by counting frames with estimating stimulus duration via internal measures. We found that for devices and browsers for which stimulus timing by counting frames was near perfect, internal measures of stimulus duration [e.g., JavaScript’s window.performance.now() high-resolution timer] were also near perfect. Conversely, for devices and browsers for which timing was less accurate, internal measures were less accurate as well. Hence, any increase in accuracy attributed to internal duration measures in previous studies may have been because the corresponding devices and browsers were very accurate already.
Although in the present study internal chronometry could not provide any improvements in timing accuracy, internal chronometry may provide more general estimates of timing accuracy in a variety of other ways. For instance, an approach based on the regularity with which software events such as rAF occur (Eichstaedt, 2001) may be useful. Also, internal measures can be important for estimating the refresh rate of a device (Anwyl-Irvine et al., 2019). Although it is beyond the scope of this article, we hope to facilitate such approaches by making all data of the present study available for reanalysis; the URL to the OSF repository containing all materials is listed at the end of this article. Additionally, internal chronometry may identify extreme levels of JavaScript load. A simple way of illustrating the latter is by having a JavaScript application run a never-ending loop. So long as this loop is executing, no other events will take place.
On the basis of the most accurate timing and presentation method found in this study (rAF and opacity), we assessed the accuracy with which keyboard and touchscreen devices can time stimuli. Some devices and browsers, of both touchscreen and keyboard type, achieved near-perfect timing: namely, iOS with Chrome, Firefox, and Safari, as well as Windows with Chrome. Hence, in settings where the device and browser can be controlled, web applications can be as accurate as specialized software. Most devices and browsers achieved most presentation durations within one frame of the requested duration, though MacOS Chrome and Safari tended to present durations of up to six frames (100 ms) too briefly. Hence, when the device and browser cannot be controlled, the reliability and validity of mental chronometry paradigms that require brief presentations may be affected.
With regard to the accuracy of RT measurements, different internal measures for RT gave similar results. Quantization of RT into 60 Hz was found on one device, which may be acceptable (Ulricht & Giray, 1989). RT overestimation varied across devices, similar to what was found in previous research (Neath et al., 2011; Reimers & Stewart, 2015). The range of mean RT overestimations was similar to or smaller than the distributions assumed in various simulation studies with between-group designs (Brand & Bradley, 2012; Reimers & Stewart, 2015; Vadillo & Garaizar, 2016). The iOS device had the lowest mean RT overestimations, whereas MacOS had the highest. Windows in combination with Chrome had the smallest variation of RT overestimation, whereas MacOS again had the highest. In general, when the device that is administering mental chronometry can be controlled, RTs may be measured quite accurately, but not at the level that specialized hardware and software, such as button boxes under Linux, can provide (Stewart, 2006). For particular combinations of devices and browsers, namely MacOS with Chrome and Firefox, RT overestimations were bimodally distributed, with centers that could differ up to 30 ms. Given both the similarity of these RT overestimations to results from previous empirical studies and the robustness reported in simulation studies, we assume that the prior recommendations still apply: A decrease in the reliability of finding group differences in RTs may be compensated for by increasing the number of participants by about 10% (Reimers & Stewart, 2015).
Prior simulations have quantified the impact of the accuracy of RT measurements on the reliability of detecting group differences. As far as we are aware, none have quantified the impact on the reliability of measuring individual differences. Our modeling work indicated that different factors may affect reliability, including the number of trials and the variance of the trait that is measured. The reliability of absolute RT measurements was affected by device noise, but relative RT was hardly affected. This could be because between-device variation was larger than within-device variation. For relative RT, between-device variation is removed due to RTs being subtracted between conditions within participants. A rather striking result was that with higher numbers of trials, relative RT was more reliable than absolute RT, even though traits in the relative RT simulations were correlated .5. The former may appear to go against the commonly held belief that the difference between two positively correlated scores is less reliable than each of these scores individually. Although a comprehensive examination of this result is beyond the scope of this article, here we may offer some explanations. First, a classic result underlying the formerly mentioned belief is based on two observations per participant (Lord & Novick, 1968), but aggregations across larger numbers of observations may yield more reliable difference scores (Miller & Ulrich, 2013). Second, we modeled latent traits as mean and differences between the mean components of ex-Gaussian RT distributions. The distribution of absolute RTs was more skewed than the distribution of relative RTs, so the mean absolute RT was perhaps a less reliable estimator of the trait score than the relative mean RT was of differences in trait scores.
However, in both group and individual difference research, any confound between device type and study design could affect RT results more severely (Reimers & Stewart, 2015). For instance, in a longitudinal study, participants could upgrade their devices between observations. If newer devices have lower RT overestimations, this could result in a spurious decrease in measured absolute RTs over time. Another example of such a confound is when participant traits covary with device preference. Personality research found that Mac users are more open to experience than PC users (Buchanan & Reips, 2001). If Mac overestimates RTs more than PC does, as was found in our sample of devices, this could result in a spurious covariance between openness to experience and an absolute-RT measure. Although more recent studies have shown negligible differences in personality across a number of brands (Gotz, Stieger, & Reips, 2017), similar risks apply to any trait for which covariation with device preference has not been studied. In the case of relative RTs, risks are less severe. Nevertheless, differences between devices with regard to the accuracy with which RT is measured can cause differences in measurement reliability, which in turn can cause violations of measurement invariance.
In summary, in controlled settings, web applications may time stimuli quite accurately and may register RTs sufficiently accurately when a constant overestimation of RTs is acceptable. In uncontrolled settings, web applications may time stimuli insufficiently accurately for mental chronometry paradigms that require brief stimulus presentations. Differences in the degree to which devices overestimate RT may more severely affect the reliability with which individual differences are measured via absolute RT than via relative RT.
Web applications offer a means to deploy studies both inside and outside of the lab. Frameworks are being developed that make it increasingly easier for researchers to deploy mental chronometry paradigms as web applications (Anwyl-Irvine et al., 2019; De Leeuw, 2015; Henninger, Shevchenko, Mertens, Kieslich, & Hilbig, 2019; Murre, 2016). Studies of timing accuracy suggest limits to what may be achieved, but also introduce technical innovations for achieving higher accuracy. The experiments reported in this article examined a range of these technical innovations in order to offer some guidelines on optimal methods. A sample of ten combinations of devices and browsers was studied so that these guidelines and the level of accuracy that can be achieved may be generalized with some confidence.
The results in this study may be representative of web browsers, as the three browsers selected in this study represent a large majority of browsers used online (StatCounter, 2018). However, the sample of four devices was quite small, as compared to the variety of devices available to web applications. This limitation may apply less to MacOS and iOS devices, as their hardware is relatively homogeneous and more to Android and Windows devices, as these come in a very wide range of hardware of different make and quality. Additionally, each included device was a relatively high-end model but was 3–4 years old at the time of the study. Because device technology progresses very rapidly, they may not be representative of newer generations of devices, nor of the budget smartphones that are becoming commonplace in developing countries (Purnell, 2019). Although previous studies reported negligible effects of device load (Barnhoorn et al., 2015; Pinet et al., 2016), so that device load was not included in the present study, this may well be different for such budget smartphones.
A study in a wider range of devices, preferably having multiple samples per device, could replicate the systematic differences found in this study. If replicated, the results could be used to correct for timing inaccuracies and RT overestimations by detecting participants’ device and browser. Note that this undertaking would require significant efforts, given its scale. Also, it would need to be repeated for each new generation of devices, as well as for significant OS and web browser updates. The design of the present study, which could assess timing accuracy at the level of individual trials, could be helpful. By making all materials openly accessible online, we hope to facilitate such efforts.
A solenoid was used for generating responses (similar to Neath et al., 2011). A benefit of the solenoid used in this study was that it provided a method for generating responses that was suitable for both keyboard and touchscreen devices. Responses were defined as the moment the solenoid came in touch with touchscreen or keyboard. Although this is indeed the moment a touch response can be registered, a key needs to be pressed first. Since the actual pressing of a key occurred later than touching it, the registration of responses by the keyboards was only possible to commence at a later point in time than for the touchscreens. However, given the high consistency and speed with which the solenoid went down, we expect this delay to have been 2 ms at most. Given that the RT overestimations we encountered were 57 ms or more, we deem the solenoid-incurred delay to be negligible in light of our findings. Alternatively, keyboard responses could be triggered by disassembling a keyboard and hot-wiring key switches (Pinet et al., 2016; Reimers & Stewart, 2015), and touchscreen responses could be triggered via an electrode (Schatz, Ybarra, & Leitner, 2015).
Overall, touchscreen devices seem technically capable of administering a substantial number of mental chronometry paradigms, when taking some limitations and best practices into account. As smartphone ownership and internet connectivity are becoming ubiquitous, this offers various opportunities for administering mental chronometry on large scales and outside of the lab. By implementing RT tasks as web applications, they are based on durable and open standards, allowing a single implementation to be deployed on desktops, laptops, smartphones, and tablets. We hope that this article helps answer doubts about the timing accuracy of such an approach and provides some insight into how the reliability of RT measurements can be affected when millisecond accuracy cannot be achieved.