Psychologists have been measuring response times in online experiments for nearly two decades (Musch & Reips, 2000; see, e.g., Nosek, Banaji, & Greenwald, 2002). Recently, the growing popularity of online behavioral experiments, drawing from a diverse population of Internet users, has prompted increased interest in empirical validation of response time measures collected on the Web, as well as in online behavioral experiment methodology more broadly. This is an important area of research, because online data collection may present confounds to some methodologies, particularly those that depend on tight control of visual stimulus presentation and response recording (Crump, McDonnell, & Gureckis, 2013; Reimers & Stewart, 2014). Considering the importance of precise measurements in psychophysical research, psychologists who hope to measure response times in an online experiment would benefit from a strong body of scholarly work demonstrating the validity of these methods. Recent studies have begun to establish this literature, by either replicating previous laboratory response time research using exclusively online methods or directly measuring display lags using online or laboratory systems. In this article, we offer a novel contribution to this growing literature, directly comparing human participants’ response times between browser-based and laboratory-based experimental platforms, in a within-subjects design.
It is encouraging that many of these laboratory experiments have been replicated in an online environment. Positive replications provide evidence that response times measured online are comparable to those measured in the laboratory, and being able to conduct such experiments online could improve the generalizability of findings, enable faster data collection from many more participants, and expand the range of possible methodologies. But when a replication attempt fails, there are many potential explanations: The hardware and software used for running the experiment and measuring response times are different between participants, these consumer systems may be of lower quality than laboratory equipment, the subject population will be different, the effect may not have been generalizable to a diverse population of Internet users, some replication attempts should fail just by chance (Francis, 2013), and so on.
All systems used for measuring response times, whether online or in the laboratory, will introduce some timing error, and for most research questions the amount of error (the time lag and variability introduced by software and hardware) generated by standard laboratory hardware and software seems to be acceptable to the research community. The question that is most relevant to researchers interested in online response time measurement is not how much error is generated by a particular online-ready software package, but rather, how does the error generated by software packages used for online data collection compare to that from software used in the lab?
In this experiment, we compared human participants’ response times measured by different software packages for the same experiment, making within-subjects comparisons between software packages, keeping all other experimental variables equal. The advantage of this method is that it allows for direct comparisons of response time distributions at the subject level across variations of meaningful psychological parameters. Thus, it is possible not only to check whether an effect replicates in a statistical sense, but also how the difference in the response time distributions measured by each software package changes over a range of possible human response times. The major disadvantage of this approach, as compared to approaches that use an external device to simulate responses, is that substantially more variation will be introduced by human responses, which will diminish the ability to detect statistical differences between the software packages. However, for behavioral researchers interested in response time measurements, the relevant question is probably not whether there are any differences, but whether the differences would systematically affect a distribution of the size typically collected for a behavioral experiment.
We tested both platforms in a simple visual search task. Visual search was selected as a representative psychophysics paradigm because (1) it is a highly investigated visual task using largely standardized methods (Wolfe & Horowitz, 2004); (2) it is possible to test subjects on a large number of trials, and gain a large amount of response time data, in a relatively short amount of time (4 s per trial, in the present study); and most importantly, (3) experimental manipulations to the number of items in the search array yield robust, large, well-characterized changes in response times (that can easily be modeled as a simple linear function; Wolfe, 1998). On the basis of previous research (Shen & Reingold, 2001), we expected mean response times ranging from roughly 700 to 1,200 ms that were directly proportionate to the number of stimulus items in the search set.
A total of 30 subjects (19 females, 11 males) participated in the experiment in exchange for $10. One subject was excluded from the analysis due to partial data loss. The subjects were 18–34 years old (mean 21.7), and most were students at Indiana University.
Subjects completed a visual search task in which they identified the presence or absence of a target (Open image in new window) in an array of distractors (Open image in new window). The task closely matched a previously reported experiment by Wang, Cavanagh, and Green (1994).
At the start of the experiment, subjects completed 30 practice trials, which were identical to the experimental trials except that corrective feedback was given after each response. The feedback remained on the screen for 2,000 ms. During the practice phase, a new trial began every 6,000 ms. There was a 20-s break after the practice trials. Subjects then performed 400 experimental trials, with a 45-s break after the first 200 trials. During the experimental trials, a new trial began every 4,000 ms.
Interleaved experiment design
The subjects were seated 2.13 m away from the screen. The projected search array was 0.25 m in diameter, occupying approximately 6.5 deg of visual angle.
During the experiment, only one system displayed a trial at a time. PTB displayed all of the practice trials, but the test trials were split evenly between the two systems. We used two Arduino microcontrollers to control which system presented each trial. The Arduinos communicated with each other via a serial connection, and each Arduino could communicate with one of the two iMacs via a USB connection. One of the Arduinos, the “master,” contained code to randomly generate a trial order, ensuring that equal numbers of trials were run on both systems. This controller was responsible for initiating a new trial: A message was sent to the other Arduino via the serial connection, and both Arduinos simultaneously relayed the start message to their respective computers via the USB connection. The Arduinos also recorded the responses generated by the subject pressing a button. The button devices were connected in parallel to each Arduino board, and both boards relayed the response to their computer via the USB connection, as if the response had been generated on a standard keyboard. The Arduinos sampled the digital ports connected to the response devices every 1 ms.
Response times were excluded from the analysis if the participant responded incorrectly or if the trial timed out (cutoff at 2,000 ms). The mean of the accuracy for all subjects (excluding practice) was 96.1%, with a range from 84.3% to 98.8%.
Bayesian data analysis
We used Bayesian data analysis for all of our analyses. There are numerous reasons to prefer Bayesian data analysis techniques to null hypothesis significance testing (Kruschke, 2011). Our primary motivation in the context of this experiment was to have richer information about the effect of software on the response time distributions. For example, with Bayesian techniques we could determine a distribution of credible values that described the difference in mean response times between the two software packages, giving us an estimate of the magnitude of the difference and the uncertainty of this estimate, rather than relying on a p value to indicate whether a particular observed difference was likely to have occurred by chance. This estimate is often summarized by the 95% highest density interval (HDI), which is the range of values of a parameter that contains 95% of the distribution, with all values inside the HDI being more probable than all values outside the HDI. The 95% HDI tells us what parameter values are most likely, given the model and the observed data.
We built a hierarchical model to describe the parameters of interest in the data. The full model specification is presented in the “Appendix”. Conceptually, the model performed a linear regression for each subject, estimating the search function parameters (intercept and slope) in each of the four within-subjects conditions (2 software packages × 2 trial types: target present or absent), while simultaneously estimating the group-level distribution for each of the subject-level parameters. This technique has the desirable property of introducing shrinkage into the estimates of the individual subjects’ parameters, allowing the parameter estimates for individual subjects to mutually inform each other. This is helpful for dealing with noisy data, because it moves outliers toward the group mean (Kruschke, 2011). The parameters of interest for us were the group-level estimates. These describe the overall effects of the software environment across all subjects.
To estimate the parameters of the model, we used Just Another Gibbs Sampler (JAGS; Plummer, 2003) and the runjags R package (Denwood, 2014) for Markov-chain Monte Carlo (MCMC) sampling. The sample consisted of three independent chains, each sampled for 20,000 iterations after an adaptation period of 1,000 iterations and a burn-in period of 4,000 iterations. We assessed the convergence of the chains for each parameter of interest via the Gelman–Rubin test (Gelman & Rubin, 1992). The R values were less than 1.025 for all parameters. The full model specification, in JAGS format, and the complete MCMC sample, in .Rdata format, are available online at http://hdl.handle.net/2022/19253.
Estimates of the search functions
Given that there are multiple instances of this exact visual search task in the literature (Shen & Reingold, 2001; Wang et al., 1994), our first analysis was simply to verify that we found search functions similar to those from previous studies (ignoring any possible effects of software). As in the previous research, adding additional items to the search set increased response times, and responses times during target-absent trials were longer than those during target-present trials. The group-level mean estimates for the slope of the search function for target-present trials was 81.5 ms/item (95% HDI: 69.8 to 93.6 ms/item). For target-absent trials, the mean estimate was 110 ms/item (95% HDI: 98.4 to 122 ms/item). The group-level mean estimate for the difference in intercepts between target-present and target-absent trials was 66.5 ms (95% HDI: 32.9 to 98.6 ms), with target-absent trials being longer.
Effects of software package
The other three parameters of interest showed no reliable difference between jsPsych and PTB. The estimates of the difference in the coefficients of set size (95% HDI: –4.2 to 7.01 ms/item), the difference in the standard deviations of the RT distributions (95% HDI: –9.93 to 21.1 ms), and the difference in the standard deviations of the estimates of set size (95% HDI: –5.71 to 3.13 ms/item) all spanned 0 and had means relatively close to 0.
Validity of the analysis model
Data and model estimates for each cell
Mean RT Across Subjects (ms)
Mean Std Dev Across Subjects (ms)
Mean RT [95% HDI] (ms)
Std Dev of RT [95% HDI] (ms)
731 [686 to 779]
148 [130 to 166]
702 [656 to 749]
145 [127 to 163]
814 [760 to 865]
191 [172 to 211]
783 [730 to 835]
190 [170 to 209]
896 [835 to 955]
235 [212 to 257]
864 [803 to 923]
234 [212 to 257]
1,060 [978 to 1,140]
321 [290 to 352]
1,030 [946 to 1,100]
323 [292 to 354]
855 [808 to 902]
167 [149 to 186]
826 [779 to 873]
164 [146 to 182]
966 [913 to 1,020]
192 [172 to 211]
935 [882 to 987]
190 [170 to 209]
1,080 [1,020 to 1,140]
216 [194 to 238]
1,040 [985 to 1,110]
215 [193 to 237]
1,300 [1,220 to 1,380]
265 [235 to 295]
1,260 [1,180 to 1,340]
267 [237 to 297]
These data are based on citations to the three articles that the authors of PTB have indicated are the desired citations (Brainard, 1997; Kleiner, Brainard, & Pelli, 2007; Pelli, 1997). The data were collected from the Web of Science on September 23, 2014. Duplicate citations (i.e., an article citing more than one of the articles) were not counted toward the total.
We used a lognormal distribution to model the response time data because we were only interested in the mean and variance of the distributions. Since the lognormal can be reparameterized into the mean and variance of the original (non-log-transformed) data, it was a reasonable option that provided a conceptually clear mapping between the model parameters and the basic research question we were asking, yet still acknowledged the skew inherent in response time distributions. Although other distributions, such as an ex-Gaussian, could have been used, the lognormal provided good fits to the data without additional parameters.
We thank Richard Viken for providing laboratory space to conduct the experiment, Chris Eller and the IU Advanced Visualization Laboratory for equipment and technical support, Michael Bailey for assistance in the data collection, John Kruschke for suggesting improvements to our analysis model, and Tony Walker and Alex Shroyer for assistance with the Arduino platform. This material is based on work that was supported by a National Science Foundation Graduate Research Fellowship under Grant No. DGE-1342962.
- Denwood, M. J. (2014). runjags: Interface utilities, parallel computing methods and additional distributions for MCMC models in JAGS [Software]. Retrieved from http://cran.r-project.org/web/packages/runjags/index.html
- Kleiner, M., Brainard, D., & Pelli, D. (2007). What’s new in Psychtoolbox-3? Perception, 36(ECVP Abstract Supplement).Google Scholar
- Kruschke, J. K. (2011). Doing Bayesian data analysis: A tutorial with R and BUGS (1st ed.). Orlando: Academic Press.Google Scholar
- Kruschke, J. K. (2013). Posterior predictive checks can and should be Bayesian: Comment on Gelman and Shalizi, “Philosophy and the practice of Bayesian statistics.”. British Journal of Mathematical and Statistical Psychology, 66, 45–56. doi:10.1111/j.2044-8317.2012.02063.x CrossRefPubMedGoogle Scholar
- Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In K. Hornik, F. Leisch, & A. Zeileis (Eds.), Proceedings of the 3rd International Workshop on Distributed Statistical Computing (pp. 1–10). Retrieved from www.r-project.org/conferences/DSC-2003/Proceedings/
- Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers? Shifting demographics in Mechanical Turk. In Proceedings of the 28th International Conference on Human Factors in Computing Systems, CHI 2010 (pp. 2863–2872). New York, NY: ACM. doi:10.1145/1753846.1753873