Psychologists have been measuring response times in online experiments for nearly two decades (Musch & Reips, 2000; see, e.g., Nosek, Banaji, & Greenwald, 2002). Recently, the growing popularity of online behavioral experiments, drawing from a diverse population of Internet users, has prompted increased interest in empirical validation of response time measures collected on the Web, as well as in online behavioral experiment methodology more broadly. This is an important area of research, because online data collection may present confounds to some methodologies, particularly those that depend on tight control of visual stimulus presentation and response recording (Crump, McDonnell, & Gureckis, 2013; Reimers & Stewart, 2014). Considering the importance of precise measurements in psychophysical research, psychologists who hope to measure response times in an online experiment would benefit from a strong body of scholarly work demonstrating the validity of these methods. Recent studies have begun to establish this literature, by either replicating previous laboratory response time research using exclusively online methods or directly measuring display lags using online or laboratory systems. In this article, we offer a novel contribution to this growing literature, directly comparing human participants’ response times between browser-based and laboratory-based experimental platforms, in a within-subjects design.
It is encouraging that many of these laboratory experiments have been replicated in an online environment. Positive replications provide evidence that response times measured online are comparable to those measured in the laboratory, and being able to conduct such experiments online could improve the generalizability of findings, enable faster data collection from many more participants, and expand the range of possible methodologies. But when a replication attempt fails, there are many potential explanations: The hardware and software used for running the experiment and measuring response times are different between participants, these consumer systems may be of lower quality than laboratory equipment, the subject population will be different, the effect may not have been generalizable to a diverse population of Internet users, some replication attempts should fail just by chance (Francis, 2013), and so on.
All systems used for measuring response times, whether online or in the laboratory, will introduce some timing error, and for most research questions the amount of error (the time lag and variability introduced by software and hardware) generated by standard laboratory hardware and software seems to be acceptable to the research community. The question that is most relevant to researchers interested in online response time measurement is not how much error is generated by a particular online-ready software package, but rather, how does the error generated by software packages used for online data collection compare to that from software used in the lab?
In this experiment, we compared human participants’ response times measured by different software packages for the same experiment, making within-subjects comparisons between software packages, keeping all other experimental variables equal. The advantage of this method is that it allows for direct comparisons of response time distributions at the subject level across variations of meaningful psychological parameters. Thus, it is possible not only to check whether an effect replicates in a statistical sense, but also how the difference in the response time distributions measured by each software package changes over a range of possible human response times. The major disadvantage of this approach, as compared to approaches that use an external device to simulate responses, is that substantially more variation will be introduced by human responses, which will diminish the ability to detect statistical differences between the software packages. However, for behavioral researchers interested in response time measurements, the relevant question is probably not whether there are any differences, but whether the differences would systematically affect a distribution of the size typically collected for a behavioral experiment.
We tested both platforms in a simple visual search task. Visual search was selected as a representative psychophysics paradigm because (1) it is a highly investigated visual task using largely standardized methods (Wolfe & Horowitz, 2004); (2) it is possible to test subjects on a large number of trials, and gain a large amount of response time data, in a relatively short amount of time (4 s per trial, in the present study); and most importantly, (3) experimental manipulations to the number of items in the search array yield robust, large, well-characterized changes in response times (that can easily be modeled as a simple linear function; Wolfe, 1998). On the basis of previous research (Shen & Reingold, 2001), we expected mean response times ranging from roughly 700 to 1,200 ms that were directly proportionate to the number of stimulus items in the search set.
A total of 30 subjects (19 females, 11 males) participated in the experiment in exchange for $10. One subject was excluded from the analysis due to partial data loss. The subjects were 18–34 years old (mean 21.7), and most were students at Indiana University.
Subjects completed a visual search task in which they identified the presence or absence of a target () in an array of distractors (). The task closely matched a previously reported experiment by Wang, Cavanagh, and Green (1994).
At the start of the experiment, subjects completed 30 practice trials, which were identical to the experimental trials except that corrective feedback was given after each response. The feedback remained on the screen for 2,000 ms. During the practice phase, a new trial began every 6,000 ms. There was a 20-s break after the practice trials. Subjects then performed 400 experimental trials, with a 45-s break after the first 200 trials. During the experimental trials, a new trial began every 4,000 ms.
Interleaved experiment design
The subjects were seated 2.13 m away from the screen. The projected search array was 0.25 m in diameter, occupying approximately 6.5 deg of visual angle.
During the experiment, only one system displayed a trial at a time. PTB displayed all of the practice trials, but the test trials were split evenly between the two systems. We used two Arduino microcontrollers to control which system presented each trial. The Arduinos communicated with each other via a serial connection, and each Arduino could communicate with one of the two iMacs via a USB connection. One of the Arduinos, the “master,” contained code to randomly generate a trial order, ensuring that equal numbers of trials were run on both systems. This controller was responsible for initiating a new trial: A message was sent to the other Arduino via the serial connection, and both Arduinos simultaneously relayed the start message to their respective computers via the USB connection. The Arduinos also recorded the responses generated by the subject pressing a button. The button devices were connected in parallel to each Arduino board, and both boards relayed the response to their computer via the USB connection, as if the response had been generated on a standard keyboard. The Arduinos sampled the digital ports connected to the response devices every 1 ms.
Response times were excluded from the analysis if the participant responded incorrectly or if the trial timed out (cutoff at 2,000 ms). The mean of the accuracy for all subjects (excluding practice) was 96.1%, with a range from 84.3% to 98.8%.
Bayesian data analysis
We used Bayesian data analysis for all of our analyses. There are numerous reasons to prefer Bayesian data analysis techniques to null hypothesis significance testing (Kruschke, 2011). Our primary motivation in the context of this experiment was to have richer information about the effect of software on the response time distributions. For example, with Bayesian techniques we could determine a distribution of credible values that described the difference in mean response times between the two software packages, giving us an estimate of the magnitude of the difference and the uncertainty of this estimate, rather than relying on a p value to indicate whether a particular observed difference was likely to have occurred by chance. This estimate is often summarized by the 95% highest density interval (HDI), which is the range of values of a parameter that contains 95% of the distribution, with all values inside the HDI being more probable than all values outside the HDI. The 95% HDI tells us what parameter values are most likely, given the model and the observed data.
We built a hierarchical model to describe the parameters of interest in the data. The full model specification is presented in the “Appendix”. Conceptually, the model performed a linear regression for each subject, estimating the search function parameters (intercept and slope) in each of the four within-subjects conditions (2 software packages × 2 trial types: target present or absent), while simultaneously estimating the group-level distribution for each of the subject-level parameters. This technique has the desirable property of introducing shrinkage into the estimates of the individual subjects’ parameters, allowing the parameter estimates for individual subjects to mutually inform each other. This is helpful for dealing with noisy data, because it moves outliers toward the group mean (Kruschke, 2011). The parameters of interest for us were the group-level estimates. These describe the overall effects of the software environment across all subjects.
To estimate the parameters of the model, we used Just Another Gibbs Sampler (JAGS; Plummer, 2003) and the runjags R package (Denwood, 2014) for Markov-chain Monte Carlo (MCMC) sampling. The sample consisted of three independent chains, each sampled for 20,000 iterations after an adaptation period of 1,000 iterations and a burn-in period of 4,000 iterations. We assessed the convergence of the chains for each parameter of interest via the Gelman–Rubin test (Gelman & Rubin, 1992). The R values were less than 1.025 for all parameters. The full model specification, in JAGS format, and the complete MCMC sample, in .Rdata format, are available online at http://hdl.handle.net/2022/19253.
Estimates of the search functions
Given that there are multiple instances of this exact visual search task in the literature (Shen & Reingold, 2001; Wang et al., 1994), our first analysis was simply to verify that we found search functions similar to those from previous studies (ignoring any possible effects of software). As in the previous research, adding additional items to the search set increased response times, and responses times during target-absent trials were longer than those during target-present trials. The group-level mean estimates for the slope of the search function for target-present trials was 81.5 ms/item (95% HDI: 69.8 to 93.6 ms/item). For target-absent trials, the mean estimate was 110 ms/item (95% HDI: 98.4 to 122 ms/item). The group-level mean estimate for the difference in intercepts between target-present and target-absent trials was 66.5 ms (95% HDI: 32.9 to 98.6 ms), with target-absent trials being longer.
Effects of software package
The other three parameters of interest showed no reliable difference between jsPsych and PTB. The estimates of the difference in the coefficients of set size (95% HDI: –4.2 to 7.01 ms/item), the difference in the standard deviations of the RT distributions (95% HDI: –9.93 to 21.1 ms), and the difference in the standard deviations of the estimates of set size (95% HDI: –5.71 to 3.13 ms/item) all spanned 0 and had means relatively close to 0.
Validity of the analysis model
Bayesian methods are only able to report the relative likelihood of parameter values for a particular model. Therefore, once the most likely parameter values are found, it is important to verify that the model can generate data that are a reasonable approximation of the empirical measurements (Kruschke, 2011, 2013). To examine the predictive validity of our analysis model, we generated credible regression lines from the posterior distributions and plotted them against the group-level data. As is shown in Fig. 4, the credible regression lines capture the overall patterns in the data very well, with the raw means of the data falling on top of the credible regression lines. We conducted a similar analysis for the distributions of subjects’ search function parameters, generating credible normal distribution functions to describe the distributions of search function parameters, and found that the analysis model captured these patterns reasonably well.
These data are based on citations to the three articles that the authors of PTB have indicated are the desired citations (Brainard, 1997; Kleiner, Brainard, & Pelli, 2007; Pelli, 1997). The data were collected from the Web of Science on September 23, 2014. Duplicate citations (i.e., an article citing more than one of the articles) were not counted toward the total.
We used a lognormal distribution to model the response time data because we were only interested in the mean and variance of the distributions. Since the lognormal can be reparameterized into the mean and variance of the original (non-log-transformed) data, it was a reasonable option that provided a conceptually clear mapping between the model parameters and the basic research question we were asking, yet still acknowledged the skew inherent in response time distributions. Although other distributions, such as an ex-Gaussian, could have been used, the lognormal provided good fits to the data without additional parameters.
Arnett, J. J. (2008). The neglected 95%: Why American psychology needs to become less American. American Psychologist, 63, 602–614. doi:10.1037/0003-066X.63.7.602
Barnhoorn, J. S., Haasnoot, E., Bocanegra, B. R., & van Steenbergen, H. (2014). QRTEngine: An easy solution for running online reaction time experiments using Qualtrics. Behavior Research Methods. doi:10.3758/s13428-014-0530-7
Brainard, D. H. (1997). The Psychophysics Toolbox. Spatial Vision, 10, 433–436. doi:10.1163/156856897X00357
Crump, M. J. C., McDonnell, J. V., & Gureckis, T. M. (2013). Evaluating Amazon’s Mechanical Turk as a tool for experimental behavioral research. PloS One, 8, e51382. doi:10.1371/journal.pone.0057410
De Clercq, A., Crombez, G., Buysse, A., & Roeyers, H. (2003). A simple and sensitive method to measure timing accuracy. Behavior Research Methods, Instruments, & Computers, 35, 109–115. doi:10.3758/BF03195502
Denwood, M. J. (2014). runjags: Interface utilities, parallel computing methods and additional distributions for MCMC models in JAGS [Software]. Retrieved from http://cran.r-project.org/web/packages/runjags/index.html
Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57, 153–169. doi:10.1016/j.jmp.2013.02.003
Garaizar, P., Vadillo, M. A., & López-de-Ipiña, D. (2014). Presentation accuracy of the web revisited: Animation methods in the HTML5 era. PloS One, 9, e109812. doi:10.1371/journal.pone.0109812
Gelman, A., & Rubin, D. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457–472.
Hawkins, R. X. D. (2014). Conducting real-time multiplayer experiments on the web. Behavior Research Methods. doi:10.3758/s13428-014-0515-6
Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33, 61–83. doi:10.1017/S0140525X0999152X. disc. 83–135.
Kleiner, M., Brainard, D., & Pelli, D. (2007). What’s new in Psychtoolbox-3? Perception, 36(ECVP Abstract Supplement).
Kruschke, J. K. (2011). Doing Bayesian data analysis: A tutorial with R and BUGS (1st ed.). Orlando: Academic Press.
Kruschke, J. K. (2013). Posterior predictive checks can and should be Bayesian: Comment on Gelman and Shalizi, “Philosophy and the practice of Bayesian statistics.”. British Journal of Mathematical and Statistical Psychology, 66, 45–56. doi:10.1111/j.2044-8317.2012.02063.x
Lagroix, H. E. P., Yanko, M. R., & Spalek, T. M. (2012). LCDs are better: Psychophysical and photometric estimates of the temporal characteristics of CRT and LCD monitors. Attention, Perception, & Psychophysics, 74, 1033–1041. doi:10.3758/s13414-012-0281-4
Musch, J., & Reips, U.-D. (2000). A brief history of Web experimenting. In M. H. Birnbaum (Ed.), Psychological experiments on the Internet (pp. 61–87). San Diego: Academic Press.
Neath, I., Earle, A., Hallett, D., & Surprenant, A. M. (2011). Response time accuracy in Apple Macintosh computers. Behavior Research Methods, 43, 353–362. doi:10.3758/s13428-011-0069-9
Nosek, B. A., Banaji, M., & Greenwald, A. G. (2002). Harvesting implicit group attitudes and beliefs from a demonstration web site. Group Dynamics: Theory, Research, and Practice, 6, 101–115. doi:10.1037/1089-26188.8.131.52
Pelli, D. G. (1997). The VideoToolbox software for visual psychophysics: Transforming numbers into movies. Spatial Vision, 10, 437–442. doi:10.1163/156856897X00366
Plant, R. R., & Turner, G. (2009). Millisecond precision psychological research in a world of commodity computers: New hardware, new problems? Behavior Research Methods, 41, 598–614. doi:10.3758/BRM.41.3.598
Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In K. Hornik, F. Leisch, & A. Zeileis (Eds.), Proceedings of the 3rd International Workshop on Distributed Statistical Computing (pp. 1–10). Retrieved from www.r-project.org/conferences/DSC-2003/Proceedings/
Reimers, S., & Stewart, N. (2007). Adobe Flash as a medium for online experimentation: A test of reaction time measurement capabilities. Behavior Research Methods, 39, 365–370. doi:10.3758/BF03193004
Reimers, S., & Stewart, N. (2008). Using Adobe Flash Lite on mobile phones for psychological research: Reaction time measurement reliability and interdevice variability. Behavior Research Methods, 40, 1170–1176. doi:10.3758/BRM.40.4.1170
Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers? Shifting demographics in Mechanical Turk. In Proceedings of the 28th International Conference on Human Factors in Computing Systems, CHI 2010 (pp. 2863–2872). New York, NY: ACM. doi:10.1145/1753846.1753873
Schubert, T. W., Murteira, C., Collins, E. C., & Lopes, D. (2013). ScriptingRT: A software library for collecting response latencies in online studies of cognition. PloS One, 8, e67769. doi:10.1371/journal.pone.0067769
Shen, J., & Reingold, E. M. (2001). Visual search asymmetry: The influence of stimulus familiarity and low-level features. Perception & Psychophysics, 63, 464–475. doi:10.3758/BF03194413
Simcox, T., & Fiez, J. A. (2014). Collecting response times using Amazon Mechanical Turk and Adobe Flash. Behavior Research Methods, 46, 95–111. doi:10.3758/s13428-013-0345-y
Wang, Q., Cavanagh, P., & Green, M. (1994). Familiarity and pop-out in visual search. Perception & Psychophysics, 56, 495–500. doi:10.3758/BF03206946
Wolfe, J. M. (1998). What can 1 million trials tell us about visual search? Psychological Science, 9, 33–39. doi:10.1111/1467-9280.00006
Wolfe, J. M., & Horowitz, T. S. (2004). What attributes guide the deployment of visual attention and how do they do it? Nature Reviews Neuroscience, 5, 495–501. doi:10.1038/nrn1411
Zwaan, R. A., & Pecher, D. (2012). Revisiting mental simulation in language comprehension: Six replication attempts. PloS One, 7, e51382. doi:10.1371/journal.pone.0051382
We thank Richard Viken for providing laboratory space to conduct the experiment, Chris Eller and the IU Advanced Visualization Laboratory for equipment and technical support, Michael Bailey for assistance in the data collection, John Kruschke for suggesting improvements to our analysis model, and Tony Walker and Alex Shroyer for assistance with the Arduino platform. This material is based on work that was supported by a National Science Foundation Graduate Research Fellowship under Grant No. DGE-1342962.
Appendix: Analysis model
Appendix: Analysis model
We model an individual response time from trial i, y i , as coming from a lognormal distributionFootnote 2 that is specific for the subject, s i ; trial type (target present vs. absent), t i ; and software package, p i , associated with that trial:
The lognormal distribution is parameterized by location, κ, and shape, η, parameters; however, our main interest in conducting the analysis was to understand how the mean, μ, and standard deviation, σ, of the original (non-log-transformed) response time data were affected by the software package. To make this conceptually clear in the model, we constructed our model to estimate parameters in the scale of the original data, and then transformed these parameters into the location and shape parameters of the lognormal distribution:
The regression portion of the model was built to find values of the mean and standard deviation parameters in the scale of the original data. The model estimated intercept, b, and slope, m, parameters (relative to the set size, x) for each unique combination of subject, trial type, and software package:
The intercept and slope parameters were linear combinations of a subject-level baseline, β, a subject-level estimate of the difference between the software packages, φ, and a subject-level estimate of the difference between trial types, λ. To estimate a particular difference (such as the difference in intercepts) between software packages, the model estimated a single subject-level difference parameter, ω, and then half of that parameter value was added to jsPsych trials and half of the parameter value was subtracted from PTB trials. We used this approach because we could then apply a group-level distribution to the difference parameter itself. In addition to this parameter mapping nicely onto our main analysis question (what are the differences between the software packages?), this particular implementation of the model also created shrinkage on the difference parameter directly, improving the estimate of the parameter in noise. We applied this strategy of directly estimating the difference parameter to both the trial type and software package differences.
The subject-level parameters were modeled as coming from higher-level group distributions. These group-level parameters that describe the distribution of subject-level parameters were the main parameters of interest for our analysis.
The group-level parameters had diffuse priors appropriate to the scale of the data.
All deflection parameters had the same priors.
About this article
Cite this article
- Psychophysics Toolbox
- Response times
- Visual search
- Online experiments