Measuring sequences of keystrokes with jsPsych: Reliability of response times and interkeystroke intervals
KeywordsInterkeystroke intervals Online experiment Motor sequence
Online research via Web interfaces is becoming increasingly important in the field of cognitive psychology (Gosling & Mason, 2015). Collecting large amounts of data over hundreds of participants in a short amount of time holds the promise of overcoming the statistical power limitations of typical laboratory samples (Reips, 2002). However, online experiments imply a trade-off between what is gained by dramatic increases in sample size and better sampling of the whole population, and what is lost to uncontrolled factors such as distracting environments and diversity in equipment configurations. The latter is acutely relevant when the experiments rely on mental chronometry (Posner, 1978).
A number of studies have evaluated the accuracy and reliability of response time measurements performed through various Web-based interfaces. Such empirical evaluations (reviewed in Reimers & Stewart, 2015) include direct comparisons of experimental results from Web and lab implementations (Reimers & Stewart, 2007; Schubert, Murteira, Collins, & Lopes, 2013), attempts to replicate classical experimental effects on response times (RTs) with online measures (Crump, McDonnell, & Gureckis, 2013; Enochson & Culbertson, 2015; Reimers & Maylor, 2005), and measures of the timing performance of Web-based testing setups using specialist software or hardware (Keller, Gunasekharan, Mayo, & Corley, 2009; Reimers & Stewart, 2015; Simcox & Fiez, 2014). In general, these studies have agreed that online RTs are reliable, if slightly overestimated (in the range of tens of milliseconds; de Leeuw & Motz, 2016; Reimers & Stewart, 2007; Schubert et al., 2013).
The studies above have concerned the chronometry of single keypresses, with RTs being defined as the time elapsed between the onset of a stimulus and the unitary response. In contrast, tasks involving rapid sequences of keystrokes have never been investigated with online setups. Sequences of keystrokes are important for researchers interested in behaviors such as typing, musical performance, motor-sequence learning, serial RT tasks, rhythm production, and so forth. This line of research is especially interested in the structure of sequence programming and how temporal and ordinal forms of information are acquired during learning, and it relates to the general problem of serial order in behavior (for a review, see Rhodes, Bullock, Verwey, Averbeck, & Page, 2004). Two dependent variables can be derived from sequences of keystrokes: RTs, defined above, and interkeystroke intervals (IKIs), the time elapsing between two successive keystrokes. There are two broad reasons why IKIs might be more sensitive than RTs to the noise induced by online measurements.
First, IKIs are typically much shorter than RTs. For example, they can last a few tens of milliseconds for expert participants in typing studies (Rumelhart & Norman, 1982), and less than 200 ms for well-trained participants in serial-RT tasks (Nissen & Bullemer, 1987). A given level of (unavoidable) chronometric imprecision could have a larger impact on the accuracy of such shorter durations than on longer RTs. Standard keyboards are connected through USB ports sampled at a rate of 125 Hz (i.e., every 8 ms). Such quantization could distort small differences that happen to be near that range. Crump et al. (2013) and Reimers and Stewart (2015) previously highlighted the difficulties inherent to shorter timings.
Second, we were also concerned that operating system (OS) settings, such as accessibility features or keyboard hotkeys, might potentially interfere in a detectable way with the recordings and the expected pattern of results. OSs interpret successive or concomitant keyboard events according to both automatic and user-based settings (multiple key presses interpreted as one, combinations of key presses triggering a particular event, etc.). The impact of this intermediate layer of software on keyboard chronometry is not known.
There are preliminary indications of how reliably the timing of successive responses can be recorded. Simcox and Fiez (2014) and Keller et al. (2009) used specialized equipment to generate a stream of keystrokes with a fixed known interval. They measured the recovery of such interval through a Web-based software, showing good timing accuracy. A limitation of this approach is that their manipulation involved fixed intervals and a single button. A crucial and original feature of the present study is the use of three distinct keys from the keyboard, and the generation of variable delays between keys, just as it happens in actual experiments involving sequence production.
To assess the accuracy of IKIs measured online, we adopted a two-fold strategy. First, we measured the timing accuracy of the jsPsych interface using a specialized hardware (Black Box Toolkit) that was modified such that three response switches from the keyboard could be alternatively and variably triggered without human intervention. Second, we ran an actual experiment online and performed a complete quality check of the data using descriptive and inferential statistics.
The experiment we designed involved finger movement sequences for which effects on the recorded RT and the IKIs are well-established (conditions 4 vs. 6 of Rosenbaum, Inhoff, & Gordon, 1984, Exp. 3). In a given block, participants produced two pretrained sequences of three consecutive finger responses. Across the two sequences, two of the responses were identical, whereas the third was different, which generates an uncertainty. In the original study, RTs decreased when the uncertain response occurred later in the sequence. In addition, the IKI preceding the uncertain response was lengthened. These effects of uncertainty on sequence programming and execution have been replicated (Rosenbaum, Hindorff, & Munro, 1987).
Hardware assessment of timing accuracy via the Black Box Toolkit
Materials and procedure
To assess the reliability of the RTs generated by multiple consecutive keystrokes, we resorted to specialized hardware, the Black Box Toolkit (BBTK; Black Box ToolKit Ltd, Sheffield, UK), which can automatically generate and record triggers with submillisecond precision. Three mechanical keys (corresponding to the letters S, D, and F) of a Dell standard USB keyboard were wired to the Black Box Toolkit. With this wiring, the BBTK was able to close the three key switches on demand and generate keyboard response sequences. In our tests, the BBTK was programmed to detect a visual stimulus through its opto-detector and then automatically generate a sequence of three responses. To handle the display of visual stimuli and the keyboard response collection, we used a jsPsych procedure similar to that used in the actual experimental task (described in the next section and available online: https://github.com/blri/Online_experiments_jsPsych). This procedure was run on an iMac 27 using the Safari Web browser (version 9.0). Three tests were run, all of which consisted in displaying a white @ character on a black background 40 times, responded to with three keystrokes, thus generating 120 automated keyboard responses. The key identities activated by the BBTK were randomized across and within trials. The programmed RTs in the first test were randomly chosen within the interval [100, 250] ms. In the second test, they were fixed at 150 ms. In the third test, the range was [350, 500] ms. The data are available at the following repository: https://osf.io/r5dfg/.
The automatic assessment of sequence keystroke timings generated by the BBTK and recorded by a jsPsych procedure revealed that IKIs are unbiased (mean deviation 0 ms) but largely quantized (at 125 Hz, the USB port sampling rate). We now examined how these objective recording conditions fared when attempting to replicate a classic effect on human sequence production.
The jsPsych platform was used to collect data from a large sample of participants. To evaluate the reliability of the jsPysch platform to accurately measure the timing of sequences of keystrokes, we designed a task involving rapid sequences of keystrokes. The overall duration of the task had to be short (less that 20 min), paired with easy-to-understand instructions to be adapted to an online format. Therefore, we aimed to reproduce two classic effects on motor sequence performance with a design adapted from Experiment 3, conditions 4 versus 6, of Rosenbaum, Inhoff, and Gordon (1984). From the original design, we selected only two conditions, to show differences between RTs and IKIs related to the structure of the sequence performed while keeping the duration of the experiment as short as possible. The conditions were the most different in terms of the sequences used (different hands and fingers; see below) and yielded the largest difference in the RT and IKI measures. The original study explored the motor representations used to perform sequences in a choice-reaction-time task design, in which participants had to select a sequence of motor responses to a visual stimulus (X or O). In the original experiment, the sequences had three finger responses that differed by one element placed at either Position 2 or Position 3. The variable element involved both a change of hand and the choice of a nonhomologous finger (e.g., R-Ring to L-Index). Rosenbaum et al. (1984) found that the position of the uncertain response had an effect on RTs, with longer RTs for uncertain responses at Position 2 than at Position 3 (with means of about 460 and 380 ms, respectively; data from Rosenbaum et al., 1984). In addition, the IKI preceding the uncertain response was found to be longer, as was attested by a significant interaction between the position of the uncertainty and the position of the required response (with mean IKI1 uncertain = 193 ms; mean IKI1 certain = 177 ms; mean IKI2 uncertain = 233 ms; mean IKI2 certain = 163 ms; data from Rosenbaum et al., 1984). The original experiment involved six participants.
471 right 69 left
324 women 214 male
mean = 39.64, median = 39, Q1–Q3 = 30–47, range = 21–69
228 work at AMU, 21 study at AMU, 19 unrelated, 273 not specified
Web browsers and OS sample characteristics
Stimuli and design
The experiment was divided into two parts, one for each uncertainty condition; each part comprised a training phase and a test phase. Written instructions described the task and introduced the pairing between visual stimuli (X and O) and sequences for each part. During the test phase, one of the two visual stimuli was displayed and stayed on until the participant hit three keys, followed by an interval of 500 ms. The test phase comprised two blocks of 20 trials each, in which ten trials of each sequence were intermixed. Participants were familiarized with the sequences during a training phase, which ended when each sequence was correctly performed four times. Before the test, the plugin was set in a training mode that could be switched on from the input settings, and allowed to provide feedback by changing the color of the stimulus before it disappeared (the initially black symbol became green if the key-sequence was correct, red otherwise). After the experiment, participants were asked to answer a few questions (handedness, gender, age, employed or not employed by the university) and had the opportunity to report whether any problem occurred during the experiment. No monetary compensation was offered.
All data are available at the following repository: https://osf.io/r5dfg/.
To replicate the original study, the data were first assessed via ANOVAs performed on the RTs and IKIs averaged per participant and cell design. Then, mixed linear regressions were used to estimate the actual effect sizes, and also the effects of additional variables: trial number, gender, age, handedness, and OS and Web browser. These variables were tested as linear predictors, except for age. The relation between age and performance has been reported as nonlinear (Baltes & Lindenberger, 1997), and spline interpolation has been successfully used in cognitive-aging research to approximate the age effect trajectory (Fozard, Vercruyssen, Reynolds, Hancock, & Quilter, 1994). Visual inspection of our data suggested that the effects of age on performance (RTs) could be nonlinear. Therefore, we used restricted cubic splines with three knots, which allowed the effect to be modeled separately in two intervals, without a priori knowledge of the point of separation between these intervals. In the model, we included random intercepts for participants and items (i.e., the different finger sequences). Since how to compute p values in this kind of analysis has been debated (Bates, Mächler, Bolker, & Walker, 2015), we took t values to approximate z values, and considered any value above 1.96 significant.
To characterize the data reliability and the added value of increasing the number of observations, we calculated the means and confidence intervals of RTs and IKIs over random samples of increasing size, then ran the same regression models on samples of increasing size.
Finally, on the basis of the results of the BBTK assessment, we searched for quantization in our data. We focused mainly on the sampling bias that could result from USB keyboards’ sampling rate (125 Hz; i.e., a value being sampled every 8 ms), and used the same methodology as reported above.
Replication of original study
RT and IKI distributions
Twelve out of the sample of 541 participants were excluded because they did not reach 85% accuracy on the task, leaving a final sample of 529 participants. Only correct trials were included in the following analysis. We also excluded trials in which any IKI was equal to zero, since the order of key pressing then could not be determined.
Effects of uncertainty on RT and IKI
For RTs, the design included the following factors: Position of Uncertainty (on 2nd or 3rd keystroke), Type of Sequence (constant or varying), and Hand of the constant sequence (left or right). We did not include any interaction between Hand and the other factors in the design. An ANOVA revealed a main effect of uncertainty [F(1, 528) = 373, p < .001]; RTs were shorter for sequences varying on the 3rd rather than on the 2nd key (M2nd = 677 ms, M3rd = 560 ms). This is in good agreement with the original results of Rosenbaum et al. (1984). The main effect of sequence also reached significance [F(1, 528) = 187, p < .001]: RTs were shorter for constant sequences (Mc = 597 ms, Mv = 640 ms). Finally, the Sequence × Uncertainty interaction was also significant [F(1, 528) = 7.3, p < .01]. The main effect of hand was not significant (F < 1).
Effects of participant characteristics on RTs and IKIs
Mixed model regression coefficient for RTs
Sequence × Uncertainty
Age (spline 1)
Age (spline 2)
OS: OS X/Windows
Web Browser: Chrome/Firefox
Web Browser: IE/Firefox
Web Browser: Safari/Firefox
Web Browser: Autres/Firefox
Mixed model regression coefficients for IKIs
Sequence × Uncertainty
Sequence × Position
Uncertainty × Position
Sequence × Uncertainty × Position
Age (spline 1)
Age (spline 2)
OS: OS X/Windows
Web browser: Chrome/Firefox
Web browser: IE/Firefox
Web browser: Safari/Firefox
Web browser: Autres/Firefox
Summarizing the results on RTs (see Table 4), we obtained the same significant effects as with the original ANOVA, although the Sequence × Uncertainty interaction did not reach significance and presented a small estimate. Importantly, the estimates for the main effects of sequence and uncertainty were the largest. Regarding IKIs (see Table 5), all main effects and interactions reached significance, except the main effect of hand. Our interaction of interest, Position × Uncertainty, presented one of the largest estimates.
Introducing personal characteristics in the model yielded a significant main effect of gender, with male participants being faster than female participants, in terms of both RTs and IKIs. We also observed a slight slowing down of both RTs and IKIs with age, but no significant effect of handedness (Tables 4 and 5). Regarding computer configurations, only the contrast of Chrome against Firefox on RTs approached significance, with responses collected from Chrome being faster (see the supplementary material). This contrast did not reach significance for IKIs.
Having reproduced the result from Rosenbaum et al. (1984), we aimed to further characterize the data collected via the online platform.
Estimation of data quantization
This sampling bias was assessed on the transformed IKI distributions, generated by taking the remainder values from the division by 8. The homogeneity of these distributions was quantified over the whole sample by means of independent chi-square tests for each participant (see the Method section), which revealed that 449 participants (83% of the sample) presented a sampling bias (as indexed by significant chi-square tests, FDR-corrected). This confirms that we should not expect a precision higher than 8 ms on actual IKI measurements. The reason why some of the data from some participants did not show this quantization could not be meaningfully traced to the specific features of their computer configurations that were available to us.
To evaluate the effect of sample size on the experimental effects, we followed the same procedure, taking random samples of increasing size from the whole distribution and running the mixed regression model described above on each sample. Figure 6B presents the evolution of the beta estimate for the Uncertainty × Position interaction across samples. It shows that the effects are well estimated from a sample size of 50, because from this value and above, all of the confidence intervals include the mean of the whole distribution.
The finding of a null difference between the programmed and recorded IKIs with the BBTK is a good indication that online testing can be used for assessing differences in timing between the elements of motor sequences performed through the keyboard. It generalizes the findings of previous experiments that used a stream of identical keystrokes separated by fixed intervals in comparable conditions (Keller et al., 2009; Simcox & Fiez, 2014). Here, we measured a sequence of responses to a visual signal, with variable interresponse intervals. In addition, we showed that even for short time intervals, the online system performed with very good timing precision. The standard deviations for the IKIs were very small (around 10 ms), a value that is considered accurate in RT measurements (see Reimers & Stewart, 2007, 2015). The BBTK tests therefore also indicate a good reliability of the measures.
The quantizing of IKIs occurred for all three of the BBTK tests and for the online measurements. A similar phenomenon has also been shown by Neath, Earle, Hallett, and Surprenant (2011), when testing various configurations of program/computers and keyboards. In their study, quantizing occurred only occasionally and seemed to be related to one particular configuration of a given Macintosh computer and a particular type of keyboard, and it was not discussed further. Quantizing of responses was also observed by Reimers and Stewart (2015) on the basis of cumulative frequency distributions. Quantizing occurred more frequently under certain configurations, but the authors could not find any systematic predictor of quantizing in the data. Here, in addition to showing quantizing, as usual, with the cumulative frequency distributions, we used a procedure that allowed for the quantification of the intervals between which data were sampled. In the BBTK tests, data were “packed” by multiples of 8 ms, a value that corresponds well to the sampling rate of a USB port (125 Hz). This was also the case in data recorded online with actual participants and variable computer settings. A great majority of the sample displayed the same quantizing by steps in multiples of 8 ms. This is a strong suggestion that quantizing is due to the sampling of the keyboard by the system. Such noncontinuous sampling of the data should be acknowledged, and some researchers for whom quantizing matters should be aware of it and choose non-USB input devices. However, a specific study has shown that under typical RT measurement conditions, the variability in human performance outweighs the imprecision in response devices (Damian, 2010).
Our online study on a large sample of participants also replicated the original results of Rosenbaum et al. (1984): The position of the uncertain response had an effect on RTs, with longer RTs for uncertain responses at Position 2 than to those at Position 3. In addition, we found the same interaction between the position of the uncertainty and the position of the required response on IKI measurements. The main effect of uncertainty reported in the original experiment did not reach significance in our sample, whereas the other effects we reported had not been significant in the original study. This discrepancy is probably related to the instability of the effects estimated by Rosenbaum et al. (1984) in the original study. The interaction of interest was replicated later on (Rosenbaum et al., 1987) and can be considered reliable. It should nonetheless be kept in mind that the original effects were reported with a sample of only six participants. With regard to our specific conditions (e.g., the same number of trials as in the original study, but only a subset of the original experimental conditions) and the experimental context (online measurements), we found that a minimum sample size of 50 participants was necessary to provide a relatively good and reliable estimation of the effect of interest. This threshold is nonetheless specific to the present experiment, and clearly is not a recommendation for a minimum sample size in any online experiment, since its value is likely to depend on a number of factors, such as the number of trials and the effect sizes. Our analysis nonetheless provides an illustration of the trade-off between sample size and the precision of estimates of the effects of interest, which depends on the constraints of a given experiment. Systematic tests of this type in methodological or experimental online studies would be useful to get a better overview of the minimally required sample sizes in various contexts.
Our measured RTs were longer than those from Rosenbaum et al.’s (1984) study: This could be due to the time lag (as measured with the BBTK) introduced by the operating system, computer, screen, and keyboard used (see Neath et al., 2011, for an example for the keyboard), as well as by the online configuration of the browser and jsPsych. This finding is in agreement with previous studies that had compared in-lab and Web-based experiments and typically found delays from 25 to 100 ms (Crump et al., 2013; de Leeuw & Motz, 2016; Reimers & Stewart, 2007; Schubert et al., 2013).
Regression analyses also indicated that the material configuration had no measurable effect on IKIs, although an effect of browser was evidenced for RTs. Moreover, our online interface could capture some variability linked to demographic variables: For instance, it allowed for extra assessments of the effects of variables such as age or gender. We found slight influences of age and gender on both RTs and IKIs. Previously, no effects of age have been reported on typing rates (Salthouse, 1984) or on indirect measures of motor-sequence learning (Howard & Howard, 1989). However, a male advantage on motor speeds has been reported (Nicholson & Kimura, 1996). Such effects should be taken cautiously, because predictors such as age, gender, and computer configurations might be correlated, as was suggested by Reimers and Stewart (2015), and might lead to spurious effects of the covariates. Mixed regression analyses then present the advantage of accounting for all predictors and their specific variabilities at the same time.
In conclusion, online measurement using jsPsych appears to be an accurate way to test for fine differences between the IKIs in various conditions. It offers a promising tool for researchers interested in motor-sequence learning and execution.
This study was partly supported by the French Agence Nationale de la Recherche (MEGALEX Grant No. ANR-12-CORP-0001; AMIDEX Grant No. ANR-11-IDEX-0001-02; Brain and Language Research Institute Grant No. ANR-11-LABX-0036; MULTIMEX Grant No. ANR-11-BSH2-0010), the European Research Council (FP7/2007–2013, Grant No. 263575), the Ministère de l’Enseignement et de la Recherche (doctoral MNRT grant to S.P.), and the Fédération de Recherche 3C (Aix-Marseille Université).
The delay could occur between the jsPsych pseudocommand “display stim” (with which the jsPsych initial clock read would occur) and the actual physical display on the screen (detected by the opto-detector; the BBTK initial clock read). This delay is probably due to the computation of the display screen, the communication between graphic electronic components and the physics of LCD screens, and the timing inaccuracies of jsPsych.