Optimizing subject behavior for mouse-tracking
If subjects sometimes make their category decisions prior to moving their mouse cursors—that is, if they wait to begin moving their cursors until they have already made a decision—then their mouse trajectories may begin too late to capture dynamic category competition (Freeman et al., 2016). For this reason, at the end of each trial in which the subject took more than 700 ms (by default) to begin moving the cursor, the questionnaire issues a “started too late” alert warning the subject to begin moving the cursor faster at the beginning of each trial. Additionally, to encourage fast decision-making and discourage subjects from taking unscheduled breaks from the experiment, after any trial in which the subject takes longer than 5000 ms (by default) to make a category decision, the questionnaire issues a “took too long” alert reminding the subject to answer more quickly (Freeman et al., 2016). Some investigators choose not to limit total response time (e.g., Kieslich & Hilbig 2014), in which case the parameter maxLatencyTime could simply be set to a very large value, such as 50,000 ms. All alerts are recorded in the dataset at the time they are triggered, but to avoid disrupting the subject’s behavior during the trial, they are not displayed onscreen until after the subject selects a category button, but before the subject clicks the Next button to proceed to the next trial. The recorded alert data allow investigators to exclude trials or subjects receiving certain types of alerts if desired. The full text of all alert messages appears in the Online Supplement.
Special considerations for online use
As mentioned, allowing subjects to complete the experiment on their own devices, rather than in a controlled lab setting, poses several challenges to collecting reliable and precise mouse-tracking data. For example, the software cannot precisely position the subject’s cursor at the start of each trial; browsers do not provide this functionality to preclude malicious misuse. Furthermore, the experiment interface is displayed with the same pixel dimensions for every subject and trial, regardless of the size and resolution of each subject’s screen, potentially yielding interfaces of somewhat differing visual sizes for different subjects. Fixing the visual size, rather than the pixel dimensions, of the experiment interface across subjects was not feasible because the survey software does not have reliable access to data on each subject’s screen size and resolution. Additionally, if subjects attempted to complete the experiment with a browser window that is smaller than the size of the experiment interface (for example, because their devices’ screens are physically too small), then they might have to scroll in the middle of each trial, leading to non-continuous mouse trajectories and erroneous reaction times.
As mentioned above, the use of fixed pixel dimensions does not guarantee that the visual distance between the buttons will be the same for every subject due to the many possible combinations of different physical dimensions of computer monitors and different pixel-per-inch resolutions. In addition, some subjects might use their browser’s zoom function, changing both the pixel distances and the visual distances. Therefore, our R analysis code by default rescales all trajectories to unit length in both the x- and y-dimensions. However, the validation study described in “Validation study” below found systematically larger values of the outcome measures for subjects with trajectories suggesting non-standard pixel scaling due, for example, to zooming typically showed larger values of the outcome measures. These differences persisted despite that the trajectories had been rescaled to unit length. Importantly, despite these mean differences on the outcome measures, the key stimulus ambiguity effects were comparable between subjects with non-standard pixel scaling and subjects with standard pixel scaling. In practice, then, investigators might choose to simply adjust analysis models for covariates indicating whether a subject had non-standard pixel scaling (operationalized as having unexpectedly large or small pixel distances between the starting and ending x-coordinates on any trial) and whether a subject had ever had a too-small window; this is the approach we adopt in the validation study. Because the experimental manipulation is randomized, these idiosyncrasies of the visual display size effectively introduce “non-differential” noise in the continuous outcome measures, in which case the estimate for the effect of stimulus ambiguity remains unbiased even without adjustment for the scaling and window size variables (Rothman, Greenland, Lash, & et al. 2008). Thus, estimates should be comparable across samples with different frequencies of non-standard scaling and too-small windows. However, adjusting for these variables as “precision covariates” may improve statistical power by removing some of the residual variation on the outcome measures that is due to these visual idiosyncrasies rather than to stimulus ambiguity. The provided R code automatically includes these two indicator variables (called weird.scaling and wts, respectively) in the prepared long-format dataset. Alternatively, subjects displaying these idiosyncrasies could simply be excluded.
As an additional data quality concern in online settings, it is sometimes possible for automated “bots” to complete Mechanical Turk tasks, yielding invalid data (Difallah, Demartini, & Cudré-Mauroux, 2012). Because bots do not physically use computer mice or trackpads to proceed through the questionnaire, but rather select buttons directly, they would not provide any mouse trajectories at all for our data collection system to erroneously record. If a bot managed to complete the questionnaire and respond to any alerts in the process, our data preparation script would automatically flag its data for exclusion due to missing trajectories.
Extensions to other survey platforms
Our implementation has limitations. Occasional idiosyncrasies (e.g., extremely poor quality connections, use of proxy servers) can cause losses of coordinate data for some trials or subjects. Our R code automatically checks for subjects with these data losses and creates a list of subject IDs that should be excluded, along with reasons for exclusion. The validation study presented below suggested that these issues affect a small fraction of trials for approximately 10% of subjects when data are collected in an uncontrolled crowdsourcing setting. A conservative analysis approach, which we adopt in the validation study, could be to exclude every subject with data losses on any trial.
Additionally, our implementation cannot control subjects’ individual mouse speed settings. That is, different mice and trackpads may be set to respond with larger or smaller onscreen movements for any given physical movement of the subject’s hand, and these differences in mouse dynamics could affect the confusion measures. Because our implementation collects data through an Internet browser, it is not able to measure subjects’ mouse speeds independently of, for example, their hand speeds. However, like the visual idiosyncrasies produced by non-standard pixel scaling or small browser windows, we would expect differences in mouse speed settings to introduce only non-differential noise in the outcomes and thus not compromise estimation of stimulus ambiguity effects (albeit with some loss of statistical power).
Last, although our implementation appears to perform reliably across common browsers (see “Effect of stimulus ambiguity on mouse trajectories”), it is incompatible with Internet Explorer; subjects running Internet Explorer will be unable to proceed through the questionnaire, and no data will be collected. (At present, Internet Explorer has only a 3% share of browser usage worldwide (“Browser market share worldwide”, 2018). Finally, subjects with very slow Internet connections, causing image stimuli to load slowly, may receive a large number of “started too late” alerts, although their data will otherwise be useable. In practice, subjects with a high frequency of “started too late” alerts could be discarded if this were of concern.