Internet-based language production research with overt articulation: Proof of concept, challenges, and practical advice

Vogt, Anne; Hauber, Roger; Kuhlen, Anna K.; Rahman, Rasha Abdel

doi:10.3758/s13428-021-01686-3

Internet-based language production research with overt articulation: Proof of concept, challenges, and practical advice

Open access
Published: 19 November 2021

Volume 54, pages 1954–1975, (2022)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Internet-based language production research with overt articulation: Proof of concept, challenges, and practical advice

Download PDF

Anne Vogt ORCID: orcid.org/0000-0001-6024-8516^1,2^na1,
Roger Hauber¹^na1,
Anna K. Kuhlen¹ &
…
Rasha Abdel Rahman^1,2

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Language production experiments with overt articulation have thus far only scarcely been conducted online, mostly due to technical difficulties related to measuring voice onset latencies. Especially the poor audiovisual synchrony in web experiments (Bridges et al. 2020) is a challenge to time-locking stimuli and participants’ spoken responses. We tested the viability of conducting language production experiments with overt articulation in online settings using the picture–word interference paradigm – a classic task in language production research. In three pre-registered experiments (N = 48 each), participants named object pictures while ignoring visually superimposed distractor words. We implemented a custom voice recording option in two different web experiment builders and recorded naming responses in audio files. From these stimulus-locked audio files, we extracted voice onset latencies offline. In a control task, participants classified the last letter of a picture name as a vowel or consonant via button-press, a task that shows comparable semantic interference effects. We expected slower responses when picture and distractor word were semantically related compared to unrelated, independently of task. This semantic interference effect is robust, but relatively small. It should therefore crucially depend on precise timing. We replicated this effect in an online setting, both for button-press and overt naming responses, providing a proof of concept that naming latency – a key dependent variable in language production research – can be reliably measured in online experiments. We discuss challenges for online language production research and suggestions of how to overcome them. The scripts for the online implementation are made available.

Web-based language production experiments: Semantic interference assessment is robust for spoken and typed response modalities

Article Open access 04 April 2022

DOLD: a digital platform for conducting online language experiments and surveys

Article 03 July 2024

Timed written picture naming in 14 European languages

Article Open access 24 May 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Reasons for conducting online language production experiments

Many psychological experiments based on behavioral measures can be run online. This brings great advantages in comparison to lab-based testing. For example, online experiments facilitate testing larger samples, promoting science to a larger community, and potentially consuming less resources during data collection (e.g., Grootswagers, 2020). This greater efficiency in data collection has led to the replication and extension of many behavioral paradigms in online settings. Even experiments which require precise measures of reaction times can reliably be conducted on the web (e.g., Anwyl-Irvine, Dalmaijer, Hodges, & Evershed, 2020a; Anwyl-Irvine, Massonnié, Flitton, Kirkham, & Evershed, 2020b; de Leeuw, 2015; Gallant & Libben, 2019; Hilbig, 2016; Pinet et al., 2017). Furthermore, there is an increasing awareness for the need to test more diverse populations in order to raise the external validity of experimental findings (Speed et al., 2018). Web-based testing is one of the options to ensure that our understanding of the human mind extends to the population at large. Most recently, the popularity of web-based testing has been gaining additional momentum as the COVID-19 pandemic forces researchers to think of alternatives to lab-based testing (Sauter, Draschkow, & Mack, 2020).

Within psycholinguistics, language production research is so far underrepresented when it comes to online-based testing. To the best of our knowledge, typical language production tasks such as picture naming, during which participants’ overt articulatory responses are acquired in order to determine voice onset latencies, have so far not been investigated in online settings. In this study, we provide a proof of principle for deriving voice onset latencies from recordings of overt articulatory responses time-locked to pictorial stimuli. To this end, we implemented the picture–word interference (PWI) task, a classic paradigm to investigate lexical access during language production (for a recent review see Bürki et al., 2020), in an online version. We demonstrate the viability of this approach by comparing voice onsets computed from short audio recordings of overt naming responses with a manual classification task providing response times depending on manual keyboard button-press responses. In a direct comparison, we show similar interference effects typically observed in lab-based picture–word interference studies for both overt picture naming responses and for manual button-press classifications of the picture names. Furthermore, we provide practical advice on moving language production research online.

Current challenges when running language production studies online

There are several challenges to conducting online language production research relying on overt naming responses. First, in the lab, the technical equipment (e.g., microphones, sound shielded booths) ensures a high quality of the acquired speech data and technical requirements are kept constant within and across participants. In an online study, participants need a microphone and need to explicitly grant access in order to record speech as a dependent variable. This approval process can disrupt the experimental procedures at different time points – depending on the individual browser’s security settings. Furthermore, recording quality may differ widely between participants due to technical reasons or background noise. Second, in the lab, the experimenter would typically monitor if the participant is complying with the instructions, e.g., naming the pictures presented on the screen in a correct manner. However, there is no easy way to monitor the performance of participants’ online verbal responses. In an online language production experiment, recorded verbal responses can only be checked after completion of the experiment. Third, lab-based experiments have the option of using voice keys, special hardware devices for automatically detecting the onset of vocal responses. Alternatively, or in addition, in the lab vocal responses can be recorded as audio files, which are scanned for the start of speech in a subsequent step after the experiment. Voice onset latencies are a key dependent variable in many language production experiments and result as the latency between the onset of a picture presentation and the onset of the naming response. However, none of the available tools for running online experiments offers the possibility to determine voice onset latencies instantly and directly log them, rendering online language production research potentially more laborious after data acquisition.

Lastly, and perhaps the most serious challenge, is to precisely timelock vocal responses to certain events, e.g., the visual presentation of a picture on a screen, with little or no variation within and between participants. In the lab, experimenters use specific hardware and software to control for audiovisual synchrony, ensuring that the timing between stimulus presentation and voice recording is reliable. Unfortunately, to date, none of the packages or programs for running online experiments offers an option for recording overt articulatory responses precisely time-locked to a (visual) stimulus. Only very recently has there been some development in this area resulting in beta versions of experimental software with audio recording possibilities and there seems to be a lively development process surrounding these versions (e.g., see the Gorilla Audio Recording Zone, or the jspsych-image-audio-response-plugin by Gilbert, 2020). However, none of these versions can, to date, ensure the high audiovisual synchrony which would be needed in order to test the typically investigated effects in language production research, which sometimes rely on mean voice onset differences in the range of a few milliseconds. To date, a published validation of their timing properties is still missing.

A recent meta-analysis compares a range of experiment builders concerning the reliability of synchronous presentation of visual and auditory stimuli, both lab-based and online, and testing different operation systems and browsers (Bridges, Pitiot, MacAskill, & Peirce, 2020; Reimers & Stewart, 2016). These studies demonstrate that the lag between visual and auditory onset varies considerably. As audio output timing, i.e., the start of audio recordings, relies on specific hardware and software properties as well, it can be assumed that it is likewise difficult to control for a precisely stimulus-locked onset of an audio recording in online settings where a large number of participants with different computer and browser configurations take part. Bridges et al. (2020) conclude that for online experiments, none of the tested packages can guarantee reliable audiovisual synchrony. They argue that JavaScript technology, which is the basis for most web-based experiments, would need to be improved in order to obtain precisely timed audio measures in the millisecond range. However, precisely timelocking overt articulation to pictorial stimuli might be an essential prerequisite to ensure replicability of language production paradigms in an online setting (Plant, 2016).

For the current study, as a proof of concept, we adopted a pragmatic approach to these challenges. As Bridges et al. (2020) point out, solving the problem of poor audio-visual synchrony in online software technology is not a task for the scientific user but rather for the community of software developers working with JavaScript. In the meantime, however, we aim for a “good-enough” approach, i.e., answering the question if the current methods are reliable enough to detect mean differences in classic paradigms and to replicate classic effects. Even for measurements made with instruments with poor resolution, mean differences can be successfully detected when aggregating over a large enough sample size (Ulrich & Giray, 1989). Therefore, given a sufficiently large number of trials and participants, one might be able to detect differences in naming latencies recorded in online settings in spite of the multiple challenges and limitations of web technology (Brand & Bradley, 2012). The same approach applies, to some extent, more broadly to all experimental research conducted online: while the data collected in these settings will invariably show some increased noise due to lack of experimental control and limitations introduced by variable personal hardware (e.g., laptop keyboards), effects should still be, and indeed are, detectable with sufficient power (Mathot & March, 2021; Pinet et al., 2017).

Testing an online implementation of the picture–word interference paradigm with verbal and manual responses

Given the many advantages of online experiments, the aim of the present study is to test whether a robust and well-replicated, but relatively small effect in language production can be replicated in an online implementation of the task including overt naming responses. In the picture–word interference paradigm, participants name pictures while ignoring simultaneously presented distractor words. Naming latencies in this paradigm depend on the semantic relation between the distractor and the target picture, with increased naming times for semantically related versus unrelated distractor words (e.g., Lupker, 1979; Schriefers et al., 1990). This well-replicated semantic interference effect has been interpreted as a marker for the cognitive processes underlying lexical access (Bürki et al., 2020). Analyzing results from 162 studies with a Bayesian meta-analysis, Bürki et al. (2020) demonstrated that the semantic interference effect amounts to 21 ms with a 95% credible interval ranging from 18 to 24 ms. Therefore, replicating this small but robust effect in an online setting would demonstrate the viability of running time sensitive language production experiments online.

Crucially, semantic interference has not only been demonstrated for overt naming responses. A comparable effect has also been observed for a manual button-press classification task in which participants are asked to identify the last letter of the picture name as a vowel or a consonant by pressing one of two response buttons (Abdel Rahman & Aristei, 2010; Hutson, Damian, & Spalek, 2013; Tufft & Richardson, 2020). The similarity of semantic interference in vocal and manual naming responses allows us to directly compare these effects in an online implementation of the paradigm. While manual responses can more readily be implemented and recorded in online settings, we can compare the effects in this response modality directly to the recording of overt naming responses.

To this end, we adopted the design of the study by Abdel Rahman and Aristei (2010) in which participants received both versions of the task with the experimental manipulation of semantic relatedness as a within-subject factor. In this way, manual button response times can serve as a benchmark for a potential effect in the vocal onsets in audio responses. For naming latencies, we recorded audio files for each naming trial, time locked to the onset of picture presentation. Naming latencies were then computed offline. To enable online voice recording, we customized freely available tools for audio recordings on the web and included them in two different experiment builder programs. It was our primary goal to find a working solution for running language production online. Therefore, we conducted the same experiment in two different implementations to increase chances for finding a working solution. The first version was programmed and hosted in SoSciSurvey (Leiner, 2019), a platform used for conducting social and behavioral research in Germany in combination with an audio recording function based on RecordRTC (Khan, 2020). The second implementation was programmed using jsPsych (de Leeuw, 2015) with a custom audio recording plugin relying on Recorderjs (Diamond, 2016) and hosted on JATOS (Lange et al., 2015).The methods and predictions of this work have been preregistered on AsPredicted.com (AsPredicted#: 43871, available for viewing under https://aspredicted.org/blind.php?x=6ma52w). Data and analysis scripts are available at OSF (https://osf.io/uh5vr/?view_only=229679aa33604aa2a5cb400eab62099). The comparison between the two implementations is an exploratory analysis which was not preregistered.

Experiment 1

Methods

One version of the experiment was implemented in SoSciSurvey (Leiner, 2019) and the other version was implemented in jsPsych (de Leeuw, 2015). The two versions of the experiment (labeled SoSciSurvey and jsPsych1, respectively) were nearly identical regarding design and procedure. If not specified otherwise the information applies to both versions.

Participants

In the SoSciSurvey version, a total of 116 native German speakers between 18 and 35 years were recruited over the commercial platform Prolific (www.prolific.co.uk) and completed the experiment. They were included in the final sample when meeting all inclusion criteria until the final sample consisted of 48 participants as determined in the preregistration (21 females, 18–33 years, M_age = 25.71, SD_age = 4.28) (see also the section on Data Exclusion in Data Analysis for details on the criteria for inclusion in the final sample). The sample size was determined via an a priori power analysis using the simr package (Green & MacLeod, 2016). Simr uses simulation to estimate power, by simulating data for which the user can define the parameter estimates. We estimated the power for the overt naming task but needed to rely on estimates from a study employing the PWI paradigm (Lorenz et al., 2018) that used LMMs to analyze their data, which Abdel Rahman and Aristei (2010) did not. The resulting suggested sample size for a power estimate of 80% was 36, but we anticipated a need for more power in online studies and had decided a priori to increase the estimated sample size by one-third, thus amounting to the sample size of n = 48.

In the jsPsych1 version, a total of 108 native German speakers between 18 and 35 years were recruited over the commercial platform Prolific (www.prolific.co.uk) and completed the experiment. They were included in the final sample when meeting all inclusion criteria until the final sample consisted of 48 participants as determined in the preregistration (24 females, 18–33 years, M_age = 26.06, SD_age = 3.99).^{Footnote 1}

Participants provided informed consent to their participation in the study. The study was conducted on the basis of the principles expressed in the Declaration of Helsinki and was approved by the local Ethics Committee. Participants received monetary compensation distributed via the platform Prolific.

Materials

The stimulus set consisted of 40 black-and-white line drawings of common objects, all of which have frequently been used in lab-based picture-naming studies in our group. Half of the German words for these objects ended in a vowel, the other half in a consonant. For the related condition, each drawing was assigned a semantically related distractor word that was not part of the response set. For the unrelated condition, the same distractor words were reassigned to different drawings to which they were not semantically related. In both conditions, half of the assigned distractors matched the name of the drawing with regards to the type of the last letter (vowel vs. consonant) and the other half did not. Thus, response compatibility of the distractor and the target regarding the classification of the last letter was balanced across the stimulus set. In the case of a compatible match, the last letters were never identical, but only matched concerning the type of letter, i.e., vowel or consonant. The line drawings were presented together with the visual distractor words that were superimposed without obscuring the visibility of the object. See online Supplementary Material A for a table of the stimulus set (https://osf.io/uh5vr/?view_only=229679aa33604aa2a5cb400eab62099).

Design

The experiment consisted of a 2 x 2 design with the within-subject factors task (button-press vs. overt naming) and relatedness (related vs. unrelated distractors). The dependent measure was response latency (of the button-press and the overt naming, respectively). The order of the tasks (button-press – overt naming vs. overt naming – button-press) and the assignment of buttons to responses (p for vowel, q for consonant vs. p for consonant, q for vowel) was counterbalanced across participants.

Procedure

At the start of the experiment, participants were given general instructions and then a preview of all 40 drawings with the corresponding names (but without distractors) to familiarize participants with the stimulus set. Instructions for the first task were then presented including a catch trial to ensure participants read the instructions carefully, followed by four practice trials. Each main task consisted of 80 trials showing all 80 stimuli in random order. Each trial started with a 500-ms fixation cross at the center of the screen. The stimulus, a line drawing with a superimposed distractor word, then appeared at the center of the screen (200 x 200 pixels) for a total of 2000 ms, followed by a blank screen for another 1000 ms before the next fixation cross appeared. In the button-press task response, labels (e.g., “Q consonant”, “P vowel”) were shown below the stimulus to the left and right, respectively. Once a button was pressed, the corresponding response label was highlighted but the stimulus remained on screen for the full duration of the trial. In the overt naming task, audio files were recorded for each trial starting at stimulus onset, producing 80 recordings with a predefined duration of 3000 ms for each participant. There was a short break after the first task before the instructions of the second part were presented. When both tasks were completed, participants received debriefing information and were then linked back to the website of Prolific in order to validate their participation.

Technical Implementation of audio recording

The technical implementations for both experimental platforms relied on JavaScript. JavaScript is a programming language that forms, together with HTML and CSS, the core technology of the Internet. Importantly, all modern browsers rely on JavaScript and therefore no prior installation of the language itself is necessary neither on the programmers’ nor the users’ side. The implementations in our study build on APIs (application programming interfaces), which can be thought of as ready-made tool sets allowing for certain functionalities to be used. The access to participants’ microphones and the streaming of their voice input is realized via such APIs in both implementations.

The experimental platform SoSciSurvey (Leiner, 2019) is based mainly on the programming language PHP, but JavaScript code can be implemented in the functionalities provided by SoSciSurvey, as we did in our study. For the implementation of the audio recording, we included a JavaScript-based function within each audio trial. This function captures the participants’ audio input, presents the visual stimulus and starts an audio recording at the same time. The audio input is then saved in the browser’s native file format (e.g., .ogg or .webm) and transferred to the SoSciSurvey server. The JavaScript plugin RecordRTC.js, which we used for this purpose, is provided by Khan (2020) who provides and actively maintains a wide range of readymade JavaScript applications under the open WebRTC (web real-time communication) standard.

In our jsPsych version, the functionality of the experiment timeline relies on the experiment library jsPsych (de Leeuw, 2015), while the data are saved via a server specified by the experimenter, in our case a server based at our institute set up with JATOS (Lange, Kühn, & Filevich, 2015). For the technical implementation of audio recording within jsPsych, access to participants’ microphones is granted only once at the beginning of the overt naming part and remains permanently active during the overt naming task. The recordings are started within each trial upon stimulus presentation. Then files are immediately saved in wav format and transferred to the server. Unlike the SoSciSurvey implementation, the recording in jsPsych relies on the JavaScript plugin recorder.js provided by Diamond (2016), which we used to customize a jsPsych plugin to enable audio recording. Note that even though the functionality of recorder.js builds the base of RecordRTC.js (as implemented in the SoSciSurvey implementation described above), and also of other audio recording plugins, it is not actively maintained and therefore might not be working in the future, e.g., if browser standards change.

The main difference with regard to the implementation of the audio recording is that our custom jsPsych plugin uses a Web Worker API during the recording and saving of audio files. Web Workers allow to run tasks in the background without interfering with the user’s interface. This should ensure, for example, that the next trial can start as predefined even if the audio file from the previous trial has not yet been transferred to the server. For anyone interested in more details of the technical background, we recommend consulting the MDN Web Docs site as a starting point, as this site provides information about Open Web technologies including JavaScript, HTML, CSS, and APIs (https://developer.mozilla.org/de/).

Data analysis

Data preprocessing

For the SoSciSurvey data, the recordings were first converted to wav format from the browser’s default compressed recording format. For the jsPsych data, the recordings were already in wav format. To extract the naming latency from the audio recordings in the overt naming task, all audio files were then processed with the tool Chronset (Roux et al., 2017), which is an automated tool for the detection of speech onsets from audio files. Afterwards, the audio files were manually checked using the software Praat (Boersma & Weenink, 2020) and a custom script (van Scherpenberg et al., 2020) to ensure that participants were producing the correct target word and to manually correct the determined speech onset where necessary.

Data exclusion

Replacement of participants due to prescreening of data

Participants were excluded and replaced in the dataset if more than 20 % of the trials were incomplete or marked as deficient. Trials were marked as deficient if (1) participants produced an error (wrong picture name or wrong letter classification ), (2) the audio files of the naming response did not contain any sound, or (3) if there were other technical difficulties concerning the audio files. These difficulties included excessive background noise, an extremely low audio signal, or irregular lengths of the audio file within a participant. In the SoSciSurvey version of the experiment, irregular file lengths were so pervasive that we did not consider it practical to replace participants on these grounds in this version. See the Discussion for details on this issue. See also Fig. 1 for an overview of the exclusion of participants due to the prescreening criteria.

Data exclusion of single trials

In the data of the final 48 participants in both versions of the experiment, single trials were excluded if no response was given, participants made an erroneous response, or if participants responded prematurely (i.e., reaction times under 200 ms). See Table 1 for an overview of the data exclusion of single trials.

Table 1 Data loss caused by preprocessing the final samples of n = 48 in % of total data in Experiment 1 (SoSciSurvey and jsPsych1) and Experiment 2 (jsPsych2). Trials were excluded from analysis if participants did not press a button in the binary button-press classification task (button-press task – no reaction), classified the last letter incorrectly (button-press task – error), did not produce an object name in the naming task (naming task – no reaction), did not produce the correct target word in the naming task (naming task – error), or if a voice onset of less than 200 ms was registered (naming task – early response)

Full size table

Data transformation and selection of linear mixed effect models

To approximate a normal distribution of the residuals of the dependent variable, the Box–Cox power transformation procedure was applied to the response latency data (Box & Cox, 1964). The specific transformation that was performed is noted in the respective section of the Results.

For analysis of the response latency data, we used the packages lme4(Bates, Mächler, et al., 2015b) and lmerTest (Kuznetsova et al., 2017) in the statistical software R (R Core Team, 2019; Version: 3.6.1) to fit linear mixed effect models (LMMs) of the (transformed) response latency with the fixed effect predictors task (button-press vs. overt naming task), relatedness (related vs. unrelated), as well as their interaction. Both predictors were coded as sum contrasts (button-press – overt naming and related – unrelated, respectively). In order to examine more closely the effect of relatedness in the two tasks, nested LMMs were also fitted, in which the fixed effect of relatedness was estimated separately for the two levels of the factor task. The additional factors repetition (whether a picture was seen for the first or second time within a task) and task-order (button-press—naming vs. naming—button-press) were included as separate fixed effects (both contrast-coded). If their inclusion led to an increase in model fit, as indicated by a likelihood ratio test, they remained in the final model.

In specifying the structure of the models’ random effects, we followed the procedure outlined by Bates and colleagues (2015a). Initially, a full model with the complete variance-covariance matrix of the random effects allowed for by the design (i.e., random effects by subject and by picture) was fitted. This model was then simplified by first forcing the correlation parameters between the random effects to zero, then identifying overfitting of the parameters in the random effects using principal component analysis and dropping those random effects that contributed the least to the cumulative proportion of variance as identified by the principal component analysis until dropping a random effect led to a reduction in the goodness of fit. Correlation parameters between random effects were then reintroduced and kept in the final model if their re-inclusion led to an increase in the model fit and did not lead to non-convergence of the model. Models reported in the Results section are always final, reduced models.

Results

In the SoSciSurvey data, the mean response latency in the button-press task was 1161 ms (SE = 7 ms) in the related condition and 1143 ms (SE = 7 ms) in the unrelated condition. In the overt naming task, the mean response latency was 875 ms (SE = 7 ms) in the related condition and 859 ms (SE = 6 ms) in the unrelated condition. In the jsPsych data, the mean response latency in the button-press task, was 1162 ms (SE = 7 ms) in the related condition and 1147 ms (SE = 6 ms) in the unrelated condition. In the overt naming task, the mean response latency was 1003 ms (SE = 6 ms) in the related condition and 989 ms (SE = 5 ms) in the unrelated condition. See Fig. 2 for line plots of mean response latency by task and relatedness for both versions of the experiment, Fig. 3 for line plots of mean response latency by task, relatedness and repetition for all online experiments and Fig. 4 for raincloud plots of the single trial data and their distribution by task and relatedness for all experiments as well as for the previous study by Abdel Rahman and Aristei (2010). Note that overall response latencies are slower in the online experiments compared to the lab. We discuss possible reasons for this in the General discussion section.

Preregistered analysis

SoSciSurvey

For the SoSciSurvey data, the box-cox procedure suggested a log-transformation of the response latency variable. In the final model (containing main effects of task, relatedness, and their interaction) both main effects were significant, while the interaction task*relatedness was not significant. The positive sign of the estimate for task and the coding of the task contrast show that response latency was slower in the button-press task compared to the naming task. Likewise, for relatedness the response latency was slower in the related than in the unrelated condition. In the final nested model (i.e. estimating separate fixed effects of relatedness, for the two levels of task), the nested effect of relatedness was marginally significant in the button-press task and did not reach significance in the overt naming task. See Table 2 for an overview of the reported models from the preregistered analysis including model formula, coefficients and random effect variance parameters.

Table 2 Table of final models from the preregistered analysis of the SoSciSurvey version of Experiment 1 (SoSciSurvey and jsPsych1). Indexing of estimate column denotes which transformation was applied to the dependent variable. *** = p < .001; ** = p < .01: * = p < .05

Full size table

jsPsych1

For jsPsych1, the Box–Cox procedure suggested transforming the response latency variable by raising to the power of – 0.5 (i.e., 1 divided by the square root of the variable). This type of transformation reverses the sign of a model’s parameter estimates compared to log-transformed or untransformed data. We therefore transformed using – 1 in the nominator (– 1/square root of the variable) to maintain the same sign as in the other (log-transformed) models. In the final model, both main effects task and relatedness were significant, while their interaction was not significant. In the final nested model, the nested effect of relatedness was significant in the button-press task and marginally significant in the overt naming task. See Table 3 for an overview of the reported models from the preregistered analysis including model formula, coefficients and random effect variance parameters.

Table 3 Table of final models from the preregistered analysis of the jsPsych1 version of Experiment 1 (SoSciSurvey and jsPsych1). Indexing of estimate column denotes which transformation was applied to the dependent variable. *** = p < .001; ** = p < .01: * = p < .05

Full size table

Exploratory analyses

While in both versions of the experiment the relatedness effect was significant in the models containing the main effect of relatedness across tasks and did not interact with the factor task, relatedness did not reach significance when examining the effect separately for the overt naming task. It is plausible to assume that the overt naming data from an online environment would suffer from an increased level of noisiness and this is also evident in the longer tails of the response time data when comparing the single trial data from Experiment 1 to the data from Abdel Rahman and Aristei (2010), see Fig. 4. Therefore, we performed outlier correction employing an approach specifically tailored to LMMs suggested by Baayen and Milin (2010). This approach relies on model criticism after model fitting rather than a priori screening for extreme values, by removing those data points with absolute standardized residuals that exceed 2.5 standard deviations. Baayen and Milin demonstrated that this approach proves more conservative (i.e., excludes fewer data points) compared to more traditional approaches to outlier correction. For the nested models, this led to the exclusion of 2.39% of trials for the prescreened SoSciSurvey data and 1.75% of trials for the prescreened jsPsych1 data.

Refitting the final nested model with the outlier corrected SoSciSurvey data, the nested effect of relatedness was significant in the button-press task but not significant in the overt naming task. In the refitted final model of the outlier corrected jsPsych1 data, the nested effect of relatedness was significant in the button-press task and in the overt naming task. See Table 4 for an overview of the reported models from the preregistered analysis including model formula, coefficients, and random effect variance parameters.

Table 4 Table of final models from the exploratory analysis of outlier corrected data from Experiment 1 (SoSciSurvey and jsPsych1). Indexing of estimate column denotes which transformation was applied to the dependent variable. *** = p < .001; ** = p < .01: * = p < .05

Full size table

Discussion

The results of the two versions of the first experiment were promising regarding the demonstration of the semantic interference effect in an online setting.For both versions, we found a significant effect of relatedness, with slower response latencies when a picture was accompanied by a semantically related written distractor word compared to an unrelated distractor, replicating the classic picture–word interference effect observed in lab settings. As we did not find an interaction of the effect of relatedness with the factor task, this effect appears to be independent of the task.

As the particular focus of the current study was demonstrating that voice onset latencies could be collected in online settings, we examined the two tasks separately to investigate the effect of relatedness specifically in the overt naming task. When looking at the effect in both tasks separately via nested models, we found that the effect did not reach significance in the overt naming task when we ran the preregistered analyses without any outlier correction. When we applied outlier correction by using model criticism and ran the models on the corrected data, the relatedness effect in the overt naming task reached significance in the jsPsych1 version, but not in the SoSciSurvey version (where the t value of the estimate actually decreased). An additional model of the overt naming data from both experimental platforms did not yield a significant interaction of the factors platform (SoSciSurvey vs. jsPsych) and relatedness (b = 0.001, t = 0.153, p = 0.878). The absence of this interaction indicates that it is not possible to conclude that the relatedness effect is actually stronger in one platform compared to the other. Even so, the significance of the effect depended on an outlier correction which, while specifically tailored to single trial data in the context of LMMs, we had not planned and therefore not preregistered. It is very likely and plausible that response latency measurements collected via the audio recordings employed in our online experiments suffer from increased random error, i.e. noise, which would decrease the power to find effects.

Presumably, one of the factors contributing to this increased noisiness of the data is the technical implementation of the audio recordings and the reliability of the recordings’ timing relative to stimulus onset. This is reflected, for example, in the issue of the variability of the audio file lengths. Only very few of the participants’ audio files were exactly of the anticipated length of 3000 ms. Presumably, deviations from this file length are due to factors like audio sampling rate, technical variability between the users’ machines, and fluctuation in Internet connection quality. It is beyond the scope and goal of this study to address the technical details of the recording process and to solve the problems underlying the variable file lengths. As long as the file lengths for a single participant were homogeneous, we expected the recording process for that participant to be reliable enough to determine reliable speech onsets. Importantly, in the majority of datasets within the jsPsych1 version, the file lengths were homogeneous within any one participant, and only a few participants (ten out of 59 participants) showed considerable variability of file lengths (i.e., in more than 20% of files). We excluded and replaced these ten participants, as we could not rule out that in shorter audio files the recording started later than programmed and thus we might not be able to infer the correct voice onset timing.

In the SoSciSurvey audio files, the issue was a lot more pervasive. In contrast to the jsPsych1 data set, 31 of 48 participants included in the final dataset of the SoSciSurvey version showed within-subject variability of file lengths in more than 20% of files. The issue was present more frequently than absent, which meant that excluding and replacing these participants would have further escalated the already large number of participants required to be collected before reaching the target sample size. The implementation of audio recordings employed in SoSciSurvey would therefore seem to be more susceptible to the variability within any one participant’s technical setup. If the variability of the audio file length is an indication of the reliability of the timing of the audio recording within the experiment, the measurements in jsPsych are more reliable.

An additional issue concerning the reliability of the timing of audio recordings in both platforms is the overall difference between the response latencies in the overt naming task of the two versions: voice onset latencies were quicker in SoSciSurvey compared to jsPsych1 (M = 867 ms across relatedness for SoSciSurvey vs. M = 996 ms for jsPsych1). We cannot account for this difference, as the two versions of the experiment were nearly identical and therefore interpret it as a technical issue in the audio recording implementation of the SoSciSurvey version. This would align with the finding that the relatedness effect in the jsPsych version could also be found when looking at the overt naming task separately in the nested model after exclusion of outliers with a model criticism procedure.

Nevertheless, even the jsPsych1 version had its difficulties and the effect in the overt naming task was only marginally significant in our preregistered analysis. To assess if the effect in this version was a stable finding or a spurious result, we decided to conduct a second experiment using the same implementation with jsPsych as experiment builder and JATOS as server with several minor adjustments.

One apparent problem with both versions of the first experiment was the high number of data sets which had to be excluded according to our preregistered inclusion criteria. The majority of the excluded participants made too many errors in the button-press task, often erroneously classifying the written distractor words’ last letter instead of the targets’. Adjusting the instructions in the follow-up experiment to be clearer with respect to the button-press task, and providing examples of correct responses to practice trials should improve the error rates and therefore make data collection more efficient. In another adjustment to address the problem of participant exclusion rates, we decided to recruit the participants for the second experiment via the institute’s participant pool instead of Prolific, as these participants may be more accustomed to reaction time experiments similar to the current study.

Furthermore, collecting a second dataset using the jsPsych experiment builder would also allow to pool both datasets in a separate analysis, thereby increasing the power to find an effect.