Manually established measurements of onset and offset times of words in participant recordings are the gold standard for evaluating tools like AlignTool. We compiled a sample of participant recordings obtained in experimental and semi-experimental settings typical of language production research and annotated them manually. We used them to evaluate the accuracy of the measurements generated automatically by AlignTool. Note that the tool is designed in such a way that users can easily inspect its accuracy and correct the measurements, where necessary. Hence, its actual accuracy in proper use will be much higher. The evaluations reported below can serve as an estimate of how exact a purely automatic annotation with AlignTool can be.
Speech corpora I: single and multiple word utterances
We used two corpora to assess AlignTool’s accuracy in establishing the temporal onsets and offsets of words in spoken utterances of one to four words recorded in an experimental setting. By assessing the accuracy of determining the temporal onset of the first word in an utterance, we specifically assessed AlignTool’s accuracy as a voice-key. In doing so, we also compared its performance to the performance of a custom-made hardware voice-key (Hasomed NesuBox 2) and of SayWhen and Chronset, the software-based voice-keys presented by Jansen and Watter (2008) and Roux et al. (2016), respectively, both of which we could apply to our data.
Rastle&Davis corpus (Rastle & Davis, 2002; English)
As outlined in the Introduction, Rastle and Davis (2002) had 24 participants name two groups of 20 words, one beginning with /s/ (simple onset) and one beginning with /sp/ or /st/ (complex onset). The speakers were participants from the University of Cambridge (cf. Rastle & Davis, 2002, p. 309). Data from two participants had to be excluded due to technical problems, yielding a total of 880 critical trials. An additional set of filler words, none of which started with /s/, were recorded but were not considered in the analyses. Onset times were measured manually (for details, see Rastle & Davis, 2002) and using two different types of voice-key. For the purpose of this evaluation, we shall focus on the manual measurements of participants’ speech onset times, which were longer for simple than for complex onsets.
The parameters for analyzing the word onset and offset times in the Rastle&Davis corpus with AlignTool are presented in Table 1. In order to measure the onset times of the utterances with SayWhen, we pre-processed the data so as to bring them into a format suitable for SayWhen. First, we resampled the recordings from 22,050 Hz to 44,100 Hz, using SoX (sox [inputfile].wav -r 44100 [outputfile.wav]). Next, we concatenated the trialwise recordings of the wav files to one wav file in order to simulate a recording of a full experiment. In order for this file to be processed by SayWhen, it needed to also include a 10-ms trial onset signal, which had not been part of the original recordings, on the left channel. We incorporated this marker in the concatenation process by including a trial onset signal file (with the trial onset signal on the left channel and silence on the right channel) before each trial recording. In addition, we added 30 ms of silence at the beginning and the end of the concatenated files, as this was necessary for SayWhen to find the first and last trials reliably. The concatenated file thus included 30 ms silence at the beginning, followed by a series of pairs of trial onset signal files and trial recordings, and 30 ms silence at the end. A new wav file header was added to the concatenated audio file and it was entered to SayWhen (using default settings). The onset latencies established by SayWhen were saved to a CSV file. In a last processing step, we subtracted 10 ms from all latencies provided by SayWhen, as its measurements were started at the beginning of the 10-ms trial onset marker, which had not been part of the original trial recording.
MTAS corpus (English, German, Dutch)
The MTAS (short for Manually Temporally Annotated Speech) corpus, was specifically created for the evaluation of AlignTool: We collected data from 30 native speakers of German, 30 native speakers of Dutch, and 30 native speakers of British English, who all completed an object-naming and a word-reading task. We chose these two tasks as they are frequently used in language production research. The participants were recruited via the participant pools of the Department of Linguistics at Ruhr-University Bochum, the Max Planck Institute for Psycholinguistics in Nijmegen, and the School of Psychology at the University of Birmingham, respectively. We did not record the region of origin of the speakers and their accents.
For the single object-naming task, we selected 100 pictures of objects for each language from the databases provided by Bates et al. (2003) for German and English and by Severens, van Lommel, Ratinckx, and Hartsuiker (2005) for Dutch (see Appendix 2). These databases include details on the frequency of the object names in the respective languages (cf. Baayen, Piepenbrook, & van Rijn, 1995) and the name agreement associated with the pictures of the objects – the more likely the participants are to use the same word(s) for a given object, the higher its name agreement. In selecting the objects, we made sure that their names were frequent words and that the pictures had high name agreement, meaning that in the norming studies, pictures were associated with three different names at most. In addition, we selected the stimuli in such a way that their names covered a variety of different onset phonemes. Occasionally, this required applying a less strict criterion on name agreement. The full list of stimuli is provided in Appendix 2. Participants were asked to name the pictures as fluently and accurately as possible.
In the word-reading task, the same object names were used as in the object-naming task, but they were combined into a total of 50 different four-word combinations, such as tent dog grapes skeleton. Each object name featured twice across all word combinations. The word combinations were compiled to include words that ended and began with the same phoneme (grapes – skeleton), were very similar (frog – clock) or very dissimilar phonemes (broom – chair), so as to model easy and hard conditions for establishing precisely the onset and offset times of individual words within the utterance. Again, participants were asked to read the object names out loud as fluently and accurately as possible. During testing, we used a custom-made hardware voice-key (Hasomed NesuBox 2) for establishing participants' utterance onset times online, allowing us to evaluate AlignTool against this voice-key. The voice-key emitted a beep signal when it was triggered, allowing the experimenter to record all trials when the voice-key was triggered audibly too early, too late, or not at all.
Regarding the manner of articulation, the sets included a large number of picture names with fricative onsets (34 for German, 33 for English, and 39 for Dutch, respectively) and plosive onsets (39 for German, 35 for English, and 29 for Dutch). Vowels and semivowels were less frequent onsets (17 for German, 20 for English, and 21 for Dutch) and nasals approximants and trills were rather rare. This distribution reflects that the frequency of occurrence of phonemes in the onsets of words varies. Some onsets are not included in the materials because they either did not constitute word onsets in any of the three languages or because the name agreement of the words containing the onsets was too low.
For the word-reading task, we assembled the written names of the objects selected for the single object naming to 50 lists of four names. Within each list, we manipulated the similarity of the final phoneme of the first (second, third) word and the first phoneme of the second (third, fourth) word. Below, we refer to these pairs of offsets and onsets of consecutive words as transitions. With 50 four-word lists per language and three transitions per list, there were 150 transitions per language. Across word groups, we identified five categories of transitions between words, depending on the similarity between the two phonemes at the transition between two words:
The two phonemes differed in both place and manner of articulation.
The two phonemes were similar, i.e., they shared the same manner of articulation but differed in the place of articulation (as in tent – plate) or vice versa (as in plate – swan), or they shared the same place and manner of articulation, but differed in voicing (as in frog – clock).
The two phonemes were identical (as in grapes – skeleton).
The data collection procedure was identical across the three languages. All participants were tested in both the single-object-naming and the multiple-word-reading tasks in their respective native language. We asked them to complete the multiple-word-naming task first, as we hoped that familiarizing the participants with the object names in this task would increase name agreement in the single-object-naming task. Participants received written instructions prior to each task and were given the opportunity to ask questions. Each task was preceded by three practice trials so as to familiarize the participants with the task and the procedure. In the word-naming task, a fixation point was shown for 500 ms, followed by the four words for 6 s and a blank screen for 150 ms. After that, the next trial was initiated. In the single object-naming task, the same trial timing was used, but the object was presented for 2 s and the blank screen between two trials was shown for 750 ms. The apparatus was also parallel across the three languages, featuring standard desktop Pentium computers for controlling the stimulus presentation and 17-in. to 19-in. computer screens for presenting the stimuli using the NESU software (Nijmegen Experiment SetUp).
In the English and Dutch testing settings, participants were seated in a quiet room; the German recordings were made with participants seated in a sound proof booth. The responses of the participants were registered using a Sony ECM-MS907 microphone (German and English) and a Sennheiser ME64 microphone (Dutch), respectively. The signal was fed through an external voice-key (Hasomed NesuBox 2) on to a second computer for recording (German) and an external DAT recorder (Dutch), respectively. For the English recordings, we had planned to use the same setup for DAT recordings as in Dutch. However, due to technical difficulties, we had to record the utterances with an M-Audio MicroTrack II recorder, which produced recordings of very poor quality. In the end, the recording quality was excellent for the Dutch utterances, average for the German and very poor for the English data. These unintended differences in recording quality allow us to evaluate how much the three software-based analysis tools, AlignTool, SayWhen, and Chronset, are affected by differences in recording quality.
The parameters for analyzing the word onset and offset times in the MTAS corpus with AlignTool are presented in Table 1. In order to measure the onset times of the utterances with SayWhen, we first split our audio recordings of the experimental sessions into trialwise recordings, based on the trial onset beeps we had recorded during the experiment. These trialwise recordings were treated in the same way as the recordings of the Rastle&Davis corpus, with the exception that no resampling was necessary.
Speech corpora II: semi-spontaneous speech
The semi-spontaneous speech data were used to pilot AlignTool’s accuracy in analyzing the temporal structure of semi-spontaneous speech elicited in description tasks.
Display Comparison Task corpus (Sichelschmidt et al., 2010; German)
For German, we used a subset of a corpus of utterances recorded from pairs of speakers engaged in semi-spontaneous dialogues (see Sichelschmidt, Jang, Kösling, Ritter, &Weiß, 2010). Each of them saw a set of colored objects on the screen but was unable to see the display of their partner. The displays differed in only one detail and the dialogue partners’ task was to describe their displays to each other so as to identify the difference. Speakers were recruited at Bielefeld University (Sichelschmidt et al., 2010). We have no detailed information on the speaker characteristics. In total, we annotated 10,822 words from seven pairs of speakers. The signal-to-noise ratio of the recordings was poor, as they unavoidably included noise generated by the computers and other background noise in the room. The parameters for analyzing the word onset and offset times in the Display Comparison Task corpus with AlignTool are presented in Table 1.
Sjerps&Meyer corpus (Sjerps & Meyer, 2015; Dutch)
For the evaluation in Dutch, we employed data from an experiment using a pseudo-dialogue setting (Sjerps & Meyer, 2015; Experiment 1, Speaking Only task). Participants described the spatial positioning of two pairs of objects, using sentences of the form “put the A above (below) the B and put the C below (above) the D”. Manual annotations of the onset and offset times were available for all nouns in the utterances (A, B, C, and D). Participants were native speakers of Dutch and were recruited from the participant pool of the Max Planck Institute for Psycholinguistics in Nijmegen. We have no detailed information on the speaker characteristics.
We selected correct responses only, yielding a total of 993 utterances with 12,367 words, 3,972 of which had been annotated manually. Originally, we had planned to include a second set of utterances from the Tapping and Speaking task, which required participants to tap rhythmically while speaking. Unfortunately, the tapping noise was clearly audible in the recordings and made it impossible for AlignTool to operate reliably, so we could not include these data. The parameters for analyzing the word onset and offset times in the Sjerps&Meyer corpus with AlignTool are presented in Table 1.
Map Task corpus (Anderson et al., 1991; English)
This corpus includes route descriptions of 64 different speakers, most of whom were Scottish and were from “within a 30 mile radius of the center of Glasgow” (Anderson et al., 1991, p. 361). The recordings include the speech of an instructor and a dialogue partner recorded on separate channels. We converted the .ses (raw) audio files to wav files using the Linux-based SoX utility tool and transcribed and temporally annotated a total of 13,619 words of the instructor in the recordings of 21 pairs of speakers. We selected those recordings because they included only few intervals where the two speakers spoke simultaneously. The parameters for analyzing the word onset and offset times in the Map Task corpus with AlignTool are presented in Table 1.
Manual annotations of word onsets and offsets
We used the AVS audio editor version 22.214.171.1247 to annotate the words manually. With this tool, it is possible to set markers within an audio file and create a “marker list” in which markers can be added, merged, renamed, replayed, and saved into an xml file. The values stored in the xml file provide the time stamp of a marker, multiplied by the sampling rate. The annotation rules are summarized in Appendix 3.
After 2–3 weeks into the annotation process we double-checked the marked onsets and offsets by exchanging the annotated data among the annotators and controlling whether they would all have annotated the data as their colleagues had done. This was usually the case. After this, we addressed problems that often occurred. For instance, it turned out that word-final plosives like /k/ and /t/ had sometimes been left unmarked. We corrected the data accordingly and also checked whether the first annotator had observed the rule that successive words should always be 1 ms apart. In a second round of corrections, we distributed the data in such a way that they were assigned to annotators who had not previously seen them and had each annotator check 15 % of the data as to whether he or she would have marked the same word beginnings and endings. Whenever annotators differed by 10 ms or more, they were asked to correct the marker and if it became clear that in a file the accordance was off multiple times, they were asked to check the entire file.
At the end of the annotation process, each file was assigned to yet another annotator who annotated about 10 % of the data from scratch (see Table 2). These annotations were used to assess the consistency of the annotations across annotators. For the sake of comparability, we restricted the consistency analyses to those data points that were also included in the evaluation of AlignTool (see below). Table 2 presents the average differences between annotators in ms (averaging across differences greater and smaller than 0), as well as the average absolute difference between annotators. It also provides the standard deviation of the measurements provided by each annotator, the covariance between annotators and their correlation (Pearson’s r). In addition, we provide intraclass correlation (ICC) scores (ICC(C,1); McGraw & Wong, 1996) as a measure of consistency. The differences were small and the ICCs exceptionally high throughout. However, they display a small dip for the English MTAS data, which is most likely due to the poor audio quality of these data.
Comparison of automatic and manual annotations: single and multiple word utterances
Rastle&Davis corpus (Rastle & Davis, 2002; English)
One participant was excluded from the analyses as this person inhaled audibly on almost all trials, which caused substantial deviations of the automatic temporal alignments provided by AlignTool from the manual measurements. Such deviations would usually be corrected manually but as the present evaluation was geared towards establishing the results of the automatic temporal alignments alone, we excluded this participant from the analyses. Of the remaining data, 18 trials were excluded due to participant errors. Table 3 lists the mean utterance onset times established manually and using AlignTool, SayWhen, and Chronset. The results of the manual annotations are given for all 22 participants originally included in the analyses reported by Rastle and Davis (2002) and for the subset of 21 participants included in the present analysis. Rastle and Davis had found that with their manual annotation, response times were 9 ms faster for complex than for simple onsets. This effect was significant by participants and approached significance by items (p = .05). Excluding one participant yielded an effect of 8 ms with p < .05 for the by-participants and p = .061 for the by-items analysis.
AlignTool automatically annotated the onset times in the two conditions about 33 ms earlier than the manual annotations. The reduction in response time was slightly more pronounced in the simple than in the complex condition, reducing the effect of condition to 2 ms (n.s.). Critically, the reduction in response times was not affected significantly by condition (see Table 3).
SayWhen allocated the onset times about 77 ms later than Rastle and Davis had done, yielding a difference between conditions in the opposite direction of what Rastle and Davis had found (-9 ms). This effect approached significance in the by-items analysis (p = .063) but was not significant in the by-participants analysis. The discrepancy between the manual annotation and that established using SayWhen was particularly pronounced in the complex condition, yielding a significant effect of condition on the difference between the two types of annotation (see Table 3).
Chronset suffered from a parallel problem: While it was better than SayWhen in detecting the onsets in the simple condition, it was as inaccurate as SayWhen in the complex condition, yielding a substantial effect of condition in the opposite direction of that seen with the manual annotations and a highly significant effect of condition on Chronset’s deviation from the manual annotations. Chronset’s and SayWhen’s results correspond to those obtained with a threshold based voice-key by Rastle and Davis (2002).
The results obtained with Chronset highlight the relevance of a post-hoc correction: Unlike a threshold-based voice-key or a fully automatic software voice-key like Chronset, SayWhen and AlignTool allow the user to correct manually the automatic annotations, eventually yielding much higher levels of accuracy. All in all, our findings suggest that the preliminary analyses of the response times provided by AlignTool are more accurate overall than those obtained by SayWhen, Chronset or a voice-key.
For the MTAS corpus, we carried out two sets of analyses. First, we compared manual annotations of the utterance onset times with those obtained by the voice-key employed during data collection, by SayWhen, by Chronset, and by AlignTool, respectively. In the second analysis, we compared the annotations generated by AlignTool with the manual annotations for onset and offset times of words within utterances, taking into account the similarity of the last phoneme of the first and the first phoneme of the second word in pairs of successive words.
Table 4 shows how many data points had to be excluded in each language because of recording problems, participant errors, voice-key malfunction, and AlignTool malfunction. In the Dutch data set, the recordings of the word-reading task were faulty in two participants, requiring us to exclude the corresponding data points from further analysis. Also, the voice-key was very sensitive in the Dutch and English experimental setup, causing it to be triggered too early on a substantial number of trials that were excluded from the analysis. In the English data set, many additional trials were lost by a malfunction of AlignTool. As the audio quality of the English data was rather poor, AlignTool failed to annotate automatically about 9 % of the data. One would, of course, be able to annotate these trials manually. For the purpose of the present evaluation, however, we simply excluded them.
Unsurprisingly, the analysis of utterance onset times yielded largely parallel patterns of results for the picture-naming and the word-reading task (see Table 5). In German, the measurements generated by AlignTool differed least from the manual measurements, compared to SayWhen, Chronset, and the voice-key. For Chronset, there was only a small difference from the manual annotations in the object-naming task, but that difference was almost twice as big for the word-reading task. Note that there were only half as many trials in the word-reading than in the object-naming task, so a few larger deviations would impact more on the mean in the word-reading task than in the object-naming task. The voice-key tended to be triggered about 75 ms too late. SayWhen allocated the onset times about 160 ms too early, yielding the greatest deviation from the manual measurements.
In Dutch and English, the voice-key tended to generate the most exact measurements. In Dutch, AlignTool established the utterance onset times about 40 ms too early, i.e., it was too sensitive. By contrast, SayWhen and Chronset were not sensitive enough, establishing the utterance onset times 20–30 ms too late. Overall, SayWhen and Chronset deviated substantially less from the manual measurements in Dutch than in German, possibly due to the fact that the recording quality was better for the Dutch than for the German speakers. In line with this interpretation of the Dutch data, SayWhen’s and Chronset’s performance dropped markedly with the English recordings, which were of by far the poorest quality overall. AlignTool was able to deal with this problem reasonably well, yielding much smaller deviations from the manual annotations than SayWhen and Chronset. The voice-key was as exact for the English speakers as for the Dutch speakers. Recall that the audio signal was first fed to the external voice-key (Hasomed NesuBox 2) and was then recorded. The relatively unaffected operation of the voice-key along with rather poor recordings suggest that the recording problems in the English corpus arose after the voice-key had operated.
Overall, these findings suggest that AlignTool establishes utterance onset times reasonably exactly, irrespective of the quality of the recordings in terms of the signal-to-noise ratio. However, users will need to tune the parameters to their recording quality. In Appendix A6 of the User Manual (Belke et al., 2017), we give some advice on how to do this. The present results also indicate that users need to edit some of the automatically generated measurements manually in order to obtain optimal results. To this end, they can access the TextGrid files established by AlignTool and edit them directly. All changes made can be saved and imported to the Excel workbook by means of the Import to TextGrids function.
Table 6 presents the average deviation from the manual annotations at the transitions of successive words in the four-word utterances generated in the reading task. These transitions pertained to the last phoneme of the first word and the first phoneme of the second word and accordingly to the offset and onset phonemes of the second and third word and the third and fourth word. Table 7 presents the results of the statistical analysis of the effects of transition position, transition similarity, and their interaction on the difference between the measurements established by AlignTool and the manual measurements. The position of the transition within the four-word utterances had a substantial effect on the accuracy of the measurements generated by AlignTool (or MAUS, to be precise), with measurement accuracy decreasing substantially across the four-word sequence (see Table 6). Indeed, from a technical perspective one would expect that accuracy decreases with increasing length, as the search space (i.e., all possible segmentations) of the HMM-based alignment algorithm increases quadratically with utterance length, thus increasing the probability of errors at later positions within the utterance. The position effect was particularly pronounced in the Dutch data, yielding average deviations of 400–500 ms for the transition from the third to the fourth word. Given that the recording quality of the Dutch data was excellent, this finding is surprising and clearly exceeds the technically induced position effect caused by the increase in length.
Inspection of trials yielding such high deviations between the third and the fourth word suggested that segmentSpeech had malfunctioned on some occasions, taking audible breathing of the participants towards the end of the trial to be speech rather than noise (see Fig. 5 for an example trial). This is caused by the automatic floor noise detection used to establish the silence threshold. The excellent audio quality of the Dutch recordings led to a low floor noise and thus breathing was likely more prominent than in recordings with a poor signal-to-noise ratio. This problem may have occurred on other trials as well. To assess this, we trimmed all audio data, deleting the interval starting 500 ms after utterance offset (as established in the manual annotation) and ending at the end of the trial.Footnote 3 Table 6 presents the average deviations in onsets and offsets by position and transition similarity. Trimming the audio data improved the results of the automatic annotation considerably, eliminating the statistical effect of transition position (see Table 7). There was still an effect of transition similarity, with the offset and onset times of dissimilar and similar words being annotated a little earlier than the manual annotations early on in the utterances and slightly later than the manual annotations at later positions in the utterance. This effect was reversed, however, for transitions with identical phonemes at the end of the first and the beginning of the second word, yielding a significant interaction of transition similarity and transition position. It is not clear why the position effect reversed for this transition similarity only but it is important to keep in mind that with identical transitions, manual annotations are largely arbitrary (see Appendix 3 for the annotation guidelines we followed). For the time being, the most important finding is that trimming the data improved the performance of AlignTool considerably, suggesting that segmentSpeech had not segmented the speech-relevant intervals in the trials reliably. This reflects a trade-off between setting highly sensitive parameters and sacrificing onset and offset accuracy.
Figure 6 presents the deviations from the manual annotations for the onset and offset times by phoneme type. For the Dutch MTAS corpus, Fig. 6 shows the results of the analysis of the trimmed audio data. There was a clear connection between the quality of the audio recordings and the overall amount of deviation of the annotation generated by AlignTool from the manual annotation. While there was only little deviation from the manual annotations in Dutch (excellent recording quality), there was more deviation in German (average recording quality) and most in English, where the recordings were worst. The deviation seen across languages was not systematically affected by phoneme type – positive and negative deviations from the manual annotations were seen in all phoneme classes alike.
In the English and German corpora of semi-spontaneous speech, we had to exclude some of the data from the analysis. As these corpora involved dialogue-like settings, speakers occasionally spoke at the same time. These sections of the corpus were excluded, as MAUS is unlikely to be able to deal with them. Interjections, like “hm”, speech errors, and incomprehensible words were excluded from the analysis as well. Finally, all trials associated with AlignTool malfunction were excluded, leaving 75.4 % of the words in the Map Task corpus and 82 % of the words in the Comparison Task corpus for analyses. There was no need to exclude any data from the Sjerps&Meyer corpus.
Table 8 presents the average measurement deviation in the corpora and the correlation between this measurement deviation and the position of a word in the participants’ utterance. Unlike what we had seen in the analyses of the MTAS data, there was no consistent positive correlation between the position of a word in the utterance and AlignTool’s accuracy; the correlation was positive for the Dutch data, negative for the German data and absent for the English data. In order to interpret these correlations, we also assessed the absolute difference between the annotations by AlignTool and the manual raters (see Table 8). For the Display Comparison Task corpus, AlignTool seems to have annotated the relevant word boundaries too early, and this error intensified over positions. As a result, there was a significant negative correlation between utterance position and the average differences between AlignTool and the manual annotations, along with a positive correlation between utterance position and the absolute differences between AlignTool and the manual annotations. The Dutch data suggest that AlignTool again tended to annotate relevant word boundaries earlier than the manual annotators. Over the four positions in the Dutch utterances, this difference grew more positive, accounting for the positive correlation between the average differences and utterance positions. For the absolute differences, there was no significant correlation with utterance position, as the absolute differences would first be positive, then move towards 0 and finally above 0 again, yielding a non-linear sequence of decreasing and increasing absolute differences. Finally, in the English Map Task corpus, there was no systematic effect of utterance position on the average differences between the annotations generated by AlignTool and by the manual annotators. In fact, the analysis of the absolute differences showed that the error became smaller over utterance positions rather than bigger.
It is noteworthy that the utterance length in the three corpora of semi-spontaneous speech differed considerably. For Dutch, participants produced utterances of the type “Put the A above (below) the B and put the C below (above) the D”, including 13 words in total. For the Map Task corpus, by contrast, there were, on average, 2 min and 40 s of pure speech in each of the 21 recordings, which we analyzed. For the Display Comparison Task corpus, the recordings were even longer: on average, there were 7 min and 30 s of pure speech from each speaker pair. Hence, one might expect that the effect of word position on average absolute differences seen in the English and German data decreases considerably when the recordings are pre-segmented into shorter sections of, say, 30 s each, prior to applying alignMAUS to the recordings (see Belke et al., 2017).
Figure 7 presents the deviations from the manual annotations for the onset and offset times by phoneme type. In German and English, the results mirrored those obtained for the MTAS corpus: While the quality of the audio recordings, which was excellent in the English Map Task corpus but much poorer in the German Display Comparison Task corpus, clearly impacted on the overall deviation of the annotation generated by AlignTool from the manual annotation, there was no systematic effect of phoneme type. By contrast, the results for the Dutch corpus differed from those obtained for the MTAS corpus in that there was a marked effect of phoneme type, especially for word onsets. AlignTool annotated the onsets of words starting with plosives substantially earlier than the human raters, whereas all other phoneme types were annotated similarly by AlignTool and the human raters. For offsets this effect was not visible. We assume that the systematic deviations in the onsets came about because the human raters had used the moment of plosion in order to establish the speech onset whereas AlignTool established the beginning of the plosives slightly earlier. This difference impacts on the temporal annotation of onsets only, as for plosives in the word offsets, both AlignTool and the human raters established the moment of plosion as the offset of the word.
AlignTool is an open source tool for the semi-automatic temporal alignment of speech in single- and multiple-word utterances and semi-spontaneous speech. In a large-scale evaluation, we have identified strengths and weaknesses of AlignTool, demonstrating that it can provide most accurate automatic alignments for recordings with an excellent signal-to-noise ratio but becomes less accurate as the recording quality decreases.
Evidently, each researcher will try to ensure that the recordings are of the best possible quality in terms of their signal-to-noise ratio, but in our experience, ideal recording conditions are rarely given. Therefore, we configured AlignTool to perform with recordings of poorer quality as well and to allow for easy-to-implement manual corrections in Praat. However, AlignTool is likely to perform less robustly and less accurately with audio signals of poorer quality, requiring the user to correct more trials than with audio signals of better quality.
In language production research, AlignTool can be used as a digital voice-key as well as as a tool for establishing word onset and offset times in more complex, semi-spontaneous settings. Our pilot data from evaluating AlignTool’s accuracy in automatically aligning semi-spontaneous speech are promising in that the deviations were small, especially for recordings of excellent quality, and there was no systematic effect of a word’s position in the utterance on alignment accuracy. Note that MAUS deserves most of the credit for this, as the alignments of semi-spontaneous speech largely relied on MAUS.
Unlike for the semi-spontaneous speech, we found that the automatic analyses of the four-word utterances in the MTAS corpus with AlignTool exhibited a substantial effect of word position. We presume that this contrast between the two types of utterances came about because the recordings in the MTAS corpus included long silent intervals, namely the interval between stimulus onset and response onset, when participants were planning their utterance, and the interval after utterance completion until the beginning of the next trial. We have demonstrated for the Dutch data that even in cases when the recording quality is excellent, non-speech sounds can have an impact on the quality of the automatic alignments generated by AlignTool. One might reduce such problems by training the automatic speech recognition system to distinguish between speech and non-speech sounds. Note, however, that such training is necessarily tied to the given recording scenario and is therefore unlikely to transfer to other recording scenarios.
Given that users can correct the results of AlignTool manually, the alignments provided by AlignTool can be potentially as accurate as those generated by hand. However, the aim is, of course, to have AlignTool generate as many word onsets on its own as possible. To this end, we recommend that users prepare manual annotations of a sample of the utterances they want to align using AlignTool and use them to find the optimal parameter settings for their recording setting. In Appendix A6 of the User Manual, we give some recommendations on how to do this. In all likelihood, an optimal parameter setting for a given recording scenario can be carried forth to new recordings. Therefore, it is worthwhile to invest some tuning effort when first using AlignTool.
However, our evaluation results also indicate that even when parameters are optimized, users must not rely on the results generated by AlignTool blindly but need to be intelligent inspectors of its results. For instance, by generating histograms of the onset times and the durations of individual words, users can identify apparent outliers so as to find out whether there are problems in the recordings of the kind reported for the Dutch section of the MTAS corpus, where a breathing noise was mistaken for speech and distorted the automatic alignment.
Apart from the domains we have evaluated AlignTool for in this paper, it can also be used in research on language comprehension, where it may be of interest to link the temporal structure of spoken utterances to listeners’ eye movements, for instance in visual word experiments or in instruction settings like the Map Task setting. In addition, AlignTool can be applied to analyzing the onset and offset times of pseudowords, as long as the pseudowords are phonotactically plausible in the speakers’ language, such that the G2P-service (BAS, 2017b) can generate a pronunciation based on the pseudowords’ orthography. We have tested AlignTool on short sequences of pseudowords used in an artificial language learning study with German speakers (Bebout & Belke, 2017). Each of the utterances consisted of four or eight pseudowords (cf. the prose and rhyme training conditions in the study). There were no manual annotations of the onset and offset times of the pseudowords but we assessed the outcome of AlignTool’s measurements visually for a sample of the 144 recordings we aligned in this way and found the results to be accurate.
Moving on from AlignTool, the next big challenge will be to develop efficient tools for the (semi-)automatic annotation of speech recorded in dialogue settings involving multiple speakers. Rosenfelder et al. (2011) have presented FAVE-align, a tool that allows users to temporally align speech recorded from multiple speakers in dialogue settings, such as sociolinguistic interviews. Users transcribe each speaker’s utterance in a separate tier and feed this information to FAVE-align, which performs forced alignments using the Penn Phonetics Lab Forced Aligner (P2FA). The tool has not been evaluated for temporal accuracy, but given that it allows for manual corrections of the alignment very high levels of accuracy should be achievable.
In sum, many psycholinguistic studies require precise information about the time course of spoken utterances. AlignTool is an open source instrument that should, we hope, support researchers in the semi-automatic analysis of their corpora. By functioning as a voice-key as well as as a tool for the analysis of word onset and offset times in more complex utterances, we expect that AlignTool will open new avenues in language production research.
This research was funded by grant no. BE 3176/4-1, awarded to Eva Belke, Britta Wrede, and Antje Meyer by the German Research Foundation (DFG). We thank Andrea Krott for making her laboratory facilities available to us in order to collect the data for the British English section of the MTAS corpus. We are most grateful to Theodor Berwe, Sabrina Böckmann, Saskia Bohemann, Maria Hosfeld, Anastasia Los, Sara Kattanek, Mareike Klamandt, Natalie Kroll, and Esther Seyffarth for their help in measuring manually the onset and offset times of the words and to Jeroen van Paridon for carrying out the analyses with SayWhen. AlignTool and its documentation are available at https://www.linguistics.ruhr-uni-bochum.de/~belke/aligntool.shtml