Chronset: An automated tool for detecting speech onset
The analysis of speech onset times has a longstanding tradition in experimental psychology as a measure of how a stimulus influences a spoken response. Yet the lack of accurate automatic methods to measure such effects forces researchers to rely on time-intensive manual or semiautomatic techniques. Here we present Chronset, a fully automated tool that estimates speech onset on the basis of multiple acoustic features extracted via multitaper spectral analysis. Using statistical optimization techniques, we show that the present approach generalizes across different languages and speaker populations, and that it extracts speech onset latencies that agree closely with those from human observations. Finally, we show how the present approach can be integrated with previous work (Jansen & Watter Behavior Research Methods, 40:744–751, 2008) to further improve the precision of onset detection. Chronset is publicly available online at www.bcbl.eu/databases/chronset.
KeywordsSpeech onset Reading aloud Automatic detection Spectral analysis Optimization
Reaction time (RT) experiments have a longstanding history in experimental psychology and have been instrumental to achieving several fundamental insights into human cognition (Donders, 1868; Posner & Mitchell, 1967; Sternberg, 1966; Stroop, 1935). Although RTs are classically measured as the time needed to execute a button press, another useful technique consists in measuring the time required to produce spoken responses. Given the ubiquity of speech in human behavior, this approach offers a natural way of measuring response latencies.
To assess speech onset, the current gold standard is to rely on human raters, often aided by semi-automatic rating techniques (Jansen & Watter, 2008; Protopapas, 2007). To date, this approach has yielded some of the most accurate and consistent estimates of speech onsets. Nevertheless, this approach is suboptimal in several respects. First, it is extremely time-consuming, because human raters have to process the waveforms of vocal recordings on a trial-by-trial basis. Second, although agreement levels among raters are typically relatively high, they are prone to subjective bias and other sources of measurement error (Green & Swets, 1966; Morrow, Mood, Disch, & Kang, 2016). Consequently, a considerable amount of work has been dedicated to developing fully automated approaches (Bansal, Griffin, & Spieler, 2001; Jansen & Watter, 2008; Kawamoto & Kello, 1999; Protopapas, 2007).
Hardware-based methods for onset detection, such as voice-key devices, operate by detecting the point in time at which sound pressure levels exceed a given threshold (Rastle & Davis, 2002). In the absence of noise or nonspeech sounds (e.g., lip smacking, respiration, or coughing), voice keys can produce accurate measurements if the experiments are carefully controlled (Korvorst, Roelofs, & Levelt, 2006; Meyer & van der Meulen, 2000; Roelofs, 2005). However, in the majority of cases voice keys fail to achieve accurate measurements whenever vocal responses are preceded by loud noise sounds or are characterized by complex acoustic onsets (Kessler, Treiman, & Mullennix, 2002; Rastle & Davis, 2002), both of which are frequent occurrences in natural speech.
With the ubiquity of personal computers, several software-based solutions have been developed to improve semi- and fully automatic speech onset detection, thereby providing a novel framework for the automatic assessment of speech onset times (Bansal et al., 2001; Donkin, Brown, & Heathcote, 2009; Jansen & Watter, 2008; Kawamoto & Kello, 1999). One important limitation of these approaches, however, is that speech onset is estimated on the basis of a sustained elevation of sound amplitude over time, which in fact can also be triggered by loud nonspeech sounds (Rastle & Davis, 2002). This causes amplitude-based approaches to produce a large number of spurious speech onset estimates, thereby decreasing the overall reliability of these techniques. To overcome this limitation, recent work has focused on estimating speech onset from multiple acoustic features, to make automatic algorithms more robust against noise (Jansen & Watter, 2008; Kello & Kawamoto, 1998). However, despite the significant improvements that have been achieved by adding multiple features to speech onset detection, the accuracy achieved by current hardware and software has remained below the precision of human raters.
Here we present Chronset, a fully automated technique aimed at further enhancing the robustness and accuracy of automatic speech onset detection. The present approach is inspired by previous work in songbirds showing that vocalized sounds have a rich harmonic structure that is absent in noise or nonvocal sounds (Tchernichovski, Nottebohm, Ho, Pesaran, & Mitra, 2000). Building on these findings, we hypothesized that human speech onset and noise sounds have distinct spectral signatures on the basis of which they can be distinguished from each other. We extracted multiple spectral features from audio waveforms, and signaled a speech onset if four different features were simultaneously above a set of threshold levels. To estimate a set of speaker and language-independent threshold parameters, we used an optimization procedure to tune the thresholds for a broad range of waveforms sampled from two laboratories where spoken responses were recorded in two different languages and experimental contexts (Jansen & Watter, 2008; Sadat, Martin, Alario, & Costa, 2012).
Our findings show that Chronset detects voice onset with a high degree of precision relative to human ratings, such that most of its errors occur in a small (<50 ms) window surrounding a manually annotated RT. On the basis of Monte Carlo simulations, we show that the size of Chronset’s estimation error does not substantially reduce the statistical power of experimental data analyses for experiments with standard sample sizes. Furthermore, we show how our approach can be combined with the SayWhen algorithm to achieve estimates of speech onset that are even closer to human rater precision. The present work therefore demonstrates a novel approach for the automatic extraction of speech onset times that provides a substantial improvement over previously reported techniques.
Human onset detection
To assess the algorithm’s performance, two previously published datasets were analyzed in the present study: one dataset comprised waveforms in the Spanish language (hereafter, the Spanish dataset; Sadat et al., 2012) and a second dataset comprised waveforms in the English language (hereafter, the English dataset; Jansen & Watter, 2008). For each dataset, speech onset latencies were calculated by a group of human raters. For the Spanish dataset analyzed in the present study, the data from two raters who used the CheckVocal software (Protopapas, 2007) were already available. In the case of the English dataset, ratings from one individual were supplemented by the data from two additional raters who manually identified onsets after splitting the continuous recording of the audio from the entire experiment for each participant into separate waveforms for each trial. The two additional raters used the Audacity software to measure onsets using a combination of listening, waveform inspection, and spectrogram inspection, whereas the original rater used the SayWhen software (Jansen & Watter, 2008).
Spectral analysis and acoustic features
Unless stated otherwise, all features were normalized to range from 0 to 1 for each sound recording, with 0 and 1 corresponding to the minimum and maximum values of each feature. Gaussian smoothing over time (10 ms) was applied to all voice features, thus ensuring that only large, systematic changes in the feature would be used in determining the voice onset. A brief description of each feature follows below (for additional details, see Tchernichovski et al., 2000). These and all subsequent analyses were carried out using the Chronux toolbox (http://chronux.org/; Bokil, Andrews, Kulkarni, Mehta, & Mitra, 2010), as well as custom scripts written in MATLAB (The Mathworks).
The amplitudes of the speech sounds were computed as the logarithm of spectral power integrated over frequencies between 0.15 and 22.05 kHz (Fig. 1c). Speech sounds typically have higher amplitude than noise, and thus provide an indicator of speech onset. However, other sounds (e.g., coughing, lip smacking) have high amplitude as well, thereby reducing the reliability of this feature for speech onset detection.
Wiener entropy (WE)
Entropy is a measure of how random versus organized a signal is. In the context of voice sounds, it can be applied to measure whether a sound is occurring relatively uniformly across the entire spectrum, or only in specific frequency bands. Noise sounds tend to have a flat spectrum, whereas the spectrum of voiced sounds will be organized into clear peaks. Many different mathematical formulations of entropy have been developed, which all relate to these same principles (Shannon, 1948). Here, we used WE, which is computed as the ratio of the geometric mean and the arithmetic mean of the spectrum (Tchernichovski et al., 2000). Importantly for the present purposes, WE is amplitude-independent, and thus is not affected by the distance of the speaker from the recording device, the speaker’s overall loudness, or the absolute signal-to-noise ratio (Fig. 1d).
Spectral change (SC)
SC is a combined measure of how power changes simultaneously in both time and frequency (Fig. 1e). Noise typically shows less spectral change than do voiced sounds.
Amplitude modulation (AM)
AM is the overall change in power across all frequencies, and thus reflects the magnitude of the change in energy of a speech sound over time (Fig. 1f). Voiced sounds will tend to modulate amplitude more strongly than other sounds, such as lip smacks, which typically generate a very transient sound.
Frequency modulation (FM)
FM assesses how much the concentration of power changes across spectral bands over time (Fig. 1g). The energy of tonal sounds tends to remain concentrated within specific spectral bands (low FM), whereas noise typically has a rapidly changing, nonconstant concentration of power across frequencies (high FM).
Harmonic pitch (HP)
HP estimates the spectral structure of harmonic sounds. To estimate HP, we computed a second spectrum of the power spectrum, which measures the periodicity of the peaks in the power spectrum (Bogert & Healy, 1963; Oppenheim & Schafer, 1989) for each time point (Fig. 1h). A high value of HP signifies that the acoustic signal is composed of harmonically related frequencies. In contrast, low HP indicates the absence of harmonic structure. Voiced sounds, which typically contain a collection of harmonics (formants), will show a high level of periodicity in the power spectrum (Noll, 1967), whereas the spectrum of noise is typically characterized by the absence of resonance.
Automatic detection criteria
To improve the robustness of our algorithm against loud noise sounds, speech onset was detected if four of the six features were simultaneously elevated above threshold levels for 35 ms. Furthermore, to ensure that our algorithm could detect unvoiced onsets such as [s], [f], or [p], we quantified speech onset as the first point in time at which the amplitude was elevated above threshold within the time window defined by the four-feature criterion. Because the algorithm uses a low threshold for the amplitude feature, this additional criterion allowed us to detect low-amplitude unvoiced sounds that often precede voiced sounds but do not have a harmonic spectrum.
Statistical tuning of optimized feature thresholds
To identify a set of thresholds that provided good sensitivity and validity across waveforms, we estimated the thresholds for each individual feature using a customized version of the gradient descent algorithm (Hinton & Sejnowski, 1986; see also Armstrong, Watson, & Plaut, 2012). Feature thresholds were optimized on the basis of vocal responses recorded in the Spanish dataset, which comprised 150 waveforms per participant (n = 14) for which the individual voice onsets were estimated by two human raters using semi-automatic techniques (total number of analyzed waveforms: 2,100). To ensure that the optimized feature thresholds were not overfitted, we randomly divided the waveforms of the Spanish dataset into a training set (80 % of the data) and a testing set (20 %). Optimization was performed on the training set, whereas the testing set was used to assess the accuracy achieved by the optimized feature thresholds. In both sets, the fit associated with a particular set of feature thresholds was quantified by measuring the maximum likelihood estimate of the standard deviation (SD) of the regression residuals, which assessed the difference between the individual hand-coded onset latencies and the fitted regression line. A poor fit of the estimated thresholds was thus associated with a higher SD for the regression residuals, whereas a better fit was associated with a lower SD. By selecting those thresholds that lowered the overall SD of the test data, it was possible to optimize the thresholds such that the returned automatic estimates of speech onset were in close agreement with those of the human ratings across a broad range of waveforms (for more details, see the supplemental information). The optimization algorithm stopped either after 1,000 attempts to modify the thresholds or if 50 consecutive attempts failed to improve the fit. To maximize the likelihood that the best possible thresholds would be identified, we repeated this optimization process for 100 randomly chosen partitions of the waveforms into training and test sets. The final thresholds that we selected were those associated with the smallest SD and the largest R2 on the testing data.
Dependent measures of algorithm performance
To assess the performance of both Chronset and the other speech onset detection algorithms, we evaluated each algorithm using two dependent measures: (1) absolute-difference scores and (2) regression fits in terms of R2 and regression residuals.
Absolute difference (AD):
Regression fits (r2and regression residuals):
The regression fit quantified how well the regression line fit the unobserved linear relationship between the automatic estimates and manual ratings. Note that R2 is more sensitive to large deviations from the optimal fit than are ADs, and is less sensitive to small deviations from the regression line, because differences are squared. These properties are a strength of the regression measure, because they highlight whether an algorithm generates large numbers of outliers, and because they eliminate any measurement bias that can be attributed to either a rater’s perceptual bias (Green & Swets, 1966; Morrow et al., 2016) or an algorithm´s systematic measurement error (for more details, see Figs. S1–S4 in the supplement materials).
One important limitation of using regression statistics to evaluate algorithm performance, however, is that a good fit (high R2) will not necessarily mean that the absolute values of the onsets are identical. Rather, it means that relative changes in the manual onsets are reflected by highly systematic changes in the automatic onsets. Thus, here we argue that an optimal algorithm should maximize performance on a new composite measure that minimizes ADs and maximizes regression fit.
To benchmark the accuracy of Chronset and the other previously reported algorithms, we first compared the performance of Chronset against two frequently employed techniques: Epd (Bansal et al., 2001) and CheckVocal (Protopapas, 2007). In the case of CheckVocal, which was originally designed for semi-automatic onset detection, the onset latencies were accepted without visual inspection, so the reported performance of CheckVocal does not reflect the accuracy of semi-automatic analyses.
Cross-validation of optimized feature thresholds
We cross-validated the thresholds optimized on the basis of the first dataset (Spanish dataset) by testing the robustness of these same thresholds on a second set of audio recordings obtained from a sample of waveforms recorded in English (English dataset). These new waveforms had been used previously to develop and assess the reliability of the SayWhen onset detection software (Jansen & Watter, 2008). In total, this dataset comprised approximately 167 trials per participant (n = 22), from which voice onsets were identified manually by three human raters as well as by the SayWhen algorithm (total number of analyzed waveforms: 3,674). This second dataset thus provided a benchmark against which we examined the robustness of the optimized thresholds for influences specific to different languages and speakers, as well as to recording equipment-related influences. It also enabled us to compare our results to those from the SayWhen algorithm.
Human onset detection and interrater variability
Systematic differences between the individual raters were measured by using the intraclass correlation (ICC). Given that all raters independently coded all trials, the ICC was computed using a two-way mixed-effects model of the single-trial voice onset latencies (Shrout & Fleiss, 1979). According to this approach, an ICC value of 0 reflects the absence of agreement—that is, all scores differed systematically across raters—whereas an ICC value of 1 reflects perfect agreement between the raters. The different datasets employed a combination of ratings from new raters and the original ratings from the published research. In the case of the English dataset, raters segmented the raw continuous audio file into individual waveforms for each trial (for more details on the original data, see Jansen & Watter, 2008).
Detection performance for distinct phonetic onset categories
To examine how the onset detection performance of Chronset may be affected by distinct phonetic onset types, such as unvoiced consonants, which are more challenging to detect than voiced vowels, we carried out a separate analysis in which we split the data from the English dataset into different groups of waveforms that corresponded to different phonetic onsets. The phonetic code of each individual waveform was determined by two human raters who listened to each waveform and then selected the corresponding phonetic onset code from the Carnegie Mellon University Pronouncing dictionary (www.speech.cs.cmu.edu/cgi-bin/cmudict#about). We then examined the fit in terms of the AD and regression fits for each phonetic onset for both Chronset and SayWhen.
Agreement between human ratings for the speech recordings in the Spanish dataset
Comparison of automatic speech onset detection for the speech recordings in the Spanish dataset
Lower correspondences between the automatic and manual scores were observed for the two other examined algorithms. The proportions of regression residuals that were within the ±10 ms range remained below 10 % for both Epd and CheckVocal (Epd, 8 %, SD = 118.8 ms; CheckVocal, 6 %, SD = 164 ms; Figs. 2e and f), which was considerably lower than Chronset’s performance. The tendency for these algorithms to misestimate the regression residuals persisted across the entire distribution, as is shown in these figures and reflected in the poor summaries of the fits via R2 (Epd, R2 = .57, offset = 235 ms; CheckVocal, R2 = .18, offset = 651 ms; Figs. 2c and d). The proportions of AD scores that remained below 10 ms were fractionally higher than with Chronset for both CheckVocal (30 %; Fig. 2g) and Epd (33 %; Fig. 2h) than for Chronset. However, assessment of the cumulative density functions revealed that these algorithms produced highly variable misestimations throughout the 0 to 1,000 ms range. This is reflected by the standard deviation of the difference scores, which showed four to ten times more overall variability for these algorithms than for Chronset (SDs: CheckVocal = 313 ms, Epd = 129 ms). These algorithms were thus fractionally better at estimating latencies within the 10 ms window, but when they failed to do so, the misestimations were substantial. These deviations appear to be attributable primarily to false early detections.
Agreement between the human ratings for the speech recordings in the English dataset
Comparison of automatic accuracy levels for the speech recordings in the English dataset
Turning to the other algorithms, the correlations were substantially lower between the manual ratings and the automatic scores estimated by all of the other algorithms examined (SayWhen, R2 = .87, offset = 76 ms; Epd, R2 = .11, offset = 786 ms; CheckVocal, R2 = .45, offset = 451 ms; Figs. 4d–f). In addition, when compared to Chronset, lower proportions of the regression residuals were observed within the ±10 ms range for all of the other algorithms (SayWhen: 6 %, SD = 172 ms; Epd: 2 %, SD = 454 ms; CheckVocal: 3 %, SD = 356 ms; Figs. 4g–i). The higher accuracy of Chronset persisted when we examined the full distributions of regression residuals, and not only those falling within the 10 ms window. Turning to the AD scores, a different pattern of results emerged: Both SayWhen and Epd produced more AD scores within 10 ms of the human rating (64 % for SayWhen, 54 % for Epd), whereas CheckVocal produced only 17 % of its RTs in this window (Figs. 4j–l). This indicates that both SayWhen and Epd are capable of producing very precise RTs on a substantial number of trials. However, inspection of the full cumulative density functions showed that all three algorithms also produced more outlier misestimations outside the 50 ms range than did Chronset, particularly in the cases of Epd and CheckVocal. Thus, the extreme sensitivity to real speech onset displayed by these other algorithms is accompanied by an increased likelihood to be triggered prematurely by nonspeech sounds.
Comparison of speech onset detections for distinct phonetic onset categories
On the basis of inspection of the Chronset data for regression residuals, it is clear that estimating a few of the onsets (EH, AW, M, and TH) was more difficult, notwithstanding that performance was nevertheless still relatively good (R2 range for these onsets: .75–.91). Performance for all of the other onsets was near ceiling (all R2s ≥ .96). Similarly, AD scores tended to decrease as R2 increased, although there were many small and a few more substantial rearrangements in performance. Of particular note are that, again, four phonemes showed notably worse performance than the others, with SDs of the AD scores >50 ms. Two of these items were also associated with the poorest R2 scores (EH and M), but the two others were not (AE and CH), although they still fell in the lower half of all observed R2 scores.
As compared to Chronset, SayWhen had a similar overall range of performance across all phonemes in terms of the regression fits (R2 range: .74–1.0). However, there was more variability and gradation in how well SayWhen was able to predict onsets for the different phonemes: Whereas Chronset only had three phonemes with R2 fits below .9, SayWhen had 16 such phonemes. Similarly, whereas Chronset had only four phonemes for which the SD in the overall AD scores was >50 ms, SayWhen had eight such phonemes.
These results suggest that the better performance for Chonset in overall R2 and its different distribution of AD scores is largely attributable to better fits for many of the phonemes that were most challenging for SayWhen (the fact that SayWhen still performed well on those phonemes notwithstanding). These results also point to two practical implications: On the one hand, they provide guidance for which types of onsets should be targeted in future improvements to automatic onset detection algorithms. On the other hand, these results suggest that insofar as some experimental conclusions can be derived without relying on the four most problematic onsets for Chronset, there should be an even smaller difference between manual onset detection and our automated approach.
In the present article, we report an approach that permits the automatic detection of speech onsets in audio recordings of human voice recordings, thereby allowing this measure to be used by researchers studying cognitive and perceptual processes including, but not limited to, speech (Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001; Donders, 1868; Plaut, McClelland, Seidenberg, & Patterson, 1996; Posner & Mitchell, 1967; Sternberg, 1966; Stroop, 1935). The present approach represents a novel technique for the automatic detection of speech onsets in audio recordings of human speech. It demonstrates that the onset latencies identified by Chronset are substantially more robust than those produced by other popular algorithms, particularly in terms of avoiding outliers, at the expense of a small degree of fine-grained precision relative to an alternative algorithm, SayWhen. It also shows that our (and other) approaches produce measurements that are rapidly converging on “gold standard” human levels of precision. Related to this point, we also observed nontrivial variability in our raters’ generation of manual onsets—particularly with respect to the absolute value of the onset. Collectively, these observations indicate that a more rigorous consideration of the gold standard is warranted when comparing automated and manual onset detection in the future. This could include efforts to better estimate the gold standard by extracting the latent “true” onset by using dimensionality reduction, as we have done in our analyses. Similarly, it could include updating the gold standard to a composite measure, which would include both ADs and regression fits.
Advantage of using multiple features in speech onset detection
Threshold-dependent measures that estimate voice onset from fluctuations in the amplitude level of speech waveforms often fail to detect complex onsets of speech (Kessler et al., 2002; Rastle & Davis, 2002). Similarly, these techniques frequently produce false alarms if high-energy nonvocal or prevocal sounds occur prior to actual vocal responses (Rastle & Davis, 2002). Our results show that the combination of multiple features achieves higher accuracy levels than do approaches that detect voice onset only on the basis of amplitude fluctuations (albeit at the expense of a small amount of fine-grained precision). This claim is also supported by the substantially higher proportion of overall variability in human measurements that is explained by Chronset estimates but not by other algorithms, by the reduction in variability across different onset phonemes, and by the smaller set of phonemes whose estimation performance was not near ceiling. These differences in performance are likely due to the fact that amplitude is sensitive to absolute sound levels, and therefore cannot reliably discern loud nonspeech from genuine speech sounds, whereas a combination of features (some of which are amplitude-independent) can.
Optimization and cross-validation of feature thresholds in a different language and experimental setting
To derive a set of thresholds that could be well-suited for application to a broad range of waveforms and yield accurate onset estimates, we used statistical optimization (Papadimitriou & Steiglitz, 1982) and tuned the thresholds for each individual feature by minimizing the SD of the regression residuals. This allowed us to identify a set of thresholds that were robust against speaker specific differences and that achieved accuracy levels near identical to human observations. Moreover, we assessed the basic validity of our method by testing our optimized thresholds on a second novel dataset, which comprised speech waveforms recorded in a different language using different equipment and another task that required spoken responses. Together, these results support the robustness of our optimized thresholds against differences in language and lab equipment.
Toward a composite gold standard for algorithm benchmarking: Integrating insights from difference scores and regression fits
Differences between manual ratings and automatic scores have been reported in the literature to quantify the measurement error of automatic speech onset detection (Jansen & Watter, 2008). This measure has clear theoretical value, but is also limited in that it may be confounded in some cases by the measurement bias of human raters, as demonstrated in our own comparisons of interrater reliability. This variability in the thresholds at which human raters visually detect the onset of speech in audio waveforms can distort the error attributed to algorithm performance. For instance, if a human rater systematically rates the true speech onset with an offset of 10 ms, the mean deviation of the algorithm from human ratings might be shifted by 10 ms incorrectly, which can work either for or against the algorithm, depending on that algorithm’s own potential biases.
To circumvent this problem, in the present study we utilized regression residuals as a second measure to benchmark the performance of Chronset against other algorithms. Regression residuals measure the difference between automatic scores and the regression line, which represents the unobserved relationship between the automatic scores and the onset of speech. The regression line itself is fitted by minimizing the differences between the regression line and the manual and automatic scores (here using maximum likelihood estimation), thereby automatically minimizing the contribution of systematic bias in each variable (Cohen, Cohen, West, & Aiken, 2003). Thus, any consistent bias that was due to the rater or the algorithm would be minimized by the regression residuals, and only those values that did not lie on the regression line would generate a large nonzero value. Regression residuals can therefore be interpreted as a bias-free estimate of measurement error that is due to either the algorithm or the human rater, which is an important advantage over the classic measure based on AD scores.
Regression estimates, however, have their own limitations, in that they are more sensitive to extreme values than they are to small deviations from the regression line. They also lack the simple transparency offered by difference-score measures. Thus, we view the present research as a novel gold standard for evaluating algorithms of speech onset detection by including both difference scores and regression residuals. These measures clearly offer complementary insights into how each algorithm’s responses align with manual responses, and together they can provide more targeted guidance for refining model development.
Simulation of the effects of measurement error on statistical power
Combining Chronset and SayWhen in a “mixture-of-experts” model
Taken together, the prior set of results highlight that our fully automated voice onset detection algorithm is able to provide sufficiently precise estimates as to have a negligible impact on the analyses of the results of a standard experiment. However, it is clear that in some settings (e.g., experiments involving extremely small numbers of trials or participants), an even more precise estimate of speech onset is important. One clue to how such an improved algorithm can be achieved emerges from a comparison of the R2 and raw difference score results. These data highlight that in many ways, Chronset and SayWhen—the next best algorithm in terms of R2, and the superior algorithm in terms of AD, particularly in the 10 ms window nearest the human ratings—exhibit complementary patterns of performance. Chronset sacrifices extremely high precision for individual latencies to ensure the robust estimation of relatively precise onsets in the absence of many outlier misestimations—the avoidance of which is especially important for a fully automatic onset detection procedure. In contrast, SayWhen achieves extremely precise estimates at the expense of many outliers, which may be an appropriate compromise in the context of a semi-automatic procedure in which outlier trials can be manually reinspected.
Related approaches in automatic voice onset time detection and voice activity detection
Voice onset time (VOT) reflects the delay between the beginning of a speech sound and the onset of vocal cord vibration and has been applied to study phonetic perception (Clayards, Tanenhaus, Aslin, & Jacobs, 2008), whereas voice activity detection (VAD) is a technique used in speech processing in which the presence or absence of human speech is detected (Ramírez, Górriz, & Segura, 2007) and has been applied in telecommunication. Recent work in automatic VOT detection has achieved accuracy levels that are similar to human precision levels by combining multidimensional feature extraction from speech signals and machine learning (Lin & Wang, 2011; Sonderegger & Keshet, 2012). Similarly, feature extraction and machine learning techniques have been applied to improve the performance of VAD in several fields (Kim, Chin, & Chang, 2013; Park et al. 2014). These studies therefore support the present approach toward the automatic extraction of onset latencies from human speech in the context of behavioral experiments. Moreover, the Chronset algorithm may help inform algorithms designed for automatic VOT detection, given that VOT measurements depend on the accurate detection of speech onset latencies (Das & Hansen, 2004).
Conclusion and outlook for Chronset
Our data show that multiple features can enhance the accuracy of automatic speech onset detection as compared to several previously reported approaches, particularly with respect to extreme misestimations. These findings are robust against at least some language- and speaker-specific influences in standard laboratory settings, as demonstrated by our tests of performance on two distinct datasets and languages. Of course, additional work remains to establish the breadth to which the present features generalize. This is a critical and fundamentally empirical question—which is also typically not even asked in studies of automatic speech onset detection.
To facilitate answering this question, we have provided an easy-to-use Web platform, as well as the full source code for Chronset, for use by other researchers. These tools should also enable the rapid reoptimization of Chronset’s parameters if other data sources are discovered in which performance is suboptimal, as may be the case with some specialized populations (e.g., onset detection in children or patient populations).
As we highlighted by our “mixture-of-experts” algorithm, the availability of our source code and an automated platform for onset estimation may also be useful for combining the unique strengths and overcoming the weaknesses of multiple estimation algorithms, to improve performance above the levels of Chronset (or any other algorithm’s) performance in isolation. Such comparisons may also be especially fruitful in identifying the areas of a particular algorithm that may benefit from targeted improvement. Indeed, the detailed comparisons that we conducted between Chronset and of the current standard models in the field have helped point us toward improving Chronset’s onset sensitivity within the < 50 ms range. Future work will evaluate whether improvements in this window can be achieved through a more extensive and computationally intense optimization of the parameters governing the temporal smoothing that helps Chronset achieve robust overall patterns of performance, and through the incorporation of additional features into the current feature set.
Chronset is available for public use under the GNU General Public License through the Chronset website (www.bcbl.eu/databases/chronset), either through a Web interface or by downloading and running the source code. To estimate speech onset latencies automatically via the Chronset website, speech recordings are required to be uploaded to the website in.wav format. Once the files are uploaded, they will be processed using Chronset (average processing time per.wav file of ~1–15 s, depending on the server load). The resulting onset latencies will be sent via email message once Chronset has terminated processing each file. Using the source version of Chronset, it is also possible to use parallel computing to process multiple files simultaneously.
We thank P. Jansen and two anonymous reviewers for their constructive comments and for helping us improve the present article, especially by motivating the mixture-of-experts model and our discussion of the merits of different measures of model fit. F.R. and B.C.A. were both supported by Marie Sktodowska-Curie grants (to F.R., PIEF-GA-2013-62772; to B.C.A., PIIF-GA-2013-627784). M.C. was supported by the BCBL and Ikerbasque, the Basque Foundation for Science, and the European Research Council (Grant No. ERC-2011-ADG-295362). B.C.A. and M.C. were also supported by the Severo Ochoa program, Grant No. SEV-2015-049 awarded to the BCBL. The authors thank C. Martin and J. Saddat, as well as P. Jansen and S. Watter, for generously making their data available to facilitate the development of Chronset and for discussion of the original research. Finally, we also thank the research assistants involved in the manual coding of onset latencies.
- Armstrong, B. C., Ruiz-Blondet, M., Khalifian, N., Zanpeng, J. J., Kurtz, K. J., & Laszlo, S. (2015). Brainprint: Assessing the uniqueness, collectability, and permanence of a novel method for ERP biometrics. Neurocomputing, 166, 59–66. doi:10.1016/j.neucom.2015.04.025
- Bansal, P., Griffin, Z., & Spieler, D. (2001). Epd: Matlab wave file parsing software. Retrieved from oak.psych.gatech.edu/~spieler/software.htmlGoogle Scholar
- Bogert, B., & Healy, M. (1963). The frequency analysis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum and Saphe cracking. In M. Rosenblatt (Ed.), Time series analysis (pp. 209–243). New York: Wiley.Google Scholar
- Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah: Erlbaum.Google Scholar
- Das, S., & Hansen, J. H. L. (2004). Detection of voice onset time (VOT) for unvoiced stops (/p/,/t/,/k/) using the Teager energy operator (TEO) for automatic detection of accented English. In Proceedings of the 6th Nordic Signal Processing Symposium, NORSIG 2004 (pp. 344–347). Piscataway: IEEE Press.Google Scholar
- Donders, F. C. (1868). Die Schnelligkeit psychischer Prozesse. Archiv für Anatomie und Physiologie und Wissenschaftliche Medizin, III, 269–317.Google Scholar
- Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley.Google Scholar
- Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 282–317). Cambridge, MA: MIT Press.Google Scholar
- Morrow, J. R., Jr., Mood, D., Disch, J. G., & Kang, M. (2016). Measurement and evaluation in human performance (5th ed.). Champaign: Human Kinetics.Google Scholar
- Oppenheim, A. V., & Schafer, R. W. (1989). Discrete-time signal processing. Upper Saddle River: Prentice-Hall.Google Scholar
- Papadimitriou, C. H., & Steiglitz, K. (1982). Combinatorial optimization: Algorithms and complexity. Mineola: Dover.Google Scholar
- Park, J., Kim, W., Han, D.,K., & Ko, H. (2014). Voice activity detection in noisy environments based on double-combined Fourier transform and line fitting. Scientific World Journal, 2014, 146040.Google Scholar
- Ramírez, J., Górriz, J. M., & Segura, J. C. (2007). Voice activity detection: Fundamentals and speech recognition system robustness. In M. Grimm & K. Kroschel (Eds.), Robust speech recognition and understanding (pp. 1–22). Vienna: I-Tech.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.