4.1 Submission Statistics
During the preparations of the evaluation [21], it was decided that in written publications comparative results [22] are to be presented anonymously, but that individual sites can of course present their own results [1, 3, 9]. This was inspired by the way it goes in the very successful NIST Speaker Recognition campaigns, and the most important reason for N-Best was to make the evaluation more attractive for industrial participants. However, one industrial subscription to the evaluation pulled out at the last moment, so the anonymity in this publication only serves to adhere to original agreements.
There were seven sites participating in the evaluation, including the five ASR sites from the N-Best project. Six of these submitted results before the deadline, totaling 52 submissions distributed over the four primary tasks. Each of the six sites included their primary system in these submissions. One participant (“sys 1”) refrained from receiving the first results until about 3 days after these had been sent to the other five participants, in order to finish two ‘unlimited time’ contrastive runs for their CTS system.
One of the participants (“sys 4”) did not submit results, but refrained from interaction with any of the involved parties, until about 4 months after the official deadline, due to unavailability of personnel. This amount of delay is quite unusual in formal evaluations, and it is difficult to guarantee that no information about the evaluation will have reached this participant.
“Sys 3” ran three different ASR systems and four different runs of its main system. “Sys 2” ran a single-pass system contrasting its multi-pass primary system, and “sys 5” ran a contrastive language model system. Finally, ‘sys 6’ and ‘sys 7’ only submitted the required minimum of four primary tasks.
4.2 Primary Evaluation Results
Results for all seven primary systems in the primary conditions in all four primary tasks are shown in Table 15.3 and are plotted in Fig. 15.1. The systems are numbered in order of the average word error rate for the primary tasks. It should perhaps be noted here that ‘sys 1,’ showing the lowest word error rates for all tasks, submitted a 10 × RT system as primary system results, and had a slightly better performing ‘unlimited time’ contrastive system, which still is according to the rules.
Table 15.3 Overall results of N-Best 2008. Figures indicate the WER, in %. Systems with * indicate late submissions We can observe from the results that CTS gives higher error rates than BN, which is consistent with results reported for English [5]. Apart from the smaller bandwidth of the audio channel, CTS also contains more spontaneous speech than the more prepared speech style that is characteristic of BN. The acoustics of CTS will also contain more regional variability compared to speech available on radio and television, so therefore the acoustic models have less spectral information to model more widely varying acoustic realisations of the sounds in the language. Another effect that makes BN data have less errors than CTS data is that the majority of the language model training material will match the linguistic content of the BN speech better than that of CTS.
4.3 Focus Conditions for Broadcast News Data
NIST has defined standard ‘focus conditions’ for the various types of speech that may appear in BN material: clean, spontaneous, and telephone speech, speech with background noise, and degraded speech. SPEX has annotated the test material for these five standard focus conditions, but in the selection criteria for the final evaluation material these conditions were not included. Hence, the amounts of data found in each of the focus conditions is not homogeneously distributed. In Table 15.4 and Fig. 15.2 the WER performance conditioned on focus condition, regional variety and speaker’s sex are shown in various combinations.
Table 15.4 BN performance expressed in WER (in %), as plotted in Fig. 15.2, but separated for Northern (left) and Southern (right) regional variants. Also indicated is the number of words Nwover which the statistics are calculated (‘k’ means 1,000). Systems with * indicate late submissions. Focus conditions are: all, clean speech, spontaneous speech, telephone speech, speech with background noise and degraded speech Even though the performance varies widely over the different systems, ranging 10–60 %, the clean focus condition clearly has lower WER, which is not surprising. Some systems took a particularly big hit with telephone speech in the NL regional variant. This may be resulting from the way the BN training (and therefore, dry run test material) is organised in CGN: contrary to the VL variant, CGN does not contain whole news shows for the NL variant. It is conjectured that the systems that proved particularly vulnerable to telephone speech have been concentrating more on the NL part during development, and may have missed the fact that BN shows may contain this type of speech. This is consistent with the type of errors seen most for these systems in the telephone condition, deletions.
However, the performance of telephone speech in BN still is a lot better than in the CTS task for all systems, with notably one exception: that of ‘sys 6’ for NL. This systems CTS performance is actually better than in the BN telephone focus conditions. This could be explained by ‘sys 6’ not detecting telephone speech in NL BN data, thus not benefiting from their relatively good CTS NL acoustic models.
4.4 Other Accent Effect
Related to this is the analysis of results by origin of the partner. From the N-Best project partners, the partners located in Belgium performed relatively well on Southern Dutch, while the Dutch university performed better on the Northern Dutch variant. This can be appreciated from Fig. 15.3, where the interaction between the participant’s home country (North for The Netherlands, South for Belgium) and regional variant of the speech is shown. This is, in a way, similar to the famous ‘Other Race Effect’ of human face recognition, Footnote 14 that is also observed by automatic face recognition systems [18]. We therefore coin this the ‘Other Accent Effect.’ We have no direct evidence why this is the case, but one reason could be the choice of phone set, the pronunciation dictionary and grapheme-phoneme conversion tools. This is one part of the ASR systems that was not specified as part of the primary training conditions. We can surmise that the researchers had better quality dictionary for their own regional accent than for the other region.
4.5 Contrastive System Conditions
Three sites submitted contrasting focus conditions. ‘Sys 1’ submitted contrasting results showing the effect of processing speed. In Fig. 15.4 it can be seen that faster processing restrictions have a negative effect on performance, but that there probably is hardly any benefit of going beyond 10 × RT.
‘Sys 2’ ran a single-pass system as contrastive to its multi-pass primary system. The results show a quite consistent gain in WER of approximately 10 %-point for all primary tasks when running the multi-pass system (Fig. 15.5 ).
Finally, ‘sys 3’ submitted many different contrastive conditions. The main variation was in system architecture, where this site submitted results based on SoftSound’s ‘Abbot’ hybrid Neural Net/Hidden Markov Model (HMM) system, [19] the site’s own‘SHoUT’ recogniser [9] and ‘Sonic’ (University of Colorado, [17]) both pure HMM systems. Using SHoUT, both single and double pass system results were submitted, and additionally a ‘bugfix’ version of these two were scored by the coordinator. A plot comparing all of these submissions from ‘sys 3’ is shown in Fig. 15.6. The multi-pass systems did not improve either of the HMM systems very much, about 1 %-point for BN in both accent regions in the case of SHoUT’s SMAPLR (structured Maximum A Posteriori Linear Regression) adaptation technique, and about 0.5 %-point for Sonics’s CMLLR (Constrained Maximum Likelihood Linear Regression) implementation.