In this section we investigate under which circumstances the proposed lexical methodology can enhance the name recognition performance. We first conduct experiments on the ASNC that covers the person and topographical name domains. Then, we verify whether our conclusions remain valid when we move to another domain, in casu, the POI domain.
5.1 Modes of Operation
Since in certain situations it is plausible to presume prior knowledge of the speaker tongue and/or the name source, three relevant modes of operation of the recogniser are considered:
M1 : In this mode, the speaker tongue and the source of the inquired name are a priori known. That is, the case of a tourist who uses a voice-driven GPS system to find his way in a foreign country where the names (geographical names, POI names) all originate from the language spoken in that country.
M2 : In this mode, the speaker tongue is known but names from different sources can be inquired. Think of the same tourist who is now traveling in a multilingual country like Belgium where the names can either be Dutch, English, French, German, or a mixture of those.
M3 : In this mode, neither the mother tongue of the actual user nor the source of the inquired name are a priori known. This mode applies for instance to an automatic call routing service of an international company.
The first experiments are carried out under the assumption of mode M1. In that case, we know in which cell we are and we only add variants for names that can occur in that cell. Furthermore, we can in principle use a different P2P converter in each cell. However, since for the ASNC names we only had typical native Dutch transcriptions, we could actually train only four P2P converters, one per name source. Each P2P converter is learned on a lexical database containing one entry (orthography + Dutch G2P transcription + typical Dutch transcription) per name of the targeted name source.
5.2 Effectiveness of P2P Variants
After having evaluated the transcription accuracy improvement as a function of the number of selected P2P variants, we came to the conclusion (cf. ) that it is a viable option to add only the four most likely P2P variants to the baseline lexicon. By doing so, we obtained the NERs listed in Table 14.3.
The most substantial improvement (47 % relative) is obtained for the case of Dutch speakers reading NN2 names. For the case of Dutch speakers reading French names no improvement is observed. The gains in all other cells are more modest (10–25 % relative), but nevertheless statistically significant (\(p\,<\,0.05\), even\(p\,<\,0.01\)for Dutch and NN2 names uttered by Dutch speakers. Footnote 6 )
The fact that there is no gain for native speakers reading French names is partly owed to the fact that the margin for improvement was very small (the baseline 2 G2P system only makes seven errors in that cell, cf. also Table 14.2 ). Furthermore, the number of examples that is available for the P2P training is limited for French names. While there are 1,676 training instances for Dutch names, there are only 322 for English names, 161 for French names and 371 for NN2 names. Therefore, we performed an additional experiment in which the sets of English and French training names were extended with 684 English and 731 French names not appearing in the ASNC test set. The name set including these extensions is called ASNC + . Training on this set does lead to a performance gain for French names. Moreover, the gain for English names becomes significant at the level of p < 0. 01 (cf. Table 14.3 ).
In summary, given enough typical transcriptions to train a P2P converter, our methodology yields a statistically significant (p < 0. 01) reduction of the NER for (almost) all cells involving Dutch natives. For the utterances of non-natives the improvements are only significant at the level of p < 0. 05 for speakers whose mother tongue is covered by the acoustic model. This is not surprising, since the Dutch typical transcriptions that we used for the P2P training were not expected to represent non-native pronunciations. Larger gains are anticipated with dedicated typical training transcriptions for these cells.
5.3 Analysis of Recognition Improvements
Our first hypothesis concerning the good results for native speakers was that for these speakers, there is not that much variation to model within a cell. Hence, one single TY transcription target per name might be sufficient to learn good P2P converters. To verify this hypothesis we measured, per cell, the fraction of training utterances for which the auditorily verified transcription is not included in the baseline G2P lexicon. This was the case for 33 % of the utterances in cell (DU,DU), around 50 % in (DU,EN) and (DU,FR) and around 75 % in all other cells, including (DU,NN2) for which we also observed a big improvement.
The small improvement achieved for NN2 speakers reading Dutch names is owed to the fact that many NN2 speakers have a low proficiency in Dutch reading, which implies that they often produce very a-typical phonetisations. The latter are not modeled by the Dutch typical transcriptions in our lexical database. Another observation is that NN2 speakers often hesitate a lot while uttering a native name (cf. ) and these hesitations are not at all modeled either.
In order to find an explanation for the good results for Dutch speakers reading NN2 names, we have compared two sets of P2P converters: one trained towards typical transcriptions and one trained towards ideal (auditorily verified) transcriptions as targets. We have recorded how many times the two P2P converters correct the same recognition error in a cell and how many times only one of them does. Figure 14.3 shows the results for the four cells comprising Dutch speakers. Footnote 7
It is remarkable that in cell (DU,NN2) the percentage of errors being corrected by both P2P converters is significantly larger than in the other cells. Digging deeper, we came to the conclusion that most of these common corrections were caused by the presence of a small number of simple vowel substitution rules that are picked up by both P2P converters as they represent really systematic discrepancies between the G2P and the typical transcriptions. The most decisive rules express that the frequently occurring letter “u” in NN2 names (e.g. Curukluk Sokagi, Butrus Benhida, Oglumus Rasuli, etc.) is often pronounced as /u/ (as in “b oo t”) while it is transcribed as /Y/ (like in “m u d”) or /y/ (like in the French “cr u ”) by the G2P converter.
Similarly, we have also examined for which names the P2P variants make a positive difference in the other cells. Table 14.4 gives some representative examples of names that were more often correctly recognised after we added P2P variants.
An interesting finding (Table 14.4) is that a minor change in the name transcription (one or two phoneme modifications) can make a huge difference in the recognition accuracy. The insertion of an /n/ in the pronunciation of “Duivenstraat” for instance leads to five corrected errors out of six occurrences.
5.4 Effectiveness of Variants in Mode M2
So far, it was assumed that the recogniser has knowledge of the mother tongue of the user and the origin of the name that will be uttered (mode M1). In many applications, including the envisaged POI business service, a speaker of the targeted group (e.g. the Dutch speakers) can inquire for names of different origins. In that case, we can let the same P2P converters as before generate variants for the names they are designed for, and incorporate all these variants simultaneously in the lexicon. With such a lexicon we got the results listed in Table 14.5.
For the pure native situation, the gain attainable under mode M2 is only 50 % of the gain that was achieved under mode M1. However, for cross-lingual cases (apart from the French names case), most of the gain achieved under mode M1 is preserved under mode M2. Note that in case of the French names, the sample size is small and the difference between 1.4 and 2.2 % is only a difference of three errors.
5.5 Effectiveness of Variants in Mode M3
In case neither the mother tongue of the speaker nor the origin of the name is given beforehand (mode M3), the recognition task becomes even more challenging. Then variants for all name sources and the most relevant speaker tongues have to be added at once.
Since we had no typical non-native pronunciations of Dutch names at our disposal, a fully realistic evaluation of mode M3 was not possible. Consequently, our lexicon remained the same as that used for mode M2, meaning that the results for native speakers remain unaffected. The results for the non-native speakers are listed in Table 14.6. They are put in opposition to the baseline results and the results with lexical modeling under mode M1.
The figures show that in every cell about 50 % of the gain is preserved. This implies that lexical modeling for proper name recognition in general is worthwhile to consider.
5.6 Evaluation of the Method in the POI Domain
In a final evaluation it was verified whether the insights acquired with person and typographical names transfer to the new domain of POI. For the training of the P2P converters we had 3,832 unique Dutch, 425 unique English and 216 unique French POI names available, each delivered with one or more plausible native Dutch transcriptions Footnote 8 and a language tag. Since there was a lot less training material for French and English names, we also compiled an extended dataset (POI+) by adding the French and English training instances of the ASNC+ dataset.
For the experimental evaluation of our method, we used the POI name corpus that was created in Autonomata Too, and that is described inChap. 4of this book (cf. Sect. 4.5, p. 74) and in .
Here we just recall that Dutch speakers were asked to read Dutch, English, French and mixed origin (Dutch-English, Dutch-French) POI, while foreign speakers were asked to read Dutch and mixed origin POI only. The recordings were conducted such that the emphasis was on the cases of Dutch natives reading foreign POI and on non-natives reading Dutch POI.
The vocabulary of the recogniser consisted of 10K POI: all POI spoken in the POI name corpus, supplemented with additional POI that were drawn from background POI lexica provided by TeleAtlas. There was no overlap between this vocabulary and the POI set that was available for P2P training. Also, none of the POI occur in the ASNC.
Table 14.7 shows NER results for Dutch utterances under the assumptions of modes M1 and M2 respectively. Table 14.8 depicts similar results for the non-native speakers.
The data support the portability of our methodology. Adding P2P variants for POI in mode M1 strongly reduces the NER for Dutch native speakers and modestly improves the recognition for non-native speakers. In mode M2, the over-all result still holds that a substantial part of the gain is preserved. However, there are differences in the details. We now see a good preservation of the gain obtained in the purely native case, but the gains in the cross-lingual settings are more diverse. The preserved gain ranges from only 22 % (for Dutch speakers reading English names, with an extended training set) to 100 % (for English and French speakers reading Dutch names).
Furthermore, we see how an extended training set for English and French POI yields no improvement for English POI and only a small gain for French POI. This either reflects that the ASNC proper name transcriptions are not suited as training material for POI names, or that relevant information regarding the “correct” transcription of proper names can already be captured with a limited training set of name transcriptions. To verify the latter hypothesis, we performed two additional mode M1 recognition experiments for Dutch POI in which only one fourth (corresponding to about 1K unique names, 1.7K training instances) and one sixth (corresponding to about 1K training instances, for nearly 650 unique training names) of the training set names for Dutch POI were included for the P2P converter training. We found that for both set-ups the NER was even (slightly) lower than before (6.4 % for 1K unique training POI names and 6.5 % for 1K training instances). We therefore argue that a limited training set of around 1K transcribed training names will typically be sufficient to learn a good P2P converter.
A qualitative evaluation of the improvements induced by the P2P transcriptions has been performed as well and is described in . That evaluation confirmed that relatively simple phoneme conversions (substitutions, deletions, insertions) account for most of the obtained NER gains, but that a large number of more structural variations (e.g. syllable-size segment deletion) is not modeled by the P2P converters. An explicit modeling of these variations, possibly by means of other techniques, could further raise the efficiency of the POI recogniser.