Does that sound right? A novel method of evaluating models of reading aloud

Gubian, Michele; Blything, Ryan; Davis, Colin J.; Bowers, Jeffrey S.

doi:10.3758/s13428-022-01794-8

Does that sound right? A novel method of evaluating models of reading aloud

Rating nonword pronunciations

Open access
Published: 01 June 2022

Volume 55, pages 1314–1331, (2023)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Does that sound right? A novel method of evaluating models of reading aloud

Download PDF

Michele Gubian¹,
Ryan Blything ORCID: orcid.org/0000-0003-2285-7219²,
Colin J. Davis³ &
…
Jeffrey S. Bowers³

1603 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Nonword pronunciation is a critical challenge for models of reading aloud but little attention has been given to identifying the best method for assessing model predictions. The most typical approach involves comparing the model’s pronunciations of nonwords to pronunciations of the same nonwords by human participants and deeming the model’s output correct if it matches with any transcription of the human pronunciations. The present paper introduces a new ratings-based method, in which participants are shown printed nonwords and asked to rate the plausibility of the provided pronunciations, generated here by a speech synthesiser. We demonstrate this method with reference to a previously published database of 915 disyllabic nonwords (Mousikou et al., 2017). We evaluated two well-known psychological models, RC00 and CDP++, as well as an additional grapheme-to-phoneme algorithm known as Sequitur, and compared our model assessment with the corpus-based method adopted by Mousikou et al. We find that the ratings method: a) is much easier to implement than a corpus-based method, b) has a high hit rate and low false-alarm rate in assessing nonword reading accuracy, and c) provided a similar outcome as the corpus-based method in its assessment of RC00 and CDP++. However, the two methods differed in their evaluation of Sequitur, which performed much better under the ratings method. Indeed, our evaluation of Sequitur revealed that the corpus-based method introduced a number of false positives and more often, false negatives. Implications of these findings are discussed.

Human and computer estimations of Predictability of words in written language

Article Open access 10 March 2020

Foreign Language Reading Fluency and Reading Fluency Methodologies

A review of reading prosody acquisition and development

Article 11 July 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Experimental psychologists have studied the processes underlying reading aloud for well over a century (e.g. Huey, 1908). According to a longstanding account, skilled readers make use of two distinct procedures operating in parallel to translate orthography into phonology: a lexical procedure that looks up the pronunciation of familiar words, and a rule-based procedure that computes a pronunciation via the application of grapheme-phoneme correspondence rules such as ea→ /i:/ (e.g. Coltheart, 1987; Morton, 1980; Ellis & Young, 2013). This dual-route account was subsequently implemented as a computational model, the dual-route cascaded model (DRC; Coltheart et al.,, 2001). A key motivation for positing two distinct procedures is to explain the ability to read aloud both irregular words like “pint” and nonwords like “slint”. However, this dual-route account was challenged by Seidenberg and McClelland (1989), who argued that a gradient descent learning mechanism could give rise to a set of probabilistic associations that are sufficient to explain how both irregular words and nonwords can be read aloud. Computational models instantiating this theoretical claim were introduced by Seidenberg and McClelland (1989) and Plaut et al., (1996). A third approach is to combine aspects of the above two accounts by combining a lexical procedure with a probabilistic association pathway that learns to map graphemes to phonemes (Perry et al., 2007, 2010).

Assessing models of reading aloud: Rule-based, analogy-based and corpus-based methods of scoring nonword pronunciation

These models of reading aloud can be tested by examining their ability to generalise their knowledge of spelling-sound correspondences to novel stimuli, i.e. their ability to name nonwords. Indeed, a fundamental challenge to Seidenberg and McClelland (1989)’s model was offered by Besner et al., (1990), whose analysis of the model’s nonword naming performance concluded that the model “failed to produce the correct phonological response to almost 50% of stimuli” (p. 434). Plaut et al., (1996) conceded that the model’s ability to name nonwords was significantly worse than skilled readers, but also noted that Besner et al., (1990)’s method of assessing the model was somewhat problematic. Besner et al., (1990) scored as correct any pronunciation of a nonword that followed grapheme-phoneme correspondence rules (i.e. correctness was based solely on regularity). However, this is an unfair test, as readers do not always follow grapheme-phoneme correspondence rules when pronouncing nonwords (e.g. Andrews and Scarratt, 1998; Glushko, 1979). By the same token, Coltheart et al., (2001)’s scoring of the DRC model’s naming of nonwords – which adopted the same rule-based method as Besner et al., (1990) – was overpermissive to a model that relies almost exclusively on grapheme-phoneme correspondence rules to name nonwords.

Because human readers sometimes produce nonword pronunciations that appear to be derived by analogy with real words rather than by following rules, Seidenberg et al., (1994) proposed an alternative analogy-based scoring method, according to which nonword pronunciations were scored as correct “if we could identify a plausible basis for them (either a rule or an analogy to a neighbouring word)” (p.1185). For example, their model was scored correct for pronouncing “jook” as /juk/ by analogy with the rhyming words “book”, “cook”, etc., whereas the scoring method used by Besner et al., (1990) and Coltheart et al., (2001) would score this pronunciation as incorrect (grapheme-phoneme correspondence rules imply a pronunciation rhyming with “spook”). The model scored extremely well when this method was used; for instance, its performance on the Glushko (1979) set was scored as 96.5% correct (the original participants scored 94.9% on average, according to this criterion). The same scoring method was used by Perry et al., (2007) in their assessment of their CDP+ model. They reported an error rate of 6.3% for a set of 592 monosyllabic nonwords, comparable to the error rate of 7.3% for human participants.

However, just as Coltheart et al., (2001)’s rule-based scoring of the DRC model’s nonword naming is overly generous in its assessment of that model’s performance, the analogy-based method used by Seidenberg et al., (1994) and Perry et al., (2007) to score their models seems far too lenient. For example, the nonword “jinth” could be pronounced with a long vowel, by analogy with the word “ninth”, but Andrews and Scarratt (1998) found that not one of their 44 participants produced this pronunciation (all participants gave the regular pronunciation that rhymes with the word “plinth”). Indeed, Andrews and Scarratt (1998) found that the nonwords with irregular body neighbours that they tested were given regular pronunciations by their participants 85% of the time, overall. Thus, although researchers may be able to point to analogies that support the irregular pronunciations produced by models, it is not clear how relevant this is to comparing models with human performance. In the same way, Pritchard et al., (2012) criticised the analogy-based scoring method as too lenient because it includes “too many pronunciation possibilities that readers simply do not consider” (p.1277). An additional limitation concerns the potential subjectivity of this measure, since a researcher must decide the extent to which position-specific or position-independent rules are considered (extreme cases of the latter may lead to particularly lenient pronunciations such as the famous “fish” pronunciation of the nonword “ghoti”).

An apparently straightforward way to overcome the problems associated with the rule-based and analogy-based scoring methods is to consider what pronunciations readers actually assign to nonwords. As Plaut et al., (1996) argued, the important question is not whether model’s pronunciations are “correct” (in the sense of following grapheme-phoneme correspondence rules), but whether these pronunciations are similar to those produced by human readers. However, assessing models on the basis of human pronunciations is not quite as straightforward as one might initially expect. One issue is that many of the older data sets in the literature do not record what pronunciations participants gave – the coding is limited to whether the pronunciation was regular (e.g. Andrews and Scarratt, 1998; Glushko, 1979).^{Footnote 1}

The more fundamental issue, though, is that there is considerable variability in human nonword naming (e.g. Pritchard et al.,, 2012; Mousikou et al.,, 2017). This variability challenges the notion that there is a correct pronunciation of nonwords (which in itself is another reason that the rule-based scoring method is problematic). Accommodating this pronunciation variability suggests the need for datasets (corpora) that record all of the responses given by participants and scoring methods that are appropriately sensitive to this variability.

To this end, Pritchard et al., (2012) assembled a corpus of naming responses for 412 monosyllabic nonwords (each of which was read aloud by 45 adults). All responses were transcribed and the resulting corpus was used to assess the DRC, CDP+ and CDP++ (Perry et al., 2010) models. A model’s pronunciation of each nonword was deemed correct if it matched with any of the pronunciations in the corpus. This corpus-based approach captures the variability of human responses (contrary to the rule-based scoring method used by Coltheart et al.,, 2001) but is not as lenient as the analogy-based scoring method used by Seidenberg et al., (1994) and Perry et al., (2007). Pritchard et al., (2012) found that error rates were 1.5% for DRC, 49.0% for CDP+, and 26.9% for CDP++. That is, for almost half of the nonwords the CDP+ model produced a response that was not produced by any of the 45 human readers. These results appear to pose a strong challenge to both the CDP+ and CDP++ models. However, it is important to note that the assessment of the DRC model depends on the details of the scoring method. Although it was rare for the model’s pronunciation not to correspond to one that was given by at least one participant (reflecting the frequency with which regular pronunciations are produced), for over a quarter of the nonwords the most frequent response given by participants did not correspond to the pronunciation output by DRC.

More recently, Mousikou et al., (2017) have extended the corpus-based method to disyllabic nonwords. Disyllabic words and nonwords pose additional challenges given that they contain a higher proportion of inconsistent grapheme-phoneme correspondences and raise the problem of stress assignment. Mousikou et al., (2017) constructed a corpus of disyllabic nonword naming by transcribing the responses of 41 adult participants who read aloud 915 disyllabic nonwords. This corpus demonstrates the striking variability of human responses; on average there were 5.9 alternative pronunciations per nonword, with a range from 1 to 22.

Mousikou et al., (2017) used this corpus to assess two models that are able to read aloud disyllabic nonwords, CDP++ and RC00 (Perry et al., 2010; Rastle and Coltheart, 2000). They found that the two models had different strengths, with the CDP++ model doing better at stress assignment and the Rastle and Coltheart (2000) model (abbreviated RC00) doing better at pronunciation. But the performance of both models differed from human performance: around one in four of CDP++ pronunciations and around one in eight of RC00 pronunciations were not produced by any human reader.

Limitations of the corpus-based method

The corpora collected by Pritchard et al., (2012) and Mousikou et al., (2017) are a valuable resource for understanding human nonword naming and testing models. Nevertheless, there are some limitations of the corpus-based approach to assessing models. Some of these limitations are related to the extremely resource-intensive nature of the corpus-based approach. In the case of Mousikou et al., (2017), a single listener was asked to transcribe all the pronunciations of the 41 participants (over 37K sound recordings) into English phonemes and then compare these transcriptions to the phonemic outputs of the models. Quite apart from the Herculean nature of this task, placing such demands on a single listener raises the risk of both random coding errors and systematic biases. Indeed, recent evidence suggests that such coding errors may be quite likely, with De Simone et al., (2021) reporting that two trained transcribers of English were in only moderate agreement (k= .57) when transcribing a set of English pseudowords in their Experiment 1. Future model evaluation will require researchers to expand the set of nonwords used to analyse models; reliance on researchers to construct large corpora may impede progress.

Other limitations of the corpus-based method are related to the resulting data. It makes sense to compare model outputs with human pronunciations, but there is a need for caution in interpreting matches and mismatches between model and data. We can distinguish two potential problems, which we label false positives and false negatives. False positives refer to cases where a match between model and data gives a misleading assessment of the success of the model. Such diagnostic errors can occur if the corpus contains errors, either as a result of participant error or transcription error. Indeed, it is possible to find many examples in the Mousikou et al. corpus that appear to be participant or transcription errors. Pronunciations that would probably not be considered plausible by most listeners include, for example, tamcem pronounced t{ksim, pispy pronounced pIpsi, and daxing pronounced d1ksIN. If one of the models considered by Mousikou et al. produced any of these presentations it would have been scored correct (on the basis that any output matched by a transcription in the database is correct), but we suggest that scoring a model in this way would constitute a false positive. Experiments 1 and 2 (below) will provide evidence for this claim.

False negatives refer to cases where the failure to find a match between model and data (because the model’s pronunciation is not present in the corpus) gives a misleading assessment of the failure of the model. The fundamental issue here is that it is not safe to assume that the participants’ responses exhaust all possible valid responses. Restricting the set of valid responses to those actually produced by at least one participant neglects the potential for alternative responses that may be apparent to individual readers. As noted above, Pritchard et al. criticise Perry et al.’s scoring criterion as overly lax for including pronunciation possibilities that “readers simply do not consider”. But we must also ask whether forcing participants to produce a single pronunciation of a nonword allows us to sample all of the possibilities that they do consider. A methodology that excludes plausible pronunciations may provide a biased standard to assess the performance of models. Furthermore, false negatives may also arise as a consequence of transcription choices (i.e. an erroneous transcription will result in the actual pronunciation being excluded from the corpus of ‘correct’ pronunciations), as we show now.

Testing a different model using the corpus-based method

To illustrate some of the above issues, we applied the corpus-based method to a test of a different model of reading aloud. Sequitur G2P (henceforth Sequitur) is a leading grapheme-to-phoneme conversion tool algorithm based on a data-driven algorithm introduced by Bisani and Ney (2008).^{Footnote 2} It is commonly used as a component of speech synthesis applications (e.g. Sawada et al.,, 2014), as well as in speech recognition applications (e.g. Panayotov et al.,, 2015), in both cases serving as a tool to generate phonemic transcriptions of words for which no dictionary entry is available to the system (e.g. proper names, toponyms, rare words, etc.). Sequitur is based on the idea of joint-sequence modelling, that is a word-pronunciation pair is modelled as a sequence of units called graphones, each one representing a mapping between adjacent letters to adjacent phonemes, where the maximum allowed number of symbols (letters or phonemes) on each side of a graphone is a parameter set by the user. The algorithm learns associations between sequences of graphones, the maximum length of learnable sequences also being a free parameter. These associations are trained on large phonemic dictionaries in the target language. In the current work, Sequitur was trained on a selection of 64,598 entries from the CELEX phonetic dictionary for English (Baayen et al., 1995).

Although it is not a psychological model, Sequitur may provide further insights into the strengths and weaknesses of each method that would not be detected by just testing CDP++ and RC00. Sequitur is capable of producing pronunciations for any text string, and so we were able to test it on the 915 disyllabic nonwords in Mousikou et al., (2017)’s database (see Section 1 in Supplementary Materials for further details about Sequitur and how we generated pronunciations from it). For each nonword, the model’s output was deemed correct if it matched at least one reference pronunciation. To ensure our application of Mousikou et al., (2017)’s method was correct we performed a similar assessment of the CDP++ model (Perry et al., 2010) and the RC00 rule-based disyllabic algorithm of Rastle and Coltheart (2000). Note, Mousikou et al., (2017) evaluated CDP++ and RC00 on their output pronunciations and stress assignment. Here, we focus on pronunciation only and as a consequence, our results are to be compared with those in the Pronunciation section in Mousikou et al., (2017).

Table 1 reports match scores for each model (i.e. the percentage of nonwords for which the model’s output matched with at least one human pronunciation in the corpus). Following Mousikou et al., (2017), we scored each pronunciation by both strict criteria (which required phoneme strings to match exactly) and lenient criteria (which allowed substitutions between short vowels and schwas). The match scores for CDP++ and RC00 replicate those calculated in Mousikou et al., (2017)^{Footnote 3}. Sequitur scored the worst among the three models, with a score of 70% using strict scoring and 75% using lenient scoring, though it was not far behind the CDP++ model.

Table 1 Matching (%) of CDP++, RC00 and Sequitur pronunciations of the 915 nonwords from (Mousikou et al., 2017) against human pronunciations from the same work

Full size table

Sequitur’s performance and the problem of false negatives

To get further insight into Sequitur’s poor performance under the corpus-based method, we looked specifically at its pronunciations that were deemed incorrect. This led us to discover a number of systematic discrepancies between the human pronunciations in Mousikou et al., (2017) and Sequitur’s pronunciation of the same nonword.

The most common discrepancy involved a 9→$ substitution:^{Footnote 4}^,^{Footnote 5} Many nonwords were pronounced by Sequitur with a $ phoneme (i.e. the long vowel, Ɔː, in law, thought, and war) when all human transcriptions used a 9 phoneme (i.e. the diphthong, , in jury and cure) in the same place. For example, outslaw was pronounced as 6ts1$ by Sequitur and as 6ts19 by humans). This can be explained by the fact that Mousikou et al., (2017) treated $ phonemes and 9 phonemes as equivalent and conflated them into 9.^{Footnote 6} Conversely, the $ phoneme never appears in the outputs of CDP++ and RC00, and nor in Mousikou et al., (2017)’s transcriptions of human speakers. This creates an unfair assessment of Sequitur because even if its use of a $ phoneme is actually acceptable, all of Sequitur’s outputs that used $ were deemed incorrect under the corpus-based method since it was impossible for them to match with any transcriptions of human speakers. In reality, it is likely that at least some of these pronunciations by Sequitur were actually acceptable, given that similar pronunciations of real words can be found in its training set, CELEX. For example, Sequitur’s pronunciations of outslaw (6ts1$) and glorak (gl$r{k) are generalizations from the CELEX pronunciation of real words like outlaw (6tl$) and glory (gl$rI), respectively. That is, the corpus-based scoring of some of the model’s pronunciations reflects false negatives.

A number of other discrepancies between Sequitur and Mousikou et al., (2017)’s transcriptions of human speakers are described in Section 4.1 in Supplementary Materials. Note that unlike the 9→$ substitution example above, these remaining discrepancies are not due to conflation of two phonemes (none of the phonemes mentioned below were conflated in Mousikou et al.,, 2017) but they still highlight potential false negatives – cases where Sequitur may have learned acceptable pronunciations from CELEX only to be deemed incorrect under the corpus-based method. For example, many nonwords with word final ‘y’ such as pifty were pronounced with a final I by Sequitur whereas all humans pronounced pifty with a final i. Yet similar pronunciations of real words can be found in CELEX (e.g. fifty→fIftI, misty→mIstI). Similarly, Sequitur pronounced nonwords like chansem with # instead of { because the CELEX pronunciation of words like chance is J#ns and not J{ns. If Sequitur learned correctly from the training set (as the examples above suggest), then either CELEX itself is wrong or those pronunciations should be considered correct (despite not matching any transcription of human speakers in Mousikou et al.,, 2017). CDP++ may have been similarly susceptible to false negatives, since this model was also trained on CELEX. Indeed, analysis of its pattern of errors reveals errors similar to those described above (e.g. sometimes using I instead of i for nonwords with word final ‘y’). Note that RC00 uses hard-wired rules to sidestep many of these errors (e.g. its algorithm identifies word final ‘y’ as a suffix, and based on its stored rules for suffixes, applies the i pronunciation).

A new method of testing models of reading aloud

Given the potential for diagnostic errors associated with the corpus-based method, together with the other limitations of this method, it is worth considering other methods for assessing model pronunciations. To this end, we propose a new method in which participants are asked to listen to and rate the plausibility of nonword pronunciations. This method does not require any transcription and allows researchers to directly test the plausibility of candidate pronunciations, rather than discarding these pronunciations because they do not conform to grapheme phoneme correspondence rules, or were not produced by a (relatively small) sample of human readers. The outputs of theoretical models can be produced by a human speaker, or (as is the case in the experiments described here) can provide the input to a speech synthesizer, so that the models really do name the nonwords aloud. This ratings-based method also has the benefit of being easy to implement online which makes data collection easier and possible to run entirely remotely.

In the following two experiments, we directly compare this new ratings-based method with the corpus-based method. We took the output of CDP++, RC00 and Sequitur for the nonwords from Mousikou et al., (2017), and used these outputs as the input to a speech synthesizer. The resulting sound files were played to participants one at a time along with the corresponding written nonword itself, and online participants rated how well the pronunciation matched the written nonword. We considered a model’s output appropriate if listeners judged the pronunciation reasonable given the spelling. Our key findings are: a) the corpus-based method does indeed introduce a number of mistakes, both false positives (categorising implausible pronunciations as correct) and more often, false negatives (categorising acceptable pronunciations as incorrect), b) our ratings method has a high hit rate and low false-alarm rate in assessing nonword reading accuracy, and arguably does a better job than the corpus-based method, and c) although we observe a similar outcome as Mousikou et al., (2017)’s evaluation in terms of RC00 outperforming CDP++, the two methods differ in terms of their evaluation of Sequitur, which performed much better under our ratings-based method. As we detail below, it can be argued that our method provided a more accurate description of the performance of Sequitur. Together, these findings indicate that our new method can be used to facilitate the developments of better models in the future.

Experiment 1

The first aim of Experiment 1 was to assess the reliability of our ratings method’s assessment of nonword pronunciations. Participants were asked to rate correct (modal and minor) human pronunciations from Mousikou et al., (2017) as well as manipulated pronunciations where deliberate errors were introduced. The ratings of modal responses were used to estimate the sensitivity of the method (i.e. proportion of modal pronunciations rated as correct) and ratings of deliberate errors were used to estimate the specificity of the method (i.e. proportion of deliberate error pronunciations rated as incorrect. Ratings of modal and minor responses were used to assess whether the method could detect fine-grained differences amongst correct pronunciations, e.g. although both modal and minor pronunciations are likely to be rated as correct using binary classification, the former may yield more positive responses than the latter on the six-point scale.

The second aim was to use our ratings method to re-evaluate possible errors of the corpus-based method. For this purpose, we used the ratings method to re-assess nonword pronunciations that were deemed errors on the basis of the model’s output matching with 0 out of 41 human responses (i.e. we tested for false negatives), and to re-assess pronunciations that were deemed correct on the basis of the output matching with only one out of 41 human responses (i.e. testing for false positives). These latter items may be good candidates for false positives with the single matches reflecting a transcription error or lapses in concentration by the participant that cause implausible pronunciation.

Method

Design and materials

Experiment 1 used 528 nonwords, corresponding to all of the nonwords from Mousikou et al., (2017) for which at least one of the three models (CDP++, RC00 and Sequitur) was deemed (a) incorrect (did not match with any human responses), or (b) correct, by matching with only one out of 41 human responses. We tested multiple pronunciations for each nonword. These pronunciations were distributed across six conditions. The first three conditions were designed as a test of the ratings method:

(i) Human Modal Pronunciation

The modal pronunciation of each nonword was determined based on its most frequent pronunciation in Mousikou et al., (2017)’s corpus. We expected that the pronunciations in this condition would receive the highest ratings.

(ii) Human Minor Pronunciation

A minor pronunciation of each nonword was determined by choosing a pronunciation that was produced by between two and six speakers in Mousikou et al., (2017)’s corpus (if there was more than one candidate, the more frequently produced pronunciation was chosen). It was not possible to use all 528 nonwords in this condition because some did not have a pronunciation that was shared between two and six speakers. We expected that the pronunciations in this condition would receive lower ratings than those in the Modal Pronunciation condition.

(iii) Deliberate Error condition

For each nonword we generated an erroneous pronunciation by changing one phoneme from the Human Modal pronunciation at random, according to the following constraints: A consonant could be substituted only by another consonant with a different place and manner, and a vowel or diphthong could be substituted only by another vowel or diphthong with a different position (front, mid, back) and length (short vowel, long vowel or diphthong). For example, i, which is a long and fronted vowel, could be substituted by U, which is short and back, but not by I, because it is also fronted; likewise, p (bilabial plosive) could be substituted by S (alveolar fricative) but not by b (also bilabial). In this way we tried to obtain errors that were unequivocal but at the same time not too distant from the written form. We expected the pronunciations in this condition to receive the lowest ratings.

The remaining three conditions – (iv) CDP++, (v) RC00 and (vi) Sequitur – were designed as a cross-check on the accuracy of the scoring of model outputs under the corpus-based method. The pronunciations in each of these conditions were those produced by the respective models. These were pronunciations that had been produced by either no human participants (‘Zero Match’ items) or one participant (‘One Match’ items) in Mousikou et al., (2017)’s corpus. Table 2 reports the number of pronunciations per category included in our experiment.

Table 2 Number of pronunciations per condition in Experiment 1

Full size table

Using Microsoft Speech Synthesizer^{Footnote 7}, pronunciations were synthesised from either Mousikou et al., (2017)’s DISC transcriptions of human pronunciations or from the models’ DISC outputs. The British female voice “en-GB, Hazel” was used in all experiments. This synthesiser allows the user to specify the desired pronunciation of a given written stimulus in terms of phonetic transcription, as well as to control some prosodic aspects, like speech rate. Although we found this synthesiser to be better than alternative ones (e.g. eSpeak^{Footnote 8}), one limitation is the lack of control in positioning lexical stress, which, although considered by the programming interface, is ignored by the synthesizer, which places the stress according to pre-determined rules not accessible to the user. As we were not focused on stress assignment, we accepted this limitation. Note the results reported in the Pronunciation section of Mousikou et al., (2017) also disregarded stress assignment. The reader interested in using Microsoft Speech Synthesizer for their own research is referred to Section 5 in Supplementary Materials for practical suggestions.

Procedure

A short screening test was used before the main experiment. Here, participants completed five multiple-choice questions, each of which played aloud the correct pronunciation of an existing English word; participants were required to match the pronunciation with one of three written forms presented on the screen, e.g. hear: crane, choose among: crane, frame, train. If participants failed to get more than 3/5 correct answers, they did not progress to the main experiment.

For the main experiment, six stimuli lists were constructed, each containing 528 nonword pronunciations so that each condition of a given nonword pronunciation featured at least once across the lists (although no list contained the same orthographic form more than once and hence participants were never exposed to two pronunciations of the same nonword). An additional constraint was to ensure that each participant rated no more than 200 stimuli (to avoid boredom and fatigue), and thus, each of the six lists was randomly divided into three lists of 176 pronunciations, giving a total of 18 lists.

In addition to the material described above, we added ten catch trials to each list (the same ones for all lists). These were ten pronunciations of nonwords that were not amongst the 528 nonwords used in our experiment, but which all 41 participants from Mousikou et al., (2017) produced (and were thus highly likely to be correct). Five of them, called Accurate, were synthesised to be consistent with the human pronunciation. The other five, called Inaccurate, were distorted by the same procedure used to obtain Deliberate Errors. Participants in our experiment were expected to rate Accurate stimuli as very good and Inaccurate stimuli very bad. If this was not the case, participants were likely to be inattentive or using the wrong audio equipment.

Gorilla Experiment Builder^{Footnote 9} was used to host the experiment online (Anwyl-Irvine et al., 2018). Each orthographic stimulus was presented while the corresponding auditory stimulus was played once. Participants could re-play the audio by pressing the space bar and rated the pronunciation by clicking on a six-point scale (Very bad, Bad, Probably not OK, Probably OK, Good, Very good) displayed below the orthographic stimulus. The task can be experienced online.^{Footnote 10}

Participants

Participants were recruited using Prolific^{Footnote 11}. The average completion time was 12 minutes and participants were paid 2.20 GBP (11 GBP/hour pro-rata). Criteria for selection was as follows: (i) Monolingual English-as-first-language Speakers, (ii) British citizen and resident (iii) No diagnosis of literacy difficulties (e.g. dyslexia), (iv) Prolific approval rate above 95%.

A total of 121 participants were recruited. Nine of these were rejected because they did not finish the experiment and four were rejected because they failed the initial screening task. Thus, 108 participants completed the experiment (ensuring there were six participants for each of the 18 stimuli lists). This sample size was deemed sufficient because it ensured that we obtained approximately 3000 observations per condition which satisfies recently proposed criterion for properly powered experiments of this kind (Brysbaert and Stevens, 2018; Brysbaert, 2019).

Overall, 98/108 of participants answered at least 9/10 Catch Trials correctly, and 104/108 participants answered at least 8/10 correctly, which suggests that overall participants were attentive and that the synthesiser was capable of rendering speech satisfactorily. However, upon closer inspection of the four participants that answered more than two catch trials incorrectly, we identified three of these participants as outliers: two of them rated almost all the stimuli from the main experiment as implausible, while the third one rated Deliberate Error pronunciations better than anything else and in general produced erratic responses (the fourth participant’s answers were relatively sound). Those three participants were excluded from further analyses as we believe that they either did not understand or did not attend to the task.

Results

Assessing sensitivity and specificity

In this section, we assess the reliability of the ratings method by obtaining Sensitivity and Specificity estimates from participants’ ratings of Human Modals and Deliberate Errors. Sensitivity is defined as the proportion of Human Modal pronunciations rated as correct, while Specificity is the proportion of Deliberate Error pronunciations rated as incorrect.

We obtained one binary correct/incorrect score for each pronunciation (i.e. 528 scores for Human Modals and 528 scores for Deliberate Errors) by (i) calculating the median rating of each pronunciation (medians were used because we treated the rating scale as an ordinal rather than a continuous scale), and (ii) converting each of these median scores to a correct response (rating of “Probably OK”, “Good”, or “Very good”), or incorrect response (rating of “Probably Not OK”, “Bad”, “Very bad”). When a median score was tied between “Probably Not OK” and “Probably OK” this was rounded down to “Probably Not OK” (this was the case for eight Modal and 21 Deliberate Error pronunciations). The pattern of results is displayed in Fig. 1. Using these scores, our initial estimation of Sensitivity was 475/528 = 90%, and Specificity was 465/528 = 88%.

However, after an inspection of the 53 Human Modal pronunciations that were rated as incorrect (and which thus reduced the Sensitivity score), it became clear that sensitivity would be near-perfect were it not for the 9→$ substitution confound in Mousikou et al., (2017)’s stimuli set. As described above, the 9→$ substitution refers to the fact that in Mousikou et al., (2017)’s transcriptions of human pronunciations, $ phonemes (i.e. the long vowel, , in law, thought, and war) and 9 phonemes (i.e. the diphthong, , in jury and cure) were treated as equivalent and conflated into 9. This is relevant to our sensitivity estimate because 48 of the 53 Human Modal pronunciations that were rated as incorrect by our participants were transcribed by Mousikou et al., (2017) with a 9 phoneme when they probably should have been transcribed with a $ phoneme. Consistent with this claim, we found that for 27 of these nonwords, a Sequitur pronunciation of the same nonword used a $ phoneme in the same position that the Modal pronunciation used a 9 phoneme; crucially, all of these Sequitur pronunciations were rated as acceptable (see Table S6 in Supplementary Materials). Once we correct for this error, we obtain a sensitivity score of 475/480 = 99% and thus confirms our method is sufficiently sensitive for detecting plausibility of nonword pronunciations.

Furthermore, an analysis of the five remaining incorrect Human Modal pronunciations revealed that two of them, baininx and cherinx, were transcribed with a final k. A check of the original audio files^{Footnote 12} revealed that speakers pronounced those two nonwords with a final ks, which is clearly more plausible. This example revealed a clear limitation of the corpus-based method’s reliance on transcriptions and further highlighted the need to cross-check transcriptions.

We conducted a similar evaluation of our initial specificity estimate (i.e. the 63 Deliberate Error pronunciations that were rated as acceptable; the full list is reported in Table S7 in Supplementary Materials). At least six of those 63 ratings can be attributed to poor design of the Deliberate Error: although these pronunciations were designed to be Deliberate Errors, their random generation unintentionally coincided with a pronunciation from another category (e.g. the Deliberate Error for the nonword outbost was generated from the Human Modal pronunciation, 6tbQst→6tb5st, but the random Q→5 substitution generated an identical pronunciation as the Human Minor pronunciation, which was also 6tb5st).

Although we could only identify six cases like this (when the error was an exact match with another pronunciation), it is likely that other Deliberate Errors were also poorly designed (i.e. although the error did not match with another pronunciation, the random generation nevertheless produced a plausible pronunciation).

Another possible source of lower specificity score is that some of these Deliberate Errors were poorly synthesized. This may have obscured the error we introduced and led participants to judge the pronunciation as acceptable. In an attempt to assess the role of the speech synthesizer in producing false-positive responses, we recruited three trained phoneticians who all had a PhD in Phonetics, were native British English speakers, and were blind to the experiment’s purpose, and asked them to verify whether each of the 63 phonemic strings had been produced accurately by our synthesizer. We used Gorilla Experiment Builder to present phoneticians with each of the 63 phonemic strings (converted from DISC to IPA) alongside an audio presentation of its synthesis. Phoneticians were asked to judge whether the phonemic string had been accurately rendered by the synthesiser using the same six-point scale employed in the main experiment and were told that positive ratings (“Probably OK” or better) would be taken to indicate that all phonemes in the string were pronounced to a satisfactory standard, whereas negative ratings (“Probably not OK” or worse) would indicate that at least one phoneme had been synthesised incorrectly. For 16 out of 63 pronunciations, the modal response from the three phoneticians was that the synthesizer had produced the phoneme string incorrectly (likely a consequence of the fact that some Deliberate Errors will naturally be difficult to pronounce). Both the error of stimulus construction and the limitations of the synthesizer we used will have artificially reduced the specificity of the corpus-based method. If we recompute the specificity while putting aside these items it is estimated at 465/512 = 91%. Of course the limits of the synthesizer does reduce the specificity in the current experiments, but it is not a limitation of the ratings method per se.

In sum, our on-line ratings-based experiment showed high Sensitivity and Specificity scores, it was able to discriminate between human modal and minor responses, and it revealed a systematic bias in Mousikou et al., (2017)’s transcriptions which we were not aware of. These results would be further improved if carried out in laboratory conditions and with a better speech synthesizer. This suggests that the ratings method is a reasonable measure for assessing pronunciations of disyllabic nonwords.

Pronunciation ratings for zero and one match items

In this section, we assess the accuracy of CDP++, RC00, and Sequitur pronunciations of the zero and one match items using our ratings method. As noted above, these are the words that are most likely to have been misclassified by the corpus-based method, with zero match pronunciations being candidates for false negatives (if a pronunciation does not match exactly with any pronunciations from the reference list of 41 participants, this is not necessarily an erroneous pronunciation) and one match pronunciations candidates for false positives (a match with only one of 41 participants may reflect the fact that this participant mispronounced or that it was mistranscribed). Based on the 9→$ confound described earlier, we decided to exclude all orthographic stimuli for which any pronunciation contained the 9 and/ or $ phoneme (95/528 nonwords were excluded).

In Fig. 1, top row, we report ratings for the Human Pronunciations and Deliberate Errors, corresponding to the Sensitivity and Specificity scores above. The bottom row of Fig. 1 reports the ratings of each model: darker shades of grey denote ‘Zero Match’ pronunciations whereas lighter shades of grey denote ‘One Match’ pronunciations. Strikingly, over half of the Zero Match pronunciations (58%) were rated as correct (“Probably OK” or better) despite the fact that all of these pronunciations were rated as incorrect under Mousikou et al., (2017)’s criterion (i.e. evidence for false negatives in the corpus-based method). Notable examples include freacely (frislI), conglist (k@nglIst), and afflave (@fl1v), which all used substitutions identified earlier (e.g. i→I, Q→@, {→@) thus supporting the notion that these generalizations learned from CELEX were indeed correct. Turning our attention to ‘One Match’ pronunciations (shaded in lighter grey), these were judged more consistently with the corpus-based method (most were judged positively) but there were nevertheless some inconsistencies between the methods (19% were judged negatively). Notable examples include surbeft (s3bft), conclise (kQn2sz), pilprem (pIspr@m), and udgement (v_im), which were all judged as “Bad” or “Very bad” by our participants. Since all of these are clearly bad pronunciations, they highlight examples of when Mousikou et al., (2017)’s reference list included either transcription errors or unreliable human responses that made their method susceptible to false positives.

In sum, the ratings method of the zero match and one match items suggests that the corpus-based method is prone to making false-negative errors and false-positive errors. The false-negative errors may simply reflect the fact that the responses of the 41 participants did not produce all plausible pronunciations of these nonwords, and the false-positive errors reflect the fact that matching one response out of 41 participants is not sufficient grounds for characterizing a pronunciation as correct (people make mistakes, and matching a mistake does not entail a correct pronunciation). Combined with the high sensitivity and specificity results of the ratings method, and the fact that the ratings method picked up a transcription error that the corpus-based method was blind to, the ratings method seems a promising method for evaluating model productions that can easily be applied to any item. In Experiment 2, we compare the success of CDP++, RC00, and Sequitur models in naming the same set of disyllabic nonwords using the corpus-based and ratings-based methods.

Experiment 2

In Experiment 2, a new group of participants was asked to rate the output of the three nonword naming models on the same set of 803 nonwords from Mousikou et al., (2017) (not just the 528 from Experiment 1 that matched with ≤ 1 human responses)^{Footnote 13}. They were also asked to rate the responses of three different speakers from Mousikou et al., (2017) who named the nonwords in different manners. The ratings of the model outputs and the ratings for the responses for the three speakers were compared.