Forced-Alignment and Edit-Distance Scoring for Vocabulary Tutoring Applications

  • Serguei Pakhomov
  • Jayson Richardson
  • Matt Finholt-Daniel
  • Gregory Sales
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5246)


We demonstrate an application of Automatic Speech Recognition (ASR) technology to the assessment of young children’s basic English vocabulary. We use a test set of 2935 speech samples manually rated by 3 reviewers to compare several approaches to measuring and classifying the accuracy of the children’s pronunciation of words, including acoustic confidence scoring obtained by forced alignment and edit distance between the expected and actual ASR output. We show that phoneme-level language modeling can be used to obtain good classification results even with a relatively small amount of acoustic training data. The area under the ROC curve of the ASR-based classifier that uses a bi-phone language model interpolated with a general English bi-phone model is 0.80 (95% CI 0.78–0.82). The point where both sensitivity and specificity are at their maximum is where sensitivity is 0.74 and the specificity is 0.80 with 0.77 harmonic mean, which is comparable to human performance (ICC=0.75; absolute agreement = 81%).


Automatic speech recognition vocabulary tutor sub-word language modeling 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Eskenazi, M.: Using Automatic Speech Processing for Foreign Language Pronunciation Tutoring: Some Issues and a Prototype. Language Learning and Technology 2, 62–67 (1999)Google Scholar
  2. 2.
    Mostow, J., Roth, S., Hauptmann, A., Kane, M.: A Prototype Reading Coach that Listens. In: Proceedings of the twelfth national conference on Artificial intelligence, vol. 1, pp. 785–792 (1994)Google Scholar
  3. 3.
    Mostow, J., Aist, G., Burkhead, P., Corbett, A., Cuneo, A., Eitelman, S., Huang, C., Junker, B., Sklar, M.B., Tobin, B.: Evaluation of an Automated Reading Tutor that Listens: Comparison to Human Tutoring and Classroom Instruction. Journal of Educational Computing Research 29, 61–117 (2003)CrossRefGoogle Scholar
  4. 4.
    Mostow, J.: Is ASR accurate enough for automated reading tutors, and how can we tell? In: Proceedings of the Ninth International Conference on Spoken Language Processing (Interspeech 2006 – ICSLP), Special Session on Speech and Language in Education, pp. 837–840 (2006)Google Scholar
  5. 5.
    Heiner, C., Beck, J.E., Mostow, J.: Automated Vocabulary Instruction in a Reading Tutor. In: Intelligent Tutoring Systems, pp. 741–743. Springer, Berlin (2006)CrossRefGoogle Scholar
  6. 6.
    Franco, H., Neumeyer, L., Digalakis, V., Weintraub, M.: Automatic Scoring of Pronunciation Quality. Speech Communication 30, 83–93 (1999)Google Scholar
  7. 7.
    Cole, R., van Vuuren, S., Pellom, B., Hacioglu, K., Ma, J., Movellan, J., Schwartz, S., Wade-Stein, D., Ward, W., Yan, J.: Perceptive Animated Interfaces: First Steps Toward a New Paradigm for Human Computer Interaction. Proceedings of the IEEE 91, 1391–1405 (2003)CrossRefGoogle Scholar
  8. 8.
    Cole, R., Wise, B., van Vuuren, S.: How Marni teaches children to read. Education Technology 47, 14–18 (2006)Google Scholar
  9. 9.
    Akahane-Yamada, R., Tohkura, Y., Bradlow, A.R., Pisoni, D.B.: Does training in speech perception modify speech production? In: Proceedings of International Conference of Speech and Language Processing, pp. 606–609 (1996)Google Scholar
  10. 10.
    Hagen, A., Pellom, B., Cole, R.: Highly accurate children’s speech recognition for interactive reading tutors using subword units. Speech Communication 49, 861–873 (2007)CrossRefGoogle Scholar
  11. 11.
    Carroll, J.B., Davies, P., Richman, B.: Word frequency book. Houghton Mifflin Co., Boston (1971)Google Scholar
  12. 12.
    Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V.: The HTK Book Version 3.4. Cambridge University, Cambridge (2006)Google Scholar
  13. 13.
    Eskenazi, M., Mostow, J.: The CMU KIDS Speech Corpus. In: Corpus of children’s read speech digitized and transcribed on two CD-ROMs, with assistance from Multicom Research and David Graff, Linguistic Data Consortium, University of Pennsylvania, Pittsburgh, PA (1997)Google Scholar
  14. 14.
    Eskenazi, M.: Detection of foreign speakers’ pronunciation errors for second language training – preliminary results. In: Proceedings of the Fourth International Conference on Spoken Language (ICSLP 1996), vol. 3, pp. 1465–1468 (1996)Google Scholar
  15. 15.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)MathSciNetGoogle Scholar
  16. 16.
    Shrout, P.E., Fleiss, J.L.: Intraclass Correlations: Uses in Assessing Rater Reliability. Psychological Bulletin 86, 420–428 (1979)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Serguei Pakhomov
    • 1
  • Jayson Richardson
    • 2
  • Matt Finholt-Daniel
    • 3
  • Gregory Sales
    • 3
  1. 1.University of MinnesotaMinneapolis 
  2. 2.University of North CarolinaWilmington 
  3. 3.Seward IncorporatedMinneapolis 

Personalised recommendations