Investigating Critical Speech Recognition Errors in Spoken Short Messages

Part of the Signals and Communication Technology book series (SCT)


Understanding dictated short-messages requires the system to perform speech recognition on the user’s speech. This speech recognition process is prone to errors. If the system can automatically detect the presence of an error, it can use dialog to clarify or correct its transcript. In this work, we present our analysis on what types of errors a recognition system makes, and propose a method to detect these critical errors. In particular, we distinguish between simple and critical errors, where the meaning in the transcript is not the same as the user dictated. We show that our method outperforms standard baseline techniques by 2 % absolute F-score.


  1. 1.
    Bacchiani M, Hirschberg J, Rosenberg A, Whittaker S, Hindle D, Isenhour P, Jones M, Stark L, Zamchick G (2001) Scanmail: audio navigation in the voicemail domain. In: Proceedings of the first international conference on human language technology research. Association for Computational Linguistics, pp 1–3Google Scholar
  2. 2.
    Bechet F, Favre B (2013) Asr error segment localization for spoken recovery strategy. In: IEEE international conference in acoustics, speech and signal processing (ICASSP), Vancouver (Canada)Google Scholar
  3. 3.
    Bohus D, Rudnicky A (2002) Integrating multiple knowledge sources for utterance-level confidence annotation in the cmu communicator spoken dialog system. Technical report, DTIC DocumentGoogle Scholar
  4. 4.
    Burget L, Schwarz P, Matejka P, Hannemann M, Rastrow A, White C, Khudanpur S, Hermansky H, Cernocky J (2008) Combination of strongly and weakly constrained recognizers for reliable detection of oovs. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2008. IEEE, pp 4081–4084Google Scholar
  5. 5.
    Burke M, Amento B, Isenhour P (2006) Error correction of voicemail transcripts in scanmail. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 339–348Google Scholar
  6. 6.
    Gishri M, Silber-Varod V, Moyal A (2010) Lexicon design for transcription of spontaneous voice messages. In: LRECGoogle Scholar
  7. 7.
    Goldwater S, Jurafsky D, Manning CD (2010) Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates. Speech Commun 52(3):181–200CrossRefGoogle Scholar
  8. 8.
    Huggins-Daines D, Rudnicky AI (2008) Interactive asr error correction for touchscreen devices. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: demo session. Association for Computational Linguistics, pp 17–19Google Scholar
  9. 9.
    Ku T crfpp.sourceforge.netGoogle Scholar
  10. 10.
    Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of the 41st annual meeting on association for computational linguistics, vol 1. Association for Computational Linguistics, pp 160–167Google Scholar
  11. 11.
    Ogawa A, Hori T, Nakamura A (2012) Recognition rate estimation based on word alignment network and discriminative error type classification. In: 2012 IEEE spoken language technology workshop (SLT). IEEE, pp 113–118Google Scholar
  12. 12.
  13. 13.
    Owoputi O, OConnor B, Dyer C, Gimpel K, Schneider N, Smith NA (2013) Improved part-of-speech tagging for online conversational text with word clusters. In: Proceedings of NAACL-HLT, pp 380–390Google Scholar
  14. 14.
    Pincus E, Stoyanchev S, Hirschberg J (2013) Exploring features for localized detection of speech recognition errors. In: Proceedings of the SIGDIAL 2013 conference. Association for Computational Linguistics, pp 132–136Google Scholar
  15. 15.
    Placeway P, Chen S, Eskenazi M, Jain U, Parikh V, Raj B, Ravishankar M, Rosenfeld R, Seymore K, Siegler M, Stern R, Thayer E (1997) The 1996 Hub-4 Sphinx-3 system. In: Proceedings of DARPA speech recognition workshop, pp 85–89Google Scholar
  16. 16.
    Qin L, Rudnicky AI (2012) OOV word detection using hybrid models with mixed types of fragments. Interspeech-2012Google Scholar
  17. 17.
    Stolcke A (2002) SRILM-an extensible language modeling toolkit. System 3:901–904Google Scholar
  18. 18.
    Whittaker S, Hirschberg J, Nakatani CH (1998) All talk and all action: strategies for managing voicemail messages. In: CHI 98 cconference summary on human factors in computing systems. ACM, pp 249–250Google Scholar
  19. 19.
    Zhang R, Rudnicky AI (2001) Word level confidence annotation using combinations of featuresGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Carnegie Mellon UniversityPittsburghUSA
  2. 2.Honda Research InstituteMountain ViewUSA

Personalised recommendations