Multimedia Tools and Applications

, Volume 76, Issue 19, pp 20359–20376 | Cite as

Improving speech transcription by exploiting user feedback and word repetition

  • Xiangdong WangEmail author
  • Ying Yang
  • Hong Liu
  • Yueliang Qian


Speech Transcription is important for video/audio retrieval and many other applications. In automatic speech transcription, recognition errors are inevitable, which makes user feedback such as manual error correction necessary. In this paper, an approach is proposed to improve the accuracy of speech transcription by exploiting user feedback and word repetition. The method aims at learning from user feedback and recognition results of preceding utterances and then correcting errors when repeated words are falsely recognized in following utterances. An interaction scheme for user feedback is proposed, which facilitate error correction by candidate lists and provide a new kind of feedback referred to as word indication to extend error correction from repeated words to repeated phrases. For template extraction and matching, the representation of word template and recognition results based on syllable confusion network (SCN) is proposed. During the transcription, templates of multi-syllable words/phrases based on SCN are extracted from user feedback and the N-best lattice, and then matched in SCN corresponding to recognition results of subsequent utterances to yield a new candidate list when repeated words are detected. Experimental results show that considerate error reduction is achieved in the newly-generated candidate lists.


Speech transcription Error correction User feedback Repeated word 


  1. 1.
    Chen H, Cooper M, Joshi D, Girod B (2014) Multi-modal language models for lecture video retrieval. ACM International Conference on Multimedia, pp 1081–1084Google Scholar
  2. 2.
    Favre B, Rouvier M, Bechet F (2014) Reranked aligners for interactive transcript correction. Proc ICASSP 2014:146–150Google Scholar
  3. 3.
    Harwath D, Gruenstein A, Mcgraw I et al (2014) Choosing useful word alternates for automatic speech recognition correction interfaces. Proc INTERSPEECH 2014:949–953Google Scholar
  4. 4.
    Jia D, Wang X, Ma Y, Yang Y, Liu H, Qian Y (2016) Language model adaptation based on correction information for interactive speech transcription. The 2016 International Conference on Progress in Informatics and Computing (PIC-2016), ShanghaiGoogle Scholar
  5. 5.
    Karat CM, Halverson C, Horn D, Karat J (1999) Patterns of entry and correction in large vocabulary continuous speech recognition systems. Proc. CHI, pp 568–575Google Scholar
  6. 6.
    Laurent A, Meignier S et al (2011) Computer-assisted transcription of speech based on confusion network reordering. ICASSP 2011:4884–4887Google Scholar
  7. 7.
    Lecouteux B, Linares G et al (2006) Imperfect transcript driven speech recognition. Interspeech 2006, PittburghGoogle Scholar
  8. 8.
    Li X, Wang X, Qian Y, Lin S (2009) Candidate generation for interactive Chinese speech recognition. Proc. joint conferences on pervasive computing (JCPC), pp 583–588Google Scholar
  9. 9.
    Liang Y, Iwano K, Shinoda K (2014, Dec 7) An Efficient error correction Interface for speech recognition on mobile touchscreen devices. Proc. Spoken Language Technology (SLT) Workshop, pp 454–459Google Scholar
  10. 10.
    Liang Y, Iwano K, Shinoda K (2014, Sept 16) Simple gesture-based error correction Interface for smartphone speech recognition. Proc. INTERSPEECH, pp 1194–1198Google Scholar
  11. 11.
    Mangu L, Brill E, Stolcke A (2000) Finding consensus in speech recognition: word error minization and other application of confusion network. Comput Speech Lang 14(4):373–400CrossRefGoogle Scholar
  12. 12.
    Miro JDV, Silvestrecerda JA, Civera J, Turro C, Juan A (2015) Efficiency and usability study of innovative computer-aided transcription strategies for video lecture repositories. Speech Comm 2015:65–75CrossRefGoogle Scholar
  13. 13.
    Nie L, Wang M, Gao Y, Zha Z-J, Chua T-S (2013) Beyond text QA: multimedia answer generation by harvesting web information. IEEE Trans Multimedia 15(2):426–441CrossRefGoogle Scholar
  14. 14.
    Ogata J, Goto M (2005) Speech repair: quick error correction just by using selection operation for speech input interfaces. In: Proc Interspeech, pp 133–136, 2006Google Scholar
  15. 15.
    Parada C, Sethy A, Dredze M, Jelinek F (2010) A spoken term detection framework for recovering out-of-vocabulary words using the web. Proc INTERSPEECH 2010:1269–1272Google Scholar
  16. 16.
    Rodríguez L, Casacuberta F, Vidal E (2007) Computer assisted transcription of speech. Lect Notes Comput Sci 4477:241–248CrossRefGoogle Scholar
  17. 17.
    Rodríguez L, García-Varea I, Vidal E (2010) Multi-modal computer assisted speech transcription. International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal interaction (ICMI-MLMI '10)Google Scholar
  18. 18.
    Sperber M, Neubig G, Nakamura S, Waibe A (2016) Optimizing computer-assisted transcription quality with iterative user interfaces. Proc. Language Resources and Evaluation (LREC)Google Scholar
  19. 19.
    Suhm B (1997) Empirical evaluation of interactive Multimodal error correction. Proc. IEEE Workshop on speech recognition and understanding, pp 583–590Google Scholar
  20. 20.
    Suhm B, Myers B, Waibel A (1996) Designing interactive error recovery methods for speech interfaces. Proceedings of ACM CHI. Workshop on Designing the User interface for Speech Recognition applicationsGoogle Scholar
  21. 21.
    Valor Miró JD, Spencer RN, Pérez González de Martos A, Garcés G, Díaz-Munío CT, Civera J, Juan A (2014) Evaluating intelligent interfaces for post-editing automatic transcriptions of online video lectures. Open Learning: The Journal of Open and Distance Learning 29(1):72–85CrossRefGoogle Scholar
  22. 22.
    Valor Miró JD, Silvestre-Cerdà JA, Civera J, Turró C, Juan A (2015) Efficient generation of high-quality multilingual subtitles for video lecture repositories. In: Conole G, Klobučar T, Rensing C, Konert J, Lavoué É (eds) Design for teaching and learning in a networked world. Lecture notes in Computer Science, vol 9307. Springer, ChamGoogle Scholar
  23. 23.
    Wang L, Hu T, Liu P, Soong FK (2008) Efficient handwriting correction of speech recognition errors with template constrained posterior (TCP). Proc. INTERSPEECH, pp 2659–2662Google Scholar
  24. 24.
    Wang X, Li X, Qian Y, Liu H (2016) Automatic error correction for repeated words in Mandarin speech recognition. Journal of Automation and Control Engineering 4(2):153–158CrossRefGoogle Scholar
  25. 25.
    Xue J and Zhao Y-X (2005) Improved confusion network algorithm and shortest path search from word lattice. ICASSP 2005; 1: 853–856Google Scholar
  26. 26.
    Zhang H, Wang X, Qian Y, Lin S (2011) An interactive way to acquire internet documents for language model adaptation of speech recognition systems. International Conference on Intelligent Human-Machine ystems and Cybernetics (IHMSC 2011), pp 97–100Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Research Center for Ubiquitous Computing Systems, Institute of Computing TechnologyBeijingChina
  2. 2.Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
  3. 3.China Agricultural UniversityBeijingChina

Personalised recommendations