Semi-Supervised Acoustic Model Retraining for Medical ASR

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11096)


Training models for speech recognition usually requires accurate word-level transcription of available speech data. For the domain of medical dictations, it is common to have “semi-literal” transcripts available: large numbers of speech files along with their associated formatted episode report, whose content only partially overlaps with the spoken content of the dictation. We present a semi-supervised method for generating acoustic training data by decoding dictations with an existing recognizer, confirming which sections are correct by using the associated report, and repurposing these audio sections for training a new acoustic model. The effectiveness of this method is demonstrated in two applications: first, to adapt a model to new speakers, resulting in a 19.7% reduction in relative word errors for these speakers; and second, to supplement an already diverse and robust acoustic model with a large quantity of additional data (from already known voices), leading to a 5.0% relative error reduction on a large test set of over one thousand speakers.


Medical speech recognition ASR Medical dictation Acoustic modeling 


  1. 1.
    Edwards, E., et al.: Medical speech recognition: reaching parity with humans. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 512–524. Springer, Cham (2017). Scholar
  2. 2.
    Jancsary, J., Klein, A., Matiasek, J., Trost, H.: Semantics-based automatic literal reconstruction of dictations. In: Semantic Representation of Spoken Language, pp. 67–74 (2007)Google Scholar
  3. 3.
    Kawahara, T.: Transcription system using automatic speech recognition for the Japanese Parliament (Diet). In: IAAI (2012)Google Scholar
  4. 4.
    Kleynhans, N., De Wet, F.: Aligning audio samples from the South African parliament with Hansard transcriptions (2014)Google Scholar
  5. 5.
    Pakhomov, S., Schonwetter, M., Bachenko, J.: Generating training data for medical dictations. In: Proceedings of NAACL-HLT, pp. 1–8 (2001)Google Scholar
  6. 6.
    Petrik, S., et al.: Semantic and phonetic automatic reconstruction of medical dictations. Comput. Speech Lang. 25(2), 363–385 (2011)CrossRefGoogle Scholar
  7. 7.
    Petrik, S., Kubin, G.: Reconstructing medical dictations from automatically recognized and non-literal transcripts with phonetic similarity matching. In: ICASSP, vol. 4, pp. IV-1125. IEEE (2007)Google Scholar
  8. 8.
    Suendermann, D., Liscombe, J., Pieraccini, R.: How to drink from a fire hose: one person can annoscribe 693 thousand utterances in one month. In: Proceedings of SIGdial, Tokyo, Japan (2010)Google Scholar
  9. 9.
    Wightman, C.W., Harder, T.A.: Semi-supervised adaptation of acoustic models for large-volume dictation. In: Proceedings of Eurospeech, pp. 1371–1374 (1999)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.EMR.AI Inc.San FranciscoUSA
  2. 2.University of California, BerkeleyBerkeleyUSA
  3. 3.DHBWKarlsruheGermany

Personalised recommendations