Skip to main content

Using Presentation Slides and Adjacent Utterances for Post-editing of Speech Recognition Results for Meeting Recordings

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12848))

Included in the following conference series:

  • 1192 Accesses

Abstract

In recent years, the use of automatic speech recognition (ASR) systems in meetings has been increasing, such as for minutes generation and speaker diarization. The problem is that ASR systems often misrecognize words because there is domain-specific content in meetings. In this paper, we propose a novel method for automatically post-editing ASR results by using presentation slides that meeting participants use and utterances adjacent to a target utterance. We focus on automatic post-editing rather than domain adaptation because of the ease of incorporating external information, and the method can be used for arbitrary speech recognition engines. In experiments, we found that our method can significantly improve the recognition accuracy of domain-specific words (proper nouns). We also found an improvement in the word error rate (WER).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/kaldi-asr/kaldi.

  2. 2.

    https://github.com/julius-speech/julius.

  3. 3.

    https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/.

  4. 4.

    https://github.com/utanaka2000/fairseq/blob/japanese_bart_pretrained_model/JAPANESE_BART_README.md.

References

  1. Asami, T., Masumura, R., Yamaguchi, Y., Masataki, H., Aono, Y.: Domain adaptation of DNN acoustic models using knowledge distillation. In: Proceedings of ICASSP, pp. 5185–5189. IEEE (2017)

    Google Scholar 

  2. Camgoz, N.C., Koller, O., Hadfield, S., Bowden, R.: Multi-channel transformers for multi-articulatory sign language translation. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12538, pp. 301–319. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66823-5_18

    Chapter  Google Scholar 

  3. Chang, F.J., Radfar, M., Mouchtaris, A., King, B., Kunzmann, S.: End-to-end multi-channel transformer for speech recognition. In: Proceedings of ICASSP, pp. 5884–5888. IEEE (2021)

    Google Scholar 

  4. Corona, R., Thomason, J., Mooney, R.: Improving black-box speech recognition using semantic parsing. In: Proceedings of the 8th IJCNLP, pp. 122–127 (2017)

    Google Scholar 

  5. Cucu, H., Buzo, A., Besacier, L., Burileanu, C.: Statistical error correction methods for domain-specific ASR systems. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS (LNAI), vol. 7978, pp. 83–92. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39593-2_7

    Chapter  Google Scholar 

  6. D’Haro, L.F., Banchs, R.E.: Automatic correction of ASR outputs by using machine translation. In: Proceedings of Interspeech, pp. 3469–3473 (2016)

    Google Scholar 

  7. Doan, T.M., Jacquenet, F., Largeron, C., Bernard, M.: A study of text summarization techniques for generating meeting minutes. In: Dalpiaz, F., Zdravkovic, J., Loucopoulos, P. (eds.) RCIS 2020. LNBIP, vol. 385, pp. 522–528. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50316-1_33

    Chapter  Google Scholar 

  8. Guo, J., Sainath, T.N., Weiss, R.J.: A spelling correction model for end-to-end speech recognition. In: Proceedings of ICASSP, pp. 5651–5655. IEEE (2019)

    Google Scholar 

  9. Hrinchuk, O., Popova, M., Ginsburg, B.: Correction of automatic speech recognition with transformer sequence-to-sequence model. In: Proceedings of ICASSP, pp. 7074–7078. IEEE (2020)

    Google Scholar 

  10. Iyer, R.M., Ostendorf, M.: Modeling long distance dependence in language: topic mixtures versus dynamic cache models. IEEE Trans. Speech Audio Process. 7(1), 30–39 (1999)

    Article  Google Scholar 

  11. Jonson, R.: Dialogue context-based re-ranking of ASR hypotheses. In: Proceedings of IEEE 2006 Workshop on SLT, pp. 174–177 (2006)

    Google Scholar 

  12. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  13. Kudo, T.: MeCab: yet another part-of-speech and morphological analyzer (2006). http://mecab.sourceforge.jp

  14. Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)

  15. Li, M., Zhang, L., Ji, H., Radke, R.J.: Keep meeting summaries on topic: abstractive multi-modal meeting summarization. In: Proceedings of ACL, pp. 2190–2196 (2019)

    Google Scholar 

  16. Mani, A., Palaskar, S., Meripo, N.V., Konam, S., Metze, F.: ASR error correction and domain adaptation using machine translation. In: Proceedings of ICASSP, pp. 6344–6348. IEEE (2020)

    Google Scholar 

  17. Nagao, K.: Meeting analytics: creative activity support based on knowledge discovery from discussions. In: Proceedings of the 51st Hawaii International Conference on System Sciences (2018)

    Google Scholar 

  18. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)

  19. Sato, T., Hashimoto, T., Okumura, M.: Implementation of a word segmentation dictionary called mecab-ipadic-NEologd and study on how to use it effectively for information retrieval. In: Proceedings of the Twenty-Three Annual Meeting of the Association for Natural Language Processing, pp. NLP2017-B6. The Association for Natural Language Processing (2017)

    Google Scholar 

  20. Sun, S., Zhang, B., Xie, L., Zhang, Y.: An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257, 79–87 (2017)

    Article  Google Scholar 

  21. Wang, Q., Downey, C., Wan, L., Mansfield, P.A., Moreno, I.L.: Speaker diarization with LSTM. In: Proceedings of ICASSP, pp. 5239–5243. IEEE (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kentaro Kamiya .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kamiya, K., Kawase, T., Higashinaka, R., Nagao, K. (2021). Using Presentation Slides and Adjacent Utterances for Post-editing of Speech Recognition Results for Meeting Recordings. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-83527-9_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-83526-2

  • Online ISBN: 978-3-030-83527-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics