Skip to main content

Improving Word Recognition in Speech Transcriptions by Decision-Level Fusion of Stemming and Two-Way Phoneme Pruning

  • Conference paper
  • First Online:
Advanced Computing (IACC 2020)

Abstract

We introduce an unsupervised approach for correcting highly imperfect speech transcriptions based on a decision-level fusion of stemming and two-way phoneme pruning. Transcripts are acquired from videos by extracting audio using Ffmpeg framework and further converting audio to text transcript using Google API. In the benchmark LRW dataset, there are 500 word categories, and 50 videos per class in mp4 format. All videos consist of 29 frames (each 1.16 s long) and the word appears in the middle of the video. In our approach we tried to improve the baseline accuracy from 9.34% by using stemming, phoneme extraction, filtering and pruning. After applying the stemming algorithm to the text transcript and evaluating the results, we achieved 23.34% accuracy in word recognition. To convert words to phonemes we used the Carnegie Mellon University (CMU) pronouncing dictionary that provides a phonetic mapping of English words to their pronunciations. A two-way phoneme pruning is proposed that comprises of the two non-sequential steps: 1) filtering and pruning the phonemes containing vowels and plosives 2) filtering and pruning the phonemes containing vowels and fricatives. After obtaining results of stemming and two-way phoneme pruning, we applied decision-level fusion and that led to an improvement of word recognition rate upto 32.96%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Besacier, L., Barnard, E., Karpov, A., Schultz, T.: Automatic speech recognition for under-resourced languages: a survey. Speech Commun. 56, 85–100 (2014)

    Article  Google Scholar 

  2. Susan, S., Sharma, S.: A fuzzy nearest neighbor classifier for speaker identification. In: 2012 Fourth International Conference on Computational Intelligence and Communication Networks, pp. 842–845. IEEE (2012)

    Google Scholar 

  3. Hemakumar, G.: Vowel-plosive of English word recognition using HMM. In: IJCSI (2011)

    Google Scholar 

  4. Tripathi, M., Singh, D., Susan, S.: Speaker recognition using SincNet and X-Vector fusion. arXiv preprint arXiv:2004.02219 (2020).

  5. Haubold, A., Kender, J.R.: Alignment of speech to highly imperfect text transcriptions. In: 2007 IEEE International Conference on Multimedia and Expo, pp. 224–227. IEEE (2007)

    Google Scholar 

  6. Gupta, P.: A context-sensitive real-time spell checker with language adaptability. In: 2020 IEEE 14th International Conference on Semantic Computing (ICSC), pp. 116–122. IEEE (2020)

    Google Scholar 

  7. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  8. Stan, A., Bell, P., King, S.: A grapheme-based method for automatic alignment of speech and text data. In: 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 286–290. IEEE (2012)

    Google Scholar 

  9. Haubold, A., Kender, J.R.: Augmented segmentation and visualization for presentation videos. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 51–60 (2005)

    Google Scholar 

  10. Ghosh, K., Sreenivasa Rao, K.: Subword based approach for grapheme-to- phoneme conversion in Bengali text-to-speech synthesis system. In: 2012 National Conference on Communications (NCC), pp. 1–5. IEEE (2012)

    Google Scholar 

  11. Wang, W., Zhou, Y., Xiong, C., Socher, R.: An investigation of phone-based subword units for end-to-end speech recognition. arXiv preprint arXiv:2004.04290 (2020)

  12. Alsharhan, E., Ramsay, A.: Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions. Inf. Process. Manag. 56(2), 343–353 (2019)

    Article  Google Scholar 

  13. Gimenes, M., Perret, C., New, B.: Lexique-Infra: Grapheme-phoneme, phoneme-grapheme regularity, consistency, and other sublexical statistics for 137,717 polysyllabic French words. Behav. Res. Methods 52(6), 2480–2488 (2020). https://doi.org/10.3758/s13428-020-01396-2

    Article  Google Scholar 

  14. Harwath, D., Glass, J.: Towards visually grounded sub-word speech unit discovery. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3017–3021. IEEE (2019)

    Google Scholar 

  15. Lin, S.-H., Yeh, Y.-M., Chen, B.: Extractive speech summarization- From the view of decision theory. In: Eleventh Annual Conference of the International Speech Communication Association (2010)

    Google Scholar 

  16. Siivola, V., Hirsimaki, T., Creutz, M., Kurimo, M.: Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner. In: Eighth European Conference on Speech Communication and Technology (2003)

    Google Scholar 

  17. Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)

    Article  Google Scholar 

  18. Chen, J., Wang, Y., Wang, D.: A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1993–2002 (2014)

    Article  Google Scholar 

  19. Mamamothb: Python port SymSpell (2019). https://github.com/mammothb/symspellpy

  20. Shuang, Y., et al.: LRW-1000: a naturally-distributed large- scale benchmark for lip reading in the wild. In: 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019), pp. 1–8. IEEE (2019)

    Google Scholar 

  21. Torfi, A., Iranmanesh, S.M., Nasrabadi, N., Dawson, J.: 3d convolutional neural networks for cross audio-visual matching recognition. IEEE Access 5, 22081–22091 (2017)

    Article  Google Scholar 

  22. Hazen, T.J.: Automatic alignment and error correction of human generated transcripts for long speech recordings. In: Ninth International Conference on Spoken Language Processing (2006)

    Google Scholar 

  23. Martin, P.: WinPitchPro-A tool for text to speech alignment and prosodic analysis. In: Speech Prosody 2004, International Conference (2004)

    Google Scholar 

  24. Chen, Y.-C., Shen, C.-H., Huang, S.-F., Lee, H.-Y.: Towards unsupervised automatic speech recognition trained by unaligned speech and text only arXiv preprint arXiv:1803.10952 (2018)

  25. Novotney, S., Schwartz, R., Ma, J.: Unsupervised acoustic and language model training with small amounts of labelled data. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4297–4300. IEEE (2009)

    Google Scholar 

  26. https://github.com/wolfgarbe/SymSpell

  27. Schwartz, R., Makhoul, J.: Where the phonemes are: Dealing with ambiguity in acoustic-phonetic recognition. IEEE Trans. Acoust. Speech Signal Process. 23(1), 50–53 (1975)

    Article  Google Scholar 

  28. Mulholland, M., Lopez, M., Evanini, K., Loukina, A., Qian, Y.: A comparison of ASR and human errors for transcription of non-native spontaneous speech. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5855–5859. IEEE (2016)

    Google Scholar 

  29. Bahl, L., et al.: Some experiments with large-vocabulary isolated-word sentence recognition. In: ICASSP 1984. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 395–396. IEEE (1984)

    Google Scholar 

  30. Rayson, S.J., Hachamovitch, D.J., Kwatinetz, A.L., Hirsch, S.M.: Autocorrecting text typed into a word processing document. U.S. Patent 5,761,689, issued June 2 (1998)

    Google Scholar 

  31. Xu, H., Ding, S., Watanabe, S.: Improving end-to-end speech recognition with pronunciation-assisted sub-word modeling. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7110- 7114. IEEE (2019)

    Google Scholar 

  32. https://github.com/phatpiglet/autocorrect

  33. Drexler, J., Glass, J.: Learning a subword inventory jointly with end-to-end automatic speech recognition. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6439–6443. IEEE (2020)

    Google Scholar 

  34. Hermann, E., Kamper, H., Goldwater, S.: Multilingual and unsupervised subword modeling for zero-resource languages. Comput. Speech Lang. 65, 101098 (2020)

    Article  Google Scholar 

  35. Agenbag, W., Niesler, T.: Automatic sub-word unit discovery and pronunciation lexicon induction for ASR with application to under-resourced languages. Comput. Speech Lang. 57, 20–40 (2019)

    Article  Google Scholar 

  36. Susan, S., Kumar, S., Agrawal, R., Yadav, K.: Statistical keyword matching using automata. Int. J. Appl. Res. Inf. Technol. Computing 5(3), 250–255 (2014)

    Article  Google Scholar 

  37. Susan, S., Keshari, J.: Finding significant keywords for document databases by two-phase Maximum Entropy Partitioning. Pattern Recogn. Lett. 125, 195–205 (2019)

    Article  Google Scholar 

  38. Feng, S., Lee, T.: Exploiting cross-lingual speaker and phonetic diversity for unsupervised subword modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2000–2011 (2019)

    Article  Google Scholar 

  39. Ojha, R., Chandra Sekhar, C.: Multi-label classification models for detection of phonetic features in building acoustic models. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2019)

    Google Scholar 

  40. CMU Pronouncing Dictionary. www.speech.cs.cmu.edu/cgi-bin/cmudict. Accessed 15 June 2020

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sunakshi Mehra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mehra, S., Susan, S. (2021). Improving Word Recognition in Speech Transcriptions by Decision-Level Fusion of Stemming and Two-Way Phoneme Pruning. In: Garg, D., Wong, K., Sarangapani, J., Gupta, S.K. (eds) Advanced Computing. IACC 2020. Communications in Computer and Information Science, vol 1367. Springer, Singapore. https://doi.org/10.1007/978-981-16-0401-0_19

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-0401-0_19

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-0400-3

  • Online ISBN: 978-981-16-0401-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics