Improving Word Recognition in Speech Transcriptions by Decision-Level Fusion of Stemming and Two-Way Phoneme Pruning

Mehra, Sunakshi; Susan, Seba

doi:10.1007/978-981-16-0401-0_19

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1367))

Included in the following conference series:

International Advanced Computing Conference

1397 Accesses
5 Citations
1 Altmetric

Abstract

We introduce an unsupervised approach for correcting highly imperfect speech transcriptions based on a decision-level fusion of stemming and two-way phoneme pruning. Transcripts are acquired from videos by extracting audio using Ffmpeg framework and further converting audio to text transcript using Google API. In the benchmark LRW dataset, there are 500 word categories, and 50 videos per class in mp4 format. All videos consist of 29 frames (each 1.16 s long) and the word appears in the middle of the video. In our approach we tried to improve the baseline accuracy from 9.34% by using stemming, phoneme extraction, filtering and pruning. After applying the stemming algorithm to the text transcript and evaluating the results, we achieved 23.34% accuracy in word recognition. To convert words to phonemes we used the Carnegie Mellon University (CMU) pronouncing dictionary that provides a phonetic mapping of English words to their pronunciations. A two-way phoneme pruning is proposed that comprises of the two non-sequential steps: 1) filtering and pruning the phonemes containing vowels and plosives 2) filtering and pruning the phonemes containing vowels and fricatives. After obtaining results of stemming and two-way phoneme pruning, we applied decision-level fusion and that led to an improvement of word recognition rate upto 32.96%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Besacier, L., Barnard, E., Karpov, A., Schultz, T.: Automatic speech recognition for under-resourced languages: a survey. Speech Commun. 56, 85–100 (2014)
Article Google Scholar
Susan, S., Sharma, S.: A fuzzy nearest neighbor classifier for speaker identification. In: 2012 Fourth International Conference on Computational Intelligence and Communication Networks, pp. 842–845. IEEE (2012)
Google Scholar
Hemakumar, G.: Vowel-plosive of English word recognition using HMM. In: IJCSI (2011)
Google Scholar
Tripathi, M., Singh, D., Susan, S.: Speaker recognition using SincNet and X-Vector fusion. arXiv preprint arXiv:2004.02219 (2020).
Haubold, A., Kender, J.R.: Alignment of speech to highly imperfect text transcriptions. In: 2007 IEEE International Conference on Multimedia and Expo, pp. 224–227. IEEE (2007)
Google Scholar
Gupta, P.: A context-sensitive real-time spell checker with language adaptability. In: 2020 IEEE 14th International Conference on Semantic Computing (ICSC), pp. 116–122. IEEE (2020)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Stan, A., Bell, P., King, S.: A grapheme-based method for automatic alignment of speech and text data. In: 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 286–290. IEEE (2012)
Google Scholar
Haubold, A., Kender, J.R.: Augmented segmentation and visualization for presentation videos. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 51–60 (2005)
Google Scholar
Ghosh, K., Sreenivasa Rao, K.: Subword based approach for grapheme-to- phoneme conversion in Bengali text-to-speech synthesis system. In: 2012 National Conference on Communications (NCC), pp. 1–5. IEEE (2012)
Google Scholar
Wang, W., Zhou, Y., Xiong, C., Socher, R.: An investigation of phone-based subword units for end-to-end speech recognition. arXiv preprint arXiv:2004.04290 (2020)
Alsharhan, E., Ramsay, A.: Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions. Inf. Process. Manag. 56(2), 343–353 (2019)
Article Google Scholar
Gimenes, M., Perret, C., New, B.: Lexique-Infra: Grapheme-phoneme, phoneme-grapheme regularity, consistency, and other sublexical statistics for 137,717 polysyllabic French words. Behav. Res. Methods 52(6), 2480–2488 (2020). https://doi.org/10.3758/s13428-020-01396-2
Article Google Scholar
Harwath, D., Glass, J.: Towards visually grounded sub-word speech unit discovery. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3017–3021. IEEE (2019)
Google Scholar
Lin, S.-H., Yeh, Y.-M., Chen, B.: Extractive speech summarization- From the view of decision theory. In: Eleventh Annual Conference of the International Speech Communication Association (2010)
Google Scholar
Siivola, V., Hirsimaki, T., Creutz, M., Kurimo, M.: Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner. In: Eighth European Conference on Speech Communication and Technology (2003)
Google Scholar
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Article Google Scholar
Chen, J., Wang, Y., Wang, D.: A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1993–2002 (2014)
Article Google Scholar
Mamamothb: Python port SymSpell (2019). https://github.com/mammothb/symspellpy
Shuang, Y., et al.: LRW-1000: a naturally-distributed large- scale benchmark for lip reading in the wild. In: 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019), pp. 1–8. IEEE (2019)
Google Scholar
Torfi, A., Iranmanesh, S.M., Nasrabadi, N., Dawson, J.: 3d convolutional neural networks for cross audio-visual matching recognition. IEEE Access 5, 22081–22091 (2017)
Article Google Scholar
Hazen, T.J.: Automatic alignment and error correction of human generated transcripts for long speech recordings. In: Ninth International Conference on Spoken Language Processing (2006)
Google Scholar
Martin, P.: WinPitchPro-A tool for text to speech alignment and prosodic analysis. In: Speech Prosody 2004, International Conference (2004)
Google Scholar
Chen, Y.-C., Shen, C.-H., Huang, S.-F., Lee, H.-Y.: Towards unsupervised automatic speech recognition trained by unaligned speech and text only arXiv preprint arXiv:1803.10952 (2018)
Novotney, S., Schwartz, R., Ma, J.: Unsupervised acoustic and language model training with small amounts of labelled data. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4297–4300. IEEE (2009)
Google Scholar
https://github.com/wolfgarbe/SymSpell
Schwartz, R., Makhoul, J.: Where the phonemes are: Dealing with ambiguity in acoustic-phonetic recognition. IEEE Trans. Acoust. Speech Signal Process. 23(1), 50–53 (1975)
Article Google Scholar
Mulholland, M., Lopez, M., Evanini, K., Loukina, A., Qian, Y.: A comparison of ASR and human errors for transcription of non-native spontaneous speech. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5855–5859. IEEE (2016)
Google Scholar
Bahl, L., et al.: Some experiments with large-vocabulary isolated-word sentence recognition. In: ICASSP 1984. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 395–396. IEEE (1984)
Google Scholar
Rayson, S.J., Hachamovitch, D.J., Kwatinetz, A.L., Hirsch, S.M.: Autocorrecting text typed into a word processing document. U.S. Patent 5,761,689, issued June 2 (1998)
Google Scholar
Xu, H., Ding, S., Watanabe, S.: Improving end-to-end speech recognition with pronunciation-assisted sub-word modeling. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7110- 7114. IEEE (2019)
Google Scholar
https://github.com/phatpiglet/autocorrect
Drexler, J., Glass, J.: Learning a subword inventory jointly with end-to-end automatic speech recognition. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6439–6443. IEEE (2020)
Google Scholar
Hermann, E., Kamper, H., Goldwater, S.: Multilingual and unsupervised subword modeling for zero-resource languages. Comput. Speech Lang. 65, 101098 (2020)
Article Google Scholar
Agenbag, W., Niesler, T.: Automatic sub-word unit discovery and pronunciation lexicon induction for ASR with application to under-resourced languages. Comput. Speech Lang. 57, 20–40 (2019)
Article Google Scholar
Susan, S., Kumar, S., Agrawal, R., Yadav, K.: Statistical keyword matching using automata. Int. J. Appl. Res. Inf. Technol. Computing 5(3), 250–255 (2014)
Article Google Scholar
Susan, S., Keshari, J.: Finding significant keywords for document databases by two-phase Maximum Entropy Partitioning. Pattern Recogn. Lett. 125, 195–205 (2019)
Article Google Scholar
Feng, S., Lee, T.: Exploiting cross-lingual speaker and phonetic diversity for unsupervised subword modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2000–2011 (2019)
Article Google Scholar
Ojha, R., Chandra Sekhar, C.: Multi-label classification models for detection of phonetic features in building acoustic models. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2019)
Google Scholar
CMU Pronouncing Dictionary. www.speech.cs.cmu.edu/cgi-bin/cmudict. Accessed 15 June 2020

Download references

Author information

Authors and Affiliations

Department of Information Technology, Delhi Technological University, Shahbad Daulatpur, Main Bawana Road, Delhi, 110042, India
Sunakshi Mehra & Seba Susan

Authors

Sunakshi Mehra
View author publications
You can also search for this author in PubMed Google Scholar
Seba Susan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sunakshi Mehra .

Editor information

Editors and Affiliations

Bennett University, Greater Noida, Uttar Pradesh, India
Deepak Garg
University College London, London, UK
Kit Wong
Missouri University of Science and Technology, Rolla, MO, USA
Jagannathan Sarangapani
Bennett University, Greater Noida, Uttar Pradesh, India
Suneet Kumar Gupta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mehra, S., Susan, S. (2021). Improving Word Recognition in Speech Transcriptions by Decision-Level Fusion of Stemming and Two-Way Phoneme Pruning. In: Garg, D., Wong, K., Sarangapani, J., Gupta, S.K. (eds) Advanced Computing. IACC 2020. Communications in Computer and Information Science, vol 1367. Springer, Singapore. https://doi.org/10.1007/978-981-16-0401-0_19

Download citation

DOI: https://doi.org/10.1007/978-981-16-0401-0_19
Published: 11 February 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0400-3
Online ISBN: 978-981-16-0401-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics