Abstract
We introduce an unsupervised approach for correcting highly imperfect speech transcriptions based on a decision-level fusion of stemming and two-way phoneme pruning. Transcripts are acquired from videos by extracting audio using Ffmpeg framework and further converting audio to text transcript using Google API. In the benchmark LRW dataset, there are 500 word categories, and 50 videos per class in mp4 format. All videos consist of 29 frames (each 1.16 s long) and the word appears in the middle of the video. In our approach we tried to improve the baseline accuracy from 9.34% by using stemming, phoneme extraction, filtering and pruning. After applying the stemming algorithm to the text transcript and evaluating the results, we achieved 23.34% accuracy in word recognition. To convert words to phonemes we used the Carnegie Mellon University (CMU) pronouncing dictionary that provides a phonetic mapping of English words to their pronunciations. A two-way phoneme pruning is proposed that comprises of the two non-sequential steps: 1) filtering and pruning the phonemes containing vowels and plosives 2) filtering and pruning the phonemes containing vowels and fricatives. After obtaining results of stemming and two-way phoneme pruning, we applied decision-level fusion and that led to an improvement of word recognition rate upto 32.96%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Besacier, L., Barnard, E., Karpov, A., Schultz, T.: Automatic speech recognition for under-resourced languages: a survey. Speech Commun. 56, 85–100 (2014)
Susan, S., Sharma, S.: A fuzzy nearest neighbor classifier for speaker identification. In: 2012 Fourth International Conference on Computational Intelligence and Communication Networks, pp. 842–845. IEEE (2012)
Hemakumar, G.: Vowel-plosive of English word recognition using HMM. In: IJCSI (2011)
Tripathi, M., Singh, D., Susan, S.: Speaker recognition using SincNet and X-Vector fusion. arXiv preprint arXiv:2004.02219 (2020).
Haubold, A., Kender, J.R.: Alignment of speech to highly imperfect text transcriptions. In: 2007 IEEE International Conference on Multimedia and Expo, pp. 224–227. IEEE (2007)
Gupta, P.: A context-sensitive real-time spell checker with language adaptability. In: 2020 IEEE 14th International Conference on Semantic Computing (ICSC), pp. 116–122. IEEE (2020)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Stan, A., Bell, P., King, S.: A grapheme-based method for automatic alignment of speech and text data. In: 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 286–290. IEEE (2012)
Haubold, A., Kender, J.R.: Augmented segmentation and visualization for presentation videos. In: Proceedings of the 13th Annual ACM International Conference on Multimedia, pp. 51–60 (2005)
Ghosh, K., Sreenivasa Rao, K.: Subword based approach for grapheme-to- phoneme conversion in Bengali text-to-speech synthesis system. In: 2012 National Conference on Communications (NCC), pp. 1–5. IEEE (2012)
Wang, W., Zhou, Y., Xiong, C., Socher, R.: An investigation of phone-based subword units for end-to-end speech recognition. arXiv preprint arXiv:2004.04290 (2020)
Alsharhan, E., Ramsay, A.: Improved Arabic speech recognition system through the automatic generation of fine-grained phonetic transcriptions. Inf. Process. Manag. 56(2), 343–353 (2019)
Gimenes, M., Perret, C., New, B.: Lexique-Infra: Grapheme-phoneme, phoneme-grapheme regularity, consistency, and other sublexical statistics for 137,717 polysyllabic French words. Behav. Res. Methods 52(6), 2480–2488 (2020). https://doi.org/10.3758/s13428-020-01396-2
Harwath, D., Glass, J.: Towards visually grounded sub-word speech unit discovery. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3017–3021. IEEE (2019)
Lin, S.-H., Yeh, Y.-M., Chen, B.: Extractive speech summarization- From the view of decision theory. In: Eleventh Annual Conference of the International Speech Communication Association (2010)
Siivola, V., Hirsimaki, T., Creutz, M., Kurimo, M.: Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner. In: Eighth European Conference on Speech Communication and Technology (2003)
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Chen, J., Wang, Y., Wang, D.: A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1993–2002 (2014)
Mamamothb: Python port SymSpell (2019). https://github.com/mammothb/symspellpy
Shuang, Y., et al.: LRW-1000: a naturally-distributed large- scale benchmark for lip reading in the wild. In: 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019), pp. 1–8. IEEE (2019)
Torfi, A., Iranmanesh, S.M., Nasrabadi, N., Dawson, J.: 3d convolutional neural networks for cross audio-visual matching recognition. IEEE Access 5, 22081–22091 (2017)
Hazen, T.J.: Automatic alignment and error correction of human generated transcripts for long speech recordings. In: Ninth International Conference on Spoken Language Processing (2006)
Martin, P.: WinPitchPro-A tool for text to speech alignment and prosodic analysis. In: Speech Prosody 2004, International Conference (2004)
Chen, Y.-C., Shen, C.-H., Huang, S.-F., Lee, H.-Y.: Towards unsupervised automatic speech recognition trained by unaligned speech and text only arXiv preprint arXiv:1803.10952 (2018)
Novotney, S., Schwartz, R., Ma, J.: Unsupervised acoustic and language model training with small amounts of labelled data. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4297–4300. IEEE (2009)
Schwartz, R., Makhoul, J.: Where the phonemes are: Dealing with ambiguity in acoustic-phonetic recognition. IEEE Trans. Acoust. Speech Signal Process. 23(1), 50–53 (1975)
Mulholland, M., Lopez, M., Evanini, K., Loukina, A., Qian, Y.: A comparison of ASR and human errors for transcription of non-native spontaneous speech. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5855–5859. IEEE (2016)
Bahl, L., et al.: Some experiments with large-vocabulary isolated-word sentence recognition. In: ICASSP 1984. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 9, pp. 395–396. IEEE (1984)
Rayson, S.J., Hachamovitch, D.J., Kwatinetz, A.L., Hirsch, S.M.: Autocorrecting text typed into a word processing document. U.S. Patent 5,761,689, issued June 2 (1998)
Xu, H., Ding, S., Watanabe, S.: Improving end-to-end speech recognition with pronunciation-assisted sub-word modeling. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7110- 7114. IEEE (2019)
Drexler, J., Glass, J.: Learning a subword inventory jointly with end-to-end automatic speech recognition. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6439–6443. IEEE (2020)
Hermann, E., Kamper, H., Goldwater, S.: Multilingual and unsupervised subword modeling for zero-resource languages. Comput. Speech Lang. 65, 101098 (2020)
Agenbag, W., Niesler, T.: Automatic sub-word unit discovery and pronunciation lexicon induction for ASR with application to under-resourced languages. Comput. Speech Lang. 57, 20–40 (2019)
Susan, S., Kumar, S., Agrawal, R., Yadav, K.: Statistical keyword matching using automata. Int. J. Appl. Res. Inf. Technol. Computing 5(3), 250–255 (2014)
Susan, S., Keshari, J.: Finding significant keywords for document databases by two-phase Maximum Entropy Partitioning. Pattern Recogn. Lett. 125, 195–205 (2019)
Feng, S., Lee, T.: Exploiting cross-lingual speaker and phonetic diversity for unsupervised subword modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2000–2011 (2019)
Ojha, R., Chandra Sekhar, C.: Multi-label classification models for detection of phonetic features in building acoustic models. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2019)
CMU Pronouncing Dictionary. www.speech.cs.cmu.edu/cgi-bin/cmudict. Accessed 15 June 2020
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Mehra, S., Susan, S. (2021). Improving Word Recognition in Speech Transcriptions by Decision-Level Fusion of Stemming and Two-Way Phoneme Pruning. In: Garg, D., Wong, K., Sarangapani, J., Gupta, S.K. (eds) Advanced Computing. IACC 2020. Communications in Computer and Information Science, vol 1367. Springer, Singapore. https://doi.org/10.1007/978-981-16-0401-0_19
Download citation
DOI: https://doi.org/10.1007/978-981-16-0401-0_19
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0400-3
Online ISBN: 978-981-16-0401-0
eBook Packages: Computer ScienceComputer Science (R0)