Skip to main content
Log in

A study on the challenges and opportunities of speech recognition for Bengali language

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Speech recognition is a fascinating process that offers the opportunity to interact and command the machine in the field of human-computer interactions. Speech recognition is a language-dependent system constructed directly based on the linguistic and textual properties of any language. Automatic speech recognition (ASR) systems are currently being used to translate speech to text flawlessly. Although ASR systems are being strongly executed in international languages, ASR systems’ implementation in the Bengali language has not reached an acceptable state. In this research work, we sedulously disclose the current status of the Bengali ASR system’s research endeavors. In what follows, we acquaint the challenges that are mostly encountered while constructing a Bengali ASR system. We split the challenges into language-dependent and language-independent challenges and guide how the particular complications may be overhauled. Following a rigorous investigation and highlighting the challenges, we conclude that Bengali ASR systems require specific construction of ASR architectures based on the Bengali language’s grammatical and phonetic structure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • Ahmed M, Shill PC, Islam K, Mollah MAS, Akhand M (2015) Acoustic modeling using deep belief network for bangla speech recognition. In: 2015 18th international conference on computer and information technology (ICCIT), pp 306–311. IEEE

  • Ahmed S, Sadeq N, Shubha SS, Islam MN, Adnan MA, Islam MZ (2020) Preparation of bangla speech corpus from publicly available audio & text. In: Proceedings of The 12th language resources and evaluation conference, pp 6586–6592

  • Al Amin MA, Islam MT, Kibria S, Rahman MS (2019) Continuous bengali speech recognition based on deep neural network. In: 2019 international conference on electrical, computer and communication engineering (ECCE), pp 1–6. IEEE

  • Alam F (2018) Development of annotated bangla speech corpora. https://data.mendeley.com/datasets/c79z6gz9rm/1

  • Alam F, Habib S, Sultana DA, Khan M (2010) Development of annotated bangla speech corpora

  • Ali MA, Hossain M, Bhuiyan MN (2013) Automatic speech recognition technique for bangla words. Int J Adv Sci Technol 50:51–60

    Google Scholar 

  • Arslan RS, BariŞÇI N (2020) A detailed survey of turkish automatic speech recognition. Turk J Electr Eng Comput Sci 28(6):3253–3269

  • Audhkhasi K, Kingsbury B, Ramabhadran B, Saon G, Picheny M (2018) Building competitive direct acoustics-to-word models for english conversational speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4759–4763. IEEE

  • Aura SR, Rahimi MJ, Baroi OL (2020) Analysis of the error pattern of hmm based bangla asr. Int J Image Graph Signal Process 12(1):1

    Article  Google Scholar 

  • Baevski A, Zhou H, Mohamed A, Auli M (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. In Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems. Curran Associates, Inc., vol 134, pp 12449–12460

  • Balakrishnama S, Ganapathiraju A (1998) Linear discriminant analysis-a brief tutorial. Inst Signal Inf Process 18:1–8

    Google Scholar 

  • Bangladesh BE (1995) Bangla Academy Journal. Number v. 21, no. 2 - v. 22, no. 2. Bangla Academy. https://books.google.com.bd/books?id=33xjAAAAMAAJ

  • Benzeghiba M, De Mori R, Deroo O, Dupont S, Erbes T, Jouvet D, Fissore L, Laface P, Mertins A, Ris C et al (2007) Automatic speech recognition and speech variability: a review. Speech Commun 49(10–11):763–786

    Article  Google Scholar 

  • Besacier L, Barnard E, Karpov A, Schultz T (2014) Automatic speech recognition for under-resourced languages: a survey. Speech Commun 56:85–100

    Article  Google Scholar 

  • Bhowmik T, Mandal SKD (2019) Prosodic word boundary detection from bengali continuous speech. Lang Resour Eval pp 1–19

  • Bhowmik T, Choudhury A, Mandal SKD (2017) Deep neural network based recognition and classification of bengali phonemes: a case study of bengali unconstrained speech. In: International conference on next generation computing technologies, pp 750–760. Springer

  • Bhowmik T, Chowdhury A, Mandal SKD (2018) Deep neural network based place and manner of articulation detection and classification for bengali continuous speech. Procedia Comput Sci 125:895–901

    Article  Google Scholar 

  • Bird JJ, Wanner E, Ekárt A, Faria DR (2019) Phoneme aware speech recognition through evolutionary optimisation. In: Proceedings of the genetic and evolutionary computation conference companion, pp 362–363

  • Bourlard HA, Morgan N (2012) Connectionist speech recognition: a hybrid approach, volume 247. Springer Science & Business Media

  • Cakrabartī U (1992) Bāmlā bākyera padagucchera samgathana. Pramā Prakāśanī. https://books.google.com.bd/books?id=kukbAAAAIAAJ

  • Chatterji S (1988) https://books.google.com.bd/books?id=NJgyAAAAIAAJ

  • Chorowski JK, Bahdanau D, Serdyuk D, Cho K, Bengio Y (2015) Attention-based models for speech recognition. In: Advances in neural information processing systems, pp 577–585

  • Chowdhury MSA, Khan MF (2019) Linear predictor coefficient, power spectral analysis and two-layer feed forward network for bangla speech recognition. In: 2019 IEEE international conference on system, computation, automation and networking (ICSCAN), pp 1–6. IEEE

  • Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555

  • Das B, Mandal S, Mitra P (2011) Bengali speech corpus for continuous auutomatic speech recognition system. In: 2011 international conference on speech database and assessments (Oriental COCOSDA), pp 51–55. IEEE

  • Das B, Mandal S, Mitra P (2021) SHRUTI bengali continuous ASR speech corpus. https://cse.iitkgp.ac.in/~pabitra/shruti_corpus.html

  • Dave N (2013) Feature extraction methods lpc, plp and mfcc in speech recognition. Int J Adv Res Eng Technol 1(6):1–4

    Google Scholar 

  • Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2010) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Article  Google Scholar 

  • Dong L, Xu S, Xu B (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal Processing (ICASSP), pp 5884–5888. IEEE

  • Gaikwad SK, Gawali BW, Yannawar P (2010) A review on speech recognition technique. Int J Comput Appl 10(3):16–24

    Google Scholar 

  • Gales M, Young S, et al. (2008) The application of hidden markov models in speech recognition. Found Trends® Signal Process 1(3):195–304

  • Gales MJ (1998) Maximum likelihood linear transformations for hmm-based speech recognition. Comput Speech Llang 12(2):75–98

    Article  Google Scholar 

  • Gales MJ, Knill KM, Ragni A, Rath SP (2014) Speech recognition and keyword spotting for low-resource languages: Babel project research at cued. In: Fourth international workshop on spoken language technologies for under-resourced languages (SLTU-2014), pp 16–23. International Speech Communication Association (ISCA)

  • Gales MJ, Knill KM, Ragni A, Rath SP (2021) IARPA babel bengali language pack. https://catalog.ldc.upenn.edu/LDC2016S08

  • Google. Large Bengali ASR training data set. http://www.openslr.org/53/

  • Graves A, Jaitly N (2014) Towards end-to-end speech recognition with recurrent neural networks. In: International conference on machine learning, pp 1764–1772

  • Graves A, Jaitly N, Mohamed A-R (2013) Hybrid speech recognition with deep bidirectional lstm. In: 2013 IEEE workshop on automatic speech recognition and understanding, pp 273–278. IEEE

  • Haeb-Umbach R, Ney H (1992) Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: Proceedings of ICASSP, volume 1, pp 13–16. USA: ICASSP

  • Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, et al. (2014) Deep speech: scaling up end-to-end speech recognition. arXiv:1412.5567

  • Haque MA, Verma A, Alex JSR, Venkatesan N (2020) Experimental evaluation of cnn architecture for speech recognition. In: First international conference on sustainable technologies for computational intelligence, pp 507–514. Springer

  • Hassan F, Khan MSA, Kotwal MRA, Huda MN (2012) Gender independent bangla automatic speech recognition. In: 2012 international conference on informatics, electronics & vision (ICIEV), pp 144–148. IEEE

  • Hassan MR, Nath B, Bhuiyan MA (2003) Bengali phoneme recognition: a new approach. In: Proceedings of 6th international conference on computer and information technology (ICCIT03)

  • Hermansky H, Fousek P (2005) Multi-resolution rasta filtering for tandem-based asr. Technical report, IDIAP

  • Hermansky H, Morgan N (1994) Rasta processing of speech. IEEE Trans Speech Audio Process 2(4):578–589

    Article  Google Scholar 

  • Hinton GE (2009) Deep belief networks. Scholarpedia 4(5):5947

    Article  Google Scholar 

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  • Hossain M, Rahman M, Prodhan UK, Khan M, et al. (2013) Implementation of back-propagation neural network for isolated bangla speech recognition. arXiv:1308.3785

  • Houque A (2006) Bengali segmented speech recognition system. Undergraduate thesis, BRAC University, Bangladesh

  • Hsu J-Y, Chen Y-J, Lee H-Y (2020) Meta learning for end-to-end low-resource speech recognition. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7844–7848. IEEE

  • Irie K, Tüske Z, Alkhouli T, Schlüter R, Ney H (2016) Lstm, gru, highway and a bit of attention: an empirical overview for language modeling in speech recognition. In: Interspeech, pp 3519–3523

  • Islam MR, Sohail ASM, Sadid MWH, Mottalib M (2005) Bangla speech recognition using three layer back-propagation neural network. In: Proceedings of the national conference on computer processing of Bangla (NCCPB), Dhaka

  • Ittichaichareon C, Suksri S, Yingthawornsuk T (2012) Speech recognition using mfcc. In: International conference on computer graphics, simulation and modeling (ICGSM’2012), pp 28–29

  • Karim R, Rahman MS, Iqbal MZ (2002) Recognition of spoken letters in bangla. In: Proceedings of 5th international conference on computer and information technology (ICCIT02)

  • Khan MF, Debnath DRC (2002) Comparative study of feature extraction methods for bangla phoneme recognition. In: 5th ICCIT, pp 27–28

  • Khan S, Pal M, Basu J, Bepari MS, Roy R (2018) Assessing performance of bengali speech recognizers under real world conditions using gmm-hmm and dnn based methods. In: SLTU, pp 192–196

  • Kotwal MRA, Banik M, Eity QN, Huda MN, Muhammad G, Alotaibi YA (2010) Bangla phoneme recognition for asr using multilayer neural network. In: 2010 13th international conference on computer and information technology (ICCIT), pp 103–107. IEEE

  • Kwon O-W, Lee T-W (2004) Phoneme recognition using ica-based feature extraction and transformation. Signal Process 84(6):1005–1019

    Article  Google Scholar 

  • Lee K-F, Hon H-W, Reddy R (1990) An overview of the sphinx speech recognition system. IEEE Trans Acoust Speech Signal Process 38(1):35–45

    Article  Google Scholar 

  • Mandal S, Das B, Mitra P (2010) Shruti-ii: a vernacular speech recognition system in bengali and an application for visually impaired community. In: 2010 IEEE students technology symposium (TechSym), pp 229–233. IEEE

  • Mandal S, Das B, Mitra P, Basu A (2011) Developing bengali speech corpus for phone recognizer using optimum text selection technique. In: 2011 international conference on asian language processing, pp 268–271. IEEE

  • Mattys SL, Davis MH, Bradlow AR, Scott SK (2012) Speech recognition in adverse conditions: a review. Lang Cognit Process 27(7–8):953–978

    Article  Google Scholar 

  • Molla K, Hirose K (2004) On the effectiveness of mfccs and their statistical distribution properties in speaker identification. In: 2004 IEEE symposium on virtual environments, human-computer interfaces and measurement systems, 2004.(VCIMS)., pp 136–141. IEEE

  • Nahid MMH (2018) Bengali speech recognition—bangla real number audio dataset. https://data.mendeley.com/datasets/t33byr6cpt/6

  • Nahid MMH, Islam MA, Islam MS (2016) A noble approach for recognizing bangla real number automatically using cmu sphinx4. In: 2016 5th international conference on informatics, electronics and vision (ICIEV), pp 844–849. IEEE

  • Nahid MMH, Purkaystha B, Islam MS (2017) Bengali speech recognition: a double layered lstm-rnn approach. In: 2017 20th international conference of computer and information technology (ICCIT), pp 1–6. IEEE

  • Nahid MMH, Islam MA, Purkaystha B, Islam MS (2018) Comprehending real numbers: development of bengali real number speech corpus

  • Nivetha S (2020) A survey on speech feature extraction and classification techniques. In: 2020 international conference on inventive computation technologies (ICICT), pp 48–53. IEEE

  • Ohi AQ, Mridha M, Hamid MA, Monowar MM (2021) Deep speaker recognition: process, progress, and challenges. IEEE Access 9:89619–89643

    Article  Google Scholar 

  • Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5206–5210. IEEE

  • Parishad BV (2001) Indian Journal of Linguistics. Number v. 20. Bhasa Vidya Parishad. https://books.google.co.uk/books?id=0yxhAAAAMAAJ

  • Paul AK, Das D, Kamal MM (2009) Bangla speech recognition system using lpc and ann. In: 2009 seventh international conference on advances in pattern recognition, pp 171–174. IEEE

  • Placeway P, Chen S, Eskenazi M, Jain U, Parikh V, Raj B, Ravishankar M, Rosenfeld R, Seymore K, Siegler M, et al. (1997) The 1996 hub-4 sphinx-3 system. In: Proceedings of DARPA speech recognition workshop, volume 97. Citeseer

  • Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, Hannemann M, Motlicek P, Qian Y, Schwarz P, et al. (2011) The kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding, Number CONF. IEEE Signal Processing Society

  • Rabiner LR (1990) Selected applications in speech recognition. Read Speech Recognit p 267

  • Rahman K, Hossain M, Das D, Islam T, Ali M (2003) Continuous bangla speech recognition system. In: Proceedings of 6th international conference on computer and information technology (ICCIT03), pp 303–307

  • Rahman MM, Khan MF, Moni MA (2010) Speech recognition front-end for segmenting and clustering continuous bangla speech. Daffodil Int Univ J Sci Technol 5(1):67–72

    Article  Google Scholar 

  • Rashmi C (2014) Review of algorithms and applications in speech recognition system. Int J Comput Sci Inf Technol 5(4):5258–5262

    Google Scholar 

  • Ravanelli M, Brakel P, Omologo M, Bengio Y (2018) Light gated recurrent units for speech recognition. IEEE Trans Emerg Topics Comput Intell 2(2):92–102

    Article  Google Scholar 

  • Reddy DR (1976) Speech recognition by machine: a review. Proc IEEE 64(4):501–531

    Article  Google Scholar 

  • Reza M, Rashid W, Mostakim M (2017) Prodorshok i: a bengali isolated speech dataset for voice-based assistive technologies: a comparative analysis of the effects of data augmentation on hmm-gmm and dnn classifiers. In: 2017 IEEE region 10 humanitarian technology conference (R10-HTC), pp 396–399. IEEE

  • Sahu P, Dua M, Kumar A (2018) Challenges and issues in adopting speech recognition. In: Speech and language processing for human-machine communications, pp 209–215. Springer

  • Saurav JR, Amin S, Kibria S, Rahman MS (2018) Bangla speech recognition for voice search. In: 2018 international conference on bangla speech and language processing (ICBSLP), pp 1–4. IEEE

  • Sharmin R, Rahut SK, Huq MR (2020) Bengali spoken digit classification: a deep learning approach using convolutional neural network. Procedia Comput Sci 171:1381–1388

    Article  Google Scholar 

  • Singh A, Kadyan V, Kumar M, Bassan N (2019) Asroil: a comprehensive survey for automatic speech recognition of indian languages. Artif Intell Rev, pp 1–32

  • Srivastava N, Mukhopadhyay R, Prajwal K, Jawahar C (2020) Indicspeech: text-to-speech corpus for indian languages. In: Proceedings of the 12th language resources and evaluation conference, pp 6417–6422

  • Srivastava N, Mukhopadhyay R, Prajwal K, Jawahar C (2021) IndicSpeech: text-to-Speech Corpus for Indian Languages. http://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages

  • Sultana S, Akhand M, Das PK, Rahman MH (2012) Bangla speech-to-text conversion using sapi. In: 2012 international conference on computer and communication engineering (ICCCE), pp 385–390. IEEE

  • Sumit SH, Al Muntasir T, Zaman MA, Nandi RN, Sourov T (2018) Noise robust end-to-end speech recognition for bangla language. In: 2018 international conference on bangla speech and language processing (ICBSLP), pp 1–5. IEEE

  • Sumon SA, Chowdhury J, Debnath S, Mohammed N, Momen S (2018) Bangla short speech commands recognition using convolutional neural networks. In: 2018 international conference on bangla speech and language processing (ICBSLP), pp 1–6. IEEE

  • SUST SUoST (2020) Pipilika: (Bengali Search Engine). Accessed April 1, 2020. https://www.pipilika.com/

  • Takiguchi T, Ariki Y (2007) PCA-based speech enhancement for distorted speech recognition. J Multimed 2(5)

  • Tebelskis J (1995) Speech recognition using neural networks. PhD thesis, Carnegie Mellon University

  • Trentin E, Gori M (2001) A survey of hybrid ann/hmm models for automatic speech recognition. Neurocomputing 37(1–4):91–126

    Article  Google Scholar 

  • Tunga, Śekhara S (1995) Bengali and other related dialects of south Assam. Mittal Publications, 1 edition

  • Vadwala AY, Suthar KA, Karmakar YA, Pandya N, Patel B (2017) Survey paper on different speech recognition algorithm: challenges and techniques. Int J Comput Appl 175(1):31–36

    Google Scholar 

  • Variani E, Lei X, McDermott E, Moreno IL, Gonzalez-Dominguez J (2014) Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4052–4056. IEEE

  • Walker W, Lamere P, Kwok P, Raj B, Singh R, Gouvea E, Wolf P, Woelfel J (2004) Sphinx-4: a flexible open source framework for speech recognition

  • Westphal M (1997) The use of cepstral means in conversational speech recognition. In: Fifth European conference on speech communication and technology

  • Zheng F, Zhang G, Song Z (2001) Comparison of different implementations of mfcc. J Comput Sci Technol 16(6):582–589

    Article  Google Scholar 

  • Zinnat SB, Siddique RMA, Hossain MI, Abdullah DM, Huda MN (2014) Automatic word recognition for bangla spoken language. In: 2014 international conference on signal propagation and computer technology (ICSPCT 2014), pp 470–475. IEEE

  • Ziółko M, Samborski R, Gałka J, Ziółko B (2011) Wavelet-Fourier analysis for speaker recognition. In: 17th national conference on applications of mathematics in biology and medicine, vol 134, p 129

  • Zoughi T, Homayounpour MM, Deypir M (2020) Adaptive windows multiple deep residual networks for speech recognition. Expert Syst Appl 139:112840

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank the Advanced Machine Learning (AML) lab and Bangladesh University of Business and Technology (BUBT) for their support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. F. Mridha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mridha, M.F., Ohi, A.Q., Hamid, M.A. et al. A study on the challenges and opportunities of speech recognition for Bengali language. Artif Intell Rev 55, 3431–3455 (2022). https://doi.org/10.1007/s10462-021-10083-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-021-10083-3

Keywords

Navigation