Introduction

The most continually evolving and explored the area in speech processing is Automatic Speech Recognition (ASR). Researchers have invested a lot to improve the performance of the real-time speech processing applications like recognizing digits, transcriptions broadcasting large vocabulary news, speech dictation systems, dialogue systems, etc. Apart from the advancements being made in the field throughout the world, ASR still poses an enormous number of challenges to the area and languages that are yet to be explored efficiently. A limited number of automatic speech recognition systems are available for commercial use, e.g., Apple Siri support 22 speech verities. This is exactly the number of languages officially recognized in India. The gap between the languages being spoken around the globe and the technical support available to these languages are very few. Some of the challenges include complication in designing systems for large vocabulary databases especially for minority languages where there is a scarcity of standard speech corpus, the articulator and phonetic features of the speaker in speaker-independent ASR, spontaneous patterns like ah, um, silence starts, out-of-vocabulary words, in the speech signal, and effect of external environmental factors like noise or distortion through the channel. These problems are prevalent in the real-time applications of ASR. Apart from all the interferences, the speech processing community has been successful to develop highly effective applications for dictation which automatically generates written text from the input signal, database accesses, interfaces for human communication, machine control, and accessing automatic remote services over a dial-up connection for some majority languages.

Framework of ASR

The following block diagram depicts an automatic speech recognizer. The front-end processing includes pre-emphasis and extracting features. These feature extraction methods are explained in later sections of the paper. The speech sample is decoded at the backend with the help of knowledge gained from the acoustic model, language model, and pronunciation model. This transforms the input speech signal into a text string in a readable format. The grammar rules are fed into the language model. The front end corresponds to the training phase and the back end corresponds to the testing phase.

To represent real-world data, ASR is attributed to different dominant and impressive probabilistic modeling techniques including speech signals and documents of spoken language which are assembled from the applications in the real world. ASR problem-solving approach can be devised from the fundamental statistical classification approach, based on pattern recognition [73, 83]. From a given sequence of allowed words in the dictionary, classes are defined based on which the speech signal is represented parametrically. To find the sequence of words W that maximizes the factor Pr(W|X) can be termed as the classification problem in ASR. Bayes’ Theorem is used to factorize the latter factor [73] given as

$$ \Pr \left( {W|X} \right) = \frac{{\Pr \left( {X|W} \right){\text{Pr}}\left( W \right)}}{\Pr \left( X \right)}. $$
(1)

For an acoustic input sequence X, the output factor Pr(W|X) can be maximized by searching for a class W, through which the numerator on the right-hand side of Eq. (1), i.e., Pr(X|W)Pr(W), can be maximized. The Language Model (LM) [257] which is represented by the factor Pr(W) is based on high-level coercions and the language-based information about the dataset of the words taken for a task. The acoustic model is given by the factor Pr(X|W) which illustrates the sequence statistics of the parameterized acoustic studies in the given feature space provided that the uttered word phonemes are given. In the 90s, Hidden Markov Model (HMM) is proved to be the best possible acoustic model for the efficient modeling of sub-word including phonemes, syllables, etc. as well as complete speech sentences [257, 121]. The Markov Chain model also coined as the n-gram model is popularly used in language models for sentences and word classification in text documents. For the HMMs as well as n-gram models, learning different model parameters from a wide training data with the help of appropriate training criteria becomes imperative. Results have proved that the efficiency of the data-driven models is mostly based on the performance of the estimated models and the different modeling techniques adopted to train the data. Although HMMs have been proven to perform effectively for ASR acoustic modeling applications, including efficient pattern recognition, they suffer from major drawbacks. Although HMMs have contributed a lot to different fields of research, especially, in speech recognition, it suffers from major intrinsic limitations. Owing to this, researchers have decided to follow different approaches in ASR application building. A hybrid of Artificial Neural Networks (ANNs) with HMMs is jotted down [37] to conquer these drawbacks. Either existing Continuous Density Hidden Markov Models (CDHMMs) were trained with the help of forward–backward or Viterbi pseudocodes which presented reduced discriminative efficiency among several other techniques as they are trained on Machine Learning (ML) criteria which are discriminative. Also, the number of parameters in HMMs hinders their implementation in hardware. Consequently, to overcome such limitations, ANNs when trained discriminatively can perform non-parametric assessments among a sequence of patterns. ANNs use a specific number of constants, which makes it feasible for developing neural microchips. However, the hybrid of ANN–HMM systems has outperformed the earlier systems. Advancements in computing hardware over the past 2 decades have led to the acoustic learning of ANN systems much simpler. ANNs with numerous numbers of hidden units and training data require subtle and efficient hardware like General Programming Units (GPU) for massive computations. Typically, related feature input vectors based on wide temporal contexts of the acoustic frames are handles easily by ANNs as compared to the standard GMM-HMMs. Log mel-frequency spectral coefficients are used directly by the hybrid ANN–HMMs excluding a de-correlating discrete cosine transform (DCT) [217, 183] which were used earlier by Gaussian mixture model (GMMs). These elements proved the efficiency of the systems. Combining HMMs and ANNs have been started in the 90s and termed as hybrid HMM/ANN [81, 99, 182]. With advancement in computational power, advance form of ANN named DNN gained popularity. There are number of architectures that are useful in implementing the deep learning concept: Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), Auto Encoder (AE), and Convolutional Neural Network (CNN). RBM is extensively used in deep learning. They are used for generating stochastic models of ANN. RBM have variant of Boltzmann Machines (BMs). BMs are NN having stochastic processing units which are connected bidirectional, whereas DBN has many layers of RBM which used greedy layer algorithm for training. DBN is probabilistic generative model having stochastic, hidden units. Undirected and symmetric connections are present between the two top layers. The layer above provides top–down connection to the lower layer. RBM is formed from, every two adjacent layers. Each RBM’s visible layer is connected with previous RBM’s hidden layer and above two layers are non-directional. The upper layer and the lower layer connection is directed and is in top–bottom manner. Various RBM layers present in DBN are sequentially trained CNN which is subpart of discriminative deep learning architecture. CNN has two types of various layers. One is convolution layers also called c-layers and another sub-sampling layers also known as s-layers. Alternately c-layers and s-layers are connected and help in the framework of middle part of the network. The inputs are convolved with trainable filter to produce feature maps in first c-layer. In every filter, a layer of connection weights is present. Additional feature maps in the first s-layer are produced, and this procedure continues and feature maps in the following c-layers and s-layers are obtained.

Approaches to automatic speech recognition

In the field of machine learning, different statistical learning approaches have been studied thoroughly. To build an effective classifier for pattern recognition applications, there exist two distinct learning approaches, namely, generative learning and discriminative learning. The generative learning technique uses the density estimation models to compute the probability of each class for the distribution of the data. However, inherit dependencies among the data can be easily exploited with generative learning methods using structural constraints. Apart, the major limitation for these models lies in the fact that such models require a true distribution of the data to build an optimal classifier. On the other side, discriminative learning schemes have recently gained immense popularity in artificial intelligence or machine learning applications. Since it does not employ modeling the underlying distribution of data and directly involves optimizing a mapping function from the input class to the output class labels, whereas discriminative models have been adopted widely to train different systems. Optimization of the mapping function can be achieved by implementing different training criteria. However, latent variables cannot be handled directly to reduce the underlying structure of the data in discriminative training. Furthermore, computations from simultaneous classes are carried out in parallel which enhances the computational complexity of the technique. Pure discriminative techniques are only used as a substitute component featuring neural networks for extraction of features in the front-end phase. Researchers have associated generative and discriminative learning techniques that were termed as discriminative learning of generative models. Ample research has been conducted in the past years to propose efficient algorithms, which can learn generative models in a discriminative manner for machine learning and ASR applications. These hybrid techniques include research which is being carried out since early 1980 where HMMs were trained discriminatively using various training methods like Maximum-Mutual Information Estimation (MMIE), CMLE/MMIE, Minimum Classification Error (MCE), Maximum Likelihood Estimation (MLE), Minimum Phone Error (MPE), etc. In this survey article, the author reviews the most relevant work in the literature including training techniques for recognizing speech models and especially, concerning discriminative training of various models used for building speech recognition systems. Apart from it also focused on the literature of feature vectors, which is one of the crucial elementals in any discriminative learning system. The author avoids any technical or experimental errors in detailing out the results. The ensuing article is organized into the following sections: In section “Challenges and issues”, a study on the prevalent challenges for ASR systems has been presented. The consecutive unit, Section “Motivation”, includes motivation areas for developing better ASR systems with improved recognition accuracy. Section “Quality assessment” deals with the quality assessment measures that have been taken by the authors. Section “Recognition of non-Indian ASR systems” presents the research work carried out by researchers around the globe for foreign languages. In section “Recognition of Indian ASR Systems”, the author presented different ASR systems and techniques used in research for Indian languages. Section “Recognition results of non-Indian and Indian languages” compares the recognition accuracies and Word Error Rate (WERs) achieved by different research work by exploring the tables cited by the author. The tables are segregated into non-Indian, Indian languages and the speech research work is conducted on publicly available datasets. Section “Synthesis analysis” provides some suggestions for the researchers to carry forward the work in this field. The author concludes in sections “Suggestions on future directions” and “Conclusion” by analyzing different techniques and results.

Objectives of speech recognition applications

With the development and advancement in the digital signal processing hardware as well as software, the speech recognition system has achieved great strides. Still, the machine cannot match the performance of humans in terms of accuracy as well as in terms of speech. The suitability of each method is shown to the application. The following are the various objectives of the Speech Recognition application:

  • To know Speech Recognition and the way it works.

  • To view Speech Recognition applications in various areas.

  • To know Speech Recognition implementation as a single application.

  • To check how Speech Recognition is faster than writing by hand and checking hand-free capability of Speech Recognition.

  • To make Speech Recognition useful for the mental and physical disabled person

  • To used Speech Recognition for various applications like the voice dialing industry, dictation, and navigation.

Challenges and issues

An extensive study of the existing literature has been carried out to identify the various techniques and methodologies along with the existing challenges available for speech recognition that could trigger further research in the field. Speech has been classified into isolated or connected words, continuous or spontaneous speech. The basic modes of speech include speaker-dependent and speaker-independent. Each speech recognition system is attributed to several challenges. First, there is a wide variation of the speakers in uttering a word leading to the pronunciation difference. This variation is further attributed to the age, gender, and dialect of the physical appearance of the speaker. Secondly, background noise can add to the problems of recognizing speech accurately in different environments and real-time applications. The next challenge includes the physiological aspect of pronouncing the words by how stress is given on different syllables, phones, and vowels. This particularly affects the speech recognition of tonal languages. A continuous speech system is rather difficult to implement owing to the uninterrupted speech we use in real life. This further poses problems in speech recognition systems. Other factors contributing to the lower recognition rates include poor microphone quality, position, and direction of the microphone relative to the speaker. Despite these challenges, environment variation, channel variation, style of speaking, age, and gender contributes to the challenging task of speech recognition, e.g., Kirchhoff and Vergyri [158] mentioned that the Arabic language script falls short of the vowels as well as other information related to the phones. The major difficulties, which posed issues for speech recognition of the Russian language, are the variations in the informal speech of the language and the non-existence of standard dictionaries [123]. There is some speech corpus that has been collected by some researchers/research agencies in India, but this corpus has not been made available to the researchers for carrying out their research in the field. Thus, a lack of a standard database for Indian languages is a dearth. Shanthi Therese and Lingam [299] reported that for designing an efficient speech recognition system, selecting and extracting the most relevant parametric information are very crucial. Segmentation of words into corresponding phonemes in Indian languages is a tedious task to pursue. Owing to the linguistic variations in different languages of India, a single language may have many scripts and multiple languages may have one script. Furthermore, different people of different regions speak one common language in different ascents or tones. For instance, Punjabi, being a tonal language, is spoken differently in different parts of Punjab. Although a lot of work has been done in other languages, Punjabi, being one of the most popularly used languages across the globe, needs some attention from the researchers in terms of speech processing. Accuracy, noise removal, information retrieval, and varying bit rate are some of the most considerable parts of speech recognition challenges.

Motivation

Discriminative learning schemes have gained immense popularity in artificial intelligence or machine learning applications. A lot of research has already been carried out on foreign languages using different discriminative criteria for training purposes. However, Indian languages still need to employ discriminative training techniques for improved performance of ASR systems. Future study should be focused on determining whether dialect-ID systems are robust against speaker’s variability, or whether systems incorporating prosodic information are required to provide further improvements. The development of a successful ASR engine depends upon a few factors. The selection of an appropriate feature may affect the training and testing phase of the system. These feature vectors depend upon various conditions. These discriminate feature needs were classified to an appropriate classifier. The selection of a classifier also plays a crucial role. Each classifier has its limitations and functionalities. Performance recognition of the testing phase is affected due to training through feature and their observation stores in a different classifier. The training of the system is focused mainly upon its corpus. The size and methods of speech corpus collection may drastically affect the system output. Also, the language models should be developed appropriately depending on the size and the nature of the data. Such factors significantly affect the accuracy of the system. The prevalent ASR systems for the majority of languages spoken across the globe have still not achieved the required efficiency [168]. Speech recognition techniques including Discriminative training of the language models, for Large Vocabulary Continuous Speech Recognition systems (LVSCR), need to be implemented more efficiently to improve the recognition accuracy.

Quality assessment

After the inclusion/exclusion criterion was employed to find out the relevant paper, the quality assessment was performed on the remaining papers. The quality assessment form (“Appendix 1”) includes all the papers indulge in the review, which contains high-quality speech recognition research. The questions included in Sect. 1 of “Appendix 1” served as a basis for screening the study. After the research paper was included, the study was done for the classification based on section “Inclusion/exclusion criteria”. Then, we preceded to sections Recognition of non-Indian ASR systems and “Recognition of Indian ASR Systems”.

Data extraction

During the starting of the review, we faced numerous problems. Extraction of all the relevant data (“Appendix 2”) from many studies was very difficult. Due to this problem, it is needed to contact various researchers to find the needed details where we could not be inferred from the search paper:

  • The steps carried in the data extraction are as follows.

  • All the papers were surveyed first, and then, from the primary studies, data were extracted.

  • Another researcher checked data extraction consistency by performing the data extraction on primary data. The samples were randomly selected, and the crossing checking of results was done.

  • During cross-checking if any disagreement was noticed, then authors were used to resolving them in the consensus meeting.

Inclusion/exclusion criteria

A systematic review of literature is a method to identify, evaluate, and interpret the available literature in the form of research papers, articles, journals, etc., so that the studied literature can be summarized, research gaps can be identified and a base for carrying out future research can be formulated [162]. When the review procedure was being carried out, the authors focused on the following research questions:

  • Question: What types of standard databases have been used to carry out the experiments?

  • Question: What terminologies have been applied to formulate new databases? For example number of speakers included, environment variations, bias on gender utterances, etc.

  • Question: What challenges have been identified for a technique for Indian as well as non-Indian languages?

  • Question: What feature extraction techniques have been employed?

  • Question: What results have been reported in the findings of the experiment?

The following datasets were mentioned in the studies:

Different other types of research questions could have been identified and involved in carrying out the review procedure, but the above-mentioned questions clearly defined the insights and the focus of the research which needs to be carried out in the field of ASR. From the immense amount of existing literature regarding speech recognition and its terminologies, primary research articles focusing on the central idea are filtered out with the help of database search using significant keywords. The authors then refine the selected papers manually. Irrelevant papers for the research were discarded manually based on the information in the title of the paper. The following facets are identified depending on the focus of the research and the research questions, which triggered the inclusion–exclusion principle for the research: speech recognition, feature extraction, large vocabulary continuous speech recognition, and ASR systems. The study followed a systematic approach to include quantitative and qualitative research articles, which have been published. This made the database search more comprehensive. Finally, the primary studies were included in the review that depends on the abstract and the full text. Our systematic approach and strict inclusion criteria likely reduced heterogeneity, but did not eliminate biases in the original studies, diversity in study design and population, and publication bias.

Recognition of non-Indian ASR systems

Jiang [131] summarized the discriminative learning training techniques of HMMs for automatic speech recognition. The author presented discriminative training criteria available in the literature as well as the optimization methodologies employed for discriminative learning. Gales et al. [87] review the different forms of discriminative learning models including maximum entropy Markov models, hidden conditional random fields, and conditional augmented models. The application of these models with respect to large vocabulary continuous speech recognition systems has been discussed. Hinton et al. [111] presented a survey to compare and summarize the existing methodologies used in different stages of speech recognition. The paper focuses on different feature extraction techniques including LPC, MFCC, etc., and several approaches to speech recognition. Apart, the evaluation of different ASR techniques has been demonstrated. Hemakumar and Punitha [110] provided a technological overview of the fundamental research carried out in the field of ASR over the past many years. The paper discussed different problems persisting in ASR and different methodologies developed so far for feature extraction, classification, models employed for a different database, and strategies used for performance evaluation. Johnson et al. [201] performed a systematic review of literature that referred to speech recognition (SR) in health care settings from 2000 to 2014. Six medical databases are searched by a qualified health librarian. They conclude that SR, although not as accurate as human transcription, does deliver reduced turnaround times for reporting and cost-effective reporting, although equivocal evidence of improved workflow processes. Debatin et al. [64] reviewed offline voice recognition on Android mobile devices, they revealed that research priorities of offline SR are on reducing the error rate, developing neural networks for language models, and research advance statistical model. However, very few solutions have been offered for offline voice recognition on Android mobile devices. Clark et al. [56] analyzed that speech human–computer interaction work focuses on nine key topics: system speech production, modality comparison, user speech production, assistive technology and accessibility, design insight, and experiences with interactive voice response (IVR) systems. Nassif et al. [230] conducted a review of literature on speech recognition from 2006 to 2019. They concluded that most of the studies reported still used MFCCs as feature extraction for speech signals. MFCCs were profoundly used in traditional classifiers (HMM and GMM). 75% of DNN models were standalone models where only 25% of the models used hybrid models. This paper tried to explore state-of-art speech recognition with respect to different feature and modeling approaches in Indian and non-Indian languages across the world. Singh et al. [306] conducted a review of the spoken languages of India. The survey was conducted based on the relevant research articles published from 2000 to 2018. The purpose of this systematic survey is, to sum up, the best available research on automatic speech recognition of Indian languages. Kaur et al. [147] reviewed the status of speech recognition research conducted on tonal languages spoken around the globe. Authors observed that a lot of work has been done for the Asian continent tonal languages, i.e., Chinese, Thai, Vietnamese, Mandarin, but little work been reported for the Mizo, Bodo, Indo-European tonal languages like Punjabi, Latvian, Lithuanian as well for the African continental tonal languages, i.e., Hausa and Yoruba (Tables 1, 2).

English

English, an Indo-European language, has been accepted globally and spoken by people around the world for communication. The English language has many accents British, American, Indian, etc. A few standard databases have been formulated by the researchers to research speech analysis and recognition. Some of them have been made public also. Results of research work conducted on these databases have been discussed in Table 3. Richardson and Campbell [269] developed an SVM classifier for NIST 2007 language corpus. Benzeghiba et al. [29] presented a comprehensive analysis of different terminologies and methodologies used for automatic speech recognition. The review included factors like speaking accent, physiology of the speaker, age, emotions, and the style of speech. Also, different modeling techniques were presented. A comparative study of various model architecture-using hybrid HMM) was examined on TIMIT phone recognition. It was employed by global discriminative training methods that outrun slightly better than its baseline HMM approach. Variants of neural networks have been discussed that are used for automatic recognition of speech. Saon and Chien [279] introduce Bayesian sensing hidden Markov models (BS-HMMs) to perform Bayesian sensing and model regularization for heterogeneous training data. Results of BS-HMM on an LVCSR exhibited improvements over conventional HMMs based on Gaussian mixture models. Cai et al. [41] applied acoustic maxout neural networks to the Switchboard phone call data. Experiments were carried to minimize the effect of underfitting. The results reported that maxout networks converged faster than linear networks. Lopez-Moreno et al. [194] used DNN as an end-to-end LID classifier and extracted bottleneck features. Experiments were carried out in two separate outlines: the complete NIST Language Recognition Evaluation dataset 2009 (LRE’09) and Voice of America (VOA) data from LRE’09. DNN-based systems significantly outperform the i-vector system when dealing with short-duration utterance. Khademian and Homayounpour [152] developed a joint-token passing algorithm, and used deep neural networks for joint-speaker identification and their gain estimation. It achieved 5.3% absolute task performance improvement. Badino et al. [22] experiment with DNN–HMM phone recognition systems that use measured articulatory information. Evaluations on both the MOCHA-TIMIT mask and the mngu0 datasets show that the recovered AFs reduce phone error rate (PER) in both clean and noisy speech conditions. Moore et al. [223] evaluate the performance of the CHiME3 baseline ASR system in a diverse range of acoustic conditions using the ACE Challenge database of AIRs and noise. The evaluation exploits the recently released ACE. The benefit of speech enhancement processing has been demonstrated, with a reduction of WER up to 82%. Hanani et al. [101] worked on language identification of 14 regional accents of British English in the ABI-1 corpus. This system achieves a recognition accuracy of 89.6%, compared with 95.18% for the ACCDIST-based system. Sailor and Patil [272] projected an unsupervised learning model based on convolutional restricted Boltzmann machine (RBM) with rectified linear units. Experiments on the TIMIT and AURORA 4 databases show that ConvRBM can more general representations of the speech signals. Hori et al. [115] proposed a system with end-to-end recurrent neural networks (RNNs). The beamformed signal is processed by a single-channel long short-term memory (LSTM) enhancement network, which is used to extract stacked mel-frequency cepstral coefficients (MFCC) features. recurrent neural network-based were extending by applying beamforming, noise-robust feature extraction techniques, and large-scale LSTM RNN language models and achieved 5.05% WER for the real-test data. Maas et al. [195] found that increasing model size and depth are simple but effective ways to improve WER performance. Experiments suggest that the DNN architecture is quite competitive with specialized architectures such as DCNNs and DLUNNs. The DNN architecture outperformed other architecture variants in both frame classification and final system WER. Sainath et al. [273] introduced a joint CNN/DNN architecture to allow speaker-adapted features to be used, and authors investigated a strategy to make dropout effective after HF sequence training. Experiments on 3 LVCSR tasks, namely a 50 and 400 h BN task and a 300 h SWB task, indicate that a CNN with the proposed speaker-adapted and ReLU + dropout ideas allow for a 12%–14% relative improvement in WER over a strong DNN system. Seide et al. [290] present CN-DNN–HMM for speech recognition tasks. Discriminatively trained Gaussian-mixture HMMs on the Switchboard corpus reduced the word-error rate. Swietojanski et al. [318] investigate CNNs for large vocabulary distant speech recognition using the AMI meeting corpus and found that CNNs improve the WER by 6.5% relative compared to conventional DNN models. Similar results were also found by Li et al. [184], Abdel-Hamid et al. [1]. Weng et al. [356] introduced DNN to experiment on CHiME. DNN was trained using the MFCC features. The authors achieved significant improvement in WER as compared to the basic DNN system. Sainath et al. [273] applied CNN to LVCSR systems. Experiments were conducted on 50 h of data of English Broadcast News Speech Corpora. Advanced features were extracted to train the system. Zhang et al. [364] proposed a hybrid approach of CNN and CTC (Connectionist Temporal Classification) to overcome the limitations of CNN. 40-dimensional log Mel filter bank coefficients were extracted, and results were shown on the TIMIT database. Beck et al. [27] introduced Hidden Conditional Random Fields (CRF) for large vocabulary systems to overcome poor modeling features of HMM. Experiments were conducted on Switchboard 300 h speech corpus and the results were reported on a neural network that was trained on Gammatone features. Bahdanau et al. [23] developed LVSR systems with the help of RNN that performs sequence prediction directly at the character level. Two methods were proposed to speed up searching operation, in the first scan to a subset of most promising frames was restricted, and in the second pooling, the information contained in neighboring frames helps in reducing source sequence length. Chen et al. [49] proposed a modular training framework of E2E ASR, while end-to-end decoding is retained. The results show that the proposed system performance gap between CI-phone CTC and the A2W model is reduced. Ravanelli et al. [266] performed experiments on TIMIT, DIRHA, CHiME, and LibriSpeech databases. Hybrid feature extraction techniques were employed using MFCC, fBANKS, and FMLLR to train the RNN–HMM system, and a significant improvement over standard RNN systems has been reported. [Seltzer et al. [293], Jing et al. [132]].

Table 1 Speech corpus
Table 2 Facets for inclusion/exclusion criteria
Table 3 Recognition results for non-Indian languages

Mandrian

Liu et al. (2010 applied different existing discriminative training approaches like MPE and fMPE on bilingual speech corpus. The results showed that the STA phone clustering technique is better than the existing phone clustering criteria. The bilingual corpus combined Mandarin and English. Chien and Huang [50] gave a discriminative linear regression adaptation algorithm for HMM for speech recognition. Aggregate aposteriori linear regression (AAPLR) was proposed for discriminative adaptation when the classification errors of adaptation data have to be minimized. Hwang et al. [124] pointed out that since Mandarin is a tonal language, adding pitch information might help in speech recognition. Therefore, they added pitch information into the input of the Tandem neural nets. The system builds with confusion network combination yield 9.1% CER on the DARPA GALE 2007 dataset.

Hoffmeister et al. [114] developed the RWTHLVCSER system for Mandarin. The proposed system integrated additional feature streams such as tone and NN-based posteriors and the combination of multiple systems. Plahl et al. [246] again developed RWTH bases LVCSR system for Mandarin. A new reduced toneme set is developed. This helps in a reduction in character error rate by about 3% relative. Wang et al. [352] explore the use of multifactor clustering for training data and the use of MPE–MAP and fMPE–MAP acoustic model adaptations. A 6% relative reduction in recognition error rate compared to a Mandarin recognition system that does not use genre-specific acoustic models was achieved. Valente et al. [339] investigate all prevalent frontends for scalability. Results reveal that the MLP features produce relative improvements at the different steps of a multi-pass system. Yang et al. [359] proposed a hybrid technique for automatically generating Chinese abbreviations and perform vocabulary expansion using the output of the abbreviation model for voice search. An improvement from 16.9 to 79.2% was achieved by incorporating the top-10 abbreviation candidates into the vocabulary. Li et al. [185] conducted experiments on the Hub4 Chinese broadcast news database. A 3-g language model was trained for the experiments. Chen et al. [48] developed large-scale Mandarin speech corpora and studied its pronunciation patterns. The system was evaluated with the help of multi-speaker read-speech mode. Research has been carried out on the tonal aspect of the language using continuous speech and reported improvement in recognition rate (Lei et al. 2016). Huang et al. [121] conducted experiments on Mandarin speech recognition by extracting pitch-related features and training the DNN–HMM-based acoustic model. They proposed an Encoder-Classifier framework for modeling Mandarin tones using RNN. The resulted show that the proposed network improves tone classification accuracy. Zou et al. [366] presented the contrasting behavior of CTC and attention-based encoder–decoder models. The experiments were conducted on the DidiCallcenter dataset and DidiReading dataset.

Japanese

Large vocabulary corpus of Spontaneous Japanese Speech, telephone-based name recognition, and MIT JUPITER weather information continuous speech data was examined by McDermott et al. [206]. Minimum Classification Error (MCE) has been evaluated on the available corpus which outperformed the baseline methods in calculating the word error rate.

Shimizu et al. [303] performed experiments on Chinese–Japanese and Chinese–English datasets to achieve a recognition accuracy of 82–94%. Nakamura [229] used a neural network-based model on multiple languages that included English and Japanese. Kinoshita et al. [155] concentrate on dealing with the effect of late reverberations. The projected technique initially estimates late reverberations using long-term multi-step linear prediction, and afterward reduces the late reverberation effect by employing spectral subtraction. The proposed technique showed significant improvements in the performance of the ASR in real recordings under severe reverberant conditions. Ichikawa et al. [126] proposed dynamic features for different types of data that outperformed MFCC features. Hotta [117] worked on a database of 5240 words. The experiments were conducted on the HTK toolkit. FBank and MFCC features were extracted from the data. Kawahara [148] defines an automatic transcription system in the Japanese Parliament. The authors proposed a lightly supervised training scheme based on statistical language model transformation that fills the gap between faithful transcripts of spoken utterances and final texts for documentation and achieved character accuracy of nearly 90%. Moriya et al. [224] use covariance matrix adaptation evolution strategy (CMA-ES) with a multi-objective Pareto optimization to tune DNN–HMM-based large vocabulary speech recognition systems. The experiments were performed on the Spontaneous Japanese corpus. The proposed technique optimizes systems for achieving high accuracy. Mufungulwa et al. [225] proposed an algorithm in speech modulation spectrum for Running Spectrum Analysis (RSA) and was applied to speech data. Accuracy in the noisy environment increases about 4% compared to current conventional methods. Fukuda et al. [82] focused on detecting breathing sounds in continuous speech of Japanese telephone conversations. The authors achieved high-rate accuracy with GMM- and SVM-based models.

Russian

Kipyatkova et al. [156] proposed a language model that integrated statistical and syntactical text analysis. HMMs were used for acoustic modeling and the phonemes were modeled using continuous HMM. A speech corpus consisted of 100 continuous utterances and 1068 words. Ronzhin et al. [270] presented a comprehensive survey of the Russian language, the applied methods, and models for speech recognition techniques in Russia and foreign countries. Karpov et al. [142] described an ASR for large vocabulary systems in the Russian language. A hybrid of knowledge-based and statistical approaches was being used to build an acoustic model. To develop the language model, a novel method combines the syntactical and statistical aspects of the text for training data. The results were computed on two distinct Russian databases. The proposed language model was evaluated on 204 thousand words vocabulary and proved to be efficient than the existing models. On similar grounds, a speech recognition system based on phonetic decoding technique was developed by Savchenko [287]. Vazhenina and Markov [341] focused on a technique to select the phonemes based on the hybrid of phonological and statistical analysis. The proposed technique when applied to the IPA Russian phonetic set with the reduced number of phonemes achieved better results.

Karpov et al. [143] build the acoustic model with the combination of knowledge-based and statistical approaches to create several different phoneme sets. The analysis was conducted with 204 thousand words vocabulary and the performance of standard statistical n-gram LMs and the language models created using our syntactico-statistical method were compared. The results confirmed that the proposed language modeling approach is reducing word recognition errors. Kapralova et al. [141] re-decode speech logged by production recognizer to improve the quality of ground truth transcripts used for training alignments. A fully unsupervised approach to the acoustic model was described that took advantage of a large amount of traffic of Google’s speech recognition products. Yanzhou and Mianzhu [360] optimized the recognition algorithm by implementing different feature extraction techniques. Prudnikov et al. [254] and Smirnov et al. [309] proposed a system to detect keywords from LVCSR systems where the experiments were conducted on the CMU-Sphinx platform. Similarly, LVCSR for the Russian language has been developed [142], Tatarnikova et al. (2006). Medennikov and Prudnikov [207] developed a speech recognition system with DNN combined with deep Bidirectional Long Short-Term Memory. The proposed techniques achieve a WER of 16.4%. Potapova and Grigorieva [250] conducted the perceptual–auditory analysis at various levels of Russian and German speech utterances in a noisy environment.

Kipyatkova and Karpov [157] constructed neural network-based models with a different number of elements in the hidden layer and perform linear interpolation of neural network models with the baseline trigram language model. The authors reviled that the application of RNN-based LMs reduced the WER when compared to baseline systems. Khokhlov et al. [154] used different acoustic modeling techniques like i-vectors, multilingual speaker-dependent bottleneck features, and a combination of feedforward and recurrent neural networks for building ASR in Russian. The study reviled that fully connected DNNs with max-out activations outperformed TDNN and BLSTM mode. Markovnikov et al. [200] presented an end-to-end speech recognition system for recognizing the extra-large vocabulary of Russian speech. The researchers have applied CTC and attention-based encoder–decoder with DNN modeling. SPIIRAS dataset was used for training the data with a length of speech corpus of 30 h. KALDI and TensorFlow toolkits were used for conducting the experiments. Kaya and Karpov [149] proposed computationally efficient feature normalization strategies for the challenging task of cross-corpus acoustic emotion recognition. The use of suprasegmental features’ normalization strategies shows enhancement in performance over benchmark normalization approaches. Iakushkin et al. [125] developed a Russian-language speech recognition system based on DeepSpeech. They analyzed the utility of TensorFlow technology for optimizing linear algebra computations in neural network training. The proposed system generates a WER of 18%.

Romanian

Burileanu et al. [39] discussed the speech recognition model of a dialogue system for the Romanian language. MFCC and PLP features were employed to train the language model. The medium-sized vocabulary system was tested for efficiency only on a limited number of Romanian words. The speech data used for training consisted of 54 h of speech uttered by 17 male and 12 female speakers for the test data. Chiopu and Oprea [51] employed the use of neural networks for a discriminative speech recognition system in the Romanian language. Mean Squared Error (MSE), Minimum Classification Error (MCE), Maximum-Mutual Information (MMI), and Minimum Phone Error (MPE) discriminative training frameworks were used to improve the recognition accuracy of the system. The minimum error computed for MLP with MSE was 0.0889. Dumitru and Gavat [74] presented a comparative study on continuous speech recognition systems in the Romanian language. The database for training consisted of 3300 phrases. TheSpeeD (Speech and Dialogue) Research Association [91] came through a significant result in developing LVSCR in the Romanian language. Different configurations of acoustic and language models were used on the speech corpora gave by Speed in 2014. Employing DNN–HMM hybrid models further improved the WER. The system was accurately tested on live transcription in Romanian. Militaru et al. [212] developed a ProtoLOGOS ASR in Romanian speech recognition. The acoustic model was trained on the statistical features of HMM and the language model was a bi-gram model. Perceptual Linear Prediction (PLP) features were used for the first time in Romanian ASR. Militaru et al. [212] developed a Romanian language automatic speech recognition system ‘ProtoLOGOS’-based Hidden Markov models and speech signals were modeled using Perceptual Linear Prediction (PLP). Heigold et al. [109] experimented on the cross- and multi-lingual network of 11 Romance languages using 10 k hours of speech corpus. The average relative gains over the monolingual baselines are 42%.

Cucu et al. [58] present the improvements authors have brought to the SpeeD automatic speech recognition system. They discussed a noise robustness approach for the ASR system and experimented that proposed acoustic features and feature-transforms improve the accuracy. Caranica et al. [42] created an automatic speech recognition system (ASR) for spoken Romanian connected digits. HMMs and a finite state grammar language model are used, to build and optimize a fully functional digit recognizer system in the Romanian language. Tufiș & Dan [334] in their paper talked about the latest development in the Romanian language. CoRoLa project has more than 152 h of pre-processed speech recordings. Georgescu et al. (2018) presented a GMM–UBM modeling technique on RoDigits corpus of Romanian language. A feature vector of MFCC and LPC features was used to model the system with GMM–UBM techniques (Fig. 1).

Arabic

Vergyri et al. [346] evolved the use of models based on morphology for speech recognition at different stages in conversational Arabic. Language models, i.e., class models and single-stream factored models, were combined with the N-best list re-scoring framework. A large vocabulary recognition system was taken into consideration to evaluate the proposed techniques [276]. Satori et al. [286] presented an Arabic ASR, which worked on Open Source CMU-Sphinx-4. Hello_Arabic_Digit application employed the use of the proposed system. Hsiao et al. [118] suggested improvements on Generalized Discriminative Feature Extraction (GDFT) that was called regularized GDFT (rGDFT). The new system was evaluated on Iraqi and Arabic ASR tasks. MFCC features were extracted and combined with Linear Discriminant Analysis (LDA) frames. Ali et al. [11] prepared language resources to train and test the Arabic ASR. 200 h of GALE data was used in the system which is publicly available by LDC. The whole experiment was conducted on Kaldi and achieved good results. Baig et al. [24] applied discriminative training criteria, Maximum Likelihood (ML) on Holy Quran followed by MPE. Afify et al. [5] introduced an algorithm for simple word decomposition provided with a text corpus and an affix list. Lexicons were developed with the help of the proposed technique and a relative improvement in WER was observed. Kirchhoff and Vergyri [158] considered the cross-dialectal variations in Arabic, Modern Standard Arabic, and Egyptian Colloquial Arabic. The missing information was replaced with the help of morphological, contextual, and acoustic methods. Kuo et al. [176] investigated Discriminative Language Modeling (DLM) on large vocabulary Arabic broadcast ASR which were further employed for the use in Phase 5 DARPA GALE Evaluation. A typical detail of the minimum Bayes risk (MBR) method for DLS was given. Zarrouk et al. [363] worked on hybrid systems to identify isolated words from a large multi-dialect dataset of Arabic vocabulary. The authors achieved a higher accuracy rate using hybrid MLP/HMM. El-Amrani et al. [76] conducted experiments on simplified Arabic phonemes to develop a language model for the Holy Quran. WER of 1.5% was achieved by training a small dataset of audio files on the Sphinx tool. Speech corpus to carry out research in Arabic speech recognition has been developed and reported [12, 209, 279]. Telmem and Ghanou [323] presented a CMU-Sphinx system based on HMM for 11,220 audio files in the native language. Alsharhan and Ramsay [14] worked on the phonological aspects of the Arabic language. A set of language-dependent grapheme-to-allophone rules was formulated. In addition to this, the stress features were extracted aiming to improve the acoustic modeling.

Malay

Seman and Jusoff [294] handled pronunciation variations in the Standard Malay (SM) system of speech recognition. The Standard Speech Malay corpus employed in the research included utterances from Buletin Utama TV3 Broadcast News which had around 550 words of 4 h length. PVD reduced the word error rate significantly. However, the proposed technique did not prove effective for handling phone changes. Fook et al. [80] cited a brief review on ASR technologies implemented on Malay Corpus. The authors have focused on the prevalent issues in speech recognition systems for noisy environments and the techniques to overcome these problems. Jamal and Shanta [130] developed an ASR system for curing patients of aphasic. Eng and Ahmad [79] proposed a hybrid technique of Self Organized Maps (SOM) and Multiplayer Perceptron (MLP) to recognize speech in Malay. LPC and a two-dimensional SOM feature map were used for feature extraction. Experiments were computed on 15 syllables of Malay that improved the recognition accuracy by 4%. Rosdi and Ainon (2008) developed a speech recognition system for isolated words in the Malay language based on HMM acoustic model. The research was carried out on five isolated phoneme word structures where 88% recognition accuracy was reported. Seman and Jusoff [295] proposed a system to segment and transcribe the spontaneous speech signal automatically instead of employing a manually annotated speech database. Evaluations on Standard Malay Television (TV3) news of local and non-local speakers were reported to be 42.53% and 30.8%, respectively. Seman et al. [296] proposed an endpoint detection technique to detect isolated words in Malay from Parliament Sessions in Malaysia. The algorithms combined Short-Term Energy (STE), Zero Crossing Rate (ZCR), frame-based Teager’s Energy (FTE), and Energy Entropy feature (EEF). A Discrete-Hidden Markov Model (DHMM) classifier attained considerable results. HMM, models were also developed for evaluation of the speech of children suffering from language disorders such as stuttering (Tan et al. 2007). Sakti et al. [277] developed a system named ad A-STAR which is a network-based speech translation system of Asian languages.

Liu and Sim [187] investigate the proposed Temporally Varying Weight Regression (TVWR) method for cross-lingual speech recognition. Experiment using the Czech, Hungarian, and Russian posterior features conducted on Malay speech, TVWR was found to consistently outperform the tandem systems trained on the same features. Rahman et al. [260] worked on a small dataset of 390 sentences to recognize child speech in the Malay language. A speech recognition accuracy of 76% on the HTK toolkit was achieved. Apandi and Jamil [20] developed a speech corpus in the Malay language using different emotions such as happiness, anger, and sadness with the help of 30 speakers including children, young adults, and middle-aged adults. MSTAT (Malay Speech Therapy Assistance Tools) assisted the therapist to diagnose children with such disorders. Malay speech recognition systems for children’s speech have also been developed by Ting et al. [329], Rahman et al. [260], Ting and Yunus [328] and Mustafa et al. [227]. Draman et al. [70] extracted voice samples from Telekom Malaysia’s call center to develop an ASR for Malay speech. N-gram models were then implemented on the transcribed data. Maseri and Mamat [204] implemented a speech recognition system for Malay with the help of HMM training to recognize preschool children's speech. MFCC features were extracted for the available dataset.

Thai

Suebvisai et al. [317] constructed a Thai speech recognizer by applying the rapid bootstrap technique to build the acoustic model. Pronunciation variations for the words improved the accuracy instead of consonantal cluster phones. Schultz et al. [289] contributed to the research in the Speech-to-Speech translation system. The system was bootstrapped, and a translation component was built. The prototype was built in English for the doctor and in Thai for the patient which then translates the speech input. Deemagaran and Kawtrakul [65, 66] presented a speaker-independent Thai connected digit speech recognition system. To extract features, MFCC, delta MFCC, delta–delta MFCC, delta energy, and delta–delta energy techniques were implemented. Continuous density HMM was employed in the speech recognition procedure. Charoenpornsawat et al. [46] proposed a graphene-based speech recognition system for the Thai language. Tri-graphene with 500 acoustic models achieved better results. Theera-Umpon et al. [325] implemented a new technique for the classification of the tonal accent of the syllables. Speech samples from 30 speakers were collected. MFPLP and MFCC features were extracted from the samples. 66.4% accuracy on tonal accents was reported. Srisuwan et al. [313] focused on implementing surface electromyography (sEMG) for classifying Thai tonal speech. It was reported that the sEMGs were able to classify the tones into different categories better than other existing techniques. Thangthai et al. [324] worked on open-vocabulary Thai LVCSR by developing a hybrid language model to eliminate the out-of-vocabulary problem. BEST, LOTUS-BN, and HIT-BTEC were used to develop the language model. Hu et al. [119] incorporated tonal features such as F0 and FFV to improve the recognition accuracy of Thai speech recognition. CNN was trained for acoustic modeling of the ASR system. Srijiranon and Eiamkanitchat [312] followed a Neuro-Fuzzy approach to recognize Thai speech. PLP features were extracted for the speech samples and recognition accuracy of 20% for the test set was reported. Sertsi et al. [297] described the performance based on computational capability and recognition accuracy on the mobile device. The proposed offline system achieves a lower RTF by 24% compared online system on the mobile device. Wang et al. [351] introduced Automatic Speech Recognition (ASR) systems for Southeast Asian languages. The speech corpus collected for Thai and other languages was applied to build ASR systems. Deep learning techniques such as Bidirectional Long Short-Term Memory Networks and Time Delay Neural Networks may be used in acoustic modeling. Torres et al. [332] worked on the NIST 2016 SRE system which included Thai as one of the languages in the dataset. Tantibundhit et al. [321] developed a Thai speech recognition system based on different factors, namely, phonemic balance, familiarity, reliability, list equivalency, and homogeneity. The limitations of the system were also discussed. Speech recognition systems focused on the tonal aspects of the Thai language were also developed by Kertkeidkachorn et al. [151].

Chunwijitra et al. [54] proposed a syllable-based unit called pseudo-morpheme (PM) and a hybrid recurrent neural network language model (RNNLM) framework for Thai. The presented hybrid lexicon constituted open vocabulary for Thai LVCSR that reduced OOV rate around 1% just using 42% of the vocabulary size. The hybrid RNNLM obtained 1.54% relative WER reduction when compared with a conventional word-based RNNLM. Kaewprateep et al. [136] perform experiments on small-scale deep learning neural networks for Thai speech recognition. CNN and LSTM were built with a relatively small speech corpus. The result shows that CNN outperformed LSTM for small-scale deep learning. Tantisatirapong et al. [322] compare feature extraction techniques for accent-dependent Thai speech. Four frequency analysis methods were explored: Energy Spectral Density (ESD), Power Spectral Density (PSD), Mel-Frequency Cepstral Coefficients (MFCC), and Spectrogram (SPT). The corpus of isolated words from 60 speakers was recorded. Results reviled that the MFCC-based feature gives better accuracy than ESD, PSD, and SPT.

Croatian

Tadić and Fulgosi [319] developed a Croatian Language Lexicon (CML) from two different universities and included two sub-lexicons. Martincic-Ipsic et al. [203] developed context-dependent acoustic modeling using context-dependent triphone hidden Markov models and Croatian phonetic rules. The proposed system for Croatian acoustic modeling was developed as parts of speech interfaces in a spoken dialogue system for the weather forecast domain. Gulić et al. [98] collected letters and digits from two sources of data and developed a Sphinx model for the Croatian language. The results were listed on WER and SER. Nouza et al. [235] demonstrated the cost-effective approach for developing LVCSR systems for the Croatian language. The authors used 39 MFCC features on audio data of 320 h. The training data were collected from three different data sets. WER on the respective data sets was reported. Dunder (2014) worked on English and the Croatian language in a combined approach for the business domain. Dunder [191] worked on the Croatian corpus, hrWaC, including 2 billion words. SVM and Random Forest (RF) classifiers were used to experiment to achieve an accuracy of 70.4% and approximately 59.28%, respectively. Kacur and Rozinaj (2011) used HMM models for building large vocabulary recognition systems using the MASTER training scheme. MFCC and PLP techniques were used for feature extraction. The voicing feature was tested and surprisingly average improvement for PLP was 19.96% and 24.51% for MFCC. Nouza et al. [234] developed two corpora for Croatian and Slovene form the available corpus hrWaC and slWaC consisting of 59,212 tokens and 37,032 tokens, respectively, for Croatian and Slovene languages. An overall 10% improvement was observed on datasets of both languages. Agić and Ljubešić and Klubička [189] developed the first linguistically annotated data of Croatian language. The data were extracted from SETTIMES parallel corpus. [188] developed a Croatian training corpus, hr500k including 5,00,000 tokens. Russo et al. [271] proposed a speech recognition system based on cochlear behavior emulated by the filtering operations of the gammatone filterbank and subsequently by the Inner Hair cell (IHC) processing stage. Results reviled that proposed Gammatone Hair Cell (GHC) coefficients are lower for clean speech conditions but show a substantial increase in performance in noisy conditions.

Czech

Byrne et al. [40] worked on 10,000 h of annotated spontaneous speech to achieve a WER of 40% for English as well as Czech. MFCC features were extracted on the HTK toolkit. Nouza et al. [233] developed the first speech recognition system in Czech which can translate spoken broadcast programs into the language. A vocabulary of 200 k words was built and a language model of 300 M word text was extracted. Results on different types of broadcast programs were listed. Ircing et al. [128] worked using rich morphological tags on a class-based n-gram language model with many-to-many word-to-class mapping. This model improved recognition accuracy over the word-based baseline system. Boriland and Hassen [35] presented an unsupervised frequency domain and cepstral domain equalizations that increase ASR resistance to the Lombard effect. The proposed system provides an absolute word error rate (WER) reduction of 8.7%. Kolar and Liu [164] combine three statistical models—HMM, maximum entropy, and a boosting-based model BoosTexter. The result revealed that superior outcomes are achieved when all the three models are combined through posterior probability interpolation. Rajnoha and Pollak [262] build a speech recognition system with HMM that recognizes digits in a noisy environment. The results depicted that bark-frequency scaling, equal loudness pre-emphasis, and intensity-loudness power law in the MFCC brought improvement in noise robustness in the system. Nouza et al. [236] worked on Slavic languages: Czech and Slovak. Experiments were performed on 350 K words in Czech and 170 K words in Slovak. Distinctive features of these languages were also discussed. Procházka et al. [253] examined publicly available n-gram corpora for the creation of language models (LM) applicable. Results showed that the Web1T-based LMs, even after rigorous cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. Kombrink et al. [165] developed speech recognition systems for under-resourced languages; they applied machine translation to translate English transcripts of telephone speech into the Czech language to improve a Czech CTS speech recognition system. Přibil and Přibilová [252] investigate emotional types of spectral and prosodic features for Czech and Slovak emotional states of speech classification based on Gaussian mixture models (GMM). Experiments were conducted with four emotional states (joy, sadness, anger, and a neutral state). Testsexhibited the principal importance of correct classification of the speaker gender in the first level, which had a heavy influence on the resulting recognition score of the emotion classification. Nouza et al. [233] worked on transcribing 83,000 h of data of Czech and Czechoslovak Radio archives. The dictionaries involved 41 Czech and 48 Slovak phones. Baselines MFCC features were extracted and 32 mixture GMMs were equipped for modeling triphone state output pdfs. Chaloupka et al. [43] transcribed 80.000 h of Czech Radio audio archive of CRaT database. HMM- and GMM-based frameworks were trained for a different set of experiments. Mateju et al. [205] trained various DNNs with different training strategies inner structure and kinds of features. The resulting strategies for training of DNNs use the ReLU activation function, filter-bank-based features. Šturm and Volín [315] worked on testing the behavior of phonotactic structures of the Czech language. Šturm [316] worked on Czech syllables. 174 disyllabic Czech words were used for the experiment from 30 speakers.

Dutch

Ordelman et al. [237] developed a language model of 65 K words collected from Dutch newspapers. A hybrid of RNN and HMM systems was used for automatic speech recognition. Hämäläinen et al. [100] studied whether longer length acoustic units were better suited for modeling pronunciation variation and long-term temporal dependencies in speech than traditional phoneme-length units. A hierarchical method that was a mixture of word-, syllable-, and phoneme-length units was used and revealed that the presented approach did increase the word accuracy. Despres et al. [67] developed a system for Northern (NL) and Southern (VL) varieties of Dutch in the joint ‘LIMSIVecsys’ speech-to-text transcription systems for broadcast news (BN) and conversational telephone speech (CTS). Word error rates under 10% were obtained on BN development data. Pelemans et al. [245] explored three new application areas in ASR for Dutch conducted on toolkit SPRAAK. Scharenborget al. [288] trained an ASR system in Dutch and Mboshi languages. 40-dimensional Filterbank features were extracted on the Kaldi toolkit and trained using the multi-layer DNN. A relative improvement of 6.62% was reported. Pelemans et al. [244] experimented layered architecture approach to check whether it works for a large lexicon (400 k words) and language models (5-g), as well. The outcome shows that the architecture is already competitive and can be applied to acoustic models, language models, and lexicons. Shi et al. [301] used The Corpus Spoken Dutch and included 44,368 different words. 2% accuracy in predicting the words was reported. Cucchiarini and Van hamme [57] started the JASMIN-CGN corpus project to extend Corpus Gesproken Nederlands in three dimensions: age, mother tongue, and interaction mode. In total, 111 h and 40 min of speech were collected. Shi et al. [302] undertook re-scoring experiments on a challenging corpus of spoken Dutch and investigate sentence and word length), to measure the performance of conventional language models. The results of experiments on CGN 32 data and WSJ data show that integrating sentence length and word length can achieve improvement. Watanabe et al. [355] introduce a model that recognizes 10 different languages, by directly performing grapheme (character/chunked-character). The proposed model is based on hybrid attention/connectionist temporal classification (CTC) architecture. Seki et al. [291] developed an ASR for multiple languages. 80-Dimensional Mel filter bank features were extracted in concatenation with the pitch features on the KALDI Toolkit.

Finnish

Pylkkönen and Kurimo (2004) worked on HMM-based phone duration modeling. To carry out the experiments, the speech recognition models used were speaker-dependent triphone models. 12 h of training data were collected from a Finnish book spoken by a female. 19.8% WER was cited. Siivola et al. [305] collected 300 million words from newspapers, books, and magazines from which 12 h of data were used for training the acoustic model. 56.4% WER on words was reported. Turunen and Kurimo [335] worked to improve the performance of morph-based spoken document retrieval. Audio data of 288 h of spoken news in Finnish were collected. 26 h of speech data were used to train the acoustic model and 40.89% WER was reported. Kurimo and Turunen [179] worked on recovering speech recognition errors from spoken documents. The experiment was conducted on 270 news spoken stories uttered by a female speaker. Hirsimäki et al. [113] worked on large vocabulary language models of 40 million words collected from two different speech corpora. Significant reductions in error rate were listed. Enarvi and Kurio [77] worked on the available corpus SPEECON and FinDialogue. Finnish conversations from 13 distinct speakers, podcasts by 5 speakers, and recordings for 67 students were collected for transcription. 1.0% WER reduction was observed on the small data. An overall 55.6% WER was reported on the web data of the existing data sets.

Ginter et al. [93] developed a speech corpus named FinnTreeBank-3, from the existing resources. Parsing was done based on morphological tagging and dependency parsing. Enarvi et al. [78] worked on LVCSR systems for Finnish and Estonian speech recognition. The training data for Finnish included 85 h of speech data from three different sources. For Estonian, training data of 164 h from broadcasts, news, and lectures were used. Mansikkaniemiet al. [199] developed an ASR system for Finnish as well as Arabic. Three distinctive data sets for acoustic model training, i.e., Acoustic model training, were carried out on KALDI and language models on VariKN toolkit. RNNLMs were trained with the help of the TheanoLM toolkit. Behravan et al. [28] handle leveling in Finnish regional dialects using SAPU (Satakunta in Speech) corpus. Authors use attributes features like manner and place of articulation for leveling dialects. Experiments conducted with an i-vector system revealed that attribute features achieve higher dialect recognition accuracy and were less sensitive against age-related leveling. Varjokallio et al. [340] introduced a novel language model look-ahead technique using the class bi-gram model. This technique gave better results over the unigram look-ahead model.

French

Adda-Decker et al. [4] studied the radio interviews in French, and analyzed the syllabic structures and the corresponding variations. The authors also put a light on using ASR systems in French as a linguistic tool. The corpus used for experimentation includes 30 h of data with 254 k words. New terminologies to analyze W-syllables and S-syllables were introduced. Razavi et al. [267] related the graphemes and phonemes by training HMM. Experimental studies were confined to two databases: Phone Book and MediaParl Corpus. G2P approach was used for the pronunciation of the words. Dimulescu and Mareüil [69] worked to determine the origin of a speaker from an uttered speech sample by analyzing the phones. It was observed that the system worked well with Arabic speakers and poorly for speakers of Portuguese. Accents that were commonly mistaken were Spanish–Italian and English–German. Bérard et al. [30] used the French–English BTEC corpus to perform experiments on Machine Translation. The proposed system performed closely to the baseline systems. Kocabiyikoglu et al. [163] worked on the segmentation and augmentation of the existing speech corpus, LibriSpeech, and presented a large-scale corpus of 236 h. Venail et al. [344] developed software using word lists gathered from the speaker’s lexicons and were then assessed based on manual as well as automatic scoring. Decker et al. [3] derived the text data for experimentation from the CHAMBER discussions and sampled. Three acoustic models for English, French, and German were built separately. 65-word lists were used for training. An OOV rate of 2.4% was reported on the CHAMBER text. Botros et al. [36] studied different clustering methods and compared their performance on the available data sets to improve the keyword search and WER further. Huet et al. [122] re-ordered the N-best lists to extract morpho-syntactic details in the post-processing of ASR speech signals. 2,00,000-word data set was extracted from the ESTER training corpus, while evaluation was carried out on 11,300 words. A WER of 22% and overall accuracy of 95% were reported. Imseng et al. [127] presented a bilingual database ‘MediaParl’ containing recordings in both French and German. Experiments were conducted using HMM/GMM systems. Multi-Layer Perceptron (MLP) and Rasta-based multi-lingual bottleneck features were examined for acoustic modeling in German and French languages [336, 337]. Results from experiments showed that multi-lingual BN features offered better cross-lingual portability [2], and Christodoulides et al. [53] presented a multi-level annotator ‘DisMo’ for spoken language corpora. It integrates part-of-speech tagging with basic disfluency detection and annotation and multi-word unit recognition. The proposed system was trained and tested on the 57 k-token corpus. Experiments revealed that ‘DisMo’ achieves a precision of 95–96.8%. Tong et al. [330, 331] examine multi-lingual (French, German) CTC training in context to adaptation and regularization techniques that were proved to be beneficial in more conventional contexts. Learning Hidden Unit Contribution (LHUC) was inspected to make language adaptive training. The performance of the universal phoneme-based CTC system was improved by applying dropout and LHUC.

German

Larson and Eickeler [180] demonstrated the use of syllable-based indexing features and how they outperform the word-based indexing features on large-vocabulary German-language radio documentaries. Burget et al. [38] used a different approach to develop a multi-lingual speech recognition system, and they used entirely different phone sets, but the model had parameters not tied to specific states and are shared across languages. They use Subspace Gaussian Mixture Model for the experiments and obtained significant WER improvements with this approach. Siniscalchi et al. [307] developed a technique to design ‘language-universal’ acoustic models for phone recognition systems under the ‘automatic speech attribute transcription’ framework. Specifically, a phone recognizer that can decode languages with minimal available target-specific training data was built.

Weninger et al. [357] present a manually segmented and annotated speech corpus of over 160 h of German broadcast news and proposed an evaluation framework of LVCSR systems. The proposed framework achieved a word error rate of 9.2%. Radeck-Arneth et al. [258, 259] collected corpus in a controlled and clean environment for German distant speech. A total of 36 h of the corpus were recorded from 180 different speakers. Kaldi tool kit was used for the development of the ASR system and a WER of 20.5% was recorded for German distant speech recognition possible.

Gonzalez-Dominguez et al. [95] present end-to-end multi-language ASR architecture deployed at Google. This helps in the selection of arbitrary combinations of spoken languages. Acoustic information was exploited by the DNN-based LID classifier. Ali [13] in their study aims to check conventional speech features to detect voice pathology, and if it can relate to voice quality. An automatic detection system based on MFCC was developed and tested on three different voice disorders. The accuracy of the MFCC-based system differs from database to database. The detection rate for the intra-database ranges from 72 to 95%. Ciobanu et al. [55] used SVM-based ensembles for Swiss and German data sets including distinct speakers. A 62.03% F1 score was reported. Milde et al. [211] performed experiments on three distinct open-source databases available for the German language. The experiment was conducted on the KALDI toolkit. Acoustic models like GMM–HMM and TDNN were used. 16.49% WER was cited by the authors. Spille et al. [311] tried to predict the speech recognition threshold (SRT) for normal-hearing people. For this, deep neural network (DNN)-based system was employed to convert the acoustic input into phoneme predictions, and ASR was trained on matched and multi-condition training. The best predictions are obtained for multi-condition training that employed amplitude modulation features. Milde and Kohn [211] train ASR system for German on Kaldi toolkit with two datasets. A total of 412 h of German read-speech data from Wikipedia corpus were taken and the system achieved a relative word error reduction of 26%.

Recognition of Indian ASR systems

Gaikwad et al. [86] have presented a state-of-the-art survey about the techniques available in speech recognition in Indian languages. He presented a survey of available speech recognition and feature extraction techniques. A speech recognition system was modeled in four working stages: Analysis, feature, extraction, modeling, and testing. The authors concluded to propose an interface for the Marathi language. Shanthi and Lingam [299] attempted to feature the advances made so far to extract features for speech recognition. Feature extraction techniques including Cepstral Analysis, Mel Cestrum Analysis, MFFCC, LDA, Fusion MFCC, LPC, perceptually based Linear Predictive Analysis (PLP) were discussed. For large vocabulary speech recognition systems, an Indian language database is established [172, 173]. Speech data from 560 speakers were collected for building ASR systems in Tamil, Telugu, and Marathi. Grapheme to phoneme conversion is performed which is followed by text selection. The acoustic models built on the language models were tested on Sphinx 2 tool kit. Hemakumar and Punitha [110] demonstrated contributions made by the researchers in developing ASRs for Indian languages. A broad view of the existing technologies and toolkits which are used for different process of recognizing speech is identified and presented chronologically. It was recognized that only a few Indian languages constituting Hindi, Marathi, Malayalam, Tamil, Telugu, and Bengali have well-developed ASRs, while other languages are yet under research. As compared to non-Indian languages, the research on speech recognition of Indian languages has not achieved that perfection yet. Therefore, research in the field of speech recognition of local Indian languages is still ongoing. In Indian languages, there is mainly an HMM-based model. Limited attempts have been made for the recognition of tones and discriminative analysis using deep learning models of speech.

Hindi

A good number of researchers have worked for the recognition of speech in the Hindi language. Wani et al. [354] developed a Hindi ASR. For feature extraction, MFCC was used, and the isolated words in Hindi were recognized using K-Nearest Neighbor (K-NN) and Gaussian Mixture Model (GMM). The speech corpus for training and testing was prepared by distinct male and female speakers in Hindi. The research proved to be satisfactory and useful for disabled and illiterate people for recognizing Hindi words. Sharma et al. [300] presented hybrid features that combine linear prediction and multi-resolution wavelet features. The classifier was based on linear discriminant function and HMM for both speaker-dependent and independent isolated Hindi speech data. The authors concluded that higher recognition accuracy was obtained using 3-level WBLPC features, while 4-level WBLPC features gave higher accuracy in the case of speaker-dependent. Kumar et al. [169] built a Hindi speech recognition system for connected words. A database of 102 words from 12 speakers was selected to develop the system on the Hidden Markov model toolkit (HTK). A comparative approach was used by the authors to manifest the improved results. An innovative technique to extract features using ensemble modules for speech recognition in Hindi was proposed [169]. The outputs of the ensemble classifier were collected with the help of the ROVER method of voting. The suggested system outperformed the baseline ASRs in Hindi. Aggarwal and Dave [7] combined the most efficient attributes of conventional, hybrid, and segmental HMMs using the ROVER technique, i.e., three distinct recognizers were developed and then merged with their unique set of features and classifiers. The WER was significantly reduced as compared to the traditional ASR systems. Aggarwal and Dave [8] proposed a method to integrate the existing feature extraction methodologies including MFCC, PLP, and gravity centroids to improve the performance of Hindi ASR systems. The results reported improved accuracy for medium-sized lexicons containing 600 words. Aggarwal and Dave [6] surveyed the existing reduction techniques and the areas of application. The techniques studied along with their advantages and disadvantages were PCA, LDA, HLDA, etc. All the techniques were implemented on Hindi speech data. Experiment results were cited down. A discriminative approach to train the HMM for continuous speech systems in Hindi was proposed by [72]. The feature extraction technique ensembles MFCC and PLP feature. For acoustic training of the model, MMIE and MME discriminative techniques were adopted. The proposed ensemble features with MPE gave better results than other feature extraction and discriminative techniques. A database of 100 speakers and 1000 sentences was used. Kumar et al. [170] developed an LVCSR in Hindi trained on 40 h of speech data of 120 speakers and 26,000 sentences. The language model which was trigram in nature was trained on 3 million words. Dua et al. [71] presented Differential Evolution (DE) technique for optimizing the filters in MFCC, GFCC, and BFCC. The performance of the proposed technique was evaluated both in a noise-free and noisy environment. A total of 100 speakers were used for the process where 80 speakers were used for training and 20 for testing. Upadhyaya et al. [338] had proposed a Context-Dependent Deep Neural network HMMs (CD-DNN–HMM) for large vocabulary Hindi speech using Kaldi automatic speech recognition toolkit. The experiments were performed on the AMUAV database. It demonstrates that CD-DNN–HMMs outperform the conventional CD-GMM–HMMs model and provide improvement in the WER of 3.1% over the conventional triphone model.

Bangla

Paul et al. [243] developed a Bangla speech recognition system where LPC was used for feature extraction and ANN for pattern recognition. It was cited that MLP with 5 layers was more robust than MLP with 3 layers. Muhammad et al. [226] proposed a novel ASR system to recognize digits in Bangla. The speech data were recorded with the help of the local people of Bangladesh and were medium in size. Feature extraction employed MFCC features and an HMM-based classifier to recognize the digits. It was also reported that due to dialect variations, the recognition accuracy of the system was affected. Mandal et al. [196] developed a speech corpus for Bengali. The proposed algorithm for text selection can be used for optimum text selection, whereas triphone or diphone can be used as the selection parameter. Hasnat et al. [104] reported the use of pattern classification by HMM model incorporated with the stochastic language model. Adaptive noise cancellation and endpoint detection were used to preprocess the signals. For every speech input signal, spectral feature vectors were extracted. The research was conducted on isolated as well as continuous speech sentences. Das et al. [61] developed a corpus for Bengali speech to recognize speech in speaker-independent continuous speech systems. The speech data were categorized into different classes depending on age and language. Phone and triphone-based speech data were prepared and incorporated with statistical modeling techniques. The collected data were implemented with 39 features on HTK. Banerjee et al. [25] compared the acoustic models based on triphone and monophonic. Methods were described to develop clusters of triphones with the help of decision-based tree methodology. 4000 recorded sentences were used for training and 600 distinct sentences from the same speakers were used for testing. The authors were concerned about developing speech systems front end which could be used for segmentation and clustering the continuous speech sentences in Bangla into the desired number of clusters [261]. Six different speakers recorded the speech data and the system was tested on 758 words from 120 sentences. Das et al. [62] performed experiments on Bengali corpus BENG_YO and BENG_OL using speaker adaptation techniques. Hossain et al. [116] implemented BPN (Back Propagation Network) for a speech recognition system in Bangla. Ten speakers were used to record ten digits in Bangla. MFCC features for 5 speakers’ data were used for training and the other 5 for testing. The ASR worked well for speaker-dependent and speaker-independent systems. Bhowmik et al. [34] used CRBLP and C-DAC speech corpora for training DNN. MFCC features were extracted for the speech samples. Bhowmik and Mandal [33] applied DNN to Bengali continuous speech samples from C-DAC corpus. The recognition accuracy of 89.40% was reported on the TIMIT database. Reza et al. [268] developed a Bengali ASR based on isolated word datasets using HMM and Gaussian emissions with DNN. 96.67% accuracy was achieved using the HMM–GMM classifier. Pal et al. [238] developed an ASR in the Bengali language for handling queries regarding agricultural commodities. The experiments were conducted on the KALDI toolkit using the speech corpus of local people. Nahid et al. [228] have shown that a deep LSTM network can model Bengali speeches effectively. The context of phones is taken into consideration while modeling phoneme-based speech recognition. The authors have solely emphasized detecting individual Bengali words. They have achieved a word detection error rate of 13.2% and phoneme detection error rate of 28.7% on the Bangla-RealNumber audio dataset. Al Aminet al. [10] used DNN–HMM and GMM–HMM-based models, which have been implemented in the Kaldi toolkit, for continuous Bengali speech recognition benchmarking on a standard and publicly published corpus called SHRUTI. The study has been shown using Kaldi-based feature extraction recipes with DNN–HMM and GMM–HMM acoustic models have achieved performances WER 0.92% and WER 2.02%. Popli and Kumar [249] observed that training not only improves Query-by-Example Spoken Term Detection (QbE-STD) in the language of the same language family like Hindi but also other Indian languages like Tamil and Telugu.

Marathi

Kayte and Gawali [150] discussed and reviewed the existing terminologies for speech recognition in Marathi. Different approaches, methods, and tools, techniques, and applications of speech synthesis were demonstrated. The database used for the research was named IWAMSR which contained five speakers from diverse age groups, gender, and race out of which three were male and two females [153]. Three different databases of IWAMSR were constructed. Performance evaluation of the three databases resulted that database 3 of IWAMSR has improved recognition accuracy. Gaikwad et al. [84] studied the feature extraction and classification techniques in Marathi continuous speech recognition systems. The authors researched different feature extraction hybrid techniques. When evaluated on the accuracy, MFLDWT proved to the best hybrid model. Gaikwad et al. [85] proposed a hybrid feature extraction methodology that combined MFCC and LDA. The results of the proposed feature extraction method were cited and compared with traditional feature extraction techniques. A comparative analysis of the traditional feature extraction methods including MFCC, vector quantization (VQ), and LPC for Marathi was reported [208]. The authors detailed the Marathi database creation which contained 120 samples of Marathi vowels and 360 samples of Marathi conson ants recorded by one male and one female speaker. Gawali et al. [89] proposed the creation of a Marathi database containing 175 samples. The speakers were local Marathi people and aged 22–35. A speech recognition system based on the combination of MFCC and DTW feature extraction techniques was developed. Waghmare et al. [349] worked on emotional speech recognition exploring MFCC feature extraction techniques and classifying with the help of LDA. A Marathi database was constructed with samples of data obtained from 5 Marathi movies. The results were presented in 5 classes of Happy, Anger, Sad, Afraid, and Surprise. Darekar and Dhande [60] implemented a novel technique to recognize emotions using a hybrid PSO-FF algorithm in Marathi speech. Cepstral, NMF, and MFCC feature extraction techniques were used to extract features from Marathi and benchmark databases. The whole experiment was conducted on MATLAB. Patil et al. [242] has recorded a database of 5300 phonetically balanced Marathi sentences to train the context-dependent HMM. The subjective quality measures (MOS and PWP) show that the HMMs with seven hidden states can give an adequate quality of synthesized speech as compared to five states and with less time complexity than seven state HMMs. Yi et al. [362] propose a language-adversarial transfer learning technique to improve the performance of low-resource speech recognition tasks. Experiments were conducted on IARPA Babel datasets. The author proposed adversarial learning which was used to ensure that the shared layers of the SHL-Model would learn more language invariant features. Bhanja et al. [31] investigated the language discriminating ability of various acoustic features like pitch Chroma, mel-frequency Cepstral coefficients (MFCCs), and their combination. The system performance has been analyzed for features extracted using different analysis units, like, syllables and utterances.

Tamil

Tamil is a Dravidian language. Saraswathi and Geetha [280] proposed an enhanced morpheme-based language model for Tamil with limited vocabulary size. The text and speech corpora for Tamil were collected from newspapers and magazines, and articles for political issues from newspapers. The enhanced morpheme-based trigram language model with back-off smoothing technique performed better for the two corpora in Tamil. Plauche et al. [248] proposed an affordable approach for collecting linguistic resources from literate and illiterate workers of agriculture in three districts of Tamil Nadu and developed an ASR for the farmers. Chandrasekar and Ponnavaiko [44] presented a continuous speech recognition system in Tamil. An approach based on the segmentation of the speech signal followed by BPN for classification was deployed. 247 Tamil characters were tested in the system. Saraswathi and Geetha [281] implemented language models at various steps in speech recognition systems for Tamil, i.e., for segmentation, recognition, and error correction. The error rate was significantly improved at various phases of the phoneme, syllable, and word recognition for Tamil. Charles et al. [45] proposed an enhanced continuous speech recognition system in Tamil which was independent of the speaker and the device, namely Alaigal. The acoustic model is comprised of three basic steps: Feature Extraction, HMM modeling, and Estimation of parameters using Gaussian Estimation. Premkumar et al. [251] developed an LVCSR based on distinct factors, namely, pronunciation dictionary, language modeling, and front-end. 21.34% syllable error rate was reported. Chen et al. [47] presented approaches for keyword search (KWS) system using conversational Tamil provided by the IARPA Babel program. Strategies like optimization data selection through Gaussian component indexed N-grams, keyword aware language modeling, and Subword modeling of morphemes and homophones can help in tackling low-resource challenges. Manohar et al. [198] used DNN acoustic models to obtain a WER of 0.5%. The Fisher English Corpus was used to carry out the experiments. Sivaranjani and Bharathi [308] developed a speech recognition system in Tamil by extracting MFCC features and building the model using HMM. 95% accuracy has been reported. A system based on the triphone decision tree clustering method was developed by Lokesh et al. [192]. The authors conducted experiments on the FIRE dataset 2011 and implemented BRNN-SOM on the dataset to achieve an accuracy of 93.6%. Sarma et al. [284] explored the usage of the monolingual Deep Neural Network model to address the problem of speech recognition (LR) in the I-vector framework. Time Delay Deep Neural Network (TDDNN) architecture was used. Experiments showed that the proposed system gave low average cost performance as compared to the GMM–UBM-based system.

Telugu

Hegde et al. [106, 108] used a hybrid of Modified Group Delay Feature (MODGDF) and MFCC for computing the joint features of continuous speech recognition of Tamil and Telugu. The proposed features gave better results when compared to MFCC alone. Hegde et al. [107] modified the group delay function to extract cepstral features and were named modified group delay features (MODGDF). Results were evaluated on DBIL Tamil and Telugu, TIMIT, OGI_MLTS, and NTIMIT databases. The modified features outperformed the baseline features. Ramamohan and Dandapat [263] presented an emotional speech recognition system in Telugu and English. 30 native male speakers recorded the data for Telugu. It was reported that the sinusoidal features outperformed linear prediction and cepstral features for speech emotion recognition. Venkateswarlu et al. [345] used both MLP and TLRN models to train and test the speech recognition model. Results of Multilayer Perceptron (MLP) and Time Lagged Recurrent Neural Network (TLRN with the features of LPCC and MFCC are compared for better results. Renjith and Manju [105] build a system to recognize emotions in Tamil and Telugu languages. Linear Predictive Cepstral Coefficients (LPCC) and Hurst Parameter were extracted from the emotions. K- Nearest Neighbor (K-NN) and Artificial Neural Network (ANN) classifiers were used to identify the emotions. Hurst parameter gave more accuracy as compared to LPCC. In 2018, Vegesna et al. [342] again worked on emotion recognition. Telugu speech corpus of 64,464 utterances to extract MFCC and prosody features. A hybrid of GMM–HMM classifier was employed. Results showed that adapted emotive speech models have yielded better performance over the existing neutral speech models. Mannepalli et al. [197] prepared a speech corpus with the help of varying accents of local speakers of Telugu. MFCC feature extraction technique was followed by GMM classification. The accents considered were coastal Andhra, Telangana, and Rayalaseema.

Kannada

Kannada is one of the most widely spoken languages of Southern India with over 50 million people in India. Punitha and Hemakumar [110] designed Kannada continuous speech recognition for speaker-dependent mode. LPC coefficients were extracted as features. K-means clustering was used to categorize the classes. Anusuya and Katti [17] presented a review of distinct feature extraction techniques with or without Wavelet transform for Kannada speech recognition. For noise-free data, Discrete Wavelet Transforms (DWT) and for noisy data, Wavelet Packet Decomposition (WPD) was adopted for preprocessing the signal. For testing, 500 different samples were recorded by 10 female speakers. Harisha et al. [102] cited the existing ASR techniques for Indian languages with research efforts to develop isolated digit recognition. MFCC features were extracted followed by the ANN classifier with a back propagation algorithm for speech recognition. Five distinct speakers of age group between 20 and 35 recorded the speech database. Anusuya and Katti [19] used vector quantization to eliminate silence from the input speech sample. For training, 100 speech signals were used. Removing noise and silence from the signals reduces the WER. Cutajar et al. [59] reviewed different techniques and technologies of ASR systems. Antony et al. [16] developed an isolated word Kannada speech recognition system. A hybrid of DWT and PCA has been used. Sajjan and Vijaya [274] developed a speech recognition system using phoneme modeling, wherein each phoneme was characterized by tristate HMM where each state was represented by GMM. The performance was tested for monophone and word-internal triphone modeling. Pardeep and Rao [240] compared some baseline speech recognition systems like HMM–GMM, HMM–ANN, and HMM–DNN. Experiments showed that the HMM–DNN baseline system gave an improvement of about 7–8% than other baseline systems. Yadava and Jayanna [358] demonstrated a spoken query system that could be used to access the latest agricultural commodity prices and weather information in Kannada. MFCC was used to extract features from the voice samples. The 80 and 20% of validated speech data were used for system training and testing, respectively. Sajjan and Vijaya [275] presented a speech recognition system by employing decision tree-based clustering to build context-dependent triphone HMMs. It was observed that clustering of triphones using a universal list of articulatory questions performs well compared with manually created phonetic question lists. Geethashree and Ravi [90] developed an emotional Speech corpus. Different classifiers were used for building the system like Mean Opinion Score (MOS), K-NN (K-Nearest Neighbour), and LVQ (Learning Vector Quantization) classifiers. Kannadaguli and Bhat [140] evaluated the performance of Bayesian and HMM-based techniques to recognize emotions in Kannada speech. Kumar et al. [171] developed a speech recognition system in a noisy environment and speech corpus was collected from 2400 speakers. The acoustic models were built using the monophone, triphone1, triphone2, triphone3, subspace Gaussian mixture models (SGMM), DNN–HMM, and a combination of DNN and SGMM. DNN–HMM system gave the least WER.

Malayalam

Malayalam is the eighth most widely spoken language in India. Mohamed and Nair [215] developed a small vocabulary speaker-independent continuous speech recognition system in Malayalam. A CDHMM was used for modeling the phonemes and acoustic features were extracted with the use of MFCC features. Baum-Welch and Viterbi algorithms were used for training. Antony et al. [16] used 1080 words in the speech corpus for the experiment. The SVM classifier was used for speech recognition. Uncertainties in the lexical tones were identified. Kurian and Balakrishnan [178] proposed a speaker-independent system to recognize the digits. Acoustic features were extracted with the help of MFCC followed by HMM classifier for speech recognition. Krishnan et al. [168] employed the use of four different types of wavelets for extracting features. The classifier for pattern recognition was ANN. 160 speech samples were used to conduct the research. Mohamed and Nair [215] developed a speech recognition system of small corpus and the system was trained using continuous density HMM, used to model phonemes, where the observation probability density functions (pdfs) were continuous. The presented system produced a word accuracy of 94.67%. In the next year, Mohamed and Nair again proposed a context-dependent, small vocabulary, continuous speech recognition in Malayalam. HMMs were combined with ANNs. 108 sentences with 540 words and a total of 3060 phonemes were used for training. Kurian and Balakriahnan [177] used. HMM and GMM tied states to evaluate the ASR system. Bigram and trigram models were used for performance evaluation. Anand et al. [15] developed an LVCSR for visually impaired people. 30 h of speech data from 80 native speakers were used in the system. HMM was used for acoustic modeling. The pronunciation variations in the dictionary were handled by combining rule-based and statistical methods. Mohamed and Ramachandran Nair [216] explored the use of the pairwise neural network as an alternative to multi-class neural network systems for estimating emission probabilities of the states of HMM. Results showed that the pairwise system outperforms the multi-class recognition system. Thennattil and Mary [326] developed a system implementing phonetic engine (PE) for continuous speech. A speech corpus of more than 1 h was collected and transcribed using International Phonetic Alphabets. Phonemes were mapped to 40 frequently occurring phonemes and then modeled using continuous HMM. Mohamed and Lajish [218] proposed methods for recognizing vowel phonemes using nonlinear speech parameters like maximal Lyapunov exponent and Phase Space Anti-diagonal Point Distribution (PSAPD). Phase Space Ant-diagonal Point Distribution when combined with MFCCC gave better accuracy of 80.44%.

Oriya

Very few researchers have contributed to developing ASR systems in Oriya. Mohanty and Swain [219] proposed an Oriya isolated speech recognition system for visually impaired students to appear in examinations. The system could translate isolated answers into isolated text. For training, a set of 1800 isolated Oriya words spoken by 30 speakers were used followed by HMM-based recognition. Mohanty and Swain [221] proposed a method for emotion recognition classified as anger, sadness, surprise, astonished, fear, happiness, and neutral. Fuzzy k-means clustering was adopted on the collected speech data from 35 native speakers of age between 22 and 58 years. Mohanty and Bhattacharya [220] proposed wavelet neural networks for speech recognition in Oriya. Two speakers recorded the speech data followed by noise cancellation. ANN was used for recognition. Patil and Basu [241] collected corpora for various tasks in the text-independent speaker identification. Experiments confirmed that MFCC performed better than LPC and LPCC. Londhe and Kshirsagar [193] developed a new speech corpus of 100 different isolated words including 67 sentences from 478 different speakers. The dataset was collected in the native region and included words from English as well as Chattisgarhi language including scripts from literature and newspaper articles.

Kumar et al. [174] developed Prosody and phonetically Rich Transcribed speech corpus for Bengali and Oriya languages. Ten hours of reading speech, 5 h of conversation speech, and 5 h of extempore speech have been collected. International Phonetic Alphabet (IPA) was used to transcribe the speech corpus. Mohanty and Swain [222] presented aHMM-based Speech input–output System for the Indian farming community to retrieve the cultivation information like availability of seeds and fertilizers. The word accuracy was found to be 75.13%. Dash et al. [63] developed a system for four Indian languages: Hindi, Marathi, Bengali, and Oriya by integrating articulatory information into acoustic features. Both speaker-dependent and -independent recognition experiments were conducted using GMM–HMM, DNN–HMM, and LSTM–HMM. DNN–HMM system outperformed the other models.

Assamese

Sarma et al. [283] developed a speech corpus generation technique with the help of an LMS filter and LPC cepstrum. ANN was used for recognizing. Sarma and Sarma [285] adopted LPC and PCA features for handling the mood and gender diversity of the native Assamese speakers. ANN was modeled using a hybrid of SOM, LVQ, and MLP. Dutta and Sarma [75] proposed a hybrid of LPC and MFCC features for recognizing Assamese speech. Recurrent Neural Network (RNN) framework was deployed for speech recognition. Kalita [137] reported the study of six Bodo vowels and eight Assamese vowels. Ten male and female speakers of each language recorded speech samples. LPC and MFCC techniques were employed to extract features. Kandali et al. [138] worked on emotion speech recognition in Assamese using MFCC features and a GMM classifier. 14 male and 13 female speech data were used for training. Kandali et al. [139] extracted features based on WPCC2 (Wavelet Pocket Cepstral Coefficients computed by method 2), MFCC, tfWPCC2 (Teager Energy operated in Transform Domain), and tfMFCC. The classifier was based on GMM. MESDNEI (MultiLingual Emotional Speech Database of North East India) was developed for research. Shahnawazuddin et al. [298] developed a spoken query system in the Assamese language for the farmers in the agriculture commodity. Data were collected from the local farmers. MFCC features were extracted for the collected data and trained on HTK toolkit using HMM/GMM. Research work on the Assamese language has been combined with other languages as well which was reported by Tuske et al. [337] and HartMann et al. [103]. Misra et al. [213] classified vowels with the help of GMM and compared the result with ANN. ANN achieved a 4% higher accuracy rate than GMM. Dey et al. [68] developed an ASR for agricultural commodities using SGMM- and DNN-based acoustic models. A relative improvement of 32% in WER was observed as compared to the GMM–HMM ASR system. Ismail and Singh [129] worked on Kamrupi and Goalparia dialect identification of the language by extracting MFCC features from the speech samples. Sarma et al. [282] developed a speech recognition system using a small training data set of 20 h in three different modes. 78.05% accuracy was achieved using the HMM-HTK toolkit. Bharali and Kalita [32] researched isolated words spoken by 15 speakers. Variants of MFCC features were used to train HMM, VQ, and I-vector models. Yi et al. [361] showed an adversarial end-to-end acoustic model for low-resource languages. The attention-based adversarial end-to-end language identification carries enough language information. Experiments were conducted on IARPA Babel datasets. The end-to-end model was trained with a low CTC loss function. The experiment proposed has gained a 9.7% relative word error rate reduction.

Punjabi

Kumar and Singh [175] developed a speaker-independent autonomous speech recognition system in Punjabi. The corpus included 1433 sentences and the authors reported an accuracy of 90.8% using MFCC features. GUI in Java Programming was built for the language model. Kaur and Singh [145] presented an effective speech model based on PNCC features for continuous Punjabi speech recognition. 34 phones for training 158 words were used. HMM, the model was used for training the system. An accuracy of 71.92% in a noisy environment has been observed. Kumar and Singh [169] built an isolated speech recognition system in Punjabi using LPC features. VQ and DTW have been used for recognizing the speech. 94% accuracy has been reported. Kaur and Singh [146] compared three feature extraction techniques, such as PNCC, PLP, and MFCC. The system was trained using HMM. 34 phones for the Punjabi language were used to break each word into small sound frames. MFCC gave the best results in a noise-free environment with 86.05% accuracy over PLP and PNCC. Guglani and Mishra [97] extracted PLP as well as MFCC features for continuous speech samples of Punjabi. The experiment was conducted on the KALDI toolkit. The performance of triphone and monophone models was compared. Results showed that MFCC features improved the performance of the ASR system and the triphone model performed better than the monophone model. Kadyan et al. [134] tested three different combinations at speech feature vector generation and two-hybrid classifiers. MFCC, RASTA-PLP, and PLP were randomly combined with GA + HMM and DE + HMM techniques to produce refine model parameters. Results from experiments showed that MFCC and DE + HMM technique improved the accuracy when compared with RASTA-PLP and PLP using hybrid HMM classifiers. Kadyan et al. [135] used DNN against GMM. Baseline MFCC and GFCC methods were integrated with cepstral mean and variance normalization for feature extraction. Hybrid classifiers: GMM–HMM, and DNN–HMM were used to obtain performance improvement. Kadan et al. [353] worked on the role of prosody-modification-based out-of-domain data augmentation on children speech speech corpus. In addition to these, they also studied the effect of varying the number of senones, the number of hidden nodes, and hidden layers as well as early stopping resulting in 32.1% of Relative Improvement (RI) in comparison to the baseline system with varied senones.

Zhang Goyal et al. [365] tackle with an issue of dialect classification on the basis of tonal aspects of laryngeal phoneme [h] on four major dialects of Indian Punjabi language with two key parameters, namely F0 variation, and acoustic space, which are calculated using two formant frequencies: F1, and F2. Further work was extended through processing of acoustic information at feature level by comparing the performance analysis using basic or hybrid Linear Predictive Cepstral Coefficients feature extraction methods. The result showed that the hybrid LPCC + F0 system achieved a Relative Improvement (R.I.) of 6.94% on Subspace Gaussian Mixture Model in comparison to that of basic LPCC approach, respectively.

Recognition results of non-Indian and Indian languages

This section presents a brief report on the WER and accuracies achieved by researchers so far for speech recognition in different languages. A review of the prevalent feature extraction techniques is represented in Table 3. Spoken language processing system such as automatic speech recognition depends mainly upon a speech corpus. Only a few languages in India currently enjoy the advantages of language technologies such as successful speech recognition engines like Hindi and English. ASR in Indian languages is still at its early stage of research and getting more attention nowadays. Slightly, a few numbers of languages to date can assemble speech corpus in their native resource.

Building speech corpora for an Indian language is a difficult task. Statistical approaches used for modeling an ASR engine depend upon a large amount of training corpus that helps in the recognition of uttered signals. Thus, it is mandatory to have a database that comprises all characteristics of the typical user who spoke data in a realistic environment. A speech dataset is of two types: a dataset collected in a particular task domain and a general-purpose dataset. Previously [52], a Hindi language corpus was built with 50,000 Devnagari scripts to develop an ASR system. The speech corpus of different language ideas was designed and evaluated for Marathi by TIFR and IIT Bombay [94], and Hindi language travel domain dataset by C-DAC Noida [21]. And Telugu language dataset for the Mandi information system by IIIT Hyderabad, English, Hindi, and Telugu dataset for travel and emergency services was collected by IIT Hyderabad. Another general-purpose corpus of Telugu, Hindi, Tamil, and Kannada was prepared by IIT Kharagpur [265]. Hindi, Indian English corpus was developed by KIIT, Bhubaneswar and supported by Nokia research center China [304]. These corpora have been studied to analyze the attempt used for the development of an ASR system in Indian languages. Three language optimal databases were constructed for Tamil, Telugu, and Marathi languages that catalyze research activities. Two methods were adopted to collect the corpus for a landline and mobile dataset [172,173,173] tested on the Sphinx-II toolkit. The corpus was collected in the coordination of IIT Hyderabad and HP Labs, India. The Indian English and Hindi language corpus was collected [9] to involve the versatility by mobile communication environment of 100 speakers. The constructed database was employed in mobile speech recognition services. A Punjabi speech corpus of Malwai dialect was collected from 50 speakers that help in the development of the Punjabi Speech synthesis system. The recorded dataset was labeled phonemically to obtain phonemics and its sub-phonemic information [26]. Issues in the construction of speech corpus were observed and studied [160] for Indian languages. MFCC and LPCC are the two most useful and promising techniques for extracting features of the speech samples. MFCC has been used by many numbers of researchers for the feature extraction stage (Dua et al. 2006). However, Cutajar et al. [59] reported that MFCC signals were not robust to the noisy environment. Furthermore, MFCC technique believes that a frame of a speech contains the information of a single phoneme at a time. However, this may not be true for continuous speech recognition systems where the frame may contain information of two or more phonemes.

Another significant drawback includes that the features in MFCC are extracted from the power spectra of the speech sample, thus, not including the phase spectra. Owing to this, speech enhancement needs to be performed to the speech signalbefore extracting the features [166]. The Discrete Wavelet Transform methodology includes only the temporal information along with the frequency information. DWTs have been explored [264]. Researchers have tried using the hybrid of MFCC and DWT to enhance the recognition, known as MFDWC (mel-frequency discrete wavelet coefficients) [333]. LPCC features have been used to overcome the demerits of LPC and MFCC [18]. The extensive use of LPCC was studied. Lee and Hung [181] mentioned the comparison analysis of MFCC and LPC. Perceptual Linear Prediction (PLP) feature considered the three aspects. PLP features are well suited for noisy environments [88]. RelAtive SpecTrA PLP (RASTA) enhances the PLP robustness. RASTA-PLP features are best to use in noisy environments instead of noise-free environments. The Vector Quantization (VQ) method involves computation errors in quantization. VQ can be merged with MFCC [120] as well as DWT. Principal Component Analysis (PCA) improves the robustness of the system for noisy data [181, 320, 343]. Linear Discriminant Analysis (LDA) involves a supervised mechanism, unlike PCA. Variations to the conventional LDA were mentioned [210, 343]. A brief review of the different techniques used in identifying and recognizing speech for ASR systems in non-Indian languages is presented in the table. The table contains information about the data sets being used, feature extraction methodologies, and the recognition results.

A comprehensive view of the speech recognition terminologies on Indian languages is presented in Table 4.

Table 4 Recognition results of Indian languages

A through study has been performed on the research papers that have been used in this study. Most of the major languages have been covered in this review paper. Research data have been put in tabular form for better understanding. In the next section, synthesis analyses of the studies have been presented. All the studies included have been critically reviews and analysis has been put forth for the readers.

Synthesis analysis

The authors list the following findings from the literature studied. The most relevant finding here is that to attain better accuracy results, a researcher should use hybrid feature extraction techniques. Such techniques provide efficient and useful information for the input to a classifier.

  1. (a)

    The method to combine different classifiers is dependent on the information which is provided by a single classifier.

  2. (b)

    Owing to the variations in the style of speaking of different individuals of different regions, speech recognition has been a demanding research area.

  3. (c)

    The accuracy of the speech recognition and identification process relies on the features which have high discriminating power. Thus, there is a drastic requisite to study variant feature selection algorithms that can achieve good accuracy.

  4. (d)

    Due to distinct characteristics and variations in the tone including the diverse speaking style of the speaker, it sometimes becomes a ponderous task for the researcher to research different speech corpus with good accuracy.

  5. (e)

    Since the performance of a classifier relies on the extraction of the features at the feature extraction stage, it is thus important to carefully select the methods of feature extraction and the classifiers.

  6. (f)

    Most of the work on Indian languages has been carried out on languages such as Hindi, Bangla, Telugu, and Tamil. Languages like Punjabi, Devanagari, Dogri, Kashmiri, Gujarat, and languages of the northeast part of India are yet to be explored.

  7. (g)

    The lack of standard databases available for the researchers is significantly low in Indian languages. This presents a future direction to perform various experiments.

  8. (h)

    It is observed that the most commonly used feature extraction methods for speech recognition and identification task are PLP, LFCC, MFCC, and RASTA.

  9. (i)

    It is also noted that the most commonly used classifiers for speech recognition and identification task are HMM, DNN, and DNN–HMM.

  10. (j)

    It is also observed from the literature studied that the HMM classifier is very commonly used for speech recognition with good accuracy. The toolkits commonly used by the researchers are Sphinx and HTK. But again, some findings proved that accuracy rates can be increased with DNN or DNN–HMM models.

Suggestions on future directions

In the field of speech recognition, an immense number of directions can be explored to carry out further research as the proposed techniques being used for extracting the features of the speech samples can be extended further by combining different techniques to improve the speech recognition rate and WER. The authors mention the following suggestions on future research directions for speech recognition [131]:

  1. (a)

    The researchers should develop standard databases for Indian languages. The LVCSR database systems should be focused more. These databases should be made accessible to the researchers, so that future research can nurture.

  2. (b)

    Indian languages have not been employed efficient feature extraction techniques. Researchers have only focused on baseline MFCC features as studied in the prevalent literature. Hybrid features like MFCC + LDA + MLLT, MFCC + BFCC + GFCC, LDA + MFCC, MF + PLP, RASTA-PLP, etc. may be implemented.

  3. (c)

    More research can be carried out methodologies for different regional languages.

  4. (d)

    Researchers in the studied literature have reported their results of recognition accuracy in clean, i.e., noise-free environments, especially for Indian languages. However, in real-time applications, background noise and other factors affect the accuracy of recognizing the speech.

  5. (e)

    Tonal aspects of languages have not been put forward by most of the researchers. As per the literature, the authors could only find tonal research done on languages like Mandarin Chinese, Bangla, Arabic, and other tonal languages that lack the research initiative.

  6. (f)

    Languages like Punjabi, Dogri, and other under-resourced languages need a standard database to work on. Also, the tonal features of the languages have not been considered yet.

  7. (g)

    Research on extracting efficient and suitable tonal features should be conducted for developing ASR systems for tonal languages.

Conclusion

In this paper, the authors have presented an extensive review and analysis of different feature extraction techniques employed for speech recognition for non-Indian and Indian languages. Different evaluation parameters in which the researchers have reported their work have been mentioned with the help of different tables. These parameters were WER, PER, SER, accuracy, recognition rate, and comparative analysis of different techniques. The authors have cited a comprehensive study of the work done for different foreign languages across the world. These languages include Chinese, Japanese, Russian, Romanian, Malay, Thai, and Arabic. Also, a summary of the speech recognition research on Indian languages, i.e., Hindi, Bangla, Oriya, Tamil, Telugu, Malayalam, Punjabi, Kannada, Marathi, and Assamese has been mentioned. The authors suggest that efficient feature extraction techniques implemented on Non-Indian languages can be implemented for Indian languages, as well. This is because the research area for Indian languages is too wide. Also, not many accurate ASR systems have been developed for Indian languages. Furthermore, LVCSR systems should be explored more by the researchers to improve the accuracy of recognizing speech for such systems, so that applications for real-time use can be developed.

Fig. 1
figure 1

Block diagram of ASR