Real Time Challenges to Handle the Telephonic Speech Recognition System

  • Joyanta Basu
  • Milton Samirakshma Bepari
  • Rajib Roy
  • Soma Khan
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 222)


Present paper describes the real time challenges to design the telephonic Automatic Speech Recognition (ASR) System. Telephonic speech data are collected automatically from all geographical regions of West Bengal to cover major dialectal variations of Bangla spoken language. All incoming calls are handled by Asterisk Server i.e. Computer telephony interface (CTI). The system asks some queries and users’ spoken responses are stored and transcribed manually for ASR system training. At the time of application of telephonic ASR, users’ voice queries are passed through the Signal Analysis and Decision (SAD) Module and after getting its decision speech signal may enter into the back-end Automatic Speech Recognition (ASR) Engine and relevant information is automatically delivered to the user. In real time scenario, the telephonic speech contains channel drop, silence or no speech event, truncated speech signal, noisy signal etc. along with the desired speech event. This paper deals with some techniques which will handle such unwanted signals in case of telephonic speech to certain extent and able to provide almost desired speech signal for the ASR system. Real time telephonic ASR system performance is increased by 8.91 % after implementing SAD module.


Asterisk server Interactive voice response Transcription tool Temporal and spectral features Knowledge base 

1 Introduction

Modern human life is totally dependent on technology and along with these devices become more and more portable like mobile, PDAs, GPRS etc. Beside this, there is also a growing demand for some hands free Voice controlled public purpose emergency information retrieval services like Weather forecasting, Road-Traffic reporting, Travel enquiry, Health informatics etc. accessible via hand-held devices (mobiles or telephones) to fulfill urgent and on the spot requirements. But real life deployment of all these applications involves development of required modules for voice-query based easy user interface and quick information retrieval using mobiles. In fact, throughout the world the number of telephone users is much higher than that of the PCs. Again human voice or speech is the fastest communication form in our daily busy schedule that further extends the usability of such voice enabled mobile applications in emergency situations. In such a scenario, speech-centric user interface on smart hand-held devices is currently foreseen to be a desirable interaction paradigm where Automatic Speech Recognition (ASR) is the only available enabling technology.

Interactive Voice Response (IVR) systems provide a simple yet efficient way for retrieving information from computers in speech form through telephones but in most of the cases users still have to navigate into the system via Dual Tone Multiple Frequency (DTMF) input and type their query by telephone keypad. A comparative study by Lee and Lai [1] revealed that in spite of occasionally low accuracy rates, a majority of users preferred interacting with the system by speech modality as it is more satisfying, more entertaining, and more natural than the touch-tone modality which involves the use of hands, quite time consuming and require at least the knowledge of English alphabets. Furthermore, a variety of ASR system architectures [2] have been implemented ranging from server based implementations accessed by the device over wireless network to recognizers embedded in the local processor associated with specific device [3].

Present paper addresses some real time challenges to handle telephonic ASR application to provide better result. And also provides a clear picture of the above tasks in a well planned and sequential manner aiming towards the development of an IVR application in spoken Bangla language. The methodology is language independent and can easily be adapted to any applications of similar type.

2 Motivation of the work

A practical IVR system should be designed in such a way that it should be capable of handling real time telephony hazards like channel drop, clipping, speech truncation etc. It should also provide robust performance considering following issues:
  1. 1.

    Speaker-Variability: Handle speech from any arbitrary speaker of any age i.e., it would be a speaker-independent ASR system.

  2. 2.

    Pronunciation/Accent Variability: Different pronunciations, dialectical variations and accents within a particular Indian language.

  3. 3.

    Channel Variability: Different channels such as landline versus cellular and different cellular technologies such as GSM and CDMA.

  4. 4.

    Handset Variability: Variability in mobile handsets due to differences in spectral characteristics.

  5. 5.

    Different Background Noise: Various kinds of environmental noise, so that it is robust to real-world application.


Considering the above said requirements, telephonic ASR is being designed in such a way that, to-some-extent it can meet the above mentioned capabilities. Speech data are mainly collected from all the geographical regions where native Bangla language speaking population is considerably high. The collected speech data is then verified and used for ASR training. The reason behind choosing a large geographical area for data collection is to cope up with the problem of speaker variability, accentual variability. Additionally, various issues regarding the telephonic channel such as channel drop or packet lost during transmission, handset variability, service provider variability, various types of background noise such as cross-talk, vehicle noise etc. have been observed, analyzed and estimated efficiently from the collected speech data and modeling of those can improve ASR performance. These issues will not only help us to improve the system performance effectively, but also provide us very good research motivation on other telephonic applications.

3 Brief Overall System Overview

Telephonic ASR system is designed such a way, that users get the relevant information in a convenient manner. First the system will give the user a language preference (within Hindi, Bangla and Indian English) and then onwards each time a directed question is asked, and the user would reply it with appropriate response from a small set of words. System is composed of three major parallel components. They are IVR server (hardware and API), Signal Processing Blocks and Information Source. Figure 1 represents an overall block diagram of the system.
Fig. 1

Block diagram of telephonic ASR system

3.1 IVR Hardware and API

As shown in Fig. 2 the Interactive Voice Response (IVR) consists of IVR hardware (generally a Telephony Hardware), a computer and application software running on the computer. The IVR hardware is connected parallel to the telephone line.
Fig. 2

Interactive voice response system

The functionality of the IVR hardware is to lift the telephone automatically when the user calls, recognize the input information (like dialed digit or speech) by the user, interact with computer to obtain the necessary information, then convert the information into speech form and also convert the incoming speech into digital form and store it in the computer.

In development of telephonic ASR system, Asterisk [4, 5] is used here as an open source IVR Server, converged telephony platform, which is designed primarily to run on Linux. It support VoIP protocols like SIP, H.323; interfaces with PSTN Channels, supports various PCI Cards, and also open source Drivers and Libraries are available.

3.2 Signal Processing Block

This block consists of three major blocks namely Speech Acquisition and Enhancement module, Signal Analysis and Decision module and ASR Engine. Block diagram of such Signal Processing Block is shown in Fig. 3.
Fig. 3

Basic block diagram of signal processing blocks including automatic speech recognition system

3.2.1 Speech Acquisition and Enhancement Module

The first block, which consists of the acoustic environment plus the transduction equipment, can have a strong effect on the generated speech representations because additive noise, room reverberation, recording device type etc. are associated with the process. A speech enhancement module suppresses the above effects so that the incoming speech can easily be recognized in heavy perturbed conditions.

3.2.2 Signal Analysis and Decision Module

In this module all incoming speech waveform analyzed for the valid speech signal or not. This is the very important module of this system. This module extracts different temporal features like Zero Crossing Rate (ZCR), Short Time Energy (STE) and Spectral features like formant analysis etc. And also using some predefined Knowledge Base (KB) this module gives some decision valid information regarding incoming speech signal. After this module system takes some decision whether users need to re-record speech signal or not. If re-recording is not required then recorded speech signal go through ASR engine for decoding the signal. Detail of this module is given in the Sect. 4. In this paper we are mainly looking towards the performance of this module.

3.2.3 ASR Engine

The basic task of Automatic Speech Recognition (ASR) is to derive a sequence of words from a stream of acoustic information. Automatic recognition of telephonic voice queries requires a robust back-end ASR system. CMU SPHINX [6], an open Source Speech Recognition Engine is used here which typically consists of Speech Feature Extraction module, Acoustic Model, Language Model, Decoder [7].

3.3 Information Source

Repository of all relevant information is known as a trusted Information Source, and design architecture of the same typically depends on type of information. In current work dynamic information is mainly refereed from trusted information source or online server and all other information which does not change very much during a considerable period of time are kept in local database using web crawler. As an example, PNR record, seat availability and running status of any train are typically dynamic information for Travel Domain. System response of any query on dynamic information must ensure delivery of latest information. To accomplish this objective a specific reference made to trusted information source. On other hand the information like train schedule, train name, train number are fetched and stored periodically in the local database. This approach ensures quick delivery of information.

4 Signal Analysis and Decision Module

This is one of the challenging modules of this entire telephonic ASR system. Where all speech signals is analyzed with temporal and spectral features and with the help of previous knowledge it will decide what to do next with stored speech signal. Here in this module currently four types of erroneous signals are considered. Those are occurs due to below problems:
  1. 1.

    Telephonic channel related problem

  2. 2.

    Signals sometimes truncated from the beginning and/or end of the signal,

  3. 3.

    Sometimes silence signals are coming

  4. 4.

    Lots of background noise like air flow, sudden noise etc. and cross talks are there.

This module consists of Feature extraction module, Decision module and Knowledge base (KB). Figure 4 shows the basic speech analysis and decision module blocks.
Fig. 4

Speech analysis and decision module blocks

4.1 Temporal Feature Extraction

In this work, various temporal features such as Zero Crossing Rate (ZCR) and Short-Time Energy (STE) are used.

4.1.1 Zero Crossing Rate

It is the rate of sign changes along the signal, i.e. the rate at which signal changes from positive to negative or vice versa. ZCR is more for unvoiced speech than the Voiced speech. It is defined as,
$$ zcr = \frac{1}{N - 1}\sum\limits_{n = 1}^{N - 1} {{\rm I}{\rm I}\left\{ {s(n)s(n - 1) < 0} \right\}},$$
where s(n) is speech signal of length N and indicator function \( {\rm I}{\rm I}\left\{ A \right\} \) is 1 if its argument A is true and 0 otherwise.

4.1.2 Short-Time Energy

In this work, STE is used to determine the energy of voiced and unvoiced regions of the signal. STE can also be used to detect the transition from unvoiced to voiced regions and vice versa. The energy of voiced region is greater than unvoiced region. The equation for STE is given by,
$$ E_{n} = \sum\limits_{m = - \infty }^{\infty } {\left\{ {s(m)} \right\}^{2} } h(n - m),$$
where s(m) is the signal and h(n-m) is the window function.

4.2 Spectral Feature Extraction

In this work we are only extracting the formant parameters from the speech signal. Formants of voiced signal is clearly visible than unvoiced signal. The center frequency of the lowest resonance of the vocal tract, called first formant frequency or F1, corresponds closely to the articulatory and/or perceptual dimension of vowel height (high vs. low vowels, or close vs. open vowels). The vowel classification has been carried out by measuring formant frequencies (F1, F2, F3).

4.3 Knowledge Base

KB module gathered information from the transcribed speech data which is required for ASR system training. Transcription tool [8] has been designed for offline transcription of recorded speech data, such that all transcriptions during data collection can be checked, corrected and verified manually by human experts. Automatic conversion of text to phoneme (phonetic transcription) is necessary to create pronunciation lexicon which will help the ASR System training.

The methodology for Grapheme to Phoneme (G2P) conversion in Bangla is based on orthographic rules. In Bangla G2P conversion sometimes depends not only on orthographic information but also on Parts of Speech (POS) information and semantics [9]. G2P conversion is an important task for data transcription. From where, many information were gathered regarding telephonic speech data. At the time of transcription we have to give some transcription remark tags and also noise tags. It’s totally human driven task. Descriptions and measurement of the remarks are given in Tables 1 and 2 shows the different types of noise tags.
Table 1

Description and measurement of remarks

Wave remarks (Wr)


A_UTTR (Amplitude)

Amplitude (Partly or fully) will be modified

C_UTTR (Clean)

Speech utterance may contain some non overlapping non-speech event

Transcription remarks (Tr)


CLPD_UTTR (Clipped)

Clipping of speech utterance

CPC_UTTR (Channel problem consider)

Channel drop occurs randomly in silence region which have not affect speech region

CPR_UTTR (Channel problem reject)

Some word or phonemes dropped randomly

I_UTTR (Improper)

In this case the utterance is slightly different than prompt in phoneme level


Noise within speech


Reasonable silence (pause) within speech

R_UTTR (Reject)

In this case speech signal is wrongly spelt or may be too many noise or may be non-sense words, can’t be able to understand

S_UTTR (Silence)

No speech utterance present

TA_UTTR (Truncate accept)

Truncation of not so significant amount (may be one or two phoneme) speech utterance

TR_UTTR (Truncate reject)

Truncation of significant amount speech utterance

W_UTTR (Wrong)

In this case the utterance is totally different from the corresponding prompt

Table 2

Types of noise tags


Explanation / examples


Air flow


Animal sound


Sudden (impulsive) noise due to banging of door


Telephonic beep sound


Sound of bird


General background noise


Breath noise


Background speech (babble)


Background song


Background instrument






Children cry


Clearing of throat


Horn noise of vehicles






Line noise


Lip smack


Hiccups, yawns, grunts


Pause or silence


Phone ringing






Tongue click


Vehicle noise

From Table 1 it has been seen that S_UTTR, CPR_UTTR, TR_UTTR and R_UTTR are the rejection tag set. Other tag set may be accepted and considered at the time of ASR training. Now all four rejection tags consider as knowledge from offline data transcription. And main objective of SAD module is to extract all those rejection remarks from the incoming speech signal on-the-fly using this KB.

5 Observations and Results

For this work we have collected almost 60 h telephonic speech data from nineteen districts of West Bengal and transcribed manually. This transcribed data helps to build up the KB. Table 3 represents the distribution of totally collected speech data according to the variations of Speakers’ Gender, Age, Education Qualification, Recording Handset model, Service provider and Environment in terms of percentage of occurrences in each criteria. Figure 5 presents the true picture of the nature of real world speech data after being checked and remarked by human experts with the help of the Transcription tool. Table 4 described the percentage of occurrence of major noise tags.
Table 3

Variations in collected speech data

Gender (in %)





Age (in %)

Child: 0–15


Adult: 15–30


Medium: 30–50


Senior: 50–99


Qualification (in %)









Handset model (in %)













Service provider (in %)

















Environment (in %)









Fig. 5

Percentage of occurrence of different types of remarks

Table 4

Percentage of occurrence of the major noise tag set


% of occurrence



























Figure 6 shows the observation result of CPR_UTTR, S_UTTR and TR_UTTR speech utterances, STE and ZCR plot. It has been observed from the figures that (1) In case of Channel Problem (i.e. CPR): signals ends with high STE and high SCR value may be at the end of the signal or sometimes occurs at the beginning of the signal. It happens because users some time delayed speaking within specific time span or sometimes starts so early. (2) In case of S_UTTR: signals are generally channel noise and mostly unvoiced sounds, that’s why high ZCR and low STE observed almost entire time span of the recording. (3) In case of TR_UTTR: sometimes channel packets have been lost due to may be bad network strength of the users or may be type of handset. It’s very common problem for the telephonic applications. It has been observed that ZCR value change suddenly from low to high or high to low and same thing for STE also. Formant analysis also has been done for all these rejection cases. To find out S_UTTR automatically from the signal formant may be the one good spectral feature. But for other cases of rejection, formant analysis is not so well.
Fig. 6

CPR_UTTR. a Signal vs. STE. b Signal vs. ZCR; S_UTTR. c Signal vs. STE. d Signal vs. ZCR; TR_UTTR. e Signal vs. STE. f Signal vs. ZCR

Figure 7 shows the one of the observation of R_UTTR, where it has been seen that many voiced and unvoiced regions are there, but those are basically background speech, not the expected reply from the uses. And also we have seen that natural STE and ZCR plots are there. So, it’s really difficult to automatic extract R_UTTR always. But sometimes, when overall STE is below the expectation level of the incoming speech signal then Speech Analysis module marks it as a R_UTTR. Basically a performance of automatic extraction of R_UTTR depends on the background noise or unwanted speech. More work is going on this rejection type.
Fig. 7

R_UTTR. a Signal vs. STE. b Signal vs. ZCR

From the above observations, signal analysis module are designed for telephonic ASR system and tested in real life scenarios. On-the-fly pattern of rejection remarks extraction is the main objective of signal analysis module. We have analyzed 887 numbers of incoming calls of users and total utterances are 10,327 numbers. Table 5 shows result of automatic extraction of the rejection remarks except R_UTTR. To find out R_UTTR remarks system desires some manual intervention.
Table 5

Output of signal analysis module and its decision

Number of incoming calls

Number of utterances











Table 6 shows the performance accuracy of the signal analysis and decision blocks. It has been observed that S_UTTR, TR_UTTR percentage accuracy is good than CPR_UTTR. But automatic extraction of R_UTTR is not satisfactory.
Table 6

Accuracy of signal analysis and decision module

Types of rejection remarks

% of correct decision









Table 7 shows the Telephonic ASR performance on using Signal Analysis and decision module and without using Signal Analysis and decision module. It has been seen that with the use of Signal analysis and decision module percentage of accuracy improved by 8.91% which is really encouraging for next level of research on this work.
Table 7

Telephonic ASR system accuracy

Number of utterances

Without using signal analysis and decision module

With using signal analysis and decision module




6 Conclusion

In the proposed work, a detail design of signal analysis and decision module has been described to cope up the real time challenges of telephonic automatic speech recognition system. Here in this work try to address various issues regarding the telephonic channel such as channel drop or packets lost during transmission and try to handle some issues. We have seen that automatic extraction of CPR_UTTR, S_UTTR and TR_UTTR by SAD module is encouraging. But automatic extraction of R_UTTR is not so easy. Currently more research is going on for this particular utterance type. It has been also observed that with the help SAD module telephonic ASR accuracy percentage increased by 8.91, which is really encouraging for the researchers. More importantly, this kind of real voice based information retrieval application useful especially to the people having no access to computers and Internet, people who may not have the required computer skills or even reading/writing abilities and also the visually challenged personal. After successful completion of the present work, it will enable development of similar speech-based access systems for other (like Medical, tourism, transport, Emergency services) public domain applications.


  1. 1.
    Lee K-M, Lai J (2005) Speech vs. touch: a comparative study of the use of speech and DTMF keypad for navigation. International Journal of Human Computer Interaction IJHCI, vol 19(3)Google Scholar
  2. 2.
    Furui S (2000) Speech recognition technology in the ubiquitous/wearable computing environment. In: Proceedings of the international conference on acoustics speech and signal processing, pp 3735–3738Google Scholar
  3. 3.
    Maes SH, Chazan D, Cohen G, Hoory R (2000) Conversational networking: conversational protocols for transport, coding, and control. In: Proceedings of the international conference on spoken language processingGoogle Scholar
  4. 4.
    Gomillion D, Dempster B Building telephony system with asterisk. ISBN: 1-904811-15-9, Packet Publishing LtdGoogle Scholar
  5. 5.
    Meggelen JV, Madsen L, Smith J Asterisk: the future of telephony, ISBN-10: 0-596-51048-9, ISBN-13: 987-0-596-51048-0, O’REILLGoogle Scholar
  6. 6.
  7. 7.
    Basu J, Khan S, Roy R, Bepari MS (2011) Designing voice enabled railway travel enquiry system: an IVR based approach on bangla ASR. ICON 2011, Anna University, Chennai, India, pp 138–145Google Scholar
  8. 8.
    Basu J, Bepari MS, Roy R, Khan S (2012) Design of telephonic speech data collection and transcription methodology for speech recognition systems. FRSM 2012, KIIT, Gurgaon, pp 147–153Google Scholar
  9. 9.
    Basu J, Basu T, Mitra M, Das Mandal S (2009) Grapheme to Phoneme (G2P) conversion for bangla. O-COCOSDA international conference, pp 66–71 Google Scholar

Copyright information

© Springer India 2013

Authors and Affiliations

  • Joyanta Basu
    • 1
  • Milton Samirakshma Bepari
    • 1
  • Rajib Roy
    • 1
  • Soma Khan
    • 1
  1. 1.Centre for Development of Advanced ComputingKolkataIndia

Personalised recommendations