Perceptually Similar Languages

. Language Identiﬁcation (LID) refers to the task of identifying an unknown language from the test utterances. In this paper, a new feature set, viz., T-MFCC by amalgamating Teager Energy Operator (TEO) and well-known Mel frequency cepstral coeﬃcients (MFCC) is developed. The eﬀectiveness of the newly derived feature set is demonstrated for identifying perceptually similar Indian languages such as Hindi and Urdu. The modiﬁed structure of polynomial classiﬁer of 2 nd and 3 rd order approximation has been used for the LID problem. The results have been compared with state-of-the art feature set, viz., MFCC and found to be eﬀective (an average jump 21.66%) in majority of the cases. This may be due to the fact that the T-MFCC represents the combined eﬀect of airﬂow properties in the vocal tract (which are known to be language and speaker dependent) and human perception process for hearing.


Introduction
Language Identification (LID) refers to the task of identifying an unknown language from the test utterances. LID applications fall into two main categories: pre-processing for machine understanding systems and preprocessing for human listeners. Alternatively, an LID system could be run in advance of the speech recognizer. Alternatively, LID might be used to route an incoming telephone call to a human switchboard operator fluent in the corresponding language [6]. Several techniques such spectral, prosody, phoneme, word-level, etc. have been proposed in the literature for LID problem. In this paper, we adopt spectralbased approach [5] and show the effectiveness of the newly derived feature set,viz.,Teager Energy based Mel Frequency Cepstral Coefficients (T-MFCC) for identification of perceptually similar Indian languages, viz.,Hindi and Urdu.
with the help of a voice activated tape recorder (Sanyo model no. M-1110C & Aiwa model no. JS299) with microphone input, a close talking microphone (viz.,Frontech and Intex). During recording of the contextual speech, the interviewer asked some questions to speaker in order to motivate him or her to speak on his or her chosen topic. Other details of the experimental setup and data collection are given in [7].

The Teager Energy Operator (TEO)
Features derived from a linear speech production models assume that airflow propagates in the vocal tract as a linear plane wave. This pulsatile flow is considered the source of sound production [9]. According to Teager [8], this assumption may not hold since the flow is actually separate and concomitant vortices are distributed throughout the vocal tract. He suggested that the true source of sound production is actually the vortex-flow interactions, which are non-linear and a non-linear model has been suggested based on the energy of airflow. Fig.1 shows Teager's original investigations about distinct flow pattern of vowel 'i' at top and bottom rear of the front oral cavity (due to the non-linear airflow) [8]. There are two broad ways to model the human speech production process. One approach is to model the vocal tract structure using a source-filter model. This approach assumes that the underlying source of speaker's identity is coming from the vocal tract configuration of the articulators (i.e., size and shape of the vocal tract) and the manner in which speaker uses his articulators in sound production [4]. An alternative way to characterize speech production is to model the airflow pattern in the vocal tract. The underlying concept here, is that while the vocal tract articulators do move to configure the vocal tract shape (making cues for speaker's identity [4]), it is the resulting airflow properties which serve to excite those models which a listener will perceive for a particular speaker's voice [8], [9]. Modeling the time-varying vortex flow is a formidable task and Teager devised a simple algorithm which uses a non-linear energy-tracking operator called as Teager Energy Operator (TEO) (in discrete-time) for signal analysis with the supporting observation that hearing is the process of detecting energy. The concept was further extended to continuous-domain by Kaiser [3]. According to Kaiser, energy in a speech frame is a function of amplitude and frequency as well. Let us now discuss this point in brief.
The dynamics and solution (which is a S.H.M.) of mass-spring system are described as and the energy is given by From (1), it is clear that the energy of the S.H.M. of displacement signal x(t) is directly proportional not only to the square of the amplitude of the signal but also to the square of the frequency of the signal. Kaiser and Teager proposed the algorithm to calculate the running estimate of the energy content in the signal.
(1) can be expressed in discrete-time domain as By trigonometry, where E n gives the running estimate of signal's energy. In continuous and discretetime, TEO of a signal x(t) is defined by It is a well known fact that the speech can be modeled as a linear combination of AM-FM signals in some cases [7], [9]. Each resonance or formant is represented by an AM-FM signal of the form where a(t) is a a time varying amplitude signal and ω i (t)is the instantaneous frequency given by ω i (t) = dφ/dt. This model allows the amplitude and formant frequency (resonance) to vary instantaneously within one pitch period. It is known that TEO can track the modulation energy and identify the instantaneous amplitude and frequency. Motivated by this fact, in this paper a new feature set based on nonlinear model of (3) is developed using the TEO. The idea of using TEO instead of the commonly used instantaneous energy is to take advantage of the modulation energy tracking capability of the TEO. This leads to a better representation of formant information (which is speaker and possibly language specific) in the feature vector than MFCC [7]. In the next section, we will discuss the details of T-MFCC.

Teager Energy Based MFCC (T-MFCC)
For a particular speech sound in a language , the human perception process responds with better frequency resolution to lower frequency range and relatively low fre-quency resolution in high frequency range with the help of human ear.
To mimic this process MFCC is developed. For computing MFCC, we warp the speech spectrum into Mel frequency scale. This Mel frequency warping is done by multiplying the magnitude of speech spectrum for a preprocessed frame by magnitude of triangular filters in Mel filterbank followed by log-compression of sub-band energies and finally DCT. Davis and Mermelstein proposed one such filterbank to simulate this in 1980 for speech recognition application [2].Thus, MFCC can be a potential feature to identify perceptually distinct languages (because for perceptually similar languages there will be confusion in MFCC due to its dependence of human perception process for hearing). Traditional MFCCbased feature extraction involves preprocessing; Mel-spectrum of preprocessed speech, followed by log-compression of subband energies and finally DCT is taken to get MFCC per frame [2]. In our approach, we employ TEO for calculating the energy of speech signal. Now, one may apply TEO in frequency domain, i.e., TEO of each subband at the output of Mel-filterbank, but there is difficulty from implementation point of view. Let us discuss this point in detail. In frequency-domain, (2) for pre-processed speech x p (n) implies, Using shifting and multiplication property of Fourier transform, we have Thus (5) is difficult to implement in discrete-time and also time-consuming. So we have applied TEO in the time-domain. Let us now see the computational details of T-MFCC.
Speech signal x(n) is first passed through pre-processing stage to give preprocessed speech signal x p (n) . Next we calculate the Teager energy of x p (n) :

Experimental Results
In this paper, modified polynomial classifier of 2 nd and 3 rd order approximations is used as the basis for all the experiments [1]. The detailed discussion on modified classifier structure is beyond the scope of the paper and is given in [7]. Feature analysis was performed using 23.2 ms frame with an overlap of 50% and feature dimension is kept as 12. Each frame was pre-emphasized with the filter 1 − 0.97z − 1, followed by Hamming windowing and then. We have taken 2 samples more to com-pute T-MFCC than that for MFCC because of TEO processing. The experiments are performed for different testing speech durations (i.e., 1 s, 3 s, 5 s, 7 s, 10 s, 12 s and 15 s) and training speech durations (i.e., 30 s, 60 s, 90 s, and 120 s). The results are shown as average success rates (over testing speech durations) in Table 2 (for Hindi and Urdu) and Table 3 (for Marathi and Hindi). In addition to this, the results are shown as overall success rates (computed as average over testing speech durations followed by average over training speech durations) in Tables 4 and 5   be language and speaker dependent [7]) and human perception process. So, T-MFCC is able to capture the speaker and language -specific information better than MFCC. -On the other hand, for both 2 nd order and 3 rd order polynomial approximation and identification of perceptually distinct languages (i.e., Marathi and Hindi), MFCC outperformed T-MFCC. -There is a significant improvement in the performance of T-MFCC for 3 rd order approximation as compared to the 2 nd order approximation. This is quite expected for a classifier of higher order polynomial approximation. -Confusion matrix for T-MFCC performed better than MFCC. This shows that T-MFCC has better class discrimination power than MFCC for distinguishing perceptually similar languages.

Conclusion
In this paper, Teager Energy based MFCC (T-MFCC) features are proposed for identifying perceptually similar Indian languages, viz., Hindi and Urdu. The performance of newly proposed feature set was compared with MFCC and found to be effective. This research work can be readily extended to identifying other perceptually similar Asian or European languages.