GMM based language identification system using robust features

In this work, we have proposed new feature vectors for spoken language identification (LID) system. The Mel frequency cepstral coefficients (MFCC) and formant frequencies derived using short-time window speech signal. Formant frequencies are extracted from linear prediction (LP) analysis of speech signal. Using these two kind of features of speech signal, new feature vectors are derived using cluster based computation. A GMM based classifier has been designed using these new feature vectors. The language specific apriori knowledge is applied on the recognition output. The experiments are carried out on OGI database and LID recognition performance is improved.

require only raw signals, such LID systems are popular. These systems are very useful in several applications like call routing systems, language translation, spoken document retrieval and front-end processing in multilingual systems etc. (Schultz and Waibel 2001;Waibel et al. 2000;Chelba et al. 2008). It is also a topic of great interest in the areas of intelligence and security for information distillation.
In practice, text independent spoken language recognition is far more challenging than text-based language recognition because there is no guarantee that a machine is able to transcribe speech to text without errors. We know that humans recognize languages through a perceptual or psychoacoustic process that is inherent in the auditory system. Therefore, the type of perceptual cues that human listeners use is always the source of inspiration for automatic spoken language recognition (Zhao et al. 2008).
These systems do not require labeled and segmented speech. There are several cues to identify the spoken language like phonemes, prosody, phonotactics, syntax and structure etc. . Among those cues, one of the important cues is acoustic phonetic for the LID task. While the term acoustic refers to physical sound patterns, the term phonotactic refers to the constraints that determine permissible syllable structures in a language. We can consider acoustic features as the proxy of phonetic repertoire and call it acoustic-phonetic features. On the other hand, we see phonotactic features as the manifestation of the phonotactic constraints in a language (NIST Language Recognition Evaluations 2007;Martin and Garofolo 2007).
The acoustic LID approach aims at capturing the essential differences among languages by modeling the distribution of spectral vectors directly. These systems use acoustic features like Mel frequency cepstral coefficients, shift delta cepstral coefficients, perceptual linear production (PLP) features, Formants etc., which are extracted directly from speech signals.
In the past, LID systems were developed with different type of features using statistical methods like vector quantization, hidden Markov models and Gaussian mixture models. The pattern classifier approaches used to implement LID using GMM and experiments were carried using 100 dimensional LPC derived feature vectors in Cimarusti and Eves (1982). A GMM is used to approximate the acousticphonetic distribution of a language. It is generally believed that each Gaussian density in a GMM captures some broad phonetic classes . However, GMM is not intended to model the contextual or dynamic information of speech. The GMM based LID systems perform classification using information from a single observation .
Sound frequencies are different in different languages and this difference is characterized by acoustic features like Mel frequency cepstral coefficients and delta cepstral coefficients. Using these features, LID was implemented using Gaussian mixture classifier (Zissman , 1996. Nagarajan and Murthy (2004) developed LID based on syllable like units recognition using HMMs. The basic requirement for building syllable-like unit recognizers for all the languages to be identified is an efficient segmentation algorithm. Earlier Kamakshi Prasad et al. (2004) proposed an algorithm, which segmented the speech signal into syllablelike units using minimum phase group delay.
There are several systems which were implemented using vector quantization (VQ), discrete HMM and Gaussian mixture model (GMM) using acoustic features like MFCCs and shifted delta cepstral (SDC) Nakagawa and Suzuki 1993). Muthusamy et al. (1994) discussed the segment approaches to LID with acoustic, phonotactic and prosodic features to develop LID and carried out experiments by combining the spectral feature vectors and pitch. Nagarajan and Murthy (2002) proposed VQ based LID using several statistical methods using MFCCs and used usefulness parameter to improve LID performance.
The communication among humans is established by Speech, wherein the information to be conveyed is embedded in the sequence of sound units produced. These basic sound units are normally referred to as phonemes. The sequences of phonemes used for communication are governed by the rules of the language. The acoustic characteristics of the phonemes are closely related to the manner in which they are produced. The phonemes that are produced depend on the type of excitation and the shape of the vocal tract system. As the basic sound units are characterized by a set of formant frequencies which correspond to the resonances of the vocal tract system, formants are one of the major acoustical cues for the identification of language from speech. Formant frequencies have rarely been used as acoustic features for language recognition, in spite of their phonetic significance.
State-of-the-art formant estimators locate candidate peaks of the spectra from short-time analysis of speech and perform temporal tracking. Traditional formant frequency estimation methods are based on spectral analysis and peak picking techniques. The characteristics of phonemes are generally manifested in spectral properties of speech signal and formants are best choice to represent the acoustic features of basic sound units which are useful for LIDs (Yegnanarayana 1978). Formants are extracted using linear prediction coefficients (LPC) and these formants have the information about different sounds (Bruce et al. 2002;Bruce and Mustafa 2006).
In this paper, we have proposed new features for a text independent LID system. Mel frequency cepstral coefficients (MFCCs) and formant frequencies are extracted from shorttime processing of speech and these two kinds of features are concatenated to form feature vector. Formant features are extracted using linear prediction coefficients of speech signal. In this work, GMMs are created one for each language using new feature vectors of each language in training phase of LID. In the testing phase, MFCCs and formants are extracted from unknown utterance and these combined features are transformed into the new feature vectors. These new feature vectors of unknown utterance are evaluated against GMM of each listed languages. Usefulness of derived feature vectors is computed as the weightage of feature vectors. A language is hypothesized based on maximum usefulness value of sequence of new feature vectors of the unknown utterance against each GMM model. The steps followed in the implementation of LID system is explained in the following sections.

Feature extraction
The performance of any automatic LID system depends on several parameters and among them the selection of feature vectors is very important. For text Independent LID, feature vectors are extracted from speech signal without considering the knowledge of speech. From the existing systems (Nagarajan and Murthy 2002), it is observed that if the frequency of phonemes is different in different languages, the frequency of feature vectors is also different in different languages, as there is a correspondence between these two entities.

Derivation of new features
In this work, m-dimensional MFCC features and n-dimensional formant frequencies are extracted from speech signal and concatenated them to form (m + n)-dimensional feature vectors. These feature vectors are grouped into r-clusters using clustering algorithm. One Gaussian is designed for each cluster. These combined features are passed through all Gaussians and calculated probability using probability density function of the respective Gaussians. The probabilities of Gaussians are treated as coefficients of new feature vectors. Each feature vector (m + n-dimensional) is transformed into a r-dimensional new feature vector as described in Fig. 1. This proposed method for the derivation of new features is used in both training and testing phases of LID system. In training phase, new features are extracted from huge corpus of language specific speech for each language. In testing phase, new features are extracted from unknown utterance of speech.

Motivation of this work
The human speech apparatus is capable of producing a wide range of sounds. Speech sounds as concrete acoustic events are referred to as phones, whereas speech sounds as entities in a linguistic system are termed as phonemes (Bruce et al. 2002). The number of phonemes used in a language ranges from about 15 to 50, with the majority having around 30 phonemes each. Phonetic repertoires differ from language to language although languages may share some common phonemes. These differences between phonetic repertoires imply that each language has its unique set of phonemes thus acoustic-phonetic feature distributions (Kirchhoff 2006).
The performance of any LID system depends on the type of feature vectors and the classifier used. If the feature vectors do not represent underlying phonetic content of the speech, the system will perform poorly irrespective of the classifier used. The selection of features is very important for LID to get the good recognition performance. The fundamental cue to recognize the spoken language is the frequency of occurrence of basic sound units is different in different languages. In short-term speech processing, it is very likely that most of the cues of basic sound units are covered in a short-term window. Hence there is a close resemblance between basic sound units and derived feature vectors. This has motivated us to explore new features. In earlier systems, phonemes are described with the acoustic features. The acoustic features are represented well with MFCC. But the state of art LID systems gave the poor results for tonal languages with only MFCC features. This has motivated us to form the features by combining MFCCs and formants.
In VQ based LID systems, for each feature vector only one code book index is considered, discarding second best and third best indices etc. But these cues are also very important in making comprehensive decisions. Hence it is proposed to use k-best alternatives in decision making process instead of a single code book index.
As the probability of a feature vector in a language is greater than that of some languages and lesser than that of some other languages, significance of feature vector cannot be estimated in isolation. In such case, weightage is given to the feature vectors based on the log likelihood ratio of the feature vector for identification of language. To estimate the significance of feature vector among the languages, compute the usefulness of feature vector between a pair of languages and a language which gives maximum usefulness is allowed for further comparisons with other languages (Nagarajan and Murthy 2004). We propose to evaluate our new features using this usefulness criterion.

Design of text-independent LID using proposed model
In the proposed model, GMM based LID is implemented using new feature vectors of speech signal and the usefulness of new feature vectors. The LID system involves two phases namely training and testing phases. Each spoken language is represented by one GMM. If there are M languages, correspondingly there are M GMMs in the recognition system. For training each of M GMMs, language specific speech corpus is used unlike in training of each of r-Gaussians, the speech corpus of all languages is combined, r-clusters are formed and one Gaussian for each cluster is designed as describe in Sect. 2.

Training phase
The training phase involves two steps. In the first step, consider a huge speech corpus consisting of speech of 25 minutes duration for each language. m-Dimensional MFCC feature vectors and n-dimensional formant frequencies are extracted from each of the listed languages by applying overlapped short time windows. These (m + n)-dimensional feature vectors are converted into r-dimensional new feature vectors as discussed in Sect. 2.1. The second step of training phase involves training of GMMs one for each language using Baum-Welch reestimation algorithm (Nagarajan and Murthy 2002). The kdimensional new feature vectors of speech corpus are used

Testing phase
In the first step, the new feature vectors of unknown utterance of speech are derived using the procedure followed in training phase as in Fig. 2. These new feature vectors of unknown utterance of speech are used as observation sequence of GMM.
In the second step, the new feature vectors of speech utterance of unknown language is evaluated against each of M GMMs, where M is the number of languages under consideration using forward-backward algorithm as in Fig. 3.
The significance of feature vectors among the languages is obtained by computing the usefulness of feature vector between a pair of languages and a language which gives maximum usefulness is allowed for further comparisons with other languages (Nagarajan and Murthy 2002). Usefulness of spectral vector is defined (Nagarajan and Murthy 2002) as where V is sequence of feature vectors, λ i , λ j are the languages which are considered in training and P (V k /λ i ) is the likelihood of feature vector V k in λ i Language.
In the third step, the usefulness of all feature vectors for a pair of languages is calculated using (1). The Language which gives maximum usefulness of all feature vectors will be allowed to further comparisons with other language one at a time. This process is repeated for all languages under consideration. If the considered languages are M, then the number of comparisons is (M − 1) only. In this process, after the comparison of all languages, the language which gives maximum usefulness in the last comparison is identified as the recognized language.

Experimental setup
The experiments are carried out using MATLAB 9.0 on Windows 7 platform. The OGI database has been used for this study (OGI Multi Language Telephone Speech 2004). MFCC and formant features are extracted from each shortterm window and the successive windows are overlapped. In our experiment, 12 MFCCs (m = 12) and five formant frequencies (n = 5) are considered. We have evaluated the 17 (m + n)-dimensional concatenated feature vectors using 15 Gaussians (r = 15) to derive new feature vectors. Gaussian mixture models with varying number of mixtures 8, 16 and 32 are implemented using new features. Testing is performed for different utterances of 1 s, 2 s and 3 s duration using the proposed method.

Results
The performance of language identification for OGI database for different duration of test utterances and varying number of mixtures of GMM using usefulness value is calculated and results are obtained using different feature vectors of short-term windowed speech signal. The performance of language identification using only MFCC feature vectors is depicted in Table 1. The performance of LID is also measured with the features of MFCCs and formants and the results are furnished in Table 2. New features are obtained from concatenated features of MFCCs and formants. Formants are extracted using LP spectrum for training and testing. The performance of the LID system is evaluated using new features and results are shown in Table 3. The performance is measured in terms of percentage of correct identification of test samples from the given test samples. The performance of LID system is also referred in terms of Identification Rate (IR), False Acceptance Rate (FAR) and False Rejection Rate (FRR). IR is the percentage of test utterances that in certain languages and classified as "true" for those languages. FAR is the percentage of test utterances that are not in certain languages but classified them as "true" for those languages. FRR is the percentage of test utterances that in certain languages but classified as "false" for those languages.
The performance of LID in terms of IR, FAR, FRR for different duration of test utterances and varying number of Mixtures of GMM using likelihood value and usefulness is depicted in Table 4.
It is observed that the average performance of LID task of OGI languages of the duration of 1 s, 2 s, 3 s utterance for 32 mixtures is increased to comparatively to 16 mixtures and 8 mixtures as specified in Tables 3 and 4. The computational time analysis is also performed for the experiments carried out in this work using core i3 processor with 2 GB RAM. The comparison of the time taken for testing of unknown utterance of speech with the different durations of test speech utterance is demonstrated in Table 5.

Conclusions
In this paper, a new GMM based approach has been proposed for text independent language recognition using new feature vectors derived from MFCC feature vectors and formants. Formants are extracted using LP spectrum of speech signal. LID system is developed using Gaussian mixture model with different mixtures. Formant and MFCC feature vectors represent the acoustic features of speech signals so that LID performance is improved. A significant improvement in the recognition performance was found with usefulness criterion combined with the new feature vectors. The procedures adopted in this paper are general in nature and hence could be extended to the implementation of any GMM or HMM based pattern recognition tasks. The average recognition performance of this text independent LID system is achieved more for 32 mixtures for OGI languages is 98.8 %.
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.