Keywords

1 Introduction

Speaker classification is a fundamental component of speaker recognition systems which performs two alternative tasks: speaker identification and verification. The goal of speaker identification is to label an unknown speech file with a speaker identity. The task of speaker verification is to validate and confirm the claim of a speaker about its identity [1, 2]. Speaker classification has been used in human-machine dialog systems, forensics, medical and many other applications. One interesting application of speaker classification is in the speech recognition and keyword spotting as preprocessing to reach the speaker of interest which is further useful in many security applications. Mixture models have been widely adopted to address the speaker classification task [3]. Recently Mixture model have been employed to address the object recognition and classification tasks through clustering in [4, 5]. A two level hierarchical clustering framework based on inverted Dirichlet mixture model is presented in [6] which is selected for object clustering and recognition. In this work, the same hierarchical clustering framework is adapted using bounded generalized Gaussian mixture model (BGGMM) with ICA and employed for speaker classification. In this paper, gender and 10 speakers classification is performed through the hierarchical clustering framework using ICA mixture model. Bounded generalized Gaussian mixture model with ICA presented in [7] is applied for the statistical learning of the clustering framework. Speaker classification based on supervised hierarchical clustering also serves the purpose to validate the effectiveness of ICA mixture model in speaker recognition and statistical learning. The gender speaker classification is performed on TIMIT and TSP speech databases and 10 speakers classification is conducted on TSP speech database. Both classification frameworks are also implemented using Gaussian mixture model in order to compare the performance of ICA mixture model in statistical learning. It is observed that classification framework based on hierarchical clustering performs well for both classification scenarios and ICA mixture model outperforms the GMM in model learning based on the classification rate. It is also observed that conventional problem of female speaker recognition is improved by employing multi-cluster model instead of classical model during the learning.

2 Supervised Hierarchical Clustering via ICA Mixture Model

In this section, supervised hierarchical clustering framework based on ICA mixture model is presented, which is applied to the speaker classification. The ICA mixture model is trained using training data and the posterior probability is employed to compute the specific cluster membership for each observation of the training data. The class label of the training data is selected to decode the clusters into particular class. The posterior probability is computed for the testing data and cluster-class information from the training is employed to find the particular class for each observation of the testing data. Since the class label of the training data is used to decode the clusters in the particular class and ICA mixture model is adapted for the statistical learning, therefore this framework is called the supervised hierarchical clustering framework based on ICA mixture model. Let us consider the training data represented as \(\mathcal {X}=({\varvec{X}}_{1},\ldots ,{\varvec{X}}_{N})\) where each observation is D-dimensional random variable \({\varvec{X}}_{i}=(X_{1},\ldots ,X_{D})\). The random variable \({\varvec{X}}\) follows a K components mixture distribution if its probability distribution is written in the following form:

$$\begin{aligned} p({\varvec{X}}_{i}|\Theta )=\sum _{j=1}^{K}p({\varvec{X}}_{\varvec{i}}|\theta _{j})p_{j} \end{aligned}$$
(1)

provided \(p_{j}\ge 0\) and \(\sum _{j=1}^{K}p_{j}=1\). In Eq. (1), \(\Theta =\left\{ p_1,\ldots ,p_K,\theta _1,\ldots ,\theta _K\right\} \) where \(\theta _j\) is the set of parameters of the jth component and \(p_j\) represents the mixing proportion for the jth component of the mixture model. For the training data \(\mathcal {X}\) having N independent and identically distributed vectors, the mixture model with K components can be expressed as follows:

$$\begin{aligned} p(\mathcal {X}|\Theta )=\prod _{i=1}^{N}\sum _{j=1}^{K}p({\varvec{X}}_{\varvec{i}}|\theta _{j})p_{j} \end{aligned}$$
(2)

For each random variable \({\varvec{X}}_{i}\), let \(Z_i\) be a K dimensional vector representing the missing group indicator which suggests to which component \({\varvec{X}}_{i}\) belongs, such that \(Z_{ij}\) will be equal to 1 if \({\varvec{X}}_{i}\) belongs to class j and 0 otherwise. The complete data likelihood is then:

$$\begin{aligned} p(\mathcal {X},Z|\Theta )=\prod _{i=1}^{N}\sum _{j=1}^{K}\left( p({\varvec{X}}_{\varvec{i}}|\theta _{j})p_{j}\right) ^{Z_{ij}} \end{aligned}$$
(3)

The complete data log-likelihood can be written as:

$$\begin{aligned} L(\Theta ,Z,\mathcal {X})=\sum _{i=1}^{N}\sum _{j=1}^{K}{Z_{ij}}\log \left( p({\varvec{X}}_{\varvec{i}}|\theta _{j})p_{j}\right) \end{aligned}$$
(4)

By replacing each \(Z_{ij}\) by its expectation, defined as posterior probability that the ith observation belongs to jth component of the mixture model as follows:

$$\begin{aligned} Z_{ij}=p(j| {\varvec{X}}_{i})=\frac{p({\varvec{X}}_{i}|\theta _j)p_{j}}{\sum _{j=1}^{K}p({\varvec{X}}_{i}|\theta _j)p_{j}} \end{aligned}$$
(5)

The membership of \({\varvec{X}}_{\varvec{i}}\) computed from the posterior probability can be selected to mark the clusters into a particular class. This information will further help for decoding the clusters into particular class for testing data using the membership function of the posterior probability for the observations of test data. If testing data is represented as \(\mathcal {Y}=({\varvec{Y}}_{1},\ldots ,{\varvec{Y}}_{L})\), the posterior probability for \({\varvec{Y}}_{\varvec{l}}\) can be computed using the trained mixture model and is represented as follows:

Fig. 1.
figure 1

Gender speaker classification using clustering

Fig. 2.
figure 2

Multi-speakers classification using clustering

figure a
$$\begin{aligned} p(j| {\varvec{Y}}_{l})=\frac{p({\varvec{Y}}_{l}|\theta _j)p_{j}}{\sum _{j=1}^{K}p({\varvec{Y}}_{l}|\theta _j)p_{j}} \end{aligned}$$
(6)

The supervised hierarchical framework for gender speaker classification is shown in Fig. 1. The speech data contains the MFCC features for male and female speakers and the class label is also provided. The ICA mixture model is trained in unsupervised fashion and the posterior probability for each observation of the training data is computed. The posterior probability marks each observation to a specific cluster and the class information of the training data can be selected to mark each cluster to a specific class to whom it belongs. For instance, if \({\varvec{X}}_{i}\) belongs to the male class and it lies in the cluster 2, then cluster 2 is marked as male cluster. All the clusters can be marked as male or female from the training information and class label. In Fig. 1, it is assumed that the ICA mixture model is learned with 10 mixture densities and we have the class label for each observation. From posterior probability it is inferred that female observations from the speech data belongs to cluster J1, J7 and J9, so these clusters can be further labeled as female class and rest of the clusters were inferred as male class in the same way. It is worth mentioning that training of the ICA mixture model is unsupervised because the speech data is adopted without any class label during the training. However, the clustering framework is supervised because class label is employed after the training to mark the clusters into specific class. In the 10 speakers classification, the same binary classification framework is extended for 10 classes (see Fig. 2) and clusters obtained from the posterior probability are decoded into particular classes based on class label of the training data. In the classification using clustering, one important aspect is to accurately mark the number of classes representing data. In the classical approach, data is modeled by a fixed number of components of the mixture model which is equal to the number of classes. There are two problems associated with classical approach: (i) one single density component for each class does not necessarily fit the class data (ii) there is an overlap between the classes when using a single distribution to model each class [6]. In speaker recognition, while modeling several speakers in one class or even a single speaker in one class may have the above problems. This is because the several speakers in a single class always have some distinct features and even same speaker will have dissimilar behavior while pronouncing the same words or utterances on different times. Due to the problems associated with classical model, we have adopted multi-cluster model which improve the learning of classification framework. There is another problem with the learning of female speakers and it is reported that speaker recognition performance of female speakers is almost worse as compare to the male speakers [8, 9]. It is observed that in the multi-cluster modeling, the performance of female speakers is improved during learning for their particular class. Bounded generalized Gaussian mixture model with ICA proposed in [7] is employed as statistical model for learning which uses the maximization of log-likelihood and ICA model for the estimation of its parameters. In an ICA mixture model, it is assumed that observed data comes from a mixture model and it can be categorized into mutually exclusive classes which means that each class of the data is modeled as an ICA [10–12]. The mixture model represented in Eq. (2) is composed of bounded generalized Gaussian distributions (BGGDs) which has mean \(\mu \), standard deviation \(\sigma \) and shape parameter \(\lambda \) as its parameters. The idea of bounded support mixture models and bounded generalized Gaussian mixture model was proposed in [13] and [14] respectively. For the ICA mixture model, each D-dimensional data vector \({\varvec{X}}_{i}=(X_{i1},\ldots ,X_{iD})\) can be represented as: \({\varvec{X}}_{i}=\text {A}_{j}{\text {s}}_{j,i}+{\text {b}}_{j}\) where \(\text {A}_{j}\) is \(L\times D\) basis functions, \({\text {s}}_{j,i}\) is D-dimensional source vector and \({\text {b}}_{j}\) is an L-dimensional bias vector for a particular mixture j [10–12, 15]. For the simplicity, number of linear combinations (L) is considered to be equal to the number of sources (D) for each observation of the dataset. In an ICA mixture model, we need to estimate the basis functions \(\text {A}_{j}\) and bias vector \({\text {b}}_{j}\) along with the parameters of the mixture model. The parameters mean, standard deviation and prior probability are estimated using the maximization of the log-likelihood. The shape parameters, basis functions and bias vector are estimated using the standard ICA model and gradient ascent. The parameter estimation for BGGMM with ICA is provided in [7] and complete learning procedure is given in Algorithm 1.

3 Experiments and Results

3.1 Design of Experiments

In this section, experimental framework for male/female and 10 speakers classification based on supervised hierarchical clustering is presented, which uses ICA mixture model for the statistical learning as described in section II. In the pre-processing stage, voice activity detection (VAD) is employed to distinguish between speech and non-speech parts of the speech sequences. By introducing the VAD in the pre-processing it is assured that the training of ICA mixture model is not inferred with the non-speech segments of the data set. The next stage is feature extraction and Mel Frequency Cepstral Coefficients (MFCCs) are selected as features. MFCCs have demonstrated their effectiveness in speech recognition and speaker classification and we have computed 13 dimensional features same as standard hidden Markov model toolkit (HTK). The ICA mixture model is trained using training part of the speech databases and the posterior probability is employed to determine the membership of an observation to a particular cluster. The class label for the training data is adopted to decode the clusters into particular class. The posterior probability is computed for the testing data and clustering information from the training is selected to find the particular class for each observation of the testing data. This classification framework is called the supervised hierarchical clustering based on ICA mixture model and presented in a detail in section II. This framework is also implemented using Gaussian mixture model in order to compare and examine the validity of the statistical learning of ICA mixture model in speaker classification.

3.2 Experimental Framework and Results

The speaker classification based on supervised hierarchical clustering is evaluated on TIMIT and TSP speech databases [16, 17]. The TIMIT speech corpus consists of 6300 speech utterances which contains 4620 speech utterances for training and 1680 speech utterances for testing. The TSP speech database consists of 1378 speech utterances spoken by 23 speakers (11 male, 12 female). For gender speaker classification, 6 speakers are selected for testing from the TSP and rest of the data is dedicated for training. For 10 speakers classification, 10 speakers (5 male, 5 female) having 60 speech utterances for each speaker are selected from the TSP with 40 speech utterances for training and 20 utterances for testing. The TIMIT speech corpus is employed for gender speaker classification whereas TSP database is selected for both classification scenarios. In the clustering framework for both scenarios, each speech utterance is segmented into frames of 25 ms with a window shifting of 10 ms, where each frame is represented by 13 MFCCs. The VAD is applied before feature extraction in order to have only speech frames in the training and testing data. The k-means is employed to initialize the parameters of ICA mixture model, with shape parameter set to 2 for each component of the mixture model. For the gender speaker classification, ICA mixture model is trained using the training sets of both speech databases separately. From the posterior probability, speech utterances are divided into clusters by the membership of particular component of the mixture model. The class label for each utterances is provided for the training data which further leads to label the clusters into particular class. Once the clusters are labeled into the particular classes, the cluster-class information can be selected to decode the testing data into male/female speakers. The classification framework is evaluated using classification accuracy computed from the confusion matrices. For the TIMIT speech corpus, the classification accuracy is computed for different number of component of mixture model between 2–100 and plotted in Fig. (3a). In the classification accuracy curve for both classes, it is observed that by increasing the number of components of the mixture model, the classification rate is increased. However after 30 components of the mixture model, the increase in classification accuracy is slow. The classification framework having ICA mixture model is compared with the same framework having GMM on the basis of classification rate. The overall classification rate for ICA mixture model in the setting of 100 mixture components is 88.92 % whereas in same setting for GMM, the classification rate is 81.87 %. It is also noted that for smaller number of mixture components, the recognition of female speakers is very poor which is improved for higher number of mixture components. It is also observed that multi-cluster model has improved the model learning for both classes as compared to the classic model. In the classic model, the female speakers have poor performance while fitting the data in one class. In comparison with GMM, ICA mixture model has performed well which validates the effectiveness of ICA mixture model for speaker classification and statistical learning. For the TSP speech database, the speech utterances from 17 speakers (8 male, 9 female) are adopted to train the ICA mixture model whereas 6 speakers (half male, half female) are employed for the testing with each speaker having 60 speech utterances. The classification accuracy for different number of components of ICA mixture model and GMM in gender speaker classification framework is computed and plotted in Fig. (3b). The highest value for overall classification accuracy is observed at 40 mixture components (86.94 %) for ICA mixture model and at 50 mixture components (81.11 %) for GMM. For the 10 class speaker classification TSP speech database is employed for tuning the speaker classification framework. In this scenario, 10 speakers are chosen and 40 speech utterances for each speaker are selected for training and 20 speech utterances for each speaker are adopted for testing. The classification results are computed for different number of mixture components and the resulting confusion matrices for classic and multi-cluster models are shown in Table (1a), (1b) and (1c). In order to have a comparison of ICA mixture model with GMM for 10 speakers classification, the same framework is implemented with GMM and overall classification rate is plotted for both models in Fig. (3c). The highest classification rate is observed at 60 mixture components for both scenarios of 10 speakers classification (78.50 % for ICA mixture & 69 % for GMM) which demonstrates the effectiveness of ICA mixture model in this setting.

Fig. 3.
figure 3

Classification accuracy for male/female and 10 speakers using ICA mixture and GMM (Colour figure online)

Table 1. 10 speakers classification confusion matrix using TSP database.

4 Conclusion

In this paper supervised hierarchical clustering framework is presented which is adopted for speaker classification. The first stage of the clustering is performed by the ICA mixture model and in the second stage, clusters received from the posterior probability are further classified using the class label of the training data. The cluster-class label information from training process is used for the classification of testing data. The classification framework is validated on TIMIT and TSP speech corpora. This framework also validates the statistical learning of ICA mixture model proposed in [7]. In order to examine the performance of the ICA mixture model, the classification framework is also implemented with GMM and the classification accuracy in different modes is compared. The proposed framework having ICA mixture model is employed for gender and 10 speakers classification. It is concluded that supervised hierarchical clustering framework has performed considerably well for the speaker classification and ICA mixture model surpass the GMM in the classification rate and model learning. It is also concluded that multi-cluster model has improved the problem of female speakers to fit the class data as compared to classic model.