1 Introduction

Early detection of vocal fold pathology by the use of non-invasive methods and employing different techniques of speech processing have attracted scientists' attention in recent years. Their aim is to develop new techniques for processing the speech signals of patients in order to decrease the treatment expenses and to increase the accuracy of diagnosis.

Nowadays, different medical techniques based on the direct examination of vocal folds such as laryngeal videoendoscopy [1,2], glottography [3], and stroboscopy [4] are used by medical specialists. But these methods have two main drawbacks. Firstly, they are invasive. They may cause patients feel uncomfortable and consequently to distort the actual signal. Secondly, they are expensive to buy and maintenance fees are high.

The best option for overcoming to the disadvantages related to the medical instruments is to employ acoustic analysis techniques. They let medical specialists to examine vocal fold in short time with minimal discomfort. They also allow revealing the pathologies on early stages.

In recent years, a number of methods based on acoustic analysis were developed for vocal fold pathology classification [5-7]. These methods usually have two phases which are the feature extraction phase and the classification phase. The feature extraction phase involves the transformation of speech signal into some parameters or features. The second phase implies a choice of a variety of machine learning methods.

Traditionally, for the feature extraction phase, one deals with such parameters like jitter [8,9], shimmer [10,11], signal-to-noise ratio [12,13], and formants [14,15]. Also, some of the well-known classifiers for the classification phase in the previous works were used such as support vector machine (SVM) [16-19], Gaussian mixture model (GMM) [20-22], artificial neural network (ANN) [23-25], and hidden Markov model (HMM) [26-28].

In [29], the authors have investigated the role of two different datasets: the Massachusetts Eye and Ear Infirmary (MEEI) and the Principe de Asturias (PdA). Both datasets contain records of sustained vowel ‘a.’ They have reported the classification accuracy of 96.37% for the MEEI and the classification accuracy of 87.85% for the PdA dataset by the use of the same method. But when they have developed a system based on the MEEI dataset and then they have used it for classifying of the PdA dataset as the test set, they have achieved the classification accuracy of 78.14%. Also, they have done it in the inverse manner. In other words, they have developed a system based on the PdA dataset and then they have used it for classifying of the MEEI dataset as the test set. For this case, they have reported the classification accuracy of 83.13%.

Therefore, in [29], it is proved that the classification accuracy of vocal fold pathology detection systems extremely depends on the dataset and its characteristics such as the volume of dataset. So, it is obvious that the reported accuracies of the pervious works are not comparable due to the lack of the same conditions such as dataset. Even it is possible to use the same dataset with the different train and test sets and consequently, their reported accuracies cannot be compared.

In fact, the acoustic characteristics of vowels differ in different languages. There are some researches such as [30,31] in which the differences of vowels in the different languages have been investigated. That is why, each language needs its special technique for vocal fold pathology detection system. So, the existing methods for other languages such as English or Arabic or Korean cannot be used for Russian language. Of course our main aim is to develop a high-efficient method for vocal fold pathology detection based on Russian language.

Some of the previous works [9-12,17,20,21,23,24,27,29,32-53] for the vocal fold pathology classification problem have been analyzed from the dataset and classifier points of view [see Additional file 1]. Of course, due to the lack of the same conditions, the reported accuracies of the previous works cannot be compared in order to decide about the best classifier or method.

So, with respect to these differences, a same infrastructure for our research in Russian language should be established. Also, two good nominates [12,32] of the previous works under our conditions (e.g., the same datasets) should be implemented in order to compare their performance with ours. For this purpose, a dataset in Russian language was created by the experts of the Belarusian Republican Center of Speech, Voice and Hearing Pathologies. Its details will be presented in the Section 2. Also, the well-known MEEI dataset, which is employed by these nominates, is used.

As it can be seen in Additional file 1, another disadvantage of some of the previous works for the vocal fold pathology detection systems is that their reported classification accuracies are often about 90% or even less. These amounts of accuracies cannot be quite sufficient because this problem is related to the health of human beings and it has vital role. But in this research, this problem is taken into account as the main goal of our research and it is tried to get classification accuracy up to a satisfiable amount by the use of feature dimensional reduction methods.

The rest of the paper is organized as follows. In Section 2, the datasets are described. In Section 3, the initial feature vector based on the combination of the MFCC and the WPD is presented. In Section 4, optimizing of the initial feature vector by the means of the feature reduction methods is investigated. In Section 5, the ANN as a classifier is described. Experimental results and analysis are summarized in Section 6. Section 7 concludes the paper.

2 Datasets

Our dataset (RusDS) was created by the specialists from the Belarusian Republican Center of Speech, Voice and Hearing Pathologies. It includes 500 healthy samples (related to 500 healthy persons) as well as 500 pathological samples (related to 500 patients with vocal fold paralysis). The information of subjects can be seen in Table 1. In recording, the utterers pronounce vowel ‘a’ for 1 s. Also, they read a given special text. All of the samples are the wave files in the PCM format, in mono mode, sample rate of 44,100 Hz, and bit-depth of 16 bit.

Table 1 The details of the RusDS dataset

Also, the well-known MEEI dataset is used which was created by Massachusetts Eye and Ear Infirmary. It includes approximately 700 records. The acoustic samples are the sustained phonation of the vowel /ah/ (1 to 3 s long). The speech samples were collected in a controlled environment and sampled with a 50- or 25-kHz sampling rate and 16 bits of resolution. The subset taken from this dataset contains 53 normal and 173 pathological samples.

3 Initial feature vector

As it is shown in Figure 1, first, by the use of cepstral representation of input signal, 13 Mel frequency cepstral coefficients (MFCCs) are extracted. Then, the wavelet packet decomposition, in five levels, is applied on the input signal to make the wavelet packet tree. The structure of the obtained wavelet packet tree with the 63 nodes is illustrated in Figure 2. Then, from the nodes of the obtained wavelet packet tree, the 63 energy features as well as the 63 Shannon entropy features are extracted. Finally, by the combination of these features, the initial feature vector with the length of 139 features is constructed.

Figure 1
figure 1

The construction of the initial feature vector.

Figure 2
figure 2

The structure of the obtained wavelet packet tree in five levels of decomposition.

The Matlab code to calculate the MFCC features is adapted from the Auditory Toolbox (Malcolm Slaney).The speech signal is windowed with a Hamming window in the time domain and by the use of the fast Fourier transform (FFT) converted into the frequency domain which gives the magnitude of the FFT. Then, the FFT data is converted into the filter bank outputs and the cosine transform is found to reduce dimensionality. The filter bank is made by using the 13 linearly spaced filters (133.33 Hz between center frequencies,) followed by the 27 log-spaced filters (separated by a factor of 1.0711703 in frequency). Each filter is made by the combination of the amplitude of FFT bin. Also, the tenth order of Daubechies wavelets ‘db10’ is used for the decomposition of signals.

4 Optimized feature vector

The discrimination power of the initial features can be evaluated by the use of t test. It can be used to investigate whether the means of two groups are statistically different from each other or not. For this purpose, it calculates a ratio between the difference of the two group's means and the variability. So, the t test is applied on each feature in our data set and compared to the P value for each feature as a measure of how effective it is at separating groups. The result is shown in Figure 3.

Figure 3
figure 3

The P value for the features.

There are about 40% of features having P values close to 0, and 60% of features having P values smaller than 0.05 means that there are about 83 features among the original 139 features which have strong discrimination power. Also, 40% of the initial features (56 features) are not strong for the classification purpose. Therefore, feature reduction phase is necessary and important for our task. However, it is very difficult to know how many features are required unless someone has some domain knowledge or the maximum number of features has been dictated in advance based on outside constraints.

So, using every feature for classification process is not good idea, and it may cause to increase the misclassification error rate. Therefore, it is better to select the proper features from the whole features. This process is called ‘Feature Reduction’ or ‘Feature Selection.’ In other words, the goal is to reduce the dimension of the data by finding a small set of important features which can give good classification performance.

It is possible to categorize the feature reduction algorithms into two categories: filter methods and wrapper methods. Filter methods focus on the general characteristics of the data to evaluate and to select the feature subsets without involving the chosen learning algorithm or the classifier. But wrapper methods use the performance feedbacks of the chosen learning algorithm or the classifier to evaluate each candidate feature subset. Wrapper methods search for a subset of features which has better fit for the chosen learning algorithm or the classifier, but they can be significantly slower than filter methods if the learning algorithm takes a long time to learn.

In this article, a well-known approach of the filter methods, which is called the principal component analysis (PCA), is employed for the feature reduction phase. It is frequently used in the previous works. Also, a novel approach, the generic algorithm (GA)-based method, is proposed for the feature reduction phase. This method belongs to the wrapper methods.

The main limitation of the PCA is that it searches for the features which their sample's value have bigger variance in comparison with others, and it does not collaborate with the classifier. So, for overcoming this disadvantage, by using genetic algorithm, a GA-based method is proposed which considers the misclassification error rate of the classifier in its fitness function and tries to minimize it. For this purpose, the chromosomes are defined as the vectors of integers (from 1 to139) with the length equal to the expected length for the reduced feature vector. The value of each gene represents the feature's number (from the initial feature vector) which should be taken part in the reduced feature vector. Also, a fitness function f is defined which shows the misclassification error rate of the ANN classifier for the train set.

$$ f=\frac{{\displaystyle {\sum}_{i=1}^n\left| ai-{r}_i\right|}}{n} $$

The a i is the result of classifier and the r i is the real class for ith sample. The n is the number of samples in the train set. The aim of the proposed GA-based method is to find the subset of the initial features so that they minimize the f.

Also, some of the parameters which are used in MATLAB for developing the proposed optimized feature vector are shown in Table 2. The uniform mutation function is a two-step algorithm. First, the algorithm selects a fraction of the vector entries of an individual for mutation, and then in the second step, the algorithm replaces each selected entry by a random number selected uniformly from the range for that entry. The scattered crossover function creates a random binary vector, selects the genes where the vector is a 1 from the first parent and the genes where the vector is a 0 from the second parent, and combines the genes to form the child.

Table 2 The implementation's parameters

5 Classifier

Artificial neural networks generally consist of several layers of interconnected nodes, each node generating a non-linear function of its inputs. The inputs to a node may come directly from the input data or from other nodes. Also, some nodes are considered as the output of the network. In this article, the ANN is used in the term of supervised learning so that a neural network is trained by giving a target output to a certain input group. A simplest way is to use a feed-forward neural network. The inputs form the input nodes of the network; the outputs are taken from the output nodes. The middle layer of nodes, visible to neither the inputs nor the outputs, is termed the hidden layer, and unlike the input and output layers, its size is not fixed.

6 Experiments and results

In this section, five experiments have been designed. These experiments are simulated in the MATLAB. In the experiments, the performances of three famous learning algorithms for the ANN are evaluated which are resilient backpropagation (‘trainrp’ in MATLAB), scaled conjugate gradient backpropagation (‘trainscg’ in MATLAB), and gradient descent with momentum and adaptive learning rate backpropagation (‘traingdx’ in MATLAB). Also, the different numbers of neurons in the hidden layer of the ANN is investigated. In the experiments, the tenfold cross-validation scheme has been adapted to assess the generalization capabilities of the system in the obtained results. Also, in the first experiment, the resubstitution error has been calculated.

In the first experiment, the classification has been done based on the initial feature vector which contains all the 139 features. As it is shown in Figure 4, first, for each sample in the train set, the 139 features are extracted according to the initial feature vector (13 MFCC, 63 energy, and 63 entropy features). Then, they are fed to the ANN for training it. After that, for each sample in the test set, the 139 features are extracted according to the initial feature vector. Then, they are fed to the trained ANN for classifying them. Finally, the real class labels and the obtained class labels of the test set samples are compared to calculate the misclassification error rate.

Figure 4
figure 4

The scheme of the classification method based on the initial feature vector.

The experiment result for the resilient backpropagation training algorithm, in the term of the classification accuracy, is illustrated in Figure 5. It is obvious that the tenfold cross-validation MCE is a good error estimation because it uses the samples in the test set that they do not take part in the training phase. So, from the tenfold cross-validation point of view, it is clear that the ANN with the resilient backpropagation training algorithm achieves a better performance (the accuracy of 85.4%) with the six neurons in its hidden layer.

Figure 5
figure 5

The obtained classification accuracies by the use of ANN with the ‘trainrp’ training algorithm.

The experiment result for the scaled conjugate gradient backpropagation training algorithm, in the term of the classification accuracy, is illustrated in Figure 6. From the tenfold cross-validation point of view, it is clear that the ANN with the scaled conjugate gradient backpropagation training algorithm achieves a better performance (the accuracy of 88.5%) with the five neurons in its hidden layer.

Figure 6
figure 6

The obtained classification accuracies by the use of ANN with the ‘trainscg’ training algorithm.

The experiment result for the gradient descent with momentum and adaptive learning rate backpropagation training algorithm, in the term of the classification accuracy, is illustrated in Figure 7. From the tenfold cross-validation point of view, it is clear that the ANN with the gradient descent with momentum and adaptive learning rate backpropagation training algorithm achieves a better performance (the accuracy of 88.5%) with the ten neurons in its hidden layer.

Figure 7
figure 7

The obtained classification accuracies by the use of ANN with the ‘traingdx’ training algorithm.

Finally, the results show a better performance of the ‘traingdx’ in comparison with the others in terms of classification accuracy. It has a better performance (the accuracy of 100%) in the resubstitution case and also a better performance in the tenfold cross-validation case (the accuracy of 88.5%) in comparison with others.

In the second experiment, the classification has been done based on the optimized feature vector which is obtained by the use of the PCA-based method. As it is shown in Figure 8, first, for each sample in the train set, the 139 features are extracted according to the initial feature vector (13 MFCC, 63 energy, and 63 entropy features). Then, by the use of PCA-based method, the length of initial feature vector is reduced and the final feature vector is constructed. Then, according to the final feature vector, the selected feature values of the samples are fed to the ANN for training it. After that, for each sample in the test set, the 139 features are extracted according to the initial feature vector. Then, according to the final feature vector, the selected features values are fed to the trained ANN for classifying them. Finally, the real class labels and the obtained class labels of the test set samples are compared to calculate the misclassification error rate. Also, five neurons in the hidden layer of ANN are used. The different feature vectors with the length of 1 to 83 features are evaluated. The selected features as the final optimized feature vector, and the obtained classification accuracies are shown in Table 3.

Figure 8
figure 8

The scheme of the classification method based on the PCA-based method.

Table 3 The selected features and the obtained classification accuracies by the use of the PCA-based method

In the third experiment, the classification has been done based on the optimized feature vector which is obtained by the use of the proposed GA-based method. As it is shown in Figure 9, first, for each sample in the train set, the 139 features are extracted according to the initial feature vector (13 MFCC, 63 energy, and 63 entropy features). Then, by the use of GA-based method, the length of initial feature vector is reduced and the final feature vector is constructed. Then, according to the final feature vector, the selected feature values of the samples are fed to the ANN for training it. After that, for each sample in the test set, the 139 features are extracted according to the initial feature vector. Then, according to the final feature vector, the selected features values are fed to the trained ANN for classifying them. Finally, the real class labels and the obtained class labels of the test set samples are compared to calculate the misclassification error rate. Also, five neurons in the hidden layer of ANN are used. The different feature vectors with the length of 1 to 83 features are evaluated. The selected features as the final optimized feature vector, and the obtained classification accuracies are shown in Table 4.

Figure 9
figure 9

The scheme of the classification method based on the proposed GA-based method.

Table 4 The selected features and the obtained classification accuracies by the use of the GA-based method

The comparative results are shown in Table 5. As it is obvious in Table 5, both PCA-based and GA-based methods can lead to the increasing of the classification accuracies. Of course, in terms of classification accuracy, the proposed GA-based method has a better performance in comparison with the PCA-based method. Also, from the training algorithm point of view, the ‘trainscg’ shows better results than the others.

Table 5 The obtained classification accuracies (%)

Finally, the experiments' results show a better performance of the proposed method which is based on the hybrid of the ANN with the ‘trainscg’ algorithm as the classifier and the GA-based method as the feature reduction approach. It provides the best classification accuracy (95.3% of accuracy) in comparison with others. It also leads to reduce the length of feature vector from 139 to 30 features. So, the response time of the vocal fold pathology classification system based on the initial feature vector (with the length of 139) and the reduced feature vector (with the length of 30) should be different.

The fourth experiment is carried out to compare the response time of the vocal fold pathology classification system based on the initial feature vector and the reduced feature vector. This experiment has carried out on a personal computer which is equipped by the processor of Intel dual-core 2.13 GHz and the memory of 2 GB. The response time in the case of initial feature vector (139 features), 8.7 ms is reported. The response time in the case of reduced feature vector (30 features), 4.6 ms is reported. Therefore, using the reduced feature vector leads to the decreasing of the response time of the program of vocal fold pathology classification in comparison with the non-reduced feature vector.

The fifth experiment is done in order to compare the performance of the proposed method with the recent works [12,32]. Of course, the effects of two different datasets (the MEEI and the RusDS) are investigated. The results are shown in Table 6. The results show the higher classification accuracy of the proposed method in comparison with recent works.

Table 6 The comparison of the proposed method with recent works

7 Conclusions

In this article, an initial feature vector based on the combination of the wavelet packet decomposition and the Mel frequency cepstral coefficients (MFCCs) is proposed. The performances of the ANN with the three kinds of training algorithms (‘trianrp’, ‘trainscg’, and ‘traingdx’) in the task of vocal fold pathology diagnosis are investigated. Also, the performance of the three kinds of feature vector (the initial feature vector, the optimized feature vector by means of the PCA-based method, and the optimized feature vector by means of the proposed GA-based method) in the task of vocal fold pathology diagnosis is evaluated. The experiments' results show the priority of the optimized feature vector by means of the proposed GA-based method in comparison with others. This better performance is due to taking into consideration of the ANN classifier in the feature reduction phase. In other words, the proposed GA-based method tries to optimize the initial feature vector with the aim of decreasing the misclassification error rate of the ANN. But the PCA-based method just focuses on the data without any attention on the misclassification error rate of the ANN classifier.

Finally, the proposed method is proposed based on the hybrid of the ANN with the ‘trainscg’ algorithm as the classifier and the GA-based method as the feature reduction approach. It is concluded that the proposed method has the higher accuracy (95.3% of accuracy) and the lower response time in comparison with others. Also, the performance of the proposed method is compared with recent works [12,32]. For this purpose, the effects of two different datasets (the MEEI and the RusDS) are investigated. Finally, it is observed that the proposed method shows the higher classification accuracy in comparison with recent works.