Skip to main content
Log in

Robust phoneme classification for automatic speech recognition using hybrid features and an amalgamated learning model

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Phoneme recognition is an important aspect of speech processing and recognition. Research on phoneme recognition is several years old and numerous algorithms have been developed over the years to improve its accuracy. In this paper, a quantitative analysis of phoneme recognition using supervised learning is investigated. Most approaches to phoneme recognition rely on using mel frequency cepstrum based features for identification of the phoneme class. In our approach, we take into consideration the vocal tract area function along with mel frequency cepstrum coefficients and analyze the change in accuracy obtained by its introduction in the feature set. Support Vector Machines have been an attractive approach to pattern recognition and its usage as a supervised learning model has been popular in the speech processing community. We compare Support Vector Machines to other supervised learning models like the Naïve Bayes, the k-Nearest Neighbors and the linear discriminant analysis classifiers, for our feature set. We impose a soft voting rule between the three best classifiers to produce our variation of a voting classifier. We enhance the accuracy of our classifier by using a priority based approach to estimate the three most likely phonemes, after the predicted phoneme. Through a figurative and quantitative approach, we show that our modified algorithm outperforms other traditional methods. Experiments were conducted on the WSJCAM0 corpus, a British English corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Biswas, A., Sahu, P.K., Bhowmick, A. and Chandra, M. (2014). Feature extraction technique using ERB like wavelet sub-band periodic and aperiodic decomposition for TIMIT phoneme recognition. In the Proceeding of the International Journal of Speech Technology (Vol. 17, Iss. 4, pp. 389–399).

  • Clarkson, P. and Moreno, P. (1999). On the use of support vector machines for phonetic classification. In Proceedings of the IEEE Transactions on Acoustics, Speech and Signal Processing (Vol. 2, pp. 585– 588).

  • Colas, F. and Brazdil, P. (2006). Comparison of SVM and some older classification algorithms in text classification tasks. In Proceedings of the International Federation for Information (Vol. 217, pp. 169–178).

  • Cutajar, M., Gatt, E., Grech, I., Casha, O. and Micallef, J. (2011). Support Vector Machines with the priorities method for speaker independent phoneme recognition. In Proceedings of the IEEE International Symposium on Signal Processing and Information Technology (pp. 409–414).

  • Demolin, D., Metens, T. and Soquet, A. (1996). Three-dimensional measurement of the vocal tract shape by MRI. In Proceedings of the International Conference on Spoken Language Processing (Vol. 1, pp. 272–275).

  • Deng, H., Beddoes, M. P., Ward, R. K., Hodgson, M. (2003). Obtaining the vocal-tract area function from the vowel sound. In Proceedings of the Journal of the Canadian Acoustical Association (Vol. 31, No. 3, pp. 40–41).

  • Fant, G. (1960). Acoustic theory of speech production. The Hague: Mouton.

    Google Scholar 

  • Ganapathiraju, A., Hamaker, J., and Picone, J. (2000). Hybrid SVM/HMM architectures for speech recognition. In Proceedings of the Speech Transcription Workshop (Vol. 4, pp. 504–507).

  • Ladefoged, P., Harshman, R., Goldstein, L., and Rice, L. (1978). Generating vocal tract shapes from formant frequencies. In Proceedings of the Journal of the Acoustical Society of America (Vol. 64, Issue 4, pp. 1027–1035).

  • Lee, K. F. and Hon, H. W. (1989). Speaker-independent phone recognition using hidden Markov models. In Proceedings of the IEEE Transactions on Acoustics, Speech, and Signal Processing (Vol. 37, No. 11).

  • Manjunath, K.E., & Sreenivasa Rao, K. (2015). Source and system features for phone recognition. In Proceedings of the International Journal of Speech Technology (Vol. 18, Iss. 2, pp. 257–270).

  • Nahar, K.M.O., Abu Shquier, M., Al-Khatib, W.G. Al-Muhtaseb, H., Elshafei M. (2016). Arabic phonemes recognition using hybrid LVQ/HMM model for continuous speech recognition. In the Proceedings of the International Journal of Speech Technology (pp. 1–14).

  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E. (2011). Scikit-learn: Machine learning in python. In Proceedings of the Journal of Machine Learning Research (Vol. 12, pp. 2825–2830).

  • Rimah, A. and Ellouze, N. (2015). Study of phonemes confusions in hierarchical automatic phoneme recognition system. In Proceedings of the Journal of Convergence Information Technology (Vol. 10, No. 4).

  • Robinson, E. A. (1967). Statistical communication and detection (pp. 274–279). New York: Hafner Publishing Company.

    Google Scholar 

  • Robinson, T., Fransen, J., Pye, D., Foote, J. and Renals, S. (1995). Wsjcam0: A British English Speech Corpus for large vocabulary continuous speech recognition. In Proceedings of the IEEE Transactions on Acoustics, Speech, and Signal Processing (Vol. 1, pp. 81–84).

  • Story, B. H. (2005). A parametric model of the vocal tract area function for vowel and consonant simulation. Journal of the Acoustical Society of America, 117(5), 3231–3254.

    Article  Google Scholar 

  • Story, B. H., Titze, I. R., and Hoffman, E. A. (1996). Vocal tract area functions from magnetic resonance imaging. In Proceedings of the Journal of the Acoustical Society of America (Vol. 100, No. 1, pp. 537–554).

  • Vos, R., Angus, J. A., and Story, B. H. (2014). A new algorithm for vocal tract shape extraction from singer’s waveforms. In Proceedings of the 136th Audio Engineering Society Convention.

  • Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recognition: Neural networks versus Hidden Markov models. In Proceedings of the IEEE Transactions on Acoustics, Speech, and Signal Processing (Vol. 37, No. 3).

  • Wakita, H. (1973). Direct estimation of the vocal tract shape by inverse filtering of acoustic speech waveforms. In Proceedings of the IEEE Transactions on Audio and Electroacoustics (Vol. 21, Issue 5, pp. 417–427).

  • Wang, D., & Brown, G. J. (2006). Computational auditory scene analysis: Principles, algorithms, and applications. Hoboken: Wiley.

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammed Kamal Khwaja.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khwaja, M.K., Vikash, P., Arulmozhivarman, P. et al. Robust phoneme classification for automatic speech recognition using hybrid features and an amalgamated learning model. Int J Speech Technol 19, 895–905 (2016). https://doi.org/10.1007/s10772-016-9377-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-016-9377-x

Keywords

Navigation