Skip to main content

Advertisement

Log in

Manner of articulation based Bengali phoneme classification

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

A phoneme classification model has been developed for Bengali continuous speech in this experiment. The analysis was conducted using a deep neural network based classification model. In the first phase, phoneme classification task has been performed using the deep-structured classification model along with two baseline models. The deep-structured model provided better overall classification accuracy than the baseline systems which were designed using hidden Markov model and multilayer Perceptron respectively. The confusion matrix of all the Bengali phonemes generated by the classification model is observed, and the phonemes are divided into nine groups. These nine groups provided better overall classification accuracy of 98.7%. In the next phase of this study, the place and manner of articulation based phonological features are detected and classified. The phonemes are regrouped into 15 groups using the manner of articulation based knowledge, and the deep-structured model is retrained. The system provided 98.9% of overall classification accuracy this time. This is almost equal to the overall classification accuracy which was observed for nine phoneme groups. But as the nine phoneme groups are redivided into 15 groups, the phoneme confusion in a single group became less which leads to a better phoneme classification model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Ali, A. A., Van Der Speigel, J., & Mueller, P. (2000). Auditory-based speech processing based on the average localized synchrony detection. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP’00 (Vol. 3, pp. 1623–1626). IEEE.

  • Ali, A. A., Van der Spiegel, J., & Mueller, P. (1998). An acoustic-phonetic featurebased system for the automatic recognition of fricative consonants. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. 2, pp. 961–964). IEEE.

  • Ali, A. A., Van der Spiegel, J., Mueller, P., Haentjens, G., & Berman, J. (1999). An acoustic-phonetic feature-based system for automatic phoneme recognition in continuous speech. In Proceedings of the 1999 IEEE International Symposium on Circuits and Systems (Vol. 3, pp. 118–121). IEEE.

  • Ali, A. M. A., Van der Spiegel, J., & Mueller, P. (2001). Acoustic-phonetic features for the automatic classification of fricatives. The Journal of the Acoustical Society of America, 109(5), 2217–2235.

    Article  Google Scholar 

  • Ali, A. M. A., Van der Spiegel, J., & Mueller, P. (2002). Robust auditory-based speech processing using the average localized synchrony detection. IEEE Transactions on Speech and Audio Processing, 10(5), 279–292.

    Article  Google Scholar 

  • Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1), 1–127.

    Article  MATH  Google Scholar 

  • Bhattacharya, K. (1988). Bengali phonetic reader (Vol. 28). Mysuru: Central Institute of Indian Languages.

    Google Scholar 

  • Bhowmik, T. (2017). Prosodic and phonological feature based speech recognition system for Bengali (Doctoral dissertation, IIT Kharagpur).

  • Bitar, N., & Espy-Wilson, C. Y. (1995a). A signal representation of speech based on phonetic features. In Proceedings of 5th Annual Dual Use Technologies and Applications Conference (pp. 310–315).

  • Bitar, N. N., & Espy-Wilson, C. Y. (1995b). Speech parameterization based on phonetic features: Application to speech recognition. In Fourth European Conference on Speech Communication and Technology (pp. 1411–1414).

  • Bitar, N. N., Espy-Wilson, C. Y. (1996). A knowledge-based signal representation for speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech, and Processing (pp. 29–32). IEEE.

  • Chatterji, S. (1926). The origin and development of the Bengali language. Calcutta: Calcutta University Press.

    Google Scholar 

  • Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2011). Large vocabulary continuous speech recognition with context dependent dbn-hmms. In IEEE International Conference on Acoustics, Speech And Signal Processing (ICASSP) (pp. 4688–4691). IEEE.

  • Das Mandal, S. (2007). Role of shape parameters in speech recognition: A study on standard colloquial Bengali (SCB). PhD thesis.

  • Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  • Dekel, O., Keshet, J., & Singer, Y. (2004). An online algorithm for hierarchical phoneme classification. In International Workshop on Machine Learning for Multimodal Interaction (pp. 146–158). Berlin: Springer.

  • Deng, L., Abdel-Hamid, O., & Yu, D. (2013). A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 6669–6673). IEEE.

  • Deng, L., & Yu, D. (2013). Deep learning for signal and information processing. Redmond, WA: Microsoft Research Monograph.

    Google Scholar 

  • Dusan, S. (2005). Estimation of speaker’s height and vocal tract length from speech signal. In Ninth European Conference on Speech Communication and Technology.

  • Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: The 632 + bootstrap method. Journal of the American Statistical Association, 92(438), 548–560.

    MathSciNet  MATH  Google Scholar 

  • Fawcett, T. (2006). An introduction to roc analysis. Pattern Recognition Letters, 27(8), 861–874.

    Article  MathSciNet  Google Scholar 

  • Feng, X., Zhang, Y., & Glass, J. (2014). Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1759–1763). IEEE.

  • Frankel, J., & King, S. (2005). A hybrid ann/dbn approach to articulatory feature recognition. In Proceedings of Eurospeech. Lisbon: CD-ROM.

  • Garofolo, J., Consortium, L. D., et al. (1993). TIMIT: Acoustic-phonetic continuous speech corpus. Philadelphia, PA: Linguistic Data Consortium.

    Book  Google Scholar 

  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (Vol. 9, pp. 249–256).

  • Goldberg, H., & Reddy, D. (1976). Feature extraction segmentation and labeling in the harpy and hearsay-ii systems. The Journal of the Acoustical Society of America, 60(S1), S11–S11.

    Article  Google Scholar 

  • Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5), 602–610.

    Article  Google Scholar 

  • Harrington, J. (1987). Acoustic cues for automatic recognition of English consonants. In Speech Technology: A Survey. (pp. 19–74). Edinburgh: Edinburgh University Press

  • Harris, J. (1994). English sound structure. Oxford: Wiley.

    Google Scholar 

  • Hayes, B., & Lahiri, A. (1991). Bengali intonational phonology. Natural Language & Linguistic Theory, 9(1), 47–96.

    Article  Google Scholar 

  • Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-R., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  • Hinton, G., Osindero, S., & Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.

    Article  MathSciNet  MATH  Google Scholar 

  • Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.

    Article  MathSciNet  MATH  Google Scholar 

  • Hou, J. (2009). On the use of frame and segment-based methods for the detection and classification of speech sounds and features. PhD thesis, Rutgers University Graduate School, New Brunswick.

  • Huang, X. (1992). Phoneme classification using semicontinuous hidden markov models. IEEE Transactions on Signal Processing, 40(5), 1062–1067.

    Article  Google Scholar 

  • King, S., & Taylor, P. (2000). Detection of phonological features in continuous speech using neural networks. Computer Speech & Language, 14(4), 333–353.

    Article  Google Scholar 

  • King, S., Taylor, P., Frankel, J., & Richmond, K. (2000). Speech recognition via phonetically featured syllables. University of the Saarland.

  • Lahiri, A. (1999). Speech recognition with phonological features. In Proceedings of the XIVth International Congress of Phonetic Sciences (Vol. 99, pp. 715–718).

  • Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th International Conference on Machine learning (pp. 473–480). ACM.

  • Lee, C.-H., Clements, M., Dusan, S., Fosler-Lussier, E., Johnson, K., Juang, B.-H., & Rabiner, L. (2007). An overview on automatic speech attribute transcription (ASAT). In INTERSPEECH (pp. 1825–1828) Antwerp.

  • Lewis, M. P., Simons, G. F., & Fennig, C. D. (2016). Ethnologue: Languages of the world (Vol. 19). Dallas, TX: SIL International Dallas.

    Google Scholar 

  • Mandal, S., Chandra, S., Lata, S., & Datta, A. (2011). Places and manner of articulation of Bangla consonants: An epg based study. In INTERSPEECH (pp. 3149–3152) Florence.

  • Mandal, S. D., Saha, A., & Datta, A. (2005). Annotated speech corpora development in Indian languages. Vishwa Bharat, 6, 49–64.

    Google Scholar 

  • MATLAB. (2015). MATLAB version 8.5.0.197613 (R2015b). Natick: The Mathworks, Inc..

    Google Scholar 

  • Meyer, B. T., Wächter, M., Brand, T., & Kollmeier, B. (2007). Phoneme confusions in human and automatic speech recognition. In INTERSPEECH (pp. 1485–1488).

  • Mohamed, A.-R., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22.

    Article  Google Scholar 

  • Mohamed, A.-R., Yu, D., & Deng, L. (2010). Investigation of full sequence training of deep belief networks for speech recognition. In INTERSPEECH (Vol. 10, pp. 2846–2849).

  • Morales, S. O. C., & Cox, S. J. (2007). Modelling confusion matrices to improve speech recognition accuracy, with an application to dysarthric speech. In INTERSPEECH (pp. 1565–1568).

  • Moreau, N., Kim, H.-G., & Sikora, T. (2004). Phonetic confusion based document expansion for spoken document retrieval. In INTERSPEECH.

  • Online census data (2016). Retrieved July 20, 2016, from http://censusindia.gov.in/Census_Data_2001/ Census_Data_Online/Language/Statement3.htm.

  • Palm, R. B. (2012). Prediction as a candidate for learning deep hierarchical models of data. Master’s thesis.

  • Reetz, H. (1999). Converting speech signals to phonological features. In Proceedings of the XIVth International Congress of Phonetic Sciences (Vol. 99, pp. 1733–1736).

  • Renals, S., & Rohwer, R. (1989). Phoneme classification experiments using radial basis functions. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNNâ˘A´Z89) (Vol. 1, pp. 461–467).

  • Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representations by backpropagating errors. Nature, 323, 533–536.

    Article  MATH  Google Scholar 

  • Seide, F., Li, G., & Yu, D. (2011). Conversational speech transcription using context-dependent deep neural networks. In INTERSPEECH (pp. 437–440). Florence.

  • Siniscalchi, S., & Lee, C.-H. (2009). A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Communication, 51(11), 1139–1153.

    Article  Google Scholar 

  • Siniscalchi, S., Lyu, D.-C., Svendsen, T., Lee, C.-H. (2012). Experiments on cross-language attribute detection and phone recognition with minimal targetspecific training data. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 875–887.

    Article  Google Scholar 

  • Siniscalchi, S., Svendsen, T., & Lee, C.-H. (2007). Towards bottom-up continuous phone recognition. In IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) (pp. 566–569). IEEE.

  • Siniscalchi, S., Yu, D., Deng, L., & Lee, C.-H. (2013). Exploiting deep neural networks for detection based speech recognition. Neurocomputing, 106, 148–157.

    Article  Google Scholar 

  • Siniscalchi, S. M., & Reed, J., Svendsen, T., & Lee, C.-H. (2009). Exploring universal attribute characterization of spoken languages for spoken language recognition. In INTERSPEECH (pp. 168–171). Brighton.

  • Siniscalchi, S. M., Svendsen, T., & Lee, C.-H. (2011). A bottom-up stepwise knowledge integration approach to large vocabulary continuous speech recognition using weighted finite state machines. In INTERSPEECH (pp. 901–904). Florence.

  • Srinivasan, S., & Petkovic, D. (2000). Phonetic confusion matrix based spoken document retrieval. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 81–87). ACM.

  • Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (pp. 1096–1103). ACM.

  • Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 11, 3371–3408.

    MathSciNet  MATH  Google Scholar 

  • Xu, D., Wang, Y., & Metze, F. (2014). EM-based phoneme confusion matrix generation for low-resource spoken term detection. IEEE Spoken Language Technology Workshop (SLT) (pp. 424–429). IEEE.

  • Yu, D., & Deng, L. (2014). Automatic speech recognition: A deep learning approach. London: Springer.

    MATH  Google Scholar 

  • Yu, D., Deng, L., & Dahl, G. (2010). Roles of pre-training and fine tuning in context dependent dbn-hmms for real world speech recognition. In Proceedings of NIPS Workshop on Deep Learning and Unsupervised Feature Learning.

  • Yu, D., Siniscalchi, S., Deng, L., & Lee, C.-H. (2012). Boosting attribute and phone estimation accuracies with deep neural networks for detection based speech recognition. In ICASSP (pp. 4169–4172). IEEE.

  • Žgank, A., Horvat, B., & Kačič Z. (2005). Data driven generation of phonetic broad classes, based on phoneme confusion matrix similarity. Speech Communication, 47(3), 379–393.

    Article  Google Scholar 

  • Zhang, P., Shao, J., Han, J., Liu, Z., & Yan, Y. (2006). Keyword spotting based on phoneme confusion matrix. Proceedings of ICSLP (Vol. 2, pp. 408–419).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tanmay Bhowmik.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhowmik, T., Mandal, S.K.D. Manner of articulation based Bengali phoneme classification. Int J Speech Technol 21, 233–250 (2018). https://doi.org/10.1007/s10772-018-9498-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-018-9498-5

Keywords

Navigation