Combined classification method for prosodic stress recognition in Farsi language

Article

Abstract

Employing stress in speech can transfer more information to a listener but makes more problems in speech recognition. The first step toward stressed speech recognition is the recognition of boundaries in stressed speech. In this research, the boundaries of prosodic stress were extracted in Farsi stressed sentences. The acoustic and prosodic features were used to train hidden Markov models for stress boundaries recognition. Using fast correlation-based filter (FCBF) method, the efficient features were selected for stress recognition. The influence of different feature sets on stress boundaries recognition performance was evaluated in this study. Based on this evaluation, a combined classifier scheme was proposed. Experimental results showed that the proposed combined model improved the stress boundaries detection performance by 12% as compared to the baseline model. So, the final recognition rate of the proposed classifier was 85% for prosodic stress boundaries recognition.

Keywords

Prosodic stress Stress boundaries detection Stress recognition Hidden Markov model MFCC Formant Pitch 

References

  1. Ananthakrishnan, A., & Narayanan, S. (2005). An automatic prosody recognizer using a coupled multi-stream acoustic model and syntactic-prosodic language model. Proceedings of the International Conference on Acoustic, Speech and Signal Processing in Montreal, Canada (pp. 269–272).Google Scholar
  2. Ananthakrishnan, S., & Narayanan, S. (2008). Automatic prosodic even detection using acoustic, lexical and syntactic evidence. IEEE Transactions on Audio, Speech, and Language Processing, 16, 216–228.CrossRefGoogle Scholar
  3. Arslan, L. M., & Hansen, J. H. L. (1997). Frequency characteristics of foreign accented speech. Proceedings of the International Conference on Acoustic, Speech and Signal Processing (ICASSP’97), 2, in Munich (pp. 1123–1126).Google Scholar
  4. Bartels, C. D., & Bilmes, J. A. (2010). Graphical models for integrating syllabic information. Computer Speech and Language, 24, 685–697.CrossRefGoogle Scholar
  5. Bartkova, K., & Jouvet, D. (2007). On using units trained on foreign data for improved multiple accent speech recognition. Speech Communication, 49, 836–846.CrossRefGoogle Scholar
  6. Bijankhan, M., Sheikhzadegan, J., Roohani, M. R., Samareh, Y., Lucas, C., & Tebiani, M. (1994). The speech database of Farsi spoken language. Proceedings of the Australian International Speech Science and Technology Conference in Sydney, Australia (pp. 826–831).Google Scholar
  7. Bitouk, D., RaginiVerma, R., & AniNenkova, A. (2010). Class-level spectral features for emotion recognition. Speech Communication, 52, 613–625.CrossRefGoogle Scholar
  8. Bortfeld, H., & Morgan, J. L. (2010). Is early word-form processing stress-full? How natural variability supports recognition. Cognitive Psychology, 60, 241–266.CrossRefGoogle Scholar
  9. Casale, S., Russo, A., & Serrano, S. (2007). Multistyle classification of speech under stress using feature subset selection based on genetic algorithms. Speech Communication, 49, 801–810.CrossRefGoogle Scholar
  10. Chen, K., Hasegawa-Johnson, M., & Cohen, A. (2004). An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic prosodic model. Proceedings of the International Conference on Acoustic, Speech and Signal Processing in Montreal, Canada (pp. 509–512).Google Scholar
  11. Cvejic, E., Kim, J., & Davis, C. (2012). Recognizing prosody across modalities, face areas and speakers: Examining perceivers’ sensitivity to variable realizations of visual prosody. Cognition, 122, 442–453.CrossRefGoogle Scholar
  12. Domahs, U., Klein, E., Huber, W., & Domahs, F. (2013). Good, bad and ugly word stress—fMRI evidence for foot structure driven processing of prosodic violations. Brain & Language, 125, 272–282.CrossRefGoogle Scholar
  13. Dumouchel, P., & O’Shaughnessy, D. D. (1993). Prosody and continuous speech recognition. Proceedings of the European Conference on Speech Communication and Technology in Berlin, Germany.Google Scholar
  14. Fleuret, F. (2004). Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5, 1531–1555.MathSciNetMATHGoogle Scholar
  15. Gallwitz, F., Niemann, H., No¨, thE., and Warnke., V. (2002). Integrated recognition of words and prosodic phrase boundaries. Speech Communication, 36, 81–95.CrossRefMATHGoogle Scholar
  16. Gharavian, D. (2004). Prosody in Farsi language and its use in recognition of intonation and speech, Ph.D. Thesis, Elec. Eng. Dept., Amirkabir University, Tehran (In Farsi).Google Scholar
  17. Gharavian, D., & Ahadi, S. M. (2003). Statistical evaluation of the influence of stress on pitch frequency and phoneme durations in Farsi language. 8th European Conference on Speech Communication and Technology in Geneva.Google Scholar
  18. Gharavian, D., & Ahadi, S. M. (2004a). Evaluation of the effect of stress on formants in Farsi vowels. International Conference on Acoustics, Speech, and Signal Processing in Montreal.Google Scholar
  19. Gharavian, D., & Ahadi, S. M. (2004b). Use of formants in stressed and unstressed continuous speech recognition. 8th International Conference on Spoken Language Processing in Jeju Island.Google Scholar
  20. Gharavian, D., & Ahadi, S. M. (2008). Stressed speech recognition using a warped frequency scale. IEICE Electronic Express, 5, 187–191.CrossRefGoogle Scholar
  21. Gharavian, D., Sheikhan, M., & Ashoftedel, F. (2013). Emotion recognition improvement using normalized formant supplementary features by hybrid of DTW-MLP-GMM model. Neural Computing and Applications, 22, 1181–1191.CrossRefGoogle Scholar
  22. Gharavian, D., Sheikhan, M., Nazerieh, A. R., & Garoucy, S. (2012). Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Computing and Applications, 21, 2115–2126.CrossRefGoogle Scholar
  23. He, L., Lech, M., Maddage, N. C., & Allen, N. B. (2011). Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech. Biomedical Signal Processing and Control, 6, 139–146.CrossRefGoogle Scholar
  24. Kat, L. W., & Fung, P. (1999). Fast accented identification and accented speech recognition. Proceedings of the International Conference on Acoustic, Speech and Signal Processing (ICASSP’99), 1, in Phoenix, AZ (pp. 221–224).Google Scholar
  25. Kirchhoff, K., Fink, G. A., & Sagerer, G. (2002). Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 37, 303 – 39.CrossRefMATHGoogle Scholar
  26. Kompe, R., Kiessling, A., Niemann, H., No¨th, E., Schukat-Talamazzini, E. G., Zottman, A., & Batliner, A. (1995). Prosodic scoring of word hypothesis graphs. Proceedings of the European Conference on Speech Communication and Technology in Madrid, (pp. 1333–1336).Google Scholar
  27. Kuijk, D. V., Heuvel, H. V. D., & Boves L. (1996). Using lexical stress in continuous speech recognition for Dutch. Proceeding of the International Conference on Spoken Language Processing (ICSLP’96), 3, in Philadelphia, PA (1736–1739).Google Scholar
  28. McCandless, S. S. (1974). An algorithm for formant extraction using linear prediction spectra. IEEE Transactions on Acoustics, Speech and Signal Processing, 2, 135–141.CrossRefGoogle Scholar
  29. Medan, Y., Yair, E., & Chazan, D. (1991). Super resolution pitch determination of speech signals. IEEE Trans. Signal Processing, 39(1), 40–48.CrossRefGoogle Scholar
  30. Narayana, L., & Kopparapu, S. K. (2009). On the use of stress information in speech for speaker recognition. Proceedings of the IEEE Region 10 Conference (TENCON’09) in Singapore (pp. 1–4).Google Scholar
  31. Ni, C., Liu, W., & Bo, X. B. (2012). From English pitch accent detection to Mandarin stress detection, where is the difference? Computer Speech and Language, 26, 127–148.CrossRefGoogle Scholar
  32. Patil, S. A., & Hansen, J. H. L. (2010). The physiological microphone (PMIC): A competitive alternative for speaker assessment in stress detection and speaker verification. Speech Communication, 52, 327–340.CrossRefGoogle Scholar
  33. Santen, J. P. H., Prud’hommeaux, E. T., & Black, L. M. (2009). Automated assessment of prosody production. Speech Communication, 51, 1082–1097.CrossRefGoogle Scholar
  34. ShiroOjima, A., & Hagiwara, H. (2011). An event-related potential investigation of lexical pitch-accent processing in auditory Japanese. Brain Research, 1385, 217–228.CrossRefGoogle Scholar
  35. Shue, Y.-L., Shattuck-Hufnagel, S. S., Iseli, M., Jun, S.-A., Veilleux, N., & Alwan, A. (2010). On the acoustic correlates of high and low nuclear pitch accents in American English. Speech Communication, 52, 106–122.CrossRefGoogle Scholar
  36. Theera-Umpon, N., Chansareewittaya, S., & Auephanwiriyakul, S. (2011). Phoneme and tonal accent recognition for Thai speech. Expert Systems with Applications, 38, 13254–13259.CrossRefGoogle Scholar
  37. Tomas, B., Maletic, M., & Raguz, Z. (2007). Determination and evaluation pitch harmonics parameters with emotions classifications. Proceedings of the International Conference on Telecommunications and Computer Networks (SOFTCOM 2007) in Split-Dubrovnik (pp. 1–5).Google Scholar
  38. Vazirnezhad, B., Almasganj, F., & Ahadi, S. M. (2009). Hybrid statistical pronunciation models designed to be trained by a medium-size corpus. Computer Speech and Language, 23, 1–24.CrossRefGoogle Scholar
  39. Vicsi, K., & Szasza´k, G. (2010). Using prosody to improve automatic speech recognition. Speech Communication, 52, 413–426.CrossRefGoogle Scholar
  40. Wightman, C. W., & Ostendorf, M. (1994). Automatic labeling of prosodic patterns. IEEE Transactions on Audio and Speech Processing, 2, 469–481.CrossRefGoogle Scholar
  41. Wu, T., Duchateau, J., Wu, T., Martens, J.-P., & Compernolle, D. V. (2010). Feature subset selection for improved native accent identification. Speech Communication, 52, 83–98.CrossRefGoogle Scholar
  42. Young, S., Evermann, G., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2002). The HTK Book. Revised for HTK Version 3.2. Retrieved from http://htk.eng.cam.ac.uk/.
  43. Zhang, A. Y., You, H., & Ni, C. J. (2010). Mandarin stress detection using syllable-based acoustic and syntactic features. Proceedings of the International Conference on Audio Language and Image Processing (ICALIP’10) in Shanghai (pp. 494–498).Google Scholar
  44. Zhou, G., Hansen, J. H. L., & Kaiser, J. F. (1998). Classification of speech under stress based on feature derived from the nonlinear Teager energy operator. Proceedings of the International Conference on Acoustic, Speech and Signal Processing, 1, in Seattle, WA (pp. 549–552)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Electrical EngineeringShahid Beheshti UniversityTehranIran
  2. 2.Department of Electrical EngineeringIslamic Azad University, South Tehran BranchTehranIran

Personalised recommendations