Abstract
Detection of word boundaries in continuous speech is a tedious process due to the absence of a definite pause or silence in the word boundary position. Thus, continuous speech recognition is a very challenging task. However, the prosodic word boundaries, unlike the written word boundaries, can be predicted using the prosodic parameters of continuous speech. This paper proposes a method for detecting such prosodic word boundaries from Bengali continuous speech. Bengali is a bound-stress language, where stress is observed on the first syllable of a prosodic word. Empirical Mode Decomposition is applied to the logarithm of fundamental frequency (F0) contour of continuous speech to detect prosodic word boundaries. 200 Bengali readout sentences, read by ten speakers, are analyzed for the present work. An overall prosodic boundary detection accuracy of 88.05% is achieved, whereas precision and recall values are 90.73% and 88.31%, respectively, with f-score as 89.5. A prosodic word dictionary comprising 5031 prosodic words has been developed by analyzing 1526 Bengali sentences with the proposed prosodic word boundary detection method.
Similar content being viewed by others
References
Acharya, S., & Das Mandal, S. K. (2013). Prosodic word and phrase boundary detection based on F0 contour analysis using empirical mode decomposition. In Oriental COCOSDA/CASLRE (pp. 1–5). IEEE.
Agarwal, A., Jain, A., Prakash, N., & Agarwal, S. (2010). Word boundary detection in continuous speech based on supra segmental features for Hindi Language. In 2nd International Conference on Signal Processing Systems (pp. 591–594). Dalian: IEEE.
Alam, F., Murtoza Habib, S., Sultana, A., & Khan, M. (2010). Development of annotated bangla speech corpora. In Spoken Languages Technologies for Under-Resourced Languages (pp. 35–41).
Ananthakrishnan, S., & Narayanan, S. (2007). Improved speech recognition using acoustic and lexical correlates of pitch accent in a n-best rescoring framework. In International Conference on Acoustics, Speech and Signal Processing—ICASSP’07 (pp. 873–876). Honolulu: IEEE.
Ananthakrishnan, S., & Narayanan, S. (2009). Unsupervised adaptation of categorical prosody models for prosody labeling and speech recognition. IEEE Transactions on Audio, Speech and Language Processing,17(1), 138–149.
Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555–596.
Bhowmik, T. (2017). Prosodic and Phonological Feature based Speech Recognition System for Bengali, PhD Thesis, IIT Kharagpur.
Bhowmik, T., & Das Mandal, S. K. (2018). Manner of articulation based Bengali phoneme classification. International Journal of Speech Technology,21(2), 233–250.
Boersma, P., & Weenink, D. (2016). Praat: Doing phonetics by computer.[computer program]. Version 6.0.19. Retrieved 2016, from http://www.praat.org.
Campbell, N. (1993). Automatic detection of prosodic boundaries in speech. Speech Communication,13(3–4), 343–354.
Campbell, N., & Black, A. (1997). Prosody and the selection of source units for concatenative synthesis. Progress in speech synthesis (pp. 279–292). New York: Springer.
Chen, S.-H., Yang, J.-H., Chiang, C.-Y., Liu, M.-C., & Wang, Y.-R. (2012). A new prosody-assisted mandarin ASR system. IEEE Transactions on Audio, Speech and Language Processing,20(6), 1669–1684.
Das Mandal, S. K. (2007). Role of Shape Parameters in Speech Recognition: A study on standard colloquial Bengali (SCB), PhD Thesis, Jadavpur University, Kolkata, India.
Das Mandal, S., Gupta, B., & Datta, A. (2007). Word boundary detection based on suprasegmental features: A case study on Bangla speech. International Journal of Speech Technology,9(1–2), 17–28.
Das Mandal, S., Saha, A., & Datta, A. (2005). Annotated speech corpora development in Indian languages. Vishwa Bharat,6, 49–64.
Das Mandal, S., Warsi, A., Basu, T., Hirose, K., & Fujisaki, H. (2010). Analysis and synthesis of F0 contours for Bangla readout speech. In Oriental COCOSDA (pp. 1–6). Kathmandu: IEEE.
Fujii, K., Kashioka, H., & Campbell, N. (2003). Target cost of FQ based on polynomial regression in concatenative speech synthesis. In 15th international congress of phonetic sciences (ICPhS-15) (pp. 2577–2580). Barcelona.
Fujisaki, H. (1997). Prosody, models, and spontaneous speech. Computing Prosody (pp. 27–42). New York: Springer.
Fujisaki, H. (2004). Information, prosody, and modeling -with emphasis on tonal features of speech. In Speech Prosody (pp. 1–10). Nara, Japan: ISCA.
Fujisaki, H., & Kawai, H. (1988). Realization of linguistic information in the voice fundamental frequency contour of the spoken Japanese. In International Conference on Acoustic, Speech, and Signal Processing-ICASSP’88 (pp. 663–666). New York: IEEE.
Ganguly, N. R., Datta, A. K., & Mukherjee, B. (1998). Acoustic correlates of perceptual stress in Bengali text reading. In International conference on Computational Linguistics, Speech and Document Processing, (pp. B68–B71). ISI Calcutta.
Hayes, B., & Lahiri, A. (1991). Bengali intonational phonology. Natural Language & Linguistic Theory,9(1), 47–96.
Hirose, K., & Minematsu, N. (2004). Use of prosodic features for speech recognition. In INTERSPEECH (pp. 1445–1448). Jeju Island, Korea: ISCA.
Huang, N., Shen, Z., Long, S., Wu, M., Shih, H., Zheng, Q., et al. (1998). The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences,454(1971), 903–995.
Iwano, K., & Hirose, K. (1999). Prosodic word boundary detection using statistical modeling of moraic fundamental frequency contours and its use for continuous speech recognition. In International Conference on Acoustics, Speech, and Signal Processing—ICASSP’99 (pp. 133–136). Phoenix: IEEE.
Lehiste, I., & Lass, N. (1976). Suprasegmental features of speech. In N. Lass (Ed.), Contemporary issues in experimental phonetics (pp. 225–239). New York: Academic Press.
Milone, D., & Rubio, A. (2003). Prosodic and accentual information for automatic speech recognition. IEEE Transaction on Speech and Audio Processing,11(4), 321–333.
Narusawa, S., Minematsu, N., Hirose, K., & Fujisaki, H. (2002). A method for automatic extraction of model parameters from fundamental frequency contours of speech. In 2002 IEEE International conference on acoustics, speech, and signal processing (Vol. 1, pp. 506–509). Orlando, Florida: IEEE.
Rajendran, S., & Yegnanarayana, B. (1996). Word boundary hypothesization for continuous speech in Hindi based on F0 patterns. Speech Communication,18(1), 21–46.
Rilling, G., Flandrin, P., & Goncalves, P. (2003). On empirical mode decomposition and its algorithms. In IEEE-EURASIP workshop on nonlinear signal and image processing (Vol. 3, pp. 8–11). NSIP-03, Grado (I).
Sagisaka, Y., Campbell, N., & Higuchi, N. (2012). Computing PROSODY: Computational models for processing spontaneous speech. Kyoto: Springer Science and Business Media.
Tsiartas, A., Ghosh, P., Georgiou, P., & Narayanan, S. (2009). Robust word boundary detection in spontaneous speech using acoustic and lexical cues. In International Conference on Acoustics, Speech, and Signal Processing—ICASSP’2009 (pp. 4785–4788). Taipei, Taiwan: IEEE.
Vergyri, D., Stolcke, A., Gadde, V., Ferrer, L., & Shriberg, E. (2003). Prosodic knowledge sources for automatic speech recognition. In International Conference on Acoustics, Speech, and Signal Processing—ICASSP’2003 (pp. I–I). Honk Kong: IEEE.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bhowmik, T., Das Mandal, S.K. Prosodic word boundary detection from Bengali continuous speech. Lang Resources & Evaluation 54, 747–765 (2020). https://doi.org/10.1007/s10579-019-09478-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-019-09478-0