Phonetic segmentation using multiple speech features

Abstract

In this paper we propose a method for improving the performance of the segmentation of speech waveforms to phonetic units. The proposed method is based on the well known Viterbi time-alignment algorithm and utilizes the phonetic boundary predictions from multiple speech parameterization techniques. Specifically, we utilize the most appropriate, with respect to boundary type, phone transition position prediction as initial point to start Viterbi time-alignment for the prediction of the successor phonetic boundary. The proposed method was evaluated on the TIMIT database, with the exploitation of several, well known in the area of speech processing, Fourier-based and wavelet-based speech parameterization algorithms. The experimental results for the tolerance of 20 milliseconds indicated an improvement of the absolute segmentation accuracy of approximately 0.70%, when compared to the baseline speech segmentation scheme.

This is a preview of subscription content, access via your institution.

References

  1. Adami, A. G., & Hermansky, H. (2003). Segmentation of speech for speaker and language recognition. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 841–844).

  2. Adell, J., Bonafonte, A., Gomez, J. A., & Castro, M. J. (2005). Comparative study of automatic phone segmentation methods for TTS. In Proceedings of the 2005 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2005) (pp. 309–312).

  3. Aversano, G., Esposito, A., Esposito, A., & Marinaro, M. (2001). A new text-independent method for phoneme segmentation. In Proceedings of the 44th IEEE Midwest symposium on circuits and systems (Vol. 2, pp. 516–519).

  4. Bajwa, R. S., Owens, R. M., & Kelliher, T. P. (1996). Simultaneous speech segmentation and phoneme recognition using dynamic programming. In Proceedings of the 1996 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1996) (Vol. 6, pp. 3213–3216).

  5. Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1), 164–171.

    MATH  Article  MathSciNet  Google Scholar 

  6. Brugnara, F., Falavigna, D., & Omologo, M. (1993). Automatic segmentation and labeling of speech based on hidden Markov models. Speech Communication, 12, 357–370.

    Article  Google Scholar 

  7. Dalsgaard, P., Andersen, O., & Barry, W. (1991). Multi-lingual label alignment using acoustic-phonetic features derived by neural-network technique. In Proceedings of the 1991 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1991) (Vol. 1, pp. 197–200).

  8. Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  9. Deller, J., Hansen, J., & Proakis, J. (1993). Discrete-time processing of speech signals. New York: Macmillan Publishing.

    Google Scholar 

  10. ETSI (2007). ETSI ES 202 050, V1.1.5 (2007-1). ETSI standard: speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Extended advanced front-end feature extraction algorithm; Compression algorithms; Back-end speech reconstruction algorithm, January 2007 (Sect. 5.3, pp. 21–24).

  11. Farooq, O., & Datta, S. (2001). Mel filter-like admissible wavelet packet structure for speech recognition. IEEE Signal Processing Letters, 8(7), 196–198.

    Article  Google Scholar 

  12. Garofolo, J. (1988). Getting started with the DARPA-TIMIT CD-ROM: an acoustic phonetic continuous speech database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD, USA.

  13. Grayden, D. B., & Scordilis, M. S. (1994). Phonemic segmentation of fluent speech. In Proceedings of the 1994 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1994) (Vol. 1, pp. 73–76).

  14. Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis for speech. Journal of the Acoustical Society of America, 87(4), 1738–1752.

    Article  Google Scholar 

  15. Hosom, J.-P. (2002). Automatic phoneme alignment based on acoustic-phonetic modeling. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 357–360).

  16. Itakura, F. (1975). Line spectrum representation of linear predictive coefficients. Journal of the Acoustical Society of America, 57(Suppl. 1), S35.

    Article  Google Scholar 

  17. Jarifi, S., Pastor, D., & Rosec, O. (2008). A fusion approach for automatic speech segmentation of large corpora with application to speech synthesis. Speech Communication, 50, 67–80.

    Article  Google Scholar 

  18. Kawai, H., & Toda, T. (2004). An evaluation of automatic phone segmentation for concatenative speech synthesis. In Proceedings of the 2004 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004) (Vol. 1, pp. 667–680).

  19. Keshet, J., Shalev-Shwartz, S., Singer, Y., & Chazan, D. (2007). A large margin algorithm for speech-to-phoneme and music-to-score alignment. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2373–2382.

    Article  Google Scholar 

  20. Kim, Y.-J., & Conkie, A. (2002). Automatic segmentation combining an HMM-based approach and spectral boundary correction. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 145–148).

  21. Kominek, J., & Black, A. (2004). A family-of-models approach to HMM-based segmentation for unit selection speech synthesis. In Proceedings of the 8th international conference on spoken language processing (ICSLP 2004) (pp. 1385–1388).

  22. Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11), 1641–1648.

    Article  Google Scholar 

  23. Lin, C.-Y., Chen, K.-T., & Roger Jang, J.-S. (2005). A hybrid approach to automatic segmentation and labeling for Mandarin Chinese speech corpus. In Proceedings of the 9th European conference on speech communication and technology (EUROSPEECH 2005) (pp. 1553–1556).

  24. Lin, C.-Y., & Jang, R. J.-S. (2007). Automatic phonetic segmentation by score predictive model for the corpora of mandarin singing voices. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2151–2159.

    Article  Google Scholar 

  25. Ljolje, A., & Riley, M. D. (1991). Automatic segmentation and labeling of speech. In Proceedings of the 1991 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1991) (Vol. 1, pp. 473–476).

  26. Ljolje, A., Hirschberg, J., & van Santen, J. P. H. (1997). Automatic speech segmentation for concatenative inventory selection. In J. P. H. van Santen, R. W. Sproat, J. P. Olive, & J. Hirschberg (Eds.), Progress in speech synthesis (pp. 304–311). Berlin: Springer.

    Google Scholar 

  27. Lo, H.-Y., & Wang, H.-M. (2007). Phonetic boundary refinement using support vector machine. In Proceedings of the 2007 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2007) (Vol. 4, pp. 933–936).

  28. Malfrere, F., Deroo, O., Dutoit, T., & Ris, C. (2003). Phonetic alignment: speech synthesis-based vs. Viterbi-based. Speech Communication, 40, 503–515.

    Article  Google Scholar 

  29. Matousek, J., Tihelka, D., & Psutka, J. (2003). Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 301–304).

  30. Mporas, I., Ganchev, T., & Fakotakis, N. (2008). A hybrid architecture for automatic segmentation of speech waveforms. In Proceedings of the 2008 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2008) (pp. 4457–4460).

  31. Nogueira, W., Giese, A., Edler, B., & Büchner, A. (2006). Wavelet packet filter-bank for speech processing strategies in cochlear implants. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP 2006) (Vol. 5, pp. 121–124).

  32. Park, S. S., & Kim, N. S. (2006). Automatic speech segmentation based on boundary-type candidate selection. IEEE Signal Processing Letters, 13(10), 640–643.

    Article  Google Scholar 

  33. Park, S. S., & Kim, N. S. (2007). On using multiple models for automatic speech segmentation. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2202–2212.

    Article  Google Scholar 

  34. Paulo, S., & Oliveira, L. C. (2003). DTW-based phonetic alignment using multiple acoustic features. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 309–312).

  35. Pauws, S., Kamp, Y., & Willems, L. (1996). A hierarchical method of automatic speech segmentation for synthesis applications. Speech Communication, 19, 207–220.

    Article  Google Scholar 

  36. Pellom, B. L., & Hansen, J. H. L. (1998). Automatic segmentation of speech recorded in unknown noisy channel characteristics. Speech Communication, 25, 97–116.

    Article  Google Scholar 

  37. Petek, B., Andersen, O., & Dalsgaard, P. (1996). On the robust automatic segmentation of spontaneous speech. In Proceedings of the 4th international conference on spoken language processing (ICSLP 1996) (Vol. 2, pp. 913–916).

  38. Sarikaya, R., & Hansen, J. H. L. (2000). High resolution speech feature parameterization for monophone-based stressed speech recognition. IEEE Signal Processing Letters, 7(7), 182–185.

    Article  Google Scholar 

  39. Sethy, A., & Narayanan, S. (2002). Refined speech segmentation for concatenative speech synthesis. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 149–152).

  40. Skowronski, M. D., & Harris, J. G. (2004). Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition. Journal of the Acoustical Society of America, 116(3), 1774–1780.

    Article  Google Scholar 

  41. Slaney, M. (1998). Auditory toolbox, Version 2 (Technical Report #1998-010). Interval Research Corporation.

  42. Svendsen, T., & Soong, F. K. (1987). On the automatic segmentation of speech signals. In Proceedings of the 1987 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1987) (pp. 77–80).

  43. Toledano, D. T., Gomez, L. A. H., & Grande, L. V. (2003). Automatic phonetic segmentation. IEEE Transactions on Speech and Audio Processing, 11(6), 617–625.

    Article  Google Scholar 

  44. van Hemert, J. P. (1991). Automatic segmentation of speech. IEEE Transactions on Signal Processing, 39(4), 1008–1012.

    Article  Google Scholar 

  45. Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.

    MATH  Article  Google Scholar 

  46. Wang, L., Zhao, Y., Chu, M., Zhou, J., & Cao, Z. (2004). Refining segmental boundaries for TTS database using fine contextual-dependent boundary models. In Proceedings of the 2004 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004) (Vol. 1, pp. 641–644).

  47. Wightman, C. W., & Talkin, D. T. (1997). The aligner: text-to-speech alignment using Markov models. In J. P. H. van Santen, R. W. Sproat, J. P. Olive, & J. Hirschberg (Eds.), Progress in speech synthesis (pp. 313–323). Berlin: Springer.

    Google Scholar 

  48. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2006). The HTK book (for HTK version 3.4). Cambridge University, Engineering Department.

  49. Ziolko, B., Manandhar, S., & Wilson, R. C. (2006). Phoneme segmentation of speech. In Proceedings of the 18th international conference on pattern recognition (ICPR 2006) (pp. 282–285).

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Todor Ganchev.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Mporas, I., Ganchev, T. & Fakotakis, N. Phonetic segmentation using multiple speech features. Int J Speech Technol 11, 73–85 (2008). https://doi.org/10.1007/s10772-009-9038-4

Download citation

Keywords

  • Speech segmentation
  • Automatic phonetic segmentation
  • Viterbi algorithm
  • Hidden Markov models