Phonetic segmentation using multiple speech features

Article

Abstract

In this paper we propose a method for improving the performance of the segmentation of speech waveforms to phonetic units. The proposed method is based on the well known Viterbi time-alignment algorithm and utilizes the phonetic boundary predictions from multiple speech parameterization techniques. Specifically, we utilize the most appropriate, with respect to boundary type, phone transition position prediction as initial point to start Viterbi time-alignment for the prediction of the successor phonetic boundary. The proposed method was evaluated on the TIMIT database, with the exploitation of several, well known in the area of speech processing, Fourier-based and wavelet-based speech parameterization algorithms. The experimental results for the tolerance of 20 milliseconds indicated an improvement of the absolute segmentation accuracy of approximately 0.70%, when compared to the baseline speech segmentation scheme.

Keywords

Speech segmentation Automatic phonetic segmentation Viterbi algorithm Hidden Markov models 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adami, A. G., & Hermansky, H. (2003). Segmentation of speech for speaker and language recognition. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 841–844). Google Scholar
  2. Adell, J., Bonafonte, A., Gomez, J. A., & Castro, M. J. (2005). Comparative study of automatic phone segmentation methods for TTS. In Proceedings of the 2005 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2005) (pp. 309–312). Google Scholar
  3. Aversano, G., Esposito, A., Esposito, A., & Marinaro, M. (2001). A new text-independent method for phoneme segmentation. In Proceedings of the 44th IEEE Midwest symposium on circuits and systems (Vol. 2, pp. 516–519). Google Scholar
  4. Bajwa, R. S., Owens, R. M., & Kelliher, T. P. (1996). Simultaneous speech segmentation and phoneme recognition using dynamic programming. In Proceedings of the 1996 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1996) (Vol. 6, pp. 3213–3216). Google Scholar
  5. Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1), 164–171. MATHCrossRefMathSciNetGoogle Scholar
  6. Brugnara, F., Falavigna, D., & Omologo, M. (1993). Automatic segmentation and labeling of speech based on hidden Markov models. Speech Communication, 12, 357–370. CrossRefGoogle Scholar
  7. Dalsgaard, P., Andersen, O., & Barry, W. (1991). Multi-lingual label alignment using acoustic-phonetic features derived by neural-network technique. In Proceedings of the 1991 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1991) (Vol. 1, pp. 197–200). Google Scholar
  8. Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366. CrossRefGoogle Scholar
  9. Deller, J., Hansen, J., & Proakis, J. (1993). Discrete-time processing of speech signals. New York: Macmillan Publishing. Google Scholar
  10. ETSI (2007). ETSI ES 202 050, V1.1.5 (2007-1). ETSI standard: speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Extended advanced front-end feature extraction algorithm; Compression algorithms; Back-end speech reconstruction algorithm, January 2007 (Sect. 5.3, pp. 21–24). Google Scholar
  11. Farooq, O., & Datta, S. (2001). Mel filter-like admissible wavelet packet structure for speech recognition. IEEE Signal Processing Letters, 8(7), 196–198. CrossRefGoogle Scholar
  12. Garofolo, J. (1988). Getting started with the DARPA-TIMIT CD-ROM: an acoustic phonetic continuous speech database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD, USA. Google Scholar
  13. Grayden, D. B., & Scordilis, M. S. (1994). Phonemic segmentation of fluent speech. In Proceedings of the 1994 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1994) (Vol. 1, pp. 73–76). Google Scholar
  14. Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis for speech. Journal of the Acoustical Society of America, 87(4), 1738–1752. CrossRefGoogle Scholar
  15. Hosom, J.-P. (2002). Automatic phoneme alignment based on acoustic-phonetic modeling. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 357–360). Google Scholar
  16. Itakura, F. (1975). Line spectrum representation of linear predictive coefficients. Journal of the Acoustical Society of America, 57(Suppl. 1), S35. CrossRefGoogle Scholar
  17. Jarifi, S., Pastor, D., & Rosec, O. (2008). A fusion approach for automatic speech segmentation of large corpora with application to speech synthesis. Speech Communication, 50, 67–80. CrossRefGoogle Scholar
  18. Kawai, H., & Toda, T. (2004). An evaluation of automatic phone segmentation for concatenative speech synthesis. In Proceedings of the 2004 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004) (Vol. 1, pp. 667–680). Google Scholar
  19. Keshet, J., Shalev-Shwartz, S., Singer, Y., & Chazan, D. (2007). A large margin algorithm for speech-to-phoneme and music-to-score alignment. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2373–2382. CrossRefGoogle Scholar
  20. Kim, Y.-J., & Conkie, A. (2002). Automatic segmentation combining an HMM-based approach and spectral boundary correction. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 145–148). Google Scholar
  21. Kominek, J., & Black, A. (2004). A family-of-models approach to HMM-based segmentation for unit selection speech synthesis. In Proceedings of the 8th international conference on spoken language processing (ICSLP 2004) (pp. 1385–1388). Google Scholar
  22. Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11), 1641–1648. CrossRefGoogle Scholar
  23. Lin, C.-Y., Chen, K.-T., & Roger Jang, J.-S. (2005). A hybrid approach to automatic segmentation and labeling for Mandarin Chinese speech corpus. In Proceedings of the 9th European conference on speech communication and technology (EUROSPEECH 2005) (pp. 1553–1556). Google Scholar
  24. Lin, C.-Y., & Jang, R. J.-S. (2007). Automatic phonetic segmentation by score predictive model for the corpora of mandarin singing voices. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2151–2159. CrossRefGoogle Scholar
  25. Ljolje, A., & Riley, M. D. (1991). Automatic segmentation and labeling of speech. In Proceedings of the 1991 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1991) (Vol. 1, pp. 473–476). Google Scholar
  26. Ljolje, A., Hirschberg, J., & van Santen, J. P. H. (1997). Automatic speech segmentation for concatenative inventory selection. In J. P. H. van Santen, R. W. Sproat, J. P. Olive, & J. Hirschberg (Eds.), Progress in speech synthesis (pp. 304–311). Berlin: Springer. Google Scholar
  27. Lo, H.-Y., & Wang, H.-M. (2007). Phonetic boundary refinement using support vector machine. In Proceedings of the 2007 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2007) (Vol. 4, pp. 933–936). Google Scholar
  28. Malfrere, F., Deroo, O., Dutoit, T., & Ris, C. (2003). Phonetic alignment: speech synthesis-based vs. Viterbi-based. Speech Communication, 40, 503–515. CrossRefGoogle Scholar
  29. Matousek, J., Tihelka, D., & Psutka, J. (2003). Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 301–304). Google Scholar
  30. Mporas, I., Ganchev, T., & Fakotakis, N. (2008). A hybrid architecture for automatic segmentation of speech waveforms. In Proceedings of the 2008 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2008) (pp. 4457–4460). Google Scholar
  31. Nogueira, W., Giese, A., Edler, B., & Büchner, A. (2006). Wavelet packet filter-bank for speech processing strategies in cochlear implants. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP 2006) (Vol. 5, pp. 121–124). Google Scholar
  32. Park, S. S., & Kim, N. S. (2006). Automatic speech segmentation based on boundary-type candidate selection. IEEE Signal Processing Letters, 13(10), 640–643. CrossRefGoogle Scholar
  33. Park, S. S., & Kim, N. S. (2007). On using multiple models for automatic speech segmentation. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2202–2212. CrossRefGoogle Scholar
  34. Paulo, S., & Oliveira, L. C. (2003). DTW-based phonetic alignment using multiple acoustic features. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 309–312). Google Scholar
  35. Pauws, S., Kamp, Y., & Willems, L. (1996). A hierarchical method of automatic speech segmentation for synthesis applications. Speech Communication, 19, 207–220. CrossRefGoogle Scholar
  36. Pellom, B. L., & Hansen, J. H. L. (1998). Automatic segmentation of speech recorded in unknown noisy channel characteristics. Speech Communication, 25, 97–116. CrossRefGoogle Scholar
  37. Petek, B., Andersen, O., & Dalsgaard, P. (1996). On the robust automatic segmentation of spontaneous speech. In Proceedings of the 4th international conference on spoken language processing (ICSLP 1996) (Vol. 2, pp. 913–916). Google Scholar
  38. Sarikaya, R., & Hansen, J. H. L. (2000). High resolution speech feature parameterization for monophone-based stressed speech recognition. IEEE Signal Processing Letters, 7(7), 182–185. CrossRefGoogle Scholar
  39. Sethy, A., & Narayanan, S. (2002). Refined speech segmentation for concatenative speech synthesis. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 149–152). Google Scholar
  40. Skowronski, M. D., & Harris, J. G. (2004). Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition. Journal of the Acoustical Society of America, 116(3), 1774–1780. CrossRefGoogle Scholar
  41. Slaney, M. (1998). Auditory toolbox, Version 2 (Technical Report #1998-010). Interval Research Corporation. Google Scholar
  42. Svendsen, T., & Soong, F. K. (1987). On the automatic segmentation of speech signals. In Proceedings of the 1987 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1987) (pp. 77–80). Google Scholar
  43. Toledano, D. T., Gomez, L. A. H., & Grande, L. V. (2003). Automatic phonetic segmentation. IEEE Transactions on Speech and Audio Processing, 11(6), 617–625. CrossRefGoogle Scholar
  44. van Hemert, J. P. (1991). Automatic segmentation of speech. IEEE Transactions on Signal Processing, 39(4), 1008–1012. CrossRefGoogle Scholar
  45. Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269. MATHCrossRefGoogle Scholar
  46. Wang, L., Zhao, Y., Chu, M., Zhou, J., & Cao, Z. (2004). Refining segmental boundaries for TTS database using fine contextual-dependent boundary models. In Proceedings of the 2004 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004) (Vol. 1, pp. 641–644). Google Scholar
  47. Wightman, C. W., & Talkin, D. T. (1997). The aligner: text-to-speech alignment using Markov models. In J. P. H. van Santen, R. W. Sproat, J. P. Olive, & J. Hirschberg (Eds.), Progress in speech synthesis (pp. 313–323). Berlin: Springer. Google Scholar
  48. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2006). The HTK book (for HTK version 3.4). Cambridge University, Engineering Department. Google Scholar
  49. Ziolko, B., Manandhar, S., & Wilson, R. C. (2006). Phoneme segmentation of speech. In Proceedings of the 18th international conference on pattern recognition (ICPR 2006) (pp. 282–285). Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Wire Communications Laboratory, Department of Electrical and Computer EngineeringUniversity of PatrasRion-PatrasGreece

Personalised recommendations