Abstract
In this paper we propose a method for improving the performance of the segmentation of speech waveforms to phonetic units. The proposed method is based on the well known Viterbi time-alignment algorithm and utilizes the phonetic boundary predictions from multiple speech parameterization techniques. Specifically, we utilize the most appropriate, with respect to boundary type, phone transition position prediction as initial point to start Viterbi time-alignment for the prediction of the successor phonetic boundary. The proposed method was evaluated on the TIMIT database, with the exploitation of several, well known in the area of speech processing, Fourier-based and wavelet-based speech parameterization algorithms. The experimental results for the tolerance of 20 milliseconds indicated an improvement of the absolute segmentation accuracy of approximately 0.70%, when compared to the baseline speech segmentation scheme.
Similar content being viewed by others
References
Adami, A. G., & Hermansky, H. (2003). Segmentation of speech for speaker and language recognition. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 841–844).
Adell, J., Bonafonte, A., Gomez, J. A., & Castro, M. J. (2005). Comparative study of automatic phone segmentation methods for TTS. In Proceedings of the 2005 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2005) (pp. 309–312).
Aversano, G., Esposito, A., Esposito, A., & Marinaro, M. (2001). A new text-independent method for phoneme segmentation. In Proceedings of the 44th IEEE Midwest symposium on circuits and systems (Vol. 2, pp. 516–519).
Bajwa, R. S., Owens, R. M., & Kelliher, T. P. (1996). Simultaneous speech segmentation and phoneme recognition using dynamic programming. In Proceedings of the 1996 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1996) (Vol. 6, pp. 3213–3216).
Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1), 164–171.
Brugnara, F., Falavigna, D., & Omologo, M. (1993). Automatic segmentation and labeling of speech based on hidden Markov models. Speech Communication, 12, 357–370.
Dalsgaard, P., Andersen, O., & Barry, W. (1991). Multi-lingual label alignment using acoustic-phonetic features derived by neural-network technique. In Proceedings of the 1991 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1991) (Vol. 1, pp. 197–200).
Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.
Deller, J., Hansen, J., & Proakis, J. (1993). Discrete-time processing of speech signals. New York: Macmillan Publishing.
ETSI (2007). ETSI ES 202 050, V1.1.5 (2007-1). ETSI standard: speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Extended advanced front-end feature extraction algorithm; Compression algorithms; Back-end speech reconstruction algorithm, January 2007 (Sect. 5.3, pp. 21–24).
Farooq, O., & Datta, S. (2001). Mel filter-like admissible wavelet packet structure for speech recognition. IEEE Signal Processing Letters, 8(7), 196–198.
Garofolo, J. (1988). Getting started with the DARPA-TIMIT CD-ROM: an acoustic phonetic continuous speech database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD, USA.
Grayden, D. B., & Scordilis, M. S. (1994). Phonemic segmentation of fluent speech. In Proceedings of the 1994 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1994) (Vol. 1, pp. 73–76).
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis for speech. Journal of the Acoustical Society of America, 87(4), 1738–1752.
Hosom, J.-P. (2002). Automatic phoneme alignment based on acoustic-phonetic modeling. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 357–360).
Itakura, F. (1975). Line spectrum representation of linear predictive coefficients. Journal of the Acoustical Society of America, 57(Suppl. 1), S35.
Jarifi, S., Pastor, D., & Rosec, O. (2008). A fusion approach for automatic speech segmentation of large corpora with application to speech synthesis. Speech Communication, 50, 67–80.
Kawai, H., & Toda, T. (2004). An evaluation of automatic phone segmentation for concatenative speech synthesis. In Proceedings of the 2004 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004) (Vol. 1, pp. 667–680).
Keshet, J., Shalev-Shwartz, S., Singer, Y., & Chazan, D. (2007). A large margin algorithm for speech-to-phoneme and music-to-score alignment. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2373–2382.
Kim, Y.-J., & Conkie, A. (2002). Automatic segmentation combining an HMM-based approach and spectral boundary correction. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 145–148).
Kominek, J., & Black, A. (2004). A family-of-models approach to HMM-based segmentation for unit selection speech synthesis. In Proceedings of the 8th international conference on spoken language processing (ICSLP 2004) (pp. 1385–1388).
Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11), 1641–1648.
Lin, C.-Y., Chen, K.-T., & Roger Jang, J.-S. (2005). A hybrid approach to automatic segmentation and labeling for Mandarin Chinese speech corpus. In Proceedings of the 9th European conference on speech communication and technology (EUROSPEECH 2005) (pp. 1553–1556).
Lin, C.-Y., & Jang, R. J.-S. (2007). Automatic phonetic segmentation by score predictive model for the corpora of mandarin singing voices. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2151–2159.
Ljolje, A., & Riley, M. D. (1991). Automatic segmentation and labeling of speech. In Proceedings of the 1991 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1991) (Vol. 1, pp. 473–476).
Ljolje, A., Hirschberg, J., & van Santen, J. P. H. (1997). Automatic speech segmentation for concatenative inventory selection. In J. P. H. van Santen, R. W. Sproat, J. P. Olive, & J. Hirschberg (Eds.), Progress in speech synthesis (pp. 304–311). Berlin: Springer.
Lo, H.-Y., & Wang, H.-M. (2007). Phonetic boundary refinement using support vector machine. In Proceedings of the 2007 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2007) (Vol. 4, pp. 933–936).
Malfrere, F., Deroo, O., Dutoit, T., & Ris, C. (2003). Phonetic alignment: speech synthesis-based vs. Viterbi-based. Speech Communication, 40, 503–515.
Matousek, J., Tihelka, D., & Psutka, J. (2003). Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 301–304).
Mporas, I., Ganchev, T., & Fakotakis, N. (2008). A hybrid architecture for automatic segmentation of speech waveforms. In Proceedings of the 2008 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2008) (pp. 4457–4460).
Nogueira, W., Giese, A., Edler, B., & Büchner, A. (2006). Wavelet packet filter-bank for speech processing strategies in cochlear implants. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP 2006) (Vol. 5, pp. 121–124).
Park, S. S., & Kim, N. S. (2006). Automatic speech segmentation based on boundary-type candidate selection. IEEE Signal Processing Letters, 13(10), 640–643.
Park, S. S., & Kim, N. S. (2007). On using multiple models for automatic speech segmentation. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2202–2212.
Paulo, S., & Oliveira, L. C. (2003). DTW-based phonetic alignment using multiple acoustic features. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 309–312).
Pauws, S., Kamp, Y., & Willems, L. (1996). A hierarchical method of automatic speech segmentation for synthesis applications. Speech Communication, 19, 207–220.
Pellom, B. L., & Hansen, J. H. L. (1998). Automatic segmentation of speech recorded in unknown noisy channel characteristics. Speech Communication, 25, 97–116.
Petek, B., Andersen, O., & Dalsgaard, P. (1996). On the robust automatic segmentation of spontaneous speech. In Proceedings of the 4th international conference on spoken language processing (ICSLP 1996) (Vol. 2, pp. 913–916).
Sarikaya, R., & Hansen, J. H. L. (2000). High resolution speech feature parameterization for monophone-based stressed speech recognition. IEEE Signal Processing Letters, 7(7), 182–185.
Sethy, A., & Narayanan, S. (2002). Refined speech segmentation for concatenative speech synthesis. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 149–152).
Skowronski, M. D., & Harris, J. G. (2004). Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition. Journal of the Acoustical Society of America, 116(3), 1774–1780.
Slaney, M. (1998). Auditory toolbox, Version 2 (Technical Report #1998-010). Interval Research Corporation.
Svendsen, T., & Soong, F. K. (1987). On the automatic segmentation of speech signals. In Proceedings of the 1987 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1987) (pp. 77–80).
Toledano, D. T., Gomez, L. A. H., & Grande, L. V. (2003). Automatic phonetic segmentation. IEEE Transactions on Speech and Audio Processing, 11(6), 617–625.
van Hemert, J. P. (1991). Automatic segmentation of speech. IEEE Transactions on Signal Processing, 39(4), 1008–1012.
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.
Wang, L., Zhao, Y., Chu, M., Zhou, J., & Cao, Z. (2004). Refining segmental boundaries for TTS database using fine contextual-dependent boundary models. In Proceedings of the 2004 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004) (Vol. 1, pp. 641–644).
Wightman, C. W., & Talkin, D. T. (1997). The aligner: text-to-speech alignment using Markov models. In J. P. H. van Santen, R. W. Sproat, J. P. Olive, & J. Hirschberg (Eds.), Progress in speech synthesis (pp. 313–323). Berlin: Springer.
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2006). The HTK book (for HTK version 3.4). Cambridge University, Engineering Department.
Ziolko, B., Manandhar, S., & Wilson, R. C. (2006). Phoneme segmentation of speech. In Proceedings of the 18th international conference on pattern recognition (ICPR 2006) (pp. 282–285).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mporas, I., Ganchev, T. & Fakotakis, N. Phonetic segmentation using multiple speech features. Int J Speech Technol 11, 73–85 (2008). https://doi.org/10.1007/s10772-009-9038-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-009-9038-4