Skip to main content
Log in

Phonetic segmentation using multiple speech features

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper we propose a method for improving the performance of the segmentation of speech waveforms to phonetic units. The proposed method is based on the well known Viterbi time-alignment algorithm and utilizes the phonetic boundary predictions from multiple speech parameterization techniques. Specifically, we utilize the most appropriate, with respect to boundary type, phone transition position prediction as initial point to start Viterbi time-alignment for the prediction of the successor phonetic boundary. The proposed method was evaluated on the TIMIT database, with the exploitation of several, well known in the area of speech processing, Fourier-based and wavelet-based speech parameterization algorithms. The experimental results for the tolerance of 20 milliseconds indicated an improvement of the absolute segmentation accuracy of approximately 0.70%, when compared to the baseline speech segmentation scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Adami, A. G., & Hermansky, H. (2003). Segmentation of speech for speaker and language recognition. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 841–844).

  • Adell, J., Bonafonte, A., Gomez, J. A., & Castro, M. J. (2005). Comparative study of automatic phone segmentation methods for TTS. In Proceedings of the 2005 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2005) (pp. 309–312).

  • Aversano, G., Esposito, A., Esposito, A., & Marinaro, M. (2001). A new text-independent method for phoneme segmentation. In Proceedings of the 44th IEEE Midwest symposium on circuits and systems (Vol. 2, pp. 516–519).

  • Bajwa, R. S., Owens, R. M., & Kelliher, T. P. (1996). Simultaneous speech segmentation and phoneme recognition using dynamic programming. In Proceedings of the 1996 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1996) (Vol. 6, pp. 3213–3216).

  • Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1), 164–171.

    Article  MATH  MathSciNet  Google Scholar 

  • Brugnara, F., Falavigna, D., & Omologo, M. (1993). Automatic segmentation and labeling of speech based on hidden Markov models. Speech Communication, 12, 357–370.

    Article  Google Scholar 

  • Dalsgaard, P., Andersen, O., & Barry, W. (1991). Multi-lingual label alignment using acoustic-phonetic features derived by neural-network technique. In Proceedings of the 1991 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1991) (Vol. 1, pp. 197–200).

  • Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  • Deller, J., Hansen, J., & Proakis, J. (1993). Discrete-time processing of speech signals. New York: Macmillan Publishing.

    Google Scholar 

  • ETSI (2007). ETSI ES 202 050, V1.1.5 (2007-1). ETSI standard: speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Extended advanced front-end feature extraction algorithm; Compression algorithms; Back-end speech reconstruction algorithm, January 2007 (Sect. 5.3, pp. 21–24).

  • Farooq, O., & Datta, S. (2001). Mel filter-like admissible wavelet packet structure for speech recognition. IEEE Signal Processing Letters, 8(7), 196–198.

    Article  Google Scholar 

  • Garofolo, J. (1988). Getting started with the DARPA-TIMIT CD-ROM: an acoustic phonetic continuous speech database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD, USA.

  • Grayden, D. B., & Scordilis, M. S. (1994). Phonemic segmentation of fluent speech. In Proceedings of the 1994 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1994) (Vol. 1, pp. 73–76).

  • Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis for speech. Journal of the Acoustical Society of America, 87(4), 1738–1752.

    Article  Google Scholar 

  • Hosom, J.-P. (2002). Automatic phoneme alignment based on acoustic-phonetic modeling. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 357–360).

  • Itakura, F. (1975). Line spectrum representation of linear predictive coefficients. Journal of the Acoustical Society of America, 57(Suppl. 1), S35.

    Article  Google Scholar 

  • Jarifi, S., Pastor, D., & Rosec, O. (2008). A fusion approach for automatic speech segmentation of large corpora with application to speech synthesis. Speech Communication, 50, 67–80.

    Article  Google Scholar 

  • Kawai, H., & Toda, T. (2004). An evaluation of automatic phone segmentation for concatenative speech synthesis. In Proceedings of the 2004 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004) (Vol. 1, pp. 667–680).

  • Keshet, J., Shalev-Shwartz, S., Singer, Y., & Chazan, D. (2007). A large margin algorithm for speech-to-phoneme and music-to-score alignment. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2373–2382.

    Article  Google Scholar 

  • Kim, Y.-J., & Conkie, A. (2002). Automatic segmentation combining an HMM-based approach and spectral boundary correction. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 145–148).

  • Kominek, J., & Black, A. (2004). A family-of-models approach to HMM-based segmentation for unit selection speech synthesis. In Proceedings of the 8th international conference on spoken language processing (ICSLP 2004) (pp. 1385–1388).

  • Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11), 1641–1648.

    Article  Google Scholar 

  • Lin, C.-Y., Chen, K.-T., & Roger Jang, J.-S. (2005). A hybrid approach to automatic segmentation and labeling for Mandarin Chinese speech corpus. In Proceedings of the 9th European conference on speech communication and technology (EUROSPEECH 2005) (pp. 1553–1556).

  • Lin, C.-Y., & Jang, R. J.-S. (2007). Automatic phonetic segmentation by score predictive model for the corpora of mandarin singing voices. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2151–2159.

    Article  Google Scholar 

  • Ljolje, A., & Riley, M. D. (1991). Automatic segmentation and labeling of speech. In Proceedings of the 1991 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1991) (Vol. 1, pp. 473–476).

  • Ljolje, A., Hirschberg, J., & van Santen, J. P. H. (1997). Automatic speech segmentation for concatenative inventory selection. In J. P. H. van Santen, R. W. Sproat, J. P. Olive, & J. Hirschberg (Eds.), Progress in speech synthesis (pp. 304–311). Berlin: Springer.

    Google Scholar 

  • Lo, H.-Y., & Wang, H.-M. (2007). Phonetic boundary refinement using support vector machine. In Proceedings of the 2007 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2007) (Vol. 4, pp. 933–936).

  • Malfrere, F., Deroo, O., Dutoit, T., & Ris, C. (2003). Phonetic alignment: speech synthesis-based vs. Viterbi-based. Speech Communication, 40, 503–515.

    Article  Google Scholar 

  • Matousek, J., Tihelka, D., & Psutka, J. (2003). Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 301–304).

  • Mporas, I., Ganchev, T., & Fakotakis, N. (2008). A hybrid architecture for automatic segmentation of speech waveforms. In Proceedings of the 2008 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2008) (pp. 4457–4460).

  • Nogueira, W., Giese, A., Edler, B., & Büchner, A. (2006). Wavelet packet filter-bank for speech processing strategies in cochlear implants. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP 2006) (Vol. 5, pp. 121–124).

  • Park, S. S., & Kim, N. S. (2006). Automatic speech segmentation based on boundary-type candidate selection. IEEE Signal Processing Letters, 13(10), 640–643.

    Article  Google Scholar 

  • Park, S. S., & Kim, N. S. (2007). On using multiple models for automatic speech segmentation. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2202–2212.

    Article  Google Scholar 

  • Paulo, S., & Oliveira, L. C. (2003). DTW-based phonetic alignment using multiple acoustic features. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 309–312).

  • Pauws, S., Kamp, Y., & Willems, L. (1996). A hierarchical method of automatic speech segmentation for synthesis applications. Speech Communication, 19, 207–220.

    Article  Google Scholar 

  • Pellom, B. L., & Hansen, J. H. L. (1998). Automatic segmentation of speech recorded in unknown noisy channel characteristics. Speech Communication, 25, 97–116.

    Article  Google Scholar 

  • Petek, B., Andersen, O., & Dalsgaard, P. (1996). On the robust automatic segmentation of spontaneous speech. In Proceedings of the 4th international conference on spoken language processing (ICSLP 1996) (Vol. 2, pp. 913–916).

  • Sarikaya, R., & Hansen, J. H. L. (2000). High resolution speech feature parameterization for monophone-based stressed speech recognition. IEEE Signal Processing Letters, 7(7), 182–185.

    Article  Google Scholar 

  • Sethy, A., & Narayanan, S. (2002). Refined speech segmentation for concatenative speech synthesis. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 149–152).

  • Skowronski, M. D., & Harris, J. G. (2004). Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition. Journal of the Acoustical Society of America, 116(3), 1774–1780.

    Article  Google Scholar 

  • Slaney, M. (1998). Auditory toolbox, Version 2 (Technical Report #1998-010). Interval Research Corporation.

  • Svendsen, T., & Soong, F. K. (1987). On the automatic segmentation of speech signals. In Proceedings of the 1987 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1987) (pp. 77–80).

  • Toledano, D. T., Gomez, L. A. H., & Grande, L. V. (2003). Automatic phonetic segmentation. IEEE Transactions on Speech and Audio Processing, 11(6), 617–625.

    Article  Google Scholar 

  • van Hemert, J. P. (1991). Automatic segmentation of speech. IEEE Transactions on Signal Processing, 39(4), 1008–1012.

    Article  Google Scholar 

  • Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.

    Article  MATH  Google Scholar 

  • Wang, L., Zhao, Y., Chu, M., Zhou, J., & Cao, Z. (2004). Refining segmental boundaries for TTS database using fine contextual-dependent boundary models. In Proceedings of the 2004 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004) (Vol. 1, pp. 641–644).

  • Wightman, C. W., & Talkin, D. T. (1997). The aligner: text-to-speech alignment using Markov models. In J. P. H. van Santen, R. W. Sproat, J. P. Olive, & J. Hirschberg (Eds.), Progress in speech synthesis (pp. 313–323). Berlin: Springer.

    Google Scholar 

  • Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2006). The HTK book (for HTK version 3.4). Cambridge University, Engineering Department.

  • Ziolko, B., Manandhar, S., & Wilson, R. C. (2006). Phoneme segmentation of speech. In Proceedings of the 18th international conference on pattern recognition (ICPR 2006) (pp. 282–285).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Todor Ganchev.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mporas, I., Ganchev, T. & Fakotakis, N. Phonetic segmentation using multiple speech features. Int J Speech Technol 11, 73–85 (2008). https://doi.org/10.1007/s10772-009-9038-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-009-9038-4

Keywords

Navigation