Phonetic segmentation using multiple speech features

Mporas, Iosif; Ganchev, Todor; Fakotakis, Nikos

doi:10.1007/s10772-009-9038-4

Phonetic segmentation using multiple speech features

Published: 14 August 2009

Volume 11, pages 73–85, (2008)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Iosif Mporas¹,
Todor Ganchev¹ &
Nikos Fakotakis¹

201 Accesses
6 Citations
Explore all metrics

Abstract

In this paper we propose a method for improving the performance of the segmentation of speech waveforms to phonetic units. The proposed method is based on the well known Viterbi time-alignment algorithm and utilizes the phonetic boundary predictions from multiple speech parameterization techniques. Specifically, we utilize the most appropriate, with respect to boundary type, phone transition position prediction as initial point to start Viterbi time-alignment for the prediction of the successor phonetic boundary. The proposed method was evaluated on the TIMIT database, with the exploitation of several, well known in the area of speech processing, Fourier-based and wavelet-based speech parameterization algorithms. The experimental results for the tolerance of 20 milliseconds indicated an improvement of the absolute segmentation accuracy of approximately 0.70%, when compared to the baseline speech segmentation scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Adami, A. G., & Hermansky, H. (2003). Segmentation of speech for speaker and language recognition. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 841–844).
Adell, J., Bonafonte, A., Gomez, J. A., & Castro, M. J. (2005). Comparative study of automatic phone segmentation methods for TTS. In Proceedings of the 2005 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2005) (pp. 309–312).
Aversano, G., Esposito, A., Esposito, A., & Marinaro, M. (2001). A new text-independent method for phoneme segmentation. In Proceedings of the 44th IEEE Midwest symposium on circuits and systems (Vol. 2, pp. 516–519).
Bajwa, R. S., Owens, R. M., & Kelliher, T. P. (1996). Simultaneous speech segmentation and phoneme recognition using dynamic programming. In Proceedings of the 1996 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1996) (Vol. 6, pp. 3213–3216).
Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1), 164–171.
Article MATH MathSciNet Google Scholar
Brugnara, F., Falavigna, D., & Omologo, M. (1993). Automatic segmentation and labeling of speech based on hidden Markov models. Speech Communication, 12, 357–370.
Article Google Scholar
Dalsgaard, P., Andersen, O., & Barry, W. (1991). Multi-lingual label alignment using acoustic-phonetic features derived by neural-network technique. In Proceedings of the 1991 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1991) (Vol. 1, pp. 197–200).
Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.
Article Google Scholar
Deller, J., Hansen, J., & Proakis, J. (1993). Discrete-time processing of speech signals. New York: Macmillan Publishing.
Google Scholar
ETSI (2007). ETSI ES 202 050, V1.1.5 (2007-1). ETSI standard: speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Extended advanced front-end feature extraction algorithm; Compression algorithms; Back-end speech reconstruction algorithm, January 2007 (Sect. 5.3, pp. 21–24).
Farooq, O., & Datta, S. (2001). Mel filter-like admissible wavelet packet structure for speech recognition. IEEE Signal Processing Letters, 8(7), 196–198.
Article Google Scholar
Garofolo, J. (1988). Getting started with the DARPA-TIMIT CD-ROM: an acoustic phonetic continuous speech database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD, USA.
Grayden, D. B., & Scordilis, M. S. (1994). Phonemic segmentation of fluent speech. In Proceedings of the 1994 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1994) (Vol. 1, pp. 73–76).
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis for speech. Journal of the Acoustical Society of America, 87(4), 1738–1752.
Article Google Scholar
Hosom, J.-P. (2002). Automatic phoneme alignment based on acoustic-phonetic modeling. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 357–360).
Itakura, F. (1975). Line spectrum representation of linear predictive coefficients. Journal of the Acoustical Society of America, 57(Suppl. 1), S35.
Article Google Scholar
Jarifi, S., Pastor, D., & Rosec, O. (2008). A fusion approach for automatic speech segmentation of large corpora with application to speech synthesis. Speech Communication, 50, 67–80.
Article Google Scholar
Kawai, H., & Toda, T. (2004). An evaluation of automatic phone segmentation for concatenative speech synthesis. In Proceedings of the 2004 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004) (Vol. 1, pp. 667–680).
Keshet, J., Shalev-Shwartz, S., Singer, Y., & Chazan, D. (2007). A large margin algorithm for speech-to-phoneme and music-to-score alignment. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2373–2382.
Article Google Scholar
Kim, Y.-J., & Conkie, A. (2002). Automatic segmentation combining an HMM-based approach and spectral boundary correction. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 145–148).
Kominek, J., & Black, A. (2004). A family-of-models approach to HMM-based segmentation for unit selection speech synthesis. In Proceedings of the 8th international conference on spoken language processing (ICSLP 2004) (pp. 1385–1388).
Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11), 1641–1648.
Article Google Scholar
Lin, C.-Y., Chen, K.-T., & Roger Jang, J.-S. (2005). A hybrid approach to automatic segmentation and labeling for Mandarin Chinese speech corpus. In Proceedings of the 9th European conference on speech communication and technology (EUROSPEECH 2005) (pp. 1553–1556).
Lin, C.-Y., & Jang, R. J.-S. (2007). Automatic phonetic segmentation by score predictive model for the corpora of mandarin singing voices. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2151–2159.
Article Google Scholar
Ljolje, A., & Riley, M. D. (1991). Automatic segmentation and labeling of speech. In Proceedings of the 1991 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1991) (Vol. 1, pp. 473–476).
Ljolje, A., Hirschberg, J., & van Santen, J. P. H. (1997). Automatic speech segmentation for concatenative inventory selection. In J. P. H. van Santen, R. W. Sproat, J. P. Olive, & J. Hirschberg (Eds.), Progress in speech synthesis (pp. 304–311). Berlin: Springer.
Google Scholar
Lo, H.-Y., & Wang, H.-M. (2007). Phonetic boundary refinement using support vector machine. In Proceedings of the 2007 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2007) (Vol. 4, pp. 933–936).
Malfrere, F., Deroo, O., Dutoit, T., & Ris, C. (2003). Phonetic alignment: speech synthesis-based vs. Viterbi-based. Speech Communication, 40, 503–515.
Article Google Scholar
Matousek, J., Tihelka, D., & Psutka, J. (2003). Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 301–304).
Mporas, I., Ganchev, T., & Fakotakis, N. (2008). A hybrid architecture for automatic segmentation of speech waveforms. In Proceedings of the 2008 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2008) (pp. 4457–4460).
Nogueira, W., Giese, A., Edler, B., & Büchner, A. (2006). Wavelet packet filter-bank for speech processing strategies in cochlear implants. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP 2006) (Vol. 5, pp. 121–124).
Park, S. S., & Kim, N. S. (2006). Automatic speech segmentation based on boundary-type candidate selection. IEEE Signal Processing Letters, 13(10), 640–643.
Article Google Scholar
Park, S. S., & Kim, N. S. (2007). On using multiple models for automatic speech segmentation. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2202–2212.
Article Google Scholar
Paulo, S., & Oliveira, L. C. (2003). DTW-based phonetic alignment using multiple acoustic features. In Proceedings of the 8th European conference on speech communication and technology (EUROSPEECH 2003) (pp. 309–312).
Pauws, S., Kamp, Y., & Willems, L. (1996). A hierarchical method of automatic speech segmentation for synthesis applications. Speech Communication, 19, 207–220.
Article Google Scholar
Pellom, B. L., & Hansen, J. H. L. (1998). Automatic segmentation of speech recorded in unknown noisy channel characteristics. Speech Communication, 25, 97–116.
Article Google Scholar
Petek, B., Andersen, O., & Dalsgaard, P. (1996). On the robust automatic segmentation of spontaneous speech. In Proceedings of the 4th international conference on spoken language processing (ICSLP 1996) (Vol. 2, pp. 913–916).
Sarikaya, R., & Hansen, J. H. L. (2000). High resolution speech feature parameterization for monophone-based stressed speech recognition. IEEE Signal Processing Letters, 7(7), 182–185.
Article Google Scholar
Sethy, A., & Narayanan, S. (2002). Refined speech segmentation for concatenative speech synthesis. In Proceedings of the 7th international conference on spoken language processing (ICSLP 2002) (pp. 149–152).
Skowronski, M. D., & Harris, J. G. (2004). Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition. Journal of the Acoustical Society of America, 116(3), 1774–1780.
Article Google Scholar
Slaney, M. (1998). Auditory toolbox, Version 2 (Technical Report #1998-010). Interval Research Corporation.
Svendsen, T., & Soong, F. K. (1987). On the automatic segmentation of speech signals. In Proceedings of the 1987 IEEE international conference on acoustics, speech, and signal processing (ICASSP 1987) (pp. 77–80).
Toledano, D. T., Gomez, L. A. H., & Grande, L. V. (2003). Automatic phonetic segmentation. IEEE Transactions on Speech and Audio Processing, 11(6), 617–625.
Article Google Scholar
van Hemert, J. P. (1991). Automatic segmentation of speech. IEEE Transactions on Signal Processing, 39(4), 1008–1012.
Article Google Scholar
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.
Article MATH Google Scholar
Wang, L., Zhao, Y., Chu, M., Zhou, J., & Cao, Z. (2004). Refining segmental boundaries for TTS database using fine contextual-dependent boundary models. In Proceedings of the 2004 IEEE international conference on acoustics, speech, and signal processing (ICASSP 2004) (Vol. 1, pp. 641–644).
Wightman, C. W., & Talkin, D. T. (1997). The aligner: text-to-speech alignment using Markov models. In J. P. H. van Santen, R. W. Sproat, J. P. Olive, & J. Hirschberg (Eds.), Progress in speech synthesis (pp. 313–323). Berlin: Springer.
Google Scholar
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., & Woodland, P. (2006). The HTK book (for HTK version 3.4). Cambridge University, Engineering Department.
Ziolko, B., Manandhar, S., & Wilson, R. C. (2006). Phoneme segmentation of speech. In Proceedings of the 18th international conference on pattern recognition (ICPR 2006) (pp. 282–285).

Download references

Author information

Authors and Affiliations

Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Rion-Patras, 26500, Greece
Iosif Mporas, Todor Ganchev & Nikos Fakotakis

Authors

Iosif Mporas
View author publications
You can also search for this author in PubMed Google Scholar
Todor Ganchev
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Fakotakis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Todor Ganchev.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mporas, I., Ganchev, T. & Fakotakis, N. Phonetic segmentation using multiple speech features. Int J Speech Technol 11, 73–85 (2008). https://doi.org/10.1007/s10772-009-9038-4

Download citation

Received: 10 April 2009
Accepted: 29 July 2009
Published: 14 August 2009
Issue Date: June 2008
DOI: https://doi.org/10.1007/s10772-009-9038-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Phonetic segmentation using multiple speech features

Abstract

Access this article

Similar content being viewed by others

Segmentation Algorithm Using Temporal Features and Group Delay for Speech Signals

Speech Signal Segmentation into Vocalized and Unvocalized Segments on the Basis of Simultaneous Masking

LSTM-Based Speech Segmentation for TTS Synthesis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Phonetic segmentation using multiple speech features

Abstract

Access this article

Similar content being viewed by others

Segmentation Algorithm Using Temporal Features and Group Delay for Speech Signals

Speech Signal Segmentation into Vocalized and Unvocalized Segments on the Basis of Simultaneous Masking

LSTM-Based Speech Segmentation for TTS Synthesis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation