Improving the Accuracy of the Speech Synthesis Based Phonetic Alignment Using Multiple Acoustic Features

Paulo, Sérgio; Oliveira, Luís C.

doi:10.1007/3-540-45011-4_5

Sérgio Paulo⁴ &
Luís C. Oliveira⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2721))

Included in the following conference series:

International Workshop on Computational Processing of the Portuguese Language

447 Accesses
2 Citations

Abstract

The phonetic alignment of the spoken utterances for speech research are commonly performed by HMM-based speech recognizers, in forced alignment mode, but the training of the phonetic segment models requires considerable amounts of annotated data. When no such material is available, a possible solution is to synthesize the same phonetic sequence and align the resulting speech signal with the spoken utterances. However, without a careful choice of acoustic features used in this procedure, it can perform poorly when applied to continuous speech utterances. In this paper we propose a new method to select the best features to use in the alignment procedure for each pair of phonetic segment classes. The results show that this selection considerably reduces the segment boundary location errors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

M. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou and A. Syrdal, The AT&T Next-Gen TTS System, 137th Acoustical Society of America meeting, Berlin, Germany, 1999.
Google Scholar
A. Black, CHATR, Version 0.8, a generic speech synthesizer, System documentation, ATR-Interpreting Telecomunications Laboratories, Kyoto, Japan, 1996.
Google Scholar
Sakoe H. and Chiba, Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. on ASSP, 26(1):43–49, 1978.
Article MATH Google Scholar
S. Paulo and L. Oliveira, Multilevel Annotation of Speech Signals Using Weighted Finite State Transducers. In Proceedings of IEEE 2002 Workshop on Speech Synthesis, Santa Monica, California, 2002.
Google Scholar
D. Caseiro, H. Meinedo, A. Serralheiro, I. Trancoso and J. Neto, Spoken Book alignment using WFST HLT 2002 Human Language Technology Conference, San Diego, California, 2002.
Google Scholar
F. Malfrère and T. Dutoit, High-Quality Speech Synthesis for Phonetic Speech Segmentation. In Proceedings of Eurospeech’97, Rhodes, Greece, 1997.
Google Scholar
N. Campbell, Autolabelling Japanese TOBI. In Proceedings of ICSLP’96, Philadelphia, USA, 1996.
Google Scholar
A. Black, P. Taylor and R. Caley, The Festival Speech Synthesis System. System documentation Edition 1.4, for Festival Version 1.4.0, 17th June 1999.
Google Scholar
P. Taylor R. Caley, A. Black, S. King, Edinburgh Speech Tools Library System Documentation Edition 1.2, 15th June 1999.
Google Scholar
ESPS Programs Version 5.3 Entropic Research Laboratories Inc., 1998.
Google Scholar

Download references

Author information

Authors and Affiliations

L2F Spoken Language Systems Lab, INESC-ID/IST, Rua Alves Redol 9, 1000-029, Lisbon, Portugal
Sérgio Paulo & Luís C. Oliveira

Authors

Sérgio Paulo
View author publications
You can also search for this author in PubMed Google Scholar
Luís C. Oliveira
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

L2F, INESC-ID Lisboa, Technical University of Lisbon, Rua Alves Redol, 9, 1000-029, Lisbon, Portugal
Nuno J. Mamede & Isabel Trancoso &
Faculty of Humanities and Social Sciences, University of Algarve, Campus de Gambelas, 8005-139, Faro, Portugal
Jorge Baptista
NILC, ICMC-USP São-Carlos, Av. do Trabalhador São-Carlense, 400, 13560-970, São Carlos, SP, Brazil
Maria das Graças Volpe Nunes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paulo, S., Oliveira, L.C. (2003). Improving the Accuracy of the Speech Synthesis Based Phonetic Alignment Using Multiple Acoustic Features. In: Mamede, N.J., Trancoso, I., Baptista, J., das Graças Volpe Nunes, M. (eds) Computational Processing of the Portuguese Language. PROPOR 2003. Lecture Notes in Computer Science(), vol 2721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45011-4_5

Download citation

DOI: https://doi.org/10.1007/3-540-45011-4_5
Published: 18 June 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40436-1
Online ISBN: 978-3-540-45011-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics