Abstract
The construction of a speech recognition system requires a recorded set of phrases to compute the pertinent acoustic models. This set of phrases must be phonetically rich and balanced in order to obtain a robust recognizer. By tradition, this set is defined manually implicating a great human effort. In this paper we propose an automated method for assembling a phonetically balanced corpus (set of phrases) from the Web. The proposed method was used to construct a phonetically balanced corpus for the Mexican Spanish language.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Vaufreydaz, D., Bergamini, C., Serignat, J.F., Besacier, L., Akbar, M.: A New Methodology for Speech Corpora Definition from Internet Documents. In: LREC 2000 Language Resources & Evaluation international Conference, Athens, Greece (2000)
Galicia-Haro, S.: Procesamiento de Textos Electrónicos para la Construcción de un Corpus. In: CORE 2003, México, D.F (2003)
Gelbukh, A., Sidorov, G., Chanona, L.: Compilation of a Spanish Representative Corpus. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, p. 285. Springer, Heidelberg (2002)
Uraga, E., Pineda, L.: Automatic generation of pronunciation lexicons for Spanish. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, p. 330. Springer, Heidelberg (2002)
Pérez, H.E.: Frecuencia de fonemas. Revista Electrónica de la Red Temática en Tecnologías del Habla, Número 1, Marzo (2003)
Alarcos-Llorach, E.: Fonología española, Madrid, Gredos (1965)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Villaseñor-Pineda, L., Montes-y-Gómez, M., Vaufreydaz, D., Serignat, JF. (2004). Experiments on the Construction of a Phonetically Balanced Corpus from the Web. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24630-5_50
Download citation
DOI: https://doi.org/10.1007/978-3-540-24630-5_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21006-1
Online ISBN: 978-3-540-24630-5
eBook Packages: Springer Book Archive