Emilia: a speech corpus for Argentine Spanish text to speech synthesis

Torres, Humberto M.; Gurlekian, Jorge A.; Evin, Diego A.; Cossio Mercado, Christian G.

doi:10.1007/s10579-019-09447-7

Emilia: a speech corpus for Argentine Spanish text to speech synthesis

Original Paper
Published: 02 February 2019

Volume 53, pages 419–447, (2019)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

429 Accesses
4 Citations
Explore all metrics

Abstract

This paper introduces Emilia, a speech corpus created to build a female voice in Spanish spoken in Buenos Aires for the Aromo text-to-speech system. Aromo is a unit selection text-to-speech system, which employs diphones as units of synthesis. The key requirements and design criteria for Emilia were: to synthesize any text in Spanish into high-quality speech with a minimum corpus size. The text corpus was designed to guarantee the phonetic and prosodic coverage. A three-stage strategy was used: in the first stage, 741 sentences were designed with all of the syllables of Spanish spoken in Argentina, with and without stress, and in all positions within the word; in the second stage, 852 sentences were added to balance out the distribution of the diphones; and after a perceptual evaluation of the quality of synthesized speech, in the third and final stage, 625 sentences were added to achieve the specified unit coverage, and to introduce sentences with more complex syntactic and prosodic structures. Issues from all three corpus building stages are reported. The paper also presents the results from the quality perceptual evaluations of the synthesized voice. Emilia has a duration of three hours and 15 minutes; its speech quality synthesized with Aromo system is similar to the level obtained with commercial systems, with a real-time ratio less than one.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Design of a Yoruba Language Speech Corpus for the Purposes of Text-to-Speech (TTS) Synthesis

An Overview of the ILSP Unit Selection Text-to-Speech Synthesis System

A Corpus of Neutral Voice Speech in Brazilian Portuguese

Notes

http://htk.eng.cam.ac.uk.
http://www.ilc.cnr.it/EAGLES96/home.html.
http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html.
http://www.speech.kth.se/software/.
http://www.fon.hum.uva.nl/praat/.
http://www.mathworks.com/.
These resources are available, in full, partial or demonstrations, for academic or commercial purpose(s), by e-mail to the authors.

References

Adell, J., Bonafonte, A., Gomez J., & Castro, M. (2005). Comparative study of automatic phone segmentation methods for TTS. In Proceedings of the ICASSP’05, (pp. 309–312). https://doi.org/10.1109/ICASSP.2005.1415112.
Aguilar, L., Fernzández, J., Garrido J., Llisterri, J., Monzón, A. M. L., & Crespo, M. R. (1994). Evaluation of a Spanish text-to-speech system. In Proceedings of the second ESCA/IEEE workshop on speech synthesis (pp. 207–210). https://www.isca-speech.org/archive_open/archive_papers/ssw2/ssw2_207.pdf.
Alıas, F., Iriondo, I., & Barnola, P. (2003). Multi-domain text classification for unit selection text-to-speech synthesis. In Procedings of the 15th international congress of phonetic sciences (pp. 2341–2344). https://www.internationalphoneticassociation.org/icphs-proceedings/ICPhS2003/papers/p15_2341.pdf.
Alvarez, Y. V., & Huckvale, M. (2002). The reliability of the ITU-T P.85 standard for the evaluation of text-to-speech systems. In Proceedings of the 7th international conference on speech & language processing (pp. 329–332). https://www.isca-speech.org/archive/archive_papers/icslp_2002/i02_0329.pdf.
Andersen, O., & Hoequist, C. (2003). Keeping rare events rare. In Proceedings of the eighth European conference on speech communication & technology (pp. II-1337–II-1340). https://www.isca-speech.org/archive/archive_papers/eurospeech_2003/e03_1337.pdf.
Badino, L., Barolo, C., & Quazza, S. (2004). Language independent phoneme mapping for foreign TTS. Proceedings of the fifth ISCA workshop on speech synthesis, Pittsburgh, PA, USA (pp. 127–137). https://www.isca-speech.org/archive_open/archive_papers/ssw5/ssw5_217.pdf.
Bayerl, P. S., & Paul, K. I. (2011). What determines inter-coder agreement in manual annotations? A meta-analytic investigation. Computational Linguistics, 37(4), 699–725. https://doi.org/10.1162/COLI_a_00074.
Article Google Scholar
Bellegarda, J. R. (2008). Unit-centric feature mapping for inventory pruning in unit selection text-to-speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 74–82. https://doi.org/10.1109/TASL.2007.911059.
Article Google Scholar
Benoît, C., Grice, M., & Hazan, V. (1966). The SUS test: A method for the assessment of TTS synthesis intelligibility. Speech Communication, 18(4), 381–392. https://doi.org/10.1016/0167-6393(96)00026-X.
Article Google Scholar
Betz, S., Carlmeyer, B., Wagner, P., & Wrede, B. (2018). Interactive hesitation synthesis: Modelling and evaluation. Multimodal Technologies and Interaction, 2(1), 9. https://doi.org/10.3390/mti2010009.
Article Google Scholar
Beutnagel, M., & Conkie, A. (1999). Interaction of units in a unit selection database. In Proceedings of the sixth European conference on speech communication and technology (Vol. 3, pp. 1063–1066). https://www.isca-speech.org/archive/archive_papers/eurospeech_1999/e99_1063.pdf.
Black, A. W., & Lenzo, K. A. (2000). Limited domain synthesis. Proceedings of the 6th international conference on spoken language processing (Vol. 2, pp. 411–414). https://www.isca-speech.org/archive/archive_papers/icslp_2000/i00_2411.pdf.
Black, A. W., & Lenzo, K. A. (2003). Building synthetic voices. Language Technologies Institute, Carnegie Mellon University and Cepstral LLC 4:2. http://festvox.org/bsv/bsv.pdf.
Boëffard, O. (2001). Variable-length acoustic units inference for text-to-speech synthesis. In Proceedings of the 7th European conference on speech communication and technology (pp. 983–986). https://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_0983.pdf.
Bonafonte, A., Höge, H., Kiss I., Moreno, A., Ziegenhain, U., Heuvel, H., Hain, H., Wang, X., & Garcia, M. (2006). TC-STAR: Specifications of language resources and evaluation for speech. In Proceedings of the 5th interantional conference on language resources and evaluation (pp. 311–314). http://nlp.lsi.upc.edu/publications/papers/tc_star_spec.pdf.
Bonafonte, A., Höge, H., Tropf, H. S., Moreno, A., van der Heuvel, H., Sündermann, D., Ziegenhain, U., Kiss, J. P. I., & Jokisch, O. (2005). TTS baselines and specifications. In Deliverable D8 of the EU project TC-STAR technology and corpora for speech to speech translation (FP6-506738). http://nlp.lsi.upc.edu/publications/papers/tc_star_spec.pdf.
Bozkurt, B., Ozturk, O., & Dutoit, T. (2003). Text design for TTS speech corpus building using a modified greedy selection. In Proceedings of the eighth European conference on speech communication and technology (pp. 277–280). https://www.isca-speech.org/archive/archive_papers/eurospeech_2003/e03_0277.pdf.
Breen, A. P., & Jackson, P. (1998). Non-uniform unit selection and the similarity metric within BT’s laureate TTS system. In Proceedings of the third ESCA workshop on speech synthesis (pp. 373–376). https://www.isca-speech.org/archive_open/archive_papers/ssw3/ssw3_201.pdf.
Campbell, N. (1996). Chatr: A high-definition speech re-sequencing system. In Proceedings of the 3rd ASA/ASJ joint meeting (pp. 1223–1228). http://www.speech-data.jp/nick/feast/proceeding/asa-asj%201996_12.pdf
Campbell, N. (2005). Developments in corpus-based speech synthesis: Approaching natural conversational speech. IEICE Transactions on Information and Systems, 88(3), 376–383. https://doi.org/10.1093/ietisy/e88-d.3.376.
Article Google Scholar
Chalamandaris, A., Tsiakoulis, P., Raptis, S., & Karabetsos, S. (2011). Corpus design for a unit selection TTS system with application to Bulgarian. Human Language Technology Challenges for Computer Science and Linguistics, 6562, 35–46. https://doi.org/10.1007/978-3-642-20095-3_4.
Article Google Scholar
Chevelu, J., Barbot, N., Boeffard, O., & Delhay, A. (2008). Comparing set-covering strategies for optimal corpus design. In Proceedings of the 23rd European signal processing conference (pp. 2951–2956). http://lrec-conf.org/proceedings/lrec2008/pdf/750_paper.pdf.
Chevelu, J., & Lolive, D. (2015). Do not build your TTS training corpus randomly. In Proceedings of the signal processing conference, IEEE (pp. 350–354). https://doi.org/10.1109/EUSIPCO.2015.7362403.
Chu, M., Chen, Y., Zhao, Y., Li, Y., & Soong, F. (2006). A study on how human annotations benefit the TTS voice. In Proceedings of the blizzard challenge workshop 2006. http://www.festvox.org/blizzard/bc2006/msra_blizzard2006.pdf.
Chu, M., & Peng, H. (2001). An objective measure for estimating MOS of synthesized speech. In Proceedings of the eventh European conference on speech communication and technology (Vol. 3, pp. 2087–2090). https://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_2087.pdf.
Coelho, L., Hain, HU., Jokisch, O., & Braga, D. (2009). Towards an objective voice preference definition for the portuguese language. In Proceedings of the joint SIG-IL/microsoft workshop on speech and language technologies for Iberian languages (pp. 67–70). http://www.isca-speech.org/archive_open/sltech_2009/papers/isl9_067.pdf.
Colantoni, L., & Gurlekian, J. (2004). Convergence and intonation: Historical evidence from Buenos Aires Spanish. Bilingualism: Language and Cognition, 7(2), 107–119. https://doi.org/10.1017/S1366728904001488.
Article Google Scholar
Coloma, G. (2018). Illustrations of the IPA: Argentine Spanish. Journal of the International Phonetic Association, 48, 243–250. https://doi.org/10.1017/S0025100317000275.
Article Google Scholar
Cryer, H., & Home, S. (2010). Review of methods for evaluating synthetic speech. RNIB Centre for Accessible Information, Birmingham: Technical report #8. https://www.rnib.org.uk/sites/default/files/2010_02_Evaluating_synthetic_speech_review.doc.
Dutoit, T. (1997). An introduction to text-to-speech synthesis. Text, speech and language technology. Dordrecht: Kluwer Academic.
Book Google Scholar
Dybkjær, L., & Hemsen, H. (2007). Evaluation of text and speech systems. Berlin: Springer.
Book Google Scholar
Eisen, B. (1993). Reliability of speech segmentation and labelling at different levels of transcription. In Proccedings of 3rd European conference on speech communication and technology (Vol. 1, pp. 673–676). https://www.isca-speech.org/archive/archive_papers/eurospeech_1993/e93_0673.pdf.
Entropic. (1993). ESPS version 5.0 programs manual. Washington, D.C.: Entropic Research Laboratory.
Google Scholar
Falk, T. H., & Moller, S. (2008). Towards signal-based instrumental quality diagnosis for text-to-speech systems. IEEE Signal Processing Letters, 15, 781–784. https://doi.org/10.1109/LSP.2008.2006709.
Article Google Scholar
Febrer, A., Padrell, J., & Bonafonte, A. (1998). Generation of unit databases for the UPC text-to-speech system. In Proceedings of the international workshop on speech and computer (pp. 26–29). http://www.lsi.upc.edu/~nlp/papers/febrer98b.pdf.
Fernández-Torné, A., & Matamala, A. (2015). Text-to-speech vs. human voiced audio descriptions: A reception study in films dubbed into catalan. The Journal of Specialised Translation, 24, 61–88. http://www.jostrans.org/issue24/art_fernandez.php.
François, H., & Boëffard, O. (2001). Design of an optimal continuous speech database for text-to-speech synthesis considered as a set covering problem. In Proceedings of the seventh European conference on speech communication and technology (pp. 829–832). https://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_0829.pdf
François, H., & Boëffard, O. (2002). The greedy algorithm and its application to the construction of a continuous speech database. In Procedings of the third international conference on language resources and evaluation (pp. 1420–1426). http://lrec.elra.info/proceedings/lrec2002/pdf/265.pdf.
Fujisaki, H. & Hirose, K. (1984). Analysis of voice fundamental frequency contours for declarative sentences of japanese. Journal of Acoustic Society of Japan, 5(4), 233–242. https://www.jstage.jst.go.jp/article/ast1980/5/4/5_4_233/_pdf.
Grůber, M., Matoušek, J., Tihelka, D., & Hanzlicek, Z. (2014). Reducing footprint of unit selection TTS system by removing linguistic segments with rarely selected units. In Proceedings of the 12th international conference on signal processing (pp. 494–499). https://doi.org/10.1109/ICOSP.2014.7015054
Grůber, M., Tihelka, D., & Matoušek, J. (2007). Evaluation of various unit types in the unit selection approach for the czech language using the festival system. In Proceedings of the 6th ISCA workshop on speech synthesis (pp. 276–281). http://www.isca-speech.org/archive_open/archive_papers/ssw6/ssw6_276.pdf.
Guirao, M., & Jurado, M. G. (1993). Estudio estadístico del español. Buenos Aires: CONICET.
Google Scholar
Gurlekian, J. A., Colantoni, L., & Torres, H. M. (2001a). El alfabeto fonético SAMPA y el diseño de córpora fonéticamente balanceados. Fonoaudiológica, 47(3), 58–70.
Google Scholar
Gurlekian, J. A., Cossio-Mercado, C., Torres, H. M., & Vaccari, M. E. (2012). Subjective evaluation of a high quality text-to-speech system for argentine spanish. In Proceedings of VII Jornadas en Tecnologí del Habla and III Iberian SLTech Workshop, IberSPEECH 2012 (pp. 241–250). https://www.researchgate.net/profile/Christian_Cossio-Mercado/publication/265955190_Subjective_Evaluation_of_a_High_Quality_Text-to-Speech_System_for_Argentine_Spanish/links/552ef53d0cf2acd38cbbdad4.pdf.
Gurlekian, J. A., Rodríguez, H., Colantoni, L., & Torres, H. M. (2001b). Development of a prosodic database for an argentine spanish text to speech system. In B. Bird, & M. Liberman (Eds.) Proceedings of the IRCS workshop on linguistic databases, SIAM (pp. 99–104). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.25.5050&rep=rep1&type=pdf.
Gurlekian, J. A., Torres, H. M., & Evin, D. (2014). Guía para la segmentación y transcripción fonética para las tecnologías del habla. Fonoaudiológica, 61(2), 24–27.
Google Scholar
Hall, J. L. (2001). Application of multidimensional scaling to subjective evaluation of coded speech. The Journal of the Acoustical Society of America, 110(4), 2167–2182. https://doi.org/10.1121/1.1397322.
Article Google Scholar
Hansakunbuntheung, C., Rugchatjaroen, A., & Wutiwiwatchai, C. (2005). Space reduction of speech corpus based on quality perception for unit selection speech synthesis. In Proceedings of the 6th international symposium on natural language processing (pp. 127–132). https://www.researchgate.net/profile/Chatchawarn_Hansakunbuntheung/publication/228957899_Space_reduction_of_speech_corpus_based_on_quality_perception_for_unit_selection_speech_synthesis/links/0912f510bb45091b12000000.pdf.
Harris, J. (1983). Syllable structure and Stress in Spanish. Cambridge: The MIT Press.
Google Scholar
Hinterleitner, F., Norrenbrock, C., & Möller, S. (2013). Is intelligibility still the main problem? A review of perceptual quality dimensions of synthetic speech. In Proceedings of the eighth ISCA workshop on speech synthesis (pp. 147–151). http://ssw8.talp.cat/papers/ssw8_PS2-1_Hinterleitner.pdf.
Hinterleitner, F., Norrenbrock, C., Möller, S., & Heute, U. (2014). Text-to-speech synthesis. Quality of experience (pp. 179–193). Berlin: Springer.
Google Scholar
Hinterleitner, F., Zabel, S., Möller, S., Leutelt, L., & Norrenbrock, C. (2011). Predicting the quality of synthesized speech using reference-based prediction measures. In Proceedings of the 22th Konferenz Elektronische Sprachsignalverarbeitung (pp. 99–106). http://www.qu.tu-berlin.de/fileadmin/fg41/publications/hinterleitner_2011_predicting-the-quality-of-synthesized-speech-using-reference.-.based-prediction-measures.pdf.
Hirst, D., Rilliard, A., & Aubergé, V. (1998). Comparison of subjective evaluation and an objective evaluation metric for prosody in text-to-speech synthesis. In Proceedings of the third ESCA/COCOSDA workshop (ETRW) on speech synthesis (pp. 293–306). https://www.isca-speech.org/archive_open/archive_papers/ssw3/ssw3_001.pdf.
Hoeckel, C. (1989). The reliability of manual labelling of continuous speech. In Proceedings of the ESCA workshop on speech input/output assessment an speech databases (Vol. 2, pp. 2179–2182). http://www.isca-speech.org/archive_open/archive_papers/sioa_89/sia_2179.pdf.
Hon, H., Acero, A., Huang, X., Liu, J., & Plumpe, M. (1998). Automatic generation of synthesis units for trainable text to speech systems. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP’98) (Vol. 1, pp. 293–306). https://doi.org/10.1109/ICASSP.1998.674425
Karabetsos, S., Tsiakoulis, P., Chalamandaris, A., & Raptis, S. (2009). Embedded unit selection text-to-speech synthesis for mobile devices. IEEE Transactions on Consumer Electronics, 55(2), 613–621. https://doi.org/10.1109/TCE.2009.5174430.
Article Google Scholar
Kawai, H., & Toda, T. (2004). An evaluation of automatic phone segmentation for concatenative speech synthesis. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (Vol. 1, pp. I–677–80). https://doi.org/10.1109/ICASSP.2004.1326076.
Kawai, H., & Tsuzaki, M. (2002). Study on time-dependent voice quality variation in a large-scale single speaker speech corpus used for speech synthesis. In Proceedings of the IEEE workshop on speech synthesis (pp. 15–18). https://doi.org/10.1109/WSS.2002.1224362.
Kelly, A. C., Berthelsen, H., Campbell, N., Chasaide, A. N., & Gobl, C. (2009). Corpus design techniques for irish speech synthesis. In Proceedings of the China Ireland ICT conference (pp. 264–265). http://www.eeng.dcu.ie/ciict/2009/proceedings.pdf.
King, S. (2014). Measuring a decade of progress in text-to-speech. Loquens, 1(1). https://doi.org/10.3989/loquens.2014.006.
Kishore, S., & Black, A. (2003). Unit size in unit selection speech synthesis. In Proceedings of the Eurospeech 2003 (pp. 1317–1320). https://www.isca-speech.org/archive/archive_papers/eurospeech_2003/e03_1317.pdf.
Krul, A., Damnati, G., Yvon, F., Boidin, C., & Moudenc, T. (2007). Approaches for adaptive database reduction for text-to-speech synthesis. In Proceedings of the eighth annual conference of the international speech communication association (Vol. 3, pp. 2881–2884). https://www.isca-speech.org/archive/archive_papers/interspeech_2007/i07_2881.pdf.
Kurtic, E. (2004). Polyglot voice design for unit selection speech synthesis. Master’s thesis, School of Philosophy, Psychology and Language Sciences, University of Edinburgh. https://www.era.lib.ed.ac.uk/bitstream/handle/1842/2070/Emina%20Kurtic.pdf?sequence=1&isAllowed=y
Lambert. T., Braunschweiler, N., & Buchholz, S. (2007). How (not) to select your voice corpus: Random selection vs. phonologically balanced. In Proceedings of the 6th ISCA workshop on speech synthesis (pp. 22–24). https://isca-speech.org/archive_open/archive_papers/ssw6/ssw6_264.pdf.
Lewis, E., & Tatham, M. (1999). Word and syllable concatenation in text-to-speech synthesis. In Proceedings of the sixth European conference on speech communications and technology (Vol. 2, pp. 615–618). https://www.isca-speech.org/archive/archive_papers/eurospeech_1999/e99_0615.pdf.
Llisterri, J. (1999). Transcripción, etiquetado y codificación de corpus orales. Revista Española de Lingüística Aplicada, Monográfico: Panorama de la Investigación en Lingüística Informática, (pp, 53–82). http://liceu.uab.es/~joaquim/publicacions/RESLA_99.pdf.
Lu, H., Zhang, W., Shao, X., Lei, Q. Z. W., Zhou, H.. & Breen, A. (2015). Pruning redundant synthesis units based on static and delta unit appearance frequency. In Proceedings of the sixteenth annual conference of the international speech communication association (pp. 269–273). https://www.isca-speech.org/archive/interspeech_2015/papers/i15_0269.pdf.
Marino, J. B., Nogueiras, A., Pachès-Leal, P., & Bonafonte, A. (2000). The demiphone: An efficient contextual subword unit for continuous speech recognition. Speech Communication, 32(3), 187–197. https://doi.org/10.1016/S0167-6393(00)00010-8.
Article Google Scholar
Matoušek, J., & Psutka, J. (2001). Design of speech corpus for text-to-speech synthesis. In Proceedings of the 7th conference on speech communication and technology (pp. 2047–2050). https://www.isca-speech.org/archive/archive_papers/eurospeech_2001/e01_2047.pdf.
Matoušek, J., Tihelka, D., & Romportl, J. (2008). Building of a speech corpus optimised for unit selection TTS synthesis. In Proceedings of 6th international conference on language resources and evaluation (pp. 1296–1299). http://www.lrec-conf.org/proceedings/lrec2008/pdf/329_paper.pdf.
Mayo, C., Clark, R. A., & King, S. (2005). Multidimensional scaling of listener responses to synthetic speech. In Proceedings of the 9th European conference on speech communication and technology (pp. 1725–1728). https://www.isca-speech.org/archive/archive_papers/interspeech_2005/i05_1725.pdf.
McPherson, I. (1975). Spanish phonology: Descriptive and historical. Manchester: Manchester Univiversity Press.
Google Scholar
Mendelson, J., & Aylett, M. (2017). Beyond the listening test: An interactive approach to TTS evaluation. In Proceedings of the 18th annual conference of the international speech communication association (pp. 20–24). https://doi.org/10.21437/Interspeech.2017-1438.
Möbius, B. (2000). Corpus-based speech synthesis: Methods and challenges. AIMS, Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung, 6(4), 87–116. http://www.ims.uni-stuttgart.de/~moebius/papers/unitsel.pdf.
Möbius, B. (2003). Rare events and closed domains: Two delicate concepts in speech synthesis. International Journal of Speech Technology, 6(1), 57–71. https://doi.org/10.1023/A:1021052023237.
Article Google Scholar
Möller, S., Hinterleitner, F., Falk, T. H., & Polzehl, T. (2010). Comparison of approaches for instrumentally predicting the quality of text-to-speech systems. In Proceedings of the eleventh annual conference of the international speech communication association (pp. 1325–1328). https://www.isca-speech.org/archive/archive_papers/interspeech_2010/i10_1325.pdf.
Ni, J., Hirai, T., Kawai, H., Toda, T., Tokuda, K., Tsuzaki, M., Sakai, S., Maia, R., & Nakamura, S. (2007). ATRECSS: ATR english speech corpus for speech synthesis. In Proceedings of the 6th ISCA workshop on speech synthesis, paper 002. https://www.isca-speech.org/archive_open/archive_papers/blizzard_2007/blz3_002.pdf.
Niebuhr, O., & Michaud, A. (2015). Speech data acquisition: The underestimated challenge. In KALIPHO-Kieler Arbeiten zur Linguistik und Phonetik, 3, 1–42. https://halshs.archives-ouvertes.fr/halshs-01026295v4/document.
Norrenbrock, C. R., Hinterleitner, F., Heute, U., & Möller, S. (2015). Quality prediction of synthesized speech based on perceptual quality dimensions. Speech Communication, 66, 17–35. https://doi.org/10.1016/j.specom.2014.06.003.
Article Google Scholar
Oliveira, L. C., Paulo, S., Figueira, L., Mendes, C., Nunes, A., & Godinho, J. (2008). Methodologies for designing and recording speech databases for corpus based synthesis. In Proceedings of the 6th international conference on language resources and evaluation (pp. 2921–2925). http://www.lrec-conf.org/proceedings/lrec2008/pdf/741_paper.pdf.
P.85 ITR. (1990). Studies toward the unification of picture assessment methodology. Technical report, ITU. https://www.itu.int/dms_pub/itu-r/opb/rep/R-REP-BT.1082-1-1990-PDF-E.pdf.
P800 ITR. (1996). Methods for subjective determination of transmission quality. Technical report, ITU. https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.800-199608-I!!PDF-E&type=items.
P85 ITR. (1994). Method for subjective performance assessment of the quality of speech voice output devices. Technical report, ITU. https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.85-199406-I!!PDF-E&type=items.
Peterson, G. E., Wang, W. S. Y., & Sivertsen, E. (1958). Segmentation techniques in speech synthesis. The Journal of the Acoustical Society of America, 30(8), 739–742. https://doi.org/10.1121/1.1909746.
Article Google Scholar
Pitt, M. A., Johnson, K., Hume, E., Kiesling, S., & Raymond, W. (2005). The Buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability. Speech Communication, 45(1), 89–95. https://doi.org/10.1016/j.specom.2004.09.001.
Article Google Scholar
Prudon, R., & d’Alessandro, C. (2001). A selection/concatenation text to speech synthesis system: Databases development, system design, comparative evaluation. In Proceedings of the 4th speech synthesis workshop (SSW4-2001), paper 138. https://www.isca-speech.org/archive_open/archive_papers/ssw4/ssw4_138.pdf.
Rodríguez, H. (2000). Construcción de una base de datos para el desarrollo de sistemas de conversión de texto a habla. University of La Plata, Buenos Aires, licenciature thesis.
Rosenberg, A., & Ramabhadran, B. (2017). Bias and statistical significance in evaluating speech synthesis with mean opinion scores. In Proceedings of the 18th annual conference of the international speech communication association (pp. 3976–3980). https://doi.org/10.21437/Interspeech.2017-479.
Royal Spanish Academy. (1992). Dictionary of the Spanish language. Madrid: Espasa Calpe.
Google Scholar
Rutten, P., Aylett, M. P., Fackrell, J., & Taylor, P. (2002). A statistically motivated database pruning technique for unit selection synthesis. In Proceedings of the seventh international conference on spoken language processing (pp. 125–128). https://www.isca-speech.org/archive/archive_papers/icslp_2002/i02_0125.pdf.
Sainz, I., Navas, E., Hernáez, I., Bonafonte, A., & Campillo, F. (2010). TTS evaluation campaign with a common spanish database. In Proceedings of the seventh international conference on language resources and evaluation (pp. 2155–2160). http://www.lrec-conf.org/proceedings/lrec2010/pdf/456_Paper.pdf.
Schiel, F., Baumann, A., Draxler, C., Ellbogen, T., Hoole, P., & Steffen, A. (2012). The validation of speech corpora. Munchen: Bavarian Archive for Speech Signals. https://epub.ub.uni-muenchen.de/13698/1/schiel_13698.pdf.
Sityaev, D., Knill, K., & Burrows, T. (2006). Comparison of the ITU-T P.85 standard to other methods for the evaluation of text-to-speech systems. In Proceedings of the ninth international conference on spoken language processing (pp. 2743–2746). https://www.isca-speech.org/archive/archive_papers/interspeech_2006/i06_1233.pdf.
Streijl, R. C., Winkler, S., & Hands, D. S. (2016). Mean opinion score (mos) revisited: Methods and applications, limitations and alternatives. Multimedia Systems, 22(2), 213–227. https://doi.org/10.1007/s00530-014-0446-1.
Article Google Scholar
Syrdal, A., Wightman, C., Conkie, A., Stylianou, Y., Beutnagel, M., Schroeter, J., Strom, V., Lee, K., & Makashay, M. (2000). Corpus-based techniques in the AT&t nextgen synthesis system. In Proceedings of the 6th international conference on spoken language processing (Vol. 3, pp. 410–415). https://www.isca-speech.org/archive/archive_papers/icslp_2000/i00_3410.pdf.
Syrdal, A. K., Conkie, A., & Stylianou, Y. (1998). Exploration of acoustic correlates in speaker selection for concatenative synthesis. In Proceedings of the international conference on spoken language processing (Vol. 6, pp. 2743–2746). https://www.isca-speech.org/archive/archive_papers/icslp_1998/i98_0882.pdf.
Taylor, P. (2009). Text-to-speech synthesis. Cambridge: Cambridge University Press.
Book Google Scholar
Torres, H. M. (2012). Creación de un corpus de texto para la construcción de un sistema TTS. Informe técnico, ISSN 0325-2043, Laboratorio de Investigaciones Sensoriales, UBA-CONICET, Buenos Aires, Argentina. http://www.lis.secyt.gov.ar/informes/2012.pdf
Torres, H. M. (2013). Medición de la velocidad de conversión del sistema TTS aromo. Informe técnico, ISSN 0325-2043, Laboratorio de Investigaciones Sensoriales, UBA-CONICET, Buenos Aires, Argentina. http://www.lis.secyt.gov.ar/informes/2013.pdf
Torres, H. M., & Gurlekian, J. (2004). Automatic determination of phrase breaks for argentine spanish. In Proceedings of the speech prosody 2004 (pp. 553–556). http://www.isca-speech.org/archive_open/sp2004/sp04_553.pdf.
Torres, H. M., & Gurlekian, J. A. (2008). Acoustic speech unit segmentation for concatenative synthesis. Computer Speech and Language, 22, 196–206. https://doi.org/10.1016/j.csl.2007.07.002.
Article Google Scholar
Torres, H. M., & Gurlekian, J. A. (2009). Parameter estimation and prediction from text for a superpositional intonation model. In Proceedings of the 20 Konferenz Elektronische Sprachsignalverarbeitung (pp. 238–247). https://www.researchgate.net/publication/265963364_Parameter_estimation_and_prediction_from_text_for_a_superpositional_intonation_model
Torres, H. M., & Gurlekian, J. A. (2016). Novel estimation method for the superpositional intonation model. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(1), 151–160. https://doi.org/10.1109/TASLP.2015.2500728.
Article Google Scholar
Torres, H. M., Gurlekian, J. A., & Mercado, C. (2012). Aromo: Argentine spanish TTS system. In Proceedings of VII Jornadas en Tecnología del Habla and III Iberian SLTech workshop (pp. 416–421). https://www.researchgate.net/profile/Christian_Cossio-Mercado/publication/265952108_Aromo_Argentine_Spanish_TTS_System/links/570c37ea08aee0660351b0b9.pdf
Umbert, M., Moreno, A., Agüero, P., & Bonafonte, A. (2006). Spanish synthesis corpora. In Proceedings of the international conference of language resources and evaluation (pp. 2102–2105). http://www.lrec-conf.org/proceedings/lrec2006/pdf/590_pdf.pdf.
Vainio, M., Jarvikivi, J., Werner, S., Volk, N., & Valikangas, J. (2002). Effect of prosodic naturalness on segmental acceptability in synthetic speech. In Proceedings of 2002 IEEE workshop on speech synthesis (pp. 143–146). https://doi.org/10.1109/WSS.2002.1224394.
Valentini-Botinhao, C., Yamagishi, J., & King, S. (2011). Evaluation of objective measures for intelligibility prediction of HMM-based synthetic speech in noise. In 2011 IEEE international conference on acoustics, speech and signal processing (pp. 5112–5115). https://doi.org/10.1109/ICASSP.2011.5947507.
van den Heuvel, H., Iskra, D., Sanders, E., & de Vriend, F. (2008). Validation of spoken language resources: An overview of basic aspects. Language Resources and Evaluation, 42(1), 41–73. https://doi.org/10.1007/s10579-007-9049-1.
Article Google Scholar
van Santen, J. P. H. (1997). Prosodic modelling in text-to-speech synthesis. In Proceedings of the 5th European conference on speech communication and technology (Vol. 5, pp. 2511–2514). https://www.isca-speech.org/archive/archive_papers/eurospeech_1997/e97_KN19.pdf.
Viswanathan, M., & Viswanathan, M. (2005). Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (mos) scale. Computer Speech & Language, 19(1), 55–83. https://doi.org/10.1016/j.csl.2003.12.001.
Article Google Scholar
Watson, A., Mullin, J., Smallwood, L., & Wilson, G. (2001). New techniques for assessing audio and video quality in real-time interactive communication. In Tutorial at IHM-HCI, Lille, France. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.494.6094&rep=rep1&;type=pdf.
Zhang, W., Liu, Y., Deng, Y., & Pang, M. (2010). Automatic construction for a TTS corpus with limited text. In Proccedings of the 2010 international conference on measuring technology and mechatronics automation (Vol. 1, pp. 707–710). https://doi.org/10.1109/ICMTMA.2010.796.

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their insightful feedback. This research was supported by Ministerio de Ciencia y Tecnología and Consejo Nacional de Investigaciones Científicas y Técnicas, Argentina.

Author information

Authors and Affiliations

Laboratorio de Investigaciones Sensoriales, INIGEM, CONICET-UBA, Av. Córdoba 2351, 9 Piso Sala 2. C.A.B.A. (1120), Buenos Aires, Argentina
Humberto M. Torres & Jorge A. Gurlekian
Center for Research and Transfer in Acoustics (CINTRA), UTN-FRC UA CONICET, Master M. López esq. Argentine Red Cross, University City, X5016ZAA, Córdoba Capital, Argentina
Diego A. Evin
Departamento de Computación, FCEN, UBA, University City, C1428EGA, Buenos Aires, Argentina
Christian G. Cossio Mercado

Authors

Humberto M. Torres
View author publications
You can also search for this author in PubMed Google Scholar
Jorge A. Gurlekian
View author publications
You can also search for this author in PubMed Google Scholar
Diego A. Evin
View author publications
You can also search for this author in PubMed Google Scholar
Christian G. Cossio Mercado
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Humberto M. Torres.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Torres, H.M., Gurlekian, J.A., Evin, D.A. et al. Emilia: a speech corpus for Argentine Spanish text to speech synthesis. Lang Resources & Evaluation 53, 419–447 (2019). https://doi.org/10.1007/s10579-019-09447-7

Download citation

Published: 02 February 2019
Issue Date: 15 September 2019
DOI: https://doi.org/10.1007/s10579-019-09447-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Emilia: a speech corpus for Argentine Spanish text to speech synthesis

Abstract

Access this article

Similar content being viewed by others

Design of a Yoruba Language Speech Corpus for the Purposes of Text-to-Speech (TTS) Synthesis

An Overview of the ILSP Unit Selection Text-to-Speech Synthesis System

A Corpus of Neutral Voice Speech in Brazilian Portuguese

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Emilia: a speech corpus for Argentine Spanish text to speech synthesis

Abstract

Access this article

Similar content being viewed by others

Design of a Yoruba Language Speech Corpus for the Purposes of Text-to-Speech (TTS) Synthesis

An Overview of the ILSP Unit Selection Text-to-Speech Synthesis System

A Corpus of Neutral Voice Speech in Brazilian Portuguese

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation