Abstract
The purpose of this paper is (1) to make an extensive overview of the field of diacritics restoration in Romanian texts, (2) to present our own experiments and results and to promote the use of the word-based Viterbi algorithm as a better accuracy solution used already in a free web-based TTS implementation, (3) to announce the production of a new, high-quality, high-volume corpus of Romanian texts, twice the size of the Romanian language subset of the JRC-Acquis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Tufiş, D., Chiţu, A.: Automatic diacritics insertion in Romanian texts. In: Proceedings of COMPLEX 1999 International Conference on Computational Lexicography (1999)
Tufiş, D., Ceauşu, A.: Diacritics Restoration in Romanian Texts. In: Pascaleva, E., Slavcheva, M. (eds.) A Common Natural Language Processing Paradigm for Balkan Languages, pp. 49–55 (2007)
Mihalcea, R.F.: Diacritics Restoration: Learning from Letters versus Learning from Words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002)
Mihalcea, R., Năstase, V.: Letter level learning for language independent diacritics restoration. In: International Conference on Computational Linguistics, pp. 1–7. Association for Computational Linguistics, Morristown (2002)
De Pauw, G., Wagacha, P.W., de Schryver, G.-M.: Automatic Diacritic Restoration for Resource-Scarce Languages. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 170–179. Springer, Heidelberg (2007)
Tufis, D., Ceausu, A.: Diac+: a professional diacritics recovering system. In: LREC, European Language Resources Association (2008)
Bobicev, V.: Statistical Methods and Algorithms of Text Processing (based on Romanian texts). Ph.D. thesis, Technical University of Moldova (2007)
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy (2006)
Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. Information Theory 13, 260–269 (1967)
Grozea, C.: Finding optimal parameter settings for high performance word sense disambiguation. In: Proceedings of Senseval-3: The Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Grozea, C. (2012). Experiments and Results with Diacritics Restoration in Romanian. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2012. Lecture Notes in Computer Science(), vol 7499. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32790-2_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-32790-2_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32789-6
Online ISBN: 978-3-642-32790-2
eBook Packages: Computer ScienceComputer Science (R0)