Romanian Syllabication Using Machine Learning
The task of finding syllable boundaries can be straightforward or challenging, depending on the language. Text-to-speech applications have been shown to perform considerably better when syllabication, whether orthographic or phonetic, is employed as a means of breaking down the text into units bellow word level. Romanian syllabication is non-trivial mainly but not exclusively due to its hiatus-diphthong ambiguity. This phenomenon affects both phonetic and orthographic syllabication. In this paper, we focus on orthographic syllabication for Romanian and show that the task can be carried out with a high degree of accuracy by using sequence tagging. We compare this approach to support vector machines and rule-based methods. The features we used are simply character n-grams with end-of-word marking.
Unable to display preview. Download preview PDF.
- 1.Bartlett, S., Kondrak, G., Cherry, C.: Automatic syllabification with structured svms for letter to phoneme conversion. In: 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL 2008: HLT), pp. 568–576. Association for Computational Linguistics, Columbus (2008)Google Scholar
- 2.Collective: Collective: Dictionarul ortografic, ortoepic si morfologic al limbii române., 2nd edn., revised. Romanian Academy, Bucharest (2010) (in Romanian)Google Scholar
- 3.Trogkanis, N., Elkan, C.: Conditional Random Fields for word hyphenation. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 366–374. Association for Computational Linguistics, Uppsala (2010)Google Scholar
- 4.Toma, S.A., Oancea, E., Munteanu, D.: Automatic rule-based syllabication for Romanian. In: Proceedings of the 5th Conference on Speech Technology and Human-Computer Dialogue (2009)Google Scholar
- 7.Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning. ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
- 8.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)Google Scholar
- 9.Okazaki, N.: CRFsuite: a fast implementation of Conditional Random Fields (CRFs) (2007)Google Scholar
- 10.Barbu, A.M.: Romanian lexical databases: Inflected and syllabic forms dictionaries. In: Sixth International Language Resources and Evaluation (LREC 2008) (2008)Google Scholar