International Journal of Speech Technology

, Volume 10, Issue 2–3, pp 121–141 | Cite as

A framework for efficient development of Slovenian written language resources used in speech processing applications

  • Matej RojcEmail author
  • Darinka Verdonik
  • Zdravko Kačič


This paper presents a framework for the efficient development and representation of morphological and phonetic lexicons, to be used in speech technology applications. Solutions that would be the most appropriate for developing speech technologies for specific language have to be analyzed when developing the lexicons. In the paper issues such as the development of resources, good word coverage in general texts, efficient coding of lexicons, representation (regarding time and memory space) and the integration of lexicons in speech processing applications are addressed. The construction process within the proposed framework is based on the use of finite-state machines and heterogeneous relation-graphs structures, and significantly reduces the time and effort needed for the construction of large-scale lexica, minimizes any analysis errors, and efficiently represents the lexicons, regarding time and memory usage. The wordlist construction process presented in the paper also guarantees that by using the constructed lexicons high word coverage is achieved in general texts. SIlex lexicons are large-scale phonetic and morphology lexicons for the Slovenian language, constructed within the new framework and with a developed toolset, and represent valuable language resources for the development of various speech processing applications for the Slovenian language.


Written language resources Morphology lexicon Phonetic lexicon Heterogeneous relation graphs (HRG) Finite-state machines (FSM) Slovenian language 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Al-Shalabi, R., & Kanaan, G. (2004). Constructing an automatic lexicon for Arabic language. International Journal of Computing & Information Sciences, 2(2). Google Scholar
  2. Bajec, A., Kolarič, R., & Rupel, M. (1956). Slovenska slovnica. Ljubljana, Svet za prosveto in kulturo LRS. Google Scholar
  3. Boula, P., Yvon, F., Aubergé, V., & Vaissière, J. (2000). A French phonetic lexicon with variants for speech and language processing. In Proceedings of the language resources and evaluation conference (LREC), Athens, Greece, May 2000. Google Scholar
  4. Breiman, L., Freidman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. New York: Chapman & Hall. zbMATHGoogle Scholar
  5. Daciuk, J. (1998). Incremental construction of finite-state automata and transducers and their use in the natural language processing. Ph.D. thesis, Technical University of Gdansk, Poland. Google Scholar
  6. Emmanuel, R., & Yves, S. (1997). Finite state language processing. Cambridge: MIT Press. Google Scholar
  7. Erjavec, T., & Ide, N. (1998). The MULTEXT-East corpus. In Proceedings of the language resources and evaluation conference (LREC), Granada, Spain. Google Scholar
  8. Günthner, F. (1996). CISLEX—Das Wörterbuch am CIS.
  9. Hartikainen, E., Maltese, G., Moreno, A., Shammass, S., & Ziegenhain, U. (2003). Large lexica for speech-to-speech translation: from specification to creation. In Proceedings of the Eurospeech conference, Geneva, Switzerland, September 2003. Google Scholar
  10. Hopcroft, J. E., & Ullman, J. D. (1979). Introduction to automata theory, languages, and computation. Reading: Addison-Wesley. zbMATHGoogle Scholar
  11. Kačič, Z. (1995). Onomastica for Slovenian.
  12. Kiraz, G. A., & Möbius, B. (1998). Multilingual syllabification using weighted finite-state transducers. In Proceedings of the third international workshop on speech synthesis, Australia. Google Scholar
  13. Kuich, W., & Salomaa, A. (1986). EATCS monographs on theoretical computer science: Vol. 5. Semirings, automata, languages. Berlin: Springer. zbMATHGoogle Scholar
  14. Leech, G., & Wilson, A. (1996). Recommendations for the morphosyntactic annotation of corpora. EAGLES report EAG-TCWG-MAC/R, ILC, Pisa.
  15. Mohri, M. (1997). Finite-state transducers in language and speech processing. Computational Linguistics, 23, 2. MathSciNetGoogle Scholar
  16. Muhr, R., Höldrich, R., & Wächter-Kollpacher, E. (2002). The pronouncing dictionary of Austrian German and the other major varieties of German—a phonetic resources database on the pronunciation of German. In Proceedings of the language resources and evaluation conference (LREC), Las Palmas, Canary Islands, Spain, May 2002. Google Scholar
  17. Pagel, V., Lenzo, K., & Black, A. W. (1998). Letter to sound rules for accented lexicon compression. In Proc. of ICSLP (pp. 2015–2018). Sydney, Australia, September 1998. Google Scholar
  18. Piepenbrock, R. (2001). CELEX, the Dutch Centre for Lexical Information.
  19. Rojc, M. (2000). Use of finite-state machines in automatic text-to-speech synthesis systems. Master thesis, Maribor. Google Scholar
  20. Rojc, M. (2003). Time and space optimal architecture of the multilingual and polyglot TTS system—architecture with finite-state machines. Ph.D. thesis, Maribor. Google Scholar
  21. Rojc, M., & Kačič, Z. (2000). A computational platform for development of morphologic and phonetic lexica. In Proceedings of the second language resources and evaluation conference (LREC), Athens, Greece. Google Scholar
  22. SSKJ. (1995). Slovar slovenskega knjižnega jezika. Ljubljana: DZS. Google Scholar
  23. Taylor, P., Black, A., & Caley, R. (2001). Heterogeneous relation graphs as a mechanism for representing linguistic information. Speech Communication, 33, 153–174. zbMATHCrossRefGoogle Scholar
  24. Toporišič, J. (1976). Slovenska slovnica. Maribor: Založba obzorja. Google Scholar
  25. Toporišič, J. (2000). Slovenska slovnica. Maribor: Založba obzorja. Google Scholar
  26. Toporišič, J. (2001). Slovenski pravopis. Ljubljana: Državna založba ZRC. Google Scholar
  27. Verdonik, D., Rojc, M., & Kačič, Z. (2004). Creating Slovenian language resources for development of speech-to-speech translation components. In Proceedings of the language resources and evaluation conference (LREC), Lisbon, Portugal, May 2004. Google Scholar
  28. Vidovič Muha, A. (1981). Pomenske skupine nekakovostnih izpeljanih pridevnikov. Slavistična Revija, 29(1), 19–42. Google Scholar
  29. Zemljak, M., & Kačič, Z. (1998). SAMPA for Slovenian.
  30. Ziegenhain, U. et al. (2004). Specification of corpora and word lists in 12 languages. LC-STAR project IST-2001-32216. Deliverable D1.1. Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Faculty of Electrical Engineering and Computer ScienceUniversity of MariborMariborSlovenia

Personalised recommendations