Skip to main content

Automatic Acquisition of a Slovak Lexicon from a Raw Corpus

  • Conference paper
Book cover Text, Speech and Dialogue (TSD 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3658))

Included in the following conference series:

Abstract

This paper presents an automatic methodology we used in an experiment to acquire a morphological lexicon for the Slovak language, and the lexicon we obtained. This methodology extends and refines approaches which have proven efficient, e.g., for the acquisition of French verbs or Croatian and Russian nouns, adjectives and verbs. It only relies on a raw corpus and on a morphological description of the language. The underlying idea is to build all possible lemmas that can explain all words found in the corpus, according to the morphological description, and to rank these hypothetical lemmas according to their likelihood given the corpus. Of course, hand-validation and iteration of the whole process is needed to achieve a high-quality lexicon, but the human involvement required is orders of magnitude lower than the cost of the fully manual development of such a resource. Moreover, this technique can be easily applied to other languages with a rich morphology that lack large-coverage lexical resources.

We would like to thank very warmly Katarína Mat’ašovičová, native speaker of Slovak, who has been our validator during the acquisition process described here.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Daille, B.: Morphological rule induction for terminology acquisition. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), Saarbrucken, Germany, pp. 215–221 (2000)

    Google Scholar 

  2. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 61–74 (1993)

    Google Scholar 

  3. Briscoe, T., Carroll, J.: Automatic extraction of subcategorization from corpora. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, DC (1997)

    Google Scholar 

  4. Oliver, A., Castellón, I., Màrquez, L.: Use of internet for augmenting coverage in a lexical acquisition system from raw corpora: application to russian. In: IESL Workshop of RANLP 2003, Bulgaria, Borovets, Bulgaria (2003)

    Google Scholar 

  5. Oliver, A., Tadić, M.: Enlarging the croatian morphological lexicon by automatic lexical acquisition from raw corpora. In: Proceedings of LREC 2004, Lisbon, Portugal, pp. 1259–1262 (2004)

    Google Scholar 

  6. Clément, L., Sagot, B., Lang, B.: Morphology based automatic acquisition of large-coverage lexica. In: Proceedings of LREC 2004, Lisbon, Portugal, pp. 1841–1844 (2004)

    Google Scholar 

  7. Jazykovedný ústav Ľ. Štúra SAV: Slovenský národný korpus (Slovak National Corpus) (2004)

    Google Scholar 

  8. Pečiar, Š., et al.: Pravidlá Slovenského Pravopisu. Vydavatel’stvo Slovenskej Akadémie Vied, Bratislava (1970)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sagot, B. (2005). Automatic Acquisition of a Slovak Lexicon from a Raw Corpus. In: Matoušek, V., Mautner, P., Pavelka, T. (eds) Text, Speech and Dialogue. TSD 2005. Lecture Notes in Computer Science(), vol 3658. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11551874_20

Download citation

  • DOI: https://doi.org/10.1007/11551874_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-28789-6

  • Online ISBN: 978-3-540-31817-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics