Skip to main content

Slovene Multi-word Units: Identification, Categorization, and Representation

  • 601 Accesses

Part of the Lecture Notes in Computer Science book series (LNAI,volume 11755)


In this paper, we present the results of a manual annotation of a Slovene training corpus with multi-word units (MWUs) relevant for inclusion in a lexicon of Slovene MWUs. We analyze the annotations in terms of (a) the frequency with which a string has been identified as a MWU, (b) the degree to which the annotators agree on the category of the identified MWU, and (c) the degree to which the annotators agree on the range of the MWU in terms of its lexicalized elements. The results of the analysis will be useful in different stages of the compilation of a Slovene MWU lexicon. The list of dictionary-relevant MWUs obtained in the annotation task will be used to enrich the lexicon and to train models for the automatic identification of MWUs in running text. The findings will also help revise the criteria for the identification and categorization of dictionary-relevant MWUs in relation to free phrases, as well as more clearly define the distinction between the lexicalized elements of MWUs and the more or less stable elements of their textual environment, which will be useful when determining the canonical forms of MWUs in the lexicon on one hand and their relation to their variable elements and syntactic conversions on the other.


  • Multi-word units
  • Slovene
  • Identification
  • Categorization
  • Multi-word lexicon

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-30135-4_8
  • Chapter length: 14 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   84.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-30135-4
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   109.99
Price excludes VAT (USA)
Fig. 1.


  1. 1.

    In this case, lexicalized elements refer to the elements that must be present in each occurrence of the MWU and must always be realized by the same lexeme.

  2. 2.

    For a detailed description, see Gantar et al. (2017, 2019).

  3. 3.

    Collocations were also excluded from the PARSEME Shared Task annotation campaign.

  4. 4.

  5. 5.

    In related work on English MWUs, these expressions are usually called compounds. See Atkins and Rundell (2008: 171) for detailed classification.

  6. 6.

    This category of MWUs has also been called compound prepositions (in spite of), MWUs with syntactic function (with regard to), prepositional phrases (in bed, in jail), complex prepositions (on top of), etc. For a more detailed overview, see Gantar et al. (2019).

  7. 7.

  8. 8.

  9. 9.

    Each token in the ssj500k v2.1 corpus has a unique ID. We used IDs instead of word forms or word lemmas to join batches to avoid introducing noise in case the same form/lemma occurred multiple times in the sentence.

  10. 10.

    The lemmatized form sorted in alphabetical order was used in order to aggregate strings that were essentially the same, but differed inflectionally, e.g. ustavno sodišče (‘constitutional court’ - nominative), ustavnega sodišča (‘constitutional court’ - genitive).

  11. 11.

    For the sake of conciseness, each different form in the cluster is only shown once although it may actually appear multiple times.

  12. 12.

    24 clusters were excluded from the analysis either because of clustering errors (see Sect. 2.2) or because the annotator incorrectly included two MWUs in a single annotation or annotated only a single element of an otherwise correctly identified MWU.


  13. 13.

    In some cases, the possessive pronoun can also be lexicalized, e.g. proti svoji volji ‘against his/her/their own will’.


  • Arhar Holdt, Š., Gorjanc, V.: Korpus FidaPLUS: nova generacija slovenskega referenčnega korpusa. Jezik in slovstvo 52(2), 95–110 (2007)

    Google Scholar 

  • Atkins, B.T.S., Rundell, M.: The Oxford Guide to Practical Lexicography. Oxford University Press, New York (2008)

    Google Scholar 

  • Gantar, P.: Stalne besedne zveze v slovenščini. Založba ZRC, ZRC SAZU, Ljubljana (2007)

    Google Scholar 

  • Gantar, P., Krek, S.: Slovene lexical database. In: Majchráková, D., Garabík, R. (eds.) Proceedings of the Natural Language Processing, Multilinguality: Sixth International Conference, Modra, Slovakia, 20–21 October 2011, pp. 72–80. Tribun EU, Brno (2011)

    Google Scholar 

  • Gantar, P.: Leksikografski opis slovenščine v digitalnem okolju. Znanstvena založba Filozofske fakultete UL, Ljubljana (2015)

    Google Scholar 

  • Gantar, P., Krek, S., Kuzman, T.: Verbal multiword expressions in Slovene. In: Mitkov, R. (ed.) EUROPHRAS 2017. LNCS (LNAI), vol. 10596, pp. 247–259. Springer, Cham (2017).

    CrossRef  Google Scholar 

  • Gantar, P., Colman, L., Parra Escartín, C., Martínez Alonso, H.: Multiword expressions: between lexicography and NLP. Int. J. Lexicogr. 32(2), 138–162 (2019).

    CrossRef  Google Scholar 

  • Hanks, P., El Marouf, I., Oakes, M.: Flexibility of multiword expressions and corpus pattern analysis. In: Sailer, M., Markantonatou, S. (eds.) Multiword Expressions: Insights from a Multi-lingual Perspective, pp. 93–119. Language Science Press, Berlin (2018)

    Google Scholar 

  • Hunston, S., Francis, G.: Pattern Grammar: A Corpus-Driven Approach to the Lexical Grammar of English. John Benjamins, Amsterdam (2000)

    CrossRef  Google Scholar 

  • Kosem, I., et al.: Collocations Dictionary of Modern Slovene (2018).

  • Kosem, I., Krek, S., Gantar, P., Arhar Holdt, Š., Čibej, J., Laskowski, C.: Kolokacijski slovar sodobne slovenščine. In: Proceedings of the Conference on Language Technologies & Digital Humanities, Ljubljana, pp. 133–139 (2018)

    Google Scholar 

  • Kilgarriff, A., Rychly, P., Smrz, P., Tugwell, D.: The sketch engine. Inf. Technol. 105, 116–127 (2004)

    Google Scholar 

  • Krek, S., Gantar, P., Kosem, I., Gorjanc, V., Laskowski, C.: Baza kolokacijskega slovarja slovenskega jezika. In: Proceedings of the Conference on Language Technologies & Digital Humanities, Ljubljana, pp. 101–105 (2016)

    Google Scholar 

  • Krek, S., et al.: Training corpus ssj500k 2.1, Slovenian language resource repository CLARIN.SI (2018).

  • Moon, R.: Fixed Expressions and Idioms in English. A Corpus-Based Approach. Clarendon Press, Oxford (1998)

    Google Scholar 

  • Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2002), pp. 1–15 (2002)

    Google Scholar 

  • Sinclair, J. (ed.): Looking Up: An Account of the COBUILD Project in Lexical Computing and the Development of the Collins COBUILD English Language Dictionary. Collins, London and Glasgow (1987)

    Google Scholar 

  • Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)

    Google Scholar 

  • Sinclair, J.: The lexical item. In: Weigand, E. (ed.) Contrastive Lexical Semantics, pp. 1–24. John Benjamins Publishing Company, Amsterdam/Philadelphia (1998)

    Google Scholar 

Download references


The study presented in this paper was conducted within the New Grammar of Modern Standard Slovene: Resource and Methods project (J6-8256), which was financially supported by the Slovenian Research Agency between 2017 and 2020. The authors also acknowledge the financial support from the Slovenian Research Agency (research core funding No. P6-0411 - Language Resources and Technologies for Slovene and No. P6-0215 - Slovene LanguageBasic, Contrastive, and Applied Studies). The authors would also like to thank the annotators: Anna Maria Grego, Tjaša Jelovšek, Tajda Liplin Šerbetar, Pia Rednak, Jana Vaupotič, Zala Vidic, Karolina Zgaga, and Kaja Žvanut.

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Polona Gantar , Jaka Čibej or Mija Bon .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Gantar, P., Čibej, J., Bon, M. (2019). Slovene Multi-word Units: Identification, Categorization, and Representation. In: Corpas Pastor, G., Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2019. Lecture Notes in Computer Science(), vol 11755. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30134-7

  • Online ISBN: 978-3-030-30135-4

  • eBook Packages: Computer ScienceComputer Science (R0)