Skip to main content

Verbal Multiword Expressions in Slovene

Part of the Lecture Notes in Computer Science book series (LNAI,volume 10596)


This paper discusses the building of a manually annotated training corpus of Slovene verbal multiword expressions, which was a part of PARSEME shared task that covered eighteen languages from various language families. In the course of the project, annotation guidelines were compiled, describing the notation scope in detail and proposing a multilingual system for verbal MWE categorisation. In this paper, we present the methods of identification, annotation scope and linguistic tests that determine structural, syntactic and lexical characteristics of the verbal MWE candidate lexical units. Furthermore, we highlight examples that specifically apply to the Slovene language. Tools and previously available data that were used in the project are also presented: an annotation tool and syntactically and morphosyntactically annotated training corpus for Slovene.


  • PARSEME shared task
  • Verbal multiword expressions
  • Categorisation
  • Training corpus
  • Slovene

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-69805-2_18
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   74.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-69805-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   95.00
Price excludes VAT (USA)
Fig. 1.
Fig. 2.


  1. 1.

  2. 2.

  3. 3.

    The final corpus, consisting of 5.5 million tokens and 60,000 VMWE annotations in eighteen languages, is available on:

  4. 4.

  5. 5.

  6. 6.

    Fixed expressions differ from collocations, which are usually not independent lexical units. This term mostly includes semi-terminological word combinations, such as stara mama ‘grandmother’, dnevna soba ‘living room’ etc. Collocations, being semantically transparent units, are not included in the categorisation of VMWEs.

  7. 7.

    In Slovene categorisation, this VMWE type has been problematised since its syntactic structure consists of a basic relationship between subject and predicate, and such examples are formally categorized “PUs with S-structure” (Toporišič 1973/1974). However, as they function similarly to “verbal PUs”, an independent subcategory, “false verbal PUs”, vas suggested (Kržišnik 1994, pp. 63, 66) which forms a part of “phrase structure PUs” top category, separating them from “S-structure PUs” such as: Obleka naredi človeka ‘Clothes make the man’ or Čas je denar ‘Time is money’.

  8. 8.

    The rule also applies to noun phrases that function as subject complements, e.g. (biti) mož beseda ‘(to be) a man of his word’, (biti) alfa in omega ‘(to be) alpha and omega’ (meaning ‘(to be) the basis of something’).

  9. 9.

    Within the shared task, the decision was taken that cases in which the verb has light semantics per se, e.g. commit a crime in English or izkazati interes ('to show interest') or vzeti v službo (lit. to take into the job 'to employ') in Slovene are treated as LVCs, given the different understanding of the notion of »light verbs«, and the possibility to distinguish between support verbs and light verbs, or light verbs and vague action verbs.

  10. 10.

    Free word order is a feature of a language that has either enough grammatical markers to eliminate any ambiguity in meaning, either easily identifiable verbs and a word order in which the subject always comes first, so it cannot be confused with the object.

  11. 11.

    It is interesting that this type of VMWEs was not considered in the PARSEME shared task in other Slavic languages, such as Polish, Czech, Croatian and Bulgarian. VPCs are typical mostly for English and other “Germanic languages.

  12. 12.

    E.g. gre za vodičem ‘he follows the guide’ in comparison to gre za naše temeljno načelo ‘it is about our fundamental principle’.

  13. 13.

    Accessible on repository:

  14. 14.

  15. 15.

  16. 16.

    Current state of the annotation in FLAT can be observed at:


  • Arhar Holdt, Š., Gorjanc, V.: Korpus FidaPLUS: nova generacija slovenskega referenčnega korpusa. Jezik in slovstvo 52(2), 95–110 (2007)

    Google Scholar 

  • Atkins, B.T.S., Rundell, M.: The Oxford Guide to Practical Lexicography. Oxford University Press, New York (2008)

    Google Scholar 

  • Baldwin, T., Kim, S.N.: Multiword expressions. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn, pp. 267–292. CRC Press, Boca Raton (2010)

    Google Scholar 

  • Dobrovoljc, K., Krek, S., Rupnik, J.: Skladenjski razčlenjevalnik za slovenščino. In: Erjavec, T., Žganec Gros, J. (eds.) Zbornik Osme konference Jezikovne tehnologije, pp. 42–47. Institut Jožef Stefan, Ljubljana (2012)

    Google Scholar 

  • Gantar, P., Kosem, I., Krek, S.: Discovering automated lexicography: the case of Slovene lexical database. Int. J. Lexicogr. 29(2), 220–225 (2016)

    CrossRef  Google Scholar 

  • Gantar, P., Krek, S.: Slovene lexical database. In: Majchraková, D., Garabík, R. (eds.) Natural Language Processing, Multilinguality: Sixth International Conference, pp. 72–80. Slovenská akadémia vied, Jazikovedný ústav Ludovíta Štúra, Modra (2011)

    Google Scholar 

  • Keber, J.: Slovar slovenskih frazemov. Založba ZRC. Inštitut za slovenski jezik Frana Ramovša, Ljubljana (2011)

    Google Scholar 

  • Kozlevčar Černelič, I.: O funkciji glagolov z oslabljenim pomenom tipa biti. Jezik in slovstvo 21(3), 76–81 (1975)

    Google Scholar 

  • Krek, S., Dobrovoljc, K., Erjavec, T.: Training corpus ssj500k 1.4. Slovenian language resource repository CLARIN.SI (2015). Accessed 15 June 2017

  • Kržišnik, E.: Slovenski glagolski frazemi (ob primeru glagolov govorjenja). Univerza v Ljubljani, Doktorska disertacija. Filozofska fakulteta (1994)

    Google Scholar 

  • Sag, Ivan A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). doi:10.1007/3-540-45715-1_1

    CrossRef  Google Scholar 

  • Toporišič J.: K izrazju in tipologiji slovenske frazeologije. Jezik in slovstvo (8), 273–279 (1973/1974)

    Google Scholar 

  • Vidovič Muha, A.: Slovensko leksikalno pomenoslovje – Govorica slovarja. Znanstveni inštitut Filozofske fakultete, Ljubljana (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Simon Krek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Gantar, P., Krek, S., Kuzman, T. (2017). Verbal Multiword Expressions in Slovene. In: Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2017. Lecture Notes in Computer Science(), vol 10596. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69804-5

  • Online ISBN: 978-3-319-69805-2

  • eBook Packages: Computer ScienceComputer Science (R0)