Abstract
This paper discusses the building of a manually annotated training corpus of Slovene verbal multiword expressions, which was a part of PARSEME shared task that covered eighteen languages from various language families. In the course of the project, annotation guidelines were compiled, describing the notation scope in detail and proposing a multilingual system for verbal MWE categorisation. In this paper, we present the methods of identification, annotation scope and linguistic tests that determine structural, syntactic and lexical characteristics of the verbal MWE candidate lexical units. Furthermore, we highlight examples that specifically apply to the Slovene language. Tools and previously available data that were used in the project are also presented: an annotation tool and syntactically and morphosyntactically annotated training corpus for Slovene.
Keywords
- PARSEME shared task
- Verbal multiword expressions
- Categorisation
- Training corpus
- Slovene
This is a preview of subscription content, access via your institution.
Buying options


Notes
- 1.
- 2.
- 3.
The final corpus, consisting of 5.5 million tokens and 60,000 VMWE annotations in eighteen languages, is available on: https://gitlab.com/parseme/sharedtask-data/tree/master.
- 4.
- 5.
- 6.
Fixed expressions differ from collocations, which are usually not independent lexical units. This term mostly includes semi-terminological word combinations, such as stara mama ‘grandmother’, dnevna soba ‘living room’ etc. Collocations, being semantically transparent units, are not included in the categorisation of VMWEs.
- 7.
In Slovene categorisation, this VMWE type has been problematised since its syntactic structure consists of a basic relationship between subject and predicate, and such examples are formally categorized “PUs with S-structure” (Toporišič 1973/1974). However, as they function similarly to “verbal PUs”, an independent subcategory, “false verbal PUs”, vas suggested (Kržišnik 1994, pp. 63, 66) which forms a part of “phrase structure PUs” top category, separating them from “S-structure PUs” such as: Obleka naredi človeka ‘Clothes make the man’ or Čas je denar ‘Time is money’.
- 8.
The rule also applies to noun phrases that function as subject complements, e.g. (biti) mož beseda ‘(to be) a man of his word’, (biti) alfa in omega ‘(to be) alpha and omega’ (meaning ‘(to be) the basis of something’).
- 9.
Within the shared task, the decision was taken that cases in which the verb has light semantics per se, e.g. commit a crime in English or izkazati interes ('to show interest') or vzeti v službo (lit. to take into the job 'to employ') in Slovene are treated as LVCs, given the different understanding of the notion of »light verbs«, and the possibility to distinguish between support verbs and light verbs, or light verbs and vague action verbs.
- 10.
Free word order is a feature of a language that has either enough grammatical markers to eliminate any ambiguity in meaning, either easily identifiable verbs and a word order in which the subject always comes first, so it cannot be confused with the object.
- 11.
It is interesting that this type of VMWEs was not considered in the PARSEME shared task in other Slavic languages, such as Polish, Czech, Croatian and Bulgarian. VPCs are typical mostly for English and other “Germanic languages.
- 12.
E.g. gre za vodičem ‘he follows the guide’ in comparison to gre za naše temeljno načelo ‘it is about our fundamental principle’.
- 13.
Accessible on Clarin.si repository: https://www.clarin.si/repository/xmlui/handle/11356/1052.
- 14.
- 15.
- 16.
Current state of the annotation in FLAT can be observed at: http://mwe.phil.hhu.de/bot/mwe_count_perlang_html.
References
Arhar Holdt, Š., Gorjanc, V.: Korpus FidaPLUS: nova generacija slovenskega referenčnega korpusa. Jezik in slovstvo 52(2), 95–110 (2007)
Atkins, B.T.S., Rundell, M.: The Oxford Guide to Practical Lexicography. Oxford University Press, New York (2008)
Baldwin, T., Kim, S.N.: Multiword expressions. In: Indurkhya, N., Damerau, F.J. (eds.) Handbook of Natural Language Processing, 2nd edn, pp. 267–292. CRC Press, Boca Raton (2010)
Dobrovoljc, K., Krek, S., Rupnik, J.: Skladenjski razčlenjevalnik za slovenščino. In: Erjavec, T., Žganec Gros, J. (eds.) Zbornik Osme konference Jezikovne tehnologije, pp. 42–47. Institut Jožef Stefan, Ljubljana (2012)
Gantar, P., Kosem, I., Krek, S.: Discovering automated lexicography: the case of Slovene lexical database. Int. J. Lexicogr. 29(2), 220–225 (2016)
Gantar, P., Krek, S.: Slovene lexical database. In: Majchraková, D., Garabík, R. (eds.) Natural Language Processing, Multilinguality: Sixth International Conference, pp. 72–80. Slovenská akadémia vied, Jazikovedný ústav Ludovíta Štúra, Modra (2011)
Keber, J.: Slovar slovenskih frazemov. Založba ZRC. Inštitut za slovenski jezik Frana Ramovša, Ljubljana (2011)
Kozlevčar Černelič, I.: O funkciji glagolov z oslabljenim pomenom tipa biti. Jezik in slovstvo 21(3), 76–81 (1975)
Krek, S., Dobrovoljc, K., Erjavec, T.: Training corpus ssj500k 1.4. Slovenian language resource repository CLARIN.SI (2015). http://hdl.handle.net/11356/1052. Accessed 15 June 2017
Kržišnik, E.: Slovenski glagolski frazemi (ob primeru glagolov govorjenja). Univerza v Ljubljani, Doktorska disertacija. Filozofska fakulteta (1994)
Sag, Ivan A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). doi:10.1007/3-540-45715-1_1
Toporišič J.: K izrazju in tipologiji slovenske frazeologije. Jezik in slovstvo (8), 273–279 (1973/1974)
Vidovič Muha, A.: Slovensko leksikalno pomenoslovje – Govorica slovarja. Znanstveni inštitut Filozofske fakultete, Ljubljana (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Gantar, P., Krek, S., Kuzman, T. (2017). Verbal Multiword Expressions in Slovene. In: Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2017. Lecture Notes in Computer Science(), vol 10596. Springer, Cham. https://doi.org/10.1007/978-3-319-69805-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-69805-2_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69804-5
Online ISBN: 978-3-319-69805-2
eBook Packages: Computer ScienceComputer Science (R0)