In this paper, we present the results of a manual annotation of a Slovene training corpus with multi-word units (MWUs) relevant for inclusion in a lexicon of Slovene MWUs. We analyze the annotations in terms of (a) the frequency with which a string has been identified as a MWU, (b) the degree to which the annotators agree on the category of the identified MWU, and (c) the degree to which the annotators agree on the range of the MWU in terms of its lexicalized elements. The results of the analysis will be useful in different stages of the compilation of a Slovene MWU lexicon. The list of dictionary-relevant MWUs obtained in the annotation task will be used to enrich the lexicon and to train models for the automatic identification of MWUs in running text. The findings will also help revise the criteria for the identification and categorization of dictionary-relevant MWUs in relation to free phrases, as well as more clearly define the distinction between the lexicalized elements of MWUs and the more or less stable elements of their textual environment, which will be useful when determining the canonical forms of MWUs in the lexicon on one hand and their relation to their variable elements and syntactic conversions on the other.
- Multi-word units
- Multi-word lexicon
This is a preview of subscription content, access via your institution.
In this case, lexicalized elements refer to the elements that must be present in each occurrence of the MWU and must always be realized by the same lexeme.
Collocations were also excluded from the PARSEME Shared Task annotation campaign.
In related work on English MWUs, these expressions are usually called compounds. See Atkins and Rundell (2008: 171) for detailed classification.
This category of MWUs has also been called compound prepositions (in spite of), MWUs with syntactic function (with regard to), prepositional phrases (in bed, in jail), complex prepositions (on top of), etc. For a more detailed overview, see Gantar et al. (2019).
Each token in the ssj500k v2.1 corpus has a unique ID. We used IDs instead of word forms or word lemmas to join batches to avoid introducing noise in case the same form/lemma occurred multiple times in the sentence.
The lemmatized form sorted in alphabetical order was used in order to aggregate strings that were essentially the same, but differed inflectionally, e.g. ustavno sodišče (‘constitutional court’ - nominative), ustavnega sodišča (‘constitutional court’ - genitive).
For the sake of conciseness, each different form in the cluster is only shown once although it may actually appear multiple times.
24 clusters were excluded from the analysis either because of clustering errors (see Sect. 2.2) or because the annotator incorrectly included two MWUs in a single annotation or annotated only a single element of an otherwise correctly identified MWU.
In some cases, the possessive pronoun can also be lexicalized, e.g. proti svoji volji ‘against his/her/their own will’.
Arhar Holdt, Š., Gorjanc, V.: Korpus FidaPLUS: nova generacija slovenskega referenčnega korpusa. Jezik in slovstvo 52(2), 95–110 (2007)
Atkins, B.T.S., Rundell, M.: The Oxford Guide to Practical Lexicography. Oxford University Press, New York (2008)
Gantar, P.: Stalne besedne zveze v slovenščini. Založba ZRC, ZRC SAZU, Ljubljana (2007)
Gantar, P., Krek, S.: Slovene lexical database. In: Majchráková, D., Garabík, R. (eds.) Proceedings of the Natural Language Processing, Multilinguality: Sixth International Conference, Modra, Slovakia, 20–21 October 2011, pp. 72–80. Tribun EU, Brno (2011)
Gantar, P.: Leksikografski opis slovenščine v digitalnem okolju. Znanstvena založba Filozofske fakultete UL, Ljubljana (2015)
Gantar, P., Krek, S., Kuzman, T.: Verbal multiword expressions in Slovene. In: Mitkov, R. (ed.) EUROPHRAS 2017. LNCS (LNAI), vol. 10596, pp. 247–259. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69805-2_18
Gantar, P., Colman, L., Parra Escartín, C., Martínez Alonso, H.: Multiword expressions: between lexicography and NLP. Int. J. Lexicogr. 32(2), 138–162 (2019). https://doi.org/10.1093/ijl/ecy012
Hanks, P., El Marouf, I., Oakes, M.: Flexibility of multiword expressions and corpus pattern analysis. In: Sailer, M., Markantonatou, S. (eds.) Multiword Expressions: Insights from a Multi-lingual Perspective, pp. 93–119. Language Science Press, Berlin (2018)
Hunston, S., Francis, G.: Pattern Grammar: A Corpus-Driven Approach to the Lexical Grammar of English. John Benjamins, Amsterdam (2000)
Kosem, I., et al.: Collocations Dictionary of Modern Slovene (2018). https://viri.cjvt.si/kolokacije/eng/
Kosem, I., Krek, S., Gantar, P., Arhar Holdt, Š., Čibej, J., Laskowski, C.: Kolokacijski slovar sodobne slovenščine. In: Proceedings of the Conference on Language Technologies & Digital Humanities, Ljubljana, pp. 133–139 (2018)
Kilgarriff, A., Rychly, P., Smrz, P., Tugwell, D.: The sketch engine. Inf. Technol. 105, 116–127 (2004)
Krek, S., Gantar, P., Kosem, I., Gorjanc, V., Laskowski, C.: Baza kolokacijskega slovarja slovenskega jezika. In: Proceedings of the Conference on Language Technologies & Digital Humanities, Ljubljana, pp. 101–105 (2016)
Krek, S., et al.: Training corpus ssj500k 2.1, Slovenian language resource repository CLARIN.SI (2018). http://hdl.handle.net/11356/1181
Moon, R.: Fixed Expressions and Idioms in English. A Corpus-Based Approach. Clarendon Press, Oxford (1998)
Sag, I.A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2002), pp. 1–15 (2002)
Sinclair, J. (ed.): Looking Up: An Account of the COBUILD Project in Lexical Computing and the Development of the Collins COBUILD English Language Dictionary. Collins, London and Glasgow (1987)
Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)
Sinclair, J.: The lexical item. In: Weigand, E. (ed.) Contrastive Lexical Semantics, pp. 1–24. John Benjamins Publishing Company, Amsterdam/Philadelphia (1998)
The study presented in this paper was conducted within the New Grammar of Modern Standard Slovene: Resource and Methods project (J6-8256), which was financially supported by the Slovenian Research Agency between 2017 and 2020. The authors also acknowledge the financial support from the Slovenian Research Agency (research core funding No. P6-0411 - Language Resources and Technologies for Slovene and No. P6-0215 - Slovene Language – Basic, Contrastive, and Applied Studies). The authors would also like to thank the annotators: Anna Maria Grego, Tjaša Jelovšek, Tajda Liplin Šerbetar, Pia Rednak, Jana Vaupotič, Zala Vidic, Karolina Zgaga, and Kaja Žvanut.
Editors and Affiliations
Rights and permissions
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Gantar, P., Čibej, J., Bon, M. (2019). Slovene Multi-word Units: Identification, Categorization, and Representation. In: Corpas Pastor, G., Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2019. Lecture Notes in Computer Science(), vol 11755. Springer, Cham. https://doi.org/10.1007/978-3-030-30135-4_8
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30134-7
Online ISBN: 978-3-030-30135-4
eBook Packages: Computer ScienceComputer Science (R0)