Abstract
A morphological analyzer, a significant component of many natural language processing applications, especially for morphologically rich languages, divides an input word into all its composing morphemes and identifies their morphological roles. This paper introduces a comprehensive morphological analyzer for Central Kurdish (CK), also known as Sorani, a low-resourced language with rich morphology. Building upon the limited existing literature, we first assembled and systematically categorized an extensive collection of the morphological and morphophonological rules of the language. Additionally, we collected and manually labeled a generative lexicon containing nearly 10,000 verb, noun and adjective stems, named entities, and other types of word stems. We used these rule sets and resources to implement CKMorph Analyzer based on finite-state transducers. In order to provide a benchmark for future research, we collected, manually labeled, and publicly shared test sets for evaluating the accuracy and coverage of the analyzer. CKMorph was able to correctly analyze 95.9% of the first test set, containing 1000 CK words morphologically analyzed according to the context. Moreover, CKMorph gave at least one analysis for 95.5% of 4.22 M CK tokens of the second test set. The demonstration of the application and resources, including the CK verb database and test sets, are openly accessible at github.com/CKMorph.
Similar content being viewed by others
Data availability
The resources including the Central Kurdish verb database (https://doi.org/10.5281/zenodo.6300522) and evaluation data sets (https://doi.org/10.5281/zenodo.6300602) are publicly accessible in CKMorph’s project’s repository at github.com/CKMorph.
References
Abdullah, K. A., & Hemed, S. A. (2020). Lexical enrichment in English and Kurdish: A comparative study. International Journal of English Linguistics, 10(2), 159–169. https://doi.org/10.5539/ijel.v10n2p159
Abdulrahman, R. O., & Hassani, H. (2022). A language model for spell checking of educational texts in Kurdish (Sorani). Proceedings of special interest group on under-resourced languages (SIGUL 2022), 189–198.
Ahmadi, S. (2019). A rule-based Kurdish text transliteration system. ACM Transactions on Asian and Low-Resource Language Information Processing, 18(2), 1–9. https://doi.org/10.1145/3278623
Ahmadi, S. (2021a). A formal description of Sorani Kurdish morphology. ArXiv Preprint ArXiv:2109.03942.
Ahmadi, S. (2021b). Hunspell for Sorani Kurdish spell checking and morphological analysis. ArXiv Preprint ArXiv:2109.06374.
Ahmadi, S., & Hassani, H. (2020). Towards finite-state morphology of Kurdish. ArXiv Preprint ArXiv: 2005.10652.
Ahmadi, S., Hassani, H., & McCrae, J. P. (2019). Towards electronic lexicography for the Kurdish language. Proceedings of electronic lexicography in the 21st century conference, 2019-Octob, 881–906.
Ahmadi, S., Hassani, H., & Abedi, K. (2020). A corpus of the Sorani Kurdish folkloric lyrics. Proceedings of the 1st joint workshop on spoken language Technologies for Under-Resourced Languages (SLTU) and collaboration and computing for under-resourced languages (CCURL), 330–335.
Ahmadi, S., Hassani, H., & Jaff, D. Q. (2022). Leveraging multilingual news websites for building a Kurdish parallel Corpus. Transactions on Asian and Low-Resource Language Information Processing, 21(5), 1–11.
Amin, W. O. (2016). Rêzmanî Karî Zimanî Kurdî. The Kurdish Academy.
Anoushe, M. (2018). A revision of Persian past tense inflection: A distributed morphology approach. Language Related Research, 9(1), 57–80.
Baban, S. (2012). Sê Mîkanîzmî Řêzmanî. Mukiryani Establishment.
Baban, S. T., & Husein, S. (1995). Programmable grammar of the Kurdish language. Logic, Philosophy and Linguistics (LP) Series, Institute for Logic, Language and Computation, University of Amsterdam. https://eprints.illc.uva.nl/id/eprint/1228/1/LP-1995-02.text.pdf
Beesley, K. R., & Karttunen, L. (2003). Finite-state morphology: Xerox tools and techniques. CSLI, Stanford.
Bills, A., Levin, L. S., Kaplan, L. D., & MacLean, E. A. (2010). Finite-state morphology for Iñupiaq. 7th SaLTMiL workshop on creation and use of basic lexical resources for less-resourced languages (LREC 2010), 19–26.
Çöltekin, Ç. (2010). A freely available morphological analyzer for Turkish. Proceedings of the 7th international conference on language resources and evaluation, LREC 2010, 820–827.
Department of IT of Kurdistan Regional Government. (2014). Unicode standard for Kurdish language. http://unicode.ekrg.org/ku_unicodes.html.
Edmonds, C. J. (1955). Prepositions and personal affixes in southern Kurdish. Bulletin of the School of Oriental and African Studies, 17(3), 490–502.
Haig, G. (2004). Alignment in Kurdish: A diachronic perspective. Habilitationsschrift: Philosophische Fakultät Der Christian-Albrechts-Universität Zu Kiel. https://www.academia.edu/2081233/
Haig, G. (2015). Ergativity in Iranian. Full-length manuscript available on Academai.edu. https://www.academia.edu/15321950/
Hamarashid, H. K., Saeed, S. A., & Rashid, T. A. (2021). Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji. Neural Computing and Applications, 33(9), 4547–4566.
Harrigan, A. G., Schmirler, K., Arppe, A., Antonsen, L., Trosterud, T., & Wolvengrey, A. (2017). Learning from the computational modelling of plains Cree verbs. Morphology, 27(4), 565–598.
Haspelmath, M., & Sims, A. D. (2013). Understanding morphology. Routledge.
Hassani, H. (2017). A method for proper noun extraction in Kurdish. OpenAccess Series in Informatics, 56(19), 1–13. https://doi.org/10.4230/OASIcs.SLATE.2017.19
Hassani, H. (2018). BLARK for multi-dialect languages: Towards the Kurdish BLARK. Language Resources and Evaluation, 52(2), 625–644. https://doi.org/10.1007/s10579-017-9400-0
Hassani, H., & Medjedovic, D. (2016). Automatic Kurdish dialects identification. Computer Science & Information Technology ( CS & IT ), 61–78. https://doi.org/10.5121/csit.2016.60307
Hassanpour, A. (1992). Nationalism and language in Kurdistan 1918–1985. Mellen Research University Press.
Heidarpour, D., Sebt, S., & E., Khan, M. B. J., Salehi, M., & Veisi, H. (2021). Contemporary Persian inflectional analyzer. Iranian journal of. Information Processing and Management, 519.
Hosseini, H., Veisi, H., & MohammadAmini, M. (2015). KSLexicon: Kurdish-Sorani generative lexicon. The First National Conference on Corpus-Based Linguistics, 33–50.
Jurafsky, D., & Martin, J. H. (2014). In P. Norvig & S. Russell (Eds.), Speech and language processing. Pearson Education.
Kalbasi, I. (1983). Mahabad Kurdish Dialect [Guyeš-e Kordi-ye Mahâbâd]. Cultural Studies Institution.
Karami, S. (2017). An introduction to Kurdish morphology. University of Kurdistan Press.
Kareem, R. A. (2016). The syntax of verbal inflection in central Kurdish. Newcastle University.
Karimi, Y. (2007). Kurdish Ezafe construction: Implications for DP structure. Lingua, 117(12), 2159–2177.
Karimi, Y. (2010). Unaccusative transitives and the person-case constraint effects in Kurdish. Lingua, 120(3), 693–716.
Khal, S. M. (2000). Ferhengî Xaɫ. Aras Publication.
Koskenniemi, K. (1983). Two-level morphology: A general computational model for word-form recognition and production (Vol. 11). University of Helsinki, Department of General Linguistics Helsinki, Finland.
Lindén, K., & Pirinen, T. (2009). Weighting finite-state morphological analyzers using hfst tools. FSMNLP, 13.
Lindén, K., Axelson, E., Hardwick, S., Pirinen, T. A., & Silfverberg, M. (2011). Hfst—Framework for compiling and applying morphologies. International Workshop on Systems and Frameworks for Computational Morphology, 67–85.
Lindén, K., Axelson, E., Drobac, S., Hardwick, S., Kuokkala, J., Niemi, J., Pirinen, T. A., & Silfverberg, M. (2013). HFST - a system for creating NLP tools. Communications in computer and information. Science, 380 CCIS, 53–71. https://doi.org/10.1007/978-3-642-40486-3_4
MacKenzie, D. N. (1961). Kurdish dialect studies (Vol. 1). Oxford University Press.
Mahmudi, A., & Veisi, H. (2021). Automated grapheme-to-phoneme conversion for central Kurdish based on optimality theory. Computer Speech and Language.
McCarus, E. N. (1958). A Kurdish grammar: Descriptive analysis of the Kurdish of Sulaimaniya, Iraq. American Council of Learned Societies program in oriental languages, publications series B-aids-number 10.
Megerdoomian, K. (2000). Persian computational morphology: A unification-based approach. New Mexico State University.
Megerdoomian, K. (2004). Finite-state morphological analysis of Persian. Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, 35–41.
Mohammadamini, M., Veisi, H., Mahmudi, A., & Hosseini, H. (2019). Challenges in standardization of Kurdish language: A Corpus based approach. The Second International Conference on Kurdish and Persian Languages and Literature.
Mohammadi, A. (2014). Morphophonological processes in Sorani Kurdish. Allameh Tabataba’i University.
Mustafa, A. M., & Rashid, T. A. (2018). Kurdish stemmer pre-processing steps for improving information retrieval. Journal of Information Science, 44(1), 15–27. https://doi.org/10.1177/0165551516683617
Oflazer, K. (1995). Two-level description of Turkish morphology. Literary and Linguistic Computing, 9(2), 137–148. https://doi.org/10.1093/llc/9.2.137
Riazati, D. (1997). Computational analysis of Persian morphology. MSc thesis, Department of Computer Science, RMIT.
Saeed, A. M., Rashid, T. A., Mustafa, A. M., Agha, R. A. A.-R., Shamsaldin, A. S., & Al-Salihi, N. K. (2018a). An evaluation of Reber stemmer with longest match stemmer technique in Kurdish Sorani text classification. Iran Journal of Computer Science, 1(2), 99–107. https://doi.org/10.1007/s42044-018-0007-4
Saeed, A. M., Rashid, T. A., Mustafa, A. M., Fattah, P., & Ismael, B. (2018b). Improving Kurdish web mining through tree data structure and Porter’s stemmer algorithms. UKH Journal of Science and Engineering, 2(1), 48–54. https://doi.org/10.25079/ukhjse.v2n1y2018.pp48-54
Salavati, S. (2013). Stemming and spell-checking algorithms for Kurdish text processing. University of Kurdistan.
Salavati, S., Sheykh Esmaili, K., & Akhlaghian, F. (2013). Stemming for Kurdish information retrieval. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8281 LNCS, 272–283. https://doi.org/10.1007/978-3-642-45068-6_24
Schmid, H., Fitschen, A., & Heid, U. (2004). SMOR: A German computational morphology covering derivation. Composition and Inflection. LREC, 1–263.
Sharafkandi, A. (1990). Henbane Borîne: A Kurdish-Persian dictionary. Soroush Press.
Sheykh Esmaili, K., Eliassi, D., Salavati, S., Aliabadi, P., Mohammadi, A., Yosefi, S., & Hakimi, S. (2013). Building a test collection for Sorani Kurdish. Proceedings of the 10th IEEE/ACS international conference on computer systems and applications, AICCSA. https://doi.org/10.1109/AICCSA.2013.6616470
Veisi, H., MohammadAmini, M., & Hosseini, H. (2019). Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus. Digital Scholarship in the Humanities. https://doi.org/10.1093/llc/fqy074
Wahby, T., & Edmonds, C. J. (1966). A Kurdish-English dictionary. Oxford University Press.
Walther, G. (2012). Fitting into morphological structure: Accounting for Sorani Kurdish endoclitics. Mediterranean Morphology Meetings, 8, 299–321.
Walther, G., & Sagot, B. (2010). Developing a large-scale lexicon for a less-resourced language: General methodology and preliminary experiments on Sorani Kurdish. Proceedings of the 7th SaLTMiL workshop on creation and use of basic lexical resources for less-resourced languages (LREC 2010 workshop).
Yusupova, Z. A. (1985). Sulaimani dialect of the Kurdish language (К. К. Курдоев, Ed.). Наука.
Zueva, A., Kuznetsova, A., & Tyers, F. (2020). A finite-state morphological analyser for evenki. Proceedings of the 12th Language Resources and Evaluation Conference, 2581–2589.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Naserzade, M., Mahmudi, A., Veisi, H. et al. CKMorph: a comprehensive morphological analyzer for Central Kurdish. Int J Digit Humanities 5, 187–232 (2023). https://doi.org/10.1007/s42803-022-00062-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42803-022-00062-7