Skip to main content
Log in

CKMorph: a comprehensive morphological analyzer for Central Kurdish

  • Original Paper
  • Published:
International Journal of Digital Humanities Aims and scope Submit manuscript

Abstract

A morphological analyzer, a significant component of many natural language processing applications, especially for morphologically rich languages, divides an input word into all its composing morphemes and identifies their morphological roles. This paper introduces a comprehensive morphological analyzer for Central Kurdish (CK), also known as Sorani, a low-resourced language with rich morphology. Building upon the limited existing literature, we first assembled and systematically categorized an extensive collection of the morphological and morphophonological rules of the language. Additionally, we collected and manually labeled a generative lexicon containing nearly 10,000 verb, noun and adjective stems, named entities, and other types of word stems. We used these rule sets and resources to implement CKMorph Analyzer based on finite-state transducers. In order to provide a benchmark for future research, we collected, manually labeled, and publicly shared test sets for evaluating the accuracy and coverage of the analyzer. CKMorph was able to correctly analyze 95.9% of the first test set, containing 1000 CK words morphologically analyzed according to the context. Moreover, CKMorph gave at least one analysis for 95.5% of 4.22 M CK tokens of the second test set. The demonstration of the application and resources, including the CK verb database and test sets, are openly accessible at github.com/CKMorph.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Figure 6:

Similar content being viewed by others

Data availability

The resources including the Central Kurdish verb database (https://doi.org/10.5281/zenodo.6300522) and evaluation data sets (https://doi.org/10.5281/zenodo.6300602) are publicly accessible in CKMorph’s project’s repository at github.com/CKMorph.

Notes

  1. http://hunspell.github.io/.

  2. https://github.com/unimorph/ckb

  3. Available at https://github.com/AsoSoft/AsoSoft-Text-Corpus

  4. https://github.com/CKMorph/Evaluations

References

  • Abdullah, K. A., & Hemed, S. A. (2020). Lexical enrichment in English and Kurdish: A comparative study. International Journal of English Linguistics, 10(2), 159–169. https://doi.org/10.5539/ijel.v10n2p159

  • Abdulrahman, R. O., & Hassani, H. (2022). A language model for spell checking of educational texts in Kurdish (Sorani). Proceedings of special interest group on under-resourced languages (SIGUL 2022), 189–198.

  • Ahmadi, S. (2019). A rule-based Kurdish text transliteration system. ACM Transactions on Asian and Low-Resource Language Information Processing, 18(2), 1–9. https://doi.org/10.1145/3278623

    Article  Google Scholar 

  • Ahmadi, S. (2021a). A formal description of Sorani Kurdish morphology. ArXiv Preprint ArXiv:2109.03942.

  • Ahmadi, S. (2021b). Hunspell for Sorani Kurdish spell checking and morphological analysis. ArXiv Preprint ArXiv:2109.06374.

  • Ahmadi, S., & Hassani, H. (2020). Towards finite-state morphology of Kurdish. ArXiv Preprint ArXiv: 2005.10652.

  • Ahmadi, S., Hassani, H., & McCrae, J. P. (2019). Towards electronic lexicography for the Kurdish language. Proceedings of electronic lexicography in the 21st century conference, 2019-Octob, 881–906.

  • Ahmadi, S., Hassani, H., & Abedi, K. (2020). A corpus of the Sorani Kurdish folkloric lyrics. Proceedings of the 1st joint workshop on spoken language Technologies for Under-Resourced Languages (SLTU) and collaboration and computing for under-resourced languages (CCURL), 330–335.

  • Ahmadi, S., Hassani, H., & Jaff, D. Q. (2022). Leveraging multilingual news websites for building a Kurdish parallel Corpus. Transactions on Asian and Low-Resource Language Information Processing, 21(5), 1–11.

    Article  Google Scholar 

  • Amin, W. O. (2016). Rêzmanî Karî Zimanî Kurdî. The Kurdish Academy.

    Google Scholar 

  • Anoushe, M. (2018). A revision of Persian past tense inflection: A distributed morphology approach. Language Related Research, 9(1), 57–80.

    Google Scholar 

  • Baban, S. (2012). Sê Mîkanîzmî Řêzmanî. Mukiryani Establishment.

  • Baban, S. T., & Husein, S. (1995). Programmable grammar of the Kurdish language. Logic, Philosophy and Linguistics (LP) Series, Institute for Logic, Language and Computation, University of Amsterdam. https://eprints.illc.uva.nl/id/eprint/1228/1/LP-1995-02.text.pdf

  • Beesley, K. R., & Karttunen, L. (2003). Finite-state morphology: Xerox tools and techniques. CSLI, Stanford.

    Google Scholar 

  • Bills, A., Levin, L. S., Kaplan, L. D., & MacLean, E. A. (2010). Finite-state morphology for Iñupiaq. 7th SaLTMiL workshop on creation and use of basic lexical resources for less-resourced languages (LREC 2010), 19–26.

  • Çöltekin, Ç. (2010). A freely available morphological analyzer for Turkish. Proceedings of the 7th international conference on language resources and evaluation, LREC 2010, 820–827.

  • Department of IT of Kurdistan Regional Government. (2014). Unicode standard for Kurdish language. http://unicode.ekrg.org/ku_unicodes.html.

  • Edmonds, C. J. (1955). Prepositions and personal affixes in southern Kurdish. Bulletin of the School of Oriental and African Studies, 17(3), 490–502.

    Article  Google Scholar 

  • Haig, G. (2004). Alignment in Kurdish: A diachronic perspective. Habilitationsschrift: Philosophische Fakultät Der Christian-Albrechts-Universität Zu Kiel. https://www.academia.edu/2081233/

  • Haig, G. (2015). Ergativity in Iranian. Full-length manuscript available on Academai.edu. https://www.academia.edu/15321950/

  • Hamarashid, H. K., Saeed, S. A., & Rashid, T. A. (2021). Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji. Neural Computing and Applications, 33(9), 4547–4566.

    Article  Google Scholar 

  • Harrigan, A. G., Schmirler, K., Arppe, A., Antonsen, L., Trosterud, T., & Wolvengrey, A. (2017). Learning from the computational modelling of plains Cree verbs. Morphology, 27(4), 565–598.

    Article  Google Scholar 

  • Haspelmath, M., & Sims, A. D. (2013). Understanding morphology. Routledge.

    Book  Google Scholar 

  • Hassani, H. (2017). A method for proper noun extraction in Kurdish. OpenAccess Series in Informatics, 56(19), 1–13. https://doi.org/10.4230/OASIcs.SLATE.2017.19

    Article  Google Scholar 

  • Hassani, H. (2018). BLARK for multi-dialect languages: Towards the Kurdish BLARK. Language Resources and Evaluation, 52(2), 625–644. https://doi.org/10.1007/s10579-017-9400-0

    Article  Google Scholar 

  • Hassani, H., & Medjedovic, D. (2016). Automatic Kurdish dialects identification. Computer Science & Information Technology ( CS & IT ), 61–78. https://doi.org/10.5121/csit.2016.60307

  • Hassanpour, A. (1992). Nationalism and language in Kurdistan 1918–1985. Mellen Research University Press.

    Google Scholar 

  • Heidarpour, D., Sebt, S., & E., Khan, M. B. J., Salehi, M., & Veisi, H. (2021). Contemporary Persian inflectional analyzer. Iranian journal of. Information Processing and Management, 519.

  • Hosseini, H., Veisi, H., & MohammadAmini, M. (2015). KSLexicon: Kurdish-Sorani generative lexicon. The First National Conference on Corpus-Based Linguistics, 33–50.

  • Jurafsky, D., & Martin, J. H. (2014). In P. Norvig & S. Russell (Eds.), Speech and language processing. Pearson Education.

    Google Scholar 

  • Kalbasi, I. (1983). Mahabad Kurdish Dialect [Guyeš-e Kordi-ye Mahâbâd]. Cultural Studies Institution.

    Google Scholar 

  • Karami, S. (2017). An introduction to Kurdish morphology. University of Kurdistan Press.

    Google Scholar 

  • Kareem, R. A. (2016). The syntax of verbal inflection in central Kurdish. Newcastle University.

    Google Scholar 

  • Karimi, Y. (2007). Kurdish Ezafe construction: Implications for DP structure. Lingua, 117(12), 2159–2177.

    Article  Google Scholar 

  • Karimi, Y. (2010). Unaccusative transitives and the person-case constraint effects in Kurdish. Lingua, 120(3), 693–716.

    Article  Google Scholar 

  • Khal, S. M. (2000). Ferhengî Xaɫ. Aras Publication.

    Google Scholar 

  • Koskenniemi, K. (1983). Two-level morphology: A general computational model for word-form recognition and production (Vol. 11). University of Helsinki, Department of General Linguistics Helsinki, Finland.

    Google Scholar 

  • Lindén, K., & Pirinen, T. (2009). Weighting finite-state morphological analyzers using hfst tools. FSMNLP, 13.

  • Lindén, K., Axelson, E., Hardwick, S., Pirinen, T. A., & Silfverberg, M. (2011). Hfst—Framework for compiling and applying morphologies. International Workshop on Systems and Frameworks for Computational Morphology, 67–85.

  • Lindén, K., Axelson, E., Drobac, S., Hardwick, S., Kuokkala, J., Niemi, J., Pirinen, T. A., & Silfverberg, M. (2013). HFST - a system for creating NLP tools. Communications in computer and information. Science, 380 CCIS, 53–71. https://doi.org/10.1007/978-3-642-40486-3_4

    Article  Google Scholar 

  • MacKenzie, D. N. (1961). Kurdish dialect studies (Vol. 1). Oxford University Press.

    Google Scholar 

  • Mahmudi, A., & Veisi, H. (2021). Automated grapheme-to-phoneme conversion for central Kurdish based on optimality theory. Computer Speech and Language.

  • McCarus, E. N. (1958). A Kurdish grammar: Descriptive analysis of the Kurdish of Sulaimaniya, Iraq. American Council of Learned Societies program in oriental languages, publications series B-aids-number 10.

  • Megerdoomian, K. (2000). Persian computational morphology: A unification-based approach. New Mexico State University.

    Google Scholar 

  • Megerdoomian, K. (2004). Finite-state morphological analysis of Persian. Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, 35–41.

  • Mohammadamini, M., Veisi, H., Mahmudi, A., & Hosseini, H. (2019). Challenges in standardization of Kurdish language: A Corpus based approach. The Second International Conference on Kurdish and Persian Languages and Literature.

    Google Scholar 

  • Mohammadi, A. (2014). Morphophonological processes in Sorani Kurdish. Allameh Tabataba’i University.

    Google Scholar 

  • Mustafa, A. M., & Rashid, T. A. (2018). Kurdish stemmer pre-processing steps for improving information retrieval. Journal of Information Science, 44(1), 15–27. https://doi.org/10.1177/0165551516683617

    Article  Google Scholar 

  • Oflazer, K. (1995). Two-level description of Turkish morphology. Literary and Linguistic Computing, 9(2), 137–148. https://doi.org/10.1093/llc/9.2.137

    Article  Google Scholar 

  • Riazati, D. (1997). Computational analysis of Persian morphology. MSc thesis, Department of Computer Science, RMIT.

  • Saeed, A. M., Rashid, T. A., Mustafa, A. M., Agha, R. A. A.-R., Shamsaldin, A. S., & Al-Salihi, N. K. (2018a). An evaluation of Reber stemmer with longest match stemmer technique in Kurdish Sorani text classification. Iran Journal of Computer Science, 1(2), 99–107. https://doi.org/10.1007/s42044-018-0007-4

    Article  Google Scholar 

  • Saeed, A. M., Rashid, T. A., Mustafa, A. M., Fattah, P., & Ismael, B. (2018b). Improving Kurdish web mining through tree data structure and Porter’s stemmer algorithms. UKH Journal of Science and Engineering, 2(1), 48–54. https://doi.org/10.25079/ukhjse.v2n1y2018.pp48-54

    Article  Google Scholar 

  • Salavati, S. (2013). Stemming and spell-checking algorithms for Kurdish text processing. University of Kurdistan.

    Google Scholar 

  • Salavati, S., Sheykh Esmaili, K., & Akhlaghian, F. (2013). Stemming for Kurdish information retrieval. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8281 LNCS, 272–283. https://doi.org/10.1007/978-3-642-45068-6_24

    Article  Google Scholar 

  • Schmid, H., Fitschen, A., & Heid, U. (2004). SMOR: A German computational morphology covering derivation. Composition and Inflection. LREC, 1–263.

  • Sharafkandi, A. (1990). Henbane Borîne: A Kurdish-Persian dictionary. Soroush Press.

    Google Scholar 

  • Sheykh Esmaili, K., Eliassi, D., Salavati, S., Aliabadi, P., Mohammadi, A., Yosefi, S., & Hakimi, S. (2013). Building a test collection for Sorani Kurdish. Proceedings of the 10th IEEE/ACS international conference on computer systems and applications, AICCSA. https://doi.org/10.1109/AICCSA.2013.6616470

  • Veisi, H., MohammadAmini, M., & Hosseini, H. (2019). Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus. Digital Scholarship in the Humanities. https://doi.org/10.1093/llc/fqy074

  • Wahby, T., & Edmonds, C. J. (1966). A Kurdish-English dictionary. Oxford University Press.

    Google Scholar 

  • Walther, G. (2012). Fitting into morphological structure: Accounting for Sorani Kurdish endoclitics. Mediterranean Morphology Meetings, 8, 299–321.

    Google Scholar 

  • Walther, G., & Sagot, B. (2010). Developing a large-scale lexicon for a less-resourced language: General methodology and preliminary experiments on Sorani Kurdish. Proceedings of the 7th SaLTMiL workshop on creation and use of basic lexical resources for less-resourced languages (LREC 2010 workshop).

  • Yusupova, Z. A. (1985). Sulaimani dialect of the Kurdish language (К. К. Курдоев, Ed.). Наука.

  • Zueva, A., Kuznetsova, A., & Tyers, F. (2020). A finite-state morphological analyser for evenki. Proceedings of the 12th Language Resources and Evaluation Conference, 2581–2589.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hadi Veisi.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Naserzade, M., Mahmudi, A., Veisi, H. et al. CKMorph: a comprehensive morphological analyzer for Central Kurdish. Int J Digit Humanities 5, 187–232 (2023). https://doi.org/10.1007/s42803-022-00062-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42803-022-00062-7

Keywords

Navigation