Morphological Disambiguation of Classical Sanskrit

  • Oliver Hellwig
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 537)


Sanskrit, the “sacred language” of Ancient India, is a morphologically rich Indo-Iranian language that has received some attention in NLP during the last decade. This paper describes a system for the tokenization and morphosyntactic analysis of Sanskrit. The system combines a fixed morphological rule base with a statistical selection of the most probable analysis of an input text. After an introduction into the research history and the linguistic peculiarities of Sanskrit that are relevant to the task, the paper describes the present architecture of the system and new extensions that increase its accuracy when analyzing morphologically ambiguous forms. The algorithms are tested on a gold-annotated data set of 3,587,000 words.


Word Order Conditional Random Field Parallel Corpus Lexical Database Noun Class 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Adler, M., Elhalad, M.: An unsupervised morpheme-based HMM for Hebrew morphological disambiguation. In: Proceedings of the 21st International Conference on Computational Linguistics, pp. 665–672 (2006)Google Scholar
  2. 2.
    Bloch, J.: Indo-Aryan from the Vedas to Modern Times. Librarie d’Amérique et d’Orient, Paris (1965)Google Scholar
  3. 3.
    Brockington, J.: The Sanskrit Epics. Brill, Leiden (1998)Google Scholar
  4. 4.
    Cardona, G.: Open image in new window. A Survey of Research. Mouton, The Hague - Paris (1976)Google Scholar
  5. 5.
    Emeneau, M.: Dravidian and indo-aryan: the indian linguistic area. In: Emeneau, M.B. (ed.) Language and Linguistic Area, pp. 167–196. Stanford University Press, Stanford (1980)Google Scholar
  6. 6.
    Gillon, B.S.: Review of “Natural Language Processing: A Paninian Perspective" by A. Bharati, V. Chaitanya, and R. Sangal. Prentice-Hall of India 1995. Computational Linguistics 21(3), 419–421 (1995)Google Scholar
  7. 7.
    Gillon, B.S.: Word order in classical Sanskrit. Indian Linguist. 57(1–4), 1–35 (1996)Google Scholar
  8. 8.
    Hellwig, O.: \(\mathtt{{SadnskritTagger}}\): a stochastic lexical and POS tagger for Sanskrit. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) Sanskrit CL 2007/2008. LNCS, vol. 5402, pp. 266–277. Springer, Heidelberg (2009) CrossRefGoogle Scholar
  9. 9.
    Hellwig, O.: Etymological trends in the Sanskrit vocabulary. Literary Linguist. Comput. 25(1), 105–118 (2010)CrossRefGoogle Scholar
  10. 10.
    Hellwig, O.: Performance of a lexical and POS tagger for Sanskrit. In: Jha, G.N. (ed.) SCL. LNCS, vol. 6465, pp. 162–172. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  11. 11.
    Hellwig, O., Petersen, W.: What’s Open image in new window got to do with it? The use of Open image in new window-headers from the Aṣṭādhyāyī in Sanskrit literature from the perspective of corpus linguistics. In: Proceedings of the WCS 2015 (forthcoming)Google Scholar
  12. 12.
    Huet, G.: A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger. J. Funct. Program. 15(04), 573–614 (2005)CrossRefzbMATHGoogle Scholar
  13. 13.
    Kiparsky, P.: On the architecture of Open image in new window grammar. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) SCL. Lecture Notes in Computer Science, vol. 5402, pp. 33–94. Springer, Heidelberg (2009)Google Scholar
  14. 14.
    Knauth, J., Alfter, D.: A dictionary data processing environment and its application in algorithmic processing of Pali dictionary data for future NLP tasks. In: Proceedings of the 5th Workshop on South and Southeast Asian NLP, pp. 65–73 (2014)Google Scholar
  15. 15.
    Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 181–184 (1995)Google Scholar
  16. 16.
    Kulkarni, A., Shukla, D.: Sanskrit morphological analyser: some issues. Indian Linguist. 70(1–4), 169–177 (2009)Google Scholar
  17. 17.
    Kulkarni, M.: Phonological overgeneration in paninian system. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) SCL. LNCS, vol. 5402, pp. 306–319. Springer, Heidelberg (2009) CrossRefGoogle Scholar
  18. 18.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)Google Scholar
  19. 19.
    Lee, J., Naradowsky, J., Smith, D.A.: A discriminative model for joint morphological disambiguation and dependency parsing. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 885–894 (2011)Google Scholar
  20. 20.
    Mayrhofer, M.: Kurzgefaßtes etymologisches Wörterbuch des Altindischen. Carl Winter Universitätsverlag, Heidelberg (1982)Google Scholar
  21. 21.
    Mishra, A.: Simulating the Open image in new window system of Sanskrit grammar. In: Huet, G., Kulkarni, A., Scharf, P. (eds.) SCL. LNCS, vol. 5402. Springer, Heidelberg (2009)Google Scholar
  22. 22.
    Mittal, V.: Automatic Sanskrit segmentizer using finite state transducers. In: Proceedings of the ACL 2010 Student Research Workshop, pp. 85–90. Association for Computational Linguistics, Stroudsburg (2010)Google Scholar
  23. 23.
    Monier-Williams, M.: Open image in new window -English Dictionary, 3rd edn. Munshiram Manoharlal Publishers Pvt. Ltd., New Delhi (1988)Google Scholar
  24. 24.
    Oberlies, T.: A Grammar of Epic Sanskrit. De Gruyter (2003)Google Scholar
  25. 25.
    Petersen, W., Soubusta, S.: Structure and implementation of a digital edition of the Aṣṭādhyāyī. In: Kulkarni, M. (ed.) Recent Researches in Sanskrit Computational Linguistics, pp. 84–103. D.K. Printworld, New Delhi (2013) Google Scholar
  26. 26.
    Petersen, W.: Zur Minimalität von Open image in new window Śivasūtras: eine Untersuchung mit Methoden der formalen Begriffsanalyse. Ph.D. thesis, Universität Düsseldorf (2008)Google Scholar
  27. 27.
    Ratnaparkhi, A.: Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. thesis, University of Pennsylvania (1998)Google Scholar
  28. 28.
    Rocher, L.: The Open image in new window, A History of Indian Literature, vol. II, Fasc. 3. Otto Harrassowitz, Wiesbaden (1986)Google Scholar
  29. 29.
    Scharfe, H.: Grammatical Literature. A History of Indian Literature, Volume 5, Fasc. 2, Otto Harrassowitz, Wiesbaden (1977)Google Scholar
  30. 30.
    Shacham, D., Wintner, S.: Morphological disambiguation of Hebrew: a case study in classifier combination. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 439–447. Association for Computational Linguistics, Prague (2007)Google Scholar
  31. 31.
    Shukla, P., Kulkarni, A., Shukl, D.: Geeta: Gold standard annotated data, analysis and its application. In: Proceedings of ICON (2013)Google Scholar
  32. 32.
    Staal, J.: Word Order in Sanskrit and Universal Grammar. Foundations of Language, Supplementary Series, vol. 5. D. Reidel Publishing Company, Dordrecht (1967) CrossRefGoogle Scholar
  33. 33.
    Stenzler, A.F.: Elementarbuch der Sanskrit-Sprache. Max Mälzer, Breslau (1872)Google Scholar
  34. 34.
    Witzel, M.: Early indian history: linguistic and textual parametres. In: Erdosy, G. (ed.) The Indo-Aryans of Ancient South Asia. Language, Material Culture and Ethnicity, vol. 1, pp. 85–125. Walter de Gruyter, Berlin (1995)Google Scholar
  35. 35.
    Yuret, D., Türe, F.: Learning morphological disambiguation rules for Turkish. In: Proceedings of HLT-NAACL (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.University of DüsseldorfDüsseldorfGermany

Personalised recommendations