Khmer POS Tagging Using Conditional Random Fields

  • Sokunsatya Sangvat
  • Charnyote PluempitiwiriyawejEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 781)


The transformation-based approach with hybrid of rule-based and tri-gram have already been introduced for Khmer part-of-speech (POS) tagging. In this study, in order to further explore this topic, we present an alternative approach to Khmer POS tagging using Conditional Random Fields (CRFs). Since the features greatly affect the tagging accuracy, we investigate five groups of features and use them with the CRF model. First, we study different contextual information and use it as our baseline model. We then analyze the characteristics of Khmer and come up with three additional groups of language-related features including morphemes, word-shapes and name-entities. We also explore the use of lexicon as features to further improve the accuracy of our tagger. Our proposed approach has been evaluated on a corpus of 41,058 words and 27 POS tags. The comparative study has shown that our proposed approach produces a competitive accuracy compared to other Khmer POS tagging approaches.


Khmer Part-of-speech tagging POS tagging Conditional Random Fields 



This research project was supported by Faculty of Information and Communication Technology, Mahidol University.


  1. 1.
    Nou, C., Kameyama, W.: Khmer POS tagger: a transformation-based approach with hybrid unknown word handling. In: International Conference on Semantic Computing (ICSC 2007), pp. 482–489 (2007).
  2. 2.
    Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Comput. Linguist. 21, 543–565 (1995)MathSciNetGoogle Scholar
  3. 3.
    Giménez, J., Màrquez, L.: SVMTool: a general POS tagger generator based on support vector machines. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, pp. 43–46 (2004)Google Scholar
  4. 4.
    Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 133–142 (1996)Google Scholar
  5. 5.
    Black, E., Jelinek, F., Lafferty, J., Mercer, R., Roukos, S.: Decision tree models applied to the labeling of text with parts-of-speech. In: Proceedings of the Workshop on Speech and Natural Language, pp. 117–121 (1992).
  6. 6.
    Ma, Q., Uchimoto, K., Murata, M., Isahara, H.: Elastic neural networks for part of speech tagging. In: IJCNN99 International Joint Conference on Neural Networks Proceedings, vol. 5, pp. 2991–2996 (1999).
  7. 7.
    Ma, Q., Murata, M., Uchimoto, K., Isahara, H.: Hybrid neuro and rule-based part of speech taggers. In: Proceedings of the 18th Conference on Computational Linguistics, pp. 509–515 (2000).
  8. 8.
    Murata, M., Ma, Q., Isahara, H.: Part of speech tagging in Thai language using support vector machine. In: NLPRS 2001 Workshop, The Second Workshop on Natural Language Processing and Neural Networks (NLPNN2001) (2001)Google Scholar
  9. 9.
    Lua, K.T.: Part of Speech Tagging of Chinese Sentences Using Genetic Algorithm (1996). In: Proceedings of ICCC96, pp. 45–49. National University of SingaporeGoogle Scholar
  10. 10.
    Zhao, J., Wang, X.-L.: Chinese POS tagging based on maximum entropy model. In: Proceedings International Conference on Machine Learning and Cybernetics, vol. 1, pp. 601–605 (2002).
  11. 11.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), pp. 282–289 (2001)Google Scholar
  12. 12.
    Okazaki, N.: CRFsuite: a fast implementation of Conditional Random Fields (CRFs) (2007).
  13. 13.

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Faculty of Information and Communication TechnologyMahidol UniversitySalayaThailand

Personalised recommendations