Statistical Khmer Name Romanization

  • Chenchen DingEmail author
  • Vichet Chea
  • Masao Utiyama
  • Eiichiro Sumita
  • Sethserey Sam
  • Sopheap Seng
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 781)


We discuss and solve the task of Khmer name Romanization. Although several standard Romanization systems exist for Khmer, conventional transcription methods are applied prevalently in practice. These are inconsistent and complicated in some cases, due to unstable phonemic, orthographic, and etymological principles. Consequently, statistical approaches are required for the task. We collect and manually align 7, 658 Khmer name Romanization instances. The alignment scheme is designed to reach a precise, consistent, and monotonic correspondence between the two different writing systems on grapheme level, through which various machine learning approaches are facilitated. Experimental results demonstrate that standard approaches of conditional random fields and support vector machine supervised by the manual alignment achieve a precision of .99 on grapheme level, which outperforms a state-of-the-art recurrent neural network approach in a pure sequence-to-sequence manner. The manually aligned data have been released under a license of CC BY-NC-SA for the research community.


  1. 1.
    Banchs, R.E., Zhang, M., Duan, X., Li, H., Kumaran, A.: Report of NEWS 2015 machine transliteration shared task. In: Proceedings of NEWS, pp. 10–23 (2015)Google Scholar
  2. 2.
    Costa-jussà, M.R.: Moses-based official baseline for NEWS 2016. In: Proceedings of NEWS, pp. 88–90 (2016)Google Scholar
  3. 3.
    Ehrman, M.E., Sos, K., Kheang, L.H.: Contemporary Cambodian – grammatical sketch (1974).
  4. 4.
    Finch, A., Liu, L., Wang, X., Sumita, E.: Neural network transduction models in transliteration generation. In: Proceedings of NEWS, pp. 61–66 (2015)Google Scholar
  5. 5.
    Finch, A., Liu, L., Wang, X., Sumita, E.: Target-bidirectional neural models for machine transliteration. In: Proceedings of NEWS, pp. 78–82 (2016)Google Scholar
  6. 6.
    Huffman, F.E.: Cambodian system of writing and beginning reader with drills and glossary (1970).
  7. 7.
    Kunchukuttan, A., Bhattacharyya, P.: Data representation methods and use of mined corpora for Indian language transliteration. In: Proceedings of NEWS, pp. 78–82 (2015)Google Scholar
  8. 8.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML, pp. 282–289 (2001)Google Scholar
  9. 9.
    Liu, L., Finch, A., Utiyama, M., Sumita, E.: Agreement on target-bidirectional LSTMs for sequence-to-sequence learning. In: Proceedings of AAAI, pp. 2630–2637 (2016)Google Scholar
  10. 10.
    Neubig, G., Nakata, Y., Mori, S.: Pointwise prediction for robust, adaptable Japanese morphological analysis. In: Proceedings of ACL-HLT, pp. 529–533 (2011)Google Scholar
  11. 11.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)Google Scholar
  12. 12.
    Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of HLT-NAACL, pp. 134–141 (2003)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  • Chenchen Ding
    • 1
    Email author
  • Vichet Chea
    • 2
  • Masao Utiyama
    • 1
  • Eiichiro Sumita
    • 1
  • Sethserey Sam
    • 2
  • Sopheap Seng
    • 2
  1. 1.Advanced Translation Technology Laboratory, ASTRECNational Institute of Information and Communications TechnologyKyotoJapan
  2. 2.Research and Development CenterNational Institute of Posts, Telecommunication and ICTPhnom PenhCambodia

Personalised recommendations