Skip to main content

Burmese (Myanmar) Name Romanization: A Sub-syllabic Segmentation Scheme for Statistical Solutions

  • Conference paper
  • First Online:
Computational Linguistics (PACLING 2017)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 781))

Abstract

We focus on Burmese name Romanization, a critical task in the translation of Burmese into languages using Latin script. As Burmese is under researched and not well resourced, we collected and manually annotated 2, 335 Romanization instances to enable statistical approaches. The annotation includes string segmentation and alignment between Burmese and Latin scripts. Although previous studies regard syllables as unbreakable units when processing Burmese, in this study, Burmese strings are segmented into well-designed sub-syllabic units to achieve precise and consistent alignment with Latin script. The experiments show that sub-syllabic units are better units than syllables for statistical approaches in Burmese name Romanization. The annotated data and segmentation program have been released under a CC BY-NC-SA license.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.nlpresearch-ucsy.edu.mm/NLP_UCSY/name-db.html.

  2. 2.

    Typical ones are the Myanmar Language Commission Transcription System, the Library of Congress’ ALA-LC Romanization index system for Burmese (http://www.loc.gov/catdir/cpso/romanization/burmese.pdf), and the Okell’s system [13].

  3. 3.

    Yayit originally represents while in the modern standard Burmese the phoneme has been merged into

  4. 4.

    Actually a voiceless sign, e.g., changing to .

  5. 5.

    Yapin can also be combined with .

  6. 6.

    may be argued in some references. The combination appears marginally in borrowing words and interjections.

  7. 7.

    E.g., is actually or .

  8. 8.

    E.g., changing to and changing to .

  9. 9.

    The visarga is usually not transcribed and aukmyit is inconsistently represented by a final t in Romanization.

  10. 10.

    Multiple medial consonants for one initial consonant is possible while yapin and yayit cannot appear simultaneously.

  11. 11.

    As mentioned, glottal endings take no tones.

  12. 12.

    However, the swapped order may introduce no problem in displaying, so both orders are used in daily typing.

  13. 13.

    Using GIZA++ [12] at http://www.statmt.org/moses/giza/GIZA++.html.

  14. 14.

    An open-sourced tool is available at https://github.com/lemaoliu/Agtarbidir.

  15. 15.

    http://taku910.github.io/crfpp/.

  16. 16.

    http://www.phontron.com/kytea/.

  17. 17.

    I.e., on the level in the bottom rank in Fig. 1, with no explicit alignment or unit boundaries between characters.

  18. 18.

    SEG cannot be applied to the RNN approach as the alignment and segmentation are not explicit variables.

  19. 19.

    I.e., the results in Tables 1 and 2 are based on the middle and upper-right parts in Fig. 1, respectively.

  20. 20.

    The Romanization instance is directly taken from the released data set. A more common Romanization of the Pali-derived name is Wunna.

References

  1. Banchs, R.E., Zhang, M., Duan, X., Li, H., Kumaran, A.: Report of NEWS 2015 machine transliteration shared task. In: Proceedings of NEWS, pp. 10–23 (2015)

    Google Scholar 

  2. Costa-Jussà, M.R.: Moses-based official baseline for NEWS 2016. In: Proceedings of NEWS, pp. 88–90 (2016)

    Google Scholar 

  3. Ding, C., Thu, Y.K., Utiyama, M., Finch, A., Sumita, E.: Empirical dependency-based head finalization for statistical Chinese-, English-, and French-to-Myanmar (Burmese) machine translation. In: Proceedings of IWSLT, pp. 184–191 (2014)

    Google Scholar 

  4. Ding, C., Thu, Y.K., Utiyama, M., Sumita, E.: Parsing Myanmar (Burmese) by using Japanese as a pivot. In: Proceedings of ICCA (Myanmar), pp. 158–162 (2016)

    Google Scholar 

  5. Ding, C., Thu, Y.K., Utiyama, M., Sumita, E.: Word segmentation for Burmese (Myanmar). ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 22 (2016)

    Article  Google Scholar 

  6. Finch, A., Liu, L., Wang, X., Sumita, E.: Neural network transduction models in transliteration generation. In: Proceedings of NEWS, pp. 61–66 (2015)

    Google Scholar 

  7. Finch, A., Liu, L., Wang, X., Sumita, E.: Target-bidirectional neural models for machine transliteration. In: Proceedings of NEWS, pp. 78–82 (2016)

    Google Scholar 

  8. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML, pp. 282–289 (2001)

    Google Scholar 

  9. Liu, L., Finch, A., Utiyama, M., Sumita, E.: Agreement on target-bidirectional LSTMs for sequence-to-sequence learning. In: Proceedings of AAAI, pp. 2630–2637 (2016)

    Google Scholar 

  10. Naing, H.M.S., Hlaing, A.M., Pa, W.P., Hu, X., Thu, Y.K., Hori, C., Kawai, H.: A Myanmar large vocabulary continuous speech recognition system. In: Proceedings of APSIPA, pp. 320–327 (2015)

    Google Scholar 

  11. Neubig, G., Nakata, Y., Mori, S.: Pointwise prediction for robust, adaptable Japanese morphological analysis. In: Proceedings of ACL-HLT, pp. 529–533 (2011)

    Google Scholar 

  12. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)

    Article  MATH  Google Scholar 

  13. Okell, J.: A guide to the Romanization of Burmese (1971)

    Google Scholar 

  14. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)

    Google Scholar 

  15. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of HLT-NAACT, pp. 134–141 (2003)

    Google Scholar 

  16. Thu, Y.K., Pa, W.P., Finch, A., Ni, J., Sumita, E., Hori, C.: The application of phrase based statistical machine translation techniques to Myanmar grapheme to phoneme conversion. In: Hasida, K., Purwarianti, A. (eds.) Computational Linguistics. CCIS, vol. 593, pp. 238–250. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0515-2_17

    Chapter  Google Scholar 

  17. Thu, Y.K., Pa, W.P., Ni, J., Shiga, Y., Finch, A., Hori, C., Kawai, H., Sumita, E.: HMM based Myanmar text to speech system. In: Proceedings of INTERSPEECH, pp. 2237–2241 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenchen Ding .

Editor information

Editors and Affiliations

Appendix

Appendix

Figure 6 shows specific annotation instances for a further illustration and demonstration. The data are organized in a three-section format of

  • original Burmese name,

  • original Romanization, and

  • aligned Burmese/Latin graphemes,

separated by |||.

The descriptions of specific instances are as follows.

Fig. 6.
figure 6

Specific annotation instances on Burmese name Romanization.

  1. I.

    An ordinary Romanization instance.

  2. II.

    A Burmese name with a western expression (Grace) as a component. Generally, such western expressions are segmented according to the Burmese spellings. In this instance, Grace is segmented into /G /@ /r /a /@ /ce. Notice that we just apply the same @ for the dummy vowel on Burmese side and for the silent placeholder on Latin side, which causes no confusion.

  3. III.

    A Burmese name derived from Pali (Wanna),Footnote 20 where stacked consonants appear (/n /n). The stacked consonants are split and aligned to separate Latin letters. If no doubled Latin letters are used, the second Burmese character will be simply aligned to a silent placeholder @. The stacking operator is always aligned to @.

  4. IV.

    A Burmese name with complex stacking, that the rhyme of the previous syllable (/in) is stacked with the following onset (/gy).

  5. V.

    A Burmese name with more complex stacking, that part of the rhyme of the previous syllable (/ein) is stacked with the following onset (/g), which is taking a further vowel diacritic (/i). The instances IV. and V. illustrate the necessity on the segmentation of stacked characters.

  6. VI.

    A Burmese name with stacked consonants, for which two syllables are kept as one word (Thinzar) in Romanization.

  7. VII.

    A Burmese name with stacked consonants, for which two syllables are separated as two words (Thin Zar) in Romanization. Notice the Burmese names in instance VI. and VII. are identical. They are treated as two different Romanization instances due to the spellings in Romanization are different.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ding, C., Pa, W.P., Utiyama, M., Sumita, E. (2018). Burmese (Myanmar) Name Romanization: A Sub-syllabic Segmentation Scheme for Statistical Solutions. In: Hasida, K., Pa, W. (eds) Computational Linguistics. PACLING 2017. Communications in Computer and Information Science, vol 781. Springer, Singapore. https://doi.org/10.1007/978-981-10-8438-6_16

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-8438-6_16

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-8437-9

  • Online ISBN: 978-981-10-8438-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics