From legacy encodings to Unicode: the graphical and logical principles in the scripts of South Asia
Much electronic text in the languages of South Asia has been published on the Internet. However, while Unicode has emerged as the favoured encoding system of corpus and computational linguists, most South Asian language data on the web uses one of a wide range of non-standard legacy encodings. This paper describes the difficulties inherent in converting text in these encodings to Unicode. Among the various legacy encodings for South Asian scripts, the most problematic are 8-bit fonts based on graphical principles (as opposed to the logical principles of Unicode). Graphical fonts typically encode several features in ways highly incompatible with Unicode. For instance, half-form glyphs used to construct conjunct consonants are typically separate code points in 8-bit fonts; in Unicode they are represented by the full consonant followed by virama. There are many more such cases. The solution described here is an approach to text conversion based on mapping rules. A small number of generalised rules (plus the capacity for more specialised rules) captures the behaviour of each character in a font, building up a conversion algorithm for that encoding. This system is embedded in a font-mapping program, outputting CES-compliant SGML Unicode. This program, a generalised text-conversion tool, has been employed extensively in corpus-building for South Asian languages.
KeywordsUnicode Font Devanagari South Asian languages/scripts Legacy text Encoding Conversion Virama Conjunct consonant Vowel diacritic
- Baker, P., Hardie, A., McEnery, A., Xiao, R., Bontcheva, K., Cunningham, H., Gaizauskas, R., Hamza, O., Maynard, D., Tablan, V., Ursu, C., Jayaram, B.D., & Leisher, M. (2004). Corpus linguistics and South Asian Languages: corpus creation and tool development. Literary and Linguistic Computing, 19(4).Google Scholar
- Bureau of Indian Standards (1991). Indian Standard Code for Information Interchange, IS13194.Google Scholar
- Bright, W. (1998). The Dravidian scripts. In S. B. Steever (Ed.), The Dravidian languages. London: Routledge.Google Scholar
- Campbell, G. L. (1997). Handbook of scripts and alphabets. London: Routledge.Google Scholar
- Hardie, A., Baker, P., McEnery, A., & Jayaram, B. D. (2006). Corpus-building for South Asian languages. In A. Saxene & L. Borin (Eds.), Lesser-known languages in South Asia: Status and policies, case studies and applications of information technology. Mouton de Gruyter.Google Scholar
- Hussain, S., Durrani, N., & Gul, S. (2005). PAN localization survey of language computing in Asia 2005. Lahore, Pakistan: Centre for Research in Urdu Language Processing, National University of Computer and Emerging Sciences. Avauialble on the internet at http://www.panl10n.net/english/Survey.htm.
- Milne, W. S. (1913, reprinted 1993). A practical Bengali grammar. New Delhi: Asian Educational Services.Google Scholar
- Nakanishi, A. (1980). Writing systems of the world. Rutland, Vermont: Charles E. Tuttle Company.Google Scholar
- St. Clair Tisdall, W. (1892, reprinted 1986). A simplified grammar of the Gujarati language. New Delhi: Asian Educational Services.Google Scholar
- Simons, G., & Bird, S. (2000). Requirements on the infrastructure for Open Language Archiving. Draft document, available on the Internet at http://www.language-archives.org/docs/requirements.html.
- Snell, R., & Weightman, S. (1989). Hindi. London: Teach Yourself Books/Hodder and Stoughton.Google Scholar
- Tablan, V., Ursu, C., Bontcheva, K., Cunningham, H., Maynard, D., Hamza, O., & Leisher, M. (2002). A unicode-based environment for creation and use of language resources. Proceedings of 3rd Language Resources and Evaluation Conference (LREC). Las Palmas de Gran Canaria.Google Scholar
- Xiao, Z., McEnery, A., Baker, P., & Hardie, A. (2004). Developing Asian language corpora: standards and practice. In Proceedings of the 4th workshop on Asian Language Resources, Sanya, China.Google Scholar