Skip to main content

New Perspectives in Sinographic Language Processing through the Use of Character Structure

  • Conference paper
Book cover Computational Linguistics and Intelligent Text Processing (CICLing 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7816))

  • 2225 Accesses

Abstract

Chinese characters have a complex and hierarchical graphical structure carrying both semantic and phonetic information. We use this structure to enhance the text model and obtain better results in standard NLP operations. First of all, to tackle the problem of graphical variation we define allographic classes of characters. Next, the relation of inclusion of a subcharacter in a characters, provides us with a directed graph of allographic classes. We provide this graph with two weights: semanticity (semantic relation between subcharacter and character) and phoneticity (phonetic relation) and calculate “most semantic subcharacter paths” for each character. Finally, adding the information contained in these paths to unigrams we claim to increase the efficiency of text mining methods. We evaluate our method on a text classification task on two corpora (Chinese and Japanese) of a total of 18 million characters and get an improvement of 3% on an already high baseline of 89.6% precision, obtained by a linear SVM classifier. Other possible applications and perspectives of the system are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allen, J.D., et al. (eds.): The Unicode Standard, Version 6.0. Unicode Consortium (2011)

    Google Scholar 

  2. Fujimura, O., Kagaya, R.: Structural patterns of Chinese characters. In: Proceedings of the International Conference on Computational Linguistics, Sånga-Säby, Sweden, pp. 131–148 (1969)

    Google Scholar 

  3. Wang, J.C.S.: Toward a generative grammar of Chinese character structure and stroke order. PhD thesis, University of Wisconsin-Madison (1983)

    Google Scholar 

  4. Dürst, M.J.: Coordinate-independent font description using Kanji as an example. Electronic Publishing 6(3), 133–143 (1993)

    Google Scholar 

  5. Chu, B.F.: 漢字基因朱邦復漢字基因工程 (Genetic engineering of Chinese characters) (2003), http://cbflabs.com/down/show.php?id=26

  6. Moro, S.: Surface or essence: Beyond the coded character set model. In: Proceedings of the Glyph and Typesetting Workshop, Kyoto, Japan, pp. 26–35 (2003)

    Google Scholar 

  7. Sproat, R.: A Computational Theory of Writing Systems. Studies in Natural Language Processing. Cambridge University Press (2000)

    Google Scholar 

  8. Peebles, D.G.: SCML: A Structural Representation for Chinese Characters. PhD thesis, Dartmouth College, TR2007–592 (2007)

    Google Scholar 

  9. Bishop, T., Cook, R.: Wenlin CDL: Character Description Language. Multilingual 18, 62–68 (2007)

    Google Scholar 

  10. Haralambous, Y.: Seeking meaning in a space made out of strokes, radicals, characters and compounds. In: Proceedings of ISSM 2010-2011, Aizu-Wakamatsu, Japan (2011)

    Google Scholar 

  11. Qin, L., Tong, C.S., Yin, L., Ling, L.N.: Decomposition for ISO/IEC 10646 ideographic characters. In: COLING 2002: Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization. Association for Computational Linguistics (2002)

    Google Scholar 

  12. Dai, R., Liu, C., Xiao, B.: Chinese character recognition: history, status and prospects. Frontiers of Computer Science in China 1, 126–136 (2007)

    Article  Google Scholar 

  13. Fujiwara, Y., Suzuki, Y., Morioka, T.: Network of words. Artificial Life and Robotics 7, 160–163 (2004)

    Article  Google Scholar 

  14. Li, J., Zhou, J.: Chinese character structure analysis based on complex networks. Physica A: Statistical Mechanics and its Applications 380, 629–638 (2007)

    Article  Google Scholar 

  15. Rocha, J., Fujisawa, H.: Substructure Shape Analysis for Kanji Character Recognition. In: Perner, P., Rosenfeld, A., Wang, P. (eds.) SSPR 1996. LNCS, vol. 1121, pp. 361–370. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  16. Zhou, L., Liu, Q.: A character-net based Chinese text segmentation method. In: SEMANET 2002 Proceedings of the 2002 Workshop on Building and Using Semantic Networks, pp. 1–6. Association for Computational Linguistics (2002)

    Google Scholar 

  17. Yu, S., Liu, H., Xu, C.: Statistical properties of Chinese phonemic networks. Physica A: Statistical Mechanics and its Applications 390, 1370–1380 (2011)

    Article  Google Scholar 

  18. Hsieh, S.K.: Hanzi, Concept and Computation: A Preliminary Survey of Chinese Characters as a Knowledge Resource in NLP. PhD thesis, Universität Tübingen (2006)

    Google Scholar 

  19. Chou, Y.-M., Hsieh, S.-K., Huang, C.-R.: Hanzi Grid: Toward a Knowledge Infrastructure for Chinese Character-Based Cultures. In: Ishida, T., R. Fussell, S., T. J. M. Vossen, P. (eds.) IWIC 2007. LNCS, vol. 4568, pp. 133–145. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  20. Taft, M., Zhu, X.: Submorphemic processing in reading Chinese. Journal of Experimental Psychology: Learning, Memory and Cognition 23, 761–775 (1997)

    Article  Google Scholar 

  21. Williams, C., Bever, T.: Chinese character decoding: a semantic bias? Read Writ. 23, 589–605 (2010)

    Article  Google Scholar 

  22. Tamaoka, K., Yamada, H.: The effects of stroke order and radicals on the knowledge of Japanese Kanji orthography, phonology and semantics. Psychologia 43, 199–210 (2000)

    Google Scholar 

  23. Zhao, S., Baldauf Jr., R.B.: Planning Chinese Characters. Reaction, Evolution or Revolution? Language Policy, vol. 9. Springer (2008)

    Google Scholar 

  24. Guder-Manitius, A.: Sinographemdidaktik. Aspekte einer systematischen Vermittlung der chinesischen Schrift im Unterricht Chinesisch als Fremdsprache. SinoLinguistica, vol. 7. Julius Groos Verlag, Tübingen (1999)

    Google Scholar 

  25. Jenkins, J.H., Cook, R.: Unicode Standard Annex #38. Unicode Han Database. Technical report, The Unicode Consortium, property kHanyuPinlu (2010)

    Google Scholar 

  26. Chikamatsu, N., Yokoyama, S., Nozaki, H., Long, E., Fukuda, S.: A Japanese logographic character frequency list for cognitive science research. Behavior Research Methods, Instruments, & Computers 32(3), 482–500 (2000)

    Article  Google Scholar 

  27. Sharoff, S.: Creating general-purpose corpora using automated search engine queries. In: Baroni, M., Bernardini, S. (eds.) WaCky! Working papers on the Web as Corpus, Bologna, GEDIT (2006), http://wackybook.sslmit.unibo.it/pdfs/wackybook.zip

  28. Morioka, T.: CHISE: Character Processing Based on Character Ontology. In: Tokunaga, T., Ortega, A. (eds.) LKR 2008. LNCS (LNAI), vol. 4938, pp. 148–162. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  29. Schindelin, C.: Zur Phonetizität chinesischer Schriftzeichen in der Didaktik des Chinesischen als Fremdsprache. SinoLinguistica, vol. 13. Iudicium, München (2007)

    Google Scholar 

  30. Newman, M.J.: Networks. An introduction. Oxford University Press (2010)

    Google Scholar 

  31. Chang, C.H., Li, S.Y., Lin, S., Huang, C.Y., Chen, J.M.: 以最佳化及機率分佈判斷漢字聲符之研究 (Automatic identification of phonetic complements for Chinese characters based on optimization and probability distribution). In: Proceedings of the 22nd Conference on Computational Linguistics and Speech Processing (ROCLING 2010), Puli, Nantou, Taiwan, pp. 199–209 (2010)

    Google Scholar 

  32. Sriram, S., Talukdar, P.P., Badaskar, S., Bali, K., Ramakrishnan, A.G.: Phonetic distance based cross-lingual search. In: Proc. of the 5th International Conf. on Natural Language Processing (KBCS 2004), Hyderabad, India (2004)

    Google Scholar 

  33. Kondrak, G.: A new algorithm for the alignment of phonetic sequences. In: NAACL 2000: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference (2000)

    Google Scholar 

  34. Huang, C.R.: Sinica BOW: Integrating bilingual WordNet and SUMO ontology. In: International Conference on Natural Language Processing and Knowledge Engineering, pp. 825–826 (2003)

    Google Scholar 

  35. Gao, Z., et al.: Chinese WordNet (2008), http://www.aturstudio.com/wordnet/windex.php

  36. Isahara, H., Bond, F., Uchimoto, K., Utiyama, M., Kanzaki, K.: Development of the Japanese WordNet. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation, LREC 2008 (2008)

    Google Scholar 

  37. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of International Conference on Research in Computational Linguistics, Taiwan (1997)

    Google Scholar 

  38. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of 15th International Conference on Machine Learning, Madison WI (1998)

    Google Scholar 

  39. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montréal, pp. 448–453 (1995)

    Google Scholar 

  40. Sogou: 互联网语料库 (SogouT) (2008), http://www.sogou.com/labs/dl/t.html

  41. Reuters: 過去ニュース (2007-2012), http://www.reuters.com/resources/archive/jp/index.html

  42. Zhang, H.J., Shi, S.M., Feng, C., Huang, H.Y.: A method of part-of-speech guessing of Chinese unknown words based on combined features. In: International Conference on Machine Learning and Cybernetics, pp. 328–332 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Haralambous, Y. (2013). New Perspectives in Sinographic Language Processing through the Use of Character Structure. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37247-6_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37246-9

  • Online ISBN: 978-3-642-37247-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics