Abstract
Topic models are hierarchical Bayesian models for language modeling and document analysis. It has been well-used and achieved a lot of success in modeling English documents. However, unlike English and the majority of alphabetic languages, the basic structural unit of Chinese language is character instead of word, and Chinese words are written without spaces between them. Most previous research of using topic models for Chinese documents did not take the Chinese character-word relationship into consideration and simply take the Chinese word as the basic term of documents. In this paper, we propose a novel model to consider the character-word relation into topic modeling by placing an asymmetric prior on the topic-word distribution of the standard Latent Dirichlet Allocation (LDA) model. Compared to LDA, this model can improve performance in document classification especially when test data contains considerable number of Chinese words not appeared in training data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bishop, M.C.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)
Blei, D.M., Lafferty, J.D.: Topic Models. In: Srivastava, A., Sahami, M. (eds.) Text Mining: Classification, Clustering, and Applications. Data Mining and Knowledge Discovery Series. Chapman & Hall/CRC (2009)
Blei, D.M., Ng, A., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Gilks, W.R., Richardson, S., Spiegelhalter, D.J.: Markov Chain Monte Carlo in Practice. Chapman & Hall, NewYork (1996)
Griffiths, T.L., Steyvers, M.: Finding Scientific Topics. Proceedings of the National Academy of Science 101, 5228–5235 (2004)
Heinrich, G.: Parameter estimation for text analysis. Technical report, Fraunhofer IGD (2009)
Huang, Z., Thint, M., Qin, Z.: Question Classification using Head Words and their Hypernyms. In: Proceedings of EMNLP, pp. 927–936 (2008)
Minka, T.: Estimating a Dirichlet distribution (2000)
Petterson, J., Smola, A., Caetano, T., Buntine, W., Narayanamurthy, S.: Word Feature for Latent Dirichlet Allocation. In: Proceedings of Neural Information Processing Systems (2010)
Qin, Z., Thint, M., Huang, Z.: Ranking Answers by Hierarchical Topic Models. In: Chien, B.-C., Hong, T.-P., Chen, S.-M., Ali, M. (eds.) IEA/AIE 2009. LNCS, vol. 5579, pp. 103–112. Springer, Heidelberg (2009)
Wallach, H., Mimno, D., McCallum, A.: Rethinking LDA: Why Priors Matter. In: Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (2009)
Wu, Y., Ding, Y., Wang, X., Xu, J.: A comparative study of topic models for topic clustering of Chinese web news. In: Computer Science and Information Technology (ICCSIT), vol. 5, pp. 236–240 (2010)
Xu, T.Q.: Fundamental structural principles of Chinese semantic syntax in terms of Chinese Characters. Applied Linguistics 1, 3–13 (2001) (in Chinese)
Zhang, Y., Qin, Z.: A topic model of Observing Chinese Characters. In: Proceedings of the 2nd International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), pp. 7–10 (2010)
Zhao, Q., Qin, Z.: What is the Basic Semantic Unit of Chinese Language? A Computational Approach Based on Topic Models. In: Proceedings of Meeting on Mathematics of Language (MOL 2011), pp. 7–10 (to appear, 2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhao, Q., Qin, Z., Wan, T. (2011). Topic Modeling of Chinese Language Using Character-Word Relations. In: Lu, BL., Zhang, L., Kwok, J. (eds) Neural Information Processing. ICONIP 2011. Lecture Notes in Computer Science, vol 7064. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24965-5_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-24965-5_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24964-8
Online ISBN: 978-3-642-24965-5
eBook Packages: Computer ScienceComputer Science (R0)