Abstract
This paper proposes a refined Hierarchical Dirichlet Process (HDP) model for unsupervised Chinese word segmentation. This model gives a better estimation of the base measure in HDP by using a dictionary-based model. We also show that the initial segmentation state for HDP model plays a very important role in model performance. A better initial segmentation can lead to a better performance. We test our model on PKU and MSRA datasets provided by Second Segmentation Bake-off (SIGHAN 2005) [1] and our model outperforms the state-of-the-art systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Emerson, T.: The second international Chinese word segmentation bakeoff. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, vol. 133 (2005)
Duan, H., Sui, Z., Tian, Y., Li, W.: The cips-sighan clp 2012 Chinese word segmentation on microblog corpora bakeoff (2012)
Zhao, H., Kit, C.: An empirical comparison of goodness measures for unsupervised Chinese word segmentation with a unified framework. In: The Third International Joint Conference on Natural Language Processing (IJCNLP 2008), Hyderabad, India (2008)
Magistry, P., Sagot, B.: Unsupervized word segmentation: the case for mandarin Chinese. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, vol. 2, pp. 383–387. Association for Computational Linguistics (2012)
Wang, H., Zhu, J., Tang, S., Fan, X.: A new unsupervised approach to word segmentation. Computational Linguistics 37(3), 421–454 (2011)
Goldwater, S., Griffiths, T.L., Johnson, M.: A bayesian framework for word segmentation: Exploring the effects of context. Cognition 112(1), 21–54 (2009)
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476) (2006)
Casella, G., George, E.I.: Explaining the gibbs sampler. The American Statistician 46(3), 167–174 (1992)
Xu, J., Gao, J., Toutanova, K., Ney, H.: Bayesian semi-supervised Chinese word segmentation for statistical machine translation. In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 1017–1024. Association for Computational Linguistics (2008)
Kempe, A.: Experiments in unsupervised entropy-based corpus segmentation. In: Workshop of EACL in Computational Natural Language Learning, pp. 7–13 (1999)
Tanaka-Ishii, K.: Entropy as an indicator of context boundaries: An experiment using a web search engine. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 93–105. Springer, Heidelberg (2005)
Jin, Z., Tanaka-Ishii, K.: Unsupervised segmentation of Chinese text by use of branching entropy. In: Proceedings of the COLING/ACL on Main Conference Poster Sessions, pp. 428–435. Association for Computational Linguistics (2006)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29 (1990)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pei, W., Han, D., Chang, B. (2013). A Refined HDP-Based Model for Unsupervised Chinese Word Segmentation. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-41491-6_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41490-9
Online ISBN: 978-3-642-41491-6
eBook Packages: Computer ScienceComputer Science (R0)