Abstract
This paper examines how one can obtain state of the art Chinese word segmentation using global linear models. We provide experimental comparisons that give a detailed road-map for obtaining state of the art accuracy on various datasets. In particular, we compare the use of reranking with full beam search; we compare various methods for learning weights for features that are full sentence features, such as language model features; and, we compare an Averaged Perceptron global linear model with the Exponentiated Gradient max-margin algorithm.
This research was partially supported by NSERC, Canada (RGPIN: 264905) and by an IBM Faculty Award. Thanks to Michael Collins and Terry Koo for help with the EG implementation (any errors are our own), to the anonymous reviewers, and to the SIGHAN bakeoff organizers and participants.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Sproat, R., Emerson, T.: The 1st international chinese word segmentation bakeoff. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, ACL, pp. 123–133 (July 2003)
Emerson, T.: The 2nd international chinese word segmentation bakeoff. In: Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, pp. 123–133 (October 2005)
Levow, G.A.: The 3rd international chinese language processing bakeoff. In: Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing, Sydney, Australia, ACL, pp. 108–117 (July 2006)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conf. on Machine Learning (ICML), pp. 282–289 (2001)
Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, PA, USA, ACL, pp. 1–8 (July 2002)
Kivinen, J., Warmuth, M.: Exponentiated gradient versus gradient descent for linear predictors. Technical Report UCSC-CRL-94-16, UC Santa Cruz (1994)
Globerson, A., Koo, T., Carreras, X., Collins, M.: Exponentiated gradient algorithms for log-linear structured prediction. In: ICML, pp. 305–312 (2007)
Zhang, Y., Clark, S.: Chinese segmentation with a word-based perceptron algorithm. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, ACL, pp. 840–847 (June 2007)
Sproat, R., Gale, W., Shih, C., Chang, N.: A stochastic finite-state word-segmentation algorithm for chinese. Comput. Linguist. 22(3), 377–404 (1996)
Song, D., Sarkar, A.: Training a perceptron with global and local features for chinese word segmentation. In: Proceedings of the 6th SIGHAN Workshop on Chinese Language Processing, pp. 143–146 (2008)
Stolcke, A.: SRILM – an extensible language modeling toolkit. In: Proceedings of the ICSLP, Denver, Colorado, vol. 2, pp. 901–904 (2002)
Collins, M., Roark, B.: Incremental parsing with the perceptron algorithm. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, pp. 111–118 (July 2004)
Zhang, R., Kikui, G., Sumita, E.: Subword-based tagging by conditional random fields for chinese word segmentation. In: Proceedings of the Human Language Technology Conference of the NAACL, New York City, USA, ACL, pp. 193–196 (June 2006)
Song, D.: Experimental comparison of discriminative learning approaches for chinese word segmentation. Master’s thesis, Simon Fraser University (2008)
Liang, P.: Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Song, D., Sarkar, A. (2009). Training Global Linear Models for Chinese Word Segmentation. In: Gao, Y., Japkowicz, N. (eds) Advances in Artificial Intelligence. Canadian AI 2009. Lecture Notes in Computer Science(), vol 5549. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01818-3_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-01818-3_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01817-6
Online ISBN: 978-3-642-01818-3
eBook Packages: Computer ScienceComputer Science (R0)