Exploring Multiple Chinese Word Segmentation Results Based on Linear Model

Su, Chen; Zhang, Yujie; Guo, Zhen; Xu, Jinan

doi:10.1007/978-3-642-41644-6_6

Chen Su⁴,
Yujie Zhang⁴,
Zhen Guo⁴ &
…
Jinan Xu⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 400))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

1822 Accesses

Abstract

In the process of developing a domain-specific Chinese-English machine translation system, the accuracy of Chinese word segmentation on large amounts of training text often decreases because of unknown words. The lack of domain-specific annotated corpus makes supervised learning approaches unable to adapt to a target domain. This problem results in many errors in translation knowledge extraction and therefore seriously lowers translation quality. To solve the domain adaptation problem, we implement Chinese word segmentation by exploring n-gram statistical features in large Chinese raw corpus and bilingually motivated Chinese word segmentation, respectively. Moreover, we propose a method of combining multiple Chinese word segmentation results based on linear model to augment domain adaptation. For evaluation, we conduct experiments of Chinese word segmentation and Chinese-English machine translation using the data of NTCIR-10 Chinese-English patent task. The experimental results showed that the proposed method achieves improvements in both F-measure of the Chinese word segmentation and BLEU score of the Chinese-English statistical machine translation system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Zhang, M., Deng, Z., Che, W., et al.: Combining Statistical Model and Dictionary for Domain Adaption of Chinese Word Segmentation. Journal of Chinese Information Processing 26(2), 8–12 (2012)
Google Scholar
Wang, Y., Kazama, J., Tsuruoka, Y., et al.: Improving Chinese word segmentation and pos tagging with semi-supervised methods using large auto-analyzed data. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 309–317 (2011)
Google Scholar
Guo, Z., Zhang, Y., Su, C., Xu, J.: Exploration of N-gram Features for the Domain Adaptation of Chinese Word Segmentation. In: Zhou, M., Zhou, G., Zhao, D., Liu, Q., Zou, L. (eds.) NLPCC 2012. CCIS, vol. 333, pp. 121–131. Springer, Heidelberg (2012)
Chapter Google Scholar
Ma, Y., Way, A.: Bilingually motivated domain-adapted word segmentation for statistical machine translation. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 549–557. Association for Computational Linguistics (2009)
Google Scholar
Xi, N., Li, B., et al.: A Chinese Word Segmentation for Statistical Machine translation. Journal of Chinese Information Processing 26(3), 54–58 (2012)
Google Scholar
Ma, Y., Zhao, T.: Combining Multiple Chinese Word Segmentation Results for Statistical Machine Translation. Journal of Chinese Information Processing 1, 104–109 (2010)
Google Scholar
Feng, H., Chen, K., Deng, X., et al.: Accessor variety criteria for Chinese word extraction. Computational Linguistics 30(1), 75–93 (2004)
Article Google Scholar
Low, J.K., Ng, H.T., Guo, W.: A Maximum Entropy Approach to Chinese Word Segmentation. In: Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing (SIGHAN 2005), pp. 161–164 (2005)
Google Scholar
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C++. Cambridge University Press, Cambridge (2002)
MATH Google Scholar
Xia, F.: The segmentation guidelines for the Penn Chinese Treebank (3.0). Technical report, University of Pennsylvania (2000)
Google Scholar
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 160–167. Association for Computational Linguistics (2003)
Google Scholar
Papineni, K., Roukos, S., Ward, T., et al.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational linguistics, pp. 311–318 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China
Chen Su, Yujie Zhang, Zhen Guo & Jinan Xu

Authors

Chen Su
View author publications
You can also search for this author in PubMed Google Scholar
Yujie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jinan Xu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Soochow University, 1 Shizi Street, 215006, Suzhou, China
Guodong Zhou
Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Juanzi Li
Institute of Computer Science & Technology, Peking University, 100871, Beijing, China
Dongyan Zhao & Yansong Feng &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Su, C., Zhang, Y., Guo, Z., Xu, J. (2013). Exploring Multiple Chinese Word Segmentation Results Based on Linear Model. In: Zhou, G., Li, J., Zhao, D., Feng, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2013. Communications in Computer and Information Science, vol 400. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41644-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-41644-6_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41643-9
Online ISBN: 978-3-642-41644-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics