Improving Chinese Word Segmentation Using Partially Annotated Sentences

Zhang, Kaixu; Su, Jinsong; Zhou, Changle

doi:10.1007/978-3-642-41491-6_1

Improving Chinese Word Segmentation Using Partially Annotated Sentences

Kaixu Zhang²³,
Jinsong Su²³ &
Changle Zhou²⁴

Conference paper

1694 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8202))

Abstract

Manually annotating is important for statistical NLP models but time-consuming and labor-intensive. We describe a learning task that can use partially annotated data as the training data. Traditional supervised learning task is a special case of such task. Particularly, we adapt the perceptron algorithm to train Chinese word segmentation models. We mix conventional fully segmented Chinese sentences with partially annotated sentences as the training data. Partially annotated sentences can be automatically generated from the heterogeneous segmented corpora as well as naturally annotated data such as markup language sentences like wikitexts without any additional manual annotating. The experiments show that our method improves the performances of both supervised model and semi-supervised models.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jiang, W., Meng, F., Liu, Q., Lü, Y.: Iterative annotation transformation with predict-self reestimation for Chinese word segmentation. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 412–420. Association for Computational Linguistics (July 2012)
Google Scholar
Sun, W., Wan, X.: Reducing approximation and estimation errors for Chinese lexical processing with heterogeneous annotations. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Long Papers, Jeju Island, Korea, vol. 1, pp. 232–241. Association for Computational Linguistics (July 2012)
Google Scholar
Fernandes, E., dos Santos, C., Milidiú, R.: Latent structure perceptron with feature induction for unrestricted coreference resolution. In: Joint Conference on EMNLP and CoNLL - Shared Task, Jeju Island, Korea, pp. 41–48. Association for Computational Linguistics (July 2012)
Google Scholar
Collins, M.: Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms, pp. 1–8 (2002)
Google Scholar
Klein, D., Manning, C.D.: A generative constituent-context model for improved grammar induction. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 128–135. Association for Computational Linguistics (July 2002)
Google Scholar
Lou, X., Hamprecht, F.: Structured learning from partial annotations. arXiv:1206.6421 (June 2012)
Google Scholar
Neubig, G., Mori, S.: Word-based partial annotation for efficient corpus construction. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta. European Language Resources Association, ELRA (2010)
Google Scholar
Tsuboi, Y., Kashima, H., Mori, S., Oda, H., Matsumoto, Y.: Training conditional random fields using incomplete annotations. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, pp. 897–904. Coling 2008 Organizing Committee (August 2008)
Google Scholar
Mirroshandel, S.A., Nasr, A.: Active learning for dependency parsing using partially annotated sentences. In: Proceedings of the 12th International Conference on Parsing Technologies, Dublin, Ireland, pp. 140–149. Association for Computational Linguistics (October 2011)
Google Scholar
Flannery, D., Miayo, Y., Neubig, G., Mori, S.: Training dependency parsers from partially annotated corpora. In: Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 776–784. Asian Federation of Natural Language Processing (November 2011)
Google Scholar
Fernandes, E.R., Brefeld, U.: Learning from partially annotated sequences. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part I. LNCS, vol. 6911, pp. 407–422. Springer, Heidelberg (2011)
Chapter Google Scholar
Yu, C.N.J., Joachims, T.: Learning structural SVMs with latent variables. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, pp. 1169–1176. ACM, New York (2009)
Google Scholar
Zettlemoyer, L., Collins, M.: Online learning of relaxed CCG grammars for parsing to logical form. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, pp. 678–687. Association for Computational Linguistics (June 2007)
Google Scholar
McClosky, D., Charniak, E., Johnson, M.: Effective self-training for parsing. In: Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, New York City, USA, pp. 152–159. Association for Computational Linguistics (June 2006)
Google Scholar
Sarkar, A.: Applying co-training methods to statistical parsing. In: Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, NAACL 2001, Stroudsburg, PA, pp. 1–8. Association for Computational Linguistics (2001)
Google Scholar
Jiang, W., Huang, L., Liu, Q.: Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging - a case study. In: Proceedings of the 47th ACL, Suntec, Singapore, pp. 522–530. Association for Computational Linguistics (August 2009)
Google Scholar
Feng, H., Chen, K., Deng, X., Zheng, W.: Accessor variety criteria for Chinese word extraction. Computational Linguistics 30(1), 75–93 (2004)
Article Google Scholar
Zhao, H., Kit, C.: Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In: The Sixth SIGHAN Workshop on Chinese Language Processing, pp. 106–111 (2008)
Google Scholar
Sun, W., Xu, J.: Enhancing Chinese word segmentation using unlabeled data. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, pp. 970–979. Association for Computational Linguistics (July 2011)
Google Scholar
Wang, Y., Kazama, J., Tsuruoka, Y., Chen, W., Zhang, Y., Torisawa, K.: Improving Chinese word segmentation and POS tagging with semi-supervised methods using large auto-analyzed data. In: Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, pp. 309–317. Asian Federation of Natural Language Processing (November 2011)
Google Scholar
Li, Z., Sun, M.: Punctuation as implicit annotations for Chinese word segmentation. Computational Linguistics 35(4), 505–512 (2009)
Article Google Scholar
Spitkovsky, V.I., Jurafsky, D., Alshawi, H.: Profiting from mark-up: Hyper-text annotations for guided parsing. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 1278–1287. Association for Computational Linguistics (July 2010)
Google Scholar
Zhang, K., Sun, M., Zhou, C.: Word segmentation on Chinese mirco-blog data with a linear-time incremental model. In: Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, Tianjin, China, pp. 41–46. Association for Computational Linguistics (December 2012)
Google Scholar
Zhang, Y., Clark, S.: Chinese segmentation with a word-based perceptron algorithm, Prague, Czech Republic, pp. 840–847. Association for Computational Linguistics (June 2007)
Google Scholar
Zhang, Y., Clark, S.: Syntactic processing using the generalized perceptron and beam search. Computational Linguistics (Early Access), 1–47 (2011)
Google Scholar
Huang, L., Sagae, K.: Dynamic programming for linear-time incremental parsing. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 1077–1086. Association for Computational Linguistics (July 2010)
Google Scholar
Duan, H., Sui, Z., Tian, Y., Li, W.: The cips-sighan clp 2012 Chinese word segmentation on microblog corpora bakeoff. In: Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, Tianjin, China, pp. 35–40. Association for Computational Linguistics (December 2012)
Google Scholar
Emerson, T.: The second international Chinese word segmentation bakeoff. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, pp. 123–133 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Xiamen University, Xiamen, Fujian, 361005, China
Kaixu Zhang & Jinsong Su
Institute of Artificial Intelligence, Xiamen University, Xiamen, Fujian, 361005, China
Changle Zhou

Authors

Kaixu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jinsong Su
View author publications
You can also search for this author in PubMed Google Scholar
Changle Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Maosong Sun
Horizon Doctoral Training Centre, School of Computer Science, University of Nottingham, NG8 1BB, Nottingham, UK
Min Zhang
Google Inc., Mountain View, CA, USA
Dekang Lin
Baidu Inc., Beijing, China
Haifeng Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, K., Su, J., Zhou, C. (2013). Improving Chinese Word Segmentation Using Partially Annotated Sentences. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-41491-6_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41490-9
Online ISBN: 978-3-642-41491-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics