Advertisement

Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS Tagging

  • Jiayuan Chao
  • Zhenghua LiEmail author
  • Wenliang Chen
  • Min Zhang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9362)

Abstract

This paper describes our system designed for the NLPCC 2015 shared task on Chinese word segmentation (WS) and POS tagging for Weibo Text. We treat WS and POS tagging as two separate tasks and use a cascaded approach. Our major focus is how to effectively exploit multiple heterogeneous data to boost performance of statistical models. This work considers three sets of heterogeneous data, i.e., Weibo (\(\textit{WB}\), 10K sentences), Penn Chinese Treebank 7.0 (\(\textit{CTB7}\), 50K), and People’s Daily (\(\textit{PD}\), 280K). For WS, we adopt the recently proposed coupled sequence labeling to combine \(\textit{WB}\), \(\textit{CTB7}\), and \(\textit{PD}\), boosting F1 score from \(93.76\%\) (baseline model trained on only \(\textit{WB}\)) to \(95.58\%\) (\(+1.82\%\)). For POS tagging, we adopt an ensemble approach combining coupled sequence labeling and the guide-feature based method, since the three datasets have three different annotation standards. First, we convert \(\textit{PD}\) into the annotation style of \(\textit{CTB7}\) based on coupled sequence labeling, denoted by \(\textit{PD}^{\textit{CTB}}\). Then, we merge CTB7 and \(\textit{PD}^{\textit{CTB}}\) to train a POS tagger, denoted by \(\textit{Tag}_{\textit{CTB7}+\textit{PD}^{\textit{CTB}}}\), which is further used to produce guide features on \(\textit{WB}\). Finally, the tagging F1 score is improved from 87.93% to 88.99% (+1.06%).

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Jiang, W., Huang, L., Liu, Q.: Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging - a case study. In: Proceedings of ACL, pp. 522–530 (2009)Google Scholar
  2. 2.
    Jiang, W., Huang, L., Liu, Q., Lü, Y.: A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL 2008: HLT, pp. 897–904 (2008)Google Scholar
  3. 3.
    Jiang, W., Sun, M., Lü, Y., Yang, Y., Liu, Q.: Discriminative learning with natural annotations: word segmentation as a case study. In: Proceedings of ACL, pp. 761–769 (2013)Google Scholar
  4. 4.
    Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., Isahara, H.: An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging. In: Proceedings of ACL-AFNLP 2009, pp. 513–521 (2009)Google Scholar
  5. 5.
    Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning (ICML 2001), pp. 282–289 (2001)Google Scholar
  6. 6.
    Li, Z., Chao, J., Zhang, M., Chen, W.: Coupled sequence labeling on heterogeneous annotations: pos tagging as a case study. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 1783–1792. Association for Computational Linguistics, Beijing july, 2015Google Scholar
  7. 7.
    Li, Z., Che, W., Liu, T.: Exploiting multiple treebanks for parsing with quasisynchronous grammar. In: ACL, pp. 675–684 (2012)Google Scholar
  8. 8.
    Liu, Y., Zhang, Y., Che, W., Liu, T., Wu, F.: Domain adaptation for CRF-based Chinese word segmentation using free annotations. In: Proceedings of EMNLP, pp. 864–874 (2014)Google Scholar
  9. 9.
    Qiu, X., Huang, C., Huang, X.: Automatic corpus expansion for Chinese word segmentation by exploiting the redundancy of web information. In: Proceedings of COLING, pp. 1154–1164 (2014)Google Scholar
  10. 10.
    Qiu, X., Qian, P., Huang, X.: Overview of the nlpcc 2015 shared task: chinese word segmentation and pos tagging for micro-blog texts (2015). arXiv preprint arXiv:1505.07599
  11. 11.
    Qiu, X., Zhao, J., Huang, X.: Joint Chinese word segmentation and POS tagging on heterogeneous annotated corpora with multiple task learning. In: Proceedings of EMNLP, pp. 658–668 (2013)Google Scholar
  12. 12.
    Sun, W.: A stacked sub-word model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL, pp. 1385–1394 (2011)Google Scholar
  13. 13.
    Sun, W., Wan, X.: Reducing approximation and estimation errors for Chinese lexical processing with heterogeneous annotations. In: Proceedings of ACL, pp. 232–241 (2012)Google Scholar
  14. 14.
    Sun, W., Xu, J.: Enhancing chinese word segmentation using unlabeled data. In: Proceedings of EMNLP, pp. 970–979 (2011)Google Scholar
  15. 15.
    Wang, A., Kan, M.Y.: Mining informal language from chinese microtext: joint word recognition and segmentation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 731–741. Association for Computational Linguistics, Sofia, August 2013Google Scholar
  16. 16.
    Xue, N., et al.: Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8(1), 29–48 (2003)Google Scholar
  17. 17.
    Yang, F., Vozila, P.: Semi-supervised Chinese word segmentation using partial-label learning with conditional random fields. In: Proceedings of EMNLP, pp. 90–98 (2014)Google Scholar
  18. 18.
    Zeng, X., Wong, D.F., Chao, L.S., Trancoso, I.: Graph-based semi-supervised model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 770–779. Association for Computational Linguistics, Sofia, August 2013Google Scholar
  19. 19.
    Zhang, L., Wang, H., Sun, X., Mansur, M.: Exploring representations from unlabeled data with co-training for Chinese word segmentation. In: Proceedings of EMNLP, pp. 311–321 (2013)Google Scholar
  20. 20.
    Zhang, L., Wang, H., Sun, X., Mansur, M.: Improving Chinese word segmentation on micro-blog using rich punctuations. In: Proceedings of ACL: Short Papers (2013)Google Scholar
  21. 21.
    Zhang, M., Zhang, Y., Che, W., Liu, T.: Character-level Chinese dependency parsing. In: Proceedings of ACL, pp. 1326–1336 (2014)Google Scholar
  22. 22.
    Zhang, M., Zhang, Y., Che, W., Liu, T.: Type-supervised domain adaptation for joint segmentation and POS-tagging. In: Proceedings of COLING, pp. 588–597 (2014)Google Scholar
  23. 23.
    Zhang, Y., Clark, S.: Joint word segmentation and POS tagging using a single perceptron. In: Proceedings of ACL 2008: HLT, pp. 888–896 (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Jiayuan Chao
    • 1
  • Zhenghua Li
    • 1
    Email author
  • Wenliang Chen
    • 1
  • Min Zhang
    • 1
  1. 1.School of Computer Science and TechnologySoochow UniversitySuzhouChina

Personalised recommendations