Skip to main content

Enhancing Chinese Word Segmentation with Character Clustering

  • Conference paper
  • 1542 Accesses

Part of the Lecture Notes in Computer Science book series (LNAI,volume 8202)

Abstract

In semi-supervised learning framework, clustering has been proved a helpful feature to improve system performance in NER and other NLP tasks. However, there hasn’t been any work that employs clustering in word segmentation. In this paper, we proposed a new approach to compute clusters of characters and use these results to assist a character based Chinese word segmentation system. Contextual information is considered when we perform character clustering algorithm to address character ambiguity. Experiments show our character clusters result in performance improvement. Also, we compare our clusters features with widely used mutual information (MI). When two features integrated, further improvement is achieved.

Keywords

  • Brown clustering
  • Chinese word segmentation
  • semi-supervised learning

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-642-41491-6_6
  • Chapter length: 9 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   49.99
Price excludes VAT (USA)
  • ISBN: 978-3-642-41491-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   64.99
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Wang, Y., Kazama, J., Tsuruoka, Y., Chen, W., Zhang, Y., Torisawa, K.: Improving Chinese word segmentation and pos tagging with semi-supervised methods using large auto-analyzed data. In: Proceedings of the Fifth International Joint Conference on Natural Language Processing, IJCNLP 2011 (2011)

    Google Scholar 

  2. Sun, W., Xu, J.: Enhancing chinese word segmentation using unlabeled data. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 970–979. Association for Computational Linguistics (2011)

    Google Scholar 

  3. Miller, S., Guinness, J., Zamanian, A.: Name tagging with word clusters and discriminative training. In: Proceedings of HLT-NAACL, vol. 4. Citeseer (2004)

    Google Scholar 

  4. Liang, P.: Semi-supervised learning for natural language. PhD thesis, Massachusetts Institute of Technology (2005)

    Google Scholar 

  5. Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)

    Google Scholar 

  6. Chen, W., Kazama, J., Uchimoto, K., Torisawa, K.: Improving dependency parsing with subtrees from auto-parsed data. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 2, pp. 570–579. Association for Computational Linguistics (2009)

    Google Scholar 

  7. Okazaki, N.: Crfsuite: a fast implementation of conditional random fields (crfs) (2007)

    Google Scholar 

  8. Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394. Association for Computational Linguistics (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, Y., Che, W., Liu, T. (2013). Enhancing Chinese Word Segmentation with Character Clustering. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41491-6_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41490-9

  • Online ISBN: 978-3-642-41491-6

  • eBook Packages: Computer ScienceComputer Science (R0)