Combining Language Modeling and Discriminative Classification for Word Segmentation

  • Dekang Lin
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5449)


Generative language modeling and discriminative classification are two main techniques for Chinese word segmentation. Most previous methods have adopted one of the techniques. We present a hybrid model that combines the disambiguation power of language modeling and the ability of discriminative classifiers to deal with out-of-vocabulary words. We show that the combined model achieves 9% error reduction over the discriminative classifier alone.


Segmentation Maximum Entropy Language Model 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Andrew, G.: A hybrid Markov/Semi-Markov conditional random field for sequence segmentation. In: Proc. of EMNLP 2006 (2006)Google Scholar
  2. 2.
    Asahara, M., Goh, C., Wang, X., Matsumoto, Y.: Combining segmenter and chunker for Chinese word segmentation. In: Proc. of the Second SIGHAN Workshop on Chinese Language Processing, pp. 144–147 (2003)Google Scholar
  3. 3.
    Charniak, E.: Statistical parsing with a context-free grammar and word statistics. In: Proc. of AAAI 1997 (1997)Google Scholar
  4. 4.
    Clark, S., Curran, J., Osborne, M.: Bootstraping POS-taggers using unlabelled data. In: Proc. of CoNLL 2003 (2003)Google Scholar
  5. 5.
    Chen, Y., Zhou, A., Zhang, G.: Unigram Language Model for Chinese Word Segmentation. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (2005)Google Scholar
  6. 6.
    Dagan, Lee, L., Pereira, F.C.N.: Similarity based methods for word sense disambiguation. In: Proc. of ACL 1997 (1997)Google Scholar
  7. 7.
    Emerson, T.: The Second International Chinese Word Segmentation Bakeoff. In: Proc. SIGHAN Workshop on Chinese Language Processing (2005)Google Scholar
  8. 8.
    Gao, J., Li, M., Wu, A., Huang, C.N.: Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics 31(4), 531–574 (2005)CrossRefzbMATHGoogle Scholar
  9. 9.
    Goodman, J.: Exponential Priors for Maximum Entropy Models. In: Proceedings of HLT/NAACL (2004)Google Scholar
  10. 10.
    Hindle, D.: Noun classification from predicate-argument structures. In: Proc. of ACL 1990 (1990)Google Scholar
  11. 11.
    Klein, D., Manning, C.: A Generative constituent-context model for improved grammar induction. In: Proceedings of the 40th Annual Meeting of the ACL (2002)Google Scholar
  12. 12.
    Low, J.K., Ng, H.T., Guo, W.: A maximum entropy approach to Chinese word segmentation. In: Proc. SIGHAN Workshop on Chinese Language Processing (2005)Google Scholar
  13. 13.
    Lin, D.: Automatic retrieval and clustering of similar words. In: Proc. of COLING/ACL 1998, pp. 768–774 (1998)Google Scholar
  14. 14.
    Luo, X., Roukos, S.: An Iterative Algorithm to Build Chinese Language Models. In: Proc. of ACL 1996, pp. 139–145 (1996)Google Scholar
  15. 15.
    Luo, X.: A maximum entropy Chinese character-based parser. In: Proc. of EMNLP (2003)Google Scholar
  16. 16.
    McClosky, D., Charniak, E., Johnson, M.: Effective self-training for parsing. In: Proc. NAACL 2006 (2006)Google Scholar
  17. 17.
    Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proc. of COLING 2004 (2004)Google Scholar
  18. 18.
    Sproat, R., Gale, W., Shih, C., Chang, N.: A stochastic finite-State word-segmentation algorithm for Chinese. Computational Linguistics 22(3) (1996)Google Scholar
  19. 19.
    Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.: A conditional random field word segmenter for SIGHAN Bakeoff 2005. In: Proc. SIGHAN Workshop (2005)Google Scholar
  20. 20.
    Xue, N., Shen, S.: Chinese word segmentation as LMR tagging. In: Proceedings of the Second SIGHAN Workshop, pp. 176–179 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Dekang Lin
    • 1
  1. 1.Google, Inc.Mountain ViewUSA

Personalised recommendations