Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation

  • Wei Qiao
  • Maosong Sun
  • Wolfgang Menzel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5246)

Abstract

Overlapping ambiguity is a major ambiguity type in Chinese word segmentation. In this paper, the statistical properties of overlapping ambiguities are intensively studied based on the observations from a very large balanced general-purpose Chinese corpus. The relevant statistics are given from different perspectives. The stability of high frequent maximal overlapping ambiguities is tested based on statistical observations from both general-purpose corpus and domain-specific corpora. A disambiguation strategy for overlapping ambiguities, with a predefined solution for each of the 5,507 pseudo overlapping ambiguities, is proposed consequently, suggesting that over 42% of overlapping ambiguities in Chinese running text could be solved without making any error. Several state-of-the-art word segmenters are used to make comparisons on solving these overlapping ambiguities. Preliminary experiments show that about 2% of the 5,507 pseudo ambiguities which are mistakenly segmented by these segmenters can be properly treated by the proposed strategy.

Keywords

Overlapping ambiguity statistical property disambiguation strategy domain-specific corpora 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Emerson, T.: The second international Chinese word segmentation bakeoff. In: Proceedings of the 4th SIGHAN Workshop, pp. 123–133 (2005)Google Scholar
  2. 2.
    Huang, C.N.: Segmentation Problems in Chinese Processing. Applied Linguistics 1, 72–78 (1997)Google Scholar
  3. 3.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference of ICML, pp. 282–289 (2001)Google Scholar
  4. 4.
    Li, R., Liu, S.H., Ye, S.W., Shi, Z.Z.: A method for resolving overlapping ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing 15(6), 13–18 (2001) (in Chinese)Google Scholar
  5. 5.
    Li, M., Gao, J.F., Huang, C.N., Li, J.F.: Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In: Proceedings of SIGHAN 2003, pp. 1–7 (2003)Google Scholar
  6. 6.
    Liang, N.Y.: A Chinese automatic segmentation system for written texts – CDWS. Journal of Chinese Information Processing 1(2), 44–52 (1987) (in Chinese)Google Scholar
  7. 7.
    Peng, F.C., Feng, F.F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of COLING 2004, Geneva, Switzerland, pp. 562–568 (2004)Google Scholar
  8. 8.
    Sproat, R., Emerson, T.: The first international Chinese word segmentation bakeoff. In: Proceedings of the 2nd SIGHAN Workshop, pp. 133–143 (2003)Google Scholar
  9. 9.
    Sun, M.S., Zuo, Z.P.: Overlapping ambiguities in Chinese text. In: Overlapping ambiguities in Chinese text, pp. 323–338 (1998)Google Scholar
  10. 10.
    Sun, M.S., Huang, C.N., T’sou, B.K.Y.: 1997. Using character bigram for ambiguity resolution In Chinese word segmentation (5), 332–339 (in Chinese)Google Scholar
  11. 11.
    Sun, M.S., Zuo, Z.P., T’sou, B.K.Y.: The role of high frequent maximal crossing ambiguities in Chinese word segmentation. Journal of Chinese Information Processing 13(1), 27–37 (1999) (in Chinese)Google Scholar
  12. 12.
    Swen, B., Yu, S.W.: A graded approach for the efficient resolution of Chinese word segmentation ambiguities. In: Proceedings of 5th Natural Language Processing Pacific Rim Symposium, pp. 19–24 (1999)Google Scholar
  13. 13.
    Xue, N.W.: Chinese word segmentation as character tagging. International Journal of Computational Linguistics, 8(1), 29–48 (2003)Google Scholar
  14. 14.
    Yu, S.W., Zhu, X.F.: Grammatical Information Dictionary for Contemporary Chinese. In: Grammatical Information Dictionary for Contemporary Chinese, 2nd edition, 2nd edn. Tsinghua University Press (2003) (in Chinese)Google Scholar
  15. 15.
    Zheng, J.H., Liu, K.Y.: Research on ambiguous word segmentation technique for Chinese text. In: Language Engineering, pp. 201–206. Tsinghua University Press, Beijing (1997) (in Chinese)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Wei Qiao
    • 1
  • Maosong Sun
    • 1
  • Wolfgang Menzel
    • 2
  1. 1.State Key Laboratory of Intelligent Technology and Systems Tsinghua National Laboratory for Information Science and Technology Department of Computer Sci. & Tech.Tsinghua UniversityBeijingChina
  2. 2.Department of InformatikHamburg UniversityHamburgGermany

Personalised recommendations