Skip to main content

Advertisement

Log in

Domain-specific Chinese word segmentation using suffix tree and mutual information

  • Published:
Information Systems Frontiers Aims and scope Submit manuscript

Abstract

As the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Chan, H. L., Hon, W. K., Lam, T. W., Sadakane, K. (2005) Dynamic dictionary matching and compressed suffix trees. Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms, Society for Industrial and Applied Mathematics. ISBN: 0-89871-585-7.

  • Chau, M., & Xu, J. (2007). Mining communities and their relationships in blogs: a study of online hate groups. International Journal of Human-Computer Studies, 65(1), 57–70.

    Article  Google Scholar 

  • Chen, H. (2006). Intelligence and security informatics: information systems perspective. Decision Support Systems, 41(3), 555–559.

    Article  Google Scholar 

  • Chen, M. T., Seiferas, J. (1985). Efficient and elegant subword-tree construction. Combinatorial Algorithm on Words (pp 97–107). NATO Advanced Science Institutes, Series F, vol. 12, Springer, Berlin.

  • Chen, H., & Xu, J. (2006). Intelligence and security informatics. Annual Review of Information Science and Technology, 40, 229–289.

    Article  Google Scholar 

  • Chien, L. F. (1997). PAT-tree based keyword extraction for chinese information retrieval. ACM SIGIR

  • Creutz, M., Lagus, K. (2007). Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, Volume 4, Issue 1.

  • Cui, S. Q., Liu, Q., Meng, Y., Yu, H., & Nishino, F. (2006). New word detection based on large-scale corpus. Journal of Computer Research and Development, 43(05), 927–932.

    Article  Google Scholar 

  • Dai, Y. B., Khoo, S. G. T., Loh, T. E. (1999). A new statistical formula for Chinese word segmentation incorporating contextual information. In: Proc. of the 22nd ACM SIGIR Conf. on Research and Development in Information Retrieval (pp 82–89).

  • Fang, Y., Yang, H. E. H. (2005). The algorithm design and realization to calculate the mutual information of four-word-string in large scale corpus. Computer Development & Applications, Vol.1.

  • Giegerich, R., & Kurtz, S. (1997). From Ukkonen to McCreight and Weiner: a unifying view to linear-time suffix tree construction. Algorithmica, 19, 331–353.

    Article  Google Scholar 

  • Hockenmaier, J., & Brew, C. (1998). Error-driven segmentation of Chinese. Communications of COLIPS, 1(1), 69–84.

    Google Scholar 

  • Jia, N., & Zhang, Q. (2007). Identification of Chinese names based on maximum entropy model. Computer Engineering, 33(9), 31–33.

    Google Scholar 

  • Leydesdorff, L., & Zhou, P. (2008). Co-word analysis using the Chinese character set. Journal of the American Society for Information Science and Technology, 59(9), 1528–1530.

    Article  Google Scholar 

  • Li, J. F., & Zhang, Y. F. (2002). Segmenting Chinese by EM algorithm. Journal of the China Society for Scientific and Technical Information, 03, 13–16.

    Google Scholar 

  • Li, R., Liu, S. H., Ye, S. W., & Shi, Z. Z. (2001). A method of crossing ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6), 13–18 (in Chinese).

    Google Scholar 

  • Low, J. K., Ng, H. T., Guo, W. (2005). A maximum entropy approach to Chinese word segmentation. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (pp 161-164). Jeju Island, Korea.

  • Maaß, M. (1999). Suffix trees and their applications. Ferienakademie 1999 Kurs 2: Bäume: Algorithmik and Kombinatorik.

  • McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. Journal of ACM, 23(2), 262–272.

    Article  Google Scholar 

  • Ong, T. H., Chen, H. (1999). Updateable PAT-tree approach to chinese key phrase extraction using mutual information: a linguistic foundation for knowledge management. In Proceedings of the Asian Digital Library Conference (pp 63-84). Taipei, Taiwan.

  • Palmer, D. (1997). A trainable rule-based algorithm to word segmentation. In Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics. Madrid, Spain.

  • Peng, F. C., Schuurmans D. (2001). Self-supervised Chinese word segmentation. Proceedings of the 4th International Symposium of Intelligent Data Analysis (pp 238–247).

  • Peng, F. C., Feng, F. F., McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. COLING 2004, Geneva, Switzerland.

  • Ponte, J. M., Croft, W. B. (1996). Useg: a retargetable word segmentation procedure for information retrieval. In Proceedings of SDAIR96, Las Vegas, Nevada.

  • Raghu, T. S., & Chen, H. (2007). Cyberinfrastructure for homeland security: advances in information sharing, data mining, and collaboration systems. Decision Support Systems, 43(4), 1321–1323.

    Article  Google Scholar 

  • Sproat, R., Shih, C., Gale, W., & Chang, N. (1996). A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics, 22(3), 377–404.

    Google Scholar 

  • Sun, M. S., Xiao, M., & Zou, J. Y. (2004). Chinese word segmentation without using dictionary based on unsupervised learning strategy. Chinese Journal of Computers, 27(6), 736–742.

    Google Scholar 

  • Teahan, W. J., Wen, Y., McNab, R. J., & Witten, I. H. (2000). A compression-based algorithm for Chinese word segmentation. Computational Linguistics, 26, 375–393.

    Article  Google Scholar 

  • Ukkonen, E. (1992). Constructing suffix trees on-line in linear time. In Jv Leeuwen (ed), Proc. IFIP 12th World Computer Congress on Algorithms, Software, Architecture (pp 484–492) Madrid, Spain.

  • Ukkonen, E. (1995). On-line Construction of Suffix-Trees. Algorithmica, 14(3).

  • Weiner, P. (1973). Linear pattern matching algorithms. Proc. 14th IEEE Annual Symp. on Switching and Automata Theory (pp 1-11).

  • Wong, P.-k., Chan, C. (1996). Chinese word segmentation based on maximum matching and word binding force. Proceedings of the 16th International Conference on Computational Linguistics (pp 200–203).

  • Wu, Z., & Tseng, G. (1993). Chinese text segmentation for text retrieval achievements and problems. Journal of the American Society for Information Science, 44(9), 532–542.

    Article  Google Scholar 

  • Xue, N. W. (2003). Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing, 8(1), 29–48.

    Google Scholar 

  • Xue, N.W., Chiou, Fu-Dong, and Palmer, M. Building a large annotated Chinese corpus. In Proceedings of the 19th International Conference on Computational Linguistics. Taipei, Taiwan, 2002.

  • Yu, H. K., Zhang, H. P., Liu, Q., Lv, X. Q., & Shi, S. C. (2006). Chinese named entity identification using cascaded hidden Markov model. Journal on Communications, 27(2), 87–94.

    Google Scholar 

  • Zhang, H. P., Yu, H. K., Xiong, D. Y., Liu Q. (2003). HMM-Based Chinese lexical analyzer ICTCLAS. In Proc. of the 2nd SIGHAN Workshop (pp 184–187).

  • Zhang, C. L., Hao, F. L., Wan, W. L. (2004). An automatic and dictionary-free Chinese word segmentation method based on suffix array. Journal of Jilin University (Science Edition), Vol 4.

  • Zhou, L. X., Liu, Q. (2002). A Character-net Based Chinese Text Segmentation Method, SEMANET: Building and Using Semantic Networks Workshop at the 19th COLING (pp 101–106).

Download references

Acknowledgments

The reported work was supported in part by the following grants: NNSFC #90924302 and #60921061, MOST #2006AA010106, CAS #2F07C01, NSF #IIS-0428241, and HKU #10207565. We thank our team member Mr. Qingyang Xu for his help with the experiments. We also thank Ms. Fenglin Li and Ms. Shufang Tang for their help with data preparation and processing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Zeng.

Additional information

A preliminary and shorter version of this paper appeared in the Proceedings of the 2008 Intelligence and Security Informatics Workshops (Springer LNCS #5075).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zeng, D., Wei, D., Chau, M. et al. Domain-specific Chinese word segmentation using suffix tree and mutual information. Inf Syst Front 13, 115–125 (2011). https://doi.org/10.1007/s10796-010-9278-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10796-010-9278-5

Keywords

Navigation