Domain-specific Chinese word segmentation using suffix tree and mutual information

Zeng, Daniel; Wei, Donghua; Chau, Michael; Wang, Feiyue

doi:10.1007/s10796-010-9278-5

Domain-specific Chinese word segmentation using suffix tree and mutual information

Published: 17 October 2010

Volume 13, pages 115–125, (2011)
Cite this article

Information Systems Frontiers Aims and scope Submit manuscript

Daniel Zeng^1,2,
Donghua Wei¹,
Michael Chau³ &
…
Feiyue Wang^1,2

651 Accesses
26 Citations
Explore all metrics

Abstract

As the amount of online Chinese contents grows, there is a critical need for effective Chinese word segmentation approaches to facilitate Web computing applications in a range of domains including terrorism informatics. Most existing Chinese word segmentation approaches are either statistics-based or dictionary-based. The pure statistical method has lower precision, while the pure dictionary-based method cannot deal with new words beyond the dictionary. In this paper, we propose a hybrid method that is able to avoid the limitations of both types of approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves high accuracy in word segmentation when domain training is available. It can also identify new words through MI-based token merging and dictionary updating. In addition, with the proposed Improved Bigram method IASeg can process N-grams. To evaluate the performance of our segmenter, we compare it with two well-known systems, the Hylanda segmenter and the ICTCLAS segmenter, using a terrorism-centric corpus and a general corpus. The experiment results show that IASeg performs better than the benchmarks in both precision and recall for the domain-specific corpus and achieves comparable performance for the general corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance Analysis and Comparison on Chinese Word Segmentation

A Word Segmentation Method of Ancient Chinese Based on Word Alignment

TLex+: A Hybrid Method Using Conditional Random Fields and Dictionaries for Thai Word Segmentation

References

Chan, H. L., Hon, W. K., Lam, T. W., Sadakane, K. (2005) Dynamic dictionary matching and compressed suffix trees. Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms, Society for Industrial and Applied Mathematics. ISBN: 0-89871-585-7.
Chau, M., & Xu, J. (2007). Mining communities and their relationships in blogs: a study of online hate groups. International Journal of Human-Computer Studies, 65(1), 57–70.
Article Google Scholar
Chen, H. (2006). Intelligence and security informatics: information systems perspective. Decision Support Systems, 41(3), 555–559.
Article Google Scholar
Chen, M. T., Seiferas, J. (1985). Efficient and elegant subword-tree construction. Combinatorial Algorithm on Words (pp 97–107). NATO Advanced Science Institutes, Series F, vol. 12, Springer, Berlin.
Chen, H., & Xu, J. (2006). Intelligence and security informatics. Annual Review of Information Science and Technology, 40, 229–289.
Article Google Scholar
Chien, L. F. (1997). PAT-tree based keyword extraction for chinese information retrieval. ACM SIGIR
Creutz, M., Lagus, K. (2007). Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, Volume 4, Issue 1.
Cui, S. Q., Liu, Q., Meng, Y., Yu, H., & Nishino, F. (2006). New word detection based on large-scale corpus. Journal of Computer Research and Development, 43(05), 927–932.
Article Google Scholar
Dai, Y. B., Khoo, S. G. T., Loh, T. E. (1999). A new statistical formula for Chinese word segmentation incorporating contextual information. In: Proc. of the 22nd ACM SIGIR Conf. on Research and Development in Information Retrieval (pp 82–89).
Fang, Y., Yang, H. E. H. (2005). The algorithm design and realization to calculate the mutual information of four-word-string in large scale corpus. Computer Development & Applications, Vol.1.
Giegerich, R., & Kurtz, S. (1997). From Ukkonen to McCreight and Weiner: a unifying view to linear-time suffix tree construction. Algorithmica, 19, 331–353.
Article Google Scholar
Hockenmaier, J., & Brew, C. (1998). Error-driven segmentation of Chinese. Communications of COLIPS, 1(1), 69–84.
Google Scholar
Jia, N., & Zhang, Q. (2007). Identification of Chinese names based on maximum entropy model. Computer Engineering, 33(9), 31–33.
Google Scholar
Leydesdorff, L., & Zhou, P. (2008). Co-word analysis using the Chinese character set. Journal of the American Society for Information Science and Technology, 59(9), 1528–1530.
Article Google Scholar
Li, J. F., & Zhang, Y. F. (2002). Segmenting Chinese by EM algorithm. Journal of the China Society for Scientific and Technical Information, 03, 13–16.
Google Scholar
Li, R., Liu, S. H., Ye, S. W., & Shi, Z. Z. (2001). A method of crossing ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6), 13–18 (in Chinese).
Google Scholar
Low, J. K., Ng, H. T., Guo, W. (2005). A maximum entropy approach to Chinese word segmentation. Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (pp 161-164). Jeju Island, Korea.
Maaß, M. (1999). Suffix trees and their applications. Ferienakademie 1999 Kurs 2: Bäume: Algorithmik and Kombinatorik.
McCreight, E. M. (1976). A space-economical suffix tree construction algorithm. Journal of ACM, 23(2), 262–272.
Article Google Scholar
Ong, T. H., Chen, H. (1999). Updateable PAT-tree approach to chinese key phrase extraction using mutual information: a linguistic foundation for knowledge management. In Proceedings of the Asian Digital Library Conference (pp 63-84). Taipei, Taiwan.
Palmer, D. (1997). A trainable rule-based algorithm to word segmentation. In Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics. Madrid, Spain.
Peng, F. C., Schuurmans D. (2001). Self-supervised Chinese word segmentation. Proceedings of the 4th International Symposium of Intelligent Data Analysis (pp 238–247).
Peng, F. C., Feng, F. F., McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. COLING 2004, Geneva, Switzerland.
Ponte, J. M., Croft, W. B. (1996). Useg: a retargetable word segmentation procedure for information retrieval. In Proceedings of SDAIR96, Las Vegas, Nevada.
Raghu, T. S., & Chen, H. (2007). Cyberinfrastructure for homeland security: advances in information sharing, data mining, and collaboration systems. Decision Support Systems, 43(4), 1321–1323.
Article Google Scholar
Sproat, R., Shih, C., Gale, W., & Chang, N. (1996). A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics, 22(3), 377–404.
Google Scholar
Sun, M. S., Xiao, M., & Zou, J. Y. (2004). Chinese word segmentation without using dictionary based on unsupervised learning strategy. Chinese Journal of Computers, 27(6), 736–742.
Google Scholar
Teahan, W. J., Wen, Y., McNab, R. J., & Witten, I. H. (2000). A compression-based algorithm for Chinese word segmentation. Computational Linguistics, 26, 375–393.
Article Google Scholar
Ukkonen, E. (1992). Constructing suffix trees on-line in linear time. In Jv Leeuwen (ed), Proc. IFIP 12th World Computer Congress on Algorithms, Software, Architecture (pp 484–492) Madrid, Spain.
Ukkonen, E. (1995). On-line Construction of Suffix-Trees. Algorithmica, 14(3).
Weiner, P. (1973). Linear pattern matching algorithms. Proc. 14th IEEE Annual Symp. on Switching and Automata Theory (pp 1-11).
Wong, P.-k., Chan, C. (1996). Chinese word segmentation based on maximum matching and word binding force. Proceedings of the 16th International Conference on Computational Linguistics (pp 200–203).
Wu, Z., & Tseng, G. (1993). Chinese text segmentation for text retrieval achievements and problems. Journal of the American Society for Information Science, 44(9), 532–542.
Article Google Scholar
Xue, N. W. (2003). Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing, 8(1), 29–48.
Google Scholar
Xue, N.W., Chiou, Fu-Dong, and Palmer, M. Building a large annotated Chinese corpus. In Proceedings of the 19th International Conference on Computational Linguistics. Taipei, Taiwan, 2002.
Yu, H. K., Zhang, H. P., Liu, Q., Lv, X. Q., & Shi, S. C. (2006). Chinese named entity identification using cascaded hidden Markov model. Journal on Communications, 27(2), 87–94.
Google Scholar
Zhang, H. P., Yu, H. K., Xiong, D. Y., Liu Q. (2003). HMM-Based Chinese lexical analyzer ICTCLAS. In Proc. of the 2nd SIGHAN Workshop (pp 184–187).
Zhang, C. L., Hao, F. L., Wan, W. L. (2004). An automatic and dictionary-free Chinese word segmentation method based on suffix array. Journal of Jilin University (Science Edition), Vol 4.
Zhou, L. X., Liu, Q. (2002). A Character-net Based Chinese Text Segmentation Method, SEMANET: Building and Using Semantic Networks Workshop at the 19th COLING (pp 101–106).

Download references

Acknowledgments

The reported work was supported in part by the following grants: NNSFC #90924302 and #60921061, MOST #2006AA010106, CAS #2F07C01, NSF #IIS-0428241, and HKU #10207565. We thank our team member Mr. Qingyang Xu for his help with the experiments. We also thank Ms. Fenglin Li and Ms. Shufang Tang for their help with data preparation and processing.

Author information

Authors and Affiliations

Chinese Academy of Sciences, Institute of Automation, Beijing, China
Daniel Zeng, Donghua Wei & Feiyue Wang
The University of Arizona, Tucson, AZ, USA
Daniel Zeng & Feiyue Wang
The University of Hong Kong, Hong Kong, China
Michael Chau

Authors

Daniel Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Donghua Wei
View author publications
You can also search for this author in PubMed Google Scholar
Michael Chau
View author publications
You can also search for this author in PubMed Google Scholar
Feiyue Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Zeng.

Additional information

A preliminary and shorter version of this paper appeared in the Proceedings of the 2008 Intelligence and Security Informatics Workshops (Springer LNCS #5075).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zeng, D., Wei, D., Chau, M. et al. Domain-specific Chinese word segmentation using suffix tree and mutual information. Inf Syst Front 13, 115–125 (2011). https://doi.org/10.1007/s10796-010-9278-5

Download citation

Published: 17 October 2010
Issue Date: March 2011
DOI: https://doi.org/10.1007/s10796-010-9278-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Domain-specific Chinese word segmentation using suffix tree and mutual information

Abstract

Access this article

Similar content being viewed by others

Performance Analysis and Comparison on Chinese Word Segmentation

A Word Segmentation Method of Ancient Chinese Based on Word Alignment

TLex+: A Hybrid Method Using Conditional Random Fields and Dictionaries for Thai Word Segmentation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Domain-specific Chinese word segmentation using suffix tree and mutual information

Abstract

Access this article

Similar content being viewed by others

Performance Analysis and Comparison on Chinese Word Segmentation

A Word Segmentation Method of Ancient Chinese Based on Word Alignment

TLex+: A Hybrid Method Using Conditional Random Fields and Dictionaries for Thai Word Segmentation

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation