Abstract
In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and are restricted to the coverage of the dictionary. In this paper, we propose a hybrid method that avoids the limitations of both approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves a high accuracy in word segmentation when domain training is available. It can identify new words through MI-based token merging and dictionary update. In addition, with the Improved Bigram method it can also process N-grams. To evaluate the performance of our segmenter, we compare it with the Hylanda segmenter and the ICTCLAS segmenter using a terrorism-related corpus. The experiment results show that IASeg performs better than the two benchmarks in both precision and recall.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chan, H.L., Hon, W.K., Lam, T.W., Sadakane, K.: Dynamic dictionary matching and compressed suffix trees. Society for Industrial and Applied Mathematics, 13–22 (2005) ISBN:0-89871-585-7
Chau, M., Xu, J.: Mining Communities and Their Relationships in Blogs: A Study of Online Hate Groups. International Journal of Human-Computer Studies 65(1), 57–70 (2007)
Chen, H., Xu, J.: Intelligence and Security Informatics. Annual Review of Information Science and Technology 40, 229–289 (2006)
Chen, M.T., Seiferas, J.: Efficient and elegant subword-tree construction. In: Combinatorial Algorithm on Words, NATO Advanced Science Institutes. Series F, vol. 12, pp. 97–107. Springer, Berlin (1985)
Chien, L.F.: PAT-Tree Based Keyword Extraction for Chinese Information Retrieval. In: ACM SIGIR (1997)
Creutz, M., Lagus, K.: Unsupervised Models for Morpheme Segmentation and Morphology Learning. ACM Transactions on Speech and Language Processing 4(1) (January 2007)
Cui, S.Q., Liu, Q., Meng, Y., Yu, H.: Nishino Fumihito. New Word Detection Based on Large-Scale Corpus 43(05), 927–932 (2006)
Dai, Y.B., Khoo, S.G.T., Loh, T.E.: A new statistical formula for Chinese word segmentation incorporating contextual information. In: Proc. of the 22nd ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 82–89 (1999)
Fang, Y., Yang, H.E.H.: The Algorithm Design and Realization to Calculate The Mutual Information of Four- Word- String in Large Scale Corpus. Computer Development & Applications 1 (2005)
Giegerich, R., Kurtz, S.: From Ukkonen to McCreight and Weiner: A unifying view to linear-time suffix tree construction. Algorithmica 19, 331–353 (1997)
Hockenmaier, J., Brew, C.: Error-driven segmentation of Chinese. Communications of COLIPS 1(1), 69–84 (1998)
Jia, N., Zhang, Q.: Identification of Chinese Names Based on Maximum Entropy Model. Computer Engineering 33(9), 31–33 (2007)
Li, J.F., Zhang, Y.F.: Segmenting Chinese by EM Algorithm. Journal of the China Society for Scientific and Technical Information 03, 13–16 (2002)
Li, R., Liu, S.H., Ye, S.W., et al.: A method of crossing ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing 15(6), 13–18 (2001) (in Chinese)
Low, J.K., Ng, H.T., Guo, W.: A Maximum Entropy Approach to Chinese Word Segmentation. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, pp. 161–164 (2005)
Maaß, M.: Suffix Trees and their Applications. Ferienakademie 1999 Kurs 2: Bäume-Algorithmik and Kombinatorik (1999)
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of ACM 23(2), 262–272 (1976)
Ong, T.H., Chen, H.: Updateable PAT-Tree Approach to Chinese Key Phrase Extraction using Mutual Information: A Linguistic Foundation for Knowledge Management. In: Proceedings Asian Digital Library Conference, Taipei, Taiwan, pp. 63–84 (1999)
Palmer, D.: A trainable rule-based algorithm to word segmentation. In: Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics, Madrid, Spain (1997)
Peng, F.C., Feng, F.F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: COLING 2004, Geneva, Switzerland (2004)
Peng, F.C., Dale, S.: Self-supervised Chinese Word Segmentation. In: Proceedings of the 4th International Symposium of Intelligent Data Analysis, pp. 238–247 (2001)
Ponte, J.M., Croft, W.B.: Useg: A retargetable word segmentation procedure for information retrieval. In: Proceedings of SDAIR 1996, Las Vegas, Nevada (1996)
Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics 22(3), 377–404 (1996)
Sun, M.S., Xiao, M., Zou, J.Y.: Chinese Word Segmentation without Using Dictionary Based on Unsupervised Learning Strategy. Chinese Journal of Computers 27(6), 736–742 (2004)
Teahan, W.J., Wen, Y., McNab, R.J., Witten, I.H.: A compression-based algorithm for Chinese word segmentation. Computational Linguistics 26, 375–393 (2000)
Ukkonen, E.: Constructing Suffix Trees On-Line in Linear Time. In: Leeuwen, J.v. (ed.) Algorithms, Software, Architecture, Proc. IFIP 12th World Computer Congress, Information Processing 1992, Madrid, Spain, vol. 1, pp. 484–492 (1992)
Ukkonen, E.: On-line Construction of Suffix-Trees. Algorithmica 14(3) (1995)
Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. 14th IEEE Annual Symp. on Switching and Automata Theory, pp. 1–11 (1973)
Wu, Z., Tseng, G.: Chinese text segmentation for text retrieval achievements and problems. JASIS 44(9), 532–542 (1993)
Xue, N.W., Chiou, F.-D., Palmer, M.: Building a large annotated Chinese corpus. In: The Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002)
Xue, N.W.: Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing 8(1), 29–48 (2003)
Yu, H.K., Zhang, H.P., Liu, Q., Lv, X.Q., Shi, S.C.: Chinese named entity identification using cascaded hidden Markov model. Journal on Communications 27(2), 87–94 (2006)
Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: HMM-Based Chinese lexical analyzer ICTCLAS. In: Proc. of the 2nd SIGHAN Workshop, pp. 184–187 (2003)
Zhang, Ch.L., Hao, F.L., Wan, W.L.: An automatic and dictionary-free Chinese word segmentation method based on suffix array. Journal of Jilin University (Science Edition) 4 (2004)
Zhou, L.X., Liu, Q.: A Character-net Based Chinese Text Segmentation Method. In: SEMANET: Building and Using Semantic Networks Workshop, attached with the 19th COLING, pp. 101–106 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zeng, D., Wei, D., Chau, M., Wang, F. (2008). Chinese Word Segmentation for Terrorism-Related Contents. In: Yang, C.C., et al. Intelligence and Security Informatics. ISI 2008. Lecture Notes in Computer Science, vol 5075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69304-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-540-69304-8_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69136-5
Online ISBN: 978-3-540-69304-8
eBook Packages: Computer ScienceComputer Science (R0)