Chinese Word Segmentation for Terrorism-Related Contents

Zeng, Daniel; Wei, Donghua; Chau, Michael; Wang, Feiyue

doi:10.1007/978-3-540-69304-8_1

Daniel Zeng^25,26,
Donghua Wei²⁵,
Michael Chau²⁷ &
…
Feiyue Wang^25,26

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5075))

Included in the following conference series:

International Conference on Intelligence and Security Informatics

2313 Accesses
2 Citations

Abstract

In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and are restricted to the coverage of the dictionary. In this paper, we propose a hybrid method that avoids the limitations of both approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves a high accuracy in word segmentation when domain training is available. It can identify new words through MI-based token merging and dictionary update. In addition, with the Improved Bigram method it can also process N-grams. To evaluate the performance of our segmenter, we compare it with the Hylanda segmenter and the ICTCLAS segmenter using a terrorism-related corpus. The experiment results show that IASeg performs better than the two benchmarks in both precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chan, H.L., Hon, W.K., Lam, T.W., Sadakane, K.: Dynamic dictionary matching and compressed suffix trees. Society for Industrial and Applied Mathematics, 13–22 (2005) ISBN:0-89871-585-7
Google Scholar
Chau, M., Xu, J.: Mining Communities and Their Relationships in Blogs: A Study of Online Hate Groups. International Journal of Human-Computer Studies 65(1), 57–70 (2007)
Article Google Scholar
Chen, H., Xu, J.: Intelligence and Security Informatics. Annual Review of Information Science and Technology 40, 229–289 (2006)
Article Google Scholar
Chen, M.T., Seiferas, J.: Efficient and elegant subword-tree construction. In: Combinatorial Algorithm on Words, NATO Advanced Science Institutes. Series F, vol. 12, pp. 97–107. Springer, Berlin (1985)
Chapter Google Scholar
Chien, L.F.: PAT-Tree Based Keyword Extraction for Chinese Information Retrieval. In: ACM SIGIR (1997)
Google Scholar
Creutz, M., Lagus, K.: Unsupervised Models for Morpheme Segmentation and Morphology Learning. ACM Transactions on Speech and Language Processing 4(1) (January 2007)
Google Scholar
Cui, S.Q., Liu, Q., Meng, Y., Yu, H.: Nishino Fumihito. New Word Detection Based on Large-Scale Corpus 43(05), 927–932 (2006)
Google Scholar
Dai, Y.B., Khoo, S.G.T., Loh, T.E.: A new statistical formula for Chinese word segmentation incorporating contextual information. In: Proc. of the 22nd ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 82–89 (1999)
Google Scholar
Fang, Y., Yang, H.E.H.: The Algorithm Design and Realization to Calculate The Mutual Information of Four- Word- String in Large Scale Corpus. Computer Development & Applications 1 (2005)
Google Scholar
Giegerich, R., Kurtz, S.: From Ukkonen to McCreight and Weiner: A unifying view to linear-time suffix tree construction. Algorithmica 19, 331–353 (1997)
Article MATH MathSciNet Google Scholar
Hockenmaier, J., Brew, C.: Error-driven segmentation of Chinese. Communications of COLIPS 1(1), 69–84 (1998)
Google Scholar
Jia, N., Zhang, Q.: Identification of Chinese Names Based on Maximum Entropy Model. Computer Engineering 33(9), 31–33 (2007)
Google Scholar
Li, J.F., Zhang, Y.F.: Segmenting Chinese by EM Algorithm. Journal of the China Society for Scientific and Technical Information 03, 13–16 (2002)
Google Scholar
Li, R., Liu, S.H., Ye, S.W., et al.: A method of crossing ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing 15(6), 13–18 (2001) (in Chinese)
Google Scholar
Low, J.K., Ng, H.T., Guo, W.: A Maximum Entropy Approach to Chinese Word Segmentation. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, pp. 161–164 (2005)
Google Scholar
Maaß, M.: Suffix Trees and their Applications. Ferienakademie 1999 Kurs 2: Bäume-Algorithmik and Kombinatorik (1999)
Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of ACM 23(2), 262–272 (1976)
Article MATH MathSciNet Google Scholar
Ong, T.H., Chen, H.: Updateable PAT-Tree Approach to Chinese Key Phrase Extraction using Mutual Information: A Linguistic Foundation for Knowledge Management. In: Proceedings Asian Digital Library Conference, Taipei, Taiwan, pp. 63–84 (1999)
Google Scholar
Palmer, D.: A trainable rule-based algorithm to word segmentation. In: Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics, Madrid, Spain (1997)
Google Scholar
Peng, F.C., Feng, F.F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: COLING 2004, Geneva, Switzerland (2004)
Google Scholar
Peng, F.C., Dale, S.: Self-supervised Chinese Word Segmentation. In: Proceedings of the 4th International Symposium of Intelligent Data Analysis, pp. 238–247 (2001)
Google Scholar
Ponte, J.M., Croft, W.B.: Useg: A retargetable word segmentation procedure for information retrieval. In: Proceedings of SDAIR 1996, Las Vegas, Nevada (1996)
Google Scholar
Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Article Google Scholar
Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics 22(3), 377–404 (1996)
Google Scholar
Sun, M.S., Xiao, M., Zou, J.Y.: Chinese Word Segmentation without Using Dictionary Based on Unsupervised Learning Strategy. Chinese Journal of Computers 27(6), 736–742 (2004)
Google Scholar
Teahan, W.J., Wen, Y., McNab, R.J., Witten, I.H.: A compression-based algorithm for Chinese word segmentation. Computational Linguistics 26, 375–393 (2000)
Article Google Scholar
Ukkonen, E.: Constructing Suffix Trees On-Line in Linear Time. In: Leeuwen, J.v. (ed.) Algorithms, Software, Architecture, Proc. IFIP 12th World Computer Congress, Information Processing 1992, Madrid, Spain, vol. 1, pp. 484–492 (1992)
Google Scholar
Ukkonen, E.: On-line Construction of Suffix-Trees. Algorithmica 14(3) (1995)
Google Scholar
Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. 14th IEEE Annual Symp. on Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar
Wu, Z., Tseng, G.: Chinese text segmentation for text retrieval achievements and problems. JASIS 44(9), 532–542 (1993)
Article Google Scholar
Xue, N.W., Chiou, F.-D., Palmer, M.: Building a large annotated Chinese corpus. In: The Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002)
Google Scholar
Xue, N.W.: Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing 8(1), 29–48 (2003)
Google Scholar
Yu, H.K., Zhang, H.P., Liu, Q., Lv, X.Q., Shi, S.C.: Chinese named entity identification using cascaded hidden Markov model. Journal on Communications 27(2), 87–94 (2006)
Google Scholar
Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: HMM-Based Chinese lexical analyzer ICTCLAS. In: Proc. of the 2nd SIGHAN Workshop, pp. 184–187 (2003)
Google Scholar
Zhang, Ch.L., Hao, F.L., Wan, W.L.: An automatic and dictionary-free Chinese word segmentation method based on suffix array. Journal of Jilin University (Science Edition) 4 (2004)
Google Scholar
Zhou, L.X., Liu, Q.: A Character-net Based Chinese Text Segmentation Method. In: SEMANET: Building and Using Semantic Networks Workshop, attached with the 19th COLING, pp. 101–106 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Automation, Chinese Academy of Sciences, China
Daniel Zeng, Donghua Wei & Feiyue Wang
The University of Arizona, Tucson, Arizona, USA
Daniel Zeng & Feiyue Wang
The University of Hong Kong, Hong Kong, China
Michael Chau

Authors

Daniel Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Donghua Wei
View author publications
You can also search for this author in PubMed Google Scholar
Michael Chau
View author publications
You can also search for this author in PubMed Google Scholar
Feiyue Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The Chinese University of Hong Kong, Hong Kong
Christopher C. Yang
The University of Arizona, USA
Hsinchun Chen
The University of Hong Kong, Hong Kong
Michael Chau
Nanyang Technological University, Singapore
Kuiyu Chang
University of Central Florida, USA
Sheau-Dong Lang
Tatung University, Taiwan
Patrick S. Chen
California University of Pennsylvania, USA
Raymond Hsieh
University of Arizona and Chinese Academy of Sciences, USA
Daniel Zeng
Chinese Academy of Sciences, China
Fei-Yue Wang & Wenji Mao &
Carnegie Mellon University, USA
Kathleen Carley & Justin Zhan &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zeng, D., Wei, D., Chau, M., Wang, F. (2008). Chinese Word Segmentation for Terrorism-Related Contents. In: Yang, C.C., et al. Intelligence and Security Informatics. ISI 2008. Lecture Notes in Computer Science, vol 5075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69304-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-540-69304-8_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69136-5
Online ISBN: 978-3-540-69304-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics