Skip to main content

Chinese Word Segmentation for Terrorism-Related Contents

  • Conference paper
Intelligence and Security Informatics (ISI 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5075))

Included in the following conference series:

Abstract

In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and are restricted to the coverage of the dictionary. In this paper, we propose a hybrid method that avoids the limitations of both approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves a high accuracy in word segmentation when domain training is available. It can identify new words through MI-based token merging and dictionary update. In addition, with the Improved Bigram method it can also process N-grams. To evaluate the performance of our segmenter, we compare it with the Hylanda segmenter and the ICTCLAS segmenter using a terrorism-related corpus. The experiment results show that IASeg performs better than the two benchmarks in both precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chan, H.L., Hon, W.K., Lam, T.W., Sadakane, K.: Dynamic dictionary matching and compressed suffix trees. Society for Industrial and Applied Mathematics, 13–22 (2005) ISBN:0-89871-585-7

    Google Scholar 

  2. Chau, M., Xu, J.: Mining Communities and Their Relationships in Blogs: A Study of Online Hate Groups. International Journal of Human-Computer Studies 65(1), 57–70 (2007)

    Article  Google Scholar 

  3. Chen, H., Xu, J.: Intelligence and Security Informatics. Annual Review of Information Science and Technology 40, 229–289 (2006)

    Article  Google Scholar 

  4. Chen, M.T., Seiferas, J.: Efficient and elegant subword-tree construction. In: Combinatorial Algorithm on Words, NATO Advanced Science Institutes. Series F, vol. 12, pp. 97–107. Springer, Berlin (1985)

    Chapter  Google Scholar 

  5. Chien, L.F.: PAT-Tree Based Keyword Extraction for Chinese Information Retrieval. In: ACM SIGIR (1997)

    Google Scholar 

  6. Creutz, M., Lagus, K.: Unsupervised Models for Morpheme Segmentation and Morphology Learning. ACM Transactions on Speech and Language Processing 4(1) (January 2007)

    Google Scholar 

  7. Cui, S.Q., Liu, Q., Meng, Y., Yu, H.: Nishino Fumihito. New Word Detection Based on Large-Scale Corpus 43(05), 927–932 (2006)

    Google Scholar 

  8. Dai, Y.B., Khoo, S.G.T., Loh, T.E.: A new statistical formula for Chinese word segmentation incorporating contextual information. In: Proc. of the 22nd ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 82–89 (1999)

    Google Scholar 

  9. Fang, Y., Yang, H.E.H.: The Algorithm Design and Realization to Calculate The Mutual Information of Four- Word- String in Large Scale Corpus. Computer Development & Applications 1 (2005)

    Google Scholar 

  10. Giegerich, R., Kurtz, S.: From Ukkonen to McCreight and Weiner: A unifying view to linear-time suffix tree construction. Algorithmica 19, 331–353 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  11. Hockenmaier, J., Brew, C.: Error-driven segmentation of Chinese. Communications of COLIPS 1(1), 69–84 (1998)

    Google Scholar 

  12. Jia, N., Zhang, Q.: Identification of Chinese Names Based on Maximum Entropy Model. Computer Engineering 33(9), 31–33 (2007)

    Google Scholar 

  13. Li, J.F., Zhang, Y.F.: Segmenting Chinese by EM Algorithm. Journal of the China Society for Scientific and Technical Information 03, 13–16 (2002)

    Google Scholar 

  14. Li, R., Liu, S.H., Ye, S.W., et al.: A method of crossing ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing 15(6), 13–18 (2001) (in Chinese)

    Google Scholar 

  15. Low, J.K., Ng, H.T., Guo, W.: A Maximum Entropy Approach to Chinese Word Segmentation. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, pp. 161–164 (2005)

    Google Scholar 

  16. Maaß, M.: Suffix Trees and their Applications. Ferienakademie 1999 Kurs 2: Bäume-Algorithmik and Kombinatorik (1999)

    Google Scholar 

  17. McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of ACM 23(2), 262–272 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  18. Ong, T.H., Chen, H.: Updateable PAT-Tree Approach to Chinese Key Phrase Extraction using Mutual Information: A Linguistic Foundation for Knowledge Management. In: Proceedings Asian Digital Library Conference, Taipei, Taiwan, pp. 63–84 (1999)

    Google Scholar 

  19. Palmer, D.: A trainable rule-based algorithm to word segmentation. In: Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics, Madrid, Spain (1997)

    Google Scholar 

  20. Peng, F.C., Feng, F.F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: COLING 2004, Geneva, Switzerland (2004)

    Google Scholar 

  21. Peng, F.C., Dale, S.: Self-supervised Chinese Word Segmentation. In: Proceedings of the 4th International Symposium of Intelligent Data Analysis, pp. 238–247 (2001)

    Google Scholar 

  22. Ponte, J.M., Croft, W.B.: Useg: A retargetable word segmentation procedure for information retrieval. In: Proceedings of SDAIR 1996, Las Vegas, Nevada (1996)

    Google Scholar 

  23. Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  24. Sproat, R., Shih, C., Gale, W., Chang, N.: A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics 22(3), 377–404 (1996)

    Google Scholar 

  25. Sun, M.S., Xiao, M., Zou, J.Y.: Chinese Word Segmentation without Using Dictionary Based on Unsupervised Learning Strategy. Chinese Journal of Computers 27(6), 736–742 (2004)

    Google Scholar 

  26. Teahan, W.J., Wen, Y., McNab, R.J., Witten, I.H.: A compression-based algorithm for Chinese word segmentation. Computational Linguistics 26, 375–393 (2000)

    Article  Google Scholar 

  27. Ukkonen, E.: Constructing Suffix Trees On-Line in Linear Time. In: Leeuwen, J.v. (ed.) Algorithms, Software, Architecture, Proc. IFIP 12th World Computer Congress, Information Processing 1992, Madrid, Spain, vol. 1, pp. 484–492 (1992)

    Google Scholar 

  28. Ukkonen, E.: On-line Construction of Suffix-Trees. Algorithmica 14(3) (1995)

    Google Scholar 

  29. Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. 14th IEEE Annual Symp. on Switching and Automata Theory, pp. 1–11 (1973)

    Google Scholar 

  30. Wu, Z., Tseng, G.: Chinese text segmentation for text retrieval achievements and problems. JASIS 44(9), 532–542 (1993)

    Article  Google Scholar 

  31. Xue, N.W., Chiou, F.-D., Palmer, M.: Building a large annotated Chinese corpus. In: The Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan (2002)

    Google Scholar 

  32. Xue, N.W.: Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing 8(1), 29–48 (2003)

    Google Scholar 

  33. Yu, H.K., Zhang, H.P., Liu, Q., Lv, X.Q., Shi, S.C.: Chinese named entity identification using cascaded hidden Markov model. Journal on Communications 27(2), 87–94 (2006)

    Google Scholar 

  34. Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: HMM-Based Chinese lexical analyzer ICTCLAS. In: Proc. of the 2nd SIGHAN Workshop, pp. 184–187 (2003)

    Google Scholar 

  35. Zhang, Ch.L., Hao, F.L., Wan, W.L.: An automatic and dictionary-free Chinese word segmentation method based on suffix array. Journal of Jilin University (Science Edition) 4 (2004)

    Google Scholar 

  36. Zhou, L.X., Liu, Q.: A Character-net Based Chinese Text Segmentation Method. In: SEMANET: Building and Using Semantic Networks Workshop, attached with the 19th COLING, pp. 101–106 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zeng, D., Wei, D., Chau, M., Wang, F. (2008). Chinese Word Segmentation for Terrorism-Related Contents. In: Yang, C.C., et al. Intelligence and Security Informatics. ISI 2008. Lecture Notes in Computer Science, vol 5075. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69304-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-69304-8_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69136-5

  • Online ISBN: 978-3-540-69304-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics