Skip to main content
Log in

An adaptive method for Chinese new word detection based on hypothesis testing

  • Short Paper
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

New word detection is a significant problem in Chinese information processing, which is also the basis of Chinese word segmentation, automatic translation and semantic analysis. To address the problem of new word detection, this paper first analyzes the features of Chinese new words, and then proposes a hypothesis-testing-based detection approach for Chinese new words. To simulate how people identify new words, three modules are included: the non-accidental testing is to identify strings that are obviously occurring in given texts, the correlation testing is to evaluate the internal correlation between adjacent characters, and common grammar rules are used for garbage string filtering after the testing. This hypothesis-testing-based detection approach avoids the subjective selection of thresholds of new words statistical features and can set thresholds adaptively according to general frequency information. Its implementation does not require large-scale corpus for training and can eliminate the influence of using different corpus on the recognition results. Comparison experimental results show that this method has good performance on both detection time and F-score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Data availibility

The data used in this paper will be made public once the paper is published.

Code availibility

All codes used in this paper will be made public once the paper is published.

References

  1. Carroll JB (1969) A rationale for an asymptotic lognormal from of word-frequency distribution 1. ETS Res Bull Ser 1969(2):i–94

    Google Scholar 

  2. Huang C, Zhao H (2007) Chinese word segmentation: a decade review. J Chin Inf Process 21(3):8–20

    MathSciNet  Google Scholar 

  3. Jia Z, Shi Z (2004) Probabilistic techniques and rule methods for new word discovery. Comput Eng 30(20):19–21

    Google Scholar 

  4. Jiang D, Chen X, Yang X (2018) A Chinese new word detection approach based on independence testing. In: International conference on artificial intelligence and symbolic computation. Springer, Berlin, pp 227–236

  5. Li X, Chen X (2020) New word discovery algorithm based on n-gram for multi-word internal solidification degree and frequency. In: 2020 5th international conference on control, robotics and cybernetics (CRC), pp 51–55. https://doi.org/10.1109/CRC51253.2020.9253493

  6. Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning, pp 545–552

  7. Mei L, Huang H, Wei X, Mao X (2016) A novel unsupervised method for new word extraction. Sci China Inf Sci 59(9):1–11

    Article  Google Scholar 

  8. Peng F, Feng F, McCallum A (2004) Chinese segmentation and new word detection using conditional random fields. Association for Computational Linguistics, USA, pp 562–568. https://doi.org/10.3115/1220355.1220436

  9. Rennie JD, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 616–623

  10. Cui S, Liu Q, Meng Y, Yu H, Nishino F (2006) New word detection based on large-scale corpus. J Comput Res Dev 43(5):927

    Article  Google Scholar 

  11. Sun R, Jin P, Lai J (2011) A method for new word extraction on Chinese large-scale query logs. In: 2011 Seventh international conference on computational intelligence and security, pp 1256–1259. https://doi.org/10.1109/CIS.2011.278

  12. Xue N, Shen L (2003) Chinese word segmentation as LMR tagging. In: Proceedings of the second SIGHAN workshop on Chinese language processing, pp 176–179

  13. Yan L, Bai B, Chen W, Wu DO (2017) New word extraction from Chinese financial documents. IEEE Signal Process Lett 24(6):770–773

    Article  Google Scholar 

  14. Zheng J, Du Y, Song L (2003) A preliminary study on vocabulary acquisition methods of agricultural pests and diseases. In: Language computing and content-based text processing. Tsinghua University Press, Beijing, pp 61–66

  15. Zipf GK (2013) The psycho-biology of language: an introduction to dynamic philology. Routledge, Abingdon

    Book  Google Scholar 

  16. Zipf GK (2016) Human behavior and the principle of least effort: an introduction to human ecology. Ravenio Books, Cambridge

    Google Scholar 

  17. Zou G, Liu Y, Liu Q, Meng Y, Yu H, Nishino F, Kang S (2004) Internet-oriented Chinese new words detection. J Chin Inf Process 18(6):1–9

    Google Scholar 

Download references

Funding

This work was supported by Fundamental Research Funds for Central Universities under Grant No. BLX2015-17.

Author information

Authors and Affiliations

Authors

Contributions

Not applicable.

Corresponding author

Correspondence to Dongchen Jiang.

Ethics declarations

Conflict of interest

Not applicable.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, D., Jiang, A. & Tang, S. An adaptive method for Chinese new word detection based on hypothesis testing. Pattern Anal Applic 25, 993–999 (2022). https://doi.org/10.1007/s10044-022-01087-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-022-01087-y

Keywords

Navigation