Abstract
New word detection is a significant problem in Chinese information processing, which is also the basis of Chinese word segmentation, automatic translation and semantic analysis. To address the problem of new word detection, this paper first analyzes the features of Chinese new words, and then proposes a hypothesis-testing-based detection approach for Chinese new words. To simulate how people identify new words, three modules are included: the non-accidental testing is to identify strings that are obviously occurring in given texts, the correlation testing is to evaluate the internal correlation between adjacent characters, and common grammar rules are used for garbage string filtering after the testing. This hypothesis-testing-based detection approach avoids the subjective selection of thresholds of new words statistical features and can set thresholds adaptively according to general frequency information. Its implementation does not require large-scale corpus for training and can eliminate the influence of using different corpus on the recognition results. Comparison experimental results show that this method has good performance on both detection time and F-score.
Similar content being viewed by others
Data availibility
The data used in this paper will be made public once the paper is published.
Code availibility
All codes used in this paper will be made public once the paper is published.
References
Carroll JB (1969) A rationale for an asymptotic lognormal from of word-frequency distribution 1. ETS Res Bull Ser 1969(2):i–94
Huang C, Zhao H (2007) Chinese word segmentation: a decade review. J Chin Inf Process 21(3):8–20
Jia Z, Shi Z (2004) Probabilistic techniques and rule methods for new word discovery. Comput Eng 30(20):19–21
Jiang D, Chen X, Yang X (2018) A Chinese new word detection approach based on independence testing. In: International conference on artificial intelligence and symbolic computation. Springer, Berlin, pp 227–236
Li X, Chen X (2020) New word discovery algorithm based on n-gram for multi-word internal solidification degree and frequency. In: 2020 5th international conference on control, robotics and cybernetics (CRC), pp 51–55. https://doi.org/10.1109/CRC51253.2020.9253493
Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning, pp 545–552
Mei L, Huang H, Wei X, Mao X (2016) A novel unsupervised method for new word extraction. Sci China Inf Sci 59(9):1–11
Peng F, Feng F, McCallum A (2004) Chinese segmentation and new word detection using conditional random fields. Association for Computational Linguistics, USA, pp 562–568. https://doi.org/10.3115/1220355.1220436
Rennie JD, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 616–623
Cui S, Liu Q, Meng Y, Yu H, Nishino F (2006) New word detection based on large-scale corpus. J Comput Res Dev 43(5):927
Sun R, Jin P, Lai J (2011) A method for new word extraction on Chinese large-scale query logs. In: 2011 Seventh international conference on computational intelligence and security, pp 1256–1259. https://doi.org/10.1109/CIS.2011.278
Xue N, Shen L (2003) Chinese word segmentation as LMR tagging. In: Proceedings of the second SIGHAN workshop on Chinese language processing, pp 176–179
Yan L, Bai B, Chen W, Wu DO (2017) New word extraction from Chinese financial documents. IEEE Signal Process Lett 24(6):770–773
Zheng J, Du Y, Song L (2003) A preliminary study on vocabulary acquisition methods of agricultural pests and diseases. In: Language computing and content-based text processing. Tsinghua University Press, Beijing, pp 61–66
Zipf GK (2013) The psycho-biology of language: an introduction to dynamic philology. Routledge, Abingdon
Zipf GK (2016) Human behavior and the principle of least effort: an introduction to human ecology. Ravenio Books, Cambridge
Zou G, Liu Y, Liu Q, Meng Y, Yu H, Nishino F, Kang S (2004) Internet-oriented Chinese new words detection. J Chin Inf Process 18(6):1–9
Funding
This work was supported by Fundamental Research Funds for Central Universities under Grant No. BLX2015-17.
Author information
Authors and Affiliations
Contributions
Not applicable.
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Jiang, D., Jiang, A. & Tang, S. An adaptive method for Chinese new word detection based on hypothesis testing. Pattern Anal Applic 25, 993–999 (2022). https://doi.org/10.1007/s10044-022-01087-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-022-01087-y