An adaptive method for Chinese new word detection based on hypothesis testing

Jiang, Dongchen; Jiang, Aoyuan; Tang, Shuai

doi:10.1007/s10044-022-01087-y

An adaptive method for Chinese new word detection based on hypothesis testing

Short Paper
Published: 04 July 2022

Volume 25, pages 993–999, (2022)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

305 Accesses
1 Altmetric
Explore all metrics

Abstract

New word detection is a significant problem in Chinese information processing, which is also the basis of Chinese word segmentation, automatic translation and semantic analysis. To address the problem of new word detection, this paper first analyzes the features of Chinese new words, and then proposes a hypothesis-testing-based detection approach for Chinese new words. To simulate how people identify new words, three modules are included: the non-accidental testing is to identify strings that are obviously occurring in given texts, the correlation testing is to evaluate the internal correlation between adjacent characters, and common grammar rules are used for garbage string filtering after the testing. This hypothesis-testing-based detection approach avoids the subjective selection of thresholds of new words statistical features and can set thresholds adaptively according to general frequency information. Its implementation does not require large-scale corpus for training and can eliminate the influence of using different corpus on the recognition results. Comparison experimental results show that this method has good performance on both detection time and F-score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Chinese New Word Detection Approach Based on Independence Testing

Chinese New Words Detection Using Mutual Information

An Enhanced New Word Identification Approach Using Bilingual Alignment

Data availibility

The data used in this paper will be made public once the paper is published.

Code availibility

All codes used in this paper will be made public once the paper is published.

References

Carroll JB (1969) A rationale for an asymptotic lognormal from of word-frequency distribution 1. ETS Res Bull Ser 1969(2):i–94
Google Scholar
Huang C, Zhao H (2007) Chinese word segmentation: a decade review. J Chin Inf Process 21(3):8–20
MathSciNet Google Scholar
Jia Z, Shi Z (2004) Probabilistic techniques and rule methods for new word discovery. Comput Eng 30(20):19–21
Google Scholar
Jiang D, Chen X, Yang X (2018) A Chinese new word detection approach based on independence testing. In: International conference on artificial intelligence and symbolic computation. Springer, Berlin, pp 227–236
Li X, Chen X (2020) New word discovery algorithm based on n-gram for multi-word internal solidification degree and frequency. In: 2020 5th international conference on control, robotics and cybernetics (CRC), pp 51–55. https://doi.org/10.1109/CRC51253.2020.9253493
Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning, pp 545–552
Mei L, Huang H, Wei X, Mao X (2016) A novel unsupervised method for new word extraction. Sci China Inf Sci 59(9):1–11
Article Google Scholar
Peng F, Feng F, McCallum A (2004) Chinese segmentation and new word detection using conditional random fields. Association for Computational Linguistics, USA, pp 562–568. https://doi.org/10.3115/1220355.1220436
Rennie JD, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 616–623
Cui S, Liu Q, Meng Y, Yu H, Nishino F (2006) New word detection based on large-scale corpus. J Comput Res Dev 43(5):927
Article Google Scholar
Sun R, Jin P, Lai J (2011) A method for new word extraction on Chinese large-scale query logs. In: 2011 Seventh international conference on computational intelligence and security, pp 1256–1259. https://doi.org/10.1109/CIS.2011.278
Xue N, Shen L (2003) Chinese word segmentation as LMR tagging. In: Proceedings of the second SIGHAN workshop on Chinese language processing, pp 176–179
Yan L, Bai B, Chen W, Wu DO (2017) New word extraction from Chinese financial documents. IEEE Signal Process Lett 24(6):770–773
Article Google Scholar
Zheng J, Du Y, Song L (2003) A preliminary study on vocabulary acquisition methods of agricultural pests and diseases. In: Language computing and content-based text processing. Tsinghua University Press, Beijing, pp 61–66
Zipf GK (2013) The psycho-biology of language: an introduction to dynamic philology. Routledge, Abingdon
Book Google Scholar
Zipf GK (2016) Human behavior and the principle of least effort: an introduction to human ecology. Ravenio Books, Cambridge
Google Scholar
Zou G, Liu Y, Liu Q, Meng Y, Yu H, Nishino F, Kang S (2004) Internet-oriented Chinese new words detection. J Chin Inf Process 18(6):1–9
Google Scholar

Download references

Funding

This work was supported by Fundamental Research Funds for Central Universities under Grant No. BLX2015-17.

Author information

Authors and Affiliations

School of Information Science and Technology, Beijing Forestry University, Beijing, 100083, China
Dongchen Jiang, Aoyuan Jiang & Shuai Tang
Engineering Research Center for Forestry-oriented Intelligent Information Processing of National Forestry and Grassland Administration, Beijing, China
Dongchen Jiang & Aoyuan Jiang

Authors

Dongchen Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Aoyuan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Tang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Not applicable.

Corresponding author

Correspondence to Dongchen Jiang.

Ethics declarations

Conflict of interest

Not applicable.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, D., Jiang, A. & Tang, S. An adaptive method for Chinese new word detection based on hypothesis testing. Pattern Anal Applic 25, 993–999 (2022). https://doi.org/10.1007/s10044-022-01087-y

Download citation

Received: 16 July 2021
Accepted: 03 June 2022
Published: 04 July 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s10044-022-01087-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An adaptive method for Chinese new word detection based on hypothesis testing

Abstract

Access this article

Similar content being viewed by others

A Chinese New Word Detection Approach Based on Independence Testing

Chinese New Words Detection Using Mutual Information

An Enhanced New Word Identification Approach Using Bilingual Alignment

Data availibility

Code availibility

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An adaptive method for Chinese new word detection based on hypothesis testing

Abstract

Access this article

Similar content being viewed by others

A Chinese New Word Detection Approach Based on Independence Testing

Chinese New Words Detection Using Mutual Information

An Enhanced New Word Identification Approach Using Bilingual Alignment

Data availibility

Code availibility

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation