An Enhanced New Word Identification Approach Using Bilingual Alignment

Yang, Ziyan; Zhang, Huaping; Shang, Jianyun; Wushour, Silamu

doi:10.1007/978-3-031-17120-8_8

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13551))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

2404 Accesses

Abstract

Traditional new word detection focused on finding the positional distribution of new words on Chinese text, but rarely on other languages. It was also difficult to obtain semantic information or translations of these new words. This paper proposed NEWBA, an enhanced new word identification algorithm by using bilingual corpus alignment. It indicated that NEWBA performs better than the traditional unsupervised method. In addition, it can obtain bilingual word pairs, which was able to provide us with translations beyond detection. NEWBA can expand the scope of traditional new word detection and therefore obtain more valuable information from bilingual aligned corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Huang, J.H., Powers, D.: Chinese word segmentation based on contextual entropy. In: Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, pp. 152–158 (2003)
Google Scholar
Zhang, H.P., Shang, J.Y.: Social media-oriented open domain new word detection. J. Chin. Inf. Process. 3, 115–121 (2017)
Google Scholar
Chen, K.J., Ma, W.Y.: Unknown word extraction for Chinese documents. In: COLING 2002: The 19th International Conference on Computational Linguistics (2002)
Google Scholar
Montariol, S., Allauzen, A.: Measure and evaluation of semantic divergence across two languages. In: ACL 2021 (Volume 1: Long Papers), pp. 1247–1258 (2021)
Google Scholar
Chang, B.: Chinese-English parallel corpus construction and its application. In: Proceedings of The 18th Pacific Asia Conference on Language, Information and Computation, pp. 283–290 (2004)
Google Scholar
Chengke, Y., Junlan, Z.: New word identification algorithm in natural language processing. In: 2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), pp. 199–203. IEEE (2020)
Google Scholar
Chen, F., Liu, Y.Q.: Open domain new word detection using condition random field method. J. Softw. 24(5), 1051–1060 (2013)
Article Google Scholar
Wang, X.: An improved neologism synthesis algorithm based on multi-word mutual information and adjacency entropy. Mod. Comput. 4, 7–11 (2018)
Google Scholar
Ye, Y., Wu, Q.: Unknown Chinese word extraction based on variety of overlapping strings. Inf. Process. Manag. 49(2), 497–512 (2013)
Article Google Scholar
Qian, Y., Du, Y.: Detecting new Chinese words from massive domain texts with word embedding. J. Inf. Sci. 45(2), 196–211 (2019)
Article Google Scholar
Le, Z., Jidong, L.: Discovering Chinese new words based on multi-sense word embedding. Data Anal. Knowl. Discov. 6(1), 113–121 (2022)
Google Scholar
Zhang, J., Huang, K.: Unsupervised new word extraction from Chinese social media data. J. Chin. Inf. Process. (2018)
Google Scholar
Huang, X.J., Peng, F.C.: Applying machine learning to text segmentation for information retrieval. Inf. Retrieval 6(3), 333–362 (2003)
Article Google Scholar
Sproat, R., Emerson, T.: The first international Chinese word segmentation bakeoff. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 133–143 (2003)
Google Scholar
Sun, Z., Deng, Z.H.: Unsupervised neural word segmentation for Chinese via segmental language modeling. arXiv preprint arXiv:1810.03167 (2018)
Liang, Y., Yin, P., Yiu, S.M.: New word detection and tagging on Chinese Twitter stream. In: Madria, S., Hara, T. (eds.) DaWaK 2015. LNCS, vol. 9263, pp. 310–321. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22729-0_24
Chapter Google Scholar
Dou, Z.Y., Neubig, G.: Word alignment by fine-tuning embeddings on parallel corpora. arXiv preprint arXiv:2101.08231 (2021)
Barrault, L., et al.: Findings of the 2019 conference on machine translation. In: Proceedings of WMT (2019)
Google Scholar
Deng, K., Bol, P.K.: On the unsupervised analysis of domain-specific Chinese texts. Proc. Natl. Acad. Sci. 113(22), 6154–6159 (2016)
Article MathSciNet Google Scholar

Download references

Acknowledgments

This work is partly supported by the Beijing Natural Science Foundation (No. 4212026 and No. 4202069) and the Fundamental Strengthening Program Technology Field Fund (No. 2021-JCJQ-JJ-0059).

Author information

Authors and Affiliations

Beijing Institute of Technology, Beijing, 100081, China
Ziyan Yang, Huaping Zhang & Jianyun Shang
Xinjiang University, Xinjiang, 830046, China
Silamu Wushour

Authors

Ziyan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Huaping Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianyun Shang
View author publications
You can also search for this author in PubMed Google Scholar
Silamu Wushour
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huaping Zhang .

Editor information

Editors and Affiliations

Singapore University of Technology and Design, Singapore, Singapore
Wei Lu
Nanjing University, Nanjing, China
Shujian Huang
Soochow University, Suzhou, China
Yu Hong
Soochow University, Soochow, China
Xiabing Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Z., Zhang, H., Shang, J., Wushour, S. (2022). An Enhanced New Word Identification Approach Using Bilingual Alignment. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13551. Springer, Cham. https://doi.org/10.1007/978-3-031-17120-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-17120-8_8
Published: 24 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17119-2
Online ISBN: 978-3-031-17120-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)