Abstract
With the development of computer and Internet, applications based on bilingual (or multilingual) parallel corpora are increasing in the field of natural language processing. In addition to the application of machine translation, the construction of parallel corpus is also of great value for bilingual dictionary compilation, word meaning disambiguation and cross language information retrieval. At present, the bilingual corpus of word alignment and sentence alignment has a large scale, and the related alignment algorithms are also relatively mature. In contrast, the chunk level alignment algorithm remains to be studied, and the chunk level alignment corpus required by the alignment algorithm is quite lacking. The construction of bilingual corpus and its automatic alignment are of great significance to the development of computational linguistics. At present, the existing bilingual corpora at home and abroad, especially Chinese-English bilingual corpora, are not large, the processing standards are not unified, and there is no general bilingual corpus that can be used publicly. It has laid a solid foundation for the large-scale establishment of bilingual language information and knowledge base with unified standards and norms, multi fields, multi genres and sentence level alignment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Berga, D., Moreno, B., Nicolò, A.: Undominated rules with three alternatives in an almost unrestricted domain. Soc. Choice Welfare. (2) (2021)
Mathew, S.M., et al.: Identification of potential natural inhibitors of the receptor-binding domain of the SARS-CoV-2 spike protein using a computational docking approach. Q. Med. J. 2021(1) (2021)
Taylor, N.C., Johnson, J.H., Herd, R.A.: Making the most of the Mogi model: size matters. J. Volcanol. Geoth. Res. 419(B7), 107380 (2021)
Liu, Z., et al.: DuRecDial 2.0: A Bilingual Parallel Corpus for Conversational Recommendation (2021)
Li, J., et al.: Are synthetic clinical notes useful for real natural language processing tasks: a case study on clinical entity recognition. J. Am. Med. Inform. Assoc. 28(10), 2193–2201 (2021)
Chala, S., et al.: Crowdsourcing Parallel Corpus for English-Oromo Neural Machine Translation using Community Engagement Platform (2021)
Liang, Y., et al.: Modeling Bilingual Conversational Characteristics for Neural Chat Translation (2021)
Jia, H., et al.: Bilingual Terminology Extraction from Non-Parallel E-Commerce Corpora (2021)
Duan, G., Yang, H., Qin, K., Huang, T.: Improving neural machine translation model with deep encoding information. Cogn. Comput. 13(4), 972–980 (2021). https://doi.org/10.1007/s12559-021-09860-7
Lu, X., et al.: An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Jie, S. (2023). Construction of Large-Scale Chinese-English Bilingual Corpus and Sentence Alignment. In: Jan, M.A., Khan, F. (eds) Application of Big Data, Blockchain, and Internet of Things for Education Informatization. BigIoT-EDU 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 466. Springer, Cham. https://doi.org/10.1007/978-3-031-23947-2_42
Download citation
DOI: https://doi.org/10.1007/978-3-031-23947-2_42
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23946-5
Online ISBN: 978-3-031-23947-2
eBook Packages: Computer ScienceComputer Science (R0)