Construction of Large-Scale Chinese-English Bilingual Corpus and Sentence Alignment

Jie, Sun

doi:10.1007/978-3-031-23947-2_42

Sun Jie¹⁷

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 466))

Included in the following conference series:

EAI International Conference, BigIoT-EDU

402 Accesses

Abstract

With the development of computer and Internet, applications based on bilingual (or multilingual) parallel corpora are increasing in the field of natural language processing. In addition to the application of machine translation, the construction of parallel corpus is also of great value for bilingual dictionary compilation, word meaning disambiguation and cross language information retrieval. At present, the bilingual corpus of word alignment and sentence alignment has a large scale, and the related alignment algorithms are also relatively mature. In contrast, the chunk level alignment algorithm remains to be studied, and the chunk level alignment corpus required by the alignment algorithm is quite lacking. The construction of bilingual corpus and its automatic alignment are of great significance to the development of computational linguistics. At present, the existing bilingual corpora at home and abroad, especially Chinese-English bilingual corpora, are not large, the processing standards are not unified, and there is no general bilingual corpus that can be used publicly. It has laid a solid foundation for the large-scale establishment of bilingual language information and knowledge base with unified standards and norms, multi fields, multi genres and sentence level alignment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Construction of Parallel Corpus of Foreign Publicity Based on Computer-Aided Translation Software

Mining Parallel Resources for Machine Translation from Comparable Corpora

An Efficient Framework to Extract Parallel Units from Comparable Data

References

Berga, D., Moreno, B., Nicolò, A.: Undominated rules with three alternatives in an almost unrestricted domain. Soc. Choice Welfare. (2) (2021)
Google Scholar
Mathew, S.M., et al.: Identification of potential natural inhibitors of the receptor-binding domain of the SARS-CoV-2 spike protein using a computational docking approach. Q. Med. J. 2021(1) (2021)
Google Scholar
Taylor, N.C., Johnson, J.H., Herd, R.A.: Making the most of the Mogi model: size matters. J. Volcanol. Geoth. Res. 419(B7), 107380 (2021)
Article Google Scholar
Liu, Z., et al.: DuRecDial 2.0: A Bilingual Parallel Corpus for Conversational Recommendation (2021)
Google Scholar
Li, J., et al.: Are synthetic clinical notes useful for real natural language processing tasks: a case study on clinical entity recognition. J. Am. Med. Inform. Assoc. 28(10), 2193–2201 (2021)
Google Scholar
Chala, S., et al.: Crowdsourcing Parallel Corpus for English-Oromo Neural Machine Translation using Community Engagement Platform (2021)
Google Scholar
Liang, Y., et al.: Modeling Bilingual Conversational Characteristics for Neural Chat Translation (2021)
Google Scholar
Jia, H., et al.: Bilingual Terminology Extraction from Non-Parallel E-Commerce Corpora (2021)
Google Scholar
Duan, G., Yang, H., Qin, K., Huang, T.: Improving neural machine translation model with deep encoding information. Cogn. Comput. 13(4), 972–980 (2021). https://doi.org/10.1007/s12559-021-09860-7
Article Google Scholar
Lu, X., et al.: An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

School of International Culture and Education, Heilongjiang University, Harbin, 150088, Heilongjiang Province, China
Sun Jie

Authors

Sun Jie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sun Jie .

Editor information

Editors and Affiliations

Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
Mian Ahmad Jan
Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, Pakistan
Fazlullah Khan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jie, S. (2023). Construction of Large-Scale Chinese-English Bilingual Corpus and Sentence Alignment. In: Jan, M.A., Khan, F. (eds) Application of Big Data, Blockchain, and Internet of Things for Education Informatization. BigIoT-EDU 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 466. Springer, Cham. https://doi.org/10.1007/978-3-031-23947-2_42

Download citation

DOI: https://doi.org/10.1007/978-3-031-23947-2_42
Published: 12 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23946-5
Online ISBN: 978-3-031-23947-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Construction of Large-Scale Chinese-English Bilingual Corpus and Sentence Alignment

Abstract

Access this chapter

Similar content being viewed by others

Construction of Parallel Corpus of Foreign Publicity Based on Computer-Aided Translation Software

Mining Parallel Resources for Machine Translation from Comparable Corpora

An Efficient Framework to Extract Parallel Units from Comparable Data

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Construction of Large-Scale Chinese-English Bilingual Corpus and Sentence Alignment

Abstract

Access this chapter

Similar content being viewed by others

Construction of Parallel Corpus of Foreign Publicity Based on Computer-Aided Translation Software

Mining Parallel Resources for Machine Translation from Comparable Corpora

An Efficient Framework to Extract Parallel Units from Comparable Data

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation