Skip to main content

Construction of Large-Scale Chinese-English Bilingual Corpus and Sentence Alignment

  • Conference paper
  • First Online:
Application of Big Data, Blockchain, and Internet of Things for Education Informatization (BigIoT-EDU 2022)

Included in the following conference series:

  • 402 Accesses

Abstract

With the development of computer and Internet, applications based on bilingual (or multilingual) parallel corpora are increasing in the field of natural language processing. In addition to the application of machine translation, the construction of parallel corpus is also of great value for bilingual dictionary compilation, word meaning disambiguation and cross language information retrieval. At present, the bilingual corpus of word alignment and sentence alignment has a large scale, and the related alignment algorithms are also relatively mature. In contrast, the chunk level alignment algorithm remains to be studied, and the chunk level alignment corpus required by the alignment algorithm is quite lacking. The construction of bilingual corpus and its automatic alignment are of great significance to the development of computational linguistics. At present, the existing bilingual corpora at home and abroad, especially Chinese-English bilingual corpora, are not large, the processing standards are not unified, and there is no general bilingual corpus that can be used publicly. It has laid a solid foundation for the large-scale establishment of bilingual language information and knowledge base with unified standards and norms, multi fields, multi genres and sentence level alignment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Berga, D., Moreno, B., Nicolò, A.: Undominated rules with three alternatives in an almost unrestricted domain. Soc. Choice Welfare. (2) (2021)

    Google Scholar 

  2. Mathew, S.M., et al.: Identification of potential natural inhibitors of the receptor-binding domain of the SARS-CoV-2 spike protein using a computational docking approach. Q. Med. J. 2021(1) (2021)

    Google Scholar 

  3. Taylor, N.C., Johnson, J.H., Herd, R.A.: Making the most of the Mogi model: size matters. J. Volcanol. Geoth. Res. 419(B7), 107380 (2021)

    Article  Google Scholar 

  4. Liu, Z., et al.: DuRecDial 2.0: A Bilingual Parallel Corpus for Conversational Recommendation (2021)

    Google Scholar 

  5. Li, J., et al.: Are synthetic clinical notes useful for real natural language processing tasks: a case study on clinical entity recognition. J. Am. Med. Inform. Assoc. 28(10), 2193–2201 (2021)

    Google Scholar 

  6. Chala, S., et al.: Crowdsourcing Parallel Corpus for English-Oromo Neural Machine Translation using Community Engagement Platform (2021)

    Google Scholar 

  7. Liang, Y., et al.: Modeling Bilingual Conversational Characteristics for Neural Chat Translation (2021)

    Google Scholar 

  8. Jia, H., et al.: Bilingual Terminology Extraction from Non-Parallel E-Commerce Corpora (2021)

    Google Scholar 

  9. Duan, G., Yang, H., Qin, K., Huang, T.: Improving neural machine translation model with deep encoding information. Cogn. Comput. 13(4), 972–980 (2021). https://doi.org/10.1007/s12559-021-09860-7

    Article  Google Scholar 

  10. Lu, X., et al.: An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sun Jie .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jie, S. (2023). Construction of Large-Scale Chinese-English Bilingual Corpus and Sentence Alignment. In: Jan, M.A., Khan, F. (eds) Application of Big Data, Blockchain, and Internet of Things for Education Informatization. BigIoT-EDU 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 466. Springer, Cham. https://doi.org/10.1007/978-3-031-23947-2_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23947-2_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23946-5

  • Online ISBN: 978-3-031-23947-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics