Abstract
This chapter describes the bilingual corpora developed in Japan. First, we discuss problems of corpus development and some corpora which are, or will be, available in Japan. Next, we describe the bilingual corpus project of JEIDA (Japan Electronics Industry Development Association). The main purpose of this project is to develop a medium-sized aligned parallel corpus of English and Japanese. Also through this project, we are able to discuss various facets involved in the development of a bilingual corpus, to do research on the alignment of Japanese and English sentences and to investigate automatic acquisition of linguistic knowledge using the developed corpus. This chapter offers an overview of the automatic alignment system developed by NTT (Nippon Telegram and Telephone Co. Ltd.), which includes the entire alignment algorithm in detail. It also describes the graphical alignment environment BACCS in which the user can see the alignment results, and easily modify the results and the user dictionary.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bonhomme, P. (Ed.) (1995). LINGUA Information and Technical Aspect. Internal report. Laboratoire Loria, Nancy, France.
Brill, E. (1992). A simple rule-based part of speech tagger. Proceedings of the Third Conference on Applied Natural Language Processing (ANLP’92), Trento, 152–155.
Brill, E. (1994). Some advances in transformation-based part of speech tagging, Proceedings of the Twefth National Conference on Artificial Intelligence (AAAI’94), Seattle, Washington, 722–727.
Brown, P. F., Lai, J. C. and Mercer, R. L. (1991). Aligning Sentences in Parallel Corpora, Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, 169–176.
Brown, P. F., Della Pietra, S., Della Pietra, V. J. and Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263311.
Bumard, L. and Sperberg-McQueen, C. M. (1995). TEI Lite: An Introduction to Text Encoding for Interchange. [Online] Available: http://sable.ox.ac.uk/ota/teilite.
Chen, S. (1993). Aligning sentences in bilingual corpora using lexical information. Proceedings of the 31“ Annual Conference of the Association for Computational Linguistics, Columbus, Ohio, 9–16.
Church, K. W. (1993). Char_align: a program for aligning parallel texts at the character level. Proceedings of the 31“ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 1–8.
Dagan, I. and Church, K. W. (1994). Termight: identifying and translating technical terminology. Proceedings of the 4’ h Conference on Applied Natural Language Processing (ANLP ‘84), University of Stuttgart, Germany, 34–40.
Fung, P. and Church, K. W. (1994). K-vec: A new approach for aligning parallel texts, Proceedings of the 15th International Conference on Computational Linguistics (COLING ‘84), Kyoto, 1096–1102.
Gale, W. A. and Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19 (3), 75–102.
Haruno, M., Ikehara, S. and Yamazaki, T. (1996). Learning bilingual collocations by word-level sorting, Proceedings of the 16th International Conference on Computational Linguistics (COLING’96), Copenhagen, 525–530.
Haruno, M. and Yamazaki, T. (1997). High-performance bilingual text alignment using statistical and dictionary information, Natural Language Engineering, 3 (1), 1–14.
Isahara, H. (1995). JEIDA’s Test-Sets for Quality Evaluation of MT Systems — Technical Evaluation from the Developer’s Point of View, Proceedings of the Fifth Machine Translation Summit, MT Summit V, Luxembourg [no page numbers in original].
Isahara, H (1998). JEIDA’s English-Japanese Bilingual Corpus Project, Proceedings of the First International Conference on Language Resources and Evaluation, Granada, Spain, 471–474.
Kaji, H., Kida, Y. and Morimoto, Y. (1992). Learning translation templates from bilingual texts, Proceedings of the 14th International Conference on Computational Linguistics (COLING’92), Nantes, 672–678.
Kay, M. and Röscheisen, M. (1993). Text-translation alignment. Computational Linguistics, 19 (1), 121–142.
Kumano, A. and Hirakawa, H. (1994). Building an MT dictionary from parallel texts based on linguistic and statistical information. Proceedings of 15th International Conference on Computational Linguistics (COLING’94), Kyoto, 76–81.
Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31 S ` Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 17–22.
Kurohashi S., Nakamura, T., Matsumoto, Y. and Nagao, M. (1994). Improvements of Japanese morphological analyzer Juman. Proceeding of International Workshop on Sharable Natural Language Resources, Nara, Japan, 22–28.
Maler, E., El Andaloussi, J. (1996). Developing SGML DTDs From Text to Model to Markup,Prentice Hall PTR.
Matsumoto, Y., Ishimoto, H. and Utsuro, T. (1993). Structural matching of parallel texts. Proceedings of the 31st Annual Meeting of the Association for Computational Liguistics, Columbus, Ohio, 23–30.
Sato, S. and Nagao, M. (1990). Toward memory-based translation. Proceedings of the 12th Interna- tional Conference on Computational Linguistics, COLING’90, Helsinki, Finland, 247–252.
Smadja, F. A. and McKeown, K. R. (1993). Translating collections for use in bilingual lexicons. Proceedings of the Human Language Technology Workshop, Plainsboro, NJ, 152–156.
Takahashi, Y., Shirai, S. and Bond, F. (1997). A method for automatically aligning Japanese and English articles, Proceedings of the Natural Language Processing Pacific rim Symposium 1997 (NLPRS’97), Phuket, Thailand, 657–660.
Wu, D. (1994). Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces, 80–87.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Isahara, H., Haruno, M. (2000). Japanese-English aligned bilingual corpora. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_16
Download citation
DOI: https://doi.org/10.1007/978-94-017-2535-4_16
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5555-2
Online ISBN: 978-94-017-2535-4
eBook Packages: Springer Book Archive