Bilingual Sentence Alignment Based on Punctuation Statistics and Lexicon

  • Thomas C. Chuang
  • Jian-Cheng Wu
  • Tracy Lin
  • Wen-Chie Shei
  • Jason S. Chang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3248)


This paper presents a new method of aligning bilingual parallel texts based on punctuation statistics and lexical information. It is demonstrated that the punctuation statistics prove to be effective means to achieve good results. The task of sentence alignment of bilingual texts written in disparate language pairs like English and Chinese is reportedly more difficult. We examine the feasibility of using punctuations for high accuracy sentence alignment. Encouraging precision rate is demonstrated in aligning sentences in bilingual parallel corpora based solely on punctuation statistics. Improved results were obtained when both punctuation statistics and lexical information were employed. We have experimented with an implementation of the proposed method on the parallel corpora of Sinorama Magazine and Records of the Hong Kong Legislative Council with satisfactory results.


Machine Translation Statistical Machine Translation Parallel Corpus Lexical Information Language Pair 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, USA, pp. 169–176 (1991)Google Scholar
  2. 2.
    Chen, S.F.: Aligning Sentences in Bilingual Corpora Using Lexical Information. In: Proceedings of ACL 1993, Columbus OH (1993)Google Scholar
  3. 3.
    Chuang, T., You, G.N., Chang, J.S.: Adaptive Bilingual Sentence Alignment. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 21–30. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  4. 4.
    Déjean, H., Gaussier, É., Sadat, F.: Bilingual Terminology Extraction: An Approach based on a Multilingual thesaurus Applicable to Comparable Corpora. In: Proceedings of the 19th International Conference on Computational Linguistics COLING 2002, Taipei, Taiwan, August 24-September 1, pp. 218–224 (2002)Google Scholar
  5. 5.
    Dolan, W.B., Pinkham, J., Richardson, S.D.: MSR-MT: The Microsoft Research Machine Translation System. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 237–239. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  6. 6.
    Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpus. Computational Linguistics 19, 75–102 (1991)Google Scholar
  7. 7.
    Gey, F.C., Chen, A., Buckland, M.K., Larson, R.R.: Translingual vocabulary mappings for multilingual information access. In: SIGIR 2002, pp. 455–456 (2002)Google Scholar
  8. 8.
    Jutras, J.-M.: An Automatic Reviser: The TransCheck System. In: Proc. of Applied Natural Language Processing, pp. 127–134 (2000)Google Scholar
  9. 9.
    Kay, M., Röscheisen, M.: Text-Translation Alignment. Computational Linguistics 19(1), 121–142 (1993)Google Scholar
  10. 10.
    Kueng, T.L., Su, K.-Y.: A Robust Cross-Domain Bilingual Sentence Alignment Model. In: Proceedings of the 19th International Conference on Computational Linguistics (2002)Google Scholar
  11. 11.
    Kwok, K.: NTCIR-2 Chinese, Cross-Language Retrieval Experiments Using PIRCS. In: Proceedings of the Second NTCIR Workshop Meeting, pp. (5) 14–20 (2001), National Institute of Informatics, JapanGoogle Scholar
  12. 12.
    Marcu, D., Wong, W.: A Phrase-Based, Joint Probability Model for Statistical Machine Translation. In: EMNLP (2002)Google Scholar
  13. 13.
    Melamed, I.: Dan, Models of Translational Equivalence among Words. Computational Linguistics 26(2), 221–249 (2000)CrossRefGoogle Scholar
  14. 14.
    Moore, R.C.: Fast and Accurate Sentence Alignment of Bilingual Corpora. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 135–144. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  15. 15.
    Piao, S.S.: Sentence and word alignment between Chinese and English. Ph.D. thesis, Lancaster University (2000)Google Scholar
  16. 16.
    Proctor, P.: Longman English-Chinese Dictionary of Contemporary English. Longman Group (Far East), Hong Kong (1988)Google Scholar
  17. 17.
    Richards, J., et al.: Longman Dictionary of Applied Linguistics. Longman (1985)Google Scholar
  18. 18.
    Simard, M., Foster, G., Isabelle, P.: Using cognates to align sentences in bilingual corpora. In: Proceedings of TMI 1992, Montreal, Canada, pp. 67–81 (1992)Google Scholar
  19. 19.
    West, M.: A General Service List of English Words, Longman, London (1953)Google Scholar
  20. 20.
    Wu, D.: Aligning a parallel English-Chinese corpus statistically with lexical criteria. In: The Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, New, Mexico, USA, pp. 80–87 (1994)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Thomas C. Chuang
    • 1
  • Jian-Cheng Wu
    • 2
  • Tracy Lin
    • 3
  • Wen-Chie Shei
    • 2
  • Jason S. Chang
    • 2
  1. 1.Department of Computer ScienceVanung UniversityChung-Li, Tao-Yuan
  2. 2.Department of Computer ScienceNational Tsing Hua UniversityHsinchu
  3. 3.Department of TelecommunicationNational Chiao Tung UniversityHsinchu

Personalised recommendations