This paper presents a language-independent context-based sentence alignment technique given parallel corpora. We can view the problem of aligning sentences as finding translations of sentences chosen from different sources. Unlike current approaches which rely on pre-defined features and models, our algorithm employs features derived from the distributional properties of words and does not use any language dependent knowledge. We make use of the context of sentences and the notion of Zipfian word vectors which effectively models the distributional properties of words in a given sentence. We accept the context to be the frame in which the reasoning about sentence alignment is done. We evaluate the performance of our system based on two different measures: sentence alignment accuracy and sentence alignment coverage. We compare the performance of our system with commonly used sentence alignment systems and show that our system performs 1.2149 to 1.6022 times better in reducing the error rate in alignment accuracy and coverage for moderately sized corpora.


sentence alignment context Zipfian word vectors multilingual 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Computational Linguistics 19, 75–102 (1993)Google Scholar
  2. 2.
    Kruskal, J.B.: An overview of sequence comparison. In: Sankoff, D., Kruskal, J.B. (eds.) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, pp. 1–44. Addison-Wesley, Reading (1983)Google Scholar
  3. 3.
    Erjavec, T.: MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In: Fourth International Conference on Language Resources and Evaluation, LREC 2004, Paris, ELRA, pp. 1535–1538 (2004), http://nl.ijs.si/et/Bib/LREC04/
  4. 4.
    Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proceedings of the 29th annual meeting on Association for Computational Linguistics, pp. 169–176. Association for Computational Linguistics, Morristown, NJ, USA (1991)Google Scholar
  5. 5.
    Chen, S.F.: Aligning sentences in bilingual corpora using lexical information. In: Proceedings of the 31st annual meeting on Association for Computational Linguistics, pp. 9–16. Association for Computational Linguistics, Morristown, NJ, USA (1993)Google Scholar
  6. 6.
    Moore, R.C.: Fast and accurate sentence alignment of bilingual corpora. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 135–144. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. 7.
    Brown, P.F., et al.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 263–311 (1993)Google Scholar
  8. 8.
    Knight, K.: A statistical machine translation tutorial workbook (1999), http://www.isi.edu/natural-language/mt/wkbk.rtf
  9. 9.
    Yarowsky, D.: Decision lists for lexical ambiguity resolution. In: Hayes-Roth, B., Korf, R. (eds.) Proceedings of the Twelfth National Conference on Artificial Intelligence. American Association for Artificial Intelligence, AAAI Press, Menlo Park (1994)Google Scholar
  10. 10.
    Yarowsky, D., Florian, R.: Evaluating sense disambiguation across diverse parameter spaces. Natural Language Engineering 8, 293–310 (2002)CrossRefGoogle Scholar
  11. 11.
    Wang, X.: Robust utilization of context in word sense disambiguation. In: Dey, A., et al. (eds.) Modeling and Using Context: 5th International and Interdisciplinary Conference, pp. 529–541. Springer, Berlin (2005)Google Scholar
  12. 12.
    Ristad, E.S., Thomas, R.G.: New techniques for context modeling. In: ACL, pp. 220–227 (1995)Google Scholar
  13. 13.
    Biçici, E.: Local context selection for aligning sentences in parallel corpora. In: Kokinov, B., et al. (eds.) CONTEXT 2007. LNCS (LNAI), vol. 4635, pp. 82–93. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  14. 14.
    Zipf, G.K.: The meaning-frequency relationship of words. The Journal of General Psychology 33, 251–256 (1945)Google Scholar
  15. 15.
    Treebank, P., Marcus, M.P., Marcinkiewicz, M.A.: Building a large annotated corpus of english: The penn treebank (2004)Google Scholar
  16. 16.
    Joachims, T.: Learning to Classify Text using Support Vector Machines. Kluwer Academic Publishers, Dordrecht (2002)Google Scholar
  17. 17.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarity in the amino acid sequences of two proteins. J. Mol. Biol. 48, 443–453 (1970)CrossRefGoogle Scholar
  18. 18.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)MATHGoogle Scholar
  19. 19.
    Varga, D., et al.: Parallel corpora for medium density languages. In: Proceedings of the Recent Advances in Natural Language Processing 2005 Conference, Borovets, Bulgaria, pp. 590–596 (2005), Comment: hunalign is available at http://mokk.bme.hu/resources/hunalign

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Ergun Biçici
    • 1
  1. 1.Koç UniversitySariyer,IstanbulTurkey

Personalised recommendations