Advertisement

Machine Translation

, Volume 33, Issue 3, pp 205–237 | Cite as

Efficient document alignment across scenarios

  • Andoni Azpeitia
  • Thierry EtchegoyhenEmail author
Article
  • 67 Downloads

Abstract

We present and evaluate an approach to document alignment meant for efficiency and portability, as it relies on automatically extracted lexical translations and simple set-theoretic operations for the computation of document-level similarity. We compare our approach to the state of the art on a variety of alignment scenarios, showing that it outperforms alternative document-alignment methods in the vast majority of cases, on both parallel and comparable corpora. We also explore several forms of simple component optimisation to evaluate the potential for improvement of the core method, and describe several successful optimisation paths that lead to significant improvements over strong baselines. The proposed approach constitutes an effective and easy to deploy method to perform accurate document alignment across scenarios, with the potential to improve the creation of parallel corpora.

Keywords

Document alignment Comparable corpora Parallel corpora 

Notes

Funding

This work was partially funded by the Spanish Ministry of Economy and competitiveness, via project AdapTA (RTC-2015-3627-7), and the Department of Economic Development and Competitiveness of the Basque Government, via project TRADIN (IG-2015/0000347).

References

  1. Azpeitia A, Etchegoyhen T (2016) DOCAL—Vicomtech’s participation in the WMT16 shared task on bilingual document alignment. In: Proceedings of the first conference on machine translation, vol 2: Shared Task Papers. Berlin, Germany, pp 666–671Google Scholar
  2. Azpeitia A, Etchegoyhen T, Martínez Garcia E (2017) Weighted set-theoretic alignment of comparable sentences. In: Proceedings of the tenth workshop on building and using comparable corpora. Vancouver, Canada, pp 41–45Google Scholar
  3. Azpeitia A, Etchegoyhen T, Martínez Garcia E (2018) Extracting parallel sentences from comparable corpora with STACC variants. In: Proceedings of the eleventh workshop on building and using comparable corpora. Miyazaki, Japan, pp 48–52Google Scholar
  4. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR arXiv:1409.0473, p 15
  5. Brown PF, Cocke J, Della Pietra SA, Della Pietra VJ, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85Google Scholar
  6. Brown PF, Della Pietra VJ, Della Pietra SA, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311Google Scholar
  7. Buck C, Koehn P (2016a) Findings of the WMT 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 554–563Google Scholar
  8. Buck C, Koehn P (2016b) Quick and reliable document alignment via TF/IDF-weighted cosine distance. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 672–678Google Scholar
  9. Chen J, Nie JY (2000) Parallel web text mining for cross-language IR. In: Content-based multimedia information access, vol 1. Centre des hautes études internationales d’informatique documentaire, Paris, France, pp 62–77Google Scholar
  10. Chen J, Chau R, Yeh CH (2004) Discovering parallel text from the world wide web. In: Proceedings of the second workshop on australasian information security, data mining and web intelligence, and software internationalisation. Dunedin, New Zealand, pp 157–161Google Scholar
  11. Dara AA, Lin YC (2016) YODA system for WMT16 shared task: bilingual document alignment. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 679–684Google Scholar
  12. Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 10(3):297–302CrossRefGoogle Scholar
  13. Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: human language technologies. Atlanta, USA, pp 644–648Google Scholar
  14. Eisele A, Chen Y (2010) MultiUN: a multilingual corpus from United Nation documents. In: Proceedings of the seventh international conference on language resources and evaluation, European Language Resources Association (ELRA). Valletta, Malta, pp 2868–2872Google Scholar
  15. Enright J, Kondrak G (2007) A fast method for parallel document identification. Human language technologies 2007: the conference of the north american chapter of the association for computational linguistics; Companion volume. Short papers, Rochester, New York, USA, pp 29–32Google Scholar
  16. Esplà-Gomis M, Forcada ML (2009) Bitextor, a free/open-source software to harvest translation memories from multilingual websites. In: Proceedings of MT summit XII. Ottawa, Canada, pp 1–8Google Scholar
  17. Esplà-Gomis M, Forcada ML, Ortiz-Rojas S, Ferràndez-Tordera J (2016) Bitextor’s participation in WMT’16: shared task on document alignment. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 685–691Google Scholar
  18. Etchegoyhen T, Azpeitia A (2016a) A portable method for parallel and comparable document alignment. Baltic J Mod Comput 4(2):243–255Google Scholar
  19. Etchegoyhen T, Azpeitia A (2016b) Set-theoretic alignment for comparable corpora. In: Proceedings of the 54th annual meeting of the association for computational linguistics, vol 1: Long Papers. Berlin, Germany, pp 2009–2018Google Scholar
  20. Etchegoyhen T, Azpeitia A, Pérez N (2016) Exploiting a Large Strongly Comparable Corpus. In: Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). Portorož, Slovenia, pp 3523–3529Google Scholar
  21. Fung P, Cheung P (2004) Mining very non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and E.M. In: Proceedings of empirical methods in natural language processing. Barcelona, Spain, pp 57–63Google Scholar
  22. Gelbukh A, Sidorov G, Lavin-Villa E, Chanona-Hernandez L (2010) Automatic term extraction using log-likelihood based comparison with general reference corpus. In: Proceedings of the 15th international conference on application of natural language to information systems. Cardiff, Wales, pp 248–255Google Scholar
  23. Germann U (2016) Bilingual document alignment with latent semantic indexing. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 692–696Google Scholar
  24. Gomes L, Lopes GP (2016) First steps towards coverage-based document alignment. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 697–702Google Scholar
  25. Heafield K (2011) KenLM: Faster and smaller language model queries. In: Proceedings of the sixth workshop on statistical machine translation. Edinburgh, Scotland, pp 187–197Google Scholar
  26. Ion R, Ceauşu A, Irimia E (2011) An expectation maximization algorithm for textual unit alignment. In: Proceedings of the 4th workshop on building and using comparable corpora: comparable corpora and the web. Portland, Oregon, pp 128–135Google Scholar
  27. Ion R, Pinnis M, Ştefānescu D, Aker A, Paramita M, Su F, Irimia E, Zhang X, Ljubešić N (2012) ACCURAT D2.6: toolkit for multi-level alignment and information extraction from comparable corpora, version 3.0. Tech. rep., ACCURAT project. http://www.accurat-project.eu/
  28. Jaccard P (1901) Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bull de la Soc Vaudoise des Sci Nat 37:241–272Google Scholar
  29. Jakubina L, Langlais P (2016) BAD LUC\(@\)WMT 2016: a bilingual document alignment platform based on Lucene. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 703–709Google Scholar
  30. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th machine translation summit. Phuket, Thailand, pp 79–86Google Scholar
  31. Koehn P (2009) Statistical machine translation. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  32. Le T, Vu HT, Oberländer J, Bojar O (2016) Using term position similarity and language modeling for bilingual document alignment. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 710–716Google Scholar
  33. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys Doklady 10(8):707–710MathSciNetGoogle Scholar
  34. Li B, Gaussier E (2013) Exploiting comparable corpora for lexicon extraction: measuring and improving corpus quality. In: Sharoff S, Rapp R, Zweigenbaum P, Fung P (eds) Building and using comparable corpora. Springer, Germany, pp 131–149CrossRefGoogle Scholar
  35. Lohar P, Afli H, Liu CH, Way A (2016) The ADAPT bilingual document alignment system at WMT16. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 717–723Google Scholar
  36. Ma X, Liberman M (1999) BITS: a method for bilingual text search over the web. In: Machine translation summit VII. Singapore, pp 538–542Google Scholar
  37. Mahata S, Das D, Pal S (2016) WMT2016: a hybrid approach to bilingual document alignment. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 724–727Google Scholar
  38. Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, CambridgezbMATHGoogle Scholar
  39. Medved M, Jakubícek M, Kovár V (2016) English-French document alignment based on keywords and statistical translation. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 728–732Google Scholar
  40. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. Lake Tahoe, CA, USA, pp 3111–3119Google Scholar
  41. Miller GA (1995) WordNet: a lexical database for english. Commun ACM 38(11):39–41CrossRefGoogle Scholar
  42. Morin E, Hazem A, Boudin F, Clouet EL (2015) LINA: identifying comparable documents from wikipedia. In: Proceedings of the eighth workshop on building and using comparable corpora. Beijing, China, pp 88–91Google Scholar
  43. Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504CrossRefGoogle Scholar
  44. Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51CrossRefGoogle Scholar
  45. Papavassiliou V, Prokopidis P, Piperidis S (2016) The ILSP/ARC submission to the WMT 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 733–739Google Scholar
  46. Paramita ML, Guthrie D, Kanoulas E, Gaizauskas R, Clough P, Sanderson M (2013) Methods for collection and evaluation of comparable documents. In: Sharoff S, Rapp R, Zweigenbaum P, Fung P (eds) Building and using comparable corpora. Springer, Germany, pp 93–112CrossRefGoogle Scholar
  47. Patry A, Langlais P (2005) Automatic identification of parallel documents with light or without linguistic resources. In: Proceedings of the 18th Canadian society conference on advances in artificial intelligence. Victoria, Canada, pp 354–365Google Scholar
  48. Patry A, Langlais P (2011) Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. In: Proceedings of the 4th workshop on building and using comparable corpora: comparable corpora and the web. Portland, Oregon, pp 87–95Google Scholar
  49. Prochasson E, Fung P (2011) Rare word translation extraction from aligned comparable documents. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Portland, Oregon, pp 1327–1335Google Scholar
  50. Rapp R (1995) Identifying word translations in non-parallel texts. In: Proceedings of the 33rd annual meeting of the association for computational linguistics. Cambridge, MA, USA, pp 320–322Google Scholar
  51. Resnik P, Smith NA (2003) The web as a parallel corpus. Comput Linguist 29(3):349–380CrossRefGoogle Scholar
  52. Sharoff S, Zweigenbaum P, Rapp R (2015) BUCC shared task: cross-language document similarity. In: Proceedings of the 8th workshop on building and using comparable corpora. Beijing, China, pp 74–78Google Scholar
  53. Spärck Jones K, Walker S, Robertson SE (2000) A probabilistic model of information retrieval: development and comparative experiments. Part 2. Inf Proces Manag 36(6):809–840CrossRefGoogle Scholar
  54. Tiedemann J (2011) Bitext alignment. Synthesis Lectures on human language technologies. Morgan & Claypool Publishers, Williston, VTGoogle Scholar
  55. Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th language resources and evaluation conference. Istanbul, Turkey, pp 2214–2218Google Scholar
  56. Tseng H, Chang P, Andrew G, Jurafsky D, Manning C (2005) A conditional random field word segmenter for sighan bakeoff 2005. In: Proceedings of the fourth SIGHAN workshop on chinese language processing. Jeju Island, Korea, pp 168–171Google Scholar
  57. Uszkoreit J, Ponte JM, Popat AC, Dubiner M (2010) Large scale parallel document mining for machine translation. In: Proceedings of the 23rd international conference on computational linguistics. Beijing, China, pp 1101–1109Google Scholar
  58. Zafarian A, Aghasadeghi A, Azadi F, Ghiasifard S, Alipanahloo Z, Bakhshaei S, Ziabary SMM (2015) AUT document alignment framework for BUCC workshop shared task. In: Proceedings of the 8th workshop on building and using comparable corpora. Beijing, China, pp 79–87Google Scholar
  59. Zweigenbaum P, Sharoff S, Rapp R (2018) Overview of the third BUCC shared task: spotting parallel sentences in comparable corpora. In: Proceedings of the 11th workshop on building and using comparable corpora. Miyazaki, Japan, pp 39–42Google Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.VicomtechDonostia/San SebastiánSpain

Personalised recommendations