Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus

  • Katharina Wäschle
  • Stefan Riezler
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7356)


Statistical machine translation of patents requires large amounts of sentence-parallel data. Translations of patent text often exist for parts of the patent document, namely title, abstract and claims. However, there are no direct translations of the largest part of the document, the description or background of the invention. We document a twofold approach for extracting parallel data from all patent document sections from a large multilingual patent corpus. Since language and style differ depending on document section (title, abstract, description, claims) and patent topic (according to the International Patent Classification), we sort the processed data into subdomains in order to enable its use in domain-oriented translation, e.g. when applying multi-task learning. We investigate several similarity metrics and apply them to the domains of patent topic and patent document sections. Product of our research is a corpus of 23 million parallel German-English sentences extracted from the MAREC patent corpus and a descriptive analysis of its subdomains.


Noun Phrase Machine Translation European Patent Patent Document Statistical Machine Translation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Wäschle, K., Riezler, S.: Structural and topical dimensions in multi-task patent translation. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France (2012)Google Scholar
  2. 2.
    Utiyama, M., Isahara, H.: A japanese-english patent parallel corpus. In: Proceedings of MT Summit XI, Copenhagen, Denmark (2007)Google Scholar
  3. 3.
    Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1), 75–102 (1993)Google Scholar
  4. 4.
    Lu, B., Tsou, B.K., Zhu, J., Jiang, T., Kwong, O.Y.: The construction of a chinese-english patent parallel corpus. In: Proceedings of the MT Summit XII, Ottawa, Canada (2009)Google Scholar
  5. 5.
    Tinsley, J., Way, A., Sheridan, P.: PLuTO: MT for online patent translation. In: Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, CO (2010)Google Scholar
  6. 6.
    Jochim, C., Lioma, C., Schütze, H., Koch, S., Ertl, T.: Preliminary study into query translation for patent retrieval. In: Proceedings of the 3rd International Workshop on Patent Information Retrieval (PaIR 2010), Toronto, Canada (2010)Google Scholar
  7. 7.
    Ceauşu, A., Tinsley, J., Zhang, J., Way, A.: Experiments on domain adaptation for patent machine translation in the PLuTO project. In: Proceedings of the 15th Conference of the European Assocation for Machine Translation (EAMT 2011), Leuven, Belgium (2011)Google Scholar
  8. 8.
    Braune, F., Fraser, A.: Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Beijing, China (2010)Google Scholar
  9. 9.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)CrossRefzbMATHGoogle Scholar
  10. 10.
    Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X, Phuket, Thailand (2005)Google Scholar
  11. 11.
    Moore, R.C.: Fast and Accurate Sentence Alignment of Bilingual Corpora. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 135–144. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  12. 12.
    Siegel, S., Castellan, J.: Nonparametric Statistics for the Behavioral Sciences, 2nd edn. MacGraw-Hill, Boston (1988)Google Scholar
  13. 13.
    Kilgarriff, A., Rose, T.: Measures for corpus similarity and homogeneity. In: Proceedings of the 3rd conference on Empirical Methods in Natural Language Processing (EMNLP-3), Granada, Spain (1998)Google Scholar
  14. 14.
    Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. In: Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS 2006), Vancouver, Canada (2006)Google Scholar
  15. 15.
    Koehn, P., Hoang, H., Birch, A., Callison-Birch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the ACL 2007 Demo and Poster Sessions, Prague, Czech Republic (2007)Google Scholar
  16. 16.
    Federico, M., Bertoldi, N., Cettolo, M.: IRSTLM: an open source toolkit for handling large scale language models. In: Proceedings of Interspeech, Brisbane, Australia (2008)Google Scholar
  17. 17.
    Heafield, K.: KenLN: faster and smaller language model queries. In: Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation (WMT 2011), Edinburgh, UK (2011)Google Scholar
  18. 18.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. Technical Report IBM Research Division Technical Report, RC22176 (W0190-022), Yorktown Heights, N.Y. (2001)Google Scholar
  19. 19.
    Koehn, P., Knight, K.: Empirical methods for compound splitting. In: Proceedings of the 10th Conference on European chapter of the Association for Computational Linguistics (EACL 2003), Budapest, Hungary (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Katharina Wäschle
    • 1
  • Stefan Riezler
    • 1
  1. 1.Department of Computational LinguisticsHeidelberg UniversityGermany

Personalised recommendations