Using Comparable Corpora for Under-Resourced Areas of Machine Translation

  • Inguna Skadiņa
  • Robert Gaizauskas
  • Bogdan Babych
  • Nikola Ljubešić
  • Dan Tufiş
  • Andrejs Vasiļjevs

Table of contents

  1. Front Matter
    Pages i-vi
  2. Inguna Skadiņa, Robert Gaizauskas, Andrejs Vasiļjevs, Monica Lestari Paramita
    Pages 1-11
  3. Bogdan Babych, Fangzhong Su, Anthony Hartley, Ahmet Aker, Monica Lestari Paramita, Paul Clough et al.
    Pages 13-53
  4. Monica Lestari Paramita, Ahmet Aker, Paul Clough, Robert Gaizauskas, Nikos Glaros, Nikos Mastropavlos et al.
    Pages 55-87
  5. Mārcis Pinnis, Nikola Ljubešić, Dan Ştefănescu, Inguna Skadiņa, Marko Tadić, Tatjana Gornostaja et al.
    Pages 89-139
  6. Ahmet Aker, Alexandru Ceaușu, Yang Feng, Robert Gaizauskas, Sabine Hunsicker, Radu Ion et al.
    Pages 141-188
  7. Bogdan Babych, Yu Chen, Andreas Eisele, Sabine Hunsicker, Mārcis Pinnis, Inguna Skadiņa et al.
    Pages 189-254
  8. Reinhard Rapp, Vivian Xu, Michael Zock, Serge Sharoff, Richard Forsyth, Bogdan Babych et al.
    Pages 255-290
  9. Ahmet Aker, Radu Ion, Nikos Mastropavlos, Monica Paramita, Mārcis Pinnis, Dan Ştefănescu et al.
    Pages 291-323

About this book


This book provides an overview of how comparable corpora can be used to overcome the lack of parallel resources when building machine translation systems for under-resourced languages and domains. It presents a wealth of methods and open tools for building comparable corpora from the Web, evaluating comparability and extracting parallel data that can be used for the machine translation task. It is divided into several sections, each covering a specific task such as building, processing, and using comparable corpora, focusing particularly on under-resourced language pairs and domains.

The book is intended for anyone interested in data-driven machine translation for under-resourced languages and domains, especially for developers of machine translation systems, computational linguists and language workers. It offers a valuable resource for specialists and students in natural language processing, machine translation, corpus linguistics and computer-assisted translation, and promotes the broader use of comparable corpora in natural language processing and computational linguistics.


Comparable corpora Under-resourced languages Multilingual processing Comparability metric Parallel data extraction from comparable corpora Term extraction Machine translation Domain adaptation

Editors and affiliations

  • Inguna Skadiņa
    • 1
  • Robert Gaizauskas
    • 2
  • Bogdan Babych
    • 3
  • Nikola Ljubešić
    • 4
  • Dan Tufiş
    • 5
  • Andrejs Vasiļjevs
    • 6
  1. 1.TildeRigaLatvia
  2. 2.Department of Computer ScienceUniversity of SheffieldSheffieldUK
  3. 3.School of Modern Languages & CulturesUniversity of LeedsLeedsUK
  4. 4.Faculty of Humanities & Social SciencesUniversity of ZagrebZagrebCroatia
  5. 5.Institute for Artificial IntelligenceRomanian AcademyBucharestRomania
  6. 6.Tilde RigaLatvia

Bibliographic information