Spanish-Basque Parallel Corpus Structure: Linguistic Annotations and Translation Units

  • A. Casillas
  • A. Díaz de Illarraza
  • J. Igartua
  • R. Martínez
  • K. Sarasola
  • A. Sologaistoa
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4629)


In this paper we propose a corpus structure which represents and manages an aligned parallel corpus. The corpus structure is based on a stand-off annotation model, which is composed of several XML documents. A bilingual parallel corpus represented in the proposed structure will contain: (1) the entire corpus together with its corresponding linguistic information, (2) translation units and alignment relations between units of the two languages: paragraphs, sentences and named entities. The proposed structure permits to work with the corpus both as an annotated corpus with linguistic information, and as a translation memory.


Machine Translation Minority Language Linguistic Information Parallel Corpus Corpus Structure 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aldezabal, I., Ansa, O., Arrieta, B., Artola, X., Ezeiza, A., Hernández, G., Lersundi, M.: EDBL: a general lexical basis for the automatic processing of Basque. In: IRCS Workshop on linguistic databases (2001)Google Scholar
  2. 2.
    Aduriz, I., Agirre, E., Aldezabal, I., Alegria, I., Ansa, O., Arregi, X., Arriola, J.M., Artola, X., de Ilarraza, A.D., Ezeiza, N., Gojenola K., Maritxalar, A., Maritxalar, M., Oronoz, M., Sarasola, K., Soroa, A., Urizar, R., Urkia, M.: A Framework for the Automatic Processing of Basque. In: Proceedings of the First International Conference on Language Resources and Evaluation (1998)Google Scholar
  3. 3.
    Artola, X., de Illarraza, A.D., Ezeiza, N., Gojenola, K., Labaka, G., Salogaistoa, A., Soroa, A.: A framework for representing and managing linguistic annotations based on typed feature structures. In: RANLP (2005)Google Scholar
  4. 4.
    Artola, X., de Ilarraza, A.D., Ezeiza, N., Gojenola, K., Sologaistoa, A., Soroa, A.: EULIA: a graphical web interface for creating, browsing and editing linguistically annotated corpora. In: LREC (2004)Google Scholar
  5. 5.
    Casillas, A., de Illarraza, A.D., Igartua, J., Martínez, R., Sarasola, K.: Compilation and Structuring of a Spanish-Basque Parallel Corpus. In: 5th SALTMIL Workshop on Minority LanguagesGoogle Scholar
  6. 6.
    Euskal Herriko Agintaritzaren Ofiziala (EHAA),
  7. 7.
    Erjavec, T.: Compiling and using the IJS-ELAN Parallel Corpus. Informatica 26, 299–307 (2002)zbMATHGoogle Scholar
  8. 8.
    Ezeiza, N., Aduriz, I., Alegria, I., Arriola, J.M., Urizar, R.: Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages. In: Proceedings of COLING-ACL 1998 (1998)Google Scholar
  9. 9.
    FreeLing 1.5 An Open Source Suite of Language Analyzers,
  10. 10.
    Martínez, R., Abaitua, J., Casillas, A.: Bitext Correspondences through Rich Mark-up. In: Proceedings of the 17th International Conference on Computational Linguistics (COLING’98) and 36th Annual Meeting of the Association for Computational Linguistics (ACL 1998), pp. 812–818 (1997)Google Scholar
  11. 11.
    Martínez, R., Abaitua, J., Casillas, A.: Aligning tagged bitext. In: Proceedings of the Sixth Workshop on Very Large Corpora, pp. 102–109 (1998)Google Scholar
  12. 12.
    MarSperberg-McQueen, C.M., Burnard, L.: Guidelines for Electronic Text Encoding and Interchange. TEI P3 Text Encoding Initiative (1994)Google Scholar
  13. 13.
    Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. In: LREC, pp. 2142–2147 (2006)Google Scholar
  14. 14.
    Marko, T.: Building the Croatian-English Parallel Corpus. In: LREC (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • A. Casillas
    • 1
  • A. Díaz de Illarraza
    • 2
  • J. Igartua
    • 2
  • R. Martínez
    • 3
  • K. Sarasola
    • 2
  • A. Sologaistoa
    • 2
  1. 1.Dpt. Electricidad y Electrónica, UPV-EHU 
  2. 2.IXA Taldea 
  3. 3.NLP&IR Group, UNED 

Personalised recommendations