Recycling Annotated Parallel Corpora for Bilingual Document Composition

  • Arantza Casillas
  • Joseba Abaitua
  • Raquel Martinez
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1934)


Parallel corpora enriched with descriptive annotations facilitate multilingual authoring development. Departing from an annotated bitext we show how SGML markup can be recycled to produce complementary language resources. On the one hand, several translation memory databases together with glossaries of proper nouns have been produced. On the other, DTDs for source and target documents have been derived and put into correspondence. This paper discusses how these resources have been automatically generated and applied to an interactive bilingual authoring system. This tool is capable of handling a substantial proportion of text both in the composition and translation of structured documents.


Machine Translation Logical Structure Document Type Source Document Proper Noun 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [Adphson, 1998]
    E. Adolphson Writing instruction and controlled language applications: panel discussion on standarization. Proceedings of GLAW’98, 191, 1998.Google Scholar
  2. [Ahonen, 1995]
    H. Ahonen. Automatic Generation of SGML Content Models. Electronic Publishing, 8(2-3):195–206, 1995.Google Scholar
  3. [Allen, 1999]
    J. Allen. Adapting the Concept of Translation Memory to Authoring Memory for a Controlled Language Writing Enviroment. ASLIB-TG21, 1999.Google Scholar
  4. [Brown, 1999]
    R. D. Brown. Adding Linguistic Knowledge to a Lexical Example-Based Translation System. Proceedings of the Eighth International Conference on Theoretical and Methodological Issues in Machine Translation, 22–32, 1999.Google Scholar
  5. [Casillas, 1999]
    A. Casillas, J. Abaitua, R. Martinez. Extraction y aprovechamiento de DTDs emparejadas en corpus paralelos. Procesamiento del Lenguaje Natural, 25:33–41, 1999.Google Scholar
  6. [ISO8879, 1986]
    ISO 8879, Information Processing-Text and Office Systems-Standard Generalized Markup Language (SGML). International Organization For Standards, 1986, Geneva.Google Scholar
  7. [Lange, 1997]
    J. Langé, é Gaussier, B. Daile. Bricks and Skeletons: Some Ideas for the Near Future of MATH. Machine Translation, 12:39–51, 1997.CrossRefGoogle Scholar
  8. [Martinez, 1997]
    R. Martínez, J. Abaitua, A. Casillas. Bilingual parallel text segmentation and tagging for specialized documentation. Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP’97), 369–372, 1997.Google Scholar
  9. [Martinez, 1998a]
    R. Martínez, J. Abaitua, A. Casillas. Bitext Correspondences through Rich Mark-up. 36th Annual Meeting of the Association for Computational Linguistics abd 11 International Conference on Computational Linguistics (COLING-ACL’98), 812–818, 1998.Google Scholar
  10. [Martinez, 1998b]
    R. Martínez, J. Abaitua, A. Casillas. Aligning tagged bitexts. Sixth Workshop on Very Large Corpora, 102–109, 1998.Google Scholar
  11. [Sperberg.McQueen, 1994]
    C. Sperberg-McQueen, L. Burnard. Guidelines for the Encoding and Interchange (P3). Text Encoding Initiative, 1994.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2000

Authors and Affiliations

  • Arantza Casillas
    • 1
  • Joseba Abaitua
    • 2
  • Raquel Martinez
    • 3
  1. 1.Departamento de AutomáticaUniversidad de AlcaláSpain
  2. 2.Facultad de FilosofÏa y LetrasUniversidad de DeustoBilbao
  3. 3.Depatamento de Sis. Informáticos y Programación, Facultad de MatemáticasUniversidad Complutense de MadridSpain

Personalised recommendations