Bitext Generation Through Rich Markup


This paper reports on a method for exploiting a bitext as the primary linguistic information source for the design of a generation environment for specialized bilingual documentation. The paper discusses such issues as Text Encoding Initiative (TEI), proposals for specialized corpus tagging, text segmentation and alignment of translation units and their allocation into translation memories, Document Type Definition (DTD), abstraction from tagged texts, and DTD deployment for bilingual text generation. The parallel corpus used for experimentation has two main features:

This is a preview of subscription content, log in to check access.


  1. Adamson G., Boreham J. (1974) The Use of an Association Measure Based on Character Structure to Identify Semantically Related Pairs of Words and Document Titles. Infor-mation Storage and Retrieval, 10, pp.253-260.

    Google Scholar 

  2. Adolphson E. (1998) Writing Instruction and Controlled Language Applications:Panel Discussion on Standardization. Proceedings of Controlled Language Applications Work-shop, CLAW '98, p.191.

  3. Aduriz I., Aldezabal I., Artola X., Ezeiza N., Urizar R. (1996) MultiWord Lexical Units in EUSLEM, a Lemmatiser-Tagger for Basque. Papers in Computational Lexicography COMPLEX '96, pp.1-8.

  4. Ahonen H. (1995) Automatic Generation of SGML Content Models. Electronic Publishing, 8(2-3), pp.195-206.

    Google Scholar 

  5. Baeza-Yates R., Navarro G. (1996) A Faster Algorithm for Approximate String Matching. Proceedings of Combinatorial Pattern Matching, CPM '96, pp.1-23.

  6. Brown P., Lai J.C., Mercer R. (1991) Aligning Sentences in Parallel Corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, pp.169-176.

  7. Burnard L., Sperberg-McQueen C.M. (1995) TEI Lite: An Introduction to Text Encoding for Interchange.[ in the le orgs/tei/intros/teiu5.tei].

  8. Casillas A., Abaitua J., Martínez R. (1999) Extracción y aprovechamiento de DTDs empa-rejadas en corpus paralelos. Procesamiento del Lenguaje Natural, 25, pp.33-41.

    Google Scholar 

  9. Casillas A., Abaitua J., Martínez R. (2000a) Advantages and Difficulties with TEI Tagging: Experiences from Aided Document Composition and Translation Tool. Extreme Markup Languages, pp.30-35.

  10. Casillas A., Abaitua J., Martínez R. (2000b) Recycling Annotated Parallel Corpora for Bilingual Document Composition. Association for Machine Translation in the Americas, AMTA 2000. Springer-Verlag, pp.117-126.

  11. Dice L.R. (1945) Measures of the Amount of Ecologic Association Between Species. Ecology, 26, pp.297-302.

    Google Scholar 

  12. Gale W., Church K.W. (1991) A Program for Aligning Sentences in Bilingual Corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, pp.177-184.

  13. Harris B. (1988) Bi-Text, a New Concept in Translation Theory. Language Monthly.

  14. Ide N., Veronis J. (1995) The Text Encoding Initiative: Background and Contexts.Kluwer Academic Publishers, Dordrecht.

    Google Scholar 

  15. Kay M. (1997) The Proper Place of Men and Machines in Language Translation. Machine Translation, 12, pp.3-23.

    Google Scholar 

  16. Kay M., Roscheisen M. (1993) Text-Translation Alignment. Computational Linguistics, 19(1), pp.121-142.

    Google Scholar 

  17. Langé J., Gaussier É., Daile B. (1997) Bricks and Skeletons: Some Ideas for the Near Future of MATH. Machine Translation, 12, pp.39-51.

    Google Scholar 

  18. Martínez R., Abaitua J., Casillas A. (1997a) Bilingual Parallel Text Segmentation and Tagging for Specialized Documentation. Proceedings of the International Conference Recent Ad-vances in Natural Language Processing RANLP '97, pp.369-372.

  19. Martínez R., Abaitua J., Casillas A. (1997b) Bitext Correspondences through Rich Mark-Up. Proceedings of the 17th International Conference on Computational Linguistics (COL-ING '98) and 36th Annual Meeting of the Association for Computational Linguistics (ACL '98), pp.812-818.

  20. Martínez R., Abaitua J., Casillas A. (1998) Aligning Tagged Bitext. Proceedings of the Sixth Workshop on Very Large Corpora, pp.102-109.

  21. Melamed I.D. (1996) A Geometric Approach to Mapping Bitext Correspondence. First Conference on Empirical Methods in Natural Language Processing (EMNLP '96).

  22. Melamed I.D. (1997) A Portable Algorithm for Mapping Bitext Correspondence. 35th Con-ference of the Association for Computational Linguistics (ACL '97), pp.305-312.

  23. Melby A. (1987) On Human-machine Interaction in Translation. Machine Translation, pp.145-154.

  24. Melby A. (1995) The Possibility of Language. A Discussion of the Nature of Language with Implications for Human and Machine Translation. John Benjamins.

  25. MtSeg. (1997) Multext-Document MSG 1. MtSeg/Overview. [ projects/multext/MUL7.html].

  26. MUC-6. (1995) Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufman.

  27. Rahtz S. (2000) XSL Stlylesheets for TEI XML. [].

  28. Ravin Y., Wacholder N. (1997) Extracting Names From Natural-Language Text. Research Report RC 20338(92147) Declassified. IBM Research Division.

  29. Romary L., Bonhomme P. (2000) Parallel Alignment of Structured Documents. In Veronis J. (ed.), Parallel Text Processing. Kluwer Academic Publishers, Dordrecht.

    Google Scholar 

  30. Shafer K. (1995) Automatic DTD creation via the GB-Engine and Fred. [ fred/docs/papers].

  31. Simard M., Foster G.F., Isabelle P. (1992) Using Cognates to Align Sentences in Bilingual Corpora. Proceedings of the Fourth International Conference on Theoretical and Method-ological Issues in Machine Translation, TMI-92, pp.67-81.

  32. Smadja F., McKeown K., Hatzivassiloglou V. (1996) Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22(1), pp.1-38

    Google Scholar 

  33. Sperberg-McQueen C.M., Burnard L. (1994) Guidelines for Electronic Text Encoding and Interchange (Text Encoding Initiative P3). Text Encoding Initiative.

  34. Sperberg-McQueen C.M., Burnard L. (1995) The Design of the TEI Encoding Scheme. Computers and Humanities, 29(1).

  35. Wakao T., Gaizauskas R., Wilks Y. (1996) Evaluation of an Algorithm for the Recognition and Classi cation of Proper Names. Proceedings of the 16th International Conference on Computational Linguistics (COLING '96), pp.418-423.

  36. Wolinski F., Vichot F., Dillet B. (1995) Automatic Processing of Proper Names in Texts. The Computation and Language E-Print Archive. [].

Download references

Author information



Rights and permissions

Reprints and Permissions

About this article

Cite this article

Casillas, A., Martínez, R. Bitext Generation Through Rich Markup. Computers and the Humanities 38, 223–251 (2004).

Download citation

  • alignment
  • bilingual document generation
  • bitext
  • parallel corpus
  • segmentation
  • SGML
  • TEI
  • translation memories