Parallel Text Processing pp 187-200 | Cite as
Parallel text alignment using crosslingual information retrieval techniques
Abstract
In this chapter, we demonstrate that aligning a sentence with its translation is not fundamentally different from finding a sentence on the same topic in the target corpus, using the source sentence as a query. The two processes are based on the semantic proximity of two sentences in different languages, and their major difference is that information retrieval only needs to insure that the sentence found contains most of the information of the query, whereas sentence alignment requires that the parts that are not common to both languages be as small as possible. A crosslingual query system can be used to obtain candidates for sentence alignment, provided that the measure of semantic proximity slightly modified. More classical techniques can be used, taking sequential order into account, but our approach is very robust to text desynchronization, such as missing text segments in one language, or texts such as glossaries or indexes that are not in the same order in different languages.
Keywords
Cross-language information retrieval weighted boolean model sentence alignment word alignment bilingual corpora French EnglishPreview
Unable to display preview. Download preview PDF.
References
- Brown, P. F., Lai, J. C. and Mercer, R. L. (1991). Aligning Sentences in Parallel Corpora, Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, 169–176.Google Scholar
- Catizone, R., Russell, G. and Warwick, S. (1993). Deriving Translation Data from Bilingual Texts, Proceedings of the First International Lexical Acquisition Workshop. Detroit, 1–7.Google Scholar
- Debili, F., Fluhr, C. and Radasoa, P. (1989). About reformulation in full text IRS. Paper presented at Conference RIAD 88, MIT Cambridge, mars 198. [Revised text published in Information processing and management, 25(6), 647–657.]Google Scholar
- Debili, F. and Sammouda, E. (1992). Appariement des Phrases de Textes Bilingues. Proceedings of the 14th International Conference on Computational Linguistics (COLING ‘82), Nantes, France, 517–538.Google Scholar
- Elkateb, F. (1997). Etudes des performances et amélioration des stratégies de reformulation du système documentaire multilingue EMIR. Thèse de l’Université Paris VII.Google Scholar
- Fluhr, C., Mordini, P., Moulin, A. and Stegentritt, E. (1994). EMIR Final report. ESPRIT project 5312, DG III, Commission of the European Union, october 1994.Google Scholar
- Fluhr, C., Schmit, D., Elkateb, F. and Gurtner, K. (1997a). Multilingual database and crosslingual interrogation in a real internet application. Paper presented at Workshop “Cross-language Text and Speech retrieval”, AAA’ 1997 Spring Symposium Series, 24–26 mars 1997, Stanford University, California.Google Scholar
- Fluhr, C., Schmit, D., Ortet, P., Elkateb, F. and Gurtner, K. (1997b). SPIRIT-W3, A distributed crosslingual indexing and retrieval engine. Paper presented at INET’97, Kuala Lumpur, June 1997.Google Scholar
- Gale, W. A. and Church, K. W. (1991). A program for aligning sentences in bilingual corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL), Berkeley, 177–184.Google Scholar
- Gaussier, E. (1995). Modèles statistiques et patrons morphosyntaxiques pour l’extraction de lexiques bilingues. Thèse de doctorat en informatique fondamentale, Université Paris V IIGoogle Scholar
- Grefenstette, G. (Ed.) (1998). Cross-language information retrieval. Boston: Kluwer Academic Publishers, 1998.Google Scholar
- Kay, M. and Röscheisen, M. (1988). Text-Translation Alignment, Technical Report, Xerox Palo Alto Research Center.Google Scholar
- Simard, M., Foster, G. F. and Isabelle, P. (1992). Using cognates to align sentences in bilingual corpora. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TM!), Montréal, Canada, 25–27 June 1992, 67–81.Google Scholar