MEDITE: A Unilingual Textual Aligner
This paper addresses a problem of natural language text alignment, from a humanities discipline called textual genetic criticism where different text versions must be compared. The paper shows that this task is hard because such versions can be very different and texts with a lot of internal repetitions present specific difficulties. MEDITE is a natural language text aligner that compares texts written in the same language. It detects modifications at character level, as opposed to related applications which either remain at word level or give poor results at character level. The detection of moved blocks in the text, induced by our formalism based on edit distance with moves, is introduced. The algorithm is closely related to sequence alignment in bioinformatics as similar building blocks are used and applied to this natural language processing task. A benchmark analysis has been carried out to compare MEDITE with other aligners and it shows that our approach is superior to existing ones especially in hard cases.
KeywordsMachine Translation Edit Distance Natural Language Text Text Version Approximate String Match
Unable to display preview. Download preview PDF.
- 2.Ganascia, J.G., Bourdaillet, J.: Alignements unilingues avec MEDITE. In: Huitièmes Journées Internationales d’Analyse Statistique des Données Textuelles (to appear, 2006)Google Scholar
- 3.Deppman, J., Ferrer, D., Groden, M. (eds.): Genetic Criticism - Texts and Avant-textes. University of Pennsylvania Press (2004)Google Scholar
- 4.Hay, L. (ed.): Essais de critique génétique. Flammarion, coll. Textes et Manuscrits (1979)Google Scholar
- 5.de Biasi, P.M.: La Génétique des Textes. Nathan Université (2000)Google Scholar
- 6.Hunt, J.W., McIlroy, M.D.: An Algorithm for Differential File Comparison. Technical Report CSTR 41, Bell Laboratories, Murray Hill, NJ (1976)Google Scholar
- 20.Lita, L., Rogati, M., Lavie, A.: BLANC: Learning Evaluation Metrics for MT. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, Association for Computational Linguistics, pp. 740–747 (2005)Google Scholar
- 21.Raghava, G., Searle, S.M., Audley, P.C., Barber, J.D., Barton, G.J.: OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4 (2003)Google Scholar