MEDITE: A Unilingual Textual Aligner

  • Julien Bourdaillet
  • Jean-Gabriel Ganascia
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4139)


This paper addresses a problem of natural language text alignment, from a humanities discipline called textual genetic criticism where different text versions must be compared. The paper shows that this task is hard because such versions can be very different and texts with a lot of internal repetitions present specific difficulties. MEDITE is a natural language text aligner that compares texts written in the same language. It detects modifications at character level, as opposed to related applications which either remain at word level or give poor results at character level. The detection of moved blocks in the text, induced by our formalism based on edit distance with moves, is introduced. The algorithm is closely related to sequence alignment in bioinformatics as similar building blocks are used and applied to this natural language processing task. A benchmark analysis has been carried out to compare MEDITE with other aligners and it shows that our approach is superior to existing ones especially in hard cases.


Machine Translation Edit Distance Natural Language Text Text Version Approximate String Match 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ganascia, J.G., Fenoglio, I., Lebrave, J.L.: Manuscrits, genèse et documents numérisés. EDITE: une étude informatisée du travail de l’écrivain. Document Numérique 8, 91–110 (2004)CrossRefGoogle Scholar
  2. 2.
    Ganascia, J.G., Bourdaillet, J.: Alignements unilingues avec MEDITE. In: Huitièmes Journées Internationales d’Analyse Statistique des Données Textuelles (to appear, 2006)Google Scholar
  3. 3.
    Deppman, J., Ferrer, D., Groden, M. (eds.): Genetic Criticism - Texts and Avant-textes. University of Pennsylvania Press (2004)Google Scholar
  4. 4.
    Hay, L. (ed.): Essais de critique génétique. Flammarion, coll. Textes et Manuscrits (1979)Google Scholar
  5. 5.
    de Biasi, P.M.: La Génétique des Textes. Nathan Université (2000)Google Scholar
  6. 6.
    Hunt, J.W., McIlroy, M.D.: An Algorithm for Differential File Comparison. Technical Report CSTR 41, Bell Laboratories, Murray Hill, NJ (1976)Google Scholar
  7. 7.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)MATHGoogle Scholar
  8. 8.
    Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)CrossRefGoogle Scholar
  9. 9.
    Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970)CrossRefGoogle Scholar
  10. 10.
    Smit, A.F.: Identification of a new, abundant superfamily of mammalian LTR- transposons. Nucleic Acids Res. 21, 1863–1872 (1993)CrossRefGoogle Scholar
  11. 11.
    Tichy, W.F.: The String-to-String Correction Problem with Block Moves. ACM Trans. Comput. Syst. 2, 309–321 (1984)CrossRefGoogle Scholar
  12. 12.
    Lopresti, D.P., Tomkins, A.: Block Edit Models for Approximate String Matching. Theor. Comput. Sci. 181, 159–179 (1997)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Shapira, D., Storer, J.A.: Edit distance with move operations. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 85–98. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  14. 14.
    Kaplan, H., Shafrir, N.: The greedy algorithm for edit distance with moves. Information Processing Letters 97, 23–27 (2006)MATHCrossRefMathSciNetGoogle Scholar
  15. 15.
    Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of whole genomes. Nucl. Acids. Res. 27, 2369–2376 (1999)CrossRefGoogle Scholar
  16. 16.
    Bray, N., Dubchak, I., Pachter, L.: AVID: A Global Alignment Program. Genome Res. 13, 97–102 (2003)CrossRefGoogle Scholar
  17. 17.
    Darling, A.C., Mau, B., Blattner, F.R., Perna, N.T.: Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements. Genome Res. 14, 1394–1403 (2004)CrossRefGoogle Scholar
  18. 18.
    Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computer Biology. Cambridge University Press, Cambridge (1997)MATHCrossRefGoogle Scholar
  19. 19.
    Batzoglou, S.: The many faces of sequence alignment. Briefings in Bioinformatics 6, 6–22 (2005)CrossRefGoogle Scholar
  20. 20.
    Lita, L., Rogati, M., Lavie, A.: BLANC: Learning Evaluation Metrics for MT. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, Association for Computational Linguistics, pp. 740–747 (2005)Google Scholar
  21. 21.
    Raghava, G., Searle, S.M., Audley, P.C., Barber, J.D., Barton, G.J.: OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Julien Bourdaillet
    • 1
  • Jean-Gabriel Ganascia
    • 1
  1. 1.Université Pierre et Marie Curie – Laboratoire d’Informatique de Paris 6ParisFrance

Personalised recommendations