MEDITE: A Unilingual Textual Aligner

Bourdaillet, Julien; Ganascia, Jean-Gabriel

doi:10.1007/11816508_46

Julien Bourdaillet²¹ &
Jean-Gabriel Ganascia²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Included in the following conference series:

International Conference on Natural Language Processing (in Finland)

1587 Accesses
2 Citations

Abstract

This paper addresses a problem of natural language text alignment, from a humanities discipline called textual genetic criticism where different text versions must be compared. The paper shows that this task is hard because such versions can be very different and texts with a lot of internal repetitions present specific difficulties. MEDITE is a natural language text aligner that compares texts written in the same language. It detects modifications at character level, as opposed to related applications which either remain at word level or give poor results at character level. The detection of moved blocks in the text, induced by our formalism based on edit distance with moves, is introduced. The algorithm is closely related to sequence alignment in bioinformatics as similar building blocks are used and applied to this natural language processing task. A benchmark analysis has been carried out to compare MEDITE with other aligners and it shows that our approach is superior to existing ones especially in hard cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ganascia, J.G., Fenoglio, I., Lebrave, J.L.: Manuscrits, genèse et documents numérisés. EDITE: une étude informatisée du travail de l’écrivain. Document Numérique 8, 91–110 (2004)
Article Google Scholar
Ganascia, J.G., Bourdaillet, J.: Alignements unilingues avec MEDITE. In: Huitièmes Journées Internationales d’Analyse Statistique des Données Textuelles (to appear, 2006)
Google Scholar
Deppman, J., Ferrer, D., Groden, M. (eds.): Genetic Criticism - Texts and Avant-textes. University of Pennsylvania Press (2004)
Google Scholar
Hay, L. (ed.): Essais de critique génétique. Flammarion, coll. Textes et Manuscrits (1979)
Google Scholar
de Biasi, P.M.: La Génétique des Textes. Nathan Université (2000)
Google Scholar
Hunt, J.W., McIlroy, M.D.: An Algorithm for Differential File Comparison. Technical Report CSTR 41, Bell Laboratories, Murray Hill, NJ (1976)
Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Article Google Scholar
Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970)
Article Google Scholar
Smit, A.F.: Identification of a new, abundant superfamily of mammalian LTR- transposons. Nucleic Acids Res. 21, 1863–1872 (1993)
Article Google Scholar
Tichy, W.F.: The String-to-String Correction Problem with Block Moves. ACM Trans. Comput. Syst. 2, 309–321 (1984)
Article Google Scholar
Lopresti, D.P., Tomkins, A.: Block Edit Models for Approximate String Matching. Theor. Comput. Sci. 181, 159–179 (1997)
Article MATH MathSciNet Google Scholar
Shapira, D., Storer, J.A.: Edit distance with move operations. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 85–98. Springer, Heidelberg (2002)
Chapter Google Scholar
Kaplan, H., Shafrir, N.: The greedy algorithm for edit distance with moves. Information Processing Letters 97, 23–27 (2006)
Article MATH MathSciNet Google Scholar
Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of whole genomes. Nucl. Acids. Res. 27, 2369–2376 (1999)
Article Google Scholar
Bray, N., Dubchak, I., Pachter, L.: AVID: A Global Alignment Program. Genome Res. 13, 97–102 (2003)
Article Google Scholar
Darling, A.C., Mau, B., Blattner, F.R., Perna, N.T.: Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements. Genome Res. 14, 1394–1403 (2004)
Article Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computer Biology. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Batzoglou, S.: The many faces of sequence alignment. Briefings in Bioinformatics 6, 6–22 (2005)
Article Google Scholar
Lita, L., Rogati, M., Lavie, A.: BLANC: Learning Evaluation Metrics for MT. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, Association for Computational Linguistics, pp. 740–747 (2005)
Google Scholar
Raghava, G., Searle, S.M., Audley, P.C., Barber, J.D., Barton, G.J.: OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Université Pierre et Marie Curie – Laboratoire d’Informatique de Paris 6, 8 rue du Capitaine Scott, 75015, Paris, France
Julien Bourdaillet & Jean-Gabriel Ganascia

Authors

Julien Bourdaillet
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Gabriel Ganascia
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Turku Centre for Computer Science (TUCS), Department of Information Technology, University of Turku, Joukahaisenkatu 3-5 B, FIN-20520, Turku, Finland
Tapio Salakoski
Turku Centre for Computer Science (TUCS) and Department of IT, University of Turku, Lemminkäisenkatu 14 A, 20520, Turku, Finland
Filip Ginter & Sampo Pyysalo &
Department of Information Technology, University of Turku, Lemminkäisenkatu 14–18 A, FIN-20520, Turku, Finland
Tapio Pahikkala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bourdaillet, J., Ganascia, JG. (2006). MEDITE: A Unilingual Textual Aligner. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_46

Download citation

DOI: https://doi.org/10.1007/11816508_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics