Article: Collating Texts Using Progressive Multiple Alignment

Spencer, Matthew; Howe, Christopher

doi:10.1007/s10579-004-8682-1

Article: Collating Texts Using Progressive Multiple Alignment

Published: August 2004

Volume 38, pages 253–270, (2004)
Cite this article

Computers and the Humanities Aims and scope Submit manuscript

Matthew Spencer¹ &
Christopher Howe¹

158 Accesses
17 Citations
6 Altmetric
1 Mention
Explore all metrics

Abstract

To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one needs accurate data on the variants occurring at each location in each witness. These data are usually obtained from computer collation programs. Existing programs either collate every witness against a base text or divide all texts up into segments as long as the longest variant phrase at each point. These methods do not give ideal data for stemma reconstruction. We describe a better collation algorithm (progressive multiple alignment) that collates all witnesses word by word without a base text, adding groups of witnesses one at a time, starting with the most closely related pair.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-SpaM: A Maximum-Likelihood Approach to Phylogeny Reconstruction Using Multiple Spaced-Word Matches and Quartet Trees

Discovering Similar Passages within Large Text Documents

Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

References

Barbrook A.C., Howe C.J., Blake N., Robinson P. (1998) The Phylogeny of The Canterbury Tales. Nature, 394, p.839.
Google Scholar
Blake N., Robinson P. (eds.) (1997) The Canterbury Tales Project Occasional Papers, Vol. II. Office for Humanities Communication Publications, London. 184 p.
Google Scholar
Brown M.P.S. (2000) Small Subunit Ribosomal RNA Modeling Using Stochastic Context-free Grammars. ISMB Proceedings 2000. American Association for Arti cial Intelligence, pp.57-66.
Cameron H.D. (1987) The Upside-Down Cladogram:Problems in Manuscript Affiliation. In Hoenigswald, H.M., Wiener, L.F. (eds.), Biological Metaphor and Cladistic Classification: An Interdisciplinary Perspective. Frances Pinter, London, pp.227-242.
Google Scholar
Cannon R.L., Jr. (1976) OPCOL: An Optimal Text Collation Algorithm. Computers and the Humanities, 10, pp.33-40.
Google Scholar
Clough P., Gaizauskas R., Piao S.S.L., Wilks Y. (2002) METER: MEasuring TExt Reuse. Proceedings of the 40th Anniversary Meeting for the Association for Computational Lin-guistics (ACL-02). University of Pennsylvania, Philadelphia, USA, pp.152-159.
Google Scholar
Cull P., Hsu T. (1999) Improved Parallel and Sequential Walking Tree Methods for Biological String Alignments. Supercomputing '99.
Durbin R., Eddy S., Krogh A., Mitchison G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge. 356 p.
Google Scholar
Feng D.-F., Doolittle R.F. (1987) Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees. Journal of Molecular Evolution, 25, pp.351-360.
Google Scholar
Gotoh O. (1982) An Improved Algorithm for Matching Biological Sequences. Journal of Molecular Biology, 162, pp.705-708.
Google Scholar
Gotoh O. (1996) Significant Improvement in Accuracy of Multiple Protein Sequence Align-ments by Iterative Refinement as Assessed by Reference to Structural Alignments. Journal of Molecular Biology, 264, pp.823-838.
Google Scholar
Karttunen L., Zwicky A.M. (1985) Introduction. In Dowty, D.R., Karttunen, L., Zwicky, A.M. (eds.), Natural Language Parsing: Psychological, Computational, and Theoretical Perspectives. Cambridge University Press, Cambridge, pp.1-25.
Google Scholar
Kruskal J.B. (1983) An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules. SIAM Review, 25, pp.201-237.
Google Scholar
Kukich K. (1992) Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, 24, pp.377-439.
Google Scholar
Lantin A.-C., Baret P.V., Macé C. (2004) Phylogenetic Analysis of Gregory of Nazianzus Homily 27. Le poids des mots: Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data. Louvain-la-Neuve, pp.700-707.
Lari K., Young S.J. (1990) The Estimation of Stochastic Context-Free Grammars Using the Inside-Outside Algorithm. Computer Speech and Language, 4, pp.35-56.
Google Scholar
Lee A.R. (1989) Numerical Taxonomy Revisited: John Griffith, Cladistic Analysis and St. Augustine's Quaestiones in Heptateuchem. Stadia Patristica, 20, pp.24-32.
Google Scholar
Lee A.R. (1990) BLUDGEON: A Blunt Instrument for the Analysis of Contamination in Textual Traditions. In Choueka, Y. (ed.), Computers in Literary and Linguistic Research. Champion-Slatkine, Paris, pp.261-292.
Google Scholar
Maddison D.R., Swofford D.L., Maddison W.P. (1997) NEXUS: An Extensible File Format for Systematic Information. Systematic Biology, 46, pp.590-621.
Google Scholar
Manning C.D., Schütze H. (1999) Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, 680 p.
Google Scholar
Mooney L.R., Barbrook A.C., Howe C.J., Spencer M. (2001) Stemmatic Analysis of Lydgate 's “Kings of England”: A Test Case for the Application of Software Developed for Evolu-tionary Biology to Manuscript Stemmatics. Revue d' Histoire des Textes, 31, pp.275-297.
Google Scholar
Navarro G. (2001) A Guided Tour to Approximate String Matching. ACM Computing Sur-veys, 33, pp.31-88.
Google Scholar
Notredame C. (2002) Recent Progresses in Multiple Sequence Alignment: A Survey. Phar-macogenomics, 3, pp.131-144.
Google Scholar
Notredame C., Higgins D.G., Heringa J. (2000) T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment. Journal of Molecular Biology, 302, pp.205-217.
Google Scholar
Ott W. (1979) The Output of Collation Programs. In Ager, D.E., Knowles, F.E., Smith, J. (eds.), Advances in Computer-Aided Literary and Linguistic Research. Department of Modern Languages, University of Aston, Birmingham, pp.41-51.
Google Scholar
Ott W. (1992) Computers and Textual Editing. In Butler, C.S. (ed.), Computers and Written Texts, Blackwell, Oxford, pp.205-226.
Google Scholar
Ott W. (2000) Strategies and Tools for Textual Scholarship: The Tübingen System of Text Processing Programs (TUSTEP). Literary and Linguistic Computing, 15, pp.93-108.
Google Scholar
Petrakis E.G.M., Tzeras K. (2000) Similarity Searching in the CORDIS Text Database. Software-Practice and Experience, 30, pp.1447-1464.
Google Scholar
Platnick N.I., Cameron H.D. (1977) Cladistic Methods in Textual, Linguistic, and Phyloge-netic Analysis. Systematic Zoology, 26, pp.380-385.
Google Scholar
Robertson A.M., Willett P. (1998) Applications of n-grams in Textual Information Systems. Journal of Documentation, 54, pp.48-69.
Google Scholar
Robinson P. (1994a) Collate 2:A User Guide. Oxford University Computing Services, Oxford, 137 p.
Google Scholar
Robinson P. (1997) A Stemmatic Analysis of the Fifteenth-Century Witnesses to The Wife of Bath 's Prologue. In Blake, N., Robinson, P. (eds.), The Canterbury Tales Project: Occasional Papers Vol. II. Office for Humanities Communication Publications, London, pp. 69-132.
Google Scholar
Robinson P.M.W. (1989) The Collation and Textual Criticism of Icelandic Manuscripts. (1): Collation. Literary and Linguistic Computing, 4, pp.99-105.
Google Scholar
Robinson P.M.W. (1994b) Collate: Interactive Collation of Large Textual Traditions. Oxford University Centre for Humanities Computing, Oxford.
Google Scholar
Robinson P.M.W., O'Hara R.J. (1996) Cladistic Analysis of an Old Norse Manuscript Tra-dition. In Hockey, S., Ide, N. (eds.), Research in Humanities Computing 4. Oxford Uni-versity Press, Oxford, pp.115-137.
Google Scholar
Sabourin C.F. (1994) Literary Computing. Infolingua, Montreal, 581 p.
Google Scholar
Salemans B.J.P. (1996) Cladistics or the Resurrection of the Method of Lachmann: On Building the Stemma of Yvain. In van Reenen, P., van Mulken, M. (eds.), Studies in Stemmatology. John Benjamins Publishing Company, Amsterdam, pp.3-70.
Google Scholar
Saflemans B.J.P. (2000) Building Stemmas with the Computer in a Cladistic, Neo-Lach-mannian Way. Katholieke Universiteit, Nijmegen, 351 p.
Google Scholar
Sampson G. (2000) The Role of Taxonomy in Language Engineering. Philosophical Trans-actions of the Royal Society of London Series A, 358, pp.1339-1355.
Google Scholar
Spencer M., Davidson E.A., Barbrook A.C., Howe C.J. (2004a) Phylogenetics of Artificial Manuscripts. Journal of Theoretical Biology, 227, pp.503-511.
Google Scholar
Spencer M., Howe C.J. (2001) Estimating Distances between Manuscripts Based on Copying Errors. Literary and Linguistic Computing, 16, pp.467-484.
Google Scholar
Spencer M., Mooney L.R., Barbrook A.C., Bordalejo B., Howe C.J., Robinson P. (in press) The Effects of Weighting Kinds of Variants. In den Hollander, A. (ed.), Studies in Stemmatology II. John Benjamins Publishing Company, Amsterdam.
Spencer M., Wachtel K., Howe C.J. (2002) The Greek Vorlage of the Syra Harclensis: A Comparative Study on Method in Exploring Textual Genealogy. TC: a Journal of Biblical Textual Criticism 7.
Spencer M., Wachtel K., Howe C.J. (2004b) Representing Multiple Pathways of Textual Flow in the Greek Manuscripts of the Letter of James Using Reduced Median Networks. Computers and the Humanities, 38, pp.1-14.
Google Scholar
Sperberg-McQueen C.M., Burnard L. (eds.) (2002) TEI P4:Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium.XML Version, Oxford, Providence, Charlottesville, Bergen
Stoliz M. (2003) New Philology and New Phylogeny:Aspects of a Critical Electronic Edition of Wolfram's Parzival. Literary and Linguistic Computing, 18, pp.139-150.
Google Scholar
Studier J.A., Keppler K.J. (1988) A Note on the Neighbor-Joining Algorithm of Saitou and Nei. Molecular Biology and Evolution, 5, pp.729-731.
Google Scholar
Thorpe J.C. (2002) Multivariate Statistical Analysis for Manuscript Classification. TC:A Journal of Biblical Textual Criticism, 7.
Toutanova K., llhan H.T., Manning C.D. (2002) Extensions to HMM-Based Statistical Word Alignment Models. Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pp.87-94.
Ukkonen E. (1992) Approximate String-Matching with q-grams and Maximal Matches. Theoretical Computer Science, 92, pp.191-211.
Google Scholar
Wagner R.A. (1975) On the Complexity of the Extended String-to-String Correction Problem. Proceedings of the 7th Annual ACM Symposium on Theory of Computing, Albuquerque, New Mexico, pp.218-223.
West M.L. (1973) Textual Criticism and Editorial Technique Applicable to Greek and Latin Texts. B.G. Teubner, Stuttgart.155 p.
Google Scholar
Wise M.J. (1996) YAP3:Improved Detection of Similarities in Computer Program and Other Texts. SIGCSE '96, Philadelphia, USA, pp.130-134.

Download references

Author information

Authors and Affiliations

Department of Mathematics and Statistics, Dalhousie University, Nova Scotia, Canada
Matthew Spencer & Christopher Howe

Authors

Matthew Spencer
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Howe
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Spencer, M., Howe, C. Article: Collating Texts Using Progressive Multiple Alignment. Computers and the Humanities 38, 253–270 (2004). https://doi.org/10.1007/s10579-004-8682-1

Download citation

Issue Date: August 2004
DOI: https://doi.org/10.1007/s10579-004-8682-1

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Article: Collating Texts Using Progressive Multiple Alignment

Abstract

Access this article

Similar content being viewed by others

Multi-SpaM: A Maximum-Likelihood Approach to Phylogeny Reconstruction Using Multiple Spaced-Word Matches and Quartet Trees

Discovering Similar Passages within Large Text Documents

Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Article: Collating Texts Using Progressive Multiple Alignment

Abstract

Access this article

Similar content being viewed by others

Multi-SpaM: A Maximum-Likelihood Approach to Phylogeny Reconstruction Using Multiple Spaced-Word Matches and Quartet Trees

Discovering Similar Passages within Large Text Documents

Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation