Skip to main content
Log in

Article: Collating Texts Using Progressive Multiple Alignment

  • Published:
Computers and the Humanities Aims and scope Submit manuscript

Abstract

To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one needs accurate data on the variants occurring at each location in each witness. These data are usually obtained from computer collation programs. Existing programs either collate every witness against a base text or divide all texts up into segments as long as the longest variant phrase at each point. These methods do not give ideal data for stemma reconstruction. We describe a better collation algorithm (progressive multiple alignment) that collates all witnesses word by word without a base text, adding groups of witnesses one at a time, starting with the most closely related pair.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Barbrook A.C., Howe C.J., Blake N., Robinson P. (1998) The Phylogeny of The Canterbury Tales. Nature, 394, p.839.

    Google Scholar 

  • Blake N., Robinson P. (eds.) (1997) The Canterbury Tales Project Occasional Papers, Vol. II. Office for Humanities Communication Publications, London. 184 p.

    Google Scholar 

  • Brown M.P.S. (2000) Small Subunit Ribosomal RNA Modeling Using Stochastic Context-free Grammars. ISMB Proceedings 2000. American Association for Arti cial Intelligence, pp.57-66.

  • Cameron H.D. (1987) The Upside-Down Cladogram:Problems in Manuscript Affiliation. In Hoenigswald, H.M., Wiener, L.F. (eds.), Biological Metaphor and Cladistic Classification: An Interdisciplinary Perspective. Frances Pinter, London, pp.227-242.

    Google Scholar 

  • Cannon R.L., Jr. (1976) OPCOL: An Optimal Text Collation Algorithm. Computers and the Humanities, 10, pp.33-40.

    Google Scholar 

  • Clough P., Gaizauskas R., Piao S.S.L., Wilks Y. (2002) METER: MEasuring TExt Reuse. Proceedings of the 40th Anniversary Meeting for the Association for Computational Lin-guistics (ACL-02). University of Pennsylvania, Philadelphia, USA, pp.152-159.

    Google Scholar 

  • Cull P., Hsu T. (1999) Improved Parallel and Sequential Walking Tree Methods for Biological String Alignments. Supercomputing '99.

  • Durbin R., Eddy S., Krogh A., Mitchison G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge. 356 p.

    Google Scholar 

  • Feng D.-F., Doolittle R.F. (1987) Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees. Journal of Molecular Evolution, 25, pp.351-360.

    Google Scholar 

  • Gotoh O. (1982) An Improved Algorithm for Matching Biological Sequences. Journal of Molecular Biology, 162, pp.705-708.

    Google Scholar 

  • Gotoh O. (1996) Significant Improvement in Accuracy of Multiple Protein Sequence Align-ments by Iterative Refinement as Assessed by Reference to Structural Alignments. Journal of Molecular Biology, 264, pp.823-838.

    Google Scholar 

  • Karttunen L., Zwicky A.M. (1985) Introduction. In Dowty, D.R., Karttunen, L., Zwicky, A.M. (eds.), Natural Language Parsing: Psychological, Computational, and Theoretical Perspectives. Cambridge University Press, Cambridge, pp.1-25.

    Google Scholar 

  • Kruskal J.B. (1983) An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules. SIAM Review, 25, pp.201-237.

    Google Scholar 

  • Kukich K. (1992) Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, 24, pp.377-439.

    Google Scholar 

  • Lantin A.-C., Baret P.V., Macé C. (2004) Phylogenetic Analysis of Gregory of Nazianzus Homily 27. Le poids des mots: Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data. Louvain-la-Neuve, pp.700-707.

  • Lari K., Young S.J. (1990) The Estimation of Stochastic Context-Free Grammars Using the Inside-Outside Algorithm. Computer Speech and Language, 4, pp.35-56.

    Google Scholar 

  • Lee A.R. (1989) Numerical Taxonomy Revisited: John Griffith, Cladistic Analysis and St. Augustine's Quaestiones in Heptateuchem. Stadia Patristica, 20, pp.24-32.

    Google Scholar 

  • Lee A.R. (1990) BLUDGEON: A Blunt Instrument for the Analysis of Contamination in Textual Traditions. In Choueka, Y. (ed.), Computers in Literary and Linguistic Research. Champion-Slatkine, Paris, pp.261-292.

    Google Scholar 

  • Maddison D.R., Swofford D.L., Maddison W.P. (1997) NEXUS: An Extensible File Format for Systematic Information. Systematic Biology, 46, pp.590-621.

    Google Scholar 

  • Manning C.D., Schütze H. (1999) Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, 680 p.

    Google Scholar 

  • Mooney L.R., Barbrook A.C., Howe C.J., Spencer M. (2001) Stemmatic Analysis of Lydgate 's “Kings of England”: A Test Case for the Application of Software Developed for Evolu-tionary Biology to Manuscript Stemmatics. Revue d' Histoire des Textes, 31, pp.275-297.

    Google Scholar 

  • Navarro G. (2001) A Guided Tour to Approximate String Matching. ACM Computing Sur-veys, 33, pp.31-88.

    Google Scholar 

  • Notredame C. (2002) Recent Progresses in Multiple Sequence Alignment: A Survey. Phar-macogenomics, 3, pp.131-144.

    Google Scholar 

  • Notredame C., Higgins D.G., Heringa J. (2000) T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment. Journal of Molecular Biology, 302, pp.205-217.

    Google Scholar 

  • Ott W. (1979) The Output of Collation Programs. In Ager, D.E., Knowles, F.E., Smith, J. (eds.), Advances in Computer-Aided Literary and Linguistic Research. Department of Modern Languages, University of Aston, Birmingham, pp.41-51.

    Google Scholar 

  • Ott W. (1992) Computers and Textual Editing. In Butler, C.S. (ed.), Computers and Written Texts, Blackwell, Oxford, pp.205-226.

    Google Scholar 

  • Ott W. (2000) Strategies and Tools for Textual Scholarship: The Tübingen System of Text Processing Programs (TUSTEP). Literary and Linguistic Computing, 15, pp.93-108.

    Google Scholar 

  • Petrakis E.G.M., Tzeras K. (2000) Similarity Searching in the CORDIS Text Database. Software-Practice and Experience, 30, pp.1447-1464.

    Google Scholar 

  • Platnick N.I., Cameron H.D. (1977) Cladistic Methods in Textual, Linguistic, and Phyloge-netic Analysis. Systematic Zoology, 26, pp.380-385.

    Google Scholar 

  • Robertson A.M., Willett P. (1998) Applications of n-grams in Textual Information Systems. Journal of Documentation, 54, pp.48-69.

    Google Scholar 

  • Robinson P. (1994a) Collate 2:A User Guide. Oxford University Computing Services, Oxford, 137 p.

    Google Scholar 

  • Robinson P. (1997) A Stemmatic Analysis of the Fifteenth-Century Witnesses to The Wife of Bath 's Prologue. In Blake, N., Robinson, P. (eds.), The Canterbury Tales Project: Occasional Papers Vol. II. Office for Humanities Communication Publications, London, pp. 69-132.

    Google Scholar 

  • Robinson P.M.W. (1989) The Collation and Textual Criticism of Icelandic Manuscripts. (1): Collation. Literary and Linguistic Computing, 4, pp.99-105.

    Google Scholar 

  • Robinson P.M.W. (1994b) Collate: Interactive Collation of Large Textual Traditions. Oxford University Centre for Humanities Computing, Oxford.

    Google Scholar 

  • Robinson P.M.W., O'Hara R.J. (1996) Cladistic Analysis of an Old Norse Manuscript Tra-dition. In Hockey, S., Ide, N. (eds.), Research in Humanities Computing 4. Oxford Uni-versity Press, Oxford, pp.115-137.

    Google Scholar 

  • Sabourin C.F. (1994) Literary Computing. Infolingua, Montreal, 581 p.

    Google Scholar 

  • Salemans B.J.P. (1996) Cladistics or the Resurrection of the Method of Lachmann: On Building the Stemma of Yvain. In van Reenen, P., van Mulken, M. (eds.), Studies in Stemmatology. John Benjamins Publishing Company, Amsterdam, pp.3-70.

    Google Scholar 

  • Saflemans B.J.P. (2000) Building Stemmas with the Computer in a Cladistic, Neo-Lach-mannian Way. Katholieke Universiteit, Nijmegen, 351 p.

    Google Scholar 

  • Sampson G. (2000) The Role of Taxonomy in Language Engineering. Philosophical Trans-actions of the Royal Society of London Series A, 358, pp.1339-1355.

    Google Scholar 

  • Spencer M., Davidson E.A., Barbrook A.C., Howe C.J. (2004a) Phylogenetics of Artificial Manuscripts. Journal of Theoretical Biology, 227, pp.503-511.

    Google Scholar 

  • Spencer M., Howe C.J. (2001) Estimating Distances between Manuscripts Based on Copying Errors. Literary and Linguistic Computing, 16, pp.467-484.

    Google Scholar 

  • Spencer M., Mooney L.R., Barbrook A.C., Bordalejo B., Howe C.J., Robinson P. (in press) The Effects of Weighting Kinds of Variants. In den Hollander, A. (ed.), Studies in Stemmatology II. John Benjamins Publishing Company, Amsterdam.

  • Spencer M., Wachtel K., Howe C.J. (2002) The Greek Vorlage of the Syra Harclensis: A Comparative Study on Method in Exploring Textual Genealogy. TC: a Journal of Biblical Textual Criticism 7.

  • Spencer M., Wachtel K., Howe C.J. (2004b) Representing Multiple Pathways of Textual Flow in the Greek Manuscripts of the Letter of James Using Reduced Median Networks. Computers and the Humanities, 38, pp.1-14.

    Google Scholar 

  • Sperberg-McQueen C.M., Burnard L. (eds.) (2002) TEI P4:Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium.XML Version, Oxford, Providence, Charlottesville, Bergen

  • Stoliz M. (2003) New Philology and New Phylogeny:Aspects of a Critical Electronic Edition of Wolfram's Parzival. Literary and Linguistic Computing, 18, pp.139-150.

    Google Scholar 

  • Studier J.A., Keppler K.J. (1988) A Note on the Neighbor-Joining Algorithm of Saitou and Nei. Molecular Biology and Evolution, 5, pp.729-731.

    Google Scholar 

  • Thorpe J.C. (2002) Multivariate Statistical Analysis for Manuscript Classification. TC:A Journal of Biblical Textual Criticism, 7.

  • Toutanova K., llhan H.T., Manning C.D. (2002) Extensions to HMM-Based Statistical Word Alignment Models. Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pp.87-94.

  • Ukkonen E. (1992) Approximate String-Matching with q-grams and Maximal Matches. Theoretical Computer Science, 92, pp.191-211.

    Google Scholar 

  • Wagner R.A. (1975) On the Complexity of the Extended String-to-String Correction Problem. Proceedings of the 7th Annual ACM Symposium on Theory of Computing, Albuquerque, New Mexico, pp.218-223.

  • West M.L. (1973) Textual Criticism and Editorial Technique Applicable to Greek and Latin Texts. B.G. Teubner, Stuttgart.155 p.

    Google Scholar 

  • Wise M.J. (1996) YAP3:Improved Detection of Similarities in Computer Program and Other Texts. SIGCSE '96, Philadelphia, USA, pp.130-134.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Spencer, M., Howe, C. Article: Collating Texts Using Progressive Multiple Alignment. Computers and the Humanities 38, 253–270 (2004). https://doi.org/10.1007/s10579-004-8682-1

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-004-8682-1

Navigation