Abstract
To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one needs accurate data on the variants occurring at each location in each witness. These data are usually obtained from computer collation programs. Existing programs either collate every witness against a base text or divide all texts up into segments as long as the longest variant phrase at each point. These methods do not give ideal data for stemma reconstruction. We describe a better collation algorithm (progressive multiple alignment) that collates all witnesses word by word without a base text, adding groups of witnesses one at a time, starting with the most closely related pair.
Similar content being viewed by others
References
Barbrook A.C., Howe C.J., Blake N., Robinson P. (1998) The Phylogeny of The Canterbury Tales. Nature, 394, p.839.
Blake N., Robinson P. (eds.) (1997) The Canterbury Tales Project Occasional Papers, Vol. II. Office for Humanities Communication Publications, London. 184 p.
Brown M.P.S. (2000) Small Subunit Ribosomal RNA Modeling Using Stochastic Context-free Grammars. ISMB Proceedings 2000. American Association for Arti cial Intelligence, pp.57-66.
Cameron H.D. (1987) The Upside-Down Cladogram:Problems in Manuscript Affiliation. In Hoenigswald, H.M., Wiener, L.F. (eds.), Biological Metaphor and Cladistic Classification: An Interdisciplinary Perspective. Frances Pinter, London, pp.227-242.
Cannon R.L., Jr. (1976) OPCOL: An Optimal Text Collation Algorithm. Computers and the Humanities, 10, pp.33-40.
Clough P., Gaizauskas R., Piao S.S.L., Wilks Y. (2002) METER: MEasuring TExt Reuse. Proceedings of the 40th Anniversary Meeting for the Association for Computational Lin-guistics (ACL-02). University of Pennsylvania, Philadelphia, USA, pp.152-159.
Cull P., Hsu T. (1999) Improved Parallel and Sequential Walking Tree Methods for Biological String Alignments. Supercomputing '99.
Durbin R., Eddy S., Krogh A., Mitchison G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge. 356 p.
Feng D.-F., Doolittle R.F. (1987) Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees. Journal of Molecular Evolution, 25, pp.351-360.
Gotoh O. (1982) An Improved Algorithm for Matching Biological Sequences. Journal of Molecular Biology, 162, pp.705-708.
Gotoh O. (1996) Significant Improvement in Accuracy of Multiple Protein Sequence Align-ments by Iterative Refinement as Assessed by Reference to Structural Alignments. Journal of Molecular Biology, 264, pp.823-838.
Karttunen L., Zwicky A.M. (1985) Introduction. In Dowty, D.R., Karttunen, L., Zwicky, A.M. (eds.), Natural Language Parsing: Psychological, Computational, and Theoretical Perspectives. Cambridge University Press, Cambridge, pp.1-25.
Kruskal J.B. (1983) An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules. SIAM Review, 25, pp.201-237.
Kukich K. (1992) Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, 24, pp.377-439.
Lantin A.-C., Baret P.V., Macé C. (2004) Phylogenetic Analysis of Gregory of Nazianzus Homily 27. Le poids des mots: Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data. Louvain-la-Neuve, pp.700-707.
Lari K., Young S.J. (1990) The Estimation of Stochastic Context-Free Grammars Using the Inside-Outside Algorithm. Computer Speech and Language, 4, pp.35-56.
Lee A.R. (1989) Numerical Taxonomy Revisited: John Griffith, Cladistic Analysis and St. Augustine's Quaestiones in Heptateuchem. Stadia Patristica, 20, pp.24-32.
Lee A.R. (1990) BLUDGEON: A Blunt Instrument for the Analysis of Contamination in Textual Traditions. In Choueka, Y. (ed.), Computers in Literary and Linguistic Research. Champion-Slatkine, Paris, pp.261-292.
Maddison D.R., Swofford D.L., Maddison W.P. (1997) NEXUS: An Extensible File Format for Systematic Information. Systematic Biology, 46, pp.590-621.
Manning C.D., Schütze H. (1999) Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, 680 p.
Mooney L.R., Barbrook A.C., Howe C.J., Spencer M. (2001) Stemmatic Analysis of Lydgate 's “Kings of England”: A Test Case for the Application of Software Developed for Evolu-tionary Biology to Manuscript Stemmatics. Revue d' Histoire des Textes, 31, pp.275-297.
Navarro G. (2001) A Guided Tour to Approximate String Matching. ACM Computing Sur-veys, 33, pp.31-88.
Notredame C. (2002) Recent Progresses in Multiple Sequence Alignment: A Survey. Phar-macogenomics, 3, pp.131-144.
Notredame C., Higgins D.G., Heringa J. (2000) T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment. Journal of Molecular Biology, 302, pp.205-217.
Ott W. (1979) The Output of Collation Programs. In Ager, D.E., Knowles, F.E., Smith, J. (eds.), Advances in Computer-Aided Literary and Linguistic Research. Department of Modern Languages, University of Aston, Birmingham, pp.41-51.
Ott W. (1992) Computers and Textual Editing. In Butler, C.S. (ed.), Computers and Written Texts, Blackwell, Oxford, pp.205-226.
Ott W. (2000) Strategies and Tools for Textual Scholarship: The Tübingen System of Text Processing Programs (TUSTEP). Literary and Linguistic Computing, 15, pp.93-108.
Petrakis E.G.M., Tzeras K. (2000) Similarity Searching in the CORDIS Text Database. Software-Practice and Experience, 30, pp.1447-1464.
Platnick N.I., Cameron H.D. (1977) Cladistic Methods in Textual, Linguistic, and Phyloge-netic Analysis. Systematic Zoology, 26, pp.380-385.
Robertson A.M., Willett P. (1998) Applications of n-grams in Textual Information Systems. Journal of Documentation, 54, pp.48-69.
Robinson P. (1994a) Collate 2:A User Guide. Oxford University Computing Services, Oxford, 137 p.
Robinson P. (1997) A Stemmatic Analysis of the Fifteenth-Century Witnesses to The Wife of Bath 's Prologue. In Blake, N., Robinson, P. (eds.), The Canterbury Tales Project: Occasional Papers Vol. II. Office for Humanities Communication Publications, London, pp. 69-132.
Robinson P.M.W. (1989) The Collation and Textual Criticism of Icelandic Manuscripts. (1): Collation. Literary and Linguistic Computing, 4, pp.99-105.
Robinson P.M.W. (1994b) Collate: Interactive Collation of Large Textual Traditions. Oxford University Centre for Humanities Computing, Oxford.
Robinson P.M.W., O'Hara R.J. (1996) Cladistic Analysis of an Old Norse Manuscript Tra-dition. In Hockey, S., Ide, N. (eds.), Research in Humanities Computing 4. Oxford Uni-versity Press, Oxford, pp.115-137.
Sabourin C.F. (1994) Literary Computing. Infolingua, Montreal, 581 p.
Salemans B.J.P. (1996) Cladistics or the Resurrection of the Method of Lachmann: On Building the Stemma of Yvain. In van Reenen, P., van Mulken, M. (eds.), Studies in Stemmatology. John Benjamins Publishing Company, Amsterdam, pp.3-70.
Saflemans B.J.P. (2000) Building Stemmas with the Computer in a Cladistic, Neo-Lach-mannian Way. Katholieke Universiteit, Nijmegen, 351 p.
Sampson G. (2000) The Role of Taxonomy in Language Engineering. Philosophical Trans-actions of the Royal Society of London Series A, 358, pp.1339-1355.
Spencer M., Davidson E.A., Barbrook A.C., Howe C.J. (2004a) Phylogenetics of Artificial Manuscripts. Journal of Theoretical Biology, 227, pp.503-511.
Spencer M., Howe C.J. (2001) Estimating Distances between Manuscripts Based on Copying Errors. Literary and Linguistic Computing, 16, pp.467-484.
Spencer M., Mooney L.R., Barbrook A.C., Bordalejo B., Howe C.J., Robinson P. (in press) The Effects of Weighting Kinds of Variants. In den Hollander, A. (ed.), Studies in Stemmatology II. John Benjamins Publishing Company, Amsterdam.
Spencer M., Wachtel K., Howe C.J. (2002) The Greek Vorlage of the Syra Harclensis: A Comparative Study on Method in Exploring Textual Genealogy. TC: a Journal of Biblical Textual Criticism 7.
Spencer M., Wachtel K., Howe C.J. (2004b) Representing Multiple Pathways of Textual Flow in the Greek Manuscripts of the Letter of James Using Reduced Median Networks. Computers and the Humanities, 38, pp.1-14.
Sperberg-McQueen C.M., Burnard L. (eds.) (2002) TEI P4:Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium.XML Version, Oxford, Providence, Charlottesville, Bergen
Stoliz M. (2003) New Philology and New Phylogeny:Aspects of a Critical Electronic Edition of Wolfram's Parzival. Literary and Linguistic Computing, 18, pp.139-150.
Studier J.A., Keppler K.J. (1988) A Note on the Neighbor-Joining Algorithm of Saitou and Nei. Molecular Biology and Evolution, 5, pp.729-731.
Thorpe J.C. (2002) Multivariate Statistical Analysis for Manuscript Classification. TC:A Journal of Biblical Textual Criticism, 7.
Toutanova K., llhan H.T., Manning C.D. (2002) Extensions to HMM-Based Statistical Word Alignment Models. Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pp.87-94.
Ukkonen E. (1992) Approximate String-Matching with q-grams and Maximal Matches. Theoretical Computer Science, 92, pp.191-211.
Wagner R.A. (1975) On the Complexity of the Extended String-to-String Correction Problem. Proceedings of the 7th Annual ACM Symposium on Theory of Computing, Albuquerque, New Mexico, pp.218-223.
West M.L. (1973) Textual Criticism and Editorial Technique Applicable to Greek and Latin Texts. B.G. Teubner, Stuttgart.155 p.
Wise M.J. (1996) YAP3:Improved Detection of Similarities in Computer Program and Other Texts. SIGCSE '96, Philadelphia, USA, pp.130-134.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Spencer, M., Howe, C. Article: Collating Texts Using Progressive Multiple Alignment. Computers and the Humanities 38, 253–270 (2004). https://doi.org/10.1007/s10579-004-8682-1
Issue Date:
DOI: https://doi.org/10.1007/s10579-004-8682-1