Computers and the Humanities

, Volume 38, Issue 3, pp 253–270 | Cite as

Article: Collating Texts Using Progressive Multiple Alignment

  • Matthew Spencer
  • Christopher Howe


To reconstruct a stemma or do any other kind of statistical analysis of a text tradition, one needs accurate data on the variants occurring at each location in each witness. These data are usually obtained from computer collation programs. Existing programs either collate every witness against a base text or divide all texts up into segments as long as the longest variant phrase at each point. These methods do not give ideal data for stemma reconstruction. We describe a better collation algorithm (progressive multiple alignment) that collates all witnesses word by word without a base text, adding groups of witnesses one at a time, starting with the most closely related pair.

dynamic programming multiple alignment stemma reconstruction text collation variants 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Barbrook A.C., Howe C.J., Blake N., Robinson P. (1998) The Phylogeny of The Canterbury Tales. Nature, 394, p.839.Google Scholar
  2. Blake N., Robinson P. (eds.) (1997) The Canterbury Tales Project Occasional Papers, Vol. II. Office for Humanities Communication Publications, London. 184 p.Google Scholar
  3. Brown M.P.S. (2000) Small Subunit Ribosomal RNA Modeling Using Stochastic Context-free Grammars. ISMB Proceedings 2000. American Association for Arti cial Intelligence, pp.57-66.Google Scholar
  4. Cameron H.D. (1987) The Upside-Down Cladogram:Problems in Manuscript Affiliation. In Hoenigswald, H.M., Wiener, L.F. (eds.), Biological Metaphor and Cladistic Classification: An Interdisciplinary Perspective. Frances Pinter, London, pp.227-242.Google Scholar
  5. Cannon R.L., Jr. (1976) OPCOL: An Optimal Text Collation Algorithm. Computers and the Humanities, 10, pp.33-40.Google Scholar
  6. Clough P., Gaizauskas R., Piao S.S.L., Wilks Y. (2002) METER: MEasuring TExt Reuse. Proceedings of the 40th Anniversary Meeting for the Association for Computational Lin-guistics (ACL-02). University of Pennsylvania, Philadelphia, USA, pp.152-159.Google Scholar
  7. Cull P., Hsu T. (1999) Improved Parallel and Sequential Walking Tree Methods for Biological String Alignments. Supercomputing '99.Google Scholar
  8. Durbin R., Eddy S., Krogh A., Mitchison G. (1998) Biological Sequence Analysis. Cambridge University Press, Cambridge. 356 p.Google Scholar
  9. Feng D.-F., Doolittle R.F. (1987) Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees. Journal of Molecular Evolution, 25, pp.351-360.Google Scholar
  10. Gotoh O. (1982) An Improved Algorithm for Matching Biological Sequences. Journal of Molecular Biology, 162, pp.705-708.Google Scholar
  11. Gotoh O. (1996) Significant Improvement in Accuracy of Multiple Protein Sequence Align-ments by Iterative Refinement as Assessed by Reference to Structural Alignments. Journal of Molecular Biology, 264, pp.823-838.Google Scholar
  12. Karttunen L., Zwicky A.M. (1985) Introduction. In Dowty, D.R., Karttunen, L., Zwicky, A.M. (eds.), Natural Language Parsing: Psychological, Computational, and Theoretical Perspectives. Cambridge University Press, Cambridge, pp.1-25.Google Scholar
  13. Kruskal J.B. (1983) An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules. SIAM Review, 25, pp.201-237.Google Scholar
  14. Kukich K. (1992) Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, 24, pp.377-439.Google Scholar
  15. Lantin A.-C., Baret P.V., Macé C. (2004) Phylogenetic Analysis of Gregory of Nazianzus Homily 27. Le poids des mots: Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data. Louvain-la-Neuve, pp.700-707.Google Scholar
  16. Lari K., Young S.J. (1990) The Estimation of Stochastic Context-Free Grammars Using the Inside-Outside Algorithm. Computer Speech and Language, 4, pp.35-56.Google Scholar
  17. Lee A.R. (1989) Numerical Taxonomy Revisited: John Griffith, Cladistic Analysis and St. Augustine's Quaestiones in Heptateuchem. Stadia Patristica, 20, pp.24-32.Google Scholar
  18. Lee A.R. (1990) BLUDGEON: A Blunt Instrument for the Analysis of Contamination in Textual Traditions. In Choueka, Y. (ed.), Computers in Literary and Linguistic Research. Champion-Slatkine, Paris, pp.261-292.Google Scholar
  19. Maddison D.R., Swofford D.L., Maddison W.P. (1997) NEXUS: An Extensible File Format for Systematic Information. Systematic Biology, 46, pp.590-621.Google Scholar
  20. Manning C.D., Schütze H. (1999) Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA, 680 p.Google Scholar
  21. Mooney L.R., Barbrook A.C., Howe C.J., Spencer M. (2001) Stemmatic Analysis of Lydgate 's “Kings of England”: A Test Case for the Application of Software Developed for Evolu-tionary Biology to Manuscript Stemmatics. Revue d' Histoire des Textes, 31, pp.275-297.Google Scholar
  22. Navarro G. (2001) A Guided Tour to Approximate String Matching. ACM Computing Sur-veys, 33, pp.31-88.Google Scholar
  23. Notredame C. (2002) Recent Progresses in Multiple Sequence Alignment: A Survey. Phar-macogenomics, 3, pp.131-144.Google Scholar
  24. Notredame C., Higgins D.G., Heringa J. (2000) T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment. Journal of Molecular Biology, 302, pp.205-217.Google Scholar
  25. Ott W. (1979) The Output of Collation Programs. In Ager, D.E., Knowles, F.E., Smith, J. (eds.), Advances in Computer-Aided Literary and Linguistic Research. Department of Modern Languages, University of Aston, Birmingham, pp.41-51.Google Scholar
  26. Ott W. (1992) Computers and Textual Editing. In Butler, C.S. (ed.), Computers and Written Texts, Blackwell, Oxford, pp.205-226.Google Scholar
  27. Ott W. (2000) Strategies and Tools for Textual Scholarship: The Tübingen System of Text Processing Programs (TUSTEP). Literary and Linguistic Computing, 15, pp.93-108.Google Scholar
  28. Petrakis E.G.M., Tzeras K. (2000) Similarity Searching in the CORDIS Text Database. Software-Practice and Experience, 30, pp.1447-1464.Google Scholar
  29. Platnick N.I., Cameron H.D. (1977) Cladistic Methods in Textual, Linguistic, and Phyloge-netic Analysis. Systematic Zoology, 26, pp.380-385.Google Scholar
  30. Robertson A.M., Willett P. (1998) Applications of n-grams in Textual Information Systems. Journal of Documentation, 54, pp.48-69.Google Scholar
  31. Robinson P. (1994a) Collate 2:A User Guide. Oxford University Computing Services, Oxford, 137 p.Google Scholar
  32. Robinson P. (1997) A Stemmatic Analysis of the Fifteenth-Century Witnesses to The Wife of Bath 's Prologue. In Blake, N., Robinson, P. (eds.), The Canterbury Tales Project: Occasional Papers Vol. II. Office for Humanities Communication Publications, London, pp. 69-132.Google Scholar
  33. Robinson P.M.W. (1989) The Collation and Textual Criticism of Icelandic Manuscripts. (1): Collation. Literary and Linguistic Computing, 4, pp.99-105.Google Scholar
  34. Robinson P.M.W. (1994b) Collate: Interactive Collation of Large Textual Traditions. Oxford University Centre for Humanities Computing, Oxford.Google Scholar
  35. Robinson P.M.W., O'Hara R.J. (1996) Cladistic Analysis of an Old Norse Manuscript Tra-dition. In Hockey, S., Ide, N. (eds.), Research in Humanities Computing 4. Oxford Uni-versity Press, Oxford, pp.115-137.Google Scholar
  36. Sabourin C.F. (1994) Literary Computing. Infolingua, Montreal, 581 p.Google Scholar
  37. Salemans B.J.P. (1996) Cladistics or the Resurrection of the Method of Lachmann: On Building the Stemma of Yvain. In van Reenen, P., van Mulken, M. (eds.), Studies in Stemmatology. John Benjamins Publishing Company, Amsterdam, pp.3-70.Google Scholar
  38. Saflemans B.J.P. (2000) Building Stemmas with the Computer in a Cladistic, Neo-Lach-mannian Way. Katholieke Universiteit, Nijmegen, 351 p.Google Scholar
  39. Sampson G. (2000) The Role of Taxonomy in Language Engineering. Philosophical Trans-actions of the Royal Society of London Series A, 358, pp.1339-1355.Google Scholar
  40. Spencer M., Davidson E.A., Barbrook A.C., Howe C.J. (2004a) Phylogenetics of Artificial Manuscripts. Journal of Theoretical Biology, 227, pp.503-511.Google Scholar
  41. Spencer M., Howe C.J. (2001) Estimating Distances between Manuscripts Based on Copying Errors. Literary and Linguistic Computing, 16, pp.467-484.Google Scholar
  42. Spencer M., Mooney L.R., Barbrook A.C., Bordalejo B., Howe C.J., Robinson P. (in press) The Effects of Weighting Kinds of Variants. In den Hollander, A. (ed.), Studies in Stemmatology II. John Benjamins Publishing Company, Amsterdam.Google Scholar
  43. Spencer M., Wachtel K., Howe C.J. (2002) The Greek Vorlage of the Syra Harclensis: A Comparative Study on Method in Exploring Textual Genealogy. TC: a Journal of Biblical Textual Criticism 7.Google Scholar
  44. Spencer M., Wachtel K., Howe C.J. (2004b) Representing Multiple Pathways of Textual Flow in the Greek Manuscripts of the Letter of James Using Reduced Median Networks. Computers and the Humanities, 38, pp.1-14.Google Scholar
  45. Sperberg-McQueen C.M., Burnard L. (eds.) (2002) TEI P4:Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium.XML Version, Oxford, Providence, Charlottesville, BergenGoogle Scholar
  46. Stoliz M. (2003) New Philology and New Phylogeny:Aspects of a Critical Electronic Edition of Wolfram's Parzival. Literary and Linguistic Computing, 18, pp.139-150.Google Scholar
  47. Studier J.A., Keppler K.J. (1988) A Note on the Neighbor-Joining Algorithm of Saitou and Nei. Molecular Biology and Evolution, 5, pp.729-731.Google Scholar
  48. Thorpe J.C. (2002) Multivariate Statistical Analysis for Manuscript Classification. TC:A Journal of Biblical Textual Criticism, 7.Google Scholar
  49. Toutanova K., llhan H.T., Manning C.D. (2002) Extensions to HMM-Based Statistical Word Alignment Models. Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pp.87-94.Google Scholar
  50. Ukkonen E. (1992) Approximate String-Matching with q-grams and Maximal Matches. Theoretical Computer Science, 92, pp.191-211.Google Scholar
  51. Wagner R.A. (1975) On the Complexity of the Extended String-to-String Correction Problem. Proceedings of the 7th Annual ACM Symposium on Theory of Computing, Albuquerque, New Mexico, pp.218-223.Google Scholar
  52. West M.L. (1973) Textual Criticism and Editorial Technique Applicable to Greek and Latin Texts. B.G. Teubner, Stuttgart.155 p.Google Scholar
  53. Wise M.J. (1996) YAP3:Improved Detection of Similarities in Computer Program and Other Texts. SIGCSE '96, Philadelphia, USA, pp.130-134.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Matthew Spencer
    • 1
  • Christopher Howe
    • 1
  1. 1.Department of Mathematics and StatisticsDalhousie UniversityNova ScotiaCanada

Personalised recommendations