Reconstituting typeset Marriage Registers using simple software tools

  • David F. BrailsfordEmail author
Special Issue Paper


In a world of fully integrated software applications, which can seem daunting to develop and to maintain, it is sometimes useful to recall that a system of loosely-linked software components can provide surprisingly powerful and flexible methods for software development.

This paper describes a project which aims to re-typeset a series of volumes from the Phillimore Marriage Registers, first published in England around the turn of the last century. The source material is plain text derived from running Optical Character Recognition (OCR) on a set of page scans taken from the original printed volumes. The regular, tabular, structure of the Register pages allows us to automate the re-typesetting process.

The UNIX troff software and its tbl preprocessor are used for the typesetting itself, but a series of simple awk-based software tools, all of them parsers and code generators of one sort or another, is used to bring about the OCR-to-troff transformation.

By re-parsing the generated troff codes it is possible to produce a surname index as a supplement to the re-typeset volume. Moreover, this second-stage parsing has been invaluable in discovering subtle ‘typos’ in the automatically generated material. With small adjustments to this parser it would be possible to output the complete marriage entries in standard XML or GEDCOM notations.


Re-typesetting OCR troff Parsing Genealogy Hyperlinking Indexing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
    Phillimore parish registers.
  3. 3.
    S & N Genealogy supplies, Phillimore parish registers.
  4. 4. WhatTheFont: font identification software.
  5. 5.
    Kernighan BW (1982) A typesetter independent TROFF. Bell Laboratories, Computing Science Technical Report No 97, Murray Hill, New Jersey 07974, March Google Scholar
  6. 6.
    Brailsford DF (1984) In-house production of examination papers using troff, eqn and tbl. In: Proceedings PROTEXT I Workshop, Boole Press, Dublin, October, pp 21–28 Google Scholar
  7. 7.
    Ossanna JF (1976) NROFF/TROFF user’s manual. Bell Laboratories, Computing Science Technical Report No 54, Murray Hill, New Jersey 07974, 11 October Google Scholar
  8. 8.
    Aho AV, Kernighan BW, Weinberger PJ (1988) The AWK programming language. Addison-Wesley, Reading zbMATHGoogle Scholar
  9. 9.
    Chomsky N (1959) On certain formal properties of grammars. Inf Control 2:137–167 MathSciNetzbMATHCrossRefGoogle Scholar
  10. 10.
    Johnson, SC (1974) YACC: yet another compiler-compiler. Bell Laboratories, Computer Science Technical Report 32, Murray Hill, New Jersey Google Scholar
  11. 11.
    Aho AV, Ullman JD (1972) The theory of parsing, translating and compiling, Vol. I: Parsing. Prentice Hall, Englewood Cliffs Google Scholar
  12. 12.
    Bentley JL, Kernighan BW (1988) Tools for printing indexes. Electron Publ Orig Dissem Des 1(1):3–17. April Google Scholar
  13. 13.
    Brailsford Marriage Register (draft version), available for download at
  14. 14.
    Batchelder N, Darrell T (1988) Psfig—a ditroff preprocessor for Postscript files. Internal Report, Computer and Information Science Dept., University of Pennsylvania Google Scholar
  15. 15.
    Smith PN, Brailsford DF, Evans DR, Harrison L, Probets SG, Sutton PE (1993) Journal publishing with Acrobat: the CAJUN project. Electron Publ Orig Dissem Des 6(4):481–493. December Google Scholar
  16. 16.
    Adobe Systems Incorporated (2006) pdfmark reference manual (version 8), November Google Scholar
  17. 17.
    GedML: genealogical data in XML.

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  1. 1.Document Engineering Research Group, School of Computer ScienceUniversity of NottinghamNottinghamUK

Personalised recommendations