Advanced Training Set Construction for Retrieval in Historic Documents

  • Andrea Ernst-Gerlach
  • Norbert Fuhr
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6458)


Retrieval in historic documents with non-standard spelling requires a mapping from search terms onto the historic terms in the document. For describing this mapping, we have developed a rule-based approach. The bottleneck of this method has been the training set construction for the algorithm where an expert has to assign manually current word forms to historic spelling variants. As a better solution, we apply a spell checker on a corpus of historic texts, which gives us a list of candidate terms and associated suggestions. The new method generates possible rules for the suggestions and accepts the most frequent rules. Experimental results with German and English texts from different centuries demonstrate the feasibility of our approach. Thus a training set can be constructed with much less initial effort.


Spelling variation training set construction historic documents 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Awakian, A.: Development of a user-interface for an interactive rule development. Master thesis, University of Duisburg-Essen (2010)Google Scholar
  2. 2.
    Baron, A., Rayson, P.: VARD 2: A tool for dealing with spelling variation in historical corpora. In: Proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham (2008)Google Scholar
  3. 3.
    Cendrowska, J.: PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies 27(4), 349–370 (1987)CrossRefzbMATHGoogle Scholar
  4. 4.
    Ernst-Gerlach, A., Fuhr, N.: Generating Search Term Variants for Text Collections with Historic Spellings. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 49–60. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  5. 5.
    Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 333–341. ACM, New York (2007)Google Scholar
  6. 6.
    Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U.: Enabling Information Retrieval on Historical Document Collections - the Role of Matching Procedures and Special Lexica. In: Proceedings of the ACM SIGIR 2009 Workshop on Analytics for Noisy Unstructured Text Data (AND 2009), Barcelona, pp. 69–76 (2009)Google Scholar
  7. 7.
    Hauser, A., Heller, M., Leiss, E., Schulz, K.U., Wanzeck, C.: Information Access to Historical Documents from the Early New High German Period. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2007) Workshop on Analytics for Noisy Unstructured Text Data, Hyderabad, India, pp. 147–154 (2007)Google Scholar
  8. 8.
    Korbar, D.: Visualisation of rule structures and rule modification possibilities for texts with non-standard spelling. Master thesis, University of Duisburg-Essen (2010)Google Scholar
  9. 9.
    Pilz, T.: Nichtstandardisierte Rechtschreibung - Variationsmodellierung und rechnergestützte Variationsverarbeitung. Doctoral thesis, University of Duisburg-Essen (2009)Google Scholar
  10. 10.
    Pilz, T., Luther, W.: Automated support for evidence retrieval in documents with nonstandard orthography. In: Featherston, S., Winkler, S. (eds.) The Fruits of Empirical Linguistics. Process, vol. 1, pp. 211–228. Mouton de Gruyter, Berlin (2009)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Andrea Ernst-Gerlach
    • 1
  • Norbert Fuhr
    • 1
  1. 1.Department of Computational and Cognitive SciencesUniversity of Duisburg-EssenDuisburgGermany

Personalised recommendations