A Cross-Language Approach to Historic Document Retrieval

  • Marijn Koolen
  • Frans Adriaans
  • Jaap Kamps
  • Maarten de Rijke
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3936)


Our cultural heritage, as preserved in libraries, archives and museums, is made up of documents written many centuries ago. Large-scale digitization initiatives make these documents available to non-expert users through digital libraries and vertical search engines. For a user, querying a historic document collection may be a disappointing experience: queries involving modern words may not be very effective for retrieving documents that contain many historic terms. We propose a cross-language approach to historic document retrieval, and investigate (1) the automatic construction of translation resources for historic languages, and (2) the retrieval of historic documents using cross-language information retrieval techniques. Our experimental evidence is based on a collection of 17th century Dutch documents and a set of 25 known-item topics in modern Dutch. Our main findings are as follows: First, we are able to automatically construct rules for modernizing historic language based on comparing (a) phonetic sequence similarity, (b) the relative frequency of consonant and vowel sequences, and (c) the relative frequency of character n-gram sequences, of historic and modern corpora. Second, modern queries are not very effective for retrieving historic documents, but the historic language tools lead to a substantial improvement in retrieval effectiveness. The improvements are above and beyond the improvement due to using a modern stemming algorithm (whose effectiveness actually goes up when the historic language is modernized).


Historic Document Modern Corpus Dutch Word Mean Reciprocal Rank Spelling Variant 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
    Braschler, M., Peters, C.: Cross-language evaluation forum: Objectives, results, achievements. Information Retrieval 7, 7–31 (2004)CrossRefGoogle Scholar
  3. 3.
    Braun, L.: Information retrieval from Dutch historical corpora. Master’s thesis, Maastricht University (2002)Google Scholar
  4. 4.
    CLEF. Cross language evaluation forum (2005),
  5. 5.
    Craswell, N., Hawking, D.: Overview of the TREC 2004 web track. In: The Thirteenth Text REtrieval Conference (TREC 2004). National Institute for Standards and Technology. NIST Special Publication 500-251 (2005)Google Scholar
  6. 6.
    DBNL. Digitale bibliotheek voor de Nederlandse letteren (2005),
  7. 7.
    DigiCULT. Technology challenges for digital culture (2005),
  8. 8.
    Efron, B.: Bootstrap methods: Another look at the jackknife. Annals of Statistics 7, 1–26 (1979)MathSciNetCrossRefMATHGoogle Scholar
  9. 9.
  10. 10.
    Hollink, V., Kamps, J., Monz, C., de Rijke, M.: Monolingual document retrieval for European languages. Information Retrieval 7, 33–52 (2004)CrossRefGoogle Scholar
  11. 11.
    Hüning, M.: Geschiedenis van het Nederlands (1996),
  12. 12.
    Kukich, K.: Technique for automatically correcting words in text. ACM Computing Surveys 24, 377–439 (1992)CrossRefGoogle Scholar
  13. 13.
    Lesk, M.: Understanding Digital Libraries, 2nd edn. The Morgan Kaufmann series in multimedia information and systems. Morgan Kaufmann, San Francisco (2005)Google Scholar
  14. 14.
    Lucene. The Lucene search engine (2005),
  15. 15.
    NeXTeNS. Text-to-speech for Dutch (2005),
  16. 16.
    O’Rourke, A.J., Robertson, A.M., Willett, P., Eley, P., Simons, P.: Word variant identification in old french. Information Research 2 (1996),
  17. 17.
    Robertson, A.M., Willett, P.: Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods. In: Proceedings ACM SIGIR 1992, pp. 256–265. ACM Press, New York (1992)Google Scholar
  18. 18.
    Rogers, H.J., Willett, P.: Searching for historical word forms in text databases using spelling-correction methods. Journal of Documentation 7, 333–353 (1991)CrossRefGoogle Scholar
  19. 19.
    Russell, R.C.: Specification of Letters, volume 1,261,167 of Patent Number. United States Patent Office, A Cross-Language Approach to Historic Document Retrieval 419 (1918)Google Scholar
  20. 20.
    Sankoff, D., Kruskal, J.: Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley Publishing Co., Reading (1983)MATHGoogle Scholar
  21. 21.
    Savoy, J.: Statistical inference in retrieval effectiveness evaluation. Information Processing and Management 33, 495–512 (1997)CrossRefGoogle Scholar
  22. 22.
    Savoy, J.: Combining multiple strategies for effective monolingual and crosslanguage retrieval. Information Retrieval 7, 121–148 (2004)CrossRefGoogle Scholar
  23. 23.
    Snowball. A language for stemming algorithms (2005),
  24. 24.
    Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the ACM 21, 168–173 (1974)MathSciNetCrossRefMATHGoogle Scholar
  25. 25.
    Wikipedia. Indo-european languages languages (2005),

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Marijn Koolen
    • 1
    • 2
  • Frans Adriaans
    • 1
    • 3
  • Jaap Kamps
    • 1
    • 2
  • Maarten de Rijke
    • 1
  1. 1.ISLAUniversity of AmsterdamThe Netherlands
  2. 2.Archives and Information StudiesUniversity of AmsterdamThe Netherlands
  3. 3.Utrecht Institute of Linguistics OTSUtrecht UniversityThe Netherlands

Personalised recommendations