Advanced Training Set Construction for Retrieval in Historic Documents
Retrieval in historic documents with non-standard spelling requires a mapping from search terms onto the historic terms in the document. For describing this mapping, we have developed a rule-based approach. The bottleneck of this method has been the training set construction for the algorithm where an expert has to assign manually current word forms to historic spelling variants. As a better solution, we apply a spell checker on a corpus of historic texts, which gives us a list of candidate terms and associated suggestions. The new method generates possible rules for the suggestions and accepts the most frequent rules. Experimental results with German and English texts from different centuries demonstrate the feasibility of our approach. Thus a training set can be constructed with much less initial effort.
KeywordsSpelling variation training set construction historic documents
Unable to display preview. Download preview PDF.
- 1.Awakian, A.: Development of a user-interface for an interactive rule development. Master thesis, University of Duisburg-Essen (2010)Google Scholar
- 2.Baron, A., Rayson, P.: VARD 2: A tool for dealing with spelling variation in historical corpora. In: Proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham (2008)Google Scholar
- 5.Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 333–341. ACM, New York (2007)Google Scholar
- 6.Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U.: Enabling Information Retrieval on Historical Document Collections - the Role of Matching Procedures and Special Lexica. In: Proceedings of the ACM SIGIR 2009 Workshop on Analytics for Noisy Unstructured Text Data (AND 2009), Barcelona, pp. 69–76 (2009)Google Scholar
- 7.Hauser, A., Heller, M., Leiss, E., Schulz, K.U., Wanzeck, C.: Information Access to Historical Documents from the Early New High German Period. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2007) Workshop on Analytics for Noisy Unstructured Text Data, Hyderabad, India, pp. 147–154 (2007)Google Scholar
- 8.Korbar, D.: Visualisation of rule structures and rule modification possibilities for texts with non-standard spelling. Master thesis, University of Duisburg-Essen (2010)Google Scholar
- 9.Pilz, T.: Nichtstandardisierte Rechtschreibung - Variationsmodellierung und rechnergestützte Variationsverarbeitung. Doctoral thesis, University of Duisburg-Essen (2009)Google Scholar