Advertisement

Generating Search Term Variants for Text Collections with Historic Spellings

  • Andrea Ernst-Gerlach
  • Norbert Fuhr
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3936)

Abstract

In this paper, we describe a new approach for retrieval in texts with non-standard spelling, which is important for historic texts in English or German. For this purpose, we present a new algorithm for generating search term variants in ancient orthography. By applying a spell checker on a corpus of historic texts, we generate a list of candidate terms for which the contemporary spellings have to be assigned manually. Then our algorithm produces a set of probabilistic rules. These probabilities can be considered for ranking in the retrieval stage. An experimental comparison shows that our approach outperforms competing methods.

Keywords

Transformation Rule Word Form Variant Graph Historic Text Collection Frequency 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Biella, D., Dyllong, E., Kaiser, H., Luther, W., Mittmann, T.: Edition électronique de la réception de Nietzsche des années 1865 à 1945. In: Proc. ICHIM 2003, Paris (2003)Google Scholar
  2. 2.
    Biella, D., Dyllong, E.H., Luther, W., Pilz, T.: An On-line Literature Research System with Rule-Based Search. In: Proc. of the 4th European Conference on e-Learning (ECEL 2005), Amsterdam (2005)Google Scholar
  3. 3.
    Camps, R., Daudé, J.: Improving the efficacy of approximate personal name matching. In: Proc. 8th International Conference on Applications of Natural Language to Information Systems (NLDB 2003) (2003), http://www.lsi.upc.es/dept/techreps/ps/R03-9.ps.gz
  4. 4.
    Cendrowska, J.: PRISM: An algorithm for inducing modular rules. International Journal of Man-Machine Studies 27(4), 349–370 (1987)CrossRefzbMATHGoogle Scholar
  5. 5.
    Cohen, W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst. 17(2), 141–173 (1999)CrossRefGoogle Scholar
  6. 6.
    De Roux, E.: 19 bibliothèques en Europe signent un manifeste pour contrer le projet de Google. Le Monde, Paris (28.04.2005) Google Scholar
  7. 7.
    Frakes, W.B., Baeza-Yates, R.A.: Information Retrieval: Data Structures & Algorithms Context-sensitive learning methods for text categorization. Prentice-Hall, Englewood Cliffs (1992), DBLP, http://dblp.uni-trier.de
  8. 8.
    Keller, R.: Die Deutsche Sprache und ihre historische Entwicklung. Helmut Buske Verlage, Hamburg (1986)Google Scholar
  9. 9.
    Nottelmann, H.: PIRE: An extensible IR engine based on probabilistic Datalog. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 260–274. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  10. 10.
    Pfeifer, U., Poersch, T., Fuhr, N.: Retrieval Effectiveness of Proper Name Search Methods. Information Processing and Management 32(6), 667–669 (1996)CrossRefGoogle Scholar
  11. 11.
    Pilz, T.: Unscharfe Suche in Textdatenbanken mit nichtstandardisierter Rechtschreibung am Beispiel von Frakturtexten zur Nietzsche-Rezeption. Staatsexamensarbeit. Universität Duisburg-Essen (2003)Google Scholar
  12. 12.
    Peters, C. (Hrsg.): CLEF 2000. LNCS, vol. 2069. Springer, Heidelberg (2001)zbMATHGoogle Scholar
  13. 13.
    Quasthoff, U.: Projekt Der Deutsche Wortschatz. In: Heyer, G., Wolff, C. (eds.) Proc. from the GLDV-Tagung, Linguistig und neue Medien, März 17-19 (1997), pp. 93–99. Deutscher Universitätsverlag, Leipzig (1998)Google Scholar
  14. 14.
    Rayson, P., Archer, D., Smith, N.: VARD versus Word. A comparison of the UCREL variant detector and modern spell checkers on English historical corpora. In: Proceedings of the Corpus Linguistics 2005 conference, Proc. from the Corpus Linguistics Conference Series on-line e-journal, Birmingham, UK, vol. 1(1) (2005)Google Scholar
  15. 15.
    Strunk, J.: Information Retrieval for Languages that lack a fixed orthography (2003), http://www.linguistics.ruhr-uni-bochum.de/~strunk/LSreport.pdf
  16. 16.
    Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, San Francisco (2000)Google Scholar
  17. 17.
    Zobel, J., Dart, P.: Phonetic String Matching: Lessons from Information Retrieval. In: Frei, H.-P., Harman, D., Schäuble, P., Wilkinson, R. (eds.) Proc. 19th Inter. Conf. on Research and Development in Information Retrieval (SIGIR), New York, pp. 166–172 (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Andrea Ernst-Gerlach
    • 1
  • Norbert Fuhr
    • 1
  1. 1.University of Duisburg-EssenGermany

Personalised recommendations