Retrieving Time from Scanned Books

  • John Foley
  • James Allan
Conference paper

DOI: 10.1007/978-3-319-16354-3_24

Part of the Lecture Notes in Computer Science book series (LNCS, volume 9022)
Cite this paper as:
Foley J., Allan J. (2015) Retrieving Time from Scanned Books. In: Hanbury A., Kazai G., Rauber A., Fuhr N. (eds) Advances in Information Retrieval. ECIR 2015. Lecture Notes in Computer Science, vol 9022. Springer, Cham

Abstract

While millions of scanned books have become available in recent years, this vast collection of data remains under-utilized. Book search is often limited to summaries or metadata, and connecting information to primary sources can be a challenge.

Even though digital books provide rich historical information on all subjects, leveraging this data is difficult. To explore how we can access this historical information, we study the problem of identifying relevant times for a given query. That is - given a user query or a description of an event, we attempt to use historical sources to locate that event in time.

We use state-of-the-art NLP tools to identify and extract mentions of times present in our corpus, and then propose a number of models for organizing this historical information.

Since no truth data is readily available for our task, we automatically derive dated event descriptions from Wikipedia, leveraging the both the wisdom of the crowd and the wisdom of experts. Using 15,000 events from between the years 1000 and 1925 as queries, we evaluate our approach on a collection of 50,000 books from the Internet Archive. We discuss the tradeoffs between context, retrieval performance, and efficiency.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • John Foley
    • 1
  • James Allan
    • 1
  1. 1.Center for Intelligent Information RetrievalUniversity of Massachusetts AmherstAmherstUSA

Personalised recommendations