Retrieving Time from Scanned Books

  • John Foley
  • James Allan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9022)


While millions of scanned books have become available in recent years, this vast collection of data remains under-utilized. Book search is often limited to summaries or metadata, and connecting information to primary sources can be a challenge.

Even though digital books provide rich historical information on all subjects, leveraging this data is difficult. To explore how we can access this historical information, we study the problem of identifying relevant times for a given query. That is - given a user query or a description of an event, we attempt to use historical sources to locate that event in time.

We use state-of-the-art NLP tools to identify and extract mentions of times present in our corpus, and then propose a number of models for organizing this historical information.

Since no truth data is readily available for our task, we automatically derive dated event descriptions from Wikipedia, leveraging the both the wisdom of the crowd and the wisdom of experts. Using 15,000 events from between the years 1000 and 1925 as queries, we evaluate our approach on a collection of 50,000 books from the Internet Archive. We discuss the tradeoffs between context, retrieval performance, and efficiency.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alonso, O., Strötgen, J., Baeza-Yates, R.A., Gertz, M.: Temporal information retrieval: Challenges and opportunities. TWAW 11, 1–8 (2011)Google Scholar
  2. 2.
    Brucato, M., Montesi, D.: Metric spaces for temporal information retrieval. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 385–397. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  3. 3.
    Campos, R., Dias, G., Jorge, A., Nunes, C.: Gte: A distributional second-order co-occurrence approach to improve the identification of top relevant dates in web snippets. In: CIKM 2012, New York, NY, USA, pp. 2035–2039 (2012)Google Scholar
  4. 4.
    Dang, H.T., Owczarzak, K.: Overview of the TAC 2008 opinion question answering and summarization tasks. In: Proc. of the First Text Analysis Conference (2008)Google Scholar
  5. 5.
    Daoud, M., Huang, J.: Exploiting temporal term specificity into a probabilistic ranking model (2011)Google Scholar
  6. 6.
    Jong, F.d., Rode, H., Hiemstra, D.: Temporal language models for the disclosure of historical text. Royal Netherlands Academy of Arts and Sciences (2005)Google Scholar
  7. 7.
    Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: A spatially and temporally enhanced knowledge base from wikipedia. Artificial Intelligence 194, 28–61 (2013)CrossRefMATHMathSciNetGoogle Scholar
  8. 8.
    Järvelin, K., Kekäläinen, J.: IR evaluation methods for retrieving highly relevant documents. In: SIGIR 2000, pp. 41–48. ACM (2000)Google Scholar
  9. 9.
    Jatowt, A., Au Yeung, C.-M., Tanaka, K.: Estimating document focus time. In: CIKM 2013, pp. 2273–2278. ACM, New York (2013)Google Scholar
  10. 10.
    Kanhabua, N., Nørvåg, K.: A comparison of time-aware ranking methods. In: SIGIR 2011, pp. 1257–1258 (2011)Google Scholar
  11. 11.
    Kanhabua, N., Nørvåg, K.: Improving temporal language models for determining time of non-timestamped documents. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL 2008. LNCS, vol. 5173, pp. 358–370. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  12. 12.
    Kanhabua, N., Nørvåg, K.: Determining time of queries for re-ranking search results. In: Lalmas, M., Jose, J., Rauber, A., Sebastiani, F., Frommholz, I. (eds.) ECDL 2010. LNCS, vol. 6273, pp. 261–272. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  13. 13.
    Kazai, G., Koolen, M., Kamps, J., Doucet, A., Landoni, M.: Overview of the INEX 2010 book track: Scaling up the evaluation using crowdsourcing. In: Geva, S., Kamps, J., Schenkel, R., Trotman, A. (eds.) INEX 2010. LNCS, vol. 6932, pp. 98–117. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  14. 14.
    Kumar, A., Baldridge, J., Lease, M., Ghosh, J.: Dating texts without explicit temporal cues. arXiv preprint arXiv:1211.2290 (2012)Google Scholar
  15. 15.
    Li, X., Croft, W.B.: Time-based language models. In: CIKM 2003, pp. 469–475. ACM (2003)Google Scholar
  16. 16.
    Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)Google Scholar
  17. 17.
    Metzler, D., Croft, W.B.: A markov random field model for term dependencies. In: SIGIR 2005, pp. 472–479. ACM (2005)Google Scholar
  18. 18.
    Metzler, D., Jones, R., Peng, F., Zhang, R.: Improving search relevance for implicitly temporal queries. In: SIGIR 2009, pp. 700–701. ACM (2009)Google Scholar
  19. 19.
    Nunes, S., Ribeiro, C., David, G.: Use of temporal expressions in web search. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 580–584. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  20. 20.
    Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: SIGIR 1998, pp. 275–281. ACM (1998)Google Scholar
  21. 21.
    Pustejovsky, J., Castano, J.M., Ingria, R., Sauri, R., Gaizauskas, R.J., Setzer, A., Katz, G., Radev, D.R.: TimeML: Robust specification of event and temporal expressions in text. New Directions in Question Answering 3, 28–34 (2003)Google Scholar
  22. 22.
    Smith, D.A.: Detecting and browsing events in unstructured text. In: SIGIR 2002, pp. 73–80. ACM (2002)Google Scholar
  23. 23.
    Sylvester, H.M.: Indian Wars of New England, vol. 2 (1910),
  24. 24.
    Voorhees, E.M., et al.: The TREC-8 Question Answering Track Report. TREC 99, 77–82 (1999)Google Scholar
  25. 25.
    Willis, C., Efron, M.: Finding information in books: characteristics of full-text searches in a collection of 10 million books. Proceedings of the American Society for Information Science and Technology 50(1), 1–10 (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • John Foley
    • 1
  • James Allan
    • 1
  1. 1.Center for Intelligent Information RetrievalUniversity of Massachusetts AmherstAmherstUSA

Personalised recommendations