Intelligent Information Access from Scientific Papers

Part of the The Information Retrieval Series book series (INRE, volume 29)


We describe a novel search engine for scientific literature. The system allows for sentence-level search starting from portable document format (PDF) files, and integrates text and image search, thus, for example, facilitating the retrieval of information present in tables and figures using both image and caption content. In addition, the system allows the user to generate in an intuitive manner complex queries for search terms that are related through particular grammatical (and thus implicitly semantic) relations. Grid processing techniques are used to parallelise the analysis of large numbers of scientific papers. We are currently conducting user evaluations, but here we report some preliminary evaluation and comparison with Google Scholar, demonstrating the potential utility of the novel features. Finally, we discuss future work and the potential and complementarity of the system for patent search.


Information Extraction Average Precision Complex Query Image Search Portable Document Format 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was supported in part by a BBSRC e-Science programme grant to the University of Cambridge (FlySlip), and a STFC miniPIPSS grant to the University of Cambridge and iLexIR Ltd (Scalable and Robust Grid-based Text Mining of Scientific Papers). This chapter is an extended version of one which appeared in the proceedings of the annual North American Association for Computational Linguistics conference proceedings, demonstration session, in June 2010.


  1. 1.
    Atterer M, Schutze H (2008) An inverted index for storing and retrieving grammatical dependencies. In: Proceedings of the 6th international conference on language resources and evaluation, Marrakech, Morocco Google Scholar
  2. 2.
    Briscoe T, Carroll J, Watson R (2006) The second release of the rasp system. In: Proceedings of the COLING/ACL 2006, Sydney, Australia Google Scholar
  3. 3.
    Britton D, Cass AJ, Clarke PEL, Coles J, Colling DJ, Doyle AT, Geddes NI, Gordon JC, Jones RWL, Kelsey DP et al. (2009) GridPP: the UK grid for particle physics. Philos Trans A 367(1897):2447 CrossRefGoogle Scholar
  4. 4.
    Eggel I, Müller H (2010) Indexing the medical open access literature for textual and content-based visual retrieval. Stud Health Technol Inf 160(2):1277–1281 Google Scholar
  5. 5.
    Gasperin C, Briscoe T (2008) Statistical anaphora resolution in biomedical texts. In: Proceedings of the 22nd international conference on computational linguistics, vol 1, pp 257–264 Google Scholar
  6. 6.
    Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proc 25th internat conf on very large data bases Google Scholar
  7. 7.
    Goetz B (2002) The Lucene search engine: powerful, flexible, and free. Javaworld
  8. 8.
    Huang X, Hu Q (2009) A Bayesian learning approach to promoting diversity in ranking for biomedical information retrieval. In: Proceedings of SIGIR 2009, Boston, MA. ACM 978-1-60558-483-6/09/07 Google Scholar
  9. 9.
    Jacobs CE, Finkelstein A, Salesin DH (1995) Fast multiresolution image querying. In: Proceedings of the 22nd annual conference on computer graphics and interactive techniques. ACM, New York, pp 277–286 CrossRefGoogle Scholar
  10. 10.
    Lewin I, Hollingsworth B, Tidhar D (2005) Retrieving hierarchical text structure from typeset scientific articles: a prerequisite for e-science text mining. In: Proceedings of the 4th UK E-science all hands conference, Nottingham, UK, pp 267–273 Google Scholar
  11. 11.
    McCallum AK (2002) Mallet: A machine learning for language toolkit.
  12. 12.
    Saetre R, Matsuzaki T, Miyao Y, Sagae K, Tsujii J (2009) Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics 25(3):394–400 CrossRefGoogle Scholar
  13. 13.
    Tansley S, Hey T, Tolle K (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond Google Scholar
  14. 14.
    Teufel S, Carletta J, Moens M (1999) An annotation scheme for discourse-level argumentation in research articles. In: Proceedings of the 9th conference of the European chapter of the association for computational linguistics (EACL’99), pp 110–117 CrossRefGoogle Scholar
  15. 15.
    Vlachos A (2007) Tackling the BioCreative2 gene mention task with conditional random fields and syntactic parsing. In: Proceedings of the second BioCreative challenge evaluation workshop Google Scholar
  16. 16.
    Voorhees E, Harman K (1999). The seventh text retrieval conference (TREC-7). NIST Google Scholar
  17. 17.
    Wang S, Hauskrecht M (2010) Effective query expansion with the resistance distance based term similarity metric. In: Proceedings of SIGIR 2010, Geneva, Switzerland. ACM 978-1-60558-896-4/10/07 Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  1. 1.University of CambridgeCambridgeUK
  2. 2.iLexIR LtdCambridgeUK
  3. 3.Camtology LtdCambridgeUK
  4. 4.University of AberdeenAberdeenUK

Personalised recommendations