Skip to main content

Intelligent Information Access from Scientific Papers

  • Chapter
Current Challenges in Patent Information Retrieval

Abstract

We describe a novel search engine for scientific literature. The system allows for sentence-level search starting from portable document format (PDF) files, and integrates text and image search, thus, for example, facilitating the retrieval of information present in tables and figures using both image and caption content. In addition, the system allows the user to generate in an intuitive manner complex queries for search terms that are related through particular grammatical (and thus implicitly semantic) relations. Grid processing techniques are used to parallelise the analysis of large numbers of scientific papers. We are currently conducting user evaluations, but here we report some preliminary evaluation and comparison with Google Scholar, demonstrating the potential utility of the novel features. Finally, we discuss future work and the potential and complementarity of the system for patent search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Atterer M, Schutze H (2008) An inverted index for storing and retrieving grammatical dependencies. In: Proceedings of the 6th international conference on language resources and evaluation, Marrakech, Morocco

    Google Scholar 

  2. Briscoe T, Carroll J, Watson R (2006) The second release of the rasp system. In: Proceedings of the COLING/ACL 2006, Sydney, Australia

    Google Scholar 

  3. Britton D, Cass AJ, Clarke PEL, Coles J, Colling DJ, Doyle AT, Geddes NI, Gordon JC, Jones RWL, Kelsey DP et al. (2009) GridPP: the UK grid for particle physics. Philos Trans A 367(1897):2447

    Article  Google Scholar 

  4. Eggel I, Müller H (2010) Indexing the medical open access literature for textual and content-based visual retrieval. Stud Health Technol Inf 160(2):1277–1281

    Google Scholar 

  5. Gasperin C, Briscoe T (2008) Statistical anaphora resolution in biomedical texts. In: Proceedings of the 22nd international conference on computational linguistics, vol 1, pp 257–264

    Google Scholar 

  6. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proc 25th internat conf on very large data bases

    Google Scholar 

  7. Goetz B (2002) The Lucene search engine: powerful, flexible, and free. Javaworld http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene.html

  8. Huang X, Hu Q (2009) A Bayesian learning approach to promoting diversity in ranking for biomedical information retrieval. In: Proceedings of SIGIR 2009, Boston, MA. ACM 978-1-60558-483-6/09/07

    Google Scholar 

  9. Jacobs CE, Finkelstein A, Salesin DH (1995) Fast multiresolution image querying. In: Proceedings of the 22nd annual conference on computer graphics and interactive techniques. ACM, New York, pp 277–286

    Chapter  Google Scholar 

  10. Lewin I, Hollingsworth B, Tidhar D (2005) Retrieving hierarchical text structure from typeset scientific articles: a prerequisite for e-science text mining. In: Proceedings of the 4th UK E-science all hands conference, Nottingham, UK, pp 267–273

    Google Scholar 

  11. McCallum AK (2002) Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu

  12. Saetre R, Matsuzaki T, Miyao Y, Sagae K, Tsujii J (2009) Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics 25(3):394–400

    Article  Google Scholar 

  13. Tansley S, Hey T, Tolle K (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond

    Google Scholar 

  14. Teufel S, Carletta J, Moens M (1999) An annotation scheme for discourse-level argumentation in research articles. In: Proceedings of the 9th conference of the European chapter of the association for computational linguistics (EACL’99), pp 110–117

    Chapter  Google Scholar 

  15. Vlachos A (2007) Tackling the BioCreative2 gene mention task with conditional random fields and syntactic parsing. In: Proceedings of the second BioCreative challenge evaluation workshop

    Google Scholar 

  16. Voorhees E, Harman K (1999). The seventh text retrieval conference (TREC-7). NIST

    Google Scholar 

  17. Wang S, Hauskrecht M (2010) Effective query expansion with the resistance distance based term similarity metric. In: Proceedings of SIGIR 2010, Geneva, Switzerland. ACM 978-1-60558-896-4/10/07

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by a BBSRC e-Science programme grant to the University of Cambridge (FlySlip), and a STFC miniPIPSS grant to the University of Cambridge and iLexIR Ltd (Scalable and Robust Grid-based Text Mining of Scientific Papers). This chapter is an extended version of one which appeared in the proceedings of the annual North American Association for Computational Linguistics conference proceedings, demonstration session, in June 2010.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ted Briscoe .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Briscoe, T. et al. (2011). Intelligent Information Access from Scientific Papers. In: Lupu, M., Mayer, K., Tait, J., Trippe, A. (eds) Current Challenges in Patent Information Retrieval. The Information Retrieval Series, vol 29. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19231-9_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19231-9_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19230-2

  • Online ISBN: 978-3-642-19231-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics