Intelligent Information Access from Scientific Papers

Briscoe, Ted; Harrison, Karl; Naish, Andrew; Parker, Andy; Rei, Marek; Siddharthan, Advaith; Sinclair, David; Slater, Mark; Watson, Rebecca

doi:10.1007/978-3-642-19231-9_16

Ted Briscoe^5,6,
Karl Harrison⁵,
Andrew Naish⁷,
Andy Parker^5,7,
Marek Rei⁵,
Advaith Siddharthan⁸,
David Sinclair⁷,
Mark Slater⁵ &
…
Rebecca Watson⁶

Part of the book series: The Information Retrieval Series ((INRE,volume 29))

1586 Accesses

Abstract

We describe a novel search engine for scientific literature. The system allows for sentence-level search starting from portable document format (PDF) files, and integrates text and image search, thus, for example, facilitating the retrieval of information present in tables and figures using both image and caption content. In addition, the system allows the user to generate in an intuitive manner complex queries for search terms that are related through particular grammatical (and thus implicitly semantic) relations. Grid processing techniques are used to parallelise the analysis of large numbers of scientific papers. We are currently conducting user evaluations, but here we report some preliminary evaluation and comparison with Google Scholar, demonstrating the potential utility of the novel features. Finally, we discuss future work and the potential and complementarity of the system for patent search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Atterer M, Schutze H (2008) An inverted index for storing and retrieving grammatical dependencies. In: Proceedings of the 6th international conference on language resources and evaluation, Marrakech, Morocco
Google Scholar
Briscoe T, Carroll J, Watson R (2006) The second release of the rasp system. In: Proceedings of the COLING/ACL 2006, Sydney, Australia
Google Scholar
Britton D, Cass AJ, Clarke PEL, Coles J, Colling DJ, Doyle AT, Geddes NI, Gordon JC, Jones RWL, Kelsey DP et al. (2009) GridPP: the UK grid for particle physics. Philos Trans A 367(1897):2447
Article Google Scholar
Eggel I, Müller H (2010) Indexing the medical open access literature for textual and content-based visual retrieval. Stud Health Technol Inf 160(2):1277–1281
Google Scholar
Gasperin C, Briscoe T (2008) Statistical anaphora resolution in biomedical texts. In: Proceedings of the 22nd international conference on computational linguistics, vol 1, pp 257–264
Google Scholar
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proc 25th internat conf on very large data bases
Google Scholar
Goetz B (2002) The Lucene search engine: powerful, flexible, and free. Javaworld http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene.html
Huang X, Hu Q (2009) A Bayesian learning approach to promoting diversity in ranking for biomedical information retrieval. In: Proceedings of SIGIR 2009, Boston, MA. ACM 978-1-60558-483-6/09/07
Google Scholar
Jacobs CE, Finkelstein A, Salesin DH (1995) Fast multiresolution image querying. In: Proceedings of the 22nd annual conference on computer graphics and interactive techniques. ACM, New York, pp 277–286
Chapter Google Scholar
Lewin I, Hollingsworth B, Tidhar D (2005) Retrieving hierarchical text structure from typeset scientific articles: a prerequisite for e-science text mining. In: Proceedings of the 4th UK E-science all hands conference, Nottingham, UK, pp 267–273
Google Scholar
McCallum AK (2002) Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu
Saetre R, Matsuzaki T, Miyao Y, Sagae K, Tsujii J (2009) Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics 25(3):394–400
Article Google Scholar
Tansley S, Hey T, Tolle K (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft Research, Redmond
Google Scholar
Teufel S, Carletta J, Moens M (1999) An annotation scheme for discourse-level argumentation in research articles. In: Proceedings of the 9th conference of the European chapter of the association for computational linguistics (EACL’99), pp 110–117
Chapter Google Scholar
Vlachos A (2007) Tackling the BioCreative2 gene mention task with conditional random fields and syntactic parsing. In: Proceedings of the second BioCreative challenge evaluation workshop
Google Scholar
Voorhees E, Harman K (1999). The seventh text retrieval conference (TREC-7). NIST
Google Scholar
Wang S, Hauskrecht M (2010) Effective query expansion with the resistance distance based term similarity metric. In: Proceedings of SIGIR 2010, Geneva, Switzerland. ACM 978-1-60558-896-4/10/07
Google Scholar

Download references

Acknowledgements

This work was supported in part by a BBSRC e-Science programme grant to the University of Cambridge (FlySlip), and a STFC miniPIPSS grant to the University of Cambridge and iLexIR Ltd (Scalable and Robust Grid-based Text Mining of Scientific Papers). This chapter is an extended version of one which appeared in the proceedings of the annual North American Association for Computational Linguistics conference proceedings, demonstration session, in June 2010.

Author information

Authors and Affiliations

University of Cambridge, Cambridge, UK
Ted Briscoe, Karl Harrison, Andy Parker, Marek Rei & Mark Slater
iLexIR Ltd, Cambridge, UK
Ted Briscoe & Rebecca Watson
Camtology Ltd, Cambridge, UK
Andrew Naish, Andy Parker & David Sinclair
University of Aberdeen, Aberdeen, UK
Advaith Siddharthan

Authors

Ted Briscoe
View author publications
You can also search for this author in PubMed Google Scholar
Karl Harrison
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Naish
View author publications
You can also search for this author in PubMed Google Scholar
Andy Parker
View author publications
You can also search for this author in PubMed Google Scholar
Marek Rei
View author publications
You can also search for this author in PubMed Google Scholar
Advaith Siddharthan
View author publications
You can also search for this author in PubMed Google Scholar
David Sinclair
View author publications
You can also search for this author in PubMed Google Scholar
Mark Slater
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca Watson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ted Briscoe .

Editor information

Editors and Affiliations

Information Retrieval Facility, Donau-City Straße 1, Vienna, 1220, Austria
Mihai Lupu
Information Retrieval Facility, Donau-City Straße 1, Vienna, 1220, Austria
Katja Mayer
Information Retrieval Facility, Donau-City Straße 1, Vienna, 1220, Austria
John Tait
3LP Advisors, Post Rd. 7003, Dublin, 43016, Ohio, USA
Anthony J. Trippe

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Briscoe, T. et al. (2011). Intelligent Information Access from Scientific Papers. In: Lupu, M., Mayer, K., Tait, J., Trippe, A. (eds) Current Challenges in Patent Information Retrieval. The Information Retrieval Series, vol 29. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19231-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-19231-9_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19230-2
Online ISBN: 978-3-642-19231-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics