Information Extraction and Semantic Annotation for Multi-Paradigm Information Management

Cunningham, Hamish; Tablan, Valentin; Roberts, Ian; Greenwood, Mark A.; Aswani, Niraj

doi:10.1007/978-3-642-19231-9_15

Hamish Cunningham⁵,
Valentin Tablan⁵,
Ian Roberts⁵,
Mark A. Greenwood⁵ &
…
Niraj Aswani⁵

Part of the book series: The Information Retrieval Series ((INRE,volume 29))

1628 Accesses

Abstract

This chapter describes the development of GATE Mímir, a new tool for indexing documents according to multiple paradigms: full text, conceptual model, and annotation structures. We also present a usage example for patent searchers covering measurements and high-level structural information which was automatically extracted from a large patent corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

1.
In the interests of protecting the innocent the first author lays claim to the introduction.
2.
http://opendotdotdot.blogspot.com/.
3.
I used to hope that as time passed I would become older and wiser, but it seems that in fact I just become odder and wider.
4.
JAPE is a regular expression based language for matching annotations—see http://gate.ac.uk/userguide/chap:jape.
5.
http://lucene.apache.org/java/.
6.
Old Norse “The rememberer, the wise one”.
7.
Although the focus is currently on indexing text documents, specifically patents, it would be perfectly feasible to associate annotations and KB data with multimedia documents, where offsets may refer to time spans in videos or areas of an image etc.
8.
Inverted Indexes are data structures traditionally used in Information Retrieval to support indexing of text.
9.
http://mg4j.dsi.unimi.it/.
10.
See http://www.w3.org/RDF/ and http://www.w3.org/TR/owl-features/.
11.
See http://www.ontotext.com/ordi/ and http://www.ontotext.com/owlim/.
12.
This assumes that an index named root exists, and was used to store the morphological root of the words.
13.
In general dates are encoded as yyyymmdd. This encoding allows dates to be treated as numbers, enabling a wide variety of search restrictions.
14.
http://ir-facility.net/prototypes/marec/.
15.
The number of sections within a patent can vary widely from one patent office to another and even, over time, within the same office. Most of the patents we examined during the reported work do, however, contain around twenty sections.
16.
Within the precision allowed by floating-point arithmetic of double precision.
17.
This query is approximately equal to the others as the two values have been rounded to the nearest whole numbers.
18.
Detailed statistics are available from the World Intellectual Property Organization at http://www.wipo.int/ipstats/.
19.
http://gatecloud.net.
20.
Such as the Sun Grid Engine (http://gridengine.sunsource.net/.) or Hadoop (http://hadoop.apache.org/).
21.
Whilst this is true for the patents in the MAREC collection, which we used when building this example index, it may not be true for all patents. In fact the structure of patents varies widely which is one reason why effectively searching large patent corpora by hand is difficult.
22.
As previously mentioned, dates are usually encoded as numbers in the form yyyymmdd. As such 20070000 is not actually a valid day but does fall between the last day of 2006 and the first day of 2007.
23.
As with abstracts the titles of the inventions are also listed in multiple languages and so a restriction to English is included in the query.
24.
http://demos.gate.ac.uk/mimir/patents/gus/search.
25.
http://www.ontotext.com/owlim/.
26.
http://gate.ac.uk/sale/talks/sam/repositories-workshop-agenda.html.
27.
Gianni Amati (Fondazione Ugo Bordoni/University of Glasgow); Mike Baycroft (Fairview Research); Norbert Fuhr (University of Essen-Duisburg); Eric Graf (University of Glasgow); Atanas Kiryakov (Ontotext); Borislav Popov (Ontotext); Ralf Schenkel (MPG); John Tait (IRF); Arjen de Vries (ACM/CWI); Francisco Webber (Matrixware/IRF); Valentin Tablan (University of Sheffield); Kalina Bontcheva (University of Sheffield); Hamish Cunningham (University of Sheffield).
28.
http://mg4j.dsi.unimi.it/.

References

Aswani N, Tablan V, Bontcheva K, Cunningham H (2005) Indexing and querying linguistic metadata and document content. In: Proceedings of fifth international conference on recent advances in natural language processing (RANLP2005), Borovets, Bulgaria
Google Scholar
Bikel D, Schwartz R, Weischedel R (1999) An algorithm that learns what’s in a name. Mach Learn, Special Issue on Natural Language Learning 34(1–3)
Google Scholar
Chomsky N (1999) Profit over people: neoliberalism and global order, 1st edn. Seven Stories Press, New York
Google Scholar
Cunningham H (2005) Information extraction, automatic. In: Encyclopedia of language and linguistics, 2nd edn, pp 665–677
Google Scholar
Cunningham H, Maynard D, Tablan V (2000) JAPE: a Java annotation patterns engine, 2nd edn. Research Memorandum CS-00-10, Department of Computer Science, University of Sheffield, Nov 2000
Google Scholar
Cunningham H, Maynard D, Bontcheva K, Tablan V, Dimitrov M, Dowman M, Aswani N, Roberts I, Li Y, Funk A (2000) Developing language processing components with GATE Version 6.0 (a user guide). http://gate.ac.uk/
Cunningham H, Maynard D, Bontcheva K, Tablan V (2002) GATE: a framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th anniversary meeting of the association for computational linguistics (ACL’02)
Google Scholar
Cunningham H, Hanbury A, Rüger S (2010) Scaling up high-value retrieval to medium-volume data. In: Cunningham H, Hanbury A, Rüger S (eds) Advances in multidisciplinary retrieval (the 1st information retrieval facility conference). LNCS, vol 6107. Vienna, Austria, May 2010. Springer, Berlin
Google Scholar
Day D, Robinson P, Vilain M, Yeh A (1998) MITRE: description of the Alembic system used for MUC-7. In: Proceedings of the seventh message understanding conference (MUC-7)
Google Scholar
Greenwood MA, Cunningham H, Aswani N, Roberts I, Tablan V (2010) GATE Mímir: philosophy, development, deployment and evaluation. Research Memorandum CS-10-05, Department of Computer Science, University of Sheffield
Google Scholar
Hull D, Ait-Mokhatar S, Chuat M, Eisele A, Gaussier E, Grefenstette G, Isabelle P, Samuelsson C, Segond F (2001) Language technologies and patent search and classification. World Pat Inf 23:265–268
Article Google Scholar
Li Y, Bontcheva K, Cunningham H (2005) SVM based learning system for information extraction. In: Winkler MNJ, Lawerence N (eds) Deterministic and statistical methods in machine learning. LNAI, vol. 3635. Springer, Berlin, pp 319–339
Chapter Google Scholar
Maynard D, Tablan V, Ursu C, Cunningham H, Wilks Y (2001) Named entity recognition from diverse text types. In: Recent advances in natural language processing 2001 conference, Tzigov Chark, Bulgaria, pp 257–274
Google Scholar
Maynard D, Bontcheva K, Cunningham H (2003) Towards a semantic extraction of named entities. In: Recent advances in natural language processing, Bulgaria
Google Scholar
Wanner L, Baeza-Yates R, Brugmann S, Codina J, Diallo B, Escorsa E, Giereth M, Kompatsiaris Y, Papadopoulos S, Pianta E, Piella G, Puhlmann I, Rao G, Rotard M, Schoester P, Serafini L, Zervaki V (2008) Towards content-oriented patent document processing. World Pat Inf 30(1):21–33
Article Google Scholar

Download references

Acknowledgements

This work was funded by the Information Retrieval Facility (ir-facility.org). Erik Graf helped us get off the blocks with MG4J and Sebastiano Vigna helped us run the extra mile. Thanks also to all the workshop participants listed above. We are grateful to our reviewers who made salient and constructive contributions.

Author information

Authors and Affiliations

Department of Computer Science, University of Sheffield, Sheffield, UK
Hamish Cunningham, Valentin Tablan, Ian Roberts, Mark A. Greenwood & Niraj Aswani

Authors

Hamish Cunningham
View author publications
You can also search for this author in PubMed Google Scholar
Valentin Tablan
View author publications
You can also search for this author in PubMed Google Scholar
Ian Roberts
View author publications
You can also search for this author in PubMed Google Scholar
Mark A. Greenwood
View author publications
You can also search for this author in PubMed Google Scholar
Niraj Aswani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hamish Cunningham .

Editor information

Editors and Affiliations

Information Retrieval Facility, Donau-City Straße 1, Vienna, 1220, Austria
Mihai Lupu
Information Retrieval Facility, Donau-City Straße 1, Vienna, 1220, Austria
Katja Mayer
Information Retrieval Facility, Donau-City Straße 1, Vienna, 1220, Austria
John Tait
3LP Advisors, Post Rd. 7003, Dublin, 43016, Ohio, USA
Anthony J. Trippe

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Cunningham, H., Tablan, V., Roberts, I., Greenwood, M.A., Aswani, N. (2011). Information Extraction and Semantic Annotation for Multi-Paradigm Information Management. In: Lupu, M., Mayer, K., Tait, J., Trippe, A. (eds) Current Challenges in Patent Information Retrieval. The Information Retrieval Series, vol 29. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19231-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-19231-9_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19230-2
Online ISBN: 978-3-642-19231-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics