Skip to main content

Information Extraction and Semantic Annotation for Multi-Paradigm Information Management

  • Chapter
Current Challenges in Patent Information Retrieval

Part of the book series: The Information Retrieval Series ((INRE,volume 29))

  • 1628 Accesses

Abstract

This chapter describes the development of GATE Mímir, a new tool for indexing documents according to multiple paradigms: full text, conceptual model, and annotation structures. We also present a usage example for patent searchers covering measurements and high-level structural information which was automatically extracted from a large patent corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    In the interests of protecting the innocent the first author lays claim to the introduction.

  2. 2.

    http://opendotdotdot.blogspot.com/.

  3. 3.

    I used to hope that as time passed I would become older and wiser, but it seems that in fact I just become odder and wider.

  4. 4.

    JAPE is a regular expression based language for matching annotations—see http://gate.ac.uk/userguide/chap:jape.

  5. 5.

    http://lucene.apache.org/java/.

  6. 6.

    Old Norse “The rememberer, the wise one”.

  7. 7.

    Although the focus is currently on indexing text documents, specifically patents, it would be perfectly feasible to associate annotations and KB data with multimedia documents, where offsets may refer to time spans in videos or areas of an image etc.

  8. 8.

    Inverted Indexes are data structures traditionally used in Information Retrieval to support indexing of text.

  9. 9.

    http://mg4j.dsi.unimi.it/.

  10. 10.

    See http://www.w3.org/RDF/ and http://www.w3.org/TR/owl-features/.

  11. 11.

    See http://www.ontotext.com/ordi/ and http://www.ontotext.com/owlim/.

  12. 12.

    This assumes that an index named root exists, and was used to store the morphological root of the words.

  13. 13.

    In general dates are encoded as yyyymmdd. This encoding allows dates to be treated as numbers, enabling a wide variety of search restrictions.

  14. 14.

    http://ir-facility.net/prototypes/marec/.

  15. 15.

    The number of sections within a patent can vary widely from one patent office to another and even, over time, within the same office. Most of the patents we examined during the reported work do, however, contain around twenty sections.

  16. 16.

    Within the precision allowed by floating-point arithmetic of double precision.

  17. 17.

    This query is approximately equal to the others as the two values have been rounded to the nearest whole numbers.

  18. 18.

    Detailed statistics are available from the World Intellectual Property Organization at http://www.wipo.int/ipstats/.

  19. 19.

    http://gatecloud.net.

  20. 20.

    Such as the Sun Grid Engine (http://gridengine.sunsource.net/.) or Hadoop (http://hadoop.apache.org/).

  21. 21.

    Whilst this is true for the patents in the MAREC collection, which we used when building this example index, it may not be true for all patents. In fact the structure of patents varies widely which is one reason why effectively searching large patent corpora by hand is difficult.

  22. 22.

    As previously mentioned, dates are usually encoded as numbers in the form yyyymmdd. As such 20070000 is not actually a valid day but does fall between the last day of 2006 and the first day of 2007.

  23. 23.

    As with abstracts the titles of the inventions are also listed in multiple languages and so a restriction to English is included in the query.

  24. 24.

    http://demos.gate.ac.uk/mimir/patents/gus/search.

  25. 25.

    http://www.ontotext.com/owlim/.

  26. 26.

    http://gate.ac.uk/sale/talks/sam/repositories-workshop-agenda.html.

  27. 27.

    Gianni Amati (Fondazione Ugo Bordoni/University of Glasgow); Mike Baycroft (Fairview Research); Norbert Fuhr (University of Essen-Duisburg); Eric Graf (University of Glasgow); Atanas Kiryakov (Ontotext); Borislav Popov (Ontotext); Ralf Schenkel (MPG); John Tait (IRF); Arjen de Vries (ACM/CWI); Francisco Webber (Matrixware/IRF); Valentin Tablan (University of Sheffield); Kalina Bontcheva (University of Sheffield); Hamish Cunningham (University of Sheffield).

  28. 28.

    http://mg4j.dsi.unimi.it/.

References

  1. Aswani N, Tablan V, Bontcheva K, Cunningham H (2005) Indexing and querying linguistic metadata and document content. In: Proceedings of fifth international conference on recent advances in natural language processing (RANLP2005), Borovets, Bulgaria

    Google Scholar 

  2. Bikel D, Schwartz R, Weischedel R (1999) An algorithm that learns what’s in a name. Mach Learn, Special Issue on Natural Language Learning 34(1–3)

    Google Scholar 

  3. Chomsky N (1999) Profit over people: neoliberalism and global order, 1st edn. Seven Stories Press, New York

    Google Scholar 

  4. Cunningham H (2005) Information extraction, automatic. In: Encyclopedia of language and linguistics, 2nd edn, pp 665–677

    Google Scholar 

  5. Cunningham H, Maynard D, Tablan V (2000) JAPE: a Java annotation patterns engine, 2nd edn. Research Memorandum CS-00-10, Department of Computer Science, University of Sheffield, Nov 2000

    Google Scholar 

  6. Cunningham H, Maynard D, Bontcheva K, Tablan V, Dimitrov M, Dowman M, Aswani N, Roberts I, Li Y, Funk A (2000) Developing language processing components with GATE Version 6.0 (a user guide). http://gate.ac.uk/

  7. Cunningham H, Maynard D, Bontcheva K, Tablan V (2002) GATE: a framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th anniversary meeting of the association for computational linguistics (ACL’02)

    Google Scholar 

  8. Cunningham H, Hanbury A, Rüger S (2010) Scaling up high-value retrieval to medium-volume data. In: Cunningham H, Hanbury A, Rüger S (eds) Advances in multidisciplinary retrieval (the 1st information retrieval facility conference). LNCS, vol 6107. Vienna, Austria, May 2010. Springer, Berlin

    Google Scholar 

  9. Day D, Robinson P, Vilain M, Yeh A (1998) MITRE: description of the Alembic system used for MUC-7. In: Proceedings of the seventh message understanding conference (MUC-7)

    Google Scholar 

  10. Greenwood MA, Cunningham H, Aswani N, Roberts I, Tablan V (2010) GATE Mímir: philosophy, development, deployment and evaluation. Research Memorandum CS-10-05, Department of Computer Science, University of Sheffield

    Google Scholar 

  11. Hull D, Ait-Mokhatar S, Chuat M, Eisele A, Gaussier E, Grefenstette G, Isabelle P, Samuelsson C, Segond F (2001) Language technologies and patent search and classification. World Pat Inf 23:265–268

    Article  Google Scholar 

  12. Li Y, Bontcheva K, Cunningham H (2005) SVM based learning system for information extraction. In: Winkler MNJ, Lawerence N (eds) Deterministic and statistical methods in machine learning. LNAI, vol. 3635. Springer, Berlin, pp 319–339

    Chapter  Google Scholar 

  13. Maynard D, Tablan V, Ursu C, Cunningham H, Wilks Y (2001) Named entity recognition from diverse text types. In: Recent advances in natural language processing 2001 conference, Tzigov Chark, Bulgaria, pp 257–274

    Google Scholar 

  14. Maynard D, Bontcheva K, Cunningham H (2003) Towards a semantic extraction of named entities. In: Recent advances in natural language processing, Bulgaria

    Google Scholar 

  15. Wanner L, Baeza-Yates R, Brugmann S, Codina J, Diallo B, Escorsa E, Giereth M, Kompatsiaris Y, Papadopoulos S, Pianta E, Piella G, Puhlmann I, Rao G, Rotard M, Schoester P, Serafini L, Zervaki V (2008) Towards content-oriented patent document processing. World Pat Inf 30(1):21–33

    Article  Google Scholar 

Download references

Acknowledgements

This work was funded by the Information Retrieval Facility (ir-facility.org). Erik Graf helped us get off the blocks with MG4J and Sebastiano Vigna helped us run the extra mile. Thanks also to all the workshop participants listed above. We are grateful to our reviewers who made salient and constructive contributions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamish Cunningham .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Cunningham, H., Tablan, V., Roberts, I., Greenwood, M.A., Aswani, N. (2011). Information Extraction and Semantic Annotation for Multi-Paradigm Information Management. In: Lupu, M., Mayer, K., Tait, J., Trippe, A. (eds) Current Challenges in Patent Information Retrieval. The Information Retrieval Series, vol 29. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19231-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19231-9_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19230-2

  • Online ISBN: 978-3-642-19231-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics