Skip to main content
Log in

A framework for traversing dense annotation lattices

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Pattern matching, or querying, over annotations is a general purpose paradigm for inspecting, navigating, mining, and transforming annotation repositories—the common representation basis for modern pipelined text processing architectures. The open-ended nature of these architectures and expressiveness of feature structure-based annotation schemes account for the natural tendency of such annotation repositories to become very dense, as multiple levels of analysis get encoded as layered annotations. This particular characteristic presents challenges for the design of a pattern matching framework capable of interpreting ‘flat’ patterns over arbitrarily dense annotation lattices. We present an approach where a finite state device applies (compiled) pattern grammars over what is, in effect, a linearized ‘projection’ of a particular route through the lattice. The route is derived by a mix of static grammar analysis and runtime interpretation of navigational directives within an extended grammar formalism; it selects just the annotations sequence appropriate for the patterns at hand. For expressive and efficient pattern matching in dense annotations stores, our implemented approach achieves a mix of lattice traversal and finite state scanning by exposing a language which, to its user, provides constructs for specifying sequential, structural, and configurational constraints among annotations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. gate: General Architecture for Text Engineering.

  2. uima: Unstructured Information Management Architecture.

  3. nltk: Natural Language Toolkit.

  4. Language and Information Engineering Lab at Jena University, http://www.julielab.de/.

  5. http://www.verbs.colorado.edu/ula2008/.

  6. International Standards Organization, Technical Committee 37, Sub-Committee 4, Language Resource Management, http://www.iso.org/iso/iso_catalogue/catalogue_tc.

  7. Requesting all the text fragments which match a pattern is, conceptually, no different from querying an annotation repository for all annotations (or annotation configurations) which satisfy a certain set of constraints, themselves specified in a pattern (query).

  8. Tokens are just instances of an annotation type. Multiple tokenizers would introduce multiple token streams; it is not uncommon for complex applications to deploy multiple tokenizers, e.g. if models for different components have been trained over disparate pre-tagged text sources.

  9. The notation [ TypeName ] refers to an annotation of type TypeName in the text.

  10. Here, and in this paper in general, we assume that all annotations manipulated through the framework are text-consuming.

  11. It may be the case that an annotation will have inner properties (or features: uima uses typed feature structures to represent annotations); in that case testing for a match would require checking values of features against their specifications too. Still, this operation is carried out just on the annotation itself.

  12. Elements of the formalism which translate into posting new, or modifying existing, annotations are somewhat orthogonal to issues of navigation; we will not discuss transduction symbols or mechanisms here. We also deliberately gloss over the question of what the span of the new [ Subj ] annotation should be, but see the example grammar immediately below.

  13. Patterns may refer to both type instances and supertypes: the framework will admit e.g. a PName annotation as an instance of a Named supertype specified in the grammar as a match target; supertypes thus are akin to wild cards. Note that if both [ A ] and [ B ] are defined to be subtypes of [ Element ], a pattern specification … Element []. Element [] … would match both sequences [ A ] followed by [ B ], and [ B ] followed by [ A ]; this allows for order-independent grammars.

  14. At grammar load time, when the interpreter is initialized.

  15. Note that, while appealing to ‘common intuitions’ in the interpretation of ‘longer [ PName ] annotations stand for nodes in a tree hierarchy above shorter [ Name ] annotations’ (Sect. 2), it is essential for the system’s completeness and correctness that such relationships are explicitly encoded in a set of priority declarations.

  16. Assuming that [ Phrase ] is declared a common supertype to both [ NP ] and [ VG ].

Abbreviations

uima :

Unstructured information management architecture

fst :

Finite state transduction

afst:

Annotation-based finite state transduction

gate :

General architecture for text engineering

ula :

Unified linguistic annotation

References

  • Appelt, D. E., & Onyshkevych, B. (1996). The common pattern specification language. In Proceedings of a workshop held at Baltimore, Maryland (pp. 23–30). Morristown, NJ, USA: Association for Computational Linguistics.

  • Bird, S. (2006). NLTK: The natural language toolkit. In Demonstration session, 45th annual meeting of the ACL. Sydney, Australia.

  • Bird, S., Buneman, P., & Tan, W.-C. (2000). Towards a query language for annotation graphs. In Second international language resources and evaluation conference. Athens, Greece.

  • Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1–2), 23–60.

    Article  Google Scholar 

  • Boguraev, B. (2000). Towards finite-state analysis of lexical cohesion. In Proceedings of the 3rd international conference on finite-state methods for NLP, intex-3. Liege, Belgium.

  • Boguraev, B., & Ando, R. K. (2005). TimeML-compliant text analysis for temporal reasoning. In Nineteenth international joint conference on artificial intelligence (IJCAI-05). Edinburgh, Scotland.

  • Boguraev, B., & Neff, M. (2007). An annotation-based finite state system for UIMA: User documentation and grammar writing manual. Technical report, IBM T.J. Watson Research Center, Yorktown Heights, New York.

  • Cassidy, S. (2002). XQuery as an annotation query language: A use case analysis. In: Third international language resources and evaluation conference. Las Palmas, Spain.

  • Cunningham, H. (2002). gate, a general architecture for language engineering. Computers and the Humanities, 36(2), 223–254.

    Article  Google Scholar 

  • Cunningham, H., Maynard, D., & Tablan, V. (2000). JAPE: A Java annotation patterns engine. Technical Memo CS-00-10, Institute for Language, Speech and Hearing (ILASH), and Department of Computer Science, University of Sheffield, Sheffield.

  • Cunningham, H., & Scott, D. (2004). Software architectures for language engineering. Special Issue. Natural Language Engineering, 10(4).

  • Dale, R. (2005). Industry watch. Natural Language Engineering, 11, 435–439.

    Article  Google Scholar 

  • Droẑdẑyński, W., Krieger, H.-U., Piskorski, J., Schäfer, U., & Xu, F. (2004). Shallow processing with unification and typed feature structures—Foundations and applications. Künstliche Intelligenz, (1), 17–23.

  • Ferrucci, D., & Lally, A. (2004). UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(4). Special Issue on Software Architectures for Language Engineering.

  • Grefenstette, G. (1999). Light parsing as finite state filtering. In A. Kornai (Ed.), Extended finite state models of language, studies in natural language processing, (pp. 86–94). Cambridge UK: Cambridge University Press.

  • Grover, C., Matheson, C., Mikheev, A., & Moens, M. (2000). lt-ttt: A flexible tokenisation tool. In Proceedings of the second international conference on language resources and evaluation, (pp. 1147–1154). Spain.

  • Hahn, U., Buyko, E., Tomanek, K., Piao, S., McNaught, J., Tsuruoka, Y., & Ananiadou, S. (2007). An annotation type system for a data-driven NLP pipeline. In Linguistic annotation workshop (the LAW); ACL-2007. Prague, Czech Republic.

  • Ide, N., & Romary, L. (2004). International standard for a linguistic annotation framework. Natural Language Engineering, 10(4). Special Issue on Software Architectures for Language Engineering.

  • Ide, N., & Suderman, K. (2007). GrAF: A graph-based format for linguistic annotation. In Linguistic annotation workshop (the LAW); ACL-2007. Prague, Czech Republic.

  • Lai, C., & Bird, S. (2004). Querying and updating treebanks: A critical survey and requirements analysis. In Australasian language technology workshop. Sydney.

  • Park, Y., Byrd, R., & Boguraev, B. (2002). Automatic glossary extraction: Beyond terminology identification. In Proceedings of the 19th international conference on computational linguistics (COLING), (pp. 772–778). Taiwan.

  • Silberztein, M. (2000). intex: An integrated FST development environment. Theoretical Computer Science, 231(1), 33–46.

    Article  Google Scholar 

  • Simov, K., Kouylekov, M., & Simov, A. (2002). Cascaded regular grammars over xml documents. In Proceedings of the second international workshop on NLP and XML (NLPXML-2002). Taipei, Tawian.

  • Srihari, R. K., Li, W., Cornell, T., & Niu, C. (2008). InfoXtract: A customizable intermediate level information extraction engine. Natural Language Engineering.

  • Verhagen, M., Stubbs, A., & Pustejovsky, J. (2007). Combining independent syntactic and semantic annotation schemes. In Linguistic annotation workshop (the LAW); ACL-2007. Prague, Czech Republic.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Branimir Boguraev.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boguraev, B., Neff, M. A framework for traversing dense annotation lattices. Lang Resources & Evaluation 44, 183–203 (2010). https://doi.org/10.1007/s10579-010-9123-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-010-9123-y

Keywords

Navigation