Language Resources and Evaluation

, Volume 44, Issue 3, pp 183–203 | Cite as

A framework for traversing dense annotation lattices

Article

Abstract

Pattern matching, or querying, over annotations is a general purpose paradigm for inspecting, navigating, mining, and transforming annotation repositories—the common representation basis for modern pipelined text processing architectures. The open-ended nature of these architectures and expressiveness of feature structure-based annotation schemes account for the natural tendency of such annotation repositories to become very dense, as multiple levels of analysis get encoded as layered annotations. This particular characteristic presents challenges for the design of a pattern matching framework capable of interpreting ‘flat’ patterns over arbitrarily dense annotation lattices. We present an approach where a finite state device applies (compiled) pattern grammars over what is, in effect, a linearized ‘projection’ of a particular route through the lattice. The route is derived by a mix of static grammar analysis and runtime interpretation of navigational directives within an extended grammar formalism; it selects just the annotations sequence appropriate for the patterns at hand. For expressive and efficient pattern matching in dense annotations stores, our implemented approach achieves a mix of lattice traversal and finite state scanning by exposing a language which, to its user, provides constructs for specifying sequential, structural, and configurational constraints among annotations.

Keywords

afst uima Annotation-based analytics development Pattern matching over annotations Annotation lattices High density annotation repositories Finite-state transduction Corpus analysis 

Abbreviations

uima

Unstructured information management architecture

fst

Finite state transduction

afst

Annotation-based finite state transduction

gate

General architecture for text engineering

ula

Unified linguistic annotation

References

  1. Appelt, D. E., & Onyshkevych, B. (1996). The common pattern specification language. In Proceedings of a workshop held at Baltimore, Maryland (pp. 23–30). Morristown, NJ, USA: Association for Computational Linguistics.Google Scholar
  2. Bird, S. (2006). NLTK: The natural language toolkit. In Demonstration session, 45th annual meeting of the ACL. Sydney, Australia.Google Scholar
  3. Bird, S., Buneman, P., & Tan, W.-C. (2000). Towards a query language for annotation graphs. In Second international language resources and evaluation conference. Athens, Greece.Google Scholar
  4. Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation. Speech Communication, 33(1–2), 23–60.CrossRefGoogle Scholar
  5. Boguraev, B. (2000). Towards finite-state analysis of lexical cohesion. In Proceedings of the 3rd international conference on finite-state methods for NLP, intex-3. Liege, Belgium.Google Scholar
  6. Boguraev, B., & Ando, R. K. (2005). TimeML-compliant text analysis for temporal reasoning. In Nineteenth international joint conference on artificial intelligence (IJCAI-05). Edinburgh, Scotland.Google Scholar
  7. Boguraev, B., & Neff, M. (2007). An annotation-based finite state system for UIMA: User documentation and grammar writing manual. Technical report, IBM T.J. Watson Research Center, Yorktown Heights, New York.Google Scholar
  8. Cassidy, S. (2002). XQuery as an annotation query language: A use case analysis. In: Third international language resources and evaluation conference. Las Palmas, Spain.Google Scholar
  9. Cunningham, H. (2002). gate, a general architecture for language engineering. Computers and the Humanities, 36(2), 223–254.CrossRefGoogle Scholar
  10. Cunningham, H., Maynard, D., & Tablan, V. (2000). JAPE: A Java annotation patterns engine. Technical Memo CS-00-10, Institute for Language, Speech and Hearing (ILASH), and Department of Computer Science, University of Sheffield, Sheffield.Google Scholar
  11. Cunningham, H., & Scott, D. (2004). Software architectures for language engineering. Special Issue. Natural Language Engineering, 10(4).Google Scholar
  12. Dale, R. (2005). Industry watch. Natural Language Engineering, 11, 435–439.CrossRefGoogle Scholar
  13. Droẑdẑyński, W., Krieger, H.-U., Piskorski, J., Schäfer, U., & Xu, F. (2004). Shallow processing with unification and typed feature structures—Foundations and applications. Künstliche Intelligenz, (1), 17–23.Google Scholar
  14. Ferrucci, D., & Lally, A. (2004). UIMA: An architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(4). Special Issue on Software Architectures for Language Engineering.Google Scholar
  15. Grefenstette, G. (1999). Light parsing as finite state filtering. In A. Kornai (Ed.), Extended finite state models of language, studies in natural language processing, (pp. 86–94). Cambridge UK: Cambridge University Press.Google Scholar
  16. Grover, C., Matheson, C., Mikheev, A., & Moens, M. (2000). lt-ttt: A flexible tokenisation tool. In Proceedings of the second international conference on language resources and evaluation, (pp. 1147–1154). Spain.Google Scholar
  17. Hahn, U., Buyko, E., Tomanek, K., Piao, S., McNaught, J., Tsuruoka, Y., & Ananiadou, S. (2007). An annotation type system for a data-driven NLP pipeline. In Linguistic annotation workshop (the LAW); ACL-2007. Prague, Czech Republic.Google Scholar
  18. Ide, N., & Romary, L. (2004). International standard for a linguistic annotation framework. Natural Language Engineering, 10(4). Special Issue on Software Architectures for Language Engineering.Google Scholar
  19. Ide, N., & Suderman, K. (2007). GrAF: A graph-based format for linguistic annotation. In Linguistic annotation workshop (the LAW); ACL-2007. Prague, Czech Republic.Google Scholar
  20. Lai, C., & Bird, S. (2004). Querying and updating treebanks: A critical survey and requirements analysis. In Australasian language technology workshop. Sydney.Google Scholar
  21. Park, Y., Byrd, R., & Boguraev, B. (2002). Automatic glossary extraction: Beyond terminology identification. In Proceedings of the 19th international conference on computational linguistics (COLING), (pp. 772–778). Taiwan.Google Scholar
  22. Silberztein, M. (2000). intex: An integrated FST development environment. Theoretical Computer Science, 231(1), 33–46.CrossRefGoogle Scholar
  23. Simov, K., Kouylekov, M., & Simov, A. (2002). Cascaded regular grammars over xml documents. In Proceedings of the second international workshop on NLP and XML (NLPXML-2002). Taipei, Tawian.Google Scholar
  24. Srihari, R. K., Li, W., Cornell, T., & Niu, C. (2008). InfoXtract: A customizable intermediate level information extraction engine. Natural Language Engineering.Google Scholar
  25. Verhagen, M., Stubbs, A., & Pustejovsky, J. (2007). Combining independent syntactic and semantic annotation schemes. In Linguistic annotation workshop (the LAW); ACL-2007. Prague, Czech Republic.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.IBM T.J. Watson Research CenterYorktown HeightsUSA

Personalised recommendations