Facilitating the Analysis of Discourse Phenomena in an Interoperable NLP Platform
Abstract
The analysis of discourse phenomena is essential in many natural language processing (NLP) applications. The growing diversity of available corpora and NLP tools brings a multitude of representation formats. In order to alleviate the problem of incompatible formats when constructing complex text mining pipelines, the Unstructured Information Management Architecture (UIMA) provides a standard means of communication between tools and resources. U-Compare, a text mining workflow construction platform based on UIMA, further enhances interoperability through a shared system of data types, allowing free combination of compliant components into workflows. Although U-Compare and its type system already support syntactic and semantic analyses, support for the analysis of discourse phenomena was previously lacking. In response, we have extended the U-Compare type system with new discourse-level types. We illustrate processing and visualisation of discourse information in U-Compare by providing several new deserialisation components for corpora containing discourse annotations. The new U-Compare is downloadable from http://nactem.ac.uk/ucompare.
Keywords
UIMA interoperabilty U-Compare discourse causality coreference meta-knowledgePreview
Unable to display preview. Download preview PDF.
References
- 1.Kim, J.D., Ohta, T., Tsujii, J.: Corpus annotation for mining biomedical events from literature. BMC Bioinformatics 9, 10 (2008)CrossRefGoogle Scholar
- 2.Thompson, P., Nawaz, R., McNaught, J., Ananiadou, S.: Enriching a biomedical event corpus with meta-knowledge annotation. BMC Bioinformatics 12, 393 (2011)CrossRefGoogle Scholar
- 3.Marcu, D.: The Theory and Practice of Discourse Parsing and Summarization. MIT Press, Cambridge (2000)MATHGoogle Scholar
- 4.Sun, M., Chai, J.Y.: Discourse processing for context question answering based on linguistic knowledge. Knowledge-Based Systems 20, 511–526 (2007)CrossRefGoogle Scholar
- 5.Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering 10, 327–348 (2004)CrossRefGoogle Scholar
- 6.Kolluru, B., Hawizy, L., Murray-Rust, P., Tsujii, J., Ananiadou, S.: Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry. PLoS ONE 6, e20181 (2011)CrossRefGoogle Scholar
- 7.Kano, Y., Baumgartner Jr., W.A., McCrochon, L., Ananiadou, S., Cohen, K.B., Hunter, L., Tsujii, J.: U-Compare: share and compare text mining tools with UIMA. Bioinfomatics 25, 1997–1998 (2009)CrossRefGoogle Scholar
- 8.Kleinberg, S., Hripcsak, G.: A review of causal inference for biomedical informatics. Journal of Biomedical Informatics 44, 1102–1112 (2011)CrossRefGoogle Scholar
- 9.Thompson, P., Iqbal, S., McNaught, J., Ananiadou, S.: Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics 10, 349 (2009)CrossRefGoogle Scholar
- 10.Mihăilă, C., Ohta, T., Pyysalo, S., Ananiadou, S.: BioCause: Annotating and analysing causality in the biomedical domain. BMC Bioinformatics 14, 2 (2013)CrossRefGoogle Scholar
- 11.Prasad, R., McRoy, S., Frid, N., Joshi, A., Yu, H.: The Biomedical Discourse Relation Bank. BMC Bioinformatics 12, 188 (2011)CrossRefGoogle Scholar
- 12.Jurafsky, D., Martin, J.H.: Speech and Language Processing, 2nd edn. Prentice Hall Series in Artificial Intelligence. Prentice Hall (2008)Google Scholar
- 13.Grosz, B.J., Weinstein, S., Joshi, A.K.: Centering: A Framework for Modeling the Local Coherence of Discourse. Comp. Ling. 21, 203–225 (1995)Google Scholar
- 14.Walker, C.: ACE 2005 Multilingual Training Corpus (2006)Google Scholar
- 15.Su, J., Yang, X., Hong, H., Tateisi, Y., Tsujii, J.: Coreference Resolution in Biomedical Texts: a Machine Learning Approach. In: Ashburner, M., Leser, U., Rebholz-Schuhmann, D. (eds.) Ontologies and Text Mining for Life Sciences: Current Status and Future Perspectives. Dagstuhl Seminar Proceedings, vol. 08131 (2008)Google Scholar
- 16.Batista-Navarro, R.T.B., Ananiadou, S.: Building a coreference-annotated corpus from the domain of biochemistry. In: Proceedings of BioNLP 2011, pp. 83–91 (2011)Google Scholar
- 17.Stenetorp, P., Topić, G., Pyysalo, S., Ohta, T., Kim, J.D., Tsujii, J.: BioNLP Shared Task 2011: Supporting Resources. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 112–120. ACL (2011)Google Scholar
- 18.Sandor, A., de Waard, A.: Identifying Claimed Knowledge Updates in Biomedical Research Articles. In: Proceedings of the Workshop on Detecting Structure in Scholarly Discourse (DSSD), pp. 7–10 (2012)Google Scholar
- 19.Oda, K., Kim, J.D., Ohta, T., Okanohara, D., Matsuzaki, T., Tateisi, Y., Tsujii, J.: New challenges for text mining: mapping between text and manually curated pathways. BMC Bioinformatics 9, S5 (2008)CrossRefGoogle Scholar
- 20.Yeh, A., Hirschman, L., Morgan, A.: Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics 19, 331–339 (2003)CrossRefGoogle Scholar
- 21.Medlock, B., Briscoe, T.: Weakly supervised learning for hedge classification in scientific literature. In: Proceedings of ACL, pp. 992–999 (2007)Google Scholar
- 22.McKnight, L., Srinivasan, P.: Categorization of sentence types in medical abstracts. In: Proceedings of the AMIA Annual Symposium, pp. 440–444 (2003)Google Scholar
- 23.Mizuta, Y., Korhonen, A., Mullen, T., Collier, N.: Zone analysis in biology articles as a basis for information extraction. Int. J. Med. Inf. 75, 468–487 (2006)CrossRefGoogle Scholar
- 24.Liakata, M., Teufel, S., Siddharthan, A., Batchelor, C.: Corpora for the conceptualisation and zoning of scientific papers. In: Proceedings of LREC, pp. 2054–2061 (2010)Google Scholar
- 25.Wilbur, W.J., Rzhetsky, A., Shatkay, H.: New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics 7, 356 (2006)CrossRefGoogle Scholar
- 26.Miwa, M., Thompson, P., McNaught, J., Kell, D., Ananiadou, S.: Extracting semantically enriched events from biomedical literature. BMC Bioinformatics 13, 108 (2012)CrossRefGoogle Scholar
- 27.Savova, G., Masanz, J., Ogren, P., Zheng, J., Sohn, S., Kipper-Schuler, K., Chute, C.: Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17, 507–513 (2010)CrossRefGoogle Scholar
- 28.Cunningham, H., Hanbury, A., Rüger, S.: Scaling Up High-Value Retrieval to Medium-Volume Data. In: Cunningham, H., Hanbury, A., Rüger, S. (eds.) IRFC 2010. LNCS, vol. 6107, pp. 1–5. Springer, Heidelberg (2010)CrossRefGoogle Scholar
- 29.Schäfer, U.: Middleware for creating and combining multi-dimensional NLP markup. In: Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing, pp. 81–84. ACL (2006)Google Scholar
- 30.Rak, R., Rowley, A., Black, W., Ananiadou, S.: Argo: an integrative, interactive, text mining-based workbench supporting curation. Database: The Journal of Biological Databases and Curation 2012 (2012)Google Scholar
- 31.Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on NLP in Biomedicine and its Applications, pp. 104–107. ACL (2004)Google Scholar
- 32.Gabbard, R., Freedman, M., Weischedel, R.: Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 288–293. Association for Computational Linguistics, Portland (2011)Google Scholar
- 33.Tsuruoka, Y., Tsujii, J., Ananiadou, S.: Accelerating the annotation of sparse named entities by dynamic sentence selection. BMC Bioinformatics 9, S8 (2008)CrossRefGoogle Scholar