Abstract
Corpus is a kind of important resource for knowledge acquisition in the natural language processing (NLP). However, up to now, in the biomedical domain comparatively fewer corpus focus on semantic association among all tokens in a sentence. We proposed an annotation scheme based on feature structure theory for enriching biomedical domain corpora with token semantic association (TSA). There are 227 documents of the BioNLP GE ST training data annotated to form TSA corpus in which each annotated item shows a token semantic association that appears as a triple. The annotation of token semantic association has the potential to significantly advance biomedical text mining by providing rich token semantic information for NLP systems especially for the sophisticated IE systems, such as bio-event extraction.
Similar content being viewed by others
References
Verspoor K, Cohen K B, Goertzel B, et al. Introduction to BioNLP’06. Linking natural language processing and biology: Towards deeper biological literature analysis[C]// Proceedings of the HLT-NAACL Workshop on Linking Natural Language and Biology. New York: ACL, 2006:iii-iv.
Zweigenbaum P, Demner-Fushman D, Yu H, et al. New frontiers in biomedical text mining[C]// Proceedings of the Pacific Symposium on Biocomputing 12. Wailea, Maui, Hawaii: IEEE Press, 2007: 205–208.
Zweigenbaum P, Demner-Fushman D, Yu H, et al. Frontiers of biomedical text mining: Current progress[J]. Briefings in Bioinformatics, 2007, 8(5): 358–375.
Ananiadou S, McNaught J. Text Mining for Biology and Biomedicine[M]. Boston: Artech House Inc, 2006.
Cohen A M, Hersh W R. A survey of current work in biomedical text mining[J]. Briefings in Bioinformatics, 2005, 6(1): 57–71.
Ananiadou S, Kell D B, Tsujii J. Text mining and its potential applications in systems biology[J]. Trends in Biotechnol 2006, 24(12): 571–579.
Cohen K B, Hunter L. Getting started in text mining[J]. PLoS Comput Biol, 2008, 4: e20.
Tomanek K, Wermter J, Hahn U. A reappraisal of sentence and token splitting for life sciences documents[J]. Stud Health Technol Inform, 2007, 129 (Pt 1): 524–528.
Kulick S, Bies A, Liberman M, et al. White P: Integrated annotation for biomedical information extraction[C]// HLT-NAACL 2004 Workshop: Biolink 2004, Linking Biological Literature, Ontologies and Databases. Boston: Artech House Inc, 2004: 61–68.
Coden A R, Pakhomov S V, Ando R K, et al. Chute CG: Domain-specific language models and lexicons for tagging[ J]. J Biomed Inform, 2005, 36: 422–430.
Lease M, Charniak E. Parsing biomedical literature[C]// Proc 2nd Internat Joint Conf Nat Lang Processing (IJCNLP). Jeju Island: ACL, 2005: 58–69.
Roberts A, Gaizauskas R, Hepple M, et al. Combining terminology resources and statistical methods for entity recognition: an evaluation[C]// European Language Resources Association (LREC). New York: Springer-Verlag, 2008: 2974–2980.
Kim J D, Ohta T, Tsujii J. Corpus annotation for mining biomedical events from literature [J]. BMC Bioinformatics, 2008, 9: 10.
Kim J D, Ohta T, Pyysalo S, et al. Extracting bio-molecular events from literature—The BioNLP’09 shared task[J]. Comput Intell, 2011, 27(4): 513–540.
Mihǎilǎ C, Ohta T, Pyysalo S, et al. BioCause: Annotating and analysing causality in the biomedical domain[J]. BMC Bioinformatics, 2013, 14: 2.
Lee H J, Shim S H, Song M R, et al. CoMAGC: A corpus with multi-faceted annotations of gene-cancer relations[J]. BMC Bioinformatics, 2013, 14: 323.
Nguyen T-V T, Moschitti. A end-to-end relation extraction using distant supervision from external semantic repositories [C]// Proc 49th Annual Meeting of the Association for Computational Linguistics. Portland: Oregon, 2011: 277–282.
Plank B, Moschitti A. embedding semantic similarity in tree kernels for domain adaptation[C]// Proc 51st Annual Meeting of the Association for Computational Linguistics. Sofia, Bulgaria: ACL, 2013: 1498–1507.
Li P F, Zhou G D, Zhu Q M, et al. Employing compositional semantics and discourse consistency in Chinese event extraction[C]// Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Jeju Island: ACL, 2012: 1006–1016.
Yu H, Lee M, Kaufman D, et al. Development, implementation, and a cognitive evaluation of a definitional question answering system for physicians[J]. Journal of Biomedical Informatics, 2007, 40(3): 236–251.
Abacha A B, Zweigenbaum P. Medical question answering: translating medical questions into sparql queries[C]// Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. Miami: ACM Press, 2012: 41–50.
Ananiadou S, McNaught J. Text Mining for Biology and Biomedicine[M]. Boston: Artech House, 2006.
Hunter L, Cohen K B. Biomedical language processing: What’s beyond PubMed[J]. Mol Cell, 2006, 21(5):589–594.
Jensen L J, Saric J, Bork P. Literature mining for the biologist: From information retrieval to biological discovery[J]. Nature Reviews Genetics, 2006, 7: 119–129.
Zweigenbaum P, Demner-Fushman D, Yu H, et al. Frontiers of biomedical text mining: Current progress[J]. Brief Bioinform, 2007, 8(5): 358–375.
Hersh W. Information Retrieval: A Health and Biomedical Perspective[M]. 3rd edition. New York: Springer-Verlag, 2008.
Spencer A. Phonology[M]. Oxford: Blackwell Publishers, 1996.
Dalrymple M. Lexical Functional Grammar[M]. Syntax and Semantics Series,Volume 34. New York: Brill Academic Press, 2001.
Chen Bo. Feature Structure and the Construction of Chinese Semantic Resource[D]. Wuhan: Wuhan University, 2011(Ch).
Author information
Authors and Affiliations
Corresponding author
Additional information
Foundation item: Supported by the National Natural Science Foundation of China(61202304, 61173095, 61173062, 61202193)
Biography: WEI Xiaomei, female, Ph.D. candidate, research direction: biomedical informatics and natural language processing.
Rights and permissions
About this article
Cite this article
Wei, X., Huang, S., Chen, B. et al. BioTSA: Annotating token semantic association to support biomedical text mining. Wuhan Univ. J. Nat. Sci. 20, 134–140 (2015). https://doi.org/10.1007/s11859-015-1071-3
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11859-015-1071-3