Abstract
In our aim to make the information encapsulated by biodiversity literature more accessible and searchable, we have developed a text mining-based framework for automatically transforming text into a structured knowledge repository. A text mining workflow employing information extraction techniques, i.e., named entity recognition and relation extraction, was implemented in the Argo platform and was subsequently applied on biodiversity literature to extract structured information. The resulting annotations were stored in a repository following the emerging Open Annotation standard, thus promoting interoperability with external applications. Accessible as a SPARQL endpoint, the repository facilitates knowledge discovery over a huge amount of biodiversity literature by retrieving annotations matching user-specified queries. We present some use cases to illustrate the types of queries that the knowledge repository currently accommodates.
Keywords
- Resource Description Framework
- Conditional Random Field
- Name Entity Recognition
- Relation Extraction
- Anatomical Part
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
References
Flora Phenotype Ontology. https://bioportal.bioontology.org/ontologies/FLOPO. Accessed 20 Sep 2016
Gazetteer. http://bioportal.bioontology.org/ontologies/GAZ. Accessed 20 Sep 2016
LingPipe. http://alias-i.com/lingpipe/. Accessed 20 Sep 2016
NERsuite: a named entity recognition toolkit. http://nersuite.nlplab.org/. Accessed 20 Sep 2016
Plant Trait Ontology. http://www.obofoundry.org/ontology/to.html. Accessed 20 Sep 2016
Species 2000 & ITIS Catalogue of Life. Digital resource, September 2016. www.catalogueoflife.org/col. Accessed 20 Sep 2016
Buttigieg, P.L., Morrison, N., Smith, B., Mungall, C.J., Lewis, S.E.: The environment ontology: contextualising biological and biomedical entities. J. Biomed. Semant. 4(1), 43 (2013)
Cui, H., Jiang, K., Sanyal, P.P.: From text to RDF triple store: an application for biodiversity literature. In: Proceedings of the Association for Information Science and Technology (ASIST 2010) (2010)
Han, L., Finin, T., Parr, C., Sachs, J., Joshi, A.: RDF123: from spreadsheets to RDF. In: Sheth, A., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 451–466. Springer, Heidelberg (2008). doi:10.1007/978-3-540-88564-1_29
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Miyao, Y., Tsujii, J.: Feature forest models for probabilistic HPSG parsing. Comput. Linguist. 34(1), 35–80 (2008)
Mungall, C.J., Torniai, C., Gkoutos, G.V., Lewis, S.E., Haendel, M.A.: Uberon, an integrative multi-species anatomy ontology. Genome Biol. 13(1), R5 (2012)
Parr, C., Sachs, J., Han, L., Wang, T.: RDF123 and spotter: tools for generating OWL and RDF for biodiversity data in spreadsheets and unstructured text. In: Proceedings of Biodiversity Information Standards Annual Conference (TDWG 2007) (2007)
Rak, R., Rowley, A., Carter, J., Batista-Navarro, R., Ananiadou, S.: Interoperability and customisation of annotation schemata in argo. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3837–3842. European Language Resources Association (ELRA), May 2014
Rak, R., Rowley, A., Black, W., Ananiadou, S.: Argo: an integrative, interactive, text mining-based workbench supporting curation. Database 2012, bas010 (2012)
Sanderson, R., Ciccarese, P., Van de Sompel, H.: Designing the w3c open annotation data model. In: Proceedings of the 5th Annual ACM Web Science Conference (WebSci 2013), pp. 366–375. ACM, New York (2013)
Stucky, B.J., Deck, J., Conlin, T., Ziemba, L., Cellinese, N., Guralnick, R.: The BiSciCol triplifier: bringing biodiversity data to the semantic web. BMC Bioinform. 15(1), 1–9 (2014)
Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a robust part-of-speech tagger for biomedical text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005). doi:10.1007/11573036_36
Acknowledgments
We would like to thank Prof. Marilou Nicolas for her valuable inputs. This work is funded by the British Council [172722806 (COPIOUS)], and is partially supported by the Engineering and Physical Sciences Research Council [EP/1038099/1 (CDT)].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Batista-Navarro, R., Zerva, C., Nguyen, N.T.H., Ananiadou, S. (2017). A Text Mining-Based Framework for Constructing an RDF-Compliant Biodiversity Knowledge Repository. In: Lossio-Ventura, J., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig SIMBig 2015 2016. Communications in Computer and Information Science, vol 656. Springer, Cham. https://doi.org/10.1007/978-3-319-55209-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-55209-5_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55208-8
Online ISBN: 978-3-319-55209-5
eBook Packages: Computer ScienceComputer Science (R0)