A Text Mining-Based Framework for Constructing an RDF-Compliant Biodiversity Knowledge Repository

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 656)


In our aim to make the information encapsulated by biodiversity literature more accessible and searchable, we have developed a text mining-based framework for automatically transforming text into a structured knowledge repository. A text mining workflow employing information extraction techniques, i.e., named entity recognition and relation extraction, was implemented in the Argo platform and was subsequently applied on biodiversity literature to extract structured information. The resulting annotations were stored in a repository following the emerging Open Annotation standard, thus promoting interoperability with external applications. Accessible as a SPARQL endpoint, the repository facilitates knowledge discovery over a huge amount of biodiversity literature by retrieving annotations matching user-specified queries. We present some use cases to illustrate the types of queries that the knowledge repository currently accommodates.


Resource Description Framework Conditional Random Field Name Entity Recognition Relation Extraction Anatomical Part 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



We would like to thank Prof. Marilou Nicolas for her valuable inputs. This work is funded by the British Council [172722806 (COPIOUS)], and is partially supported by the Engineering and Physical Sciences Research Council [EP/1038099/1 (CDT)].


  1. 1.
    Flora Phenotype Ontology. Accessed 20 Sep 2016
  2. 2.
    Gazetteer. Accessed 20 Sep 2016
  3. 3.
    LingPipe. Accessed 20 Sep 2016
  4. 4.
    NERsuite: a named entity recognition toolkit. Accessed 20 Sep 2016
  5. 5.
    Plant Trait Ontology. Accessed 20 Sep 2016
  6. 6.
    Species 2000 & ITIS Catalogue of Life. Digital resource, September 2016. Accessed 20 Sep 2016
  7. 7.
    Buttigieg, P.L., Morrison, N., Smith, B., Mungall, C.J., Lewis, S.E.: The environment ontology: contextualising biological and biomedical entities. J. Biomed. Semant. 4(1), 43 (2013)CrossRefGoogle Scholar
  8. 8.
    Cui, H., Jiang, K., Sanyal, P.P.: From text to RDF triple store: an application for biodiversity literature. In: Proceedings of the Association for Information Science and Technology (ASIST 2010) (2010)Google Scholar
  9. 9.
    Han, L., Finin, T., Parr, C., Sachs, J., Joshi, A.: RDF123: from spreadsheets to RDF. In: Sheth, A., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 451–466. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-88564-1_29 CrossRefGoogle Scholar
  10. 10.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  11. 11.
    Miyao, Y., Tsujii, J.: Feature forest models for probabilistic HPSG parsing. Comput. Linguist. 34(1), 35–80 (2008)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Mungall, C.J., Torniai, C., Gkoutos, G.V., Lewis, S.E., Haendel, M.A.: Uberon, an integrative multi-species anatomy ontology. Genome Biol. 13(1), R5 (2012)CrossRefGoogle Scholar
  13. 13.
    Parr, C., Sachs, J., Han, L., Wang, T.: RDF123 and spotter: tools for generating OWL and RDF for biodiversity data in spreadsheets and unstructured text. In: Proceedings of Biodiversity Information Standards Annual Conference (TDWG 2007) (2007)Google Scholar
  14. 14.
    Rak, R., Rowley, A., Carter, J., Batista-Navarro, R., Ananiadou, S.: Interoperability and customisation of annotation schemata in argo. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3837–3842. European Language Resources Association (ELRA), May 2014Google Scholar
  15. 15.
    Rak, R., Rowley, A., Black, W., Ananiadou, S.: Argo: an integrative, interactive, text mining-based workbench supporting curation. Database 2012, bas010 (2012)CrossRefGoogle Scholar
  16. 16.
    Sanderson, R., Ciccarese, P., Van de Sompel, H.: Designing the w3c open annotation data model. In: Proceedings of the 5th Annual ACM Web Science Conference (WebSci 2013), pp. 366–375. ACM, New York (2013)Google Scholar
  17. 17.
    Stucky, B.J., Deck, J., Conlin, T., Ziemba, L., Cellinese, N., Guralnick, R.: The BiSciCol triplifier: bringing biodiversity data to the semantic web. BMC Bioinform. 15(1), 1–9 (2014)CrossRefGoogle Scholar
  18. 18.
    Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a robust part-of-speech tagger for biomedical text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005). doi: 10.1007/11573036_36 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.School of Computer ScienceUniversity of ManchesterManchesterUK

Personalised recommendations