Skip to main content

A Text Mining-Based Framework for Constructing an RDF-Compliant Biodiversity Knowledge Repository

Part of the Communications in Computer and Information Science book series (CCIS,volume 656)


In our aim to make the information encapsulated by biodiversity literature more accessible and searchable, we have developed a text mining-based framework for automatically transforming text into a structured knowledge repository. A text mining workflow employing information extraction techniques, i.e., named entity recognition and relation extraction, was implemented in the Argo platform and was subsequently applied on biodiversity literature to extract structured information. The resulting annotations were stored in a repository following the emerging Open Annotation standard, thus promoting interoperability with external applications. Accessible as a SPARQL endpoint, the repository facilitates knowledge discovery over a huge amount of biodiversity literature by retrieving annotations matching user-specified queries. We present some use cases to illustrate the types of queries that the knowledge repository currently accommodates.


  • Resource Description Framework
  • Conditional Random Field
  • Name Entity Recognition
  • Relation Extraction
  • Anatomical Part

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. 1.

  2. 2.

  3. 3.

  4. 4.

  5. 5.

  6. 6.

  7. 7.

  8. 8.


  1. Flora Phenotype Ontology. Accessed 20 Sep 2016

  2. Gazetteer. Accessed 20 Sep 2016

  3. LingPipe. Accessed 20 Sep 2016

  4. NERsuite: a named entity recognition toolkit. Accessed 20 Sep 2016

  5. Plant Trait Ontology. Accessed 20 Sep 2016

  6. Species 2000 & ITIS Catalogue of Life. Digital resource, September 2016. Accessed 20 Sep 2016

  7. Buttigieg, P.L., Morrison, N., Smith, B., Mungall, C.J., Lewis, S.E.: The environment ontology: contextualising biological and biomedical entities. J. Biomed. Semant. 4(1), 43 (2013)

    CrossRef  Google Scholar 

  8. Cui, H., Jiang, K., Sanyal, P.P.: From text to RDF triple store: an application for biodiversity literature. In: Proceedings of the Association for Information Science and Technology (ASIST 2010) (2010)

    Google Scholar 

  9. Han, L., Finin, T., Parr, C., Sachs, J., Joshi, A.: RDF123: from spreadsheets to RDF. In: Sheth, A., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 451–466. Springer, Heidelberg (2008). doi:10.1007/978-3-540-88564-1_29

    CrossRef  Google Scholar 

  10. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  11. Miyao, Y., Tsujii, J.: Feature forest models for probabilistic HPSG parsing. Comput. Linguist. 34(1), 35–80 (2008)

    CrossRef  MathSciNet  Google Scholar 

  12. Mungall, C.J., Torniai, C., Gkoutos, G.V., Lewis, S.E., Haendel, M.A.: Uberon, an integrative multi-species anatomy ontology. Genome Biol. 13(1), R5 (2012)

    CrossRef  Google Scholar 

  13. Parr, C., Sachs, J., Han, L., Wang, T.: RDF123 and spotter: tools for generating OWL and RDF for biodiversity data in spreadsheets and unstructured text. In: Proceedings of Biodiversity Information Standards Annual Conference (TDWG 2007) (2007)

    Google Scholar 

  14. Rak, R., Rowley, A., Carter, J., Batista-Navarro, R., Ananiadou, S.: Interoperability and customisation of annotation schemata in argo. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 3837–3842. European Language Resources Association (ELRA), May 2014

    Google Scholar 

  15. Rak, R., Rowley, A., Black, W., Ananiadou, S.: Argo: an integrative, interactive, text mining-based workbench supporting curation. Database 2012, bas010 (2012)

    CrossRef  Google Scholar 

  16. Sanderson, R., Ciccarese, P., Van de Sompel, H.: Designing the w3c open annotation data model. In: Proceedings of the 5th Annual ACM Web Science Conference (WebSci 2013), pp. 366–375. ACM, New York (2013)

    Google Scholar 

  17. Stucky, B.J., Deck, J., Conlin, T., Ziemba, L., Cellinese, N., Guralnick, R.: The BiSciCol triplifier: bringing biodiversity data to the semantic web. BMC Bioinform. 15(1), 1–9 (2014)

    CrossRef  Google Scholar 

  18. Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a robust part-of-speech tagger for biomedical text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005). doi:10.1007/11573036_36

    CrossRef  Google Scholar 

Download references


We would like to thank Prof. Marilou Nicolas for her valuable inputs. This work is funded by the British Council [172722806 (COPIOUS)], and is partially supported by the Engineering and Physical Sciences Research Council [EP/1038099/1 (CDT)].

Author information

Authors and Affiliations


Corresponding author

Correspondence to Sophia Ananiadou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Batista-Navarro, R., Zerva, C., Nguyen, N.T.H., Ananiadou, S. (2017). A Text Mining-Based Framework for Constructing an RDF-Compliant Biodiversity Knowledge Repository. In: Lossio-Ventura, J., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig SIMBig 2015 2016. Communications in Computer and Information Science, vol 656. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-55208-8

  • Online ISBN: 978-3-319-55209-5

  • eBook Packages: Computer ScienceComputer Science (R0)