A Text Mining Framework for Accelerating the Semantic Curation of Literature

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9819)


The Biodiversity Heritage Library is the world’s largest digital library of biodiversity literature. Currently containing almost 40 million pages, the library can be explored with a search interface employing keyword-matching, which unfortunately fails to address issues brought about by ambiguity. Helping alleviate these issues are tools that automatically attach semantic metadata to documents, e.g., biodiversity concept recognisers. However, gold standard, semantically annotated textual corpora are critical for the development of these advanced tools. In the biodiversity domain, such corpora are almost non-existent especially since the construction of semantically annotated resources is typically a time-consuming and laborious process. Aiming to accelerate the development of a corpus of biodiversity documents, we propose a text mining framework that hastens curation through an iterative feedback-loop process of (1) manual annotation, and (2) training and application of statistical concept recognition models. Even after only a few iterations, our curators were observed to have spent less time and effort on annotation.


Manual Annotation Conditional Random Field Name Entity Recognition Conditional Random Field Model Unstructured Information Management Architecture 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  2. 2.
    Rak, R., Rowley, A., Black, W., Ananiadou, S.: Argo: an integrative, interactive, text mining-based workbench supporting curation. Database: J. Biol. Databases Curation 2012, bas010 (2012)CrossRefGoogle Scholar
  3. 3.
    Batista-Navarro, R., Rak, R., Ananiadou, S.: Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics. J. Cheminformatics 7(Suppl. 1), S6 (2015)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.School of Computer ScienceUniversity of ManchesterManchesterUK
  2. 2.Smithsonian InstituteWashington, D.C.USA
  3. 3.Missouri Botanical GardenMissouriUSA

Personalised recommendations