Building Concise Text Corpora from Web Contents

  • Wolfram BartussekEmail author


This is a report on ongoing work done in a research project for Small and Medium-sized Enterprises (SMEs), funded by the German Federal Ministry of Education and Research (Funding ID: 01IS15056D; project duration: Jan 2016 – Dec 2017). The project, named OntoPMS, is targeted at post market surveillance (PMS) of medical devices as required by the medical device regulation (Medical Device Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, OJ. L, pp 1–175, 2017) which entered into power following formal publication in May 2017. Being a regulation, it is immediately legally binding in all member states of the European Union. This project aims at providing both technical support and assisting procedures to satisfy article 4 of the MDR: “Key elements of the existing regulatory approach, such as the supervision of notified bodies, conformity assessment procedures, clinical investigations and clinical evaluation, vigilance and market surveillance should be significantly reinforced, whilst provisions ensuring transparency and traceability regarding medical devices should be introduced, to improve health and safety.” This chapter focuses on one component of the software system under development, the corpus builder. This component retrieves scientific publications of interest from the web and other sources, checks them for relevance and transfers them to a linguistic corpus and in parallel to a search engine based on the open source package Elasticsearch. The challenge was, in this case, not to take everything that one can get hold of (whole web crawling) but to find and to take only those publications that really belong to the domain of interest and are relevant with respect to surveillance aspects. So, the dictum was to build comprehensive yet minimal corpora for the purposes at hand. Although the software has been developed in the context of medical device PMS, its use is not bound in any way to this specific application area.



Acknowledgements go to all participants of the OntoPMS consortium. With respect to ontologies, accompanying work flows, and available technologies I would like to thank Prof. Heinrich Herre, Alexandr Uciteli, and Stephan Kropf from the IMISE at the University of Leipzig for many inspiring conversations. I wouldn’t have had much chance to understand medical regulations in Europe without the help of the novineon personnel Timo Weiland (consortium project lead), Prof. Marc O. Schurr, Stefanie Meese, Klaus Gräf, and the quality manager from Ovesco, Matthias Leenen. The participants from the BfArM, the German Federal Institute for Drugs and Medical Devices, with Prof. Wolfgang Lauer and Robin Seidel helped me understand the MAUDE24 database and how to connect it to the CorpusBuilder. IntraFind (Christoph Goller and Philipp Blohm) developed an ingenious enhancement to the search engine exploiting the corpus; and MT2IT (Prof. Jörg-Uwe Meyer, Michael Witte) will provide the structures of the overall system where the CorpusBuilder will be embedded. I also would like to thank my colleagues at OntoPort, Anatol Reibold and Günter Lutz-Misof for their astute remarks on earlier versions of this chapter.


  1. 1.
    Herre H (2010) General formal ontology (GFO): a foundational ontology for conceptual modelling. In: Poli R, Healy M, Kameas A (eds) Theory and applications of ontology: computer applications. Springer, Dordrecht, pp 297–345CrossRefGoogle Scholar
  2. 2.
    Uciteli A, Goller C, Burek P, Siemoleit S, Faria B, Galanzina H, Weiland T, Drechsler-Hake D, Bartussek W, Herre H (2014) Search ontology, a new approach towards semantic search. In: Plödereder E, Grunske L, Schneider E, Ull D (eds) FoRESEE: Future Search Engines 2014–44. Annual meeting of the GI, Stuttgart – GI edition proceedings LNI. Köllen, Bonn, pp 667–672Google Scholar
  3. 3.
    Medical Device Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, OJ. L (2017) pp 1–175Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.OntoPort UGSulzbachGermany

Personalised recommendations