Skip to main content

GGTWEAK: Gene Tagging with Weak Supervision for German Clinical Text

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13897))

Abstract

Accurate extraction of biomolecular named entities like genes and proteins from medical documents is an important task for many clinical applications. So far, most gene taggers were developed in the domain of English-language, scientific articles. However, documents from other genres, like clinical practice guidelines, are usually created in the respective language used by clinical practitioners. To our knowledge, no annotated corpora and machine learning models for gene named entity recognition are currently available for the German language.

In this work, we present GGTweak, a publicly available gene tagger for German medical documents based on a large corpus of clinical practice guidelines. Since obtaining sufficient gold-standard annotations of gene mentions for training supervised machine learning models is expensive, our approach relies solely on programmatic, weak supervision for model training. We combine various label sources based on the surface form of gene mentions and gazetteers of known gene names, with only partial individual coverage of the training data. Using a small amount of hand-labelled data for model selection and evaluation, our weakly supervised approach achieves an \(F_1\) score of 76.6 on a held-out test set, an increase of 12.4 percent points over a strongly supervised baseline.

While there is still a performance gap to state-of-the-art gene taggers for the English language, weak supervision is a promising direction for obtaining solid baseline models without the need to conduct time-consuming annotation projects. GGTweak can be readily applied in-domain to derive semantic metadata and enable the development of computer-interpretable clinical guidelines, while the out-of-domain robustness still needs to be investigated.

S. Steinwand and F. Borchert—These authors equally share first authorship.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bada, M., et al.: Concept annotation in the CRAFT corpus. BMC Bioinform. 13(1), 1–20 (2012)

    Article  Google Scholar 

  2. Borchert, F., et al..: GGPOnc: A corpus of German medical text with rich metadata based on clinical practice guidelines. In: Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis. pp. 38–48 (2020)

    Google Scholar 

  3. Borchert, F., et al.: GGPOnc 2.0 - the German clinical guideline corpus for oncology: Curation workflow, annotation policy, baseline ner taggers. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3650–3660 (2022)

    Google Scholar 

  4. Bressem, K.K., et al.: MEDBERT.de: A comprehensive German BERT model for the medical domain. arXiv (2023). https://doi.org/10.48550/ARXIV.2303.08179

  5. Brown, G.R., et al.: Gene: a gene-centered information resource at NCBI. Nucleic Acids Res. 43(D1), D36–D42 (2015)

    Article  Google Scholar 

  6. Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings Bioinform. 6(1), 57–71 (03 2005)

    Google Scholar 

  7. Collier, N., Ohta, T., Tsuruoka, Y., Tateisi, Y., Kim, J.D.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP). pp. 73–78. Geneva, Switzerland (2004)

    Google Scholar 

  8. Faessler, E., Modersohn, L., Lohr, C., Hahn, U.: ProGene - a large-scale, high-quality protein-gene annotated benchmark corpus. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 4585–4596 (2020)

    Google Scholar 

  9. Fries, J.A., et al.: Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat. Commun. 12(1), 1–11 (2021)

    Article  Google Scholar 

  10. Giorgi, J.M., Bader, G.D.: Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics 34(23), 4087–4094 (2018)

    Article  Google Scholar 

  11. Gonzalez-Agirre, A., Marimon, M., Intxaurrondo, A., Rabal, O., Villegas, M., Krallinger, M.: PharmaCoNER: Pharmacological substances, compounds and proteins named entity recognition track. In: Proceedings of the 5th Workshop on BioNLP Open Shared Tasks, pp. 1–10. Association for Computational Linguistics, Hong Kong, China (2019)

    Google Scholar 

  12. Hasso Plattner Institute’s Digital Health Center on GitHub: GGTweak source code repository. https://github.com/hpi-dhc/ggponc_molecular (2023)

  13. Henkenjohann, R., et al.: An engineering approach towards multi-site virtual molecular tumor board software. In: International Conference on ICT for Health, Accessibility and Wellbeing, pp. 156–170. Springer (2021)

    Google Scholar 

  14. Hong, S., Lee, J.G.: DTranNER: biomedical named entity recognition with deep learning-based label-label transition model. BMC Bioinform. 21(1), 1–11 (2020)

    Article  MathSciNet  Google Scholar 

  15. Klie, J.C., Bugert, M., Boullosa, B., Eckart de Castilho, R., Gurevych, I.: The INCEpTION platform: machine-assisted and knowledge-oriented interactive annotation. In: COLING 2018 – Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018)

    Google Scholar 

  16. Lentzen, M., et al.: Critical assessment of transformer-based ai models for German clinical notes. JAMIA open 5(4), ooac087 (2022)

    Google Scholar 

  17. Lison, P., Barnes, J., Hubin, A.: skweak: Weak supervision made easy for NLP. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pp. 337–346. Association for Computational Linguistics, Online (2021)

    Google Scholar 

  18. Montani, I.,et al.Flusskind: explosion/ spaCy: v3.4.1: Fix compatibility with CuPy v9.x (Jul 2022). https://doi.org/10.5281/zenodo.6907665

  19. Perera, N., Dehmer, M., Emmert-Streib, F.: Named entity recognition and relation detection for biomedical information extraction. Frontiers in cell and developmental biology, p. 673 (2020)

    Google Scholar 

  20. Raj Kanakarajan, K., Kundumani, B., Sankarasubbu, M.: BioELECTRA: pretrained biomedical text encoder using discriminators. In: Proceedings of the 20th Workshop on Biomedical Language Processing, pp. 143–154 (2021)

    Google Scholar 

  21. Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C.: Snorkel: Rapid training data creation with weak supervision. In: Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases. vol. 11, p. 269 (2017)

    Google Scholar 

  22. Safranchik, E., Luo, S., Bach, S.: Weakly supervised sequence tagging from noisy rules. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 5570–5578 (2020)

    Google Scholar 

  23. Smith, L., et al.: Overview of BioCreative II gene mention recognition. Genome Biol. 9(2), 1–19 (2008)

    Google Scholar 

  24. Sun, C., Yang, Z., Wang, L., Zhang, Y., Lin, H., Wang, J.: Deep learning with language models improves named entity recognition for PharmaCoNER. BMC Bioinform. 22(1), 1–16 (2021)

    Google Scholar 

  25. Weber, L., Sänger, M., Münchmeyer, J., Habibi, M., Leser, U., Akbik, A.: HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition. Bioinformatics 37(17), 2792–2794 (2021)

    Article  Google Scholar 

  26. Wikipedia: Kategorie:Protein. https://de.wikipedia.org/wiki/Kategorie: Protein (2023)

  27. Zesch, T., Bewersdorff, J.: German medical natural language processing-a data-centric survey. In: Applications in Medicine and Manufacturing, pp. 137–142 (2022)

    Google Scholar 

Download references

Acknowledgements

Parts of this work were generously supported by a grant of the German Federal Ministry of Research and Education (01ZZ1802H).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florian Borchert .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Steinwand, S., Borchert, F., Winkler, S., Schapranow, MP. (2023). GGTWEAK: Gene Tagging with Weak Supervision for German Clinical Text. In: Juarez, J.M., Marcos, M., Stiglic, G., Tucker, A. (eds) Artificial Intelligence in Medicine. AIME 2023. Lecture Notes in Computer Science(), vol 13897. Springer, Cham. https://doi.org/10.1007/978-3-031-34344-5_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-34344-5_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-34343-8

  • Online ISBN: 978-3-031-34344-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics