GGTWEAK: Gene Tagging with Weak Supervision for German Clinical Text

Steinwand, Sandro; Borchert, Florian; Winkler, Silvia; Schapranow, Matthieu-P.

doi:10.1007/978-3-031-34344-5_22

GGTWEAK: Gene Tagging with Weak Supervision for German Clinical Text

Conference paper
First Online: 05 June 2023

1116 Accesses
1 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13897))

Abstract

Accurate extraction of biomolecular named entities like genes and proteins from medical documents is an important task for many clinical applications. So far, most gene taggers were developed in the domain of English-language, scientific articles. However, documents from other genres, like clinical practice guidelines, are usually created in the respective language used by clinical practitioners. To our knowledge, no annotated corpora and machine learning models for gene named entity recognition are currently available for the German language.

In this work, we present GGTweak, a publicly available gene tagger for German medical documents based on a large corpus of clinical practice guidelines. Since obtaining sufficient gold-standard annotations of gene mentions for training supervised machine learning models is expensive, our approach relies solely on programmatic, weak supervision for model training. We combine various label sources based on the surface form of gene mentions and gazetteers of known gene names, with only partial individual coverage of the training data. Using a small amount of hand-labelled data for model selection and evaluation, our weakly supervised approach achieves an \(F_1\) score of 76.6 on a held-out test set, an increase of 12.4 percent points over a strongly supervised baseline.

While there is still a performance gap to state-of-the-art gene taggers for the English language, weak supervision is a promising direction for obtaining solid baseline models without the need to conduct time-consuming annotation projects. GGTweak can be readily applied in-domain to derive semantic metadata and enable the development of computer-interpretable clinical guidelines, while the out-of-domain robustness still needs to be investigated.

S. Steinwand and F. Borchert—These authors equally share first authorship.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bada, M., et al.: Concept annotation in the CRAFT corpus. BMC Bioinform. 13(1), 1–20 (2012)
Article Google Scholar
Borchert, F., et al..: GGPOnc: A corpus of German medical text with rich metadata based on clinical practice guidelines. In: Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis. pp. 38–48 (2020)
Google Scholar
Borchert, F., et al.: GGPOnc 2.0 - the German clinical guideline corpus for oncology: Curation workflow, annotation policy, baseline ner taggers. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3650–3660 (2022)
Google Scholar
Bressem, K.K., et al.: MEDBERT.de: A comprehensive German BERT model for the medical domain. arXiv (2023). https://doi.org/10.48550/ARXIV.2303.08179
Brown, G.R., et al.: Gene: a gene-centered information resource at NCBI. Nucleic Acids Res. 43(D1), D36–D42 (2015)
Article Google Scholar
Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings Bioinform. 6(1), 57–71 (03 2005)
Google Scholar
Collier, N., Ohta, T., Tsuruoka, Y., Tateisi, Y., Kim, J.D.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP). pp. 73–78. Geneva, Switzerland (2004)
Google Scholar
Faessler, E., Modersohn, L., Lohr, C., Hahn, U.: ProGene - a large-scale, high-quality protein-gene annotated benchmark corpus. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 4585–4596 (2020)
Google Scholar
Fries, J.A., et al.: Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat. Commun. 12(1), 1–11 (2021)
Article Google Scholar
Giorgi, J.M., Bader, G.D.: Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics 34(23), 4087–4094 (2018)
Article Google Scholar
Gonzalez-Agirre, A., Marimon, M., Intxaurrondo, A., Rabal, O., Villegas, M., Krallinger, M.: PharmaCoNER: Pharmacological substances, compounds and proteins named entity recognition track. In: Proceedings of the 5th Workshop on BioNLP Open Shared Tasks, pp. 1–10. Association for Computational Linguistics, Hong Kong, China (2019)
Google Scholar
Hasso Plattner Institute’s Digital Health Center on GitHub: GGTweak source code repository. https://github.com/hpi-dhc/ggponc_molecular (2023)
Henkenjohann, R., et al.: An engineering approach towards multi-site virtual molecular tumor board software. In: International Conference on ICT for Health, Accessibility and Wellbeing, pp. 156–170. Springer (2021)
Google Scholar
Hong, S., Lee, J.G.: DTranNER: biomedical named entity recognition with deep learning-based label-label transition model. BMC Bioinform. 21(1), 1–11 (2020)
Article MathSciNet Google Scholar
Klie, J.C., Bugert, M., Boullosa, B., Eckart de Castilho, R., Gurevych, I.: The INCEpTION platform: machine-assisted and knowledge-oriented interactive annotation. In: COLING 2018 – Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018)
Google Scholar
Lentzen, M., et al.: Critical assessment of transformer-based ai models for German clinical notes. JAMIA open 5(4), ooac087 (2022)
Google Scholar
Lison, P., Barnes, J., Hubin, A.: skweak: Weak supervision made easy for NLP. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pp. 337–346. Association for Computational Linguistics, Online (2021)
Google Scholar
Montani, I.,et al.Flusskind: explosion/ spaCy: v3.4.1: Fix compatibility with CuPy v9.x (Jul 2022). https://doi.org/10.5281/zenodo.6907665
Perera, N., Dehmer, M., Emmert-Streib, F.: Named entity recognition and relation detection for biomedical information extraction. Frontiers in cell and developmental biology, p. 673 (2020)
Google Scholar
Raj Kanakarajan, K., Kundumani, B., Sankarasubbu, M.: BioELECTRA: pretrained biomedical text encoder using discriminators. In: Proceedings of the 20th Workshop on Biomedical Language Processing, pp. 143–154 (2021)
Google Scholar
Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C.: Snorkel: Rapid training data creation with weak supervision. In: Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases. vol. 11, p. 269 (2017)
Google Scholar
Safranchik, E., Luo, S., Bach, S.: Weakly supervised sequence tagging from noisy rules. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 5570–5578 (2020)
Google Scholar
Smith, L., et al.: Overview of BioCreative II gene mention recognition. Genome Biol. 9(2), 1–19 (2008)
Google Scholar
Sun, C., Yang, Z., Wang, L., Zhang, Y., Lin, H., Wang, J.: Deep learning with language models improves named entity recognition for PharmaCoNER. BMC Bioinform. 22(1), 1–16 (2021)
Google Scholar
Weber, L., Sänger, M., Münchmeyer, J., Habibi, M., Leser, U., Akbik, A.: HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition. Bioinformatics 37(17), 2792–2794 (2021)
Article Google Scholar
Wikipedia: Kategorie:Protein. https://de.wikipedia.org/wiki/Kategorie: Protein (2023)
Zesch, T., Bewersdorff, J.: German medical natural language processing-a data-centric survey. In: Applications in Medicine and Manufacturing, pp. 137–142 (2022)
Google Scholar

Download references

Acknowledgements

Parts of this work were generously supported by a grant of the German Federal Ministry of Research and Education (01ZZ1802H).

Author information

Authors and Affiliations

HPI Digital Health Center, Hasso Plattner Institute, University of Potsdam, Prof. -Dr. -Helmert -Str. 2 -3, 14482, Potsdam, Germany
Sandro Steinwand, Florian Borchert, Silvia Winkler & Matthieu-P. Schapranow

Authors

Sandro Steinwand
View author publications
You can also search for this author in PubMed Google Scholar
Florian Borchert
View author publications
You can also search for this author in PubMed Google Scholar
Silvia Winkler
View author publications
You can also search for this author in PubMed Google Scholar
Matthieu-P. Schapranow
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florian Borchert .

Editor information

Editors and Affiliations

University of Murcia, Murcia, Spain
Jose M. Juarez
Universitat Jaume I, Castellón de la Plana, Spain
Mar Marcos
University of Maribor, Maribor, Slovenia
Gregor Stiglic
Brunel University London, Uxbridge, UK
Allan Tucker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Steinwand, S., Borchert, F., Winkler, S., Schapranow, MP. (2023). GGTWEAK: Gene Tagging with Weak Supervision for German Clinical Text. In: Juarez, J.M., Marcos, M., Stiglic, G., Tucker, A. (eds) Artificial Intelligence in Medicine. AIME 2023. Lecture Notes in Computer Science(), vol 13897. Springer, Cham. https://doi.org/10.1007/978-3-031-34344-5_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-34344-5_22
Published: 05 June 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34343-8
Online ISBN: 978-3-031-34344-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics