Abstract
Accurate extraction of biomolecular named entities like genes and proteins from medical documents is an important task for many clinical applications. So far, most gene taggers were developed in the domain of English-language, scientific articles. However, documents from other genres, like clinical practice guidelines, are usually created in the respective language used by clinical practitioners. To our knowledge, no annotated corpora and machine learning models for gene named entity recognition are currently available for the German language.
In this work, we present GGTweak, a publicly available gene tagger for German medical documents based on a large corpus of clinical practice guidelines. Since obtaining sufficient gold-standard annotations of gene mentions for training supervised machine learning models is expensive, our approach relies solely on programmatic, weak supervision for model training. We combine various label sources based on the surface form of gene mentions and gazetteers of known gene names, with only partial individual coverage of the training data. Using a small amount of hand-labelled data for model selection and evaluation, our weakly supervised approach achieves an \(F_1\) score of 76.6 on a held-out test set, an increase of 12.4 percent points over a strongly supervised baseline.
While there is still a performance gap to state-of-the-art gene taggers for the English language, weak supervision is a promising direction for obtaining solid baseline models without the need to conduct time-consuming annotation projects. GGTweak can be readily applied in-domain to derive semantic metadata and enable the development of computer-interpretable clinical guidelines, while the out-of-domain robustness still needs to be investigated.
S. Steinwand and F. Borchert—These authors equally share first authorship.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bada, M., et al.: Concept annotation in the CRAFT corpus. BMC Bioinform. 13(1), 1–20 (2012)
Borchert, F., et al..: GGPOnc: A corpus of German medical text with rich metadata based on clinical practice guidelines. In: Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis. pp. 38–48 (2020)
Borchert, F., et al.: GGPOnc 2.0 - the German clinical guideline corpus for oncology: Curation workflow, annotation policy, baseline ner taggers. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3650–3660 (2022)
Bressem, K.K., et al.: MEDBERT.de: A comprehensive German BERT model for the medical domain. arXiv (2023). https://doi.org/10.48550/ARXIV.2303.08179
Brown, G.R., et al.: Gene: a gene-centered information resource at NCBI. Nucleic Acids Res. 43(D1), D36–D42 (2015)
Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings Bioinform. 6(1), 57–71 (03 2005)
Collier, N., Ohta, T., Tsuruoka, Y., Tateisi, Y., Kim, J.D.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP). pp. 73–78. Geneva, Switzerland (2004)
Faessler, E., Modersohn, L., Lohr, C., Hahn, U.: ProGene - a large-scale, high-quality protein-gene annotated benchmark corpus. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 4585–4596 (2020)
Fries, J.A., et al.: Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat. Commun. 12(1), 1–11 (2021)
Giorgi, J.M., Bader, G.D.: Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics 34(23), 4087–4094 (2018)
Gonzalez-Agirre, A., Marimon, M., Intxaurrondo, A., Rabal, O., Villegas, M., Krallinger, M.: PharmaCoNER: Pharmacological substances, compounds and proteins named entity recognition track. In: Proceedings of the 5th Workshop on BioNLP Open Shared Tasks, pp. 1–10. Association for Computational Linguistics, Hong Kong, China (2019)
Hasso Plattner Institute’s Digital Health Center on GitHub: GGTweak source code repository. https://github.com/hpi-dhc/ggponc_molecular (2023)
Henkenjohann, R., et al.: An engineering approach towards multi-site virtual molecular tumor board software. In: International Conference on ICT for Health, Accessibility and Wellbeing, pp. 156–170. Springer (2021)
Hong, S., Lee, J.G.: DTranNER: biomedical named entity recognition with deep learning-based label-label transition model. BMC Bioinform. 21(1), 1–11 (2020)
Klie, J.C., Bugert, M., Boullosa, B., Eckart de Castilho, R., Gurevych, I.: The INCEpTION platform: machine-assisted and knowledge-oriented interactive annotation. In: COLING 2018 – Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pp. 5–9 (2018)
Lentzen, M., et al.: Critical assessment of transformer-based ai models for German clinical notes. JAMIA open 5(4), ooac087 (2022)
Lison, P., Barnes, J., Hubin, A.: skweak: Weak supervision made easy for NLP. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, pp. 337–346. Association for Computational Linguistics, Online (2021)
Montani, I.,et al.Flusskind: explosion/ spaCy: v3.4.1: Fix compatibility with CuPy v9.x (Jul 2022). https://doi.org/10.5281/zenodo.6907665
Perera, N., Dehmer, M., Emmert-Streib, F.: Named entity recognition and relation detection for biomedical information extraction. Frontiers in cell and developmental biology, p. 673 (2020)
Raj Kanakarajan, K., Kundumani, B., Sankarasubbu, M.: BioELECTRA: pretrained biomedical text encoder using discriminators. In: Proceedings of the 20th Workshop on Biomedical Language Processing, pp. 143–154 (2021)
Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Ré, C.: Snorkel: Rapid training data creation with weak supervision. In: Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases. vol. 11, p. 269 (2017)
Safranchik, E., Luo, S., Bach, S.: Weakly supervised sequence tagging from noisy rules. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 5570–5578 (2020)
Smith, L., et al.: Overview of BioCreative II gene mention recognition. Genome Biol. 9(2), 1–19 (2008)
Sun, C., Yang, Z., Wang, L., Zhang, Y., Lin, H., Wang, J.: Deep learning with language models improves named entity recognition for PharmaCoNER. BMC Bioinform. 22(1), 1–16 (2021)
Weber, L., Sänger, M., Münchmeyer, J., Habibi, M., Leser, U., Akbik, A.: HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition. Bioinformatics 37(17), 2792–2794 (2021)
Wikipedia: Kategorie:Protein. https://de.wikipedia.org/wiki/Kategorie: Protein (2023)
Zesch, T., Bewersdorff, J.: German medical natural language processing-a data-centric survey. In: Applications in Medicine and Manufacturing, pp. 137–142 (2022)
Acknowledgements
Parts of this work were generously supported by a grant of the German Federal Ministry of Research and Education (01ZZ1802H).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Steinwand, S., Borchert, F., Winkler, S., Schapranow, MP. (2023). GGTWEAK: Gene Tagging with Weak Supervision for German Clinical Text. In: Juarez, J.M., Marcos, M., Stiglic, G., Tucker, A. (eds) Artificial Intelligence in Medicine. AIME 2023. Lecture Notes in Computer Science(), vol 13897. Springer, Cham. https://doi.org/10.1007/978-3-031-34344-5_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-34344-5_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-34343-8
Online ISBN: 978-3-031-34344-5
eBook Packages: Computer ScienceComputer Science (R0)