Privacy Risks in German Patient Forums: A NER-Based Approach to Enrich Digital Twins

Schultenkämper, Sergej; Bäumer, Frederik Simon

doi:10.1007/978-3-031-48981-5_9

Sergej Schultenkämper⁸ &
Frederik Simon Bäumer⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1979))

Included in the following conference series:

International Conference on Information and Software Technologies

144 Accesses

Abstract

The online sharing of personal health data by individuals has raised privacy concerns. This paper presents a Named Entity Recognition (NER)-based analysis to detect potential privacy risks in German patient forums. The objective is to extract sensitive information from user-generated texts and augment existing digital profiles of users to demonstrate the potential threats posed by the aggregation of information. To achieve this, we trained a NER model on a large corpus of German patient forum texts and evaluated its performance using standard metrics. The results show that the NER model can effectively extract health-related information from German texts with a micro-average precision of 0.8666, a recall of 0.9633 and an F₁-score of 0.9124. This enables the creation of Digital Twins that accurately reflect the health-related characteristics of individuals. However, when this information is combined with data from different platforms, it poses a potential threat to users’ privacy and underlines the need to warn users.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Available at https://prodi.gy, last accessed 2023-03-27.

References

Abadji, J., Suárez, P.J.O., Romary, L., Sagot, B.: Ungoliant: an optimized pipeline for the generation of a very large-scale multilingual web corpus. In: CMLC 2021–9th Workshop on Challenges in the Management of Large Corpora (2021)
Google Scholar
Barricelli, B.R., Casiraghi, E., Fogli, D.: A survey on digital twin: definitions, characteristics, applications, and design implications. IEEE Access 7, 167653–167671 (2019). https://doi.org/10.1109/ACCESS.2019.2953499
Article Google Scholar
Biewald, L.: Experiment tracking with weights and biases (2020). https://www.wandb.com/. Accessed 19 July 2023
Bilge, L., Strufe, T., Balzarotti, D., Kirda, E.: All your contacts are belong to us: automated identity theft attacks on social networks. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, pp. 551–560. Association for Computing Machinery, New York (2009). https://doi.org/10.1145/1526709.1526784
Brown, T.B., et al.: Language models are few-shot learners (2020)
Google Scholar
Bäumer, F.S., Denisov, S., Geierhos, M., Lee, Y.S.: Towards authority-dependent risk identification and analysis in online networks. In: Science, N., Organization, T. (eds.) STO-MP-IST-190. NATO Science and Technology Organization (2021)
Google Scholar
Bäumer, F.S., Geierhos, M.: Text broom: a ML-based tool to detect and highlight privacy breaches in physician reviews: an insight into our current work. In: European Conference on Data Analysis 2018: Multidisciplinary Facets of Data Science - Book of Abstracts (2018)
Google Scholar
Bäumer, F.S., Grote, N., Kersting, J., Geierhos, M.: Privacy matters: detecting nocuous patient data exposure in online physician reviews. In: Damaševičius, R., Mikašytė, V. (eds.) ICIST 2017. CCIS, vol. 756, pp. 77–89. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67642-5_7
Chapter Google Scholar
Bäumer, F.S., Kersting, J., Orlikowski, M., Geierhos, M.: Towards a multi-stage approach to detect privacy breaches in physician reviews. In: Khalili, A., Koutraki, M. (eds.) Proceedings of the Posters and Demos Track of the 14th International Conference on Semantic Systems Co-Located with the 14th International Conference on Semantic Systems (SEMANTiCS 2018). CEUR Workshop Proceedings, vol. 2198. CEUR-WS.org (2018)
Google Scholar
Chan, B., Schweter, S., Möller, T.: German’s next language model. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 6788–6796. International Committee on Computational Linguistics, Barcelona (2020). https://doi.org/10.18653/v1/2020.coling-main.598
Chen, M., Cheung, A.S.Y., Chan, K.L.: Doxing: what adolescents look for and their intentions. Int. J. Environ. Res. Public Health 16(2), 218 (2019). https://doi.org/10.3390/ijerph16020218
Article Google Scholar
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the ACL, pp. 8440–8451. ACL (2020). https://doi.org/10.18653/v1/2020.acl-main.747
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019). https://doi.org/10.18653/v1/N19-1423
Eckert, S., Metzger-Riftkin, J.: Doxxing. In: The International Encyclopedia of Gender, Media, and Communication, pp. 1–5 (2020). https://doi.org/10.1002/9781119429128.iegmc009
Fire, M., Goldschmidt, R., Elovici, Y.: Online social networks: threats and solutions. IEEE Commun. Surv. Tutor. 16(4), 2019–2036 (2014). https://doi.org/10.1109/COMST.2014.2321628
Article Google Scholar
Frei, J., Kramer, F.: GERNERMED: an open German medical NER model. Softw. Impacts 11, 100212 (2022). https://doi.org/10.1016/j.simpa.2021.100212
Article Google Scholar
Henry, S., Buchan, K., Filannino, M., Stubbs, A., Uzuner, Ö.: 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J. Am. Med. Inform. Assoc. (JAMIA) 27(1), 3–12 (2020)
Article Google Scholar
Karahasanovic, A., Brandtzæg, P.B., Vanattenhoven, J., Lievens, B., Nielsen, K.T., Pierson, J.: Ensuring trust, privacy, and etiquette in web 2.0 applications. Computer 42(6), 42–49 (2009)
Article Google Scholar
Krumm, J., Davies, N., Narayanaswami, C.: User-generated content. IEEE Pervasive Comput. 7(4), 10–11 (2008). https://doi.org/10.1109/MPRV.2008.85
Article Google Scholar
Lothritz, C., Allix, K., Veiber, L., Bissyandé, T.F., Klein, J.: Evaluating pretrained transformer-based models on the task of fine-grained named entity recognition. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3750–3760. International Committee on Computational Linguistics, Barcelona (2020). https://doi.org/10.18653/v1/2020.coling-main.334
Moradi, M., Blagec, K., Haberl, F., Samwald, M.: GPT-3 models are poor few-shot learners in the biomedical domain. arXiv preprint arXiv:2109.02555 (2021)
Ostendorff, M., Blume, T., Ostendorff, S.: Towards an open platform for legal information. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, JCDL 2020, pp. 385–388. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3383583.3398616
Tian, W., Mao, J., Jiang, J., He, Z., Zhou, Z., Liu, J.: Deeply understanding structure-based social network de-anonymization. Procedia Comput. Sci. 129, 52–58 (2018). https://doi.org/10.1016/j.procs.2018.03.045
Article Google Scholar
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 2214–2218. European Language Resources Association (ELRA), Istanbul (2012)
Google Scholar
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 (2003)
Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015). https://doi.org/10.1109/ICCV.2015.11

Download references

Acknowledgments

This research is funded by dtec.bw – Digitalization and Technology Research Center of the Bundeswehr. dtec.bw is funded by the European Union – NextGenerationEU.

Author information

Authors and Affiliations

Bielefeld University of Applied Sciences and Arts, Bielefeld, Germany
Sergej Schultenkämper & Frederik Simon Bäumer

Authors

Sergej Schultenkämper
View author publications
You can also search for this author in PubMed Google Scholar
Frederik Simon Bäumer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergej Schultenkämper .

Editor information

Editors and Affiliations

Kaunas University of Technology, Kaunas, Lithuania
Audrius Lopata
Kaunas University of Technology, Kaunas, Lithuania
Daina Gudonienė
Kaunas University of Technology, Kaunas, Lithuania
Rita Butkienė

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schultenkämper, S., Bäumer, F.S. (2024). Privacy Risks in German Patient Forums: A NER-Based Approach to Enrich Digital Twins. In: Lopata, A., Gudonienė, D., Butkienė, R. (eds) Information and Software Technologies. ICIST 2023. Communications in Computer and Information Science, vol 1979. Springer, Cham. https://doi.org/10.1007/978-3-031-48981-5_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-48981-5_9
Published: 10 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48980-8
Online ISBN: 978-3-031-48981-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Privacy Risks in German Patient Forums: A NER-Based Approach to Enrich Digital Twins