Skip to main content

Privacy Risks in German Patient Forums: A NER-Based Approach to Enrich Digital Twins

  • Conference paper
  • First Online:
Information and Software Technologies (ICIST 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1979))

Included in the following conference series:

  • 144 Accesses

Abstract

The online sharing of personal health data by individuals has raised privacy concerns. This paper presents a Named Entity Recognition (NER)-based analysis to detect potential privacy risks in German patient forums. The objective is to extract sensitive information from user-generated texts and augment existing digital profiles of users to demonstrate the potential threats posed by the aggregation of information. To achieve this, we trained a NER model on a large corpus of German patient forum texts and evaluated its performance using standard metrics. The results show that the NER model can effectively extract health-related information from German texts with a micro-average precision of 0.8666, a recall of 0.9633 and an F1-score of 0.9124. This enables the creation of Digital Twins that accurately reflect the health-related characteristics of individuals. However, when this information is combined with data from different platforms, it poses a potential threat to users’ privacy and underlines the need to warn users.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available at https://prodi.gy, last accessed 2023-03-27.

References

  1. Abadji, J., Suárez, P.J.O., Romary, L., Sagot, B.: Ungoliant: an optimized pipeline for the generation of a very large-scale multilingual web corpus. In: CMLC 2021–9th Workshop on Challenges in the Management of Large Corpora (2021)

    Google Scholar 

  2. Barricelli, B.R., Casiraghi, E., Fogli, D.: A survey on digital twin: definitions, characteristics, applications, and design implications. IEEE Access 7, 167653–167671 (2019). https://doi.org/10.1109/ACCESS.2019.2953499

    Article  Google Scholar 

  3. Biewald, L.: Experiment tracking with weights and biases (2020). https://www.wandb.com/. Accessed 19 July 2023

  4. Bilge, L., Strufe, T., Balzarotti, D., Kirda, E.: All your contacts are belong to us: automated identity theft attacks on social networks. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, pp. 551–560. Association for Computing Machinery, New York (2009). https://doi.org/10.1145/1526709.1526784

  5. Brown, T.B., et al.: Language models are few-shot learners (2020)

    Google Scholar 

  6. Bäumer, F.S., Denisov, S., Geierhos, M., Lee, Y.S.: Towards authority-dependent risk identification and analysis in online networks. In: Science, N., Organization, T. (eds.) STO-MP-IST-190. NATO Science and Technology Organization (2021)

    Google Scholar 

  7. Bäumer, F.S., Geierhos, M.: Text broom: a ML-based tool to detect and highlight privacy breaches in physician reviews: an insight into our current work. In: European Conference on Data Analysis 2018: Multidisciplinary Facets of Data Science - Book of Abstracts (2018)

    Google Scholar 

  8. Bäumer, F.S., Grote, N., Kersting, J., Geierhos, M.: Privacy matters: detecting nocuous patient data exposure in online physician reviews. In: Damaševičius, R., Mikašytė, V. (eds.) ICIST 2017. CCIS, vol. 756, pp. 77–89. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67642-5_7

    Chapter  Google Scholar 

  9. Bäumer, F.S., Kersting, J., Orlikowski, M., Geierhos, M.: Towards a multi-stage approach to detect privacy breaches in physician reviews. In: Khalili, A., Koutraki, M. (eds.) Proceedings of the Posters and Demos Track of the 14th International Conference on Semantic Systems Co-Located with the 14th International Conference on Semantic Systems (SEMANTiCS 2018). CEUR Workshop Proceedings, vol. 2198. CEUR-WS.org (2018)

    Google Scholar 

  10. Chan, B., Schweter, S., Möller, T.: German’s next language model. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 6788–6796. International Committee on Computational Linguistics, Barcelona (2020). https://doi.org/10.18653/v1/2020.coling-main.598

  11. Chen, M., Cheung, A.S.Y., Chan, K.L.: Doxing: what adolescents look for and their intentions. Int. J. Environ. Res. Public Health 16(2), 218 (2019). https://doi.org/10.3390/ijerph16020218

    Article  Google Scholar 

  12. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the ACL, pp. 8440–8451. ACL (2020). https://doi.org/10.18653/v1/2020.acl-main.747

  13. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019). https://doi.org/10.18653/v1/N19-1423

  14. Eckert, S., Metzger-Riftkin, J.: Doxxing. In: The International Encyclopedia of Gender, Media, and Communication, pp. 1–5 (2020). https://doi.org/10.1002/9781119429128.iegmc009

  15. Fire, M., Goldschmidt, R., Elovici, Y.: Online social networks: threats and solutions. IEEE Commun. Surv. Tutor. 16(4), 2019–2036 (2014). https://doi.org/10.1109/COMST.2014.2321628

    Article  Google Scholar 

  16. Frei, J., Kramer, F.: GERNERMED: an open German medical NER model. Softw. Impacts 11, 100212 (2022). https://doi.org/10.1016/j.simpa.2021.100212

    Article  Google Scholar 

  17. Henry, S., Buchan, K., Filannino, M., Stubbs, A., Uzuner, Ö.: 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J. Am. Med. Inform. Assoc. (JAMIA) 27(1), 3–12 (2020)

    Article  Google Scholar 

  18. Karahasanovic, A., Brandtzæg, P.B., Vanattenhoven, J., Lievens, B., Nielsen, K.T., Pierson, J.: Ensuring trust, privacy, and etiquette in web 2.0 applications. Computer 42(6), 42–49 (2009)

    Article  Google Scholar 

  19. Krumm, J., Davies, N., Narayanaswami, C.: User-generated content. IEEE Pervasive Comput. 7(4), 10–11 (2008). https://doi.org/10.1109/MPRV.2008.85

    Article  Google Scholar 

  20. Lothritz, C., Allix, K., Veiber, L., Bissyandé, T.F., Klein, J.: Evaluating pretrained transformer-based models on the task of fine-grained named entity recognition. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3750–3760. International Committee on Computational Linguistics, Barcelona (2020). https://doi.org/10.18653/v1/2020.coling-main.334

  21. Moradi, M., Blagec, K., Haberl, F., Samwald, M.: GPT-3 models are poor few-shot learners in the biomedical domain. arXiv preprint arXiv:2109.02555 (2021)

  22. Ostendorff, M., Blume, T., Ostendorff, S.: Towards an open platform for legal information. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, JCDL 2020, pp. 385–388. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3383583.3398616

  23. Tian, W., Mao, J., Jiang, J., He, Z., Zhou, Z., Liu, J.: Deeply understanding structure-based social network de-anonymization. Procedia Comput. Sci. 129, 52–58 (2018). https://doi.org/10.1016/j.procs.2018.03.045

    Article  Google Scholar 

  24. Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 2214–2218. European Language Resources Association (ELRA), Istanbul (2012)

    Google Scholar 

  25. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 (2003)

    Google Scholar 

  26. Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015). https://doi.org/10.1109/ICCV.2015.11

Download references

Acknowledgments

This research is funded by dtec.bw – Digitalization and Technology Research Center of the Bundeswehr. dtec.bw is funded by the European Union – NextGenerationEU.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sergej Schultenkämper .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Schultenkämper, S., Bäumer, F.S. (2024). Privacy Risks in German Patient Forums: A NER-Based Approach to Enrich Digital Twins. In: Lopata, A., Gudonienė, D., Butkienė, R. (eds) Information and Software Technologies. ICIST 2023. Communications in Computer and Information Science, vol 1979. Springer, Cham. https://doi.org/10.1007/978-3-031-48981-5_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-48981-5_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-48980-8

  • Online ISBN: 978-3-031-48981-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics