Abstract
The online sharing of personal health data by individuals has raised privacy concerns. This paper presents a Named Entity Recognition (NER)-based analysis to detect potential privacy risks in German patient forums. The objective is to extract sensitive information from user-generated texts and augment existing digital profiles of users to demonstrate the potential threats posed by the aggregation of information. To achieve this, we trained a NER model on a large corpus of German patient forum texts and evaluated its performance using standard metrics. The results show that the NER model can effectively extract health-related information from German texts with a micro-average precision of 0.8666, a recall of 0.9633 and an F1-score of 0.9124. This enables the creation of Digital Twins that accurately reflect the health-related characteristics of individuals. However, when this information is combined with data from different platforms, it poses a potential threat to users’ privacy and underlines the need to warn users.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Available at https://prodi.gy, last accessed 2023-03-27.
References
Abadji, J., Suárez, P.J.O., Romary, L., Sagot, B.: Ungoliant: an optimized pipeline for the generation of a very large-scale multilingual web corpus. In: CMLC 2021–9th Workshop on Challenges in the Management of Large Corpora (2021)
Barricelli, B.R., Casiraghi, E., Fogli, D.: A survey on digital twin: definitions, characteristics, applications, and design implications. IEEE Access 7, 167653–167671 (2019). https://doi.org/10.1109/ACCESS.2019.2953499
Biewald, L.: Experiment tracking with weights and biases (2020). https://www.wandb.com/. Accessed 19 July 2023
Bilge, L., Strufe, T., Balzarotti, D., Kirda, E.: All your contacts are belong to us: automated identity theft attacks on social networks. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, pp. 551–560. Association for Computing Machinery, New York (2009). https://doi.org/10.1145/1526709.1526784
Brown, T.B., et al.: Language models are few-shot learners (2020)
Bäumer, F.S., Denisov, S., Geierhos, M., Lee, Y.S.: Towards authority-dependent risk identification and analysis in online networks. In: Science, N., Organization, T. (eds.) STO-MP-IST-190. NATO Science and Technology Organization (2021)
Bäumer, F.S., Geierhos, M.: Text broom: a ML-based tool to detect and highlight privacy breaches in physician reviews: an insight into our current work. In: European Conference on Data Analysis 2018: Multidisciplinary Facets of Data Science - Book of Abstracts (2018)
Bäumer, F.S., Grote, N., Kersting, J., Geierhos, M.: Privacy matters: detecting nocuous patient data exposure in online physician reviews. In: Damaševičius, R., Mikašytė, V. (eds.) ICIST 2017. CCIS, vol. 756, pp. 77–89. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67642-5_7
Bäumer, F.S., Kersting, J., Orlikowski, M., Geierhos, M.: Towards a multi-stage approach to detect privacy breaches in physician reviews. In: Khalili, A., Koutraki, M. (eds.) Proceedings of the Posters and Demos Track of the 14th International Conference on Semantic Systems Co-Located with the 14th International Conference on Semantic Systems (SEMANTiCS 2018). CEUR Workshop Proceedings, vol. 2198. CEUR-WS.org (2018)
Chan, B., Schweter, S., Möller, T.: German’s next language model. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 6788–6796. International Committee on Computational Linguistics, Barcelona (2020). https://doi.org/10.18653/v1/2020.coling-main.598
Chen, M., Cheung, A.S.Y., Chan, K.L.: Doxing: what adolescents look for and their intentions. Int. J. Environ. Res. Public Health 16(2), 218 (2019). https://doi.org/10.3390/ijerph16020218
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the ACL, pp. 8440–8451. ACL (2020). https://doi.org/10.18653/v1/2020.acl-main.747
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019). https://doi.org/10.18653/v1/N19-1423
Eckert, S., Metzger-Riftkin, J.: Doxxing. In: The International Encyclopedia of Gender, Media, and Communication, pp. 1–5 (2020). https://doi.org/10.1002/9781119429128.iegmc009
Fire, M., Goldschmidt, R., Elovici, Y.: Online social networks: threats and solutions. IEEE Commun. Surv. Tutor. 16(4), 2019–2036 (2014). https://doi.org/10.1109/COMST.2014.2321628
Frei, J., Kramer, F.: GERNERMED: an open German medical NER model. Softw. Impacts 11, 100212 (2022). https://doi.org/10.1016/j.simpa.2021.100212
Henry, S., Buchan, K., Filannino, M., Stubbs, A., Uzuner, Ö.: 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. J. Am. Med. Inform. Assoc. (JAMIA) 27(1), 3–12 (2020)
Karahasanovic, A., Brandtzæg, P.B., Vanattenhoven, J., Lievens, B., Nielsen, K.T., Pierson, J.: Ensuring trust, privacy, and etiquette in web 2.0 applications. Computer 42(6), 42–49 (2009)
Krumm, J., Davies, N., Narayanaswami, C.: User-generated content. IEEE Pervasive Comput. 7(4), 10–11 (2008). https://doi.org/10.1109/MPRV.2008.85
Lothritz, C., Allix, K., Veiber, L., Bissyandé, T.F., Klein, J.: Evaluating pretrained transformer-based models on the task of fine-grained named entity recognition. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 3750–3760. International Committee on Computational Linguistics, Barcelona (2020). https://doi.org/10.18653/v1/2020.coling-main.334
Moradi, M., Blagec, K., Haberl, F., Samwald, M.: GPT-3 models are poor few-shot learners in the biomedical domain. arXiv preprint arXiv:2109.02555 (2021)
Ostendorff, M., Blume, T., Ostendorff, S.: Towards an open platform for legal information. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, JCDL 2020, pp. 385–388. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3383583.3398616
Tian, W., Mao, J., Jiang, J., He, Z., Zhou, Z., Liu, J.: Deeply understanding structure-based social network de-anonymization. Procedia Comput. Sci. 129, 52–58 (2018). https://doi.org/10.1016/j.procs.2018.03.045
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 2214–2218. European Language Resources Association (ELRA), Istanbul (2012)
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 (2003)
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015). https://doi.org/10.1109/ICCV.2015.11
Acknowledgments
This research is funded by dtec.bw – Digitalization and Technology Research Center of the Bundeswehr. dtec.bw is funded by the European Union – NextGenerationEU.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Schultenkämper, S., Bäumer, F.S. (2024). Privacy Risks in German Patient Forums: A NER-Based Approach to Enrich Digital Twins. In: Lopata, A., Gudonienė, D., Butkienė, R. (eds) Information and Software Technologies. ICIST 2023. Communications in Computer and Information Science, vol 1979. Springer, Cham. https://doi.org/10.1007/978-3-031-48981-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-48981-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48980-8
Online ISBN: 978-3-031-48981-5
eBook Packages: Computer ScienceComputer Science (R0)