Towards Identifying De-anonymisation Risks in Distributed Health Data Silos

Podlesny, Nikolai J.; Kayem, Anne V. D. M.; Meinel, Christoph

doi:10.1007/978-3-030-27615-7_3

Nikolai J. Podlesny¹⁴,
Anne V. D. M. Kayem¹⁴ &
Christoph Meinel¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11706))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1534 Accesses
4 Citations

Abstract

Accessing distributed and isolated data repositories such as medical research and treatment data in a privacy-preserving manner is a challenging problem. Furthermore, in the context of high-dimensional datasets, adhering to strict privacy legislation can be projected to a W[2]-complete problem whereby all privacy violating attribute combinations must be identified. While traditional anonymisation algorithms incur high levels of information loss when applied to high-dimensional data, they often do not guarantee privacy, which defeats the purpose of anonymisation. In this paper, we extend our previous work and address these issues by using Bayesian networks to handle data transformation for anonymisation [29]. By computing conditional probabilities linking attribute pairs for all attribute pair combinations the privacy exposure risk can be assessed. Attribute pairs differing by a high conditional probability indicate a high risk of de-anonymisation, similar to quasi-identifiers in syntactic anonymisation schemes, and can be separated instead of deleted. Attribute compartmentation removes the risk of privacy exposure, and deletion avoidance results in a significant reduction in information loss. In other words, assimilating the conditional probability of outliers directly in the adjacency matrix in a greedy fashion is efficient and privacy-preserving. Further, we offer deeper evaluation insights for optimising Bayesian networks with multigrid solver for aggregating state space explosion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://ec.europa.eu/info/law/law-topic/data-protection/reform/what-does-general-data-protection-regulation-gdpr-govern_en.

References

An, X., Jutla, D., Cercone, N.: A Bayesian network approach to detecting privacy intrusion. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 73–76. IEEE Computer Society (2006)
Google Scholar
Aue, G., Biesdorf, S., Henke, N.: ehealth 2.0: how health systems can gain a leadership role in digital health. McKinsey & Company, December 2015
Google Scholar
Barbaro, M., Zeller, T.: A face is exposed for AOL searcher no. 4417749, August 2006. http://www.nytimes.com/2006/08/09/technology/09aol.html
Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: 2005 Proceedings of 21st International Conference on Data Engineering, ICDE 2005, pp. 217–228. IEEE (2005)
Google Scholar
Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: Guo, J., Hermelin, D. (eds.) 11th International Symposium on Parameterized and Exact Computation (IPEC 2016). Leibniz International Proceedings in Informatics (LIPIcs), vol. 63, pp. 6:1–6:13. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2017). http://drops.dagstuhl.de/opus/volltexte/2017/6920
Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial. SIAM, Philadelphia (2000)
Book Google Scholar
Carr, J.: Applications of Centre Manifold Theory, vol. 35. Springer, New York (2012)
MATH Google Scholar
Chickering, D.M., Geiger, D., Heckerman, D., et al.: Learning Bayesian networks is NP-hard. Technical Report, MSR-TR-94-17, Microsoft Research (1994)
Google Scholar
Crossfield, S.S., Clamp, S.: Electronic health records research in a health sector environment with multiple provider types. In: HEALTHINF, pp. 104–111 (2013)
Google Scholar
Dagum, P., Luby, M.: Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artif. Intell. 60(1), 141–153 (1993)
Article MathSciNet Google Scholar
De Montjoye, Y.A., Hidalgo, C.A., Verleysen, M., Blondel, V.D.: Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3, 1376 (2013)
Article Google Scholar
Dwork, C.: Differential privacy. In: van Tilborg, H.C.A., Jajodia, S. (eds.) Encyclopedia of Cryptography and Security, pp. 338–340. Springer, Boston (2011). https://doi.org/10.1007/978-1-4419-5906-5_752
Chapter Google Scholar
Efron, B.: Bayes’ theorem in the 21st century. Science 340(6137), 1177–1178 (2013)
Article MathSciNet Google Scholar
European Commission: opinion 05/2014 on anonymisation techniques, April 2014. https://www.pdpjournals.com/docs/88197.pdf
Fulton, S.R., Ciesielski, P.E., Schubert, W.H.: Multigrid methods for elliptic problems: a review. Mon. Weather Rev. 114(5), 943–959 (1986)
Article Google Scholar
Kayyali, B., Knott, D., Van Kuiken, S.: The big-data revolution in us health care: accelerating value and innovation, April 2013
Google Scholar
Kifer, D., Machanavajjhala, A.: No free lunch in data privacy. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, pp. 193–204. ACM, New York (2011). https://doi.org/10.1145/1989323.1989345
Kohlmayer, F., Prasser, F., Eckert, C., Kuhn, K.A.: A flexible approach to distributed data anonymization. J. Biomed. Inform. 50, 62–76 (2014)
Article Google Scholar
Leoni, D.: Non-interactive differential privacy: a survey. In: Proceedings of the First International Workshop on Open Data, pp. 40–52. ACM (2012)
Google Scholar
Lin, T., Zha, H.: Riemannian manifold learning. IEEE Trans. Pattern Anal. Mach. Intell. 30(5), 796–809 (2008)
Article Google Scholar
Liu, F.: Generalized Gaussian mechanism for differential privacy. arXiv preprint arXiv:1602.06028 (2016)
Massey, R.: How the GDPR will impact life sciences and health care, February 2017
Google Scholar
Meng, D., Sivakumar, K., Kargupta, H.: Privacy-sensitive Bayesian network parameter learning. In: 2004 Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 487–490. IEEE (2004)
Google Scholar
Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 223–228. ACM (2004)
Google Scholar
Mohammed, N., Fung, B., Hung, P.C., Lee, C.K.: Centralized and distributed anonymization for high-dimensional healthcare data. ACM Trans. Knowl. Discov. Data (TKDD) 4(4), 18 (2010)
Google Scholar
Narayanan, A., Shmatikov, V.: How to break anonymity of the netflix prize dataset. CoRR abs/cs/0610105 (2006). http://arxiv.org/abs/cs/0610105
Podlesny, N.J., Kayem, A.V.D.M., von Schorlemer, S., Uflacker, M.: Minimising information loss on anonymised high dimensional data with greedy in-memory processing. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, R.R. (eds.) DEXA 2018. LNCS, vol. 11029, pp. 85–100. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98809-2_6
Chapter Google Scholar
Olson, L.N., Schroder, J.B.: PyAMG: algebraic multigrid solvers in Python v4.0 (2018). release 4.0, https://github.com/pyamg/pyamg
Podlesny, N., Kayem, A.V., Meinel, C.: Identifying data exposure across high-dimensional health data silos through Bayesian networks optimised by multigrid and manifold. In: 2019 IEEE 17th International Conference on Dependable, Autonomic and Secure Computing (DASC). IEEE (2019)
Google Scholar
Podlesny, N.J.: Enriched health dataset (2017). https://github.com/jaSunny/MA-enriched-Health-Data
Rubinstein, I.S., Hartzog, W.: Anonymization and risk. 91 Washington Law Review, p. 703 (2016)
Google Scholar
Sajda, P.: Machine learning for detection and diagnosis of disease. Annu. Rev. Biomed. Eng. 8, 537–565 (2006)
Article Google Scholar
Schadt, E., Chilukuri, S.: The role of big data in medicine, November 2015
Google Scholar
Smith, G.: Recent developments in quantitative information flow (invited tutorial). In: Proceedings of the 2015 30th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), pp. 23–31. IEEE Computer Society (2015)
Google Scholar
Stüben, K.: An introduction to algebraic multigrid. Multigrid, pp. 413–532 (2001)
Google Scholar
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 571–588 (2002)
Article MathSciNet Google Scholar
Takbiri, N., Houmansadr, A., Goeckel, D.L., Pishro-Nik, H.: Fundamental limits of location privacy using anonymization. In: 2017 51st Annual Conference on Information Sciences and Systems (CISS), pp. 1–6. IEEE (2017)
Google Scholar
Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. Proc. VLDB Endow. 1(1), 115–125 (2008)
Article Google Scholar
Vaněk, P., Mandel, J., Brezina, M.: Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems. Computing 56(3), 179–196 (1996)
Article MathSciNet Google Scholar
Vessenes, P., Seidensticker, R.: System and method for analyzing transactions in a distributed ledger, US Patent 9,298,806, 29 March 2016. https://www.google.com/patents/US9298806
Wang, J., Zhang, Z., Zha, H.: Adaptive manifold learning. In: Advances in Neural Information Processing Systems, pp. 1473–1480 (2005)
Google Scholar
Wright, R., Yang, Z.: Privacy-preserving Bayesian network structure computation on distributed heterogeneous data. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 713–718. ACM (2004)
Google Scholar
Zhang, B., Dave, V., Mohammed, N., Hasan, M.A.: Feature selection for classification under anonymity constraint. arXiv preprint arXiv:1512.07158 (2015)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 25 (2017)
Article MathSciNet Google Scholar
Zhang, X., Yang, L.T., Liu, C., Chen, J.: A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Trans. Parallel Distrib. Syst. 25(2), 363–373 (2014)
Article Google Scholar
Zillner, S., Neururer, S.: Big data in the health sector. In: Cavanillas, J.M., Curry, E., Wahlster, W. (eds.) New Horizons for a Data-Driven Economy, pp. 179–194. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-21569-3_10
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
Nikolai J. Podlesny, Anne V. D. M. Kayem & Christoph Meinel

Authors

Nikolai J. Podlesny
View author publications
You can also search for this author in PubMed Google Scholar
Anne V. D. M. Kayem
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Meinel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Nikolai J. Podlesny , Anne V. D. M. Kayem or Christoph Meinel .

Editor information

Editors and Affiliations

Clausthal University of Technology, Clausthal-Zellerfeld, Germany
Sven Hartmann
Johannes Kepler University of Linz, Linz, Austria
Josef Küng
The University of Texas at Arlington, Arlington, TX, USA
Sharma Chakravarthy
Johannes Kepler University of Linz, Linz, Austria
Gabriele Anderst-Kotsis
Software Competence Center Hagenberg, Hagenberg im Mühlkreis, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Podlesny, N.J., Kayem, A.V.D.M., Meinel, C. (2019). Towards Identifying De-anonymisation Risks in Distributed Health Data Silos. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2019. Lecture Notes in Computer Science(), vol 11706. Springer, Cham. https://doi.org/10.1007/978-3-030-27615-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-27615-7_3
Published: 03 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27614-0
Online ISBN: 978-3-030-27615-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics