Abstract
Accessing distributed and isolated data repositories such as medical research and treatment data in a privacy-preserving manner is a challenging problem. Furthermore, in the context of high-dimensional datasets, adhering to strict privacy legislation can be projected to a W[2]-complete problem whereby all privacy violating attribute combinations must be identified. While traditional anonymisation algorithms incur high levels of information loss when applied to high-dimensional data, they often do not guarantee privacy, which defeats the purpose of anonymisation. In this paper, we extend our previous work and address these issues by using Bayesian networks to handle data transformation for anonymisation [29]. By computing conditional probabilities linking attribute pairs for all attribute pair combinations the privacy exposure risk can be assessed. Attribute pairs differing by a high conditional probability indicate a high risk of de-anonymisation, similar to quasi-identifiers in syntactic anonymisation schemes, and can be separated instead of deleted. Attribute compartmentation removes the risk of privacy exposure, and deletion avoidance results in a significant reduction in information loss. In other words, assimilating the conditional probability of outliers directly in the adjacency matrix in a greedy fashion is efficient and privacy-preserving. Further, we offer deeper evaluation insights for optimising Bayesian networks with multigrid solver for aggregating state space explosion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
An, X., Jutla, D., Cercone, N.: A Bayesian network approach to detecting privacy intrusion. In: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 73–76. IEEE Computer Society (2006)
Aue, G., Biesdorf, S., Henke, N.: ehealth 2.0: how health systems can gain a leadership role in digital health. McKinsey & Company, December 2015
Barbaro, M., Zeller, T.: A face is exposed for AOL searcher no. 4417749, August 2006. http://www.nytimes.com/2006/08/09/technology/09aol.html
Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: 2005 Proceedings of 21st International Conference on Data Engineering, ICDE 2005, pp. 217–228. IEEE (2005)
Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: Guo, J., Hermelin, D. (eds.) 11th International Symposium on Parameterized and Exact Computation (IPEC 2016). Leibniz International Proceedings in Informatics (LIPIcs), vol. 63, pp. 6:1–6:13. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2017). http://drops.dagstuhl.de/opus/volltexte/2017/6920
Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial. SIAM, Philadelphia (2000)
Carr, J.: Applications of Centre Manifold Theory, vol. 35. Springer, New York (2012)
Chickering, D.M., Geiger, D., Heckerman, D., et al.: Learning Bayesian networks is NP-hard. Technical Report, MSR-TR-94-17, Microsoft Research (1994)
Crossfield, S.S., Clamp, S.: Electronic health records research in a health sector environment with multiple provider types. In: HEALTHINF, pp. 104–111 (2013)
Dagum, P., Luby, M.: Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artif. Intell. 60(1), 141–153 (1993)
De Montjoye, Y.A., Hidalgo, C.A., Verleysen, M., Blondel, V.D.: Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3, 1376 (2013)
Dwork, C.: Differential privacy. In: van Tilborg, H.C.A., Jajodia, S. (eds.) Encyclopedia of Cryptography and Security, pp. 338–340. Springer, Boston (2011). https://doi.org/10.1007/978-1-4419-5906-5_752
Efron, B.: Bayes’ theorem in the 21st century. Science 340(6137), 1177–1178 (2013)
European Commission: opinion 05/2014 on anonymisation techniques, April 2014. https://www.pdpjournals.com/docs/88197.pdf
Fulton, S.R., Ciesielski, P.E., Schubert, W.H.: Multigrid methods for elliptic problems: a review. Mon. Weather Rev. 114(5), 943–959 (1986)
Kayyali, B., Knott, D., Van Kuiken, S.: The big-data revolution in us health care: accelerating value and innovation, April 2013
Kifer, D., Machanavajjhala, A.: No free lunch in data privacy. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, pp. 193–204. ACM, New York (2011). https://doi.org/10.1145/1989323.1989345
Kohlmayer, F., Prasser, F., Eckert, C., Kuhn, K.A.: A flexible approach to distributed data anonymization. J. Biomed. Inform. 50, 62–76 (2014)
Leoni, D.: Non-interactive differential privacy: a survey. In: Proceedings of the First International Workshop on Open Data, pp. 40–52. ACM (2012)
Lin, T., Zha, H.: Riemannian manifold learning. IEEE Trans. Pattern Anal. Mach. Intell. 30(5), 796–809 (2008)
Liu, F.: Generalized Gaussian mechanism for differential privacy. arXiv preprint arXiv:1602.06028 (2016)
Massey, R.: How the GDPR will impact life sciences and health care, February 2017
Meng, D., Sivakumar, K., Kargupta, H.: Privacy-sensitive Bayesian network parameter learning. In: 2004 Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 487–490. IEEE (2004)
Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 223–228. ACM (2004)
Mohammed, N., Fung, B., Hung, P.C., Lee, C.K.: Centralized and distributed anonymization for high-dimensional healthcare data. ACM Trans. Knowl. Discov. Data (TKDD) 4(4), 18 (2010)
Narayanan, A., Shmatikov, V.: How to break anonymity of the netflix prize dataset. CoRR abs/cs/0610105 (2006). http://arxiv.org/abs/cs/0610105
Podlesny, N.J., Kayem, A.V.D.M., von Schorlemer, S., Uflacker, M.: Minimising information loss on anonymised high dimensional data with greedy in-memory processing. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, R.R. (eds.) DEXA 2018. LNCS, vol. 11029, pp. 85–100. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98809-2_6
Olson, L.N., Schroder, J.B.: PyAMG: algebraic multigrid solvers in Python v4.0 (2018). release 4.0, https://github.com/pyamg/pyamg
Podlesny, N., Kayem, A.V., Meinel, C.: Identifying data exposure across high-dimensional health data silos through Bayesian networks optimised by multigrid and manifold. In: 2019 IEEE 17th International Conference on Dependable, Autonomic and Secure Computing (DASC). IEEE (2019)
Podlesny, N.J.: Enriched health dataset (2017). https://github.com/jaSunny/MA-enriched-Health-Data
Rubinstein, I.S., Hartzog, W.: Anonymization and risk. 91 Washington Law Review, p. 703 (2016)
Sajda, P.: Machine learning for detection and diagnosis of disease. Annu. Rev. Biomed. Eng. 8, 537–565 (2006)
Schadt, E., Chilukuri, S.: The role of big data in medicine, November 2015
Smith, G.: Recent developments in quantitative information flow (invited tutorial). In: Proceedings of the 2015 30th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), pp. 23–31. IEEE Computer Society (2015)
Stüben, K.: An introduction to algebraic multigrid. Multigrid, pp. 413–532 (2001)
Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 571–588 (2002)
Takbiri, N., Houmansadr, A., Goeckel, D.L., Pishro-Nik, H.: Fundamental limits of location privacy using anonymization. In: 2017 51st Annual Conference on Information Sciences and Systems (CISS), pp. 1–6. IEEE (2017)
Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. Proc. VLDB Endow. 1(1), 115–125 (2008)
Vaněk, P., Mandel, J., Brezina, M.: Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems. Computing 56(3), 179–196 (1996)
Vessenes, P., Seidensticker, R.: System and method for analyzing transactions in a distributed ledger, US Patent 9,298,806, 29 March 2016. https://www.google.com/patents/US9298806
Wang, J., Zhang, Z., Zha, H.: Adaptive manifold learning. In: Advances in Neural Information Processing Systems, pp. 1473–1480 (2005)
Wright, R., Yang, Z.: Privacy-preserving Bayesian network structure computation on distributed heterogeneous data. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 713–718. ACM (2004)
Zhang, B., Dave, V., Mohammed, N., Hasan, M.A.: Feature selection for classification under anonymity constraint. arXiv preprint arXiv:1512.07158 (2015)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 25 (2017)
Zhang, X., Yang, L.T., Liu, C., Chen, J.: A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Trans. Parallel Distrib. Syst. 25(2), 363–373 (2014)
Zillner, S., Neururer, S.: Big data in the health sector. In: Cavanillas, J.M., Curry, E., Wahlster, W. (eds.) New Horizons for a Data-Driven Economy, pp. 179–194. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-21569-3_10
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Podlesny, N.J., Kayem, A.V.D.M., Meinel, C. (2019). Towards Identifying De-anonymisation Risks in Distributed Health Data Silos. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2019. Lecture Notes in Computer Science(), vol 11706. Springer, Cham. https://doi.org/10.1007/978-3-030-27615-7_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-27615-7_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27614-0
Online ISBN: 978-3-030-27615-7
eBook Packages: Computer ScienceComputer Science (R0)