Skip to main content

Minimising Information Loss on Anonymised High Dimensional Data with Greedy In-Memory Processing

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11029))

Abstract

Minimising information loss on anonymised high dimensional data is important for data utility. Syntactic data anonymisation algorithms address this issue by generating datasets that are neither use-case specific nor dependent on runtime specifications. This results in anonymised datasets that can be re-used in different scenarios which is performance efficient. However, syntactic data anonymisation algorithms incur high information loss on high dimensional data, making the data unusable for analytics. In this paper, we propose an optimised exact quasi-identifier identification scheme, based on the notion of k-anonymity, to generate anonymised high dimensional datasets efficiently, and with low information loss. The optimised exact quasi-identifier identification scheme works by identifying and eliminating maximal partial unique column combination (mpUCC) attributes that endanger anonymity. By using in-memory processing to handle the attribute selection procedure, we significantly reduce the processing time required. We evaluated the effectiveness of our proposed approach with an enriched dataset drawn from multiple real-world data sources, and augmented with synthetic values generated in close alignment with the real-world data distributions. Our results indicate that in-memory processing drops attribute selection time for the mpUCC candidates from 400s to 100s, while significantly reducing information loss. In addition, we achieve a time complexity speed-up of \(O(3^{n/3})\approx O(1.4422^{n})\).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www-07.ibm.com/au/hana/pdf/S_HANA_Performance_Whitepaper.pdf.

  2. 2.

    https://www.python.org/download/releases/3.0/.

  3. 3.

    http://news.sap.com/germany/gesundheit-cloud/.

  4. 4.

    https://www.medicare.gov/download/downloaddb.asp.

  5. 5.

    https://www.fda.gov/drugs/informationondrugs/ucm142438.htm.

  6. 6.

    https://health.data.ny.gov/browse?limitTo=datasets&sortBy=alpha.

  7. 7.

    https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html.

  8. 8.

    https://github.com/jaSunny/MA-enriched-Health-Data.

  9. 9.

    https://medlineplus.gov/coronaryarterydisease.html.

  10. 10.

    https://github.com/jaSunny/MA-Anonymization-ETL.

  11. 11.

    https://github.com/jaSunny/MA-enriched-Health-Data.

References

  1. Aggarwal, C.C.: On k-anonymity and the curse of dimensionality. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB 2005 (2005)

    Google Scholar 

  2. Barbaro, M., Zeller, T., Hansell, S.: A face is exposed for AOL searcher no. 4417749. New York Times 9(2008), 8 (2006). https://www.nytimes.com/2006/08/09/technology/09aol.html

  3. Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, pp. 217–228. IEEE (2005)

    Google Scholar 

  4. Bhaskar, R., Laxman, S., Smith, A., Thakurta, A.: Discovering frequent patterns in sensitive data. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 503–512. ACM (2010)

    Google Scholar 

  5. Bläsius, T., Friedrich, T., Schirneck, M.: The parameterized complexity of dependency detection in relational databases. In: LIPIcs-Leibniz International Proceedings in Informatics. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2017)

    Google Scholar 

  6. Bonomi, L., Xiong, L.: Mining frequent patterns with differential privacy. Proc. VLDB Endow. 6(12), 1422–1427 (2013)

    Article  Google Scholar 

  7. De Montjoye, Y.A., Hidalgo, C.A., Verleysen, M., Blondel, V.D.: Unique in the crowd: the privacy bounds of human mobility. Sci. Rep. 3, 1376 (2013)

    Article  Google Scholar 

  8. Dondi, R., Mauri, G., Zoppis, I.: On the complexity of the l-diversity problem. In: Murlak, F., Sankowski, P. (eds.) MFCS 2011. LNCS, vol. 6907, pp. 266–277. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22993-0_26

    Chapter  MATH  Google Scholar 

  9. Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1

    Chapter  MATH  Google Scholar 

  10. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  11. Färber, F., et al.: The SAP HANA database-an architecture overview. IEEE Data Eng. Bull. 35(1), 28–33 (2012)

    Google Scholar 

  12. Fienberg, S.E., Jin, J.: Privacy-preserving data sharing in high dimensional regression and classification settings. J. Priv. Confid. 4(1), 221–243 (2012)

    Google Scholar 

  13. Fredj, F.B., Lammari, N., Comyn-Wattiau, I.: Abstracting anonymization techniques: a prerequisite for selecting a generalization algorithm. Procedia Comput. Sci. 60, 206–215 (2015)

    Article  Google Scholar 

  14. Ghosh, A., Roughgarden, T., Sundararajan, M.: Universally utility-maximizing privacy mechanisms. SIAM J. Comput. 41(6), 1673–1693 (2012)

    Article  MathSciNet  Google Scholar 

  15. Ibarra, O.H.: Reversal-bounded multicounter machines and their decision problems. J. ACM (JACM) 25(1), 116–133 (1978)

    Article  MathSciNet  Google Scholar 

  16. Islam, M.Z., Brankovic, L.: Privacy preserving data mining: a noise addition framework using a novel clustering technique. Knowl.-Based Syst. 24(8), 1214–1223 (2011)

    Article  Google Scholar 

  17. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations. IRSS, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9

    Chapter  Google Scholar 

  18. Kifer, D., Machanavajjhala, A.: No free lunch in data privacy. In: Proceedings of the 2011 ACM SIGMOD, SIGMOD 2011, pp. 193–204. ACM (2011)

    Google Scholar 

  19. Kohlmayer, F., Prasser, F., Eckert, C., Kuhn, K.A.: A flexible approach to distributed data anonymization. J. Biomed. Inform. 50, 62–76 (2014)

    Article  Google Scholar 

  20. Koufogiannis, F., Han, S., Pappas, G.J.: Optimality of the Laplace mechanism in differential privacy (2015)

    Google Scholar 

  21. Lee, J., et al.: High-performance transaction processing in SAP HANA. IEEE Data Eng. Bull. 36(2), 28–33 (2013)

    Google Scholar 

  22. Li, C., Miklau, G., Hay, M., McGregor, A., Rastogi, V.: The matrix mechanism: optimizing linear counting queries under differential privacy. VLDB J. 24(6), 757–781 (2015)

    Article  Google Scholar 

  23. Li, N., Li, T., Venkatasubramanian, S.: T-closeness: privacy beyond k-anonymity and l-diversity. In: 2007 IEEE 23rd ICDE, pp. 106–115, April 2007

    Google Scholar 

  24. Liang, H., Yuan, H.: On the complexity of t-closeness anonymization and related problems. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013. LNCS, vol. 7825, pp. 331–345. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37487-6_26

    Chapter  Google Scholar 

  25. Liu, F.: Generalized Gaussian mechanism for differential privacy (2016)

    Google Scholar 

  26. Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. ACM TKDD 1(1), 3 (2007)

    Article  Google Scholar 

  27. McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: 48th IEEE Symposium Foundations of Computer Science, FOCS 2007 (2007)

    Google Scholar 

  28. Meyer, A.R., Stockmeyer, L.J.: The equivalence problem for regular expressions with squaring requires exponential space. In: SWAT (FOCS), pp. 125–129 (1972)

    Google Scholar 

  29. Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 223–228. ACM (2004)

    Google Scholar 

  30. Mohammed, N., Fung, B., Hung, P.C., Lee, C.K.: Centralized and distributed anonymization for high-dimensional healthcare data. ACM TKDD 4(4), 18 (2010)

    Google Scholar 

  31. Papenbrock, T., Naumann, F.: A hybrid approach for efficient unique column combination discovery. Proc. der Fachtagung Business, Technologie und Web (2017)

    Google Scholar 

  32. Plattner, H., et al.: A Course in In-Memory Data Management. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-55270-0

    Book  Google Scholar 

  33. Polonetsky, J., Tene, O., Finch, K.: Shades of gray: seeing the full spectrum of practical data de-identification (2016)

    Google Scholar 

  34. Rubinstein, I., Hartzog, W.: Anonymization and risk (2015)

    Google Scholar 

  35. Rzhetsky, A., Wajngurt, D., Park, N., Zheng, T.: Probing genetic overlap among complex human phenotypes. Proc. Nat. Acad. Sci. 104(28), 11694–11699 (2007)

    Article  Google Scholar 

  36. Suthram, S., Dudley, J.T., Chiang, A.P., Chen, R., Hastie, T.J., Butte, A.J.: Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets. PLoS Comput. Biol. 6(2), 1–10 (2010)

    Article  Google Scholar 

  37. Sweeney, L.: Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 571–588 (2002)

    Article  MathSciNet  Google Scholar 

  38. Sweeney, L.: K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(05), 557–570 (2002)

    Article  MathSciNet  Google Scholar 

  39. Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. Proc. VLDB Endow. 1(1), 115–125 (2008)

    Article  Google Scholar 

  40. Vaidya, J., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 206–215. ACM (2003)

    Google Scholar 

  41. Vaidya, J., Kantarcıoğlu, M., Clifton, C.: Privacy-preserving Naive Bayes classification. VLDB J.—Int. J. Very Large Data Bases 17(4), 879–898 (2008)

    Article  Google Scholar 

  42. Vessenes, P., Seidensticker, R.: System and method for analyzing transactions in a distributed ledger. US Patent 9,298,806, 29 March 2016

    Google Scholar 

  43. Wernke, M., Skvortsov, P., Dürr, F., Rothermel, K.: A classification of location privacy attacks and approaches. Pers. Ubiquit. Comput. 18(1), 163–175 (2014)

    Article  Google Scholar 

  44. Wimmer, H., Powell, L.: A comparison of the effects of k-anonymity on machine learning algorithms. In: Proceedings of the Conference for Information Systems Applied Research ISSN, vol. 2167, p. 1508 (2014)

    Google Scholar 

  45. Zhang, B., Dave, V., Mohammed, N., Hasan, M.A.: Feature selection for classification under anonymity constraint. arXiv preprint arXiv:1512.07158 (2015)

  46. Zhang, X., Yang, L.T., Liu, C., Chen, J.: A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Trans. Parallel Distrib. Syst. 25(2), 363–373 (2014)

    Article  Google Scholar 

  47. Zhou, X., Menche, J., Barabási, A.L., Sharma, A.: Human symptoms-disease network. Nat. Commun. 5, 4212 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Nikolai J. Podlesny , Anne V. D. M. Kayem , Stephan von Schorlemer or Matthias Uflacker .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Podlesny, N.J., Kayem, A.V.D.M., von Schorlemer, S., Uflacker, M. (2018). Minimising Information Loss on Anonymised High Dimensional Data with Greedy In-Memory Processing. In: Hartmann, S., Ma, H., Hameurlain, A., Pernul, G., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2018. Lecture Notes in Computer Science(), vol 11029. Springer, Cham. https://doi.org/10.1007/978-3-319-98809-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98809-2_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98808-5

  • Online ISBN: 978-3-319-98809-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics