Skip to main content
Log in

An evolutionary feature set decomposition based anonymization for classification workloads: Privacy Preserving Data Mining

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Privacy has become an important concern while publishing micro data about a population. The emerging area called privacy preserving data mining (PPDM) focus on individual privacy without compromising data mining results. An adversarial exploitation of published data poses a risk of information disclosure about individuals. On the other hand, imposing privacy constraints on the data results in substantial information loss and compromises the legitimate data analysis. Motivated by the increasing growth of PPDM algorithms, we first investigate the privacy implications and the crosscutting issues between privacy versus utility of data. We present a privacy model that embeds the anonymization procedure in to a learning algorithm and this has mitigated the additional overheads imposed on data mining tasks. Our primary concern about PPDM is that the utility of data should not be compromised by the transformation applied. Different data mining classification workloads are analyzed with the proposed anonymization procedure for any side effects incurred. It is shown empirically that classification accuracy obtained for most of the datasets outperforms the results obtained with original dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Regard, H.: Recommendation of the council concerning guidelines governing the protection of privacy and transborder flows of personal data (2013)

  2. Centers for Disease Control and Prevention: HIPAA privacy rule and public health. Guidance from CDC and the US Department of Health and Human Services. MMWR Morb. Mortal. Weekly Rep. 52(Suppl. 1), 1–17 (2003)

    Google Scholar 

  3. Canadian Environmental Protection Act: The House of Commons of Canada (Bill C-32, as passed by the House of Commons) (1999)

  4. Oliveira, S.R.M., Osmar, R.Z.: Toward standardization in privacy-preserving data mining. In: ACM SIGKDD 3rd Workshop on Data Mining Standards, vol. 7 (2004)

  5. Sweeney, L.: k-Anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10(05), 557–570 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  6. Machanavajjhala, A., et al.: l-Diversity: privacy beyond k-anonymity. In: ACM Transactions on Knowledge Discovery from Data (TKDD)vol. 1, no. 1 (2007)

  7. Aggarwal, C. C.: On k-anonymity and the curse of dimensionality. In: Proceedings of the 31st International Conference on Very Large Data Bases. VLDB Endowment (2005)

  8. Brickell, J., Shmatikov, V.: The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 70–78. ACM, New York (2008)

  9. Wang, K., Yu, P.S., Chakraborty, W.K., Yu, P.S., Chakraborty, S.: Bottom-up generalization: a data mining solution to privacy protection. In: Fourth IEEE International Conference on 2004. InData Mining, ICDM’04, pp. 249–256. IEEE, New York (2004)

  10. Chen, R., Fung, B.C., Mohammed, N., Desai, B.C., Wang, K.: Privacy-preserving trajectory data publishing by local suppression. Inf. Sci. 10(231), 83–97 (2013)

    Article  MATH  Google Scholar 

  11. Maimon, O., Rokach, L.: Decomposition methodology for knowledge discovery and data mining. In: Data Mining and Knowledge Discovery Handbook, pp. 981–1003. Springer, New York (2005)

  12. Harding, J.A., Shahbaz, M., Kusiak, A.: Data mining in manufacturing: a review. J. Manuf. Sci. Eng. 128(4), 969–976 (2006)

    Article  Google Scholar 

  13. Chipperfield, A.J., Fleming, P.J.: The MATLAB genetic algorithm toolbox. In: IEE Colloquium on Applied Control Techniques Using MATLAB (pp. 10-1). IET, Stevenage (1995)

  14. Jain, P., Gyanchandani, M., Khare, N.: Big data privacy: a technological perspective and review. J. Big Data 3(1), 25 (2016)

    Article  Google Scholar 

  15. Ying, S., Mingsheng, Y., Yuan, F.: Quantum privacy-preserving data mining. arXiv preprint arXiv:1512.04009 (2015)

  16. Ying, S., Mingsheng, Y., Yuan, F.: Quantum privacy-preserving data analytics. arXiv preprint arXiv:1702.04420 (2017)

  17. Jaiswal, J.K., Rita, S., Ilango, P.: Anonymization in PPDM based on data distributions and attribute relations. Indian J. Sci. Technol. 9, 37 (2016)

  18. Bertino, E., Lin, D., Jiang, W.: A Survey of Quantification of Privacy Preserving Data Mining Algorithms. Springer, New York (2008)

    Book  Google Scholar 

  19. Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 279–288 (2002)

  20. LeFevre, K., DeWitt, D.J., Raghu, R.: Mondrian multidimensional k-anonymity. In: Proceedings of the 22nd International Conference on Data Engineering, 2006. ICDE’06. IEEE, New York (2006)

  21. Aggarwal, C.C., Yu Philip, S.: A condensation approach to privacy preserving data mining. In: Advances in Database Technology-EDBT, vol. 2004, pp. 183–199. Springer, Berlin (2004)

  22. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. Elsevier, Burlington (2011)

    MATH  Google Scholar 

  23. Pramanik, M.I., Raymond, Y.K.L., Wenping, Z.: K-anonymity through the enhanced clustering method. In: 2016 IEEE 13th International Conference on e-Business Engineering (ICEBE). IEEE, New York (2016)

  24. Inan, A., Kantarcioglu, M., Elisa, B.: Using anonymized data for classification. In: IEEE 25th International Conference on Data Engineering, 2009. ICDE’09. IEEE, New York (2009)

  25. Lanzi, P.L.: Fast feature selection with genetic algorithms: a filter approach. In: IEEE International Conference on Evolutionary Computation, 1997, pp. 537–540. IEEE, New York (1997)

  26. Rokach, L., Maimon, O.: Data mining using decomposition methods. In: Data Mining and Knowledge Discovery Handbook, pp. 981–998. Springer, New York (2009)

  27. Rokach, L.: Genetic algorithm-based feature set partitioning for classification problems. Pattern Recognit. 41(5), 1676–1700 (2008)

    Article  MATH  Google Scholar 

  28. Byun, J.W., Kamra, A., Bertino, E., Li, N.: Efficient k-anonymization using clustering techniques. In: International Conference on Database Systems for Advanced Applications, pp. 188–200. Springer, Berlin (2007)

  29. Duan, Y., Canny, J., Zhan, J.: Efficient privacy-preserving association rule mining: P4P style. In: IEEE Symposium on Computational Intelligence and Data Mining, 2007. CIDM 2007, pp. 654–660. IEEE, Piscataway (2007)

  30. Fung, B., Wang, K., Yu, P.S.: Anonymizing classification data for privacy preservation. IEEE Trans. Knowl. Data Eng. 19(5), 711–25 (2007)

    Article  Google Scholar 

  31. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97(1), 273–324 (1997). Dec 31

    Article  MATH  Google Scholar 

  32. Yu L, Liu H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. In: ICML Aug 21, vol. 3, pp. 856–863 (2003)

  33. Kira, K., Rendell, L.A.: The feature selection problem: traditional methods and a new algorithm. In: AAAI, vol. 2 (1992)

  34. Liu, H., Setiono, R.: Chi2: feature selection and discretization of numeric attributes. In: Proceedings of the Seventh International Conference on Tools with Artificial Intelligence, 1995. IEEE, New York (1995)

  35. Shin, K., Miyazaki, S.: A fast and accurate feature selection algorithm based on binary consistency measure. In: Computational Intelligence (2015)

  36. Li, T., Li, N.: On the tradeoff between privacy and utility in data publishing. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 517–526. ACM, New York (2009)

  37. Yildirim, P.: Filter based feature selection methods for prediction of risks in hepatitis disease. Int. J. Mach. Learn. Comput. 5(4), 258 (2015)

    Article  Google Scholar 

  38. Liu, H., Setiono, R.: A probabilistic approach to feature selection—a filter solution. In: 13th International Conference on Machine Learning, pp. 319–327 (1996)

  39. Noever, D., Baskaran, S.: Steady state vs. generational genetic algorithms: a comparison of time complexity and convergence properties. Preprint 1992:92-07 (1992)

  40. Li, T., et al.: Slicing: a new approach for privacy preserving data publishing. IEEE Trans. Knowl. Data Eng. 24(3), 561–574 (2012)

    Article  Google Scholar 

  41. Sugiyama, M., Krauledat, M., Mazller, K.R.: Covariate shift adaptation by importance weighted cross validation. J. Mach. Learn. Res. 8(May), 985–1005 (2007)

    MATH  Google Scholar 

  42. Witten, I. H., Frank, E., Trigg, L.E., Hall, M.A., Holmes, G., Cunningham, S.J.: Weka: practical machine learning tools and techniques with Java implementations (1999)

  43. Derrac, J., Garcia, S., Sanchez, L., Herrera, F.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult. Valued Logic Soft Comput. 17, 255–287 (2011)

    Google Scholar 

  44. Fung, B.C., Wang, K., Yu, P.S.: Top-down specialization for information and privacy preservation. In: Proceedings of the 21st International Conference on InData Engineering, 2005. ICDE 2005, pp. 205–216. IEEE, New York (2005)

  45. Blake, C., Merz, C.J.: UCI Repository of Machine Learning Databases. Department of Information and Computer Science, University of California, Irvine. http://www.archive.ics.uci.edu/ml (1998). Accessed 2015

Download references

Acknowledgements

This research is a part of Grant received from the Department of Science and Technology, Government of India: SEED/WS/018/2015.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. K. Ilavarasi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ilavarasi, A.K., Sathiyabhama, B. An evolutionary feature set decomposition based anonymization for classification workloads: Privacy Preserving Data Mining. Cluster Comput 20, 3515–3525 (2017). https://doi.org/10.1007/s10586-017-1108-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-1108-9

Keywords

Navigation