Progress in Artificial Intelligence

, Volume 6, Issue 2, pp 87–104 | Cite as

Preserving output-privacy in data stream classification

Regular Paper

Abstract

Privacy-preservation has emerged to be a major concern in devising a data mining system. But, protecting the privacy of data mining input does not guarantee a privacy-preserved output. This paper focuses on preserving the privacy of data mining output and particularly the output of classification task. Further, instead of static datasets, we consider the classification of continuously arriving data streams: a rapidly growing research area. Due to the challenges of data stream classification such as vast volume, a mixture of labeled and unlabeled instances throughout the stream and timely classifier publication, enforcing privacy-preservation techniques becomes even more challenging. In order to achieve this goal, we propose a systematic method for preserving output-privacy in data stream classification that addresses several applications like loan approval, credit card fraud detection, disease outbreak or biological attack detection. Specifically, we propose an algorithm named Diverse and k-Anonymized HOeffding Tree (DAHOT) that is an amalgamation of popular data stream classification algorithm Hoeffding tree and a variant of k-anonymity and l-diversity principles. The empirical results on real and synthetic data streams verify the effectiveness of DAHOT as compared to its bedrock Hoeffding tree and two other techniques, one that learns sanitized decision trees from sampled data stream and other technique that uses ensemble-based classification. DAHOT guarantees to preserve the private patterns while classifying the data streams accurately.

Keywords

Classification Data streams Output-privacy Privacy-preserving classification of data streams Decision tree Anonymization 

References

  1. 1.
    Agarwal, R., Srikant R.: Privacy-preserving data mining. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 439–450 (2000)Google Scholar
  2. 2.
    Lindell, Y., Pinkas, B.: Privacy preserving data mining. In: Proceedings of 20th Annual International Cryptology Conference on Advances in Cryptology. Springer, pp. 36–54 (2000)Google Scholar
  3. 3.
    Kantarcioglu, M.: A survey of privacy-preserving methods across horizontally partitioned data. In: Aggarwal, C., Yu, P. (eds.) Privacy-Preserving Data Mining: Models and Algorithms, vol. 34 of Advances in Database Systems, pp. 313–336. Springer, Berlin (2008)CrossRefGoogle Scholar
  4. 4.
    Zhang, N., Wang, S., Zhao, W.: A new scheme on privacy-preserving data classification. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 374–383 (2005)Google Scholar
  5. 5.
    Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Eng. 13(6), 1010–1027 (2001)CrossRefGoogle Scholar
  6. 6.
    Fung, B.C.M., Wang, K., Yu, P.: Anonymizing classification data for privacy preservation. IEEE Trans. Knowl. Data Eng. 19(5), 711–725 (2007)CrossRefGoogle Scholar
  7. 7.
    Wang, T., Liu, L.: Output privacy in data mining. ACM Trans. Database Syst. 36(1), 1–34 (2011)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Friedman, A., Wolff, R., Schuster, A.: Providing k-anonymity in data mining. Int. J. Very Large Data Bases 17(4), 789–804 (2008)CrossRefGoogle Scholar
  9. 9.
    Hwanjo, Y., Xiaoqian, J., Vaidya J.: Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data. In: Proceedings of ACM SAC International Conference, pp. 603–610 (2006)Google Scholar
  10. 10.
    Pinkas, B.: Cryptographic techniques for privacy-preserving data mining. ACM SIGKDD Explor. 4(2), 12–19 (2002)CrossRefGoogle Scholar
  11. 11.
    Kantarcioglu, M., Jin, J., Clifton, C.: When do data mining results violate privacy? In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 599–604 (2004)Google Scholar
  12. 12.
    Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In: IEEE Symposium on Research in Security and Privacy, pp. 188–206 (1998)Google Scholar
  13. 13.
    Samet, S., Miri A.: Privacy preserving ID3 using Gini index over horizontally partitioned data. In: Computer Systems and Applications, pp. 645–651 (2008)Google Scholar
  14. 14.
    Emekci, F., Sahin, O., Agrawal, D., Abbadi, A.: Privacy preserving decision tree learning over multiple parties. In: Data and Knowledge Engineering, pp. 348–361 (2007)Google Scholar
  15. 15.
    Dwork, C.: Differential privacy. In: International Colloquium on Automata, Languages and Programming, pp. 1–12 (2006)Google Scholar
  16. 16.
    Xiong, L., Chitti, S., Liu L.: Mining multiple private databases using a kNN classifier. In: Proceedings of ACM SAC International Conference, pp. 435–440 (2007)Google Scholar
  17. 17.
    Vaidya, J., Kantarcioglu, M., Clifton, C.: Privacy-preserving Naïve Bayes classification. Int. J. Very Large Data Bases 17, 879–898 (2007)CrossRefGoogle Scholar
  18. 18.
    Aggarwal, C.: Data Streams Models and Algorithms. Advances in Database Systems. Springer, Berlin (2006)Google Scholar
  19. 19.
    Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Data stream mining—a practical approach. Technical report, Department of Computer Science, San University of Waikato, New Zealand (2011)Google Scholar
  20. 20.
    Golab, L., Ozsu, T.: Data Stream Management. Morgan and Claypool Publishers, San Mateo (2010)MATHGoogle Scholar
  21. 21.
    Abdulsalam, H., Skillicorn, D., Martin, P.: Classification using streaming random forests. IEEE Trans. Knowl. Data Eng. 23(1), 22–36 (2011)CrossRefGoogle Scholar
  22. 22.
    Aggarwal, C.: On abnormality detection in spuriously populated data streams. In: Proceedings of SIAM Conference on Data Mining (2005)Google Scholar
  23. 23.
    Kotecha, R., Garg, S.: Data Streams and privacy: two emerging issues in data classification. In: Proceedings of 5th Nirma University International Conference on Engineering, IEEE (2015)Google Scholar
  24. 24.
    Huang, C., Chen, M., Wang, C.: Credit scoring with a data mining approach based on support vector machines. Expert Syst. Appl. 37, 847–856 (2007)CrossRefGoogle Scholar
  25. 25.
    Lee, T., Chiu, C., Chou, Y., Lu, C.: Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Comput. Stat. Data Anal. 50, 1113–1130 (2006)MathSciNetCrossRefMATHGoogle Scholar
  26. 26.
    Lessmann, S., Baesens, B., Seow, H., Thomas, L.: Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research. Eur. J. Oper. Res. 247, 1–32 (2015)CrossRefMATHGoogle Scholar
  27. 27.
    Twala, B.: Multiple classifier application to credit risk assessment. Expert Syst. Appl. 37, 3326–3336 (2010)CrossRefGoogle Scholar
  28. 28.
    Wang, G., Mac, J., Huang, L., Xu, K.: Two credit scoring models based on dual strategy ensemble trees. Knowl. Based Syst. 26, 61–68 (2012)CrossRefGoogle Scholar
  29. 29.
    Yeh, I., Lien, C.: The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36, 2473–2480 (2009)CrossRefGoogle Scholar
  30. 30.
    Yu, L., Wang, S., Lai, K.: Credit risk assessment with a multistage neural network ensemble learning approach. Expert Syst. Appl. 34, 1434–1444 (2008)CrossRefGoogle Scholar
  31. 31.
    Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of 6th ACM International Conference on Knowledge Discovery and Data Mining (2000)Google Scholar
  32. 32.
    Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (2001)Google Scholar
  33. 33.
    Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for on-demand classification of evolving data streams. IEEE Trans. Knowl. Data Eng. 18, 577–589 (2006)CrossRefGoogle Scholar
  34. 34.
    Farid, D., Zhang, L., Hossain, A., Rahman, C., Strachan, R., Sexton, G., Dahal, K.: An adaptive ensemble classifier for mining concept drifting data streams. Expert Syst. Appl. 40(15), 5895–5906 (2013)CrossRefGoogle Scholar
  35. 35.
    Bifet, A., Gavalda R.: Adaptive parameter-free learning from evolving data streams. Technical report, Polytechnic University of Catalonia (2009)Google Scholar
  36. 36.
    Wang, H., Fan, W., Yu, P., Han J.: Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of 9th ACM International Conference on Knowledge Discovery and Data Mining (2003)Google Scholar
  37. 37.
    Masud, M., Gao, J., Khan, L., Han, J., Thuraisingham, B.: Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Trans. Knowl. Data Eng. 23, 859–874 (2011)CrossRefGoogle Scholar
  38. 38.
    Godase, A., Attar, V.: Classifier ensemble for imbalanced data stream classification. In: Proceedings of ACM CUBE International Information Technology Conference (2012)Google Scholar
  39. 39.
    Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: A systematic comparison and evaluation of k-anonymization algorithms for practitioners. Trans. Data Priv. 7, 337–370 (2014)MathSciNetGoogle Scholar
  40. 40.
    Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Approximation algorithms for k-anonymity. J. Privacy Technol. 8, 1–18 (2005)Google Scholar
  41. 41.
    Bayardo, R., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceedings of 21st International Conference on Data Engineering, pp. 217–228 (2005)Google Scholar
  42. 42.
    Bertino, E., Ooi, C., Yang, Y., Deng, R.: Privacy and ownership preserving of outsourced medical data. In: Proceedings of 21st International Conference on Data Engineering, pp. 521–532 (2005)Google Scholar
  43. 43.
    Fung, B., Wang, K., Yu, P.: Top-Down Specialization for Information and Privacy Preservation. In: Proceedings of 21st International Conference on Data Engineering, pp. 205–216 (2005)Google Scholar
  44. 44.
    Iyengar, V.: Transforming data to satisfy privacy constraints. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge discovery and Data Mining, pp. 279–288 (2002)Google Scholar
  45. 45.
    LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of 22nd International Conference on Data Engineering (2006)Google Scholar
  46. 46.
    Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(1), 45–96 (2007)CrossRefGoogle Scholar
  47. 47.
    Sun, X., Wang, H., Li, J., Truta, T.M.: Enhanced p-sensitive k-anonymity models for privacy preserving data publishing. Trans. Data Priv. 1, 53–66 (2008)MathSciNetGoogle Scholar
  48. 48.
    Tian, H., Zhang, W., Xu, S., Sharkey, P.: A knowledge model sharing based approach to privacy-preserving data mining. Trans. Data Priv. 5, 433–467 (2012)MathSciNetGoogle Scholar
  49. 49.
    Sweeney, L.: k-Anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)Google Scholar
  50. 50.
    Li, N., Li, T.: t-Closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the 23rd International Conference on Data Engineering, pp. 106–115 (2007)Google Scholar
  51. 51.
    Wang, K., Yu, P., Chakraborty, S.: Bottom-up generalization: a data mining solution to privacy protection. In: Proceedings of 4th IEEE International Conference on Data Mining (2004)Google Scholar
  52. 52.
    Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)MathSciNetCrossRefMATHGoogle Scholar
  53. 53.
    Kirkby, R.: Improving Hoeffding Trees. Ph.D. thesis, Department of Computer Science, University of Waikato (2007)Google Scholar
  54. 54.
    Xiao, X., Tao, Y.: M-invariance: towards privacy preserving re-publication of dynamic datasets. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 689–700 (2007)Google Scholar
  55. 55.
    Pei, J., Xu, J., Wang, Z., Wang, W., Wang, K.: Maintaining k-anonymity against incremental updates. In: Proceedings of the 19th International Conference on Scientific and Statistical Database Management. IEEE (2007)Google Scholar
  56. 56.
    Fung, B.C.M., Wang, K., Fu, A.W.C., Pei, J.: Anonymity for continuous data publishing. In: Proceedings of the 11th International Conference on Extending Database Technology. ACM (2008)Google Scholar
  57. 57.
    Li, J., Ooi, B.C., Wang, W.: Anonymizing streaming data for privacy protection. In: Proceedings of the 24th International Conference on Data Engineering, pp. 1367–1369. IEEE (2008)Google Scholar
  58. 58.
    Cao, J., Carminati, B., Ferrari, E., Tan, K.L.: CASTLE: a delay-constrained scheme for ks-anonymizing data streams. In: Proceedings of the 24th International Conference on Data Engineering, pp. 1376–1378. IEEE (2008)Google Scholar
  59. 59.
    Zhou, B., Han, Y., Pei, J., Jiang, B., Tao, Y., Jia, Y.: Continuous privacy preserving publishing of data streams. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 648–659 (2009)Google Scholar
  60. 60.
    Chao, C., Chen, P., Sun, C.: Privacy-preserving classification of data streams. Tamkang J. Sci. Eng. 12(3), 321–330 (2009)Google Scholar
  61. 61.
    Chhinkaniwala, H., Garg, S.: Tuple value based multiplicative data perturbation approach to preserve privacy in data stream mining. Int. J. Data Min. Knowl. Manag. Process 3(3), 53–61 (2013)Google Scholar
  62. 62.
    Chhinkaniwala, H., Patel, K., Garg, S.: Privacy preserving data stream classification using data perturbation techniques. In: Proceedings of International Conference on Emerging Trends in Electrical, Electronics and Communication Technologies (2012)Google Scholar
  63. 63.
    Xu, Y., Wang, K., Fu, A., She, R., Pei, J.: Privacy-preserving data stream classification. In: Aggarwal, C., Yu, P. (eds.) Privacy-Preserving Data Mining: Models and Algorithms, vol. 34 of Advances in Database Systems, pp. 487–510. Springer, Berlin (2008)CrossRefGoogle Scholar
  64. 64.
    Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Chapman and Hall, Boca Raton (1993)MATHGoogle Scholar
  65. 65.
    Lichman, M.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. (2013) http://archive.ics.uci.edu/ml. Accessed 20 Nov 2016
  66. 66.
    Kaggle: Give Me Some Credit Competition-2011. (2016) https://www.kaggle.com/c/GiveMeSomeCredit. Accessed 06 May 2016
  67. 67.
    Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. (2014) http://moa.cms.waikato.ac.nz
  68. 68.
    Breiman, L.: Bagging predictors. Mach. Learn. 123–140 (1996)Google Scholar
  69. 69.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of International Joint Conference on Artificial Intelligence, pp. 1137–1145 (1995)Google Scholar
  70. 70.
    Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. In: 3rd ed. San Mateo: The Morgan Kaufmann Series in Data Management Systems (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringNirma UniversityAhmedabadIndia

Personalised recommendations