Skip to main content
Log in

Preserving output-privacy in data stream classification

  • Regular Paper
  • Published:
Progress in Artificial Intelligence Aims and scope Submit manuscript

Abstract

Privacy-preservation has emerged to be a major concern in devising a data mining system. But, protecting the privacy of data mining input does not guarantee a privacy-preserved output. This paper focuses on preserving the privacy of data mining output and particularly the output of classification task. Further, instead of static datasets, we consider the classification of continuously arriving data streams: a rapidly growing research area. Due to the challenges of data stream classification such as vast volume, a mixture of labeled and unlabeled instances throughout the stream and timely classifier publication, enforcing privacy-preservation techniques becomes even more challenging. In order to achieve this goal, we propose a systematic method for preserving output-privacy in data stream classification that addresses several applications like loan approval, credit card fraud detection, disease outbreak or biological attack detection. Specifically, we propose an algorithm named Diverse and k-Anonymized HOeffding Tree (DAHOT) that is an amalgamation of popular data stream classification algorithm Hoeffding tree and a variant of k-anonymity and l-diversity principles. The empirical results on real and synthetic data streams verify the effectiveness of DAHOT as compared to its bedrock Hoeffding tree and two other techniques, one that learns sanitized decision trees from sampled data stream and other technique that uses ensemble-based classification. DAHOT guarantees to preserve the private patterns while classifying the data streams accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  1. Agarwal, R., Srikant R.: Privacy-preserving data mining. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 439–450 (2000)

  2. Lindell, Y., Pinkas, B.: Privacy preserving data mining. In: Proceedings of 20th Annual International Cryptology Conference on Advances in Cryptology. Springer, pp. 36–54 (2000)

  3. Kantarcioglu, M.: A survey of privacy-preserving methods across horizontally partitioned data. In: Aggarwal, C., Yu, P. (eds.) Privacy-Preserving Data Mining: Models and Algorithms, vol. 34 of Advances in Database Systems, pp. 313–336. Springer, Berlin (2008)

    Chapter  Google Scholar 

  4. Zhang, N., Wang, S., Zhao, W.: A new scheme on privacy-preserving data classification. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 374–383 (2005)

  5. Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Eng. 13(6), 1010–1027 (2001)

    Article  Google Scholar 

  6. Fung, B.C.M., Wang, K., Yu, P.: Anonymizing classification data for privacy preservation. IEEE Trans. Knowl. Data Eng. 19(5), 711–725 (2007)

    Article  Google Scholar 

  7. Wang, T., Liu, L.: Output privacy in data mining. ACM Trans. Database Syst. 36(1), 1–34 (2011)

    Article  MathSciNet  Google Scholar 

  8. Friedman, A., Wolff, R., Schuster, A.: Providing k-anonymity in data mining. Int. J. Very Large Data Bases 17(4), 789–804 (2008)

    Article  Google Scholar 

  9. Hwanjo, Y., Xiaoqian, J., Vaidya J.: Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data. In: Proceedings of ACM SAC International Conference, pp. 603–610 (2006)

  10. Pinkas, B.: Cryptographic techniques for privacy-preserving data mining. ACM SIGKDD Explor. 4(2), 12–19 (2002)

    Article  Google Scholar 

  11. Kantarcioglu, M., Jin, J., Clifton, C.: When do data mining results violate privacy? In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 599–604 (2004)

  12. Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In: IEEE Symposium on Research in Security and Privacy, pp. 188–206 (1998)

  13. Samet, S., Miri A.: Privacy preserving ID3 using Gini index over horizontally partitioned data. In: Computer Systems and Applications, pp. 645–651 (2008)

  14. Emekci, F., Sahin, O., Agrawal, D., Abbadi, A.: Privacy preserving decision tree learning over multiple parties. In: Data and Knowledge Engineering, pp. 348–361 (2007)

  15. Dwork, C.: Differential privacy. In: International Colloquium on Automata, Languages and Programming, pp. 1–12 (2006)

  16. Xiong, L., Chitti, S., Liu L.: Mining multiple private databases using a kNN classifier. In: Proceedings of ACM SAC International Conference, pp. 435–440 (2007)

  17. Vaidya, J., Kantarcioglu, M., Clifton, C.: Privacy-preserving Naïve Bayes classification. Int. J. Very Large Data Bases 17, 879–898 (2007)

    Article  Google Scholar 

  18. Aggarwal, C.: Data Streams Models and Algorithms. Advances in Database Systems. Springer, Berlin (2006)

    Google Scholar 

  19. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Data stream mining—a practical approach. Technical report, Department of Computer Science, San University of Waikato, New Zealand (2011)

  20. Golab, L., Ozsu, T.: Data Stream Management. Morgan and Claypool Publishers, San Mateo (2010)

    MATH  Google Scholar 

  21. Abdulsalam, H., Skillicorn, D., Martin, P.: Classification using streaming random forests. IEEE Trans. Knowl. Data Eng. 23(1), 22–36 (2011)

    Article  Google Scholar 

  22. Aggarwal, C.: On abnormality detection in spuriously populated data streams. In: Proceedings of SIAM Conference on Data Mining (2005)

  23. Kotecha, R., Garg, S.: Data Streams and privacy: two emerging issues in data classification. In: Proceedings of 5th Nirma University International Conference on Engineering, IEEE (2015)

  24. Huang, C., Chen, M., Wang, C.: Credit scoring with a data mining approach based on support vector machines. Expert Syst. Appl. 37, 847–856 (2007)

    Article  Google Scholar 

  25. Lee, T., Chiu, C., Chou, Y., Lu, C.: Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Comput. Stat. Data Anal. 50, 1113–1130 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  26. Lessmann, S., Baesens, B., Seow, H., Thomas, L.: Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research. Eur. J. Oper. Res. 247, 1–32 (2015)

    Article  MATH  Google Scholar 

  27. Twala, B.: Multiple classifier application to credit risk assessment. Expert Syst. Appl. 37, 3326–3336 (2010)

    Article  Google Scholar 

  28. Wang, G., Mac, J., Huang, L., Xu, K.: Two credit scoring models based on dual strategy ensemble trees. Knowl. Based Syst. 26, 61–68 (2012)

    Article  Google Scholar 

  29. Yeh, I., Lien, C.: The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36, 2473–2480 (2009)

    Article  Google Scholar 

  30. Yu, L., Wang, S., Lai, K.: Credit risk assessment with a multistage neural network ensemble learning approach. Expert Syst. Appl. 34, 1434–1444 (2008)

    Article  Google Scholar 

  31. Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of 6th ACM International Conference on Knowledge Discovery and Data Mining (2000)

  32. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (2001)

  33. Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for on-demand classification of evolving data streams. IEEE Trans. Knowl. Data Eng. 18, 577–589 (2006)

    Article  Google Scholar 

  34. Farid, D., Zhang, L., Hossain, A., Rahman, C., Strachan, R., Sexton, G., Dahal, K.: An adaptive ensemble classifier for mining concept drifting data streams. Expert Syst. Appl. 40(15), 5895–5906 (2013)

    Article  Google Scholar 

  35. Bifet, A., Gavalda R.: Adaptive parameter-free learning from evolving data streams. Technical report, Polytechnic University of Catalonia (2009)

  36. Wang, H., Fan, W., Yu, P., Han J.: Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of 9th ACM International Conference on Knowledge Discovery and Data Mining (2003)

  37. Masud, M., Gao, J., Khan, L., Han, J., Thuraisingham, B.: Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Trans. Knowl. Data Eng. 23, 859–874 (2011)

    Article  Google Scholar 

  38. Godase, A., Attar, V.: Classifier ensemble for imbalanced data stream classification. In: Proceedings of ACM CUBE International Information Technology Conference (2012)

  39. Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: A systematic comparison and evaluation of k-anonymization algorithms for practitioners. Trans. Data Priv. 7, 337–370 (2014)

    MathSciNet  Google Scholar 

  40. Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Approximation algorithms for k-anonymity. J. Privacy Technol. 8, 1–18 (2005)

    Google Scholar 

  41. Bayardo, R., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceedings of 21st International Conference on Data Engineering, pp. 217–228 (2005)

  42. Bertino, E., Ooi, C., Yang, Y., Deng, R.: Privacy and ownership preserving of outsourced medical data. In: Proceedings of 21st International Conference on Data Engineering, pp. 521–532 (2005)

  43. Fung, B., Wang, K., Yu, P.: Top-Down Specialization for Information and Privacy Preservation. In: Proceedings of 21st International Conference on Data Engineering, pp. 205–216 (2005)

  44. Iyengar, V.: Transforming data to satisfy privacy constraints. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge discovery and Data Mining, pp. 279–288 (2002)

  45. LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of 22nd International Conference on Data Engineering (2006)

  46. Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(1), 45–96 (2007)

    Article  Google Scholar 

  47. Sun, X., Wang, H., Li, J., Truta, T.M.: Enhanced p-sensitive k-anonymity models for privacy preserving data publishing. Trans. Data Priv. 1, 53–66 (2008)

    MathSciNet  Google Scholar 

  48. Tian, H., Zhang, W., Xu, S., Sharkey, P.: A knowledge model sharing based approach to privacy-preserving data mining. Trans. Data Priv. 5, 433–467 (2012)

    MathSciNet  Google Scholar 

  49. Sweeney, L.: k-Anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)

  50. Li, N., Li, T.: t-Closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the 23rd International Conference on Data Engineering, pp. 106–115 (2007)

  51. Wang, K., Yu, P., Chakraborty, S.: Bottom-up generalization: a data mining solution to privacy protection. In: Proceedings of 4th IEEE International Conference on Data Mining (2004)

  52. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)

    Article  MathSciNet  MATH  Google Scholar 

  53. Kirkby, R.: Improving Hoeffding Trees. Ph.D. thesis, Department of Computer Science, University of Waikato (2007)

  54. Xiao, X., Tao, Y.: M-invariance: towards privacy preserving re-publication of dynamic datasets. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 689–700 (2007)

  55. Pei, J., Xu, J., Wang, Z., Wang, W., Wang, K.: Maintaining k-anonymity against incremental updates. In: Proceedings of the 19th International Conference on Scientific and Statistical Database Management. IEEE (2007)

  56. Fung, B.C.M., Wang, K., Fu, A.W.C., Pei, J.: Anonymity for continuous data publishing. In: Proceedings of the 11th International Conference on Extending Database Technology. ACM (2008)

  57. Li, J., Ooi, B.C., Wang, W.: Anonymizing streaming data for privacy protection. In: Proceedings of the 24th International Conference on Data Engineering, pp. 1367–1369. IEEE (2008)

  58. Cao, J., Carminati, B., Ferrari, E., Tan, K.L.: CASTLE: a delay-constrained scheme for ks-anonymizing data streams. In: Proceedings of the 24th International Conference on Data Engineering, pp. 1376–1378. IEEE (2008)

  59. Zhou, B., Han, Y., Pei, J., Jiang, B., Tao, Y., Jia, Y.: Continuous privacy preserving publishing of data streams. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 648–659 (2009)

  60. Chao, C., Chen, P., Sun, C.: Privacy-preserving classification of data streams. Tamkang J. Sci. Eng. 12(3), 321–330 (2009)

    Google Scholar 

  61. Chhinkaniwala, H., Garg, S.: Tuple value based multiplicative data perturbation approach to preserve privacy in data stream mining. Int. J. Data Min. Knowl. Manag. Process 3(3), 53–61 (2013)

  62. Chhinkaniwala, H., Patel, K., Garg, S.: Privacy preserving data stream classification using data perturbation techniques. In: Proceedings of International Conference on Emerging Trends in Electrical, Electronics and Communication Technologies (2012)

  63. Xu, Y., Wang, K., Fu, A., She, R., Pei, J.: Privacy-preserving data stream classification. In: Aggarwal, C., Yu, P. (eds.) Privacy-Preserving Data Mining: Models and Algorithms, vol. 34 of Advances in Database Systems, pp. 487–510. Springer, Berlin (2008)

    Chapter  Google Scholar 

  64. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Chapman and Hall, Boca Raton (1993)

    MATH  Google Scholar 

  65. Lichman, M.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. (2013) http://archive.ics.uci.edu/ml. Accessed 20 Nov 2016

  66. Kaggle: Give Me Some Credit Competition-2011. (2016) https://www.kaggle.com/c/GiveMeSomeCredit. Accessed 06 May 2016

  67. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. (2014) http://moa.cms.waikato.ac.nz

  68. Breiman, L.: Bagging predictors. Mach. Learn. 123–140 (1996)

  69. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of International Joint Conference on Artificial Intelligence, pp. 1137–1145 (1995)

  70. Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. In: 3rd ed. San Mateo: The Morgan Kaufmann Series in Data Management Systems (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Radhika Kotecha.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kotecha, R., Garg, S. Preserving output-privacy in data stream classification. Prog Artif Intell 6, 87–104 (2017). https://doi.org/10.1007/s13748-017-0114-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13748-017-0114-8

Keywords

Navigation