Preserving output-privacy in data stream classification

Kotecha, Radhika; Garg, Sanjay

doi:10.1007/s13748-017-0114-8

Preserving output-privacy in data stream classification

Regular Paper
Published: 06 February 2017

Volume 6, pages 87–104, (2017)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

Radhika Kotecha¹ &
Sanjay Garg¹

447 Accesses
5 Citations
Explore all metrics

Abstract

Privacy-preservation has emerged to be a major concern in devising a data mining system. But, protecting the privacy of data mining input does not guarantee a privacy-preserved output. This paper focuses on preserving the privacy of data mining output and particularly the output of classification task. Further, instead of static datasets, we consider the classification of continuously arriving data streams: a rapidly growing research area. Due to the challenges of data stream classification such as vast volume, a mixture of labeled and unlabeled instances throughout the stream and timely classifier publication, enforcing privacy-preservation techniques becomes even more challenging. In order to achieve this goal, we propose a systematic method for preserving output-privacy in data stream classification that addresses several applications like loan approval, credit card fraud detection, disease outbreak or biological attack detection. Specifically, we propose an algorithm named Diverse and k-Anonymized HOeffding Tree (DAHOT) that is an amalgamation of popular data stream classification algorithm Hoeffding tree and a variant of k-anonymity and l-diversity principles. The empirical results on real and synthetic data streams verify the effectiveness of DAHOT as compared to its bedrock Hoeffding tree and two other techniques, one that learns sanitized decision trees from sampled data stream and other technique that uses ensemble-based classification. DAHOT guarantees to preserve the private patterns while classifying the data streams accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agarwal, R., Srikant R.: Privacy-preserving data mining. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 439–450 (2000)
Lindell, Y., Pinkas, B.: Privacy preserving data mining. In: Proceedings of 20th Annual International Cryptology Conference on Advances in Cryptology. Springer, pp. 36–54 (2000)
Kantarcioglu, M.: A survey of privacy-preserving methods across horizontally partitioned data. In: Aggarwal, C., Yu, P. (eds.) Privacy-Preserving Data Mining: Models and Algorithms, vol. 34 of Advances in Database Systems, pp. 313–336. Springer, Berlin (2008)
Chapter Google Scholar
Zhang, N., Wang, S., Zhao, W.: A new scheme on privacy-preserving data classification. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 374–383 (2005)
Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Eng. 13(6), 1010–1027 (2001)
Article Google Scholar
Fung, B.C.M., Wang, K., Yu, P.: Anonymizing classification data for privacy preservation. IEEE Trans. Knowl. Data Eng. 19(5), 711–725 (2007)
Article Google Scholar
Wang, T., Liu, L.: Output privacy in data mining. ACM Trans. Database Syst. 36(1), 1–34 (2011)
Article MathSciNet Google Scholar
Friedman, A., Wolff, R., Schuster, A.: Providing k-anonymity in data mining. Int. J. Very Large Data Bases 17(4), 789–804 (2008)
Article Google Scholar
Hwanjo, Y., Xiaoqian, J., Vaidya J.: Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data. In: Proceedings of ACM SAC International Conference, pp. 603–610 (2006)
Pinkas, B.: Cryptographic techniques for privacy-preserving data mining. ACM SIGKDD Explor. 4(2), 12–19 (2002)
Article Google Scholar
Kantarcioglu, M., Jin, J., Clifton, C.: When do data mining results violate privacy? In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 599–604 (2004)
Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In: IEEE Symposium on Research in Security and Privacy, pp. 188–206 (1998)
Samet, S., Miri A.: Privacy preserving ID3 using Gini index over horizontally partitioned data. In: Computer Systems and Applications, pp. 645–651 (2008)
Emekci, F., Sahin, O., Agrawal, D., Abbadi, A.: Privacy preserving decision tree learning over multiple parties. In: Data and Knowledge Engineering, pp. 348–361 (2007)
Dwork, C.: Differential privacy. In: International Colloquium on Automata, Languages and Programming, pp. 1–12 (2006)
Xiong, L., Chitti, S., Liu L.: Mining multiple private databases using a kNN classifier. In: Proceedings of ACM SAC International Conference, pp. 435–440 (2007)
Vaidya, J., Kantarcioglu, M., Clifton, C.: Privacy-preserving Naïve Bayes classification. Int. J. Very Large Data Bases 17, 879–898 (2007)
Article Google Scholar
Aggarwal, C.: Data Streams Models and Algorithms. Advances in Database Systems. Springer, Berlin (2006)
Google Scholar
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Data stream mining—a practical approach. Technical report, Department of Computer Science, San University of Waikato, New Zealand (2011)
Golab, L., Ozsu, T.: Data Stream Management. Morgan and Claypool Publishers, San Mateo (2010)
MATH Google Scholar
Abdulsalam, H., Skillicorn, D., Martin, P.: Classification using streaming random forests. IEEE Trans. Knowl. Data Eng. 23(1), 22–36 (2011)
Article Google Scholar
Aggarwal, C.: On abnormality detection in spuriously populated data streams. In: Proceedings of SIAM Conference on Data Mining (2005)
Kotecha, R., Garg, S.: Data Streams and privacy: two emerging issues in data classification. In: Proceedings of 5th Nirma University International Conference on Engineering, IEEE (2015)
Huang, C., Chen, M., Wang, C.: Credit scoring with a data mining approach based on support vector machines. Expert Syst. Appl. 37, 847–856 (2007)
Article Google Scholar
Lee, T., Chiu, C., Chou, Y., Lu, C.: Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Comput. Stat. Data Anal. 50, 1113–1130 (2006)
Article MathSciNet MATH Google Scholar
Lessmann, S., Baesens, B., Seow, H., Thomas, L.: Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research. Eur. J. Oper. Res. 247, 1–32 (2015)
Article MATH Google Scholar
Twala, B.: Multiple classifier application to credit risk assessment. Expert Syst. Appl. 37, 3326–3336 (2010)
Article Google Scholar
Wang, G., Mac, J., Huang, L., Xu, K.: Two credit scoring models based on dual strategy ensemble trees. Knowl. Based Syst. 26, 61–68 (2012)
Article Google Scholar
Yeh, I., Lien, C.: The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36, 2473–2480 (2009)
Article Google Scholar
Yu, L., Wang, S., Lai, K.: Credit risk assessment with a multistage neural network ensemble learning approach. Expert Syst. Appl. 34, 1434–1444 (2008)
Article Google Scholar
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of 6th ACM International Conference on Knowledge Discovery and Data Mining (2000)
Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (2001)
Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for on-demand classification of evolving data streams. IEEE Trans. Knowl. Data Eng. 18, 577–589 (2006)
Article Google Scholar
Farid, D., Zhang, L., Hossain, A., Rahman, C., Strachan, R., Sexton, G., Dahal, K.: An adaptive ensemble classifier for mining concept drifting data streams. Expert Syst. Appl. 40(15), 5895–5906 (2013)
Article Google Scholar
Bifet, A., Gavalda R.: Adaptive parameter-free learning from evolving data streams. Technical report, Polytechnic University of Catalonia (2009)
Wang, H., Fan, W., Yu, P., Han J.: Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of 9th ACM International Conference on Knowledge Discovery and Data Mining (2003)
Masud, M., Gao, J., Khan, L., Han, J., Thuraisingham, B.: Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Trans. Knowl. Data Eng. 23, 859–874 (2011)
Article Google Scholar
Godase, A., Attar, V.: Classifier ensemble for imbalanced data stream classification. In: Proceedings of ACM CUBE International Information Technology Conference (2012)
Ayala-Rivera, V., McDonagh, P., Cerqueus, T., Murphy, L.: A systematic comparison and evaluation of k-anonymization algorithms for practitioners. Trans. Data Priv. 7, 337–370 (2014)
MathSciNet Google Scholar
Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Approximation algorithms for k-anonymity. J. Privacy Technol. 8, 1–18 (2005)
Google Scholar
Bayardo, R., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceedings of 21st International Conference on Data Engineering, pp. 217–228 (2005)
Bertino, E., Ooi, C., Yang, Y., Deng, R.: Privacy and ownership preserving of outsourced medical data. In: Proceedings of 21st International Conference on Data Engineering, pp. 521–532 (2005)
Fung, B., Wang, K., Yu, P.: Top-Down Specialization for Information and Privacy Preservation. In: Proceedings of 21st International Conference on Data Engineering, pp. 205–216 (2005)
Iyengar, V.: Transforming data to satisfy privacy constraints. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge discovery and Data Mining, pp. 279–288 (2002)
LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of 22nd International Conference on Data Engineering (2006)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data 1(1), 45–96 (2007)
Article Google Scholar
Sun, X., Wang, H., Li, J., Truta, T.M.: Enhanced p-sensitive k-anonymity models for privacy preserving data publishing. Trans. Data Priv. 1, 53–66 (2008)
MathSciNet Google Scholar
Tian, H., Zhang, W., Xu, S., Sharkey, P.: A knowledge model sharing based approach to privacy-preserving data mining. Trans. Data Priv. 5, 433–467 (2012)
MathSciNet Google Scholar
Sweeney, L.: k-Anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 10(5), 557–570 (2002)
Li, N., Li, T.: t-Closeness: privacy beyond k-anonymity and l-diversity. In: Proceedings of the 23rd International Conference on Data Engineering, pp. 106–115 (2007)
Wang, K., Yu, P., Chakraborty, S.: Bottom-up generalization: a data mining solution to privacy protection. In: Proceedings of 4th IEEE International Conference on Data Mining (2004)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)
Article MathSciNet MATH Google Scholar
Kirkby, R.: Improving Hoeffding Trees. Ph.D. thesis, Department of Computer Science, University of Waikato (2007)
Xiao, X., Tao, Y.: M-invariance: towards privacy preserving re-publication of dynamic datasets. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 689–700 (2007)
Pei, J., Xu, J., Wang, Z., Wang, W., Wang, K.: Maintaining k-anonymity against incremental updates. In: Proceedings of the 19th International Conference on Scientific and Statistical Database Management. IEEE (2007)
Fung, B.C.M., Wang, K., Fu, A.W.C., Pei, J.: Anonymity for continuous data publishing. In: Proceedings of the 11th International Conference on Extending Database Technology. ACM (2008)
Li, J., Ooi, B.C., Wang, W.: Anonymizing streaming data for privacy protection. In: Proceedings of the 24th International Conference on Data Engineering, pp. 1367–1369. IEEE (2008)
Cao, J., Carminati, B., Ferrari, E., Tan, K.L.: CASTLE: a delay-constrained scheme for ks-anonymizing data streams. In: Proceedings of the 24th International Conference on Data Engineering, pp. 1376–1378. IEEE (2008)
Zhou, B., Han, Y., Pei, J., Jiang, B., Tao, Y., Jia, Y.: Continuous privacy preserving publishing of data streams. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 648–659 (2009)
Chao, C., Chen, P., Sun, C.: Privacy-preserving classification of data streams. Tamkang J. Sci. Eng. 12(3), 321–330 (2009)
Google Scholar
Chhinkaniwala, H., Garg, S.: Tuple value based multiplicative data perturbation approach to preserve privacy in data stream mining. Int. J. Data Min. Knowl. Manag. Process 3(3), 53–61 (2013)
Chhinkaniwala, H., Patel, K., Garg, S.: Privacy preserving data stream classification using data perturbation techniques. In: Proceedings of International Conference on Emerging Trends in Electrical, Electronics and Communication Technologies (2012)
Xu, Y., Wang, K., Fu, A., She, R., Pei, J.: Privacy-preserving data stream classification. In: Aggarwal, C., Yu, P. (eds.) Privacy-Preserving Data Mining: Models and Algorithms, vol. 34 of Advances in Database Systems, pp. 487–510. Springer, Berlin (2008)
Chapter Google Scholar
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Chapman and Hall, Boca Raton (1993)
MATH Google Scholar
Lichman, M.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. (2013) http://archive.ics.uci.edu/ml. Accessed 20 Nov 2016
Kaggle: Give Me Some Credit Competition-2011. (2016) https://www.kaggle.com/c/GiveMeSomeCredit. Accessed 06 May 2016
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. (2014) http://moa.cms.waikato.ac.nz
Breiman, L.: Bagging predictors. Mach. Learn. 123–140 (1996)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of International Joint Conference on Artificial Intelligence, pp. 1137–1145 (1995)
Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. In: 3rd ed. San Mateo: The Morgan Kaufmann Series in Data Management Systems (2011)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Nirma University, Ahmedabad, India
Radhika Kotecha & Sanjay Garg

Authors

Radhika Kotecha
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay Garg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Radhika Kotecha.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kotecha, R., Garg, S. Preserving output-privacy in data stream classification. Prog Artif Intell 6, 87–104 (2017). https://doi.org/10.1007/s13748-017-0114-8

Download citation

Received: 21 July 2016
Accepted: 13 January 2017
Published: 06 February 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s13748-017-0114-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Preserving output-privacy in data stream classification

Abstract

Access this article

Similar content being viewed by others

Dynamic Ensemble Selection for Imbalanced Data Stream Classification with Limited Label Access

Privacy Preserving in Data Stream Mining Using Statistical Learning Methods for Building Ensemble Classifier

Dynamic weighted selective ensemble learning algorithm for imbalanced data streams

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Preserving output-privacy in data stream classification

Abstract

Access this article

Similar content being viewed by others

Dynamic Ensemble Selection for Imbalanced Data Stream Classification with Limited Label Access

Privacy Preserving in Data Stream Mining Using Statistical Learning Methods for Building Ensemble Classifier

Dynamic weighted selective ensemble learning algorithm for imbalanced data streams

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation