Abstract
Instance-based learning, and the k-nearest neighbors algorithm (k-NN) in particular, provide simple yet effective classification algorithms for data mining. Classifiers are often executed on sensitive information such as medical or personal data. Differential privacy has recently emerged as the accepted standard for privacy protection in sensitive data. However, straightforward applications of differential privacy to k-NN classification yield rather inaccurate results. Motivated by this, we develop algorithms to increase the accuracy of private instance-based classification. We first describe the radius neighbors classifier (r-N) and show that its accuracy under differential privacy can be greatly improved by a non-trivial sensitivity analysis. Then, for k-NN classification, we build algorithms that convert k-NN classifiers to r-N classifiers. We experimentally evaluate the accuracy of both classifiers using various datasets. Experiments show that our proposed classifiers significantly outperform baseline private classifiers (i.e., straightforward applications of differential privacy) and executing the classifiers on a dataset published using differential privacy. In addition, the accuracy of our proposed k-NN classifiers are at least comparable to, and in many cases better than, the other differentially private machine learning techniques.
Similar content being viewed by others
Notes
Note that Algorithm 2 satisfies \(\varepsilon '\)-DP even if there is only one training instance within the specified radius, since it always returns an answer probabilistically. We illustrate with the following example: For the binary classification task (a / b), and let there be 1 training instance within the radius of the specified test instance, with label a. Let \(\varepsilon = 1.0\). Then, the score of a is: \(\varepsilon ^{1 \times \frac{1}{2}} = 1.65\), and the score of b is \(\varepsilon ^{1 \times \frac{0}{2}} = 1\). Thus, Algorithm 2 returns a with probability \(\frac{1.65}{1+1.65}\) and b with probability \(\frac{1}{1+1.65}\). Hence, there is a significant probability of returning b despite the only training instance within the radius has label a.
References
Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, Zhang L (2016) Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. ACM, pp 308–318
Aggarwal CC (2014) Instance-based learning: a survey. In: Aggarwal CC (ed) Data classification: algorithms and applications. CRC Press, pp 157–186
Alcala J, Fernandez A, Luengo J, Derrac J, Garcia S, Sanchez L, Herrera F (2010) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult. Valued Log. Soft Comput. 17(2–3):255–287
Behley J, Steinhage V, Cremers AB (2015) Efficient radius neighbor search in three-dimensional point clouds. In: 2015 IEEE international conference on robotics and automation (ICRA). IEEE, pp 3625–3630
Bentley JL (1975) Survey of techniques for fixed radius near neighbor searching (No. SLAC-186; STAN-CS-75-513). Stanford Linear Accelerator Center, California
Bojarski M, Choromanska A, Choromanski K, LeCun Y (2014) Differentially-and non-differentially-private random decision trees. arXiv:1410.6973
Chaudhuri K, Sarwate AD, Sinha K (2013) A near-optimal algorithm for differentially-private principal components. J Mach Learn Res 14(1):2905–2943
Chaudhuri K, Monteleoni C (2009) Privacy-preserving logistic regression. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems 21. Curran Associates, Inc., pp 289–296
Cormode G, Procopiuc C, Srivastava D, Shen E, Yu T (2012) Differentially private spatial decompositions. In: 2012 IEEE 28th international conference on data engineering. IEEE, pp 20–31
Doherty KAJ, Adams RG, Davey N (2007) Unsupervised learning with normalised data and non-Euclidean norms. Appl Soft Comput 7(1):203–210
Dwork C (2006) Differential privacy. In: 33rd international colloquium on automata, languages and programming, part II (ICALP 2006), pp 1–12
Dwork C (2008) Differential privacy: a survey of results. In: Agrawal M, Du D, Duan Z, Li A (eds) Theory and applications of models of computation. TAMC 2008. Lecture notes in Computer Science, vol 4978. Springer, Berlin, Heidelberg, pp 1–19
Dwork C, Naor M (2008) On the difficulties of disclosure prevention in statistical databases or the case for differential privacy. J Priv Confidentiality 2(1):8
Dwork C, McSherry F, Nissim K, Smith A (2006) Calibrating noise to sensitivity in private data analysis. In: Theory of cryptography conference. Springer, Berlin, pp 265–284
Elmehdwi Y, Samanthula BK, Jiang W (2014) Secure k-nearest neighbor query over encrypted data in outsourced environments. In: 2014 IEEE 30th international conference on data engineering. IEEE, pp 664–675
Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 493–502
Ghosh A, Roughgarden T, Sundararajan M (2012) Universally utility-maximizing privacy mechanisms. SIAM J Comput 41(6):1673–1693
Hamm J, Champion AC, Chen G, Belkin M, Xuan D (2015, June). Crowd-ML: a privacy-preserving learning framework for a crowd of smart devices. In: 2015 IEEE 35th international conference on distributed computing systems (ICDCS). IEEE, pp 11–20
Hay M, Machanavajjhala A, Miklau G, Chen Y, Zhang D, Principled evaluation of differentially private algorithms using DPBench. In: Proceedings of the 2016 ACM SIGMOD international conference on management of data
Horton P, Nakai K (1997) Better prediction of protein cellular localization sites with the k nearest neighbors classifier. In: Proceedings of the 5th international conference on intelligent systems molecular biology, vol 5, pp 147–152
Ji Z, Lipton ZC, Elkan C (2014) Differential privacy and machine learning: a survey and review. arXiv:1412.7584
Kantarcioglu M, Clifton C (2004) Privately computing a distributed k-nn classifier. In: European conference on principles of data mining and knowledge discovery. Springer, Berlin, pp 279–290
Karp RM (2010) Reducibility among combinatorial problems. In: Miller RE, Thatcher JW (eds) Complexity of computer computations: proceedings of a symposium on the complexity of computer computations, held March 20–22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York. Springer, Berlin, Heidelberg, pp 85–103
Kotsogiannis I, Machanavajjhala A, Hay M, Miklau G (2017) Pythia: data dependent differentially private algorithm selection. In: Proceedings of the 2017 ACM international conference on management of data. ACM, pp 1323–1337
Leoni D (2012, May) Non-interactive differential privacy: a survey. In: Proceedings of the first international workshop on open data. ACM, pp 40–52
Li C, Hay M, Miklau G, Wang Y (2014) A data-and workload-aware algorithm for range queries under differential privacy. Proc VLDB Endow 7(5):341–352
Li F, Shin R, Paxson V (2015) Exploring privacy preservation in outsourced k-nearest neighbors with multiple data owners. In: Proceedings of the 2015 ACM workshop on cloud computing security workshop, ACM, pp 53–64
Machanavajjhala A, Korolova A, Sarma AD (2011) Personalized social recommendations: accurate or private? Proc VLDB Endow 4(7):440–450
McSherry FD (2009) Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, pp 19–30
McSherry F, Talwar K (2007) Mechanism design via differential privacy. In: 48th annual ieee symposium on foundations of computer science, 2007. (FOCS’07). IEEE, pp 94–103
Nissim K, Raskhodnikova S, Smith A (2007) Smooth sensitivity and sampling in private data analysis. In: Proceedings of the 39th annual ACM symposium on theory of computing, ACM, pp 75–84
Okada R, Fukuchi K, Sakuma J (2015) Differentially private analysis of outliers. In: Joint European conference on machine learning and knowledge discovery in databases. Springer International Publishing, pp 458–473
Parry RM, Jones W, Stokes TH, Phan JH, Moffitt RA, Fang H, Wang MD (2010) k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction. Pharmacogenomics J 10(4):292–309
Qardaji W, Yang W, Li N (2013) Differentially private grids for geospatial data. In: 2013 IEEE 29th international conference on data engineering (ICDE). IEEE, pp 757–768
Qi Y, Atallah MJ (2008, June) Efficient privacy-preserving k-nearest neighbor search. In: 2008 IEEE 28th international conference on distributed computing systems (ICDCS). IEEE, pp 311–319
Rana S, Gupta SK, Venkatesh S (2015) Differentially private random forest with high utility. In: 2015 IEEE international conference on data mining (ICDM). IEEE, pp 955–960
Rubinstein BI, Bartlett PL, Huang L, Taft N (2012) Learning in a large function space: privacy-preserving mechanisms for SVM learning. J Priv Confidentiality 4(1):65–100
Sarwate AD, Chaudhuri K (2013) Signal processing and machine learning with differential privacy: algorithms and challenges for continuous data. IEEE Signal Process Mag 30(5):86–94
Scikit-learn: machine learning in python. http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsClassifier.html. Retrieved Jan 20 2017
Shokri R, Shmatikov V (2015) Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. ACM, pp 1310–1321
Stoddard B, Chen Y, Machanavajjhala A (2014) Differentially private algorithms for empirical machine learning. arXiv:1411.5428
Su D, Cao J, Li N, Bertino E, Jin H (2016) Differentially private k-means clustering. In: Proceedings of the sixth ACM conference on data and application security and privacy. ACM, pp 26–37
To H, Ghinita G, Shahabi C (2014) A framework for protecting worker location privacy in spatial crowdsourcing. Proc VLDB Endow 7(10):919–930
Vaidya J, Shafiq B, Basu A, Hong Y (2013) Differentially private naive Bayes classification. In: Proceedings of the 2013 IEEE/WIC/ACM international joint conferences on web intelligence (WI) and intelligent agent technologies (IAT). IEEE, pp 571–576
Wong WK, Cheung DWL, Kao B, Mamoulis N (2009) Secure kNN computation on encrypted databases. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, pp 139–152
Wu Q, Hao JK (2015) A review on algorithms for maximum clique problems. Eur J Oper Res 242(3):693–709
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, Zhou ZH (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Xiao X, Bender G, Hay M, Gehrke J (2011) iReduct: differential privacy with reduced relative errors. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data. ACM, pp 229–240
Xiong L, Chitti S, Liu L (2007) Preserving data privacy in outsourcing data aggregation services. ACM Trans Internet Technol (TOIT) 7(3):17
Xiong L, Chitti S, Liu L (2006) K nearest neighbor classification across multiple private databases. In: Proceedings of the 15th ACM international conference on information and knowledge management. ACM, pp 840–841
Yao B, Li F, Xiao X (2013) Secure nearest neighbor revisited. In: 2013 IEEE 29th international conference on data engineering (ICDE). IEEE, pp 733–744
Zhang J, Zhang Z, Xiao X, Yang Y, Winslett M (2012) Functional mechanism: regression analysis under differential privacy. Proc VLDB Endow 5(11):1364–1375
Zhang X, Chen R, Xu J, Meng X, Xie Y (2014) Towards accurate histogram publication under differential privacy. In: Proceedings of the 2014 SIAM international conference on data mining, pp 587–595
Zhang J, Cormode G, Procopiuc CM, Srivastava D, Xiao X (2014) Privbayes: private data release via bayesian networks. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, ACM, pp 1423–1434
Zhang Z, Rubinstein BI, Dimitrakakis C (2016) On the differential privacy of Bayesian inference. In: The thirtieth AAAI conference on artificial intelligence (AAAI-16)
Zhang F, Zhao G, Xing T (2009) Privacy-preserving distributed k-nearest neighbor mining on horizontally partitioned multi-party data. In: International conference on advanced data mining and applications, Springer, Berlin, pp 755–762
Zhu Y, Xu R, Takagi T (2013) Secure k-NN computation on encrypted cloud data without sharing key with query users. In: Proceedings of the 2013 international workshop on security in cloud computing. ACM, pp 55–60
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: Kurt Driessens, Dragi Kocev, Marko Robnik-Šikonja, Myra Spiliopoulou.
This research was funded by The Scientific and Technological Research Council of Turkey (TUBITAK) under Grant Number 114E261.
Rights and permissions
About this article
Cite this article
Gursoy, M.E., Inan, A., Nergiz, M.E. et al. Differentially private nearest neighbor classification. Data Min Knowl Disc 31, 1544–1575 (2017). https://doi.org/10.1007/s10618-017-0532-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-017-0532-z