Skip to main content
Log in

Differentially private nearest neighbor classification

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Instance-based learning, and the k-nearest neighbors algorithm (k-NN) in particular, provide simple yet effective classification algorithms for data mining. Classifiers are often executed on sensitive information such as medical or personal data. Differential privacy has recently emerged as the accepted standard for privacy protection in sensitive data. However, straightforward applications of differential privacy to k-NN classification yield rather inaccurate results. Motivated by this, we develop algorithms to increase the accuracy of private instance-based classification. We first describe the radius neighbors classifier (r-N) and show that its accuracy under differential privacy can be greatly improved by a non-trivial sensitivity analysis. Then, for k-NN classification, we build algorithms that convert k-NN classifiers to r-N classifiers. We experimentally evaluate the accuracy of both classifiers using various datasets. Experiments show that our proposed classifiers significantly outperform baseline private classifiers (i.e., straightforward applications of differential privacy) and executing the classifiers on a dataset published using differential privacy. In addition, the accuracy of our proposed k-NN classifiers are at least comparable to, and in many cases better than, the other differentially private machine learning techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. Note that Algorithm 2 satisfies \(\varepsilon '\)-DP even if there is only one training instance within the specified radius, since it always returns an answer probabilistically. We illustrate with the following example: For the binary classification task (a / b), and let there be 1 training instance within the radius of the specified test instance, with label a. Let \(\varepsilon = 1.0\). Then, the score of a is: \(\varepsilon ^{1 \times \frac{1}{2}} = 1.65\), and the score of b is \(\varepsilon ^{1 \times \frac{0}{2}} = 1\). Thus, Algorithm 2 returns a with probability \(\frac{1.65}{1+1.65}\) and b with probability \(\frac{1}{1+1.65}\). Hence, there is a significant probability of returning b despite the only training instance within the radius has label a.

  2. http://archive.ics.uci.edu/ml/datasets.html.

  3. http://sci2s.ugr.es/keel/datasets.php.

  4. https://sourceforge.net/projects/privbayes/.

References

  • Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, Zhang L (2016) Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. ACM, pp 308–318

  • Aggarwal CC (2014) Instance-based learning: a survey. In: Aggarwal CC (ed) Data classification: algorithms and applications. CRC Press, pp 157–186

  • Alcala J, Fernandez A, Luengo J, Derrac J, Garcia S, Sanchez L, Herrera F (2010) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult. Valued Log. Soft Comput. 17(2–3):255–287

    Google Scholar 

  • Behley J, Steinhage V, Cremers AB (2015) Efficient radius neighbor search in three-dimensional point clouds. In: 2015 IEEE international conference on robotics and automation (ICRA). IEEE, pp 3625–3630

  • Bentley JL (1975) Survey of techniques for fixed radius near neighbor searching (No. SLAC-186; STAN-CS-75-513). Stanford Linear Accelerator Center, California

    Google Scholar 

  • Bojarski M, Choromanska A, Choromanski K, LeCun Y (2014) Differentially-and non-differentially-private random decision trees. arXiv:1410.6973

  • Chaudhuri K, Sarwate AD, Sinha K (2013) A near-optimal algorithm for differentially-private principal components. J Mach Learn Res 14(1):2905–2943

    MathSciNet  MATH  Google Scholar 

  • Chaudhuri K, Monteleoni C (2009) Privacy-preserving logistic regression. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing systems 21. Curran Associates, Inc., pp 289–296

  • Cormode G, Procopiuc C, Srivastava D, Shen E, Yu T (2012) Differentially private spatial decompositions. In: 2012 IEEE 28th international conference on data engineering. IEEE, pp 20–31

  • Doherty KAJ, Adams RG, Davey N (2007) Unsupervised learning with normalised data and non-Euclidean norms. Appl Soft Comput 7(1):203–210

    Article  Google Scholar 

  • Dwork C (2006) Differential privacy. In: 33rd international colloquium on automata, languages and programming, part II (ICALP 2006), pp 1–12

  • Dwork C (2008) Differential privacy: a survey of results. In: Agrawal M, Du D, Duan Z, Li A (eds) Theory and applications of models of computation. TAMC 2008. Lecture notes in Computer Science, vol 4978. Springer, Berlin, Heidelberg, pp 1–19

  • Dwork C, Naor M (2008) On the difficulties of disclosure prevention in statistical databases or the case for differential privacy. J Priv Confidentiality 2(1):8

    Google Scholar 

  • Dwork C, McSherry F, Nissim K, Smith A (2006) Calibrating noise to sensitivity in private data analysis. In: Theory of cryptography conference. Springer, Berlin, pp 265–284

  • Elmehdwi Y, Samanthula BK, Jiang W (2014) Secure k-nearest neighbor query over encrypted data in outsourced environments. In: 2014 IEEE 30th international conference on data engineering. IEEE, pp 664–675

  • Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 493–502

  • Ghosh A, Roughgarden T, Sundararajan M (2012) Universally utility-maximizing privacy mechanisms. SIAM J Comput 41(6):1673–1693

    Article  MathSciNet  MATH  Google Scholar 

  • Hamm J, Champion AC, Chen G, Belkin M, Xuan D (2015, June). Crowd-ML: a privacy-preserving learning framework for a crowd of smart devices. In: 2015 IEEE 35th international conference on distributed computing systems (ICDCS). IEEE, pp 11–20

  • Hay M, Machanavajjhala A, Miklau G, Chen Y, Zhang D, Principled evaluation of differentially private algorithms using DPBench. In: Proceedings of the 2016 ACM SIGMOD international conference on management of data

  • Horton P, Nakai K (1997) Better prediction of protein cellular localization sites with the k nearest neighbors classifier. In: Proceedings of the 5th international conference on intelligent systems molecular biology, vol 5, pp 147–152

  • Ji Z, Lipton ZC, Elkan C (2014) Differential privacy and machine learning: a survey and review. arXiv:1412.7584

  • Kantarcioglu M, Clifton C (2004) Privately computing a distributed k-nn classifier. In: European conference on principles of data mining and knowledge discovery. Springer, Berlin, pp 279–290

  • Karp RM (2010) Reducibility among combinatorial problems. In: Miller RE, Thatcher JW (eds) Complexity of computer computations: proceedings of a symposium on the complexity of computer computations, held March 20–22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York. Springer, Berlin, Heidelberg, pp 85–103

  • Kotsogiannis I, Machanavajjhala A, Hay M, Miklau G (2017) Pythia: data dependent differentially private algorithm selection. In: Proceedings of the 2017 ACM international conference on management of data. ACM, pp 1323–1337

  • Leoni D (2012, May) Non-interactive differential privacy: a survey. In: Proceedings of the first international workshop on open data. ACM, pp 40–52

  • Li C, Hay M, Miklau G, Wang Y (2014) A data-and workload-aware algorithm for range queries under differential privacy. Proc VLDB Endow 7(5):341–352

    Article  Google Scholar 

  • Li F, Shin R, Paxson V (2015) Exploring privacy preservation in outsourced k-nearest neighbors with multiple data owners. In: Proceedings of the 2015 ACM workshop on cloud computing security workshop, ACM, pp 53–64

  • Machanavajjhala A, Korolova A, Sarma AD (2011) Personalized social recommendations: accurate or private? Proc VLDB Endow 4(7):440–450

    Article  Google Scholar 

  • McSherry FD (2009) Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, pp 19–30

  • McSherry F, Talwar K (2007) Mechanism design via differential privacy. In: 48th annual ieee symposium on foundations of computer science, 2007. (FOCS’07). IEEE, pp 94–103

  • Nissim K, Raskhodnikova S, Smith A (2007) Smooth sensitivity and sampling in private data analysis. In: Proceedings of the 39th annual ACM symposium on theory of computing, ACM, pp 75–84

  • Okada R, Fukuchi K, Sakuma J (2015) Differentially private analysis of outliers. In: Joint European conference on machine learning and knowledge discovery in databases. Springer International Publishing, pp 458–473

  • Parry RM, Jones W, Stokes TH, Phan JH, Moffitt RA, Fang H, Wang MD (2010) k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction. Pharmacogenomics J 10(4):292–309

    Article  Google Scholar 

  • Qardaji W, Yang W, Li N (2013) Differentially private grids for geospatial data. In: 2013 IEEE 29th international conference on data engineering (ICDE). IEEE, pp 757–768

  • Qi Y, Atallah MJ (2008, June) Efficient privacy-preserving k-nearest neighbor search. In: 2008 IEEE 28th international conference on distributed computing systems (ICDCS). IEEE, pp 311–319

  • Rana S, Gupta SK, Venkatesh S (2015) Differentially private random forest with high utility. In: 2015 IEEE international conference on data mining (ICDM). IEEE, pp 955–960

  • Rubinstein BI, Bartlett PL, Huang L, Taft N (2012) Learning in a large function space: privacy-preserving mechanisms for SVM learning. J Priv Confidentiality 4(1):65–100

    Google Scholar 

  • Sarwate AD, Chaudhuri K (2013) Signal processing and machine learning with differential privacy: algorithms and challenges for continuous data. IEEE Signal Process Mag 30(5):86–94

    Article  Google Scholar 

  • Scikit-learn: machine learning in python. http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsClassifier.html. Retrieved Jan 20 2017

  • Shokri R, Shmatikov V (2015) Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC conference on computer and communications security. ACM, pp 1310–1321

  • Stoddard B, Chen Y, Machanavajjhala A (2014) Differentially private algorithms for empirical machine learning. arXiv:1411.5428

  • Su D, Cao J, Li N, Bertino E, Jin H (2016) Differentially private k-means clustering. In: Proceedings of the sixth ACM conference on data and application security and privacy. ACM, pp 26–37

  • To H, Ghinita G, Shahabi C (2014) A framework for protecting worker location privacy in spatial crowdsourcing. Proc VLDB Endow 7(10):919–930

    Article  Google Scholar 

  • Vaidya J, Shafiq B, Basu A, Hong Y (2013) Differentially private naive Bayes classification. In: Proceedings of the 2013 IEEE/WIC/ACM international joint conferences on web intelligence (WI) and intelligent agent technologies (IAT). IEEE, pp 571–576

  • Wong WK, Cheung DWL, Kao B, Mamoulis N (2009) Secure kNN computation on encrypted databases. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, pp 139–152

  • Wu Q, Hao JK (2015) A review on algorithms for maximum clique problems. Eur J Oper Res 242(3):693–709

    Article  MathSciNet  MATH  Google Scholar 

  • Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, Zhou ZH (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  • Xiao X, Bender G, Hay M, Gehrke J (2011) iReduct: differential privacy with reduced relative errors. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data. ACM, pp 229–240

  • Xiong L, Chitti S, Liu L (2007) Preserving data privacy in outsourcing data aggregation services. ACM Trans Internet Technol (TOIT) 7(3):17

    Article  Google Scholar 

  • Xiong L, Chitti S, Liu L (2006) K nearest neighbor classification across multiple private databases. In: Proceedings of the 15th ACM international conference on information and knowledge management. ACM, pp 840–841

  • Yao B, Li F, Xiao X (2013) Secure nearest neighbor revisited. In: 2013 IEEE 29th international conference on data engineering (ICDE). IEEE, pp 733–744

  • Zhang J, Zhang Z, Xiao X, Yang Y, Winslett M (2012) Functional mechanism: regression analysis under differential privacy. Proc VLDB Endow 5(11):1364–1375

    Article  Google Scholar 

  • Zhang X, Chen R, Xu J, Meng X, Xie Y (2014) Towards accurate histogram publication under differential privacy. In: Proceedings of the 2014 SIAM international conference on data mining, pp 587–595

  • Zhang J, Cormode G, Procopiuc CM, Srivastava D, Xiao X (2014) Privbayes: private data release via bayesian networks. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, ACM, pp 1423–1434

  • Zhang Z, Rubinstein BI, Dimitrakakis C (2016) On the differential privacy of Bayesian inference. In: The thirtieth AAAI conference on artificial intelligence (AAAI-16)

  • Zhang F, Zhao G, Xing T (2009) Privacy-preserving distributed k-nearest neighbor mining on horizontally partitioned multi-party data. In: International conference on advanced data mining and applications, Springer, Berlin, pp 755–762

  • Zhu Y, Xu R, Takagi T (2013) Secure k-NN computation on encrypted cloud data without sharing key with query users. In: Proceedings of the 2013 international workshop on security in cloud computing. ACM, pp 55–60

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Inan.

Additional information

Responsible editors: Kurt Driessens, Dragi Kocev, Marko Robnik-Šikonja, Myra Spiliopoulou.

This research was funded by The Scientific and Technological Research Council of Turkey (TUBITAK) under Grant Number 114E261.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gursoy, M.E., Inan, A., Nergiz, M.E. et al. Differentially private nearest neighbor classification. Data Min Knowl Disc 31, 1544–1575 (2017). https://doi.org/10.1007/s10618-017-0532-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-017-0532-z

Keywords

Navigation