Advertisement

Knowledge and Information Systems

, Volume 54, Issue 1, pp 33–64 | Cite as

Is my model any good: differentially private regression diagnostics

  • Yan ChenEmail author
  • Andrés F. Barrientos
  • Ashwin Machanavajjhala
  • Jerome P. Reiter
Regular Paper
  • 302 Downloads

Abstract

Linear and logistic regression are popular statistical techniques for analyzing multi-variate data. Typically, analysts do not simply posit a particular form of the regression model, estimate its parameters, and use the results for inference or prediction. Instead, they first use a variety of diagnostic techniques to assess how well the model fits the relationships in the data and how well it can be expected to predict outcomes for out-of-sample records, revising the model as necessary to improve fit and predictive power. In this article, we develop \(\epsilon \)-differentially private diagnostics tools for regression, beginning to fill a gap in privacy-preserving data analysis. Specifically, we create differentially private versions of residual plots for linear regression and of receiver operating characteristic (ROC) curves as well as binned residual plot for logistic regression. The residual plot and binned residual plot help determine whether or not the data satisfy the assumptions underlying the regression model, and the ROC curve is used to assess the predictive power of the logistic regression model. These diagnostics improve the usefulness of algorithms for computing differentially private regression output, which alone does not allow analysts to assess the quality of the posited model. Our empirical studies show that these algorithms can be effective tools for allowing users to evaluate the quality of their models.

Keywords

Differential privacy Regression diagnostic Residual plot Binned residual plot ROC curve 

Notes

Acknowledgements

This work is supported in part by NSF Grants SES 1131897, ACI 1443014, CNS 1408982 and 1253327 and Alfred P. Sloan Foundation G-2-15-20166003.

References

  1. 1.
    Almeida T, Hidalgo JMG, Silva TP (2013) Towards SMS spam filtering: results under a new dataset. Int J Inf Secur Sci 2(1):1–18Google Scholar
  2. 2.
    Boyd K, Lantz E, Page D (2015) Differential privacy for classifier evaluation. In: ACM workshop on artificial intelligence and security, pp 15–23Google Scholar
  3. 3.
    Bun M, Nissim K, Stemmer U, Vadhan S (2015) Differentially private release and learning of threshold functions. In: Proceedings of the 2015 IEEE 56th annual symposium on foundations of computer science (FOCS) (FOCS ’15). IEEE Computer Society, Washington, DC, USA, 634–649.  https://doi.org/10.1109/FOCS.2015.45
  4. 4.
    Chaudhuri K, Monteleoni C, Sarwate AD (2011) Differentially private empirical risk minimization. J Mach Learn Res (JMLR) 12:1069–1109 (July 2011)Google Scholar
  5. 5.
    Chaudhuri K, Vinterbo SA (2013) A stability-based validation procedure for differentially private machine learning. In: Advances in neural information processing systems (NIPS), pp 2552–2660Google Scholar
  6. 6.
    Chen Y, Machanavajjhala A, Reiter JP, Barrientos AF (2016) Differentially private regression diagnostics. In: 2016 IEEE 16th international conference on data mining (ICDM), Barcelona, pp 81–90.  https://doi.org/10.1109/ICDM.2016.0019
  7. 7.
    Dwork C, Roth A (2014) The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci 9:211–407MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. In: CS224N project report, Stanford, pp 1–12Google Scholar
  9. 9.
    Hardt M, Ligett K, Mcsherry F (2012) A simple and practical algorithm for differentially private data release. Adv Neural Inf Process Syst 25:2339–2347Google Scholar
  10. 10.
    Hay M, Machanavajjhala A, Miklau G, Chen Y, Zhang D (2016) Principled evaluation of differentially private algorithms using DPBench. In: Proceedings of the 2016 international conference on management of data (SIGMOD), pp 139–154Google Scholar
  11. 11.
    Hay M, Rastogi V, Miklau G, Suciu D (1995) Boosting the accuracy of differentially private histograms through consistency. In: Proceedings of the VLDB endowment, pp 1021–1032 (2010)Google Scholar
  12. 12.
    Kairouz P, Oh S, Viswanath P (2015) The composition theorem for differential privacy. In: Proceedings of the 32nd international conference in machine learning (ICML), pp 1376-0-1385Google Scholar
  13. 13.
    Kotsogiannis I, Machanavajjhala A, Hay M, Miklau G (2017) Pythia: data dependent differentially private algorithm selection. In: Proceedings of the 2017 international conference on management of data (SIGMOD), pp 1323–1337Google Scholar
  14. 14.
    Li C, Hay M, Miklau G, Wang Y (2014) A data- and workload-aware query answering algorithm for range queries under differential privacy. In: Proceedings of the VLDB endowment, pp 341–352Google Scholar
  15. 15.
    Li N, Yang W, Qardaji W (2013) Differentially private grids for geospatial data. In: 2013 IEEE 29th international conference on data engineering (ICDE), pp 757–768Google Scholar
  16. 16.
    Machanavajjahala A, Kifer D, Abowd JM, Gehrke J, Vilhuber L (2008) Privacy: theory meets practice on the Ma. In: Carey MJ, Schneider DA (eds) 2008 IEEE 24th international conference on data engineering, pp 277–286Google Scholar
  17. 17.
    Matthews G, Harel O (2013) An examination of data confidentiality and disclosure issues related to publication of empirical ROC curves. Acad Radiol 20:889–896CrossRefGoogle Scholar
  18. 18.
    Nissim K, Raskhodnikova S, Smith A (2007) Smooth sensitivity and sampling in private data analysis. In: Proceedings of the thirty-ninth annual ACM symposium on theory of computing (STOC ’07). ACM, New York, NY, USA, pp 75–84.  https://doi.org/10.1145/1250790.1250803
  19. 19.
    O’Keefe CM, Good NM (2009) Regression output from a remote analysis server. Data Knowl. Eng. 68:1175–1186CrossRefGoogle Scholar
  20. 20.
    Qardaji W, Yang W, Li N (2013) Understanding hierarchical methods for differentially private histograms. In: Proceedings of the VLDB endowment, pp 1954–1965Google Scholar
  21. 21.
    Reiter JP (2003) Model diagnostics for remote access servers. Stat Comput 13:371–380MathSciNetCrossRefGoogle Scholar
  22. 22.
    Reiter JP (2005) Releasing multiply-imputed, synthetic public use microdata: an illustration and empirical study. J R Stat Soc Ser A 168:185–205MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Reiter JP, Kohnen CN (2005) Categorical data regression diagnostics for remote servers. J Stat Comput Simul 75:889–903MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Reiter JP, Oganian A, Karr AF (2009) Differentially private least squares: estimation, confidence and rejecting the null hypothesis, pp 1475–1482Google Scholar
  25. 25.
    Sheffet O (2015) Differentially private least squares: estimation, confidence and rejecting the null hypothesis. CoRR. arXiv:abs/1507.02482
  26. 26.
    Smith A (2011) APrivacy-preserving statistical estimation with optimal convergence rates. In: STOC, pp 813–822Google Scholar
  27. 27.
    Stoddard B, Chen Y, Machanavajjhala A (2014) Differentially private algorithms for empirical machine learning. CoRR. arXiv:abs/1411.5428
  28. 28.
    Wu X, Fredrikson M, Wu W, Jha S, Naughton JF (2015) Revisiting differentially private regression: lessons from learning theory and their consequences. CoRR. arXiv:abs/1512.06388
  29. 29.
    Zhang J, Xiao X, Yang Y, Zhang Z, Winslett M (2013) PrivGene: differentially private model fitting using genetic algorithms. In: Proceedings of the 2013 international conference on management of data (SIGMOD), pp 665–676Google Scholar
  30. 30.
    Zhang J, Zhang Z, Xiao X, Yang Y, Winslett M (2012) Functional mechanism: regression analysis under differential privacy. In: Proceedings of the VLDB endowment, pp 1364–1375Google Scholar
  31. 31.
    Zhang X, Chen R, Xu J, Meng X, Xie Y (1995) Towards accurate histogram publication under differential privacy. In: Proceedings of the 2014 SIAM international conference on data mining, pp 587–595 (2014)Google Scholar

Copyright information

© Springer-Verlag London Ltd. 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceDuke UniversityDurhamUSA
  2. 2.Department of Statistical ScienceDuke UniversityDurhamUSA

Personalised recommendations