Skip to main content
Log in

Is my model any good: differentially private regression diagnostics

Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Linear and logistic regression are popular statistical techniques for analyzing multi-variate data. Typically, analysts do not simply posit a particular form of the regression model, estimate its parameters, and use the results for inference or prediction. Instead, they first use a variety of diagnostic techniques to assess how well the model fits the relationships in the data and how well it can be expected to predict outcomes for out-of-sample records, revising the model as necessary to improve fit and predictive power. In this article, we develop \(\epsilon \)-differentially private diagnostics tools for regression, beginning to fill a gap in privacy-preserving data analysis. Specifically, we create differentially private versions of residual plots for linear regression and of receiver operating characteristic (ROC) curves as well as binned residual plot for logistic regression. The residual plot and binned residual plot help determine whether or not the data satisfy the assumptions underlying the regression model, and the ROC curve is used to assess the predictive power of the logistic regression model. These diagnostics improve the usefulness of algorithms for computing differentially private regression output, which alone does not allow analysts to assess the quality of the posited model. Our empirical studies show that these algorithms can be effective tools for allowing users to evaluate the quality of their models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. Moreover, the quality of these algorithms also depends on how tight the assumed bounds are.

References

  1. Almeida T, Hidalgo JMG, Silva TP (2013) Towards SMS spam filtering: results under a new dataset. Int J Inf Secur Sci 2(1):1–18

    Google Scholar 

  2. Boyd K, Lantz E, Page D (2015) Differential privacy for classifier evaluation. In: ACM workshop on artificial intelligence and security, pp 15–23

  3. Bun M, Nissim K, Stemmer U, Vadhan S (2015) Differentially private release and learning of threshold functions. In: Proceedings of the 2015 IEEE 56th annual symposium on foundations of computer science (FOCS) (FOCS ’15). IEEE Computer Society, Washington, DC, USA, 634–649. https://doi.org/10.1109/FOCS.2015.45

  4. Chaudhuri K, Monteleoni C, Sarwate AD (2011) Differentially private empirical risk minimization. J Mach Learn Res (JMLR) 12:1069–1109 (July 2011)

  5. Chaudhuri K, Vinterbo SA (2013) A stability-based validation procedure for differentially private machine learning. In: Advances in neural information processing systems (NIPS), pp 2552–2660

  6. Chen Y, Machanavajjhala A, Reiter JP, Barrientos AF (2016) Differentially private regression diagnostics. In: 2016 IEEE 16th international conference on data mining (ICDM), Barcelona, pp 81–90. https://doi.org/10.1109/ICDM.2016.0019

  7. Dwork C, Roth A (2014) The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci 9:211–407

    Article  MathSciNet  MATH  Google Scholar 

  8. Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. In: CS224N project report, Stanford, pp 1–12

  9. Hardt M, Ligett K, Mcsherry F (2012) A simple and practical algorithm for differentially private data release. Adv Neural Inf Process Syst 25:2339–2347

    Google Scholar 

  10. Hay M, Machanavajjhala A, Miklau G, Chen Y, Zhang D (2016) Principled evaluation of differentially private algorithms using DPBench. In: Proceedings of the 2016 international conference on management of data (SIGMOD), pp 139–154

  11. Hay M, Rastogi V, Miklau G, Suciu D (1995) Boosting the accuracy of differentially private histograms through consistency. In: Proceedings of the VLDB endowment, pp 1021–1032 (2010)

  12. Kairouz P, Oh S, Viswanath P (2015) The composition theorem for differential privacy. In: Proceedings of the 32nd international conference in machine learning (ICML), pp 1376-0-1385

  13. Kotsogiannis I, Machanavajjhala A, Hay M, Miklau G (2017) Pythia: data dependent differentially private algorithm selection. In: Proceedings of the 2017 international conference on management of data (SIGMOD), pp 1323–1337

  14. Li C, Hay M, Miklau G, Wang Y (2014) A data- and workload-aware query answering algorithm for range queries under differential privacy. In: Proceedings of the VLDB endowment, pp 341–352

  15. Li N, Yang W, Qardaji W (2013) Differentially private grids for geospatial data. In: 2013 IEEE 29th international conference on data engineering (ICDE), pp 757–768

  16. Machanavajjahala A, Kifer D, Abowd JM, Gehrke J, Vilhuber L (2008) Privacy: theory meets practice on the Ma. In: Carey MJ, Schneider DA (eds) 2008 IEEE 24th international conference on data engineering, pp 277–286

  17. Matthews G, Harel O (2013) An examination of data confidentiality and disclosure issues related to publication of empirical ROC curves. Acad Radiol 20:889–896

    Article  Google Scholar 

  18. Nissim K, Raskhodnikova S, Smith A (2007) Smooth sensitivity and sampling in private data analysis. In: Proceedings of the thirty-ninth annual ACM symposium on theory of computing (STOC ’07). ACM, New York, NY, USA, pp 75–84. https://doi.org/10.1145/1250790.1250803

  19. O’Keefe CM, Good NM (2009) Regression output from a remote analysis server. Data Knowl. Eng. 68:1175–1186

    Article  Google Scholar 

  20. Qardaji W, Yang W, Li N (2013) Understanding hierarchical methods for differentially private histograms. In: Proceedings of the VLDB endowment, pp 1954–1965

  21. Reiter JP (2003) Model diagnostics for remote access servers. Stat Comput 13:371–380

    Article  MathSciNet  Google Scholar 

  22. Reiter JP (2005) Releasing multiply-imputed, synthetic public use microdata: an illustration and empirical study. J R Stat Soc Ser A 168:185–205

    Article  MathSciNet  MATH  Google Scholar 

  23. Reiter JP, Kohnen CN (2005) Categorical data regression diagnostics for remote servers. J Stat Comput Simul 75:889–903

    Article  MathSciNet  MATH  Google Scholar 

  24. Reiter JP, Oganian A, Karr AF (2009) Differentially private least squares: estimation, confidence and rejecting the null hypothesis, pp 1475–1482

  25. Sheffet O (2015) Differentially private least squares: estimation, confidence and rejecting the null hypothesis. CoRR. arXiv:abs/1507.02482

  26. Smith A (2011) APrivacy-preserving statistical estimation with optimal convergence rates. In: STOC, pp 813–822

  27. Stoddard B, Chen Y, Machanavajjhala A (2014) Differentially private algorithms for empirical machine learning. CoRR. arXiv:abs/1411.5428

  28. Wu X, Fredrikson M, Wu W, Jha S, Naughton JF (2015) Revisiting differentially private regression: lessons from learning theory and their consequences. CoRR. arXiv:abs/1512.06388

  29. Zhang J, Xiao X, Yang Y, Zhang Z, Winslett M (2013) PrivGene: differentially private model fitting using genetic algorithms. In: Proceedings of the 2013 international conference on management of data (SIGMOD), pp 665–676

  30. Zhang J, Zhang Z, Xiao X, Yang Y, Winslett M (2012) Functional mechanism: regression analysis under differential privacy. In: Proceedings of the VLDB endowment, pp 1364–1375

  31. Zhang X, Chen R, Xu J, Meng X, Xie Y (1995) Towards accurate histogram publication under differential privacy. In: Proceedings of the 2014 SIAM international conference on data mining, pp 587–595 (2014)

Download references

Acknowledgements

This work is supported in part by NSF Grants SES 1131897, ACI 1443014, CNS 1408982 and 1253327 and Alfred P. Sloan Foundation G-2-15-20166003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Chen.

Additional information

©2016 IEEE. Reprinted, with permission, from Chen et al. [6].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Barrientos, A.F., Machanavajjhala, A. et al. Is my model any good: differentially private regression diagnostics. Knowl Inf Syst 54, 33–64 (2018). https://doi.org/10.1007/s10115-017-1128-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1128-z

Keywords

Navigation