Abstract
Linear and logistic regression are popular statistical techniques for analyzing multi-variate data. Typically, analysts do not simply posit a particular form of the regression model, estimate its parameters, and use the results for inference or prediction. Instead, they first use a variety of diagnostic techniques to assess how well the model fits the relationships in the data and how well it can be expected to predict outcomes for out-of-sample records, revising the model as necessary to improve fit and predictive power. In this article, we develop \(\epsilon \)-differentially private diagnostics tools for regression, beginning to fill a gap in privacy-preserving data analysis. Specifically, we create differentially private versions of residual plots for linear regression and of receiver operating characteristic (ROC) curves as well as binned residual plot for logistic regression. The residual plot and binned residual plot help determine whether or not the data satisfy the assumptions underlying the regression model, and the ROC curve is used to assess the predictive power of the logistic regression model. These diagnostics improve the usefulness of algorithms for computing differentially private regression output, which alone does not allow analysts to assess the quality of the posited model. Our empirical studies show that these algorithms can be effective tools for allowing users to evaluate the quality of their models.
Similar content being viewed by others
Notes
Moreover, the quality of these algorithms also depends on how tight the assumed bounds are.
References
Almeida T, Hidalgo JMG, Silva TP (2013) Towards SMS spam filtering: results under a new dataset. Int J Inf Secur Sci 2(1):1–18
Boyd K, Lantz E, Page D (2015) Differential privacy for classifier evaluation. In: ACM workshop on artificial intelligence and security, pp 15–23
Bun M, Nissim K, Stemmer U, Vadhan S (2015) Differentially private release and learning of threshold functions. In: Proceedings of the 2015 IEEE 56th annual symposium on foundations of computer science (FOCS) (FOCS ’15). IEEE Computer Society, Washington, DC, USA, 634–649. https://doi.org/10.1109/FOCS.2015.45
Chaudhuri K, Monteleoni C, Sarwate AD (2011) Differentially private empirical risk minimization. J Mach Learn Res (JMLR) 12:1069–1109 (July 2011)
Chaudhuri K, Vinterbo SA (2013) A stability-based validation procedure for differentially private machine learning. In: Advances in neural information processing systems (NIPS), pp 2552–2660
Chen Y, Machanavajjhala A, Reiter JP, Barrientos AF (2016) Differentially private regression diagnostics. In: 2016 IEEE 16th international conference on data mining (ICDM), Barcelona, pp 81–90. https://doi.org/10.1109/ICDM.2016.0019
Dwork C, Roth A (2014) The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci 9:211–407
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. In: CS224N project report, Stanford, pp 1–12
Hardt M, Ligett K, Mcsherry F (2012) A simple and practical algorithm for differentially private data release. Adv Neural Inf Process Syst 25:2339–2347
Hay M, Machanavajjhala A, Miklau G, Chen Y, Zhang D (2016) Principled evaluation of differentially private algorithms using DPBench. In: Proceedings of the 2016 international conference on management of data (SIGMOD), pp 139–154
Hay M, Rastogi V, Miklau G, Suciu D (1995) Boosting the accuracy of differentially private histograms through consistency. In: Proceedings of the VLDB endowment, pp 1021–1032 (2010)
Kairouz P, Oh S, Viswanath P (2015) The composition theorem for differential privacy. In: Proceedings of the 32nd international conference in machine learning (ICML), pp 1376-0-1385
Kotsogiannis I, Machanavajjhala A, Hay M, Miklau G (2017) Pythia: data dependent differentially private algorithm selection. In: Proceedings of the 2017 international conference on management of data (SIGMOD), pp 1323–1337
Li C, Hay M, Miklau G, Wang Y (2014) A data- and workload-aware query answering algorithm for range queries under differential privacy. In: Proceedings of the VLDB endowment, pp 341–352
Li N, Yang W, Qardaji W (2013) Differentially private grids for geospatial data. In: 2013 IEEE 29th international conference on data engineering (ICDE), pp 757–768
Machanavajjahala A, Kifer D, Abowd JM, Gehrke J, Vilhuber L (2008) Privacy: theory meets practice on the Ma. In: Carey MJ, Schneider DA (eds) 2008 IEEE 24th international conference on data engineering, pp 277–286
Matthews G, Harel O (2013) An examination of data confidentiality and disclosure issues related to publication of empirical ROC curves. Acad Radiol 20:889–896
Nissim K, Raskhodnikova S, Smith A (2007) Smooth sensitivity and sampling in private data analysis. In: Proceedings of the thirty-ninth annual ACM symposium on theory of computing (STOC ’07). ACM, New York, NY, USA, pp 75–84. https://doi.org/10.1145/1250790.1250803
O’Keefe CM, Good NM (2009) Regression output from a remote analysis server. Data Knowl. Eng. 68:1175–1186
Qardaji W, Yang W, Li N (2013) Understanding hierarchical methods for differentially private histograms. In: Proceedings of the VLDB endowment, pp 1954–1965
Reiter JP (2003) Model diagnostics for remote access servers. Stat Comput 13:371–380
Reiter JP (2005) Releasing multiply-imputed, synthetic public use microdata: an illustration and empirical study. J R Stat Soc Ser A 168:185–205
Reiter JP, Kohnen CN (2005) Categorical data regression diagnostics for remote servers. J Stat Comput Simul 75:889–903
Reiter JP, Oganian A, Karr AF (2009) Differentially private least squares: estimation, confidence and rejecting the null hypothesis, pp 1475–1482
Sheffet O (2015) Differentially private least squares: estimation, confidence and rejecting the null hypothesis. CoRR. arXiv:abs/1507.02482
Smith A (2011) APrivacy-preserving statistical estimation with optimal convergence rates. In: STOC, pp 813–822
Stoddard B, Chen Y, Machanavajjhala A (2014) Differentially private algorithms for empirical machine learning. CoRR. arXiv:abs/1411.5428
Wu X, Fredrikson M, Wu W, Jha S, Naughton JF (2015) Revisiting differentially private regression: lessons from learning theory and their consequences. CoRR. arXiv:abs/1512.06388
Zhang J, Xiao X, Yang Y, Zhang Z, Winslett M (2013) PrivGene: differentially private model fitting using genetic algorithms. In: Proceedings of the 2013 international conference on management of data (SIGMOD), pp 665–676
Zhang J, Zhang Z, Xiao X, Yang Y, Winslett M (2012) Functional mechanism: regression analysis under differential privacy. In: Proceedings of the VLDB endowment, pp 1364–1375
Zhang X, Chen R, Xu J, Meng X, Xie Y (1995) Towards accurate histogram publication under differential privacy. In: Proceedings of the 2014 SIAM international conference on data mining, pp 587–595 (2014)
Acknowledgements
This work is supported in part by NSF Grants SES 1131897, ACI 1443014, CNS 1408982 and 1253327 and Alfred P. Sloan Foundation G-2-15-20166003.
Author information
Authors and Affiliations
Corresponding author
Additional information
©2016 IEEE. Reprinted, with permission, from Chen et al. [6].
Rights and permissions
About this article
Cite this article
Chen, Y., Barrientos, A.F., Machanavajjhala, A. et al. Is my model any good: differentially private regression diagnostics. Knowl Inf Syst 54, 33–64 (2018). https://doi.org/10.1007/s10115-017-1128-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-017-1128-z