Abstract
In this paper, we consider conditional selective inference (SI) for a linear model estimated after outliers are removed from the data. To apply the conditional SI framework, it is necessary to characterize the events of how the robust method identifies outliers. Unfortunately, the existing conditional SIs cannot be directly applied to our problem because they are applicable to the case where the selection events can be represented by linear or quadratic constraints. We propose a conditional SI method for popular robust regressions such as least-absolute-deviation regression and Huber regression by introducing a new computational method using a convex optimization technique called homotopy method. We show that the proposed conditional SI method is applicable to a wide class of robust regression and outlier detection methods and has good empirical performance on both synthetic data and real data experiments.
Similar content being viewed by others
Notes
Since numerical results on Huberized Lasso were not presented in Chen and Bien (2020), we implemented it by ourself (see Appendix C in the supplementary material).
References
Allgower, E. L., George, K. (1993). Continuation and path following. Acta Numerica, 2, 1–63.
Andrews, D. F. (1974). A robust method for multiple linear regression. Technometrics, 16(4), 523–531.
Atkinson, A. (1986). [influential observations, high leverage points, and outliers in linear regression]: Comment: Aspects of diagnostic regression analysis. Statistical Science, 1(3), 397–402.
Bach, F. R., Heckerman, D., Horvits, E. (2006). Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research, 7, 1713–41.
Barrodale, I., Roberts, F. D. (1973). An improved algorithm for discrete l_1 linear approximation. SIAM Journal on Numerical Analysis, 10(5), 839–848.
Benjamini, Y., Yekutieli, D. (2005). False discovery rate-adjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association, 100(469), 71–81.
Benjamini, Y., Heller, R., Yekutieli, D. (2009). Selective inference in complex research. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1906), 4255–4271.
Berk, R., Brown, L., Buja, A., Zhang, K., Zhao, L., et al. (2013). Valid post-selection inference. The Annals of Statistics, 41(2), 802–837.
Best, M. J. (1996). An algorithm for the solution of the parametric quadratic programming problem. Applied mathematics and parallel computing, 57-76.
Bickel, P. J. (1973). On some analogues to linear combinations of order statistics in the linear model. The Annals of Statistics, 597–616.
Brownlee, K. A. (1965). Statistical theory and methodology in science and engineering, Vol. 150. New York: Wiley.
Chen, S., Bien, J. (2019). Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, 1–12.
Chen, S., Bien, J. (2020). Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, 29(2), 323–334.
Choi, Y., Taylor, J., Tibshirani, R., et al. (2017). Selecting the number of principal components: Estimation of the true rank of a noisy matrix. The Annals of Statistics, 45(6), 2590–2617.
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74(368), 829–836.
Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15–18.
Das, D., Duy, V. N. L., Hanada, H., Tsuda, K., Takeuchi, I. (2022). Fast and more powerful selective inference for sparse high-order interaction model. Proceedings of AAAI conference on artificial intelligence.
Duy, V. N. L. Takeuchi, I. (2021). Parametric programming approach for more powerful and general lasso selective inference. In International conference on artificial intelligence and statistics, 901–909.
Duy, V. N. L., Iwazaki, S., Takeuchi, I. (2020a). Quantifying statistical significance of neural network representation-driven hypotheses by selective inference. arXiv preprint arXiv:2010.01823.
Duy, V. N. L., Toda, H., Sugiyama, R., Takeuchi, I. (2020b). Computing valid p-value for optimal changepoint by selective inference using dynamic programming. Advances in Neural Information Processing Systems, 33, 11356–11367.
Efron, B., and Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499.
Ellenberg, J. H. (1973). The joint distribution of the standardized least squares residuals from a general linear regression. Journal of the American Statistical Association, 68(344), 941–943.
Ellenberg, J. H. (1976). Testing for a single outlier from a general linear regression. Biometrics, 32, 637–645.
Fithian, W., Taylor, J., Tibshirani, R., Tibshirani, R. (2015). Selective sequential model selection. arXiv preprint arXiv:1512.02565.
Gal, T. (1995). Postoptimal analysis, parametric programming, and related topics. Berlin: Walter de Gruyter.
Giesen, J., Jaggi, M., Laue, S. (2012a). Approximating parameterized convex optimization problems. ACM Transactions on Algorithms (TALG), 9(1), 1–17.
Giesen, J., Müller, J., Laue, S., Swiercy, S. (2012b). Approximating concavely parameterized optimization problems. Advances in Neural Information Processing Systems, 25, 2105–2113.
Harvey, A. C. (1977). A comparison of preliminary estimators for robust regression. Journal of the American Statistical Association, 72(360a), 910–913.
Hastie, T., Rosset, S., Tibshirani, R., Zhu, J. (2004a). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391–415.
Hastie, T., Rosset, S., Tibshirani, R. (2004b). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391–1415.
Hill, R. W., and Holland, P. W. (1977). Two robust alternatives to least-squares regression. Journal of the American Statistical Association, 72(360a), 828–833.
Hocking, T., j. P. Vert, Bach, F., Joulin, A. (2011). Clusterpath: an algorithm for clustering using convex fusion penalties. Proceedings of the 28th international conference on machine learning, 745–752.
Hoeting, J., Raftery, A. E., Madigan, D. (1996). A method for simultaneous variable selection and outlier identification in linear regression. Computational Statistics & Data Analysis, 22(3), 251–270.
Huber, P. J. (2004). Robust statistics, Vol. 523. Cambridge, Massachusetts: John Wiley & Sons.
Huber, P. J., et al. (1973). Robust regression: Asymptotics, conjectures and monte carlo. The Annals of Statistics, 1(5), 799–821.
Hyun, S., Lin, K., G’Sell, M., Tibshirani, R. J. (2018). Post-selection inference for changepoint detection algorithms with application to copy number variation data. arXiv preprint arXiv:1812.03644.
Joshi, P. C. (1972). Some slippage tests of mean for a single outlier in linear regression. Biometrika, 59(1), 109–120.
Karasuyama, M., Takeuchi, I. (2010). Nonlinear regularization path for quadratic loss support vector machines. IEEE Transactions on Neural Networks, 22(10), 1613–1625.
Karasuyama, M., Harada, N., Sugiyama, M., and Takeuchi, I. (2012). Multi-parametric solution-path algorithm for instance-weighted support vector machines. Machine Learning, 88(3), 297–330.
Koenker, R. (2005). Quantile regression. Cambridge, Massachusetts: Cambridge University Press.
Koenker R, Bassett Jr, G. (1978) Regression quantiles. Econometrica: Journal of the Econometric Society, (pp. 33–50).
Lee, G., Scott, C. (2007). The one class support vector machine solution path. Proceedings of international conference on acoustics speech and signal processing 2007, II521–II524.
Lee, J. D., Sun, D. L., Sun, Y., Taylor, J. E., et al. (2016). Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3), 907–927.
Leeb, H., Pötscher, B. M. (2005). Model selection and inference: Facts and fiction. Econometric Theory, 21(1), 21–59.
Leeb, H., Pötscher, B. M., et al. (2006). Can one estimate the conditional distribution of post-model-selection estimators? The Annals of Statistics, 34(5), 2554–2591.
Lockhart, R., Taylor, J., Tibshirani, R. J., Tibshirani, R. (2014). A significance test for the lasso. Annals of Statistics, 42(2), 413.
Loftus, J. R., Taylor, J. E. (2014). A significance test for forward stepwise model selection. arXiv preprint arXiv:1405.3920.
Loftus, J. R., Taylor, J. E. (2015). Selective inference in regression models with groups of variables. arXiv preprint arXiv:1511.01478.
Maronna, R. A., Martin, R. D., Yohai, V. J., et al. (2019). Robust statistics: Theory and methods (with R). Cambridge, Massachusetts: John Wiley & Sons.
Murty, K. G. (1983). Linear programming. Berlin: Springer.
Ndiaye, E., Takeuchi, I. (2019). Computing full conformal prediction set with approximate homotopy. Advances in Neural Information Processing Systems, 32, 1386–1395.
Ndiaye, E., Le, T., Fercoq, O., Salmon, J., Takeuchi, I. (2019). Safe grid search with optimal complexity. In International conference on machine learning, 4771–4780.
Ogawa, K., Imamura, M., Takeuchi, I., Sugiyama, M. (2013). Infinitesimal annealing for training semi-supervised support vector machines. International conference on machine learning, 897–905.
Osborne, M. R., Presnell, B., Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3), 389–403.
Pan, J.-X., Fang, K.-T. (1995). Multiple outlier detection in growth curve model with unstructured covariance matrix. Annals of the Institute of Statistical Mathematics, 47(1), 137–153.
Panigrahi, S., Taylor, J., Weinstein, A. (2016). Bayesian post-selection inference in the linear model. arXiv preprint arXiv:1605.08824, 28.
Pötscher, B. M., Schneider, U., et al. (2010). Confidence sets based on penalized maximum likelihood estimators in gaussian regression. Electronic Journal of Statistics, 4, 334–360.
Ritter, K. (1984). On parametric linear and quadratic programming problems. Mathematical programming: Proceedings of the international congress on mathematical programming, 307–335.
Rosset, S., Zhu, J. (2007). Piecewise linear regularized solution paths. Annals of Statistics, 35, 1012–1030.
Rousseeuw, P. J., Leroy, A. M. (2005). Robust regression and outlier detection, Vol. 589. Cambridge, Massachusetts: John Wiley & Sons.
She, Y., Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494), 626–639.
Shibagaki, A., Suzuki, Y., Karasuyama, M., Takeuchi, I. (2015). Regularization path of cross-validation error lower bounds. Advances in Neural Information Processing Systems, 28, 1675–1683.
Shimodaira, H., Terada, Y. (2019). Selective inference for testing trees and edges in phylogenetics. Frontiers in Ecology and Evolution, 7, 174.
Srikantan, K. (1961). Testing for the single outlier in a regression model. Sankhyā: The Indian Journal of Statistics, Series A, 251–260.
Srivastava, M. S., von Rosen, D. (1998). Outliers in multivariate regression models. Journal of Multivariate Analysis, 65(2), 195–208.
Sugiyama, K., Duy, V. N. L., Takeuchi, I. (2020). More powerful and general selective inference for stepwise feature selection using the homotopy continuation approach. arXiv preprint arXiv:2012.13545.
Sugiyama, R., Toda, H., Duy, V. N. L., Inatsu, Y., Takeuchi, I. (2021). Valid and exact statistical inference for multi-dimensional multiple change-points by selective inference. arXiv preprint arXiv:2110.08989.
Suzumura, S., Nakagawa, K., Umezu, Y., Tsuda, K., Takeuchi, I. (2017). Selective inference for sparse high-order interaction models. Proceedings of the 34th international conference on machine learning, Vol. 70, 3338–3347.
Takeuchi, I., Sugiyama, M. (2011). Target neighbor consistent feature weighting for nearest neighbor classification. Advances in Neural Information Processing Systems, 24, 576–584.
Takeuchi, I., Nomura, K., Kanamori, T. (2009). Nonparametric conditional density estimation using piecewise-linear solution path of kernel quantile regression. Neural Computation, 21(2), 539–559.
Takeuchi, I., Hongo, T., Sugiyama, M., Nakajima, S. (2013). Parametric task learning. Advances in Neural Information Processing Systems, 26, 1358–1366.
Tanizaki, K., Hashimoto, N., Inatsu, Y., Hontani, H., Takeuchi, I. (2020). Computing valid p-values for image segmentation by selective inference. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9553–9562.
Taylor, J., Lockhart, R., Tibshirani, R. J., Tibshirani, R. (2014). Post-selection adaptive inference for least angle regression and the lasso. arXiv preprint arXiv:1401.3889, 354.
Terada, Y., Shimodaira, H. (2017). Selective inference for the problem of regions via multiscale bootstrap. arXiv preprint arXiv:1711.00949.
Tian, X., Taylor, J., et al. (2018). Selective inference with a randomized response. The Annals of Statistics, 46(2), 679–710.
Tibshirani, R. J., Taylor, J., Lockhart, R., Tibshirani, R. (2016). Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514), 600–620.
Tsuda, K. (2007). Entire regularization paths for graph data. Proceedings of international conference on machine learning, 2007, 919–925.
Welsch, R. E., Kuh, E. (1977). Linear regression diagnostics. Technical Report 173, National Bureau of Economic Research, Cambridge, Massachusetts.
Yamada, M., Umezu, Y., Fukumizu, K., Takeuchi, I. (2018a). Post selection inference with kernels. In International conference on artificial intelligence and statistics, 152–160.
Yamada, M., Wu, D., Tsai, Y.-H. H., Takeuchi, I., Salakhutdinov, R., Fukumizu, K. (2018b). Post selection inference with incomplete maximum mean discrepancy estimator. arXiv preprint arXiv:1802.06226.
Yang, F., Barber, R. F., Jain, P., Lafferty, J. (2016). Selective inference for group-sparse linear models. Advances in Neural Information Processing Systems, 29, 2469–2477.
Zaman, A., Rousseeuw, P. J., Orhan, M. (2001). Econometric applications of high-breakdown robust regression techniques. Economics Letters, 71(1), 1–8.
Acknowledgements
This work was partially supported by MEXT KAKENHI (20H00601, 16H06538), JST CREST (JPMJCR21D3), JST Moonshot R &D (JPMJMS2033-05), JST AIP Acceleration Research (JPMJCR21U2), NEDO (JPNP18002, JPNP20006) and RIKEN Center for Advanced Intelligence Project. We thank the two anonymous reviewers for their constructive comments which help us to improve the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendix A: Proofs
Appendix A: Proofs
Proof of Lemma 1
Let \(\rho _i(z) = y_i(z)-\varvec{x}_i^\top \varvec{\beta }\) be a parametric residual for each i. Then, (21) can be written as
Furthermore, for each i, let \(\rho _i^+(z)\) and \(\rho _i^-(z)\) be non-negative numbers with \(\rho _i(z) =\rho _i^+(z)-\rho _i^-(z)\). Since at least one of \(\rho _i(z)^+\) and \(\rho _i(z)^-\) can be set to zero, the following equation holds:
Therefore, \(\hat{\varvec{\beta }}^{R}(z)\) in the parametric approach can be given by solving the following parametric linear programming problem as
Next, we show that (26) can be expressed as (22). The \(\varvec{\beta }(z)\) in (26) can be re-written as follows:
Thus, since \({\varvec{Y}} = {\varvec{a}} + {\varvec{b}} z\), by letting
we have (22). Finally, we show piecewise linearity. It is known that the optimal value of the parametric linear programming problem given in the form (22) is a (convex) piecewise-linear function with respect to the parameter z (see, e.g., Section 8.6 in Murty (1983)). Hence, the solution path of \(\varvec{\beta }^{R}(z)\) can be represented as a piecewise-linear function with respect to z. \(\square \)
Proof of Lemma 2
Let \(\varvec{e}(z) = \varvec{y}(z)-X^\top \varvec{\beta }\) be a parametric residual vector. Suppose that \(\varvec{u}(z)\) and \(\varvec{v}(z)\) are vectors, and the i-th element \(u(z)_i\) and \(v(z)_i\) of \(\varvec{u}(z)\) and \(\varvec{v}(z)\) are respectively given by
where \(e(z)_i\) is the i-th element of \({\varvec{\rho }} (z)\). Then, the following inequality holds:
In addition, the objective function can be expressed as follows:
where \(\varvec{1}\) is a vector with all elements 1. Thus, \(\hat{\varvec{\beta }}^{R}(z)\) in the parametric approach can be given by solving the following parametric quadratic programming problem as
Therefore, noting that \({\varvec{y}} (z) = {\varvec{a}} + {\varvec{b}} z\), letting
it can be shown that (27) is the same as (23). \(\square \)
Next, we consider the KKT optimality conditions of the reformulated optimization problem (21), and show that the optimal solutions are represented as piecewise-linear functions of z by showing that, when the active variables (nonzero variables at the optimal solution) do not change, the optimal solutions linearly change with z (Proposition 1). Let \(\hat{\varvec{u}}(z)\) and \(\hat{\varvec{r}} (z)\) be the vectors of optimal Lagrange multipliers. Then, the KKT conditions of (23) are given by
where \(\varvec{h}(z)=(\varvec{y}(z)\;\;-\varvec{y}(z) \;\; \varvec{0} \;\; \delta \cdot \varvec{1_n} \;\; \varvec{0})^\top \in {\mathbb {R}}^{5n}\). Furthermore, letting \({\mathcal {A}}_z=\{i \in [m] : \hat{u}_i(z)> 0\}\) and \({\mathcal {A}}_z^c=[m] \; \backslash \; {\mathcal {A}}_z\), we have
Then, the following proposition holds:
Proposition 1
Let \(S_{{\mathcal {A}}_z}\) be the rows of matrix S in a set \({\mathcal {A}}_z\). Consider two real values z and \(z'(z < z')\). If \({\mathcal {A}}_z = {\mathcal {A}}_{z'}\), then we have
where \(\psi (z) \in {\mathbb {R}}^n, \gamma (z) \in {\mathcal {R}}^{|{\mathcal {A}}_z|}\), \( \begin{bmatrix} \psi (z)\\ \gamma (z) \end{bmatrix} = \left[ \begin{array}{cc} P &{} S_{{\mathcal {A}}_z}^\top \\ S_{{\mathcal {A}}_z} &{} 0 \end{array} \right] ^{-1} \left[ \begin{array}{c} \varvec{0}\\ \varvec{u}_{1,{\mathcal {A}}_z} \end{array} \right] .\)
Proposition 1 means that \(\hat{\varvec{\beta }}^{R}(z)\) is a piecewise-linear function with respect to z on the interval \([z,z^\prime ]\). Finally, we show Proposition 1.
Proof
According to (28), we have the following linear system
By decomposing \(\varvec{h}(z)=\varvec{u}_0+\varvec{u}_1 z\), (31) can be re-written as
From now, let us denote \( \begin{bmatrix} \psi (z)\\ \gamma (z) \end{bmatrix} = \left[ \begin{array}{cc} P &{} S_{{\mathcal {A}}_z}^\top \\ S_{{\mathcal {A}}_z} &{} \varvec{0} \end{array} \right] ^{-1} \left[ \begin{array}{c} \varvec{0} \\ \varvec{u}_{1,{\mathcal {A}}_z} \end{array} \right] \) with \(\psi (z)\in {\mathbb {R}}^n\) and \(\gamma (z) \in {\mathbb {R}}^{|{\mathcal {A}}_z|}\), the results in proposition 1 is straightforward. \(\square \)
Proof of Lemma 3
We derive the truncation region \({\mathcal {Z}}\) in the interval \([z_{t-1},z_t].\) For any \(t \in [T]\) and \(i \in [n]\), the residual \(r_i^R(z)\) in the interval \([z_{t-1},z_t]\) is given by
The condition for the truncation region is the following inequality:
This implies that
From (32) and (33), z is classified into the following four cases based on the value of \(g_{t,i}\):
If \(g_{t,i} = 0\), the region that satisfies \(|f_{t,i}|\ge \xi \) is the truncation region. From the above, \(\mathcal {V}_{t, i}\) is derived as
where we define \([\ell , u] = \emptyset \) if \(\ell > u\). \(\square \)
Proof of Lemma 4
We derive the truncation region \({\mathcal {Z}}\) in the interval \([z_{t-1},z_t]\). For any \(t \in [T]\) and \((i, i^\prime ) \in \mathcal {O}(y^{\mathrm{observed}}) \times \left( [n] \setminus \mathcal {O}(y^\mathrm{observed})\right) \), the residual \(r_i^R(z)\) in the interval \([z_{t-1},z_t]\) is given by
The condition of the truncation region is the following:
Since the case classification by absolute value is complicated, we take the method of squaring both sides:
where \(\alpha =g_{t,i}^2-g_{t,i'}^2, \beta =2f_{t,i}g_{t,i}-2f_{t,i'}g_{t,i'} \) and \(\gamma =f_{t,i}^2-f_{t,i'}^2\). The truncation region that satisfies (35) is equivalent to the condition that satisfies equation (34). Next, we consider the three cases of \(\alpha \).
When \(\alpha =0\), from \(\beta z + \gamma \ge 0\), z is classified into
If \(\beta =0\), the region that satisfies \(\gamma \ge 0\) is the truncation region. Therefore, when \(\alpha = 0\), the truncation region can be expressed as
When \(\alpha >0\), using the quadratic formula, inequality (35) is
Therefore, the truncation region can be derived using the union as
When \(\alpha <0\), using the quadratic formula, inequality (35) is
Therefore, the truncation region can be derived using the intersection as
From the above, \(\mathcal {W}_{t, i}\) is derived as
\(\square \)
About this article
Cite this article
Tsukurimichi, T., Inatsu, Y., Duy, V.N.L. et al. Conditional selective inference for robust regression and outlier detection using piecewise-linear homotopy continuation. Ann Inst Stat Math 74, 1197–1228 (2022). https://doi.org/10.1007/s10463-022-00846-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-022-00846-2