Skip to main content
Log in

Conditional selective inference for robust regression and outlier detection using piecewise-linear homotopy continuation

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

In this paper, we consider conditional selective inference (SI) for a linear model estimated after outliers are removed from the data. To apply the conditional SI framework, it is necessary to characterize the events of how the robust method identifies outliers. Unfortunately, the existing conditional SIs cannot be directly applied to our problem because they are applicable to the case where the selection events can be represented by linear or quadratic constraints. We propose a conditional SI method for popular robust regressions such as least-absolute-deviation regression and Huber regression by introducing a new computational method using a convex optimization technique called homotopy method. We show that the proposed conditional SI method is applicable to a wide class of robust regression and outlier detection methods and has good empirical performance on both synthetic data and real data experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Note that the authors in Chen and Bien (2020) point out the possibility of Huberized Lasso approach, but did not provide any numerical results. For comparison in §5, we implemented it by ourselves.

  2. Since numerical results on Huberized Lasso were not presented in Chen and Bien (2020), we implemented it by ourself (see Appendix C in the supplementary material).

References

  • Allgower, E. L., George, K. (1993). Continuation and path following. Acta Numerica, 2, 1–63.

    Article  MathSciNet  MATH  Google Scholar 

  • Andrews, D. F. (1974). A robust method for multiple linear regression. Technometrics, 16(4), 523–531.

    Article  MathSciNet  MATH  Google Scholar 

  • Atkinson, A. (1986). [influential observations, high leverage points, and outliers in linear regression]: Comment: Aspects of diagnostic regression analysis. Statistical Science, 1(3), 397–402.

    Article  Google Scholar 

  • Bach, F. R., Heckerman, D., Horvits, E. (2006). Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research, 7, 1713–41.

    MathSciNet  MATH  Google Scholar 

  • Barrodale, I., Roberts, F. D. (1973). An improved algorithm for discrete l_1 linear approximation. SIAM Journal on Numerical Analysis, 10(5), 839–848.

    Article  MathSciNet  MATH  Google Scholar 

  • Benjamini, Y., Yekutieli, D. (2005). False discovery rate-adjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association, 100(469), 71–81.

    Article  MathSciNet  MATH  Google Scholar 

  • Benjamini, Y., Heller, R., Yekutieli, D. (2009). Selective inference in complex research. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1906), 4255–4271.

    Article  MathSciNet  MATH  Google Scholar 

  • Berk, R., Brown, L., Buja, A., Zhang, K., Zhao, L., et al. (2013). Valid post-selection inference. The Annals of Statistics, 41(2), 802–837.

    Article  MathSciNet  MATH  Google Scholar 

  • Best, M. J. (1996). An algorithm for the solution of the parametric quadratic programming problem. Applied mathematics and parallel computing, 57-76.

  • Bickel, P. J. (1973). On some analogues to linear combinations of order statistics in the linear model. The Annals of Statistics, 597–616.

  • Brownlee, K. A. (1965). Statistical theory and methodology in science and engineering, Vol. 150. New York: Wiley.

    MATH  Google Scholar 

  • Chen, S., Bien, J. (2019). Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, 1–12.

  • Chen, S., Bien, J. (2020). Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, 29(2), 323–334.

    Article  MathSciNet  MATH  Google Scholar 

  • Choi, Y., Taylor, J., Tibshirani, R., et al. (2017). Selecting the number of principal components: Estimation of the true rank of a noisy matrix. The Annals of Statistics, 45(6), 2590–2617.

    Article  MathSciNet  MATH  Google Scholar 

  • Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74(368), 829–836.

    Article  MathSciNet  MATH  Google Scholar 

  • Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15–18.

    MathSciNet  MATH  Google Scholar 

  • Das, D., Duy, V. N. L., Hanada, H., Tsuda, K., Takeuchi, I. (2022). Fast and more powerful selective inference for sparse high-order interaction model. Proceedings of AAAI conference on artificial intelligence.

  • Duy, V. N. L. Takeuchi, I. (2021). Parametric programming approach for more powerful and general lasso selective inference. In International conference on artificial intelligence and statistics, 901–909.

  • Duy, V. N. L., Iwazaki, S., Takeuchi, I. (2020a). Quantifying statistical significance of neural network representation-driven hypotheses by selective inference. arXiv preprint arXiv:2010.01823.

  • Duy, V. N. L., Toda, H., Sugiyama, R., Takeuchi, I. (2020b). Computing valid p-value for optimal changepoint by selective inference using dynamic programming. Advances in Neural Information Processing Systems, 33, 11356–11367.

  • Efron, B., and Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499.

    Article  MathSciNet  MATH  Google Scholar 

  • Ellenberg, J. H. (1973). The joint distribution of the standardized least squares residuals from a general linear regression. Journal of the American Statistical Association, 68(344), 941–943.

    Article  MathSciNet  MATH  Google Scholar 

  • Ellenberg, J. H. (1976). Testing for a single outlier from a general linear regression. Biometrics, 32, 637–645.

    Article  MathSciNet  MATH  Google Scholar 

  • Fithian, W., Taylor, J., Tibshirani, R., Tibshirani, R. (2015). Selective sequential model selection. arXiv preprint arXiv:1512.02565.

  • Gal, T. (1995). Postoptimal analysis, parametric programming, and related topics. Berlin: Walter de Gruyter.

    Google Scholar 

  • Giesen, J., Jaggi, M., Laue, S. (2012a). Approximating parameterized convex optimization problems. ACM Transactions on Algorithms (TALG), 9(1), 1–17.

    Article  MathSciNet  MATH  Google Scholar 

  • Giesen, J., Müller, J., Laue, S., Swiercy, S. (2012b). Approximating concavely parameterized optimization problems. Advances in Neural Information Processing Systems, 25, 2105–2113.

    Google Scholar 

  • Harvey, A. C. (1977). A comparison of preliminary estimators for robust regression. Journal of the American Statistical Association, 72(360a), 910–913.

    Article  MathSciNet  MATH  Google Scholar 

  • Hastie, T., Rosset, S., Tibshirani, R., Zhu, J. (2004a). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391–415.

    MathSciNet  MATH  Google Scholar 

  • Hastie, T., Rosset, S., Tibshirani, R. (2004b). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391–1415.

    MathSciNet  MATH  Google Scholar 

  • Hill, R. W., and Holland, P. W. (1977). Two robust alternatives to least-squares regression. Journal of the American Statistical Association, 72(360a), 828–833.

    Article  MATH  Google Scholar 

  • Hocking, T., j. P. Vert, Bach, F., Joulin, A. (2011). Clusterpath: an algorithm for clustering using convex fusion penalties. Proceedings of the 28th international conference on machine learning, 745–752.

  • Hoeting, J., Raftery, A. E., Madigan, D. (1996). A method for simultaneous variable selection and outlier identification in linear regression. Computational Statistics & Data Analysis, 22(3), 251–270.

    Article  MATH  Google Scholar 

  • Huber, P. J. (2004). Robust statistics, Vol. 523. Cambridge, Massachusetts: John Wiley & Sons.

    Google Scholar 

  • Huber, P. J., et al. (1973). Robust regression: Asymptotics, conjectures and monte carlo. The Annals of Statistics, 1(5), 799–821.

    Article  MathSciNet  MATH  Google Scholar 

  • Hyun, S., Lin, K., G’Sell, M., Tibshirani, R. J. (2018). Post-selection inference for changepoint detection algorithms with application to copy number variation data. arXiv preprint arXiv:1812.03644.

  • Joshi, P. C. (1972). Some slippage tests of mean for a single outlier in linear regression. Biometrika, 59(1), 109–120.

    Article  MathSciNet  MATH  Google Scholar 

  • Karasuyama, M., Takeuchi, I. (2010). Nonlinear regularization path for quadratic loss support vector machines. IEEE Transactions on Neural Networks, 22(10), 1613–1625.

    Article  Google Scholar 

  • Karasuyama, M., Harada, N., Sugiyama, M., and Takeuchi, I. (2012). Multi-parametric solution-path algorithm for instance-weighted support vector machines. Machine Learning, 88(3), 297–330.

    Article  MathSciNet  MATH  Google Scholar 

  • Koenker, R. (2005). Quantile regression. Cambridge, Massachusetts: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Koenker R, Bassett Jr, G. (1978) Regression quantiles. Econometrica: Journal of the Econometric Society, (pp. 33–50).

  • Lee, G., Scott, C. (2007). The one class support vector machine solution path. Proceedings of international conference on acoustics speech and signal processing 2007, II521–II524.

  • Lee, J. D., Sun, D. L., Sun, Y., Taylor, J. E., et al. (2016). Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3), 907–927.

    Article  MathSciNet  MATH  Google Scholar 

  • Leeb, H., Pötscher, B. M. (2005). Model selection and inference: Facts and fiction. Econometric Theory, 21(1), 21–59.

    Article  MathSciNet  MATH  Google Scholar 

  • Leeb, H., Pötscher, B. M., et al. (2006). Can one estimate the conditional distribution of post-model-selection estimators? The Annals of Statistics, 34(5), 2554–2591.

    Article  MathSciNet  MATH  Google Scholar 

  • Lockhart, R., Taylor, J., Tibshirani, R. J., Tibshirani, R. (2014). A significance test for the lasso. Annals of Statistics, 42(2), 413.

    MathSciNet  MATH  Google Scholar 

  • Loftus, J. R., Taylor, J. E. (2014). A significance test for forward stepwise model selection. arXiv preprint arXiv:1405.3920.

  • Loftus, J. R., Taylor, J. E. (2015). Selective inference in regression models with groups of variables. arXiv preprint arXiv:1511.01478.

  • Maronna, R. A., Martin, R. D., Yohai, V. J., et al. (2019). Robust statistics: Theory and methods (with R). Cambridge, Massachusetts: John Wiley & Sons.

    MATH  Google Scholar 

  • Murty, K. G. (1983). Linear programming. Berlin: Springer.

    MATH  Google Scholar 

  • Ndiaye, E., Takeuchi, I. (2019). Computing full conformal prediction set with approximate homotopy. Advances in Neural Information Processing Systems, 32, 1386–1395.

    Google Scholar 

  • Ndiaye, E., Le, T., Fercoq, O., Salmon, J., Takeuchi, I. (2019). Safe grid search with optimal complexity. In International conference on machine learning, 4771–4780.

  • Ogawa, K., Imamura, M., Takeuchi, I., Sugiyama, M. (2013). Infinitesimal annealing for training semi-supervised support vector machines. International conference on machine learning, 897–905.

  • Osborne, M. R., Presnell, B., Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20(3), 389–403.

    Article  MathSciNet  MATH  Google Scholar 

  • Pan, J.-X., Fang, K.-T. (1995). Multiple outlier detection in growth curve model with unstructured covariance matrix. Annals of the Institute of Statistical Mathematics, 47(1), 137–153.

    Article  MathSciNet  MATH  Google Scholar 

  • Panigrahi, S., Taylor, J., Weinstein, A. (2016). Bayesian post-selection inference in the linear model. arXiv preprint arXiv:1605.08824, 28.

  • Pötscher, B. M., Schneider, U., et al. (2010). Confidence sets based on penalized maximum likelihood estimators in gaussian regression. Electronic Journal of Statistics, 4, 334–360.

    Article  MathSciNet  MATH  Google Scholar 

  • Ritter, K. (1984). On parametric linear and quadratic programming problems. Mathematical programming: Proceedings of the international congress on mathematical programming, 307–335.

  • Rosset, S., Zhu, J. (2007). Piecewise linear regularized solution paths. Annals of Statistics, 35, 1012–1030.

    Article  MathSciNet  MATH  Google Scholar 

  • Rousseeuw, P. J., Leroy, A. M. (2005). Robust regression and outlier detection, Vol. 589. Cambridge, Massachusetts: John Wiley & Sons.

    MATH  Google Scholar 

  • She, Y., Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494), 626–639.

    Article  MathSciNet  MATH  Google Scholar 

  • Shibagaki, A., Suzuki, Y., Karasuyama, M., Takeuchi, I. (2015). Regularization path of cross-validation error lower bounds. Advances in Neural Information Processing Systems, 28, 1675–1683.

    Google Scholar 

  • Shimodaira, H., Terada, Y. (2019). Selective inference for testing trees and edges in phylogenetics. Frontiers in Ecology and Evolution, 7, 174.

    Article  Google Scholar 

  • Srikantan, K. (1961). Testing for the single outlier in a regression model. Sankhyā: The Indian Journal of Statistics, Series A, 251–260.

  • Srivastava, M. S., von Rosen, D. (1998). Outliers in multivariate regression models. Journal of Multivariate Analysis, 65(2), 195–208.

    Article  MathSciNet  MATH  Google Scholar 

  • Sugiyama, K., Duy, V. N. L., Takeuchi, I. (2020). More powerful and general selective inference for stepwise feature selection using the homotopy continuation approach. arXiv preprint arXiv:2012.13545.

  • Sugiyama, R., Toda, H., Duy, V. N. L., Inatsu, Y., Takeuchi, I. (2021). Valid and exact statistical inference for multi-dimensional multiple change-points by selective inference. arXiv preprint arXiv:2110.08989.

  • Suzumura, S., Nakagawa, K., Umezu, Y., Tsuda, K., Takeuchi, I. (2017). Selective inference for sparse high-order interaction models. Proceedings of the 34th international conference on machine learning, Vol. 70, 3338–3347.

  • Takeuchi, I., Sugiyama, M. (2011). Target neighbor consistent feature weighting for nearest neighbor classification. Advances in Neural Information Processing Systems, 24, 576–584.

    Google Scholar 

  • Takeuchi, I., Nomura, K., Kanamori, T. (2009). Nonparametric conditional density estimation using piecewise-linear solution path of kernel quantile regression. Neural Computation, 21(2), 539–559.

    Article  MathSciNet  MATH  Google Scholar 

  • Takeuchi, I., Hongo, T., Sugiyama, M., Nakajima, S. (2013). Parametric task learning. Advances in Neural Information Processing Systems, 26, 1358–1366.

    Google Scholar 

  • Tanizaki, K., Hashimoto, N., Inatsu, Y., Hontani, H., Takeuchi, I. (2020). Computing valid p-values for image segmentation by selective inference. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9553–9562.

  • Taylor, J., Lockhart, R., Tibshirani, R. J., Tibshirani, R. (2014). Post-selection adaptive inference for least angle regression and the lasso. arXiv preprint arXiv:1401.3889, 354.

  • Terada, Y., Shimodaira, H. (2017). Selective inference for the problem of regions via multiscale bootstrap. arXiv preprint arXiv:1711.00949.

  • Tian, X., Taylor, J., et al. (2018). Selective inference with a randomized response. The Annals of Statistics, 46(2), 679–710.

    Article  MathSciNet  MATH  Google Scholar 

  • Tibshirani, R. J., Taylor, J., Lockhart, R., Tibshirani, R. (2016). Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514), 600–620.

    Article  MathSciNet  Google Scholar 

  • Tsuda, K. (2007). Entire regularization paths for graph data. Proceedings of international conference on machine learning, 2007, 919–925.

    Google Scholar 

  • Welsch, R. E., Kuh, E. (1977). Linear regression diagnostics. Technical Report 173, National Bureau of Economic Research, Cambridge, Massachusetts.

  • Yamada, M., Umezu, Y., Fukumizu, K., Takeuchi, I. (2018a). Post selection inference with kernels. In International conference on artificial intelligence and statistics, 152–160.

  • Yamada, M., Wu, D., Tsai, Y.-H. H., Takeuchi, I., Salakhutdinov, R., Fukumizu, K. (2018b). Post selection inference with incomplete maximum mean discrepancy estimator. arXiv preprint arXiv:1802.06226.

  • Yang, F., Barber, R. F., Jain, P., Lafferty, J. (2016). Selective inference for group-sparse linear models. Advances in Neural Information Processing Systems, 29, 2469–2477.

    Google Scholar 

  • Zaman, A., Rousseeuw, P. J., Orhan, M. (2001). Econometric applications of high-breakdown robust regression techniques. Economics Letters, 71(1), 1–8.

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This work was partially supported by MEXT KAKENHI (20H00601, 16H06538), JST CREST (JPMJCR21D3), JST Moonshot R &D (JPMJMS2033-05), JST AIP Acceleration Research (JPMJCR21U2), NEDO (JPNP18002, JPNP20006) and RIKEN Center for Advanced Intelligence Project. We thank the two anonymous reviewers for their constructive comments which help us to improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ichiro Takeuchi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 78 KB)

Appendix A: Proofs

Appendix A: Proofs

Proof of Lemma 1

Let \(\rho _i(z) = y_i(z)-\varvec{x}_i^\top \varvec{\beta }\) be a parametric residual for each i. Then, (21) can be written as

$$\begin{aligned} \hat{\varvec{\beta }}^{R}(z) = \mathop \mathrm{arg~min}\limits _{\varvec{\beta }} \sum _{i=1}^n |\rho _i(z)|. \end{aligned}$$

Furthermore, for each i, let \(\rho _i^+(z)\) and \(\rho _i^-(z)\) be non-negative numbers with \(\rho _i(z) =\rho _i^+(z)-\rho _i^-(z)\). Since at least one of \(\rho _i(z)^+\) and \(\rho _i(z)^-\) can be set to zero, the following equation holds:

$$\begin{aligned} |\rho _i(z)|=|\rho _i^+(z) -\rho _i^-(z)| = \rho _i^+(z) +\rho _i^-(z),\ i \in [n]. \end{aligned}$$

Therefore, \(\hat{\varvec{\beta }}^{R}(z)\) in the parametric approach can be given by solving the following parametric linear programming problem as

$$\begin{aligned}&\mathrm{min} \sum _{i=1}^n\left( \rho _i^+(z) +\rho _i^-(z)\right) \nonumber \\&\mathrm{s.t.}\;\; \rho _i^+(z) - \rho _i^-(z)=y_i(z)-\varvec{x}_i^{\top }\varvec{\beta }(z), \rho _i^+(z) \ge 0, \rho _i^-(z) \ge 0,\ i \in [n]. \end{aligned}$$
(26)

Next, we show that (26) can be expressed as (22). The \(\varvec{\beta }(z)\) in (26) can be re-written as follows:

$$\begin{aligned} \varvec{\beta }(z)=\varvec{\beta }^+(z) - \varvec{\beta }^-(z) , \;\;\; \varvec{\beta }^+(z)\ge \varvec{0},\;\; \varvec{\beta }^-(z)\ge \varvec{0}. \end{aligned}$$

Thus, since \({\varvec{Y}} = {\varvec{a}} + {\varvec{b}} z\), by letting

$$\begin{aligned} \varvec{r}&= \left( \begin{array}{cccccccccc} \rho _1^+ (z)&\rho _1^- (z)&\cdots&\rho _n^+ (z)&\rho _n^- (z)&\beta _{1}^{+} (z)&\beta _{1}^{-} (z)&\cdots&\beta _{p}^{+} (z)&\beta _{p}^{-} (z) \end{array} \right) ^\top \in {\mathbb {R}}^{2n+2p}, \end{aligned}$$
$$\begin{aligned} S&= \left( \begin{array}{cccccccccccccc} 1 &{} -1 &{} 0 &{}0&{}\cdots &{} 0 &{}0 &{}x_{11} &{}-x_{11}&{}x_{12} &{}-x_{12}&{}\cdots &{}x_{1p}&{}-x_{1p}\\ 0 &{} 0 &{} 1 &{} -1 &{}\cdots &{} 0 &{}0 &{}x_{21} &{}-x_{21}&{}x_{22} &{}-x_{22}&{}\cdots &{}x_{2p}&{}-x_{2p}\\ &{}{\vdots }&{}&{}&{}{\ddots }&{} &{}{\vdots }&{}&{}{\vdots }&{}&{}{\vdots }&{} &{}&{}{\vdots } \\ 0&{} 0 &{} 0 &{}0 &{}\cdots &{}1 &{}-1 &{}x_{n1} &{}-x_{n1}&{}x_{n2} &{}-x_{n2}&{}\cdots &{}x_{np}&{}-x_{np}\\ \end{array} \right) \in {\mathbb {R}}^{n \times (2n+2p)},\\ \varvec{q}&= \left( \begin{array}{cccccccccc} 1&{}1&{}\cdots &{}1&{}1&{}0&{}0&{}\cdots &{}0&{}0 \\ \end{array} \right) ^\top \in {\mathbb {R}}^{2n+2p}, \varvec{u}_0=\varvec{a}, \varvec{u}_1=\varvec{b}, \end{aligned}$$

we have (22). Finally, we show piecewise linearity. It is known that the optimal value of the parametric linear programming problem given in the form (22) is a (convex) piecewise-linear function with respect to the parameter z (see, e.g., Section 8.6 in Murty (1983)). Hence, the solution path of \(\varvec{\beta }^{R}(z)\) can be represented as a piecewise-linear function with respect to z. \(\square \)

Proof of Lemma 2

Let \(\varvec{e}(z) = \varvec{y}(z)-X^\top \varvec{\beta }\) be a parametric residual vector. Suppose that \(\varvec{u}(z)\) and \(\varvec{v}(z)\) are vectors, and the i-th element \(u(z)_i\) and \(v(z)_i\) of \(\varvec{u}(z)\) and \(\varvec{v}(z)\) are respectively given by

$$\begin{aligned} u(z)_i = \min \{ |e (z)_i |, \delta \}, \ v(z)_i = \max \{|e (z)_i | -\delta ,0 \}, \end{aligned}$$

where \(e(z)_i\) is the i-th element of \({\varvec{\rho }} (z)\). Then, the following inequality holds:

$$\begin{aligned} -\varvec{u}(z)-\varvec{v}(z) \le \varvec{y}(z)-X^\top \varvec{\beta } \le \varvec{u}(z)+\varvec{v}(z). \end{aligned}$$

In addition, the objective function can be expressed as follows:

$$\begin{aligned} \frac{1}{2} \varvec{u}(z)^\top \varvec{u}(z) + \delta \cdot \varvec{1}^\top \varvec{v}(z), \end{aligned}$$

where \(\varvec{1}\) is a vector with all elements 1. Thus, \(\hat{\varvec{\beta }}^{R}(z)\) in the parametric approach can be given by solving the following parametric quadratic programming problem as

$$\begin{aligned} \min&\frac{1}{2} \varvec{u}(z)^\top \varvec{u}(z) + \delta \cdot \varvec{1}^\top \varvec{v}(z)\nonumber \\ \mathrm {s.t.}&-\varvec{u}(z)-\varvec{v}(z) \le \varvec{y}(z)-X\varvec{\beta }(z) \le \varvec{u}(z) + \varvec{v}(z), \nonumber \\&\varvec{0} \le \varvec{u}(z) \le \delta \cdot \varvec{1},\;\; \varvec{v}(z) \ge {\varvec{0}}. \end{aligned}$$
(27)

Therefore, noting that \({\varvec{y}} (z) = {\varvec{a}} + {\varvec{b}} z\), letting

$$\begin{aligned} \varvec{r}&= \left( \begin{array}{ccc} \varvec{\beta } (z) ,\varvec{u} (z) ,\varvec{v} (z) \end{array} \right) ^\top \in {\mathbb {R}}^{p+2n}, \\ P&= \left( \begin{array}{ccc} \varvec{0} &{}\varvec{0} &{}\varvec{0}\\ \varvec{0} &{} I_n &{} \varvec{0}\\ \varvec{0} &{} \varvec{0} &{} \varvec{0}\\ \end{array} \right) \in {\mathbb {R}}^{(p+2n) \times (p+2n)}, \\ \varvec{q}&=(\varvec{0} \;\; \varvec{0}\;\; \delta \cdot \varvec{1}_n)^\top \in {\mathbb {R}}^{(p+2n)}\\ S&= \left( \begin{array}{ccc} X &{}-I_n &{}-I_n\\ -X &{} -I_n &{} -I_n\\ \varvec{0} &{} -I_n &{} \varvec{0}\\ \varvec{0} &{} I_n &{} \varvec{0}\\ \varvec{0} &{} \varvec{0} &{} -I_n \end{array} \right) \in {\mathbb {R}}^{5n \times (p+2n)},\\ {\varvec{u}}_0&= ({\varvec{a}} \;\; -{\varvec{a}} \;\; \varvec{0} \;\; \delta \cdot \varvec{1_n} \;\; \varvec{0} ) ^\top \in {\mathbb {R}}^{5n}, \\ {\varvec{u}}_1&= ({\varvec{b}} \;\; -{\varvec{b}} \;\; \varvec{0} \;\; \varvec{0} \;\; \varvec{0} ) ^\top \in {\mathbb {R}}^{5n}, \end{aligned}$$

it can be shown that (27) is the same as (23). \(\square \)

Next, we consider the KKT optimality conditions of the reformulated optimization problem (21), and show that the optimal solutions are represented as piecewise-linear functions of z by showing that, when the active variables (nonzero variables at the optimal solution) do not change, the optimal solutions linearly change with z (Proposition 1). Let \(\hat{\varvec{u}}(z)\) and \(\hat{\varvec{r}} (z)\) be the vectors of optimal Lagrange multipliers. Then, the KKT conditions of (23) are given by

$$\begin{aligned} P \varvec{\hat{r}}(z)+\varvec{q}+S^\top \hat{\varvec{u}}(z)&={\varvec{0}},\\ S\hat{\varvec{r}}(z)-\varvec{h}(z)&\le {\varvec{0}}, \\ \hat{u}_i (z) (S \hat{\varvec{r}}(z)-{\varvec{h}} (z))_i&= 0, \;\;\; \forall i \in [m], \\ \hat{u} _i (z)&\ge 0, \;\;\; \forall i \in [m], \end{aligned}$$

where \(\varvec{h}(z)=(\varvec{y}(z)\;\;-\varvec{y}(z) \;\; \varvec{0} \;\; \delta \cdot \varvec{1_n} \;\; \varvec{0})^\top \in {\mathbb {R}}^{5n}\). Furthermore, letting \({\mathcal {A}}_z=\{i \in [m] : \hat{u}_i(z)> 0\}\) and \({\mathcal {A}}_z^c=[m] \; \backslash \; {\mathcal {A}}_z\), we have

$$\begin{aligned} P \varvec{\hat{r}}(z)+\varvec{q}+S^\top \hat{\varvec{u}}(z)&={\varvec{0}}, \nonumber \\ (S\hat{\varvec{r}}(z)-\varvec{h}(z))_i&= 0, \;\;\; \forall i \in {\mathcal {A}}_z \nonumber \\ (S\hat{\varvec{r}}(z)-\varvec{h}(z))_i&\le 0, \;\;\; \forall i \in {\mathcal {A}}_z^c. \end{aligned}$$
(28)

Then, the following proposition holds:

Proposition 1

Let \(S_{{\mathcal {A}}_z}\) be the rows of matrix S in a set \({\mathcal {A}}_z\). Consider two real values z and \(z'(z < z')\). If \({\mathcal {A}}_z = {\mathcal {A}}_{z'}\), then we have

$$\begin{aligned} \hat{\varvec{r}}(z')-\hat{\varvec{r}}(z)&= \psi (z) \times (z'-z), \end{aligned}$$
(29)
$$\begin{aligned} \hat{\varvec{u}}_{{\mathcal {A}}_z}(z')-\hat{\varvec{u}}_{{\mathcal {A}}_z}(z)&= \gamma (z) \times (z'-z), \end{aligned}$$
(30)

where \(\psi (z) \in {\mathbb {R}}^n, \gamma (z) \in {\mathcal {R}}^{|{\mathcal {A}}_z|}\), \( \begin{bmatrix} \psi (z)\\ \gamma (z) \end{bmatrix} = \left[ \begin{array}{cc} P &{} S_{{\mathcal {A}}_z}^\top \\ S_{{\mathcal {A}}_z} &{} 0 \end{array} \right] ^{-1} \left[ \begin{array}{c} \varvec{0}\\ \varvec{u}_{1,{\mathcal {A}}_z} \end{array} \right] .\)

Proposition 1 means that \(\hat{\varvec{\beta }}^{R}(z)\) is a piecewise-linear function with respect to z on the interval \([z,z^\prime ]\). Finally, we show Proposition 1.

Proof

According to (28), we have the following linear system

$$\begin{aligned} \left[ \begin{array}{cc} P &{} S_{{\mathcal {A}}_z}^\top \\ S_{{\mathcal {A}}_z} &{} \varvec{0} \end{array} \right] \left[ \begin{array}{c} \hat{\varvec{r}}(z)\\ \hat{\varvec{u}}_{{\mathcal {A}}_z}(z)\\ \end{array} \right] = \left[ \begin{array}{c} \varvec{q}\\ \varvec{h}_{{\mathcal {A}}_z}(z) \end{array} \right] . \end{aligned}$$
(31)

By decomposing \(\varvec{h}(z)=\varvec{u}_0+\varvec{u}_1 z\), (31) can be re-written as

$$\begin{aligned}&\left[ \begin{array}{cc} P &{} S_{{\mathcal {A}}_z}^\top \\ S_{{\mathcal {A}}_z} &{} \varvec{0} \end{array} \right] \left[ \begin{array}{c} \hat{\varvec{r}}(z)\\ \hat{\varvec{u}}_{{\mathcal {A}}_z}(z)\\ \end{array} \right] = \left[ \begin{array}{c} \varvec{q}\\ \varvec{u}_{0,{\mathcal {A}}}(z) \end{array} \right] + \left[ \begin{array}{c} \varvec{0} \\ \varvec{u}_{1,{\mathcal {A}}_z} \end{array} \right] z, \\ \left[ \begin{array}{c} \hat{\varvec{r}}(z)\\ \hat{\varvec{u}}_{{\mathcal {A}}_z}(z)\\ \end{array} \right] =&\left[ \begin{array}{cc} P &{} S_{{\mathcal {A}}_z}^\top \\ S_{{\mathcal {A}}_z} &{} \varvec{0} \end{array} \right] ^{-1} \left[ \begin{array}{c} \varvec{q}\\ \varvec{u}_{0,{\mathcal {A}}}(z) \end{array} \right] + \left[ \begin{array}{cc} P &{} S_{{\mathcal {A}}_z}^\top \\ S_{{\mathcal {A}}_z} &{} \varvec{0} \end{array} \right] ^{-1} \left[ \begin{array}{c} \varvec{0} \\ \varvec{u}_{1,{\mathcal {A}}_z} \end{array} \right] z. \end{aligned}$$

From now, let us denote \( \begin{bmatrix} \psi (z)\\ \gamma (z) \end{bmatrix} = \left[ \begin{array}{cc} P &{} S_{{\mathcal {A}}_z}^\top \\ S_{{\mathcal {A}}_z} &{} \varvec{0} \end{array} \right] ^{-1} \left[ \begin{array}{c} \varvec{0} \\ \varvec{u}_{1,{\mathcal {A}}_z} \end{array} \right] \) with \(\psi (z)\in {\mathbb {R}}^n\) and \(\gamma (z) \in {\mathbb {R}}^{|{\mathcal {A}}_z|}\), the results in proposition 1 is straightforward. \(\square \)

Proof of Lemma 3

We derive the truncation region \({\mathcal {Z}}\) in the interval \([z_{t-1},z_t].\) For any \(t \in [T]\) and \(i \in [n]\), the residual \(r_i^R(z)\) in the interval \([z_{t-1},z_t]\) is given by

$$\begin{aligned} r_i^R(z)=f_{t,i}+g_{t,i}z \;\; z \in [z_{t-1}, z_t]. \end{aligned}$$

The condition for the truncation region is the following inequality:

$$\begin{aligned} |r_i^R(z)|&\ge \xi . \end{aligned}$$

This implies that

$$\begin{aligned}&f_{t,i}+g_{t,i}z \ge \xi , \end{aligned}$$
(32)
$$\begin{aligned} \text { or }&f_{t,i}+g_{t,i}z \le -\xi . \end{aligned}$$
(33)

From (32) and (33), z is classified into the following four cases based on the value of \(g_{t,i}\):

$$\begin{aligned}&\left\{ \begin{array}{c} z\ge \frac{\xi -f_{t,i}}{g_{t,i}} \text { if } \;g_{t,i}> 0,\\ z\le \frac{\xi -f_{t,i}}{g_{t,i}} \text { if } \;g_{t,i}< 0,\\ \end{array} \right. \\&\left\{ \begin{array}{c} z\le \frac{-\xi -f_{t,i}}{g_{t,i}} \text { if } \;g_{t,i} > 0,\\ z\ge \frac{-\xi -f_{t,i}}{g_{t,i}} \text { if }\;g_{t,i} < 0. \end{array} \right. \end{aligned}$$

If \(g_{t,i} = 0\), the region that satisfies \(|f_{t,i}|\ge \xi \) is the truncation region. From the above, \(\mathcal {V}_{t, i}\) is derived as

$$\begin{aligned} \mathcal {V}_{t, i} = \left\{ \begin{array}{ll} \left[ z_{t-1}, \min \left\{ z_{t}, \frac{- \xi - f_{t,i}}{g_{t,i}}\right\} \right] \cup \left[ \max \left\{ z_{t-1}, \frac{\xi - f_{t,i}}{g_{t,i}}\right\} , z_t \right] &{} \text { if } g_{t,i} > 0,\\ \left[ z_{t-1}, \min \left\{ z_{t}, \frac{\xi - f_{t,i}}{g_{t,i}}\right\} \right] \cup \left[ \max \left\{ z_{t-1}, \frac{-\xi - f_{t,i}}{g_{t,i}}\right\} , z_t \right] &{} \text { if } g_{t,i}< 0,\\ \left[ z_{t-1}, z_{t} \right] &{} \text { if } g_{t,i} = 0 \text { and } |f_{t,i}| \ge \xi ,\\ \emptyset &{} \text { if } g_{t,i} = 0 \text { and } |f_{t,i}| < \xi , \end{array} \right. \end{aligned}$$

where we define \([\ell , u] = \emptyset \) if \(\ell > u\). \(\square \)

Proof of Lemma 4

We derive the truncation region \({\mathcal {Z}}\) in the interval \([z_{t-1},z_t]\). For any \(t \in [T]\) and \((i, i^\prime ) \in \mathcal {O}(y^{\mathrm{observed}}) \times \left( [n] \setminus \mathcal {O}(y^\mathrm{observed})\right) \), the residual \(r_i^R(z)\) in the interval \([z_{t-1},z_t]\) is given by

$$\begin{aligned} r_i^R(z)=f_{t,i}+g_{t,i}z,\;\;\; r_{i'}^R(z)=f_{t,i'}+g_{t,i'}z \;\; z \in [z_{t-1}, z_t]. \end{aligned}$$

The condition of the truncation region is the following:

$$\begin{aligned} |f_{t,i}+g_{t,i} z| \ge |f_{t,i'} +g_{t,i'}z|. \end{aligned}$$
(34)

Since the case classification by absolute value is complicated, we take the method of squaring both sides:

$$\begin{aligned}&(f_{t,i}+g_{t,i} z)^2 \ge (f_{t,i'} +g_{t,i'}z)^2,\nonumber \\&(g_{t,i}^2-g_{t,i'}^2)z^2 + (2f_{t,i}g_{t,i}-2f_{t,i'}g_{t,i'})z + (f_{t,i'}^2-f_{t,i'}^2) \ge 0,\nonumber \\&\alpha z^2 + \beta z + \gamma \ge 0, \end{aligned}$$
(35)

where \(\alpha =g_{t,i}^2-g_{t,i'}^2, \beta =2f_{t,i}g_{t,i}-2f_{t,i'}g_{t,i'} \) and \(\gamma =f_{t,i}^2-f_{t,i'}^2\). The truncation region that satisfies (35) is equivalent to the condition that satisfies equation (34). Next, we consider the three cases of \(\alpha \).

When \(\alpha =0\), from \(\beta z + \gamma \ge 0\), z is classified into

$$\begin{aligned} \left\{ \begin{array}{c} z\ge \frac{-\gamma }{\beta } \text { if } \; \beta > 0,\\ z\le \frac{-\gamma }{\beta } \text { if }\; \beta < 0. \end{array} \right. \end{aligned}$$

If \(\beta =0\), the region that satisfies \(\gamma \ge 0\) is the truncation region. Therefore, when \(\alpha = 0\), the truncation region can be expressed as

$$\begin{aligned} \left\{ \begin{array}{ll} \left[ \max \left\{ z_{t-1}, \frac{-\gamma }{\beta }\right\} , z_t \right] &{} \text { if } \alpha =0 \text { and } \beta >0,\\ \left[ z_{t-1}, \min \left\{ z_{t},\frac{-\gamma }{\beta }\right\} \right] &{} \text { if } \alpha =0 \text { and } \beta<0,\\ \left[ z_{t-1}, z_{t} \right] &{} \text { if } \alpha =0 \text { and } \beta =0 \text { and } \gamma \ge 0,\\ \emptyset &{} \text { if } \alpha =0 \text { and } \beta =0 \text { and } \gamma <0. \end{array} \right. \end{aligned}$$

When \(\alpha >0\), using the quadratic formula, inequality (35) is

$$\begin{aligned}&z>\frac{-\beta +\sqrt{\beta ^2-4 \alpha \gamma }}{2 \alpha },\\ \text { or }&z<\frac{-\beta -\sqrt{\beta ^2-4 \alpha \gamma }}{2 \alpha }. \end{aligned}$$

Therefore, the truncation region can be derived using the union as

$$\begin{aligned} \left[ \max \left\{ z_{t-1}, \frac{-\beta +\sqrt{\beta ^2-4 \alpha \gamma }}{2 \alpha } \right\} , z_t \right] \cup \left[ z_{t-1}, \min \left\{ z_{t}, \frac{-\beta -\sqrt{\beta ^2-4 \alpha \gamma }}{2 \alpha }\right\} \right] \text { if } \alpha >0. \end{aligned}$$

When \(\alpha <0\), using the quadratic formula, inequality (35) is

$$\begin{aligned}&z<\frac{-\beta -\sqrt{\beta ^2-4 \alpha \gamma }}{2 \alpha }, \\ \text { and }&z>\frac{-\beta +\sqrt{\beta ^2-4 \alpha \gamma }}{2 \alpha }. \end{aligned}$$

Therefore, the truncation region can be derived using the intersection as

$$\begin{aligned} \left[ z_{t-1}, \min \left\{ z_{t}, \frac{-\beta -\sqrt{\beta ^2-4 \alpha \gamma }}{2 \alpha } \right\} \right] \cap \left[ \max \left\{ z_{t-1}, \frac{-\beta +\sqrt{\beta ^2-4 \alpha \gamma }}{2 \alpha }\right\} , z_{t} \right] \text { if } \alpha <0. \end{aligned}$$

From the above, \(\mathcal {W}_{t, i}\) is derived as

$$\begin{aligned} \mathcal {W}_{t, (i, i^\prime )} = \left\{ \begin{array}{ll} \left[ \max \left\{ z_{t-1}, \frac{-\gamma }{\beta }\right\} , z_t \right] &{} \text { if } \alpha =0 \text { and } \beta>0,\\ \left[ z_{t-1}, \min \left\{ z_{t},\frac{-\gamma }{\beta }\right\} \right] &{} \text { if } \alpha =0 \text { and } \beta<0,\\ \left[ z_{t-1}, z_{t} \right] &{} \text { if } \alpha =0 \text { and } \beta =0 \text { and } \gamma \ge 0,\\ \emptyset &{} \text { if } \alpha =0 \text { and } \beta =0 \text { and } \gamma<0,\\ \left[ \max \left\{ z_{t-1}, \frac{-\beta +\sqrt{\beta ^2-4 \alpha \gamma }}{2 \alpha } \right\} , z_t \right] \cup \left[ z_{t-1}, \min \left\{ z_{t}, \frac{-\beta -\sqrt{\beta ^2-4 \alpha \gamma }}{2 \alpha }\right\} \right] &{} \text { if } \alpha >0,\\ \left[ z_{t-1}, \min \left\{ z_{t}, \frac{-\beta -\sqrt{\beta ^2-4 \alpha \gamma }}{2 \alpha } \right\} \right] \cap \left[ \max \left\{ z_{t-1}, \frac{-\beta +\sqrt{\beta ^2-4 \alpha \gamma }}{2 \alpha }\right\} , z_{t} \right] &{} \text { if } \alpha <0. \end{array} \right. \end{aligned}$$

\(\square \)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tsukurimichi, T., Inatsu, Y., Duy, V.N.L. et al. Conditional selective inference for robust regression and outlier detection using piecewise-linear homotopy continuation. Ann Inst Stat Math 74, 1197–1228 (2022). https://doi.org/10.1007/s10463-022-00846-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-022-00846-2

Keywords

Navigation