Skip to main content

Adaptive iterative Hessian sketch via A-optimal subsampling


Iterative Hessian sketch (IHS) is an effective sketching method for modeling large-scale data. It was originally proposed by Pilanci and Wainwright (J Mach Learn Res 17(1):1842–1879, 2016) based on randomized sketching matrices. However, it is computationally intensive due to the iterative sketch process. In this paper, we analyze the IHS algorithm under the unconstrained least squares problem setting and then propose a deterministic approach for improving IHS via A-optimal subsampling. Our contributions are threefold: (1) a good initial estimator based on the A-optimal design is suggested; (2) a novel ridged preconditioner is developed for repeated sketching; and (3) an exact line search method is proposed for determining the optimal step length adaptively. Extensive experimental results demonstrate that our proposed A-optimal IHS algorithm outperforms the existing accelerated IHS methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. Data can be found in


  • Benzi, M.: Preconditioning techniques for large linear systems: a survey. J. Comput. Phys. 182(2), 418–477 (2002)

    MathSciNet  MATH  Google Scholar 

  • Boutsidis, C., Gittens, A.: Improved matrix algorithms via the subsampled randomized Hadamard transform. SIAM J. Matrix Anal. Appl. 34(3), 1301–1340 (2013)

    MathSciNet  MATH  Google Scholar 

  • Clarkson, K. L., Woodruff, D. P.: Low rank approximation and regression in input sparsity time. In: Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, pp. 81–90. ACM (2013)

  • Drineas, P., Mahoney, M. W., Muthukrishnan, S.: Sampling algorithms for l 2 regression and applications. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 1127–1136. Society for Industrial and Applied Mathematics (2006)

  • Drineas, P., Mahoney, M.W., Muthukrishnan, S., Sarlós, T.: Faster least squares approximation. Numer. Math. 117(2), 219–249 (2011)

    MathSciNet  MATH  Google Scholar 

  • Drineas, P., Magdon-Ismail, M., Mahoney, M.W., Woodruff, D.P.: Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 13(Dec), 3475–3506 (2012)

    MathSciNet  MATH  Google Scholar 

  • Gonen, A., Orabona, F., Shalev-Shwartz, S.: Solving ridge regression using sketched preconditioned SVRG. In: International Conference on Machine Learning, pp. 1397–1405 (2016)

  • Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012)

    Google Scholar 

  • Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26(189–206), 1 (1984)

    MathSciNet  MATH  Google Scholar 

  • Knyazev, A.V., Lashuk, I.: Steepest descent and conjugate gradient methods with variable preconditioning. SIAM J. Matrix Anal. Appl. 29(4), 1267–1280 (2007)

    MathSciNet  MATH  Google Scholar 

  • Lu, Y., Dhillon, P., Foster, D. P., Ungar, L.: Faster ridge regression via the subsampled randomized Hadamard transform. In: Advances in Neural Information Processing Systems, pp. 369–377 (2013)

  • Ma, P., Mahoney, M.W., Yu, B.: A statistical perspective on algorithmic leveraging. J. Mach. Learn. Res. 16(1), 861–911 (2015)

    MathSciNet  MATH  Google Scholar 

  • Mahoney, M.W., et al.: Randomized algorithms for matrices and data. Found. Trends® Mach. Learn. 3(2), 123–224 (2011)

    MATH  Google Scholar 

  • Martınez, C.: Partial quicksort. In: Proceedings of the 6th ACMSIAM Workshop on Algorithm Engineering and Experiments and 1st ACM-SIAM Workshop on Analytic Algorithmics and Combinatorics, pp 224–228 (2004)

  • McWilliams, B., Krummenacher, G., Lucic, M., Buhmann, J. M.: Fast and robust least squares estimation in corrupted linear models. In: Advances in Neural Information Processing Systems, pp. 415–423 (2014)

  • Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Berlin (2006)

    MATH  Google Scholar 

  • Pilanci, M., Wainwright, M.J.: Iterative Hessian sketch: fast and accurate solution approximation for constrained least-squares. J. Mach. Learn. Res. 17(1), 1842–1879 (2016)

    MathSciNet  MATH  Google Scholar 

  • Pukelsheim, F.: Optimal Design of Experiments, vol. 50. SIAM, Philadelphia (1993)

    MATH  Google Scholar 

  • Seber, G.A.: A Matrix Handbook for Statisticians, vol. 15. Wiley, New York (2008)

    MATH  Google Scholar 

  • Tropp, J.A.: Improved analysis of the subsampled randomized Hadamard transform. Adv. Adapt. Data Anal. 3(1–2), 115–126 (2011)

    MathSciNet  MATH  Google Scholar 

  • Wang, D., Xu, J.: Large scale constrained linear regression revisited: faster algorithms via preconditioning. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

  • Wang, J., Lee, J.D., Mahdavi, M., Kolar, M., Srebro, N., et al.: Sketching meets random projection in the dual: a provable recovery algorithm for big and high-dimensional data. Electron. J. Stat. 11(2), 4896–4944 (2017)

    MathSciNet  MATH  Google Scholar 

  • Wang, H., Yang, M., Stufken, J.: Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc. 114(525), 393–405 (2019)

    MathSciNet  MATH  Google Scholar 

  • Woodruff, D.P., et al.: Sketching as a tool for numerical linear algebra. Found. Trends® Theor. Comput. Sci. 10(1–2), 1–157 (2014)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Aijun Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix A: Extra results with different ridged preconditioners

We further compare the ridged preconditioner \(\varvec{M}=\frac{n}{m}\sum _{i=1}^n\delta _i\varvec{x}_i\varvec{x}_i^T+\lambda \varvec{I}_d\) with its two components, the non-ridged term and the scaled identity matrix. Three preconditioners are evaluated through \(\mathrm{MSE}_2\) under our proposed algorithm framework. We only consider the identity matrix \(\varvec{M}= \varvec{I}\) since any scaling multiplier \(\lambda \) in \(\varvec{M}= \lambda \varvec{I}\) can be canceled out during the update of \({\hat{\varvec{\beta }}}_t\). The results strengthen that the ridging operation may enhance the preconditioner performance (Fig. 7).

Fig. 7
figure 7

Estimation error between the estimator and the LSE versus iteration number N for different preconditioners

Appendix B: Extra results with the same proposed initial estimator

In this section, we perform some additional experiments where all the methods are initialized by our proposed A-optimal estimator. The subsample size is fixed as \(m=1000\). These experiments further justify that the proposed Aopt-IHS method generally enjoys the better convergent rates than the benchmark methods (Figs. 8, 9).

Fig. 8
figure 8

Estimation error between the estimator and the ground truth versus iteration number N when \(d=50\)

Fig. 9
figure 9

Estimation error between the estimator and the LSE based on full data versus iteration number N when \(d=50\)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, A., Zhang, H. & Yin, G. Adaptive iterative Hessian sketch via A-optimal subsampling. Stat Comput 30, 1075–1090 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Hessian sketch
  • Subsampling
  • Optimal design
  • Preconditioner
  • Exact line search
  • First-order method