Skip to main content
Log in

Optimal subsampling for softmax regression

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

A Publisher Correction to this article was published on 12 September 2019

This article has been updated

Abstract

To meet the challenge of massive data, Wang et al. (J Am Stat Assoc 113(522):829–844, 2018b) developed an optimal subsampling method for logistic regression. The purpose of this paper is to extend their method to softmax regression, which is also called multinomial logistic regression and is commonly used to model data with multiple categorical responses. We first derive the asymptotic distribution of the general subsampling estimator, and then derive optimal subsampling probabilities under the A-optimality criterion and the L-optimality criterion with a specific L matrix. Since the optimal subsampling probabilities depend on the unknowns, we adopt a two-stage adaptive procedure to address this issue and use numerical simulations to demonstrate its performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Change history

  • 12 September 2019

    Unfortunately, due to a technical error, the articles published in issues 60:2 and 60:3 received incorrect pagination. Please find here the corrected Tables of Contents. We apologize to the authors of the articles and the readers.

References

  • Atkinson A, Donev A, Tobias R (2007) Optimum experimental designs, with SAS, vol 34. Oxford University Press, Oxford

    MATH  Google Scholar 

  • Drineas P, Kannan R, Mahoney MW (2006a) Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J Comput 36(1):132–157

    Article  MathSciNet  MATH  Google Scholar 

  • Drineas P, Kannan R, Mahoney MW (2006b) Fast Monte Carlo algorithms for matrices II: computing a low-rank approximation to a matrix. SIAM J Comput 36(1):158–183

    Article  MathSciNet  MATH  Google Scholar 

  • Drineas P, Kannan R, Mahoney MW (2006c) Fast Monte Carlo algorithms for matrices III: computing a compressed approximate matrix decomposition. SIAM J Comput 36(1):184–206

    Article  MathSciNet  MATH  Google Scholar 

  • Drineas P, Mahoney MW, Muthukrishnan S (2006d) Sampling algorithms for \(l_2\) regression and applications. In: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithm. Society for Industrial and Applied Mathematics, Philadelphia, pp 1127–1136

  • Drineas P, Mahoney M, Muthukrishnan S (2008) Relative-error CUR matrix decomposition. SIAM J Matrix Anal Appl 30:844–881

    Article  MathSciNet  MATH  Google Scholar 

  • Drineas P, Mahoney M, Muthukrishnan S, Sarlos T (2011) Faster least squares approximation. Numer Math 117:219–249

    Article  MathSciNet  MATH  Google Scholar 

  • Ferguson TS (1996) A course in large sample theory. Chapman and Hall, London

    Book  MATH  Google Scholar 

  • Frieze A, Kannan R, Vempala S (2004) Fast Monte-Carlo algorithms for finding low-rank approximations. J ACM 51:1025–1041

    Article  MathSciNet  MATH  Google Scholar 

  • Lane A, Yao P, Flournoy N (2014) Information in a two-stage adaptive optimal design. J Stat Plan Inference 144:173–187

    Article  MathSciNet  MATH  Google Scholar 

  • Ma P, Mahoney M, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–911

    MathSciNet  MATH  Google Scholar 

  • Mahoney MW (2011) Randomized algorithms for matrices and data. Found Trends Mach Learn 3(2):123–224

    MATH  Google Scholar 

  • Mahoney MW, Drineas P (2009) CUR matrix decompositions for improved data analysis. Proc Natl Acad Sci USA 106(3):697–702

    Article  MathSciNet  MATH  Google Scholar 

  • Ortega JM, Rheinboldt WC (1970) Iterative solution of nonlinear equations in several variables, vol 30. SIAM, Philadelphia

    MATH  Google Scholar 

  • R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/

  • Raskutti G, Mahoney M (2016) A statistical perspective on randomized sketching for ordinary least-squares. J Mach Learn Res 17:1–31

    MathSciNet  MATH  Google Scholar 

  • van der Vaart A (1998) Asymptotic statistics. Cambridge University Press, London

    Book  MATH  Google Scholar 

  • Wang H (2018) More efficient estimation for logistic regression with optimal subsample. arXiv preprint. arXiv:1802.02698

  • Wang H, Yang M, Stufken J (2018a) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc. https://doi.org/10.1080/01621459.2017.1408468

  • Wang H, Zhu R, Ma P (2018b) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113(522):829–844

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge the comments from two referees that helped improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to HaiYing Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by NSF Grant 1812013, a UCONN REP grant, and a GPU Grant from NVIDIA Corporation.

Appendix: Proofs

Appendix: Proofs

We prove the two theorems in this section. We use \(O_{P|{\mathcal {D}}_N}(1)\) and \(o_{P|{\mathcal {D}}_N}(1)\) to denote boundedness and convergence to zero, respectively, in conditional probability given the full data. Specifically for a sequence of random vector \({\mathbf {v}}_{n,N}\), as \(n\rightarrow \infty \) and \(N\rightarrow \infty \), \({\mathbf {v}}_{n,N}=O_{P|{\mathcal {D}}_N}(1)\) means that for any \(\epsilon >0\), there exists a finite \(C_\epsilon >0\) such that

$$\begin{aligned}&{\mathbb {P}}\big \{\sup _n{\mathbb {P}}(\Vert {\mathbf {v}}_{n,N}\Vert >C_\epsilon |{\mathcal {D}}_N) \le \epsilon \big \}\rightarrow 1; \end{aligned}$$

\(v_{n,N}=o_{P|{\mathcal {D}}_N}(1)\) means that for any \(\epsilon >0\) and \(\delta \),

$$\begin{aligned}&{\mathbb {P}}\big \{{\mathbb {P}}(\Vert {\mathbf {v}}_{n,N}\Vert >\delta |{\mathcal {D}}_N) \le \epsilon \big \}\rightarrow 1. \end{aligned}$$

Proof

(Theorem 1) By direct calculation under the conditional distribution of the subsample given \({\mathcal {D}}_N\), we have

$$\begin{aligned}&\displaystyle {\mathbb {E}}\left\{ \ell _s^*({\varvec{\beta }})\big |{\mathcal {D}}_N\right\} =\ell _f({\varvec{\beta }}),\nonumber \\&\displaystyle {\mathbb {E}}\left\{ \ell _s^*({\varvec{\beta }})-\ell _f({\varvec{\beta }})\big |{\mathcal {D}}_N\right\} ^2 =\frac{1}{n}\left[ \frac{1}{N^2}\sum _{i=1}^{N}\frac{t_i^2({\varvec{\beta }})}{\pi _i}-\ell _f^2({\varvec{\beta }})\right] , \end{aligned}$$
(8)

where \(t_i({\varvec{\beta }})=\sum _{k=1}^{K}\delta _{i,k}{\mathbf {x}}_i^\mathrm{T}{\varvec{\beta }}_k- \log \Big \{1+\sum _{l=1}^Ke^{{\mathbf {x}}_i^\mathrm{T}{\varvec{\beta }}_l}\Big \}\). Note that

$$\begin{aligned} |t_i({\varvec{\beta }})|&\le \sum _{k=1}^{K}\Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}_k\Vert +\log \bigg (1+\sum _{k=1}^{K}e^{\Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}_k\Vert }\bigg )\\&\le K\Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}\Vert +\log \Big (1+Ke^{\Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}\Vert }\Big )\\&\le K\Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}\Vert +1+\log K + \Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}\Vert \\&=(K+1)\Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}\Vert +1+\log K, \end{aligned}$$

where the second inequality is from the fact that \(\Vert {\varvec{\beta }}_k\Vert \le \Vert {\varvec{\beta }}\Vert \), and the third inequality is from the fact that \(\log (1+x)<1+\log x\) for \(x\ge 1\). Therefore, from Assumption 2,

$$\begin{aligned} \frac{1}{n^2}\sum _{i=1}^{N}\frac{t_i^2({\varvec{\beta }})}{\pi _i}-\ell _f^2({\varvec{\beta }}) \le&\frac{1}{n^2}\sum _{i=1}^{N}\frac{t_i^2({\varvec{\beta }})}{\pi _i} +\left( \frac{1}{n}\sum _{i=1}^{N}|t_i({\varvec{\beta }})|\right) ^2=O_P(1). \end{aligned}$$
(9)

Combining (8) and (9), \(\ell _s^*({\varvec{\beta }})-\ell _f({\varvec{\beta }})\rightarrow 0\) in conditional probability given \({\mathcal {D}}_N\). Note that the parameter space is compact, and \({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}\) and \({\hat{\varvec{\beta }}}_{{\mathrm {full}}}\) are the unique global maximums of the continuous concave functions \(\ell _s^*({\varvec{\beta }})\) and \(\ell _f({\varvec{\beta }})\), respectively. Thus, from Theorem 5.9 and its remark of van der Vaart (1998), we obtain that conditionally on \({\mathcal {D}}_N\) in probability,

$$\begin{aligned} \Vert {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}\Vert =o_{P|{\mathcal {D}}_N}(1). \end{aligned}$$
(10)

From Taylor’s theorem (c.f. Chap. 4 of Ferguson 1996),

$$\begin{aligned} 0={\dot{\ell }}_{s,j}^*({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}) =&{\dot{\ell }}_{s,j}^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) +\frac{\partial {\dot{\ell }}_{s,j}^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})}{\partial {\varvec{\beta }}^T}({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}) + R_j \end{aligned}$$
(11)

where \({\dot{\ell }}_{s,j}^*({{\varvec{\beta }}})\) is the partial derivative of \(\ell _s^*({{\varvec{\beta }}})\) with respect to \(\beta _j\), and

$$\begin{aligned} R_j=({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}})^T \int _0^1\int _0^1\frac{\partial ^2{\dot{\ell }}_{s,j}^*\{{\hat{\varvec{\beta }}}_{{\mathrm {full}}} +uv({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}})\}}{\partial {\varvec{\beta }}\partial {\varvec{\beta }}^T}v\mathrm {d}u\mathrm {d}v\ ({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}). \end{aligned}$$

Note that the third partial derivative of the log-likelihood for the subsample takes the form of

$$\begin{aligned} \frac{\partial ^3\ell _s^*({\varvec{\beta }})}{\partial \beta _{j_1}\partial \beta _{j_2}\partial \beta _{j_3}} =\sum _{i=1}^{n}\frac{\alpha _{i,j_1j_2j_3}x_{i,j_1'}x_{i,j_2'}x_{i,j_3'}}{nN\pi ^*_i}, \end{aligned}$$

where \(j_l'=\text {Rem}(j_l/d)+dI\{\text {Rem}(j_l/d)=0\}\), \(l=1,2,3\), \(\text {Rem}(j_l/d)\) is the reminder of the integer division \(j_l/d\), and \(\alpha _{i,j_1j_2j_3}\) satisfies that \(|\alpha _{j_1j_2j_3}|\le 2\). Here \(|\alpha _{j_1j_2j_3}|\le 2\) because it has a form of \(p_{k'}({\mathbf {x}}_i,{\varvec{\beta }})\{1-p_{k'}({\mathbf {x}}_i,{\varvec{\beta }})\}\{1-2p_{k'}({\mathbf {x}}_i,{\varvec{\beta }})\}\), \(p_{k_1'}({\mathbf {x}}_i,{\varvec{\beta }})p_{k_2'}({\mathbf {x}}_i,{\varvec{\beta }})\{2p_{k_2'}({\mathbf {x}}_i,{\varvec{\beta }})-1\}\), or \(2p_{k_1'}({\mathbf {x}}_i,{\varvec{\beta }})p_{k_2'}({\mathbf {x}}_i,{\varvec{\beta }})p_{k_3'}({\mathbf {x}}_i,{\varvec{\beta }})\) for some \(k'\) and \(k_1'\ne k_2'\ne k_3'\). Thus,

$$\begin{aligned} \left\| \frac{\partial ^2{\dot{\ell }}_{s,j}^*({\varvec{\beta }})}{\partial {\varvec{\beta }}\partial {\varvec{\beta }}^T}\right\| \le \frac{2}{n}\sum _{i=1}^{n}\frac{K\Vert {\mathbf {x}}^*_i\Vert ^3}{N\pi ^*_i} \end{aligned}$$

for all \({\varvec{\beta }}\). This gives us that

$$\begin{aligned} \sup _{u,v}\left\| \frac{\partial ^2{\dot{\ell }}_{s,j}^*\{{\hat{\varvec{\beta }}}_{{\mathrm {full}}} +uv({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}})\}}{\partial {\varvec{\beta }}\partial {\varvec{\beta }}^T}\right\| =O_{P|{\mathcal {D}}_N}(1), \end{aligned}$$
(12)

because

$$\begin{aligned} P\left( \frac{1}{n}\sum _{i=1}^{n}\frac{\Vert {\mathbf {x}}^*_i\Vert ^3}{N\pi ^*_i}\ge \tau \Bigg |{\mathcal {D}}_N\right) \le \frac{1}{nN\tau }\sum _{i=1}^{n}{\mathbb {E}}\left( \frac{\Vert {\mathbf {x}}^*_i\Vert ^3}{\pi ^*_i}\Bigg |{\mathcal {D}}_N\right) =\frac{1}{N\tau }\sum _{i=1}^{N}\Vert {\mathbf {x}}_i\Vert ^3\rightarrow 0, \end{aligned}$$

in probability as \(\tau \rightarrow \infty \) by Assumption 1. From (12), we have that

$$\begin{aligned} R_j=O_{P|{\mathcal {D}}_N}(\Vert {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}\Vert ^2). \end{aligned}$$
(13)

Denote \({\mathbf {M}}_n^*=\partial ^2{\dot{\ell }}_s^*({\varvec{\beta }})/ \partial {\varvec{\beta }}\partial {\varvec{\beta }}^T=n^{-1}\sum _{i=1}^{n}(N\pi _i^*)^{-1}{\varvec{\phi }}_i^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes ({\mathbf {x}}_i^*{{\mathbf {x}}_i^*}^\mathrm{T})\). From (11) and (12), we have

$$\begin{aligned} {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}= -{\mathbf {M}}_n^{*-1}\left\{ {\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) +O_{P|{\mathcal {D}}_N}(\Vert {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}\Vert ^2)\right\} . \end{aligned}$$
(14)

By direct calculation, we know that

$$\begin{aligned} {\mathbb {E}}({\mathbf {M}}_n^*|{\mathcal {D}}_N)={\mathbf {M}}_N. \end{aligned}$$
(15)

For any component \({\mathbf {M}}_n^{*j_1j_2}\) of \({\mathbf {M}}_n^*\) where \(1\le j_1,j_2\le d\),

$$\begin{aligned} {\mathbb {V}}\left( {\mathbf {M}}_n^{*j_1j_2}|{\mathcal {D}}_N\right) =&\frac{1}{n}\sum _{i=1}^{N}\pi _i \left\{ \frac{\{{\varvec{\phi }}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes ({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T})\}^{j_1j_2}}{N\pi _i} -{\mathbf {M}}_N^{j_1j_2}\right\} ^2\\ =&\frac{1}{nN^2}\sum _{i=1}^{N}\frac{[\{{\varvec{\phi }}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes ({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T})\}^{j_1j_2}]^2}{\pi _i} -\frac{1}{n}({\mathbf {M}}_N^{j_1j_2})^2\\ \le&\frac{1}{nN^2}\sum _{i=1}^{N}\frac{\Vert {\mathbf {x}}_i\Vert ^4}{\pi _i} =O_P(n^{-1}), \end{aligned}$$

where the second last inequality holds by the fact that all elements of \({\varvec{\phi }}_i\) are between 0 and 1, and the last equality is from Assumption 2. This result combined with Markov’s inequality and (15), implies that

$$\begin{aligned} {\mathbf {M}}_n^*-{\mathbf {M}}_N&=O_{P|{\mathcal {D}}_N}(n^{-1/2}). \end{aligned}$$
(16)

By direct calculation, we have

$$\begin{aligned} {\mathbb {E}}\left\{ \frac{\partial \ell _s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})}{\partial {\varvec{\beta }}}\bigg |{\mathcal {D}}_N\right\} =\frac{\partial \ell _f({\hat{\varvec{\beta }}}_{{\mathrm {full}}})}{\partial {\varvec{\beta }}}= 0. \end{aligned}$$
(17)

Note that

$$\begin{aligned} {\mathbb {V}}\left\{ \frac{\partial \ell _s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})}{\partial {\varvec{\beta }}}\bigg |{\mathcal {D}}_N\right\} =\frac{1}{nN^2}\sum _{i=1}^{N}\frac{\varvec{\psi }_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes ({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T})}{\pi _i}, \end{aligned}$$
(18)

whose elements are bounded by \((nN^2)^{-1}\sum _{i=1}^{N}\pi _i^{-1}\Vert {\mathbf {x}}_i\Vert ^2\) which is of order \(O_P(n^{-1})\) by Assumption 2. From (17), (18) and Markov’s inequality, we know that

$$\begin{aligned} \frac{\partial \ell _s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})}{\partial {\varvec{\beta }}}&=O_{P|{\mathcal {D}}_N}(n^{-1/2}). \end{aligned}$$
(19)

Note that (16) indicates that \({\mathbf {M}}_n^{*-1} =O_{P|{\mathcal {D}}_N}(1)\). Combining this with (10), (14) and (19), we have

$$\begin{aligned} {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}} =O_{P|{\mathcal {D}}_N}(n^{-1/2})+o_{P|{\mathcal {D}}_N}(\Vert {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}\Vert ), \end{aligned}$$

which implies that

$$\begin{aligned} {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}=O_{P|{\mathcal {D}}_N}(n^{-1/2}). \end{aligned}$$
(20)

Note that

$$\begin{aligned} {\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) =\frac{1}{n}\sum _{i=1}^{n}\frac{{\mathbf {s}}_i^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes {\mathbf {x}}_i^*}{N\pi ^*_i} \equiv \frac{1}{n}\sum _{i=1}^{n}{\varvec{\eta }}_i \end{aligned}$$
(21)

Given \({\mathcal {D}}_N\), \({\varvec{\eta }}_1, \ldots , {\varvec{\eta }}_n\) are i.i.d, with mean \({\mathbf {0}}\) and variance,

$$\begin{aligned}&{\mathbb {V}}({\varvec{\eta }}_i|{\mathcal {D}}_N)={\mathbf {V}}_{Nc}=O_P(1). \end{aligned}$$
(22)

Meanwhile, for every \(\varepsilon >0\) and some \(\rho >0\),

$$\begin{aligned}&\sum _{i=1}^{n}{\mathbb {E}}\{\Vert n^{-1/2}{\varvec{\eta }}_i\Vert ^2 I(\Vert {\varvec{\eta }}_i\Vert>n^{1/2}\varepsilon )|{\mathcal {D}}_N\}\\&\quad \le \frac{1}{n^{1+\rho /2}\varepsilon ^{\rho }} \sum _{i=1}^{n}{\mathbb {E}}\{\Vert {\varvec{\eta }}_i\Vert ^{2+\rho } I(\Vert {\varvec{\eta }}_i\Vert >n^{1/2}\varepsilon )|{\mathcal {D}}_N\}\\&\quad \le \frac{1}{n^{1+\rho /2}\varepsilon ^{\rho }} \sum _{i=1}^{n}{\mathbb {E}}(\Vert {\varvec{\eta }}_i\Vert ^{2+\rho }|{\mathcal {D}}_N)\\&\quad =\frac{1}{n^{\rho /2}}\frac{1}{N^{2+\rho }}\frac{1}{\varepsilon ^{\rho }} \sum _{i=1}^{N}\frac{\Vert {\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\Vert ^{2+\rho } \Vert {\mathbf {x}}_i\Vert ^{2+\rho }}{\pi _i^{1+\rho }}\\&\quad \le \frac{1}{n^{\rho /2}}\frac{1}{N^{2+\rho }}\frac{1}{\varepsilon ^{\rho }} \sum _{i=1}^{N}\frac{\Vert {\mathbf {x}}_i\Vert ^{2+\rho }}{\pi _i^{1+\rho }}=o_P(1) \end{aligned}$$

where the last equality is from Assumption 2. This and (22) show that the Lindeberg–Feller conditions are satisfied in the conditional distribution in probability. From (21) and (22), by the Lindeberg–Feller central limit theorem (Proposition 2.27 of van der Vaart 1998), conditionally on \({\mathcal {D}}_N\) in probability,

$$\begin{aligned} n^{1/2}{\mathbf {V}}_{Nc}^{-1/2}{\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})= \frac{1}{n^{1/2}}\{{\mathbb {V}}({\varvec{\eta }}_i|{\mathcal {D}}_N)\}^{-1/2}\sum _{i=1}^{n}{\varvec{\eta }}_i \rightarrow {\mathbb {N}}(0,{\mathbf {I}}), \end{aligned}$$

in distribution. From (14) and (20),

$$\begin{aligned} {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}= -{\mathbf {M}}_n^{*-1}{\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})+O_{P|{\mathcal {D}}_N}(n^{-1}). \end{aligned}$$
(23)

From (16),

$$\begin{aligned} {\mathbf {M}}_n^{*-1}-{\mathbf {M}}_N^{-1}&=-{\mathbf {M}}_N^{-1}({\mathbf {M}}_n^*-{\mathbf {M}}_N){\mathbf {M}}_n^{*-1} =O_{P|{\mathcal {D}}_N}(n^{-1/2}). \end{aligned}$$
(24)

Based on Assumption 1 and (22), it is verified that,

$$\begin{aligned} {\mathbf {V}}={\mathbf {M}}_N^{-1}{\mathbf {V}}_{Nc}{\mathbf {M}}_N^{-1}=O_{P}(1). \end{aligned}$$
(25)

Thus, (23), (24) and (25) yield,

$$\begin{aligned}&n^{1/2}{\mathbf {V}}^{-1/2}({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}})\\&\quad =-n^{1/2}{\mathbf {V}}^{-1/2}{\mathbf {M}}_n^{*-1}{\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) +O_{P|{\mathcal {D}}_N}(n^{-1/2})\\&\quad =-{\mathbf {V}}^{-1/2}{\mathbf {M}}_N^{-1}n^{1/2}{\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) -{\mathbf {V}}^{-1/2}({\mathbf {M}}_n^{*-1}-{\mathbf {M}}_N^{-1})n^{1/2}{\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) +O_{P|{\mathcal {D}}_N}(n^{-1/2})\\&\quad =-{\mathbf {V}}^{-1/2}{\mathbf {M}}_N^{-1}{\mathbf {V}}_{Nc}^{1/2}{\mathbf {V}}_{Nc}^{-1/2}n^{-1/2}{\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) +O_{P|{\mathcal {D}}_N}(n^{-1/2}). \end{aligned}$$

The result in Theorem 1 follows from Slutsky’s Theorem(Theorem 6 of Ferguson 1996) and the fact that

$$\begin{aligned} {\mathbf {V}}^{-1/2}{\mathbf {M}}_N^{-1}{\mathbf {V}}_{Nc}^{1/2}({\mathbf {V}}^{-1/2}{\mathbf {M}}_N^{-1}{\mathbf {V}}_{Nc}^{1/2})^T ={\mathbf {V}}^{-1/2}{\mathbf {M}}_N^{-1}{\mathbf {V}}_{Nc}^{1/2}{\mathbf {V}}_{Nc}^{1/2}{\mathbf {M}}_N^{-1}{\mathbf {V}}^{-1/2}={\mathbf {I}}. \end{aligned}$$

\(\square \)

Proof

(Theorem 2) Note that \(\varvec{\psi }_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})={\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}}){\mathbf {s}}_i^\mathrm{T}({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\), so

$$\begin{aligned} {\varvec{\psi }}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes ({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T}) =\{{\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes {\mathbf {x}}_i\} \{{\mathbf {s}}_i^\mathrm{T}({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes {\mathbf {x}}_i^\mathrm{T}\}. \end{aligned}$$
(26)

Therefore, for the A-optimality,

$$\begin{aligned} \mathrm {tr}({\mathbf {V}}_N)&= \mathrm {tr}({\mathbf {M}}_N^{-1}{\mathbf {V}}_{Nc}{\mathbf {M}}_N^{-1}) \\&= \frac{1}{N^2} \mathrm {tr}\bigg \{{\mathbf {M}}_N^{-1} \sum _{i=1}^{N}\frac{{\varvec{\psi }}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes ({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T})}{\pi _i}{\mathbf {M}}_N^{-1}\bigg \} \\&= \frac{1}{N^2} \sum _{i=1}^{N}\frac{\mathrm {tr}\big [{\mathbf {M}}_N^{-1}\{{\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes {\mathbf {x}}_i\} \{{\mathbf {s}}_i^\mathrm{T}({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes {\mathbf {x}}_i^\mathrm{T}\}{\mathbf {M}}_N^{-1}\big ]}{\pi _i}\\&= \frac{1}{N^2} \sum _{i=1}^{N}\frac{\Vert {\mathbf {M}}_N^{-1}\{{\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes {\mathbf {x}}_i\}\Vert ^2}{\pi _i}\times \sum _{i=1}^{N}\pi _i\\&\ge \bigg \{\frac{1}{N}\sum _{i=1}^{N}\Vert {\mathbf {M}}_N^{-1}\{{\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes {\mathbf {x}}_i\}\Vert \bigg \}^2. \end{aligned}$$

Here, the last step due to the Cauchy–Schwarz inequality, and the equality holds if and only if \(\pi _i\) is proportional to \(\Vert {\mathbf {M}}_N^{-1}\{{\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes {\mathbf {x}}_i\}\Vert \). Thus, the A-optimal subsampling probabilities take the form of (5).

For the L-optimality, the proof is similar by noticing that

$$\begin{aligned} \mathrm {tr}({\mathbf {V}}_{Nc})&= \frac{1}{N^2}\sum _{i=1}^{N}\frac{\mathrm {tr}\{{\varvec{\psi }}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes ({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T})\}}{\pi _i} \\&= \frac{1}{N^2}\sum _{i=1}^{N}\frac{\mathrm {tr}\{{\varvec{\psi }}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\} \mathrm {tr}({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T})}{\pi _i} = \frac{1}{N^2}\sum _{i=1}^{N}\frac{\Vert {{\mathbf {s}}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\Vert ^2 \Vert {\mathbf {x}}_i\Vert ^2}{\pi _i} \\&= \frac{1}{N^2}\sum _{i=1}^{N}\frac{\big [\sum _{k=1}^{K}\{\delta _{i,k} - p_k({\mathbf {x}}_i, {\hat{\varvec{\beta }}}_{{\mathrm {full}}})\}^2\big ]\Vert {\mathbf {x}}_i\Vert ^2}{\pi _i}\times \sum _{i=1}^{N}\pi _i\\&\ge \Bigg ( \frac{1}{N}\sum _{i=1}^{N}\bigg [\sum _{k=1}^{K}\{\delta _{i,k} - p_k({\mathbf {x}}_i, {\hat{\varvec{\beta }}}_{{\mathrm {full}}})\}\bigg ]^{1/2}\Vert {\mathbf {x}}_i\Vert \Bigg )^2. \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yao, Y., Wang, H. Optimal subsampling for softmax regression. Stat Papers 60, 585–599 (2019). https://doi.org/10.1007/s00362-018-01068-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-018-01068-6

Keywords

Navigation