Optimal subsampling for softmax regression

Yao, Yaqiong; Wang, HaiYing

doi:10.1007/s00362-018-01068-6

Optimal subsampling for softmax regression

Regular Article
Published: 18 December 2018

Volume 60, pages 585–599, (2019)
Cite this article

Statistical Papers Aims and scope Submit manuscript

1771 Accesses
32 Citations
Explore all metrics

A Publisher Correction to this article was published on 12 September 2019

This article has been updated

Abstract

To meet the challenge of massive data, Wang et al. (J Am Stat Assoc 113(522):829–844, 2018b) developed an optimal subsampling method for logistic regression. The purpose of this paper is to extend their method to softmax regression, which is also called multinomial logistic regression and is commonly used to model data with multiple categorical responses. We first derive the asymptotic distribution of the general subsampling estimator, and then derive optimal subsampling probabilities under the A-optimality criterion and the L-optimality criterion with a specific L matrix. Since the optimal subsampling probabilities depend on the unknowns, we adopt a two-stage adaptive procedure to address this issue and use numerical simulations to demonstrate its performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal Poisson Subsampling for Softmax Regression

Article 23 August 2023

Bayesian Nonlinear Support Vector Machines for Big Data

Deterministic subsampling for logistic regression with massive data

Article 30 December 2022

Change history

12 September 2019
Unfortunately, due to a technical error, the articles published in issues 60:2 and 60:3 received incorrect pagination. Please find here the corrected Tables of Contents. We apologize to the authors of the articles and the readers.

References

Atkinson A, Donev A, Tobias R (2007) Optimum experimental designs, with SAS, vol 34. Oxford University Press, Oxford
MATH Google Scholar
Drineas P, Kannan R, Mahoney MW (2006a) Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J Comput 36(1):132–157
Article MathSciNet MATH Google Scholar
Drineas P, Kannan R, Mahoney MW (2006b) Fast Monte Carlo algorithms for matrices II: computing a low-rank approximation to a matrix. SIAM J Comput 36(1):158–183
Article MathSciNet MATH Google Scholar
Drineas P, Kannan R, Mahoney MW (2006c) Fast Monte Carlo algorithms for matrices III: computing a compressed approximate matrix decomposition. SIAM J Comput 36(1):184–206
Article MathSciNet MATH Google Scholar
Drineas P, Mahoney MW, Muthukrishnan S (2006d) Sampling algorithms for $l_2$ regression and applications. In: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithm. Society for Industrial and Applied Mathematics, Philadelphia, pp 1127–1136
Drineas P, Mahoney M, Muthukrishnan S (2008) Relative-error CUR matrix decomposition. SIAM J Matrix Anal Appl 30:844–881
Article MathSciNet MATH Google Scholar
Drineas P, Mahoney M, Muthukrishnan S, Sarlos T (2011) Faster least squares approximation. Numer Math 117:219–249
Article MathSciNet MATH Google Scholar
Ferguson TS (1996) A course in large sample theory. Chapman and Hall, London
Book MATH Google Scholar
Frieze A, Kannan R, Vempala S (2004) Fast Monte-Carlo algorithms for finding low-rank approximations. J ACM 51:1025–1041
Article MathSciNet MATH Google Scholar
Lane A, Yao P, Flournoy N (2014) Information in a two-stage adaptive optimal design. J Stat Plan Inference 144:173–187
Article MathSciNet MATH Google Scholar
Ma P, Mahoney M, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–911
MathSciNet MATH Google Scholar
Mahoney MW (2011) Randomized algorithms for matrices and data. Found Trends Mach Learn 3(2):123–224
MATH Google Scholar
Mahoney MW, Drineas P (2009) CUR matrix decompositions for improved data analysis. Proc Natl Acad Sci USA 106(3):697–702
Article MathSciNet MATH Google Scholar
Ortega JM, Rheinboldt WC (1970) Iterative solution of nonlinear equations in several variables, vol 30. SIAM, Philadelphia
MATH Google Scholar
R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
Raskutti G, Mahoney M (2016) A statistical perspective on randomized sketching for ordinary least-squares. J Mach Learn Res 17:1–31
MathSciNet MATH Google Scholar
van der Vaart A (1998) Asymptotic statistics. Cambridge University Press, London
Book MATH Google Scholar
Wang H (2018) More efficient estimation for logistic regression with optimal subsample. arXiv preprint. arXiv:1802.02698
Wang H, Yang M, Stufken J (2018a) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc. https://doi.org/10.1080/01621459.2017.1408468
Wang H, Zhu R, Ma P (2018b) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113(522):829–844
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We gratefully acknowledge the comments from two referees that helped improve the paper.

Author information

Authors and Affiliations

Department of Statistics, University of Connecticut, Storrs, CT, USA
Yaqiong Yao & HaiYing Wang

Authors

Yaqiong Yao
View author publications
You can also search for this author in PubMed Google Scholar
HaiYing Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to HaiYing Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by NSF Grant 1812013, a UCONN REP grant, and a GPU Grant from NVIDIA Corporation.

Appendix: Proofs

We prove the two theorems in this section. We use $O_{P|{\mathcal {D}}_N}(1)$ and $o_{P|{\mathcal {D}}_N}(1)$ to denote boundedness and convergence to zero, respectively, in conditional probability given the full data. Specifically for a sequence of random vector ${\mathbf {v}}_{n,N}$, as $n\rightarrow \infty $ and $N\rightarrow \infty $, ${\mathbf {v}}_{n,N}=O_{P|{\mathcal {D}}_N}(1)$ means that for any $\epsilon >0$, there exists a finite $C_\epsilon >0$ such that

$$\begin{aligned}&{\mathbb {P}}\big \{\sup _n{\mathbb {P}}(\Vert {\mathbf {v}}_{n,N}\Vert >C_\epsilon |{\mathcal {D}}_N) \le \epsilon \big \}\rightarrow 1; \end{aligned}$$

$v_{n,N}=o_{P|{\mathcal {D}}_N}(1)$ means that for any $\epsilon >0$ and $\delta $,

$$\begin{aligned}&{\mathbb {P}}\big \{{\mathbb {P}}(\Vert {\mathbf {v}}_{n,N}\Vert >\delta |{\mathcal {D}}_N) \le \epsilon \big \}\rightarrow 1. \end{aligned}$$

Proof

(Theorem 1) By direct calculation under the conditional distribution of the subsample given ${\mathcal {D}}_N$, we have

$$\begin{aligned}&\displaystyle {\mathbb {E}}\left\{ \ell _s^*({\varvec{\beta }})\big |{\mathcal {D}}_N\right\} =\ell _f({\varvec{\beta }}),\nonumber \\&\displaystyle {\mathbb {E}}\left\{ \ell _s^*({\varvec{\beta }})-\ell _f({\varvec{\beta }})\big |{\mathcal {D}}_N\right\} ^2 =\frac{1}{n}\left[ \frac{1}{N^2}\sum _{i=1}^{N}\frac{t_i^2({\varvec{\beta }})}{\pi _i}-\ell _f^2({\varvec{\beta }})\right] , \end{aligned}$$

(8)

where $t_i({\varvec{\beta }})=\sum _{k=1}^{K}\delta _{i,k}{\mathbf {x}}_i^\mathrm{T}{\varvec{\beta }}_k- \log \Big \{1+\sum _{l=1}^Ke^{{\mathbf {x}}_i^\mathrm{T}{\varvec{\beta }}_l}\Big \}$. Note that

$$\begin{aligned} |t_i({\varvec{\beta }})|&\le \sum _{k=1}^{K}\Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}_k\Vert +\log \bigg (1+\sum _{k=1}^{K}e^{\Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}_k\Vert }\bigg )\\&\le K\Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}\Vert +\log \Big (1+Ke^{\Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}\Vert }\Big )\\&\le K\Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}\Vert +1+\log K + \Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}\Vert \\&=(K+1)\Vert {\mathbf {x}}_i\Vert \Vert {\varvec{\beta }}\Vert +1+\log K, \end{aligned}$$

where the second inequality is from the fact that $\Vert {\varvec{\beta }}_k\Vert \le \Vert {\varvec{\beta }}\Vert $, and the third inequality is from the fact that $\log (1+x)<1+\log x$ for $x\ge 1$. Therefore, from Assumption 2,

$$\begin{aligned} \frac{1}{n^2}\sum _{i=1}^{N}\frac{t_i^2({\varvec{\beta }})}{\pi _i}-\ell _f^2({\varvec{\beta }}) \le&\frac{1}{n^2}\sum _{i=1}^{N}\frac{t_i^2({\varvec{\beta }})}{\pi _i} +\left( \frac{1}{n}\sum _{i=1}^{N}|t_i({\varvec{\beta }})|\right) ^2=O_P(1). \end{aligned}$$

(9)

Combining (8) and (9), $\ell _s^*({\varvec{\beta }})-\ell _f({\varvec{\beta }})\rightarrow 0$ in conditional probability given ${\mathcal {D}}_N$. Note that the parameter space is compact, and ${\hat{\varvec{\beta }}}_{{\mathrm {sub}}}$ and ${\hat{\varvec{\beta }}}_{{\mathrm {full}}}$ are the unique global maximums of the continuous concave functions $\ell _s^*({\varvec{\beta }})$ and $\ell _f({\varvec{\beta }})$, respectively. Thus, from Theorem 5.9 and its remark of van der Vaart (1998), we obtain that conditionally on ${\mathcal {D}}_N$ in probability,

$$\begin{aligned} \Vert {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}\Vert =o_{P|{\mathcal {D}}_N}(1). \end{aligned}$$

(10)

From Taylor’s theorem (c.f. Chap. 4 of Ferguson 1996),

$$\begin{aligned} 0={\dot{\ell }}_{s,j}^*({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}) =&{\dot{\ell }}_{s,j}^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) +\frac{\partial {\dot{\ell }}_{s,j}^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})}{\partial {\varvec{\beta }}^T}({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}) + R_j \end{aligned}$$

(11)

where ${\dot{\ell }}_{s,j}^*({{\varvec{\beta }}})$ is the partial derivative of $\ell _s^*({{\varvec{\beta }}})$ with respect to $\beta _j$, and

$$\begin{aligned} R_j=({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}})^T \int _0^1\int _0^1\frac{\partial ^2{\dot{\ell }}_{s,j}^*\{{\hat{\varvec{\beta }}}_{{\mathrm {full}}} +uv({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}})\}}{\partial {\varvec{\beta }}\partial {\varvec{\beta }}^T}v\mathrm {d}u\mathrm {d}v\ ({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}). \end{aligned}$$

Note that the third partial derivative of the log-likelihood for the subsample takes the form of

$$\begin{aligned} \frac{\partial ^3\ell _s^*({\varvec{\beta }})}{\partial \beta _{j_1}\partial \beta _{j_2}\partial \beta _{j_3}} =\sum _{i=1}^{n}\frac{\alpha _{i,j_1j_2j_3}x_{i,j_1'}x_{i,j_2'}x_{i,j_3'}}{nN\pi ^*_i}, \end{aligned}$$

where $j_l'=\text {Rem}(j_l/d)+dI\{\text {Rem}(j_l/d)=0\}$, $l=1,2,3$, $\text {Rem}(j_l/d)$ is the reminder of the integer division $j_l/d$, and $\alpha _{i,j_1j_2j_3}$ satisfies that $|\alpha _{j_1j_2j_3}|\le 2$. Here $|\alpha _{j_1j_2j_3}|\le 2$ because it has a form of $p_{k'}({\mathbf {x}}_i,{\varvec{\beta }})\{1-p_{k'}({\mathbf {x}}_i,{\varvec{\beta }})\}\{1-2p_{k'}({\mathbf {x}}_i,{\varvec{\beta }})\}$, $p_{k_1'}({\mathbf {x}}_i,{\varvec{\beta }})p_{k_2'}({\mathbf {x}}_i,{\varvec{\beta }})\{2p_{k_2'}({\mathbf {x}}_i,{\varvec{\beta }})-1\}$, or $2p_{k_1'}({\mathbf {x}}_i,{\varvec{\beta }})p_{k_2'}({\mathbf {x}}_i,{\varvec{\beta }})p_{k_3'}({\mathbf {x}}_i,{\varvec{\beta }})$ for some $k'$ and $k_1'\ne k_2'\ne k_3'$. Thus,

$$\begin{aligned} \left\| \frac{\partial ^2{\dot{\ell }}_{s,j}^*({\varvec{\beta }})}{\partial {\varvec{\beta }}\partial {\varvec{\beta }}^T}\right\| \le \frac{2}{n}\sum _{i=1}^{n}\frac{K\Vert {\mathbf {x}}^*_i\Vert ^3}{N\pi ^*_i} \end{aligned}$$

for all ${\varvec{\beta }}$. This gives us that

$$\begin{aligned} \sup _{u,v}\left\| \frac{\partial ^2{\dot{\ell }}_{s,j}^*\{{\hat{\varvec{\beta }}}_{{\mathrm {full}}} +uv({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}})\}}{\partial {\varvec{\beta }}\partial {\varvec{\beta }}^T}\right\| =O_{P|{\mathcal {D}}_N}(1), \end{aligned}$$

(12)

because

$$\begin{aligned} P\left( \frac{1}{n}\sum _{i=1}^{n}\frac{\Vert {\mathbf {x}}^*_i\Vert ^3}{N\pi ^*_i}\ge \tau \Bigg |{\mathcal {D}}_N\right) \le \frac{1}{nN\tau }\sum _{i=1}^{n}{\mathbb {E}}\left( \frac{\Vert {\mathbf {x}}^*_i\Vert ^3}{\pi ^*_i}\Bigg |{\mathcal {D}}_N\right) =\frac{1}{N\tau }\sum _{i=1}^{N}\Vert {\mathbf {x}}_i\Vert ^3\rightarrow 0, \end{aligned}$$

in probability as $\tau \rightarrow \infty $ by Assumption 1. From (12), we have that

$$\begin{aligned} R_j=O_{P|{\mathcal {D}}_N}(\Vert {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}\Vert ^2). \end{aligned}$$

(13)

Denote ${\mathbf {M}}_n^*=\partial ^2{\dot{\ell }}_s^*({\varvec{\beta }})/ \partial {\varvec{\beta }}\partial {\varvec{\beta }}^T=n^{-1}\sum _{i=1}^{n}(N\pi _i^*)^{-1}{\varvec{\phi }}_i^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes ({\mathbf {x}}_i^*{{\mathbf {x}}_i^*}^\mathrm{T})$. From (11) and (12), we have

$$\begin{aligned} {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}= -{\mathbf {M}}_n^{*-1}\left\{ {\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) +O_{P|{\mathcal {D}}_N}(\Vert {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}\Vert ^2)\right\} . \end{aligned}$$

(14)

By direct calculation, we know that

$$\begin{aligned} {\mathbb {E}}({\mathbf {M}}_n^*|{\mathcal {D}}_N)={\mathbf {M}}_N. \end{aligned}$$

(15)

For any component ${\mathbf {M}}_n^{*j_1j_2}$ of ${\mathbf {M}}_n^*$ where $1\le j_1,j_2\le d$,

$$\begin{aligned} {\mathbb {V}}\left( {\mathbf {M}}_n^{*j_1j_2}|{\mathcal {D}}_N\right) =&\frac{1}{n}\sum _{i=1}^{N}\pi _i \left\{ \frac{\{{\varvec{\phi }}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes ({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T})\}^{j_1j_2}}{N\pi _i} -{\mathbf {M}}_N^{j_1j_2}\right\} ^2\\ =&\frac{1}{nN^2}\sum _{i=1}^{N}\frac{[\{{\varvec{\phi }}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes ({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T})\}^{j_1j_2}]^2}{\pi _i} -\frac{1}{n}({\mathbf {M}}_N^{j_1j_2})^2\\ \le&\frac{1}{nN^2}\sum _{i=1}^{N}\frac{\Vert {\mathbf {x}}_i\Vert ^4}{\pi _i} =O_P(n^{-1}), \end{aligned}$$

where the second last inequality holds by the fact that all elements of ${\varvec{\phi }}_i$ are between 0 and 1, and the last equality is from Assumption 2. This result combined with Markov’s inequality and (15), implies that

$$\begin{aligned} {\mathbf {M}}_n^*-{\mathbf {M}}_N&=O_{P|{\mathcal {D}}_N}(n^{-1/2}). \end{aligned}$$

(16)

By direct calculation, we have

$$\begin{aligned} {\mathbb {E}}\left\{ \frac{\partial \ell _s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})}{\partial {\varvec{\beta }}}\bigg |{\mathcal {D}}_N\right\} =\frac{\partial \ell _f({\hat{\varvec{\beta }}}_{{\mathrm {full}}})}{\partial {\varvec{\beta }}}= 0. \end{aligned}$$

(17)

Note that

$$\begin{aligned} {\mathbb {V}}\left\{ \frac{\partial \ell _s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})}{\partial {\varvec{\beta }}}\bigg |{\mathcal {D}}_N\right\} =\frac{1}{nN^2}\sum _{i=1}^{N}\frac{\varvec{\psi }_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes ({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T})}{\pi _i}, \end{aligned}$$

(18)

whose elements are bounded by $(nN^2)^{-1}\sum _{i=1}^{N}\pi _i^{-1}\Vert {\mathbf {x}}_i\Vert ^2$ which is of order $O_P(n^{-1})$ by Assumption 2. From (17), (18) and Markov’s inequality, we know that

$$\begin{aligned} \frac{\partial \ell _s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})}{\partial {\varvec{\beta }}}&=O_{P|{\mathcal {D}}_N}(n^{-1/2}). \end{aligned}$$

(19)

Note that (16) indicates that ${\mathbf {M}}_n^{*-1} =O_{P|{\mathcal {D}}_N}(1)$. Combining this with (10), (14) and (19), we have

$$\begin{aligned} {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}} =O_{P|{\mathcal {D}}_N}(n^{-1/2})+o_{P|{\mathcal {D}}_N}(\Vert {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}\Vert ), \end{aligned}$$

which implies that

$$\begin{aligned} {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}=O_{P|{\mathcal {D}}_N}(n^{-1/2}). \end{aligned}$$

(20)

Note that

$$\begin{aligned} {\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) =\frac{1}{n}\sum _{i=1}^{n}\frac{{\mathbf {s}}_i^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes {\mathbf {x}}_i^*}{N\pi ^*_i} \equiv \frac{1}{n}\sum _{i=1}^{n}{\varvec{\eta }}_i \end{aligned}$$

(21)

Given ${\mathcal {D}}_N$, ${\varvec{\eta }}_1, \ldots , {\varvec{\eta }}_n$ are i.i.d, with mean ${\mathbf {0}}$ and variance,

$$\begin{aligned}&{\mathbb {V}}({\varvec{\eta }}_i|{\mathcal {D}}_N)={\mathbf {V}}_{Nc}=O_P(1). \end{aligned}$$

(22)

Meanwhile, for every $\varepsilon >0$ and some $\rho >0$,

$$\begin{aligned}&\sum _{i=1}^{n}{\mathbb {E}}\{\Vert n^{-1/2}{\varvec{\eta }}_i\Vert ^2 I(\Vert {\varvec{\eta }}_i\Vert>n^{1/2}\varepsilon )|{\mathcal {D}}_N\}\\&\quad \le \frac{1}{n^{1+\rho /2}\varepsilon ^{\rho }} \sum _{i=1}^{n}{\mathbb {E}}\{\Vert {\varvec{\eta }}_i\Vert ^{2+\rho } I(\Vert {\varvec{\eta }}_i\Vert >n^{1/2}\varepsilon )|{\mathcal {D}}_N\}\\&\quad \le \frac{1}{n^{1+\rho /2}\varepsilon ^{\rho }} \sum _{i=1}^{n}{\mathbb {E}}(\Vert {\varvec{\eta }}_i\Vert ^{2+\rho }|{\mathcal {D}}_N)\\&\quad =\frac{1}{n^{\rho /2}}\frac{1}{N^{2+\rho }}\frac{1}{\varepsilon ^{\rho }} \sum _{i=1}^{N}\frac{\Vert {\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\Vert ^{2+\rho } \Vert {\mathbf {x}}_i\Vert ^{2+\rho }}{\pi _i^{1+\rho }}\\&\quad \le \frac{1}{n^{\rho /2}}\frac{1}{N^{2+\rho }}\frac{1}{\varepsilon ^{\rho }} \sum _{i=1}^{N}\frac{\Vert {\mathbf {x}}_i\Vert ^{2+\rho }}{\pi _i^{1+\rho }}=o_P(1) \end{aligned}$$

where the last equality is from Assumption 2. This and (22) show that the Lindeberg–Feller conditions are satisfied in the conditional distribution in probability. From (21) and (22), by the Lindeberg–Feller central limit theorem (Proposition 2.27 of van der Vaart 1998), conditionally on ${\mathcal {D}}_N$ in probability,

$$\begin{aligned} n^{1/2}{\mathbf {V}}_{Nc}^{-1/2}{\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})= \frac{1}{n^{1/2}}\{{\mathbb {V}}({\varvec{\eta }}_i|{\mathcal {D}}_N)\}^{-1/2}\sum _{i=1}^{n}{\varvec{\eta }}_i \rightarrow {\mathbb {N}}(0,{\mathbf {I}}), \end{aligned}$$

in distribution. From (14) and (20),

$$\begin{aligned} {\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}}= -{\mathbf {M}}_n^{*-1}{\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})+O_{P|{\mathcal {D}}_N}(n^{-1}). \end{aligned}$$

(23)

From (16),

$$\begin{aligned} {\mathbf {M}}_n^{*-1}-{\mathbf {M}}_N^{-1}&=-{\mathbf {M}}_N^{-1}({\mathbf {M}}_n^*-{\mathbf {M}}_N){\mathbf {M}}_n^{*-1} =O_{P|{\mathcal {D}}_N}(n^{-1/2}). \end{aligned}$$

(24)

Based on Assumption 1 and (22), it is verified that,

$$\begin{aligned} {\mathbf {V}}={\mathbf {M}}_N^{-1}{\mathbf {V}}_{Nc}{\mathbf {M}}_N^{-1}=O_{P}(1). \end{aligned}$$

(25)

Thus, (23), (24) and (25) yield,

$$\begin{aligned}&n^{1/2}{\mathbf {V}}^{-1/2}({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}-{\hat{\varvec{\beta }}}_{{\mathrm {full}}})\\&\quad =-n^{1/2}{\mathbf {V}}^{-1/2}{\mathbf {M}}_n^{*-1}{\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) +O_{P|{\mathcal {D}}_N}(n^{-1/2})\\&\quad =-{\mathbf {V}}^{-1/2}{\mathbf {M}}_N^{-1}n^{1/2}{\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) -{\mathbf {V}}^{-1/2}({\mathbf {M}}_n^{*-1}-{\mathbf {M}}_N^{-1})n^{1/2}{\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) +O_{P|{\mathcal {D}}_N}(n^{-1/2})\\&\quad =-{\mathbf {V}}^{-1/2}{\mathbf {M}}_N^{-1}{\mathbf {V}}_{Nc}^{1/2}{\mathbf {V}}_{Nc}^{-1/2}n^{-1/2}{\dot{\ell }}_s^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) +O_{P|{\mathcal {D}}_N}(n^{-1/2}). \end{aligned}$$

The result in Theorem 1 follows from Slutsky’s Theorem(Theorem 6 of Ferguson 1996) and the fact that

$$\begin{aligned} {\mathbf {V}}^{-1/2}{\mathbf {M}}_N^{-1}{\mathbf {V}}_{Nc}^{1/2}({\mathbf {V}}^{-1/2}{\mathbf {M}}_N^{-1}{\mathbf {V}}_{Nc}^{1/2})^T ={\mathbf {V}}^{-1/2}{\mathbf {M}}_N^{-1}{\mathbf {V}}_{Nc}^{1/2}{\mathbf {V}}_{Nc}^{1/2}{\mathbf {M}}_N^{-1}{\mathbf {V}}^{-1/2}={\mathbf {I}}. \end{aligned}$$

$\square $

Proof

(Theorem 2) Note that $\varvec{\psi }_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})={\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}}){\mathbf {s}}_i^\mathrm{T}({\hat{\varvec{\beta }}}_{{\mathrm {full}}})$, so

$$\begin{aligned} {\varvec{\psi }}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes ({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T}) =\{{\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes {\mathbf {x}}_i\} \{{\mathbf {s}}_i^\mathrm{T}({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes {\mathbf {x}}_i^\mathrm{T}\}. \end{aligned}$$

(26)

Therefore, for the A-optimality,

$$\begin{aligned} \mathrm {tr}({\mathbf {V}}_N)&= \mathrm {tr}({\mathbf {M}}_N^{-1}{\mathbf {V}}_{Nc}{\mathbf {M}}_N^{-1}) \\&= \frac{1}{N^2} \mathrm {tr}\bigg \{{\mathbf {M}}_N^{-1} \sum _{i=1}^{N}\frac{{\varvec{\psi }}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes ({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T})}{\pi _i}{\mathbf {M}}_N^{-1}\bigg \} \\&= \frac{1}{N^2} \sum _{i=1}^{N}\frac{\mathrm {tr}\big [{\mathbf {M}}_N^{-1}\{{\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes {\mathbf {x}}_i\} \{{\mathbf {s}}_i^\mathrm{T}({\hat{\varvec{\beta }}}_{{\mathrm {full}}}) \otimes {\mathbf {x}}_i^\mathrm{T}\}{\mathbf {M}}_N^{-1}\big ]}{\pi _i}\\&= \frac{1}{N^2} \sum _{i=1}^{N}\frac{\Vert {\mathbf {M}}_N^{-1}\{{\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes {\mathbf {x}}_i\}\Vert ^2}{\pi _i}\times \sum _{i=1}^{N}\pi _i\\&\ge \bigg \{\frac{1}{N}\sum _{i=1}^{N}\Vert {\mathbf {M}}_N^{-1}\{{\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes {\mathbf {x}}_i\}\Vert \bigg \}^2. \end{aligned}$$

Here, the last step due to the Cauchy–Schwarz inequality, and the equality holds if and only if $\pi _i$ is proportional to $\Vert {\mathbf {M}}_N^{-1}\{{\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes {\mathbf {x}}_i\}\Vert $. Thus, the A-optimal subsampling probabilities take the form of (5).

For the L-optimality, the proof is similar by noticing that

$$\begin{aligned} \mathrm {tr}({\mathbf {V}}_{Nc})&= \frac{1}{N^2}\sum _{i=1}^{N}\frac{\mathrm {tr}\{{\varvec{\psi }}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes ({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T})\}}{\pi _i} \\&= \frac{1}{N^2}\sum _{i=1}^{N}\frac{\mathrm {tr}\{{\varvec{\psi }}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\} \mathrm {tr}({\mathbf {x}}_i{\mathbf {x}}_i^\mathrm{T})}{\pi _i} = \frac{1}{N^2}\sum _{i=1}^{N}\frac{\Vert {{\mathbf {s}}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\Vert ^2 \Vert {\mathbf {x}}_i\Vert ^2}{\pi _i} \\&= \frac{1}{N^2}\sum _{i=1}^{N}\frac{\big [\sum _{k=1}^{K}\{\delta _{i,k} - p_k({\mathbf {x}}_i, {\hat{\varvec{\beta }}}_{{\mathrm {full}}})\}^2\big ]\Vert {\mathbf {x}}_i\Vert ^2}{\pi _i}\times \sum _{i=1}^{N}\pi _i\\&\ge \Bigg ( \frac{1}{N}\sum _{i=1}^{N}\bigg [\sum _{k=1}^{K}\{\delta _{i,k} - p_k({\mathbf {x}}_i, {\hat{\varvec{\beta }}}_{{\mathrm {full}}})\}\bigg ]^{1/2}\Vert {\mathbf {x}}_i\Vert \Bigg )^2. \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yao, Y., Wang, H. Optimal subsampling for softmax regression. Stat Papers 60, 585–599 (2019). https://doi.org/10.1007/s00362-018-01068-6

Download citation

Received: 03 September 2018
Revised: 09 December 2018
Published: 18 December 2018
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s00362-018-01068-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal subsampling for softmax regression

Abstract

Access this article

Similar content being viewed by others

Optimal Poisson Subsampling for Softmax Regression

Bayesian Nonlinear Support Vector Machines for Big Data

Deterministic subsampling for logistic regression with massive data

Change history

12 September 2019

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proofs

Proof

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimal subsampling for softmax regression

Abstract

Access this article

Similar content being viewed by others

Optimal Poisson Subsampling for Softmax Regression

Bayesian Nonlinear Support Vector Machines for Big Data

Deterministic subsampling for logistic regression with massive data

Change history

12 September 2019

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proofs

Appendix: Proofs

Proof

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation