Abstract
To meet the challenge of massive data, Wang et al. (J Am Stat Assoc 113(522):829–844, 2018b) developed an optimal subsampling method for logistic regression. The purpose of this paper is to extend their method to softmax regression, which is also called multinomial logistic regression and is commonly used to model data with multiple categorical responses. We first derive the asymptotic distribution of the general subsampling estimator, and then derive optimal subsampling probabilities under the A-optimality criterion and the L-optimality criterion with a specific L matrix. Since the optimal subsampling probabilities depend on the unknowns, we adopt a two-stage adaptive procedure to address this issue and use numerical simulations to demonstrate its performance.
Similar content being viewed by others
Change history
12 September 2019
Unfortunately, due to a technical error, the articles published in issues 60:2 and 60:3 received incorrect pagination. Please find here the corrected Tables of Contents. We apologize to the authors of the articles and the readers.
References
Atkinson A, Donev A, Tobias R (2007) Optimum experimental designs, with SAS, vol 34. Oxford University Press, Oxford
Drineas P, Kannan R, Mahoney MW (2006a) Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J Comput 36(1):132–157
Drineas P, Kannan R, Mahoney MW (2006b) Fast Monte Carlo algorithms for matrices II: computing a low-rank approximation to a matrix. SIAM J Comput 36(1):158–183
Drineas P, Kannan R, Mahoney MW (2006c) Fast Monte Carlo algorithms for matrices III: computing a compressed approximate matrix decomposition. SIAM J Comput 36(1):184–206
Drineas P, Mahoney MW, Muthukrishnan S (2006d) Sampling algorithms for \(l_2\) regression and applications. In: Proceedings of the seventeenth annual ACM-SIAM symposium on discrete algorithm. Society for Industrial and Applied Mathematics, Philadelphia, pp 1127–1136
Drineas P, Mahoney M, Muthukrishnan S (2008) Relative-error CUR matrix decomposition. SIAM J Matrix Anal Appl 30:844–881
Drineas P, Mahoney M, Muthukrishnan S, Sarlos T (2011) Faster least squares approximation. Numer Math 117:219–249
Ferguson TS (1996) A course in large sample theory. Chapman and Hall, London
Frieze A, Kannan R, Vempala S (2004) Fast Monte-Carlo algorithms for finding low-rank approximations. J ACM 51:1025–1041
Lane A, Yao P, Flournoy N (2014) Information in a two-stage adaptive optimal design. J Stat Plan Inference 144:173–187
Ma P, Mahoney M, Yu B (2015) A statistical perspective on algorithmic leveraging. J Mach Learn Res 16:861–911
Mahoney MW (2011) Randomized algorithms for matrices and data. Found Trends Mach Learn 3(2):123–224
Mahoney MW, Drineas P (2009) CUR matrix decompositions for improved data analysis. Proc Natl Acad Sci USA 106(3):697–702
Ortega JM, Rheinboldt WC (1970) Iterative solution of nonlinear equations in several variables, vol 30. SIAM, Philadelphia
R Core Team (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/
Raskutti G, Mahoney M (2016) A statistical perspective on randomized sketching for ordinary least-squares. J Mach Learn Res 17:1–31
van der Vaart A (1998) Asymptotic statistics. Cambridge University Press, London
Wang H (2018) More efficient estimation for logistic regression with optimal subsample. arXiv preprint. arXiv:1802.02698
Wang H, Yang M, Stufken J (2018a) Information-based optimal subdata selection for big data linear regression. J Am Stat Assoc. https://doi.org/10.1080/01621459.2017.1408468
Wang H, Zhu R, Ma P (2018b) Optimal subsampling for large sample logistic regression. J Am Stat Assoc 113(522):829–844
Acknowledgements
We gratefully acknowledge the comments from two referees that helped improve the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by NSF Grant 1812013, a UCONN REP grant, and a GPU Grant from NVIDIA Corporation.
Appendix: Proofs
Appendix: Proofs
We prove the two theorems in this section. We use \(O_{P|{\mathcal {D}}_N}(1)\) and \(o_{P|{\mathcal {D}}_N}(1)\) to denote boundedness and convergence to zero, respectively, in conditional probability given the full data. Specifically for a sequence of random vector \({\mathbf {v}}_{n,N}\), as \(n\rightarrow \infty \) and \(N\rightarrow \infty \), \({\mathbf {v}}_{n,N}=O_{P|{\mathcal {D}}_N}(1)\) means that for any \(\epsilon >0\), there exists a finite \(C_\epsilon >0\) such that
\(v_{n,N}=o_{P|{\mathcal {D}}_N}(1)\) means that for any \(\epsilon >0\) and \(\delta \),
Proof
(Theorem 1) By direct calculation under the conditional distribution of the subsample given \({\mathcal {D}}_N\), we have
where \(t_i({\varvec{\beta }})=\sum _{k=1}^{K}\delta _{i,k}{\mathbf {x}}_i^\mathrm{T}{\varvec{\beta }}_k- \log \Big \{1+\sum _{l=1}^Ke^{{\mathbf {x}}_i^\mathrm{T}{\varvec{\beta }}_l}\Big \}\). Note that
where the second inequality is from the fact that \(\Vert {\varvec{\beta }}_k\Vert \le \Vert {\varvec{\beta }}\Vert \), and the third inequality is from the fact that \(\log (1+x)<1+\log x\) for \(x\ge 1\). Therefore, from Assumption 2,
Combining (8) and (9), \(\ell _s^*({\varvec{\beta }})-\ell _f({\varvec{\beta }})\rightarrow 0\) in conditional probability given \({\mathcal {D}}_N\). Note that the parameter space is compact, and \({\hat{\varvec{\beta }}}_{{\mathrm {sub}}}\) and \({\hat{\varvec{\beta }}}_{{\mathrm {full}}}\) are the unique global maximums of the continuous concave functions \(\ell _s^*({\varvec{\beta }})\) and \(\ell _f({\varvec{\beta }})\), respectively. Thus, from Theorem 5.9 and its remark of van der Vaart (1998), we obtain that conditionally on \({\mathcal {D}}_N\) in probability,
From Taylor’s theorem (c.f. Chap. 4 of Ferguson 1996),
where \({\dot{\ell }}_{s,j}^*({{\varvec{\beta }}})\) is the partial derivative of \(\ell _s^*({{\varvec{\beta }}})\) with respect to \(\beta _j\), and
Note that the third partial derivative of the log-likelihood for the subsample takes the form of
where \(j_l'=\text {Rem}(j_l/d)+dI\{\text {Rem}(j_l/d)=0\}\), \(l=1,2,3\), \(\text {Rem}(j_l/d)\) is the reminder of the integer division \(j_l/d\), and \(\alpha _{i,j_1j_2j_3}\) satisfies that \(|\alpha _{j_1j_2j_3}|\le 2\). Here \(|\alpha _{j_1j_2j_3}|\le 2\) because it has a form of \(p_{k'}({\mathbf {x}}_i,{\varvec{\beta }})\{1-p_{k'}({\mathbf {x}}_i,{\varvec{\beta }})\}\{1-2p_{k'}({\mathbf {x}}_i,{\varvec{\beta }})\}\), \(p_{k_1'}({\mathbf {x}}_i,{\varvec{\beta }})p_{k_2'}({\mathbf {x}}_i,{\varvec{\beta }})\{2p_{k_2'}({\mathbf {x}}_i,{\varvec{\beta }})-1\}\), or \(2p_{k_1'}({\mathbf {x}}_i,{\varvec{\beta }})p_{k_2'}({\mathbf {x}}_i,{\varvec{\beta }})p_{k_3'}({\mathbf {x}}_i,{\varvec{\beta }})\) for some \(k'\) and \(k_1'\ne k_2'\ne k_3'\). Thus,
for all \({\varvec{\beta }}\). This gives us that
because
in probability as \(\tau \rightarrow \infty \) by Assumption 1. From (12), we have that
Denote \({\mathbf {M}}_n^*=\partial ^2{\dot{\ell }}_s^*({\varvec{\beta }})/ \partial {\varvec{\beta }}\partial {\varvec{\beta }}^T=n^{-1}\sum _{i=1}^{n}(N\pi _i^*)^{-1}{\varvec{\phi }}_i^*({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes ({\mathbf {x}}_i^*{{\mathbf {x}}_i^*}^\mathrm{T})\). From (11) and (12), we have
By direct calculation, we know that
For any component \({\mathbf {M}}_n^{*j_1j_2}\) of \({\mathbf {M}}_n^*\) where \(1\le j_1,j_2\le d\),
where the second last inequality holds by the fact that all elements of \({\varvec{\phi }}_i\) are between 0 and 1, and the last equality is from Assumption 2. This result combined with Markov’s inequality and (15), implies that
By direct calculation, we have
Note that
whose elements are bounded by \((nN^2)^{-1}\sum _{i=1}^{N}\pi _i^{-1}\Vert {\mathbf {x}}_i\Vert ^2\) which is of order \(O_P(n^{-1})\) by Assumption 2. From (17), (18) and Markov’s inequality, we know that
Note that (16) indicates that \({\mathbf {M}}_n^{*-1} =O_{P|{\mathcal {D}}_N}(1)\). Combining this with (10), (14) and (19), we have
which implies that
Note that
Given \({\mathcal {D}}_N\), \({\varvec{\eta }}_1, \ldots , {\varvec{\eta }}_n\) are i.i.d, with mean \({\mathbf {0}}\) and variance,
Meanwhile, for every \(\varepsilon >0\) and some \(\rho >0\),
where the last equality is from Assumption 2. This and (22) show that the Lindeberg–Feller conditions are satisfied in the conditional distribution in probability. From (21) and (22), by the Lindeberg–Feller central limit theorem (Proposition 2.27 of van der Vaart 1998), conditionally on \({\mathcal {D}}_N\) in probability,
in distribution. From (14) and (20),
From (16),
Based on Assumption 1 and (22), it is verified that,
Thus, (23), (24) and (25) yield,
The result in Theorem 1 follows from Slutsky’s Theorem(Theorem 6 of Ferguson 1996) and the fact that
\(\square \)
Proof
(Theorem 2) Note that \(\varvec{\psi }_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})={\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}}){\mathbf {s}}_i^\mathrm{T}({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\), so
Therefore, for the A-optimality,
Here, the last step due to the Cauchy–Schwarz inequality, and the equality holds if and only if \(\pi _i\) is proportional to \(\Vert {\mathbf {M}}_N^{-1}\{{\mathbf {s}}_i({\hat{\varvec{\beta }}}_{{\mathrm {full}}})\otimes {\mathbf {x}}_i\}\Vert \). Thus, the A-optimal subsampling probabilities take the form of (5).
For the L-optimality, the proof is similar by noticing that
\(\square \)
Rights and permissions
About this article
Cite this article
Yao, Y., Wang, H. Optimal subsampling for softmax regression. Stat Papers 60, 585–599 (2019). https://doi.org/10.1007/s00362-018-01068-6
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-018-01068-6