Stochastic Online Kernel Selection with Instantaneous Loss in Random Feature Space

Han, Zhizhuo; Liao, Shizhong

doi:10.1007/978-3-319-70087-8_4

Stochastic Online Kernel Selection with Instantaneous Loss in Random Feature Space

Zhizhuo Han¹⁸ &
Shizhong Liao¹⁸

Conference paper
First Online: 24 October 2017

4633 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10634))

Abstract

Online kernel selection is critical to online kernel learning. However, the time complexity of existing online kernel selection algorithms of each round is linear with respect to the number of examples already arrived. This is not efficient for online learning. To address this issue, we propose a novel stochastic online kernel selection algorithm via the random feature mapping and using the instantaneous loss. This algorithm has only constant time complexity at each round and theoretical guarantee. Formally, the algorithm first maps the arriving example into the random feature space. Then the algorithm updates the kernel parameter and the weights of the classifier simultaneously using SGD (stochastic gradient descent) to minimize the instantaneous loss. We also prove that the algorithm enjoys a sub-linear regret bound. Experimental results on benchmark datasets demonstrate that the proposed algorithm is effective and efficient.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Yang, T., Mahdavi, M., Jin, R., Yi, J., Hoi, S.C.: Online kernel selection: algorithms and evaluations. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pp. 1197–1203. AAAI Press (2012)
Google Scholar
Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for support vector machines. Mach. Learn. 46(1), 131–159 (2002)
Article MATH Google Scholar
Cristianini, N., Elisseeff, A., Shawe-Taylor, J., Kandola, J.: On kernel-target alignment. In: Advances in Neural Information Processing Systems (2001)
Google Scholar
Chen, B., Liang, J., Zheng, N., Príncipe, J.C.: Kernel least mean square with adaptive kernel size. Neurocomputing 191, 95–106 (2016)
Article Google Scholar
Fan, H., Song, Q., Shrestha, S.B.: Kernel online learning with adaptive kernel width. Neurocomputing 175, 233–242 (2016)
Article Google Scholar
Yang, T., Li, Y.F., Mahdavi, M., Jin, R., Zhou, Z.H.: Nyström method vs random fourier features: a theoretical and empirical comparison. In: Advances in Neural Information Processing Systems, pp. 476–484 (2012)
Google Scholar
Dekel, O., Shalev-Shwartz, S., Singer, Y.: The forgetron: a kernel-based perceptron on a budget. SIAM J. Comput. 37(5), 1342–1372 (2008)
Article MATH MathSciNet Google Scholar
Hu, J., Yang, H., King, I., Lyu, M.R., So, A.M.C.: Kernelized online imbalanced learning with fixed budgets. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2666–2672 (2015)
Google Scholar
Lin, M., Weng, S., Zhang, C.: On the sample complexity of random fourier features for online learning: how many random fourier features do we need? ACM Trans. Knowl. Discov. Data 8(3), 13 (2014)
Article Google Scholar
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2007)
Google Scholar
Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems, pp. 1313–1320 (2009)
Google Scholar
Forster, J., Warmuth, M.K.: Relative expected instantaneous loss bounds. J. Comput. Syst. Sci. 64(1), 76–102 (2002)
Article MATH MathSciNet Google Scholar
Lu, J., Hoi, S.C., Wang, J., Zhao, P., Liu, Z.Y.: Large scale online kernel learning. J. Mach. Learn. Res. 17(47), 1–43 (2016)
MATH MathSciNet Google Scholar

Download references

Acknowledgments

The work was supported in part by the National Natural Science Foundation of China under grant No. 61673293.

Author information

Authors and Affiliations

School of Computer Science and Technology, Tianjin University, Tianjin, 300350, China
Zhizhuo Han & Shizhong Liao

Authors

Zhizhuo Han
View author publications
You can also search for this author in PubMed Google Scholar
Shizhong Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shizhong Liao .

Editor information

Editors and Affiliations

Guangdong University of Technology, Guangzhou, China
Derong Liu
Guangdong University of Technology, Guangzhou, China
Shengli Xie
South China University of Technology, Guangzhou, China
Yuanqing Li
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Dongbin Zhao
King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
El-Sayed M. El-Alfy

Appendix

In this appendix, we provide the proof details of Theorem 1.

Proof

Let $f_{*}(\varvec{x})= (\varvec{w}^{*})^\top \phi (\varvec{x},\gamma ^{*})$ be the optimal classifier in the random feature space that minimizes the expected loss. The desired inequality can be rewritten as

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^{T}(\ell (\varvec{w}_{t}^{\top }\phi (\varvec{x}_{t},\gamma _{t}),y_t) - \ell _{t}(g^{*}(\varvec{x}_t), y_t))\\&\quad = \sum _{t=1}^{T}(\ell _{t}(\varvec{w}_{t}^{\top }\phi (\varvec{x}_{t},\gamma _{t}),y_t) - \ell _{t}((\varvec{w}^{*})^{\top }\phi (\varvec{x}_{t},\gamma _{t}),y_t) \\&\qquad +\sum _{t=1}^{T}(\ell _{t}((\varvec{w}^{*})^{\top }\phi (\varvec{x}_{t},\gamma _{t}),y_t) - \ell _{t}((\varvec{w}^{*})^{\top }\phi (\varvec{x}_{t},\gamma ^*),y_t))\\&\qquad +\sum _{t=1}^{T}(\ell _{t}((\varvec{w}^{*})^{\top }\phi (\varvec{x}_{t},\gamma ^*) - \ell _{t}(g^{*}(\varvec{x}_t), y_t))\\&\quad = A + B + C. \end{aligned} \end{aligned}$$

First of all, consider $\ell _t$ as a function of $\gamma $. From the convexity of the loss function, we obtain

$$\begin{aligned} \begin{aligned} \ell _t(\gamma _{t})-\ell _t(\gamma ^{*})\le&\nabla \ell _{t}(\gamma _{t})(\gamma _{t}-\gamma ^{*})\\ =&\frac{(\gamma _{t}-\gamma ^{*})^{2}-(\gamma _{t+1}-\gamma ^{*})^{2}}{2\eta }+ \frac{\eta (\nabla \ell _{t}(\gamma _{t}))^{2}}{2}. \end{aligned} \end{aligned}$$

Summing the above over $t=1,\ldots T$ leads to

$$\begin{aligned} \begin{aligned} B \le&\frac{(\gamma _{1}-\gamma ^{*})^{2}-(\gamma _{T+1}-\gamma ^{*})^{2}}{2\eta } + \frac{\eta }{2}\sum _{t=1}^{T}(\nabla \ell _{t}(\gamma _{t}))^{2} \\ \le&\frac{(\gamma ^{*})^2}{2\eta } + \frac{\eta }{2}L_1^{2}T, \end{aligned} \end{aligned}$$

where $L_1=\max _{t\in [T]}\Vert \nabla \ell _{t}(\gamma _{t})\Vert ^{2}$. We adopt a similar procedure and it suffices to show that

$$\begin{aligned} A\le \frac{\Vert \varvec{w}^{*}\Vert ^2 }{2\eta } + \frac{\eta }{2}L_2^{2}T \;\mathrm {and}\;L_2=\max _{t\in [T]}\Vert \nabla \ell _{t}(\varvec{w}_{t})\Vert ^{2}. \end{aligned}$$

From the result of [13], we get with probability at least

$$\begin{aligned} 1-2^8\left( \frac{\gamma _pR}{\epsilon '}\right) ^2\exp \left( \frac{-D\epsilon ^2}{4(d+2)}\right) , \end{aligned}$$

$$\begin{aligned} \Vert \varvec{w}^{*}\Vert ^2 \le (1+\epsilon ')\Vert g^{*}\Vert _1^2 \;\mathrm {and}\; C\le \epsilon 'LT\Vert g^*\Vert _1. \end{aligned}$$

Recalling the definition of M, we can derive $ \varvec{M}[j]\sim \mathcal {N}(0, \Vert \varvec{x}\Vert ^{2}). $ This easily leads to the upper bound of $|\nabla \ell (\gamma )|$, i.e.

$$\begin{aligned} |\nabla \ell (\gamma )|\le \sqrt{\frac{2}{D}}\left( \sum _{j=1}^{D/2}|\varvec{w}[j]\varvec{M}[j]| + \sum _{j=D/2+1}^{D}|\varvec{w}[j]\varvec{M}[j-D/2]|\right) . \end{aligned}$$

By the property of Gaussian variable, we therefore obtain

$$\begin{aligned} \mathbb {E}\left[ \left| \varvec{w}[j]\varvec{M}[j]\right| \right] \le \Vert \varvec{x}\Vert ^{2}(\varvec{w}[j])^{2}\sqrt{\frac{2}{\pi }} \;\mathrm {and}\; \mathbb {E}\left[ |\nabla \ell (\gamma )|\right] \le \frac{2\Vert \varvec{w}\Vert ^2\Vert \varvec{x}\Vert ^{2}}{\sqrt{\pi D}}. \end{aligned}$$

By Chernoff inequality, we have, with probability at least

$$\begin{aligned} 1-\exp (-\frac{2\epsilon ^{2}C}{3}\sqrt{\frac{D}{\pi }}), \end{aligned}$$

$$\begin{aligned} |\nabla \ell (\gamma _t)|\le (1+\epsilon )\frac{2C}{\sqrt{\pi D}}, \;\mathrm {for}\; \forall t\in \{1,\ldots ,T\}, \end{aligned}$$

where $C=\max _{t\in \{1,\ldots ,T\}}\Vert \varvec{w}_t\Vert ^2\Vert \varvec{x}_t\Vert ^{2}.$ From the basic relationship between the sine and the cosine, it follows that $ \Vert \nabla \ell (\varvec{w})\Vert _2^2=1. $

We now conclude our proof. $\square $

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, Z., Liao, S. (2017). Stochastic Online Kernel Selection with Instantaneous Loss in Random Feature Space. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science(), vol 10634. Springer, Cham. https://doi.org/10.1007/978-3-319-70087-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-70087-8_4
Published: 24 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70086-1
Online ISBN: 978-3-319-70087-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

Buying options

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation