Skip to main content

Stochastic Online Kernel Selection with Instantaneous Loss in Random Feature Space

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10634))

Abstract

Online kernel selection is critical to online kernel learning. However, the time complexity of existing online kernel selection algorithms of each round is linear with respect to the number of examples already arrived. This is not efficient for online learning. To address this issue, we propose a novel stochastic online kernel selection algorithm via the random feature mapping and using the instantaneous loss. This algorithm has only constant time complexity at each round and theoretical guarantee. Formally, the algorithm first maps the arriving example into the random feature space. Then the algorithm updates the kernel parameter and the weights of the classifier simultaneously using SGD (stochastic gradient descent) to minimize the instantaneous loss. We also prove that the algorithm enjoys a sub-linear regret bound. Experimental results on benchmark datasets demonstrate that the proposed algorithm is effective and efficient.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Yang, T., Mahdavi, M., Jin, R., Yi, J., Hoi, S.C.: Online kernel selection: algorithms and evaluations. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pp. 1197–1203. AAAI Press (2012)

    Google Scholar 

  2. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for support vector machines. Mach. Learn. 46(1), 131–159 (2002)

    Article  MATH  Google Scholar 

  3. Cristianini, N., Elisseeff, A., Shawe-Taylor, J., Kandola, J.: On kernel-target alignment. In: Advances in Neural Information Processing Systems (2001)

    Google Scholar 

  4. Chen, B., Liang, J., Zheng, N., Príncipe, J.C.: Kernel least mean square with adaptive kernel size. Neurocomputing 191, 95–106 (2016)

    Article  Google Scholar 

  5. Fan, H., Song, Q., Shrestha, S.B.: Kernel online learning with adaptive kernel width. Neurocomputing 175, 233–242 (2016)

    Article  Google Scholar 

  6. Yang, T., Li, Y.F., Mahdavi, M., Jin, R., Zhou, Z.H.: Nyström method vs random fourier features: a theoretical and empirical comparison. In: Advances in Neural Information Processing Systems, pp. 476–484 (2012)

    Google Scholar 

  7. Dekel, O., Shalev-Shwartz, S., Singer, Y.: The forgetron: a kernel-based perceptron on a budget. SIAM J. Comput. 37(5), 1342–1372 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  8. Hu, J., Yang, H., King, I., Lyu, M.R., So, A.M.C.: Kernelized online imbalanced learning with fixed budgets. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2666–2672 (2015)

    Google Scholar 

  9. Lin, M., Weng, S., Zhang, C.: On the sample complexity of random fourier features for online learning: how many random fourier features do we need? ACM Trans. Knowl. Discov. Data 8(3), 13 (2014)

    Article  Google Scholar 

  10. Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2007)

    Google Scholar 

  11. Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems, pp. 1313–1320 (2009)

    Google Scholar 

  12. Forster, J., Warmuth, M.K.: Relative expected instantaneous loss bounds. J. Comput. Syst. Sci. 64(1), 76–102 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  13. Lu, J., Hoi, S.C., Wang, J., Zhao, P., Liu, Z.Y.: Large scale online kernel learning. J. Mach. Learn. Res. 17(47), 1–43 (2016)

    MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

The work was supported in part by the National Natural Science Foundation of China under grant No. 61673293.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shizhong Liao .

Editor information

Editors and Affiliations

Appendix

Appendix

In this appendix, we provide the proof details of Theorem 1.

Proof

Let \(f_{*}(\varvec{x})= (\varvec{w}^{*})^\top \phi (\varvec{x},\gamma ^{*})\) be the optimal classifier in the random feature space that minimizes the expected loss. The desired inequality can be rewritten as

$$\begin{aligned} \begin{aligned}&\sum _{t=1}^{T}(\ell (\varvec{w}_{t}^{\top }\phi (\varvec{x}_{t},\gamma _{t}),y_t) - \ell _{t}(g^{*}(\varvec{x}_t), y_t))\\&\quad = \sum _{t=1}^{T}(\ell _{t}(\varvec{w}_{t}^{\top }\phi (\varvec{x}_{t},\gamma _{t}),y_t) - \ell _{t}((\varvec{w}^{*})^{\top }\phi (\varvec{x}_{t},\gamma _{t}),y_t) \\&\qquad +\sum _{t=1}^{T}(\ell _{t}((\varvec{w}^{*})^{\top }\phi (\varvec{x}_{t},\gamma _{t}),y_t) - \ell _{t}((\varvec{w}^{*})^{\top }\phi (\varvec{x}_{t},\gamma ^*),y_t))\\&\qquad +\sum _{t=1}^{T}(\ell _{t}((\varvec{w}^{*})^{\top }\phi (\varvec{x}_{t},\gamma ^*) - \ell _{t}(g^{*}(\varvec{x}_t), y_t))\\&\quad = A + B + C. \end{aligned} \end{aligned}$$

First of all, consider \(\ell _t\) as a function of \(\gamma \). From the convexity of the loss function, we obtain

$$\begin{aligned} \begin{aligned} \ell _t(\gamma _{t})-\ell _t(\gamma ^{*})\le&\nabla \ell _{t}(\gamma _{t})(\gamma _{t}-\gamma ^{*})\\ =&\frac{(\gamma _{t}-\gamma ^{*})^{2}-(\gamma _{t+1}-\gamma ^{*})^{2}}{2\eta }+ \frac{\eta (\nabla \ell _{t}(\gamma _{t}))^{2}}{2}. \end{aligned} \end{aligned}$$

Summing the above over \(t=1,\ldots T\) leads to

$$\begin{aligned} \begin{aligned} B \le&\frac{(\gamma _{1}-\gamma ^{*})^{2}-(\gamma _{T+1}-\gamma ^{*})^{2}}{2\eta } + \frac{\eta }{2}\sum _{t=1}^{T}(\nabla \ell _{t}(\gamma _{t}))^{2} \\ \le&\frac{(\gamma ^{*})^2}{2\eta } + \frac{\eta }{2}L_1^{2}T, \end{aligned} \end{aligned}$$

where \(L_1=\max _{t\in [T]}\Vert \nabla \ell _{t}(\gamma _{t})\Vert ^{2}\). We adopt a similar procedure and it suffices to show that

$$\begin{aligned} A\le \frac{\Vert \varvec{w}^{*}\Vert ^2 }{2\eta } + \frac{\eta }{2}L_2^{2}T \;\mathrm {and}\;L_2=\max _{t\in [T]}\Vert \nabla \ell _{t}(\varvec{w}_{t})\Vert ^{2}. \end{aligned}$$

From the result of  [13], we get with probability at least

$$\begin{aligned} 1-2^8\left( \frac{\gamma _pR}{\epsilon '}\right) ^2\exp \left( \frac{-D\epsilon ^2}{4(d+2)}\right) , \end{aligned}$$
$$\begin{aligned} \Vert \varvec{w}^{*}\Vert ^2 \le (1+\epsilon ')\Vert g^{*}\Vert _1^2 \;\mathrm {and}\; C\le \epsilon 'LT\Vert g^*\Vert _1. \end{aligned}$$

Recalling the definition of M, we can derive \( \varvec{M}[j]\sim \mathcal {N}(0, \Vert \varvec{x}\Vert ^{2}). \) This easily leads to the upper bound of \(|\nabla \ell (\gamma )|\), i.e.

$$\begin{aligned} |\nabla \ell (\gamma )|\le \sqrt{\frac{2}{D}}\left( \sum _{j=1}^{D/2}|\varvec{w}[j]\varvec{M}[j]| + \sum _{j=D/2+1}^{D}|\varvec{w}[j]\varvec{M}[j-D/2]|\right) . \end{aligned}$$

By the property of Gaussian variable, we therefore obtain

$$\begin{aligned} \mathbb {E}\left[ \left| \varvec{w}[j]\varvec{M}[j]\right| \right] \le \Vert \varvec{x}\Vert ^{2}(\varvec{w}[j])^{2}\sqrt{\frac{2}{\pi }} \;\mathrm {and}\; \mathbb {E}\left[ |\nabla \ell (\gamma )|\right] \le \frac{2\Vert \varvec{w}\Vert ^2\Vert \varvec{x}\Vert ^{2}}{\sqrt{\pi D}}. \end{aligned}$$

By Chernoff inequality, we have, with probability at least

$$\begin{aligned} 1-\exp (-\frac{2\epsilon ^{2}C}{3}\sqrt{\frac{D}{\pi }}), \end{aligned}$$
$$\begin{aligned} |\nabla \ell (\gamma _t)|\le (1+\epsilon )\frac{2C}{\sqrt{\pi D}}, \;\mathrm {for}\; \forall t\in \{1,\ldots ,T\}, \end{aligned}$$

where \(C=\max _{t\in \{1,\ldots ,T\}}\Vert \varvec{w}_t\Vert ^2\Vert \varvec{x}_t\Vert ^{2}.\) From the basic relationship between the sine and the cosine, it follows that \( \Vert \nabla \ell (\varvec{w})\Vert _2^2=1. \)

We now conclude our proof.    \(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Han, Z., Liao, S. (2017). Stochastic Online Kernel Selection with Instantaneous Loss in Random Feature Space. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science(), vol 10634. Springer, Cham. https://doi.org/10.1007/978-3-319-70087-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-70087-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-70086-1

  • Online ISBN: 978-3-319-70087-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics