Skip to main content
Log in

Isolation kernel: the X factor in efficient and effective large scale online kernel learning

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Large scale online kernel learning aims to build an efficient and scalable kernel-based predictive model incrementally from a sequence of potentially infinite data points. Current state-of-the-art large scale online kernel learning focuses on improving efficiency. Two key approaches to gain efficiency through approximation are (1) limiting the number of support vectors, and (2) using an approximate feature map. They often employ a kernel with a feature map with intractable dimensionality. While these approaches can deal with large scale datasets efficiently, this outcome is achieved by compromising predictive accuracy because of the approximation. We offer an alternative approach that puts the kernel used at the heart of the approach. It focuses on creating a sparse and finite-dimensional feature map of a kernel called Isolation Kernel. Using this new approach, to achieve the above aim of large scale online kernel learning becomes extremely simple—simply use Isolation Kernel instead of a kernel having a feature map with intractable dimensionality. We show that, using Isolation Kernel, large scale online kernel learning can be achieved efficiently without sacrificing accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The code is available at https://github.com/LIBOL/KOL.

  2. As pointed in Ting et al. (2018), Laplacian kernel can be expressed as \(\exp (-\gamma \sum ^d_{\jmath =1} |x_\jmath - y_\jmath |) = \psi ^{-\frac{1}{d} \sum ^d_{\jmath =1} |x_\jmath - y_\jmath |}\), where \(\gamma = \frac{\log (\psi )}{d}\). Laplacian kernel has been shown to be competitive to Gaussian kernel in SVM in a recent study (Ting et al. 2018).

  3. Note that this parameter is called budget in Lu et al. (2016); but this is different from the budget used to limit the number of support vectors, mentioned in Section 2.1.

  4. The two-class conversions from the original class labels were done for three multi-class datasets: mnist: \(\{3, 4, 6, 7, 9\}\) and \(\{0, 1, 2, 5, 8\}\). smallNORB: \(\{1,4\}\) and \(\{0, 2, 3\}\). cifar-10: \(\{ 0, 2, 3, 4, 5\}\) and \(\{1, 6, 7, 8, 9\}\).

  5. This simulation is more realistic than the previous online experiments which assume that class label of each point is available immediately after a prediction is made to enable model update (Lu et al. 2016). In practice, the algorithm can be made to be in the training mode whenever class labels are available, either partially or the entire block.

  6. Note that the first points in the accuracy plots can swing wildly because each is the accuracy on the first data block.

  7. The setting \(b=0\) is used in the experiment in this section. This appears to be the best setting for the datasets we used here. \(b>0\) not only produces poorer accuracy on some datasets, but yields a larger number of features.

  8. LIBSVM appears to perform slightly worse than LIBLINEAR in some datasets with IK\(_a\). This is due to the different heuristics used in the optimisation procedures. Otherwise, both LIBSVM and LIBLINEAR shall yield exactly the same accuracy since the same feature map of Isolation Kernel is used.

  9. At https://scikit-learn.org/stable/modules/kernel_approximation.html in Python.

  10. One study has attributed the data independence of the feature generation as the reason of poorer SVM accuracy in comparison with the Nyström method (Yang et al. 2012).

  11. A sparse representation yields vectors having many zero values, where a feature with zero value means that the feature is irrelevant.

  12. Product Quantization (an improvement over vector quantization) aims to reduce storage and retrieval time for conducting approximate nearest neighbour search.

References

  • Breiman L (2000) Some infinity theory for predictor ensembles. Tech Rep 577, Statistics Dept, University of California, Berkeley

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  • Cavallanti G, Cesa-Bianchi N, Gentile C (2007) Tracking the best hyperplane with a simple budget perceptron. Mach Learn 69(2):143–167

    Article  Google Scholar 

  • Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Transactions Intell Syst Technol 2(3):1–27

    Article  Google Scholar 

  • Chang YW, Hsieh CJ, Chang KW, Ringgaard M, Lin CJ (2010) Training and testing low-degree polynomial data mappings via linear svm. J Mach Learn Res 11:1471–1490

    MathSciNet  MATH  Google Scholar 

  • Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  • Felix XY, Suresh AT, Choromanski KM, Holtmann-Rice DN, Kumar S (2016) Orthogonal random features. In: Advances in Neural Information Processing Systems, pp 1975–1983

  • Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42

    Article  Google Scholar 

  • Huang P, Avron H, Sainath TN, Sindhwani V, Ramabhadran B (2014) Kernel methods match deep neural networks on timit. In: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, pp 205–209

  • Jose C, Goyal P, Aggrwal P, Varma M (2013) Local deep kernel learning for efficient non-linear svm prediction. In: Proceedings of the 30th International Conference on Machine Learning, pp III–486–III–494

  • Kafai M, Eshghi K (2019) Croification: accurate kernel classification with the efficiency of sparse linear svm. IEEE Transactions Pattern Anal Mach Intell 41(1):34–48

    Article  Google Scholar 

  • Kivinen J, Smola AJ, Williamson RC (2001) Online learning with kernels. In: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, pp 785–792

  • Liu FT, Ting KM, Zhou ZH (2008) Isolation forest. In: Proceedings of the IEEE International Conference on Data Mining, pp 413–422

  • Lu J, Hoi SCH, Wang J, Zhao P, Liu ZY (2016) Large scale online kernel learning. J Mach Learn Res 17(1):1613–1655

    MathSciNet  MATH  Google Scholar 

  • Maji S, Berg AC, Malik J (2013) Efficient classification for additive kernel svms. IEEE Transactions Pattern Anal Mach Intell 35(1):66–77

    Article  Google Scholar 

  • Musco C, Musco C (2017) Recursive sampling for the nystrom method. In: Advances in Neural Information Processing Systems, pp 3833–3845

  • Qin X, Ting KM, Zhu Y, Lee VCS (2019) Nearest-neighbour-induced isolation similarity and its impact on density-based clustering. In: Proceedings of The Thirty-Third AAAI Conference on Artificial Intelligence, pp 4755–4762

  • Rahimi A, Recht B (2007) Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp 1177–1184

  • Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9(Nov):2491–2521

    MathSciNet  MATH  Google Scholar 

  • Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge

    Google Scholar 

  • Shen Y, Chen T, Giannakis G (2018) Online ensemble multi-kernel learning adaptive to non-stationary and adversarial environments. In: Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pp 2037–2046

  • Ting KM, Zhu Y, Zhou ZH (2018) Isolation kernel and its effect on SVM. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 2329–2337

  • Ting KM, Xu BC, Washio T, Zhou ZH (2020) Isolation distributional kernel: a new tool for kernel based anomaly detection. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 198–206

  • Vedaldi A, Zisserman A (2012a) Efficient additive kernels via explicit feature maps. IEEE Transactions Pattern Anal Mach Intell 34(3):480–492

    Article  Google Scholar 

  • Vedaldi A, Zisserman A (2012b) Sparse kernel approximations for efficient classification and detection. In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp 2320–2327

  • Viet Le Q, Sarlos T, Smola AJ (2013) Fastfood: approximate kernel expansions in loglinear time. In: Proceedings of the 30th International Conference on Machine Learning, pp III–244–III–252

  • Wang Z, Vucetic S (2010) Online passive-aggressive algorithms on a budget. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp 908–915

  • Wang Z, Crammer K, Vucetic S (2012) Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale svm training. J Mach Learn Res 13(1):3103–3131

    MathSciNet  MATH  Google Scholar 

  • Williams CKI, Seeger M (2001) Using the nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems 13. MIT Press, Cambridge, pp 682–688

  • Wu J, Ding L, Liao S (2017) Predictive nyström method for kernel methods. Neurocomputing 234:116–125

    Article  Google Scholar 

  • Xu BC, Ting KM, Zhou ZH (2019) Isolation set-kernel and its application to multi-instance learning. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 941–949

  • Xu BC, Ting KM, Jiang Y (2021) Isolation graph kernel. In: Proceedings of The Thirty-Fifth AAAI Conference on Artificial Intelligence, pp 10487–10495

  • Yang J, Sindhwani V, Fan Q, Avron H, Mahoney M (2014) Random Laplace feature maps for semigroup kernels on histograms. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp 971–978

  • Yang T, Li YF, Mahdavi M, Jin R, Zhou ZH (2012) Nyström method vs random fourier features: a theoretical and empirical comparison. In: Advances in Neural Information Processing Systems, pp 476–484

  • Zadeh P, Hosseini R, Sra S (2016) Geometric mean metric learning. In: Proceedings of the International Conference on Machine Learning, pp 2464–2471

Download references

Acknowledgements

Kai Ming Ting is supported by Natural Science Foundation of China (62076120). Takashi Washio is supported by JST CREST Grant Number JPMJCR1666 and the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 20K21815. A discussion with Jaakko Peltonen at Shonan Meeting in October 2018 has provided the initial impetus in creating the feature map of Isolation Kernel described in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Ming Ting.

Additional information

Responsible editor: Jingrui He.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ting, K.M., Wells, J.R. & Washio, T. Isolation kernel: the X factor in efficient and effective large scale online kernel learning. Data Min Knowl Disc 35, 2282–2312 (2021). https://doi.org/10.1007/s10618-021-00785-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-021-00785-1

Keywords

Navigation