A Sparse Online Approach for Streaming Data Classification via Prototype-Based Kernel Models

Coelho, David N.; Barreto, Guilherme A.

doi:10.1007/s11063-021-10701-9

A Sparse Online Approach for Streaming Data Classification via Prototype-Based Kernel Models

Published: 16 January 2022

Volume 54, pages 1679–1706, (2022)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

350 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Processing big data streams through machine learning algorithms has various challenges, such as little time to train the models, hardware memory constraints, and concept drift. In this paper, we show that prototype-based kernel classifiers designed by sparsification procedures, such as the approximate linear dependence (ALD) method, provides an adequate tradeoff between accuracy and size complexity of kernelized nearest neighbor classifiers. The proposed approach automatically selects relevant samples from the training data stream to form a sparse dictionary of prototypes, which are then used in kernelized distance metrics to classify arriving samples on the fly. Additionally, the proposed method is fully adaptive, in the sense that it updates and removes prototypes from the dictionary, enabling it to learn continuously in nonstationary environments. The results obtained from a comprehensive set of computer simulations involving artificial and real streaming data sets indicate that the proposed algorithm can build models with low complexity and competitive classification error rates compared to state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Cluster-Based Prototype Reduction for Online Classification

Approximate Linear Dependence as a Design Method for Kernel Prototype-Based Classifiers

A Random Fourier Features based Streaming Algorithm for Anomaly Detection in Large Datasets

Notes

https://github.com/davidcoelho89/spok-nn
Sometimes normalized by the quadratic norm of the difference vector, i.e. $\Vert \mathbf {w}_{i^*}(t)-\mathbf {x}(t)\Vert ^2$.
A binary vector of length $m_{t-1}$ with the k-th element set to to 1. All the other elements are set to zero.
https://github.com/vlosing/SAMkNN

References

Albuquerque RF, Oliveira PDL, Braga APS (2018) Adaptive fuzzy learning vector puantization (AFLVQ) for time series classification. In: Barreto GA, Coelho R (eds) North American fuzzy information Processing society annual conference (NAFIPS’2018), vol CCIS 831, pp 385–397
Aliyu A, Abdullah AH, Kaiwartya O, Cao Y, Lloret J, Aslam N, Joda UM (2018) Towards video streaming in IoT environments: vehicular communication perspective. Comput Commun 118:93–119
Article Google Scholar
Augenstein C, Spangenberg N, Franczyk B (2017) Applying machine learning to big data streams: an overview of challenges. In: 2017 IEEE 4th international conference on soft computing & machine intelligence (ISCMI), pp 25–29. IEEE
Biehl M, Hammer B, Villmann T (2016) Prototype-based models in machine learning. WIREs Cogn Sci 7(2):92–111
Article Google Scholar
Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM international conference on data mining, pp 443–448. SIAM
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) Moa: massive online analysis. J Mach Learn Res 11(May):1601–1604
Google Scholar
Bifet A, Holmes G, Pfahringer B (2010) Leveraging bagging for evolving data streams. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 135–150
Bifet A, Pfahringer B, Read J, Holmes G (2013) Efficient data stream classification via probabilistic adaptive windows. In: Proceedings of the 28th annual ACM symposium on applied computing, pp 801–806
Brna AP, Brown RC, Connolly PM, Simons SB, Shimizu RE, Aguilar-Simon M (2019) Uncertainty-based modulation for lifelong learning. Neural Netw 120:129–142
Article Google Scholar
Carpenter GA, Grossberg S, Rosen DB (1991) Fuzzy ART: fast stable learning, categorization of analog patterns by an adaptive resonance system. Neural Netw 4(6):759–771
Article Google Scholar
Chua SL, Marsland S, Guesgen HW (2011) Unsupervised learning of patterns in data streams using compression and edit distance. In: Twenty-second international joint conference on artificial intelligence
Coelho DN, Barreto GA (2019) Approximate linear dependence as a design method for kernel prototype-based classifiers. In: A.C.M.G.J. Vellido A, Gibert K (ed) Advances in self-organizing maps, learning vector quantization, clustering and data visualization (WSOM’2019), vol 976. Springer, pp 241–250
Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531
Article Google Scholar
Engel Y, Mannor S, Meir R (2004) The kernel recursive least squares algorithm. IEEE Trans Signal Process 52(8):2275–2285
Article MathSciNet Google Scholar
Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):1–37
Article Google Scholar
Gomes HM, Barddal JP, Enembreck F, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Comput Surv 50(2):1–36
Article Google Scholar
Gomes HM, Bifet A, Read J, Barddal JP, Enembreck F, Pfharinger B, Holmes G, Abdessalem T (2017) Adaptive random forests for evolving data stream classification. Mach Learn 106(9–10):1469–1495
Article MathSciNet Google Scholar
Grossberg S (1987) Competitive learning: from interactive activation to adaptive resonance. Cogn Sci 11:23–63
Article Google Scholar
Haasdonk B, Pekalska E (2009) Classification with kernel mahalanobis distance classifiers. In: Advances in data analysis, data handling and business intelligence. Springer, pp 351–361
Hammer B, Hofmann D, Schleif FM, Zhu X (2014) Learning vector quantization for (dis-)similarities. Neurocomputing 131:43–51
Article Google Scholar
Harries M (1999) Splice-2 comparative evaluation: electricity pricing
Haykin S, Li L (1995) Nonlinear adaptive prediction of nonstationary signals. IEEE Trans Signal Process 43(2):526–535
Article Google Scholar
Heusinger M, Raab C, Schleif FM (2019) Passive concept drift handling via momentum based robust soft learning vector quantization. In: A.C.M.G.J. Vellido A, Gibert K (ed) Advances in self-organizing maps, learning vector quantization, clustering and data visualization (WSOM’2019), vol 976. Springer, pp 200–209
Hofmann D, Schleif FM, Paaßen B, Hammer B (2014) Learning interpretable kernelized prototype-based models. Neurocomputing 141:84–96
Article Google Scholar
Iwashita AS, Papa JP (2018) An overview on concept drift learning. IEEE Access 7:1532–1547
Article Google Scholar
Jaber G, Cornuéjols A, Tarroux P (2013) Online learning: searching for the best forgetting strategy under concept drift. In: International conference on neural information processing. Springer, pp 400–408
Jäkel F, Schölkopf B, Wichmann FA (2007) A tutorial on kernel methods for categorization. J Math Psychol 51(6):343–358
Article MathSciNet Google Scholar
Juárez-Ruiz E, Cortés-Maldonado R, Pérez-Rodríguez F (2016) Relationship between the inverses of a matrix and a submatrix. Comput Sist 20(2):251–262
Google Scholar
Kohonen T (1990) Improved versions of learning vector quantization. In: Proceedings of the 1990 international joint conference on neural networks (IJCNN’90), pp 545–550. IEEE
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Article Google Scholar
Kohonen T (2013) Essentials of the self-organizing map. Neural Netw 37:52–65
Article Google Scholar
Lau KW, Yin H, Hubbard S (2006) Kernel self-organising maps for classification. Neurocomputing 69(16):2033–2040
Article Google Scholar
Li X, Yu W (2015) Data stream classification for structural health monitoring via on-line support vector machines. In: 2015 IEEE first international conference on big data computing service and applications, pp 400–405. IEEE
Li Z, Huang W, Xiong Y, Ren S, Zhu T (2020) Incremental learning imbalanced data streams with concept drift: the dynamic updated ensemble algorithm. Knowl Based Syst 195:105694
Article Google Scholar
Liu W, Pokharel PP, Principe JC (2008) The kernel least-mean-square algorithm. IEEE Trans Signal Process 56(2):543–554
Article MathSciNet Google Scholar
Losing V, Hammer B, Wersing H (2015) Interactive online learning for obstacle classification on a mobile robot. In: 2015 international joint conference on neural networks (IJCNN’2015), pp 1–8. IEEE
Losing V, Hammer B, Wersing H (2016) KNN classifier with self adjusting memory for heterogeneous concept drift. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 291–300. IEEE
Losing V, Hammer B, Wersing H (2018) Incremental on-line learning: a review and comparison of state of the art algorithms. Neurocomputing 275:1261–1274
Article Google Scholar
Mairal J, Bach F, Ponce J, Sapiro G (2010) Online learning for matrix factorization and sparse coding. J Mach Learn Res 11(2):19–60. http://jmlr.org/papers/v11/mairal10a.html
Mermillod M, Bugaiska A, Bonin P (2013) The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Front Psychol 4:504
Article Google Scholar
Moreno-Torres JG, Raeder T, Alaiz-RodríGuez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recognit 45(1):521–530
Article Google Scholar
Platt J (1991) A resource-allocating network for function interpolation. MIT Press
Qin AK, Suganthan PN (2004) A novel kernel prototype-based learning algorithm. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004, vol 4, pp 621–624. IEEE
Richard C, Carlos J, Bermudez M (2007) Affine projection algorithm applied to nonlinear adaptive filtering. Statistical Signal Processing
Richardson FM, Thomas MS (2008) Critical periods and catastrophic interference effects in the development of self-organizing feature maps. Dev Sci 11(3):371–389
Article Google Scholar
Rubio G, Herrera LJ, Pomares H, Rojas I, Guillén A (2010) Design of specific-to-problem kernels and use of kernel weighted k-nearest neighbours for time series modelling. Neurocomputing 73(10–12):1965–1975
Article Google Scholar
Sherman J, Morrison WJ (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat 21(1):124–127
Article MathSciNet Google Scholar
Soares Filho LA, Barreto GA (2014) On the efficient design of a prototype-based classifier using differential evolution. In: 2014 IEEE symposium on differential evolution (SDE), pp 1–8. IEEE
Spangenberg N, Augenstein C, Franczyk B, Wagner M, Apitz M, Kenngott H (2017) Method for intra-surgical phase detection by using real-time medical device data. In: 2017 IEEE 30th international symposium on computer-based medical systems (CBMS), pp 254–259. IEEE
Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300
Article Google Scholar
Tsymbal A (2004) The problem of concept drift: definitions and related work. Tech. Rep. TCD-CS-2004-16, Computer Science Department, Trinity College Dublin. www.scss.tcd.ie/publications/tech-reports/
Van Vaerenbergh S, Santamaría I (2014) Online regression with kernels. Regularization, Optimization, Kernels, and Support Vector Machines, pp 477–501
Wadewale K, Desai S (2015) Survey on method of drift detection and classification for time varying data set. Int Res J Eng Technol 2(9):709–713
Google Scholar
Wang D, Yeung DS, Tsang ECC (2007) Weighted mahalanobis distance kernels for support vector machines. IEEE Trans Neural Netw 18(5):1453–1462
Article Google Scholar
Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Dis 30(4):964–994
Article MathSciNet Google Scholar
Yin H (2006) On the equivalence between kernel self-organising maps and self-organising mixture density networks. Neural Netw 19(6):780–784
Article Google Scholar
Žliobaitė I, Pechenizkiy M, Gama J (2016) An overview of concept drift applications. In: Big data analysis: new algorithms for a new society, pp 91–114. Springer

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES - Finance Code 001) and by the Brazilian National Research Council (CNPq) via the grant 309379/2019-9. This study was financed by the following Brazilian research funding agencies: CAPES (Finance Code 001) and CNPq (grant no. 309379/2019-9).

Author information

Authors and Affiliations

Graduate Program in Teleinformatics Engineering, Federal University of Ceará, Center of Technology, Campus of Pici, Fortaleza, Ceará, Brazil
David N. Coelho & Guilherme A. Barreto

Authors

David N. Coelho
View author publications
You can also search for this author in PubMed Google Scholar
Guilherme A. Barreto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guilherme A. Barreto.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix - Kernelized Distance Calculation

This section is devoted to demonstrate that kernel functions based on the squared Euclidean distance $\left\| \mathbf{x} - \mathbf{w} _i \right\| _{2}^{2}$, such as the widely used Gaussian radial basis function (RBF), lead to kernelized distances that reduces to the standard Euclidean distance function. Also, the derivative of these functions are proportional to either the difference between the argument vectors $(\mathbf{x} - \mathbf{w} _i)$ or a normalized version of this difference.

Firstly, considering the linear kernel $k(\mathbf {u},\mathbf {v})=\mathbf {u}^T\mathbf {v}$ and Eqs. (9) and (11), the cost function and its derivative are

$$\begin{aligned} J_i(\mathbf{x} )= & {} \mathbf{x} ^{T}{} \mathbf{x} -2\mathbf{w} _{i}^{T}{} \mathbf{x} + \mathbf{w} _i^{T}{} \mathbf{w} _i, \end{aligned}$$

(33)

$$\begin{aligned}= & {} (\mathbf{x} - \mathbf{w} _i)^T(\mathbf{x} - \mathbf{w} _i), \end{aligned}$$

(34)

$$\begin{aligned}= & {} \left\| \mathbf{x} - \mathbf{w} _i \right\| _{2}^{2}, \end{aligned}$$

(35)

whose gradient vector is given by

$$\begin{aligned} \nabla J_i(\mathbf {x}) = 2(\mathbf{w} _{i} - \mathbf{x} ). \end{aligned}$$

(36)

Now, considering the Gaussian kernel, the cost function and its gradient vector are given by

$$\begin{aligned} J_i \left( \mathbf {x} \right)= & {} \exp \left( - \frac{\left\| \mathbf{x} -\mathbf{x} \right\| _{2}^{2}}{2\gamma ^2} \right) -2 \exp \left( - \frac{\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}}{2\gamma ^2} \right) + \exp \left( - \frac{\left\| \mathbf{w} _i-\mathbf{w} _i \right\| _{2}^{2}}{2\gamma ^2} \right) \end{aligned}$$

(37)

$$\begin{aligned} J_i \left( \mathbf {x} \right)= & {} 2 - 2 \exp \left( - \frac{\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}}{2\gamma ^2} \right) . \end{aligned}$$

(38)

$$\begin{aligned} \nabla J_{i}(\mathbf{x} )= & {} - 2 \exp \left( - \frac{\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}}{2\gamma ^2} \right) \left( - \frac{1}{2\gamma ^2} \right) 2(\mathbf{w} _{i} - \mathbf{x} )\nonumber \\ \nabla J_{i}(\mathbf{x} )= & {} \left( \frac{2}{\gamma ^2} \right) \exp \left( - \frac{\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}}{2\gamma ^2} \right) (\mathbf{w} _{i} - \mathbf{x} ). \end{aligned}$$

(39)

Without loss of generality, we can set $\gamma =\frac{1}{\sqrt{2}}$. Thus, Eq. (39) can be rewritten as

$$\begin{aligned} \nabla J_{i}(\mathbf{x} ) = \left[ \frac{4}{\exp \left( \left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2} \right) }\right] (\mathbf{w} _{i} - \mathbf{x} ). \end{aligned}$$

(40)

Finally, the Cauchy kernel function leads, respectively, to the following cost function and gradient vector:

$$\begin{aligned} J_i(\mathbf{x} )= & {} \left( 1 + \frac{\left\| \mathbf{x} - \mathbf{x} \right\| ^2}{\gamma ^2} \right) ^{-1} -2\left( 1 + \frac{\left\| \mathbf{w} _i - \mathbf{x} \right\| ^2}{\gamma ^2} \right) ^{-1} + \left( 1 + \frac{\left\| \mathbf{w} _i - \mathbf{w} _i \right\| ^2}{\gamma ^2} \right) ^{-1} \end{aligned}$$

(41)

$$\begin{aligned}= & {} 2 - 2\left( 1 + \frac{\left\| \mathbf{w} _i - \mathbf{x} \right\| ^2}{\gamma ^2} \right) ^{-1} = 2 - 2\left( \frac{\gamma ^2}{\gamma ^2 + \left\| \mathbf{w} _i - \mathbf{x} \right\| ^2} \right) , \end{aligned}$$

(42)

and

$$\begin{aligned}&\nabla J_{i}(\mathbf{x} ) = - 2\gamma ^2\left[ -\frac{1}{\left( \gamma ^2 + \left\| \mathbf{w} _i - \mathbf{x} \right\| ^2 \right) ^2} \right] 2\left( \mathbf{w} _i - \mathbf{x} \right) \nonumber \\&\quad = \left[ \frac{4\gamma ^2}{\left( \gamma ^2 + \left\| \mathbf{w} _i - \mathbf{x} \right\| ^2 \right) ^2} \right] \left( \mathbf{w} _i - \mathbf{x} \right) \end{aligned}$$

(43)

As one can inferred from the Eqs. (35), (38) and (42), the minimum value of these cost functions $J_i \left( \mathbf {x} \right) $ occurs when the squared Euclidean distance $\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}$ is minimum. Furthermore, for fixed hyperparameters, the gradient vectors resulting from these cost functions are proportional to $\left( \mathbf{w} _i - \mathbf{x} \right) $, differing only by factors that depend on $\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}$. For a given iteration, these factors are constant. This feature then motivated us to use a common learning rule in Eq. (20) to update the prototypes whenever we use one these kernel functions.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Coelho, D.N., Barreto, G.A. A Sparse Online Approach for Streaming Data Classification via Prototype-Based Kernel Models. Neural Process Lett 54, 1679–1706 (2022). https://doi.org/10.1007/s11063-021-10701-9

Download citation

Accepted: 17 November 2021
Published: 16 January 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s11063-021-10701-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Sparse Online Approach for Streaming Data Classification via Prototype-Based Kernel Models

Abstract

Access this article

Similar content being viewed by others

A Cluster-Based Prototype Reduction for Online Classification

Approximate Linear Dependence as a Design Method for Kernel Prototype-Based Classifiers

A Random Fourier Features based Streaming Algorithm for Anomaly Detection in Large Datasets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix - Kernelized Distance Calculation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Sparse Online Approach for Streaming Data Classification via Prototype-Based Kernel Models

Abstract

Access this article

Similar content being viewed by others

A Cluster-Based Prototype Reduction for Online Classification

Approximate Linear Dependence as a Design Method for Kernel Prototype-Based Classifiers

A Random Fourier Features based Streaming Algorithm for Anomaly Detection in Large Datasets

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix - Kernelized Distance Calculation

Appendix - Kernelized Distance Calculation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation