Abstract
Processing big data streams through machine learning algorithms has various challenges, such as little time to train the models, hardware memory constraints, and concept drift. In this paper, we show that prototype-based kernel classifiers designed by sparsification procedures, such as the approximate linear dependence (ALD) method, provides an adequate tradeoff between accuracy and size complexity of kernelized nearest neighbor classifiers. The proposed approach automatically selects relevant samples from the training data stream to form a sparse dictionary of prototypes, which are then used in kernelized distance metrics to classify arriving samples on the fly. Additionally, the proposed method is fully adaptive, in the sense that it updates and removes prototypes from the dictionary, enabling it to learn continuously in nonstationary environments. The results obtained from a comprehensive set of computer simulations involving artificial and real streaming data sets indicate that the proposed algorithm can build models with low complexity and competitive classification error rates compared to state of the art.
Similar content being viewed by others
Notes
Sometimes normalized by the quadratic norm of the difference vector, i.e. \(\Vert \mathbf {w}_{i^*}(t)-\mathbf {x}(t)\Vert ^2\).
A binary vector of length \(m_{t-1}\) with the k-th element set to to 1. All the other elements are set to zero.
References
Albuquerque RF, Oliveira PDL, Braga APS (2018) Adaptive fuzzy learning vector puantization (AFLVQ) for time series classification. In: Barreto GA, Coelho R (eds) North American fuzzy information Processing society annual conference (NAFIPS’2018), vol CCIS 831, pp 385–397
Aliyu A, Abdullah AH, Kaiwartya O, Cao Y, Lloret J, Aslam N, Joda UM (2018) Towards video streaming in IoT environments: vehicular communication perspective. Comput Commun 118:93–119
Augenstein C, Spangenberg N, Franczyk B (2017) Applying machine learning to big data streams: an overview of challenges. In: 2017 IEEE 4th international conference on soft computing & machine intelligence (ISCMI), pp 25–29. IEEE
Biehl M, Hammer B, Villmann T (2016) Prototype-based models in machine learning. WIREs Cogn Sci 7(2):92–111
Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM international conference on data mining, pp 443–448. SIAM
Bifet A, Holmes G, Kirkby R, Pfahringer B (2010) Moa: massive online analysis. J Mach Learn Res 11(May):1601–1604
Bifet A, Holmes G, Pfahringer B (2010) Leveraging bagging for evolving data streams. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 135–150
Bifet A, Pfahringer B, Read J, Holmes G (2013) Efficient data stream classification via probabilistic adaptive windows. In: Proceedings of the 28th annual ACM symposium on applied computing, pp 801–806
Brna AP, Brown RC, Connolly PM, Simons SB, Shimizu RE, Aguilar-Simon M (2019) Uncertainty-based modulation for lifelong learning. Neural Netw 120:129–142
Carpenter GA, Grossberg S, Rosen DB (1991) Fuzzy ART: fast stable learning, categorization of analog patterns by an adaptive resonance system. Neural Netw 4(6):759–771
Chua SL, Marsland S, Guesgen HW (2011) Unsupervised learning of patterns in data streams using compression and edit distance. In: Twenty-second international joint conference on artificial intelligence
Coelho DN, Barreto GA (2019) Approximate linear dependence as a design method for kernel prototype-based classifiers. In: A.C.M.G.J. Vellido A, Gibert K (ed) Advances in self-organizing maps, learning vector quantization, clustering and data visualization (WSOM’2019), vol 976. Springer, pp 241–250
Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531
Engel Y, Mannor S, Meir R (2004) The kernel recursive least squares algorithm. IEEE Trans Signal Process 52(8):2275–2285
Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):1–37
Gomes HM, Barddal JP, Enembreck F, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Comput Surv 50(2):1–36
Gomes HM, Bifet A, Read J, Barddal JP, Enembreck F, Pfharinger B, Holmes G, Abdessalem T (2017) Adaptive random forests for evolving data stream classification. Mach Learn 106(9–10):1469–1495
Grossberg S (1987) Competitive learning: from interactive activation to adaptive resonance. Cogn Sci 11:23–63
Haasdonk B, Pekalska E (2009) Classification with kernel mahalanobis distance classifiers. In: Advances in data analysis, data handling and business intelligence. Springer, pp 351–361
Hammer B, Hofmann D, Schleif FM, Zhu X (2014) Learning vector quantization for (dis-)similarities. Neurocomputing 131:43–51
Harries M (1999) Splice-2 comparative evaluation: electricity pricing
Haykin S, Li L (1995) Nonlinear adaptive prediction of nonstationary signals. IEEE Trans Signal Process 43(2):526–535
Heusinger M, Raab C, Schleif FM (2019) Passive concept drift handling via momentum based robust soft learning vector quantization. In: A.C.M.G.J. Vellido A, Gibert K (ed) Advances in self-organizing maps, learning vector quantization, clustering and data visualization (WSOM’2019), vol 976. Springer, pp 200–209
Hofmann D, Schleif FM, Paaßen B, Hammer B (2014) Learning interpretable kernelized prototype-based models. Neurocomputing 141:84–96
Iwashita AS, Papa JP (2018) An overview on concept drift learning. IEEE Access 7:1532–1547
Jaber G, Cornuéjols A, Tarroux P (2013) Online learning: searching for the best forgetting strategy under concept drift. In: International conference on neural information processing. Springer, pp 400–408
Jäkel F, Schölkopf B, Wichmann FA (2007) A tutorial on kernel methods for categorization. J Math Psychol 51(6):343–358
Juárez-Ruiz E, Cortés-Maldonado R, Pérez-Rodríguez F (2016) Relationship between the inverses of a matrix and a submatrix. Comput Sist 20(2):251–262
Kohonen T (1990) Improved versions of learning vector quantization. In: Proceedings of the 1990 international joint conference on neural networks (IJCNN’90), pp 545–550. IEEE
Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Kohonen T (2013) Essentials of the self-organizing map. Neural Netw 37:52–65
Lau KW, Yin H, Hubbard S (2006) Kernel self-organising maps for classification. Neurocomputing 69(16):2033–2040
Li X, Yu W (2015) Data stream classification for structural health monitoring via on-line support vector machines. In: 2015 IEEE first international conference on big data computing service and applications, pp 400–405. IEEE
Li Z, Huang W, Xiong Y, Ren S, Zhu T (2020) Incremental learning imbalanced data streams with concept drift: the dynamic updated ensemble algorithm. Knowl Based Syst 195:105694
Liu W, Pokharel PP, Principe JC (2008) The kernel least-mean-square algorithm. IEEE Trans Signal Process 56(2):543–554
Losing V, Hammer B, Wersing H (2015) Interactive online learning for obstacle classification on a mobile robot. In: 2015 international joint conference on neural networks (IJCNN’2015), pp 1–8. IEEE
Losing V, Hammer B, Wersing H (2016) KNN classifier with self adjusting memory for heterogeneous concept drift. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 291–300. IEEE
Losing V, Hammer B, Wersing H (2018) Incremental on-line learning: a review and comparison of state of the art algorithms. Neurocomputing 275:1261–1274
Mairal J, Bach F, Ponce J, Sapiro G (2010) Online learning for matrix factorization and sparse coding. J Mach Learn Res 11(2):19–60. http://jmlr.org/papers/v11/mairal10a.html
Mermillod M, Bugaiska A, Bonin P (2013) The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Front Psychol 4:504
Moreno-Torres JG, Raeder T, Alaiz-RodríGuez R, Chawla NV, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recognit 45(1):521–530
Platt J (1991) A resource-allocating network for function interpolation. MIT Press
Qin AK, Suganthan PN (2004) A novel kernel prototype-based learning algorithm. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004, vol 4, pp 621–624. IEEE
Richard C, Carlos J, Bermudez M (2007) Affine projection algorithm applied to nonlinear adaptive filtering. Statistical Signal Processing
Richardson FM, Thomas MS (2008) Critical periods and catastrophic interference effects in the development of self-organizing feature maps. Dev Sci 11(3):371–389
Rubio G, Herrera LJ, Pomares H, Rojas I, Guillén A (2010) Design of specific-to-problem kernels and use of kernel weighted k-nearest neighbours for time series modelling. Neurocomputing 73(10–12):1965–1975
Sherman J, Morrison WJ (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat 21(1):124–127
Soares Filho LA, Barreto GA (2014) On the efficient design of a prototype-based classifier using differential evolution. In: 2014 IEEE symposium on differential evolution (SDE), pp 1–8. IEEE
Spangenberg N, Augenstein C, Franczyk B, Wagner M, Apitz M, Kenngott H (2017) Method for intra-surgical phase detection by using real-time medical device data. In: 2017 IEEE 30th international symposium on computer-based medical systems (CBMS), pp 254–259. IEEE
Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300
Tsymbal A (2004) The problem of concept drift: definitions and related work. Tech. Rep. TCD-CS-2004-16, Computer Science Department, Trinity College Dublin. www.scss.tcd.ie/publications/tech-reports/
Van Vaerenbergh S, Santamaría I (2014) Online regression with kernels. Regularization, Optimization, Kernels, and Support Vector Machines, pp 477–501
Wadewale K, Desai S (2015) Survey on method of drift detection and classification for time varying data set. Int Res J Eng Technol 2(9):709–713
Wang D, Yeung DS, Tsang ECC (2007) Weighted mahalanobis distance kernels for support vector machines. IEEE Trans Neural Netw 18(5):1453–1462
Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Dis 30(4):964–994
Yin H (2006) On the equivalence between kernel self-organising maps and self-organising mixture density networks. Neural Netw 19(6):780–784
Žliobaitė I, Pechenizkiy M, Gama J (2016) An overview of concept drift applications. In: Big data analysis: new algorithms for a new society, pp 91–114. Springer
Acknowledgements
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES - Finance Code 001) and by the Brazilian National Research Council (CNPq) via the grant 309379/2019-9. This study was financed by the following Brazilian research funding agencies: CAPES (Finance Code 001) and CNPq (grant no. 309379/2019-9).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix - Kernelized Distance Calculation
Appendix - Kernelized Distance Calculation
This section is devoted to demonstrate that kernel functions based on the squared Euclidean distance \(\left\| \mathbf{x} - \mathbf{w} _i \right\| _{2}^{2}\), such as the widely used Gaussian radial basis function (RBF), lead to kernelized distances that reduces to the standard Euclidean distance function. Also, the derivative of these functions are proportional to either the difference between the argument vectors \((\mathbf{x} - \mathbf{w} _i)\) or a normalized version of this difference.
Firstly, considering the linear kernel \(k(\mathbf {u},\mathbf {v})=\mathbf {u}^T\mathbf {v}\) and Eqs. (9) and (11), the cost function and its derivative are
whose gradient vector is given by
Now, considering the Gaussian kernel, the cost function and its gradient vector are given by
Without loss of generality, we can set \(\gamma =\frac{1}{\sqrt{2}}\). Thus, Eq. (39) can be rewritten as
Finally, the Cauchy kernel function leads, respectively, to the following cost function and gradient vector:
and
As one can inferred from the Eqs. (35), (38) and (42), the minimum value of these cost functions \(J_i \left( \mathbf {x} \right) \) occurs when the squared Euclidean distance \(\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}\) is minimum. Furthermore, for fixed hyperparameters, the gradient vectors resulting from these cost functions are proportional to \(\left( \mathbf{w} _i - \mathbf{x} \right) \), differing only by factors that depend on \(\left\| \mathbf{w} _i-\mathbf{x} \right\| _{2}^{2}\). For a given iteration, these factors are constant. This feature then motivated us to use a common learning rule in Eq. (20) to update the prototypes whenever we use one these kernel functions.
Rights and permissions
About this article
Cite this article
Coelho, D.N., Barreto, G.A. A Sparse Online Approach for Streaming Data Classification via Prototype-Based Kernel Models. Neural Process Lett 54, 1679–1706 (2022). https://doi.org/10.1007/s11063-021-10701-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-021-10701-9