Abstract
Accurate online density estimation is crucial to numerous applications that are prevalent with streaming data. Existing online approaches for density estimation somewhat lack prompt adaptability and robustness when facing concept-drifting and noisy streaming data, resulting in delayed or even deteriorated approximations. To alleviate this issue, in this work, we first propose an adaptive local online kernel density estimator (ALoKDE) for real-time density estimation on data streams. ALoKDE consists of two tightly integrated strategies: (1) a statistical test for concept drift detection and (2) an adaptive weighted local online density estimation when a drift does occur. Specifically, using a weighted form, ALoKDE seeks to provide an unbiased estimation by factoring in the statistical hallmarks of the latest learned distribution and any potential distributional changes that could be introduced by each incoming instance. A robust variant of ALoKDE, i.e., R-ALoKDE, is further developed to effectively handle data streams with varied types/levels of noise. Moreover, we analyze the asymptotic properties of ALoKDE and R-ALoKDE, and also derive their theoretical error bounds regarding bias, variance, MSE and MISE. Extensive comparative studies on various artificial and real-world (noisy) streaming data demonstrate the efficacies of ALoKDE and R-ALoKDE in online density estimation and real-time classification (with noise).
This is a preview of subscription content, access via your institution.











Notes
This kernel is asymptotically-optimally efficient among all other kernel functions [7].
“Gaussian”, “Skewed unimodal”, “Strongly skewed”, “Kurtotic unimodal”, “Outlier”, “Bimodal”,“Separated bimodal”, “Skewed bimodal”, “Trimodal”, “Claw”, “Double Claw”, “Symmetric Claw”, “Asymmetric Double Claw”, “Smooth Comb”, and “Discrete Comb”.
References
Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv (CSUR) 46(4):44
Cesa-Bianchi N, Shalev-Shwartz S, Shamir O (2011) Online learning of noisy data. IEEE Trans Inf Theory 57(12):7907–7931
Deng C, Yang E, Liu T, Tao D (2019) Two-stream deep hashing with class-specific centers for supervised image search. IEEE Trans Neural Netw Learn Syst 31(6):2189–2201
Procopiuc CM, Procopiuc O, (2005) Density estimation for spatial data streams. In: Bauzer Medeiros C, Egenhofer MJ, Bertino E (eds) Proceedings of advances in spatial and temporal databases, Heidelberg, (August 2005) Lecture notes in computer science, vol 3633. Springer, Berlin, pp 109–126
Heinz C, Seeger B (2008) Cluster kernels: resource-aware kernel density estimators over streaming data. IEEE Trans Knowl Data Eng 20(7):880–893
Kristan M, Leonardis A, Skočaj D (2011) Multivariate online kernel density estimation with Gaussian kernels. Pattern Recognit 44(10–11):2630–2642
Qahtan A, Wang S, Zhang X (2017) KDE-Track: an efficient dynamic density estimator for data streams. IEEE Trans Knowl Data Eng 29(3):642–655
Zhang P, Zhu X, Shi Y, Guo L, Wu X (2011) Robust ensemble learning for mining noisy data streams. Decis Support Syst 50(2):469–479
Krawczyk B, Cano A (2018) Online ensemble learning with abstaining classifiers for drifting and noisy data streams. Appl Soft Comput 68(7):677–692
Verma N, Branson K (2015) Sample complexity of learning Mahalanobis distance metrics. In: Proceedings of the advances in neural information processing systems, Montréal, December 2015, pp 2584–2592
Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the SIAM international conference on data mining, Minneapolis, April 2007, pp 443–448
Scott DW (2015) Multivariate density estimation: theory, practice, and visualization. Wiley, New York
Sheather SJ (2004) Density estimation. Stat Sci 19(4):588–597
Banerjee A, urlina P (2010) Efficient particle filtering via sparse kernel density estimation. IEEE Trans Image Process 19(9):2480–2490
Hong X, Chen S, Qatawneh A, Daqrouq K, Sheikh M, Morfeq A (2013) Sparse probability density function estimation using the minimum integrated square error. Neurocomputing 115:122–129
Carbone P, Petri D, Barbé K (2017) Nonparametric probability density estimation via interpolation filtering. IEEE Trans Instrum Meas 66(4):681–690
Zhang L, Lin J, Karim R (2018) Adaptive kernel density-based anomaly detection for nonlinear systems. Know Based Syst 139:50–63
Wahid A, Rao A (2019) RKDOS: a relative kernel density-based outlier score. IETE Tech Rev 8:1–12
Deng C, Liu X, Li C, Tao D (2018) Active multi-kernel domain adaptation for hyperspectral image classification. Pattern Recognit 77:306–315
Yang M, Deng C, Nie F (2019) Adaptive-weighting discriminative regression for multi-view classification. Pattern Recognit 88:236–245
Zhou A, Cai Z, Wei L, Qian W (2003) M-kernel merging: towards density estimation over data streams. In: Proceedings of the IEEE international conference on database systems for advanced applications, Kyoto, March 2003, pp 285–292
Cao Y, He H, Man H (2012) SOMKE: kernel density estimation over data streams by sequences of self-organizing maps. IEEE Trans Neural Netw Learn Syst 23(8):1254–1268
Kristan M, Leonardis A (2014) Online discriminative kernel density estimator with Gaussian kernels. IEEE Trans Cybern 44(3):355–365
Qiu T, Shen F, Zhao J (2015) Local adaptive and incremental Gaussian mixture for online density estimation. In Cao T, Lim EP, Zhou ZH, Ho TB, Cheung D, Motoda H (eds) Proceedings of the advances in knowledge discovery and data mining, Ho Chi Minh City, May 2015. Lecture Notes in Computer Science, vol 9077. Springer, Cham, pp 418–428
Wilcox R (2005) Kolmogorov–Smirnov test. Encyclopedia of biostatistics
Lall A (2015) Data streaming algorithms for the Kolmogorov-Smirnov test. In: Proceedings of the IEEE international conference on big data, Santa Clara, October 2015, pp 95–104
Duong T, Hazelton ML (2005) Cross-validation bandwidth matrices for multivariate kernel density estimation. Scand J Stat 32(3):485–506
Raykar VC, Duraiswami R (2006) Fast optimal bandwidth selection for kernel density estimation. In: Proceedings of the SIAM international conference on data mining, Bethesda, April 2006, pp 524–528
Yang C, Duraiswami R, Gumerov NA, Davis L (2003) Improved fast gauss transform and efficient kernel density estimation. In: Proceedings of the IEEE international conference on computer vision, Nice, October 2003, pp 464
Botev ZI, Grotowski JF, Kroese DP (2010) Kernel density estimation via diffusion. Ann Stat 38(5):2916–2957
Foote J (2000) Automatic audio segmentation using a measure of audio novelty. In: Proceedings of the IEEE international conference on multimedia and expo, New York, July 2000, pp 452–455
Losing V, Hammer B, Wersing H (2016) Knn classifier with self adjusting memory for heterogeneous concept drift. In: Proceedings of the IEEE international conference on data mining, Barcelona, December 2016, pp 291–300
Boedihardjo A, Liu C, Chen F (2015) Fast adaptive kernel density estimator for data streams. Knowl Inf Syst 42(2):285–317
Masud M, Gao J, Khan L, Han J, Thuraisingham B (2010) Classification and novel class detection in concept-drifting data streams under time constraints. IEEE Trans Knowl Data Eng 23(6):859–874
Masud M, Chen Q, Khan L, Aggarwal C, Gao J, Han J, Srivastava A, Oza N (2012) Classification and adaptive novel class detection of feature-evolving data streams. IEEE Trans Knowl Data Eng 25(7):1484–1497
Acknowledgements
This publication was made possible by funding from the DOD ARO Grant #W911NF-20-1-0249, and the NIH grants 5U54MD007595, 5P20GM103424, 5U19AG055373 and U54GM104940.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Appendix 1: Proof of Theorem 1
Equation (6) is equivalent to the following equation,
In fact, \(A_{t}\) is the MISE of \(\hat{f}_{t}^{kde }\), \(B_t\) is the MISE of \(\hat{f}_{t}\), and \(C_t\) is the covariance between \(\hat{f}_{t}^{kde }\) and \(\hat{f}_{t}\). Technically, they can be efficiently computed by the second-order Taylor series approximation [29], where we can use an estimate of the curvature \(\int |f''({{\varvec{x}}})|^2 d {{\varvec{x}}}\) and choose \(h_{opt}\) accordingly.
Furthermore, since \(x^2 + y^2 \ge 2xy\), \(Var\{X - Y\} = Var\{X\} + Var\{Y\} - 2 Cov\{X, Y\} \ge 0\), \(Var\{X\} + Var\{Y\} \ge 2 Cov\{X, Y\}\), we obtain \(A_t + B_t \ge 2C_t\). Note that the equality holds if and only if \(\hat{f}_{t}^{kde }({{\varvec{x}}};h_{\mathcal {D}_{t}},h_{{{\varvec{x}}}_{t}})=\hat{f}_{t}({{\varvec{x}}})\), which is unlikely to happen since we utilize a local sampling strategy to construct \(\hat{f}_{t}^{kde }\) when concept drift occurs. Then, the solution of the optimization problem in Eq. (7) is discussed below:
-
Case 1: if \(\hat{f}_{t}^{kde }({{\varvec{x}}};h_{\mathcal {D}_{t}},h_{{{\varvec{x}}}_{t}})=\hat{f}_{t}({{\varvec{x}}})\), then \(A_t = B_t = C_t\), we have \(\forall \lambda _t \in [0,1]\);
-
Case 2: if \(\hat{f}_{t}^{kde }({{\varvec{x}}};h_{\mathcal {D}_{t}},h_{{{\varvec{x}}}_{t}}) \ne \hat{f}_{t}({{\varvec{x}}})\), then \(A_t + B_t > 2C_t\).
-
Case 2-1: if \(B_t < C_t\), then \(A_t > C_t\), \(\lambda _t = 0\);
-
Case 2-2: if \(B_t \ge C_t\) and \(A_t < C_t\), then \(\lambda _t = 1\);
-
Case 2-3: if \(B_t \ge C_t\) and \(A_t \ge C_t\), then \(\lambda _t = \frac{B_{t}-C_{t}}{A_{t}+B_{t}-2C_{t}}\), \(1 - \lambda _t = \frac{A_{t}-C_{t}}{A_{t}+B_{t}-2C_{t}}\).
-
Therefore, we have a closed-form solution of Eq. (7) for Case 2 (i.e., \(A_t + B_t > 2C_t\)):
This concludes the proof of Theorem 1.
1.2 Appendix 2: Proof of Theorem 2
Since \(\lambda _1=\cdots =\lambda _{t-1}=\lambda _{t}=\lambda\), then the following equation holds:
using the above equation repeatedly, we have:
Lemma 1 implies that: \(\lim \limits _{m_{t-j}\rightarrow \infty }E(\hat{f}_{t-j}^{kde}(x;h_{\mathcal {D}_{t-j}},h_{x_{t-j}}))=f(x)\), where \(j=0,1,\ldots ,t-1\).
On the other hand, since \(m_1=\cdots =m_{t}\), \(h_{\mathcal {D}_{1}}=\cdots =h_{\mathcal {D}_{t}}\), we have:
With \(t \rightarrow \infty\), for arbitrary \(\lambda \in (0,1]\), we have: \(\lim \nolimits _{m_{t}\rightarrow \infty }E(\hat{f}_{t+1}(x))=f(x)\).
We then consider the variance. Since \(\mathcal {D}_{j}\) and \(\mathcal {D}_{k}\) (\(j \ne k\)) are two independent sets, \(Cov\{\hat{f}_{j}^{kde}(x;h_{\mathcal {D}_{j}},h_{x_{j}}), \hat{f}_{k}^{kde}(x;h_{\mathcal {D}_{k}},h_{x_{k}}) \} = 0\). Since \(\hat{f}_{1}(x)\) is initialized by two given constants, then for arbitrary t, we have \(Cov\{\hat{f}_{t}^{kde}(x;h_{\mathcal {D}_{t}},h_{x_{t}}),\hat{f}_{1}(x)\} = 0\). Hence, the following equation holds:
Lemma 1 implies that: \(\lim \nolimits _{m_{t-j}\rightarrow \infty }m_{t-j}h_{\mathcal {D}_{t-j}}Var\{\hat{f}_{t-j}^{kde}(x;h_{\mathcal {D}_{t-j}},h_{x_{t-j}})\}=f(x)\int _{-\infty }^{\infty }K^2(u)du\), where \(j=0,1,\ldots ,t-1\). Since \(m_1=\cdots =m_{t}\), \(h_{\mathcal {D}_{1}}=\cdots =h_{\mathcal {D}_{t}}\), we have:
Now with \(t \rightarrow \infty\), for arbitrary \(\lambda \in (0, 1]\), we have: \(\lim \nolimits _{m_{t}\rightarrow \infty }m_{t}h_{\mathcal {D}_{t}}Var\{\hat{f}_{t+1}(x)\}= \frac{\lambda f(x)}{2-\lambda }\int _{-\infty }^{\infty }K^2(u)du\). This concludes the proof of Theorem 2.
1.3 Appendix 3: Proof of Theorem 3
We consider the mean squared error of \(\hat{f}_{t+1}(x)\), \(MSE (\hat{f}_{t+1}(x))=Var\{\hat{f}_{t+1}(x)\}+(Bias\{\hat{f}_{t+1}(x)\})^2\). According to Theorem 2, the estimator \(\hat{f}_{t+1}(x)\) is asymptotically unbiased and therefore, \(\lim \nolimits _{m_{t}\rightarrow \infty }Bias\{\hat{f}_{t+1}(x)\}=0\). Since \(\lim \nolimits _{m_{t}\rightarrow \infty }m_{t}h_{\mathcal {D}_{t}} = \infty\) and \(f(x)\int _{-\infty }^{\infty }K^2(u)du<\infty\), then for arbitrary \(\lambda \in (0,1]\), we have, \(\lim \nolimits _{m_{t}\rightarrow \infty ,t\rightarrow \infty }Var\{\hat{f}_{t+1}(x)\}=\lim \limits _{m_{t}\rightarrow \infty ,t\rightarrow \infty }\frac{\lambda f(x)\int _{-\infty }^{\infty }K^2(u)du}{(2-\lambda )m_{t}h_{\mathcal {D}_{t}}}=0\). Therefore, \(\hat{f}_{t+1}(x)\) converges to f(x) in \(L^2\) and hence is weakly consistent. This concludes the proof of Theorem 3.
1.4 Appendix 4: Proof of Theorem 4
According to the proof of Theorem 2 (\(\lambda _1=\cdots =\lambda _{t-1}=\lambda _{t}=\lambda\)), the following equation holds:
According to the properties of KDE by Taylor expansion [29], we have, \(Bias\{\hat{f}_{t-j}^{kde}(x;h_{\mathcal {D}_{t-j}},h_{x_{t-j}})\}= \frac{1}{2}f''(x)h_{\mathcal {D}_{t-j}}^2\int _{-\infty }^{\infty }u^2K(u)du+o(h_{\mathcal {D}_{t-j}}^2)\), where \(j=0,1,\ldots ,t-1\).
Thus, with \(m_1=\cdots =m_{t}\), \(h_{\mathcal {D}_{1}}=\cdots =h_{\mathcal {D}_{t}}\), we have:
For arbitrary \(\lambda \in (0,1]\), let t be large enough (\((1-\lambda )^t \approx 0\)), we can immediately derive Eq. (9) from the above equation.
For the variance, since Eq. (18) holds and according to the properties of KDE by Taylor expansion [29], we have, \(Var\{\hat{f}_{t-j}^{kde}(x;h_{\mathcal {D}_{t-j}},h_{x_{t-j}})\} = \frac{f(x)}{m_{t-j}h_{\mathcal {D}_{t-j}}}\int _{-\infty }^{\infty }K^2(u)du + o(\frac{1}{m_{t-j}})\), where \(j=0,1,\ldots ,t-1\). Since \(m_1=\cdots =m_{t}\) and \(h_{\mathcal {D}_{1}}=\cdots =h_{\mathcal {D}_{t}}\), we have:
For arbitrary \(\lambda \in (0,1]\), let t be large enough (\((1-\lambda )^t \approx 0\)), we can immediately derive Eq. (10) from the above equation. This concludes the proof of Theorem 4.
Rights and permissions
About this article
Cite this article
Chen, Z., Fang, Z., Sheng, V. et al. Adaptive robust local online density estimation for streaming data. Int. J. Mach. Learn. & Cyber. 12, 1803–1824 (2021). https://doi.org/10.1007/s13042-021-01275-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-021-01275-y