Skip to main content
Log in

Fast adaptive kernel density estimator for data streams

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The probability density function (PDF) is an effective data model for a variety of stream mining tasks. As such, accurate estimates of the PDF are essential to reducing the uncertainties and errors associated with mining results. The nonparametric adaptive kernel density estimator (AKDE) provides accurate, robust, and asymptotically consistent estimates of a PDF. However, due to AKDE’s extensive computational requirements, it cannot be directly applied to the data stream environment. This paper describes the development of an AKDE approximation approach that heeds the constraints of the data stream environment and supports efficient processing of multiple queries. To this end, this work proposes (1) the concept of local regions to provide a partition-based variable bandwidth to capture local density structures and enhance estimation quality; (2) a suite of linear-pass methods to construct the local regions and kernel objects online; (3) an efficient multiple queries evaluation algorithm; (4) a set of approximate techniques to increase the throughput of multiple density queries processing; and (5) a fixed-size memory time-based sliding window that updates the kernel objects in linear time. Comprehensive experiments were conducted with real-world and synthetic data sets to validate the effectiveness and efficiency of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Freeway performance measurement system (PeMS) [http://pems.eecs.berkeley.edu]

  2. Aggarwal C (2003) A framework for diagnosing changes in evolving data streams. In: Proceedings of 2003 ACM SIGMOD international conference on management of data. San Diego, CA, pp 575–586

  3. Aggarwal C, Yu PS (2007) Data streams: models and algorithms. In: Aggarwal C (ed) A survey of synopsis construction in data streams. Springer Science and Business Media, New York, pp 69–202

    Google Scholar 

  4. Asuncion A, Newman DJ (2007) UCI machine learning repository. [http://www.ics.uci.edu/~mlearn/MLRepository.html]

  5. Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of 21st ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. Madison, WI, pp 1–16

  6. Babcock B, Datar M, Motwani R (2002) Sampling from a moving window over streaming data. In: Proceedings of 13th Annual ACM-SIAM symposium on discrete algorithms. San Francisco, CA, pp 633–634

  7. Chan CC, Batur C, Srinivasan A (1991) Determination of quantization intervals in rule based model for dynamic systems. In: Proceedings of IEEE conference of systems, man, and, cybernetics. pp 1719–1723

  8. Clear R, Berman S (1988) Estimation of linear interpolation error. In: Proceedings of the annual illuminating engineering society conference

  9. Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Proceedings of 12th international conference on machine learning. pp 194–202

  10. Duoandikoetxea J (2001) Fourier analysis: American mathematical society

  11. Gibbons P, Matias Y, Poosala V (2002) Fast incremental maintenance of approximate histograms. ACM Trans Database Syst 27:261–298

    Article  Google Scholar 

  12. Gilbert A, Kotidis Y, Muthukrishan S, Strauss MJ (2002) How to summarize the universe: dynamic maintenance of quantiles. In: Proceedings of the 28th international conference of very large data bases. Hong Kong, China, pp 454–465

  13. Gray A, Moore A (2003) Rapid evaluation of multiple density models. In: Proceedings of 9th international workshop on artificial intelligence and statistics. Key West, FL

  14. Guha S, Koudas N, Shim K (2006) Approximation and streaming algorithms for histogram construction problems. ACM Trans Database Syst 31:396–438

    Article  Google Scholar 

  15. Heinz C (2007) Density estimation over data streams. Phd, Mathematics, Phillipps-University Marburg

  16. Heinz C, Seeger B (2008) Cluster kernels: resource-aware kernel density estimators over streaming data. IEEE Trans Knowl Data Eng 20:880–893

    Article  Google Scholar 

  17. Heinz C, Seeger B (2006) Exploring data streams with nonparametric estimators. In: Proceedings of 18th international conference on statistical and scientific database management. Vienna, Austria, pp 261–264

  18. Heinz C, Seeger B (2006) Resource-aware kernel density estimators over streaming data. In: Proceedings of 15th ACM international conference on information and knowledge management. Arlington, VA, pp 870–871

  19. Heinz C, Seeger B (2006) Towards kernel density estimation over streaming data. In: Proceedings of 13th international conference on management of data. Delhi, pp 91–102

  20. Hinneburg A, Keim D (1998) An efficient approach to clustering in large multimedia databases with noise, in proceedings of ACM Knowledge Discovery and Data Mining 58–65

  21. Ioannidis Y (2003) The history of histograms (abridged). In: Proceedings of 29th international conference on very large databases. Berlin, pp 19–30

  22. Keogh E, Xi X, Wei L, Ratanamahatana CA (2008) The UCR time series classification/clustering. [http://www.cs.ucr.edu/~eamonn/time_series_data]. Available: http://www.cs.ucr.edu/~eamonn/

  23. Ledl T (2004) Kernel density estimation: theory and application in discriminant analysis. Aust J Stat 33:267–279

    Google Scholar 

  24. Liu H, Hussain F, Tan CL, Dash M (2002) Discretization: an enabling technique. Data Min Knowl Discov 6:393–423

    Article  MathSciNet  Google Scholar 

  25. Merckt TV (1993) Decision trees in numerical attribute spaces. In: Proceedings of the 13th international joint conference on artificial intelligence, pp 1016–1021

  26. Nussbaumer HJ (1982) Fast Fourier transform and convolution algorithms, 2nd edn. Springer, New York

    Book  Google Scholar 

  27. Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33:1065–1076

    Article  MATH  MathSciNet  Google Scholar 

  28. Sain SR, Scott DW (1996) On locally adaptive density estimation. J Am Stat Assoc 91:1525–1534

    Article  MATH  MathSciNet  Google Scholar 

  29. Scott DW (1992) Multivariate density estimation. Wiley, New York

    Book  MATH  Google Scholar 

  30. Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London

    Book  MATH  Google Scholar 

  31. Smith JO (2011) Digital audio resampling home page. Available http://www-ccrma.stanford.edu/~jos/resample

  32. Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: Proceedings of the 32nd international conference on very large databases. Seoul, pp 187–198

  33. Wand MP, Jones MC (1995) Kernel smoothing. CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  34. Wegman EJ, Marchette DJ (2003) On some techniques for streaming data: a case study of internet packet headers. J Comput Graph Stat 12:1–22

    Article  MathSciNet  Google Scholar 

  35. Weiss SM, Galen RS, Tadepalli PV (1991) Maximizing the predictive value of production rules, artificial intelligence, pp 47–71

  36. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD international conference on management of data. Montreal, pp 103–114

  37. Zhang T, Ramakrishnan R, Livny M (1999) Fast density estimation using CF-kernel for very large databases. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining. San Diego, CA, pp 312–316

  38. Zhou A, Cai Z, Wei L, Qian W (2003) M-Kernel merging: towards density estimation over data streams. In: Proceedings of the 8th international conference on database systems for advanced applications. Kyoto, pp 285–292

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arnold P. Boedihardjo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boedihardjo, A.P., Lu, CT. & Chen, F. Fast adaptive kernel density estimator for data streams. Knowl Inf Syst 42, 285–317 (2015). https://doi.org/10.1007/s10115-013-0712-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0712-0

Keywords

Navigation