Advertisement

Knowledge and Information Systems

, Volume 42, Issue 2, pp 285–317 | Cite as

Fast adaptive kernel density estimator for data streams

  • Arnold P. BoedihardjoEmail author
  • Chang-Tien Lu
  • Feng Chen
Regular Paper

Abstract

The probability density function (PDF) is an effective data model for a variety of stream mining tasks. As such, accurate estimates of the PDF are essential to reducing the uncertainties and errors associated with mining results. The nonparametric adaptive kernel density estimator (AKDE) provides accurate, robust, and asymptotically consistent estimates of a PDF. However, due to AKDE’s extensive computational requirements, it cannot be directly applied to the data stream environment. This paper describes the development of an AKDE approximation approach that heeds the constraints of the data stream environment and supports efficient processing of multiple queries. To this end, this work proposes (1) the concept of local regions to provide a partition-based variable bandwidth to capture local density structures and enhance estimation quality; (2) a suite of linear-pass methods to construct the local regions and kernel objects online; (3) an efficient multiple queries evaluation algorithm; (4) a set of approximate techniques to increase the throughput of multiple density queries processing; and (5) a fixed-size memory time-based sliding window that updates the kernel objects in linear time. Comprehensive experiments were conducted with real-world and synthetic data sets to validate the effectiveness and efficiency of the approach.

Keywords

Data mining Data streams Kernel density estimation 

References

  1. 1.
    Freeway performance measurement system (PeMS) [http://pems.eecs.berkeley.edu]
  2. 2.
    Aggarwal C (2003) A framework for diagnosing changes in evolving data streams. In: Proceedings of 2003 ACM SIGMOD international conference on management of data. San Diego, CA, pp 575–586Google Scholar
  3. 3.
    Aggarwal C, Yu PS (2007) Data streams: models and algorithms. In: Aggarwal C (ed) A survey of synopsis construction in data streams. Springer Science and Business Media, New York, pp 69–202Google Scholar
  4. 4.
    Asuncion A, Newman DJ (2007) UCI machine learning repository. [http://www.ics.uci.edu/~mlearn/MLRepository.html]
  5. 5.
    Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of 21st ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. Madison, WI, pp 1–16Google Scholar
  6. 6.
    Babcock B, Datar M, Motwani R (2002) Sampling from a moving window over streaming data. In: Proceedings of 13th Annual ACM-SIAM symposium on discrete algorithms. San Francisco, CA, pp 633–634Google Scholar
  7. 7.
    Chan CC, Batur C, Srinivasan A (1991) Determination of quantization intervals in rule based model for dynamic systems. In: Proceedings of IEEE conference of systems, man, and, cybernetics. pp 1719–1723Google Scholar
  8. 8.
    Clear R, Berman S (1988) Estimation of linear interpolation error. In: Proceedings of the annual illuminating engineering society conferenceGoogle Scholar
  9. 9.
    Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Proceedings of 12th international conference on machine learning. pp 194–202Google Scholar
  10. 10.
    Duoandikoetxea J (2001) Fourier analysis: American mathematical societyGoogle Scholar
  11. 11.
    Gibbons P, Matias Y, Poosala V (2002) Fast incremental maintenance of approximate histograms. ACM Trans Database Syst 27:261–298CrossRefGoogle Scholar
  12. 12.
    Gilbert A, Kotidis Y, Muthukrishan S, Strauss MJ (2002) How to summarize the universe: dynamic maintenance of quantiles. In: Proceedings of the 28th international conference of very large data bases. Hong Kong, China, pp 454–465Google Scholar
  13. 13.
    Gray A, Moore A (2003) Rapid evaluation of multiple density models. In: Proceedings of 9th international workshop on artificial intelligence and statistics. Key West, FLGoogle Scholar
  14. 14.
    Guha S, Koudas N, Shim K (2006) Approximation and streaming algorithms for histogram construction problems. ACM Trans Database Syst 31:396–438CrossRefGoogle Scholar
  15. 15.
    Heinz C (2007) Density estimation over data streams. Phd, Mathematics, Phillipps-University MarburgGoogle Scholar
  16. 16.
    Heinz C, Seeger B (2008) Cluster kernels: resource-aware kernel density estimators over streaming data. IEEE Trans Knowl Data Eng 20:880–893CrossRefGoogle Scholar
  17. 17.
    Heinz C, Seeger B (2006) Exploring data streams with nonparametric estimators. In: Proceedings of 18th international conference on statistical and scientific database management. Vienna, Austria, pp 261–264Google Scholar
  18. 18.
    Heinz C, Seeger B (2006) Resource-aware kernel density estimators over streaming data. In: Proceedings of 15th ACM international conference on information and knowledge management. Arlington, VA, pp 870–871Google Scholar
  19. 19.
    Heinz C, Seeger B (2006) Towards kernel density estimation over streaming data. In: Proceedings of 13th international conference on management of data. Delhi, pp 91–102Google Scholar
  20. 20.
    Hinneburg A, Keim D (1998) An efficient approach to clustering in large multimedia databases with noise, in proceedings of ACM Knowledge Discovery and Data Mining 58–65Google Scholar
  21. 21.
    Ioannidis Y (2003) The history of histograms (abridged). In: Proceedings of 29th international conference on very large databases. Berlin, pp 19–30Google Scholar
  22. 22.
    Keogh E, Xi X, Wei L, Ratanamahatana CA (2008) The UCR time series classification/clustering. [http://www.cs.ucr.edu/~eamonn/time_series_data]. Available: http://www.cs.ucr.edu/~eamonn/
  23. 23.
    Ledl T (2004) Kernel density estimation: theory and application in discriminant analysis. Aust J Stat 33:267–279Google Scholar
  24. 24.
    Liu H, Hussain F, Tan CL, Dash M (2002) Discretization: an enabling technique. Data Min Knowl Discov 6:393–423CrossRefMathSciNetGoogle Scholar
  25. 25.
    Merckt TV (1993) Decision trees in numerical attribute spaces. In: Proceedings of the 13th international joint conference on artificial intelligence, pp 1016–1021Google Scholar
  26. 26.
    Nussbaumer HJ (1982) Fast Fourier transform and convolution algorithms, 2nd edn. Springer, New YorkCrossRefGoogle Scholar
  27. 27.
    Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33:1065–1076CrossRefzbMATHMathSciNetGoogle Scholar
  28. 28.
    Sain SR, Scott DW (1996) On locally adaptive density estimation. J Am Stat Assoc 91:1525–1534CrossRefzbMATHMathSciNetGoogle Scholar
  29. 29.
    Scott DW (1992) Multivariate density estimation. Wiley, New YorkCrossRefzbMATHGoogle Scholar
  30. 30.
    Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, LondonCrossRefzbMATHGoogle Scholar
  31. 31.
    Smith JO (2011) Digital audio resampling home page. Available http://www-ccrma.stanford.edu/~jos/resample
  32. 32.
    Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: Proceedings of the 32nd international conference on very large databases. Seoul, pp 187–198Google Scholar
  33. 33.
    Wand MP, Jones MC (1995) Kernel smoothing. CRC Press, Boca RatonCrossRefzbMATHGoogle Scholar
  34. 34.
    Wegman EJ, Marchette DJ (2003) On some techniques for streaming data: a case study of internet packet headers. J Comput Graph Stat 12:1–22CrossRefMathSciNetGoogle Scholar
  35. 35.
    Weiss SM, Galen RS, Tadepalli PV (1991) Maximizing the predictive value of production rules, artificial intelligence, pp 47–71Google Scholar
  36. 36.
    Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD international conference on management of data. Montreal, pp 103–114Google Scholar
  37. 37.
    Zhang T, Ramakrishnan R, Livny M (1999) Fast density estimation using CF-kernel for very large databases. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining. San Diego, CA, pp 312–316Google Scholar
  38. 38.
    Zhou A, Cai Z, Wei L, Qian W (2003) M-Kernel merging: towards density estimation over data streams. In: Proceedings of the 8th international conference on database systems for advanced applications. Kyoto, pp 285–292Google Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Arnold P. Boedihardjo
    • 1
    Email author
  • Chang-Tien Lu
    • 2
  • Feng Chen
    • 3
  1. 1.U. S. Army Corps of EngineersAlexandriaUSA
  2. 2.Computer Science DepartmentVirginia TechFalls ChurchUSA
  3. 3.Computer Science DepartmentCarnegie Melon UniversityPittsburghUSA

Personalised recommendations