Data skeletons: simultaneous estimation of multiple quantiles for massive streaming datasets with applications to density estimation
- 153 Downloads
We consider the problem of density estimation when the data is in the form of a continuous stream with no fixed length. In this setting, implementations of the usual methods of density estimation such as kernel density estimation are problematic. We propose a method of density estimation for massive datasets that is based upon taking the derivative of a smooth curve that has been fit through a set of quantile estimates. To achieve this, a low-storage, single-pass, sequential method is proposed for simultaneous estimation of multiple quantiles for massive datasets that form the basis of this method of density estimation. For comparison, we also consider a sequential kernel density estimator. The proposed methods are shown through simulation study to perform well and to have several distinct advantages over existing methods.
KeywordsSequential quantile estimation Sequential density estimation Online algorithms Sequential algorithms Cubic spline
Unable to display preview. Download preview PDF.
- Chen, F., Lambert, D., Pinheiro, J.C.: Incremental quantile estimation for massive tracking. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, p. 10 (2000) Google Scholar
- Kesidis, G.: Bandwidth adjustments using on-line packet-level adjustments. In: SPIE Conference on Performance and Control of Network Systems, Boston, Sept. 19–22, 1999 Google Scholar
- Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings ACM SIGMOD International Conference on Management of Data, June, pp. 426–435 (1998) Google Scholar
- Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large databases. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp. 251–262 (1999) Google Scholar
- Paxson, V., Floyd, S.: Wide-area traffic: the failure of Poisson modeling. IEEE/ACM Trans. Netw., pp. 226–244 (1995) Google Scholar
- Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, COMAD, pp. 294–305 (1996) Google Scholar
- Raatikainen, K.E.E.: Simultaneous estimation of several percentiles. Simulation 49(4), 159–164 (1987) Google Scholar
- Raatikainen, K.E.E.: Sequential procedure for simultaneous estimation of several percentiles. Trans. Soc. Comput. Simul. 7(1), 21–44 (1990) Google Scholar
- Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, Boca Raton (1998) Google Scholar