Statistics and Computing

, Volume 17, Issue 4, pp 311–321 | Cite as

Data skeletons: simultaneous estimation of multiple quantiles for massive streaming datasets with applications to density estimation

  • James P. McDermottEmail author
  • G. Jogesh Babu
  • John C. Liechty
  • Dennis K. J. Lin


We consider the problem of density estimation when the data is in the form of a continuous stream with no fixed length. In this setting, implementations of the usual methods of density estimation such as kernel density estimation are problematic. We propose a method of density estimation for massive datasets that is based upon taking the derivative of a smooth curve that has been fit through a set of quantile estimates. To achieve this, a low-storage, single-pass, sequential method is proposed for simultaneous estimation of multiple quantiles for massive datasets that form the basis of this method of density estimation. For comparison, we also consider a sequential kernel density estimator. The proposed methods are shown through simulation study to perform well and to have several distinct advantages over existing methods.


Sequential quantile estimation Sequential density estimation Online algorithms Sequential algorithms Cubic spline 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Billingsley, P.: Probability and Measure. Wiley, New York (1986) zbMATHGoogle Scholar
  2. Chen, F., Lambert, D., Pinheiro, J.C.: Incremental quantile estimation for massive tracking. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, p. 10 (2000) Google Scholar
  3. Dunn, C.L.: Precise simulated percentiles in a pinch. Am. Stat. 45(3), 207–211 (1991) CrossRefGoogle Scholar
  4. Green, P.J., Silverman, B.W.: Nonparametric Regression and Generalized Linear Models. Chapman & Hall, London (1994) zbMATHGoogle Scholar
  5. Jain, R., Chlamtac, I.: The p-square algorithm for dynamic calculation of quantiles and histograms without storing observations. Commun. ACM 28(10), 1076–1085 (1985) CrossRefGoogle Scholar
  6. Kesidis, G.: Bandwidth adjustments using on-line packet-level adjustments. In: SPIE Conference on Performance and Control of Network Systems, Boston, Sept. 19–22, 1999 Google Scholar
  7. Liechty, J.C., Lin, D.K.J., McDermott, J.P.: Single-pass low-storage arbitrary quantile estimation for massive datasets. Stat. Comput. 13(2), 91–100 (2003) CrossRefGoogle Scholar
  8. Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings ACM SIGMOD International Conference on Management of Data, June, pp. 426–435 (1998) Google Scholar
  9. Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large databases. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp. 251–262 (1999) Google Scholar
  10. Paxson, V., Floyd, S.: Wide-area traffic: the failure of Poisson modeling. IEEE/ACM Trans. Netw., pp. 226–244 (1995) Google Scholar
  11. Pearl, J.: A space-efficient on-line method of computing quantile estimates. J. Algorithms 2, 164–177 (1981) zbMATHCrossRefGoogle Scholar
  12. Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, COMAD, pp. 294–305 (1996) Google Scholar
  13. Raatikainen, K.E.E.: Simultaneous estimation of several percentiles. Simulation 49(4), 159–164 (1987) Google Scholar
  14. Raatikainen, K.E.E.: Sequential procedure for simultaneous estimation of several percentiles. Trans. Soc. Comput. Simul. 7(1), 21–44 (1990) Google Scholar
  15. Rousseeuw, P.J., Bassett, G.W.: The remedian: a robust averaging method for large datasets. J. Am. Stat. Assoc. 85(409), 97–104 (1990) zbMATHCrossRefGoogle Scholar
  16. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, Boca Raton (1998) Google Scholar
  17. Tierney, L.: A space-efficient recursive procedure for estimating a quantile of an unknown distribution. SIAM J. Sci. Stat. Comput. 4(4), 706–711 (1983) zbMATHCrossRefGoogle Scholar
  18. Wahba, G.: Interpolating spline methods for density estimation I. Equi-spaced knots. Ann. Stat. 3, 30–48 (1975) zbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • James P. McDermott
    • 1
    Email author
  • G. Jogesh Babu
    • 1
  • John C. Liechty
    • 2
  • Dennis K. J. Lin
    • 3
  1. 1.Department of StatisticsThe Pennsylvania State UniversityUniversity ParkUSA
  2. 2.Departments of Marketing and StatisticsThe Pennsylvania State UniversityUniversity ParkUSA
  3. 3.Department of Supply Chain and Information SystemsThe Pennsylvania State UniversityUniversity ParkUSA

Personalised recommendations