The VLDB Journal

, Volume 25, Issue 4, pp 449–472 | Cite as

Quantiles over data streams: experimental comparisons, new analyses, and further improvements

Regular Paper

Abstract

A fundamental problem in data management and analysis is to generate descriptions of the distribution of data. It is most common to give such descriptions in terms of the cumulative distribution, which is characterized by the quantiles of the data. The design and engineering of efficient methods to find these quantiles has attracted much study, especially in the case where the data are given incrementally, and we must compute the quantiles in an online, streaming fashion. While such algorithms have proved to be extremely useful in practice, there has been limited formal comparison of the competing methods, and no comprehensive study of their performance. In this paper, we remedy this deficit by providing a taxonomy of different methods and describe efficient implementations. In doing so, we propose new variants that have not been studied before, yet which outperform existing methods. To illustrate this, we provide detailed experimental comparisons demonstrating the trade-offs between space, time, and accuracy for quantile computation.

Keywords

Data stream algorithms Quantiles Ordinary least squares 

References

  1. 1.
    Agarwal, P.K., Cormode, G., Huang, Z., Phillips, J.M., Wei, Z., Yi, K.: Mergeable summaries. ACM Trans. Database Syst. 38, 26 (2013)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58(1), 137–147 (1999). doi:10.1006/jcss.1997.1545. http://www.sciencedirect.com/science/article/B6WJ0-45JCBTJ-D/2/2a71f12f1f0112bc83447b9d48eba529
  3. 3.
    Arasu, A., Manku, G.: Approximate counts and quantiles over sliding windows. In: Proceedings of the ACM Symposium on Principles of Database Systems (2004)Google Scholar
  4. 4.
    Blum, M., Floyd, R.W., Pratt, V., Rievest, R.L., Tarjan, R.E.: Time bounds for selection. J. Comput. Syst. Sci. 7, 448–461 (1973)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages, and Programming (2002)Google Scholar
  6. 6.
    Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: Proceedings of the International Conference on Very Large Data Bases (2008)Google Scholar
  7. 7.
    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 35–46 (2004)Google Scholar
  9. 9.
    Cormode, G., Garofalakis, M., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: distributed tracking of approximate quantiles. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2005)Google Scholar
  10. 10.
    Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In: Proceedings of the ACM Symposium on Principles of Database Systems (2006)Google Scholar
  11. 11.
    Felber, D., Ostrovsky, R.: A randomized online quantile summary in O(1/\(\varepsilon \) log 1/\(\varepsilon \)) words (2015). CoRR abs/1503.01156, http://arxiv.org/abs/1503.01156
  12. 12.
    Ganguly, S., Majumder, A.: CR-precis: A deterministic summary structure for update data streams. In: ESCAPE (2007)Google Scholar
  13. 13.
    Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: How to summarize the universe: dynamic maintenance of quantiles. In: Proceedings of the International Conference on Very Large Data Bases (2002)Google Scholar
  14. 14.
    Govindaraju, N.K., Raghuvanshi, N., Manocha, D.: Fast and approximate stream mining of quantiles and frequencies using graphics processors. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2005)Google Scholar
  15. 15.
    Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2001)Google Scholar
  16. 16.
    Greenwald, M., Khanna, S.: Power conserving computation of order-statistics over sensor networks. In: Proceedings of the ACM Symposium on Principles of Database Systems (2004)Google Scholar
  17. 17.
    Huang, Z., Wang, L., Yi, K., Liu, Y.: Sampling based algorithms for quantile computation in sensor networks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2011)Google Scholar
  18. 18.
    Hung, R.Y.S., Ting, H.F.: An \(\varOmega (\frac{1}{\varepsilon }\log \frac{1}{\varepsilon })\) space lower bound for finding \(\varepsilon \)-approximate quantiles in a data stream. In: FAW (2010)Google Scholar
  19. 19.
    Lagrange, J.L.: Mécanique analytique, vol. 1. Mallet-Bachelier, Paris (1853)Google Scholar
  20. 20.
    Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing histogram queries under differential privacy (2009). CoRR abs/0912.4742, http://arxiv.org/abs/0912.4742
  21. 21.
    Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1998)Google Scholar
  22. 22.
    Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1999)Google Scholar
  23. 23.
    Munro, J.I., Paterson, M.S.: Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315–323 (1980)MathSciNetCrossRefMATHGoogle Scholar
  24. 24.
    Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysis with sawzall. Dyn. Grids Worldw. Comput. 13(4), 277–298 (2005)Google Scholar
  25. 25.
    Rao, C.R.: Linear Statistical Inference and its Applications, vol. 22. Wiley, New Jersey (2009)Google Scholar
  26. 26.
    Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: new aggregation techniques for sensor networks. In: Proceedings of the ACM SenSys (2004)Google Scholar
  27. 27.
    Suri, S., Toth, C., Zhou, Y.: Range counting over multidimensional data streams. Discrete Comput. Geom. 36, 633–655 (2006)MathSciNetCrossRefMATHGoogle Scholar
  28. 28.
    Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280 (1971)MathSciNetCrossRefMATHGoogle Scholar
  29. 29.
    Wang, L., Luo, G., Yi, K., Cormode, G.: Quantiles over data streams: an experimental study. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2013)Google Scholar
  30. 30.
    Yi, K., Zhang, Q.: Optimal tracking of distributed heavy hitters and quantiles. Algorithmica 65(1), 206–223 (2013)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Department of Compute Science and EngineeringHKUSTKowloonHong Kong
  2. 2.Department of Computer ScienceUniversity of WarwickCoventryUK

Personalised recommendations