Advertisement

Efficient Exact Algorithm for Count Distinct Problem

  • Nikolay Golov
  • Alexander Filatov
  • Sergey BruskinEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11661)

Abstract

This paper describes and analyses optimization approaches, which make possible the exact calculation of millions of hierarchical count distinct measures over hundreds of billions data rows. Described approach evolved for several years, in parallel with the growth of tasks from a fast growing internet company, and was finally implemented as a PEAPM (Pipelined Exact Accumulation for Paralleled Measures) algorithm. Current version of an algorithm outputs exact values (not estimates), works in a single thread, in minutes using a general commodity hardware, and requires volume of RAM equal to the doubled size of required measures.

Keywords

Big Data MPP Database Analytics Cardinality estimation Distinct elements problem Clickstream analysis Performance 

References

  1. 1.
    Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Discrete Mathematics and Theoretical Computer Science Proceedings, Nancy, France, AH, pp. 127–146. CiteSeerX 10.1.1.76.4286Google Scholar
  2. 2.
    Wang, G., Zhang, X., Tang, S., Wilson, C., Zheng, H., Zhao, B.Y.: Clickstream user behavior models. ACM Trans. Web 11(4), 1–37 (2017)CrossRefGoogle Scholar
  3. 3.
    Banerjee, A., Ghosh, J.: Clickstream clustering using weighted longest common subsequences. In: Proceedings of Web Mining Workshop at the 1st SIAM Conference on Data Mining (2001)Google Scholar
  4. 4.
    Rönnbäck, L., Regardt, O., Bergholtz, M., Johannesson, P., Wohed, P.: Anchor modeling agile information modeling in evolving data environments. Data Knowl. Eng. 69(12), 1229–1253 (2010)CrossRefGoogle Scholar
  5. 5.
    Golov, N., Ronnback, L.: Big data normalization for massively parallel processing databases. Comput. Stand. Interfaces 54(2), 86–93 (2017).  https://doi.org/10.1016/j.csi.2017.01.009CrossRefGoogle Scholar
  6. 6.
    Naspers takes full control of Russian classifieds site Avito in 1.16B dollars deal. https://techcrunch.com/2019/01/28/naspers-avito-1-16-billion/
  7. 7.
    Benchmarks of modern analytical databases for typical click stream analysis scenarious. https://clickhouse.yandex/benchmark.html
  8. 8.
    C++ Vertica extension implementing described algorithm. https://github.com/phil-88/vertica-udf

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Nikolay Golov
    • 1
  • Alexander Filatov
    • 2
  • Sergey Bruskin
    • 3
    Email author
  1. 1.Avito, Higher School of EconomicsMoscowRussia
  2. 2.Avito, Moscow Engineering Physics Institute MEPhIMoscowRussia
  3. 3.Higher School of EconomicsMoscowRussia

Personalised recommendations