Skip to main content

Computing Hierarchical Summary of the Data Streams

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9652))

Included in the following conference series:

Abstract

Data stream processing is an important function in many online applications such as network traffic analysis, web applications, and financial data analysis. Computing summaries of data stream is challenging since streaming data is generally unbounded, and cannot be permanently stored or accessed more than once. In this paper, we have proposed two counter based hierarchical (CHS) \(\epsilon \)–approximation algorithms to create hierarchical summaries of one dimensional data. CHS maintains a data structure, where each entry contains the incoming data item and an associated counter to store its frequency. Since every item in streaming data cannot be stored, CHS only maintains frequent items (known as hierarchical heavy hitters) at various levels of generalization hierarchy by exploiting the natural hierarchy of the data. The algorithm guarantees accuracy of count within an \(\epsilon \) bound. Furthermore, using aperiodic (CHS-A) and periodic (CHS-P) compression strategy the proposed technique offers improved space complexities of \(O(\frac{\eta }{\epsilon })\) and \(O(\frac{\eta }{\epsilon }\log \epsilon N)\), respectively. We provide theoretical proofs for both space and time requirements of CHS algorithm. We have also experimentally compared the proposed algorithm with the existing benchmark techniques. Experimental results show that the proposed algorithm requires fewer updates per element of data, and uses a moderate amount of bounded memory. Moreover, precision-recall analysis demonstrates that CHS algorithm provides a high quality output compared to existing benchmark techniques. For the experimental validation, we have used both synthetic data derived from an open source generator, and real benchmark data sets from an international Internet Service Provider.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www-ai.cs.uni-dortmund.de/SOFTWARE/HHHPlugin/index.html.

  2. 2.

    http://www.tcpdump.org/manpages/tcpdump.1.html, Accessed: 23/02/2015.

  3. 3.

    https://www.wireshark.org/, Accessed: 23/02/2015.

References

  1. Estan, C., Varghese, G.: New directions in traffic measurement and accounting. SIGCOMM Comput. Commun. Rev. 32(4), 323–336 (2002)

    Article  Google Scholar 

  2. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 693–703. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  3. Metwally, A., Agrawal, D.P., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398–412. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  4. Lin, Y., Liu, H.: Separator: sifting hierarchical heavy hitters accurately from data streams. In: Alhajj, R., Gao, H., Li, X., Li, J., Zaïane, O.R. (eds.) ADMA 2007. LNCS (LNAI), vol. 4632, pp. 170–182. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  5. Mitzenmacher, M., Steinke, T., Thaler, J.: Hierarchical heavy hitters with the space saving algorithm, arXiv 1102

    Google Scholar 

  6. Truong, P., Guillemin, F.: Identification of heavyweight address prefix pairs in IP traffic. In: 21st International Teletraffic Congress, 2009. ITC 21 2009, pp. 1–8. IEEE (2009)

    Google Scholar 

  7. Jose, L., Yu, M., Rexford, J.: Online measurement of large traffic aggregates on commodity switches. In: Proceedings of the USENIX HotICE Workshop (2011)

    Google Scholar 

  8. da Cruz, M.A., Correa, S., Cardoso, K.V., et al.: Accurate online detection of bidimensional hierarchical heavy hitters in software-defined networks. In: 2013 IEEE Latin-America Conference on Communications (LATINCOM), pp. 1–6. IEEE (2013)

    Google Scholar 

  9. Moshref, M., Yu, M., Govindan, R., Vahdat, A.: Dream: dynamic resource allocation for software-defined measurement. In: ACM SIGCOMM 2014, pp. 419–430. ACM (2014)

    Google Scholar 

  10. Hernández, C., Navarro, A.G., Marín, M.: Managing massive graphs, universidad de chile (2014). http://users.dcc.uchile.cl/~gnavarro/algoritmos/tesiscecilia.pdf, Ph.D. thesis, Citeseer (2009)

  11. Kalliola, A., Aura, T., Šćepanović, S.: Denial-of-service mitigation for internet services. In: Bernsmed, K., Fischer-Hübner, S. (eds.) NordSec 2014. LNCS, vol. 8788, pp. 213–228. Springer, Heidelberg (2014)

    Google Scholar 

  12. Leeder, M.A.: Providing customized information to a user based on identifying a trend, US Patent 8,649,779, 11 February 2014

    Google Scholar 

  13. Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding hierarchical heavy hitters in streaming data. ACM Trans. Knowl. Discov. Data (TKDD) 1(4), 1–48 (2008)

    Article  Google Scholar 

  14. Hershberger, J., Shrivastava, N., Suri, S., Tóth, C.D.: Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In: Proceedings of Principles of database systems, pp. 338–347. ACM (2005)

    Google Scholar 

  15. Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of Very Large Data Bases, VLDB Endowment, pp. 346–357 (2002)

    Google Scholar 

  16. Micheel, J., Graham, I., Brownlee, N.: The auckland data set: an access link observed. In: Proceedings of Access Networks and Systems, pp. 19–30 (2001)

    Google Scholar 

  17. Ganesan, P., Garcia-Molina, H., Widom, J.: Exploiting hierarchical domain structure to compute similarity. ACM Trans. Inf. Syst. (TOIS) 21(1), 64–93 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zubair Shah .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Shah, Z., Mahmood, A.N., Barlow, M. (2016). Computing Hierarchical Summary of the Data Streams. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31750-2_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31749-6

  • Online ISBN: 978-3-319-31750-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics