Computing Hierarchical Summary of the Data Streams

Shah, Zubair; Mahmood, Abdun Naser; Barlow, Michael

doi:10.1007/978-3-319-31750-2_14

Zubair Shah¹⁹,
Abdun Naser Mahmood¹⁹ &
Michael Barlow¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9652))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3007 Accesses
3 Citations

Abstract

Data stream processing is an important function in many online applications such as network traffic analysis, web applications, and financial data analysis. Computing summaries of data stream is challenging since streaming data is generally unbounded, and cannot be permanently stored or accessed more than once. In this paper, we have proposed two counter based hierarchical (CHS) \(\epsilon \)–approximation algorithms to create hierarchical summaries of one dimensional data. CHS maintains a data structure, where each entry contains the incoming data item and an associated counter to store its frequency. Since every item in streaming data cannot be stored, CHS only maintains frequent items (known as hierarchical heavy hitters) at various levels of generalization hierarchy by exploiting the natural hierarchy of the data. The algorithm guarantees accuracy of count within an \(\epsilon \) bound. Furthermore, using aperiodic (CHS-A) and periodic (CHS-P) compression strategy the proposed technique offers improved space complexities of \(O(\frac{\eta }{\epsilon })\) and \(O(\frac{\eta }{\epsilon }\log \epsilon N)\), respectively. We provide theoretical proofs for both space and time requirements of CHS algorithm. We have also experimentally compared the proposed algorithm with the existing benchmark techniques. Experimental results show that the proposed algorithm requires fewer updates per element of data, and uses a moderate amount of bounded memory. Moreover, precision-recall analysis demonstrates that CHS algorithm provides a high quality output compared to existing benchmark techniques. For the experimental validation, we have used both synthetic data derived from an open source generator, and real benchmark data sets from an international Internet Service Provider.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www-ai.cs.uni-dortmund.de/SOFTWARE/HHHPlugin/index.html.
2.
http://www.tcpdump.org/manpages/tcpdump.1.html, Accessed: 23/02/2015.
3.
https://www.wireshark.org/, Accessed: 23/02/2015.

References

Estan, C., Varghese, G.: New directions in traffic measurement and accounting. SIGCOMM Comput. Commun. Rev. 32(4), 323–336 (2002)
Article Google Scholar
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 693–703. Springer, Heidelberg (2002)
Chapter Google Scholar
Metwally, A., Agrawal, D.P., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398–412. Springer, Heidelberg (2005)
Chapter Google Scholar
Lin, Y., Liu, H.: Separator: sifting hierarchical heavy hitters accurately from data streams. In: Alhajj, R., Gao, H., Li, X., Li, J., Zaïane, O.R. (eds.) ADMA 2007. LNCS (LNAI), vol. 4632, pp. 170–182. Springer, Heidelberg (2007)
Chapter Google Scholar
Mitzenmacher, M., Steinke, T., Thaler, J.: Hierarchical heavy hitters with the space saving algorithm, arXiv 1102
Google Scholar
Truong, P., Guillemin, F.: Identification of heavyweight address prefix pairs in IP traffic. In: 21st International Teletraffic Congress, 2009. ITC 21 2009, pp. 1–8. IEEE (2009)
Google Scholar
Jose, L., Yu, M., Rexford, J.: Online measurement of large traffic aggregates on commodity switches. In: Proceedings of the USENIX HotICE Workshop (2011)
Google Scholar
da Cruz, M.A., Correa, S., Cardoso, K.V., et al.: Accurate online detection of bidimensional hierarchical heavy hitters in software-defined networks. In: 2013 IEEE Latin-America Conference on Communications (LATINCOM), pp. 1–6. IEEE (2013)
Google Scholar
Moshref, M., Yu, M., Govindan, R., Vahdat, A.: Dream: dynamic resource allocation for software-defined measurement. In: ACM SIGCOMM 2014, pp. 419–430. ACM (2014)
Google Scholar
Hernández, C., Navarro, A.G., Marín, M.: Managing massive graphs, universidad de chile (2014). http://users.dcc.uchile.cl/~gnavarro/algoritmos/tesiscecilia.pdf, Ph.D. thesis, Citeseer (2009)
Kalliola, A., Aura, T., Šćepanović, S.: Denial-of-service mitigation for internet services. In: Bernsmed, K., Fischer-Hübner, S. (eds.) NordSec 2014. LNCS, vol. 8788, pp. 213–228. Springer, Heidelberg (2014)
Google Scholar
Leeder, M.A.: Providing customized information to a user based on identifying a trend, US Patent 8,649,779, 11 February 2014
Google Scholar
Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding hierarchical heavy hitters in streaming data. ACM Trans. Knowl. Discov. Data (TKDD) 1(4), 1–48 (2008)
Article Google Scholar
Hershberger, J., Shrivastava, N., Suri, S., Tóth, C.D.: Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In: Proceedings of Principles of database systems, pp. 338–347. ACM (2005)
Google Scholar
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of Very Large Data Bases, VLDB Endowment, pp. 346–357 (2002)
Google Scholar
Micheel, J., Graham, I., Brownlee, N.: The auckland data set: an access link observed. In: Proceedings of Access Networks and Systems, pp. 19–30 (2001)
Google Scholar
Ganesan, P., Garcia-Molina, H., Widom, J.: Exploiting hierarchical domain structure to compute similarity. ACM Trans. Inf. Syst. (TOIS) 21(1), 64–93 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of New South Wales, Canberra, Australia
Zubair Shah, Abdun Naser Mahmood & Michael Barlow

Authors

Zubair Shah
View author publications
You can also search for this author in PubMed Google Scholar
Abdun Naser Mahmood
View author publications
You can also search for this author in PubMed Google Scholar
Michael Barlow
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zubair Shah .

Editor information

Editors and Affiliations

The University of Melbourne, Melbourne, Victoria, Australia
James Bailey
The University of Texas at Dallas, Richardson, Texas, USA
Latifur Khan
Osaka University, Osaka, Japan
Takashi Washio
University of Auckland, Auckland, New Zealand
Gill Dobbie
Shenzhen University, Shenzhen, China
Joshua Zhexue Huang
Massey University, Auckland, New Zealand
Ruili Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shah, Z., Mahmood, A.N., Barlow, M. (2016). Computing Hierarchical Summary of the Data Streams. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-31750-2_14
Published: 12 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics