A Multi-Domain Architecture for Mining Frequent Items and Itemsets from Distributed Data Streams

Cesario, Eugenio; Mastroianni, Carlo; Talia, Domenico

doi:10.1007/s10723-013-9277-0

A Multi-Domain Architecture for Mining Frequent Items and Itemsets from Distributed Data Streams

Published: 11 October 2013

Volume 12, pages 153–168, (2014)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Eugenio Cesario¹,
Carlo Mastroianni¹ &
Domenico Talia²

270 Accesses
6 Citations
Explore all metrics

Abstract

Real-time analysis of distributed data streams is a challenging task since it requires scalable solutions to handle streams of data that are generated very rapidly by multiple sources. This paper presents the design and the implementation of an architecture for the analysis of data streams in distributed environments. In particular, data stream analysis has been carried out for the computation of items and itemsets that exceed a frequency threshold. The mining approach is hybrid, that is, frequent items are calculated with a single pass, using a sketch algorithm, while frequent itemsets are calculated by a further multi-pass analysis. The architecture combines parallel and distributed processing to keep the pace with the rate of distributed data streams. In order to keep computation close to data, miners are distributed among the domains where data streams are generated. The paper reports the experimental results obtained with a prototype of the architecture, tested on a Grid composed of three domains each one handling a data stream.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining frequent items and itemsets from distributed data streams for emergency detection and management

Article 29 January 2016

A System that Performs Data Distribution and Manages Frequent Itemsets Generation of Incremental Data in a Distributed Environment

A general-purpose distributed pattern mining system

Article Open access 18 March 2020

References

Aggarwal, C.: An introduction to data streams. In: Aggarwal C. (ed.) Data Streams: Models and Algorithms, pp. 1–8. Springer, New York (2007)
Chapter Google Scholar
Aggarwal, C.C.: Data Streams: Models and Algorithms. Springer, New York (2007)
Book Google Scholar
Aggarwal, C.C., Yu, P.S.: A survey of synopsis construction in data streams. In: Aggarwal C. (ed.) Data Streams: Models and Algorithms, pp. 169–207. Springer, New York (2007)
Chapter Google Scholar
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast Discovery of Association Rules. American Association for Artificial Intelligence, Menlo Park (1996)
Barkstrom, B., Hinke, T., Gavali, S., Smith, W., Seufzer, W., Hu, C., Cordner, D.: Distributed generation of nasa earth science data products. J. Grid Computing 1(2), 101–116 (2003)
Article Google Scholar
Cai, Z., Kumar, V., Schwan, K.: Iq-paths: Predictably high performance data streams across dynamic network overlays. J. Grid Computing 5(2), 129–150 (2007)
Article Google Scholar
Cesario, E., Grillo, A., Mastroianni, C., Talia, D.: A sketch-based architecture for mining frequent items and itemsets from distributed data streams. In: Proc. of the 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2011), pp. 245–253. Newport Beach, CA (2011)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002)
Cheng, J., Ke, Y., Ng, W.: A survey on algorithms for mining frequent itemsets over data streams. Knowl. Inf. Syst. 16, 1–27 (2008)
Article MathSciNet Google Scholar
Chi, Y., Wang, H., Yu, P.S., Muntz, R.R., Fischer, M.: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst. 10(3), 265–294 (2006)
Article Google Scholar
Cormode, G., Garofalakis, M.: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), 9:1–9:39 (2008)
Article Google Scholar
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: International Conference on Very Large Data Bases (2008)
Cormode, G., Hadjieleftheriou, M.: Finding the frequent items in streams of data. Commun. ACM 52(10), 97–105 (2009)
Article Google Scholar
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Article MathSciNet Google Scholar
Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal Sampling from Distributed Streams. ACM Principles of Database Systems (PODS) (2010)
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comput. (SIAMCOMP) 31(6), 635–644 (2002)
MathSciNet Google Scholar
Fischer, M., Salzburg, S.: Finding a majority among n votes: solution to problem 81-5. J. Algorithms 3(4), 376–379 (1982)
Google Scholar
Frequent itemset mining dataset repository. Available at http://fimi.ua.ac.be/data/. Accessed June 2013
Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: A review. ACM SIGMOD Rec. 34(1), 18–26 (2005)
Article Google Scholar
Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.S.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha Y. (eds.) Data Mining: Next Generation Challenges and Future Directions, chap. 3, pp. 191–210. MIT Press, Menlo Park (2004)
Google Scholar
Grama, A.Y., Gupta, A., Kumar, V.: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE ParallelDistrib. Technol. 1(3), 12–21 (1993)
Article Google Scholar
Jin, R., Agrawal, G.: An algorithm for in-core frequent itemset mining on streaming data. In: 5th IEEE International Conference on Data Mining ICDM, pp. 210–217. Houston, TX (2005)
Jin, R., Agrawal, G.: An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), pp. 210–217. Houston, TX (2005)
Lucchese, C., Mastroianni, C., Orlando, S., Talia, D.: Mining@home: toward a public resource computing framework for distributed data mining. Concurr. Comput.: Pract. Exper. 22(5), 658–682 (2009)
Google Scholar
Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: ICDE ’05: Proceedings of the 21st International Conference on Data Engineering, pp. 767–778. Tokyo, Japan (2005)
Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: International Conference on Very Large Data Bases (2002)
Mastroianni, C., Cozza, P., Talia, D., Kelley, I., Taylor, I.: A scalable super-peer approach for public scientific computation. Futur. Gener. Comput. Syst. 25(3), 213–223 (2009)
Article Google Scholar
Metwally, A., Agrawal, D., Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory (2005)
Olston, C., Jiang, J., Widom, J.: Adaptive filters for continuous queries over distributed data streams. In: SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. San Diego, CA (2003)
Srinivasan Parthasarathy, A.G., Otey, M.E.: A survey of distributed mining of data streams. In: Aggarwal C. (ed.) Data Streams: Models and Algorithms, pp. 289–307. Springer, New York (2007)
Chapter Google Scholar
Wright, A.: Data streaming 2.0. Commun. ACM (CACM) 53(4), 13–14 (2010)
Article Google Scholar
Yildirim, E., Kosar, T.: End-to-end data-flow parallelism for throughput optimization in high-speed networks. J. Grid Computing 10(3), 395–418 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

ICAR-CNR, Via P. Bucci 41C, Rende (CS), 87036, Italy
Eugenio Cesario & Carlo Mastroianni
ICAR-CNR and DIMES, University of Calabria, Via P. Bucci 41C, Rende (CS), 87036, Italy
Domenico Talia

Authors

Eugenio Cesario
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Mastroianni
View author publications
You can also search for this author in PubMed Google Scholar
Domenico Talia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eugenio Cesario.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cesario, E., Mastroianni, C. & Talia, D. A Multi-Domain Architecture for Mining Frequent Items and Itemsets from Distributed Data Streams. J Grid Computing 12, 153–168 (2014). https://doi.org/10.1007/s10723-013-9277-0

Download citation

Received: 09 October 2012
Accepted: 17 September 2013
Published: 11 October 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s10723-013-9277-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Multi-Domain Architecture for Mining Frequent Items and Itemsets from Distributed Data Streams

Abstract

Access this article

Similar content being viewed by others

Mining frequent items and itemsets from distributed data streams for emergency detection and management

A System that Performs Data Distribution and Manages Frequent Itemsets Generation of Incremental Data in a Distributed Environment

A general-purpose distributed pattern mining system

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Multi-Domain Architecture for Mining Frequent Items and Itemsets from Distributed Data Streams

Abstract

Access this article

Similar content being viewed by others

Mining frequent items and itemsets from distributed data streams for emergency detection and management

A System that Performs Data Distribution and Manages Frequent Itemsets Generation of Incremental Data in a Distributed Environment

A general-purpose distributed pattern mining system

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation