Abstract
Real-time analysis of distributed data streams is a challenging task since it requires scalable solutions to handle streams of data that are generated very rapidly by multiple sources. This paper presents the design and the implementation of an architecture for the analysis of data streams in distributed environments. In particular, data stream analysis has been carried out for the computation of items and itemsets that exceed a frequency threshold. The mining approach is hybrid, that is, frequent items are calculated with a single pass, using a sketch algorithm, while frequent itemsets are calculated by a further multi-pass analysis. The architecture combines parallel and distributed processing to keep the pace with the rate of distributed data streams. In order to keep computation close to data, miners are distributed among the domains where data streams are generated. The paper reports the experimental results obtained with a prototype of the architecture, tested on a Grid composed of three domains each one handling a data stream.
Similar content being viewed by others
References
Aggarwal, C.: An introduction to data streams. In: Aggarwal C. (ed.) Data Streams: Models and Algorithms, pp. 1–8. Springer, New York (2007)
Aggarwal, C.C.: Data Streams: Models and Algorithms. Springer, New York (2007)
Aggarwal, C.C., Yu, P.S.: A survey of synopsis construction in data streams. In: Aggarwal C. (ed.) Data Streams: Models and Algorithms, pp. 169–207. Springer, New York (2007)
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast Discovery of Association Rules. American Association for Artificial Intelligence, Menlo Park (1996)
Barkstrom, B., Hinke, T., Gavali, S., Smith, W., Seufzer, W., Hu, C., Cordner, D.: Distributed generation of nasa earth science data products. J. Grid Computing 1(2), 101–116 (2003)
Cai, Z., Kumar, V., Schwan, K.: Iq-paths: Predictably high performance data streams across dynamic network overlays. J. Grid Computing 5(2), 129–150 (2007)
Cesario, E., Grillo, A., Mastroianni, C., Talia, D.: A sketch-based architecture for mining frequent items and itemsets from distributed data streams. In: Proc. of the 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2011), pp. 245–253. Newport Beach, CA (2011)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002)
Cheng, J., Ke, Y., Ng, W.: A survey on algorithms for mining frequent itemsets over data streams. Knowl. Inf. Syst. 16, 1–27 (2008)
Chi, Y., Wang, H., Yu, P.S., Muntz, R.R., Fischer, M.: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst. 10(3), 265–294 (2006)
Cormode, G., Garofalakis, M.: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), 9:1–9:39 (2008)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: International Conference on Very Large Data Bases (2008)
Cormode, G., Hadjieleftheriou, M.: Finding the frequent items in streams of data. Commun. ACM 52(10), 97–105 (2009)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)
Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal Sampling from Distributed Streams. ACM Principles of Database Systems (PODS) (2010)
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comput. (SIAMCOMP) 31(6), 635–644 (2002)
Fischer, M., Salzburg, S.: Finding a majority among n votes: solution to problem 81-5. J. Algorithms 3(4), 376–379 (1982)
Frequent itemset mining dataset repository. Available at http://fimi.ua.ac.be/data/. Accessed June 2013
Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: A review. ACM SIGMOD Rec. 34(1), 18–26 (2005)
Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.S.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha Y. (eds.) Data Mining: Next Generation Challenges and Future Directions, chap. 3, pp. 191–210. MIT Press, Menlo Park (2004)
Grama, A.Y., Gupta, A., Kumar, V.: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE ParallelDistrib. Technol. 1(3), 12–21 (1993)
Jin, R., Agrawal, G.: An algorithm for in-core frequent itemset mining on streaming data. In: 5th IEEE International Conference on Data Mining ICDM, pp. 210–217. Houston, TX (2005)
Jin, R., Agrawal, G.: An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), pp. 210–217. Houston, TX (2005)
Lucchese, C., Mastroianni, C., Orlando, S., Talia, D.: Mining@home: toward a public resource computing framework for distributed data mining. Concurr. Comput.: Pract. Exper. 22(5), 658–682 (2009)
Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: ICDE ’05: Proceedings of the 21st International Conference on Data Engineering, pp. 767–778. Tokyo, Japan (2005)
Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: International Conference on Very Large Data Bases (2002)
Mastroianni, C., Cozza, P., Talia, D., Kelley, I., Taylor, I.: A scalable super-peer approach for public scientific computation. Futur. Gener. Comput. Syst. 25(3), 213–223 (2009)
Metwally, A., Agrawal, D., Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory (2005)
Olston, C., Jiang, J., Widom, J.: Adaptive filters for continuous queries over distributed data streams. In: SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. San Diego, CA (2003)
Srinivasan Parthasarathy, A.G., Otey, M.E.: A survey of distributed mining of data streams. In: Aggarwal C. (ed.) Data Streams: Models and Algorithms, pp. 289–307. Springer, New York (2007)
Wright, A.: Data streaming 2.0. Commun. ACM (CACM) 53(4), 13–14 (2010)
Yildirim, E., Kosar, T.: End-to-end data-flow parallelism for throughput optimization in high-speed networks. J. Grid Computing 10(3), 395–418 (2012)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cesario, E., Mastroianni, C. & Talia, D. A Multi-Domain Architecture for Mining Frequent Items and Itemsets from Distributed Data Streams. J Grid Computing 12, 153–168 (2014). https://doi.org/10.1007/s10723-013-9277-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-013-9277-0