Skip to main content
Log in

A Multi-Domain Architecture for Mining Frequent Items and Itemsets from Distributed Data Streams

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Real-time analysis of distributed data streams is a challenging task since it requires scalable solutions to handle streams of data that are generated very rapidly by multiple sources. This paper presents the design and the implementation of an architecture for the analysis of data streams in distributed environments. In particular, data stream analysis has been carried out for the computation of items and itemsets that exceed a frequency threshold. The mining approach is hybrid, that is, frequent items are calculated with a single pass, using a sketch algorithm, while frequent itemsets are calculated by a further multi-pass analysis. The architecture combines parallel and distributed processing to keep the pace with the rate of distributed data streams. In order to keep computation close to data, miners are distributed among the domains where data streams are generated. The paper reports the experimental results obtained with a prototype of the architecture, tested on a Grid composed of three domains each one handling a data stream.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal, C.: An introduction to data streams. In: Aggarwal C. (ed.) Data Streams: Models and Algorithms, pp. 1–8. Springer, New York (2007)

    Chapter  Google Scholar 

  2. Aggarwal, C.C.: Data Streams: Models and Algorithms. Springer, New York (2007)

    Book  Google Scholar 

  3. Aggarwal, C.C., Yu, P.S.: A survey of synopsis construction in data streams. In: Aggarwal C. (ed.) Data Streams: Models and Algorithms, pp. 169–207. Springer, New York (2007)

    Chapter  Google Scholar 

  4. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast Discovery of Association Rules. American Association for Artificial Intelligence, Menlo Park (1996)

  5. Barkstrom, B., Hinke, T., Gavali, S., Smith, W., Seufzer, W., Hu, C., Cordner, D.: Distributed generation of nasa earth science data products. J. Grid Computing 1(2), 101–116 (2003)

    Article  Google Scholar 

  6. Cai, Z., Kumar, V., Schwan, K.: Iq-paths: Predictably high performance data streams across dynamic network overlays. J. Grid Computing 5(2), 129–150 (2007)

    Article  Google Scholar 

  7. Cesario, E., Grillo, A., Mastroianni, C., Talia, D.: A sketch-based architecture for mining frequent items and itemsets from distributed data streams. In: Proc. of the 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2011), pp. 245–253. Newport Beach, CA (2011)

  8. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002)

  9. Cheng, J., Ke, Y., Ng, W.: A survey on algorithms for mining frequent itemsets over data streams. Knowl. Inf. Syst. 16, 1–27 (2008)

    Article  MathSciNet  Google Scholar 

  10. Chi, Y., Wang, H., Yu, P.S., Muntz, R.R., Fischer, M.: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst. 10(3), 265–294 (2006)

    Article  Google Scholar 

  11. Cormode, G., Garofalakis, M.: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), 9:1–9:39 (2008)

    Article  Google Scholar 

  12. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: International Conference on Very Large Data Bases (2008)

  13. Cormode, G., Hadjieleftheriou, M.: Finding the frequent items in streams of data. Commun. ACM 52(10), 97–105 (2009)

    Article  Google Scholar 

  14. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)

    Article  MathSciNet  Google Scholar 

  15. Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal Sampling from Distributed Streams. ACM Principles of Database Systems (PODS) (2010)

  16. Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comput. (SIAMCOMP) 31(6), 635–644 (2002)

    MathSciNet  Google Scholar 

  17. Fischer, M., Salzburg, S.: Finding a majority among n votes: solution to problem 81-5. J. Algorithms 3(4), 376–379 (1982)

    Google Scholar 

  18. Frequent itemset mining dataset repository. Available at http://fimi.ua.ac.be/data/. Accessed June 2013

  19. Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: A review. ACM SIGMOD Rec. 34(1), 18–26 (2005)

    Article  Google Scholar 

  20. Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.S.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha Y. (eds.) Data Mining: Next Generation Challenges and Future Directions, chap. 3, pp. 191–210. MIT Press, Menlo Park (2004)

    Google Scholar 

  21. Grama, A.Y., Gupta, A., Kumar, V.: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE ParallelDistrib. Technol. 1(3), 12–21 (1993)

    Article  Google Scholar 

  22. Jin, R., Agrawal, G.: An algorithm for in-core frequent itemset mining on streaming data. In: 5th IEEE International Conference on Data Mining ICDM, pp. 210–217. Houston, TX (2005)

  23. Jin, R., Agrawal, G.: An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), pp. 210–217. Houston, TX (2005)

  24. Lucchese, C., Mastroianni, C., Orlando, S., Talia, D.: Mining@home: toward a public resource computing framework for distributed data mining. Concurr. Comput.: Pract. Exper. 22(5), 658–682 (2009)

    Google Scholar 

  25. Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: ICDE ’05: Proceedings of the 21st International Conference on Data Engineering, pp. 767–778. Tokyo, Japan (2005)

  26. Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: International Conference on Very Large Data Bases (2002)

  27. Mastroianni, C., Cozza, P., Talia, D., Kelley, I., Taylor, I.: A scalable super-peer approach for public scientific computation. Futur. Gener. Comput. Syst. 25(3), 213–223 (2009)

    Article  Google Scholar 

  28. Metwally, A., Agrawal, D., Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory (2005)

  29. Olston, C., Jiang, J., Widom, J.: Adaptive filters for continuous queries over distributed data streams. In: SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. San Diego, CA (2003)

  30. Srinivasan Parthasarathy, A.G., Otey, M.E.: A survey of distributed mining of data streams. In: Aggarwal C. (ed.) Data Streams: Models and Algorithms, pp. 289–307. Springer, New York (2007)

    Chapter  Google Scholar 

  31. Wright, A.: Data streaming 2.0. Commun. ACM (CACM) 53(4), 13–14 (2010)

    Article  Google Scholar 

  32. Yildirim, E., Kosar, T.: End-to-end data-flow parallelism for throughput optimization in high-speed networks. J. Grid Computing 10(3), 395–418 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eugenio Cesario.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cesario, E., Mastroianni, C. & Talia, D. A Multi-Domain Architecture for Mining Frequent Items and Itemsets from Distributed Data Streams. J Grid Computing 12, 153–168 (2014). https://doi.org/10.1007/s10723-013-9277-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-013-9277-0

Keywords

Navigation