Journal of Grid Computing

, Volume 12, Issue 1, pp 153–168 | Cite as

A Multi-Domain Architecture for Mining Frequent Items and Itemsets from Distributed Data Streams

  • Eugenio Cesario
  • Carlo Mastroianni
  • Domenico Talia
Article

Abstract

Real-time analysis of distributed data streams is a challenging task since it requires scalable solutions to handle streams of data that are generated very rapidly by multiple sources. This paper presents the design and the implementation of an architecture for the analysis of data streams in distributed environments. In particular, data stream analysis has been carried out for the computation of items and itemsets that exceed a frequency threshold. The mining approach is hybrid, that is, frequent items are calculated with a single pass, using a sketch algorithm, while frequent itemsets are calculated by a further multi-pass analysis. The architecture combines parallel and distributed processing to keep the pace with the rate of distributed data streams. In order to keep computation close to data, miners are distributed among the domains where data streams are generated. The paper reports the experimental results obtained with a prototype of the architecture, tested on a Grid composed of three domains each one handling a data stream.

Keywords

Distributed data mining Frequent items Frequent itemsets Grid Stream mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal, C.: An introduction to data streams. In: Aggarwal C. (ed.) Data Streams: Models and Algorithms, pp. 1–8. Springer, New York (2007)CrossRefGoogle Scholar
  2. 2.
    Aggarwal, C.C.: Data Streams: Models and Algorithms. Springer, New York (2007)CrossRefGoogle Scholar
  3. 3.
    Aggarwal, C.C., Yu, P.S.: A survey of synopsis construction in data streams. In: Aggarwal C. (ed.) Data Streams: Models and Algorithms, pp. 169–207. Springer, New York (2007)CrossRefGoogle Scholar
  4. 4.
    Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast Discovery of Association Rules. American Association for Artificial Intelligence, Menlo Park (1996)Google Scholar
  5. 5.
    Barkstrom, B., Hinke, T., Gavali, S., Smith, W., Seufzer, W., Hu, C., Cordner, D.: Distributed generation of nasa earth science data products. J. Grid Computing 1(2), 101–116 (2003)CrossRefGoogle Scholar
  6. 6.
    Cai, Z., Kumar, V., Schwan, K.: Iq-paths: Predictably high performance data streams across dynamic network overlays. J. Grid Computing 5(2), 129–150 (2007)CrossRefGoogle Scholar
  7. 7.
    Cesario, E., Grillo, A., Mastroianni, C., Talia, D.: A sketch-based architecture for mining frequent items and itemsets from distributed data streams. In: Proc. of the 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2011), pp. 245–253. Newport Beach, CA (2011)Google Scholar
  8. 8.
    Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002)Google Scholar
  9. 9.
    Cheng, J., Ke, Y., Ng, W.: A survey on algorithms for mining frequent itemsets over data streams. Knowl. Inf. Syst. 16, 1–27 (2008)CrossRefMathSciNetGoogle Scholar
  10. 10.
    Chi, Y., Wang, H., Yu, P.S., Muntz, R.R., Fischer, M.: Catch the moment: maintaining closed frequent itemsets over a data stream sliding window. Knowl. Inf. Syst. 10(3), 265–294 (2006)CrossRefGoogle Scholar
  11. 11.
    Cormode, G., Garofalakis, M.: Approximate continuous querying over distributed streams. ACM Trans. Database Syst. 33(2), 9:1–9:39 (2008)CrossRefGoogle Scholar
  12. 12.
    Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: International Conference on Very Large Data Bases (2008)Google Scholar
  13. 13.
    Cormode, G., Hadjieleftheriou, M.: Finding the frequent items in streams of data. Commun. ACM 52(10), 97–105 (2009)CrossRefGoogle Scholar
  14. 14.
    Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1), 58–75 (2005)CrossRefMathSciNetGoogle Scholar
  15. 15.
    Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal Sampling from Distributed Streams. ACM Principles of Database Systems (PODS) (2010)Google Scholar
  16. 16.
    Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. SIAM J. Comput. (SIAMCOMP) 31(6), 635–644 (2002)MathSciNetGoogle Scholar
  17. 17.
    Fischer, M., Salzburg, S.: Finding a majority among n votes: solution to problem 81-5. J. Algorithms 3(4), 376–379 (1982)Google Scholar
  18. 18.
    Frequent itemset mining dataset repository. Available at http://fimi.ua.ac.be/data/. Accessed June 2013
  19. 19.
    Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: A review. ACM SIGMOD Rec. 34(1), 18–26 (2005)CrossRefGoogle Scholar
  20. 20.
    Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.S.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha Y. (eds.) Data Mining: Next Generation Challenges and Future Directions, chap. 3, pp. 191–210. MIT Press, Menlo Park (2004)Google Scholar
  21. 21.
    Grama, A.Y., Gupta, A., Kumar, V.: Isoefficiency: measuring the scalability of parallel algorithms and architectures. IEEE ParallelDistrib. Technol. 1(3), 12–21 (1993)CrossRefGoogle Scholar
  22. 22.
    Jin, R., Agrawal, G.: An algorithm for in-core frequent itemset mining on streaming data. In: 5th IEEE International Conference on Data Mining ICDM, pp. 210–217. Houston, TX (2005)Google Scholar
  23. 23.
    Jin, R., Agrawal, G.: An algorithm for in-core frequent itemset mining on streaming data. In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), pp. 210–217. Houston, TX (2005)Google Scholar
  24. 24.
    Lucchese, C., Mastroianni, C., Orlando, S., Talia, D.: Mining@home: toward a public resource computing framework for distributed data mining. Concurr. Comput.: Pract. Exper. 22(5), 658–682 (2009)Google Scholar
  25. 25.
    Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: ICDE ’05: Proceedings of the 21st International Conference on Data Engineering, pp. 767–778. Tokyo, Japan (2005)Google Scholar
  26. 26.
    Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: International Conference on Very Large Data Bases (2002)Google Scholar
  27. 27.
    Mastroianni, C., Cozza, P., Talia, D., Kelley, I., Taylor, I.: A scalable super-peer approach for public scientific computation. Futur. Gener. Comput. Syst. 25(3), 213–223 (2009)CrossRefGoogle Scholar
  28. 28.
    Metwally, A., Agrawal, D., Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory (2005)Google Scholar
  29. 29.
    Olston, C., Jiang, J., Widom, J.: Adaptive filters for continuous queries over distributed data streams. In: SIGMOD ’03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. San Diego, CA (2003)Google Scholar
  30. 30.
    Srinivasan Parthasarathy, A.G., Otey, M.E.: A survey of distributed mining of data streams. In: Aggarwal C. (ed.) Data Streams: Models and Algorithms, pp. 289–307. Springer, New York (2007)CrossRefGoogle Scholar
  31. 31.
    Wright, A.: Data streaming 2.0. Commun. ACM (CACM) 53(4), 13–14 (2010)CrossRefGoogle Scholar
  32. 32.
    Yildirim, E., Kosar, T.: End-to-end data-flow parallelism for throughput optimization in high-speed networks. J. Grid Computing 10(3), 395–418 (2012)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Eugenio Cesario
    • 1
  • Carlo Mastroianni
    • 1
  • Domenico Talia
    • 2
  1. 1.ICAR-CNRRende (CS)Italy
  2. 2.ICAR-CNR and DIMESUniversity of CalabriaRende (CS)Italy

Personalised recommendations