Knowledge and Information Systems

, Volume 50, Issue 3, pp 969–997

SILVERBACK+: scalable association mining via fast list intersection for columnar social data

  • Yusheng Xie
  • Zhengzhang Chen
  • Diana Palsetia
  • Goce Trajcevski
  • Ankit Agrawal
  • Alok Choudhary
Regular Paper
  • 128 Downloads

Abstract

We present Silverback+, a scalable probabilistic framework for accurate association rule and frequent item-set mining of large-scale social behavioral data. Silverback+ tackles the problem of efficient storage utilization and management via: (1) probabilistic columnar infrastructure and (2) using Bloom filters and sampling techniques. In addition, probabilistic pruning techniques based on Apriori method are developed, for accelerating the mining of frequent item-sets. The proposed target-driven techniques yield a significant reduction of the size of the frequent item-set candidates, as well as the required number of repetitive membership checks through a novel list intersection algorithm. Extensive experimental evaluations demonstrate the benefits of this context-aware consideration and incorporation of the infrastructure limitations when utilizing the corresponding research techniques. When compared to the traditional Hadoop-based approach for improving scalability by straightforwardly adding more hosts, Silverback+ exhibits a much better runtime performance, with negligible loss of accuracy.

Keywords

Association rule mining Frequent item-set mining Columnar probabilistic databases Social media Bloom filter 

References

  1. 1.
    Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: SIGMOD’93. ACM, pp 207–216Google Scholar
  2. 2.
    Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the VLDB Endow, VLDB’94, pp 487–499Google Scholar
  3. 3.
    Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. In: SIGMOD’98. ACM, New York, NY, USA, pp 85–93Google Scholar
  4. 4.
    Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors, vol 13. ACM, New York, pp 422–426MATHGoogle Scholar
  5. 5.
    Cao H, Wolfson O, Trajcevski G (2006) Spatio-temporal data reduction with deterministic error bounds. VLDB J 15(3):211–228CrossRefGoogle Scholar
  6. 6.
    Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE (2006) Bigtable: a distributed storage system for structured data. In: OSDI’06. USENIX Association, pp 15–15Google Scholar
  7. 7.
    Chen J, Stallaer J (2014) An economic analysis of online advertising using behavioral targeting. MIS Quarterly 38(2):429–449Google Scholar
  8. 8.
    Chung S, Luo C (2003) Parallel mining of maximal frequent itemsets from databases. In: ICTAI’03, pp 134–139Google Scholar
  9. 9.
    Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD, Yang C ( 2001) Finding interesting associations without support pruning, vol 13. IEEE, pp 64–78Google Scholar
  10. 10.
    Cormode G, Garofalakis MN (2008) Approximate continuous querying over distributed streams. ACM Trans Database Syst 33(2):1–39CrossRefGoogle Scholar
  11. 11.
    Grupcev V, Yuan Y, Tu Y-C, Huang J, Chen S, Pandit S, Weng M (2013) Approximate algorithms for computing spatial distance histograms with accuracy guarantees. IEEE Trans Knowl Data Eng 25(9):1982–1996CrossRefGoogle Scholar
  12. 12.
    Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: SIGMOD’00. ACM, pp 1–12Google Scholar
  13. 13.
    Hofmann T, Buhmann J (1997) Pairwise data clustering by deterministic annealing, vol 19. IEEE, pp 1–14Google Scholar
  14. 14.
    Kallman R, Kimura H, Natkins J, Pavlo A, Rasin A, Zdonik S, Jones EPC, Madden S, Stonebraker M, Zhang Y, Hugg J, Abadi DJ (2008) H-store: a high-performance, distributed main memory transaction processing system, vol 1, VLDB Endowment, pp 1496–1499Google Scholar
  15. 15.
    Kendall M (1938) A new measure of rank correlation, vol 30. Biometrika Trust, pp 81–93Google Scholar
  16. 16.
    Kimura N, Latifi S (2005) A survey on data compression in wireless sensor networks. In: ITCC (2), pp 8–13Google Scholar
  17. 17.
    Kumar A, Grupcev V, Yuan Y, Huang J, Tu YC, Shen G (2014) Computing spatial distance histograms for large scientific data sets on-the-fly, vol 26. IEEE, pp 2410–2424Google Scholar
  18. 18.
    Lakshman A, Malik P (2010) Cassandra: a decentralized structured storage system, vol 44. ACM, New York, pp 35–40Google Scholar
  19. 19.
    Lan B, Ooi BC, Tan K-L (2002) Efficient indexing structures for mining frequent patterns. In: ICDE’02, pp 453–462Google Scholar
  20. 20.
    Lee J, Bengio S, Kim S, Lebanon G, Singer Y (2014) Local collaborative ranking. In: Proceedings of the 23rd international conference on World Wide Web. In: WWW’14. ACM, New York, NY, USA, pp 85–96Google Scholar
  21. 21.
    Li H, Wang Y, Zhang D, Zhang M, Chang E (2008) Pfp: parallel fp-growth for query recommendation. In: RecSys’08, pp 107–114Google Scholar
  22. 22.
    Lin M-Y, Lee P-Y, Hsueh S-C ( 2012) Apriori-based frequent itemset mining algorithms on mapreduce. In: ICUIMC’12Google Scholar
  23. 23.
    Ozkural E, Aykanat C (2004) A space optimization for FP-growth. In: FIMIGoogle Scholar
  24. 24.
    Pu IM (2006) Fundamental data compression. Elsevier, AmsterdamGoogle Scholar
  25. 25.
    Qiu L, Li Y, Wu X (2007) Preserving privacy in association rule mining with Bloom filters. J Intell Inf Syst 29(3):253–278CrossRefGoogle Scholar
  26. 26.
  27. 27.
    Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison Wesley, ReadingGoogle Scholar
  28. 28.
    Turrisi R, Jaccard J (2003) Interaction effects in multiple regression, vol 72. Sage, LondonGoogle Scholar
  29. 29.
    Vitter JS (1985) Random sampling with a reservoir, vol 11. ACM, New York, pp 37–57MATHGoogle Scholar
  30. 30.
    Xie Y, Chen Z, Zhang K, Patwary M, Cheng Y, Liu H, Agrawal A, Choudhary A ( 2013) Graphical modeling of macro behavioral targeting in social networks. In: SDM, pp 740–748Google Scholar
  31. 31.
    Xie Y, Cheng Y, Honbo D, Zhang K, Agrawal A, Choudhary AN, Gao Y, Gou J (2012) Probabilistic macro behavioral targeting. In: DUBMMSM, pp 7–10Google Scholar
  32. 32.
    Xie Y, Palsetia D, Trajcevski G, Agrawal A, Choudhary AN (2014) Silverback: scalable association mining for temporal data in columnar probabilistic databases. In: ICDE, pp 1072–1083Google Scholar
  33. 33.
    Ye Y, Chiang C-C (2006) A parallel apriori algorithm for frequent itemsets mining. In: SERA’06. IEEE, pp 87–94Google Scholar
  34. 34.
    Zaki MJ (2000) Scalable algorithms for association mining, vol 12. IEEE Educational Activities Department, Piscataway, pp 372–390Google Scholar
  35. 35.
    Zaki MJ, Parthasarathy S, Li W (1997) A localized algorithm for parallel association mining. In: SPAA’97, pp 321–330Google Scholar

Copyright information

© Springer-Verlag London 2016

Authors and Affiliations

  • Yusheng Xie
    • 1
    • 3
  • Zhengzhang Chen
    • 2
  • Diana Palsetia
    • 1
  • Goce Trajcevski
    • 1
  • Ankit Agrawal
    • 1
  • Alok Choudhary
    • 1
  1. 1.Department of Electrical and Computer EngineeringNorthwestern UniversityEvanstonUSA
  2. 2.NEC Laboratories AmericaPrincetonUSA
  3. 3.Baidu ResearchSunnyvaleUSA

Personalised recommendations