Advertisement

Data Mining and Knowledge Discovery

, Volume 33, Issue 6, pp 1736–1774 | Cite as

Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications

  • Domenico SaccáEmail author
  • Edoardo Serra
  • Antonino Rullo
Article

Abstract

The development of novel platforms and techniques for emerging “Big Data” applications requires the availability of real-life datasets for data-driven experiments, which are however not accessible in most cases for various reasons, e.g., confidentiality, privacy or simply insufficient availability. An interesting solution to ensure high quality experimental findings is to synthesize datasets that reflect patterns of real ones using a two-step approach: first a real dataset X is analyzed to derive relevant patterns Z (latent variables) and, then, such patterns are used to reconstruct a new dataset \(X'\) that is like X but not exactly the same. The approach can be implemented using inverse mining techniques such as inverse frequent itemset mining (\(\texttt {IFM}\)), which consists of generating a transactional dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. This paper introduces various extensions of \(\texttt {IFM}\) within a uniform framework with the aim to generate artificial datasets that reflect more elaborated patterns (in particular infrequency and duplicate constraints) of real ones. Furthermore, in order to further enlarge the application domain of \(\texttt {IFM}\), an additional extension is introduced that considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social network analytics.

Keywords

Data mining Frequent itemset mining Inverse problems Classification Linear programming Big data Synthetic dataset 

Notes

Funding

The funding was supported by MISE, Italian Ministry for Industry (Grant No. PON ID Service and Protect ID).

References

  1. Aggarwal CC, Yu PS (2008) A general survey of privacy-preserving data mining models and algorithms. In: Aggarwal CC, Yu PS (eds) Privacy-preserving data mining—models and algorithms, volume 34 of advances in database systems. Springer, Berlin, pp 11–52CrossRefGoogle Scholar
  2. Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, SIGMOD ’93, New York, NY, USA. ACM, pp 207–216Google Scholar
  3. Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, New York, NY, USA. ACM, pp 439–450Google Scholar
  4. Beheshti AK, Hejazi SR (2015) A novel hybrid column generation-metaheuristic approach for the vehicle routing problem with general soft time window. Inf Sci 316:598–615CrossRefGoogle Scholar
  5. Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127MathSciNetzbMATHCrossRefGoogle Scholar
  6. Bertsimas D, Tsitsiklis JN (1997) Introduction to linear optimization. Athena Scientific, BelmontGoogle Scholar
  7. Bykowski A, Rigotti C (2001) A condensed representation to find frequent patterns. In: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’01, New York, NY, USA. ACM, pp 267–273Google Scholar
  8. Cagliero L, Garza P (2013) Itemset generalization with cardinality-based constraints. Inf Sci 244:161–174MathSciNetzbMATHCrossRefGoogle Scholar
  9. Calders T (2004) Computational complexity of itemset frequency satisfiability. In: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’04, New York, NY, USA. ACM, pp 143–154Google Scholar
  10. Calders T (2007) The complexity of satisfying constraints on databases of transactions. Acta Inf 44(7–8):591–624MathSciNetzbMATHCrossRefGoogle Scholar
  11. Chen CP, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347CrossRefGoogle Scholar
  12. Evfimievski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining. In: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS ’03, New York, NY, USA. ACM, pp 211–222Google Scholar
  13. Gilmore PC, Gomory RE (1961) A linear programming approach to the cutting-stock problem. Oper Res 9(6):849–859MathSciNetzbMATHCrossRefGoogle Scholar
  14. Gunopulos D, Khardon R, Mannila H, Toivonen H (1997) Data mining, hypergraph transversals, and machine learning. In: Mendelzon AO, Özsoyoglu ZM (eds) Proceedings of the 16-th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’97, ACM Press, pp 209–216Google Scholar
  15. Guns T, Nijssen S, Raedt LD (2011) Itemset mining: a constraint programming perspective. Artif Intell 175(12):1951–1983MathSciNetzbMATHCrossRefGoogle Scholar
  16. Guzzo A, Moccia L, Saccà D, Serra E (2013) Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs. ACM Trans Knowl Discov Data 7(4):18:1–18:39CrossRefGoogle Scholar
  17. Guzzo A, Saccà D, Serra E (2009) An effective approach to inverse frequent set mining. In: Proceedings of the 2009 ninth IEEE international conference on data mining, ICDM ’09, Washington, DC, USA. IEEE Computer Society, pp 806–811Google Scholar
  18. Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86MathSciNetCrossRefGoogle Scholar
  19. Han J, Kamber M (2005) Data mining: concepts and techniques. Kaufmann, San FranciscozbMATHGoogle Scholar
  20. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetzbMATHCrossRefGoogle Scholar
  21. Hu T, Sung SY, Xiong H, Fu Q (2008) Discovery of maximum length frequent itemsets. Inf Sci 178(1):69–87MathSciNetCrossRefGoogle Scholar
  22. Jindal R, Malaya DB (2016) A novel approach for mining frequent patterns from incremental data. IJDMMM 8(3):244–264CrossRefGoogle Scholar
  23. KDDCUP2000 (2000). https://www.kdd.org/kdd-cup/view/kdd-cup-2000. Accessed 4 May 2018
  24. Liu L, Kantarcioglu M, Thuraisingham B (2008) The applicability of the perturbation based privacy preserving data mining for real-world data. Data Knowl Eng 65(1):5–21CrossRefGoogle Scholar
  25. Luenberger DG (2003) Linear and nonlinear programming, 2nd edn. Springer, BerlinzbMATHGoogle Scholar
  26. Mendes R, Vilela JP (2017) Privacy-preserving data mining: methods, metrics, and applications. IEEE Access 5:10562–10582CrossRefGoogle Scholar
  27. Michael K, Miller KW (2013) Big data: new opportunities and new challenges [guest editors’ introduction]. Computer 46(6):22–24CrossRefGoogle Scholar
  28. Mielikainen T (2003) On inverse frequent set mining. In: Proceedings of 2nd workshop on privacy preserving data mining, PPDM ’03, Washington, DC, USA. IEEE Computer Society, pp 18–23Google Scholar
  29. ms-IFM code (2018). Datasets and codes used by paper’s experiments for ms-IFM ans stored in GitHub repository. https://github.com/ninorullo/NoSQL-IFM. Accessed 18 Dec 2018
  30. ms-IFM dataset (2017). Yelp challenge. https://www.yelp.com/dataset. Accessed 18 Dec 2018
  31. Narayanan A, Shmatikov V(2009) De-anonymizing social networks. In: Proceedings—-IEEE symposium on security and privacy 2009 30th IEEE symposium on security and privacy, pp 173–187Google Scholar
  32. Oliveira S RM, Zaïane OR (2003) Protecting sensitive knowledge by data sanitization. In: Proceedings of the third IEEE international conference on data mining, ICDM ’03, Washington, DC, USA. IEEE Computer Society, pp 613–616Google Scholar
  33. Papadimitriou CH (1994) Computational complexity. Addison-Wesley, BostonzbMATHGoogle Scholar
  34. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theory, ICDT ’99, London, UK. Springer-Verlag, pp 398–416Google Scholar
  35. Patki N, Wedge R, Veeramachaneni K (2016) The synthetic data vault. In: 2016 IEEE international conference on data science and advanced analytics, DSAA 2016, Montreal, QC, Canada, October 17–19, 2016, IEEE, pp 399–410Google Scholar
  36. Ramesh G, Maniatty W, Zaki MJ (2003) Feasible itemset distributions in data mining: theory and application. In Neven F, Beeri C, Milo T (eds) PODS, ACM, pp 284–295Google Scholar
  37. Ramesh G, Zaki MJ, Maniatty W (2005) Distribution-based synthetic database generation techniques for itemset mining. In: IDEAS, IEEE Computer Society, pp 307–316Google Scholar
  38. Saccà D, Serra E (2013) Number of minimal hypergraph transversals and complexity of IFM with infrequency: high in theory, but often not so much in practice!. Online Preliminary Paper from http://sacca.deis.unical.it/#view=object&format=object&id=1490/gid=160. Accessed 4 May 2018
  39. Shah A, Gulati R (2016) Article: Privacy preserving data mining: techniques, classification and implications—a survey. International Journal of Computer Applications, 137(12):40–46. Published by Foundation of Computer Science (FCS), NY, USAGoogle Scholar
  40. Stavropoulos EC, Verykios VS, Kagklis V (2016) A transversal hypergraph approach for the frequent itemset hiding problem. Knowl Inf Syst 47(3):625–645CrossRefGoogle Scholar
  41. Sweeney L (2002) K-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl-Based Syst 10(5):557–570MathSciNetzbMATHCrossRefGoogle Scholar
  42. Weikum G (2013) Where’s the data in the big data wave? ACM Sigmod Blog http://wp.sigmod.org/?p=786. Accessed 4 May 2018
  43. Wu H, Ning Y, Chakraborty P, Vreeken J, Tatti N, Ramakrishnan N (2018) Generating realistic synthetic population datasets. ACM Trans Knowl Discov Data 12(4):45:1–45:22Google Scholar
  44. Wu X, Wu Y, Wang Y, Li Y (2005) Privacy aware market basket data set generation: A feasible approach for inverse frequent set mining. In: Proceedings of SIAM international conference on data mining, SDM’ 05, Philadelphia, PA, USA. SIAM, pp 103–114Google Scholar
  45. Zheng Z, Kohavi R, Mason L (2001) Real world performance of association rule algorithms. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’01, New York, NY, USA. ACM, pp 401–406Google Scholar
  46. Zhong S (2007) Privacy-preserving algorithms for distributed mining of frequent itemsets. Inf Sci 177(2):490–503zbMATHCrossRefGoogle Scholar
  47. Zhou B, Pei J, Luk W (2008) A brief survey on anonymization techniques for privacy preserving publishing of social network data. SIGKDD Explor Newsl 10(2):12–22CrossRefGoogle Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.DIMES DepartmentUniversity of CalabriaRendeItaly
  2. 2.CS DepartmentBoise State UniversityBoiseUSA

Personalised recommendations