# Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications

- 81 Downloads
- 1 Citations

## Abstract

The development of novel platforms and techniques for emerging “Big Data” applications requires the availability of real-life datasets for data-driven experiments, which are however not accessible in most cases for various reasons, e.g., confidentiality, privacy or simply insufficient availability. An interesting solution to ensure high quality experimental findings is to synthesize datasets that reflect patterns of real ones using a two-step approach: first a real dataset *X* is analyzed to derive relevant patterns *Z* (latent variables) and, then, such patterns are used to reconstruct a new dataset \(X'\) that is like *X* but not exactly the same. The approach can be implemented using inverse mining techniques such as inverse frequent itemset mining (\(\texttt {IFM}\)), which consists of generating a transactional dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. This paper introduces various extensions of \(\texttt {IFM}\) within a uniform framework with the aim to generate artificial datasets that reflect more elaborated patterns (in particular infrequency and duplicate constraints) of real ones. Furthermore, in order to further enlarge the application domain of \(\texttt {IFM}\), an additional extension is introduced that considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social network analytics.

## Keywords

Data mining Frequent itemset mining Inverse problems Classification Linear programming Big data Synthetic dataset## Notes

### Funding

The funding was supported by MISE, Italian Ministry for Industry (Grant No. PON ID Service and Protect ID).

## References

- Aggarwal CC, Yu PS (2008) A general survey of privacy-preserving data mining models and algorithms. In: Aggarwal CC, Yu PS (eds) Privacy-preserving data mining—models and algorithms, volume 34 of advances in database systems. Springer, Berlin, pp 11–52CrossRefGoogle Scholar
- Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, SIGMOD ’93, New York, NY, USA. ACM, pp 207–216Google Scholar
- Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, New York, NY, USA. ACM, pp 439–450Google Scholar
- Beheshti AK, Hejazi SR (2015) A novel hybrid column generation-metaheuristic approach for the vehicle routing problem with general soft time window. Inf Sci 316:598–615CrossRefGoogle Scholar
- Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127MathSciNetzbMATHCrossRefGoogle Scholar
- Bertsimas D, Tsitsiklis JN (1997) Introduction to linear optimization. Athena Scientific, BelmontGoogle Scholar
- Bykowski A, Rigotti C (2001) A condensed representation to find frequent patterns. In: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’01, New York, NY, USA. ACM, pp 267–273Google Scholar
- Cagliero L, Garza P (2013) Itemset generalization with cardinality-based constraints. Inf Sci 244:161–174MathSciNetzbMATHCrossRefGoogle Scholar
- Calders T (2004) Computational complexity of itemset frequency satisfiability. In: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’04, New York, NY, USA. ACM, pp 143–154Google Scholar
- Calders T (2007) The complexity of satisfying constraints on databases of transactions. Acta Inf 44(7–8):591–624MathSciNetzbMATHCrossRefGoogle Scholar
- Chen CP, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347CrossRefGoogle Scholar
- Evfimievski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining. In: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS ’03, New York, NY, USA. ACM, pp 211–222Google Scholar
- Gilmore PC, Gomory RE (1961) A linear programming approach to the cutting-stock problem. Oper Res 9(6):849–859MathSciNetzbMATHCrossRefGoogle Scholar
- Gunopulos D, Khardon R, Mannila H, Toivonen H (1997) Data mining, hypergraph transversals, and machine learning. In: Mendelzon AO, Özsoyoglu ZM (eds) Proceedings of the 16-th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’97, ACM Press, pp 209–216Google Scholar
- Guns T, Nijssen S, Raedt LD (2011) Itemset mining: a constraint programming perspective. Artif Intell 175(12):1951–1983MathSciNetzbMATHCrossRefGoogle Scholar
- Guzzo A, Moccia L, Saccà D, Serra E (2013) Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs. ACM Trans Knowl Discov Data 7(4):18:1–18:39CrossRefGoogle Scholar
- Guzzo A, Saccà D, Serra E (2009) An effective approach to inverse frequent set mining. In: Proceedings of the 2009 ninth IEEE international conference on data mining, ICDM ’09, Washington, DC, USA. IEEE Computer Society, pp 806–811Google Scholar
- Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86MathSciNetCrossRefGoogle Scholar
- Han J, Kamber M (2005) Data mining: concepts and techniques. Kaufmann, San FranciscozbMATHGoogle Scholar
- Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507MathSciNetzbMATHCrossRefGoogle Scholar
- Hu T, Sung SY, Xiong H, Fu Q (2008) Discovery of maximum length frequent itemsets. Inf Sci 178(1):69–87MathSciNetCrossRefGoogle Scholar
- Jindal R, Malaya DB (2016) A novel approach for mining frequent patterns from incremental data. IJDMMM 8(3):244–264CrossRefGoogle Scholar
- KDDCUP2000 (2000). https://www.kdd.org/kdd-cup/view/kdd-cup-2000. Accessed 4 May 2018
- Liu L, Kantarcioglu M, Thuraisingham B (2008) The applicability of the perturbation based privacy preserving data mining for real-world data. Data Knowl Eng 65(1):5–21CrossRefGoogle Scholar
- Luenberger DG (2003) Linear and nonlinear programming, 2nd edn. Springer, BerlinzbMATHGoogle Scholar
- Mendes R, Vilela JP (2017) Privacy-preserving data mining: methods, metrics, and applications. IEEE Access 5:10562–10582CrossRefGoogle Scholar
- Michael K, Miller KW (2013) Big data: new opportunities and new challenges [guest editors’ introduction]. Computer 46(6):22–24CrossRefGoogle Scholar
- Mielikainen T (2003) On inverse frequent set mining. In: Proceedings of 2nd workshop on privacy preserving data mining, PPDM ’03, Washington, DC, USA. IEEE Computer Society, pp 18–23Google Scholar
- ms-IFM code (2018). Datasets and codes used by paper’s experiments for ms-IFM ans stored in GitHub repository. https://github.com/ninorullo/NoSQL-IFM. Accessed 18 Dec 2018
- ms-IFM dataset (2017). Yelp challenge. https://www.yelp.com/dataset. Accessed 18 Dec 2018
- Narayanan A, Shmatikov V(2009) De-anonymizing social networks. In: Proceedings—-IEEE symposium on security and privacy 2009 30th IEEE symposium on security and privacy, pp 173–187Google Scholar
- Oliveira S RM, Zaïane OR (2003) Protecting sensitive knowledge by data sanitization. In: Proceedings of the third IEEE international conference on data mining, ICDM ’03, Washington, DC, USA. IEEE Computer Society, pp 613–616Google Scholar
- Papadimitriou CH (1994) Computational complexity. Addison-Wesley, BostonzbMATHGoogle Scholar
- Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theory, ICDT ’99, London, UK. Springer-Verlag, pp 398–416Google Scholar
- Patki N, Wedge R, Veeramachaneni K (2016) The synthetic data vault. In: 2016 IEEE international conference on data science and advanced analytics, DSAA 2016, Montreal, QC, Canada, October 17–19, 2016, IEEE, pp 399–410Google Scholar
- Ramesh G, Maniatty W, Zaki MJ (2003) Feasible itemset distributions in data mining: theory and application. In Neven F, Beeri C, Milo T (eds) PODS, ACM, pp 284–295Google Scholar
- Ramesh G, Zaki MJ, Maniatty W (2005) Distribution-based synthetic database generation techniques for itemset mining. In: IDEAS, IEEE Computer Society, pp 307–316Google Scholar
- Saccà D, Serra E (2013) Number of minimal hypergraph transversals and complexity of IFM with infrequency: high in theory, but often not so much in practice!. Online Preliminary Paper from http://sacca.deis.unical.it/#view=object&format=object&id=1490/gid=160. Accessed 4 May 2018
- Shah A, Gulati R (2016) Article: Privacy preserving data mining: techniques, classification and implications—a survey. International Journal of Computer Applications, 137(12):40–46. Published by Foundation of Computer Science (FCS), NY, USAGoogle Scholar
- Stavropoulos EC, Verykios VS, Kagklis V (2016) A transversal hypergraph approach for the frequent itemset hiding problem. Knowl Inf Syst 47(3):625–645CrossRefGoogle Scholar
- Sweeney L (2002) K-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl-Based Syst 10(5):557–570MathSciNetzbMATHCrossRefGoogle Scholar
- Weikum G (2013) Where’s the data in the big data wave? ACM Sigmod Blog http://wp.sigmod.org/?p=786. Accessed 4 May 2018
- Wu H, Ning Y, Chakraborty P, Vreeken J, Tatti N, Ramakrishnan N (2018) Generating realistic synthetic population datasets. ACM Trans Knowl Discov Data 12(4):45:1–45:22Google Scholar
- Wu X, Wu Y, Wang Y, Li Y (2005) Privacy aware market basket data set generation: A feasible approach for inverse frequent set mining. In: Proceedings of SIAM international conference on data mining, SDM’ 05, Philadelphia, PA, USA. SIAM, pp 103–114Google Scholar
- Zheng Z, Kohavi R, Mason L (2001) Real world performance of association rule algorithms. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’01, New York, NY, USA. ACM, pp 401–406Google Scholar
- Zhong S (2007) Privacy-preserving algorithms for distributed mining of frequent itemsets. Inf Sci 177(2):490–503zbMATHCrossRefGoogle Scholar
- Zhou B, Pei J, Luk W (2008) A brief survey on anonymization techniques for privacy preserving publishing of social network data. SIGKDD Explor Newsl 10(2):12–22CrossRefGoogle Scholar