Abstract
Boolean data is a core data type in machine learning. It is used to represent categorical and transactional data. Unlike real valued data, it is notoriously difficult to efficiently design boolean datasets that satisfy particular constraints. Inverse Frequent Itemset Mining (IFM) is the problem of constructing a boolean dataset, satisfying given support constraints for some itemsets. Previous work mainly focuses on the theoretical complexity of IFM and practical solutions scale poorly or do not satisfy all the constraints. We propose Items2Data, a practical algorithm for generating boolean datasets which is efficient under specific conditions. We introduce global closure to describe the condition which a dataset can be efficiently constructed. We evaluate Items2Data and its use in designing synthetic datasets and to analyze its accuracy, scalability and speed on real world datasets. The results indicate Items2Data is practical and efficient for generating synthetic boolean data when pre-defined itemsets are globally closed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Trefethen, L., Bau, D.: Numerical Linear Algebra. Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics (1997)
Belohlavek, R., Vychodil, V.: Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comput. Syst. Sci. 76(1), 3–20 (2010)
Guzzo, A., Moccia, L., Saccà , D., Serra, E.: Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs. ACM Trans. Knowl. Disc. Data (TKDD) 7(4), 18 (2013)
Guzzo, A., Saccà , D., Serra, E.: An effective approach to inverse frequent set mining. In: Ninth IEEE International Conference on Data Mining, ICDM 2009, pp. 806–811. IEEE (2009)
Wu, X., Wu, Y., Wang, Y., Li, Y.: Privacy-aware market basket data set generation: a feasible approach for inverse frequent set mining. In: Proceedings of the 2005 SIAM International Conference on Data Mining, pp. 103–114. SIAM (2005)
Ramesh, G., Zaki, M.J., Maniatty, W.A.: Distribution-based synthetic database generation techniques for itemset mining. In: 9th International Database Engineering and Application Symposium, IDEAS 2005, pp. 307–316. IEEE (2005)
Calders, T.: The complexity of satisfying constraints on databases of transactions. Acta Informatica 44(7–8), 591–624 (2007)
Calders, T.: Computational complexity of itemset frequency satisfiability. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 143–154. ACM (2004)
Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 398–416. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-49257-7_25
Mielikainen, T.: On inverse frequent set mining. In: Proceedings of the 3rd IEEE ICDM Workshop on Privacy Preserving Data Mining, pp. 18–23. Citeseer (2003)
Madsen, L., Birkes, D.: Simulating dependent discrete data. J. Stat. Comput. Simul. 83(4), 677–691 (2013)
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Calders, T., Rigotti, C., Boulicaut, J.-F.: A survey on condensed representations for frequent sets. In: Boulicaut, J.-F., De Raedt, L., Mannila, H. (eds.) Constraint-Based Mining and Inductive Databases. LNCS (LNAI), vol. 3848, pp. 64–80. Springer, Heidelberg (2006). https://doi.org/10.1007/11615576_4
Calders, T., Goethals, B.: Non-derivable itemset mining. Data Min. Knowl. Disc. 14(1), 171–206 (2007)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Rish, I.: An empirical study of the Naive Bayes Classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46. IBM (2001)
Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017)
Geurts, K., Wets, G., Brijs, T., Vanhoof, K.: Profiling of high-frequency accident locations by use of association rules. Transp. Res. Rec. J. Transp. Res. Board 1840, 123–130 (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Wong, I.S., Dobbie, G., Koh, Y.S. (2019). Items2Data: Generating Synthetic Boolean Datasets from Itemsets. In: Chang, L., Gan, J., Cao, X. (eds) Databases Theory and Applications. ADC 2019. Lecture Notes in Computer Science(), vol 11393. Springer, Cham. https://doi.org/10.1007/978-3-030-12079-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-12079-5_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12078-8
Online ISBN: 978-3-030-12079-5
eBook Packages: Computer ScienceComputer Science (R0)