Skip to main content
Log in

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Given a large collection of transactions containing items, a basic common data mining problem is to extract the so-called frequent itemsets (i.e., sets of items appearing in at least a given number of transactions). In this paper, we propose a structure called free-sets, from which we can approximate any itemset support (i.e., the number of transactions containing the itemset) and we formalize this notion in the framework of ∈-adequate representations (H. Mannila and H. Toivonen, 1996. In Proc. of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 189–194). We show that frequent free-sets can be efficiently extracted using pruning strategies developed for frequent itemset discovery, and that they can be used to approximate the support of any frequent itemset. Experiments on real dense data sets show a significant reduction of the size of the output when compared with standard frequent itemset extraction. Furthermore, the experiments show that the extraction of frequent free-sets is still possible when the extraction of frequent itemsets becomes intractable, and that the supports of the frequent free-sets can be used to approximate very closely the supports of the frequent itemsets. Finally, we consider the effect of this approximation on association rules (a popular kind of patterns that can be derived from frequent itemsets) and show that the corresponding errors remain very low in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proc. of the 1993 ACM SIGMOD International Conference on Management of Data(SIGMOD'93), Washington, D.C., pp.207–216.

  • Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A.I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, AAAI Press, pp. 307–328.

  • Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proc. of the 20th International Conference on Very Large Data Bases (VLDB'94), Santiago de Chile, Chile, pp. 487–499.

  • Bayardo, R.J. 1997. Brute-force mining of high-confidence classification rules. In Proc. of the Third International Conference on Knowledge Discovery and Data Mining (KDD'97), Newport Beach, California, pp.123–126.

  • Bayardo, R.J. 1998. Efficiently mining long patterns from databases. In Proc. of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD'98), Seattle, Washington, pp. 85–93.

  • Boulicaut, J.-F. and Bykowski, A. 2000. Frequent closures as a concise representation for binary data mining. In Proc. of the Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'00), Vol. 1805 of LNAI, Kyoto, JP, Berlin: Springer-Verlag, pp. 62–73.

    Google Scholar 

  • Boulicaut, J.-F., Bykowski, A., and Rigotti, C. to appear. Approximation of frequency queries by mean of freesets. In Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'00), LNAI, Lyon, France, Berlin: Springer-Verlag.

  • Bykowski, A. and Gomez-Chantada, L. 2000. Frequent item set extraction in highly-correlated data: A web usage mining application. In Proc. of the 2000 International Workshop on Web Knowledge Discovery and Data Mining (WKDDM'00), Kyoto, Japan, pp. 27–42.

  • Fujiwara, S., Ullman, J.D., and Motwani, R. 2000. Dynamic miss-counting algorithms: Finding implication and similarity rules with confidence pruning. In Proc. of the 16th International Conference on Data Engineering (ICDE'00), San Diego, California, pp. 501–511.

  • Mannila, H. and Toivonen, H. 1996.Multiple uses of frequent sets and condensed representations. In Proc. of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), Portland, Oregon, pp. 189–194.

  • Mannila, H. and Toivonen, H. 1997. Level wise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3):241–258.

    Google Scholar 

  • Ng, R., Lakshmanan, L.V., Han, J., and Pang, A. 1998. Exploratory mining and pruning optimization of constrained association rules. In Proc. of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD'98), Seattle, Washington, pp. 13–24.

  • Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L. 1999. Efficient mining of association rules using closed item set lattices. Information Systems, 24(1):25–46.

    Google Scholar 

  • Pavlov, D., Mannila, H., and Smyth, P. 2000. Probalistic models for query approximation with large data sets. Technical Report 2000-07, Department of Information and Computer Science, Univsersity of California, Irvine, CA-92697-3425.

  • Piatetsky-Shapiro, G. 1991. Discovery, analysis, and presentation of strong rules. In Knowledge Discovery in Databases, Menlo Park, CA: AAAI Press, pp. 229–248.

    Google Scholar 

  • Jean-François Boulicaut is currently an associate professor at INSA Lyon (Ph.D. in 1992, Habilitation in 2001). His main research interests are databases and knowledge discovery from databases. He is the leader of the “Data Mining group” at INSA Lyon and coordinates the cInQ IST-2000-26469 project (consortium on knowledge discovery with Inductive Queries, 2001–2004) funded by the European Union.

  • Artur Bykowski earned his Ph.D. in Computer Science from INSA Lyon in 2002. His Ph.D. thesis was entitled “Condensed representations of frequent sets: application to descriptive pattern discovery”. His research interests include condensed representations of frequent patterns and rule mining from large databases.

  • Christophe Rigotti received his Ph.D. in Computer Science from INSA Lyon in 1996. His research interests include data mining, constraint programming, inductive programming and databases. He is currently an assistant professor at INSA Lyon and a visiting member of the Ludwig Maximilians University of Munich.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boulicaut, JF., Bykowski, A. & Rigotti, C. Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries. Data Mining and Knowledge Discovery 7, 5–22 (2003). https://doi.org/10.1023/A:1021571501451

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1021571501451

Navigation