Data Mining and Knowledge Discovery

, Volume 7, Issue 1, pp 5–22 | Cite as

Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries

  • Jean-François Boulicaut
  • Artur Bykowski
  • Christophe Rigotti
Article

Abstract

Given a large collection of transactions containing items, a basic common data mining problem is to extract the so-called frequent itemsets (i.e., sets of items appearing in at least a given number of transactions). In this paper, we propose a structure called free-sets, from which we can approximate any itemset support (i.e., the number of transactions containing the itemset) and we formalize this notion in the framework of ∈-adequate representations (H. Mannila and H. Toivonen, 1996. In Proc. of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 189–194). We show that frequent free-sets can be efficiently extracted using pruning strategies developed for frequent itemset discovery, and that they can be used to approximate the support of any frequent itemset. Experiments on real dense data sets show a significant reduction of the size of the output when compared with standard frequent itemset extraction. Furthermore, the experiments show that the extraction of frequent free-sets is still possible when the extraction of frequent itemsets becomes intractable, and that the supports of the frequent free-sets can be used to approximate very closely the supports of the frequent itemsets. Finally, we consider the effect of this approximation on association rules (a popular kind of patterns that can be derived from frequent itemsets) and show that the corresponding errors remain very low in practice.

condensed representations frequent pattern discovery association rules 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proc. of the 1993 ACM SIGMOD International Conference on Management of Data(SIGMOD'93), Washington, D.C., pp.207–216.Google Scholar
  2. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A.I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, AAAI Press, pp. 307–328.Google Scholar
  3. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules in large databases. In Proc. of the 20th International Conference on Very Large Data Bases (VLDB'94), Santiago de Chile, Chile, pp. 487–499.Google Scholar
  4. Bayardo, R.J. 1997. Brute-force mining of high-confidence classification rules. In Proc. of the Third International Conference on Knowledge Discovery and Data Mining (KDD'97), Newport Beach, California, pp.123–126.Google Scholar
  5. Bayardo, R.J. 1998. Efficiently mining long patterns from databases. In Proc. of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD'98), Seattle, Washington, pp. 85–93.Google Scholar
  6. Boulicaut, J.-F. and Bykowski, A. 2000. Frequent closures as a concise representation for binary data mining. In Proc. of the Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'00), Vol. 1805 of LNAI, Kyoto, JP, Berlin: Springer-Verlag, pp. 62–73.Google Scholar
  7. Boulicaut, J.-F., Bykowski, A., and Rigotti, C. to appear. Approximation of frequency queries by mean of freesets. In Proc. of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'00), LNAI, Lyon, France, Berlin: Springer-Verlag.Google Scholar
  8. Bykowski, A. and Gomez-Chantada, L. 2000. Frequent item set extraction in highly-correlated data: A web usage mining application. In Proc. of the 2000 International Workshop on Web Knowledge Discovery and Data Mining (WKDDM'00), Kyoto, Japan, pp. 27–42.Google Scholar
  9. Fujiwara, S., Ullman, J.D., and Motwani, R. 2000. Dynamic miss-counting algorithms: Finding implication and similarity rules with confidence pruning. In Proc. of the 16th International Conference on Data Engineering (ICDE'00), San Diego, California, pp. 501–511.Google Scholar
  10. Mannila, H. and Toivonen, H. 1996.Multiple uses of frequent sets and condensed representations. In Proc. of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), Portland, Oregon, pp. 189–194.Google Scholar
  11. Mannila, H. and Toivonen, H. 1997. Level wise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3):241–258.Google Scholar
  12. Ng, R., Lakshmanan, L.V., Han, J., and Pang, A. 1998. Exploratory mining and pruning optimization of constrained association rules. In Proc. of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD'98), Seattle, Washington, pp. 13–24.Google Scholar
  13. Pasquier, N., Bastide, Y., Taouil, R., and Lakhal, L. 1999. Efficient mining of association rules using closed item set lattices. Information Systems, 24(1):25–46.Google Scholar
  14. Pavlov, D., Mannila, H., and Smyth, P. 2000. Probalistic models for query approximation with large data sets. Technical Report 2000-07, Department of Information and Computer Science, Univsersity of California, Irvine, CA-92697-3425.Google Scholar
  15. Piatetsky-Shapiro, G. 1991. Discovery, analysis, and presentation of strong rules. In Knowledge Discovery in Databases, Menlo Park, CA: AAAI Press, pp. 229–248.Google Scholar
  16. Jean-François Boulicaut is currently an associate professor at INSA Lyon (Ph.D. in 1992, Habilitation in 2001). His main research interests are databases and knowledge discovery from databases. He is the leader of the “Data Mining group” at INSA Lyon and coordinates the cInQ IST-2000-26469 project (consortium on knowledge discovery with Inductive Queries, 2001–2004) funded by the European Union.Google Scholar
  17. Artur Bykowski earned his Ph.D. in Computer Science from INSA Lyon in 2002. His Ph.D. thesis was entitled “Condensed representations of frequent sets: application to descriptive pattern discovery”. His research interests include condensed representations of frequent patterns and rule mining from large databases.Google Scholar
  18. Christophe Rigotti received his Ph.D. in Computer Science from INSA Lyon in 1996. His research interests include data mining, constraint programming, inductive programming and databases. He is currently an assistant professor at INSA Lyon and a visiting member of the Ludwig Maximilians University of Munich.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Jean-François Boulicaut
    • 1
  • Artur Bykowski
    • 1
  • Christophe Rigotti
    • 1
  1. 1.Laboratoire d'Ingénierie des Systèmes d'InformationVilleurbanne CedexFrance

Personalised recommendations