Skip to main content

New algorithms for finding approximate frequent item sets

Abstract

In standard frequent item set mining a transaction supports an item set only if all items in the set are present. However, in many cases this is too strict a requirement that can render it impossible to find certain relevant groups of items. By relaxing the support definition, allowing for some items of a given set to be missing from a transaction, this drawback can be amended. The resulting item sets have been called approximate, fault-tolerant or fuzzy item sets. In this paper we present two new algorithms to find such item sets: the first is an extension of item set mining based on cover similarities and computes and evaluates the subset size occurrence distribution with a scheme that is related to the Eclat algorithm. The second employs a clustering-like approach, in which the distances are derived from the item covers with distance measures for sets or binary vectors and which is initialized with a one-dimensional Sammon projection of the distance matrix. We demonstrate the benefits of our algorithms by applying them to a concept detection task on the 2008/2009 Wikipedia Selection for schools and to the neurobiological task of detecting neuron ensembles in (simulated) parallel spike trains.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. http://www.borgelt.net/sodim.html.

  2. http://www.borgelt.net/genpst.html.

  3. The only case in which groups can be complete is when the co-occurrences of two groups overlap accidentally and this fills (one of) the formerly created gaps. However, this is highly unlikely and we did not observe this case in our experiments.

  4. The script used to perform these experiments can be found in the source package of our SODIM implementation at http://www.borgelt.net/sodim.html.

  5. http://schools-wikipedia.org/.

  6. http://en.wikipedia.org/.

  7. http://schools-wikipedia.org/wp/l/List_of_elements_by_name.htm.

  8. http://www.borgelt.net/genpst.html.

  9. Note that, since we are considering only pairwise comparisons here, we are less restricted than in Segond and Borgelt (2011), where measures referring to n 01 and n 10 other than through the sum of these quantities are not applicable.

References

  • Aggarwal CC, Lin Y, Wang J, Wang J (2009) Frequent pattern mining with uncertain data. In: Proceedings of the 15th ACM SIGMOD international conference on knowledge discovery and data mining (KDD 2009, Paris, France). ACM Press, New York, pp 29–38

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large databases (VLDB 1994, Santiago de Chile). Morgan Kaufmann, San Mateo, pp 487–499

  • Besson J, Robardet C, Boulicaut J-F (2006) Mining a new fault-tolerant pattern type as an alternative to formal concept discovery. In: Proceedings of the international conference on computational science (ICCS 2006, Reading, United Kingdom). Springer, Berlin, pp 144–157

  • Berger D, Borgelt C, Diesmann M, Gerstein G, Grün S (2009) An accretion based data mining algorithm for identification of sets of correlated neurons. In: 18th Annual computational neuroscience meeting (CNS*2009). Berlin, Germany

  • Berger D, Borgelt C, Louis S, Morrison A, Grün S (2010) Efficient identification of assembly neurons within massively parallel spike trains. In: Computational intelligence and neuroscience 2010, Article ID 439648. Hindawi Publishing Corp., New York

  • Borgelt C, Wang X (2009) SaM: a split and merge algorithm for fuzzy frequent item set mining. In: Proceedings of the 13th international fuzzy systems association world congress and 6th conference of the European society for fuzzy logic and technology (IFSA/EUSFLAT’09, Lisbon, Portugal). IFSA/EUSFLAT Organization Committee, Lisbon, pp 968–973

  • Boulicaut JF, Bykowski A, Rigotti C (2000) Approximation of frequency queries by means of free-sets. In: Proceedings of the 4th European conference on principles and practice of knowledge discovery in databases (PKDD 2000, Lyon, France). LNCS, vol 1910. Springer, Heidelberg, pp 75–85

  • Buzśaki G (2004) Large-scale recording of neuronal ensembles. Nat Neurosci 7:446–451

    Article  Google Scholar 

  • Calders T, Garboni C, Goethals B (2010) Efficient pattern mining of uncertain data with sampling. In: Proceedings of the 14th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2010, Hyderabad, India), vol I. Springer, Berlin, pp 480–487

  • Cha S-H, Tappert CC, Yoon S (2006) Enhancing binary feature vector similarity measures. J Pattern Recognit Res 1:63–77

    Google Scholar 

  • Choi S-S, Cha S-H, Tappert CC (2010) A survey of binary similarity and distance measures. J Syst Cybern Inf 8(1):43–48

    Google Scholar 

  • Chui C-K, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of the 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2007, Nanjing, China). Springer, Berlin, pp 47–58

  • Creighton C, Hanash S (2003) Mining gene expression databases for association rules. Bioinformatics 19:79–86

    Article  Google Scholar 

  • Davé RN (1991) Characterization and detection of noise in clustering. Pattern Recognit Lett 12:657–664

    Article  Google Scholar 

  • Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302

    Article  Google Scholar 

  • Edwards AL (1976) The correlation coefficient. An introduction to linear regression and correlation. W.H. Freeman, San Francisco, pp 33–46

    Google Scholar 

  • Gerstein GL, Perkel DH, Subramanian KN (1978) Identification of functionally related neural assemblies. Brain Res 140(1):43–62

    Article  Google Scholar 

  • Gionis A, Mannila H, Seppänen JK (2004) Geometric and combinatorial tiles in 0-1 data. In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (PKDD04, Pisa, Italy). LNAI, vol 3202. Springer, Berlin, pp 173–184

  • Grün S, Rotter S (eds) (2010) Analysis of parallel spike trains. Springer, Berlin

    Google Scholar 

  • Hamming RV (1950) Error detecting and error correcting codes. Bell Syst Tech J 29:147–160

    MathSciNet  Google Scholar 

  • Hebb DO (1949) The organization of behavior. Wiley, New York

    Google Scholar 

  • Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles 37:547–579

    Google Scholar 

  • Klement EP, Mesiar R, Pap E (2000) Triangular norms. Kluwer, Dordrecht

    MATH  Google Scholar 

  • Kohavi R, Bradley CE, Frasca B, Mason L, Zheng Z (2000) KDD-Cup 2000 organizers’ report: peeling the onion. SIGKDD Explor 2(2):86–93

    Article  Google Scholar 

  • Leung CK-S, Carmichael CL, Hao B (2007) Efficient mining of frequent patterns from uncertain data. In: Proceedings of the 7th IEEE international conference on data mining workshops (ICDMW 2007, Omaha, NE). IEEE Press, Piscataway, pp 489–494

  • Lewicki MS (1998) A review of methods for spike sorting: the detection and classification of neural action potentials. Netw Comput Neural Syst 9:R53–R78

    MATH  Article  Google Scholar 

  • Pei J, Tung AKH, Han J (2001) Fault-tolerant frequent pattern mining: problems and challenges. In: Proceedings of the ACM SIGMOD workshop on research issues in data mining and knowledge discovery (DMK’01, Santa Babara, CA). ACM Press, New York

  • Pensa RG, Robardet C, Boulicaut JF (2006) Supporting bi-cluster interpretation in 0/1 data by means of local patterns. Intell Data Anal 10:457–472

    Google Scholar 

  • Rehm F, Klawonn F, Kruse R (2007) A novel approach to noise clustering for outlier detection. Soft Comput Fusion Found Methodol Appl 11(5):489–494

    Google Scholar 

  • Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132:1115–1118

    Article  Google Scholar 

  • Sammon JW (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18(5):401–409

    Article  Google Scholar 

  • Segond M, Borgelt C (2011) Item set mining based on cover similarity. In: Proceedings of the 15th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2011, Shenzhen, China). Springer, Berlin (in press)

  • Seppänen JK, Mannila H (2004) Dense itemsets. In: Proceedings of the 10th ACM SIGMOD international conference on knowledge discovery and data mining (KDD 2004, Seattle, WA). ACM Press, New York, pp 683–688

  • Sørensen T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter/Kongelige Danske Videnskabernes Selskab 5(4):1–34

    Google Scholar 

  • Tanimoto TT (1957) IBM Internal Report, November 17

  • Wang X, Borgelt C, Kruse R (2005) Mining fuzzy frequent item sets. In: Proceedings of the 11th international fuzzy systems association world congress (IFSA’05, Beijing, China). Tsinghua University Press/Springer, Beijing/Heidelberg, pp 528–533

  • Yang C, Fayyad U, Bradley PS (2001) Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings of the 7th ACM SIGMOD international conference on knowledge discovery and data mining (KDD 2001, San Francisco, CA). ACM Press, New York, pp 194–203

  • Yule GU (1900) On the association of attributes in statistics. Philos Trans R Soc Lond Ser A 194:257–319

    MATH  Article  Google Scholar 

  • Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) New algorithms for fast discovery of association rules. In: Proceedings of the 3rd international conference on knowledge discovery and data mining (KDD’97, Newport Beach, CA). AAAI Press, Menlo Park, pp 283–296

Download references

Acknowledgments

This work was partially supported by the European Commission under the 7th Framework Program FP7-ICT-2007-C FET-Open, contract no. BISON-211898, and by the Helmholtz Alliance on Systems Biology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Borgelt.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Borgelt, C., Braune, C., Kötter, T. et al. New algorithms for finding approximate frequent item sets. Soft Comput 16, 903–917 (2012). https://doi.org/10.1007/s00500-011-0776-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-011-0776-2

Keywords

  • Spike Train
  • Copy Probability
  • Item Cover
  • Supporting Transaction
  • Fiedler Vector