Soft Computing

, Volume 16, Issue 5, pp 903–917 | Cite as

New algorithms for finding approximate frequent item sets

  • Christian Borgelt
  • Christian Braune
  • Tobias Kötter
  • Sonja Grün
Focus

Abstract

In standard frequent item set mining a transaction supports an item set only if all items in the set are present. However, in many cases this is too strict a requirement that can render it impossible to find certain relevant groups of items. By relaxing the support definition, allowing for some items of a given set to be missing from a transaction, this drawback can be amended. The resulting item sets have been called approximate, fault-tolerant or fuzzy item sets. In this paper we present two new algorithms to find such item sets: the first is an extension of item set mining based on cover similarities and computes and evaluates the subset size occurrence distribution with a scheme that is related to the Eclat algorithm. The second employs a clustering-like approach, in which the distances are derived from the item covers with distance measures for sets or binary vectors and which is initialized with a one-dimensional Sammon projection of the distance matrix. We demonstrate the benefits of our algorithms by applying them to a concept detection task on the 2008/2009 Wikipedia Selection for schools and to the neurobiological task of detecting neuron ensembles in (simulated) parallel spike trains.

References

  1. Aggarwal CC, Lin Y, Wang J, Wang J (2009) Frequent pattern mining with uncertain data. In: Proceedings of the 15th ACM SIGMOD international conference on knowledge discovery and data mining (KDD 2009, Paris, France). ACM Press, New York, pp 29–38Google Scholar
  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large databases (VLDB 1994, Santiago de Chile). Morgan Kaufmann, San Mateo, pp 487–499Google Scholar
  3. Besson J, Robardet C, Boulicaut J-F (2006) Mining a new fault-tolerant pattern type as an alternative to formal concept discovery. In: Proceedings of the international conference on computational science (ICCS 2006, Reading, United Kingdom). Springer, Berlin, pp 144–157Google Scholar
  4. Berger D, Borgelt C, Diesmann M, Gerstein G, Grün S (2009) An accretion based data mining algorithm for identification of sets of correlated neurons. In: 18th Annual computational neuroscience meeting (CNS*2009). Berlin, GermanyGoogle Scholar
  5. Berger D, Borgelt C, Louis S, Morrison A, Grün S (2010) Efficient identification of assembly neurons within massively parallel spike trains. In: Computational intelligence and neuroscience 2010, Article ID 439648. Hindawi Publishing Corp., New YorkGoogle Scholar
  6. Borgelt C, Wang X (2009) SaM: a split and merge algorithm for fuzzy frequent item set mining. In: Proceedings of the 13th international fuzzy systems association world congress and 6th conference of the European society for fuzzy logic and technology (IFSA/EUSFLAT’09, Lisbon, Portugal). IFSA/EUSFLAT Organization Committee, Lisbon, pp 968–973Google Scholar
  7. Boulicaut JF, Bykowski A, Rigotti C (2000) Approximation of frequency queries by means of free-sets. In: Proceedings of the 4th European conference on principles and practice of knowledge discovery in databases (PKDD 2000, Lyon, France). LNCS, vol 1910. Springer, Heidelberg, pp 75–85Google Scholar
  8. Buzśaki G (2004) Large-scale recording of neuronal ensembles. Nat Neurosci 7:446–451CrossRefGoogle Scholar
  9. Calders T, Garboni C, Goethals B (2010) Efficient pattern mining of uncertain data with sampling. In: Proceedings of the 14th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2010, Hyderabad, India), vol I. Springer, Berlin, pp 480–487Google Scholar
  10. Cha S-H, Tappert CC, Yoon S (2006) Enhancing binary feature vector similarity measures. J Pattern Recognit Res 1:63–77Google Scholar
  11. Choi S-S, Cha S-H, Tappert CC (2010) A survey of binary similarity and distance measures. J Syst Cybern Inf 8(1):43–48Google Scholar
  12. Chui C-K, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of the 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2007, Nanjing, China). Springer, Berlin, pp 47–58Google Scholar
  13. Creighton C, Hanash S (2003) Mining gene expression databases for association rules. Bioinformatics 19:79–86CrossRefGoogle Scholar
  14. Davé RN (1991) Characterization and detection of noise in clustering. Pattern Recognit Lett 12:657–664CrossRefGoogle Scholar
  15. Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302CrossRefGoogle Scholar
  16. Edwards AL (1976) The correlation coefficient. An introduction to linear regression and correlation. W.H. Freeman, San Francisco, pp 33–46Google Scholar
  17. Gerstein GL, Perkel DH, Subramanian KN (1978) Identification of functionally related neural assemblies. Brain Res 140(1):43–62CrossRefGoogle Scholar
  18. Gionis A, Mannila H, Seppänen JK (2004) Geometric and combinatorial tiles in 0-1 data. In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (PKDD04, Pisa, Italy). LNAI, vol 3202. Springer, Berlin, pp 173–184Google Scholar
  19. Grün S, Rotter S (eds) (2010) Analysis of parallel spike trains. Springer, BerlinGoogle Scholar
  20. Hamming RV (1950) Error detecting and error correcting codes. Bell Syst Tech J 29:147–160MathSciNetGoogle Scholar
  21. Hebb DO (1949) The organization of behavior. Wiley, New YorkGoogle Scholar
  22. Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles 37:547–579Google Scholar
  23. Klement EP, Mesiar R, Pap E (2000) Triangular norms. Kluwer, DordrechtMATHGoogle Scholar
  24. Kohavi R, Bradley CE, Frasca B, Mason L, Zheng Z (2000) KDD-Cup 2000 organizers’ report: peeling the onion. SIGKDD Explor 2(2):86–93CrossRefGoogle Scholar
  25. Leung CK-S, Carmichael CL, Hao B (2007) Efficient mining of frequent patterns from uncertain data. In: Proceedings of the 7th IEEE international conference on data mining workshops (ICDMW 2007, Omaha, NE). IEEE Press, Piscataway, pp 489–494Google Scholar
  26. Lewicki MS (1998) A review of methods for spike sorting: the detection and classification of neural action potentials. Netw Comput Neural Syst 9:R53–R78MATHCrossRefGoogle Scholar
  27. Pei J, Tung AKH, Han J (2001) Fault-tolerant frequent pattern mining: problems and challenges. In: Proceedings of the ACM SIGMOD workshop on research issues in data mining and knowledge discovery (DMK’01, Santa Babara, CA). ACM Press, New YorkGoogle Scholar
  28. Pensa RG, Robardet C, Boulicaut JF (2006) Supporting bi-cluster interpretation in 0/1 data by means of local patterns. Intell Data Anal 10:457–472Google Scholar
  29. Rehm F, Klawonn F, Kruse R (2007) A novel approach to noise clustering for outlier detection. Soft Comput Fusion Found Methodol Appl 11(5):489–494Google Scholar
  30. Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132:1115–1118CrossRefGoogle Scholar
  31. Sammon JW (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18(5):401–409CrossRefGoogle Scholar
  32. Segond M, Borgelt C (2011) Item set mining based on cover similarity. In: Proceedings of the 15th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2011, Shenzhen, China). Springer, Berlin (in press)Google Scholar
  33. Seppänen JK, Mannila H (2004) Dense itemsets. In: Proceedings of the 10th ACM SIGMOD international conference on knowledge discovery and data mining (KDD 2004, Seattle, WA). ACM Press, New York, pp 683–688Google Scholar
  34. Sørensen T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter/Kongelige Danske Videnskabernes Selskab 5(4):1–34Google Scholar
  35. Tanimoto TT (1957) IBM Internal Report, November 17Google Scholar
  36. Wang X, Borgelt C, Kruse R (2005) Mining fuzzy frequent item sets. In: Proceedings of the 11th international fuzzy systems association world congress (IFSA’05, Beijing, China). Tsinghua University Press/Springer, Beijing/Heidelberg, pp 528–533Google Scholar
  37. Yang C, Fayyad U, Bradley PS (2001) Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings of the 7th ACM SIGMOD international conference on knowledge discovery and data mining (KDD 2001, San Francisco, CA). ACM Press, New York, pp 194–203Google Scholar
  38. Yule GU (1900) On the association of attributes in statistics. Philos Trans R Soc Lond Ser A 194:257–319MATHCrossRefGoogle Scholar
  39. Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) New algorithms for fast discovery of association rules. In: Proceedings of the 3rd international conference on knowledge discovery and data mining (KDD’97, Newport Beach, CA). AAAI Press, Menlo Park, pp 283–296Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  • Christian Borgelt
    • 1
  • Christian Braune
    • 1
    • 2
  • Tobias Kötter
    • 3
  • Sonja Grün
    • 4
    • 5
  1. 1.European Centre for Soft ComputingMieres (Asturias)Spain
  2. 2.Department of Computer ScienceOtto-von-Guericke-University of MagdeburgMagdeburgGermany
  3. 3.Department of Computer ScienceUniversity of KonstanzConstanceGermany
  4. 4.RIKEN Brain Science InstituteWako-ShiJapan
  5. 5.Institute of Neuroscience and Medicine (INM-6)Research Center JülichJülichGermany

Personalised recommendations