Knowledge and Information Systems

, Volume 29, Issue 1, pp 55–80 | Cite as

Efficient mining of all margin-closed itemsets with applications in temporal knowledge discovery and classification by compression

Regular Paper

Abstract

Margin-closed itemsets have previously been proposed as a subset of the closed itemsets with a minimum margin constraint on the difference in support to supersets. The constraint reduces redundancy in the set of reported patterns favoring longer, more specific patterns. A variety of patterns ranging from rare specific itemsets to frequent general itemsets is reported to support exploratory data analysis and understandable classification models. We present DCI_Margin, a new efficient algorithm that mines the complete set of margin-closed itemsets. We modified the DCI_Closed algorithm that has low memory requirements and can be parallelized. The margin constraint is checked on-the-fly reusing information already computed by DCI_Closed. We thoroughly analyzed the behavior on many datasets and show how other data mining algorithms can benefit from the redundancy reduction.

Keywords

Closed itemsets Constrained itemsets Condensed representation Temporal data mining Compression 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Afrati F, Gionis A, Mannila H (2004) Approximating a collection of frequent sets, In: Proceedings of 10th ACM SIGKDD international conference on Knowledge Discovery and Data Mining. ACM Press, pp 12–19Google Scholar
  2. 2.
    Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: Proceedings of ACM SIGMOD international conference on Management of Data. ACM Press, pp 207–216.Google Scholar
  3. 3.
    Agrawal R, Srikant R (1995) Mining sequential patterns. In: Yu PS and Chen ASP (eds) Proceedings of 11th international conference on data engineering. pp 3–14Google Scholar
  4. 4.
    Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
  5. 5.
    Beil F, Ester M, Xu X (2002) Frequent term-based text clustering, In: Proceedings of 8th international conference on knowledge discovery and data mining. pp 436–442Google Scholar
  6. 6.
    Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inf Syst 21(1): 65–89CrossRefGoogle Scholar
  7. 7.
    Bonchi F, Lucchese C (2005) Pushing tougher constraints in frequent pattern mining, In: Proceedings of Pacific-Asia conference on knowledge discovery and data Mining. pp 114–124Google Scholar
  8. 8.
    Bonchi F, Lucchese C (2006) On condensed representations of constrained frequent patterns. Knowl Inf Syst 9(2): 180–201CrossRefGoogle Scholar
  9. 9.
    Bonchi F, Lucchese C (2007) Extending the state-of-the-art of constraint-based pattern discovery. Data Min Knowl Discov 60(2): 377–399Google Scholar
  10. 10.
    Boulicaut J-F, Bykowski A (2000) Frequent closures as a concise representation for binary data mining. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining. pp 62–73Google Scholar
  11. 11.
    Boulicaut J-F, Bykowski A, Rigotti C (2003) Free-sets: a condensed representation of boolean data for the approximation of frequency queries. Data Min Knowl Discov 7(1): 5–22MathSciNetCrossRefGoogle Scholar
  12. 12.
    Bringmann B, Zimmermann A (2009) One in a million: picking the right patterns. Knowl Inf Syst 18(1): 61–81CrossRefGoogle Scholar
  13. 13.
    Bykowski A, Rigotti C (2001) A condensed representation to find frequent patterns. In: Proceedings of 20th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. ACM Press, pp 267–273Google Scholar
  14. 14.
    Calders T, Goethals B (2002) Mining all non-derivable frequent itemsets. In: Proceedings of 6th European conference on principles of data mining and knowledge discovery. Springer, pp 74–85Google Scholar
  15. 15.
    Calders T, Goethals B (2003) Minimal k-free representations of frequent sets, In: Proceedings of 7th European conference on principles and practice of knowledge discovery in databases. Springer, pp 71–82Google Scholar
  16. 16.
    Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1): 171–206MathSciNetCrossRefGoogle Scholar
  17. 17.
    Calders T, Goethals B, Mampaey M (2007) Mining itemsets in the presence of missing values. In: Proceedings of international symposium on applied computing. ACM, pp 404–408Google Scholar
  18. 18.
    Calders T, Rigotti C, Boulicaut J-F (2006) A survey on condensed representations for frequent sets. In: Constraint-based mining and inductive databases. pp 64–80Google Scholar
  19. 19.
    Cheng H, Yan X, Han J, Hsu C (2007) Discriminative frequent pattern analysis for effective classification. In: Proceedings of IEEE international conference on data engineering. pp 716–725Google Scholar
  20. 20.
    Cheng H, Yu PS, Han J (2006) AC-Close: efficiently mining approximate closed itemsets by core pattern recovery. In: Proceedings of IEEE international conference on data mining. IEEE pp 839–844Google Scholar
  21. 21.
    Cheng H, Yu PS, Han J (2008) Approximate frequent itemset mining in the presence of random noise. In: Soft computing for knowledge discovery and data Mining. Springer, pp 363–389Google Scholar
  22. 22.
    Cheng J, Ke Y, Ng W (2006) δ-tolerance closed frequent itemsets. In: Proceedings of 6th IEEE international conference on data mining. IEEE Press, pp 139–148Google Scholar
  23. 23.
    Coenen F (2003) The LUCS-KDD discretised/normalised ARM and CARM data library. http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN/
  24. 24.
    De Raedt L, Guns T, Nijssen S (2008) Constraint programming for itemset mining. In: Proceedings of 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 204–212Google Scholar
  25. 25.
    Faloutsos C, Megalooikonomou V (2007) On data mining, compression, and kolmogorov complexity. Data Min Knowl Discov 15(1): 3–20MathSciNetCrossRefGoogle Scholar
  26. 26.
    Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets, In: Proceedings of SIAM international conference on data miningGoogle Scholar
  27. 27.
    Gallo A, De Bie T, Cristianini N (2007) Mini: Mining informative non-redundant itemsets, In: Proceedings of European symposium on principles of data mining and knowledge Discovery. pp 438–445Google Scholar
  28. 28.
    Garriga G (2005) Summarizing sequential data with closed partial orders. In: Proceedings of 5th SIAM international conference on data mining. SIAM, pp 380–391Google Scholar
  29. 29.
    Garriga G, Kralj P, Lavrac N (2006) Closed sets for labeled data. In: Proceedings of European conference on principles and practice of knowledge discovery in databases. pp 163–174Google Scholar
  30. 30.
    Geerts F, Goethals B, Mielikäinen T (2004) Tiling databases. In: Proceedings of discovery science. pp 278–289Google Scholar
  31. 31.
    Goethals B, Zaki M (2003) FIMI ’03, frequent itemset mining implementations, In: Proceedings of ICDM 2003 workshop on frequent itemset mining implementationsGoogle Scholar
  32. 32.
    Grice H (1989) Studies in the way of Words. Harvard University Press, CambridgeGoogle Scholar
  33. 33.
    Gupta R, Fang G, Field B, Steinbach M, Kumar V (2008) Quantitative evaluation of approximate frequent pattern mining algorithms. In: Proceedings of 14th ACM SIGKDD international conference on knowledge discovery and data Mining. ACM, pp 301–309Google Scholar
  34. 34.
    Han J, Pei J (2001) Pattern growth methods for sequential pattern mining: Principles and extensions, In: Workshop on temporal data mining, 7th ACM SIGKDD international conference on knowledge discovery and data mining. ACM PressGoogle Scholar
  35. 35.
    Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of ACM SIGMOD international conference on management of data. ACM Press, pp 1–12Google Scholar
  36. 36.
    Hipp J, Güntzer U, Nakhaeizadeh G (2000) Algorithms for association rule mining—a general survey and comparison. SIGKDD Explor 2(1): 58–64CrossRefGoogle Scholar
  37. 37.
    Keogh E, Lonardi S, Ratanamahatana C, Wei L, Lee S, Handley J (2007) Compression-based data mining of sequential data. Data Min Knowl Discov 14(1): 99–129MathSciNetCrossRefGoogle Scholar
  38. 38.
    Kohavi R, Brodley C, Frasca B, Mason L, Zheng Z (2000) ‘KDD-Cup 2000 organizers’ report: Peeling the onion. SIGKDD Explor 2(2): 86–98. http://www.ecn.purdue.edu/KDDCUP
  39. 39.
    Kryszkiewicz M (2001) Concise representation of frequent patterns based on disjunction-free generators, In: Proceedings of 1st IEEE international conference on data mining. IEEE Press, pp 305–312Google Scholar
  40. 40.
    Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: Proceedings of IEEE international conference on data mining, pp 313–320Google Scholar
  41. 41.
    Li J, Li H, Wong L, Pei J, Dong G (2006) Minimum description length principle: generators are preferable to closed patterns. In: Proceedings of AAAI, pp 409–414Google Scholar
  42. 42.
    Li W, Han J, Pei J (2001) CMAR: Accurate and efficient classification based on multiple class-association rules. In: Proceedings of IEEE international conference on data mining. pp 369–376Google Scholar
  43. 43.
    Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Proceedings of international conference on knowledge discovery and data mining. pp 80–86Google Scholar
  44. 44.
    Liu G, Li J, Wong L (2008) A new concise representation of frequent itemsets using generators and a positive border. Knowl Inf Syst 17(1): 35–56MathSciNetCrossRefGoogle Scholar
  45. 45.
    Liu J, Paulsen S, Wang W, Nobel A, Prins J (2005) Mining approximate frequent itemsets from noisy data. In: Proceedings of 5th international conference data mining. IEEE, pp 721–724Google Scholar
  46. 46.
    Lucchese C, Orlando S, Perego R (2006a) Fast and memory efficient mining of frequent closed itemsets. IEEE Trans Knowl Data Eng 18(1): 21–36CrossRefGoogle Scholar
  47. 47.
    Lucchese C, Orlando S, Perego R (2006b) Mining frequent closed itemsets out of core, In: Proceedings of the 6th SIAM international conference on data mining (SDM’06)Google Scholar
  48. 48.
    Lucchese C, Orlando S, Perego R (2007) Parallel mining of frequent closed patterns: harnessing modern computer architectures. In: Proceedings IEEE international conference on data miningGoogle Scholar
  49. 49.
    Malik H, Kender J (2006) High quality, efficient hierarchical document clustering using closed interesting itemsets. In: Proceedings of IEEE international conference on data mining. pp 991–996Google Scholar
  50. 50.
    Mielikäinen T. (2005) Summarization techniques for pattern collections in data mining, PhD thesis, University of Helsinki, FinlandGoogle Scholar
  51. 51.
    Mörchen F (2006) Algorithms for time series knowledge mining, In: Proceedings 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, pp 668–673Google Scholar
  52. 52.
    Mörchen F, Ultsch A (2007) Efficient mining of understandable patterns from multivariate interval time series. Data Min Knowl Discov 15(2): 181–215MathSciNetCrossRefGoogle Scholar
  53. 53.
    Muhonen J, Toivonen H (2006) Closed non-derivable itemsets. In: Proceedings European symposium on principles of data mining and knowledge discovery. pp 601–608Google Scholar
  54. 54.
    Ng R, Lakshmanan LV, Han J, Pang A (1998) Exploratory mining and pruning optimizations of constrained associations rules. In: Proceedings of ACM SIGMOD conference on management of Data. ACM, pp 13–24Google Scholar
  55. 55.
    Nijssen S, Fromont E (2007) Mining optimal decision trees from itemset lattices. In: Proceedings of international conference on knowledge discovery and data mining. ACM, pp 530–539Google Scholar
  56. 56.
    Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of 7th international conference on database theory. Springer, pp 398–416Google Scholar
  57. 57.
    Pei J, Dong G, Zou W, Han J (2002) On computing condensed frequent pattern bases. In: Proceedings of 2nd IEEE international conference on data mining. IEEE Press, pp 378–385Google Scholar
  58. 58.
    Pei J, Han J, Lakshmanan LVS (2001) Mining frequent itemsets with convertible constraints. In: Proceedings of IEEE international conference on data Engineering. IEEE, pp 433–442Google Scholar
  59. 59.
    Pei J, Tung AK, Han J (2001) Fault-tolerant frequent pattern mining: problems and challenges. In: Workshop on research issues in data mining and knowledge discovery, 20th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systemsGoogle Scholar
  60. 60.
    Pei J, Wang H, Liu J, Wang K, Wang J, Yu P (2006) Discovering frequent closed partial orders from strings. IEEE Trans Knowl Data Eng 18(11): 1467–1481CrossRefGoogle Scholar
  61. 61.
    Pudi V, Haritsa J (2003) Generalized closed itemsets for association rule mining. In: Proceedings of 19th international conference on data engineering. IEEE Press pp 714–716Google Scholar
  62. 62.
    Seppänen J, Mannila H (2004) Dense itemsets. In: Proceedings of 10th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, pp 683–688Google Scholar
  63. 63.
    Siebes A (2006) Item sets that compress. In: Proceedings of SIAM Conference on data mining. pp 393–404Google Scholar
  64. 64.
    Song G, Yang D, Cui B, Zheng B, Liu Y, Xie K (2007) CLAIM: An efficient method for relaxed frequent closed itemsets mining over stream data. In: Proceedings of 12th international conference on database systems for advanced applications. Springer, pp 664–675Google Scholar
  65. 65.
    Srikant R, Vu Q, Agrawal R (1997) Mining association rules with item constraints, In: Proceedings of international conference on knowledge discovery and data mining. ACM, pp 67–73Google Scholar
  66. 66.
    Sripada SG, Reiter E, Hunter J (2003) Generating English summaries of time series data using the Gricean maxims, In: Proceedings of 9th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, pp 187–196Google Scholar
  67. 67.
    Tatti N (2007) Maximum entropy based significance of itemsets. In: Proceedings of IEEE international conference on data mining. pp 312–321Google Scholar
  68. 68.
    Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1): 57–77CrossRefGoogle Scholar
  69. 69.
    Uno T, Arimura H (2007) An efficient polynomial delay algorithm for pseudo frequent itemset mining. In: Proceedings of 10th international conference discovery science. Springer, pp 219–230Google Scholar
  70. 70.
    Van Leeuwen M, Siebes A (2008) StreamKrimp: Detecting change in data streams. In: Proceedings of European conference on machine learning and principles and practices of knowledge discovery in data. pp 765–774Google Scholar
  71. 71.
    van Leeuwen M, Vreeken J, Siebes A (2006) Compression picks item sets that matter, In: Proceedings of European conference on principles and practice of knowledge discovery in databases. pp 585–592Google Scholar
  72. 72.
    Vreeken J, Siebes A (2008) Filling in the blanks—Krimp minimisation for missing data. In: Proceedings of 8th IEEE international conference on data mining. pp 1067–1072Google Scholar
  73. 73.
    Wang J, Karypis G (2006) On mining instance-centric classification rules. IEEE Trans Knowl Data Eng 18(11): 1497–1511CrossRefGoogle Scholar
  74. 74.
    Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Conference on information and knowledge management. pp 483–490Google Scholar
  75. 75.
    Webb GI (2007) Discovering significant patterns. Mach Learn 68(1): 1–33CrossRefGoogle Scholar
  76. 76.
    Xin D, Han J, Yan X, Cheng H (2005) Mining compressed frequent-pattern sets. In: Proceedings of 31st international conference on very large data bases. pp 709–720Google Scholar
  77. 77.
    Yahia SB, Hamrouni T, Mephu Nguifo E (2006) Frequent closed itemset based algorithms: a thorough structural and analytical survey. ACM SIGKDD Explor 8(1): 93–104CrossRefGoogle Scholar
  78. 78.
    Yan X, Cheng H, Han J, Xin D (2005) Summarizing itemset patterns: a profile-based approach, In: Proceedings of 11th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, pp 314–323Google Scholar
  79. 79.
    Yang C, Fayyad U, Bradley P (2001) Efficient discovery of error-tolerant frequent itemsets in high dimensions, In: Proceedings of 7th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, pp 194–203Google Scholar
  80. 80.
    Yin X, Han J (2003) CPAR: Classification based on predictive association rules. In: Proceedings of SIAM international conference on data miningGoogle Scholar
  81. 81.
    Zaki M (2004) Mining non-redundant association rules. Data Min Knowl Discov 9(3): 223–248MathSciNetCrossRefGoogle Scholar
  82. 82.
    Zaki M, Hsiao C-J (2002) CHARM: An efficient algorithm for closed itemset mining. In: Proceedings of 2nd SIAM international conference on data mining SIAM. pp 457–473Google Scholar
  83. 83.
    Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of 11th Conference of information and knowledge management. pp 515–524Google Scholar

Copyright information

© Springer-Verlag London Limited 2010

Authors and Affiliations

  1. 1.Siemens Corporate ResearchPrincetonUSA
  2. 2.Databionic Research GroupPhilipps-University MarburgMarburgGermany

Personalised recommendations