Knowledge and Information Systems

, Volume 30, Issue 1, pp 87–111 | Cite as

Application-independent feature construction based on almost-closedness properties

  • Dominique Gay
  • Nazha Selmaoui-Folcher
  • Jean-François Boulicaut
Regular Paper

Abstract

Feature construction has been studied extensively, including for 0/1 data samples. Given the recent breakthroughs in closedness-related constraint-based mining, we are considering its impact on feature construction for classification tasks. We investigate the use of condensed representations of frequent itemsets based on closedness properties as new features. These itemset types have been proposed to avoid set counting in difficult association rule mining tasks, i.e. when data are noisy and/or highly correlated. However, our guess is that their intrinsic properties (say the maximality for the closed itemsets and the minimality for the δ-free itemsets) should have an impact on feature quality. Understanding this remains fairly open, and we discuss these issues thanks to itemset properties on the one hand and an experimental validation on various data sets (possibly noisy) on the other hand.

Keywords

Feature construction Pattern-based classification δ-free itemsets δ-strong rules Closure equivalence classes Noise-tolerance 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: Proceedings ACM SIGMOD’93, pp 207–216Google Scholar
  2. 2.
    Antonie M-L, Zaïane OR (2004) An associative classifier based on positive and negative rules. In: Proceedings of the 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, DMKD’04. ACM Press, pp 64–69Google Scholar
  3. 3.
    Baralis E, Chiusano S (2004) Essential classification rule sets. ACM Trans Database Syst 29(4): 635–674CrossRefGoogle Scholar
  4. 4.
    Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L (2000) Mining frequent patterns with counting inference. SIGKDD Explor 2(2): 66–75CrossRefGoogle Scholar
  5. 5.
    Besson J, Pensa RG, Robardet C, Boulicaut J-F (2006) Constraint-based mining of fault-tolerant patterns from boolean data. In: KDID’05 selected and invited revised papers, vol. 3933 of LNCS, Springer, pp 55–71Google Scholar
  6. 6.
    Boley M, Grosskreutz H (2009) Approximating the number of frequent sets in dense data. Knowl Inf Syst 21(1): 65–89CrossRefGoogle Scholar
  7. 7.
    Bonchi F, Lucchese C (2006) On condensed representations of constrained frequent patterns. Knowl Inf Syst 9(2): 180–201CrossRefGoogle Scholar
  8. 8.
    Boulicaut J-F, Bykowski A, Rigotti C (2000) Approximation of frequency queries by means of free-sets. In: Proceedings PKDD’00, vol. 1910 of LNCS, Springer, pp 75–85Google Scholar
  9. 9.
    Boulicaut J-F, Bykowski A, Rigotti C (2003) Free-sets: a condensed representation of boolean data for the approximation of frequency queries. Data Mining Knowl Discov 7(1): 5–22CrossRefMathSciNetGoogle Scholar
  10. 10.
    Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: SIGMOD’97. ACM Press, New york, pp 265–276Google Scholar
  11. 11.
    Bringmann B, Nijssen S, Zimmermann A (2009) Pattern based classification: a unifying perspective. In: LeGo’09 worskhop colocated with ECML/PKDD’09Google Scholar
  12. 12.
    Bringmann B, Zimmermann A (2009) One in a million: picking the right patterns. Knowl Inf Syst 18(1): 61–81CrossRefGoogle Scholar
  13. 13.
    Brodley CE, Utgoff PE (1995) Multivariate decision trees. Mach Learn 19(1): 45–77MATHGoogle Scholar
  14. 14.
    Calders T, Rigotti C, Boulicaut J-F (2005) A survey on condensed representations for frequent sets. In: Constraint-based mining and inductive databases, vol 3848 of LNCS. Springer, Berlin, pp 64–80Google Scholar
  15. 15.
    Cerf L, Gay D, Selmaoui N, Boulicaut J-F (2008) A parameter free associative classifier. In: Proceedings DaWaK’08, vol 5182 of LNCS. Springer, Berlin, pp 238–247Google Scholar
  16. 16.
    Chang C-C, Lin C-J (2001) LIBSVM: a library for support vector machines’. http://www.csie.ntu.edu.tw/~cjlin/libsvm/
  17. 17.
    Cheng H, Yan X, Han J, Hsu C-W (2007) Discriminative frequent pattern analysis for effective classification. In: Proceedings ICDE’07. IEEE Computer Society, Silver Spring, pp 716–725Google Scholar
  18. 18.
    Cheng H, Yu PS, Han J (2006) AC-close: efficiently mining approximate closed itemsets by core pattern recovery. In: ICDM’06. pp 839–844Google Scholar
  19. 19.
    Cheng J, Ke Y, Ng W (2006) δ-tolerance closed frequent itemsets. In: ICDM’06, pp 139–148Google Scholar
  20. 20.
    Crémilleux B, Boulicaut J-F (2002) Simplest rules characterizing classes generated by delta-free sets. In: Proceedings ES’02. Springer, Berlin, pp 33–46Google Scholar
  21. 21.
    Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings KDD’99. ACM Press, New york, pp 43–52Google Scholar
  22. 22.
    Dong G, Zhang X, Wong L, Li J (1999) CAEP: classification by aggregating emerging patterns. In: Proceedings DS’99, vol 1721 of LNCS, Springer, Berlin, pp 30–42Google Scholar
  23. 23.
    El-Manzalawy Y (2005) WLSVM: integrating libsvm into weka environment. http://www.cs.iastate.edu/~yasser/wlsvm/
  24. 24.
    Fayyad UM, Irani KB (1993) Multi-interval discretization of continous-valued attributes for classification learning. In: Proceedings IJCAI’93. Morgan Kaufmann, Los Altos, pp 1022–1027Google Scholar
  25. 25.
    Fürnkranz J (2002) Round robin classification. J Mach Learn Res 2: 721–747MATHMathSciNetGoogle Scholar
  26. 26.
    Ganter B, Stumme G, Wille R (eds) (2005) Formal concept analysis, foundations and applications, vol 3626 of lecture notes in computer science. Springer, BerlinGoogle Scholar
  27. 27.
    Garriga GC, Kralj P, Lavrac N (2006) Closed sets for labeled data. In: Proceedings PKDD’06. Springer, Berlin, pp 163–174Google Scholar
  28. 28.
    Garriga GC, Kralj P, Lavrac N (2008) Closed sets for labeled data. J Mach Learn Res 9: 559–580MATHMathSciNetGoogle Scholar
  29. 29.
    Gay D, Selmaoui N, Boulicaut J.-F (2007) Pattern-based decision tree construction. In: Proceedings of IEEE ICDIM’07. IEEE Press, New York, pp 291–296Google Scholar
  30. 30.
    Gay D, Selmaoui N, Boulicaut J-F (2008) Feature construction based on closedness properties is not that simple. In: Proceedings PAKDD’08, vol 5012 of LNCS. Springer, Berlin, pp 112–123Google Scholar
  31. 31.
    Gay D, Selmaoui N, Boulicaut J-F (2009) Application-independent feature construction from noisy samples In: Proceedings PAKDD’09, vol 5476 of LNCS. Springer, Berlin, pp 965–972Google Scholar
  32. 32.
    Hébert C, Crémilleux B (2005) Mining delta-strong characterization rules in large SAGE data. In: PKDD’05 discovery challenge on gene expression dataGoogle Scholar
  33. 33.
    Hébert C, Crémilleux B (2006) Optimized rule mining through a unified framework for interestingness measures. In: Proceedings DaWaK’06, vol 4081 of LNCS. Springer, Berlin, pp 238–247Google Scholar
  34. 34.
    John GH, Langley P (1995) Estimating continuous distributions in bayesian classifiers. In: Proceedings UAI’95. Morgan Kaufmann, Los Altos, pp 338–345Google Scholar
  35. 35.
    Kubica J, Moore AW (2003) Probabilistic noise identification and data cleaning. In: Proceedings ICDM’03. IEEE Computer Society, Silver Spring, pp 131–138Google Scholar
  36. 36.
    Li J, Dong G, Ramamohanarao K (2000) Instance-based classification by emerging patterns. In: Proceedings the 4th European conference on principles and practice of knowledge discovery in databases. Springer, Berlin, pp 191–200Google Scholar
  37. 37.
    Li J, Dong G, Ramamohanarao K (2001) ‘Making use of the most expressive jumping emerging patterns for classification. Knowl Inf Syst 3(2): 131–145CrossRefGoogle Scholar
  38. 38.
    Li J, Liu G, Wong L (2007) Mining statistically important equivalence classes and delta-discriminative emerging patterns. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining KDD’07. ACM Press, New YorkGoogle Scholar
  39. 39.
    Li W, Han J, Pei J (2001) CMAR: accurate and efficient classification based on multiple class-association rules. In: Proceedings ICDM’01. IEEE Computer Society, New York, pp 369–376Google Scholar
  40. 40.
    Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Proceedings KDD’98. AAAI Press, pp 80–86Google Scholar
  41. 41.
    Liu G, Li J, Wong L (2007) A new concise representation of frequent itemsets using generators and a positive border. Knowl Inf SystGoogle Scholar
  42. 42.
    Miettinen P, Mielikäinen T, Gionis A, Das G, Mannila H (2008) The discrete basis problem. IEEE Trans Knowl Data Eng 20(10): 1348–1362CrossRefGoogle Scholar
  43. 43.
    Park S-H, Fürnkranz J. (2007) Efficient pairwise classification. In: ECML’07, pp 658–665Google Scholar
  44. 44.
    Pensa RG, Robardet C, Boulicaut J-F (2006) Supporting bi-cluster interpretation in 0/1 data by means of local patterns. Intell Data Anal 10(5): 457–472Google Scholar
  45. 45.
    Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, Los AltosGoogle Scholar
  46. 46.
    Ramamohanarao K, Fan H (2007) Patterns based classifiers. World Wide Web 10(1): 71–83CrossRefGoogle Scholar
  47. 47.
    Rebbapragada U, Brodley CE (2007) Class noise mitigation through instance weighting. In: Proceedings ECML’07, vol 4701 of LNCS. Springer, Berlin, pp 708–715Google Scholar
  48. 48.
    Selmaoui N, Leschi C, Gay D, Boulicaut J-F (2006) Feature construction and delta-free sets in 0/1 samples. In: Proceedings DS’06, vol 4265 of LNCS. Springer, Berlin, pp 363–367Google Scholar
  49. 49.
    Soulet A, Crémilleux B, Rioult F (2004) Condensed representation of emerging patterns. In: Proceedings of the 8th Pacific-Asia conference on knowledge discovery in databases, vol 3056 of LNCS, pp 127–132Google Scholar
  50. 50.
    Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, ReadingGoogle Scholar
  51. 51.
    Utgoff PE, Brodley CE (1990) An incremental method for finding multivariate splits for decision trees. In: ICML’90, pp 58–65Google Scholar
  52. 52.
    Van Hulse J, Khoshgoftaar TM, Huang H (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2): 171–190CrossRefGoogle Scholar
  53. 53.
    Wang J, Karypis G (2005) HARMONY: efficiently mining the best rules for classification. In: Proceedings SIAM SDM’05, pp 34–43Google Scholar
  54. 54.
    Wang J, Karypis G (2006) On mining instance-centric classification rules. IEEE Trans Knowl Data Eng 18(11): 1497–1511CrossRefGoogle Scholar
  55. 55.
    Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, Los AltosMATHGoogle Scholar
  56. 56.
    Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng AFM, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37CrossRefGoogle Scholar
  57. 57.
    Yang C, Fayyad UM, Bradley PS (2001) Efficient discovery of error-tolerant frequent itemsets in high dimensions. In: Proceedings KDD’01. ACM Press, New York, pp 194–203Google Scholar
  58. 58.
    Yang Y, Wu X, Zhu X (2004) Dealing with predictive-but-unpredictable attributes in noisy data sources. In: Proceedings PKDD’04, vol 3202 of LNCS. Springer, Berlin, pp 471–483Google Scholar
  59. 59.
    Zhang S, Wu X, Zhang C, Lu J (2008) Computing the minimum-support for mining frequent patterns. Knowl Inf Syst 15(2): 233–257CrossRefGoogle Scholar
  60. 60.
    Zhang Y, Wu X (2007) Noise modeling with associative corruption rules. In: Proceedings ICDM’07. IEEE Computer Society, New York, pp 733–738Google Scholar
  61. 61.
    Zheng Z (1995) Constructing nominal x-of-n attributes. In: IJCAI’95, pp 1064–1070Google Scholar
  62. 62.
    Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Revue 22(3): 177–210CrossRefMATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2010

Authors and Affiliations

  • Dominique Gay
    • 1
    • 2
  • Nazha Selmaoui-Folcher
    • 1
  • Jean-François Boulicaut
    • 3
  1. 1.University of New-CaledoniaNOUMEA CédexFrance
  2. 2.Orange LabsTECH/ASAP/PROFLANNION CédexFrance
  3. 3.INSA-LyonVilleurbanneFrance

Personalised recommendations