Knowledge and Information Systems

, Volume 42, Issue 2, pp 465–492

Efficient algorithms for finding optimal binary features in numeric and nominal labeled data

  • Michael Mampaey
  • Siegfried Nijssen
  • Ad Feelders
  • Rob Konijn
  • Arno Knobbe
Regular Paper

Abstract

An important subproblem in supervised tasks such as decision tree induction and subgroup discovery is finding an interesting binary feature (such as a node split or a subgroup refinement) based on a numeric or nominal attribute, with respect to some discrete or continuous target variable. Often one is faced with a trade-off between the expressiveness of such features on the one hand and the ability to efficiently traverse the feature search space on the other hand. In this article, we present efficient algorithms to mine binary features that optimize a given convex quality measure. For numeric attributes, we propose an algorithm that finds an optimal interval, whereas for nominal attributes, we give an algorithm that finds an optimal value set. By restricting the search to features that lie on a convex hull in a coverage space, we can significantly reduce computation time. We present some general theoretical results on the cardinality of convex hulls in coverage spaces of arbitrary dimensions and perform a complexity analysis of our algorithms. In the important case of a binary target, we show that these algorithms have linear runtime in the number of examples. We further provide algorithms for additive quality measures, which have linear runtime regardless of the target type. Additive measures are particularly relevant to feature discovery in subgroup discovery. Our algorithms are shown to perform well through experimentation and furthermore provide additional expressive power leading to higher-quality results.

Keywords

Binary features Decision trees Subgroup discovery Numeric data  Nominal data Labeled data Convex functions Convex hulls Coverage space  ROC analysis 

References

  1. 1.
    Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, pp 207–216Google Scholar
  2. 2.
    Atzmüller M, Puppe F (2006) SD-Map—a fast algorithm for exhaustive subgroup discovery. In: Proceedings of PKDD, pp 6–17Google Scholar
  3. 3.
    Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. Chapman & Hall/CRC, LondonMATHGoogle Scholar
  4. 4.
    Calders T, Dexters N, Gillis JJM, Goethals B (2014) Mining frequent itemsets in a stream. Inf Syst 39:233–255Google Scholar
  5. 5.
    Chou PA (1991) Optimal partitioning for classification and regression trees. IEEE Trans Pattern Anal Mach Intell 13(4):340–354CrossRefGoogle Scholar
  6. 6.
    Conway JH, Guy RK (1996) Farey fractions and ford circles. The Book of Numbers, Springer, pp 152–154Google Scholar
  7. 7.
    Costanigro M, Mittelhammer RC, McCluskey JJ (2009) Estimating class-specific parametric models under class uncertainty: local polynomial regression clustering in an hedonic analysis of wine markets. J Appl Econom 24:1117–1135CrossRefMathSciNetGoogle Scholar
  8. 8.
    De Cock D (2011) Ames, ia real estate data, 2011. http://www.amstat.org/publications/jse/
  9. 9.
    Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optima-preserving elimination of partition candidates. Data Mining Knowl Discov 8(2):97–126CrossRefMathSciNetGoogle Scholar
  10. 10.
    Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the international joint conference on artificial intelligence (IJCAI), pp 1022–1029Google Scholar
  11. 11.
    Frank A, Asuncion A (2010) UCI machine learning repository, http://archive.ics.uci.edu/ml
  12. 12.
    Fukuda T, Morimoto Y, Morishita S, Tokuyama T (1999) Mining optimized association rules for numeric attributes. J Comput Syst Sci 58(1):1–12CrossRefMATHMathSciNetGoogle Scholar
  13. 13.
    Fürnkranz J, Flach PA (2005) Roc ‘n’ rule learning—towards a better understanding of covering algorithms. Mach Learn 58(1):39–77CrossRefMATHGoogle Scholar
  14. 14.
    Graham RL (1972) An efficient algorithm for determining the convex hull of a finite planar set. Inf Process Lett 1(4):132–133CrossRefMATHGoogle Scholar
  15. 15.
    Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Mining Knowl Discov 19(2):210–226CrossRefGoogle Scholar
  16. 16.
    Hämäläinen W (2012) Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowl Inf Syst 32(2):383–414CrossRefGoogle Scholar
  17. 17.
    Herrera F, Carmona C, González P, del Jesus M (2010) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29:495–525CrossRefGoogle Scholar
  18. 18.
    Kavšek B, Lavrač N, Jovanoski V (2003) Apriori-sd: adapting association rule learning to subgroup discovery. In Proceedings of intelligent data analysis (IDA), pp 230–241Google Scholar
  19. 19.
    Klösgen W (2002) Handbook of data mining and knowledge discovery. Oxford University Press, New YorkGoogle Scholar
  20. 20.
    Meeng M, Knobbe A (2011) Flexible enrichment with Cortana—software demo. In: Proceedings of BeneLearn, pp 117–119Google Scholar
  21. 21.
    Novak PK, Lavrač N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403MATHGoogle Scholar
  22. 22.
    Nymann JE (1972) On the probability that k positive integers are relatively prime. J Number Theory 4(5):469–473CrossRefMATHMathSciNetGoogle Scholar
  23. 23.
    Preparata FP, Shamos MI (1985) Computational geometry: an introduction. Springer, BerlinCrossRefGoogle Scholar
  24. 24.
    Rényi A, Sulanke R (1963) Über die konvexe hülle von n zufällig gewälten punkten. Probab Theory Relat Fields 2:75–84MATHGoogle Scholar
  25. 25.
    Rzepakowski P, Jaroszewicz S (2012) Decision trees for uplift modeling with single and multiple treatments. Knowl Inf Syst 32(2):303–327CrossRefGoogle Scholar
  26. 26.
    Sedgewick R, Bentley J (2002) Quicksort is optimal. Knuthfest, Stanford University, StanfordGoogle Scholar
  27. 27.
    Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proceedings of principles of data mining and knowledge discovery (PKDD), pp 78–87Google Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  • Michael Mampaey
    • 1
  • Siegfried Nijssen
    • 2
    • 3
  • Ad Feelders
    • 4
  • Rob Konijn
    • 2
    • 5
  • Arno Knobbe
    • 2
  1. 1.University of BonnBonnGermany
  2. 2.Leiden UniversityLeidenThe Netherlands
  3. 3.KU LeuvenLouvainBelgium
  4. 4.Utrecht UniversityUtrechtThe Netherlands
  5. 5.VU University AmsterdamAmsterdamThe Netherlands

Personalised recommendations