Advertisement

Knowledge and Information Systems

, Volume 51, Issue 2, pp 369–394 | Cite as

Exceptionally monotone models—the rank correlation model class for Exceptional Model Mining

  • Lennart Downar
  • Wouter Duivesteijn
Regular Paper

Abstract

Exceptional Model Mining strives to find coherent subgroups of the dataset where multiple target attributes interact in an unusual way. One instance of such an investigated form of interaction is Pearson’s correlation coefficient between two targets. EMM then finds subgroups with an exceptionally linear relation between the targets. In this paper, we enrich the EMM toolbox by developing the more general rank correlation model class. We find subgroups with an exceptionally monotone relation between the targets. Apart from catering for this richer set of relations, the rank correlation model class does not necessarily require the assumption of target normality, which is implicitly invoked in the Pearson’s correlation model class. Furthermore, it is less sensitive to outliers. We provide pseudocode for the employed algorithm and analyze its computational complexity, and experimentally illustrate what the rank correlation model class for EMM can find for you on six datasets from an eclectic variety of domains.

Keywords

Rank correlation Exceptional Model Mining Monotonicity Subgroup Discovery Data mining 

Notes

Acknowledgments

We would like to thank Dr. Johannes Albrecht (Emmy Noether group leader at the TU Dortmund, department of experimental physics, with research focus on the CERN LHCb Experiment) for fruitful discussion and helpful comments. This research is supported in part by the Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Analysis,” Project A1. This work was supported by the European Union through the ERC Consolidator Grant FORSIED (Project Reference 615517).

References

  1. 1.
    Downar L, Duivesteijn W (2015) Exceptionally monotone models—the rank correlation model class for exceptional model mining. ICDM, to appear, ProcGoogle Scholar
  2. 2.
    Downar L (2014) A rank correlation model class for exceptional model mining. Bachelor’s thesis, TU DortmundGoogle Scholar
  3. 3.
    Duivesteijn W (2013) Exceptional model mining. PhD thesis, Leiden UniversityGoogle Scholar
  4. 4.
    Duivesteijn W, Feelders AJ, Knobbe A (2016) Exceptional model mining—supervised descriptive local pattern mining with complex target concepts. Data Min Knowl Disc 30:47–98CrossRefGoogle Scholar
  5. 5.
    Leman D, Feelders A, Knobbe AJ (2008) Exceptional model mining. In: Proceedings of ECML/PKDD, vol 2, pp 1–16Google Scholar
  6. 6.
    Kendall MG (1938) A new measure of rank correlation. Biometrika 30(1):81–93CrossRefzbMATHGoogle Scholar
  7. 7.
    Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101CrossRefGoogle Scholar
  8. 8.
    Balasubramaniyan R, Hüllermeier E, Weskamp N, Kämper J (2005) Clustering of gene expression data using a local shape-based similarity measure. Bioinformatics 21(7):1069–1077Google Scholar
  9. 9.
    Yilmaz E, Aslam JA, Robertson S (2008) A new rank correlation coefficient for information retrieval. In: Proceedings of SIGIR, pp 587–594Google Scholar
  10. 10.
    Breese JS, Heckerman D, Kadie CM (1998) Empirical analysis of predictive algorithms for collaborative filtering. IN: Proceedings of UAI, pp 43–52Google Scholar
  11. 11.
    Li WK, Lee SY (1980) Application of rank correlation to lanthanide induced shift data. Organ Magn Reson 13(2):97–99Google Scholar
  12. 12.
    Lemmerich F, Becker M, Atzmüller M (2012) Generic pattern trees for exhaustive exceptional model mining. In: Proceedings of ECML-PKDD, vol 2, pp 277–292Google Scholar
  13. 13.
    Adam-Bourdarios C, Cowan G, Cécile Germain IG, Kégl B, Rousseau D (2014) Learning to discover: the higgs boson machine learning challenge. http://higgsml.lal.in2p3.fr/documentation/. Accessed 7 Aug
  14. 14.
    Hand D, Adams N, Bolton R (eds) (2002) Pattern detection and discovery. Springer, New YorkzbMATHGoogle Scholar
  15. 15.
    Morik K, Boulicaut JF, Siebes A (eds) (2005) Local pattern detection. Springer, New YorkGoogle Scholar
  16. 16.
    Mannila H, Toivonen H (1997) Levelwise search and borders of theories in knowledge discovery. Data Min Knowl Disc 1(3):241–258CrossRefGoogle Scholar
  17. 17.
    Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. Advances in Knowledge Discovery and Data Mining, pp 307–328Google Scholar
  18. 18.
    Herrera F, Carmona CJ, González P, Del Jesus MJ (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525CrossRefGoogle Scholar
  19. 19.
    Moens S, Boley M (2014) Instant exceptional model mining using weighted controlled pattern sampling. In: Proceedings of IDA, pp 203–214Google Scholar
  20. 20.
    Duivesteijn W, Knobbe A, Feelders A, Van Leeuwen M (2010) Subgroup discovery meets Bayesian networks—an exceptional model mining approach. In: Proceedings of ICDM, pp 158–167Google Scholar
  21. 21.
    Duivesteijn W, Feelders A, Knobbe A (2012) Different slopes for different folks—mining for exceptional regression models with Cook’s distance. In: Proceedings of KDD, pp 868–876Google Scholar
  22. 22.
    Duivesteijn W, Thaele J (2014) Understanding where your classifier does (not) work—the SCaPE model class for EMM. In: Proceedings of ICDM, pp 809–814Google Scholar
  23. 23.
    Kowalski CJ (1972) On the effects of non-normality on the distribution of the sample product-moment correlation coefficient. J R Stat Soc Ser C (Appl Stat) 21(1):1–12MathSciNetGoogle Scholar
  24. 24.
    Anscombe FJ (1973) Graphs in statistical analysis. Am Stat 27(1):17–21Google Scholar
  25. 25.
    Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Disc 5(3):213–246CrossRefzbMATHGoogle Scholar
  26. 26.
    Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of KDD, pp 43–52Google Scholar
  27. 27.
    Kralj Novak P, Lavrač N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403Google Scholar
  28. 28.
    Jorge AM, Azevedo PJ, Pereira F (2006) Distribution rules with numeric attributes of interest. In: Proceedings of PKDD, pp 247–258Google Scholar
  29. 29.
    Umek L, Zupan B (2011) Subgroup discovery in data sets with multi-dimensional responses. Intell Data Anal 15(4):533–549Google Scholar
  30. 30.
    Galbrun E, Miettinen P (2012) From black and white to full color: extending redescription mining outside the Boolean world. Stat Anal Data Min 5(4):284–303MathSciNetCrossRefGoogle Scholar
  31. 31.
    Fisher DH, Langley PW (1986) Conceptual clustering and its relation to numerical taxonomy. In: Gale WA (ed) Artificial intelligence and statistics, reading. Addison-Wesley, Boston, pp 77–116Google Scholar
  32. 32.
    Tsoumakas G, Katakis I (2007) Multi-label classification: an overview. Int J Data Wareh Min 3(3):1–13CrossRefGoogle Scholar
  33. 33.
    Duivesteijn W, Loza Mencía E, Fürnkranz J, Knobbe A (2012) Multi-label LeGo—enhancing multi-label classifiers with local patterns. Technical report TUD-KE-2012-02, TU DarmstadtGoogle Scholar
  34. 34.
    Clark M (2013) A comparison of correlation measures. Technical report, University of Notre DameGoogle Scholar
  35. 35.
    Hoeffding W (1948) A non-parametric test of independence. Ann Math Stat 19(4):546–557MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Blum JR, Kiefer J, Rosenblatt M (1961) Distribution free tests of independence based on the sample distribution function. Ann Math Stat 32(2):485–498MathSciNetCrossRefzbMATHGoogle Scholar
  37. 37.
    Hollander M, Wolfe D (1999) Nonparametric statistical methods. Series in probability and statistics, 2nd edn. Wiley, HobokenzbMATHGoogle Scholar
  38. 38.
    Székely GJ, Rizzo ML, Bakirov NK (2007) Measuring and testing dependence by correlation of distances. Ann Stat 35(6):2769–2794MathSciNetCrossRefzbMATHGoogle Scholar
  39. 39.
    Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334:1518–1524CrossRefzbMATHGoogle Scholar
  40. 40.
    Kinney JB, Atwal GS (2014) Equitability, mutual information, and the maximal information coefficient. Proc Natl Acad Sci USA 111(9):3354–3359MathSciNetCrossRefzbMATHGoogle Scholar
  41. 41.
    Gretton A, Bousquet O, Smola A, Schölkopf B (2005) Measuring statistical dependence with Hilbert–Schmidt norms. In: Proceedings of ALT, pp 63–77Google Scholar
  42. 42.
    Lopez-Paz D, Hennig P, Schölkopf B (2013) The randomized dependence coefficient. Advances in Neural Information Processing Systems, pp 1–9Google Scholar
  43. 43.
    Gebelein H (1941) Das statistische problem der Korrelation als Variations- und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung. Z Angew Math Mech 21:364–379MathSciNetCrossRefzbMATHGoogle Scholar
  44. 44.
    Conover WJ (1971) Practical nonparametric statistics. Wiley, HobokenGoogle Scholar
  45. 45.
    Fisher RAS (1970) Statistical methods for research workers, 14th edn. Oliver and Boyd, LondonzbMATHGoogle Scholar
  46. 46.
    Fieller EC, Hartley HO, Pearson ES (1957) Tests for rank correlation coefficients. I. Biometrika 44(4):470–481Google Scholar
  47. 47.
    Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Proceedings of SIGMOD, pp 1–12Google Scholar
  48. 48.
    Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: rapid prototyping for complex data mining tasks. In: Proceedings of KDD, pp 935–940Google Scholar
  49. 49.
    Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. University of California, School of Information and Computer Science, Irvine, CA
  50. 50.
    Anglin PM, Gençay R (1996) Semiparametric estimation of a hedonic price function. J Appl Econom 11(6):633–648CrossRefGoogle Scholar
  51. 51.
    Rousseauw J, du Plessis J, Benade A, Jordaan P, Kotze J, Jooste P, Ferreira J (1983) Coronary risk factor screening in three rural communities. S Afr Med J 64:430–436Google Scholar
  52. 52.
    Hastie T, Tibshirani R, Friedman J (2010) The elements of statistical learning. Springer, StanfordzbMATHGoogle Scholar
  53. 53.
    Lim TS, Loh WY, Shih YS (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40:203–228CrossRefzbMATHGoogle Scholar
  54. 54.
    Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188CrossRefGoogle Scholar
  55. 55.
    Keller F, Müller E, Böhm K (2012) HiCS: high contrast subspaces for density-based outlier ranking. In: Proceedings of ICDE, pp 1037–1048Google Scholar
  56. 56.
    Nguyen HV, Müller E, Böhm K (2013) 4S: scalable subspace search scheme overcoming traditional apriori processing. In: Proceedings of BigData, pp 359–367Google Scholar
  57. 57.
    Nguyen HV, Müller E, Vreeken J, Efros P, Böhm K (2014) Multivariate maximal correlation analysis. In: Proceedings of ICML, pp 775–783Google Scholar

Copyright information

© Springer-Verlag London 2016

Authors and Affiliations

  1. 1.Fakultät für Informatik, LS VIIITechnische Universität DortmundDortmundGermany
  2. 2.Data Science Lab and iMindsUniversiteit GentGentBelgium

Personalised recommendations