Maximal exceptions with minimal descriptions

Abstract

We introduce a new approach to Exceptional Model Mining. Our algorithm, called EMDM, is an iterative method that alternates between Exception Maximisation and Description Minimisation. As a result, it finds maximally exceptional models with minimal descriptions. Exceptional Model Mining was recently introduced by Leman et al. (Exceptional model mining 1–16, 2008) as a generalisation of Subgroup Discovery. Instead of considering a single target attribute, it allows for multiple ‘model’ attributes on which models are fitted. If the model for a subgroup is substantially different from the model for the complete database, it is regarded as an exceptional model. To measure exceptionality, we propose two information-theoretic measures. One is based on the Kullback–Leibler divergence, the other on Krimp. We show how compression can be used for exception maximisation with these measures, and how classification can be used for description minimisation. Experiments show that our approach efficiently identifies subgroups that are both exceptional and interesting.

References

  1. Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) LIMBO: scalable clustering of categorical data. In: Proceedings of the EDBT, pp 124–146

  2. Asuncion A, Newman DJ (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html

  3. Cohen WW (1995) Fast effective rule induction. In: Proceedings of the ICML’95, pp 115–123

  4. Garriga GC, Heikinheimo H, Seppänen JK (2007) Cross-mining binary and numerical attributes. In: Proceedings of the ICDM’07, pp 481–486

  5. Heikinheimo H, Fortelius M, Eronen J, Mannila H (2007) Biogeography of european land mammals shows environmentally distinct and spatially coherent clusters. J Biogeogr 34(6): 1053–1064

    Article  Google Scholar 

  6. Klösgen W (2002) Subgroup discovery chapter 16.3. Oxford University Press, Oxford

    Google Scholar 

  7. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1): 79–86

    MATH  Article  MathSciNet  Google Scholar 

  8. Leeuwen M, Vreeken J, Siebes A (2006) Compression picks the item sets that matter. In: Proceedings of the ECML PKDD’06 pp 585–592

  9. Leeuwen M, Bonchi F, Sigurbjörnsson B, Siebes A (2009) Compressing tags to find interesting media groups. In: Proceedings of the CIKM’09, pp 1147–1156

  10. Leman D, Feelders A, Knobbe A (2008) Exceptional model mining. In: Proceedings of the ECML/ PKDD’08, 2:1–16

  11. Mitchell-Jones AJ, Amori G, Bogdanowicz W, Krystufek B, Reijnders PJH, Spitzenberger F, Stubbe M, Thissen JBM, Vohralik V, Zima J (1999) The atlas of european mammals. Academic Press, London

    Google Scholar 

  12. Rissanen J (1978) Modeling by shortest data description. Automatica 14(1): 465–471

    MATH  Article  Google Scholar 

  13. Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the SDM’06, pp 393–404

  14. Slonim N, Tishby N (1999) Agglomerative information bottleneck. In: Proceedings of the NIPS’99, pp 617–623

  15. Tsoumakas G, Vilcek J, Spyromitros L (2010) MULAN: a java library for multi-label learning. http://mulan.sourceforge.net/

  16. Umek L, Zupan B, Toplak M, Morin A, Chauchat J-H, Makovec G, Smrke D (2009) Subgroup discovery in data sets with multi-dimensional responses: A method and a case study in traumatology. In: Proceedings of AIME’09, pp 265–274

  17. Warner HR, Toronto AF, Veasey LR, Stephenson R (1961) A mathematical model for medical diagnosis, application to congenital heart disease. J Am Med Assoc 177: 177–184

    Google Scholar 

  18. Witten IH, Frank Eibe (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Matthijs van Leeuwen.

Additional information

Responsible editors: José L Balcázar, Francesco Bonchi, Aristides Gionis, Michèle Sebag.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and Permissions

About this article

Cite this article

van Leeuwen, M. Maximal exceptions with minimal descriptions. Data Min Knowl Disc 21, 259–276 (2010). https://doi.org/10.1007/s10618-010-0187-5

Download citation

Keywords

  • Exceptional Model Mining
  • Subgroup Discovery
  • Information theory