Exceptional Model Mining

Duivesteijn, Wouter; Feelders, Ad J.; Knobbe, Arno

doi:10.1007/s10618-015-0403-4

Exceptional Model Mining

Supervised descriptive local pattern mining with complex target concepts

Published: 04 February 2015

Volume 30, pages 47–98, (2016)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Wouter Duivesteijn¹,
Ad J. Feelders² &
Arno Knobbe³

2624 Accesses
80 Citations
Explore all metrics

Abstract

Finding subsets of a dataset that somehow deviate from the norm, i.e. where something interesting is going on, is a classical Data Mining task. In traditional local pattern mining methods, such deviations are measured in terms of a relatively high occurrence (frequent itemset mining), or an unusual distribution for one designated target attribute (common use of subgroup discovery). These, however, do not encompass all forms of “interesting”. To capture a more general notion of interestingness in subsets of a dataset, we develop Exceptional Model Mining (EMM). This is a supervised local pattern mining framework, where several target attributes are selected, and a model over these targets is chosen to be the target concept. Then, we strive to find subgroups: subsets of the dataset that can be described by a few conditions on single attributes. Such subgroups are deemed interesting when the model over the targets on the subgroup is substantially different from the model on the whole dataset. For instance, we can find subgroups where two target attributes have an unusual correlation, a classifier has a deviating predictive performance, or a Bayesian network fitted on several target attributes has an exceptional structure. We give an algorithmic solution for the EMM framework, and analyze its computational complexity. We also discuss some illustrative applications of EMM instances, including using the Bayesian network model to identify meteorological conditions under which food chains are displaced, and using a regression model to find the subset of households in the Chinese province of Hunan that do not follow the general economic law of demand.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Clustering-Inspired Quality Measure for Exceptional Preferences Mining—Design Choices and Consequences

Exceptional Preferences Mining

Discovering a taste for the unusual: exceptional models for preference mining

Article Open access 09 July 2018

Notes

We consider the exact search strategy to be a parameter of the algorithm.
When the description language at hand is very expressive, and the dataset contains many numeric attributes, one can imagine that for every subset of the dataset at least one corresponding description exists.
http://cran.r-project.org.
Available from the Journal of Applied Econometrics Data Archive at http://econ.queensu.ca/jae/.

References

Agresti A (1990) Categorical data analysis. Wiley, New York
Aidt T, Tzannatos Z (2002) Unions and collective bargaining. The World Bank, Washington, DC
Book Google Scholar
Anglin PM, Gençay R (1996) Semiparametric estimation of a hedonic price function. J Appl Econ 11(6):633–648
Article Google Scholar
Atzmüller M, Lemmerich F (2009) Fast subgroup discovery for continuous target concepts. In: Proceedings of ISMIS, pp 35–44
Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3):213–246
Article MATH Google Scholar
Blockeel H, De Raedt L, Ramon J (1998) Top-down induction of clustering trees. In: Procedings of ICML, pp 55–63
Boley M, Grosskreutz H (2009) Non-redundant subgroup discovery using a closure system. In: Proceedings of ECML/PKDD, vol 1, pp 179–194
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth & Brooks/Cole Advanced Books & Software, Monterey
MATH Google Scholar
de Campos LM, Fernández-Luna JM, Huete JF (2004) Bayesian networks and information retrieval: an introduction to the special issue. Inf Process Manag 40(5):727–733
Article Google Scholar
Carmona CJ, González P, del Jesus MJ, Herrera F (2010) NMEEF-SD: non-dominated multiobjective evolutionary algorithm for extracting fuzzy rules in subgroup discovery. IEEE Trans Fuzzy Syst 18(5):958–970
Article Google Scholar
Chao C, Velicer C, Slezak JM, Jacobsen SJ (2009) Correlates for completion of 3-dose regimen of HPV vaccine in female members of a managed care organization. Mayo Clin Proc 84(10):864–870
Article Google Scholar
Cook RD (1977) Detection of influential observation in linear regression. Technometrics 19(1):15–18
MATH MathSciNet Google Scholar
Cook RD, Weisberg S (1980) Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics 22(4):495–508
Article MATH MathSciNet Google Scholar
Cook RD, Weisberg S (1982) Residuals and influence in regression. Chapman & Hall, London
MATH Google Scholar
Costanigro M, Mittelhammer RC, McCluskey JJ (2009) Estimating class-specific parametric models under class uncertainty: local polynomial regression clustering in an hedonic analysis of wine markets. J Appl Econ 24:1117–1135
Article MathSciNet Google Scholar
Davis GA (2003) Bayesian reconstruction of traffic accidents. Law Probab Risk 2:69–89
Article Google Scholar
Díez FJ, Mira J, Iturralde E, Zubillaga S (1997) DIAVAL, a Bayesian expert system for echocardiography. Artif Intell Med 10:59–73
Article Google Scholar
Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of KDD, pp 43–52
Dougherty C (2011) Introduction to econometrics, 4th edn. Oxford University Press, Oxford
Google Scholar
Duivesteijn W, Feelders A, Knobbe AJ (2012) Different slopes for different folks—mining for exceptional regression models with Cook’s distance. In: Proceedings of KDD, pp 868–876
Duivesteijn W, Knobbe AJ, Feelders A, van Leeuwen M (2010) Subgroup discovery meets Bayesian networks—an exceptional model mining approach. In: Proceedings of ICDM, pp 158–167
Duivesteijn W, Loza Mencía E, Fürnkranz J, Knobbe AJ (2012) Multi-label LeGo—enhancing multi-label classifiers with local patterns. In: Proceedings of IDA, pp 114–125
Friedman J, Fisher N (1999) Bump-hunting in high-dimensional data. Stat Comput 9(2):123–143
Article Google Scholar
Friedman N, Linial M, Nachman I, Pe’er D (2000) Using Bayesian networks to analyze expression data. J Comput Biol 7(3/4):601–620
Article Google Scholar
Galbrun E, Miettinen P (2012) From black and white to full color: extending redescription mining outside the Boolean world. Stat Anal Data Min 5(4):284–303
Article MathSciNet Google Scholar
Garriga GC, Heikinheimo H, Seppänen JK (2007) Cross-mining binary and numerical attributes. In: Proceedings of ICDM, pp 481–486
Gallo A, Miettinen P, Mannila H (2008) Finding subgroups having several descriptions: algorithms for redescription mining. In: Proceedings of SDM, pp 334–345
Gentleman JF, Wilk MB (1975) Detecting outliers II: supplementing the direct analysis of residuals. Biometrics 31:387–410
Article MATH Google Scholar
Goodman LA (1970) The multivariate analysis of qualitative data: interaction among multiple classifications. J Am Stat Assoc 65:226–256
Article Google Scholar
Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2):210–226
Article MathSciNet Google Scholar
Hand DJ, Adams NM, Bolton RJ (2002) Pattern detection and discovery, vol 2447. Lecture notes in computer science, Springer, Berlin
MATH Google Scholar
Heckerman D, Geiger D, Chickering DM (1995) Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn 20:197–243
MATH Google Scholar
Heikinheimo H, Fortelius M, Eronen J, Mannila H (2007) Biogeography of European land mammals shows environmentally distinct and spatially coherent clusters. J Biogeogr 34(6):1053–1064
Article Google Scholar
Herrera F, Carmona CJ, González P, del Jesus MJ (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525
Article Google Scholar
Hochberg Y, Tamhane A (1987) Multiple comparison procedures. Wiley, New York
Book MATH Google Scholar
Jensen RT, Miller NH (2008) Giffen behavior and subsistence consumption. Am Econ Rev 98(4):1553–1577
Article Google Scholar
del Jesús MJ, González P, Herrera F, Mesonero M (2007) Evolutionary fuzzy rule induction process for subgroup discovery: a case study in marketing. IEEE Trans Fuzzy Syst 15(4):578–592
Article Google Scholar
Jorge AM, Azevedo PJ, Pereira F (2006) Distribution rules with numeric attributes of interest. In: Proceedings of PKDD, pp 247–258
Klösgen W (1996) Explora: a multipattern and multistrategy discovery assistant. In: Advances in knowledge discovery and data mining. pp 249–271
Klösgen W (1998) Deviation and association patterns for subgroup mining in temporal, spatial, and textual data bases. In: Rough sets and current trends in computing. Springer, pp 1–18
Klösgen W (1999) Applications and research problems of subgroup mining. In: Proceedings of ISMIS, pp 1–15
Klösgen W (2002) Subgroup discovery. In: Handbook of data mining and knowledge discovery, chap. 16.3. Oxford University Press, New York
Knobbe AJ, Feelders A, Leman D (2012) Exceptional model mining. In: Data mining: foundations and intelligent paradigms, intelligent systems reference library, vol 24, pp 183–198
Knuth DE (1998) The art of computer programming, vol. 3: sorting and searching, 2nd edn. Addison-Wesley, Reading
Google Scholar
Kocev D, Vens C, Struyf J, Džeroski S (2013) Tree ensembles for predicting structured outputs. Pattern Recogn 46(3):817–833
Article Google Scholar
Kohavi R (1995) The power of decision tables. In: Proceedings of ECML, pp 174–189
van de Koppel E, Slavkov I, Astrahantseff K, Schramm A, Schulte J, Vandesompele J, de Jong E, Dzeroski S, Knobbe AJ (2007) Knowledge discovery in neuroblastoma-related biological data. In: Data mining in functional genomics and proteomics workshop at PKDD 2007, Warsaw, Poland, pp 45–56
Kralj Novak P, Lavrač N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403
MATH Google Scholar
Kriegel H-P, Kröger P, Schubert E, Zimek A (2012) Outlier detection in arbitrarily oriented subspaces. In: Proceedings of ICDM, pp 379–388
Lavrač N, Flach P, Zupan B (1999) Rule evaluation measures: a unifying view. In: Proceedings of the ninth international workshop on inductive logic programming. Lecture notes in artificial intelligence, vol 1634, pp 174–185
Lavrač N, Kavšek B, Flach PA, Todorovski L (2004) Subgroup discovery with CN2-SD. J Mach Learn Res 5:153–188
Google Scholar
van Leeuwen M (2010) Maximal exceptions with minimal descriptions. Data Min Knowl Discov 21(2):259–276
Article MathSciNet Google Scholar
van Leeuwen M, Knobbe AJ (2011) Non-redundant subgroup discovery in large and complex data. In: Proceedings of ECML/PKDD, vol 3, pp 459–474
van Leeuwen M, Knobbe AJ (2012) Diverse subgroup set discovery. Data Min Knowl Discov 25(2):208–242
Article MathSciNet Google Scholar
Leman D, Feelders A, Knobbe AJ (2008) Exceptional model mining. In: Proceedings of ECML/PKDD, vol 2, pp 1–16
Lemmerich F, Becker M, Atzmüller M (2012) Generic pattern trees for exhaustive exceptional model mining. In: Proceedings of ECML/PKDD, vol 2, pp 277–292
Mampaey M, Nijssen S, Feelders A, Knobbe AJ (2012) Efficient algorithms for finding richer subgroup descriptions in numeric and nominal data. In: Proceedings of ICDM, pp 499–508
Marshall A (1895) Principles of economics. MacMillan and co, New York
Google Scholar
Meeng M, Knobbe AJ (2011) Flexible enrichment with Cortana—Software Demo. In: Proceedings of Benelearn, pp 117–119
Mitchell-Jones T et al (1999) The atlas of European mammals. Poyser natural history. Poyser, London
Google Scholar
Moore D, McCabe G (1993) Introduction to the practice of statistics. WH Freeman and Company, New York
Google Scholar
Morik K, Boulicaut JF, Siebes A (2005) Local pattern detection. Lecture notes in computer science, vol 3539, Springer, Heidelberg
Neil M, Fenton N, Tailor M (2005) Using Bayesian networks to model expected and unexpected operational losses. Risk Anal 25(4):963–972
Article Google Scholar
Neter J, Kutner M, Nachtsheim CJ, Wasserman W (1966) Applied linear statistical models. WCB McGraw-Hill, Boston
Google Scholar
Paine RT (1966) Food web complexity and species diversity. Am Nat 100(910):65–75
Article Google Scholar
Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm RF (1995) Turning CARTwheels: an alternating algorithm for mining redescriptions. In: Proceedings of KDD, pp 837–844
Rezende L (2008) Econometrics of auctions by least squares. J Appl Econ 23:925–948
Article MathSciNet Google Scholar
Scholz M (2005) Knowledge-based sampling for subgroup discovery. In: Morik K, Boulicaut JF, Siebes A (eds) Local pattern detection. Lecture notes in computer science, vol 3539, Springer, Heidelberg, pp 171–189
Schubert E, Wolfe J, Tarnopolsky A (2004) Spectral centroid and timbre in complex, multiple instrumental textures. In: Proceedings of 8th international conference on music perception & cognition, pp 654–657
Siebes A (1995) Data surveying: foundations of an inductive query language. In: Proceedings of KDD, pp 269–274
Stengos T, Zacharias E (2006) Intertemporal pricing and price discrimination: a semiparametric hedonic analysis of the personal computer market. J Appl Econ 21:371–386
Article MathSciNet Google Scholar
Trohidis K, Tsoumakas G, Kalliris G, Vlahavas IP (2008) Multi-label classification of music into emotions. In: Proceedings of 9th international conference on music information retrieval, pp 325–330
Umek L, Zupan B (2011) Subgroup discovery in data sets with multi-dimensional responses. Intell Data Anal 15(4):533–549
Google Scholar
Verma T, Pearl J (1990) Equivalence and synthesis of causal models. In: Proceedings of UAI, pp 255–270
Whittaker J (1990) Graphical models in applied multivariate statistics. Wiley, New York
MATH Google Scholar
Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proceedings of PKDD, pp 78–87
Yang G, Le Cam L (2000) Asymptotics in statistics: some basic concepts. Springer, Berlin
Zhang B (2003) Regression clustering. In: Proceedings of ICDM, pp 451–458
Zimmermann A, De Raedt L (2009) Cluster-grouping: from subgroup discovery to clustering. Mach Learn 77(1):125–159
Article Google Scholar

Download references

Acknowledgments

This research is supported in part by the Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Analysis”, project C1, and in part by the Netherlands Organisation for Scientific Research (NWO) under project number 612.065.822 (Exceptional Model Mining).

Author information

Authors and Affiliations

Fakultät für Informatik, LS VIII, Technische Universität Dortmund, Dortmund, Germany
Wouter Duivesteijn
ICS, Utrecht University, Utrecht, the Netherlands
Ad J. Feelders
LIACS, Leiden University, Leiden, the Netherlands
Arno Knobbe

Authors

Wouter Duivesteijn
View author publications
You can also search for this author in PubMed Google Scholar
Ad J. Feelders
View author publications
You can also search for this author in PubMed Google Scholar
Arno Knobbe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wouter Duivesteijn.

Additional information

Responsible editor: M.J. Zaki.

This paper extends the previously published papers (Leman et al. 2008; Duivesteijn et al. 2010, 2012a).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Duivesteijn, W., Feelders, A.J. & Knobbe, A. Exceptional Model Mining. Data Min Knowl Disc 30, 47–98 (2016). https://doi.org/10.1007/s10618-015-0403-4

Download citation

Received: 09 August 2013
Accepted: 22 January 2015
Published: 04 February 2015
Issue Date: January 2016
DOI: https://doi.org/10.1007/s10618-015-0403-4

Keywords

Mathematics Subject Classification

H.2.8: Data mining

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exceptional Model Mining

Abstract

Access this article

Similar content being viewed by others

A Clustering-Inspired Quality Measure for Exceptional Preferences Mining—Design Choices and Consequences

Exceptional Preferences Mining

Discovering a taste for the unusual: exceptional models for preference mining

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Exceptional Model Mining

Abstract

Access this article

Similar content being viewed by others

A Clustering-Inspired Quality Measure for Exceptional Preferences Mining—Design Choices and Consequences

Exceptional Preferences Mining

Discovering a taste for the unusual: exceptional models for preference mining

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation