Detecting Group Differences: Mining Contrast Sets

Bay, Stephen D.; Pazzani, Michael J.

doi:10.1023/A:1011429418057

Detecting Group Differences: Mining Contrast Sets

Published: July 2001

Volume 5, pages 213–246, (2001)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Stephen D. Bay¹ &
Michael J. Pazzani²

1111 Accesses
265 Citations
3 Altmetric
Explore all metrics

Abstract

A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mining contrast sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups. We provide a search algorithm for mining contrast sets with pruning rules that drastically reduce the computational complexity. Once the contrast sets are found, we post-process the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining associations between sets of items in massive database. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216.
Agrawal, R., Psaila, G., Wimmers, E., and Zait, M. 1995. Querying shapes of histories. In Proceedings of the 21st International Conference on Very Large Databases.
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Databases.
Agresti, A. 1990. Categorical Data Analysis. New York: John Wiley & Sons.
Google Scholar
Bay, S.D. 1999. The UCIKDDarchive. [http://kdd.ics.uci.edu/]. Irvine, CA: University of California, Department of Information and Computer Science.
Google Scholar
Bay, S.D. and Pazzani, M.J. 1999. Detecting change in categorical data: Mining contrast sets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 302–306.
Bayardo, R.J. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM SIGMOD Conference on Management of Data.
Bayardo, R.J., Agrawal, R., and Gunopulos, D. 1999. Constraint-based rule mining in large, dense databases. In Proceedings 15th International Conference on Data Engineering.
Bazaraa, M.S. and Shetty, C.M. 1979. Nonlinear Programming: Theory and Algorithms. New York: John Wiley & Sons.
Google Scholar
Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. 1975. Discrete Multivariate Analysis: Theory and Practice: The MIT Press, Cambridge, Massachusetts.
Google Scholar
Blake, C. and Merz, C.J. 1998. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu/∼mlearn/MLRepository.html].
Google Scholar
Brin, S., Motwani, R., Ullman, J.D., and Tsur, S. 1997. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 255–264.
Chakrabarti, S., Sarawagi, S., and Dom, B. 1998. Mining surprising patterns using temporal description length. In Proceedings of the 24th International Conference on Very Large Databases.
Cohen, J. 1990. Things I have learned (so far). American Psychologist, 45:1304–1312.
Google Scholar
Darity, W.A. 1998. Intergroup disparity: Economic theory and social science evidence. Southern Economic Journal, 64:805–826.
Google Scholar
Davies, J. and Billman, D. 1996. Hierarchical categorization and the effects of contrast inconsistency in an unsupervised learning task. In Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society, p. 750.
Dong, G. and Li, J. 1999. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Everitt, B.S. 1992. The Analysis of Contingency Tables, 2nd ed. Chapman and Hall, London, U.K.
Google Scholar
Ganti, V., Gehrke, J.E., Ramakrishnan, R., and Loh, W. 1999. A framework for measuring changes in data characteristics. In Proceedings of Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.
Glenn, N.D. 1977. Cohort Analysis. Newbury Park, CA: Sage Publications.
Google Scholar
Hochberg, Y. and Tamhane, A.C. 1987. Multiple Comparison Procedures. New York: John Wiley & Sons.
Google Scholar
Hoschka, P. and Klösgen, W. 1991. A support system for interpreting statistical data. In Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley (Eds.). AAAI Press, Menlo Park, CA, pp. 325–346.
Google Scholar
Keogh, E. and Pazzani, M.J. 1998. An enhanced representation of time series that allows fast and accuracte classi-fication, clustering, and relevance feedback. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining.
Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkamo, A.I. 1994. Finding interesting rules from large sets of discovered association rules. In Proceedings of the Third International Conference on Information and Knowledge Management, pp. 401–407.
Klösgen, W. 1993. Explora user documentation: A support system for discovery in databases.
Klösgen, W. 1996. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). AAAI Press/The MIT Press, Cambridge, Massachusetts, pp. 249–271.
Google Scholar
Knoke, D. and Burke, P.J. 1980. Log-linear Models. Newbury Park, CA: Sage Publications.
Google Scholar
Lewontin, R.C. and Felsenstein, J. 1965. The robustness of homogeneity in 2 × n tables. Biometrics, 21:19–33.
Google Scholar
Lin, D. and Kedem, Z.M. 1998. Pincer-search: A new algorithm for discovering the maximum frequent set. In Proceedings of the Sixth European Conference on Extending Database Theory.
Lincoff, G.H. 1981. The Audubon Society Field guide to North American Mushrooms. Random House, New York.
Google Scholar
Liu, B. and Hsu, W. 1996. Post-analysis of learned rules. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pp. 828–834.
Liu, B., Hsu, W., and Chen, S. 1997. Using general impressions to analyze discovered classification rules. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp.31–36.
Liu, B., Hsu, W., and Ma, Y. 1999a. Pruning and summarizing the discovered associations. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Liu, B., Hsu, W., Mun, L., and Lee, H. 1999b. Finding interesting patterns using user expectations. IEEE Transactions on Knowledge and Data Engineering, 11:817–832.
Google Scholar
Mannila, H. and Toivonen, H. 1997. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1:241–258.
Google Scholar
Megiddo, N. and Srikant, R. 1998. Discovering predictive association rules. In Proceedings of the 4th International Conference on Knowledge Discovery in Databases and Data Mining.
Menard, S. 1991. Longitudinal Research. Newbury Park, CA: Sage Publications.
Google Scholar
Michell, T.M. 1977. Version spaces: A candidate elimination approach to rule learning. In Proc. of the 5th Int'l Joint Conf. on Artificial Intelligence.
Ng, R., Lakshmanan, L.V.S., Han, J., and Pang, A. 1998. Exploratory mining and pruning optimizations of constrained associations rules. In Proceedings of the ACM SIGMOD Conference on Management of Data.
Padmanabhan, B. and Tuzhilin, A. 1998. A belief-driven method for discovering unexpected patterns. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining.
Riddle, P., Segal, R., and Etzioni, O. 1994. Representation design and brute-force induction in a boeing manufacturing domain. Applied Artificial Intelligence, 8:125–147.
Google Scholar
Ruggles, S. 1995. Sample designs and sampling errors. Historical Methods, 28:40–46.
Google Scholar
Ruggles, S. 1997. The rise of divorce and separation in the united states, 1880- 1990. Demography, 34: 455–466.
Google Scholar
Ruggles, S. and Sobek, M. 1997. Integrated public use microdata series: Version 2.0. [http://www.ipums. umn.edu/].
Rymon, R. 1992. Search through systematic set enumeration. Third International Conference on Principles of Knowledge Representation and Reasoning.
Shaffer, J.P. 1995. Multiple hypothesis testing. Annual Review Psychology, 46:561–584.
Google Scholar
Silberschatz, A. and Tuzhilin, A. 1996. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8:970–974.
Google Scholar
Silverstein, C., Brin, S., and Motwani, R. 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2:39–68.
Google Scholar
Srikant, R. and Agrawal, R. 1996. Mining quantitative association rules in large relational tables. In Proceedings of the ACM SIGMOD Conference on Management of Data.
Srikant, R., Vu, Q., and Agrawal, R. 1997. Mining association rules with item constraints. In Proceedings of the 3rd International Conference on Knowledge Discovery in Databases and Data Mining.
Zaki, M.J., Parthasarathy, S., Ogihara, M., and Li, W. 1997. New algorithms for fast discovery of association rules. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining.

Download references

Author information

Authors and Affiliations

Department of Information and Computer Science, University of California, Irvine, CA, 92697, USA
Stephen D. Bay
Department of Information and Computer Science, University of California, Irvine, CA, 92697, USA
Michael J. Pazzani

Authors

Stephen D. Bay
View author publications
You can also search for this author in PubMed Google Scholar
Michael J. Pazzani
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bay, S.D., Pazzani, M.J. Detecting Group Differences: Mining Contrast Sets. Data Mining and Knowledge Discovery 5, 213–246 (2001). https://doi.org/10.1023/A:1011429418057

Download citation

Issue Date: July 2001
DOI: https://doi.org/10.1023/A:1011429418057

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting Group Differences: Mining Contrast Sets

Abstract

Access this article

Similar content being viewed by others

Overcoming the Spurious Groups Problem in Between-Group PCA

A Bayesian Approach for Identifying Multivariate Differences Between Groups

Cluster Analysis

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Detecting Group Differences: Mining Contrast Sets

Abstract

Access this article

Similar content being viewed by others

Overcoming the Spurious Groups Problem in Between-Group PCA

A Bayesian Approach for Identifying Multivariate Differences Between Groups

Cluster Analysis

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation