Skip to main content
Log in

Detecting Group Differences: Mining Contrast Sets

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mining contrast sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups. We provide a search algorithm for mining contrast sets with pruning rules that drastically reduce the computational complexity. Once the contrast sets are found, we post-process the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining associations between sets of items in massive database. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216.

  • Agrawal, R., Psaila, G., Wimmers, E., and Zait, M. 1995. Querying shapes of histories. In Proceedings of the 21st International Conference on Very Large Databases.

  • Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Databases.

  • Agresti, A. 1990. Categorical Data Analysis. New York: John Wiley & Sons.

    Google Scholar 

  • Bay, S.D. 1999. The UCIKDDarchive. [http://kdd.ics.uci.edu/]. Irvine, CA: University of California, Department of Information and Computer Science.

    Google Scholar 

  • Bay, S.D. and Pazzani, M.J. 1999. Detecting change in categorical data: Mining contrast sets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 302–306.

  • Bayardo, R.J. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM SIGMOD Conference on Management of Data.

  • Bayardo, R.J., Agrawal, R., and Gunopulos, D. 1999. Constraint-based rule mining in large, dense databases. In Proceedings 15th International Conference on Data Engineering.

  • Bazaraa, M.S. and Shetty, C.M. 1979. Nonlinear Programming: Theory and Algorithms. New York: John Wiley & Sons.

    Google Scholar 

  • Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. 1975. Discrete Multivariate Analysis: Theory and Practice: The MIT Press, Cambridge, Massachusetts.

    Google Scholar 

  • Blake, C. and Merz, C.J. 1998. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu/∼mlearn/MLRepository.html].

    Google Scholar 

  • Brin, S., Motwani, R., Ullman, J.D., and Tsur, S. 1997. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 255–264.

  • Chakrabarti, S., Sarawagi, S., and Dom, B. 1998. Mining surprising patterns using temporal description length. In Proceedings of the 24th International Conference on Very Large Databases.

  • Cohen, J. 1990. Things I have learned (so far). American Psychologist, 45:1304–1312.

    Google Scholar 

  • Darity, W.A. 1998. Intergroup disparity: Economic theory and social science evidence. Southern Economic Journal, 64:805–826.

    Google Scholar 

  • Davies, J. and Billman, D. 1996. Hierarchical categorization and the effects of contrast inconsistency in an unsupervised learning task. In Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society, p. 750.

  • Dong, G. and Li, J. 1999. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

  • Everitt, B.S. 1992. The Analysis of Contingency Tables, 2nd ed. Chapman and Hall, London, U.K.

    Google Scholar 

  • Ganti, V., Gehrke, J.E., Ramakrishnan, R., and Loh, W. 1999. A framework for measuring changes in data characteristics. In Proceedings of Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.

  • Glenn, N.D. 1977. Cohort Analysis. Newbury Park, CA: Sage Publications.

    Google Scholar 

  • Hochberg, Y. and Tamhane, A.C. 1987. Multiple Comparison Procedures. New York: John Wiley & Sons.

    Google Scholar 

  • Hoschka, P. and Klösgen, W. 1991. A support system for interpreting statistical data. In Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley (Eds.). AAAI Press, Menlo Park, CA, pp. 325–346.

    Google Scholar 

  • Keogh, E. and Pazzani, M.J. 1998. An enhanced representation of time series that allows fast and accuracte classi-fication, clustering, and relevance feedback. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining.

  • Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkamo, A.I. 1994. Finding interesting rules from large sets of discovered association rules. In Proceedings of the Third International Conference on Information and Knowledge Management, pp. 401–407.

  • Klösgen, W. 1993. Explora user documentation: A support system for discovery in databases.

  • Klösgen, W. 1996. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). AAAI Press/The MIT Press, Cambridge, Massachusetts, pp. 249–271.

    Google Scholar 

  • Knoke, D. and Burke, P.J. 1980. Log-linear Models. Newbury Park, CA: Sage Publications.

    Google Scholar 

  • Lewontin, R.C. and Felsenstein, J. 1965. The robustness of homogeneity in 2 × n tables. Biometrics, 21:19–33.

    Google Scholar 

  • Lin, D. and Kedem, Z.M. 1998. Pincer-search: A new algorithm for discovering the maximum frequent set. In Proceedings of the Sixth European Conference on Extending Database Theory.

  • Lincoff, G.H. 1981. The Audubon Society Field guide to North American Mushrooms. Random House, New York.

    Google Scholar 

  • Liu, B. and Hsu, W. 1996. Post-analysis of learned rules. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pp. 828–834.

  • Liu, B., Hsu, W., and Chen, S. 1997. Using general impressions to analyze discovered classification rules. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp.31–36.

  • Liu, B., Hsu, W., and Ma, Y. 1999a. Pruning and summarizing the discovered associations. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

  • Liu, B., Hsu, W., Mun, L., and Lee, H. 1999b. Finding interesting patterns using user expectations. IEEE Transactions on Knowledge and Data Engineering, 11:817–832.

    Google Scholar 

  • Mannila, H. and Toivonen, H. 1997. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1:241–258.

    Google Scholar 

  • Megiddo, N. and Srikant, R. 1998. Discovering predictive association rules. In Proceedings of the 4th International Conference on Knowledge Discovery in Databases and Data Mining.

  • Menard, S. 1991. Longitudinal Research. Newbury Park, CA: Sage Publications.

    Google Scholar 

  • Michell, T.M. 1977. Version spaces: A candidate elimination approach to rule learning. In Proc. of the 5th Int'l Joint Conf. on Artificial Intelligence.

  • Ng, R., Lakshmanan, L.V.S., Han, J., and Pang, A. 1998. Exploratory mining and pruning optimizations of constrained associations rules. In Proceedings of the ACM SIGMOD Conference on Management of Data.

  • Padmanabhan, B. and Tuzhilin, A. 1998. A belief-driven method for discovering unexpected patterns. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining.

  • Riddle, P., Segal, R., and Etzioni, O. 1994. Representation design and brute-force induction in a boeing manufacturing domain. Applied Artificial Intelligence, 8:125–147.

    Google Scholar 

  • Ruggles, S. 1995. Sample designs and sampling errors. Historical Methods, 28:40–46.

    Google Scholar 

  • Ruggles, S. 1997. The rise of divorce and separation in the united states, 1880- 1990. Demography, 34: 455–466.

    Google Scholar 

  • Ruggles, S. and Sobek, M. 1997. Integrated public use microdata series: Version 2.0. [http://www.ipums. umn.edu/].

  • Rymon, R. 1992. Search through systematic set enumeration. Third International Conference on Principles of Knowledge Representation and Reasoning.

  • Shaffer, J.P. 1995. Multiple hypothesis testing. Annual Review Psychology, 46:561–584.

    Google Scholar 

  • Silberschatz, A. and Tuzhilin, A. 1996. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8:970–974.

    Google Scholar 

  • Silverstein, C., Brin, S., and Motwani, R. 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2:39–68.

    Google Scholar 

  • Srikant, R. and Agrawal, R. 1996. Mining quantitative association rules in large relational tables. In Proceedings of the ACM SIGMOD Conference on Management of Data.

  • Srikant, R., Vu, Q., and Agrawal, R. 1997. Mining association rules with item constraints. In Proceedings of the 3rd International Conference on Knowledge Discovery in Databases and Data Mining.

  • Zaki, M.J., Parthasarathy, S., Ogihara, M., and Li, W. 1997. New algorithms for fast discovery of association rules. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bay, S.D., Pazzani, M.J. Detecting Group Differences: Mining Contrast Sets. Data Mining and Knowledge Discovery 5, 213–246 (2001). https://doi.org/10.1023/A:1011429418057

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1011429418057

Navigation