Abstract
A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mining contrast sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups. We provide a search algorithm for mining contrast sets with pruning rules that drastically reduce the computational complexity. Once the contrast sets are found, we post-process the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.
Similar content being viewed by others
References
Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining associations between sets of items in massive database. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216.
Agrawal, R., Psaila, G., Wimmers, E., and Zait, M. 1995. Querying shapes of histories. In Proceedings of the 21st International Conference on Very Large Databases.
Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Databases.
Agresti, A. 1990. Categorical Data Analysis. New York: John Wiley & Sons.
Bay, S.D. 1999. The UCIKDDarchive. [http://kdd.ics.uci.edu/]. Irvine, CA: University of California, Department of Information and Computer Science.
Bay, S.D. and Pazzani, M.J. 1999. Detecting change in categorical data: Mining contrast sets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 302–306.
Bayardo, R.J. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM SIGMOD Conference on Management of Data.
Bayardo, R.J., Agrawal, R., and Gunopulos, D. 1999. Constraint-based rule mining in large, dense databases. In Proceedings 15th International Conference on Data Engineering.
Bazaraa, M.S. and Shetty, C.M. 1979. Nonlinear Programming: Theory and Algorithms. New York: John Wiley & Sons.
Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. 1975. Discrete Multivariate Analysis: Theory and Practice: The MIT Press, Cambridge, Massachusetts.
Blake, C. and Merz, C.J. 1998. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu/∼mlearn/MLRepository.html].
Brin, S., Motwani, R., Ullman, J.D., and Tsur, S. 1997. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 255–264.
Chakrabarti, S., Sarawagi, S., and Dom, B. 1998. Mining surprising patterns using temporal description length. In Proceedings of the 24th International Conference on Very Large Databases.
Cohen, J. 1990. Things I have learned (so far). American Psychologist, 45:1304–1312.
Darity, W.A. 1998. Intergroup disparity: Economic theory and social science evidence. Southern Economic Journal, 64:805–826.
Davies, J. and Billman, D. 1996. Hierarchical categorization and the effects of contrast inconsistency in an unsupervised learning task. In Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society, p. 750.
Dong, G. and Li, J. 1999. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Everitt, B.S. 1992. The Analysis of Contingency Tables, 2nd ed. Chapman and Hall, London, U.K.
Ganti, V., Gehrke, J.E., Ramakrishnan, R., and Loh, W. 1999. A framework for measuring changes in data characteristics. In Proceedings of Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.
Glenn, N.D. 1977. Cohort Analysis. Newbury Park, CA: Sage Publications.
Hochberg, Y. and Tamhane, A.C. 1987. Multiple Comparison Procedures. New York: John Wiley & Sons.
Hoschka, P. and Klösgen, W. 1991. A support system for interpreting statistical data. In Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley (Eds.). AAAI Press, Menlo Park, CA, pp. 325–346.
Keogh, E. and Pazzani, M.J. 1998. An enhanced representation of time series that allows fast and accuracte classi-fication, clustering, and relevance feedback. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining.
Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkamo, A.I. 1994. Finding interesting rules from large sets of discovered association rules. In Proceedings of the Third International Conference on Information and Knowledge Management, pp. 401–407.
Klösgen, W. 1993. Explora user documentation: A support system for discovery in databases.
Klösgen, W. 1996. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). AAAI Press/The MIT Press, Cambridge, Massachusetts, pp. 249–271.
Knoke, D. and Burke, P.J. 1980. Log-linear Models. Newbury Park, CA: Sage Publications.
Lewontin, R.C. and Felsenstein, J. 1965. The robustness of homogeneity in 2 × n tables. Biometrics, 21:19–33.
Lin, D. and Kedem, Z.M. 1998. Pincer-search: A new algorithm for discovering the maximum frequent set. In Proceedings of the Sixth European Conference on Extending Database Theory.
Lincoff, G.H. 1981. The Audubon Society Field guide to North American Mushrooms. Random House, New York.
Liu, B. and Hsu, W. 1996. Post-analysis of learned rules. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pp. 828–834.
Liu, B., Hsu, W., and Chen, S. 1997. Using general impressions to analyze discovered classification rules. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp.31–36.
Liu, B., Hsu, W., and Ma, Y. 1999a. Pruning and summarizing the discovered associations. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Liu, B., Hsu, W., Mun, L., and Lee, H. 1999b. Finding interesting patterns using user expectations. IEEE Transactions on Knowledge and Data Engineering, 11:817–832.
Mannila, H. and Toivonen, H. 1997. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1:241–258.
Megiddo, N. and Srikant, R. 1998. Discovering predictive association rules. In Proceedings of the 4th International Conference on Knowledge Discovery in Databases and Data Mining.
Menard, S. 1991. Longitudinal Research. Newbury Park, CA: Sage Publications.
Michell, T.M. 1977. Version spaces: A candidate elimination approach to rule learning. In Proc. of the 5th Int'l Joint Conf. on Artificial Intelligence.
Ng, R., Lakshmanan, L.V.S., Han, J., and Pang, A. 1998. Exploratory mining and pruning optimizations of constrained associations rules. In Proceedings of the ACM SIGMOD Conference on Management of Data.
Padmanabhan, B. and Tuzhilin, A. 1998. A belief-driven method for discovering unexpected patterns. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining.
Riddle, P., Segal, R., and Etzioni, O. 1994. Representation design and brute-force induction in a boeing manufacturing domain. Applied Artificial Intelligence, 8:125–147.
Ruggles, S. 1995. Sample designs and sampling errors. Historical Methods, 28:40–46.
Ruggles, S. 1997. The rise of divorce and separation in the united states, 1880- 1990. Demography, 34: 455–466.
Ruggles, S. and Sobek, M. 1997. Integrated public use microdata series: Version 2.0. [http://www.ipums. umn.edu/].
Rymon, R. 1992. Search through systematic set enumeration. Third International Conference on Principles of Knowledge Representation and Reasoning.
Shaffer, J.P. 1995. Multiple hypothesis testing. Annual Review Psychology, 46:561–584.
Silberschatz, A. and Tuzhilin, A. 1996. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8:970–974.
Silverstein, C., Brin, S., and Motwani, R. 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2:39–68.
Srikant, R. and Agrawal, R. 1996. Mining quantitative association rules in large relational tables. In Proceedings of the ACM SIGMOD Conference on Management of Data.
Srikant, R., Vu, Q., and Agrawal, R. 1997. Mining association rules with item constraints. In Proceedings of the 3rd International Conference on Knowledge Discovery in Databases and Data Mining.
Zaki, M.J., Parthasarathy, S., Ogihara, M., and Li, W. 1997. New algorithms for fast discovery of association rules. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bay, S.D., Pazzani, M.J. Detecting Group Differences: Mining Contrast Sets. Data Mining and Knowledge Discovery 5, 213–246 (2001). https://doi.org/10.1023/A:1011429418057
Issue Date:
DOI: https://doi.org/10.1023/A:1011429418057