# Detecting Group Differences: Mining Contrast Sets

- 751 Downloads
- 152 Citations

## Abstract

A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mining contrast sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups. We provide a search algorithm for mining contrast sets with pruning rules that drastically reduce the computational complexity. Once the contrast sets are found, we post-process the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.

## Preview

Unable to display preview. Download preview PDF.

## References

- Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining associations between sets of items in massive database. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216.Google Scholar
- Agrawal, R., Psaila, G., Wimmers, E., and Zait, M. 1995. Querying shapes of histories. In Proceedings of the 21st International Conference on Very Large Databases.Google Scholar
- Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Databases.Google Scholar
- Agresti, A. 1990. Categorical Data Analysis. New York: John Wiley & Sons.Google Scholar
- Bay, S.D. 1999. The UCIKDDarchive. [http://kdd.ics.uci.edu/]. Irvine, CA: University of California, Department of Information and Computer Science.Google Scholar
- Bay, S.D. and Pazzani, M.J. 1999. Detecting change in categorical data: Mining contrast sets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 302–306.Google Scholar
- Bayardo, R.J. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM SIGMOD Conference on Management of Data.Google Scholar
- Bayardo, R.J., Agrawal, R., and Gunopulos, D. 1999. Constraint-based rule mining in large, dense databases. In Proceedings 15th International Conference on Data Engineering.Google Scholar
- Bazaraa, M.S. and Shetty, C.M. 1979. Nonlinear Programming: Theory and Algorithms. New York: John Wiley & Sons.Google Scholar
- Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. 1975. Discrete Multivariate Analysis: Theory and Practice: The MIT Press, Cambridge, Massachusetts.Google Scholar
- Blake, C. and Merz, C.J. 1998. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu/∼mlearn/MLRepository.html].Google Scholar
- Brin, S., Motwani, R., Ullman, J.D., and Tsur, S. 1997. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 255–264.Google Scholar
- Chakrabarti, S., Sarawagi, S., and Dom, B. 1998. Mining surprising patterns using temporal description length. In Proceedings of the 24th International Conference on Very Large Databases.Google Scholar
- Cohen, J. 1990. Things I have learned (so far). American Psychologist, 45:1304–1312.Google Scholar
- Darity, W.A. 1998. Intergroup disparity: Economic theory and social science evidence. Southern Economic Journal, 64:805–826.Google Scholar
- Davies, J. and Billman, D. 1996. Hierarchical categorization and the effects of contrast inconsistency in an unsupervised learning task. In Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society, p. 750.Google Scholar
- Dong, G. and Li, J. 1999. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
- Everitt, B.S. 1992. The Analysis of Contingency Tables, 2nd ed. Chapman and Hall, London, U.K.Google Scholar
- Ganti, V., Gehrke, J.E., Ramakrishnan, R., and Loh, W. 1999. A framework for measuring changes in data characteristics. In Proceedings of Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.Google Scholar
- Glenn, N.D. 1977. Cohort Analysis. Newbury Park, CA: Sage Publications.Google Scholar
- Hochberg, Y. and Tamhane, A.C. 1987. Multiple Comparison Procedures. New York: John Wiley & Sons.Google Scholar
- Hoschka, P. and Klösgen, W. 1991. A support system for interpreting statistical data. In Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley (Eds.). AAAI Press, Menlo Park, CA, pp. 325–346.Google Scholar
- Keogh, E. and Pazzani, M.J. 1998. An enhanced representation of time series that allows fast and accuracte classi-fication, clustering, and relevance feedback. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining.Google Scholar
- Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkamo, A.I. 1994. Finding interesting rules from large sets of discovered association rules. In Proceedings of the Third International Conference on Information and Knowledge Management, pp. 401–407.Google Scholar
- Klösgen, W. 1993. Explora user documentation: A support system for discovery in databases.Google Scholar
- Klösgen, W. 1996. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). AAAI Press/The MIT Press, Cambridge, Massachusetts, pp. 249–271.Google Scholar
- Knoke, D. and Burke, P.J. 1980. Log-linear Models. Newbury Park, CA: Sage Publications.Google Scholar
- Lewontin, R.C. and Felsenstein, J. 1965. The robustness of homogeneity in 2 ×
*n*tables. Biometrics, 21:19–33.Google Scholar - Lin, D. and Kedem, Z.M. 1998. Pincer-search: A new algorithm for discovering the maximum frequent set. In Proceedings of the Sixth European Conference on Extending Database Theory.Google Scholar
- Lincoff, G.H. 1981. The Audubon Society Field guide to North American Mushrooms. Random House, New York.Google Scholar
- Liu, B. and Hsu, W. 1996. Post-analysis of learned rules. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pp. 828–834.Google Scholar
- Liu, B., Hsu, W., and Chen, S. 1997. Using general impressions to analyze discovered classification rules. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp.31–36.Google Scholar
- Liu, B., Hsu, W., and Ma, Y. 1999a. Pruning and summarizing the discovered associations. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
- Liu, B., Hsu, W., Mun, L., and Lee, H. 1999b. Finding interesting patterns using user expectations. IEEE Transactions on Knowledge and Data Engineering, 11:817–832.Google Scholar
- Mannila, H. and Toivonen, H. 1997. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1:241–258.Google Scholar
- Megiddo, N. and Srikant, R. 1998. Discovering predictive association rules. In Proceedings of the 4th International Conference on Knowledge Discovery in Databases and Data Mining.Google Scholar
- Menard, S. 1991. Longitudinal Research. Newbury Park, CA: Sage Publications.Google Scholar
- Michell, T.M. 1977. Version spaces: A candidate elimination approach to rule learning. In Proc. of the 5th Int'l Joint Conf. on Artificial Intelligence.Google Scholar
- Ng, R., Lakshmanan, L.V.S., Han, J., and Pang, A. 1998. Exploratory mining and pruning optimizations of constrained associations rules. In Proceedings of the ACM SIGMOD Conference on Management of Data.Google Scholar
- Padmanabhan, B. and Tuzhilin, A. 1998. A belief-driven method for discovering unexpected patterns. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining.Google Scholar
- Riddle, P., Segal, R., and Etzioni, O. 1994. Representation design and brute-force induction in a boeing manufacturing domain. Applied Artificial Intelligence, 8:125–147.Google Scholar
- Ruggles, S. 1995. Sample designs and sampling errors. Historical Methods, 28:40–46.Google Scholar
- Ruggles, S. 1997. The rise of divorce and separation in the united states, 1880- 1990. Demography, 34: 455–466.Google Scholar
- Ruggles, S. and Sobek, M. 1997. Integrated public use microdata series: Version 2.0. [http://www.ipums. umn.edu/].Google Scholar
- Rymon, R. 1992. Search through systematic set enumeration. Third International Conference on Principles of Knowledge Representation and Reasoning.Google Scholar
- Shaffer, J.P. 1995. Multiple hypothesis testing. Annual Review Psychology, 46:561–584.Google Scholar
- Silberschatz, A. and Tuzhilin, A. 1996. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8:970–974.Google Scholar
- Silverstein, C., Brin, S., and Motwani, R. 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2:39–68.Google Scholar
- Srikant, R. and Agrawal, R. 1996. Mining quantitative association rules in large relational tables. In Proceedings of the ACM SIGMOD Conference on Management of Data.Google Scholar
- Srikant, R., Vu, Q., and Agrawal, R. 1997. Mining association rules with item constraints. In Proceedings of the 3rd International Conference on Knowledge Discovery in Databases and Data Mining.Google Scholar
- Zaki, M.J., Parthasarathy, S., Ogihara, M., and Li, W. 1997. New algorithms for fast discovery of association rules. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining.Google Scholar