Advertisement

Data Mining and Knowledge Discovery

, Volume 5, Issue 3, pp 213–246 | Cite as

Detecting Group Differences: Mining Contrast Sets

  • Stephen D. Bay
  • Michael J. Pazzani
Article

Abstract

A fundamental task in data analysis is understanding the differences between several contrasting groups. These groups can represent different classes of objects, such as male or female students, or the same group over time, e.g. freshman students in 1993 through 1998. We present the problem of mining contrast sets: conjunctions of attributes and values that differ meaningfully in their distribution across groups. We provide a search algorithm for mining contrast sets with pruning rules that drastically reduce the computational complexity. Once the contrast sets are found, we post-process the results to present a subset that are surprising to the user given what we have already shown. We explicitly control the probability of Type I error (false positives) and guarantee a maximum error rate for the entire analysis by using Bonferroni corrections.

data mining contrast sets change detection association rules 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining associations between sets of items in massive database. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216.Google Scholar
  2. Agrawal, R., Psaila, G., Wimmers, E., and Zait, M. 1995. Querying shapes of histories. In Proceedings of the 21st International Conference on Very Large Databases.Google Scholar
  3. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Databases.Google Scholar
  4. Agresti, A. 1990. Categorical Data Analysis. New York: John Wiley & Sons.Google Scholar
  5. Bay, S.D. 1999. The UCIKDDarchive. [http://kdd.ics.uci.edu/]. Irvine, CA: University of California, Department of Information and Computer Science.Google Scholar
  6. Bay, S.D. and Pazzani, M.J. 1999. Detecting change in categorical data: Mining contrast sets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 302–306.Google Scholar
  7. Bayardo, R.J. 1998. Efficiently mining long patterns from databases. In Proceedings of the ACM SIGMOD Conference on Management of Data.Google Scholar
  8. Bayardo, R.J., Agrawal, R., and Gunopulos, D. 1999. Constraint-based rule mining in large, dense databases. In Proceedings 15th International Conference on Data Engineering.Google Scholar
  9. Bazaraa, M.S. and Shetty, C.M. 1979. Nonlinear Programming: Theory and Algorithms. New York: John Wiley & Sons.Google Scholar
  10. Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. 1975. Discrete Multivariate Analysis: Theory and Practice: The MIT Press, Cambridge, Massachusetts.Google Scholar
  11. Blake, C. and Merz, C.J. 1998. UCI repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu/∼mlearn/MLRepository.html].Google Scholar
  12. Brin, S., Motwani, R., Ullman, J.D., and Tsur, S. 1997. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 255–264.Google Scholar
  13. Chakrabarti, S., Sarawagi, S., and Dom, B. 1998. Mining surprising patterns using temporal description length. In Proceedings of the 24th International Conference on Very Large Databases.Google Scholar
  14. Cohen, J. 1990. Things I have learned (so far). American Psychologist, 45:1304–1312.Google Scholar
  15. Darity, W.A. 1998. Intergroup disparity: Economic theory and social science evidence. Southern Economic Journal, 64:805–826.Google Scholar
  16. Davies, J. and Billman, D. 1996. Hierarchical categorization and the effects of contrast inconsistency in an unsupervised learning task. In Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society, p. 750.Google Scholar
  17. Dong, G. and Li, J. 1999. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
  18. Everitt, B.S. 1992. The Analysis of Contingency Tables, 2nd ed. Chapman and Hall, London, U.K.Google Scholar
  19. Ganti, V., Gehrke, J.E., Ramakrishnan, R., and Loh, W. 1999. A framework for measuring changes in data characteristics. In Proceedings of Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.Google Scholar
  20. Glenn, N.D. 1977. Cohort Analysis. Newbury Park, CA: Sage Publications.Google Scholar
  21. Hochberg, Y. and Tamhane, A.C. 1987. Multiple Comparison Procedures. New York: John Wiley & Sons.Google Scholar
  22. Hoschka, P. and Klösgen, W. 1991. A support system for interpreting statistical data. In Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley (Eds.). AAAI Press, Menlo Park, CA, pp. 325–346.Google Scholar
  23. Keogh, E. and Pazzani, M.J. 1998. An enhanced representation of time series that allows fast and accuracte classi-fication, clustering, and relevance feedback. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining.Google Scholar
  24. Klemettinen, M., Mannila, H., Ronkainen, P., Toivonen, H., and Verkamo, A.I. 1994. Finding interesting rules from large sets of discovered association rules. In Proceedings of the Third International Conference on Information and Knowledge Management, pp. 401–407.Google Scholar
  25. Klösgen, W. 1993. Explora user documentation: A support system for discovery in databases.Google Scholar
  26. Klösgen, W. 1996. Explora: A multipattern and multistrategy discovery assistant. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). AAAI Press/The MIT Press, Cambridge, Massachusetts, pp. 249–271.Google Scholar
  27. Knoke, D. and Burke, P.J. 1980. Log-linear Models. Newbury Park, CA: Sage Publications.Google Scholar
  28. Lewontin, R.C. and Felsenstein, J. 1965. The robustness of homogeneity in 2 × n tables. Biometrics, 21:19–33.Google Scholar
  29. Lin, D. and Kedem, Z.M. 1998. Pincer-search: A new algorithm for discovering the maximum frequent set. In Proceedings of the Sixth European Conference on Extending Database Theory.Google Scholar
  30. Lincoff, G.H. 1981. The Audubon Society Field guide to North American Mushrooms. Random House, New York.Google Scholar
  31. Liu, B. and Hsu, W. 1996. Post-analysis of learned rules. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pp. 828–834.Google Scholar
  32. Liu, B., Hsu, W., and Chen, S. 1997. Using general impressions to analyze discovered classification rules. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp.31–36.Google Scholar
  33. Liu, B., Hsu, W., and Ma, Y. 1999a. Pruning and summarizing the discovered associations. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Google Scholar
  34. Liu, B., Hsu, W., Mun, L., and Lee, H. 1999b. Finding interesting patterns using user expectations. IEEE Transactions on Knowledge and Data Engineering, 11:817–832.Google Scholar
  35. Mannila, H. and Toivonen, H. 1997. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1:241–258.Google Scholar
  36. Megiddo, N. and Srikant, R. 1998. Discovering predictive association rules. In Proceedings of the 4th International Conference on Knowledge Discovery in Databases and Data Mining.Google Scholar
  37. Menard, S. 1991. Longitudinal Research. Newbury Park, CA: Sage Publications.Google Scholar
  38. Michell, T.M. 1977. Version spaces: A candidate elimination approach to rule learning. In Proc. of the 5th Int'l Joint Conf. on Artificial Intelligence.Google Scholar
  39. Ng, R., Lakshmanan, L.V.S., Han, J., and Pang, A. 1998. Exploratory mining and pruning optimizations of constrained associations rules. In Proceedings of the ACM SIGMOD Conference on Management of Data.Google Scholar
  40. Padmanabhan, B. and Tuzhilin, A. 1998. A belief-driven method for discovering unexpected patterns. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining.Google Scholar
  41. Riddle, P., Segal, R., and Etzioni, O. 1994. Representation design and brute-force induction in a boeing manufacturing domain. Applied Artificial Intelligence, 8:125–147.Google Scholar
  42. Ruggles, S. 1995. Sample designs and sampling errors. Historical Methods, 28:40–46.Google Scholar
  43. Ruggles, S. 1997. The rise of divorce and separation in the united states, 1880- 1990. Demography, 34: 455–466.Google Scholar
  44. Ruggles, S. and Sobek, M. 1997. Integrated public use microdata series: Version 2.0. [http://www.ipums. umn.edu/].Google Scholar
  45. Rymon, R. 1992. Search through systematic set enumeration. Third International Conference on Principles of Knowledge Representation and Reasoning.Google Scholar
  46. Shaffer, J.P. 1995. Multiple hypothesis testing. Annual Review Psychology, 46:561–584.Google Scholar
  47. Silberschatz, A. and Tuzhilin, A. 1996. What makes patterns interesting in knowledge discovery systems. IEEE Transactions on Knowledge and Data Engineering, 8:970–974.Google Scholar
  48. Silverstein, C., Brin, S., and Motwani, R. 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2:39–68.Google Scholar
  49. Srikant, R. and Agrawal, R. 1996. Mining quantitative association rules in large relational tables. In Proceedings of the ACM SIGMOD Conference on Management of Data.Google Scholar
  50. Srikant, R., Vu, Q., and Agrawal, R. 1997. Mining association rules with item constraints. In Proceedings of the 3rd International Conference on Knowledge Discovery in Databases and Data Mining.Google Scholar
  51. Zaki, M.J., Parthasarathy, S., Ogihara, M., and Li, W. 1997. New algorithms for fast discovery of association rules. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining.Google Scholar

Copyright information

© Kluwer Academic Publishers 2001

Authors and Affiliations

  • Stephen D. Bay
    • 1
  • Michael J. Pazzani
    • 2
  1. 1.Department of Information and Computer ScienceUniversity of CaliforniaIrvineUSA
  2. 2.Department of Information and Computer ScienceUniversity of CaliforniaIrvineUSA

Personalised recommendations