Efficient Mining of Top Correlated Patterns Based on Null-Invariant Measures

  • Sangkyum Kim
  • Marina Barsky
  • Jiawei Han
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6912)

Abstract

Mining strong correlations from transactional databases often leads to more meaningful results than mining association rules. In such mining, null (transaction)-invariance is an important property of the correlation measures. Unfortunately, some useful null-invariant measures such as Kulczynski and Cosine, which can discover correlations even for the very unbalanced cases, lack the (anti)-monotonicity property. Thus, they could only be applied to frequent itemsets as the post-evaluation step. For large datasets and for low supports, this approach is computationally prohibitive. This paper presents new properties for all known null-invariant measures. Based on these properties, we develop efficient pruning techniques and design the Apriori-like algorithm NICoMiner for mining strongly correlated patterns directly. We develop both the threshold-bounded and the top-k variations of the algorithm, where top-k is used when the optimal correlation threshold is not known in advance and to give user control over the output size. We test NICoMiner on real-life datasets from different application domains, using Cosine as an example of the null-invariant correlation measure. We show that NICoMiner outperforms support-based approach more than an order of magnitude, and that it is very useful for discovering top correlations in itemsets with low support.

Keywords

Association Rule Pattern Mining Frequent Itemsets Mining Association Rule Market Basket 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
  2. 2.
    Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: SIGMOD (1993)Google Scholar
  3. 3.
    Brin, S., Motwani, R., Silverstein, C.: Beyond market baskets: generalizing association rules to correlations. In: SIGMOD (1997)Google Scholar
  4. 4.
    Campagna, A., Pagh, R.: Finding associations and computing similarity via biased pair sampling. In: Perner, P. (ed.) ICDM 2009. LNCS, vol. 5633, Springer, Heidelberg (2009)Google Scholar
  5. 5.
    Cohen, J., West, S.G., Cohen, P., Aiken, L.: Applied Multiple Regression Correlation Analysis for the Behavioral Sciences, 3rd edn. Lawrence Erlbaum Assoc Inc., Mahwah (2002)Google Scholar
  6. 6.
    Hahsler, M.: Groceries dataset (2007), http://rss.acs.unt.edu/Rdoc/library/arules/data/
  7. 7.
    Hahsler, M., Hornik, K., Reutterer, T.: Implications of probabilistic data modeling for mining association rules. In: Proceedings of the 29th Annual Conference of the Gesellschaft für Klassifikation (2006)Google Scholar
  8. 8.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)MATHGoogle Scholar
  9. 9.
    Kuo, W.P., Jenssen, T.-K., Butte, A.J., Ohno-Machado, L., Kohane, I.S.: Analysis of matched mrna measurements from two different microarray technologies. Bioinformatics 18(3), 405–412 (2002)CrossRefGoogle Scholar
  10. 10.
    Lee, Y.-K., Kim, W.-Y., Cai, Y.D., Han, J.: Comine: Efficient mining of correlated patterns. In: ICDM (2003)Google Scholar
  11. 11.
    Omiecinski, E.R.: Alternative interest measures for mining associations in databases. IEEE Trans. on Knowl. and Data Eng. 15, 57–69 (2003)CrossRefGoogle Scholar
  12. 12.
    Redmond, M.: Communities and crime dataset (2009), http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime
  13. 13.
    Redmond, M.A., Baveja, A.: A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research 141, 660–678 (2002)CrossRefMATHGoogle Scholar
  14. 14.
    Srikant, R., Agrawal, R.: Mining generalized association rules. In: VLDB (1995)Google Scholar
  15. 15.
    Sun, Y., Han, J., Gao, J., Yu, Y.: iTopicModel: Information network-integrated topic modeling. In: Perner, P. (ed.) ICDM 2009. LNCS, vol. 5633, Springer, Heidelberg (2009)Google Scholar
  16. 16.
    Tan, P.-N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for association patterns. In: KDD (2002)Google Scholar
  17. 17.
    Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Amsterdam (2005)Google Scholar
  18. 18.
    von Storch, H., Zwiers, F.W.: Statistical analysis in climate research. Cambridge University Press, Cambridge (2002)Google Scholar
  19. 19.
    Wu, T., Chen, Y., Han, J.: Re-examination of interestingness measures in pattern mining: a unified framework. Data Min. Knowl. Discov. 21(3), 371–397 (2010)MathSciNetCrossRefGoogle Scholar
  20. 20.
    Xin, D., Han, J., Yan, X., Cheng, H.: Mining compressed frequent-pattern sets. In: VLDB (2005)Google Scholar
  21. 21.
    Xiong, H., He, X., Ding, C.H.Q., Zhang, Y., Kumar, V., Holbrook, S.R.: Identification of functional modules in protein complexes via hyperclique pattern discovery. In: Pacific Symposium on Biocomputing (2005)Google Scholar
  22. 22.
    Xiong, H., Zhou, W., Brodie, M., Ma, S.: Top-k φ correlation computation. INFORMS Journal on Computing 20(4), 539–552 (2008)MathSciNetCrossRefMATHGoogle Scholar
  23. 23.
    Zhu, S., Wu, J., Xiong, H., Xia, G.: Scaling up top-k cosine similarity search. Data Knowl. Eng. 70, 60–83 (2011)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Sangkyum Kim
    • 1
  • Marina Barsky
    • 2
  • Jiawei Han
    • 1
  1. 1.University of Illinois at Urbana-ChampaignUrbanaUSA
  2. 2.Simon Fraser UniversityCanada

Personalised recommendations