Advertisement

Correlation Range Query

  • Wenjun Zhou
  • Hao Zhang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7923)

Abstract

Efficient correlation computation has been an active research area of data mining. Given a large dataset and a specified query item, we are interested in finding items in the dataset that are within certain range of correlation with the query item. Such a problem, known as the correlation range query (CRQ), has been a common task in many application domains. In this paper, we identify piecewise monotone properties of the upper and lower bounds of the φ coefficient, and propose an efficient correlation range query algorithm, called CORAQ. The CORAQ algorithm effectively prunes many items without computing their actual correlation coefficients with the query item. CORAQ also attains completeness and correctness of the query results. Experiments with large benchmark datasets show that this algorithm is much faster than its brute-force alternative and scales well with large datasets.

Keywords

Association Mining Correlation Computing φ Coefficient 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Seller, M., Gray, P.: A survey of database marketing. Technical report, I.T. in Business, Center for Research on Information Technology and Organizations, UC Irvine (1999)Google Scholar
  2. 2.
    Kamakura, W.A., Wedel, M., de Rosa, F., Mazzon, J.A.: Cross-selling through database marketing: a mixed data factor analyzer for data augmentation and prediction. International Journal of Research in Marketing 20, 45–65 (2003)CrossRefGoogle Scholar
  3. 3.
    Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O.: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proceedings of the National Academy of Sciences 96(8), 4285–4288 (1999)CrossRefGoogle Scholar
  4. 4.
    Xiong, H., He, X., Ding, C., Zhang, Y., Kumar, V., Holbrook, S.R.: Identification of functional modules in protein complexes via hyperclique pattern discovery. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 221–232 (2005)Google Scholar
  5. 5.
    Xiong, H., Shekhar, S., Ning Tan, P., Kumar, V.: Exploiting a support-based upper bound of pearson’s correlation coefficient for efficiently identifying strongly correlated pairs. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 334–343 (2004)Google Scholar
  6. 6.
    Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: CORDS: Automatic discovery of correlations and soft functional dependencies. In: ACM SIGMOD International Conference on Management of Data, pp. 647–658 (2004)Google Scholar
  7. 7.
    Xiong, H., Shekhar, S., Ning Tan, P., Kumar, V.: TAPER: A two-step approach for all-strong-pairs correlation query in large databases. IEEE Transactions on Knowledge and Data Engineering 18(4), 493–508 (2006)CrossRefGoogle Scholar
  8. 8.
    Xiong, H., Zhou, W., Brodie, M., Ma, S.: Top-k φ correlation computation. INFORMS Journal on Computing 20(4), 539–552 (2008)MathSciNetzbMATHCrossRefGoogle Scholar
  9. 9.
    Zhou, W., Xiong, H.: Volatile correlation computation: A checkpoint view. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 848–856 (2008)Google Scholar
  10. 10.
    Zhou, W., Xiong, H.: Checkpoint evolution for volatile correlation computing. Machine Learning 83(1), 103–131 (2011)MathSciNetzbMATHCrossRefGoogle Scholar
  11. 11.
    Xiong, H., Brodie, M., Ma, S.: TOP-COP: Mining top-k strongly correlated pairs in large databases. In: Proceedings of the 2006 IEEE International Conference on Data Mining, pp. 1162–1166 (2006)Google Scholar
  12. 12.
    Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Wenjun Zhou
    • 1
    • 2
  • Hao Zhang
    • 1
    • 2
  1. 1.Statistics, Operations, and Management Science DepartmentUniversity of TennesseeKnoxvilleUSA
  2. 2.Department of Electrical Engineering and Computer ScienceUniversity of TennesseeKnoxvilleUSA

Personalised recommendations