Machine Learning

, Volume 83, Issue 1, pp 103–131 | Cite as

Checkpoint evolution for volatile correlation computing

Article

Abstract

Given a set of data objects, the problem of correlation computing is concerned with efficient identification of strongly-related ones. Existing studies have been mainly focused on static data. However, as observed in many real-world scenarios, input data are often dynamic and analytical results have to be continually updated. Therefore, there is the critical need to develop a dynamic solution for volatile correlation computing. To this end, we develop a checkpoint scheme, which can help us capture dynamic correlation values by establishing an evolving computation buffer. In this paper, we first provide a theoretical analysis of the properties of the volatile correlation, and derive a tight upper bound. Such tight and evolving upper bound is used to identify a small list of candidate pairs, which are maintained as new transactions are added into the database. Once the total number of new transactions goes beyond the buffer size, the upper bound is re-computed according to the next checkpoint, and a new list of candidate pairs is identified. Based on such a scheme, a new algorithm named CHECK-POINT+ has been designed. Experimental results on real-world data sets show that CHECK-POINT+ can significantly reduce the computation cost in dynamic data environments, and has the advantage of compacting the use of memory space.

Keywords

Pearson’s correlation coefficient Correlation coefficient Volatile correlation computing Checkpoint 

References

  1. Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Mining association rules between sets of items in large databases. In P. Buneman & S. Jajodia (Eds.), Proceedings of the 1993 ACM SIGMOD international conference on management of data (pp. 207–216). Washington, D.C., USA, May 26–28, 1993. New York: Assoc. Comput. Math. CrossRefGoogle Scholar
  2. Alexander, C. (2001). Market models: a guide to financial data analysis. London: Wiley. Google Scholar
  3. Bayardo, R. J., & Agrawal, R. (1999). Mining the most interesting rules. In U. Fayyad, S. Chaudhuri, & D. Madigan (Eds.), Proceedings of the Fifth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 145–154). San Diego, California, USA, August 15–18, 1999. New York: Assoc. Comput. Math. CrossRefGoogle Scholar
  4. Bentley, J. (2000). Programming pearls (2nd ed.). Reading: Addison–Wesley. Google Scholar
  5. Brin, S., Motwani, R., & Silverstein, C. (1997). Beyond market baskets: generalizing association rules to correlations. In J. Peckham (Ed.), Proceedings of the 1997 ACM SIGMOD international conference on management of data (pp. 265–276). May 13–15, 1997, Tucson, Arizona, USA. New York: Assoc. Comput. Math. CrossRefGoogle Scholar
  6. Cohen, P., Cohen, J., West, S. G., & Aiken, L. S. (2002). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Hillstale: Erlbaum. Google Scholar
  7. DuMouchel, W., & Pregibon, D. (2001). Empirical Bayes screening for multi-item associations. In F. Provost & R. Srikant (Eds.), Proceedings of the Seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 67–76). San Francisco, California, USA, August 26–29, 2001. New York: Assoc. Comput. Math. CrossRefGoogle Scholar
  8. Ilyas, I. F., Markl, V., Haas, P. J., Brown, P., & Aboulnaga, A. (2004). CORDS: Automatic discovery of correlations and soft functional dependencies. In G. Weikum, A. C. König, & S. Deßloch (Eds.), Proceedings of the 2004 ACM SIGMOD international conference on management of data (pp. 647–658). Paris, France, June 13–18, 2004. New York: Assoc. Comput. Math. CrossRefGoogle Scholar
  9. Jermaine, C. (2001). The computational complexity of high-dimensional correlation search. In N. Cercone, T. Y. Lin, & X. Wu (Eds.), Proceedings of the 2001 IEEE international conference on data mining (pp. 249–256). November 29–December 2, 2001, San Jose, California, USA. Los Alamitos: IEEE Computer Society. CrossRefGoogle Scholar
  10. Jermaine, C. (2003). Playing hide-and-seek with correlations. In L. Getoor, T. E. Senator, P. Domingos, & C. Faloutsos (Eds.), Proceedings of the Ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 559–564). Washington, D.C., August 24–27, 2003. New York: Assoc. Comput. Math. CrossRefGoogle Scholar
  11. Ke, Y., Cheng, J., & Ng, W. (2006). Mining quantitative correlated patterns using an information-theoretic approach. In T. Eliassi-Rad, L. H. Ungar, M. Craven, & D. Gunopulos (Eds.), Proceedings of the Twelfth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 227–236). Philadelphia, Pennsylvania, USA, August 20–23, 2006. New York: Assoc. Comput. Math. Google Scholar
  12. Ke, Y., Cheng, J., & Ng, W. (2007). Correlation search in graph databases. In P. Berkhin, R. Caruana, & X. Wu (Eds.), Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 390–399). San Jose, California, USA, August 12–15, 2007. New York: Assoc. Comput. Math. CrossRefGoogle Scholar
  13. Kuo, W. P., Jenssen, T. K., Butte, A. J., Ohno-Machado, L., & Kohane, I. S. (2002). Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics, 18(3), 405–412. CrossRefGoogle Scholar
  14. Melucci, M. (2007). On rank correlation in information retrieval evaluation. SIGIR Forum, 41(1), 18–33. CrossRefGoogle Scholar
  15. Morishita, S., & Sese, J. (2000). Traversing itemset lattice with statistical metric pruning. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (pp. 226–236). May 15–17, 2000, Dallas, Texas, USA. New York: Assoc. Comput. Math. CrossRefGoogle Scholar
  16. Pearson, K. (1920). Notes on the history of correlation. Biometrika, 13, 25–45. CrossRefGoogle Scholar
  17. Reynolds, H. T. (1977). The analysis of cross-classifications. New York: Free Press. Google Scholar
  18. Uno, T., Kiyomi, M., & Arimura, H. (2005). LCM ver. 3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In OSDM’05: proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations (pp. 77–86). New York: Assoc. Comput. Math. CrossRefGoogle Scholar
  19. Xiong, H., Shekhar, S., Tan, P. N., & Kumar, V. (2004). Exploiting a support-based upper bound of pearson’s correlation coefficient for efficiently identifying strongly correlated pairs. In W. Kim, R. Kohavi, J. Gehrke, & W. DuMouchel (Eds.), Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 334–343). Seattle, Washington, USA, August 22–25, 2004. New York: Assoc. Comput. Math. CrossRefGoogle Scholar
  20. Xiong, H., Shekhar, S., Tan, P. N., & Kumar, V. (2006). TAPER: A two-step approach for all-strong-pairs correlation query in large databases. IEEE Transactions on Knowledge and Data Engineering, 18(4), 493–508. CrossRefGoogle Scholar
  21. Xiong, H., Zhou, W., Brodie, M., & Ma, S. (2008). Top-k φ correlation computation. INFORMS Journal on Computing, 20(4), 539–552. CrossRefMathSciNetGoogle Scholar
  22. Yilmaz, E., Aslam, J. A., & Robertson, S. (2008). A new rank correlation coefficient for information retrieval. In S.H. Myaeng, D.W. Oard, F. Sebastiani, T.S. Chua, & M.K. Leong (Eds.), Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2008 (pp. 587–594). Singapore, July 20–24, 2008. Google Scholar
  23. Zhou, W., & Xiong, H. (2008). Volatile correlation computation: a checkpoint view. In Y. Li, B. Liu, & S. Sarawagi (Eds.), Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 848–856). Las Vegas, Nevada, USA, August 24–27, 2008. New York: Assoc. Comput. Math. CrossRefGoogle Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  1. 1.MSIS DepartmentRutgers UniversityNewarkUSA

Personalised recommendations