Advertisement

Data Mining and Knowledge Discovery

, Volume 12, Issue 2–3, pp 249–273 | Cite as

Mining Adaptive Ratio Rules from Distributed Data Sources

  • Jun YanEmail author
  • Ning Liu
  • Qiang Yang
  • Benyu Zhang
  • Qiansheng Cheng
  • Zheng Chen
Article

Abstract

Different from traditional association-rule mining, a new paradigm called Ratio Rule (RR) was proposed recently. Ratio rules are aimed at capturing the quantitative association knowledge, We extend this framework to mining ratio rules from distributed and dynamic data sources. This is a novel and challenging problem. The traditional techniques used for ratio rule mining is an eigen-system analysis which can often fall victim to noise. This has limited the application of ratio rule mining greatly. The distributed data sources impose additional constraints for the mining procedure to be robust in the presence of noise, because it is difficult to clean all the data sources in real time in real-world tasks. In addition, the traditional batch methods for ratio rule mining cannot cope with dynamic data. In this paper, we propose an integrated method to mining ratio rules from distributed and changing data sources, by first mining the ratio rules from each data source separately through a novel robust and adaptive one-pass algorithm (which is called Robust and Adaptive Ratio Rule (RARR)), and then integrating the rules of each data source in a simple probabilistic model. In this way, we can acquire the global rules from all the local information sources adaptively. We show that the RARR technique can converge to a fixed point and is robust as well. Moreover, the integration of rules is efficient and effective. Both theoretical analysis and experiments illustrate that the performance of RARR and the proposed information integration procedure is satisfactory for the purpose of discovering latent associations in distributed dynamic data source.

Keywords

Multiple source data mining Data stream mining Ratio rule Robust statistics Eigen system analysis 

References

  1. Agrawal, R., Imielinski, T. et al. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Database. Washington, DC, USA, pp. 207–216.Google Scholar
  2. Aumann, Y., and Lindell, Y. 1999. A statistical theory for quantitative association rules. In Proceedings of the fifth International Conference on Knowledge Discovery and Data Mining. San Diego, CA, USA, pp. 15–18.Google Scholar
  3. Buckley, C., and Voorhees, E.M. 2000. Evaluating evaluation measure stability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Athens, Greece, pp. 33–40.Google Scholar
  4. Chaudhuri, S., Ganjam, K. et al. 2003. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD international conference on Management of data. San Diego, California, pp. 314–324.Google Scholar
  5. Duda, R.O., and Hart, P.E. et al. 2000. Pattern Classification. Wiley Interscience.Google Scholar
  6. Golub, G.H. and Van, L.C.F. 1996. Matrix Computations. Maryland: Johns Hopkins University Press.Google Scholar
  7. H., R., and M.S. 1951. A Stochasitc Approximation Methods. Ann Math Stat, 22:400–407.Google Scholar
  8. Hamilton, A.G. 1990. Linear Algebra. Cambridge University Press.Google Scholar
  9. Han, J., and Fu, Y. 1995. Discovery of Multiple-Level Association Rules from Large Databases. In Proceedings of the 21st International Conference on Very Large Databases. Morgan-Kaufmann, pp. 420–431.Google Scholar
  10. Huber, P.J. 2003. Robust Statistics. Wiley-IEEE.Google Scholar
  11. Jolliffe, I. 2002. Principal Component Analysis. New York: Springer.Google Scholar
  12. Kallenberg, O. 2002. Foundations of Modern Probability. New York: Springer-Verlag.Google Scholar
  13. Korn, F., Labrinidis, A. et al. 1998. Ratio Rules: A New Paradigm for Fast, Quantifiable Data Mining. In Proceeding of the 24th International Conference on Very Large Data Bases. New York, USA, pp. 582–593.Google Scholar
  14. Korn, F., Labrinidis, A., et al. 2000.`` Quantifiable Data Mining Using Principal Component Analysis''. VLDB Journal: Very Large Data Bases, 8:254–266.Google Scholar
  15. Kushner, H.J., and Clark, D.S. 1978. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Berlin: Springer-Verlag.Google Scholar
  16. Li, Y. 2004. ``On incremental and robust subspace learning''. Pattern Recognition, 7(37):1509–1518.Google Scholar
  17. Liano, K. 1996. Robust error measure for supervised neural network learning with outliers. IEEE Transactions on Knowledge and Data Engineering, 7(1).Google Scholar
  18. Ljung, L. 1977. Analysis of recusive stochasitic algorithms. IEEE Transactions on Automatic Control, AC-22(4):551–575.Google Scholar
  19. Oja, E. and Karhunen, J. 1985. On Stochastic Approximation of the Eigenvectors and Eigenvalues of the Expectation of a Random Matrix. J Math Anal Appl, (106):69–84.Google Scholar
  20. Srikant, R. and Agrawal, R. 1996. Mining Quantitative Association rules in Large Relational Tabels. ACM SIGMOD Conference, Montreal, Quebec, Canada, pp. 1–13.Google Scholar
  21. Wang, R.Y. Storey, V.C., et al. 1995. ``A framework for analysis of data quality research''. IEEE Transactions on Knowledge and Data Engineering, 7(4):623–640.Google Scholar
  22. Weng, J., Zhang, Y., et al. 2003. ``Candid covariance-free incremental principal component analysis''. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8):1034–1040.Google Scholar
  23. Wu, X., Zhang, C., et al. 2005. ``Database Classification for Multi-Database Mining''. Information Systems, 30:71–88.Google Scholar
  24. Xu, L. 1993. Least mean square Error reconstruction principle for self-organizing neural-nets. Neural Networks, 6:627–648.Google Scholar
  25. Xu, L. 1995. Robust Principal Analysis by Self-Organizing Rules Based on Statistical Phsics Approach. IEEE Transactions Neural Networks, 6(1):131–143.Google Scholar
  26. Yan, J., Zhang, B., et al. 2004. IMMC: incremental maximum margin criterion. ACM SIGKDD Conference, Seattle, WA, USA, pp. 725–730.Google Scholar
  27. Zhang, S., Wu, X., et al. 2003. Multi-database mining. IEEE Computational Intelligence Bulletin, 2(1):5–13.Google Scholar

Copyright information

© Springer Science+Business Media, Inc. 2005

Authors and Affiliations

  • Jun Yan
    • 1
    Email author
  • Ning Liu
    • 2
  • Qiang Yang
    • 3
  • Benyu Zhang
    • 4
  • Qiansheng Cheng
    • 1
  • Zheng Chen
    • 4
  1. 1.LMAM, Department of Information ScienceSchool of Mathematical Science, Peking UniversityBeijingP.R. China
  2. 2.Department of Mathematical ScienceTsinghua UniversityTsinghuaP.R. China
  3. 3.Department of Computer ScienceHong Kong University of Science and TechnologyHongKongP.R. China
  4. 4.Microsoft Research AsiaBeijingP.R. China

Personalised recommendations