Advertisement

Knowledge and Information Systems

, Volume 33, Issue 1, pp 191–212 | Cite as

Compression and aggregation of Bayesian estimates for data intensive computing

  • Ruibin Xi
  • Nan Lin
  • Yixin Chen
  • Youngjin Kim
Regular Paper

Abstract

Bayesian estimation is a major and robust estimator for many advanced statistical models. Being able to incorporate prior knowledge in statistical inference, Bayesian methods have been successfully applied in many different fields such as business, computer science, economics, epidemiology, genetics, imaging, and political science. However, due to its high computational complexity, Bayesian estimation has been deemed difficult, if not impractical, for large-scale databases, stream data, data warehouses, and data in the cloud. In this paper, we propose a novel compression and aggregation schemes (C&A) that enables distributed, parallel, or incremental computation of Bayesian estimates. Assuming partitioning of a large dataset, the C&A scheme compresses each partition into a synopsis and aggregates the synopsis into an overall Bayesian estimate without accessing the raw data. Such a C&A scheme can find applications in OLAP for data cubes, stream data mining, and cloud computing. It saves tremendous computing time since it processes each partition only once, enabling fast incremental update, and allows parallel processing. We prove that the compression is asymptotically lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed C&A scheme can make feasible OLAP of Bayesian estimates in a data cube. Further, it supports real-time Bayesian analysis of stream data, which can only be scanned once and cannot be permanently retained. Experimental results validate our theoretical analysis and demonstrate that our method can dramatically save time and space costs with almost no degradation of the modeling accuracy.

Keywords

Bayesian estimation Data cubes OLAP Stream data mining Compression Aggregation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agresti A (2002) Categorical data analysis, 2nd edn. Wiley, New JerseyzbMATHCrossRefGoogle Scholar
  2. 2.
    Barbara D, Wu X (2001) Loglinear-based quasi cubes. J Intell Inf Syst 16: 255–276zbMATHCrossRefGoogle Scholar
  3. 3.
    Cadez I, Heckerman D, Smyth P, Meek C, White S (2000) Visualization of navigation patterns on a web site using model-based clustering. Technical report, Microsoft Research 2000. MSR-TR-00-18Google Scholar
  4. 4.
    Chao MT (1970) The asymptotic behavior of Bayes’ estimators. Ann Math Stat 41(2): 601–608zbMATHCrossRefGoogle Scholar
  5. 5.
    Charig CR, Webb DR, Payne SR, Wickham OE (1986) Comparison of treatment of renal calculi by operative surgery, percutaneous nephrolithotomy, and extracorporeal shock wave lithotripsy. Br Med J 292: 879–882CrossRefGoogle Scholar
  6. 6.
    Chen B, Chen L, Lin Y, Ramakrishnan R (2005) Prediction cubes. In: Proceedings of the 31st VLDB conference, pp 982–993Google Scholar
  7. 7.
    Chen Y, Dong G, Han J, Pei J, Wah B, Wang J (2006) Regression cubes with lossless compression and aggregation. IEEE Trans Knowl Data Eng 18: 1585–1599CrossRefGoogle Scholar
  8. 8.
    Chen Y, Dong G, Han J, Wah BW, Wang J (2002) Multi-dimensional regression analysis of time-series data streams, pp 323–334Google Scholar
  9. 9.
    Chung KL (2001) A course in probability theory, 3rd edn. Elsevier, San DiegoGoogle Scholar
  10. 10.
    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39: 1–38MathSciNetzbMATHGoogle Scholar
  11. 11.
    Centers for Disease Control and Prevention (2005–2008) Behavioral risk factor surveillance system survey data. U.S. Department of Health and Human Services, Centers for Disease Control and PreventionGoogle Scholar
  12. 12.
    Ghosh JK, Ramamoorthi RV (2002) Bayesian nonprametrics. Springer, New JerseyGoogle Scholar
  13. 13.
    Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Min Knowl Discov 1: 29–54CrossRefGoogle Scholar
  14. 14.
    Han J, Chen Y, Dong G, Pei J, Wah BW, Wang J, Cai Y (2005) Stream cube: an architecture for multi-dimensional analysis of data streams. Distrib Parallel Databases 18(2): 173–197CrossRefGoogle Scholar
  15. 15.
    Harinarayan V, Rajaraman A, Ullman JD (1996) Implementing data cubes efficiently. In: Proceedings of ACM SIGMOD international conferernce on management of data. pp 205–216Google Scholar
  16. 16.
    Julious SA, Mullee MA (1994) Confounding and Simpson’s paradox. Br Med J 309: 1480–1481CrossRefGoogle Scholar
  17. 17.
    Khoshgozaran A, Khodaei A, Sharifzadeh M, Shahabi C (2008) A hybrid aggregation and compression technique for road network databases. Knowl Inf Syst 17(3): 265–286CrossRefGoogle Scholar
  18. 18.
    Lehmann EL, Casella G (1998) Theory of point estimation, 2nd edn. Springer, New JerseyzbMATHGoogle Scholar
  19. 19.
    Lenz H, Thalheim B (2001) OLAP databases and aggregation functions. In: Proceedings of the 13th international conference on scientific and statistical database management, pp 91–100Google Scholar
  20. 20.
    Liu C, Zhang M, Zheng M, Chen Y (2003) Step-by-step regression: a more efficient alternative for polynomial multiple linear regression in stream cube. In: Proceedings of the 7th Pacific-Asia conference on knowledge discovery and data mining, pp 437–448Google Scholar
  21. 21.
    Liu H, Lin Y, Han J (2011) Methods for mining frequent items in data streams: an overview. Knowl Inf Syst 26:1–30Google Scholar
  22. 22.
    Palpanas T, Koudas N, Mendelzon AO (2005) Using datacube aggregates for approximate querying and deviation detection. IEEE Trans Knowl Data Eng 17(11): 1465–1477CrossRefGoogle Scholar
  23. 23.
    Pang S, Ozawa S, Kasabov N (2005) Incremental linear discriminant analysis for classification of data streams. IEEE Trans Syst Man Cybern Part B 35(5): 905–914CrossRefGoogle Scholar
  24. 24.
    Ramoni M, Sebastiani P, Cohen P (2002) Bayesian clustering by dynamics. Mach Learn 47(1): 99–121CrossRefGoogle Scholar
  25. 25.
    Rao CR (1973) Linear statistical inference and its applications. Wiley, New YorkzbMATHCrossRefGoogle Scholar
  26. 26.
    Ridgeway G (1997) Finite discrete markov process clustering. Technical report, Microsoft Research. MSR-TR-97-24Google Scholar
  27. 27.
    Ridgeway G, Altschuler S (1998) Clustering finite discrete markov chains. In: Proceedings of the section on physical and engineering sciences, pp 228–229Google Scholar
  28. 28.
    Safarinejadian B, Menhaj MB, Karrari M (2010) A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowl Inf Syst 23(3): 267–292CrossRefGoogle Scholar
  29. 29.
    Sathe G, Sarawagi S (2001) Intelligent rollups in multidimensional OLAP data. In: Proceedings of the 27th VLDB conference, pp 531–540Google Scholar
  30. 30.
    Sebastiani P, Ramonni M, Cohen P, Warwick J, Davis J (1999) Discovering dynamics using Bayesian clustering. In: Advances in intelligent data analysis. Lecture notes in computer science. Springer, pp 395–406Google Scholar
  31. 31.
    Shiryaev AN (1995) Probability, 2nd edn. Springer, New JerseyzbMATHGoogle Scholar
  32. 32.
    Tanner MA, Wong WH (1987) The calculation of posterior distribution by data augmentation. J Am Stat Assoc 82: 528–540MathSciNetzbMATHCrossRefGoogle Scholar
  33. 33.
    Vassiliadis P (1998) Modeling multidimensional databases, cubes and cube operations. In: Proceedings of the 10th international conference on scientific and statistical database management, pp 53–62Google Scholar
  34. 34.
    Xi R, Lin N, Chen Y (2009) Compression and aggregation for logistic regression analysis in data cubes. IEEE Trans Knowl Data Eng 21(4): 479–492CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  1. 1.Center for Biomedical Informatics, Harvard Medical SchoolBostonUSA
  2. 2.Department of MathematicsWashington UniversitySt. LouisUSA
  3. 3.Department of Computer ScienceWashington UniversitySt. LouisUSA
  4. 4.Google Inc.Mountain ViewUSA

Personalised recommendations