The VLDB Journal

, Volume 21, Issue 5, pp 651–676 | Cite as

CLARO: modeling and processing uncertain data streams

  • Thanh T. L. Tran
  • Liping Peng
  • Yanlei Diao
  • Andrew McGregor
  • Anna Liu
Regular Paper

Abstract

Uncertain data streams, where data are incomplete and imprecise, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the claro system that supports stream processing for uncertain data naturally captured using continuous random variables. claro employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for relational operators by exploring statistical theory and approximation. We also consider query planning for complex queries given an accuracy requirement. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements and outperform state-of-the-art sampling methods.

Keywords

Uncertain data streams Continuous uncertainty Data models Query processing 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, P., Widom, J.: Continuous uncertainty in trio. In: MUD Workshop (2009)Google Scholar
  2. 2.
    Antova, L., et al.: Fast and simple relational processing of uncertain data. In: ICDE, pp. 983–992 (2008)Google Scholar
  3. 3.
    Benjelloun, O., et al.: Uldbs: databases with uncertainty and lineage. In: VLDB, pp. 953–964 (2006)Google Scholar
  4. 4.
    Cassella G. et al.: Statistical Inference. Duxbury, Belmont (2001)Google Scholar
  5. 5.
    Cheng, R., et al.: Evaluating probabilistic queries over imprecise data. In: SIGMOD, pp. 551–562 (2003)Google Scholar
  6. 6.
    Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: SIGMOD, pp. 281–292 (2007)Google Scholar
  7. 7.
    Dalvi N.N., Suciu D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)CrossRefGoogle Scholar
  8. 8.
    DasGupta A.: Asymptotic Theory of Statistics and Probability. Springer, Berlin (2008)MATHGoogle Scholar
  9. 9.
    Deshpande, A., Madden, S.: MauveDB: supporting model-based user views in database systems. In: SIGMOD (2006)Google Scholar
  10. 10.
    Diao, Y., et al.: Capturing data uncertainty in high-volume stream processing. In: CIDR (2009)Google Scholar
  11. 11.
    Ge, T., Zdonik, S.B.: Handling uncertain data in array database systems. In: ICDE, pp. 1140–1149 (2008)Google Scholar
  12. 12.
    Guestrin, C., et al.: Distributed regression: an efficient framework for modeling sensor network data. In: IPSN (2004)Google Scholar
  13. 13.
    Jampani, R., et al.: Mcdb: a monte carlo approach to managing uncertain data. In: SIGMOD, pp. 687–700 (2008)Google Scholar
  14. 14.
    Jayram, T.S., et al.: Efficient aggregation algorithms for probabilistic data. In: SODA, pp. 346–355 (2007)Google Scholar
  15. 15.
    Jayram, T.S., et al.: Estimating statistical aggregates on probabilistic data streams. ACM TODS 33(4):243–252 (2008)Google Scholar
  16. 16.
    Kanagal, B., et al.: Efficient query evaluation over temporally correlated probabilistic streams. In: ICDE (2009)Google Scholar
  17. 17.
    Lopes, R.H., et al.: The two-dimensional kolmogorov-smirnov test. In: Proceeding of the XI International Workshop on Advanced Computing and Analysis Techniques in Physics Research (2007)Google Scholar
  18. 18.
    McLachlan G., Peel D.: Finite Mixture Models. Wiley-Interscience, New York (2000)MATHCrossRefGoogle Scholar
  19. 19.
    Qi, Y., et al.: Threshold query optimization for uncertain data. In: SIGMOD, pp. 315–326 (2010)Google Scholar
  20. 20.
    Ré, C., et al.: Event queries on correlated probabilistic streams. In: SIGMOD, pp. 715–728 (2008)Google Scholar
  21. 21.
    Re, C., Suciu, D.: The trichotomy of having queries on a probabilistic database. In: VLDB J. (2009)Google Scholar
  22. 22.
    Sen, P., et al.: Exploiting shared correlations in probabilistic databases. In: VLDB (2008)Google Scholar
  23. 23.
    Singh, S., et al.: Database support for probabilistic attributes and tuples. In: ICDE, pp. 1053–1061 (2008)Google Scholar
  24. 24.
    Suciu, D., et al.: Embracing uncertainty in large-scale computational astrophysics. In: MUD Workshop (2009)Google Scholar
  25. 25.
    Szalay, A.S., et al.: Designing and mining multi-terabyte astronomy archives. In: SIGMOD, pp. 451–462 (2000)Google Scholar
  26. 26.
    Tran, T., et al.: Probabilistic inference over RFID streams in mobile environments. In: ICDE (2009)Google Scholar
  27. 27.
    Tran, T.T.L., el al.: Claro: modeling and processing uncertain data streams. UMass Amherst (2011). http://www.cs.umass.edu/~ttran/pubs/claro-tr.pdf
  28. 28.
    Tran, T.T.L., et al.: Conditioning and aggregating uncertain data streams: Going beyond expectations. In: PVLDB (2010)Google Scholar
  29. 29.
    Tran, T.T.L., et al.: Pods: a new model and processing algorithms for uncertain data streams. In: SIGMOD (2010)Google Scholar
  30. 30.
    Wang, D.Z., et al.: Bayestore: managing large, uncertain data repositories with probabilistic graphical models. In: VLDB (2008)Google Scholar
  31. 31.
    Widom, J.: Trio: a system for integrated management of data, accuracy, and lineage. In: CIDR, pp. 262–276 (2005)Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  • Thanh T. L. Tran
    • 1
  • Liping Peng
    • 1
  • Yanlei Diao
    • 1
  • Andrew McGregor
    • 1
  • Anna Liu
    • 1
  1. 1.University of MassachusettsAmherstUSA

Personalised recommendations