Skip to main content
Log in

CLARO: modeling and processing uncertain data streams

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Uncertain data streams, where data are incomplete and imprecise, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the claro system that supports stream processing for uncertain data naturally captured using continuous random variables. claro employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for relational operators by exploring statistical theory and approximation. We also consider query planning for complex queries given an accuracy requirement. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements and outperform state-of-the-art sampling methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal, P., Widom, J.: Continuous uncertainty in trio. In: MUD Workshop (2009)

  2. Antova, L., et al.: Fast and simple relational processing of uncertain data. In: ICDE, pp. 983–992 (2008)

  3. Benjelloun, O., et al.: Uldbs: databases with uncertainty and lineage. In: VLDB, pp. 953–964 (2006)

  4. Cassella G. et al.: Statistical Inference. Duxbury, Belmont (2001)

    Google Scholar 

  5. Cheng, R., et al.: Evaluating probabilistic queries over imprecise data. In: SIGMOD, pp. 551–562 (2003)

  6. Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: SIGMOD, pp. 281–292 (2007)

  7. Dalvi N.N., Suciu D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)

    Article  Google Scholar 

  8. DasGupta A.: Asymptotic Theory of Statistics and Probability. Springer, Berlin (2008)

    MATH  Google Scholar 

  9. Deshpande, A., Madden, S.: MauveDB: supporting model-based user views in database systems. In: SIGMOD (2006)

  10. Diao, Y., et al.: Capturing data uncertainty in high-volume stream processing. In: CIDR (2009)

  11. Ge, T., Zdonik, S.B.: Handling uncertain data in array database systems. In: ICDE, pp. 1140–1149 (2008)

  12. Guestrin, C., et al.: Distributed regression: an efficient framework for modeling sensor network data. In: IPSN (2004)

  13. Jampani, R., et al.: Mcdb: a monte carlo approach to managing uncertain data. In: SIGMOD, pp. 687–700 (2008)

  14. Jayram, T.S., et al.: Efficient aggregation algorithms for probabilistic data. In: SODA, pp. 346–355 (2007)

  15. Jayram, T.S., et al.: Estimating statistical aggregates on probabilistic data streams. ACM TODS 33(4):243–252 (2008)

    Google Scholar 

  16. Kanagal, B., et al.: Efficient query evaluation over temporally correlated probabilistic streams. In: ICDE (2009)

  17. Lopes, R.H., et al.: The two-dimensional kolmogorov-smirnov test. In: Proceeding of the XI International Workshop on Advanced Computing and Analysis Techniques in Physics Research (2007)

  18. McLachlan G., Peel D.: Finite Mixture Models. Wiley-Interscience, New York (2000)

    Book  MATH  Google Scholar 

  19. Qi, Y., et al.: Threshold query optimization for uncertain data. In: SIGMOD, pp. 315–326 (2010)

  20. Ré, C., et al.: Event queries on correlated probabilistic streams. In: SIGMOD, pp. 715–728 (2008)

  21. Re, C., Suciu, D.: The trichotomy of having queries on a probabilistic database. In: VLDB J. (2009)

  22. Sen, P., et al.: Exploiting shared correlations in probabilistic databases. In: VLDB (2008)

  23. Singh, S., et al.: Database support for probabilistic attributes and tuples. In: ICDE, pp. 1053–1061 (2008)

  24. Suciu, D., et al.: Embracing uncertainty in large-scale computational astrophysics. In: MUD Workshop (2009)

  25. Szalay, A.S., et al.: Designing and mining multi-terabyte astronomy archives. In: SIGMOD, pp. 451–462 (2000)

  26. Tran, T., et al.: Probabilistic inference over RFID streams in mobile environments. In: ICDE (2009)

  27. Tran, T.T.L., el al.: Claro: modeling and processing uncertain data streams. UMass Amherst (2011). http://www.cs.umass.edu/~ttran/pubs/claro-tr.pdf

  28. Tran, T.T.L., et al.: Conditioning and aggregating uncertain data streams: Going beyond expectations. In: PVLDB (2010)

  29. Tran, T.T.L., et al.: Pods: a new model and processing algorithms for uncertain data streams. In: SIGMOD (2010)

  30. Wang, D.Z., et al.: Bayestore: managing large, uncertain data repositories with probabilistic graphical models. In: VLDB (2008)

  31. Widom, J.: Trio: a system for integrated management of data, accuracy, and lineage. In: CIDR, pp. 262–276 (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thanh T. L. Tran.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tran, T.T.L., Peng, L., Diao, Y. et al. CLARO: modeling and processing uncertain data streams. The VLDB Journal 21, 651–676 (2012). https://doi.org/10.1007/s00778-011-0261-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-011-0261-7

Keywords

Navigation