Knowledge and Information Systems

, Volume 45, Issue 1, pp 159–190 | Cite as

Sliding windows over uncertain data streams

  • Michele Dallachiesa
  • Gabriela Jacques-Silva
  • Buğra Gedik
  • Kun-Lung Wu
  • Themis Palpanas
Regular Paper

Abstract

Uncertain data streams can have tuples with both value and existential uncertainty. A tuple has value uncertainty when it can assume multiple possible values. A tuple is existentially uncertain when the sum of the probabilities of its possible values is \(<\)1. A situation where existential uncertainty can arise is when applying relational operators to streams with value uncertainty. Several prior works have focused on querying and mining data streams with both value and existential uncertainty. However, none of them have studied, in depth, the implications of existential uncertainty on sliding window processing, even though it naturally arises when processing uncertain data. In this work, we study the challenges arising from existential uncertainty, more specifically the management of count-based sliding windows, which are a basic building block of stream processing applications. We extend the semantics of sliding window to define the novel concept of uncertain sliding windows and provide both exact and approximate algorithms for managing windows under existential uncertainty. We also show how current state-of-the-art techniques for answering similarity join queries can be easily adapted to be used with uncertain sliding windows. We evaluate our proposed techniques under a variety of configurations using real data. The results show that the algorithms used to maintain uncertain sliding windows can efficiently operate while providing a high-quality approximation in query answering. In addition, we show that sort-based similarity join algorithms can perform better than index-based techniques (on 17 real datasets) when the number of possible values per tuple is low, as in many real-world applications.

Keywords

Data stream processing Sliding windows Uncertainty management 

References

  1. 1.
    Abadi D, Ahmad Y, Balazinska M, Çetintemel U, Cherniack M, Hwang JH, Lindner W, Maskey A, Rasin A, Ryvkina E, Tatbul N, Xing Y, Zdonik S (2005) The design of the Borealis stream processing engine. In: CIDRGoogle Scholar
  2. 2.
    Aggarwal CC (2009) Managing and mining uncertain data. Springer, BerlinMATHCrossRefGoogle Scholar
  3. 3.
    Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. In: IEEE ICDEGoogle Scholar
  4. 4.
    Aßfalg J, Kriegel H-P, Kröger P, Renz M (2009) Probabilistic similarity search for uncertain time series. In: SSDBMGoogle Scholar
  5. 5.
    Aßfalg J, Kriegel HP, Kröger P, Renz M (2009) Probabilistic similarity search for uncertain time series. In: SSDBM, pp 435–443Google Scholar
  6. 6.
    Benjelloun O, Sarma A, Halevy A, Widom J (2006) Uldbs: databases with uncertainty and lineage. In: VLDBGoogle Scholar
  7. 7.
    Bernecker T, Kriegel HP, Renz M, Verhein F, Züfle A (2009) Probabilistic frequent itemset mining in uncertain databases. In: KDD, pp 119–128Google Scholar
  8. 8.
    Biem A, Bouillet E, Feng H, Ranganathan A, Riabov A, Verscheure O, Koutsopoulos H, Moran C (2010) IBM infosphere streams for scalable, real-time, intelligent transportation services. In: ACM SIGMODGoogle Scholar
  9. 9.
    Calders T, Garboni C, Goethals B (2010) Approximation of frequentness probability of itemsets in uncertain data. In: Data mining (ICDM), 2010 IEEE 10th international conference on IEEE, pp 749–754Google Scholar
  10. 10.
    Cheng R, Kalashnikov D, Prabhakar S (2004) Querying imprecise data in moving object environments. IEEE Trans Knowl Data Eng 16(9):1112–1127CrossRefGoogle Scholar
  11. 11.
    Dai X, Yiu M, Mamoulis N, Tao Y, Vaitis M (2005) Probabilistic spatial queries on existentially uncertain data. In: SSTDGoogle Scholar
  12. 12.
    Dallachiesa M, Aggarwal C, Palpanas T (2014) Node classification in uncertain graphs. In: SSDBM 32Google Scholar
  13. 13.
    Dallachiesa M, Nushi B, Mirylenka K, Palpanas T (2012) Uncertain time-series similarity: return to the basics. PVLDB 5(11):1662–1673Google Scholar
  14. 14.
    Dallachiesa M, Palpanas T (2013) Identifying streaming frequent items in ad hoc time windows. Data Knowl Eng 87:66–90CrossRefGoogle Scholar
  15. 15.
    Dallachiesa M, Palpanas T, Ilyas FI (2014) Top-k nearest neighbor search in uncertain data series. Proc VLDB EndowmentGoogle Scholar
  16. 16.
    Daskalakis C, Diakonikolas I, Servedio RA (2012) Learning poisson binomial distributions. In: Proceedings of the 44th symposium on theory of computing. ACM, pp 709–728Google Scholar
  17. 17.
    Diao Y, Li B, Liu A, Peng L, Sutton C, Tran TTL, Zink M (2009) Capturing data uncertainty in high-volume stream processing. In: CIDRGoogle Scholar
  18. 18.
    Fernandez M, Williams S (2010) Closed-form expression for the poisson-binomial probability density function. IEEE Trans Aerosp Electron Syst 46(2):803–817Google Scholar
  19. 19.
    Fung BCM, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: A survey of recent developments. ACM Comput Surv 42(4):14Google Scholar
  20. 20.
    Gedik B (2013) Generic windowing support for extensible stream processing systems. Softw Pract Exp 44(9):1105–1128Google Scholar
  21. 21.
    Gedik B, Andrade H (2012) A model-based framework for building extensible, high performance stream processing middleware and programming language for IBM infosphere streams. Softw Pract Exp 42(11):1363–1391Google Scholar
  22. 22.
    Getoor L, Friedman N, Koller D, Taskar B (2003) Learning probabilistic models of link structure. J Mach Learn Res 3:679–707MATHMathSciNetGoogle Scholar
  23. 23.
    Halpern J (2003) Reasoning about uncertainty. MIT Press, CambridgeMATHGoogle Scholar
  24. 24.
    Hirzel M, Andrade H, Gedik B, Kumar V, Losa G, Mendell M, Nasgaard H, Soulé R, Wu KL (2009) SPL language specification. Technical report RC24897. IBM ResearchGoogle Scholar
  25. 25.
    Hong Y (2011) On computing the distribution function for the sum of independent and non-identical random indicators. Technical report, Department of Statistics, Virginia TechGoogle Scholar
  26. 26.
    Jayram TS, McGregor A, Muthukrishnan S, Vee E (2007) Estimating statistical aggregates on probabilistic data streams. In: ACM PODSGoogle Scholar
  27. 27.
    Jin C, Yi K, Chen L, Yu JX, Lin X (2008) Sliding-window top-k queries on uncertain streams. Proc VLDB Endowment 1(1):301–312CrossRefGoogle Scholar
  28. 28.
    Kanagal B, Deshpande A (2008) Online filtering, smoothing and probabilistic modeling of streaming data. In IEEE ICDEGoogle Scholar
  29. 29.
    Keogh E, Xi X, Wei L, Ratanamahatana CA (2006) The UCR time series classification/clustering homepage. http://www.cs.ucr.edu/-eamonn/time_series_data
  30. 30.
    Kriegel H, Kunath P, Pfeifle M, Renz M (2006) Probabilistic similarity join on uncertain data. In: DASFAAGoogle Scholar
  31. 31.
    Kuo W, Zuo M (2003) Optimal reliability modeling: principles and applications. Wiley, New YorkGoogle Scholar
  32. 32.
    Leung CKS, Hao B (2009) Mining of frequent itemsets from streams of uncertain data. In: IEEE ICDEGoogle Scholar
  33. 33.
    Lian X, Chen L (2011) Similarity join processing on uncertain data streams. IEEE TKDE 23(11)Google Scholar
  34. 34.
    Liao L, Fox D, Kautz H (2007) Extracting places and activities from gps traces using hierarchical conditional random fields. Int J Rob Res 26(1):119–134CrossRefGoogle Scholar
  35. 35.
    Liao L, Patterson DJ, Fox D, Kautz H (2007) Learning and inferring transportation routines. Artif Intell 171(5):311–331MATHMathSciNetCrossRefGoogle Scholar
  36. 36.
    Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the hilbert space-filling curve. IEEE TKDE 13(1)Google Scholar
  37. 37.
    Neumeyer L, Robbins B, Nair A, Kesari A (2010) S4: distributed stream computing platform. In: KDCloudGoogle Scholar
  38. 38.
    Nybø R (2008) Time series opportunities in the petroleum industry. In: ESTSP 08, European symposium on time series prediction, Porvoo, FinlandGoogle Scholar
  39. 39.
    Raza U, Camerra A, Murphy AL, Palpanas T, Picco GP (2012) What does model-driven data acquisition really achieve in wireless sensor networks? In: PERCOMGoogle Scholar
  40. 40.
    Ré C, Letchner J, Balazinska M, Suciu D (2008) Event queries on correlated probabilistic streams. In: ACM SIGMODGoogle Scholar
  41. 41.
    Sarangi S, Murthy K (2010) DUST: a generalized notion of similarity between uncertain time series. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 383–392Google Scholar
  42. 42.
    Singh S, Mayfield C, Shah R, Prabhakar S, Hambrusch SE, Neville J, Cheng R (2008) Database support for probabilistic attributes and tuples. In: IEEE ICDEGoogle Scholar
  43. 43.
    Sow D, Biem A, Blount M, Ebling M, Verscheure O (2010) Body sensor data processing using stream computing. In: MIRGoogle Scholar
  44. 44.
    Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 273–282Google Scholar
  45. 45.
    Tran TT, Peng L, Diao Y, McGregor A, Liu A (2012) Claro: modeling and processing uncertain data streams. VLDB J Int J Very Large Data Bases 21(5):651–676CrossRefGoogle Scholar
  46. 46.
    Tran TT, Peng L, Li B, Diao Y, Liu A (2010) Pods: a new model and processing algorithms for uncertain data streams. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 159–170Google Scholar
  47. 47.
    Wang L, Cheung D, Cheng R, Lee S, Yang X (2012) Efficient mining of frequent itemsets on large uncertain databases. IEEE Trans Knowl Data Eng 24(12):2170–2183Google Scholar
  48. 48.
    Wu KL, Yu PS, Gedik B, Hildrum K, Aggarwal CC, Bouillet E, Fan W, George D, Gu X, Luo G, Wang H (2007) Challenges and experience in prototyping a multi-modal stream analytic and monitoring application on system. In: VLDBGoogle Scholar
  49. 49.
    Yeh M, Wu K, Yu P, Chen M (2009) PROUD: a probabilistic approach to processing similarity queries over uncertain data streams. In: Proceedings of the 12th international conference on extending database technology: advances in database technology. ACM, pp 684–695Google Scholar
  50. 50.
    Youssef M, Mah M, Agrawala A (2007) Challenges: device-free passive localization for wireless environments. In: ACM MOBICOMGoogle Scholar
  51. 51.
    Zhang Q, Li F, Yi K (2008) Finding frequent items in probabilistic data. In: ACM SIGMODGoogle Scholar
  52. 52.
    Zhang W, Lin X, Zhang Y, Wang W, Yu JX (2009) Probabilistic skyline operator over sliding windows. In: IEEE ICDEGoogle Scholar
  53. 53.
    Zhou Z, Gupta H, Das SR, Zhu X (2007) Slotted scheduled tag access in multi-reader rfid systems. In: IEEE ICNPGoogle Scholar

Copyright information

© Springer-Verlag London 2014

Authors and Affiliations

  • Michele Dallachiesa
    • 1
    • 2
  • Gabriela Jacques-Silva
    • 2
  • Buğra Gedik
    • 3
  • Kun-Lung Wu
    • 2
  • Themis Palpanas
    • 1
    • 4
  1. 1.University of TrentoTrentoItaly
  2. 2.IBM T.J. Watson Research CenterYorktown HeightsUSA
  3. 3.Bilkent UniversityAnkaraTurkey
  4. 4.Paris Descartes UniversityParisFrance

Personalised recommendations