Advertisement

Skyline queries over incomplete data streams

Abstract

Nowadays, efficient and effective processing over massive stream data has attracted much attention from the database community, which are useful in many real applications such as sensor data monitoring, network intrusion detection, and so on. In practice, due to the malfunction of sensing devices or imperfect data collection techniques, real-world stream data may often contain missing or incomplete data attributes. In this paper, we will formalize and tackle a novel and important problem, named skyline query over incomplete data stream (Sky-iDS), which retrieves skyline objects (in the presence of missing attributes) with high confidences from incomplete data stream. In order to tackle the Sky-iDS problem, we will design efficient approaches to impute missing attributes of objects from incomplete data stream via differential dependency (DD) rules. We will propose effective pruning strategies to reduce the search space of the Sky-iDS problem, devise cost-model-based index structures to facilitate the data imputation and skyline computation at the same time, and integrate our proposed techniques into an efficient Sky-iDS query answering algorithm. Extensive experiments have been conducted to confirm the efficiency and effectiveness of our Sky-iDS processing approach over both real and synthetic data sets.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Notes

  1. 1.

    http://db.csail.mit.edu/labdata/labdata.html.

  2. 2.

    http://archive.ics.uci.edu/ml/datasets/gas+sensors+for+home+activity+monitoring.

  3. 3.

    https://www.kaggle.com/abkedar/times-series-kernel.

  4. 4.

    https://www.kaggle.com/nphantawee/pump-sensor-data/version/1.

References

  1. 1.

    Aberer, K., Hauswirth, M., Salehi, A.: Infrastructure for data processing in large-scale interconnected sensor networks. In: MDM (2007)

  2. 2.

    Antova, L., Koch, C., Olteanu, D.: From complete to incomplete information and back. In: SIGMOD (2007)

  3. 3.

    Awasthi, A., Bhattacharya, A., Gupta, S., Singh, U.: K-dominant skyline join queries: extending the join paradigm to k-dominant skylines. In: ICDE (2017)

  4. 4.

    Beckmann, N., Kriegel, H., Schneider, R., Seeger, B.: The R*-tree: an efficient and robust access method for points and rectangles. In: SIGMOD (1990)

  5. 5.

    Berchtold, S., Keim, D., Kriegel, H.: The x-tree: an index structure for high-dimensional data. In: VLDB (1996)

  6. 6.

    Bohm, C., Ooi, B.C., Plant, C., Yan, Y.: Efficiently processing continuous k-nn queries on data streams. In: ICDE (2007)

  7. 7.

    Borzsony, S., Kossmann, D., Stocker, K.: The skyline operator. In: ICDE (2001)

  8. 8.

    Bousnina, F., Elmi, S., Chebbah, M., Tobji, M., HadjAli, A., Yaghlane, B.: Skyline operator over tripadvisor reviews within the belief functions framework. In: ICDE (2017)

  9. 9.

    Chan, C., Jagadish, H.V., Tan, K., Tung, A., Zhang, Z.: Finding k-dominant skylines in high dimensional space. In: SIGMOD (2006)

  10. 10.

    Choudhury, F.M., Bao, Z., Culpepper, J.S., Sellis, T.: Monitoring the top-m rank aggregation of spatial objects in streaming queries. In: ICDE (2017)

  11. 11.

    Cranor, C., Johnson, T., Spataschek, O., Shkapenyuk, V.: Gigascope: a stream database for network applications. In: SIGMOD (2003)

  12. 12.

    Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB (2007)

  13. 13.

    Das, A., Gehrke, J., Riedewald, M.: Approximate join processing over data streams. In: SIGMOD (2003)

  14. 14.

    Das, G., Gunopulos, D., Koudas, N., Sarkas, N.: Ad-hoc top-k query answering for data streams. In: VLDB (2007)

  15. 15.

    Das Sarma, A., Lall, A., Nanongkai, D., Xu, J.: Randomized multi-pass streaming skyline algorithms. In: VLDB (2009)

  16. 16.

    Dellis, E., Seeger, B.: Efficient computation of reverse skyline queries. In: VLDB (2007)

  17. 17.

    Dhanabal, L., Shantharajah, S.P.: A study on nsl-kdd dataset for intrusion detection system based on classification algorithms. In: IJARCCE (2015)

  18. 18.

    Ding, X., Lian, X., Chen, L., Jin, H.: Continuous monitoring of skylines over uncertain data streams. Inf. Sci. 184, 196–214 (2012)

  19. 19.

    Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Processing complex aggregate queries over data streams. In: SIGMOD (2002)

  20. 20.

    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. In: VLDB (2010)

  21. 21.

    Gao, Y., Miao, X., Cui, H., Chen, G., Li, Q.: Processing k-skyband, constrained skyline, and group-by skyline queries on incomplete data. Expert Syst. Appl. 41, 4959–4974 (2014)

  22. 22.

    Golab, L., Özsu, T.: Issues in data stream management. In: ACM SIGMOD Record (2003)

  23. 23.

    Hammad, M.A., Aref, W.G., Elmagarmid, A.K.: Query processing of multi-way stream window joins. In: VLDB (2008)

  24. 24.

    Hao, S., Tang, N., Li, G., He, J., Ta, N., Feng, J.: A novel cost-based model for data repairing. In: ICDE. IEEE (2017)

  25. 25.

    Igbe, O., Darwish, I., Saadawi, T.: Distributed network intrusion detection systems: an artificial immune system approach. In: CHASE. IEEE (2016)

  26. 26.

    Keogh, E., Chu, S., Hart, D., Pazzani, M.: An online algorithm for segmenting time series. In: ICDE (2001)

  27. 27.

    Khalefa, M., Mokbel, M., Levandoski, J.: Skyline query processing for incomplete data. In: ICDE (2008)

  28. 28.

    Koudas, N., Ooi, B.C., Tan, K., Zhang, R.: Approximate nn queries on streams with guaranteed error/performance bounds. In: VLDB (2004)

  29. 29.

    Lee, J., Hwang, S.: Toward efficient multidimensional subspace skyline computation. In: VLDB (2014)

  30. 30.

    Li, X., Wang, Y., Li, X., Wang, Y.: Parallelizing skyline queries over uncertain data streams with sliding window partitioning and grid index. In: KAIS (2014)

  31. 31.

    Lian, X., Chen, L.: Monochromatic and bichromatic reverse skyline search over uncertain databases. In: SIGMOD (2008)

  32. 32.

    Libkin, L.: Incomplete information and certain answers in general data models. In: PODS (2011)

  33. 33.

    Lin, X., Yuan, Y., Wang, W., Lu, H.: Stabbing the sky: efficient skyline computation over sliding windows. In: ICDE (2005)

  34. 34.

    Liu, M., Tang, S.: An effective probabilistic skyline query process on uncertain data streams. In: EUSPN/ICTH (2015)

  35. 35.

    Mayfield, C., Neville, J., Prabhakar, S.: Eracer: a database approach for statistical inference and data cleaning. In: SIGMOD (2010)

  36. 36.

    Miao, X., Gao, Y., Chen, L., Chen, G., Li, Q., Jiang, T.: On efficient \(k\)-skyband query processing over incomplete data. In: DASFAA (2013)

  37. 37.

    Miao, X., Gao, Y., Guo, S., Liu, W.: Incomplete data management: a survey. Front. Comput. Sci. 2018(12), 4–25 (2018)

  38. 38.

    Ooi, B.C., Goh, C.H., Tan, K.: Fast high-dimensional data search in incomplete databases. In: VLDB (1998)

  39. 39.

    Papadias, D., Tao, Y., Fu, G., Seeger, B.: An optimal and progressive algorithm for skyline queries. In: SIGMOD (2003)

  40. 40.

    Pei, J., Jiang, B., Lin, X., Yuan, Y.: Probabilistic skylines on uncertain data. In: VLDB (2007)

  41. 41.

    Prokoshyna, N., Szlichta, J., Chiang, F., Miller, R.J., Srivastava, D.: Combining quantitative and logical data cleaning. In: PVLDB (2015)

  42. 42.

    Qin, L., Yu, J.X., Chang, L.: Scalable keyword search on large data streams. In: VLDB (2011)

  43. 43.

    Ren, W., Lian, X., Ghazinour, K.: Skyline Queries Over Incomplete Data Streams (Technical Report). arXiv:1909.11224 (2019)

  44. 44.

    Royston, P.: Multiple imputation of missing values. Stata J. 4, 227–241 (2004)

  45. 45.

    Sarkas, N., Das, G., Koudas, N., Tung, A.: Categorical skylines for streaming data. In: SIGMOD (2008)

  46. 46.

    Song, S., Cao, Y., Wang, J.: Cleaning timestamps with temporal constraints. In: PVLDB (2016)

  47. 47.

    Song, S., Chen, L.: Differential dependencies: Reasoning and discovery. In: TODS (2011)

  48. 48.

    Song, S., Cheng, H., Yu, J.X., Chen, L.: Repairing vertex labels under neighborhood constraints. In: PVLDB (2014)

  49. 49.

    Song, S., Liu, B., Cheng, H., Yu, J.X., Chen, L.: Graph repairing under neighborhood constraints. In: VLDBJ (2017)

  50. 50.

    Song, S., Sun, Y., Zhang, A., Chen, L., Wang, J.: Enriching data imputation under similarity rule constraints. In: TKDE (2018)

  51. 51.

    Song, S., Zhang, A., Chen, L., Wang, J.: Enriching data imputation with extensive similarity neighbors. In: VLDB (2015)

  52. 52.

    Song, S., Zhang, A., Wang, J., Yu, P.S.: Screen: stream data cleaning under speed constraints. In: SIGMOD (2015)

  53. 53.

    Srivastava, J., Cooley, R., Deshpande, M., Tan, P.: Web usage mining: Discovery and applications of usage patterns from web data. In: SIGKDD (2000)

  54. 54.

    Tao, Y., Papadias, D.: Maintaining sliding window skylines on data streams. In: TKDE (2006)

  55. 55.

    Tatbul, N., Zdonik, S.: Window-aware load shedding for aggregation queries over data streams. In: VLDB (2006)

  56. 56.

    Van Buuren, S.: Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16, 219–242 (2007)

  57. 57.

    Vijayakumar, N., Plale, B.: Prediction of missing events in sensor data streams using kalman filters. In: sensorKDD (2007)

  58. 58.

    Wang, J., Song, S., Zhu, X., Lin, X.: Efficient recovery of missing events. In: PVLDB (2013)

  59. 59.

    Wang, J., Song, S., Zhu, X., Lin, X., Sun, J.: Efficient recovery of missing events. In: TKDE (2016)

  60. 60.

    Wellenzohn, K., Böhlen, M.H., Dignös, A., Gamper, J., Mitterer, H.: Continuous imputation of missing values in streams of pattern-determining time series. In: EDBT, pp 330–341 (2017). https://doi.org/10.5441/002/edbt.2017.30

  61. 61.

    Xue, W., Luo, Q., Chen, L., Liu, Y.: Contour map matching for event detection in sensor networks. In: SIGMOD (2006)

  62. 62.

    Zhang, A., Song, S., Sun, Y., Wang, J.: Learning individual models for imputation. In: ICDE (2019)

  63. 63.

    Zhang, A., Song, S., Wang, J.: Sequential data cleaning: a statistical approach. In: SIGMOD (2016)

  64. 64.

    Zhang, A., Song, S., Wang, J., Yu, P.S.: Time series data cleaning: from anomaly detection to anomaly repairing. In: VLDB (2017)

  65. 65.

    Zhang, S., Mamoulis, N., Cheung, D.: Scalable skyline computation using object-based space partitioning. In: SIGMOD (2009)

  66. 66.

    Zhang, W., Lin, X., Zhang, Y., Wang, W., Yu, J.X.: Probabilistic skyline operator over sliding windows. In: ICDE (2009)

  67. 67.

    Zhou, X., Chen, L.: Event detection over twitter social media streams. In: VLDB (2014)

Download references

Acknowledgements

Xiang Lian is supported by NSF OAC No. 1739491 and Lian Startup No. 220981, Kent State University. We thank the anonymous reviewers for the useful suggestions.

Author information

Correspondence to Xiang Lian.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Proofs of Lemmas for pruning strategies

Proof of Lemma 1

Proof

As shown in Fig. 3a, since \(o'^p.min\) is the minimum corner of the imputed object \(o'^p\), it holds that imputed samples of \(o'^p\) is dominating \(o'^p.min\), that is, \(o'^p \preccurlyeq o'^p.min\). Similarly, we also have \(o_i^p.max \preccurlyeq o_i^p\). Due to lemma assumption that \(o'^p.min\)\(\prec o_i^p.max\), by dominance transition, we can derive \(o'^p \preccurlyeq o'^p.min \prec o_i^p.max \preccurlyeq o_i^p\). Thus, we have \(Pr\{o'^p \prec o_i^p\} = 1\) (or \(Pr\{o'^p \prec o_{il}\} = 1\) for any instance \(o_{il} \in o_i^p\)). According to Eq. (4), it holds that \(P_{Sky\text {-}iDS}(o_i^p) = 0\). Moreover, since \(o'.exp \ge o_i.exp\) holds (i.e., object \(o'\) expires after \(o_i^p\) from lemma assumption), it indicates that \(o_i^p\) can never be the skyline due to the existence of object \(o'^p\). Hence, object \(o_i^p\in iDS\) can be safely pruned, which completes the proof. \(\square \)

Proof of Lemma 2

Proof

From Eq. (4), we can derive a probability upper bound as follows.

$$\begin{aligned} P_{Sky\text {-}iDS}(o_i^p)\le & {} \sum _{\forall o_{il} \in o_i^p} o_{il}.p \cdot (1- Pr\{o'^p \prec o_{il}\}) \nonumber \\= & {} 1- \sum _{\forall o_{il} \in o_i^p} o_{il}.p \cdot Pr\{o'^p \prec o_{il}\}. \end{aligned}$$
(6)

Since \(o_i^p.max \preccurlyeq o_{il}\) (\(o_{il}\in o_i^p\)) and \(Pr\{o'^p \prec o_i^p.max\} \ge 1-\alpha \) hold, we have \(Pr\{o'^p \prec o_{il}\} \ge Pr\{o'^p \prec o_i^p.max\} \ge 1-\alpha \). By substituting this probability into Eq. (6), we can obtain: \(P_{Sky\text {-}iDS}(o_i^p) \le 1- \sum _{\forall o_{il} \in o_i^p} o_{il}.p \cdot (1-\alpha ) = \alpha \). Moreover, since \(o'.exp \ge o_i.exp\) holds, \(o_i^p\) always has the skyline probability less than \(\alpha \) during its lifetime, due to the existence of object \(o'\). Thus, object \(o_i\) can be safely pruned. \(\square \)

Proof of Lemma 3

Proof

Similar to the proof of Lemma 2, since \(o'^p \preccurlyeq o'^p.min\) and \(Pr\{o'^p.min \prec o_i^p\} \ge 1-\alpha \) hold, we have \(Pr\{o'^p \prec o_i^p\} \ge Pr\{o'^p.min \prec o_i^p\} \ge 1-\alpha \). By substituting this probability into Eq. (6), we can obtain: \(P_{Sky\text {-}iDS}(o_i^p) \le 1- Pr\{o'^p \prec o_i^p\} = \alpha \). Thus, since object \(o_i\) expires before object \(o'\) (i.e., \(o'.exp \ge o_i.exp\)), object \(o_i\) always has the skyline probability lower than \(\alpha \) during its lifetime. Hence, object \(o_i\) can be safely pruned. \(\square \)

Proofs of properties for skyline tree ST

Proof of Property 1 of ST

Proof

We can prove this property by showing that no such an imputed object \(o_i^p\) exists, where \(o_i^p\) is a valid object not within skyline tree ST but is actually a skyline or may become a skyline later.

First, assume that the object \(o_i^p\) is a current skyline. According to Definition 6, we can obtain \(P_{Sky\text {-}iDS}(o_i^p)>\alpha \). By substituting this probability into Eq. (6), we have \(\sum _{\forall o_{il} \in o_i^p} o_{il}.p \cdot Pr\{n^p \prec o_{il}\} < 1-\alpha \), that is, \(Pr\{n^p \prec o_i^p\}<1-\alpha \). Thus, no object \(t^p\) in ST dominates \(o_i^p\) with probability not smaller than \((1-\alpha )\), and then object \(o_i^p\) should be on the first layer of ST.

Second, assume that the object \(o_i^p\) is dominated by some objects \(n^p \in ST\), and may become the skyline after these objects \(n^p\) expire (i.e., \(n^p.exp < o_i^p.exp\)). In this case, object \(n^p\) should be the child of one of these objects \(n^p\), since \(Pr\{n^p \prec o_i^p\} \ge 1-\alpha \) and \(n^p.exp < o_i^p.exp\). Therefore, the ST index contains all the objects \(o_i^p \in pDS\) that have the chance to be skylines before they expire. \(\square \)

Proof of Property 2 of ST

Proof

Given an imputed object \(o_i^p \in ST\), if it is not on the first layer of ST, \(o_i^p\) will be dominated by its non-empty parent node (object) \(n^p \in ST\) with probability \(Pr\{n^p \prec o_i^p\} \ge 1-\alpha \). By substituting this probability into Eq. (6), we can obtain \(P_{Sky\text {-}iDS}(o_i^p)\le 1-\sum _{\forall o_{il} \in o_i^p} o_{il}.p \cdot Pr\{n^p \prec o_{il}\}=1-Pr\{n^p \prec o_{il}\} \le \alpha \), that is, \(P_{Sky\text {-}iDS}(o_i^p)\le \alpha \), which violates the Sky-iDS definition in Definition 6. Hence, object \(o_i^p\) cannot become a skyline before its parent node expires from stream iDS. \(\square \)

Proof of Property 3 of ST

Proof

According to Property 2, we can get objects \(n^p\) not on the first layer all have the skyline probabilities not bigger than \(\alpha \) (\(P_{Sky\text {-}iDS}(n^p) \le \alpha \)). So, current skyline objects must be all on the first layer of ST, in other words, the set of objects on the first layer of ST is a superset of Sky-iDS answers. \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ren, W., Lian, X. & Ghazinour, K. Skyline queries over incomplete data streams. The VLDB Journal 28, 961–985 (2019). https://doi.org/10.1007/s00778-019-00577-6

Download citation

Keywords

  • Skyline query
  • Incomplete data streams
  • Sky-iDS