Skip to main content

Aggregation-Aware Compression of Probabilistic Streaming Time Series

  • Conference paper
  • First Online:
Machine Learning and Data Mining in Pattern Recognition (MLDM 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9166))

Abstract

In recent years, there has been a growing interest for probabilistic data management. We focus on probabilistic time series where a main characteristic is the high volumes of data, calling for efficient compression techniques. To date, most work on probabilistic data reduction has provided synopses that minimize the error of representation w.r.t. the original data. However, in most cases, the compressed data will be meaningless for usual queries involving aggregation operators such as SUM or AVG. We propose PHA (Probabilistic Histogram Aggregation), a compression technique whose objective is to minimize the error of such queries over compressed probabilistic data. We incorporate the aggregation operator given by the end-user directly in the compression technique, and obtain much lower error in the long term. We also adopt a global error aware strategy in order to manage large sets of probabilistic time series, where the available memory is carefully balanced between the series, according to their individual variability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://fimi.ua.ac.be/data/.

References

  1. Akbarinia, R., Masseglia, F.: Fast and exact mining of probabilistic data streams. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013, Part I. LNCS, vol. 8188, pp. 493–508. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  2. Akbarinia, R., Valduriez, P., Verger, G.: Efficient evaluation of sum queries over probabilistic data. IEEE Trans. Knowl. Data Eng. 25(4), 764–775 (2013)

    Article  Google Scholar 

  3. Bernecker, T., Kriegel, H.P., Renz, M., Verhein, F., Zuefle, A.: Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 119–128. ACM (2009)

    Google Scholar 

  4. Burdick, D., Deshpande, P.M., Jayram, T.S., Ramakrishnan, R., Vaithyanathan, S.: OLAP over uncertain and imprecise data. VLDB J. 16(1), 123–144 (2007)

    Article  Google Scholar 

  5. Chen, Y., Dong, G., Han, J., Wah, B.W., Wang, J.: Multi-dimensional regression analysis of time-series data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, VLDB 2002, pp. 323–334. VLDB Endowment (2002)

    Google Scholar 

  6. Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD 2007, pp. 281–292 (2007)

    Google Scholar 

  7. Cormode, G., Garofalakis, M.: Histograms and wavelets on probabilistic data. IEEE Trans. Knowl. Data Eng. 22(8), 1142–1157 (2010)

    Article  Google Scholar 

  8. Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)

    Article  Google Scholar 

  9. Hey, A.J.G., Tansley, S., Tolle, K.M. (eds.): The fourth paradigm: data-intensive scientific discovery, Microsoft Research, Redmond, Washington (2009)

    Google Scholar 

  10. Jayram, T.S., McGregor, A., Muthukrishnan, S., Vee, E.: Estimating statistical aggregates on probabilistic data streams. ACM Trans. Database Syst. 33(4), 26:1–26:30 (2008)

    Article  Google Scholar 

  11. Kanagal, B., Deshpande, A.: Efficient query evaluation over temporally correlated probabilistic streams. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE 2009, pp. 1315–1318 (2009)

    Google Scholar 

  12. Rempala, G., Wesolowski, J.: Asymptotics for products of sums and u-statistics. Electron. Commun. Probab. 7(5), 47–54 (2002)

    MathSciNet  Google Scholar 

  13. Ross, R., Subrahmanian, V.S., Grant, J.: Aggregate operators in probabilistic databases. J. ACM 52(1), 54–101 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  14. Sathe, S., Jeung, H., Aberer, K.: Creating probabilistic databases from imprecise time-series data. In: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. ICDE 2011, pp. 327–338 (2011)

    Google Scholar 

  15. Zhao, Y., Aggarwal, C., Yu, P.: On wavelet decomposition of uncertain time series data sets. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 2010, pp. 129–138 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Florent Masseglia .

Editor information

Editors and Affiliations

Appendix

Appendix

AVG Operator. To extend PHA to the average (AVG) operator, we need a method for computing the average of two probabilistic histograms. Given two probabilistic histograms \(H_1\) and \(H_2\), aggregating them by an AVG operator means to make a histogram H such that the probability of each value k is equal to the cumulative probability of cases where the average of two values \(x_1\) from \(H_1\) and \(x_2\) from \(H_2\) is equal to k, i.e. \(k=(x_1 + x_2) / 2\). In the following lemma, we present a formula for aggregating two histograms using the AVG operator.

Lemma 1

Let \(H_1\) and \(H_2\) be two probabilistic histograms. Then, the probability of each value k in the probabilistic histogram obtained from the average of \(H_1\) and \(H_2\), denoted as \(AVG(H_1, H_2)[k]\), is computed as:

\(AVG(H_1, H_2)[k] = \sum _{i=0}^k \! ( (H_1[i] \times H_2[2 \times k - i] + H_1[2 \times k - i] \times H_2[i])\)

Proof. The probability of having a value k in \(AVG(H_1, H_2)\) is equal to the cumulative probability of all cases where the average of two values i in \(H_1\) and j in \(H_2\) is equal to k. In other words, j should be equal to \(2 \times k - i\). This is done by the sigma in the above equation.

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Akbarinia, R., Masseglia, F. (2015). Aggregation-Aware Compression of Probabilistic Streaming Time Series. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21024-7_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21023-0

  • Online ISBN: 978-3-319-21024-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics