Aggregation-Aware Compression of Probabilistic Streaming Time Series

Akbarinia, Reza; Masseglia, Florent

doi:10.1007/978-3-319-21024-7_16

Reza Akbarinia⁵ &
Florent Masseglia⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9166))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

Abstract

In recent years, there has been a growing interest for probabilistic data management. We focus on probabilistic time series where a main characteristic is the high volumes of data, calling for efficient compression techniques. To date, most work on probabilistic data reduction has provided synopses that minimize the error of representation w.r.t. the original data. However, in most cases, the compressed data will be meaningless for usual queries involving aggregation operators such as SUM or AVG. We propose PHA (Probabilistic Histogram Aggregation), a compression technique whose objective is to minimize the error of such queries over compressed probabilistic data. We incorporate the aggregation operator given by the end-user directly in the compression technique, and obtain much lower error in the long term. We also adopt a global error aware strategy in order to manage large sets of probabilistic time series, where the available memory is carefully balanced between the series, according to their individual variability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://fimi.ua.ac.be/data/.

References

Akbarinia, R., Masseglia, F.: Fast and exact mining of probabilistic data streams. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013, Part I. LNCS, vol. 8188, pp. 493–508. Springer, Heidelberg (2013)
Chapter Google Scholar
Akbarinia, R., Valduriez, P., Verger, G.: Efficient evaluation of sum queries over probabilistic data. IEEE Trans. Knowl. Data Eng. 25(4), 764–775 (2013)
Article Google Scholar
Bernecker, T., Kriegel, H.P., Renz, M., Verhein, F., Zuefle, A.: Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 119–128. ACM (2009)
Google Scholar
Burdick, D., Deshpande, P.M., Jayram, T.S., Ramakrishnan, R., Vaithyanathan, S.: OLAP over uncertain and imprecise data. VLDB J. 16(1), 123–144 (2007)
Article Google Scholar
Chen, Y., Dong, G., Han, J., Wah, B.W., Wang, J.: Multi-dimensional regression analysis of time-series data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, VLDB 2002, pp. 323–334. VLDB Endowment (2002)
Google Scholar
Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD 2007, pp. 281–292 (2007)
Google Scholar
Cormode, G., Garofalakis, M.: Histograms and wavelets on probabilistic data. IEEE Trans. Knowl. Data Eng. 22(8), 1142–1157 (2010)
Article Google Scholar
Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)
Article Google Scholar
Hey, A.J.G., Tansley, S., Tolle, K.M. (eds.): The fourth paradigm: data-intensive scientific discovery, Microsoft Research, Redmond, Washington (2009)
Google Scholar
Jayram, T.S., McGregor, A., Muthukrishnan, S., Vee, E.: Estimating statistical aggregates on probabilistic data streams. ACM Trans. Database Syst. 33(4), 26:1–26:30 (2008)
Article Google Scholar
Kanagal, B., Deshpande, A.: Efficient query evaluation over temporally correlated probabilistic streams. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE 2009, pp. 1315–1318 (2009)
Google Scholar
Rempala, G., Wesolowski, J.: Asymptotics for products of sums and u-statistics. Electron. Commun. Probab. 7(5), 47–54 (2002)
MathSciNet Google Scholar
Ross, R., Subrahmanian, V.S., Grant, J.: Aggregate operators in probabilistic databases. J. ACM 52(1), 54–101 (2005)
Article MATH MathSciNet Google Scholar
Sathe, S., Jeung, H., Aberer, K.: Creating probabilistic databases from imprecise time-series data. In: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. ICDE 2011, pp. 327–338 (2011)
Google Scholar
Zhao, Y., Aggarwal, C., Yu, P.: On wavelet decomposition of uncertain time series data sets. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 2010, pp. 129–138 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Inria & LIRMM, Zenith Team - Université. Montpellier, Montpellier cedex 5, France
Reza Akbarinia & Florent Masseglia

Authors

Reza Akbarinia
View author publications
You can also search for this author in PubMed Google Scholar
Florent Masseglia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florent Masseglia .

Editor information

Editors and Affiliations

IBaI, Leipzig, Germany
Petra Perner

Appendix

AVG Operator. To extend PHA to the average (AVG) operator, we need a method for computing the average of two probabilistic histograms. Given two probabilistic histograms \(H_1\) and \(H_2\), aggregating them by an AVG operator means to make a histogram H such that the probability of each value k is equal to the cumulative probability of cases where the average of two values \(x_1\) from \(H_1\) and \(x_2\) from \(H_2\) is equal to k, i.e. \(k=(x_1 + x_2) / 2\). In the following lemma, we present a formula for aggregating two histograms using the AVG operator.

Lemma 1

Let \(H_1\) and \(H_2\) be two probabilistic histograms. Then, the probability of each value k in the probabilistic histogram obtained from the average of \(H_1\) and \(H_2\), denoted as \(AVG(H_1, H_2)[k]\), is computed as:

\(AVG(H_1, H_2)[k] = \sum _{i=0}^k \! ( (H_1[i] \times H_2[2 \times k - i] + H_1[2 \times k - i] \times H_2[i])\)

Proof. The probability of having a value k in \(AVG(H_1, H_2)\) is equal to the cumulative probability of all cases where the average of two values i in \(H_1\) and j in \(H_2\) is equal to k. In other words, j should be equal to \(2 \times k - i\). This is done by the sigma in the above equation.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Akbarinia, R., Masseglia, F. (2015). Aggregation-Aware Compression of Probabilistic Streaming Time Series. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-21024-7_16
Published: 01 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21023-0
Online ISBN: 978-3-319-21024-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Aggregation-Aware Compression of Probabilistic Streaming Time Series

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Lemma 1

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation