Abstract
In recent years, there has been a growing interest for probabilistic data management. We focus on probabilistic time series where a main characteristic is the high volumes of data, calling for efficient compression techniques. To date, most work on probabilistic data reduction has provided synopses that minimize the error of representation w.r.t. the original data. However, in most cases, the compressed data will be meaningless for usual queries involving aggregation operators such as SUM or AVG. We propose PHA (Probabilistic Histogram Aggregation), a compression technique whose objective is to minimize the error of such queries over compressed probabilistic data. We incorporate the aggregation operator given by the end-user directly in the compression technique, and obtain much lower error in the long term. We also adopt a global error aware strategy in order to manage large sets of probabilistic time series, where the available memory is carefully balanced between the series, according to their individual variability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Akbarinia, R., Masseglia, F.: Fast and exact mining of probabilistic data streams. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013, Part I. LNCS, vol. 8188, pp. 493–508. Springer, Heidelberg (2013)
Akbarinia, R., Valduriez, P., Verger, G.: Efficient evaluation of sum queries over probabilistic data. IEEE Trans. Knowl. Data Eng. 25(4), 764–775 (2013)
Bernecker, T., Kriegel, H.P., Renz, M., Verhein, F., Zuefle, A.: Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 119–128. ACM (2009)
Burdick, D., Deshpande, P.M., Jayram, T.S., Ramakrishnan, R., Vaithyanathan, S.: OLAP over uncertain and imprecise data. VLDB J. 16(1), 123–144 (2007)
Chen, Y., Dong, G., Han, J., Wah, B.W., Wang, J.: Multi-dimensional regression analysis of time-series data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, VLDB 2002, pp. 323–334. VLDB Endowment (2002)
Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD 2007, pp. 281–292 (2007)
Cormode, G., Garofalakis, M.: Histograms and wavelets on probabilistic data. IEEE Trans. Knowl. Data Eng. 22(8), 1142–1157 (2010)
Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)
Hey, A.J.G., Tansley, S., Tolle, K.M. (eds.): The fourth paradigm: data-intensive scientific discovery, Microsoft Research, Redmond, Washington (2009)
Jayram, T.S., McGregor, A., Muthukrishnan, S., Vee, E.: Estimating statistical aggregates on probabilistic data streams. ACM Trans. Database Syst. 33(4), 26:1–26:30 (2008)
Kanagal, B., Deshpande, A.: Efficient query evaluation over temporally correlated probabilistic streams. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE 2009, pp. 1315–1318 (2009)
Rempala, G., Wesolowski, J.: Asymptotics for products of sums and u-statistics. Electron. Commun. Probab. 7(5), 47–54 (2002)
Ross, R., Subrahmanian, V.S., Grant, J.: Aggregate operators in probabilistic databases. J. ACM 52(1), 54–101 (2005)
Sathe, S., Jeung, H., Aberer, K.: Creating probabilistic databases from imprecise time-series data. In: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering. ICDE 2011, pp. 327–338 (2011)
Zhao, Y., Aggarwal, C., Yu, P.: On wavelet decomposition of uncertain time series data sets. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 2010, pp. 129–138 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
AVG Operator. To extend PHA to the average (AVG) operator, we need a method for computing the average of two probabilistic histograms. Given two probabilistic histograms \(H_1\) and \(H_2\), aggregating them by an AVG operator means to make a histogram H such that the probability of each value k is equal to the cumulative probability of cases where the average of two values \(x_1\) from \(H_1\) and \(x_2\) from \(H_2\) is equal to k, i.e. \(k=(x_1 + x_2) / 2\). In the following lemma, we present a formula for aggregating two histograms using the AVG operator.
Lemma 1
Let \(H_1\) and \(H_2\) be two probabilistic histograms. Then, the probability of each value k in the probabilistic histogram obtained from the average of \(H_1\) and \(H_2\), denoted as \(AVG(H_1, H_2)[k]\), is computed as:
\(AVG(H_1, H_2)[k] = \sum _{i=0}^k \! ( (H_1[i] \times H_2[2 \times k - i] + H_1[2 \times k - i] \times H_2[i])\)
Proof. The probability of having a value k in \(AVG(H_1, H_2)\) is equal to the cumulative probability of all cases where the average of two values i in \(H_1\) and j in \(H_2\) is equal to k. In other words, j should be equal to \(2 \times k - i\). This is done by the sigma in the above equation.
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Akbarinia, R., Masseglia, F. (2015). Aggregation-Aware Compression of Probabilistic Streaming Time Series. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2015. Lecture Notes in Computer Science(), vol 9166. Springer, Cham. https://doi.org/10.1007/978-3-319-21024-7_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-21024-7_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21023-0
Online ISBN: 978-3-319-21024-7
eBook Packages: Computer ScienceComputer Science (R0)