Abstract
Several wavelet synopsis construction algorithms were previously proposed for optimal Haar\(^+\) synopses. Recently, we proposed the OptExtHP-EB algorithm to find an optimal one-dimensional \(\hbox {Haar}^+\) synopsis. By utilizing the novel properties of optimal synopses, OptExtHP-EB represents the set of optimal synopses in a node of a \(\hbox {Haar}^+\) tree by a set of extended synopses. While it is much faster than the previous \(\hbox {Haar}^+\) synopsis construction algorithms, it can handle only one-dimensional data. In this paper, we propose the OptExtHP-EB2D algorithm for two-dimensional \(\hbox {Haar}^+\) synopses by extending OptExtHP-EB. While a one-dimensional \(\hbox {Haar}^+\) tree has only two child nodes and three coefficients in a node, a two-dimensional \(\hbox {Haar}^+\) tree is much more complex in that it has four child nodes and seven coefficients per node. Thus, for each possible subset of the coefficients selected in a node, we develop the efficient methods to compute a set of optimal synopses denoted by extended synopses. Our experiments confirm the effectiveness of our proposed OptExtHP-EB2D algorithm.
Similar content being viewed by others
References
Bruno, N., Chaudhuri, S., Gravano, L.: Stholes: a multidimensional workload-aware histogram. In: ACM Sigmod Record, vol. 30, pp. 211–222. ACM (2001)
Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. 10(2–3), 199–223 (2001)
Cormode, G., Garofalakis, M., Sacharidis, D.: Fast approximate wavelet tracking on streams. In: International Conference on Extending Database Technology, pp. 4–22. Springer (2006)
Deshpande, A., Garofalakis, M., Rastogi, R.: Independence is good: dependency-based histogram synopses for high-dimensional data. ACM SIGMOD Rec. 30(2), 199–210 (2001)
Garofalakis, M., Gibbons, P.B.: Probabilistic wavelet synopses. ACM TODS 29(1), 43–90 (2004)
Garofalakis, M., Kumar, A.: Deterministic wavelet thresholding for maximum-error metrics. In: PODS, pp. 166–176 (2004)
Garofalakis, M., Kumar, A.: Wavelet synopses for general error metrics. TODS 30(4), 888–928 (2005)
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: One-pass wavelet decompositions of data streams. TKDE 15(3), 541–554 (2003)
Guha, S.: Space efficiency in synopsis construction algorithms. In: VLDB, pp. 409–420 (2005)
Guha, S.: On the space-time of optimal, approximate and streaming algorithms for synopsis construction problems. VLDB J. 17(6), 1509–1535 (2008)
Guha, S., Harb, B.: Wavelet synopsis for data streams: minimizing non-Euclidean error. In: SIGKDD, pp. 88–97 (2005)
Guha, S., Harb, B.: Approximation algorithms for wavelet transform coding of data streams. Inf. Theory 54(2), 811–830 (2008)
Guha, S., Park, H., Shim, K.: Wavelet synopsis for hierarchical range queries with workloads. VLDB J. 17(5), 1079–1099 (2008)
Jestes, J., Yi, K., Li, F.: Building wavelet histograms on large data in mapreduce. PVLDB 5(2), 109–120 (2011)
Karras, P.: Optimality and scalability in lattice histogram construction. PVLDB 2(1), 670–681 (2009)
Karras, P., Mamoulis, N.: One-pass wavelet synopses for maximum-error metrics. In: VLDB, pp. 421–432 (2005)
Karras, P., Mamoulis, N.: The Haar+ tree: a refined synopsis data structure. In: ICDE, pp. 436–445 (2007)
Karras, P., Mamoulis, N.: Hierarchical synopses with optimal error guarantees. TODS 33(3), 18 (2008)
Karras, P., Sacharidis, D., Mamoulis, N.: Exploiting duality in summarization with deterministic guarantees. In: SIGKDD, pp. 380–389. ACM (2007)
Kim, J., Min, J.K., Shim, K.: Efficient haar+ synopsis construction for the maximum absolute error measure. PVLDB 11(1), 40–52 (2017)
Matias, Y., Vitter, J.S., Wang, M.: Wavelet-based histograms for selectivity estimation. In: SIGMOD, vol. 27, pp. 448–459. ACM (1998)
Matias, Y., Vitter, J.S., Wang, M.: Dynamic maintenance of wavelet-based histograms. In: VLDB, pp. 101–110 (2000)
Morton, G.M.: A computer oriented geodetic data base and a new technique in file sequencing (1966)
Muralikrishna, M., DeWitt, D.J.: Equi-depth multidimensional histograms. In: ACM SIGMOD Record, vol. 17, pp. 28–36. ACM (1988)
Muthukrishnan, S.: Subquadratic algorithms for workload-aware haar wavelet synopses. In: FSTTCS, pp. 285–296 (2005)
Muthukrishnan, S., Poosala, V., Suel, T.: On rectangular partitionings in two dimensions: algorithms, complexity and applications. In: International Conference on Database Theory, pp. 236–256. Springer (1999)
Muthukrishnan, S., Strauss, M.: Rangesum histograms. In: Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 233–242. Society for Industrial and Applied Mathematics (2003)
Mytilinis, I., Tsoumakos, D., Koziris, N.: Distributed wavelet thresholding for maximum error metrics. In: SIGMOD, pp. 663–677. ACM (2016)
Natsev, A., Rastogi, R., Shim, K.: Walrus: a similarity retrieval algorithm for image databases. SIGMOD 28, 395–406 (1999)
Poosala, V., Ioannidis, Y.E.: Selectivity estimation without the attribute value independence assumption. VLDB 97, 486–495 (1997)
Reiss, F., Garofalakis, M., Hellerstein, J.M.: Compact histograms for hierarchical identifiers. In: VLDB, pp. 870–881 (2006)
Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: Isomer: Consistent histogram construction using query feedback. In: Proceedings of the 22nd International Conference on Data Engineering, 2006. ICDE’06, pp. 39–39. IEEE (2006)
Thaper, N., Guha, S., Indyk, P., Koudas, N.: Dynamic multidimensional histograms. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 428–439. ACM (2002)
Vitter, J.S., Wang, M.: Approximate computation of multidimensional aggregates of sparse data using wavelets. In: SIGMOD, vol. 28, pp. 193–204. ACM (1999)
Acknowledgements
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2016R1D1A1A02937186) as well as Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (NRF-2017M3C4A7063570). It was also supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (NRF-2019R1F1A1062511).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Lemma 5
For a pair of ranges \(r\in R\) and \(r'\in R'\), \(R_{\mathsf {req}}(r\,{\bowtie }_{\delta }r')=\{r \oplus z\cdot \delta \,|\, mat_\mathsf {max}(r,r') \le z\cdot \delta \le mat_\mathsf {min}(r,r'), z\in {\mathbb {Z}}\}\). Let \(r_1\) and \(r'_1\) be the front ranges of R and \(R'\), respectively. Then, for every range \(r_{\delta }\in R_{\mathsf {req}}(r\,{\bowtie }_{\delta }r')\) with \(r\in R\) and \(r'\in R'\), we have the followings.
Since \(r_{\delta }.\mathsf {min}\ge (R_{\mathsf {F}}.\mathsf {front}).\mathsf {min}\) and \(r_{\delta }.\mathsf {max}\ge (R_{\mathsf {F}}. \mathsf {front}).\mathsf {max}\), \(R_{\mathsf {req}}(R {\bowtie }_{\delta } R').\mathsf {front}=R_{\mathsf {F}}.\mathsf {front}\). Similarly, we can show that \(R_{\mathsf {req}}(R\,{\bowtie }_{\delta }R').\mathsf {rear}=R_{\mathsf {rear}}.\mathsf {rear}\).
Since \(R_{\mathsf {req}}(r_{}\,{\bowtie }_{\delta }r')\) is a \(\delta \)-shifted range set, there always exists a range \(r''\) in \(R\,{\bowtie }_{\delta }R'\) such that \(r''.\mathsf {min}\) is located in \([(R_{\mathsf {req}}.\mathsf {front}).\mathsf {min},(R_{\mathsf {rear}}.\mathsf {rear}).\mathsf {min}]\). Thus, \(R_\mathsf {req}(R\,{\bowtie }_{\delta }R')\) is also a \(\delta \)-shifted range set whose front and rear ranges are \(R_{\mathsf {F}}.\mathsf {front}\) and \(R_{\mathsf {R}}.\mathsf {rear}\), respectively. \(\square \)
Proof of Lemma 6
We prove each case as follows:
(a) When every range in R does not contain any range in \(R'\): We break the proof into two subcases.
(a-1) When \(r_{1}.{\mathsf {min}} <r'_{1}.{\mathsf {min}}\): If \(r_{m}.{\mathsf {min}} \ge r'_{1}.{\mathsf {min}}\) holds, since R is a \(\delta \)-shifted range set, there always exists a range \(r_j \in R\) such that \(r_{j}.{\mathsf {min}} =r'_{1}.{\mathsf {min}}\). Then, \(r_j\) contains \(r'_1\), it is a contradiction. Thus, we have \(r_{m}.{\mathsf {min}} <r'_{1}.{\mathsf {min}}\). In this case, \(r_m\) and \(r'_1\) are the pair of the ranges whose minimum values are the closest among all pairs of the ranges in R and \(R'\), respectively.
For a pair of ranges \(r_j\in R\) and \(r'_k\in R'\), let \(\mathtt{mbr}(\{r_j, r'_k\})=[e_\mathsf {min}, e_\mathsf {max}]\). Then, we have the property \(e_\mathsf {min} \le \min (r_{m}.{\mathsf {min}}, r'_{1}.{\mathsf {min}})\) from the following inequalities.
Symmetrically, we can show \( \max (r_{m}.{\mathsf {max}},r'_{1}.{\mathsf {max}}) \le e_\mathsf {max} \).
Since \(e_\mathsf {min} \le \min (r_{m}.{\mathsf {min}}, r'_{1}.{\mathsf {min}})\), \(e_\mathsf {max} \ge \max (r_{m}.{\mathsf {max}}, r'_{1}.{\mathsf {max}})\) and \([\min (r_{m}.{\mathsf {min}},r'_{1}.{\mathsf {min}}), \max (r_{m}.{\mathsf {max}},r'_{1}.{\mathsf {max}})]=\mathtt{mbr}(\{r_m, r'_1\})\), \(\mathtt{mbr}(\{r_j, r'_k\})\) always contains \(\mathtt{mbr}(\{r_m, r'_1\})\). That is, every range in \(R\,{\bowtie }_\mathsf {mbr}R'\) contains \(\mathtt{mbr}(\{r_m, r'_1\})\). Thus, by Definition 6, the required range set of \(R\,{\bowtie }_\mathsf {mbr}R'\) becomes \(\{\mathtt{mbr}(\{r_m, r'_1\})\}\).
(a-2) When \(\varvec{r_{1}.{\mathsf {min}} \ge r'_{m}.{\mathsf {min}}}\): We omit the proof since we can show similarly to the case of (a-1).
(b) When there exists a range in R containing a range in \(R'\): Let \(r'_{k_\mathsf {F}}\) be the range in \(R'\) which has the smallest minimum value among all ranges contained by \(r_{j_\mathsf {F}}\). Then, we first show that, for every pair of ranges \(r_j\in R\) and \(r'_k\in R'\) satisfying \(j\le j_\mathsf {F}\) and \(k\le k_\mathsf {F}\), \(\mathtt{mbr}(\{r_j, r'_k\})\) contains \(\mathtt{mbr}(\{r_{j_\mathsf {F}}, r'_{k_\mathsf {F}}\})\). It implies that such \(\mathtt{mbr}(\{r_j, r'_k\})\)s are not included in \(R_\mathsf {req}(R\,{\bowtie }_\mathsf {mbr}R')\). We consider two subcases of when (b-1) \(j_\mathsf {F}=1\) and (b-2) \(j_\mathsf {F}>1\).
(b-1) When \(j_\mathsf {F}=1\): Since \(\mathtt{mbr}(\{r_{1}, r'_\mathsf {k}\})\) always contains \(r_{1}\) and \(\mathtt{mbr}(\{r_{1}, r'_{k_\mathsf {F}}\})=r_{1}\), all \(\mathtt{mbr}(\{r_{1}, r'_\mathsf {k}\})\)s with \(k\le k_\mathsf {F}\) contain \(\mathtt{mbr}(\{r_{1}, r'_{k_\mathsf {F}}\})\).
(b-2) When \(j_\mathsf {F}>1\): If \(r_{j_\mathsf {F}}.\mathsf {max}<r'_1.\mathsf {max}\), \(r_{j_\mathsf {F}}\) cannot contain any range in \(R'\). If \(r_{j_\mathsf {F}}.\mathsf {max}>r'_1.\mathsf {max}\), \(r_{j_\mathsf {F}-1}\) contains \(r'_1\) and it is a contradiction. Thus, we get \(r_{j_\mathsf {F}}.\mathsf {max}=r'_1.\mathsf {max}\) and \(k_\mathsf {F}=1\). Then, for every \(\mathtt{mbr}(\{r_j, r'_1\})\) with \(j\le j_\mathsf {F}\), since \(\mathtt{mbr}(\{r_j, r'_1\}).\mathsf {max}=r'_1.\mathsf {max}=\mathtt{mbr}(\{r_{j_\mathsf {F}}, r'_1\}).\mathsf {max}\) and \(\mathtt{mbr}(\{r_j, r'_1\}).\mathsf {min}=r_j.\mathsf {min}\le \mathtt{mbr}(\{r_{j_\mathsf {F}}, r'_1\}).\mathsf {min}\), \(\mathtt{mbr}(\{r_j, r'_1\})\) contains \(\mathtt{mbr}(\{r_{j_\mathsf {F}}, r'_{1}\})\).
For every pair of ranges \(r_j\in R\) and \(r'_k\in R'\) satisfying \(j\ge j_\mathsf {R}\) and \(k\ge k_\mathsf {R}\), we can symmetrically show that \(\mathtt{mbr}(\{r_j, r'_k\})\) contains \(\mathtt{mbr}(\{r_{j_\mathsf {R}}, r'_{k_\mathsf {R}}\})\). Thus, we need to consider \(\mathtt{mbr}(\{r_j, r'_k\})\)s with \(j_\mathsf {F}\le j\le j_\mathsf {R}\) and \(k_\mathsf {F}\le k\le k_\mathsf {R}\). Since a range \(r_j\in R\) contains a range \(r'_k\in R'\), \(\mathtt{mbr}(\{r_j, r'_k\})=r_j\) and it is contained by \(\mathtt{mbr}(\{r_j, r'_{k'}\})\) with every \(r'_{k'}\in R'\), the required range set of \(\{\mathtt{mbr}(\{r_j, r'_k\})\,|\,j_\mathsf {F}\le j\le j_\mathsf {R},k_\mathsf {F}\le k\le k_\mathsf {R}\}\) becomes \(\{r_j\,|\,j_\mathsf {F}\le j\le j_\mathsf {R}\}\). \(\square \)
Proof of Lemma 7
For a pair of \(r_1\in R\) and \(r_2\in R\), if \(r_1\) contains \(r_2\), since \(\mathtt{mbr}(\{r_1, r_3\})\) contains \(\mathtt{mbr}(\{r_2, r_3\})\) with every range \(r_3\in R'\) by Definition 6, \(\mathtt{mbr}(\{r_1, r_2\})\not \in R_\mathsf {req}(\{\mathtt{mbr}(\{r, r'\})\,|\,r\in R,r'\in R'\})\) by Definition 7. Thus, \(R_\mathsf {req}(\{\mathtt{mbr}(\{r, r'\})\,|\,r\in R,r'\in R'\})=R_\mathsf {req}(\{\mathtt{mbr}(\{r, r'\})\,|\,r\in R_\mathsf {req}(R),r'\in R'\})\). Symmetrically, we can show \(R_\mathsf {req}(\{\mathtt{mbr}(\{r, r'\})\,|\,r\in R_\mathsf {req}(R),r'\in R'\})=R_\mathsf {req}(\{\mathtt{mbr}(\{r, r'\})\,|\,r\in R_\mathsf {req}(R),r'\in R_\mathsf {req}(R')\})\). Thus, \(R_\mathsf {req}(R\,{\bowtie }_\mathsf {mbr}R')=R_\mathsf {req}(R_\mathsf {req}(R)\,{\bowtie }_\mathsf {mbr}R_\mathsf {req}(R'))\). We can similarly prove \(R_\mathsf {req}(R\,{\bowtie }_{\delta }R')=R_\mathsf {req}(R_\mathsf {req}(R)\,{\bowtie }_{\delta } R_\mathsf {req}(R'))\). \(\square \)
Rights and permissions
About this article
Cite this article
Kim, J., Min, JK. & Shim, K. Efficient two-dimensional Haar\(^+\) synopsis construction for the maximum absolute error measure. The VLDB Journal 28, 675–701 (2019). https://doi.org/10.1007/s00778-019-00551-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-019-00551-2