# Workload-Aware Self-Tuning Histograms of String Data

## Abstract

In this paper we extend STHoles, a very successful algorithm that uses query results to build and maintain multi-dimensional histograms of numerical data. Our contribution is the formal definition of extensions of all relevant concepts; such that they are independent of the domain of the data, but subsume STHoles concepts as their numerical specialization. At the same time, we also derive specializations for the string domain and implement these into a prototype that we use to empirically validate our approach. Our current implementation uses *string prefixes* as the machinery for describing string ranges. Although weaker than regular expressions, prefixes can be very efficiently applied and can capture interesting ranges in hierarchically structured string domains, such as those of filesystem pathnames and URIs. In fact, we base the empirical validation of the approach on existing, publicly available Semantic Web data where we demonstrate convergence to accurate and efficient histograms.

## 1 Introduction

*Query optimizers* in query processing systems typically rely on *histograms*, data structures that approximate data distribution, in order to be able to apply their cost model. Histograms can be constructed by scanning the database tables and aggregating the values of the attributes in the table; and similarly maintained in the face of database updates.

This histogram lifecycle, however, cannot be efficiently applied to large-scale and frequently updated databases, such as, for example, stores of sensor data. An alternative approach is taken by *adaptive* query processing systems that update their histograms by observing and analysing the results of the queries that constitute the client-requested workload, as opposed to maintenance workload only for updating the histograms. The relevant databases literature focuses on numerical attributes, exploiting the concept of an *interval* as a description of a set of numerical values that is *succinct* and that has a *length* that can be used to estimate the cardinality of many different intervals that have roughly the same density.

In the work described here, we investigate how to extend adaptive query processing so that it can be applied to the domain of *strings*, typically treated as purely categorical symbols that can only be described by enumeration. This, however, disregards the fact that there are several classes of strings that have an internal structure and that can be handled in a more sophisticated manner. Specifically, we use string *prefixes* to expresses ‘intervals’, i.e., sub-spaces of the overall string space that are interesting from the point of view of providing query optimization statistics. Although weaker than regular expressions, prefixes can be very efficiently applied and can capture interesting ranges in hierarchically-structured string domains, such as that of URIs.

This attention on URIs is motivated by their prominent position in the increasingly popular *Semantic Web* and *Linked Data* infrastructures for publishing data. In fact, these paradigms motivate adaptive query processing for a further reason besides the scale of the data: *distributed querying* engines often concentrate loose federations of publicly-readable remote data sources over which the distributed querying engine cannot effect that histograms are maintained and published. Furthermore, the URIs of large-scale datasets are not hand-crafted names but are automatically generated following naming conventions, usually hierarchical. These observations both motivate extending adaptive query processing to Semantic Web data stores and also present an opportunity for our string prefix extension.

In the remainder, we review self-tuning histograms and optimized querying of Semantic Web data (Sect. 2) where we identify STHoles as our starting point, a very successful algorithm for multi-dimensional histograms of numerical data. In Sect. 3 we proceed to formalize the key concepts in STHoles in a way that subsumes STHoles as its specialization for numerical intervals; and to provide an extension to string prefixes. We then present experimental results using our prototype implementation (Sect. 4) and conclude (Sect. 5).

## 2 Background

In their simplest form, self-tuning histograms comprise a set of *buckets* where each bucket holds a range of value of a numerical attribute *a* and the number of results that fall within this range. Buckets are progressively refined from query execution feedback after each selection on *a*, using the difference between the actual result sizes and those estimated by the statistics in the buckets of *a*. Since actual values are not uniformly distributed along the range of values that *a* can have, high frequency buckets are split to smaller (i.e., with narrower range) and more accurate ones; to maintain memory usage, buckets with similar frequencies are merged to reclaim memory space. STGrid [1] extends these ideas to multi-dimensional self-tuning histograms that use query workloads to refine a grid-based histogram structure. These self-tuning histograms are a low-cost alternative to traditional histograms with comparable accuracy. However, since the splitting (or merging) of each bucket entails the splitting (or merging) of several other buckets that could be far away from and unrelated to the original one, overall accuracy is degraded in order to satisfy the grid-partitioning constraint.

To alleviate the poor bucket layout problem of STGrid, STHoles [2] allows buckets to overlap. This more flexible data structure allows STHoles to exploit feedback in a truly multi-dimensional way and is adopted by many subsequent algorithms [3, 4], including the one presented here. STHoles allows for inclusion relationships between buckets, resulting in a tree-structured histogram where each node represents a bucket. Holes are sub-regions of a bucket with different tuple density and are buckets themselves. To refine an STHoles histogram, query results are used to count how many tuples fall inside each bucket of the current histogram. Each partial intersection of query results and a bucket can be used to refine the histogram by drilling new holes, whenever the query results diverge from the prediction made through the bucket’s statistics.

In order to maintain a constant number of buckets, buckets with close tuple densities are merged to make space for new holes. A penalty function measures the difference in approximation accuracy between the old and the new histogram to choose which buckets to merge. Parent-child merges are useful to eliminate buckets that become too similar to their parents; sibling merges are useful to extrapolate frequency distributions to yet unseen regions in the data domain and also to consolidate buckets with similar density that cover nearby regions.

STHoles and, in general, workload-aware self-tuning histograms have been successfully used in relational databases as a low-overhead alternative to statically re-scanning database tables. The resulting histogram is focused towards the current workload, providing more accurate statistics for data regions that are being queried more frequently. Furthermore, they are able to adapt to changes in data distribution and thus are well-suited for datasets with frequently changing contents. They are, however, for the most part targeting numerical attributes, since they exploit the idea that a value range is an indication of the size of the range. Turning our attention to the Semantic Web, the *Resource Description Framework (RDF)* is the dominant standard for expressing information. RDF information is a graph where *properties* (labelled edges) link two resources to each other or one resource to a *literal* (a concrete value). The relevance of this discussion to self-tuning histograms is that RDF uses URIs as abstract symbols that denote resources. Given this prominent role of URIs in RDF data, extending self-tuning histograms to string attributes can have a significant impact in optimizing querying of RDF datasets.

There has been relatively limited amount of work around string selectivity estimation in the field of relational databases. Chaudhuri et al. [5] proposed to collect multiple candidate identifying substrings of a string using, for example, a Markov estimator and build a regression tree as a combination function of their estimated selectivities, in order to alleviate the selectivity underestimation problem of queries involving string predicates in previous methods, which used independence and Markov assumptions. In 2005, Lim et al. [6] introduced CXHist, which is a workload-aware histogram for selectivity estimation supporting a broad class of XML string-based queries. CXHist is the first histogram based on classification that uses feature distributions to summarize queries and quantize their selectivities into buckets and a naive-Bayes classifier to capture the mapping between queries and their selectivity.

Within the Semantic Web community itself, the SWOOGLE search engine collects metadata, such as classes, class instances and properties for web documents and relations between documents [7]. LODStats computes several schema-level statistical values for large-scale RDF datasets using an approach based on *statement streams* [8]. Work more closely related to ours includes RDFStats [9], which is a generator for statistics of RDF sources like SPARQL endpoints. They generate different statistical items such as instances per class and histograms. Unlike our approach, they generate different static histograms (i.e. that must be rebuilt to reflect any changes in the RDF source) per class, property and XML data type. For range estimations on strings, RDFStats mentions three possibilities: (a) one bucket for each distinct string, resulting in large histograms; (b) reducing strings to prefixes; or (c) using a hash function to reduce the number of distinct strings, although no appropriate general-purpose hash function has been identified. However, as Harth et al. [10] have also noted in relation to Q-Trees for indexing RDF triples, hashing URIs is a purely syntactic mapping from URIs to numerical coordinates and fails to take into account the semantic similarity between resources; and no universally good function has been identified.

As URIs the most prominent datatype in the Semantic Web, not having developed an extension that can naturally handle URI strings has left Semantic Web data outside the scope of many developments in self-tuning histograms.

## 3 Self-Tuning String Histograms

In this section we first introduce some basic definitions and then proceed to establish a new histogram structure that extends the structure of the STHoles algorithm with the ability to cover strings. We also present the algorithms that construct and refine this new structure. In our treatment, string ranges are specified by prefixes. Prefixes can naturally express ranges of semantically related resources, given the natural tendency to group together relevant items in hierarchical structures such as pathnames and URIs. Although this mostly contributes to applying selftuning histogram algorithms to Semantic Web data, we discuss how our structure is also amenable to regular expressions or other string patterns so that future extensions can address challenges from the wider data management community.

### 3.1 Preliminaries

Let *D* be a *dimension*, any subset of *D* be a *range* in *D*, and \(\mathcal {P}(D)\) the set of all possible ranges in *D*. A range can be defined either implicitly by constraints over the values of *D* or explicitly by enumeration. Note that \(D\in \mathcal {P}(D)\), meaning that a range does not *need* to impose a restriction but can also include the whole dimension. Let *H* be a histogram of *n* dimensions \(D_1, \ldots D_n\). Let \(\mathrm {V}(H)\) be the set of all possible *n*-dimensional vectors \((r_1, \ldots r_n)\) where \(\forall i \in \left[ 1,n\right] : r_i\in \mathcal {P}(D_i)\)

A histogram is represented as an inclusion hierarchy of *buckets*; we shall use \(B_H\) to denote the set of buckets of a histogram *H*.

**Definition 1**

*bucket*\(b\in B_H\) is an entity of histogram

*H*such that:

*b*is associated with a \(\mathrm {box}(b)\in \mathrm {V}(H)\), the vector that specifies the set of tuples that the bucket describes.*b*is associated with a \(\mathrm {size}(b)\) which indicates the number of tuples that match \(\mathrm {box}(b)\).*b*is associated with*n*values \(\mathrm {dvc}(b,D_i), i=1\dots n\) which indicate the number of distinct values appearing in dimension \(D_i\) of the tuples that match \(\mathrm {box}(b)\).

*b*to be the quantity

**Definition 2**

Every histogram implicitly includes a bucket \(b_\top \) such that \(\mathrm {box}(b_\top )\equiv \left( D_1, \ldots D_n\right) \) that is, the bucket that imposes no restrictions in any of the dimensions of *H* and includes all tuples. We call this the *top bucket*\(b_\top \).

The implication of Definition 2 is that the overall size of the dataset and the number of distinct values in each dimension should be known (or at least approximated) regardless of what query feedback has been received. In our implementation we assume the *root bucket* (the top-most bucket of the hierarchy) as an approximation of the top.

Let \(\mathcal {Q}_H\) be the set of all possible queries over the tables covered by *H*. Regardless of how they are syntactically expressed, we perceive \(\mathcal {Q}_H\) as the set of all possible restrictions over the dimensions of *H*; thus:

**Definition 3**

*query*\(q\in \mathcal {Q}_H\) is an entity of histogram

*H*such that:

*q*is associated with a \(\mathrm {box}(q)\in \mathrm {V}(H)\), the vector that specifies the restrictions expressed by the query*q*is associated with a \(\mathrm {size}(q)\) which indicates the number of tuples that are returned by executing*q*.

In our approach, we assume *string prefixes* as the description language for implicitly defining string ranges. The motivation is discussed in the Introduction (Sect. 1). We use \(\mathrm {Pref}(r)\) to denote the set of prefixes that specify a string range. We define intersection between ranges and between boxes as follows:

**Definition 4**

*range intersection*\(r_1 \Cap r_2\) is defined as follows:

- 1.
If \(r_1,r_2\) are string ranges defined by sets of prefixes, then \(r_1 \Cap r_2 = \{p|(p_1,p_2)\in \mathrm {Pref}(r_1) \times \mathrm {Pref}(r_2)\wedge (p=p_1=p_2\vee \) one of \(p_1,p_2\) is a prefix of the other and

*p*is the longest (more specific) of the two\()\}\) - 2.
If one of the ranges is a string range defined by sets of prefixes (say \(r_1\) without loss of generality) and the other is an explicit set of strings (say \(r_2\)), then \(r_1 \Cap r_2 = \left\{ v|v\in r_2 \wedge \exists p\in r_1:p \ \text {is a prefix of}\ v\right\} \)

- 3.
In any other case, \(r_1 \Cap r_2\) = \(r_1 \cap r_2\).

**Definition 5**

*n*-dimensional histogram

*H*, let \(v_1 = \left( r_{1,1}, \ldots r_{1,n}\right) \) and \(v_2 = \left( r_{2,1}, \ldots r_{2,n}\right) \). We define

*box intersection*:

**Definition 6**

*n*-dimensional histogram

*H*, let \(v_1 = \left( r_{1,1}, \ldots r_{1,n}\right) \) and \(v_2 = \left( r_{2,1}, \ldots r_{2,n}\right) \). We say that \(v_1\)

*encloses*\(v_2\) iff \(\forall i\in \left[ 1,n\right] \) at least one of the following holds:

- 1.
\(r_{2,i} \subseteq r_{1,i} \subset D_i\), that is, none of the ranges is the complete dimension and \(r_{2,i}\) is contained within \(r_{1,i}\).

- 2.
\(r_{2,i} = D_i\) and \(r_{1,i} \subset D_i\), that is, if one of the ranges is the complete dimension then it is enclosed by the one that is not.

- 3.
\(r_{2,i} = r_{1,i} = D_i\), that is, both ranges are the complete dimension.

It should be noted that we have defined an unrestricted dimension as *being enclosed by* (rather than enclosing) a restriction. The rationale behind this will be revisited in conjunction with *bucket merging* (Sect. 3.4).

**Definition 7**

Given two boxes \(v_1, v_2\in \mathrm {V}(H)\) from histogram *H*, \(v_1\)*tightly encloses*\(v_2\) iff \(v_1\)*encloses*\(v_2\) and there is no \(u\in \mathrm {V}(H)\) such that \(v_1 \Supset u \Supset v_2\).

**Definition 8**

*q*the

*best fit*, the set of buckets \(\mathrm {bf}(q)\subseteq \mathbb {B}_H\) such that

**Lemma 1**

For every query there is always a non-empty *best fit*.

*Proof*

There is always at least one bucket that *encloses* any \(\mathrm {box}(q)\), the *top bucket*\(b_\top \) (Definition 2). If there is no other bucket that *encloses*\(\mathrm {box}(q)\), then \(b_\top \)*tightly encloses*\(\mathrm {box}(q)\) (Definition 7) and thus \(\mathrm {bf}(q)=\left\{ b_\top \right\} \), which is non-empty. If there are other buckets that *enclose*\(\mathrm {box}(q)\), then there is also at least one that *tightly encloses* \(\mathrm {box}(q)\), so \(\mathrm {bf}(q)\) is non-empty.

### 3.2 Cardinality Estimation

Being able to predict the size of querying results is important input for query execution optimizers, but the specifics of how this optimization is performed is outside the scope of this paper. We will here proceed to define metrics over the values associated with the buckets of *H* in order to predict \(\mathrm {size}(q), q\in \mathcal {Q}_H\), the number of results returned by *q*. More specifically, we will define an extension of self-tuning histograms that handles string values in a more sophisticated manner than explicit ranges.

In the literature, numerical intervals are used to succinctly define ranges and efficiently decide if a query is enclosed by a bucket or not. The numerical difference between the interval’s starting and ending value is sometimes used to define *range length* and, in multi-dimensional buckets, *bucket volume*: an estimator of the number of tuples in a bucket. Categorical dimensions (including strings) are, on the other hand, either not handled at all or only explicitly defined as sets of values, with bucket volume defined using the cardinality of the value range. Given this, we define range length as follows:

**Definition 9**

*D*and a range \(r \in \mathcal {P}(D)\) we define the function \(\mathrm {length} : \mathcal {P}(D) \rightarrow \mathbb {R}\) as follows:

- 1.
Unrestricted ranges that span the whole dimension have length 0.

- 2.
If

*r*is a numerical range defined by interval [*x*,*y*], then \(\mathrm {length}(r) = y-x+1\). The addition of the unit term guarantees that the length cannot be zero even if \(x=y\), i.e., even if the numerical range is a single number. - 3.
If

*r*is a string range defined by a set of prefixes \(\mathrm {Pref}(r)\), then \(\mathrm {length}(r) = \left| 1+\mathrm {Pref}(r)\right| \), the number of prefixes defining the range. - 4.
In any other case, including explicitly defined numerical and string ranges, then \(\mathrm {length}(r) = \left| r\right| \), the number of distinct values in the range.

What should be noted in Definition 9 is that the only situation in which the length of a range can be 1 is when it is an individual value. This is important for Definition 10 immediately below:

**Definition 10**

*H*, we define the function \(\mathrm {est}_H : \mathrm {V}(H) \rightarrow \mathbb {R}\) as follows:

We propose this function as an estimator of the number of tuples that lie inside *q*, given a histogram.

### 3.3 Histogram Construction and Refinement

The construction of the histogram follows the same high level steps as the STHoles algorithm. In particular, we start with an empty histogram. For each query *q* in the workload, we identify *candidate buckets*\(b_i\) that intersect with *q*.

*q*, we expand its bounding box so that it covers

*q*and we update its statistics as follows:

*candidate holes*\(c_i\). We then shrink each candidate hole to the largest sub-region that does not intersect with the box of any other bucket, we count the exact number of tuples from the result stream that lie inside the shrunk hole and the distinct values count. Then, we determine whether the current density of the candidate bucket is close to the actual density of the candidate hole. If not, we ‘drill’ the candidate hole as a new histogram bucket and we move all children of \(b_i\) that are enclosed by \(c_i\) to the new bucket (Algorithms 1 and 2).

*X*be the set of all buckets that partially intersect with candidate hole \(c_i\). STHoles selects at each step the pair \(\langle x,j \rangle \) that comprises bucket \(x\in X\) and dimension

*j*such that shrinking \(c_i\) along

*j*by excluding

*x*has as a result the smallest reduction of \(c_i\). Instead of checking for the optimal \(\langle x,j \rangle \) our method selects the first pair where shrinking \(c_i\) along

*j*by excluding

*x*results in the smallest

*relative*reduction of \(c_i\)’s

*length*in that dimension, the intuition being that often excluding

*x*will give similar relative reduction along all dimensions. We then shrink \(c_i\), we update participants and repeat the procedure until there are no participants left (Algorithm 3). This may result in a suboptimal shrink, but we avoid examining all possible combinations at each step. Furthermore, in STHoles the number of tuples in this shrunk subregion is estimated assuming uniformity; instead, we measure exactly the number of tuples and distinct values per dimension.

### 3.4 Bucket Merging

In order to limit the number of buckets and memory usage, buckets are *merged* to make space for drilling new holes. Following STHoles, our method looks for *parent-child* or *sibling* buckets that can be merged with minimal impact on the cardinality estimations. We diverge from STHoles when computing the box, size, and dvc associated with the merged bucket as well as in the *penalty* measure that guides the merging process towards merges that have the smallest impact.

Let \(b_1,b_2\) be two buckets in the *n*-dimensional histogram *H* and let \(H'\) be the histogram after the merge and \(b_m\) the bucket in \(H'\) that replaces \(b_1\) and \(b_2\). In the *parent-child* case, one of the two buckets, let that be \(b_1\), *tightly encloses* the other. In this case, we merge \(b_2\) into \(b_1\), so that \(\mathrm {box}(b_m)\equiv \mathrm {box}(b_1)\).

Our *penalty* measure should be proportionate to the absolute differences between the cardinality estimates made from the two original buckets and the cardinality estimates made from the resulting bucket. Let \(Q_1,Q_2\) be two sets of queries that would have \(b_1\) and \(b_2\) resp. be involved in their cardinality estimation. That is: \(\forall k \in \{1,2\}\forall q \in Q_k: b_k\in \mathrm {bf}(q)\)

*n*queries so that for each dimension \(D_i, i=1,\ldots n\) of

*H*there is exactly one query that:

has a specific value for the dimension \(D_i\) and this value is inside the range of \(\mathrm {box}(b_k)\) for \(D_i\). It is not important what this particular value is, since the estimation is the same for all values inside the bucket’s box’s range.

is completely unspecified for all other dimensions.

*H*and \(H'\). As seen in Definition 10, only the terms that involve \(b_k\) are significant for calculating the difference, and thus:

*sibling*merges, let \(b_p\) be the bucket that

*tightly encloses*both \(b_1\) and \(b_2\). The merged bucket encloses siblings \(b_1\) and \(b_2\) and is also extended to enclose any further siblings \(b_c\) that partially overlap with either \(b_1\) or \(b_2\). The size of \(b_m\) is estimated by adding; the distinct values count of \(b_m\) is estimated by the maximum distinct values count among the merged buckets:

- 1.
\(\mathrm {box}(b_p)\)

*tightly encloses*\(\mathrm {box}(b_m)\) - 2.
\(\mathrm {box}(b_m)\)

*tightly encloses*both buckets \(b_1, b_2\) - 3.
\(\mathrm {box}(b_m)\)

*tightly encloses*the boxes of all children of \(b_p\) that either of \(b_1, b_2\) partially interstects with. That is, \(\mathrm {box}(b_m)\) encloses \(\mathrm {box}(b_c)\) for all \(b_c\) such that (i) \(b_p\)*tightly encloses*\(b_c\); and (ii) \(\mathrm {box}(b_1)\)*partially overlaps*\(\mathrm {box}(b_c)\) or \(\mathrm {box}(b_2)\)*partially overlaps*\(\mathrm {box}(b_c)\) - 4.
\(\mathrm {size}(b_m) = \sum \limits _{k=1,2,c_1,\ldots }\mathrm {size}(b_k)\)

- 5.
\(\mathrm {dvc}(b_m) = \max \limits _{k=1,2,c_1,\ldots }\mathrm {dvc}(b_k)\).

*not*merged into \(b_m\), but that the latter is expanded so that it can assume \(b_c\) as its children. This is because in some algorithms (including STHoles), \(\mathrm {box}(b_m)\) can become larger than \(\mathrm {box}(b_1)\cup \mathrm {box}(b_2)\) in order to have a ‘rectangular’ description with a single interval in each dimension. As a result, it might cut across other buckets; \(\mathrm {box}(b_m)\) should then be extended so as to subsume those as children. In order to avoid, however, dropping informative restrictions, STHoles only extends \(\mathrm {box}(b_m)\) along dimensions where the boxes of \(b_c\) do have a restriction. In order to capture this, we have defined the

*encloses*relation (Definition 6) in a way that makes unrestricted dimensions

*enclosed by*(rather than enclosing) restrictions. In analogy to the parent-child case, we define the penalty to be the sum over all \(q\in Q_k\) of the estimation error differences between

*H*and \(H'\):

### 3.5 Discussion

We have defined a multi-dimensional histogram over numerical, string, and categorical data. The core added value of this work is that we introduce the notion of *descriptions* in string dimensions, akin to intervals for numerical dimensions. This has considerable advantages for RDF stores and, more generally, in the Semantic Web and Linked Open Data domain, where URIs have a prominent role and offer the opportunity to exploit the hierarchical structure of their string representation.

We propose *prefixes* as the formalism for expressing string ranges, motivated by its applicability to URI structure. This is no loss of generality, since it is straightforward to use more expressive pattern formalisms (such as regular expressions) without altering the core method but at a considerable computational cost. The only requirement is that *intersection* and some notion of *length* can be defined. Length, in particular, can be used in the way STHoles uses it as an indication of a bucket’s size relative to the size of its parent bucket, although that also depends on the specifics of the estimation functions and is not required. In the function we propose (Definition 10), the only requirement is that length equals 1 for fixed singleton values and more than 1 for intervals and prefixes.

This is related to the fact that for range queries we return the statistics of the bucket that more tightly encloses the query, instead of returning an estimation based on the ratio of the volume occupied by the query to the volume of the overall bucket. In other words, we use *length* more as a metric of the size of description, rather than a metric of the bucket size (the number of tuples that fit this description). To compensate, we exactly measure in query results (rather than estimate) bucket size when shrinking buckets, compensating for the extra computational time by avoiding examining all combinations of buckets \(\times \) dimensions (cf. Sect. 3.3).

Furthermore, we can afford a larger number of buckets in the same amount of memory (and thus more accurate estimations) than if strings were treated as categorical values.

A further improvement would be to define the *length* of string prefixes in a way that multi-dimensional volume calculated from these lengths accurately estimates the number of tuples in the buckets. An even more ambitious future goal is to define the length of string ranges in a way that it can be combined with numerical range length, so that multi-dimensional and heterogeneous (strings and numbers) buckets can be assigned a meaningful volume.

A limitation of our algorithm is that when we merge two sibling buckets we assign to the resulting bucket the sum of the sizes of the merged buckets and of the children of the resulting bucket, which is an overestimation of the real size. Furthermore, we also assign as distinct value count the maximum of the distinct value counts of these buckets, which is an underestimation of the real distinct value count. These estimations will persist until subsequent workload queries effect an update of merged bucket’s statistics and will be used in cardinality estimations. Besides empirically testing and tuning these estimators, we are also planning to extend the theoretical framework so that estimated values are represented as ranges or distributions, and subsequent calculations take into account the whole range or the distribution parameters rather than a single value.

In general, and despite these limitations, our framework is an accurate theoretical account of STHoles, a state-of-the-art algorithm for self-tuning multi-dimensional numerical histograms, and an extension to heterogeneous numerical/string histograms that is backwards-compatible with STHoles.

## 4 Experiments

To empirically validate our approach, the algorithm presented above has been implemented in Java as the *STRHist* module of the *Semagrow Stack*, an optimized distributed querying system for the Semantic Web.^{1} The execution flow of the Semagrow Stack starts with client queries, analysed to build an optimal *query plan*. The optimizer relies on cardinality statistics (produced by STRHist) in order to provide an execution plan for the Semagrow *Query Execution Engine*. This engine, besides joining results and serving them to the client application, also forwards to STRHist measurements collected during query execution. STRHist analyses these query feedback logs in batches to maintain the histograms that are used by the optimizer. The histogram is persisted in RDF stores using the Sevod vocabulary [11], but STRHist effects updates on an in-memory representation. This representation is a tree of *bucket* objects, starting for a single root. For our proof-of-concept system we assume that all the buckets fit into the main memory of a single machine.

The expected behaviour of the algorithm is to improve estimates by adding buckets that punch holes and add sub-buckets in areas where there is a difference between the actual statistics and the histogram estimates. Considering how client applications access some ‘areas’ more heavily than others, the algorithm zooms into such critical regions to provide more accurate statistics. Naturally, the more interesting observations relate to the effect of merges as soon as the available space is exhausted.

We applied STRHist to the AGRIS bibliographic database on agricultural research and technology maintained by the *Food and Agriculture Organization of the UN*. AGRIS comprises approximately 23 million RDF triples describing 4 million distinct publications with standard bibliographic attributes.^{2} AGRIS consolidates data from more than 150 institutions from 65 countries. Bibliography items are denoted by URIs that are constructed following a convention that includes the location of the contributing institution and incorporation AGRIS. As scientific output increases through the years and since there is considerable variation in the countries, there are interesting generalizations to be captured by patterns over publication URIs.

We define a 3-dimensional histogram over subject, predicate and object variables. Subject URIs are represented as strings^{3} while predicate URIs are treated as categorical values, since there is always a small number of distinct predicates. Each bucket is composed of a 3-dimensional subject/predicate/object bounding box, a size indicating the number of triples contained in the bucket, and the number of distinct subjects, predicates and objects.

*W*is generated by randomly selecting subject URIs from the dataset

*DB*and given a histogram

*H*over

*D*we estimate the size of the results of hypothetical queries that select rows that have this particular value. We then measure the

*average absolute estimation error*and the

*root mean square error*of histogram

*H*on the evaluation workload

*W*:

*q*and \(\mathrm {act}_{D}(q)\) is the actual number of tuples in

*D*that satisfy

*q*.

Estimation error (RMS and absolute) versus training batch and merges (Parent-Child and Sibling merges). Configured for a maximum of 100 buckets.

Estimation error (RMS and absolute) versus memory usage, after executing all 24 batches.

## 5 Conclusions

In this paper, we have presented an algorithm for building and maintaining multi-dimensional histograms exploiting query feedback. Our algorithm is based on STHoles algorithm, but can also handle URIs by using prefixes as a succinct description of string ranges and store additional statistics (distinct value counts) as well.

As future work, we would like to use more expressive pattern formalisms (such as regular expressions), in order to describe sub-spaces of the URI space and better exploit naming conventions of URIs. It would be also interesting to identify clusters of URIs in the result stream, and therefore describe a string range using a set of prefixes/regular expressions, instead of a single prefix/regular expression. We will also try to lift some of the limitations described above. Furthermore, our algorithm needs to be evaluated in terms of accuracy, convergence, efficiency and adaptability to dataset changes, using automatically generated queries to form a workload for experimentation. Query feedback will be obtained from different real data sources in order to test the response of the system. Among them, there will be constrained and unconstrained queries with focus on the variability of subject and object values against constant predicates. Last but not least, we plan to provide a more scalable and robust implementation of the algorithm, which will be able to serve detailed and accurate data summaries and statistics about extremely large datasets.

## Footnotes

- 1.
*STRHist*is available at https://bitbucket.org/acharal/strhistFor more details on Semagrow, please see http://www.semagrow.eu.

- 2.
Please see http://agris.fao.org for more details on AGRIS. The AGRIS site mentions 7 million distinct publications, but this includes recent additions that are not in end-2013 data dump used for these experiments.

- 3.
We use the

*canonical string representation*of URIs as defined in Sect. 2, IETF RFC 7320 (http://tools.ietf.org/html/rfc7320).

## Notes

### Acknowledgements

The work described here was partially carried out at the 2014 edition of the *International Research-Centred Summer School*, held at NCSR ‘Demokritos’, Athens, Greece, 3–30 July 2014. For more details please see http://irss.iit.demokritos.gr

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007–2013) under grant agreement No. 318497. More details at http://www.semagrow.eu.

### References

- 1.Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: Building histograms without looking at data. In: Proceedings of the 1999 ACM International Conference on Management of Data (SIGMOD 1999), pp. 181–192. ACM (1999)Google Scholar
- 2.Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. In: Proceedings of 2001 ACM International Conference on Management of Data (SIGMOD 2001), pp. 211–222. ACM (2001)Google Scholar
- 3.Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: Consistent histogram construction using query feedback. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE 2006). IEEE Computer Society (2006)Google Scholar
- 4.Roh, Y.J., Kim, J.H., Chung, Y.D., Son, J.H., Kim, M.H.: Hierarchically organized skew-tolerant histograms for geographic data objects. In: Proceedings of 2010 ACM International Conference on Management of Data (SIGMOD 2010), pp. 627–638. ACM (2010)Google Scholar
- 5.Chaudhuri, S., Ganti, V., Gravano, L.: Selectivity estimation for string predicates: Overcoming the underestimation problem. In: Proceedings of 20th International Conference on Data Engineering (ICDE 2004). IEEE Computer Society (2004)Google Scholar
- 6.Lim, L., Wang, M., Vitter, J.S.: CXHist: An on-line classification-based histogram for XML string selectivity estimation. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB 2005), Trondheim, Norway, 30 August – 2 September 2005, pp. 1187–1198 (2005)Google Scholar
- 7.Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: A search and metadata engine for the Semantic Web. In: Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM 2004), pp. 652–659. ACM (2004)Google Scholar
- 8.Auer, S., Demter, J., Martin, M., Lehmann, J.: LODStats – an extensible framework for high-performance dataset analytics. In: ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Acquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (eds.) EKAW 2012. LNCS, vol. 7603, pp. 353–362. Springer, Heidelberg (2012) Google Scholar
- 9.Langegger, A., Wöss, W.: RDFStats - an extensible RDF statistics generator and library. In: Proceedings of DEXA 2009, pp. 79–83 (2009)Google Scholar
- 10.Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.U., Umbrich, J.: Data summaries for on-demand queries over linked data. In: Proceedings of 19th International World Wide Web Conference (WWW 2010), Raleigh, NC, USA, 26–30 April 2010 (2010)Google Scholar
- 11.Charalambidis, A., Konstantopoulos, S., Karkaletsis, V.: Dataset descriptions for optimizing federated querying. In: Poster Track, Companion Volume to the Procedings of the 24th Intl World Wide Web Conference (WWW 2015), Florence, Italy, 18–22 May 2015. ACM (2015)Google Scholar