HINT: A Hierarchical Interval Index for Allen Relationships

Indexing intervals is a fundamental problem, finding a wide range of applications, most notably in temporal and uncertain databases. We propose HINT, a novel and efficient in-memory index for range selection queries over interval collections. HINT applies a hierarchical partitioning approach, which assigns each interval to at most two partitions per level and has controlled space requirements. We reduce the information stored at each partition to the absolutely necessary by dividing the intervals in it, based on whether they begin inside or before the partition boundaries. In addition, our index includes storage optimization techniques for the effective handling of data sparsity and skewness. We show how HINT can be used to efficiently process queries based on Allen’s relationships. Experiments on real and synthetic interval sets of different characteristics show that HINT is typically one order of magnitude faster than existing interval indexing methods.


Introduction
A wide range of applications require managing large collections of intervals.In temporal databases [37,6], each tuple has a validity interval, which captures the period of time that the tuple is valid.In statistics and probabilistic databases [14], uncertain values are often approximated by (confidence or uncertainty) intervals.In data anonymization [36], attribute values are often generalized to value ranges.XML data indexing techniques [27] encode label paths as intervals and evaluate path expressions using containment relationships between the intervals.Several computational geometry problems [5] (e.g., windowing) use interval search as a module.The internal states of window queries in Stream processors (e.g.Flink/Kafka) can be modeled and managed as intervals [2].Event detection systems [12], represent the time periods where events are active as time intervals.Matching of event patterns as relationships between intervals is studied in [23].
We study the classic problem of indexing a large collection S of objects (or records), based on an interval attribute that characterizes each object.Hence, we model each object s ∈ S as a triple ⟨s.id, s.st, s.end⟩, where s.id is the object's identifier (which can be used to access any other attribute of the object), and [s.st, s.end] is the interval associated to s.Our focus is on selection queries, the most fundamental query type over intervals.Given a query interval q = [q.st,q.end], the objective is to find the ids of all objects s ∈ S, whose intervals overlap with q, i.e., they satisfy a generalized OVERLAPS (G-OVERLAPS) relationship.In addition, we study the retrieval of data intervals that satisfy one of Allen's interval algebra relationships [1] with q.Allen's algebra is used for describing precise relationships between intervals.Modeling the relative positions of temporal data finds many applications, from manufacturing processes and machine faults to business processes in general [20].Selection queries are also known as time travel or timeslice queries in temporal databases [35].Stabbing queries (pure-timeslice queries in temporal databases) are a special class of selection queries for which q.st = q.end and the predicate is CONTAINED BY.
Without loss of generality, we assume that the intervals and queries are closed at both ends. 1  Examples of selection queries include the following: on a relation storing employment periods: find all employees who were employed sometime inside the [1/1/2021, 2/28/2021] range (G-OVERLAPS); find all employees who started working for a company at 1/1/2021 and stopped before 2/28/2021 (STARTS).-on uncertain temperatures: find all stations having temperature between 6 and 8 degrees with a nonzero probability (G-OVERLAPS); find all stations having temperatures, which are definitely lower/higher than 25 degrees (BEFORE/AFTER).
For efficient selection queries over collections of intervals, classic data structures for managing intervals, like the interval tree [18], are typically used.Competitive indexing methods include the timeline index [21], 1Dgrids and the period index [4].All these methods, which we review in detail in Section 2, were not optimized for handling very large collections of intervals in main memory.Hence, there is room for new data structures, which exploit the characteristics and capabilities of modern machines that have large enough memory capacities for the scale of data found in most applications.
Contributions.In this paper, we propose a novel and general-purpose Hierarchical index for INTervals (HINT), suitable for applications that manage large collections of intervals.HINT defines a hierarchical decomposition of the domain and assigns each interval in S to at most two partitions per level.If the domain is relatively small and discrete, our index can evaluate G-OVERLAPS queries, requiring no comparisons at all.For the general case where the domain is large and/or continuous, we propose a version of HINT, denoted by HINT m , which limits the number of levels to m + 1 and greatly reduces the space requirements.HINT m conducts comparisons only for the intervals in the first and last accessed partitions at the bottom levels of the index.Some of the unique and novel characteristics of our index include: -The intervals in each partition are further divided into groups, based on whether they begin inside or before the partition.This division (1) cancels the need for detecting and eliminating duplicate query results, (2) reduces the data accesses to the absolutely necessary, and (3) minimizes the space needed for storing the objects into the partitions.-We theoretical prove that the expected number of HINT m partitions for which comparisons are necessary is at most four.This guarantees fast retrieval 1 Our index can easily be adapted to manage intervals and/or process selection queries, which are open at either or both sides, i.e., [o.st, o.end), (o.st, o.end] or (o.st, o.end).

Table 1: Comparison of interval indices
index query cost space updates Interval tree [18] medium low slow Timeline index [21] medium medium slow 1D-grid medium medium fast Period index [4] medium medium fast HINT/HINT m (our work) low low fast times, independently of the query extent and position.
-The optimized version of our index stores the intervals in all partitions at each level sequentially and uses a dedicated array with just the ids of intervals there, as well as links between non-empty partitions at each level.These optimizations facilitate sequential access to the query results at each level, while avoiding accessing unnecessary data.-We show the necessary additional comparisons and accesses on HINT m for each relationship in Allen's algebra.In addition, we show that HINT m without the storage optimization is directly suitable for processing queries using all Allen's relationships, while maintaining the excellent performance of HINT m for G-OVERLAPS queries.-We show how an index-based nested loops approach for G-OVERLAPS interval joins that uses HINT m on the inner join input outperforms the state-of-the-art join method when the outer input is relatively small.Table 1 compares HINT to previous work, based on our experiments on real and synthetic datasets.Our index is typically one order of magnitude faster than the competition.As we explain in Section 2, existing indices typically require at least one comparison for each query result (interval tree, 1D-grid) or may access and compare more data than necessary (timeline index, 1Dgrid).Further, the 1D-grid, the timeline and the period index need more space than HINT in the presence of long intervals in the data due to excessive replication either in their partitions (1D-grid, period index) or their checkpoints (timeline index).HINT gracefully supports updates, as each partition (or division within a partition) is independent from others.The building cost of HINT is also low, as we verify experimentally.Overall, HINT is superior in all aspects to the state-of-the-art and constitutes an important contribution, given the fact that selection queries over large interval collections is a fundamental problem with numerous applications.
Comparison to our previous work.This article extends our previous work [13] in three directions.First, we elaborate on the model for tuning the value of the parameter m for HINT m .Specifically, we include a new experiment which confirms the intuition behind our proposed model.Second, we study HINT m performance for G-OVERLAPS interval joins.Finally, we study the evaluation of selection queries under all relationships in Allen's algebra; [13] considered only the G-OVERLAPS relationship.We show that HINT m achieves excellent performance, independently of the query predicate.
Outline.Section 2 reviews related work and presents in detail the characteristics and weaknesses of existing interval indices.In Section 3, we present HINT and its generalized HINT m version, and analyze their complexity.Focusing primarily on the G-OVERLAPS relationship, optimizations that boost the performance of HINT m are presented in Section 4, and the first part of our experimental analysis on real and synthetic data against the state-of-the-art, is presented in Section 5.Then, Section 6 discusses necessary changes to HINT m for efficiently evaluating selection queries under the Allen's algebra relationships, and Section 7 follows up with the second part of our experiments.Finally, Section 8 concludes the paper with a discussion about future work.

Related Work
In this section, we present in detail the state-of-the-art main-memory indices for intervals, to which we experimentally compare HINT in Section 5.In addition, we briefly discuss other relevant data structures and previous work on other queries over interval data.
Interval tree.One of the most popular data structures for intervals is Edelsbrunner's interval tree [18], a binary search tree, which takes O(n) space and answers queries in O(log n+K) time (K is the number of query results).The tree divides the domain hierarchically by placing all intervals strictly before (after) the domain's center to the left (right) subtree and all intervals that overlap with the domain's center at the root.This process is repeated recursively for the left and right subtrees using the centers of the corresponding sub-domains.The intervals assigned to each tree node are sorted in two lists based on their starting and ending values, respectively.Interval trees are used to answer selection (i.e., stabbing and range) queries.For example, Figure 1 shows a set of 14 intervals s 1 , . . ., s 14 , which are assigned to 7 interval tree nodes and a query interval q = [q.st,q, end].The domain point c corresponding to the tree's root is contained in the query interval, hence all intervals in the root are reported and both the left and right children of the root have to be visited recursively.Since the left child's point c L is before q.st, we access the END list from the end and report results until we find an interval s for which s.end < q.st; then we access recursively the right child of c L .This process is repeated symmetrically for the root's right child c R .The main drawback of the interval tree is that we need to perform comparisons for Fig. 1: Example of an interval tree most of the intervals in the query result.In addition, updates on the tree can be slow because the lists at each node should be kept sorted.A relational interval tree for disk-resident data was proposed in [24].
Timeline index.The timeline index [21] is a generalpurpose access method for temporal (versioned) data, in SAP-HANA.It keeps the endpoints of all intervals in an event list, which is a table of ⟨time, id, isStart⟩ triples, where time is the value of the start or end point of the interval, id is the identifier of the interval, and isStart 1 or 0, depending on whether time corresponds to the start or end of the interval, respectively.The event list is sorted primarily by time and secondarily by isStart (descending).In addition, at certain timestamps, called checkpoints, the entire set of active objectids is materialized, that is the intervals that contain the checkpoint.For each checkpoint, there is a link to the first triple in the event list for which isStart=0 and time is greater than or equal to the checkpoint, Figure 2(a) shows a set of five intervals s 1 , . . ., s 5 and Figure 2(b) exemplifies a timeline index for them.
To evaluate a selection query (called time-travel query in [21]), we first find the largest checkpoint which is smaller than or equal to q.st (e.g., c 2 in Figure 2) and initialize R as the active interval set at the checkpoint (e.g., R = {s 1 , s 3 , s 5 }).Then, we scan the event list from the position pointed by the checkpoint, until the first triple for which time ≥ q.st, and update R by inserting to it intervals corresponding to an isStart = 1 event and deleting the ones corresponding to a isStart = 0 triple (e.g., R becomes {s 3 , s 5 }).When we reach q.st, all intervals in R are guaranteed query results and they are reported.We continue scanning the event list until the first triple after q.end and we add to the result the ids of all intervals corresponding to triples with isStart = 1 (e.g., s 2 and s 4 ).
The timeline index accesses more data and performs more comparisons than necessary, during query evaluation.In the worst-case scenario, where almost all intervals span almost the entire domain, all checkpoints will include almost all intervals, so the space complexity is O(n • C), where C is the number of checkpoints.Each  1D-grid.A simple and practical data structure for intervals is a 1D-grid, which divides the domain into p partitions P 1 , P 2 , . . ., P p .The partitions are pairwise disjoint in terms of their interval span and collectively cover the entire data domain D. Each interval is assigned to all partitions that it overlaps with.Figure 3 shows 5 intervals assigned to p = 4 partitions; s 1 goes to P 1 only, while s 5 goes to all four partitions.Given a query q, the results can be obtained by accessing each partition P i that overlaps with q.For each P i which is contained in q (i.e., q.st ≤ P i .st∧ P i .end≤ q.end), all intervals in P i are guaranteed to overlap with q.For each P i , which overlaps with q, but is not contained in q, we should compare each s i ∈ P i with q to determine whether s i is a query result.If the interval of a query q overlaps with multiple partitions, duplicate results may be produced.An efficient approach for handling duplicates is the reference value method [17], which was originally proposed for rectangles but can be directly applied for 1D intervals.For each interval s found to overlap with q in a partition P i , we compute v = max{s.st,q.st} as the reference value and report s only if v ∈ [P i .st,P i .end].Since v is unique, s is reported only in one partition.In Figure 3, interval s 4 is reported only in P 2 which contains value max{s 4 .st,q.st}.
The 1D-grid has two drawbacks.First, the duplicate results should be computed and checked before being eliminated by the reference value.Second, if the collection contains many long intervals, the index may grow large in size due to excessive replication which increases the number of duplicate results to be eliminated.In Period index.The period index [4] is a self-adaptive structure based on domain partitioning, specialized for G-OVERLAPS and duration queries.The time domain is split into coarse partitions as in a 1D-grid and then each partition is divided hierarchically, in order to organize the intervals assigned to the partition based on their positions and durations.Figure 4 shows a set of intervals and how they are partitioned in a period index.There are two primary partitions P 1 and P 2 and each of them is divided hierarchically to three levels.Each level corresponds to a duration length and each interval is assigned to the level corresponding to its duration.The top level stores intervals shorter than the length of a division there, the second level stores longer intervals but shorter than a division there, and so on.Hence, each interval is assigned to at most two divisions, except for intervals which are assigned to the bottom-most level, which can go to an arbitrary number of divisions.During query evaluation, only the divisions that overlap with the query interval are accessed; if the query carries a duration predicate, the divisions that are shorter than the query duration are skipped.For G-OVERLAPS queries, the period index performs in par with the interval tree and the 1D-grid [4], so we also compare against this index in Section 5.In the worst-case, space complexity is O(n • C), where C is the number of coarse partitions and each query and update costs O(C) time (i.e., same as the 1D-grid).
Other indexing works.Another classic data structure for intervals is the segment tree [5], a binary search tree with O(n log n) space complexity that answers stabbing queries in O(log n + K) time.The segment tree is not designed for G-OVERLAPS queries, for which it requires a duplicate result elimination mechanism.In computational geometry [5], indexing intervals was studied as a subproblem within orthogonal 2D range search; typically, the worst-case optimal interval tree is used.Indexing intervals re-gained interest with the advent of temporal databases [6].For temporal data, a number of indices are proposed for secondary memory, mainly for effective versioning and compression [3,26].Such indices are tailored for historical versioned data, while we focus on arbitrary interval sets, queries, and updates.
Interval joins and aggregation.Additional research on indexing intervals does not address selection queries, but other operations such as temporal aggregation [22,29,21] and interval joins [15,33,8,7,10,34,11,12,38,16].The timeline index [21] can be directly used for temporal aggregation.Piatov et al. [32] presented plane-sweep algorithms that extend the timeline index to support aggregation over fixed intervals, sliding window aggregates, and MIN/ MAX aggregates.Timeline was later adapted for interval overlap joins [33,34].In Section 5.5, we consider our proposed indexing for join computation in an index-based nested-loops fashion, and compare it against the state-of-the-art algorithm optFS from [10].Similar to previous work, optFS builds on a highly optimized variant of plane-sweep to join un-indexed collections of intervals.A domain partitioning technique for parallel processing of interval joins was proposed in [8,7,10].Alternative partitioning techniques for interval joins were proposed in [15,11].Partitioning techniques for interval joins cannot replace interval indices as they are not designed for selection queries.Temporal joins on Allen's algebra relationships for RDF data were studied in [12].Multi-way interval joins in the context of temporal k-clique enumeration were studied in [38].Awad et al. [2] define interval events of the same or different types that are observed in succession in data streams.Analytical operations based on aggregation or reasoning can be used to formulate composite interval events.

HINT
In this section, we propose the Hierarchical index for INTervals or HINT, which defines a hierarchical domain decomposition and assigns each interval to at most two partitions per level.The primary goal of the index is to minimize the number of comparisons during 011 100 0101 Fig. 5: Hierarchical partitioning and assignment of [5,9] query evaluation, while keeping the space requirements relatively low, even when there are long intervals in the collection.HINT applies a smart division of intervals in each partition into two groups, which avoids the production and handling of duplicate query results and minimizes the number of accessed intervals.In Section 3.1, we present a version of HINT, which avoids comparisons overall during query evaluation, but it is not always applicable and may have high space requirements.Section 3.2 presents HINT m , the general version of our index, used for intervals in arbitrary domains.Last, Section 3.3 describes our analytical model for setting the m parameter and Section 3.4 discusses updates.Table 2 summarizes the notation used in the paper.

A comparison-free version of HINT
We first describe a version of HINT, which is appropriate for a discrete and not very large domain.Specifically, assume that the domain D wherefrom the endpoints of intervals in S take value is [0, 2 m −1].We can define a regular hierarchical decomposition of D into partitions, where at each level ℓ from 0 to m, there are 2 ℓ partitions, denoted by array P ℓ,0 , . . ., P ℓ,2 ℓ −1 .Figure 5 illustrates the hierarchical domain partitioning for m = 4.Each interval s ∈ S is assigned to the smallest set of partitions from all levels which collectively define s.It is not hard to show that s will be assigned to at most two partitions per level.For example, in Figure 5, interval [5,9] is assigned to one partition at level ℓ = 4 and two partitions at level ℓ = 3.The assignment procedure is described by Algorithm 1.In a nutshell, for an interval [a, b], starting from the bottom-most level ℓ, if the last bit of a (resp.b) is 1 (resp.0), we assign the interval to partition P ℓ,a (resp.P ℓ,b ) and increase a (resp.decrease b) by one.We then update a, b by cutting-off their last bits (i.e., integer division by 2, or bitwise right-shift).If, at the next level, a > b holds, indexing [a, b] is done.

Query evaluation
A selection query q can be evaluated by finding at each level the partitions that overlap with q.Specifically, the partitions that overlap with the query interval q at level ℓ are partitions P ℓ,pref ix(ℓ,q.st) to P ℓ,pref ix(ℓ,q.end), where pref ix(k, x) denotes the k-bit prefix of integer x.We call these partitions relevant to the query q.All intervals in the relevant partitions are guaranteed to overlap with q and intervals in none of these partitions cannot possibly overlap with q.However, since the same interval s may exist in multiple partitions that overlap with a query, s may be reported multiple times in the query result.
We propose a technique that avoids the production and therefore, the need for elimination of duplicates and, at the same time, minimizes the number of data accesses.For this, we divide the intervals in each partition P ℓ,i into two groups: originals P O ℓ,i and replicas P R ℓ,i .Group P O ℓ,i contains all intervals s ∈ P ℓ,i that begin at P ℓ,i i.e., pref ix(ℓ, s.st) = i.Group P R ℓ,i contains all intervals s ∈ P ℓ,i that begin before P ℓ,i , i.e., pref ix(ℓ, s.st) ̸ = i. 2 Each interval is added as original in only one partition of HINT.For example, interval [5,9] in Figure 5 is added to P O 4,5 , P R 3,3 , and P R 3,4 .Given a query q, at each level ℓ of the index, we report all intervals in the first relevant partition P ℓ,f (i.e., P O ℓ,f ∪ P R ℓ,f ).Then, for every other relevant partition P ℓ,i , i > f , we report all intervals in P O ℓ,i and ignore P R ℓ,i .This guarantees that no result is missed and no duplicates are produced.The reason is that each interval s will appear as original in just one partition, hence, reporting only originals cannot produce any duplicates.At the same time, all replicas P R ℓ,f in the first partitions per level ℓ that overlap with q begin before q and overlap with q, so they should be reported.On the other hand, replicas P R ℓ,i in subsequent relevant partitions (i > f ) ALGORITHM 2: Searching HINT Input : HINT index H, query interval q Output : set R of all intervals that overlap with q 1 R ← ∅; 2 foreach level ℓ in H do 3 p ← pref ix(ℓ, q.st); ▷ 1st relevant partition Fig. 6: Accessed partitions for query [5,9] contain intervals, which are either originals in a previous partition P ℓ,j , j < i or replicas in P R ℓ,f , so, they can safely be skipped.Algorithm 2 describes the search algorithm using HINT.
For example, consider the hierarchical partitioning of Figure 6 and a query interval q = [5,9].The binary representations of q.st and q.end are 0101 and 1001, respectively.The relevant partitions at each level are shown in bold (blue) and dashed (red) lines and can be determined by the corresponding prefixes of 0101 and 1001.At each level ℓ, all intervals (both originals and replicas) in the first partitions P ℓ,f (bold/blue) are reported while in the subsequent partitions (dashed/red), only the original intervals are.
Discussion.The version of HINT described above finds all query results, without any comparisons.Hence, in each partition P ℓ,i , we only have to keep the ids of the intervals that are assigned to P ℓ,i and do not have to store/replicate the interval endpoints.Further, the relevant partitions at each level are computed by fast bitshifting operations which are comparison-free.Under this, we expect a pipelined execution as CPU branch mispredictions are reduced.To use HINT for arbitrary integer domains, we should first normalize all interval endpoints by subtracting the minimum endpoint, in order to convert them to values in a [0, 2 m −1] domain (the same transformation should be applied on the queries).If the required m is very large, we can index the intervals based on their m-bit prefixes and support approximate search on discretized data.Approximate search can also be applied on intervals in a real-valued domain, after rescaling and discretization in a similar way.

HINT m : indexing arbitrary intervals
We now present a generalized version of HINT, denoted by HINT m , which can be used for intervals in arbitrary domains.HINT m uses a hierarchical domain partitioning with m + 1 levels, based on a [0, 2 m − 1] domain D; each raw interval endpoint is mapped to a value in D, by linear rescaling.The mapping function ⌋, where min(x) and max(x) are the minimum and maximum interval endpoints in the dataset S, respectively.Each raw interval [s.st, s.end] is mapped to interval [f (s.st), f (s.end)].The mapped interval is then assigned to at most two partitions per level in HINT m , using Algorithm 1.
For the ease of presentation, we will assume that the raw interval endpoints take values in [0, 2 m ′ − 1], where m ′ > m, which means that the mapping function f simply outputs the m most significant bits of its input.As an example, assume that m = 4 and m ′ = 6.Interval [21,38] (=[0b010101, 0b100110]) is mapped to interval [5,9] (=[0b0101, 0b1001]) and assigned to partitions P 4,5 , P 3,3 , and P 3,4 , as shown in Figure 5. So, in contrast to HINT, the set of partitions whereto an interval s is assigned in HINT m does not define s, but the smallest interval in the [0, 2 m − 1] domain D, which covers s.As in HINT, at each level ℓ, we divide each partition P ℓ,i to P O ℓ,i and P R ℓ,i , to avoid duplicate results.

Query evaluation using HINT m
For a query q, simply reporting all intervals in the relevant partitions at each level (as in Algorithm 2) would produce false positives.Instead, comparisons to the query endpoints may be required for the first and the last partition at each level that overlap with q.Specifically, we can consider each level of HINT m as a 1D-grid (see Section 2) and go through the partitions at each level ℓ that overlap with q.For the first partition P ℓ,f , we verify whether s overlaps with q for each interval s ∈ P O ℓ,f and each s ∈ P R ℓ,f .For the last partition P ℓ,l , we verify whether s overlaps with q for each interval s ∈ P O ℓ,l .For each partition P ℓ,i between P ℓ,f and P ℓ,l , we report all s ∈ P O ℓ,i without any comparisons.As an example, consider the HINT m index and the query interval q shown in Figure 7.The identifiers of the relevant partitions to q are shown in the figure (and also some indicative intervals that are assigned to these partitions).At level m = 4, we have to perform comparisons for all intervals in the first relevant partitions P 4,5 .In partitions P 4,6 ,. . .,P 4,8 , we just report the originals in them as results, while in partition P 4,9 we compare the start points of all originals with q, before we can confirm whether they are results or not.We  Lemma 1 At every level ℓ, each s ∈ P R ℓ,f is a query result iff q.st ≤ s.end.If l > f , each s ∈ P O ℓ,l is a query result iff s.st ≤ q.end.
Proof For the first relevant partition P ℓ,f at each level ℓ, for each replica s ∈ P R ℓ,f , s.st < q.st, so q.st ≤ s.end suffices as an overlap test.For the last partition P ℓ,l , if l > f , for each original s ∈ P O ℓ,f , q.st < s.st, so s.st ≤ q.end suffices as an overlap test.

Avoiding redundant comparisons
One of our most important findings in this study and a powerful feature of HINT m is that at most levels, it is not necessary to do comparisons at the first and/or the last partition.For instance, in the previous example, we do not have to perform comparisons for partition P 3,4 , since any interval assigned to P 3,4 should overlap with P 4,8 and the interval spanned by P 4,8 is covered by q.This means that the start point of all intervals in P 3,4 is guaranteed to be before q.end (which is inside P 4,9 ).In addition, observe that for any relevant partition which is the last partition at an upper level and covers P 3,4 (i.e., partitions {P 2,2 , P 1,1 , P 0,0 }), we do not have to conduct the s.st ≤ q.end tests as intervals in these partitions are guaranteed to start before P 4,9 .The lemma below formalizes these observations: Lemma 2 If the first (resp.last) relevant partition for a query q at level ℓ (ℓ < m) starts (resp.ends) at the same value as the first (resp.last) relevant partition at level ℓ + 1, then for every first (resp.last) relevant partition P v,f (resp.P v,l ) at levels v ≤ ℓ, each interval s ∈ P v,f (resp.s ∈ P v,l ) satisfies s.end ≥ q.st (resp.s.st ≤ q.end).
Proof Let P.st (resp.P.end) denote the first (resp.last) domain value of partition P .Consider the first relevant partition P ℓ,f at level ℓ and assume that P ℓ,f .st= P ℓ+1,f .st.Then, for every interval s ∈ P ℓ,f , s.end ≥ P ℓ+1,f .end,otherwise s would have been allocated to P ℓ+1,f instead of P ℓ,f .Further, P ℓ+1,f .end≥ q.st, since P ℓ+1,f is the first partition at level ℓ + 1 which overlaps with q.Hence, s.end ≥ q.st.Moreover, for every interval s ∈ P v,f with v < ℓ, s.end ≥ P ℓ+1,f .endholds, as interval P v,f covers interval P ℓ,f ; so, we also have s.end ≥ q.st.Symmetrically, we prove that if P ℓ,l .end= P ℓ+1,l .end,then for each s ∈ P v,l , v ≤ ℓ, s.st ≤ q.end.
We next focus on how to rapidly check the condition of Lemma 2. Essentially, if the last bit of the offset f (resp.l) of the first (resp.last) partition P ℓ,f (resp.P ℓ,l ) relevant to the query at level ℓ is 0 (resp.1), then the first (resp.last) partition at level ℓ−1 above satisfies the condition.For example, in Figure 7, consider the last relevant partition P 4,9 at level 4. The last bit of l = 9 is 1; so, the last partition P 3,4 at level 3 satisfies the condition and we do not have to perform comparisons in the last partitions at level 3 and above.
Algorithm 3 is a pseudocode for HINT m search.The algorithm accesses all index levels, bottom-up.It uses two auxiliary flags compf irst and complast to mark whether it is necessary to perform comparisons at the current level (and all levels above it) at the first and the last partition, respectively, according to the discussion in the previous paragraph.At each level ℓ, we find the offsets of the relevant partitions to the query, based on the ℓ-prefixes of q.st and q.end (Line 4).For the first position f = pref ix(q, st), the partitions holding originals and replicas P O ℓ,f and P R ℓ,f are accessed.The algorithm first checks whether f = l, i.e., the first and the last partitions coincide.In this case, if compf irst and complast are set, then we perform all comparisons in P O ℓ,f and apply the first observation in Lemma 1 to P R ℓ,f .Else, if only complast is set, we can safely skip the q.st ≤ s.end comparisons; if only compf ist is set, regardless whether f = l, we just perform q.st ≤ s.end comparisons to both originals and replicas in the first partition.If neither compf irst nor complast are set, we just report all intervals in the first partition as results.If we are at the last partition P ℓ,l and l > f (Line 17) then we just examine P O ℓ,l and apply just the s.st ≤ q.end test for each interval there, according to Lemma 1. Last, for all partitions in-between the first and the last, we simply report all original intervals there.

Complexity Analysis
Let n be the number of intervals in S. Assume that the domain is [0, 2 m ′ − 1], with m ′ > m.To analyze the space complexity of HINT m , we first prove that: Lemma 3 The total number of intervals assigned at the lowest level m of HINT m is expected to be n.

Input
: HINT m index H, query interval q Output : set R of intervals that overlap with q 1 compf irst ← T RU E; complast ← T RU E; 2 R ← ∅; 3 for ℓ = m to 0 do ▷ bottom-up 4 f ← pref ix(ℓ, q.st); l ← pref ix(ℓ, q.end); Proof Each interval s ∈ S will go to zero, one, or two partitions at level m, based on the bits of s.st and s.end at position m (see Algorithm 1); on average, s will go to one partition.
Using Algorithm 1, when an interval is assigned to a partition at a level ℓ, the interval is truncated (i.e., shortened) by 2 m ′ −ℓ .Based on this, we analyze the space complexity of HINT m as follows.
Theorem 1 Let λ be the average length of intervals in input collection S. The space complexity of HINT m is O(n • log(2 log λ−m ′ +m + 1)).
Proof Based on Lemma 3, each s ∈ S will be assigned on average to one partition at level m and will be truncated by 2 m ′ −m .Following Algorithm 1, at the next level m − 1, s is also be expected to be assigned to one partition (see Lemma 3) and truncated by 2 m ′ −m+1 , and so on, until the entire interval is truncated (condition a ≤ b is violated at Line 3 of Algorithm 1).Hence, we are looking for the number of levels whereto each s will be assigned, or for the smallest k for which 2 m ′ −m + 2 m ′ −m+1 + . . .+ 2 m ′ −m+k−1 ≥ λ.Solving the inequality gives k ≥ log(2 log λ−m ′ +m + 1) and the space complexity of HINT m is O(n • k).
For the computational cost of queries in terms of conducted comparisons, in the worst case, O(n) intervals are assigned to the first relevant partition P m,f at level m and O(n) comparisons are required.To estimate the expected cost of query evaluation in terms of conducted comparisons, we assume a uniform distribution of intervals to partitions and random query intervals.

Lemma 4
The expected number of HINT m partitions for which we have to conduct comparisons is four.
Proof At the last level of the index m, we definitely have to do comparisons in the first and the last partition (which are different in the worst case).At level m − 1, for each of the first and last partitions, we have a 50% chance to avoid comparisons, due to Lemma 2. Hence, the expected number of partitions for which we have to perform comparisons at level m − 1 is 1.Similarly, at level m − 2 each of the yet active first/last partitions has a 50% chance to avoid comparisons.Overall, for the worst-case conditions, where m is large and q is long, the expected number of partitions, for which we need to perform comparisons is 2 + 1 + 0.5 + 0.25 + . . .= 4.

Theorem 2
The expected number of comparisons during query evaluation over HINT m is O(n/2 m ).
Proof For each query, we conduct comparisons at least in the first and the last relevant partitions at level m.The expected number of intervals, in each of these two partitions, is O(n/2 m ), considering Lemma 3 and assuming a uniform distribution of the intervals in the partitions.In addition, due to Lemma 4, the number of expected additional partitions that require comparisons is 2 and each of these two partitions is expected to also hold at most O(n/2 m ) intervals, by Lemma 3 on the levels above m and using the truncated intervals after their assignment to level m (see Algorithm 1).Hence, q is expected to be compared with O(n/2 m ) intervals in total and the cost of each such comparison is O(1).
In the worst case, all data intervals fall at the topmost level ℓ = 0 and the queries fall inside [2 m −2, 2 m − 1]; in this (extreme) case, query cost is O(n), as all intervals are compared with each query.

Setting m
As shown in Section 3.2.3, the space requirements and the search performance of HINT m depend on the value of m.For large values of m, the cost of accessing comparison-free results will dominate the computational cost of comparisons.We conduct an analytical study for estimating m opt : the smallest value of m, which is expected to result in a HINT m of search performance close to the best possible, while achieving the lowest possible space requirements.Our study uses simple statistics namely, the number of intervals n = |S|, the mean length λ s of data intervals and the mean length λ q of query intervals.We assume that the endpoints and the lengths of intervals and queries are uniformly distributed.
The overall cost of query evaluation consists of (1) the cost for determining the relevant partitions per level, denoted by C p , (2) the cost of conducting comparisons between data intervals and the query, denoted by C cmp , and (3) the cost of accessing query results in the partitions for which we do not have to conduct comparisons, denoted by C acc .Cost C p is negligible, as the partitions are determined by a small number m of bit-shifting operations.To estimate C cmp , we need to estimate the number of intervals in the partitions whereat we need to conduct comparisons and multiply this by the expected cost β cmp per comparison.To estimate C acc , we need to estimate the number of intervals in the corresponding partitions and multiply this by the expected cost β acc of (sequentially) accessing and reporting one interval.β cmp and β acc are machine-dependent and can easily be estimated by experimentation.
According to Algorithm 3, unless λ q is smaller than the length of a partition at level m, there will be two partitions that require comparisons at level m, one partition at level m − 1, etc. with the expected number of partitions being at most four (see Lemma 4).Hence, we can assume that C cmp is practically dominated by the cost of processing two partitions at the lowest level m.
As each partition at level m is expected to have n/2 m intervals (see Lemma 3), we have Then, the number of accessed intervals for which we expect to apply no comparisons is |Q| − 2 • n/2 m , where |Q| is the total number of expected query results.Under this, we have ).We can estimate |Q| using the selectivity analysis for (multidimensional) intervals and queries in [31] as where Λ is the length of the entire domain with all intervals in S (i.e., Λ = max ∀s∈S s.end − min ∀s∈S s.st).
With C cmp and C acc , we now estimate m opt .First, we gradually increase m from 1 to its max value m ′ (determined by Λ), and compute the expected cost C cmp + C acc .For m = m ′ , HINT m corresponds to the comparisonfree HINT with the lowest expected cost.Then, we select as m opt the lowest value of m for which C cmp +C acc converges to the cost of the m = m ′ case.

Updates
We handle insertions to an existing HINT/HINT m by calling Algorithm 1 for each new interval s.Small adjustments are needed for HINT m to add s to the originals division at the first partition assignment, i.e., to P O ℓ,a or P O ℓ,b , and to the replicas division for every other partition, i.e., to P R ℓ,a or P R ℓ,b Further, we handle deletions using tombstones, similarly to previous studies [25,30] and recent indexing approaches [19].Given an interval s for deletion, we first search the index to locate all partitions that contain s (both as original and as replica) and then, replace s.id by a special "tombstone" id to signal the logical deletion.Each insertion costs O(m) time as an interval is added to up to 2m partitions, and finding the partitions at each level costs O(1) time.By running the same algorithm we find the partitions that include an interval to be deleted in O(m) time.Last, we handle modifications to an existing interval, via a deletion and a consecutive insertion.

Optimizing HINT m
In this section, we discuss optimization techniques, which greatly improve the performance of HINT m (and HINT) in practice.First, we show how to reduce the number of partitions in HINT m where comparisons are performed and how to avoid accessing unnecessary data.Next, we show how to handle very sparse or skewed data at each level of HINT/HINT m .Another optimization is decoupling the storage of the interval ids with the storage of interval endpoints in each partition.Finally, we revisit updates under the prism of these optimizations.

Subdivisions and space decomposition
Recall that, at each level ℓ of HINT m , every partition P ℓ,i is divided into P O ℓ,i (holding originals) and P R ℓ,i (holding replicas).We propose to further divide each , so that P Oin ℓ,i ) holds the intervals from P Oin ℓ,i that end inside (resp.after) partition P ℓ,i .Similarly, each P R ℓ,i is divided into P Rin ℓ,i and P .
Queries that overlap with multiple partitions.Consider a query q, which overlaps with a sequence of more than one partitions at level ℓ.As already discussed, if we have to conduct comparisons in the first such partition P ℓ,f , we should do so for all intervals in P O ℓ,f and P R ℓ,f .By subdividing P O ℓ,f and P R ℓ,f , we get the following lemma: q < l a t e x i t s h a 1 _ b a s e 6 4 = " w c P C l q c e 5 S 6 c a 0 Fig. 8: Partition subdivisions in HINT m (level ℓ = 2) ) each interval s in P Oin ℓ,f ∪ P Rin ℓ,f overlaps with q iff s.end ≥ q.st; and (2) all intervals s in P surely overlap with q.
Proof Follows directly from the fact that q starts inside P ℓ,f but ends after P ℓ,f .Hence, we need just one comparison for each interval in P Oin ℓ,f ∪P Rin ℓ,f , whereas we can report all intervals P as query results with no comparisons.As already discussed, for all partitions P ℓ,i between P ℓ,f and P ℓ,l , we just report intervals in P Oin ℓ,i ∪ P as results, with no comparisons, whereas for the last partition P ℓ,l , we perform one comparison per interval in P Oin ℓ,l ∪ P .
Queries that overlap with a single partition.If the query q overlaps only one partition P ℓ,f at level ℓ, we can use following lemma to minimize the necessary comparisons: Lemma 6 If P ℓ,f = P ℓ,l then each interval s in P Oin ℓ,f overlaps with q iff s.st ≤ q.end ∧ q.st ≤ s.end, each interval s in P O af t ℓ,f overlaps with q iff s.st ≤ q.end, each interval s in P Rin ℓ,f overlaps with q iff s.end ≥ q.st, all intervals in P R af t ℓ,f overlap with q.Proof All intervals s ∈ P O af t ℓ,f end after q, so s.st ≤ q.end suffices as an overlap test.All intervals s ∈ P Rin ℓ,f start before q, so s.st ≤ q.end suffices as an overlap test.All intervals s ∈ P R af t ℓ,f start before and end after q, so they are guaranteed results.
Overall, the subdivisions help us to minimize the number of intervals in each partition, for which we have to apply comparisons.Figure 8 shows the subdivisions which are accessed by query q at level ℓ = 2 of a HINT m index.In partition P ℓ,f = P 2,1 , all four subdivisions are accessed, but comparisons are needed only for intervals in P Oin 2,1 and P

Sorting the intervals in each subdivision
We can keep the intervals in each subdivision sorted, in order to reduce the number of comparisons for queries that access them.For example, let us examine the last partition P ℓ,l that overlaps with a query q at a level ℓ.If the intervals s in P Oin ℓ,l are sorted on their start endpoint (i.e., s.st), we can simply access and report the intervals until the first s ∈ P Oin ℓ,l , such that s.st > q.end.Or, we can perform binary search to find the first s ∈ P Oin ℓ,l , such that s.st > q.end and then scan and report all intervals before s.Table 3 (second column) summarizes the sort orders for each of the four subdivisions of a partition that can be beneficial in query evaluation.For a subdivision P Oin ℓ,i , intervals may have to be compared based on their start point (if P ℓ,i = P ℓ,f ), or based on their end point (if P ℓ,i = P ℓ,l ), or based on both points (if P ℓ,i = P ℓ,f = P ℓ,l ).We choose to sort based on s.st to accommodate two of these three cases.For a subdivision P , intervals may have to be compared only based on their start point (if P ℓ,i = P ℓ,l ).For a subdivision P Rin ℓ,i , intervals may have to be compared only based on their end point (if P ℓ,i = P ℓ,f ).Last, for a subdivision P , there is never any need to compare the intervals, so, no order provides any benefit.Overall, sorting will reduce the expected number of comparisons per query for P O af t ℓ,l and P Rin ℓ,l to O(log(n/2 m )), but the expected cost for P Oin ℓ,l remains O(n/2 m ).Under this, the worst-case query cost remains O(n+K), where K is the number of query results, derived from Theorem 2.

Storage optimization
So far, we have assumed that each interval s is stored in the partitions whereto s is assigned as a triplet ⟨s.id, s.st, s.end⟩.However, if we split the partitions into subdivisions, we do not need to keep all information of the intervals in them.Specifically, for each subdivision P Oin ℓ,i , we may need to use s.st and/or s.end for each interval s ∈ P Oin ℓ,i , while for each subdivision P , we may need to use s.st for each s ∈ P Oin ℓ,i , but we will never need s.end.From the intervals s of each subdivision P Rin ℓ,i , we may need s.end, but we will never use s.st.Finally, for each subdivision P R af t ℓ,i , we just have to keep the s.id identifiers of the intervals.Table 3 (third column) summarizes the data that we need to keep from each interval in the subdivisions of each partition.Since each interval s is stored as original just once in the entire index, but as replica in possibly multiple partitions, space can be saved by storing only the necessary data, especially if the intervals span multiple partitions.Note that even when we do not apply the subdivisions, but just use P O ℓ,i and P R ℓ,i (as suggested in Section 3.2), we do not need to store the start points s.st of all intervals in P R ℓ,i , as they are never used in comparisons.

Handling data skewness and sparsity
Data skewness and sparsity may cause many partitions to be empty, especially at the lowest levels of HINT (i.e., large values of ℓ).Recall that a query accesses a sequence of multiple P O ℓ,i partitions at each level ℓ.Since the intervals are physically distributed in the partitions, this results into the unnecessary accessing of empty partitions and may cause cache misses.We propose a storage organization where all P O ℓ,i divisions at the same level ℓ are merged into a single table T O ℓ and an auxiliary index is used to find each non-empty division. 3he auxiliary index locates the first non-empty partition, which is greater than or equal to the ℓ-prefix of q.st (i.e., via binary search or a binary search tree).From thereon, the nonempty partitions which overlap with the query interval are accessed sequentially and distinguished with the help of the auxiliary index.Hence, the contents of the relevant P O ℓ,i 's to each query are always accessed sequentially.Figure 9(a) shows an example at level ℓ = 4 of HINT m .From the total 2 ℓ = 16 < l a t e x i t s h a 1 _ b a s e 6 4 = " x y q A h H 0 < l a t e x i t s h a 1 _ b a s e 6 4 = "     < l a t e x i t s h a 1 _ b a s e 6 4 = " f F e X c + p q 0 5 J 5 v Z h T 9 w P n 8 A z h a P w Q = = < / l a t e x i t >

01
< l a t e x i t s h a 1 _ b a s e 6 4 = " u z l y R f / M 9 r 9 7 ids column for     < l a t e x i t s h a 1 _ b a s e 6 4 = "

< l a t e x i t s h a 1 _ b a s e 6 4 = " B z R U P Y 1 c 1 N W h H N v T L B d y 3 c u r B H Y = " >
< l a t e x i t s h a 1 _ b a s e 6 4 = " x y q A h H 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " t j b 9 S / H J d

01
< l a t e x i t s h a 1 _ b a s e 6 4 = " u z l y R f / M 9 r 9 7 we add to the auxiliary index, a link from each partition P O to the partition P O ℓ−1,j at the level above, such that j is the smallest number greater than or equal to i ÷ 2, for which partition P O ℓ−1,j is not empty.Hence, instead of performing binary search at level ℓ − 1, we use the link from the first partition P O ℓ,f relevant to the query at level ℓ and (if necessary) apply a linear search backwards from the pointed partition P O ℓ−1,j to identify the first non-empty partition P O ℓ−1,f that overlaps with q. Figure 9(b) shows an example, where each nonempty partition at level ℓ is linked with the first nonempty partition with greater than or equal prefix at the level ℓ − 1 above.Given query example q, we use the auxiliary index to find the first nonempty partition P O 4,5 which overlaps with q and also sequentially access P O to find the first nonempty partition at level 3, which overlaps with q.We repeat this to get partition P O 2,3 at level 2, which however is not guaranteed to be the first one overlapping with q, so we go backwards to P O 2,3 .

Reducing cache misses
At most levels of HINT m , no comparisons are conducted and the only operations are processing the inter-val ids which qualify the query.Also, even for the levels ℓ where comparisons are required, these are only restricted to the first and the last relevant partitions P O ℓ,f and P O ℓ,l and no comparisons are needed for the partitions in-between.Summing up, when accessing any (sub-)partition for which no comparison is required, we do not need any information about the intervals, except for their ids.Hence, in our implementation, for each (sub-)partition, we store the ids of all intervals in it in a dedicated array (the ids column) and the interval endpoints (wherever necessary) in a different array. 4If we need the id of an interval that qualifies a comparison, we can access the corresponding position of the ids column.This storage organization greatly improves search performance by reducing the cache misses, because for the intervals that do not require comparisons, we only access their ids and not their interval endpoints.This optimization is orthogonal to and applied in combination with the strategy in Section 4.2, i.e., we store all P O divisions at each level ℓ in a single table T O ℓ , which is decomposed to a column that stores the ids and another table for the endpoint data of the intervals.We exemplify the ids column in Figure 9(a).If, for a sequence of partitions at a level, we do not have to perform any comparisons, we just access the sequence of the interval ids that are part of the answer, which is implied by the position of the first such partition (obtained via the auxiliary index).In this example, all intervals in P O 4,5 and P O 4,6 are guaranteed to be query results without any comparisons and they can be sequentially accessed from the ids column without having to access the endpoints of the intervals.The auxiliary index guides the search by identifying and distinguishing between partitions for which comparisons should be conducted (e.g., P O 4,8 ) and those for which they are not necessary.

Updates
A version of HINT m that uses all techniques from Sections 4.1-4.2, is optimized for query operations.Under this, the index cannot efficiently support individual updates, i.e., new intervals inserted one-by-one.Dealing with updates in batches will be a better fit.This is a common practice for other update-unfriendly indices, e.g., the inverted index in IR.Yet, for mixed workloads (i.e., with both queries and updates), we adopt a hybrid setting where a delta index is maintained to digest 4 Similar to the previous section, this storage optimization can be straightforwardly employed also when a partition is divided into P O in ℓ,i , P the latest updates as discussed in Section 3.4, 5 and a fully optimized HINT m , which is periodically updated in batches, holds older data supporting deletions with tombstones.Both indices are probed upon querying.

Experimental Analysis
We compare our hierarchical indexing, detailed in Sections 3 and 4 against the interval tree [18] 6 , the timeline index [21], the (adaptive) period index [4], and a uniform 1D-grid.All indices were implemented in C++ and compiled using gcc (v4.8.5) with -O3. 7The tests ran on a dual Intel(R) Xeon(R) CPU E5-2630 v4 at 2.20GHz with 384 GBs of RAM, running CentOS Linux.

Data and queries
We used 4 collections of real intervals, which have also been used in previous works; Table 4 summarizes their characteristics.BOOKS [8] contains the periods of time in 2013 when books were lent out by Aarhus libraries (https://www.odaa.dk).WEBKIT [8,9,15,33] records the file history in the git repository of the Webkit project from 2001 to 2016 (https://webkit.org); the intervals indicate the periods during which a file did not change.TAXIS [10] stores the time periods of taxi trips (pickup and drop-off timestamps) from NY City in 2013 (https://www1.nyc.gov/site/tlc/index.page).GREEND [11,28] records time periods of power usage from households in Austria and Italy from January 2010 to October 2014.BOOKS and WEBKIT contain around 2M intervals each, which are quite long on average; TAXIS and GREEND have over 100M short intervals.We also generated synthetic collections to simulate different cases for the lengths and the skewness of the input intervals.Table 5 shows the construction parameters for the synthetic datasets and their default values.The domain of the datasets ranges from 32M to 512M, which requires index level parameter m to range from 25 to 29 for a comparison-free HINT (similar to the real datasets).The cardinality ranges from 10M to 1B.The interval lengths were generated using the random.zipf(α) in the numpy library.They follow a zipfian distribution according to the p(x) = x −a ζ(a) probability density function, where ζ is the Riemann Zeta function.A small value of α results in most intervals being relatively long, while a large value results in the 5 Small adjustments are applied for the P O in l,i , P subdivisions and the storage optimizations. 6Code from https://github.com/ekg/intervaltree. 7Source code available in https://github.com/pbour/hint.great majority of intervals having length 1.We generated the positions of the middle points of the intervals from a normal distribution centered at the middle point µ of the domain.So, the middle point of each interval is generated using numpy's random.normalvariate(µ,σ).The greater the value of σ, the more spread the intervals are in the domain.
On the real datasets, we used queries uniformly distributed in the domain.On the synthetic, the query positions follow the distribution of the data.In both, the query extent was fixed to a percentage of the domain size (default 0.1%).At each test, we ran 10K random queries to measure the overall throughput.Measuring query throughput instead of average query time makes sense in applications that manage huge volumes of interval data and offer a search interface to billions of users simultaneously (e.g., public historical databases).

Optimizing HINT/HINT m
In our first set of tests, we study the best setting for our hierarchical indexing.We compare the effectiveness of the two evaluation approaches in Section 3.2.1 and investigate the impact of the optimizations in Section 4.

Query evaluation approaches on HINT m
We compare the straightforward top-down approach for evaluating queries on HINT m that uses solely Lemma 1, against the bottom-up illustrated in Algorithm 3 which additionally employs Lemma 2. Figure 10 reports the throughput of each approach on BOOKS and TAXIS, while varying the number of levels m in the index.We omit the results for WEBKIT and GREEND that follow identical trend to BOOKS and TAXIS, respectively.We observe that the bottom-up approach significantly outperforms top-down for BOOKS while for TAXIS, this performance gap is very small.As expected, bottom-up performs at its best for inputs that contain long intervals which are indexed on high levels of the index, i.e., the intervals in BOOKS.In contrast, the intervals in TAXIS are very short and so, indexed at the bottom level of HINT m , while the majority of the partitions at the higher levels are empty.Hence, top-down conducts no comparisons at higher levels.For the rest of our tests, HINT m uses the bottom-up approach.

Subdivisions and space decomposition
We next evaluate the subdivisions and space decomposition optimizations described in Section 4.1 for HINT m .Note that these techniques are not applicable to our   11: Optimizing HINT m : subdivisions and space decomposition comparison-free HINT as the index stores only interval ids. Figure 11 shows the effect of the optimizations on BOOKS and TAXIS, for different values of m; similar trends were observed in WEBKIT and GREEND, respectively.The plots include (1) a base version of HINT m , which employs none of the proposed optimizations, (2) subs+sort+opt, with all optimizations activated, (3) subs+sort, which only sorts the subdivisions (section 4.1.1)and (iv) subs+sopt, which uses only the storage optimization for the subdivisions (Section 4.1.2).We observe that the subs+sort+opt version of HINT m is superior to all three other versions, on all tests.Essentially, the index benefits from the sub+sort setting only when m is small, i.e., below 15, at the expense of increasing the index time compared to base.In this case, the partitions contain a large number of intervals and therefore, using binary search or scanning until the first interval that does not overlap the query, will save on the conducted comparisons.On the other hand, the subs+sopt optimization significantly reduces the space requirements of the index.As a result, the version incurs a higher cache hit ratio and so, a higher throughput compared to base is achieved, especially for large values of m, i.e., higher than 10.The subs+sort+opt version manages to combine the benefits of both subs+sort and subs+sopt versions, i.e., high throughput in all cases, with low space requirements.The effect in the performance is more pronounced in BOOKS because of the long intervals and the high replication ratio.In view of these results, HINT m employs all optimizations from Section 4.1 for the rest of our experiments.

Handling data skewness & sparsity and reducing cache misses
Table 6 tests the effect of the handling data skewness & sparsity optimization (Section 4.2) on the comparisonfree version of HINT (Section 3.1). 8Observe that the optimization has a great effect on both the throughput and the size of the index in all four real datasets, because empty partitions are effectively excluded from query evaluation and from the indexing process.
Figure 12 shows the effect of either or both of the data skewness & sparsity (Section 4.2) and the cache misses optimizations (Section 4.3) on the performance In all cases, the version of HINT m which uses both optimizations is superior to all other versions.As expected, the skewness & sparsity optimization helps to reduce the space requirements of the index when m is large, because there are many empty partitions in this case at the bottom levels of the index.At the same time, the cache misses optimization helps in reducing the number of cache misses in all cases where no comparisons are needed.Overall, the optimized version of HIN T m converges to its best performance at a relatively small value of m, where the space requirements of the index are relatively low, especially on the BOOKS and WEBKIT datasets which contain long intervals.For the rest of our experiments, HINT m employs both optimizations and HINT the data skewness & sparsity optimization.Last, by juxtaposing Table 7 with Figures 11 and 12, we also observe that both m opt values correspond to the part of the plots before the index size blows, usually for m ≥ 20.

Tuning m
After demonstrating the merit of HINT m optimizations, we now elaborate on how to set the value of m and on the effectiveness of our analytical model from Section 3.3.As we already discussed our model is based on the intuition that as m increases, the cost of accessing comparison-free results dominates the computational cost of the comparisons.Figure 13  intuition on BOOKS and TAXIS (the plots for WE-BKIT and GREEND exhibit exactly the same trend as BOOKS and TAXIS, respectively).For different values of m and for 10K queries, we report the overall time spend for comparisons between data intervals and query intervals, denoted by C cmp , and the overall time spent to output results with no comparisons, denoted by C acc , i.e., the time taken for simply accessing data intervals which are guaranteed query results.We also include the total execution time, i.e., C cmp + C acc .The plots clearly show the expected behaviour.For small values of m, the cost of conducting comparisons dominates the total execution cost since the partitions at the bottom level m of the index have large extents and numerous intervals.As m increases, the fraction of the results collected from just accessing the contents of partitions rises, increasing the C acc cost.The optimal values m opt (i.e., where the total execution time is the lowest possible) occur after C acc exceeds C cmp .In fact, we notice that increasing m beyond m opt roughly eliminates the cost of comparisons (C cmp ≈ 0) as the partitions are much shorter than the queries, while the total cost essentially equals the cost of simply accessing the intervals from the comparison-free partitions.
To determine m opt , our model in Section 3.3 selects the smallest m value for which the index converges within 3% to its lowest estimated cost.Table 7 reports, for each real dataset, m opt (est.) and m opt (exps), which brings the highest throughput in our tests.Overall, our model estimates a value of m opt which is very close to m opt (exps).Despite a larger gap for WEBKIT, the measured throughput for the estimated m opt = 9 is only 5% lower than the best observed throughput.

Discussion
Table 7 also shows the replication factor k of the index, i.e., the average number of partitions in which every interval is stored, as predicted by our space complexity analysis (see Theorem 1) and as measured experimentally.As expected, the replication factor is high on BOOKS, WEBKIT due to the large number of long intervals, and low on TAXIS, GREEND where the in- tervals are very short and stored at the bottom levels.
Although our analysis uses simple statistics, the predictions are quite accurate.
The next line of the table (avg.comp.part.)shows the average number of HINT m partitions for which comparisons were conducted.Consistently to our analysis in Section 3.2.3,all numbers are below 4, which means that the performance of HINT m is very close to the performance of the comparison-free, but space-demanding HINT.To further elaborate on the number of required comparisons, we last show the fraction of the results produced by HINT m without any comparisons.In all datasets, over 99% of the results are collected with no comparisons, which explains how HINT m is able to match the performance of the comparison-free HINT.

Index performance comparison
Next, we compare the optimized versions of HINT and HINT m against the previous work competitors.We start with our tests on the real datasets.For HINT m , we set m to the best value on each dataset, according to Table 7.Similarly, we set the number of partitions for 1Dgrid, the number of checkpoints for the timeline index, and the number of levels and number of coarse partitions for the period index (see Table 7).Table 8 shows the sizes of each index in memory and Table 9 shows the construction cost of each index, for the default query extent 0.1%.Regarding space, HINT m along with the interval tree and the period index have the lowest requirements on datasets with long intervals (BOOKS and WEBKIT) and very similar to 1D-grid in the rest.In TAXIS and GREEND where the intervals are indexed mainly at the bottom level, the space requirements of HINT m are significantly lower than our comparison-free HINT due to limiting the number of levels.When compared to the raw data (see Table 4), HINT m is 2 to 3 times bigger for BOOKS and WEBKIT (which contain many long intervals), and 1 time bigger for GREEND and TAXIS.These ratios are smaller than the replication ratios k reported in Table 7, thanks to our storage optimization (cf.Section 4.1.2).Due to its simplicity, 1D-grid has the lowest index time across all datasets.Nevertheless, HINT m is the runner up in most of the cases, especially for the biggest inputs, i.e., TAXIS and GREEND, while in BOOKS and WEBKIT, its index time is very close to the interval tree.Figure 14 compares the throughputs of all indices on queries of various extents (as a percentage of the domain size).The first set of bars in each plot corresponds to stabbing queries, i.e., queries of 0 extent.We observe that HINT and HINT m outperform the competition by almost one order of magnitude, across the board.In fact, only on GREEND the performance for one of the competitors, i.e., 1D-grid, comes close to the performance of our hierarchical indexing.Due to the extremely short intervals in GREEND (see Table 4) almost all the results are collected from the bottom level of HINT/HINT m , which essentially resembles the evaluation process in 1D-grid.Yet, our indices are even in this case faster as they require no duplicate elimination.
HINT m is the best index overall, as it achieves the performance of HINT, requiring less space, confirming the findings of our analysis in Section 3.2.3.As shown in Table 8, HINT always has higher space requirements than HINT m ; even up to an order of magnitude higher in case of GREEND.What is more, since HINT m offers the option to control the occupied space in memory by appropriately setting the m parameter, it can handle scenarios with space limitations.HINT is marginally better than HINT m only on datasets with short intervals (TAXIS and GREEND) and only for selective queries.In these cases, the intervals are stored at the lowest levels of the hierarchy where HINT m typically needs to conduct comparisons to identify results, but HINT applies comparison-free retrieval.
We next consider the synthetic datasets.In each test, we vary the value of one parameter (domain size, cardinality, α, σ, query extent) and fix the rest to their default (see Table 5).The value of m for HINT m , the     Fig. 15: Comparing throughputs, synthetic datasets number of partitions for 1D-grid, the number of checkpoints for the timeline index and the number of levels/coarse partitions for the period index are set to their best values on each dataset.The results Figure 15 follow a similar trend to the tests on the real datasets.HINT and HINT m are always significantly faster than the competition.Different to the real datasets, 1D-grid is steadily outperformed by the other three competitors.Intuitively, the uniform partitioning of the domain in 1D-grid cannot cope with the skewness of the synthetic datasets.As expected the domain size, the dataset cardinality and the query extent have a negative impact on all indices.Essentially, increasing the domain size under a fixed query extent, affects the performance similar to increasing the query extent, i.e., the queries become longer and less selective, including more results.Further, the querying cost grows linearly with the dataset size since the number of query results are proportional to it.HINT m occupies around 8% more space than the raw data, because the replication factor k is close to 1.In contrast, as α grows, the intervals become shorter, so the query performance improves.Similarly, when increasing σ the intervals are more widespread, meaning that the queries are expected to retrieve fewer results, and the query cost drops accordingly.

Updates
We now test the efficiency of HINT m in updates using both the update-friendly version of HINT m (Section 3.4), denoted by subs+sopt HINT m , and the hybrid setting for the fully-optimized index from Section 4.4, denoted as HINT m .We index offline the first 90% of the intervals for each real dataset in batch and then execute a mixed workload with 10K queries of 0.1% extent, 5K insertions of new intervals (randomly selected from the remaining 10% of the dataset) and 1K random deletions.Table 10 reports our findings for BOOKS and TAXIS; the results for WEBKIT and GREEND follow the same trend.Note that we excluded Timeline since the index is designed for temporal (versioned) data where updates only happen as new events are appended at the end of the event list, and the comparisonfree HINT, for which our tests have already shown a similar performance to HINT m with higher indexing/ storing costs.Also, all indices handle deletions with "tombstones".We observe that both versions of HINT m outperform the competition by a wide margin.An exception arises on TAXIS, as the short intervals are inserted in only one partition in 1D-grid.The interval tree Fig. 16: G-OVERLAPS based interval joins, real datasets has in fact several orders of magnitude slower updates due to the extra cost of maintaining the partitions in the tree sorted at all time.Overall, we also observe that the hybrid HINT m setting is the most efficient index as the smaller delta subs+sopt HINT m handles insertions faster than the 90% pre-filled subs+sopt HINT m .

Interval Joins
We conclude the first part of our analysis studying the applicability of HINT m to the evaluation of interval joins.Given two input datasets R, S, the objective is to find all pairs of intervals (r, s), r ∈ R, s ∈ S, such that r G-OVERLAPS with s.The rationale is that if the outer dataset R is very small compared to the inner S, an index already available for S can be used to evaluate fast the join in an index nested loops fashion.Hence, we show how HINT m constructed for each of the four real datasets can be used to evaluate joins where the outer relation is a random sample of the same dataset.As part of the join process, we sort the outer dataset R in order to achieve better cache locality between consecutive probes to the inner dataset S. As a competitor, we used the state-of-the-art interval join algorithm [10], which sorts both join inputs and applies a specialized sweeping algorithm optFS.Figure 16 shows the results for various sizes |R| of the outer dataset R. The results confirm our expectation.For small sizes of |R|, HINT m is able to outperform optFS.On TAXIS in particular, HINT m loses to [10] only when |R|/|S| ≥ 50%.

Supporting Allen's Algebra
We now turn our focus to Allen's algebra for intervals [1].Table 11 (first two columns) summarizes the basic relationships of the algebra, each denoted by q REL s, where q is the query interval and s, an interval in the input collection S. Note that the G-OVERLAPS selection query from the previous sections identifies every interval s non-disjoint to query q, i.e., a combination of all basic algebra's relationships besides BEFORE and AFTER.
We study selection queries on Allen's relationships under two setups for our hierarchical indexing.We focus on HINT m , which exhibits similar performance to the comparison-free HINT but significant lower indexing costs, as our experiments showed in Section 5.

Setup Optimized for G-OVERLAPS
We start off with the HINT m setup from the first part of our paper (see Table 3), optimized for the G-OVERLAPS selection.In what follows, we discuss how queries based on Allen's relationships can be evaluated without any structural changes to the index.Table 11 summarizes the set of intervals reported for each selection query.
Relationship EQUALS.An EQUALS selection determines all input intervals identical to query q, i.e., with q.end = s.end and q.st = s.st.To answer such a query, we access two specific index partitions; the first relevant P ℓ,f at level ℓ and the last relevant P ℓ ′ ,l , at level ℓ ′ . 9Intuitively, these two partitions correspond to the first and last partition where HINT m would store the query interval q, respectively.We then distinguish between two cases.If q overlaps a single partition, i.e., if f = l, we need only the intervals that both start and end inside this partition, i.e., the P Oin ℓ,f subdivision.So, we report set s ∈ P Oin ℓ,f : q.st = s.st∧ q.end = s.end .Otherwise, if f ̸ = l, we report results among the intervals that start in the first relevant partition (from P O af t ℓ,f ) and end in the last (from P Rin ℓ ′ ,l ), i.e., set s ∈ P O af t ℓ,f : q.st = s.sts ∈ P Rin ℓ ′ ,l : q.end = s.end .Note that we cannot directly check q.end = s.end as P O af t ℓ,f stores only s.st (and s.id).
Relationship STARTS.According to Allen's algebra, a STARTS selection query reports all intervals that start where q does, i.e., with q.st = s.st, but outlive its end, i.e., with q.end < s.end.By construction, HINT m stores such intervals as originals in the first relevant partition.We consider two cases for every index level ℓ.If : q.st > s.st ∧ q.st < s.end ∧ q.end > s.end q.st < s.end ∧ : q.st < s.end ∧ q.end > s.end q.end > s.end : q.st < s.st ∧ q.end > s.end q.end > s. that satisfies only q.st = s.st; for the latter intervals, their s.end is by construction after q.end.So, we report s ∈ P Oin ℓ,f : q.st = s.st∧ q.end < s.end s ∈ P O af t ℓ,f : q.st = s.st .In contrast, if f ̸ = l, the results can only come from the intervals that end after the first relevant partition at current level ℓ, i.e., from P O af t ℓ,f .But, as subdivisions P O af t ℓ,f store only s.st according to Table 3, we cannot directly check the q.end < s.end condition.Instead, we rely on the replicas inside the last relevant partition at any index level.Intuitively, if an interval s ∈ P O af t ℓ,f : q.st = s.st is stored as a replica in the last relevant partition l at a level ℓ ′ , which either (1) ends inside l (i.e., s ∈ P Rin ℓ ′ ,l ) but after q.end or (2) outlives the partition (i.e., s ∈ P R af t ℓ ′ ,l ) then q.end < s.end holds for s.The above two sets are computed as ∀ℓ ′ s ∈ P Rin ℓ ′ ,l : q.end < s.end P Relationship STARTED BY.As an inverse to STARTS, a STARTED BY selection determines all intervals that again start at q.st but end before q.end.Therefore, if f = l holds at a level ℓ, we consider only the intervals that both start and end inside the partition, reporting set s ∈ P Oin ℓ,f : q.st = s.st∧ q.end > s.end .Otherwise, results are found among all originals in f .For the P Oin ℓ,f subdivision, we directly output s ∈ P Oin ℓ,f : q.st = s.st as their s.end is by construction before q.end.For the intervals in s ∈ P O af t ℓ,f with q.st = s.st,we apply a similar technique to STARTS for checking the q.end > s.end condition.Intuitively, such an interval s will be reported if it ends at any level ℓ ′ , either inside a partition i with f < i < l or in the last relevant partition l but be-fore q.end.For this purpose, we check if s is inside set ∀ℓ ′ ∀f <i<l P Rin ℓ ′ ,i s ∈ P Rin ℓ ′ ,l : q.end > s.end .Relationship FINISHES.This selection query returns all intervals that end exactly where query q does, i.e., with q.end = s.end, but start before q, i.e., with q.st > s.st.If q overlaps a single partition (f = l) at a level ℓ, we consider the intervals that end in the last relevant partition l: s ∈ P Oin ℓ,l : q.end = s.end∧ q.st > s.st s ∈ P Rin ℓ,l : q.end = s.end .Otherwise (f ̸ = l), only replicas that end inside partition l (Subdivision P Rin ℓ,l ) with q.end = s.end can be part of the results.To this end, we face a similar challenge to STARTS/STARTED BY as P Rin ℓ,l does not store s.st (see Table 3) to directly check q.st > s.st.The solution is to check if an interval s ∈ P Rin ℓ,l : q.end = s.end is contained in set , i.e., the intervals that either (1) start before q.st in the first relevant partition f at any level ℓ ′ or (2) are stored in P R af t ℓ ′ ,f and so, their start is by construction before q.st.Relationship FINISHED BY.A FINISHED BY selection inverses the second condition of FINISHES, determining intervals with q.end = s.end and q.st < s.st.For a level ℓ, if f = l we report the intervals that start and end inside the partition, and satisfy both conditions, i.e., set s ∈ P Oin ℓ,l : q.end = s.end∧ q.st < s.st .Otherwise (f ̸ = l), the results are among all intervals that end in partition l, i.e., set s ∈ P Oin ℓ,l : q.end = s.ends ∈ P Rin ℓ,l : q.end = s.end .For the intervals from subdivision P Oin ℓ,l , q.st < s.st holds by construction while for P Rin ℓ,l intervals, a direct check of the condition is not possible.Instead, we check such an interval s against the set of intervals that start either (1) after q in the first relevant partition at any level ℓ ′ or (2) inside the partitions in between the first and the last relevant; set Relationship MEETS.This selection query returns all intervals that start at q.end.Under this, we report for each level ℓ, all originals in the last relevant partition l that satisfy the q.end = s.stcondition, i.e., set s ∈ P Oin ℓ,l P O af t ℓ,l : q.end = s.st .
Relationship MET BY.This selection query returns all intervals that end at q.st.To this end, the results are among the intervals that end inside the first relevant partition f , i.e., set s ∈ P Oin ℓ,f P Rin ℓ,f : q.st = s.end, at each level ℓ.
Relationship OVERLAPS.An OVERLAPS selection determines all non-disjoint intervals to query q, which start after q.st and end after q.end.If q overlaps a single partition (f = l) at a level ℓ, such intervals are found among the originals in the partition; for the P Oin ℓ,f subdivision all query conditions are checked, while for an s in P O af t ℓ,f , q.end < s.end always holds.So, we report set s ∈ P Oin ℓ,f : q.st < s.st ∧ q.end > s.st ∧ q.end < s.end s ∈ P O af t ℓ,f : q.st < s.st ∧ q.end > s.st .Otherwise, results are reported in two parts.The first part is drawn from the originals in the last relevant partition at each level ℓ, i.e., s ∈ P Oin ℓ,l : q.end > s.st ∧ q.end < s.end s ∈ P O af t ℓ,l : q.end > s.st .For the second part, we consider the intervals that start before partition l and outlive q, i.e., set s ∈ P Rin ℓ,l : q.end < s.end P . For every such interval s, q.end > s.st holds by construction, but we need to check its start against q.st.As subdivisions P Rin ℓ and P R af t ℓ do not store s.st, we cannot directly check the q.st < s.st condition.Instead, we compare s against all P O af t ℓ ′ at any level ℓ ′ that (1) either start before q.st in the first relevant partition f or (2) inside every partition in between f and l, i.e., set .
Relationship OVERLAPPED BY.As inverse to OVERLAPS, the OVERLAPPED BY selection determines all non-disjoint intervals to q that start before q.st and end before q.end.If f = l, we draw the results from all intervals (both originals and replicas) that end inside the partition; set s ∈ P Oin ℓ,f : q.st > s.st ∧ q.st < s.end ∧ q.end > s.end s ∈ P Rin ℓ,f : q.st < s.end ∧ q.end > s.end .Otherwise, the results consist of two parts for every level ℓ.The first part includes again originals and replicas that end inside the first relevant partition f , but now, condition q.end > s.end always holds by construction.Hence, we report set s ∈ P Oin ℓ,f : q.st > s.st ∧ q.st < s.end s ∈ P Rin ℓ,f : q.st < s.end .For the second part, we seek results among all intervals that start before q, i.e., originals s ∈ P O af t ℓ,f : q.st > s.st and replicas P R af t ℓ,f for both sets q.st < s.end holds by construction as intervals outlive the first relevant partition f .As neither of the P O af t ℓ,f and P R af t ℓ,f subdivisions maintains s.end, we check q.end > s.end by determining the replicas at any index level ℓ ′ that end (1) either before the last relevant partition l or (2) inside l after q.end, i.e., set ∀ℓ ′ ∀f <i<l P Rin ℓ ′ ,i s ∈ P Rin ℓ ′ ,l : q.end > s.end .
Relationship CONTAINS.This selection query returns all intervals, fully contained inside the query interval q, i.e., with q.st < s.st ∧ q.end > s.end.For every level ℓ, if f = l, q can contain only intervals that both start and end in this partition, i.e., from subdivision P Oin ℓ,f ; we report set s ∈ P Oin ℓ,f : q.st < s.st ∧ q.end > s.end .Otherwise, the results are drawn from the original intervals in every partition from the first relevant partition f to the last l; for the latter only originals that end inside the partition are considered.Specifically, for the intervals in P Oin ℓ subdivisions, we report s ∈ P Oin ℓ,f : q.st < s.st ∀f <i<l P Oin ℓ,i s ∈ P Oin ℓ,l : q.end > s.end ; observe how only one condition is checked for partitions f and l, while for every partition i in between, all originals that end inside i are directly output.In contrast, for all intervals in the P O af t ℓ subdivisions, we need to check the q.end > s.end condition; additionally, for every s ∈ P O af t ℓ,f subdivision, we also check if q.st < s.st holds.
As P O af t ℓ subdivisions store only s.st, q.end < s.end is checked similarly to OVERLAPPED BY, i.e., using set ∀ℓ ′ ∀f <i<l P Rin ℓ ′ ,i s ∈ P Rin ℓ ′ ,l : q.end > s.end .
Relationship CONTAINED BY.This selection determines all intervals that fully contain q, i.e., with q.st > s.st ∧ q.end < s.end.For each level ℓ, if f = l, the result intervals are found among all subdivisions in the partition, reporting s ∈ P Oin ℓ,f : q.st > s.st ∧ q.end < s.end s ∈ P O af t ℓ,f : q.st > s.st s ∈ P Rin ℓ,f : q.end < s.end P R af t ℓ,f .In contrast, if f ̸ = l, the results are among the intervals that (1) start before q.st, corresponding to set s ∈ P O af t ℓ,f : q.st > s.st P R af t ℓ,f , and (2) end after q.end.As the P O af t ℓ or the P R af t ℓ subdivisions do not store s.end, in order to check the q.end < s.end condition, we need to intersect the above candidates set with the replicas at any level ℓ ′ that either end inside the last relevant partition l or outlive it, i.e., set .Note that replicas from these partitions are ignored as they will only produce duplicate results.
Relationship AFTER.An AFTER selection determines all intervals that end before q.Results are found at each level among the intervals which end inside either (1) the first relevant partition f and satisfy q.st > s.end, i.e., set s ∈ P are ignored to avoid duplicate results.

One Setup for All
The storage optimization discussed in Section 4.1.2allows the G-OVERLAPS setup of HINT m to reduce the memory footprint of the index and improve cache locality.But as an optimization technique tailored for the G-OVERLAPS relationship, it has a negative impact on Allen's algebra basic relationships.The key issue is that we cannot directly check the conditions on s.end for the P O af t and P R af t subdivisions and on s.st for P Rin and P R af t .Instead, we are forced to access extra partitions to implicitly conduct these checks, e.g., the P Rin ℓ ′ ,l and P subdivisions in the last relevant partition l at each index level ℓ ′ , for the STARTS relationship.
In view of this, we next consider a subs+sort setup of HINT m for Allen's algebra. 10Essentially, no changes are required if query q overlaps a single partition (f = l) at a level ℓ as all necessary information is available for the selection conditions.Further, the computation of MEETS, MET BY, BEFORE and AFTER queries remains unchanged.So, in what follows, we discuss the necessary changes for the rest of relationships in the f ̸ = l case.
Relationship EQUALS.We can now directly retrieve results from the first relevant partition f and the P O af t ℓ,f subdivision by checking both query conditions, i.e., we report set s ∈ P O af t ℓ,f : q.st = s.st∧ q.end = s.end .
Relationship STARTS.With s.end in P O af t ℓ,f , both query conditions can be directly checked at each level ℓ and thus report s ∈ P O af t ℓ,f : q.st = s.st∧ q.end < s.end .
Relationship STARTED BY.Similar to STARTS, we can directly check both conditions for P O af t ℓ,f in the first relevant partition f .We report s ∈ P Oin ℓ,f : q.st = s.sts ∈ P O af t ℓ,f : q.st = s.st∧ q.end > s.end , at each level.
Relationship FINISHES.With s.st in P Rin ℓ,l subdivisions, we can directly check q.st > s.st and report s ∈ P Rin ℓ,l : q.end = s.end∧ q.st > s.st , at each level.Relationship FINISHED BY.Similar to FINISHES, we can directly check both conditions on P Rin ℓ,l and thus, report at each level ℓ, set s ∈ P Oin ℓ,l : q.end = s.ends ∈ P Rin ℓ,l : q.end = s.end∧ q.st < s.st ., we directly check q.st < s.st for partition l.
So, we report s ∈ P Rin ℓ,l : q.st < s.st ∧ q.end < s.end s ∈ P R af t ℓ,l : q.st < s.st intervals at each level along with the set s ∈ P Oin ℓ,l : q.end > s.st ∧ q.end < s.end s ∈ P O af t ℓ,l : q.end > s.st .
Relationship OVERLAPPED BY.With s.end stored in P O af t ℓ,f and P R af t ℓ,f , we can directly check q.end > s.end, reporting set s ∈ P O af t ℓ,f : q.st > s.st ∧ q.end > s.end s ∈ P R af t ℓ,f : q.end > s.end along with the intervals contained in s ∈ P Oin ℓ,f : q.st > s.st ∧ q.st < s.end s ∈ P Rin ℓ,f : q.st < s.end .
Relationship CONTAINS.With s.end in P O af t ℓ subdivisions, we can directly check the q.end > s.end condition to output s ∈ P O af t ℓ,f : q.st < s.st ∧ q.end > s.end s ∈ ∀f <i<l P O af t ℓ,i : q.end > s.end along with the set s ∈ P Oin ℓ,f : q.st < s.st s ∈ P Oin ℓ,l : q.end > s.end subdivisions, we can now directly check the q.end < s.end condition at each level ℓ, reporting the intervals s ∈ P O af t ℓ,f : q.st > s.st ∧ q.end < s.end s ∈ P R af t ℓ,f : q.end < s.end .

Bottom-up Evaluation Approach
Both setups of HINT m can benefit from the bottomup approach in Section 3.2.2.The idea is to determine the levels when the last bit of the first (last) relevant partition f (l) are set to 1 or 0, for the first time.Due to lack of space, we discuss only STARTS for the G-OVERLAPS setup as an example.Specifically, results are found among the original intervals stored in the first relevant partition f up to the level where the last bit in f is 1, for the first time.All originals in f at a higher level start by construction of the index before q.st and thus, violate q.st = s.st.Further, at levels after the one where the last bit of l is 0 for the first time, q.end < s.end always holds for all s ∈ P Rin ℓ ′ ,l .Consider the query q in Figure 7. Candidate results are contained only as originals in P 4,5 , where the last bit of f = 5 is 1.Also as the last bit of l is 0 at the 4th level, all P Rin intervals in P 2,2 , P 1,1 , P 0,0 satisfy q.end < s.end.

Experiments on Allen's Algebra
For the second part of our experiments, we focus on selection queries under the basic relationships of Allen's algebra.We first compare the two alternative HINT m setups from Section 6 and then put the best setup against the competition.We extended our code for all competitive indices in Section 5 to support Allen's algebra.We ran our tests on datasets BOOKS, WEBKIT, TAXIS and GREEND; we omit the results on the synthetic datasets due to lack of space and because of observing similar trends.Lastly, parameter m and all other index parameters are set according to Table 7.

Determining the Best Index Setup
Figure 17 reports the throughputs achieved by the two HINT m setups; results in WEBKIT and GREEND are similar and therefore omitted due to lack of space.Note that both setups adopt the bottom-up evaluation (Section 3.2.2) and employ the skewness & sparsity and the cache misses optimizations (Sections 4.2 and 4.3).The 'one setup for all' setup drastically improves the performance of HINT m for the majority of the queries.Essentially, the G-OVERLAPS setup matches the performance of 'one setup for all' in the G-OVERLAPS relationship, as expected, and in relationships where only one partition per level is examined by both setups, without the need to indirectly check a condition, i.e., in MEETS, MET BY, BEFORE and AFTER.In the rest, 'one setup for all' is from one to several orders of magnitude faster.
We also compare the two setups on their index size and updates.As expected, 'one setup for all' requires HINT m for G-OVERLAPS   13) due to disabling sopt from Section 4.1.2.Regarding insertions and deletions, both setups will employ the hybrid setting in Section 4.4 with similar performance.Overall, in a typical space-time tradeoff, 'one setup for all' increases the space requirements in exchange of drastically accelerating querying, even several orders of magnitude for some relationships.So, for the rest of our analysis, HINT m operates always under 'one setup for all'.

Index Performance Comparison
Figure 18 compares the performance of all studied indices.The first 4 rows of plots report the results for OVERLAPS, OVERLAPPED BY, CONTAINS, CONTAINED BY, while varying the query extent, similar to Figure 14.Note that for CONTAINED BY on TAXIS and GREEND, we consider a different range of values as these datasets contain significantly shorter intervals than BOOKS and WEBKIT.The last row of plots reports on the rest of the relationships where the selection queries essentially resemble typical stabbing queries, i.e., query overlaps either one partition per level or only two in total partitions in EQUALS.
Overall, HINT m exhibits the highest throughput for all queries based on Allen's algebra relationships, in line with the results in Figure 14.Its performance gap to the competitor indices ranges from almost half to several orders of magnitude.Essentially, the smallest

Conclusions
We proposed a hierarchical index (HINT) for intervals, which has low space complexity and minimizes the number of data accesses and comparisons during query evaluation.Our experiments on real and synthetic datasets show that HINT outperforms previous work by almost one order of magnitude in a wide variety of interval data and query distributions.Our index fully supports selection queries based on Allen's relationships [1] between intervals, achieving consistently excellent performance independently of the query pred-icate.In the future, we intend to extend our work towards multiple directions.Regarding the index structure, we plan to consider compression techniques to further reduce HINT m memory footprint and adaptive variants that e.g., use non-regular partitioning to better deal with long intervals.We also plan to support queries combining temporal selections and selections on additional attributes or the interval duration [4].Further, we will study how to manage transactional data using HINT, e.g., streams of events.Last, we plan to investigate hardware-aware techniques, e.g., for effective parallelization relying on the fact that HINT partitions are independent, or near-storage computation with Processing-in-Memory.

Fig. 7 :
Fig. 7: Avoiding redundant comparisons in HINT m can simplify the overlap tests at the first and the last partition of each level ℓ based on the following:

ALGORITHM 3 :
Searching HINT m d a d F j 0 Y s 3 K 9 g P a d e S T b N t a J J d k q x Q l v 4 K L x 4 U 8 e r P 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 4 g 5 0 8 Z 1 v 5 3 c y u r a

4 , 1 <
7 q j 7 e 9 t H L m T X r F k l t 2 Z 0 D L x M t I C T L U e 8 W v b j 8 i i a D S E I 6 1 7 n h u b P w U K 8 M I p 5 N C N 9 E 0 x m S E B 7 R j q c S C a j + d H T x B J 1 b p o z B S t q R B M / X 3 R I q F 1 m M R 2 E 6 B z V A v e l P x P 6 + T m P D S T 5 m M E 0 M l m S 8 K E 4 5 M h K b f o z 5 T l B g + t g Q T x e y t i A y x w s T Y j A o 2 B G / x 5 W X S P C 9 7 1 b J 7 V y n V r r I 4 8 n A E x 3 A K H l x A D W 6 g D g 0 g I O A Z X u H N U c 6 L 8 + 5 8 z F t z T j Z z C H / g f P 4 A y D q P v Q = = < / l a t e x i t > P O l a t e x i t s h a 1 _ b a s e 6 4 = " d r D3 v c w Z 0 C H T e P f N l D o y 6 X t Q Y B Y = " > A A A B 8 H i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B g 5 Rd a d F j 0 Y s 3 K 9 g P a d e S T b N t a J J d k q x Q l v 4 K L x 4 U 8 e r P 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 4 g 5 0 8 Z 1 v 5 3 c y u r a

4 , 5 <
5 w X 5 9 3 5 m L f m n G z m E P 7 A + f w B z k 6 P w Q = = < / l a t e x i t > P O l a t e x i t s h a 1 _ b a s e 6 4 = " x y q A h H 0 8 A W q e 3 R / g X d b Y I P 9 Y y w U = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g Q c K u x M c x 6 M W b E c x D k j X M T m a T I T O z y 8 y s E J Z 8 h R c P i n j 1 c 7 z 5 N 0 6 S P W

4 , 6 <
e H O W 8 O O / O x 6 w 1 5 2 Q z + / A H z u c P z 9 O P w g = = < / l a t e x i t > P O l a t e x i t s h a 1 _ b a s e 6 4 = " A G 7 z m r L d 3 4 A O k v O w y h 4 q 6 b w 5 E F E = " > A A A B 8 H i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B g 5 R d q d h j 0 Y s 3 K 9 g P a d e S T b N t a J J d k q x Q l v 4 K L x 4 U 8 e r P 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 4 g 5 0 8 Z 1 v 5 3 c y u r a

0 4 2
s w 9 / 4 H z + A D 0 k j / o = < / l a t e x i t > d a d F j 0 Y s 3 K 9 g P a d e S T b N t a J J d k q x Q l v 4 K L x 4 U 8 e r P 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 4 g 5 0 8 Z 1 v 5 3 c y u r a

4 , 1 <
7 q j 7 e 9 t H L m T X r F k l t 2 Z 0 D L x M t I C T L U e 8 W v b j 8 i i a D S E I 6 1 7 n h u b P w U K 8 M I p 5 N C N 9 E 0 x m S E B 7 R j q c S C a j + d H T x B J 1 b p o z B S t q R B M / X 3 R I q F 1 m M R 2 E 6 B z V A v e l P x P 6 + T m P D S T 5 m M E 0 M l m S 8 K E 4 5 M h K b f o z 5 T l B g + t g Q T x e y t i A y x w s T Y j A o 2 B G / x 5 W X S P C 9 7 1 b J 7 V y n V r r I 4 8 n A E x 3 A K H l x A D W 6 g D g 0 g I O A Z X u H N U c 6 L 8 + 5 8 z F t z T j Z z C H / g f P 4 A y D q P v Q = = < / l a t e x i t > P O l a t e x i t s h a 1 _ b a s e 6 4 = " d r D3 v c w Z 0 C H T e P f N l D o y 6 X t Q Y B Y = " > A A A B 8 H i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B g 5 Rd a d F j 0 Y s 3 K 9 g P a d e S T b N t a J J d k q x Q l v 4 K L x 4 U 8 e r P 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 4 g 5 0 8 Z 1 v 5 3 c y u r a

4 , 5 <
5 w X 5 9 3 5 m L f m n G z m E P 7 A + f w B z k 6 P w Q = = < / l a t e x i t > P O l a t e x i t s h a 1 _ b a s e 6 4 = " x y q A h H 0 8 A W q e 3 R / g X d b Y I P 9 Y y w U = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g Q c K u x M c x 6 M W b E c x D k j X M T m a T I T O z y 8 y s E J Z 8 h R c P i n j 1 c 7 z 5 N 0 6 S P W

4 , 6 <
e H O W 8 O O / O x 6 w 1 5 2 Q z + / A H z u c P z 9 O P w g = = < / l a t e x i t > P O l a t e x i t s h a 1 _ b a s e 6 4 = " A G 7 z m r L d 3 4 A O k v O w y h 4 q 6 b w 5 E F E = " > A A A B 8 H i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B g 5 R d q d h j 0 Y s 3 K 9 g P a d e S T b N t a J J d k q x Q l v 4 K L x 4 U 8 e r P 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 4 g 5 0 8 Z 1 v 5 3 c y u r a

4 , 8 <
5 w X 5 9 3 5 m L f m n G z m E P 7 A + f w B 0 t 2 P x A = = < / l a t e x i t >P O l a t e x i t s h a 1 _ b a s e 6 4 = " f T q K R / C C g B 4 6 D L D x 1 y P 6 F Q M z K / s = " > A A A B 8 X i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B g 5 R dr e i x 6 M W b F e w H t m v J p t k 2 N J s s S V Y o S / + F F w + K e P X f e P P f m L Z 7 0 N Y H A 4 / 3 Z p i Z F 8 S c a e O 6 3 0 5 u a X l l d S 2 / X t j Y 3 N r e K e 7 u N b R M y e 1 c p V a + y O P J w A I d w D B 5 c Q B V u o A Z 1 I C D g G V 7 h z d H O i / P u f M x a c 0 4 2 s w 9 / 4 H z + A D 0 k j / o = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " H 4 t q n e P P f m L Z 7 0 N Y H A 4 / 3 Z p i Z F 8 S c a e O 6 3 0 5 u a X l l d S 2 / X t j Y 3 N r e K e 7 u N X S U K E L r J O K R a g V Y U 8 4 k r R t m O G 3 F i m I R c N o M h t c T v / l E l W a R v D e j m P o C 9 y U L G c H G S g + 1 b l o 5 c c e P t 9 1 i y S 2 7 U 6 B F 4 m W k B B l q 3 e J X p x e R R F B p C M d a t z 0 5 w X 5 9 3 5 m L X m n G x m H / 7 A + f w B x O y P u w = = < / l a t e x i t > P O 3,0 < l a t e x i t s h a 1 _ b a s e 6 4 = " M A x / C u T / f v m M d T / 9 y V Y p n + C Q O e 8 = " > A A A B 8 H i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B g 5 R d r e i x 6 M W b F e y H t G v J p t k 2 N M k u S V Y o S 3 + F F w + K e P X n e P P f m L Z 7 0 N Y H A 4 / 3 Z p i Z F 8 S c a e O 6 3 0 5 u a X l l d S 2 / X t j Y 3 N r e K e 7 u N X S U K

3 , 4 <
r i y M M B H M I x e H A B V b i B G t S B g I B n e I U 3 R z k v z r v z M W v N O d n M P v y B 8 / k D y w i P v w = = < / l a t e x i t > P O l a t e x i t s h a 1 _ b a s e 6 4 = " t j b 9 S / H J d 6i c 0 1 O H O D g r c v h o H n A = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g Q c K u 7 2 P Q i z c j m I c k a 5 i d z C Z D Z m a X m V k h L P k K L x 4 U 8 e r n e P N v n C R 7 0 M S C h q K q m + 6 u I O Z M G 9 f 9 d n I L i 0 v L K / n V w t r 6 x u Z W c X u n r q N E E V o j E Y 9 U M8 C a c i Z p z T D D a T N W F I u A 0 0 Y w u B 7 7 j S e q N I v k v R n G 1 B e 4 J 1 n I C D Z W e q h 2 0 p O j 8 9 H j b a d Y c s v u B G i e e B k p Q y 5 k 0 e b 3 v F k l t 2 Z 0 D L x M t I C T L U e 8 W v b j 8 i i a D S E I 6 1 7 n h u b P w U K 8 M I p 5 N C N 9 E 0 x m S E B 7 R j q c S C a j + d H T x B J 1 b p o z B S t q R B M / X 3 R I q F 1 m M R 2 E 6 B z V A v e l P x P 6 + T m P D S T 5 m M E 0 M l m S 8 K E 4 5 M h K b f o z 5 T l B g + t g Q T x e y t i A y x w s T Y j A o 2 B G / x 5 W X S r J S 9 8 7 J 7 V y 3 V r r I 4 8 n A E x 3 A K H l x A D W 6 g D g 0 g I O A Z X u H N U c 6 L 8 + 5 8 z F t z T j Z z C H / g f P 4 A x O q P u w = = < / l a t e x i t >

4 < l a t e x i t s h a 1 _ b a s e 6 4 =
d e / P K 7 X r P I 4 i H M E x n I I H l 1 C D W 6 h D E w g I e I Z X e H O U 8+ K 8 O x / z 1 o K T z x z C H z i f P 7 r f k F w = < / l a t e x i t > T O " x 3 j K R J X c q G a R i n u w V 8 R S v T H M y f c = " > A A A B 8 H i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B g 5 Rd a d F j 0 Y s 3 K 9 g P a d e S T b N t a J J d k q x Q l v 4 K L x 4 U 8 e r P 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 4 g 5 0 8 Z 1 v 5 3 c y u r a

4 , 1 < l a t e x i t s h a 1 _ b a s e 6 4 =
7 q j 7 e 9 t H L m T X r F k l t 2 Z 0 D L x M t I C T L U e 8 W v b j 8 i i a D S E I 6 1 7 n h u b P w U K 8 M I p 5 N C N 9 E 0 x m S E B 7 R j q c S C a j + d H T x B J 1 b p o z B S t q R B M / X 3 R I q F 1 m M R 2 E 6 B z V A v e l P x P 6 + T m P D S T 5 m M E 0 M l m S 8 K E 4 5 M h K b f o z 5 T l B g + t g Q T x e y t i A y x w s T Y j A o 2 B G / x 5 W X S P C 9 7 1 b J 7 V y n V r r I 4 8 n A E x 3 A K H l x A D W 6 g D g 0 g I O A Z X u H N U c 6 L 8 + 5 8 z F t z T j Z z C H / g f P 4 A y D q P v Q = = < / l a t e x i t >P O " d r D 3 v c w Z 0 C H T e P f N l D o y 6 X t Q Y B Y = " > A A A B 8 H i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B g 5 Rd a d F j 0 Y s 3 K 9 g P a d e S T b N t a J J d k q x Q l v 4 K L x 4 U 8 e r P 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 4 g 5 0 8 Z 1 v 5 3 c y u r a

4 , 5 < l a t e x i t s h a 1 _ b a s e 6 4 =
5 w X 5 9 3 5 m L f m n G z m E P 7 A + f w B z k 6 P w Q = = < / l a t e x i t > P O " x y q A h H 0 8 A W q e 3 R / g X d b Y I P 9 Y y w U = " > A A A B 8 H i c b V D L S g N B E O y N r x h f U Y 9 e B o P g Q c K u x M c x 6 M W b E c x D k j X M T m a T I T O z y 8 y s E J Z 8 h R c P i n j 1 c 7 z 5 N 0 6 S P W

4 , 6 < l a t e x i t s h a 1 _
e H O W 8 O O / O x 6 w 1 5 2 Q z + / A H z u c P z 9 O P w g = = < / l a t e x i t > P O b a s e 6 4 = " A G 7 z m r L d 3 4 A O k v O w y h 4 q 6 b w 5 E F E = " > A A A B 8 H i c b V B N S w M x E J 2 t X 7 V + V T 1 6 C R b B g 5 R d q d h j 0 Y s 3 K 9 g P a d e S T b N t a J J d k q x Q l v 4 K L x 4 U 8 e r P 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 4 g 5 0 8 Z 1 v 5 3 c y u r a

0 4 2
s w 9 / 4 H z + A D 0 k j / o = < / l a t e x i t >

4 , 6 < l a t e x i t s h a 1 _
e H O W 8 O O / O x 6 w 1 5 2 Q z + / A H z u c P z 9 O P w g = = < / l a t e x i t > P O b a s e 6 4 = " A G 7 z m r L d 3 4 A O k v O w y h 4 q 6 b w 5

4 , 6 and P O 4 , 8 .
Then, we follow the pointer from P O 4,5 to P O 3,4

Fig. 12 :
Fig. 12: Optimizing HINT m : impact of handling skewness & sparsity and reducing cache misses optimizations of HINT m for different values of m.In all cases, the version of HINT m which uses both optimizations is superior to all other versions.As expected, the skewness & sparsity optimization helps to reduce the space requirements of the index when m is large, because there are many empty partitions in this case at the bottom levels of the index.At the same time, the cache misses optimization helps in reducing the number of cache misses in all cases where no comparisons are needed.Overall, the optimized version of HIN T m converges to its best performance at a relatively small value of m, where the space requirements of the index are relatively low, especially on the BOOKS and WEBKIT datasets which contain long intervals.For the rest of our experiments, HINT m employs both optimizations and HINT the data skewness & sparsity optimization.Last, by juxtaposing Table7with Figures11 and 12, we also observe that both m opt values correspond to the part of the plots before the index size blows, usually for m ≥ 20.

Fig. 14 :
Fig. 14: Comparing throughputs, real datasets V E R L A P S E Q U A L S S T A R T S S T A R T E D _ B Y F I N I S H E S F I N I S H E D _ B Y M E E T S M E T _ B Y O V E R L A P S O V E R L A P P E D _ B Y C O N T A I N S C O N T A I N E D _ B Y B E F O R E A F T E R Throughput [queries/sec]

Table 3 :
Necessary data and beneficial sort orders O 4 as shown at the bottom of the figure (the binary representations of the interval endpoints are shown).At the moment, ignore the ids column for T O 4 at the right of the figure.The sparse index for T O4 has one entry per nonempty partition pointing to the first interval in it.For the query in the example, the index is used to find the first nonempty partition P O 4,5 , for which the id is greater than or equal to the 4-bit prefix 0100 of q.st.All relevant non-empty partitions P O

Table 4 :
Characteristics of real datasets

Table 5 :
Parameters of synthetic datasets

Table 7 :
Statistics and parameter setting

Table 12 :
Allen's algebra relationships, 'One setup for all' : q.end = s.end∧q.st> s.stFINISHED BYq.end = s.end∧ ∀ℓ: if f = l, s ∈ P : q.st < s.st ∧ q.end > s.st ∧ q.end < s.end q.end > s.st ∧ s ∈ P : q.st > s.st ∧ q.st < s.end ∧ q.end > s.end q.st < s.end s ∈ P Relationship OVERLAPS.With s.st in subdivisions P Rin ∀ℓ: if f = l, s ∈ P : q.st < s.end ∧ q.end > s.end q.end > s.end else s ∈ P : q.st > s.st ∧ q.st < s.end s ∈ P : q.st > s.st ∧ q.end > s.end s ∈ P R af t ℓ,f : q.end > s.end CONTAINS q.st < s.st ∧ ∀ℓ: if f = l, s ∈ P : q.st < s.st ∧ q.end > s.end else s ∈ P : q.st < s.st ∧ q.end > s.end q.end > s.end ∀f <i<l P Relationship CONTAINED BY.With s.end stored in both P