Journal of Intelligent Information Systems

, Volume 39, Issue 2, pp 317–334

Adaptive two-level optimization for selection predicates of multiple continuous queries

Authors

    • Department of Non-commissioned officersAnyang Science University
  • Won-Suk Lee
    • Department of Computer ScienceYonsei University
Open AccessArticle

DOI: 10.1007/s10844-011-0192-1

Cite this article as:
Lee, H. & Lee, W. J Intell Inf Syst (2012) 39: 317. doi:10.1007/s10844-011-0192-1

Abstract

A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Query processing for such a data stream should also be continuous and rapid, which requires strict time and space constraints. In order to guarantee these constraints, we have proposed a new scheme called an Attribute Selection Construct (ASC) for an attribute of a data stream in our previous study (Lee and Lee, Information Sciences 178:2416–2432, 2008). As its optimization technique, this paper proposes the new strategy that determines the evaluation order of multiple ASC’s for a given query set at two different levels—macro and micro levels. Based on the two levels, it also proposes two different strategies—macro-sequence and hybrid-sequence—that find the optimized full evaluation sequence of all the ASC’s. In addition, it provides the adaptive strategy that periodically rearranges the evaluation sequence of multiple ASC’s. The performance of the proposed technique is verified by a series of experiments.

Keywords

Data stream Multiple continuous queries Selection predicate ASC Macro level Micro level Macro sequence Hybrid sequence Adaptive optimization

1 Introduction

A data stream is defined as a massive unbounded sequence of data elements continuously generated at a rapid rate (Babcock et al. 2002; Motwani et al. 2003). Accordingly, a registered query in a data stream management system (DSMS) is called a continuous query. It should be executed continuously rather than once on demand, producing its results whenever a new tuple of a target data stream arrives (Abadi et al. 2003; Avnur and Hellerstein 2000; Chen et al. 2002). Research activities on data streams are motivated by emerging applications involving massive datasets such as customer click streams, multimedia data, retail chain transactions and network intrusion detection system (NIDS). In these fields, one of the main research issues is message brokers that classify data by some criteria and send them to proper destinations. The classification criteria are implemented by some continuous queries. They should be evaluated in real-time, which requires strict time and space constraints. Since a number of continuous queries are registered together in advance, it is more efficient to evaluate multiple queries collectively by sharing the common constraints of the queries, as already proposed in most DSMS’s (Chandrasekaran and Franklin 2002; Chen et al. 2000; Madden et al. 2002; Sharaf et al. 2007).

We have proposed a new structure called an attribute selection construct (ASC) for the efficient evaluation of selection predicates (Lee and Lee 2008). Given a set of continuous queries, an attribute of a base data stream is defined as a p-attribute (participant attribute) if it is employed to express at least one selection predicate. An ASC is constructed for each p-attribute and it contains the encoded information of those selection predicates that are imposed to its corresponding p-attribute. Based on the constraining constant values of the selection predicates, the entire domain of a p-attribute is subdivided into a number of disjoint regions. For every region, the ASC of the p-attribute maintains pre-computed results for all the queries. The results indicate which queries satisfy an incoming tuple if its p-attribute value falls within the region. An ASC can be completely built at compile-time because only the selection predicates of the queries are required to build it. This feature improves run-time efficiency which is important in timely fashioned stream environment.

Based on the ASC scheme, this paper proposes a new adaptive two-level optimization technique. Given a set of continuous queries, a tuple of a data stream can be dropped when it does not satisfy any of the queries. In case of a detection system such as NIDS, tuple filtering capability should be considered significantly because most of normal ones should be filtered out by the system. In order to minimize the run-time overhead of query evaluation, it is very important to filter out such an unmatched tuple as early as possible (Babu et al. 2004). Among the multiple ASC’s constructed for the queries, the filtering capability of each ASC is also different from one another. Furthermore, within an ASC its regions also have different filtering capabilities. Therefore, the evaluation order of the ASC’s can significantly influence the overall performance of query evaluation. From this viewpoint, the proposed method determines the evaluation order of multiple ASC’s at two different levels. One is only ASC’s level (macro level). The other is the combination level of ASC and its regions (micro level). Based on the two levels, this paper proposes two different strategies—macro-sequence and hybrid-sequence. A macro sequence finds the full evaluation order of all the ASC’s, considering only their overall filtering capabilities at macro level. Meanwhile, a hybrid sequence considers the respective filtering capabilities of the regions of an ASC at micro level, as well as its overall filtering capability. Also, this paper includes an adaptive strategy which dynamically rearranges the current evaluation sequence, capturing the run-time tuple dropping ratio of the sequence periodically. As a user-defined parameter, a rearrangement threshold μ is introduced to indicate the maximum allowable ratio that the inefficiency of the current evaluation sequence can be sustained by.

Contributions   The contributions of this paper are summarized as follows:
  • Based on the previous proposed scheme ASC (Lee and Lee 2008), it proposes the new techniques determining the evaluation sequence of ASC’s at two different levels—macro and micro levels—for multiple target queries over a data stream.

  • Considering macro and micro levels, it provides two different strategies that find the optimized full evaluation sequence of multiple ASC’s: a macro-sequence and a hybrid-sequence. A macro-sequence determines the sequence at only the macro level and a hybrid-sequence does the sequence at both the macro and micro levels.

  • Due to the selectivity change of selection predicates over a data stream, it also provides the adaptive optimization strategy that periodically rearranges the evaluation sequence of multiple ASC’s.

Paper outline   Section 2 presents related works and Section 3 (preliminary section) briefly introduces how to construct and evaluate the ASC proposed in our previous paper (Lee and Lee 2008). Sections 4 and 5 illustrate how to find the optimized evaluation sequence of ASC’s at two different levels, and how to evaluate and rearrange it at run-time, respectively. In Section 6, the performance of the proposed method is analyzed through a series of experiments. Finally, Section 7 presents our conclusions.

2 Related work

Some studies have proposed a grouping-based method for sharing the common constraints of continuous queries. CACQ (Madden et al. 2002) employs a predicate index for each distinct attribute and maintains various data structures reflecting the characteristics of comparison operators such as an equality hash-table and a greater-than(less-than) tree. PSoup (Chandrasekaran and Franklin 2002) uses a red-black tree based on an IBS-tree (Interval Binary Search tree) (Hanson et al. 1990) for each distinct attribute in order to index all the constraining constants of selection predicates. In case of a non-equal comparison, both of the predicate index of CACQ and the red-black tree of PSoup should be traversed sequentially for its specified range, which can degrade the performance of query evaluation considerably if it is highly selective. (Wu et al. 2006) uses a query index with a series of hierarchical CEI’s (Containment-Encoded Intervals) for each distinct attribute. Similarly to the disjoint region of the proposed ASC, a CEI is produced by exclusively dividing the domain of a corresponding attribute according to the constraining constants of its selection predicates. However, in order to find a set of satisfied queries, more than one CEI which contains the corresponding attribute value of an incoming tuple should be searched in a cascade. Meanwhile, in ASC, only one region is searched because each region contains the evaluation results for all the queries. In general, ASC requires more space than CEI’s. However, with regard to search complexity, ASC is simpler than CEI’s. In other words, compared to (Wu et al. 2006), our approach focuses on reducing evaluation cost rather than storage cost. There are some efficient filtering algorithms in publish/subscribe systems (Fabret et al. 2001). The proposed scheme of (Fabret et al. 2001) groups subscriptions based on their size and common conjunction of equality predicates, and uses multi-attribute hash indices so several subscription attributes can be evaluated using a single comparison. However, as the number of subscription attributes is increased, multi-attribute hash indices should be maintained more complicatedly. Moreover, not only they cannot benefit from short-cut operations in the conjunctive form of selection predicates but also they need extra operations for inequality predicates.

Despite of run-time overhead, several adaptive optimization strategies are also proposed. Eddies (Avnur and Hellerstein 2000) creates a selection module for an attribute that is used to express a selection predicate. The execution order of multiple selection modules is decided based on their selectivities. It is also changed adaptively by tracking the selectivities over all the tuples the module has processed recently. Bizarro et al. (2005) has proposed CBR (Content-Based Routing), which extends Eddies to support different routes for a single data stream. CBR uses adaptive algorithms that partition input data based on statistical properties, and efficiently route individual tuples through customized plans based on their partition. These adaptive methods can degrade the performance of query evaluation especially when the number of selection modules is large or data distribution of incoming tuples is frequently changed. STREAM (Widom and Babu 2001) uses the A-greedy algorithm which monitors the on-going selectivity statistics of various partial evaluation sequences for the selection predicates at run-time. If the current order is not optimal, it is rearranged adaptively. However, it can only be applied to a single continuous query. In (Munagala et al. 2006), a shared execution strategy is proposed for the purpose of optimizing multiple continuous queries. For a specific incoming tuple, among the shared filters that are not evaluated yet, the next filter to be evaluated is chosen by cost-based analysis at run-time. Since the unit of evaluation scheduling is an individual filter, its run-time complexity can be rapidly increased when the number of filters is large. Furthermore, there is no facility that controls the trade-off between query performance and optimization overhead. (Wang et al. 2006) has proposed a query index tree based on decision tree. All kinds of predicates indices on p-attributes are integrated into a single index tree. The optimization point of this scheme is how to select dividing p-attributes during the construction of the tree. For this purpose, (Wang et al. 2006) uses either of Information Gain (IG) or Estimated Time Cost (ETC). However, since the evaluation sequence of p-attributes is decided by the internal structure of an index tree, it is hard work to optimize the evaluation sequence adaptively. In fact, the index tree should be reconstructed for adaptive optimization.

3 Preliminaries

3.1 ASC: attribute selection construct

As described in Section 1, an ASC stores the pre-computed matching results of all the regions of the corresponding p-attribute. It is constructed for each p-attribute and is formally defined as follows.

Definition 1

(Attribute Selection Construct)

Given a set of continuous queries Q = {q 1, q 2, ..., q k } registered to a relational stream D(A 1, A 2, ..., A n ), let \(A_{p}(Q)\subseteq \{A_{1},A_{2},\ldots,A_{n}\}\) denote the set of p-attributes for the query set Q. An attribute selection construct ASC(A i ) for a p-attribute A i  ∈ A p (Q) with m distinct constraining constant values has the following entries:
  • Query-usage bitmap \({\boldsymbol{(qub[1..k])}}\) indicates whether the p-attribute A i is employed in the selection predicates of the jth query q j  ∈ Q (1 ≤ j ≤ k) or not. In other words, if A i  ∈ A p ({q j }), qub[j] = 1. Otherwise, qub[j] = 0.

  • Region array \({\boldsymbol{(ra[1..2m+1])}}\) has (2m + 1) entries which correspond to (2m + 1) distinct regions. The rth region ra[r] (1 ≤ r ≤ 2m + 1) maintains the following two fields:
    • Region identifier rif indicates the range that the region covers, explicitly or implicitly. If the rth region ASC(A i ). ra[r] is the constant region that only includes the constant C, then ASC(A i ).ra[r].rif = C. Otherwise, ASC(A i ).ra[r].rif = null, implying that ASC(A i ).ra[r − 1].rif < ASC(A i ).ra[r].rif < ASC(A i ).ra[r + 1].rif.

    • Query-result bitmap \(\boldsymbol{(qrb[1..k])}\) stores the pre-computed matching result of a region. If any tuple whose p-attribute A i ‘s value falls within the region ASC(A i ).ra[r] cannot satisfy the jth query q j  ∈ Q, then the jth bit of the bitmap is set to 0 i.e., ASC(A i ).ra[r].qrb[j] = 0. Otherwise, the jth bit is set to 1. i.e., ASC(A i ).ra[r].qrb[j] = 1.

If the jth query q j does not have any selection predicate for a p-attribute A i i.e., ASC(A i ).qub j  = 0, then its corresponding query-result bit ASC(A i ).ra[r].qrb[j] should be set to 1 in all the regions of ASC(A i ) because the matching result of the query q j cannot be determined by the p-attribute A i .

Example 1

In Fig. 1, the query-result bitmap ASC(A 2).ra[2].qrb = 1110 indicates that only the query q 4 is not satisfied if the A 2’s attribute value of an incoming tuple is equal to 20. In this region, the queries q 1 and q 3 are also considered to be satisfied because the p-attribute A 2 is not used in its selection predicates.
https://static-content.springer.com/image/art%3A10.1007%2Fs10844-011-0192-1/MediaObjects/10844_2011_192_Fig1_HTML.gif
Fig. 1

The constructing process of ASCs

3.2 Run-time evaluation of ASC’s

Given a set of continuous queries Q = {q 1, q 2, ..., q k }, the ASC’s of their p-attributes A p (Q) are processed for every incoming tuple one by one in sequence. A global-result bitmap denoted by a GRB[1..k] is introduced to accumulate the intermediate matching result of each ASC in the course of query evaluation. Initially, all the bits of the bitmap GRB are set to 1’s, assumed that all of the queries in Q satisfy an incoming tuple t D of the data stream. The ASC of a p-attribute A i ASC(A i ) is evaluated as follows: Among the disjoint regions of ASC(A i ).ra, the one that includes the A i ’s attribute value of the incoming tuple t D is identified first by binary-searching, based on the constant values of its region-identifiers. Subsequently, let the identified region be ASC(A i ).ra[r]. Its query-result bitmap ASC(A i ).ra[r].qrb is bitwise ANDed with the global result bitmap GRB. The result of this operation is reassigned to GRB. If the updated result of GRB is 0, the incoming tuple t D is dropped immediately even if there are some ASC’s left to be evaluated. Otherwise, the matching process is continued by evaluating the next ASC. When all the ASC’s have been processed successfully, the queries satisfied by the incoming tuple t D are identified.

4 Evaluation sequence

Given a set of continuous queries Q, if the number of p-attributes in the query set Q is more than one i.e., ∣ A p (Q) ∣ > 1, the evaluation sequence of their corresponding ASC’s can significantly affect the query performance due to the difference in the filtering capabilities of the p-attributes. Therefore, finding an optimized evaluation sequence is very important. The proposed method determines the evaluation order of ASC’s at two different levels. The first determines the order based on the overall filtering capability of each ASC by averaging the filtering capabilities of all of its regions. The second determines it based on the individual capability of each region of an individual ASC. Given two ASC’s, an evaluation order identified by the first level is called a macro-arrow, whereas an evaluation order identified by the second level is called a micro-arrow. A micro-arrow starts from a specific region of one ASC and ends to the other ASC. This paper proposes two different strategies that find the overall evaluation sequence of multiple ASC’s. One determines the evaluation sequence only by a sequence of macro-arrows. It is called a macro-sequence. The other determines the evaluation sequence by a sequence of macro/micro arrows. It is called a hybrid-sequence. In a hybrid-sequence, a macro-arrow is used when the filtering capability of a specific region of an ASC is much more selective. While the evaluation order of a macro-sequence is fixed, that of a hybrid-sequence may be different according to the attribute values of an incoming tuple as in (Bizarro et al. 2005). In order to find the evaluation sequence at run-time, a monitoring module should be kept apart from the executor like the streaMon in STREAM (Widom and Babu 2001). It identifies the optimized sequence during executing continuous queries, based on the selectivities of ASC’s.

4.1 Extension of ASC

To implement the proposed evaluation sequence, the structure of ASC is slightly extended. Based on Definition 1, a new entry called a candidate-arrow bitmap is added to the region array of ASC. It is defined as follows:
  • Candidate-arrow bitmap \(\boldsymbol{(cab[1..\vert A_{p}(Q)\vert])}\) indicates a set of ASC’s which are candidates for the next ASC to be evaluated. For those tuples whose values of the p-attribute A i are in the region ASC(A i ).ra[r], \(ASC(A_{j}) (j\ne i)\) is a candidate of ASC(A i ) if and only if it can fail at least one query in Q right after evaluating ASC(A i ). In other words, if ASC(A i ).ra[r].qrb & \(ASC(A_{j}).qub \ne 0\), ASC(A i ).ra[r].cab[j] = 1. A micro-arrow from the region ASC(A i ).ra[r] is established to one of those ASC’s whose corresponding bits of ASC(A i ).ra[r].cab are 1’s through monitoring their selectivities at run-time.

Figure 1 shows how the ASCs of conjunctive continuous queries are constructed. The constant values of the selection predicates used in the four queries are arranged by their p-attributes in Fig. 1b. Subsequently, in Fig. 1c, the domain of each p-attribute is divided into a set of disjoint regions based on the constant values. Since the basic components of an ASC are constructed at compile-time, maintaining ASC’s causes nearly negligible run-time overhead.

Example 1

In Fig. 2, the candidate-arrow bitmap ASC(A 2).ra[3].cab = 1010 indicates that a micro-arrow from this region can be destined to either ASC(A 1) or ASC(A 3). A micro-arrow whose destination is ASC(A 4) is not established because ASC(A 2).ra[3].qrb & ASC(A 4).qub = 0.
https://static-content.springer.com/image/art%3A10.1007%2Fs10844-011-0192-1/MediaObjects/10844_2011_192_Fig2_HTML.gif
Fig. 2

The extension (candidate-arrow bitmap) of Fig. 1

Example 2

Figure 3 illustrates how the queries in Fig. 1 are evaluated by an evaluation sequence of macro-arrows ASC(A 1) → ASC(A 2) → ASC(A 3) → ASC(A 4). For the tuple t 2, the matched region of ASC(A 1) is the last region whose boundary is (70, ∞) because t 2[A 1] = 70. The query-result bitmap ASC(A 1).ra[5].qrb = 0101 is bitwise ANDed with the global-result bitmap GRB whose bits are initialized to all 1’s. The result of this operation makes GRB = 0101. This means that the queries q 1 and q 3 are terminated because they do not satisfy the tuple t 2. In the same way, ASC(A 2), ASC(A 3) and ASC(A 4) are also processed one by one. Since the value of the bitmap GRB is finally set to 0001, only the query q 4 satisfies the tuple t 2. For the tuples t 1 and t 3, all of the four queries turn out to be unsatisfied after processing the third ASC ASC(A 3). Consequently, both of t 1 and t 3 are dropped and the remaining ASC ASC(A 4) is not evaluated.
https://static-content.springer.com/image/art%3A10.1007%2Fs10844-011-0192-1/MediaObjects/10844_2011_192_Fig3_HTML.gif
Fig. 3

Run-time evaluation of ASC’s

4.2 Macro-sequence

4.2.1 Minimal cover set

The overall performance of evaluating the queries depends on how early unmatched tuples are dropped because all the queries are collectively processed in the proposed scheme. An incoming tuple of a data stream can be dropped only after the fact that it cannot satisfy any query in a query set Q is determined. If a specific query q ∈ Q does not have any selection predicate for a p-attribute A i , the ASC of the p-attribute A i cannot be employed to determine the matching result of the query q. Therefore, for efficient tuple dropping, it is important to first evaluate those p-attributes that can collectively provide the complete matching result of every query in Q. Such a set of p-attributes is called a cover set. A cover set with the smallest cardinality is defined to be a minimal cover set. A tuple cannot be dropped until all the p-attributes of a minimal cover set are evaluated. A minimal cover set is formally defined, as follows.

Definition 2

(Minimal cover set)

Given a set of continuous queries Q = {q 1, q 2, ..., q k } registered to a data stream D(A 1, A 2, ..., A n ), let V be a subset of p-attributes i.e., \(V \subseteq A_{p}(Q)\). If all the query-usage bitmaps of the ASC’s corresponding to the p-attributes in V are bitwise ORed to all 1’s, then the set V is called a cover set CS(Q) for the queries in Q. A cover set CS(Q) with the smallest cardinality is a minimal cover set MCS(Q).

Example 3

In Fig. 3, MCS(Q) = {A 1, A 2} or {A 3, A 4} because ASC(A 1).qub & ASC(A 2).qub = 0 and ASC(A 3).qub & ASC(A 4).qub = 0.

4.2.2 Finding macro-sequence

A macro-sequence is complete and unique because it arranges the ASC’s of all the p-attributes for a given query set according to their average filtering capabilities. Based on the concept of a minimal cover set, a macro-sequence for a query set Q can be found by two phases. In the first phase, a minimal cover set MCS(Q) is identified, and the sequence of its p-attributes is also identified simultaneously during finding MCS(Q). In the second phase, a complete macro-sequence is identified by arranging the p-attributes excluded from MCS(Q).

To find MCS(Q), while expanding a p-attributes sequence ρ in a greedy manner, those queries whose matching results are already determined by the previous p-attributes of the sequence ρ are excluded from the query set Q temporarily. For a set of k continuous queries Q = {q 1, q 2, ..., q k } and a p-attribute A i  ∈ A p (Q), the corresponding attribute value of an incoming tuple should be evaluated for all the queries in Q in order to decide whether it can satisfy each of the queries or not. Let a term evaluation instance denote an evaluation task of an individual query for an incoming tuple on a specific ASC. Therefore, for every incoming tuple, k individual evaluation instances should be taken place. The conditional selectivity of a p-attributes sequence is formally defined as follows.

Definition 3

(Conditional selectivity)

Given a set of k continuous queries Q = {q 1, q 2, ..., q k } registered to a data stream D(A 1, A 2, ..., A n ), let a p-attributes sequence ρ be a partial sequence of p-attributes, which is not a minimal cover set MCS(Q) yet. And it is supposed that the selection predicates for the p-attributes of the sequence ρ cover only w queries in Q (w < k). To find the complete evaluation sequence of MCS(Q), the sequence ρ is repeatedly expanded by appending one of the remaining p-attributes in A p (Q). When a p-attribute A v  ∉ ρ is appended to the sequence ρ, the conditional selectivity s τ (ρA v  ∣ ρ) of the expanded sequence ρA v for the sequence ρ in a fixed period τ is defined as follows:
$$ s_\tau \left( {\rho \to A_v \left| \rho \right.} \right)=\frac{EI_{\tau } \left( {\rho \to A_v \left| \rho \right.} \right)}{\left( {k-w} \right)\times T_\tau}. $$
(1)
where T τ denotes the total number of tuples generated in the period τ, and EI τ (ρA v  ∣ ρ) denotes the number of evaluation instances (EI) successfully passed by the sequence ρA v for those (k − w) remaining queries whose results are not determined by the sequence ρ.
This conditional selectivity is employed until the currently expanding evaluation sequence of p-attributes becomes a minimal cover set MCS(Q). The number of queries covered by a partial sequence ρ can be traced efficiently by the query-usage bitmaps of the ASC’s in the sequence ρ. It can be obtained by counting the number of 1’s in the result of a bitwise OR operation on all the query-usage bits of the ASC’s. Furthermore, the value of EI τ (ρA v  ∣ ρ) can also be efficiently found at run-time by the global result bitmap GRB as follows. Let Q p (ρ) be the set of queries covered by the sequence ρ. Since every query q ∈ Q p (ρ) is excluded from computing the conditional selectivity of the sequence ρA v , if T τ tuples {t 1, ..., t T } are generated in a period τ, EI τ (ρA v  ∣ ρ) for a query set Q = {q 1, q 2, ..., q k } is expressed as follows:
$${EI}_\tau \left( {\rho \to {\rm A}_{\rm v} {\rm \vert }\rho } \right)\mbox{=}\sum\limits_{i=1}^{\vert \tau \vert } \eta \left( {\rho \to A_v ,t_i } \right). $$
(2)
where η(ρA v , t i ) denotes the number of 1’s in those bit positions of GRB[1..k] that are corresponding to the queries in Q − Q p (ρ) after the ith tuple t i is evaluated for the sequence ρA v .

After identifying MCS(Q), a complete macro-sequence is determined by extending the evaluation sequence of MCS(Q) one by one. Among all the candidate sequences, the one with the highest tuple dropping ratio is chosen. The tuple dropping ratio of a p-attributes sequence is formally defined as follows.

Definition 4

(Tuple dropping ratio)

Given a set of continuous queries Q = {q 1, q 2, ..., q k } registered to a data stream D(A 1, A 2, ..., A n ) and a p-attributes sequence ρ = A x → ... → A z including a MCS(Q), let a set of p-attributes in ρ be denoted by A p (ρ) where \(MCS(Q) \subseteq A_{p}(\rho) \subseteq A_{p}(Q)\). Let \(\mathop T\nolimits_\tau^{tot}\) denote the total number of tuples generated in a fixed period τ and \(\mathop T\nolimits_\tau^{unm} \left( \rho \right)\) denote the number of unmatched tuples by the sequence ρ during the same period. The tuple dropping ratio d τ (ρ) of the sequence ρ is defined as follows:
$$ d_\tau \left( \rho \right)=\frac{\mathop T\nolimits_\tau^{unm} \left( \rho \right)}{\mathop T\nolimits_\tau^{tot} }. $$
(3)

If two or more candidate sequences have the same dropping ratio, the one whose last ASC has the lowest selectivity is chosen. The selectivity of an individual ASC is found when deciding the evaluation sequence of length 1 in the first phase. This procedure is continued repeatedly until the evaluation sequence contains all the p-attributes in A p (Q).

Example 4

In Fig. 3, let ρ = A 1A 2A 3 be a p-attributes sequence. If three tuples t 1, t 2 and t 3 are generated in a period τ, then the tuples t 1 and t 3 are dropped by the sequence ρ. Therefore, the tuple dropping ratio d τ (ρ) is 2/3.

4.2.3 Implementation

In a monitoring module, for a fixed period τ, a partial macro-sequence is extended by one of the remaining p-attributes. Let n be the number of p-attributes. The time to find a new complete macro-sequence is τ* (n − 1) because all of the p-attributes should participate in the sequence. To measure the conditional selectivities of different candidate sequences in each period, an instance counter for each candidate sequence is needed. The instance counter of a candidate sequence keeps the cumulative summary of 1’s in GRB’s resulting from evaluating all the incoming tuples on its candidate sequence for a specific period. Its value is utilized to compute the conditional selectivity of a p-attributes sequence in the first phase that finds MCS(Q). On the other hand, in order to measure the tuple dropping ratios of different candidate sequences in each period, it is only checked whether an incoming tuple is dropped by a target candidate sequence or not. This task is performed by examining the bit-values of GRB after evaluating the sequence. If they are all 0’s, the tuple should be dropped.

Example 5

In Fig. 3, suppose that the tuples t 1, t 2 and t 3 are generated in the period τ 1, τ 2 and τ 3 respectively. Let ρ i denote a macro-sequence whose length is i. For the tuple t 1 in the period τ 1, ρ 1 = A 2 because s τ1(A 1) = 3/4, s τ1(A 2) = 1/2, s τ1(A 3) = 3/4 and s τ1(A 4) = 3/4. Subsequently, for the tuple t 2 in the period τ 2, ρ 2 = A 2A 1 because s τ2(A 2A 1 ∣ A 2) = 0, s τ2(A 2A 3 ∣ A 2) = 1/2 and s τ2(A 2A 4 ∣ A 2) = 1. At this point, since ASC(A 2).qub & ASC(A 1).qub = 0, MCS(Q) = {A 2,A 1}. After finding MCS(Q), for the tuple t 3 in the period τ 3, two candidate sequences d τ3(A 2A 1A 3) and d τ3(A 2A 1A 4) are compared. Since d τ3(A 2A 1A 3) = 1 and d τ3(A 2A 1A 4) = 0, ρ 3 = A 2A 1A 3. Accordingly, a macro-sequence ρ 4 = A 2A 1A 3A 4 is found.

The processing time for finding a macro-sequence mainly depends on the number of p-attributes because its main task is comparing the conditional selectivity or dropping ratio of the remaining p-attributes one another. In the first phase (period) of a monitoring module, n p-attributes are compared one another so that one of the p-attributes is initially assigned to the macro-sequence. Subsequently, in the second phase, (n − 1) remaining p-attributes are compared. In the same way, the remaining p-attributes are compared until all the p-attributes are assigned to the macro-sequence. Therefore, the time complexity of finding a complete macro-sequence for (n − 1) periods is as follows:
$$ n+\left( {n-1} \right)+\left( {n-2} \right)+\ldots +1=\sum_{i=0}^{n-1} {\left( {n-i} \right)=0\left( {n^2} \right)} . $$
(4)

4.3 Hybrid-sequence

Each region of an ASC can have a number of micro-arrows. A hybrid-sequence utilizes the filtering capability of a micro-arrow to enhance the performance of a macro-sequence. A procedure for finding a hybrid-sequence is similar to that for finding a macro-sequence. Starting from the first ASC of a macro-sequence, the possibility of employing any micro-arrow is examined. Given a partially identified hybrid-sequence \(\rho^{\prime} = \rho \to ASC(A_{i})\) for a query set Q, if the region ra[r] of the ASC(A i ) is visited by incoming tuples in a monitoring period τ, one micro-arrow is chosen among the candidate micro-arrows of the region ASC(A i ).ra[r], based on their monitored filtering capabilities. The candidate micro-arrows are indicated in the candidate-arrow bitmap ASC(A i ).ra[r].cab. For the ASC(A j ) satisfying that \(A_{j} \notin A_{p} (\rho^{\prime})\) and ASC(A i ).ra[r].cab[j] = 1, the filtering capability of a candidate micro-arrow ASC(A i ).ra[r] →ASC(A j ) is measured as the conditional selectivity \(s_{\tau} (\rho^{\prime} \to ASC(A_{j})\vert \rho^{\prime})\) if \(A_{p}(\rho^{\prime})\subset MCS(Q)\). Otherwise, it is measured as the tuple dropping ratio \(d_{\tau} (\rho^{\prime} \to ASC(A_{j}))\). The conditional selectivity of each candidate micro-arrow is slightly different from Definition 4. It identifies a set of uncovered queries based on the values of the global-result bitmap GRB evaluated by the sequence ρ . In other words, in order to measure the conditional selectivity, only the queries whose corresponding bits of GRB are 1’s are targeted. From the region ASC(A i ).ra[r], if any candidate micro-arrow does not exist or its filtering capability is lower than that of the macro-arrow of ASC(A i ) in a macro-sequence, the micro arrow of the region ASC(A i ).ra[r] is not established. There can be several hybrid-sequences because each region of an ASC can possibly establish its own micro-arrow. These hybrid-sequences are expanded concurrently. The filtering capability of a specific hybrid-sequence is measured against only those tuples that have visited the series of the ASC regions in the sequence.

Example 6

In Fig. 4, suppose that the first ASC in a hybrid-sequence is ASC(A 1) and the tuples t 4 and t 5 are generated in the period τ 1 and τ 2 respectively. For the tuple t 4 in the period τ 1, the micro-arrow becomes ASC(A 1).ra[5] →ASC(A 2) because s τ1 (ASC(A 1).ra[5] →ASC(A 2)) = 0 and s τ1 (ASC(A 1).ra[5] →ASC(A 4)) = 1/4. At this point, MCS(Q) = {A 1, A 2} because ASC(A 1).qub & ASC(A 2).qub = 0. Therefore, for the tuple t 5 in the period τ 2, d τ2(ASC(A 1).ra[5] →ASC(A 2).ra[5] →ASC(A 3)) and d τ2 (ASC(A 1).ra[5] →ASC(A 2).ra[5] →ASC(A 4)) are compared. Since the former is 0 and the latter is 1, the next micro-arrow becomes ASC(A 2).ra[5] →ASC(A 4).
https://static-content.springer.com/image/art%3A10.1007%2Fs10844-011-0192-1/MediaObjects/10844_2011_192_Fig4_HTML.gif
Fig. 4

The process of identifying a hybrid-sequence

5 Evaluation and adaptation

5.1 Evaluation

If the run-time evaluation order is chosen to be a macro-sequence, its evaluation is performed according to the order of ASC’s in the sequence. In a hybrid-sequence, a macro arrow plays a role as the substitute of a micro arrow. The evaluation sequence firstly tries to employ the micro arrow that the visited region of the currently evaluated ASC owns. If it does not exist, the macro arrow on the macro-sequence is employed. The run-time evaluation order of their ASC’s may be different according to the p-attribute values of an incoming tuple. Due to this reason, a set of unevaluated ASC’s can be different, so that it should be traced continuously. For this purpose, an additional bitmap called a Global Evaluation Bitmap (GEB) is introduced. Given a set of p-attributes A p (Q), the bitmap GEB has ∣ A p (Q) ∣ bits each of which is corresponding to a distinct p-attribute. At first, it is initialized to all 0’s. If the ASC of a specific p-attribute has been evaluated, the corresponding bit of GEB is set to 1. Accordingly, the set of unevaluated ASC’s at a specific point can be identified by the bitmap GEB. Exceptionally, if there is no micro-arrow in the visited region of the evaluated ASC, among the unevaluated ASC’s, the most preceding ASC in the macro-sequence is evaluated.

Example 7

Fig. 5 illustrates how a hybrid-sequence is evaluated. Suppose that a macro-sequence is \(\rho^{macro}=A_{1} \to A_{2} \to A_{3} \to A_{4}\) and there are three micro-arrows \(\rho_1^{micro} = ASC(A_{1}).ra[5] \to ASC(A_{2})\), \(\rho_2^{micro} = ASC(A_{2}).ra[5] \to ASC(A_{4})\) and \(\rho_3^{micro} = ASC(A_{4}). ra[5] \to ASC(A_{3})\). For the tuple t 6, according to its attribute values, the micro-arrows \(\rho_1^{micro}\) and \(\rho_2^{micro}\) are followed to process the tuple t 6 until it is dropped. For the tuple t 7, since there is no micro-arrow from the region ASC(A 2).ra[2] containing t 7[A 2] = 20, the first unevaluated ASC ASC(A 3) in the sequence ρ macro is evaluated subsequently.
https://static-content.springer.com/image/art%3A10.1007%2Fs10844-011-0192-1/MediaObjects/10844_2011_192_Fig5_HTML.gif
Fig. 5

Run-time evaluation by a hybrid-sequence

5.2 Adaptive rearrangement

Since the filtering capability of each p-attribute for a query set Q can vary dynamically as time goes by, the tuple dropping ratio of the current evaluation sequence of ASC’s should be monitored periodically in order to keep the sequence to be as efficient as possible. This period is called the re-computation period λ for a tuple dropping ratio. The A-greedy algorithm (Widom and Babu 2001) only addresses how to adaptively adjust the evaluation sequence of filters in a single query but it does not specify exactly when to start it. In this paper, the current evaluation sequence is rearranged whenever its tuple dropping ratio is changed more than a specified threshold μ, called a rearrangement threshold. Given the current evaluation sequence ρ, let d init (ρ) denote the initial dropping ratio of the sequence ρ when it was selected to be the current sequence. On the other hand, let d cur (ρ) denote the currently monitored dropping ratio of the sequence ρ. Its value is computed against those tuples that are generated in the last re-computation period λ in order to reflect the recent variation of the dropping ratio. Whenever the following condition is satisfied, the arrangement of the ASC’s in the monitoring module described in Sections 4.2 and 4.3 is performed again to replace the current evaluation sequence.
$$ \Delta d\left( \rho \right)=\left| {\frac{d_{cur} \left( \rho \right)-d_{init} \left( \rho \right)}{d_{init} \left( \rho \right)}} \right|\ge \mu $$
(5)
The proposed rearrangement scheme establishes the evaluation sequence for the future data elements of an underlying data stream based on the most recently passed data elements. As the value of μ is set to be smaller, the current evaluation sequence is more frequently rearranged but the run-time overhead is increased as well. Therefore, the value can control how precisely the current change of the data stream is reflected to the current evaluation sequence.

6 Experimental results

In this section, the performance of the proposed method is comparatively analyzed. All the algorithms are implemented in C, and all the experiments are executed on a Pentium 4 CPU 2.66 GHz system with 1G RAM. The system runs Linux with 2.4.5 kernel and gcc 3.3.2. For the following experiments, two different synthetic datasets and one real dataset are used to verify the effectiveness of the ASC’s arrangement strategy. Each of the synthetic datasets consists of 500,000 tuples and 20 integer-type attributes. The integer value of each attribute is generated from a range [0..99] but the data distributions of the two synthetic datasets D 1 and D 2 are different. While the former is uniform distribution, the latter is non-uniform distribution. For the real dataset D 3, a million US Census 1999 (http://kdd.ics.uci.edu, UCI KDD Archive) dataset is used. It has 10 integer-type attributes and 1,000,000 tuples. Furthermore, a number of different query sets are employed. The characteristics of the query sets are specified in Table 1. In this table, the item “Standard deviation” indicates the standard deviation for the number of 1’s in the query-result bitmap of each region in an ASC. The overall evaluation cost cost(Q, D) of a query set Q for a dataset D is measured by the number of evaluated ASC’s, assuming that the cost of evaluating each ASC is identical.
Table 1

Specifications of experimental query sets

 

Q1

Q2

Q3

Q4

Number of queries in Q

50

50

30

30

35

40

50

50

Number of selection predicates

204

173

78

81

98

138

173

362

Number of p-attributes

10

15

6

7

8

10

15

10

Standard deviation

0.08

0.2

0.06

0.06

0.15

0.1

0.25

varying

In Fig. 6, the processing costs of various evaluation sequences for the ASC’s are compared along with the evaluation cost of Ticket Routing (Avnur and Hellerstein 2000). The cost of Ticket Routing is measured by the number of visited selection modules. The query set Q 1 is evaluated for the synthetic datasets D 1 and D 2, whereas the query set Q 2 is used for the real dataset D 3. The term Macro denotes the evaluation order of ASC’s by a macro-sequence. On the other hand, the term Hybrid denotes that by a hybrid-sequence. In these two schemes, the current evaluation sequence is rearranged adaptively. Assuming that the tuples of a target dataset arrive at a constant rate, the re-computation period λ for a tuple dropping ratio is set to every 10,000 tuples. Furthermore, the rearrangement threshold μ is set to 0. It means that a rearrangement process for the current evaluation sequence is invoked whenever the period λ is elapsed. The terms SEQ worst and SEQ best denote the highest and lowest evaluation costs for a target dataset. These two costs are found by performing all the possible evaluation sequences of the ASC’s experimentally. Their evaluation sequences are not changed adaptively but fixed. As shown in Fig. 6, both Macro and Hybrid not only operate more efficiently than Ticket Routing but also approximate SEQ best in all the target datasets. Especially, in the dataset D 2 and D 3, Hybrid performs better than SEQ best. It means that the adaptive change of the current evaluation sequence for incoming tuples is more effective than the best fixed evaluation sequence. Also, it shows that micro-arrows can reduce the run-time evaluation cost.
https://static-content.springer.com/image/art%3A10.1007%2Fs10844-011-0192-1/MediaObjects/10844_2011_192_Fig6_HTML.gif
Fig. 6

Processing costs according to the evaluation sequences of ASC’s (μ = 0)

Figure 7 shows the change of the processing cost of each scheme presented in Fig. 6 by varying the number of p-attributes. In this experiment, the query set Q 3 is processed for the dataset D 2. As in the experiment of Fig. 6, the value of λ in Macro and Hybrid is set to every 10,000 tuples. As the number of p-attributes is increased, the performance gaps among the four schemes are also comparatively enlarged. In addition, Hybrid performs better than SEQ best.
https://static-content.springer.com/image/art%3A10.1007%2Fs10844-011-0192-1/MediaObjects/10844_2011_192_Fig7_HTML.gif
Fig. 7

Processing costs according to the number of pattributes (μ = 0)

Figure 8 verifies the effectiveness of micro-arrows by showing that the overall query performance is propositional to the ratio of micro-arrows in an evaluation sequence. A term micro ratio ξ is defined as the ratio of the number of evaluated micro-arrows over the total number of evaluated macro/micro arrows. The dataset D 1 and the query set Q 4 are used for this experiment. In this figure, the x-axis indicates the standard deviation of Q 4 in Table 1. As the value gets larger, the gaps among the filtering capabilities of regions in an ASC are enlarged. The ratio of the processing cost of the Hybrid scheme over that of the Macro scheme is defined as a term relative processing cost ω. As the standard deviation is increased, the micro ratio is increased while the relative processing cost is decreased. Consequently, it leads to the improvement of query performance.
https://static-content.springer.com/image/art%3A10.1007%2Fs10844-011-0192-1/MediaObjects/10844_2011_192_Fig8_HTML.gif
Fig. 8

The effectiveness of micro-arrows

Figure 9 shows the effect of the adaptive rearrangement of the current evaluation sequence for the different values of the rearrangement threshold μ. To simulate the dynamic change of the selectivities of selection predicates, each of the datasets D 1 and D 2 is iteratively repeated as a sub-dataset to build a target dataset D. The value of λ is set to every 80,000 tuples. The processing cost cost(Q 1,D) for the query set Q 1 is traced. Initially, the evaluation sequence for the first sub-dataset D 1 is used. As shown in this figure, whenever the boundary of a sub-dataset is crossed over, the processing cost is rapidly increased, so that the current evaluation sequence is no longer optimal. Subsequently, a newly adjusted evaluation sequence is obtained by the adaptive rearrangement process as described in Section 5.2, which makes the processing cost be close to the optimal cost. In addition, as the value of μ is set to be smaller, the rearrangement process is invoked more frequently. Accordingly, the proposed method can approximate the optimal processing cost more rapidly. However, it also increases the run-time rearrangement overhead due to the frequent invocation of the rearrangement process.
https://static-content.springer.com/image/art%3A10.1007%2Fs10844-011-0192-1/MediaObjects/10844_2011_192_Fig9_HTML.gif
Fig. 9

The effect of adaptive rearrangement according to a rearrangement threshold μ

7 Conclusions

In order to process the selection predicates of multiple continuous queries efficiently in a data stream environment, an attribute-based construct ASC and its matching algorithm are proposed in this paper. The proposed approach saves space usage by sharing the common selection predicates of target multiple continuous queries and also reduces run-time overhead by utilizing the pre-computed matching results of the queries based on the comparison constants of p-attributes expressed in the selection predicates of the queries at compile-time. In addition, the query performance is optimized by arranging the evaluation order of multiple ASC’s at two different levels and rearranging the current evaluation sequence adaptively. The goal of this optimization is minimizing the evaluation of unnecessary operations by dropping unmatched tuples as early as possible. A micro-arrow used in a hybrid-sequence can play an important role to achieve the goal further. The proposed method and optimization techniques are verified through various experiments.

Acknowledgements

This work was supported by core research program (No. 2011-0016648) and NRL Program (No. 2010-0008007) of the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST).

Open Access

This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Copyright information

© The Author(s) 2012