Efficient Skyline Computation on Massive Incomplete Data

Incomplete skyline query is an important operation to filter out pareto-optimal tuples on incomplete data. It is harder than skyline due to intransitivity and cyclic dominance. It is analyzed that the existing algorithms cannot process incomplete skyline on massive data efficiently. This paper proposes a novel table-scan-based TSI algorithm to deal with incomplete skyline on massive data with high efficiency. TSI algorithm solves the issues of intransitivity and cyclic dominance by two separate stages. In stage 1, TSI computes the candidates by a sequential scan on the table. The tuples dominated by others are discarded directly in stage 1. In stage 2, TSI refines the candidates by another sequential scan. The pruning operation is devised in this paper to reduce the execution cost of TSI. By the assistant structures, TSI can skip majority of the tuples in phase 1 without retrieving it actually. The extensive experimental results, which are conducted on synthetic and real-life data sets, show that TSI can compute skyline on massive incomplete data efficiently.


Introduction
The skyline operator filters out a set of interesting tuples from a potential huge data set. Among the specified skyline criteria, a tuple p is said to dominate another tuple q if p is strictly better than q in at least one attribute, and no worse than q in the other attributes. The skyline query actually discovers all tuples which are not dominated by any other tuples.
Due to its practical importance, skyline queries have received extensive attentions [2, 3, 5, 6, 8, 9, 14-17, 19, 20]. However, the overwhelming majority of the existing algorithms only consider the data set of complete attributes, i.e., all the attributes of every tuple are available. In real-life applications, because of the reasons such as the delivery failure or the deliberate concealment, the data set we encounter often is incomplete, i.e., some attributes of tuples are unknown [13]. On incomplete data, the existing skyline algorithms cannot be applied directly, since all of them assume the transitivity of dominance relationship.
On complete data, the transitivity rule is that: if p 1 dominates p 2 , and p 2 dominates p 3 , obviously p 1 dominates p 3 by the definition of dominance. The transitivity is the basis of the efficiency of the existing skyline algorithms which utilize indexing, partitioning and pre-sorting operation. On incomplete data, some attributes of tuples are missing, the traditional definition of dominance does not hold any more, and the dominance relationship is re-defined on incomplete data. Given the skyline criteria, p and q are two tuples on incomplete data, let C be the common complete attributes of p and q among the skyline criteria, p dominates q if p is no worse than q among C and strictly better than q in at least one attribute among C. From the dominance relationship defined above, transitivity does not hold on incomplete data.
As illustrated in Fig. 1, the specified skyline criteria are {A 1 , A 2 , A 3 } . In the table, p 1 dominates p 2 since the common attribute among the skyline criteria of p 1 and p 2 is A 1 , p 1 .A 1 < p 2 .A 1 . Similarly, p 2 dominates p 3 . But p 1 does not dominate p 3 here and transitivity does not hold. Besides, it is found that p 3 dominates p 1 . On incomplete data, we may face the issue of cyclic dominance. The two issues, intransitivity and cyclic dominance, make the processing of skyline on incomplete data different from the skyline on complete data.
The current incomplete skyline algorithms can be classified into three categories: replace-based algorithms [7], sorted-based algorithms [1], and bucket-based algorithms [7,10]. The replace-based algorithms first replace the incomplete attributes with a specific value, then compute traditional skyline on transformed data, and finally refine the candidate to compute the results by pairwise comparison. Normally, the number of candidate on massive data is large and the pairwise comparison is significantly expensive. Sorted-based algorithms utilize the selected tuples with possible high dominance via pre-sorted structures one by one to prune the non-skyline tuples. It usually performs many passes on scan on the table and will incur high I/O cost on massive data. Bucket-based algorithms first split the tuples into different buckets according to their attribute encoding to make the tuples in the same buckets have the same encoding and hold the transitivity rule, then compute local skyline results on every buckets, and finally merge the local skyline results to obtain the results. In incomplete skyline computation, the skyline criteria size usually is greater than that on complete data due to the cyclic dominance, the bucket number involved in bucket-based algorithms often is large, and the number of local skyline results is relatively great. The computation operation and merge operation of local skyline results often incur high computation cost and I/O cost on massive data. To sum up, the existing algorithms cannot process incomplete skyline query on massive data efficiently.
Based on the discussion above, this paper proposes TSI algorithm (Table-scan-based Skyline over Incomplete data) to compute skyline results on massive incomplete data with high efficiency. In order to reduce the computation cost and I/O cost, the execution of TSI consists of two stages. In stage 1, TSI performs a sequential scan on the table and maintains candidates in memory. For each tuple t retrieved currently in stage 1, any candidate dominated by t is removed. And if t is not dominated by any candidates, t is added to the candidate set. In stage 1, TSI does not consider the intransitivity and cyclic dominance in incomplete skyline computation, but just discards the tuples which are not final results definitely. In stage 2, another sequential scan is executed to refine the candidates. For each tuple t retrieved currently in stage 2, it discards any candidates dominated by it. When stage 2 terminates, the candidates in memory are the incomplete skyline results. In this paper, it is found that the cost in stage 1 dominates the overall cost of TSI, so a pruning operation is devised to skip the tuples in stage 1. The useful data structures are pre-constructed, which is used to check whether a tuple is dominated before retrieving it. The extensive experiments are conducted on synthetic and real-life data sets. The experimental results show that the pruning operation can skip overwhelming majority of tuples in stage 1 and TSI outperforms the existing algorithms significantly.
The contributions of this paper are listed as follows: -This paper proposes a novel table-scan-based TSI algorithm of two stages to process skyline query on massive incomplete data efficiently. -Two novel data structure is designed to maintain information of tuples and obtain pruning tuples with strong dominance capability. -This paper devises efficient pruning operations to reduce the execution cost of TSI, which directly skips the tuples dominated by some tuples before retrieving them actually. -The experimental results show that TSI can compute incomplete skyline on massive data efficiently.
The rest of the paper is organized as follows. The related work is surveyed in Sect. 2, followed by preliminaries in Sect. 3. The existing algorithms are analyzed in Sect. 4. Baseline algorithm is developed in Sect. 5. Section 6 introduces TSI algorithm. The performance evaluation is provided in Sect. 7. Section 8 concludes the paper.

Related Work
Since [2] first introduces the skyline operator into database environment, skyline has been studied extensively by database researchers [2,3,5,6,8,11,[14][15][16][17]20]. However, the most of the existing skyline algorithms only consider the complete data, and they utilize the transitivity of dominance relationship to acquire significant pruning power. They cannot be directly used for the skyline query on incomplete data, where the dominance relationship is intransitivity and cyclic.
In the rest of this section, we survey the skyline algorithms on incomplete data. The current incomplete skyline algorithms can be classified into three categories: replace-based algorithms, sorted-based algorithms, and bucket-based algorithms.

Replace-Based Algorithms
Khalefa et al. [7] propose a set of skyline algorithms for incomplete data. The first two algorithms, replacement and  set of example table  1 3 bucket, are the extension of the existing skyline algorithms to accommodate the incomplete data. Replacement algorithm first replaces the incomplete attributes by a special value to transform the incomplete data to complete data. Traditional skyline algorithm can be used to compute the skyline results SKY comp on the transformed complete data, which also is the superset of the skyline results on the incomplete data. Finally, the tuples in SKY comp are transformed into their original incomplete form, and the exhaustive pairwise comparison between all tuples in SKY comp is performed to compute the final results. Bucket algorithm first divides all the tuples on incomplete data into different buckets to make all tuples in the same bucket have the same bitmap representation. The dominance relationship within the same bucket is transitive now since the tuples here have the same bitmap representation. The traditional skyline algorithm is utilized to compute the skyline results within each bucket, which is called local skyline. The local skyline results for all buckets are merged as the candidate skyline. The exhaustive pairwise comparison is performed on the candidate skyline to compute the query answer. ISkyline algorithm employs two new concepts, virtual points and shadow skylines, to improve bucket algorithm. The execution of ISkyline consists of three phases. In phase I, each newly retrieved tuple is compared against the local skyline and the virtual points to determine whether the tuple needs to be (a) stored in the local skyline, (b) stored in the shadow skyline, (c) discarded directly. In phase II, the tuples newly inserted into local skyline are compared with the current candidate skyline; ISkyline updates the candidate skyline and the virtual points correspondingly. Every time t tuples are kept in the candidate skyline, ISkyline enters into phase III, updates the global skyline, and clears current candidate skyline. The similar processing continues until the end of input is reached and ISkyline returns the global skyline. The replace-based algorithms first replace the incomplete attributes with a specific value, then compute traditional skyline on transformed data, and finally refine the candidate to compute the results by pairwise comparison. Normally, the number of candidate on massive data is large and the pairwise comparison is significantly expensive.

Sorted-Based Algorithms
Bharuka et al. [1] propose a sort-based skyline algorithm SIDS to evaluate the skyline over incomplete data. SIDS first sorts the incomplete data D in non-descending order for each attribute. Let D i be the sorted list with respect to the ith attribute. Only the ids of the tuples are kept in D i , and the ids of the tuples whose ith attributes are incomplete are not stored. SIDS performs a round-robin retrieval on the sorted lists. For each retrieved data p, if it is not retrieved before, it is compared with each data q in the candidate set, which is initialized to be the whole incomplete data. If p and q are compared already, the next data in the candidate set are retrieved and processed. Otherwise, if p dominates q, q is removed from the candidate set. And if q dominates p and p is in candidate set, p is removed from the candidate set also. If the number of p being retrieved during the round-robin retrieval is equal to the number of its complete attributes and p is not pruned yet, p can be reported to be one of the skyline results. SIDS terminates when candidate set becomes empty or all points in sorted lists are processed at least once.
Sorted-based algorithms utilize the selected tuples with possible high dominance via pre-sorted structures one by one to prune the non-skyline tuples. It usually performs many passes on scan on the table and will incur high I/O cost on massive data.

Bucket-Based Algorithms
Lee et al. [10] propose a sorting-based SOBA algorithm to optimize the bucket algorithm. Similar to the bucket algorithm, SOBA also first divides the incomplete data into a set of buckets according to their bitmap representation, then computes the local skyline of tuples in each bucket, and finally performs the pairwise comparison for the skyline candidates (the collection of all local skylines). SOBA uses two techniques to reduce the dominance tests for the skyline candidates. The first technique is to sort the buckets in ascending order of the decimal numbers of the bitmap representation. This can identify the non-skyline points as early as possible. The second technique is to rearrange the order of tuples within the bucket. By sorting tuples in the ascending order of the sum of the complete attributes, the tuples accessed earlier have the higher probability to dominate other tuples and this can help reduce the number of dominance tests further.
Bucket-based algorithms first split the tuples into different buckets according to their attribute encoding to make the tuples in the same buckets have the same encoding and hold the transitivity rule, then compute local skyline results on every buckets, and finally merge the local skyline results to obtain the results. In incomplete skyline computation, the skyline criteria size usually is greater than that on complete data due to the cyclic dominance, the bucket number involved in bucket-based algorithms often is large, and the number of local skyline results is relatively great. The computation operation and merge operation of local skyline results often incur high computation cost and I/O cost on massive data.
For the algorithms mentioned above, the dominance over incomplete data is defined on the common complete attributes. There are also other definitions of dominance over incomplete data. Zhang et al. [21] propose a general framework to extend skyline query. For each attribute, they first retrieve the probability distribution function of the values in the attribute by all the non-missing values on the attribute and then convert incomplete tuples to complete data by estimating all missing attributes. And a mapping dominance is defined on the converted data. Zhang et al. [18] propose PISkyline to compute probabilistic skyline on incomplete data. It is considered in [18] that each missing attribute value can be described by a probability density function. The probability is used to measure the preference condition between missing values and the valid values. Then, the probability of a tuple being skyline can be computed. PISkyline returns the K tuples with the highest skyline probability.
Discussion Throughout this paper, we use the definition of dominance over incomplete data as [7]. Firstly, this dominance notion is commonly used in most skyline algorithms over incomplete data. Secondly, the estimation of the incomplete attribute values may be undesirable in some cases. Therefore, we do not guess the incomplete attribute in this paper and not consider such algorithms anymore.
In this paper, we consider the skyline over massive incomplete data, i.e., the data set cannot be kept in memory entirely. It is found that the existing algorithms, including [1,7,10], all assume their processing of the in-memory data. Their performance will be seriously degraded on massive data. Since the cardinality of skyline query increases exponentially with respect to the size of skyline criteria [4], replacement algorithm often generates a large number of skyline candidates and the pairwise comparison among the candidates incurs a prohibitively expensive cost. Bucketbased algorithms, such as bucket algorithm, ISkyline and SOBA, have the problem that they have to divide the data set into different buckets. Given the size m of the skyline criteria, the number of the buckets can be as high as 2 m − 1 ; this will cause serious performance issue when m is not small. For SIDS, it utilizes one selected tuple to prune the nonskyline tuples in the candidate set, and this incurs a pass of sequential scan on the data. Thus, it requires many passes of scan on the data to finish its execution, and this will incur a high I/O cost on massive data.

Preliminaries
Given an incomplete table T of n tuples with attributes A 1 , … , A M , some attributes of the tuples in T are incomplete. The attributes in T are considered to be numerical type, let A 1 , … , A m be the specified skyline criteria. Throughout the paper, it is assumed that the smaller attribute values are preferred. In this paper, the attributes with known values are called complete attributes, while the attributes with unknown values are called incomplete attributes. ∀t ∈ T , t has at least one complete attribute among A 1 , A 2 , … , A m , while all other attributes have a probability p ( 0 < p ≤ 1 ) of being incomplete. The frequently used symbols in this paper are listed in Table 1.
The dominance over incomplete data is given in Definition 1. The incomplete skyline returns the tuples in T which are not dominated by any other tuples. The current part of T loaded in memory t A tuple in T C The common complete attribute(s) PI The positional index of tuple X S The size of allocated memory for storing tuples of T in each time X S cnd A set maintaining candidate tuples SL i The sorted list which is built for the i-th attribute MCR The bit-vector representing the membership checking result of SL i RIA The bit-vector representing whether the attribute is complete S c The set of the complete attributes of t NUM c The number of the complete attributes for each tuple Definition 1 (Dominance over incomplete data) Given table T and skyline criteria A 1 , … , A m , ∀t 1 , t 2 ∈ T , let C be their common complete attributes among skyline criteria, t 1 dominates t 2 (denoted by The positional index is defined in Definition 2. We denoted by T(a) the tuple with PI = a , by T(a, … , b)(a ≤ b) the tuples in T whose PIs are between a and b, by

The Analysis for the Existing Algorithms
The existing skyline algorithms over incomplete data can be classified into three types: replacement-based algorithm, bucket-based algorithm, and sort-based algorithm. As discussed in Sect. 2, replacement-based algorithm usually generates too many skyline candidates and sort-based algorithm often needs to perform many passes on the table before returning the results. They both incur much high computation cost and I/O cost on massive data. In the following part of this section, we analyze the performance of bucket-based algorithm.
Given table T and the skyline criteria . Note that the most significant bit is the first bit. Bucket-based algorithm divides tuples in T according to their encoded vectors. Therefore, the tuples in the same bucket share the same vectors, and the transitive dominance relation holds among the tuples in a bucket. Traditional skyline algorithm can be utilized to compute the local skyline within the bucket. Any tuple t 1 dominated by a tuple t 2 in the same bucket can be discarded directly, since it cannot be skyline result and any tuples which can be dominated by t 1 can be dominated by t 2 naturally. Of course, there are other techniques to optimize the pairwise comparison among skyline candidates [10].
As illustrated in Fig. 2, for analysis, we assume that the bitmap encoding of the buckets consists of m cases with equal likelihood: ∃i, 1 ≤ i ≤ m , the values of A i must be known and other attributes can be unknown with the probability p independently. Given an m-bit where Cnt1 is a function to return the number of bit 1 in a bit-vector. Of course, in this paper, 1 ≤ r ≤ m . The bit-vector b can be occurred in r cases. In each case, the probability of generating b is (1 − p) r−1 × p m−r , i.e., besides the selected complete attribute, there are (r − 1) complete attributes and (m − r) incomplete attributes. Therefore, the probability pr b of generating b among the overall cases is Theoretically, bucket-based algorithm can split T into all 2 m − 1 buckets. The size of skyline criteria of skyline on incomplete data usually is greater than that on complete data due to the cyclic dominance. This can be verified in the existing skyline algorithms on incomplete data [1,7,10]. Then, the number of all buckets is not small. For example, given m = 20 , there are possibly 1048575 buckets. Then, bucket-based algorithm has to maintain a large number of buckets. For one thing, this increases the management burden of the file system; for another, this makes each bucket maintain a relatively small number of tuples with not small skyline criteria size.
The size of the skyline candidates for pairwise comparison, i.e., the local skylines of all buckets, is Under the independent assumption, the number of local skyline in the bucket of encoded bit-vector b can be estimated as , where ≈ 0.57721 is the Euler-Mascheroni constant. But in this paper, it is found that the cardinality estimation is much lower than the actual cardinality when m is relatively large. Of course, we can use other cardinality estimation methods [12,22] in such case. For  table in running example simplicity, we still use the cardinality estimation in [4], since it still can provide useful insight for our analysis. Given n = 10 8 , m = 20 and p = 0.5 , the total number of all local skyline results is 7641060 even by use of the cardinality formula mentioned above, which is much lower than the actual value. The number of local skyline results, which is used to perform pairwise comparisons, is still too high.
To sum up, the existing skyline algorithms on massive incomplete data all have their performance issue.

Baseline Algorithm
The existing algorithms, as mentioned in Secst. 2 and 4, have rather poor performance and very long execution time on massive incomplete data. Therefore, this section first devises a baseline algorithm BA which can be used as a benchmark against the algorithm proposed in this paper. Different from the existing methods, BA adopts a blocknested-loop-like execution. It first retrieves T from the beginning and loads a part of T into the memory, compares the tuples in memory with all tuples in T, removes the dominated tuples in the memory. Each time the tuples left in memory are compared with all other tuples and can be reported as part of incomplete skyline results. Then, the next part of T is loaded and the similar processing is executed; the iteration continues until all tuples in T are loaded into memory once and compared with all other tuples. In this paper, let S be the size of allocated memory for storing tuples of T each time, the number of table scan in BA is 8×M×n S + 1 . In order to reduce the I/O cost in BA, a n-bit bit-vector B ret , each bit initialized with 1 is maintained. In the first iteration, the tuples of size S bytes are loaded into memory. Let T part be the current part of T loaded in memory. The tuples in T part are compared with all tuples in T. ∀t = T(a) , if t is dominated by some tuple in T part , the ath bit in B ret is set to 0. Then, in the next iteration, suppose that the next retrieved tuple is

Example 1
In the rest of this paper, we use a running example, as depicted in Fig. 3, to illustrate the execution of algorithms proposed in this paper. In the running example, we set M to be 3, m to be 3, n to be 16 and S to be 256 bytes. The value field of the attribute is [0, 100). According to the parameters, the execution of BA divides into two iterations.

TSI Algorithm
In this paper, we propose a new algorithm TSI (Table-scan-based Skyline over Incomplete data) to process skyline over massive incomplete data efficiently. TSI performs two passes of scan on the table to compute the skyline results. Section 6.1 describes the basic execution of TSI algorithm. The pruning operation is presented in Sect. 6.2.

Basic Process
The basic process of TSI consists of two stages. In stage 1, TSI performs the first-pass scan on T to find the candidate tuples, while in the stage 2, TSI scans T again to discard the candidates which are dominated by some tuple. Algorithm 1 is the pseudo-code of the basic process.

Algorithm 1 TSI basic(T)
Input: T is an incomplete table Output: S cnd a set maintaining the skyline tuples over T 1: initialize S cnd ← ∅ 2: // Stage 1 find the candidate tuples 3: while T has more tuples do 4: retrieve the next tuple t of T ; 5: if S cnd = ∅ then 6: S cnd ← S cnd ∪ t; 7: else 8: while S cnd has more tuples do 9: retrieve the next tuple p of S cnd ; 10: if p is dominated by t then 11: remove Theorem 1 When the first-pass scan of TSI is over, S cnd maintains a superset of skyline results over T.
Proof ∀t 1 = T(pi 1 ) , if t 1 is a skyline tuple, there is no other tuple in T which can dominate t 1 . At the end of stage 1, t 1 obviously will be kept in S cnd . If t 1 is not a skyline tuple, and there is another tuple t 2 = T(pi 2 ) which can dominate t 1 . If pi 1 < pi 2 , t 2 will be retrieved after t 1 and remove t 1 from S cnd . If pi 1 > pi 2 , t 2 is retrieved before t 1 . If t 2 is dominated by some tuple and discarded, t 1 still will be kept in S cnd at the end of stage 1. Q.E.D.
In stage 2, TSI performs another sequential scan on T. Let t be the currently retrieved tuple (line 22-23), any candidates are removed from S cnd if they are dominated by t (line 26-27). It is proved in Theorem 2 that the candidates in S cnd are the skyline results at the end of stage 2.
In stage 1, TSI retrieves the tuples in T sequentially and maintains the candidate tuples in a set S cnd (empty initially) (line 1). Let t be the currently retrieved tuple. If S cnd is empty, TSI keeps t in S cnd (line 5-6). Otherwise, S cnd is iterated over, any candidate which is dominated by t is removed from S cnd (line [10][11]. At the end of iteration, if t is dominated by some candidate in S cnd , t is discarded (line 14-15); otherwise, TSI keeps t in S cnd (line [16][17]. In stage 1, TSI does not consider the intransitivity and cyclic dominance of skyline on incomplete data. Any candidates is discarded if it is dominated by some tuple, even though the candidate may dominate the following tuples. In this way, TSI does not need to maintain the dominated tuples and reduces the in-memory maintenance cost significantly. It is proved in Theorem 1 that S cnd contains a superset of the query results at the end of stage 1.

Theorem 2 When the second-pass scan of TSI is over, S cnd maintains the skyline results over T.
Proof ∀t 1 ∈ S cnd , if t 1 is not a skyline tuple, there is another tuple t 2 = T(pi 2 ) which can dominate t 1 . In the second-pass scan, TSI will discard t 1 when retrieving t 2 . Q.E.D.
The existing algorithms utilize many methods, such as replacement, sortedness and bucket, to deal with intransitivity and cyclic dominance. They usually incur high execution cost on massive incomplete data, as analyzed in Sects. 2 and 4. In this paper, TSI neglects the intransitivity and cyclic dominance in the first-pass scan and leaves the refinement of the skyline results in the second-pass scan.

Example 2
The execution in stage 1 of TSI in the running example is illustrated in Fig. 5. Initially, the candidate set S cnd is empty. Then, as the first sequential scan is performed, Time complexity On massive incomplete data, the majority of the execution cost of TSI is consumed in stage 1. The reason is that every tuple retrieved in stage 1 needs to compare with all candidates in S cnd and the size of S cnd increases during the first-pass scan on T, while the size of S cnd decreases gradually in stage 2.
Time complexity of stage 1. As shown in Algorithm 1, the time complexity of stage 1 is determined by the nested loop, the outer loop from line 3 to Line 20, and the inner loop from Line 8 to Line 13. Assume that there are n tuples in the incomplete table, in other words, algorithm 1 needs to retrieve n tuples. The iteration count of the outer loop is O(n), since time complexity is the amount of time taken by an algorithm to run as a function of the input size. The inner loop involves one sequential scan on S cnd , whose size is no more than n. For each iteration in the inner loop, the operations take in constant time; thus, the time complexity of the inner loop is O(|S cnd |) . On the whole, the time complexity of stage 1 is determined by the number of tuples in T and the number of candidates in S cnd , i.e., the time complexity of stage 1 is O(n * |S cnd |).
Time complexity of stage 2. The execution of stage 2 is described in Algorithm 1. Obviously, the cost of stage 2 is similar to stage 1, i.e., the product of n and the size of S cnd ; it might be insignificant compared with the cost of the following operations. The reason is that if the skyline candidates are relatively small, the size of S cnd with skyline subset generating in stage 1 is much large than the size of S cnd with skyline tuples generating in stage 2 and the size of S cnd in stage 1 often dominates the overall execution cost. On the whole, the time complexity of algorithm 1 is O(n 2 ).
In Sect. 6.2, we will propose pruning method to skip the unnecessary tuples in the sequential scan to improve the performance TSI further.

Intuitive Idea
On massive incomplete data, it is analyzed that the majority of the execution cost of TSI is consumed in stage 1. In stage 1, TSI computes the candidates of the skyline over T. Obviously, any tuple must not be a skyline tuple if it is dominated by some tuple. In stage 1, TSI utilizes some pre-constructed data structure to skip the tuples in T which are dominated. In this way, TSI will speed up its execution in stage 1, since the pruning operation not only reduces the I/O cost to retrieve tuples, but also reduces the computation cost of dominance checking.

Dominance Checking on Incomplete Data
Given t 1 ∈ T , ∀t 2 ∈ T , let C be the common complete attributes among skyline criteria of t 1 and t 2 . For one thing, if t 1 t 2 , it means that ∀A ∈ C , t 1 .A ≤ t 2 .A and ∃A ∈ C , t 1 .A < t 2 .A . Suppose that t 1 is obtained currently, we can utilize the values of t 1 to skip the tuples dominated by it. For another, it C is empty, t 1 and t 2 cannot be compared in terms of dominance checking. Therefore, the key to the dominance checking on incomplete data is (1) the comparison of complete attributes, (2) the representation of incomplete attributes. In the following, we introduce how to construct data structures to solve the two issues.
In the paper, the value of any incomplete attribute is regarded as the positive infinity since the smaller values are preferred. Given table T( For the representation of incomplete attributes, TSI performs a sequential scan on T and constructs the structure RIA, which consists of M n-bit bit-vectors. For .A i is a complete attribute RIA i (a) = 1 ; otherwise, RIA i (a) = 0.

Example 3
The required data structures mentioned above are illustrated in Fig. 7. SL 1 , SL 2 , SL 3 are three sorted lists, whose elements are arranged in the ascending order of A 1 , A 2 , A 3 , respectively. MCR 1,1 is a 16-bit bit-vector representing the membership checking results of SL 1 (1, 2 1 ).PI T , i.e., 12 and 8. Therefore, the 8th bit and 12th bit in MCR 1,1 are 1, MCR 1,1 = 0000000100010000 . ITV 1 keeps the attribute values of exponential gaps in SL 1 , i.e., SL 1 (2 1 ).A 1 , SL 1 (2 2 ).A 1 , SL 1 (2 3 ).A 1 , SL 1 (2 4  By the structures MCR and RIA, given t 1 ∈ T , we want to know which tuples in T are dominated by t 1 . Let S c be set of the complete attributes among A 1 , A 2 , … , A m of t 1 , without loss of generality, assume that S c = {A 1 , … , is assigned negative infinity. Let DBV t 1 be the n-bit bit-vector of dominance checking corresponding to t 1 , whose bits are initialized to bit 1. It is proved by Theorem 3 that the bit 1s of correspond to the tuples dominated by t 1 .

Theorem 3 The bit 1s of
represent the tuples which are dominated by t 1 .
Proof As mentioned above, the value b i is determined as the minimum integer value satisfying Therefore, the bit 1s of ¬MCR i,b i represent the tuples whose A i values are greater than t 1 .A i . Since we treat the incomplete attribute values as positive infinity, ⋀ �S c � i=1 ¬MCR i,b i represents the tuples whose values of A 1 , … , A |S c | are all greater than those of t 1 . Given t 2 among these tuples, if at least one of A 1 , … , A |S c | of t 2 is complete attribute, t 2 is dominated by t 1 according to the dominance definition over incomplete data. If all of A 1 , … , A |S c | of t 2 are incomplete, t 1 and t 2 are not comparable from the perspective of dominance relationship. The bit 1s of ⋁ �S c � i=1 RIA i mean that at least one of A 1 , … , A |S c | is complete, and the bit 0s of represent the tuples which are dominated by t 1 . Q.E.D.

The Extraction of the Pruning Tuples
In order to skip the unnecessary tuples of T in stage 1, we first extract some pruning tuples for the following execution of TSI. The number of pruning tuples should not be large and they should have relatively strong dominance capability. Since the dimensionality of T can be high, we do not extract the pruning tuples with respect to the combination of different attributes, but to the values of single attribute and the number of complete attributes for each tuple. It is known that the cardinality of skyline results grows exponentially with the size of skyline criteria [4] and on incomplete data, dominance relationship between two tuples is performed over their common complete attributes. Intuitively, for a tuple, if it has a small number of complete attributes and one of its complete attributes is very small, it tends to have a relatively strong dominance capability.
The pruning tuples can be extracted from M sorted column files SC 1 where NUM c is the number of the complete attributes for each tuple. The tuples of SC i (1 ≤ i ≤ M) are sorted on NUM c and A i , i.e., they are first arranged in the ascending order of NUM c , then all tuples with the same NUM c are arranged in the ascending order of A i .
For each sorted column file SC i , we retrieve its tuples sequentially. Let sc be the current retrieved tuple, if sc.A i is within the first f % proportion among all A i values, the PI T value of sc is maintained in memory, and otherwise, the next tuple is retrieved. The process continues until the number of PI T values maintained in memory reaches n pt or it reaches to the end of file. Then, the corresponding tuples of T are extracted and kept in a separate pruning tuple file PT i . In this paper, f is set to 5 and n pt is set 1000; the pruning effect with such parameter setting is satisfactory in the performance evaluation. Figure 8 illustrates the extracting of pruning tuples in the running example. SC i (1 ≤ i ≤ 3) is arranged first in the ascending order of NUM c , and the tuples with the same value of NUM c are sorted in ascending of A i . In the running example, f = 12.5(16 × 12.5% = 2) and n pt = 1 , one pruning tuple will be retrieved for SC i . For SC 1 , SC 1 (1, … , 11) cannot be used to generate pruning tuples since their attribute values are not within the first two smallest values of A i . Then, SC 1 (12) is selected to obtain the pruning tuple T(SC 1 (12).PI T ) since it is the first tuple in SC 1 whose A 1 value is among the first two smallest values of A 1 . Other pruning tuples (T (14) and T(6)) are obtained similarly.

The Execution of Pruning Operation
By the pre-constructed structures described above, TSI can utilize pruning operation to reduce the execution cost in stage 1. In order to execute the pruning operation, TSI maintains a n-bit pruning bit-vector PRB in memory, which is filled with bit 0 initially.

Algorithm 2 TSI Pruning(T , S cnd )
Input: T is an incomplete table, S cnd a set maintaining the candidate tuples Output: S cnd a set maintaining the skyline tuples over T 1: MH is a min-heap to keep mpruningtuples with the highest dominance capability. 2: initialize S cnd ← ∅, MH ← ∅; 3: // Stage 1 find the candidate tuples 4: extract the involved pruning tuples P T 1 , P T 2 , ..., P Tm for each skyline criteria of T , and put P T 1 , P T 2 , ..., P Tm in to MH; 5: while MH has more pruning tuples do 6: retrieve the next tuple pt of MH; 7: Sc is the complete attributes of pt, Sc = {A 1 , . . . , A |Sc| }}; 8: if PRB(pt)=1 then 9: pt can be skipped; 10: else 11: for (i = 1; i ≤ |Sc|; i + +) do 12: compute the first value   [11][12][13]. For the retrieved pruning tuple pt, TSI sets the (pt.PI T )th bit of PRB to be 1, since it is retrieved already (line 14). Besides, for each pruning tuple pt, TSI removes any candidates in S cnd which are dominated by pt (line 18-23). If pt is not dominated by any candidate in S cnd , TSI keeps it in S cnd (line 26-27). ∀pt b ∈ MH (1 ≤ b ≤ m) , TSI computes its corresponding bit-vector DBV pt b of dominance checking as in Sect. 6.2.2 (line 29). The final pruning bit-vector PRB is Algorithm 2 is the pseudo-code of the execution of pruning operation. At the beginning of the stage 1, TSI determines the involved pruning tuple files PT 1 , PT 2 , … , PT m according to the current skyline criteria and retrieves pruning tuples from them. In the process of retrieving PT 1 , PT 2 , … , PT m , TSI maintains a min-heap MH in memory to keep m pruning tuples with the highest dominance capability (line 4). Given a pruning tuple pt, let S c be its complete attributes. Likewise, assume that S c = {A 1 , … , A |S c | }} (line 5-6). ∀1 ≤ i ≤ |S c | , we determine the first value ITV i [b i ] of ITV i which is greater than pt.A i , its dominance capability
In stage 1, ∀1 ≤ a ≤ n , if PRB(a) = 1 , T(a) can be skipped; otherwise, TSI needs to retrieve T(a). The rest of the execution in stage 1 is the same as that in Sect. 6.1.

Example 6
In the running example, TSI only needs to retrieve three tuples (T(8), T(15), T (16)) in stage 1 by use of PRB. This reduces the I/O cost and computation cost significantly.

Experimental Settings
To evaluate the performance of TSI, we implement it in Java with jdk-8u20-windows-x64. The experiments are executed on LENOVO ThinkCentre M8400 (Intel (R) Core(TM) i7 CPU @ 3.40GHz (8 CPUs) + 32G memory + 3TB HDD + 64 bit windows 7). In the experiments, we implement TSI, BA, SOBA [10] and SIDS [1]. With the experimental setting below, the execution time of SOBA and SIDS is so long that we do not report its experimental results with the settings below, but evaluate it in Sect. 7.8 separately. For BA, the size S of the allocated memory is 4GB. We do not use a larger size for BA because, with the assistance of the bit-vector B ret as mentioned in Sect. 5, the larger value of S makes more tuples of T loaded in memory at a time and reduces the number of iteration, but it also reduces the proportion of retrieval which can use the optimization of skipping operation.
In the experiments, we evaluate the performance of TSI in terms of several aspects: tuple number (n), used attribute number (m), incomplete ratio (p), correlation coefficient (c). The experiments are executed on three data sets: two synthetic data sets (independent distribution and correlated distribution) and a real data set. The used parameter settings are listed in Table 2. For correlated distribution, the first two attributes have the specified correlation coefficient, while the left attributes follow the independent distribution. In order to generate two sequences of random numbers with correlation coefficient c, we first generate two sequences of uncorrelated distributed random number X 1 and X 2 , then a new sequence Y 1 = c × X 1 + √ 1 − c 2 × X 2 is generated, and we get two sequences X 1 and Y 1 with the given correlation coefficient c. When generating synthetic data, we fix the number of M to be 60 and generate data with all complete attributes. Then, according to used skyline criteria, we select one attribute first, this attribute is complete. Other (m − 1) attributes in skyline criteria have a probability p of being incomplete independently. The real data used are HIGGS Data Set from UCI Machine Learning Repository 1 , it is provided to classification problem including 11000000 instances. The main reasons for using HIGGS are that 1) HIGGS is one of the largest databases to our knowledge, accordingly, we have better access to compare the performance of above algorithms. 2) and it is an open dataset that we can find and obtain expediently. On real data, we evaluate the performance of TSI with varying values of p.
The required structures are pre-constructed before the experiments. Under the default setting of the experiments, i.e., M = 60 , n = 50 × 10 6 , and p = 0.3 , it takes 6840.573 seconds to pre-construct the required data structures.

The Comparison of TSI with and Without Pruning
The performance of TSI B and TSI is compared in different aspects, where TSI B is the TSI algorithm without pruning operation. As depicted in Fig. 10a, TSI runs 18.84 times faster than TSI B and the speedup ratio increases with a greater value of n. This significant advantage is due to the effective pruning operation. The numbers of the candidates after stage 1 are illustrated in Fig. 10b. TSI maintains more candidates than TSI B after stage 1. This is because the pruning operation skips most of the tuples in stage 1, and therefore, many candidates which should be removed by some tuples are left. But the pruning operation reduces the cost in stage 1 significantly. Figure 10c reports the time decomposition of TSI B . Obviously, the execution time of stage 1 dominates its overall time. We even cannot see the time in stage 2 due to its rather small proportion. Figure 10d gives the time decomposition of TSI, which consists of four parts: the time to retrieve pruning tuples, the time to load the required bitvectors, the time in stage 1, and the time in stage 2. The time in stage 2 of TSI is longer than that of TSI B due to the greater number of candidates left. However, the time reduction in stage 1 of TSI is much significant compared with TSI B and TSI runs one order of magnitude faster than TSI B averagely. As shown in Fig. 10(e and f), the pruning operation makes TSI incur less I/O cost and perform fewer number of dominance checking.

Experiment 1: the Effect of Tuple Number
Given m = 20 , M = 60 , p = 0.3 and c = 0 , experiment 1 evaluates the performance of TSI on varying tuple numbers. As shown in Fig. 11a, TSI runs 60.42 times faster than BA averagely. The speedup ratio of TSI over BA increases with a greater value of n, from 8.31 at n = 5 × 10 6 to 166.58 at  T to compute incomplete skyline results. At n = 500 × 10 6 , BA needs to execute 56 iterations, each loading a part of T and then followed by a table scan on T to remove the dominated tuples. On the contrary, TSI shows a slower growing trend on tuple number due to its execution process and pruning operation. As illustrated in Fig. 11d, the pruning operation of TSI can skip vast majority of tuples in stage 1. The pruning ratio in the experiments is computed by the

Experiment 2: the Effect of Skyline Criteria Size
Given M = 60 , n = 50 × 10 6 , p = 0.3 and c = 0 , experiment 2 evaluates the performance of TSI on varying skyline criteria sizes. As illustrated in Fig. 12a, with a greater value of m, the execution times of BA and TSI both increase significantly; TSI still runs 85.79 times faster than BA averagely. For BA, its I/O cost depends on two parts. For one thing, BA needs to retrieve T once to load it into memory. For another, BA performs a sequential scan on T in each iteration to discard the candidates in memory which are dominated by some tuples. For the first part, BA may not retrieve all tuples into memory since the current tuples may be dominated by the previous iterations. For the second part, if the current candidates all are discarded, BA does not have to continue the sequential scan but just performs the next iteration directly. When the value of m increases, given other parameters are fixed, the probability that a tuple is dominated by other tuple becomes lower. Therefore, the I/O cost increases on both parts. This is reported in Fig. 12b. For TSI, its I/O cost also consists of two parts. In stage 1, TSI performs a selective scan on T to obtain the candidates of incomplete skyline results. In stage 2, TSI does another sequential scan on T to compute the results, in which if all candidates are removed, TSI can terminate directly. As the value of m increases, the pruning effect in TSI becomes worse in stage 1, which also  Fig. 12d, and TSI has to retrieve more tuples before it terminate in stage 2. This makes a higher I/O cost for TSI with a greater value of m, as illustrated in Fig. 12b.
With the similar explanation, as shown in Fig. 12c, the numbers of dominance checking for both algorithms increase with a greater value of m.

Experiment 3: the Effect of Incomplete Ratio
Given m = 20 , M = 60 , n = 50 × 10 6 and c = 0 , experiment 3 evaluates the performance of TSI on varying incomplete ratios. As the value of p increases, the execution time of BA decreases quickly, while the execution time of TSI first decreases and then increases gradually. For BA, the decline of execution time is easy to understand. With a greater value of p, the probability that any tuple is dominated by other tuples increases. This makes more in-memory candidates in each iteration dominated by some tuples in the sequential scan, and can reduce the I/O cost and dominance checking cost. As illustrated in Fig. 13c, with a greater value of p, the number of dominance checking in BA decreases constantly. And as shown in Fig. 13b, the I/O cost of BA first decreases significantly when p increases from 0.3 to 0.4, then remains unchanged basically ever since. When p increases from 0.3 to 0.4, the number of in-memory candidates is reduced during the sequential scan and in each iteration, BA terminates earlier. This makes less I/O cost for BA. When the value of p is greater than 0.4, the number of in-memory candidates is reduced also, but in each iteration, BA reaches an approximately equal scan depth before it terminates. For TSI, the effect of pruning operation depends on two factors. One is the probability that one tuple can be dominated by other tuples. The other is whether all common attributes of two tuples are incomplete. The two factors have different effects in different cases. With a greater value of p, the probability of a tuple dominated by some tuples increases, also the probability that the common attributes of two tuples are all incomplete. When p increases from 0.3 to 0.5, the first factor has a greater impact, and ever since, the second factor plays a larger role. This explains the trend of the execution time of TSI. Similarly, this can explain the variation trend of TSI in I/O cost (Fig. 13b), the number of dominance checking (Fig. 13c), and the pruning ratio (Fig. 13d).

Experiment 4: the Effect of Correlation Coefficient
Given m = 20 , M = 60 , n = 50 × 10 6 and p = 0.3 , experiment 4 evaluates the performance of TSI on varying correlation coefficients. As illustrated in Fig. 14a, TSI runs 47.72 times faster than BA. The correlation coefficients considered range from -0.8 to 0.8. A negative correlation means that there is an inverse relationship between two variables, when one variable decreases, the other increases. And a positive correlation means that variables tend to move in the same direction. Therefore, the skyline computation on negatively correlated data usually is more expensive than that on positively correlation data. The variations in TSI and BA both show a downward trend in experiment 4. Here, the trend is not significant because the incomplete attributes in the data set reduce the impact of correlation. The I/O cost and number of dominance checking are depicted in Fig. 14(b and c), respectively, and they have the similar variation trends. The effect of pruning operation of TSI is illustrated in Fig. 14d. Due to the impact of incomplete attributes, the pruning ratio shows considerable change, but it still shows upward trend overall.

Experiment 5: Real Data
The real data, HIGGS Data Set, are obtained from UCI Machine Learning Repository. It contains 11,000,000 tuples with 28 attributes. We select the first 20 attributes as skyline criteria and evaluate the performance of TSI with varying incomplete ratios. Before the experiment is executed, one attribute first is chosen to be complete and other (m − 1) attributes in skyline criteria have a probability p of being incomplete independently. As depicted in Fig. 15a, TSI runs 40.46 times faster than BA. The variation trends of execution times of BA and TSI are very close to those in Sect. 7.5 and can be explained similarly. The I/O cost and the number of dominance checking are depicted in Fig. 15(b and c), respectively. The pruning ratio in TSI is illustrated in Fig. 15d. The variation in these figures can be explained similarly as in Sect. 7.5.

Experiment 6: the Comparison with SOBA and SIDS
In this part, we evaluate the performance of TSI against BA, SOBA and SIDS on a relatively small data set with relatively small skyline criteria size. Given n = 10 × 10 6 , p = 0.3 and c = 0 , in order to acquire a better performance for SOBA and SIDS, we set the value of m to be from 6 to 10, and the value of M equal to that of m. This can reduce the length of each tuple and also lower the cost of bucket partitioning for SOBA and SIDS. As illustrated in Fig. 16a, SIDS is the slowest among the four algorithms while TSI is the faster in various skyline criteria size, and the execution time of SOBA increases significantly with the number of m. When m = 10 , SOBA runs 10.96 times slower than BA, the baseline algorithm in this paper, and runs 200.91 times slower than TSI. As for SIDS, it runs 21.11 times slower than BA and runs 386.84 times slower than TSI. On disk resident data, SOBA and SIDS cannot process incomplete skyline efficiently. The bucket partitioning of SOBA involves two passes of table scan, not to mention the maintenance cost of the large number of partitions in the disk if the number of m is not small. Then, the computation of local skyline involves another pass of tuple retrieval. On the relatively large value of m, the number of local skyline is great also. As depicted in Fig. 16b, the local skyline makes up 11.7% of the total tuples at m = 10 . The I/O cost of SIDS is much larger than others, SIDS and BA are much close in I/O cost. The growth trend of the execution time of SIDS is fast with respect to skyline criteria size. The performance of TSI is efficient not only for the in-memory data set with small size of skyline criteria, but also for the disk-resident data with not small size of skyline criteria.

Conclusion
This paper considers the problem of incomplete skyline computation on massive data. It is analyzed that the existing algorithms cannot process the problem efficiently. A tablescan-based algorithm TSI is devised in this paper to deal with the problem efficiently. Its execution consists of two stages. In stage 1, TSI maintains the candidates by a sequential scan. And in stage 2, TSI performs another sequential scan to refine the candidate and acquire the final results. In order to reduce the cost in stage 1, which dominates the overall cost of TSI, a pruning operation is utilized to skip the unnecessary tuples in stage 1. The experimental results show that TSI outperforms the existing algorithms significantly.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.