System workflow
Before reporting on the experimental evaluation of our approach, we would like to consolidate all the intelligent components of the proposed methodology to demonstrate the system workflow. Upon recording predictive analytics queries and their corresponding scores (cardinality values), we first quantize the query-space into a set of centroids, representing the patterns of the issued queries. Then, the novel cluster-head refinement method is applied to increase the reliability of a query’s closest cluster-head, such that the closest cluster-head of a given query should have the most accurate score prediction.
The refinements are achieved by investigating the statistical patterns of the winner and the rival cluster-heads, which are adjusted to minimize the prediction error. The adjustment mechanism adopts a reward-penalty methodology to fine tuning the positions of the winner and the rival cluster-head vectors in the query space in light of increasing the reliability in the score prediction and, at the same time, minimizing the prediction error.
Finally, the system hosting the refined cluster-heads is ready to provide reliable score predictions for any given query.
Dataset & experiment set-up
We conducted a series of experiments to answer the different questions that have been asked in our methodology section. In order to carry out these experiments, the use of a real life data sets and query work loads were required.
Dataset
We adopted the data set of gas contextual sensor data in [21] and [19] publicly available from the UCI Machine Learning Repository.Footnote 1 The data set consists of 14,000 measurements from 16 chemical sensors exposed to 6 gases at different concentration levels, where there are 8 features for each chemical sensor. Our exploration queries focus on the measurements of the first two features of the first chemical sensor for ethanol.
Query workload
We used the query workload set of size 1000 adopted from UCI ML Repository,Footnote 2 where each query has two \((\min \limits ,\max \limits )\) pairs, and run it against the data set to obtain the score y of each query. Therefore the resulting training set was of size 600 while the testing set was of size 400.
Experimental set-up
For the query space quantization, we experimented with the k-means algorithm with values k ∈{7,15,22,40}. As there are no known methods for acquiring the optimal k value in our specific query-driven clustering process, we decided to start off with small values of k to be able to monitor each centroid. These k values were also considered in the related work [3], which will be used for our comparative assessment.
In order to compare our model with the predictive models [3] and [2], we adopted n-fold cross-validation with n = 10. To check whether there is a statistically significant difference between the means of the score prediction accuracy values, we fixed a significance level α = 0.95, where the reasoning is that if difference is significant at the α% level, there is a 100 − α% chance that there really is a difference. We divided the significance level by two using the two-tailed test,Footnote 3 i.e., the significance level is set to (1 − α)/2 = 0.025. In all the reported experiments and comparison, the derived probability values (p-values) were less that (1 − α)/2 = 0.025 (with average p −value being 0.0127). This indicates that the comparison assessment of the score prediction accuracy is statistically significant.
Reliable refinement methodology & comparative assessment
We examine the behaviour of the three refinement methods of reward/penalty w.r.t. value of a(q) given a query q. We firstly report on the results of the Rewarding Methods A and B, as the method C is investigated later. We decided that an appropriate way to determine which the best method was, was by measuring the performance of each possible combination between Method A (or B) and the three Refinement Approaches using the same initial centroids.
For each of these tests, we initially created a new refinement set of 1500 queries and uses a method-refinement approach combination to produce a new set of refined centroids. A test refinement set of 400 queries is used to obtain the average BR, AR and IMP metrics that quantify the refinement’s effect. For each refinement set, five testing refinement sets were produced and their BR, AR and IMP values were averaged. This process is repeated three times for each experiment with three different refinement sets. The average BR, AR and IMP values for an experiment are once again averaged.
The results for Methods A and B for each refinement approach are provided in Fig. 3. Method A with a(q) = 0.1 achieves the best performance in improving the reliability of the winner representative (according to IMP). Using Method A along with Refinement Approach 2 leads to the best refinement in all experiments. At that point, we concluded that Method A was the best choice for acquiring a value of a, but, we were still not convinced that Refinement Approach 2 was the best choice. We expect that the behavior of the refinement approaches will change when the number of centroids increased.
Based on this reasoning, we conducted a new series of experiments to determine which refinement approach is the best using Method A. In these experiments, four sets of centroids were explored corresponding to k = 7, k = 15, k = 22 and k = 40. It should be noted that, in all of these experiments, we ensure that the same initial sets of centroids were used for each k. The same methodology as previously was used for producing average BR, AR and IMP values. The results of these experiments are provided in Fig. 4. In most cases, the refinement’s effect decreases as the size of k increases. The relationship between k and IMP (the refinement’s effect) is further examined in Fig. 5. This inverse-like relationship applies for all refinement approaches at all sizes of k, except when k = 15 for the Refinement Approach 3, where the refinement seems to be most effective.
It may also be noticed that the lowest AR may belong to a different Refinement Approach at different k values, although at most cases the difference can be too small to be deemed as important. We consequently concluded that the Refinement Approach 2 would be our choice for refining centroids. We then considered that the method for acquiring a value of a can depend on the number of occupants in a representative’s cluster [4], i.e., Method C. As a result of our decision making, experiments were conducted in the same manner as previously, but this time we used Refinement Approach 2 in combination with Method C.
We can see from Fig. 6 that Method C produces the lowest AR results in all cases. Additionally, when k = 22, despite the fact that Refinement Approach 2 with Method C produced the lowest AR, the IMP turns negative. This shows that the refinement function slightly decreased the reliability of the winner representative. This also indicates that there is a limit to how much we refine a set of centroids until reliability stops improving; this assumption will be further examined later. Similar results are obtained for Method C combined with refinement approaches 1 and 3.
In order to examine whether the centroid refinement method has improved the reliability of the winner representative, we provided our comparison results in the Figs. 7, 8, 9, and 10 with the previous approaches [3] and [2] using the query winner representative for prediction decision. Each of those comparison results represents the number of cases where the rival representative had a better prediction than the winner representative using the prediction approach in [3] and [2] (Red Bar) in comparison with one of our centroid refinement methods (Blue Bar) for four sets of centroids. We present the performance of our methodology using the Refinement Approaches 1,2 & 3 with Method A as well as Refinement Approach 2 with Method C.
We conclude by observing these results that our proposed centroid refinement offers higher reliability of the winner representative than the existing approaches [3] and [2]. Note: we obtain similar comparative assessment results with refinement approaches combined with rewarding methods, illustrating the capability of our methodology to provide accurate and reliable predictions after appropriate centroids refinement.
Refinement cycles & prediction reliability
We have discussed previously that the contents of a refinement set can be crucial to the refinement’s effect on centroids. We suspected that as refinement sets are random, there can be queries within the set that can either improve or worsen the centroid refinement. Hence, we decided to present the idea of the refinement cycles.
During a refinement cycle, a new refinement set is generated and imposes the refinement function on the resulting centroids of the previous refinement cycle. This holds unless it is the first refinement cycle, where refinement will occur on the centroids that resulted from the k-means clustering. We keep introducing new refinement cycles as long as there was a positive IMP value; in other words, as long as there is improvement.
We initially created four different sets of centroids, where in all cases k = 7. During each refinement cycle, we create a refinement set of 500 queries to be used by the refinement function and a test refinement set of 400 queries to produce the values for BR, AR and IMP. The refinement function made use of the Refinement Approach 2 with Method A. At the end of each refinement cycle, we calculate the average values for BR, AR and IMP for the four sets of centroids. We then attempted ten different refinement sets during each cycle and eventually used the one that had the best performance in improving reliability. This way we ensured that at each cycle, we had the best refinement set we could find.
The results of these experiments are presented in Figs. 11 and 12. It can be seen in all four cases, that no further improvements in reliability could be made during the third refinement cycle as all of our IMP values became negative.
While also taking into account the IMP value when k = 22, we conclude that we can refine the centroids up to a certain point before the reliability of the winner representative starts getting worse. We can also see that the average AR of the second refinement cycle is close to the value we obtained 25.98%. Therefore, we deduce that refinement cycles won’t lead to a drastic improvement in reliability.
In addition, we used 500 queries less to reach a similar AR value but also used another 500 queries to determine that the best AR has been reached in the previous cycle. As we ensured what the limits of refinement were, we proceeded to examining our predictions approaches while using the Refinement Approach 2 with Method C for centroid refinement. We have discussed three different prediction approaches in our methodology that will all be examined in this subsection. The conducted experiments here did not only aim to find which is the best prediction approach, but also to determine whether improved reliability of the winner representative leads to better predictability. These results are shown in Fig. 13.
We further examined each prediction approach by making predictions for the scores of queries. We made separate predictions for queries that were derived from the three refinement sets, used during centroid refinement and their corresponding test refinement sets. The actual scores of these queries were in all cases already known. This allowed for the prediction errors to be calculated.
All the prediction errors were then used to produce the average absolute prediction error. We made separate predictions using both the Refined (R) and Unrefined (UR) centroids in order to compare their average prediction errors and derive whether centroid refinement has benefited our predictions. We ensured that for all refinement sets and test refinement sets, predictions were made with their corresponding centroids from centroid refinement.
The average prediction errors for unrefined centroids can be found in the columns labelled as UR, while the average prediction errors for refined centroids can be found in the columns labelled as R in Fig. 13. Therefore, for a given k, there are four average prediction error values for each prediction approach. Our initial tests involved three sets of centroids where k = 7, k = 22 and k = 40. We observed that for all prediction approaches, our average prediction errors tend to drop as the number of centroids was increasing. This lead us to conducting more experiments with higher k numbers, where k = 85, k = 107 and k = 130. There are a couple of key points that we can conclude from our experimental results.
Firstly, Prediction Approach 2 outperforms all other prediction approaches in all cases. The only case where this does not hold is when k = 7, for predictions made using the refined centroids for the test refinement set, where the average prediction error is the same with Prediction Approach 1. Hence, we can determine that using the scores of more than one representatives improves our predictions. We can assume at this point that our predictions can be further improved if the calculations included the scores of even more centroids that are close to a query and weighting them appropriately.
Secondly, improving the winner representative’s reliability does not make predictions better at all cases. We can see from our results that the cases where predictions, made from refined centroids, are consistently better than the unrefined centroids (or at least as good), for all three prediction approaches, for all six sizes of k, are the predictions made for queries of the refinement sets. This comes as no big surprise of course, as they are the set that re-adjusted the centroids. Centroid refinement seemed to be most effective in Prediction Approaches 1 and 3, where the former has a better performance for k ∈{85,107,130}, while the latter has a better performance for k ∈{7,22,40}.
Figure 14 displays how score predictions from the refined centroids outperformed all score predictions from the unrefined centroids (using Prediction Approach 1); except in k = 22, where predictions from the unrefined and refined centroids had the exact same average prediction error. As that is the only case where there was no drop in the average prediction error, we can take a look at Fig. 6 and observe that k = 22 was the test where we actually had a small negative IMP value. In the rest of our k values, IMP was always positive and predictions were improved; we now provide the IMP values for k = 80, k = 107 and k = 130 respectively: 0.40%, 0.20% and 0.57%.
Thirdly, the test refinement set experiments are where we ideally like to see consistent improvement due to refinement, as these can act as indicators of how effective our rationale is in terms of generalization.In the cases of k = 40 and k = 85, lower average prediction errors were achieved due to centroid refinement in all prediction approaches, even if at some cases the improvement is not that big to be deemed significant. The comparison of the average prediction errors, between predictions made from unrefined and refined centroids using Prediction Approach 2 for the test refinement sets, is shown in Fig. 15.
In these cases, there are no obvious indicators on when refinement seems to help or at which prediction approaches. We can further support this claim by taking into account our BR, AR and IMP values. The greatest IMP value was noticed during k = 7 and the lowest at k = 22 (where IMP is actually negative). When we look at the corresponding case of Prediction Approach 2 at k = 7 in Figs. 13 and 15, the unrefined centroids offered the better predictions by a significant margin. While in the corresponding case for Prediction Approach 2 at k = 22, the refined centroids had an average prediction error of 132.5 while the unrefined centroids had an average prediction error of only 131.5.
Therefore our conclusions from the above analysis are:
-
Firstly, the centroid refinement has the potential to always improve score predictions for queries in the refinement set. We expect to see the average prediction error decrease as long as there is a positive IMP value (see Fig. 14).
-
Secondly, there does not seem to be a direct relationship between IMP and the average prediction error when predicting scores for the test refinement queries, as there are both cases where predictions got worse or better due to centroid refinement. Increasing the number of centroids to improve predictions works up to a certain extend. We can see from Fig. 13, that in most cases when we increased k, the average prediction error for the same test at a lower k would be higher. Despite that, the drop in the average prediction error, gets smaller every time we increase k. In fact, at certain cases the average prediction error can get worse as k increases, e.g., almost all results of Refinement Approach 1 from k = 85 to k = 130.
-
We can also observe from our bar charts in Figs. 14 and 15 that the effect that k has on the average predictions errors becomes more subtle as k increases. We can safely conclude that predictions can get better at a specific range of high k values, but this range can be acquired only through experimentation, as k should differ between datasets and queries with different numbers of dimensions.
At the end of these experiments, we picked our best performing prediction models to investigate whether we could answer the question: Can we use the predicted score of a query to determine if it is worth executing based on user criteria? All of the selected prediction models made use of Prediction Approach 2 and were using the following sets of centroids as their prediction basis: k = 85 with UR, k = 107 with UR, k = 107 with R and k = 130 with UR. We decided to pick more than one model, as the average prediction errors of these four prediction models were too close one another to single one out.
Discussion on query execution
We handle the question: is a query worth executing or not? as a simple binary classification problem. We classify queries, based on user criteria, into queries that are worth executing (class label: 1) and queries that are not worth executing (class label: 0). In order to do that, we firstly had to define sets of user criteria, where these can be seen in the headings of columns of the tables displayed in Figs. 16 and 19. A set of user criteria in the format (c1,c2) declares that:
$$ c_{1} \leq y \leq c_{2}, $$
(23)
while a set of user criteria in the format (c1 <) declares that:
For each prediction model, represented by a row in Figs. 16 and 18, we used the predicted scores for the queries of the test refinement sets (the same test refinement sets for which we can see their average prediction errors in Fig. 13) along with their actual scores, to measure sensitivity and specificity. Each sensitivity/specificity value displayed is an average from measuring the sensitivity/specificity of those test refinement sets. The sensitivity metric indicates the percentage of queries that we predicted are worth executing and indeed were, according to their actual score.
The sensitivity results can be seen in Fig. 16 while a visual representation of these tests is available through the bar chart in Fig. 19. The specificity metric indicates the percentage of queries that we predicted are not worth executing and indeed weren’t according to their actual score. The specificity results can be seen in Fig. 18; and as previously, a visual representation of these results is provided in Fig. 19.
As we can recall from our previous tests in Fig. 13 the average prediction errors of the four prediction models were very similar, too similar to safely determine which one was better. As one may notice, there are plenty of similarities between the sensitivity and specificity measurements of prediction models for certain user criteria as well. We have drawn a couple of conclusions from these experiments that we are presenting below.
Sensitivity
We notice the refinement’s effect in the cases of k = 107;UR and k = 107;R for the tests (50,100) and (100,300). We can see that the UR model had a performance almost twice as good as the R model in the (50,100) test, while the R model outperformed the UR model in the (100,300) test by a significant amount. In order to determine the best performing prediction model, we have calculated the average sensitivity for each prediction model for the declared user criteria, these are (respectively as they appear in Fig. 16 from top to bottom): 62.52%, 63.38%, 64.33% and 61.90%. Surely, such sensitivity values are only relevant for the specified user criteria, but we can see that there is indeed no prediction model that particularly stands out. There are tests where all models scored well, such as (0,100), and tests were they all scored poorly, such as (200,300). This could be due to the fact that we are only relying on the scores of only two of the closest centroids, which might not always offer the closest prediction. This could again be overcome if we further managed to reduce our average prediction errors, by involving more than two centroids in our prediction process.
Specificity
As one may notice from observing Figs. 16, 17, 18, and 19 when we changed our question from Is a query worth executing? to Is a query not worth executing?, the performance of our prediction models got significantly better. This is reasonable as the sensitivity tests have more boundaries on what the predicted score should be, therefore the specificity metric is more forgiving when it comes to the error of our predicted scores.
We can notice patterns in our specificity values as well, there are tests where all models performed consistently well such as (300<) and tests where performance was poorer such as (100<). We have calculated average specificity values for the four prediction models, these are (respectively as they appear in Fig 18 from top to bottom): 89.32%, 90.42%, 91.63% and 89.32%.
Although these average sensitivity and specificity values are not absolute, as it is impossible to declare and test every possible user criteria, we can still make certain conclusions about which the best prediction model is. Prediction model k = 107;R has the highest average sensitivity and specificity values, even if it did not significantly outperform the other models, as well as the lowest average prediction errors for both the refinement and test refinement sets. Surely, this does not mean that this is the definitive prediction model for determining whether a query is worth executing or not, as different data-sets and queries with more/ less dimensions would most likely require a different k number.