1 Introduction

In cloud computing, there is a challenge of studying cloud users’ behaviour due to the lack of information about users in many cloud workload traces (i.e. many traces contain no labels for users). A potential approach to reveal these is the use of clustering methods. In [2], we investigated the ability of extracting users’ identity through clustering. Our past results has shown that a high accuracy of extraction can be achieved using the K-means method. The accuracy is mainly limited by how the attributes (i.e. columns in a trace) to be used by the clustering method are selected.

Attribute selection can be conducted using unsupervised methods such as Laplacian score. However, according to [5, 8, 17], the use of a such general feature selection methods require further input parameters (e.g. the number of expected clusters), which are not usually available in cloud workload traces (i.e. they don’t disclose the number of users whose utilization patterns the traces represent).

Therefore, to address these challenges, we extend our previous work in [2] by introducing an unsupervised method of feature (attribute) selection specialized for cloud workload traces. In other words, this work presents a new method for attributes selection and ranking that overcomes the limitations of providing parameters that are not easily available in the unlabelled traces. Consequently, this addresses the limitation in our previous work of selecting the attribute that have potentially the best extraction ability of cloud users’ identity. In our new method (SeQual), we exploit the use of the Silhouette coefficient metric and a set of sequential measures of clustering quality for each attribute in the workload traces. This is conducted by first clustering each attribute in these traces. Then the quality of each clustering result is measured using Silhouette coefficient. Each attribute is clustered to a sequence of potential cluster counts from 2 to 50 (this range represents the potential number of users we expect to be extractable from an typical trace). Consequently, the process of clustering and quality measurement is repeated on each attribute and each cluster count. As a result, this incremental sequence forms a pattern of quality from which the optimal attributes for clustering can be selected.

In this paper, we evaluated the proposed SeQual method by testing its ability to rank attributes of 19 workload traces containing user information (i.e. each trace line is labelled with a user id). The evaluation process is conducted by comparing the performance of SeQual method with commonly used methods of feature selection. The ground truth of this evaluation is the actual rank of the targeted attributes based on their ability to extract users’ identity through clustering. The extraction ability is measured by comparing the clustering results with the actual user IDs via several quality metrics.

The SeQual method is further evaluated with respect to the unlabelled traces. This is conducted by analysing the statistical characteristics of data distribution that show correlation with ranking accuracy. Then the range of these characteristics are used to form a measure that indicates the applicability of the targeted traces for SeQual ranking. Eventually, this measure is applied on the unlabelled traces to anticipate the range of expected ranking accuracy when SeQual method is used.

The rest of this paper has the following structure. In Sect.  2, we give a brief description of the feature selection methods and then we present the related work of their implementation in relation to cloud computing. Next, Sect. 3 reveals our methodology of developing and evaluating our feature selection method. Later, Sect. 4 focuses on the implementation of the methodology including the testing and analysing results. Finally, Sect. 5 draws the conclusions of this paper and discloses our future work areas.

2 Background and related work

This section presents a brief background of feature selection methods and literature review of their uses in cloud computing.

2.1 Background

Feature selection can be described as the technique of reducing, ranking and choosing attribute fields from original datasets based on particular ranking and selection criteria [4, 13]. It aims to reduce the dimensionality of the input data to avoid any irrelevant information. Based on the availability of classification labels in the targeted datasets, there are two main types of feature selection methods (supervised and unsupervised) [5, 8]. Both of these methods can be sub-categorized as filter, wrapper and hybrid methods.

In filter methods, the optimal features in the traces are selected and ranked based on the general properties of the targeted dataset. The speed and scalability are considered the main characteristic of filter methods [17]. One supervised approach of these methods is the Pearson correlation filter (PCF). In this method, all possible correlations between the attributes in the targeted data are measured and followed by disregarding attributes with highest dependency. This is repeated for a specific number of parameters [8]. Another supervised method is the relief filter (RF) algorithm, which detects those features which are statistically relevant to the target concept [10].

In the wrapper methods of feature selection, the learning algorithms are combined with a clustering method and used in the evaluation process [13]. The selection process in these methods start first by ranking and detecting dataset attributes by using search strategies, then the ranked attributes evaluated through a learning algorithm. Although these methods can show better ability to improve the intended learning algorithm, they are dependent on the used algorithms and tend to show a costly computational process [17]. While hybrid feature selection aims to combine the advantages of both filter and wrapper methods to design more effective algorithm of feature and attributes selection [6]. According to [17], the limitation in the hybrid methods is that both filter and wrapper may not be combined efficiently which causes low performance quality.

In conclusion, the disadvantage of using these general methods of feature selection requires providing different parameters such as the number of attributes and clusters, which are not necessarily available in workload traces.

2.2 Related work

In the area of cloud computing, various clustering methods are used to extract information form workload traces. Unfortunately, these traces can consist of several attributes that may not beneficial for clustering. Therefore it’s important to exploit the use of feature selection methods to rank these attributes based on their efficiency for clustering.

Accordingly, in [11], researchers used a feature selection method to select the attributes that are not beneficial for predicting resources from Google workload traces such as CPU and memory usage. Similarly, the recursive feature elimination (RFE) method was used in [9] to improve the accuracy of the proposed prediction model for job failure. With this method, they aimed to improve resource utilization and cloud application efficiency.

Furthermore, authors in [12] compared several feature selection algorithms for different learning and heuristic methods in the prediction of application runtime. The authors applied feature selection to indicate attributes with past similar jobs.

Bhagtya et al. [3] also exploited the use of feature selection methods in the aim of increasing the accuracy of predicting CPU, Memory and Disk Utilization to stop user migration.

According to the above studies, the feature selection methods were already used in different prediction and forecasting applications for cloud computing. The selection process of these applications mainly targeted the resource aspect of the traces and used in labelled form when the targeted characteristic were known. Hence, the challenge of dealing with the non-availability of these features was not encountered. However, the use of feature selection methods for the purpose of (the rarely available) user identity extraction through clustering were not being investigated thoroughly. This non-availability of users’ identities is mainly due to the obfuscatory attempts to protect sensitive data as discussed in [14].

Therefore, in this work, we propose a novel feature selection method to reveal users’ identity from unlabelled workload traces through clustering. This differs from above studies in that it’s addressing the non-availability of user IDs in these traces which was not been encountered previously. The aim of these works was to study resources aspect where the user IDs were known.

3 Methodology

In this section, our methodology for implementing and evaluating the proposed SeQual method is illustrated through two main parts. In the first part, we describe the SeQual method. Then, we illustrate the evaluation process to test its performance.

3.1 Description of SeQual method

In our method, we use sequentially increasing number of expected clusters to judge the clustering quality and to rank the attributes for selection. We set the predefined number of K clusters for the clustering process as incremental sequence of inputs from 2 to 50. We have used this range as in the labelled datasets where user identification is available, we have observed that it is rarely possible to identify more than 50 users with high quality. The cumulative percent distribution of the user IDs’ frequencies in the labelled traces ranges from 81% to 99% for 50 user IDs as shown in Table 1. Accordingly, we expect to extract the majority of user IDs from this range.

Table 1 Sample of the cumulative percent distribution in the labelled traces
Fig. 1
figure 1

Behaviour of the Silhouette coefficient for all the applicable attributes in ANL-Intrepid 2009 trace

After applying the K-means algorithm for all these K value inputs, we have received a sequence of clustering results for each corresponding attribute. This is followed by measuring the quality trends over the various K values (i.e. the number of clusters asked to be produced from K-means). We have used the Silhouette coefficient metric to quantify the quality. This process is conducted on the labelled traces in their unlabelled form, when the users identification in these traces are disregarded. Then, we use their labelled form for quality comparison as illustrated in more detail in Sect. 4.

Although using clustering methods such as K-means has some challenges (i.e. centroid shifting), the mechanisms to cope with these challenges by SeQual technique were not covered in this paper. Since the scope of this work mainly concerns the ability of the Silhouette coefficient to shape the quality behaviour of the selected clustering method in the SeQual technique.

Figures 1 and 2 show a sample of such a sequential evaluation of the Silhouette coefficient for the most relevant attributes from two sample traces (the ANL-Intrepid 2009 and PIK-IPLEX).

Fig. 2
figure 2

Behaviour of the Silhouette coefficient for all the applicable attributes in PIK IPLEX trace

Based on these charts, each of the attributes (Requested number of process, Requested time, Run time, Submit time and Wait time) shows a different pattern with respect to the Silhouette quality. For instance, in Fig. 1 the attribute of requested number of processes shows higher average of Silhouette in comparison with the attribute of submit time. The figure also demonstrates a noticeable pattern of peaks and troughs in the entire range of the measurement. Such patterns can also be seen in the quality results of other workload traces.

In Fig. 2, the attributes show slightly different Silhouette patterns. For instance, the attribute of Run Time shows a pattern of sharp trough and peak and more sustainable quality in compared to the gradually decreasing pattern of Wait Time. It can be noticed from above that there are different attributes for Silhouette coefficient behaviour between Figs. 1 and 2. This is due to the differences in the type and number of applicable attributes for clustering between both traces (ANL-Intrepid 2009 and PIK-IPLEX as shown in Table 2).

Based on these observations, we concluded that any attribute showing high mean of Silhouette coefficient with a sharp peak and trough, will potentially have a high ability to extract users’ identity. As, the attributes with such pattern will be ranked higher for identities extraction, while those with gradually increasing or stable pattern will be ranked lower. This was used to devise the proposed SeQual method for feature selection (ranking). Which we will discuss in Algorithm 1.

figure a

In the first step, the workload trace to be used for clustering is selected 1. The type of attributes in this selected trace should be numerical to be ranked successfully. Then, each of these attributes is read individually 3. This is followed by clustering each of these attributes separately 5. The clustering process is conducted using K-means method, and the results are stored separately. The quality of these results is measured using Silhouette coefficient metric 6. This should be implemented in the same loop together with the clustering process. The process of both clustering and quality metric is implemented for the complete sequence of number of clusters (i.e. 2 to 50). The highest attribute is selected from this pattern by checking if the corresponding attributes shows highest average and sharpest trough 8. This trough is determined based on having the lowest quality for clustering for less than three points of clusters. Eventually, this is repeated for all the attributes in the trace. Figure 3 shows a flowchart for the SeQual method process.

Fig. 3
figure 3

Flowchart for the SeQual method process

3.2 Comparative evaluation

Now we present a comparative evaluation test to measure the performance of the new SeQual method. This is conducted by using 10K records randomly selected from trace attributes of workload resources such as Grid Workload Archive (GWA) and Logs of Real Parallel Workloads from Production Systems (PWA [7]). These traces are: ANL-Intrepid, PIK-IPLEX, CIEMAT Euler, KIT ForHLR 2 (KIT-FH2), Cornell Theory Center IBM SP2 (CTC-SP2), The DAS2 5-Cluster Grid Logs (DAS2), LCG, GWA-T-3 NorduGrid, GWA-T-10 SHARCNet, University of Luxemburg Gaia Cluster (UNILU-GAIA), CEA-Curie, LANL Origin 2000 Cluster Nirvana (LANL-O2K), San-Diego Supercomputer Center Paragon) (SDS Par-95), MetaCentrum, Los Alamos National Lab CM-5 (LANL CM-5), RICC, San Diego Supercomputer Center SP2 (SDSC-SP2), OSC Linux Cluster (OSC-Clust-2000), and GWA-T-4 AuverGrid. The traces are randomized to avoid biased selection of data sample. The records from these trace archives are usually labelled, thus they often provide information about users’ identification either directly or in anonymized form. Table 2 shows a description of each trace’s attributes for clustering.

Table 2 Labelled traces description

As shown in Table 2, we classified the attributes in the traces into 3 groups regarding their usefulness for clustering. First group are those attributes with fewer unique values than three. Using such attributes as input to clustering could mislead the clustering algorithm even if these attributes are combined with others. Second group are those attributes with unique values less than the unique User IDs in the traces. These attributes can be combined with other attributes to extract users’ identity. The third group are those attributes which have a wide variety of values and thus are more applicable to extract user IDs via clustering.

As illustrated in section two, the purpose of developing the SeQual method is to rank and select the attributes (that presented in Table 2) for clustering cloud Workload traces. The aim of this clustering process is to identify potential user related activities in the available cloud datasets. Therefore, we consider the ability of each attribute to reveal users’ identities as the criterion for the evaluation process. Accordingly, these attributes are ranked based on the quality measures of their clustering results. This is conducted via using the quality metrics of entropy, adjusted rand index and precision.

We have used such variety of quality metrics to avoid our SeQual method to be biased towards a particular quality metric’s behaviour. Entropy measures the homogeneity of clustering results in comparison with the user ID [15]. While adjusted rand index evaluate the agreement between clustering results and the targeted reference with preventing randomness effect [16]. On the other hand, Precision measures the highest possible matching between clustering results and the user ID reference. Consequently, these produce three actual ranks each of which presents different measure of quality.

After measuring the actual ranks for the targeted attributes, we conduct the comparative evaluation as follows. First, we rank the workload attributes using our SeQual method in comparison with common methods of feature selection (LS, RF and PCF) discussed earlier in Sect. 2. For the supervised methods (PCF and RF), we use the attribute that shows users’ ID as a reference for the feature selection method. While, for the unsupervised methods of SeQual and LS, we don’t disclose the attributes that can be used for direct user identification (e.g. user’s id, its group or used executable).

Then the performance of SeQual, LS, PCF and RF method is measured by evaluating their ranking results against the actual ranks. When we identified the actual ranks, we have ranked each attribute with the exact number of user count that can be feasibly extracted from the dataset and we ranked each attribute’s capability to help the clustering to achieve good quality. In the evaluation process, the performance of each selection method is weighted against each actual rank of precision, entropy and adjusted rand individually. This gives us three results of comparative evaluation, which shows the percentage of how close each selection performance in compared to the actual ranks.

We further extend the evaluation process by analysing the distribution characteristics of the labelled traces that were used for ranking. To conduct this evaluation, we target two statistical measures related to clustering: coefficient of variation(CV) and the skewness. In this analysis, we calculated these two measures for all attributes in the 19 traces. Then, we present the pattern for the applicable range of both measures. We indicate the potential correlation between the ranking quality scale and these statistical measures. So the unlabelled traces within the certain ranges of CV and skewness measures, are expected to show scale of quality of ranking similar to those in labelled trace. The unlabelled traces that we target in this work are BitBrains, Materna, Google Mustang, Facebook and Alibaba  [1]. The description of these traces is shown in Table 3.

Table 3 Unlabelled traces description

Similar to Table 2, the attributes of unlabelled traces are categorized into two groups. First are those attributes with fewer unique values than three. These are not useful for clustering and consequently for ranking. Second are those which are applicable for clustering and ranking due to having sufficient amount of unique values.

4 Testing and results

Implementing the comparative evaluation is conducted using 10K records of 19 labelled traces from workload resources. Each method of (SeQual, LS, RF and PCF) is used individually to rank the attributes of these 19 traces. Then, the ranking results for each of these methods are evaluated against three ground truths. These ground truths are the actual ranking indicated by measuring the quality of clustering through metrics of precision measurement, entropy and adjusted rand index.

The comparison is made through a weighting process, in which the ranking of each attribute by SeQual, LS, PCF and RF is given weights in comparison with the actual ranks. Figure 4 presents the performance of SeQual in compared to LS, RF and PCF methods based on each quality metric. Accordingly, each of these quality metrics shows a different scale of performance.

Fig. 4
figure 4

Comparative evaluation results

The SeQual method shows the highest percentage of accuracy based on the ground truths of all three quality metrics of precision, entropy and adjusted rand index. As, according to adjusted rand index, SeQual method shows the ability to rank the attributes with 90% accuracy. This was below 74% for both LS and RF, and 79% for PCF. While based on the Precision metric, the measure of SeQual was 99% and for other methods this was about 92% of the accuracy. On the other hand, the ground truth of Entropy shows less diversity in results. For SeQual the result was 99% and around 98% for LS, RF and PCF.

To show the performance of each method in more detailed presentation, we present the ranges of all 19 traces and all performance metrics in the boxplot charts of Fig. 5.

Fig. 5
figure 5

Boxplots of distribution for methods performance

These boxplots present the distribution for each feature selection method’s performance. From these charts, we can infer how stable is the performance of each method.

Figure 5a focuses on evaluating with Entropy. It can be noticed that our method has a narrower and more precise set of results compared to the other methods. This implies that the SeQual method has the ability to perform quality of ranking with more reliability as there are no significant changes in the range of quality. More obvious ranking quality for SeQual method can be noticed in Fig. 5b and c, which show the precision and Adjusted rand index-based evaluations. One limitation of this performance is that the run time needed for processing SeQual technique is relatively longer than other unsupervised.

Finally, we evaluate the compatibility of unlabelled trace to be ranked with the SeQual method by evaluating the ranges of CV and skewness for both labelled and unlabelled traces. We expect, that the traces with ranges of CV and skewness similar to the labelled traces, will be expected to be ranked by the SeQual method with similar accuracy. Figure 6 shows boxplot charts of both CV and skewness distribution range for all 160 relevant attributes in all the 19 labelled traces.

Fig. 6
figure 6

Range of the statistical distributions for labelled traces

By comparing the range of the clustering qualities for the labelled traces with the statistical range of their (CV and skewness) as shown in the above boxplots, we observed the following patterns:

  • Traces with no out-of-range characteristic (either for CV and skewness) has 80% chance to be ranked with high accuracy using SeQual method when using precision for accuracy.

  • Traces with no out-of-ranges characteristics (for both CV and skewness) has about 85% of chance to be ranked with high accuracy when using adjusted rand index or entropy as the quality measures.

Based on the these observations we calculate the out-of-range level of the targeted traces by the equation of:

$$\begin{aligned} level_{outofrange}=50\frac{\#outofrange}{\#attributes}. \end{aligned}$$
(1)

Consequently, we calculated the out-of-range level for all the 19 labelled traces as shown in Fig. 7.

Fig. 7
figure 7

Level of out-of-range for workload traces

The cleanest traces exhibit 0% out-of-range levels. Noisy traces with all their attributes out-of the above CV and skewness ranges are going to show an out-of-range level of 100%. Based on above, we formulate range of ranking accuracy for each quality metric as shown in Table 4.

Table 4 Range of ranking accuracy for each quality metric

In the next steps of this evaluation, we applied the above observations on the unlabelled traces which are presented in Table 3. Table 5 shows the results of calculating the out-of-range level for each unlabelled traces and the scale of expected quality of ranks using SeQual method.

Table 5 Expected quality range for unlabelled traces

According to Table 5, the traces of (Materna, Ali Baba 2007, GWAT AuverGrid and Google Vm reading) have 0% level of out-of-range. Which means that it’s expected for these traces to be ranked with high accuracy when using SeQual method.

While the statistical characteristics for BitBrains shows 18% level of out-of-ranges. This indicates that it’s expected for this trace to be ranked in the accuracy between 85% to 95% according to adjusted rand index, and around (98% to 100%) according to both Entropy and Precision quality metric. In addition, Facebook Hadoop trace shows 60% level of noise and according to our analysis; thus, it is not recommended to use such trace for ranking based on SeQual method.

The results show that the average accuracy of the proposed SeQual method is 99% when using the quality metrics of entropy and precision and 90% based on adjusted rand index. This accuracy presents the ability of the SeQual method to rank the targeted attributes for extracting users’ identity. This performance was achieved without the need of predefining essential parameters such as number of clusters to our method. These parameters are very critical in affecting the quality of the clustering. For instance, by changing the number of clusters, the ability of extracting users’ identity from the targeted attributes also changes.

Furthermore, the results of the comparative tests show that the SeQual method can compete with both unsupervised and supervised methods of feature selection. This is shown in Fig. 4, in which the SeQual method shows less distortion in ranking quality compared to other methods in all three quality metrics. These results also show the stability of SeQual performance in ranking different traces of cloud workload.

As mentioned in the previous section, the distribution range for CV and skewness measures of the labelled traces can be used to determine whether an unlabelled trace can be investigated with SeQual. The results in Table 5 show that the majority of the targeted unlabelled workload traces can be ranked with high expected accuracy using our SeQual method.

5 Conclusion and future work

The outcome of this study is a new unsupervised method of feature selection for ranking attributes of cloud workload traces for the purpose of extracting users’ identities. The new SeQual method exploits the ability of the Silhouette coefficient metric to measure the quality of each clustered attribute. The evaluation of the SeQual method is conducted by testing its ability to rank attributes of 19 labelled traces. These traces were tested in an unlabelled form and the testing results were compared to their labelled information. This is followed by a comparison test between the new SeQual method with commonly used supervised (RF and PCF) and unsupervised (LS) feature selection methods. As a result, the SeQual method shows more accurate and more stable performance when compared to these methods based on the quality metric of (precision, entropy and adjusted rand index).

Furthermore, this study also presents testing the ability of the new method to rank unlabelled traces. This is conducted by measuring the range of the statistical characteristics for the traces that are recommended to be ranked by the SeQual method. The results of this paper will be of interest to the research community of cloud computing, especially for data mining applications on cloud workload traces. The new method can assist in ranking unlabelled workload traces for clustering human behaviours in these traces. This can also be beneficial in the attribute selections of cloud resource consumption by targeting human behaviours.

For future work, we intend to use the SeQual method to achieve more precise ranking in selecting attributes to be used for predicting cloud users’ behaviour. We also plan to develop a new method for detecting the number of clusters for clustering workload traces and use this method to reduce the sequential loop in the implementation of the SeQual method. in which we aim to achieve more optimized ranking speed. It is also recommended to test this method for general feature selection of different types of datasets.