Clustering: an R library to facilitate the analysis and comparison of cluster algorithms

Clustering is an unsupervised learning method that divides data into groups of similar features. Researchers use this technique to categorise and automatically classify unlabelled data to reveal data concentrations. Although there are other implementations of clustering algorithms in R, this paper introduces the Clustering library for R, aimed at facilitating the analysis and comparison between clustering algorithms. Specifically, the library uses relevant clustering algorithms from the literature with two objectives: firstly to group data homogeneously by establishing differences between clusters and secondly to generate a ranking between the algorithms and the attributes of a data set to obtain the optimal number of clusters. Finally, it is crucial to highlight the added value that the library provides through its interactive graphical user interface, where experiments can be easily configured and executed without requiring expert knowledge of the parameters of each algorithm.

be as different as possible, and the elements they contain must be as similar as possible. These conditions are satisfied by maximising, or minimising, some quality measures related to the clusters data distribution. In the literature, several measures to validate the quality of clusters can be found [6]. The first kind of measure is based on external metrics, which involves evaluating the results of a base algorithm in a pre-specified structure. This is imposed on a data set and reflects our intuition about the structure of the clustering of the data set. The second kind of measure is based on internal metrics, where the results of a clustering algorithm are evaluated in terms of the characteristics of the instances that belong to each cluster, e.g. the proximity matrix.
In the specialised literature, there are many proposals on clustering algorithms. Therefore, a review was carried out of the algorithms available in the R libraries. In fact, the Clustering library for R incorporates the most relevant algorithms from the Hierarchical and Partitioning sections of the Clustering Task View 1 based on the number of citations and downloads of the different algorithms. The libraries that implement these algorithms and that are most cited in the Partitional Clustering section are apcluster [8] and cluster [9], while in Hierarchical Clustering the most cited library are: cluster [9], ClusterR [19], and pvclust [10]. In addition, 1 https://cran.r-project.org/web/views/Cluster.html.  [7] is also included in our Clustering library because it includes the widely used Kmeans algorithm. Table 1 shows a comparison of the features offered by the libraries included in the package and our Clustering package. The description of each column of Table 1  Unfortunately, not much software implements quality criteria in clustering to measure and analyse the quality of different algorithms. In particular, there are several problems associated with current libraries: • It is not possible to work with different input formats.
• The algorithm mainly focuses on the distribution of the data in the clusters, but they do not show their quality. • It is not possible to work with a set of data sets, so it is not easy to compare different algorithms. • Few libraries include a graphical user interface (GUI).
To address these problems, this paper presents the Clustering library for R. It is a library that allows for the comparison of multiple clustering algorithms simultaneously while assessing the quality of the clusters extracted. The purpose of this library is to evaluate a set of data sets to determine which attributes are the most suitable for obtaining clusters of interest. Therefore, assessment of the clusters created, how they have been distributed, whether the distributions are uniform, and how they have been categorised from the data can be performed. In addition, the library offers the added value of an easy-to-use and highly helpful GUI, which allows experiments to be quickly set up and run with no need for the user to know the parameters of each algorithm, facilitating the analysis and comparison of the results provided by different algorithms.
The advantages provided by the Clustering library compared to other packages are: • This library can work with a data set and with a directory of several data sets. • Putting all this together, the main advantage and novelty appears: users can run an experimental study with multiple algorithms, measures, number of clusters, and similarity measures where the comparison between the algorithms is based on internal and external measures. In addition, the measurement of the external quality measures in order to determine the optimal number of clusters is performed automatically for each of the attributes in the data set as a parameter. • Finally, another strong point is the GUI facilitates the use of the library by the user. Nowadays, there are only two libraries that implement GUI (ProjectionBasedClustering and VarSelLCM) to the best of our knowledge. In addition, VarSelLCM does not work, while the Projec-tionBasedClustering library does not allow the comparison of algorithms with quality measures.
The structure of this contribution is as follows: firstly, in Sect. 2 a library presentation is given together with the definition of the architecture and functionalities; in Sect. 3, an example of the use of the library is described; and finally, Section 4 outlines the conclusions reached.
However, the library can be easily extended with new algorithms and user-specified metrics making use of the clustering object. In R, it is possible to implement an object-oriented programming style that implements a number of generic methods. This style is called S3. The methods implements are print(), summary(), and plot(). In addition, another essential functionality incorporated concerning existing solutions in CRAN 2 is the possibility of sorting, filtering, and exporting the results for further analysis. Regarding the source of the data, it is important to remark that the library accepts different formats of input data sets, such as CSV, KEEL, ARFF (Weka), and data.frame objects, highlighting the possibility to work with directories containing different data sets instead of working with a single data set. This allows the execution of multiple data sets with a single configuration, saving a lot of time and effort. Finally, the results can be easily exported to L A T E X to facilitate their incorporation into reports and documents. Note the non-inclusion of the RWeka and RKeel libraries in the package for not to increase the number of dependencies since the code needed to read files of these formats is easy to develop.

Software architecture
The Clustering library imports a series of libraries that are used internally for processing. These libraries are represented in Fig. 1. In the image, the grey squares represent all the imports of the Clustering library. The red database represents the dependency with the R library. And the orange box represents the Clustering library. The dependency on R is required.
As mentioned previously, the library includes 16 clustering algorithms from different libraries. Specifically, the algorithms included in the Clustering library are given in Table 2.
All these algorithms are wrapped up in the Clustering library and can be executed through of the clustering() method, which is the core of this library. This method is in charge of several aspects: 1. To properly handle the parameters of each method. 2. To run in parallel the chosen algorithms and to collect their results for each data set. 3. To assess the quality of the clusters extracted and to perform a ranking concerning them. 4. To display and allow easy management of all this information in a user-friendly way.
When the clustering() method is executed, it returns an object called clustering. This object contains information about: what algorithms have been executed, metrics used, flags to indicate if it has internal and external measurements, and the results of the execution. The library provides several helper functions to evaluate and rank the results extracted, plot, and export them, to perform further analysis. These functions can be evaluated using the clustering object. Finally, all this functionality is accessible through the GUI. To run the GUI the library has a method called

Software functionalities
The Clustering library provides several functionalities to handle all the previously described components: • clustering(): It is the core function of the library.
The parameters of the method are as follows: -Path The file path. It is only allowed to use path or df but not both at the same time. The directory must contain a list of files with the data sets to be uploaded. Allowed formats are CSV, KEEL, ARFF (Weka), and data.frame. df Data matrix or data frame, or similarity matrix.
It is only allowed to use path or df but not both at the same time. Through this parameter, it is possible to load a data set. R has several utility packages for reading data sets. The best known is utils [18]. As a result, it generates the clustering object. The library allows sorting and filtering operations for further processing of the results. The '[' operator makes use of the filter method of the dplyr library [22]. • External metrics. These methods are responsible for assessing the quality of the extracted clusters using the attributes in the data set as target. The following methods receive the clustering object as an input parameter. The methods return the best algorithms, distance measures, and the number of clusters based on the quality measures. For the methods evaluate_best_validation_external_by_metrics and result_external_algorithm_by_metric in addition to the clustering object, it is necessary to indicate the external quality measure. -best_ranked_external_metrics(): The execution of this method allows to obtain the attributes with better behaviour by algorithm, measure of distance, and number of clusters in a ranking way. • Internal metrics. Incorporates the same set of methods and input parameters mentioned above for the external metrics: -best_ranked_internal_metrics().

Example of use of Clustering library
This section presents an illustrative example to show the performance of the Clustering library. At this point, it is time to work with the library to examine the real potential it has by evaluating algorithms, internal and external quality measures as well as being able to work with a range of clusters that allows us to select the best algorithm from the configured data. The operation of the library consists of performing parallel runs for each of the attributes of the data set. To execute the clustering() method, it is necessary to indicate to the package through external parameters the quality measures, the number of clusters, the algorithms, and the data set or set of data sets. For the simulation, it is used Precision [27][28][29] and Recall [27][28][29] as external quality measures and Silhouette [30] as internal quality measures. The Precision [27][28][29] is the ratio where t p is the number of true positives and f p the number of false positives. The Precision [27][28][29] is intuitively the ability of the classifier not to label as positive a sample that is negative. The best value is 1 and the worst value is 0. The Recall [27][28][29] is the ratio t p (t p+ f n) where t p is the number of true positives and f n the number of false negatives. The best value is 1 and the worst value is 0. The value of the Silhouette [30] coefficient is between [−1, 1]. A score of 1 denotes the best meaning that the data point is very compact within the cluster to which it belongs and far away from the other clusters. The worst value is −1. Values near 0 denote overlapping clusters. The data set used is called Stock and is included in the library. This data set contains the daily stock price data of ten aerospace companies from January 1998 to October 1991. The algorithms used are clara [9] and kmeans_rcpp [19]. Finally, it is required indicate the cluster number. It is also possible to work with a range of clusters. The range used for this study is set up between 3 and 5. Tables 3 and 4 show the results obtained after the execution of the clustering() method. In this table, Algorithm indicates the name of the algorithm, Distance represents the distance measurement employed (for methods with a single metric), Clusters is the number of clusters used in that execution, and Data is the data set analysed. The Clustering library tries to find the attribute that provides the best partitioning of the data about the external metrics used. It is mandatory to indicate an external measure in the clustering() method. To achieve this, it selects each attribute in the data set as a target and calculates its associated external metrics. This is given in Tables 3 and 4 in column V ar, which reflects which attribute in the data set has been used as the target. The remaining columns presented below, i.e. T ime, Precision, and Recall, show the value of the external metrics employed concerning using that attribute V ar as the target.
Once the complete analysis is performed, the Clustering library is ready to summarise the data. The objective of this summary is twofold: on the one hand, it tries to determine the optimum number of clusters of each algorithm according to the results extracted; on the other hand, the attribute that shows the best influence on the results is also determined. The methods best_ranked_external_metrics() and best_ranked_internal_metrics() have been employed to achieve this. These methods need a clustering object as a parameter. This object is obtained with the output of the clustering() method. The results are given in Tables 5  and 6.
It is important to highlight that new columns ending with Att appear in Tables 5 and 6, showing the attribute of the data set with the greatest influence on the metrics analysed.
Finally, it is possible to discover situations where it is necessary to know which distance measurement best suits the external and internal metrics. The main purpose is to reduce and facilitate the analysis and study of several algorithms for multiple data sets. In this way, the Clustering library incorpo- rates multiple methods, as given in Tables 7 and 8 for external  metrics and Tables 9 and 10 for internal ones.
Clustering library incorporates other methods such as plot(). It shows a graphical representation of the distribution of the data by cluster and algorithm as shown in Fig. 2.
So far, this illustrative example has been performed at console level, but thanks to the appClustering() method, users can graphically interact with the library using its GUI. Specifically, a browser with the interface is available to facilitate the execution and analysis by any type of user (both novel and expert). There is a layout with a header, a side menu, and the main menu, as shown in Fig. 3. In the header, the user can choose to display results numerically or in plots in the same way as presented in Fig. 3. In the left menu, the user can see the different parameters with which can be run the algorithms. Finally, in the main menu, the result of the execution of the clustering() method is presented.
The operation of the application is simple, and Fig. 4 shows a step-by-step explanation. In more detail, Fig. 4 presents two ovals marked in red. The first one represents the header menu, while the second one represents the library configuration parameters. Rectangles are each of the configuration parameters. The parameters will explain as follows: • Marked in red, the user can choose whether to work with test data sets or indicate a directory of data set files to be processed. • In blue, the libraries that implement the clustering algorithms mentioned throughout the paper can be selected. It is possible to mark all the libraries or a subset of them. All the algorithms implemented within the selected library are marked when a library is marked. • In yellow, the algorithms implemented by the libraries are shown. Multiple algorithms can be selected.       (c) Silhouette by number of clusters for each algorithm.

Conclusions
This paper presents a novelty library for R to facilitate the execution and analysis of clustering algorithms available in CRAN. Specifically, the Clustering library is emphasised on the metrics to measure the quality of clusters. In addition, Clustering library offers the following advantages: it allows to analyse one or multiple data sets simultaneously using different algorithms, to use multiple distance measures in the executions, to work with a range of clusters, to incorporate quality metrics to analyse the most relevant attributes for the data set, as well as to be able to use user-friendly graphical interface that facilitates the use of the library with no need of in-depth knowledge of R. As future work, the quality of the clusters is being improved using classification techniques such as hyperrectangle with genetics (CHC), to reduce the number of clusters and improve quality measures.