Differential evolution-based transfer rough clustering algorithm

Due to well processing the uncertainty in data, rough clustering methods have been successfully applied in many fields. However, when the capacity of the available data is limited or the data are disturbed by noise, the rough clustering algorithms always cannot effectively explore the structure of the data. Furthermore, rough clustering algorithms are usually sensitive to the initialized cluster centers and easy to fall into local optimum. To resolve the problems mentioned above, a novel differential evolution-based transfer rough clustering (DE-TRC) algorithm is proposed in this paper. First, transfer learning mechanism is introduced into rough clustering and a transfer rough clustering framework is designed, which utilizes the knowledge from the related domain to assist the clustering task. Then, the objective function of the transfer rough clustering algorithm is optimized by using the differential evolution algorithm to enhance the robustness of the algorithm. It can overcome the sensitivity to initialized cluster centers and meanwhile achieve the global optimal clustering. The proposed algorithm is validated on different synthetic and real-world datasets. Experimental results demonstrate the effectiveness of the proposed algorithm in comparison with both traditional rough clustering algorithms and other state-of-the-art clustering algorithms.


Introduction
As an unsupervised machine learning method, clustering algorithm has been widely used in pattern recognition, data mining, image processing, and other fields. It can automatically cluster objects according to the inherent characteristics of data, in which similar objects can be divided into the same categories. Scholars have developed many clustering methods based on different concepts: (1) prototype-based clustering [1]; (2) density-based clustering [2,3]; (3) graphbased clustering [4,5]; (4) other clustering based on model [6]. Among these types of methods, prototype-based clustering is perhaps the most popular one, and has been studied  [7] is one of the most widely used prototype-based clustering algorithms. HCM divides data into disjoint clusters, where each data element belongs to exactly one cluster. In the real-world scene, however, the data distribution has some uncertainties, such as overlapping, crossing, and so on. Therefore, hard clustering algorithms have great limitations in practical application.
To deal with the uncertainty in data, theories of uncertainty are integrated into clustering algorithms. Fuzzy c-means (FCM) [8] introduces the fuzzy set theory into HCM and adopts the concept of membership with values in continuous range between 0 and 1 to represent the belongingness of an object to multiple clusters. In contrast to HCM, FCM belongs to the prototype-based soft clustering algorithm. Possibilistic c-means (PCM) [9] relaxes the restriction that the sum of memberships of an object to all clusters is one in FCM. PCM behaves well on data with outliers. Rough set theory [10,11] proposed by Pawlak can deal with the uncertainty, and inherent incompleteness in data by considering rough approximations in roughly granulated spaces. Considering the characteristic of rough set theory, rough clustering methods are developed by incorporating rough set theory and clustering method. Lingras and West [12] first proposed rough c-means (RCM) which introduces the lower approximation, upper approximation, and boundary region to represent the certain, possible, and uncertain belongingness of objects to clusters. Many variants of RCM [13,14] have been proposed these years. Ubukata et al. proposed a novel rough clustering framework called objective function-based rough membership c-means clustering (RMCM2) [15]. RMCM2 algorithm adopts rough membership function, which considers the neighborhood information of data and designs the objective function derived from rough membership c-means algorithm [16].
Traditional rough clustering methods usually cannot effectively explore the structure of the data when the capacity of the available data is limited or the data are disturbed by noise. Several clustering methods have been developed to solve the problems mentioned above, such as co-clustering [17], multitask learning [18], semi-supervised learning [19], and transfer learning [20,21]. Transfer learning is perhaps the most promising model due to its specific mechanism. In the past decade or so, many unsupervised transfer clustering algorithms have been developed by combining clustering algorithms with transfer learning. According to the transfer method, transfer clustering algorithms can be roughly divided into four categories [22]: instance-based [23], feature-representation-based [24,25], parameter-based [26][27][28][29], relational knowledge-based [30]. The earliest study is the self-taught clustering (STC) based on mutual information proposed by Dai et al. [24]. After that, Sun et al. proposed a transfer maximum entropy clustering algorithm based on maximum entropy clustering [29]. In [21,26], transfer learning was applied to prototype-based fuzzy clustering. The recently proposed transfer learning possibilistic c-means (TLPCM) [28] works in applications where data are limited or polluted by noise.
Furthermore, the traditional rough clustering algorithms are sensitive to the initialized cluster centers and easy to fall into local optimum. The differential evolution (DE) algorithm is a nature-inspired optimization algorithm, which can approach the global optimum for a specific problem. DE has been widely studied and applied because of its simple implementation and fast convergence [31,32]. The clustering algorithms based on differential evolution avoid the algorithms falling into local optimization and enhance the robustness of the clustering algorithms.
Based on the problems mentioned above, a novel differential evolution-based transfer rough clustering algorithm (DE-TRC) is proposed in this paper. Using the special mechanism of transfer learning, the objective function of unsupervised transfer rough clustering algorithm is designed in this paper. It combines transfer learning and rough clustering algorithm to improve the performance of the algorithm in classifying sparse data and impure data. In addition, the differential evolution algorithm is introduced to optimize the objective function of the clustering algorithm, which improves the robustness of the algorithm. The experimental results presented here are thoroughly evaluated. By comparing with the state-of-the-art clustering algorithm, the advantages of DE-TRC on both synthetic and real-world data sets are proved.
The rest of this paper is organized as follows. In "Related works", the related works of this paper including rough clustering and transfer mechanism are introduced. The implementation details of the suggested model are described in "Proposed methods". The experiments of synthetic and realworld datasets are carried out in "Experiment analysis". "Conclusion" provides conclusions of this paper.

Related works
In this section, we review and introduce the rough set theory and the method of prototype-based transfer clustering matching mechanism, which are the two important theories related to our proposed method.

Rough set theory and clustering
Rough set theory is a mathematical tool for dealing with uncertain problems, which is considered from the perspective of granular computation [33]. Rough set is the approximation of a vague concept by a pair of precise concepts, namely, lower and upper approximations.
Let U = {x 1 , x 2 , · · · , x N } be a set of N objects and F = {X 1 , X 2 , · · · , X C } be a family of C clusters on U. μ R ci is the rough membership of x i to X c with respect to the neighborhood relationship R and is computed by: where, u ci is the degree of x i belongs to X c and is denoted by: R it represents the nearest relation and is represented by the following condition: k can be determined according to the strategy studied in Ref. [15]. In order to describe the lower approximation, upper approximation, and boundary region of X c , their corresponding memberships u ci , u ci , andû ci can be calculated by using the rough membership:

Prototype-based transfer mechanism
Generally, transfer learning can utilize source data to improve the learning result of target data, if the useful knowledge can be extracted from the source domain and is transferred to the target domain [20]. For prototype-based clustering problem, the most critical information is the cluster centers. Therefore, a prototype-based transfer clustering mechanism is given as follows: whereṽ c denotes the cth cluster center in the source domain and v c is the cth cluster center in the target domain. The number C of clusters in the source domain and target domain is the same. This transfer clustering mechanism can be introduced into the clustering objective functions to improve the clustering performance. In addition, in order to process the different number of clusters between the source domain and the target domain, a prototype-based transfer matching mechanism [21] is proposed and given as follows: whereṽ k denotes kth cluster center in the source domain and v c is the cth cluster center in the target domain. C S and C T are the number of clusters in the source domain and target domain, respectively. r ck represents the similarity between the kth center in the source domain and the cth center in the target domain. m is the fuzzier and is usually set to 2.

Proposed methods
In order to effectively explore the structure of the data and avoid the influence of the initialized cluster centers, a differential evolution-based transfer rough clustering algorithm is proposed to improve the clustering performance when the data in the target domain is insufficient or is disturbed by noise. As shown in Fig. 1, the learned knowledge, i.e., cluster centers, is usually available from the source domain. Then, the knowledge applies to the target domain through the transfer learning strategy to assist the target domain to perform the clustering task effectively. Furthermore, the objective function of the transfer rough clustering algorithm is optimized by using the differential evolution algorithm to enhance the robustness of the algorithm.

Transfer rough clustering algorithm (TRC)
We firstly propose a transfer rough clustering algorithm (TRC). The objective function of TRC fully utilizes the target domain data X T and cluster centersṼ S = {ṽ k }(k = 1, 2, ..., C S ) of the source domain as the auxiliary knowledge.
The cluster centersṼ S of source domain were obtained by the classical RMCM2 algorithm. The specific form of this objective function is designed as follows.
where x i is ith sample in the target domain, N T is the number of samples in the target domain. μ R ci is a rough membership function, which represents the degree to which the ith sample belongs to the cth cluster center. x i − v c is the Euclidean distance between the ith sample and the cth cluster center in the target domain. λ is a trade-off parameter.
In Eq. (9), the first term measures the internal compactness in the target domain data and the second term measures the similarity between cluster centers in the target domain and the cluster centers in the source domain. Through minimizing this objective function by using the Lagrange multipliers, the update expressions of r ck and v c are given as follows, The rough membership μ R ci depends on the nearest neighbor matrix R it and membership u ci , as shown in Eq. (3). The update rule of u ci is given as follows [15]: A pseudocode for TRC algorithm is given in Algorithm 1.

Differential evolution-based transfer rough clustering algorithm (DE-TRC)
Although the proposed TRC utilizes the source data to improve the clustering performance, it may be influenced by the initialized cluster centers and is easy to fall into local optimum. In order to solve the above problems, differential evolution (DE) algorithm is introduced and differential evolution-based transfer rough clustering algorithm is proposed. In this method, there are three important parts: population initialization, evolutionary operators, and fitness function computation. The fitness function describes a quality of the individual in a population. Therefore, the objective function designed in Eq. (9) is adopted as the fitness function in our method.

Population initialization
The population contains the candidate solutions to the clustering problem. In DE-TRC, the solution is the cluster centers. Therefore, we select the random population initialization technique to generate the population. D is the dimension of the problem and NP is the population size which is often specified by user. As shown in Fig. 2, a chromosome is represented as a vector of real numbers of six dimension. The number of cluster centers is 3.

Evolution optimization process
After initialization, some evolutionary operations including mutation, crossover, and selection [34] are performed to evolve the population. To generate an offspring p i, t+1 of chromosome p i, t in the tth generation, three different chromosomes are randomly selected from the population to mutate a differential vector by using the following equation, where F denotes a scalar number (that typically lies in the interval [0.4, 1]). p j, t , p m, t , and p n, t represent three different chromosomes, respectively. Then, a binomial crossover operator is applied on o i, t and p i, t . In particular, the offspring w i, t is generated by: ( 1 4 ) where CR denotes crossover control parameter. w j i, t represents the jth gene of the ith chromosome in the tth generation.
Finally, in order to evaluate the offspring w i, t and the chromosome p i, t , their corresponding fitness value should be computed by Eq. (9). Only the good one can be passed to the next generation The above operators are repeated until the stopping criterion is met. The details of the DE-TRC algorithms can be described as Algorithm 2.
In this work, a setting of NP = 100 is sufficient for reliable convergence behavior. If the NP is too large, the computation complexity is high. If the NP is too small, the diversity of the population seems too small to escape a local minimum [45,47]. As for F, F = 0.5 is usually a good initial choice [31,43]. It is feasible in all data sets with F = 0.5 in the manuscript. The crossover constant CR is a real number from interval [0, 1]. The CR should not be too large to avoid that the perturbations get too high and the convergence speed decreases. According to the suggestions in [45,46], we set CR = 0.3 in this manuscript. Usually, the algorithm is stopped after exceeding a maximum number of iterations [44]. To achieve global convergence in this manuscript, we set the maximum number of iterations is 500 according to [47].

Time complexity analysis
We assume that the size of a dataset in target domain is N T , the number of clusters in target domain and source domain are C T and C S , respectively. The population size is N, the number of generations is T . In each generation, the computation of fitness function consumes more time than other Proposed in this paper operations. For each individual, the time complexity for calculating the fitness J T RC is O C T N 2 T + C T C S . Then, the time complexity consumed by the fitness calculations in the whole population is O N C T N 2 T + C T C S .Therefore, the total time complexity of the DE-TRC algorithm is

Experiment analysis
In this section, we first verify the superiority of the transfer learning mechanism and differential evolution algorithm on synthetic and real-world datasets. Then, we conduct various experiments to evaluate the clustering performance of TRC, DE-TRC, some relevant clustering methods, and transfer clustering methods, including FCM [8], RCM [12], FCDE [35], GARCM [36], E-TFCM [21], and STC [24]. Respective parameters for all algorithms are presented in Table 1. We use three popular external measures: accuracy (ACC) [38], normalized mutual information (NMI) [39] and rand index (RI) [40] as evaluation criteria. All of them lie between 0 and 1. The higher the index value is, the better the clustering performance is. In this experiment, we run all algorithms 10 times on the datasets, and calculate the mean and standard deviation to evaluate the algorithm performance. All the experimental methods in this paper are performed using MATLAB R2018b installed on the 64-bit Windows 10 operating system with an Intel(R) Core i5-10400F CPU and 16 GB RAM.

Synthetic datasets
Five different synthetic two-dimensional datasets are generated to evaluate the performance of the proposed algorithm in this experiment. In real world, there may be insufficient available data in some special or emerging fields [20]. To simulate above scenario, we create T1 and T2, which have the same distribution as the source domain data. The main difference  is that T1 and T2 have fewer samples in each cluster than the source domain. Noise and outlier are inevitable in the process of data acquisition. These disturbances limit the performance of common clustering algorithms. To verify the anti-noise performance of the proposed algorithm, in this paper, we create T3 and T4 with additive and multiplicative noise, respectively. It is known that the performance of classical clustering algorithms is limited when clustering unbalanced data [42]. In this paper, an unbalance data T5 is created, in which the number of samples in different categories varies greatly. Furthermore, existing transfer clustering algorithms have a common limitation, i.e., the numbers of clusters in the source and target domains must be the same. However, in most practical applications, the above assumption cannot always hold. Therefore, T6 is constructed to verify the performance of clustering algorithms in this situation. S1 and T1 are generated as the dataset in the source domain and target domain (Synthetic dataset 1), respectively, which are uniform distribution with same maximum and minimum range values and derived from Ref. [37]. The parameters used to generate other four synthetic datasets are listed in Table 2. All source data and target data are shown in Figs. 3 and 4.

Real-world datasets
To further validate the proposed algorithm in the real-world scenario, this paper develops a series of experiments for UCI datasets [41]. The characteristics of five classical datasets used in this paper are shown in Table 3. In order to verify the clustering performance of DE-TRC algorithm on the limited real-world datasets, each dataset is divided into source domain data and target domain data. 10% of the full data are the target data and the rest are served as the source data.

Parameter analysis
In the proposed method, the parameter λ is used to weight the proportion of two terms in the objective function, and may greatly affect the performance of the algorithm. When λ = 1, the DE-TRC algorithm degenerates into the RMCM2 algorithm. When λ = 0, the clustering results are completely determined by transfer term, that is, the final cluster centers are completely depends on the knowledge of the source domain. In this section, T2, T5, T6, Iris, Haberman, and Wine are selected to validate the parameter sensitivity analysis. λ  is varied from 0 to 1 in 0.05 steps and the impact of λ on the performance of the algorithm is presented in Figs. 5 and 6. It can be seen from Figs. 5 and 6 that the clustering performance of almost all data sets improve with the increase of parameter λ. In Figs. 5(b), 6(a), and 6(c), when λ is higher than 0.8, DE-TRC obtains the satisfactory performance in terms of ACC, NMI, and RI. In Figs. 5(a) and 6(b), the clustering performance of DE-TRC is improved when λ is higher than 0.9 and 0.6, respectively. But the performance index of ACC decreases to 0.6129 when the value of λ is 0.95 in Fig. 6(b). Therefore, λ is assigned to 0.9 in the following experiment.

Experiments to verify the superiority of transfer mechanism and evolution algorithm
In order to verify the superiority of the transfer learning mechanism and differential evolution algorithm, we first set up experiments and select three comparison algorithms: RMCM2, TRC, and DE-TRC. Figures 7 and 8 show the scatter plots of clustering results obtained by RMCM2 and DE-TRC algorithm, respectively. Samples in different categories are represented by different colors and markers, and the cluster prototypes are marked by black pentagrams. We can see from Figs. 7(c) and 8(c), the obtained prototype of RMCM2 algorithm is far away from the cluster center of cluster 2 due to the influence of noise. DE-TRC algorithm obtains the ideal cluster prototypes. Furthermore, as shown in Figs. 7(e) and 8(e), the resulting prototype of cluster 3 obtained by RMCM2 is biased toward other clusters, and DE-TRC algorithm can find more desirable clustering prototypes. In order to quantitatively verify the importance of transfer learning mechanism and evolutionary optimization in DE-TRC proposed, RI is selected as the performance index, and the results of original RMCM2, TRC, and DE-TRC algorithms on synthetic datasets and real-world datasets are listed in Tables 4 and 5, respectively. In these tables, the best algorithm is highlighted in bold.  Tables 4 and 5, the clustering results of TRC are obviously superior to original RMCM2 algorithm on most of data sets. For example, although TRC obtains the higher mean value of RI than RMCM2 on all datasets, the large standard deviation means that the performance of TRC algorithm is a little unstable. DE-TRC obtains the mean value 1.0000 and mean standard deviation 0.0000 on datasets T1, T2, and T5. Due to introducing the evolutionary optimization algorithm, DE-TRC not only gets a better clustering result but also achieves a better convergence.

Experiments of related comparative algorithm
To further prove the superiority and stability of the proposed algorithm, we conduct comparative experiments of six state-of-the-art algorithms on all data sets. The comparative experimental results on the synthetic data sets and real-world data sets are shown in Tables 6 and 7, respectively. For STC algorithm, the number of clusters in source and target domain data should be equal. Thus, it is not applicable to T6.
Based on the result presented in Tables 6 and 7, we give some analyses as follows. For T1, T2, and T5, due to utilizing the advanced knowledge coming from the source domain, DE-TRC obtains entirely correct clustering result. Datasets T3 and T4 were polluted by either the interference data or the noise. In general, E-TFCM and DE-TRC obtained better results than other algorithms because of the noise robust capability of transfer methods. In addition, benefitting from the evolutionary optimization method, DE-TRC owns the most stable clustering performance among all comparison algorithms.

Time efficiency analysis
The computational time of DE-TRC and other comparison algorithms on five real-world data sets are listed in Table 8. All algorithms can be divided into three categories.  The best algorithm is highlighted in bold The best algorithm is highlighted in bold  The best algorithm is highlighted in bold  The best algorithm is highlighted in bold fitness functions of individuals in each generation. Therefore, they required more computational time. However, they can obtain better clustering results and more stable clustering performance than classical clustering algorithms and transfer clustering algorithms, as shown in Tables 6 and 7.

Conclusion
Based on the traditional rough clustering algorithm, we design an unsupervised transfer rough clustering framework based on differential evolution algorithm (DE-TRC). Firstly, the introduction of the transfer learning mechanism can help DE-TRC work well on the data which are limited or polluted by noise. Secondly, the differential evolution algorithm is used to optimize the objective function of transfer rough clustering algorithm, which effectively solves the problem of sensitivity to the initial cluster centers. Experimental results show that the proposed DE-TRC algorithm is able to provide better clustering result than the traditional rough clustering algorithms and other state-of-the-art clustering algorithms in both artificial synthetic and real-world datasets. However, it is possible to further improve the efficiency of the algorithm in this paper. Future study will focus on proposing more effective evolutionary strategies to improve the overall efficiency of the algorithm.

Conflict of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Appendix A
In order to minimize the objective function, the Lagrangian multiplier α is introduced and the Lagrangian formulation is Because r ck should satisfy the constraint C S k=1 r ck = 1, then we can get r ck in a closed-form as follows.
We can get v c in a closed-form as follows: