Introduction

Document clustering aims to group similar documents together and put the dissimilar ones into different clusters. Researches on document clustering play key roles in intelligent text processing and have been widely used in various advanced applications such as keywords extraction [1], collaborative filtering recommendation [2], text summarization [3], document retrieval [4] and other natural language processing tasks [5]. Generally, this task relies heavily on a commonly accepted clustering hypothesis, i.e., documents on the same topic show high similarity, while those belonging to different clusters distinguish a lot. That is, the guiding principle of document clustering is to minimize the intra-cluster similarity while maximizing the inter-cluster similarity.

Different types of algorithms have been developed for document clustering, such as the partitioning methods, the hierarchical methods, and the word and phrase-based methods. For the partitioning methods, various distance measures are adopted and documents are clustered accordingly [6, 7]. The hierarchical methods usually create a cluster hierarchy, in which the leaf nodes correspond to the input documents, the parent nodes are the merged clusters, and the topmost hierarchies are the final clusters [8, 9]. The word and phrase-based methods find the key sets of words so as to cluster the documents [10, 11]. In this work, we focus on the partitioning method since it is lightweight, simple to implement, and the number of model parameters is small [12]. Many techniques [13,14,15] in this category are based on or derived from K-means. However, the K-means algorithm gets easily trapped in a local optima and the clustering performance is, therefore, sensitive to the selection of initial cluster centroids. This unwelcome property restricts the usage of K-means based methods [16].

To solve the above limitation, the Evolutionary Co-mputation (EC) algorithms were adopted for global search. For example, a Particle Swarm Optimization (PSO)-based document clustering algorithm was proposed to perform a comprehensive search in the entire solution space [17]. Song et al. [18] put forward a hybrid evolutionary approach for document clustering. They exploited an evolutionary algorithm to improve the initialization of PSO and adopted the particles to find the optimal solution. Zhang et al. [19] developed a genetic algorithm-based aggregation method to construct enhanced clusters from several weak clustering results.

In addition to the common problems in document clustering, new challenges may arise from the language of the documents. That is, in contrast to English documents, it is impractical to directly split the Chinese documents using the whitespace character. The characters in the Chinese sentence do not have any space between them, and different ways of teasing apart the sentence may lead to different understandings. As such, it is urgently needed to design clustering algorithms that are specially developed for Chinese document clustering. Toward this end, the algorithm in [20] decomposes the task of Chinese document clustering into two phases. In the first stage, it preprocesses the Chinese documents using a hash-based method and constructs term vectors to represent the documents. Then, with the extracted features, a dynamical method based on the self-organizing maps was proposed for clustering. Geng et al. [21] put forward a three-layer hybrid algorithm to cluster Chinese microblogs. To start with, the K-means algorithm is adopted to group the microblog texts; then, an agglomerative algorithm is used to merge the small clusters with the same topic; afterward, the potential noise is eliminated in the third layer. Considering the fact that the new center of a cluster was easily affected by the isolated texts, Yao et al. [13] presented an improved clustering method to solve the isolated point problem.

Although plenty of works have been developed for Chinese document clustering, three major problems are still unsolved. First, there lacks a Chinese document clustering framework that provides global optimization capability. Second, many document clustering algorithms require the cluster number as the input. In practical working scenarios, nevertheless, manually specifying the number of document categories may be inaccurate, resulting in unsatisfactory results. Besides, most document clustering algorithms remain the problem of weak exploitation ability. They cannot fine-tune the cluster centroids to better locations and thus may trapped in a wrong cluster structure. To address the above-mentioned problems, we propose a general framework for Chinese documents clustering. The novelties and main contributions of this work are summarized as follows:

  1. 1.

    We propose a general framework for Chinese document clustering, in which the preprocessing and feature extraction modules are used for effectively representing the dataset and the main clustering module is fulfilled with an evolutionary approach for global optimization.

  2. 2.

    We propose an Adaptive Encoding-based Evolutionary Approach (AEEA) to adaptively estimate the cluster number, equipped with carefully designed crossover and mutation operators.

  3. 3.

    To solve the problem of weak exploitation ability, we introduce a hybrid evolutionary approach with a single step of K-means. In the evolutionary process, the K-means module helps to fine-tune the cluster centroids and to enhance the exploitation ability of the proposed AEEA method.

  4. 4.

    We compare our method with the state-of-the-arts on different Chinese document clustering datasets. Experimental results validate the effectiveness of our method.

The rest of this paper is organized as follows. Section “Related work” reviews the related document clustering algorithms. In Sect. “The proposed framework”, we introduce the proposed Chinese document clustering framework. Section “Experiments” shows the experimental results. Conclusions are drawn in Sect. “Conclusion”.

Related work

Generally, document clustering can be tackled by the phrase-based document indexing [22], concept factorization [23], predictive network [24], and the evolutionary algorithms [17, 19]. As the evolutionary algorithms are powerful for handling the NP-hard problems [25], we focus on this category and classify existing evolutionary algorithms into two classes considering whether the cluster number is pre-defined.

Evolutionary clustering with fixed cluster number

Algorithms in this category require the cluster number as the input. For example, Zhang et al. [19] put forward a genetic algorithm for document clustering, where the cluster centroids are fine-tuned in the evolutionary process. Yogesh et al. [26] combined the genetic algorithm and the discrete differential evolution algorithm for finding the best clustering centroids. A hybrid K-means algorithm-based on PSO was proposed by Kamel et al. [27], where the PSO module optimizes the cluster centroids, and the K-means module is used for generating and refining the solutions. In addition to the single-objective methods, multi-objective optimization methods can also be used for document clustering. A multi-objective genetic algorithm was proposed to solve the premature convergence problem [28]. This method takes two cluster validation indices as separate optimization goals. Laith et al. [29] proposed a multi-objective method that combined distance and similarity measures for improving the clustering performance.

Evolutionary clustering with variable number of clusters

Practically, it is hard to determine the underlying cluster numbers in advance. To handle this problem, methods in this category were designed to learn the underlying cluster number along the evolutionary process. For example, Cobos et al. [30] proposed a memetic algorithm for web document clustering, where the niching methods were used to promote the diversity in the population. Onan et al. [31] developed an improved ant algorithm with latent dirichlet allocation-based representation for document clustering. Furthermore, an improved two-phased genetic algorithm for document clustering was proposed by Akter and Chung [32]. By modifying the cluster centroids during crossover, this method allows the generation of new centroids, and therefore increases the searching space. In addition, several multi-objective optimization methods were proposed for dealing with document clustering. For instance, Abdul et al. [33] introduced a multi-objective genetic algorithm for clustering the web documents by considering the multi-view data representations.

The proposed framework

To handle the problem of document clustering, in this paper, we put forward a general clustering framework based on a novel global optimization strategy. Within this framework, various evolutionary approaches can be adapted for global search. As an illustration, we adopt an Adaptive Encoding-based Evolutionary Approach (AEEA) for Chinese document clustering.

As shown in Fig. 1, the general pipeline for our Chinese document clustering framework is introduced as follows. First, we conduct data preprocessing to analyze the document dataset. Then, the Vector Space Model (VSM) is constructed to represent the dataset. Afterward, the dimension of the initial feature representations is reduced for efficient computation, and the evolutionary approach is applied to get the final clustering results.

Fig. 1
figure 1

Illustration of the pipeline for our Chinese document clustering framework

Data preprocessing

Given an unlabeled document dataset \(\{d_{i}\mid i=1,2,..., N\}\), we firstly remove the stop words for refining the data and split the sentences into meaningful words via word segmentation. The stop words denote some certain meaningless words in documents. For example, in English context, these words can be function words like ‘the’, ‘is’, ‘at’, etc. While in Chinese context, they can be “啊,哎,吧”, etc. Most stop words have no contribution in differentiating the topic of documents. On the contrary, they have negative effects such as misleading the clustering process and increasing the computational complexity.

After removing the stop words, we adopt the ‘Jieba’ software [34] to process the documents for word segmentation. Unlike the English sentence that directly adopts the whitespace character as the splitter, the Chinse sentence fuses all characters all together. Thus, to analyze Chinese documents, we require additional efforts to segment the words, and the ‘Jieba’ provides an automatic way. In addition, the segmentation can be performed at different granularities. For example, a university name “华南理工大学” can be split into a coarse granularity “华南理工\(\setminus \)大学” or a fine granularity “华南 \(\setminus \) 理工 \(\setminus \) 大学”. We adopt the full mode of ‘Jieba’, which will list all possible words in the sentence, providing the finest granularity. It helps us to fully address the semantics of Chinese sentences, but the side effect is that it brings relatively higher problem dimensionality. After preprocessing, the documents consist of different meaningful words, providing the convenience for the following feature extraction procedure.

Feature extraction using VSM

We construct the feature vectors via VSM [35] to extract the compact representation of initial documents. We first count the word frequencies and then construct the document weight vector as an efficient feature. For example, when the ith document \(d_{i}\) is traversed, the frequency \(z_{d_{i},j}\) is incremented by one each time the feature word j appears. Afterward, the Term Frequency (TF) and Inverse Document Frequency (IDF) values of each feature word j in the ith document \(d_{i}\) are calculated as

$$\begin{aligned} \mathrm{{TF}}(d_{i},j)= & {} \frac{z_{d_{i},j}}{\sum _{k=1}^{D} z_{d_{i},k}}, \end{aligned}$$
(1)
$$\begin{aligned} \mathrm{{IDF}}(d_{i},j)= & {} \mathrm{{log}}\left( \frac{N}{u_{j}+1}\right) , \end{aligned}$$
(2)

where N is the total number of documents in the whole dataset, and \(u_{j}\) is the number of documents that contain the feature word j. Physically, TF\((d_{i},j)\) can be interpreted as the regularized word frequency and IDF\((d_{i},j)\) represents the anti-document frequency. With these notations, the weight of the feature word j in the document \(d_{i}\) can be calculated as

$$\begin{aligned} f_{d_{i},j}=TF(d_{i},j)\times IDF(d_{i},j). \end{aligned}$$
(3)

All the weights of feature words in the document \(d_{i}\) work together to form the feature vector of the ith document, i.e., \({\varvec{{f}}}_{i}\).

The extracted feature vectors are concatenated as the representation of the dataset, i.e., the feature matrix \(M=[{\varvec{{f}}}_{1},{\varvec{{f}}}_{2},...,{\varvec{{f}}}_{N}]^{T}\). We also reduce the dimension of the feature vectors by applying Principal Component Analysis (PCA) [36] to preserve 90% of the original data variation. Here, we need to strike a balance between data quality and data dimensionality. According to researches in [37, 38], the 90% is a recommended setting that covers most of the data information while keeping a good dimension reduction rate. The reduction of the data dimensions brings at least two benefits: (1) the documents generally have tens of thousands of attributes. By performing dimension reduction, the computing time and resources can be significantly reduced; (2) the original feature vector contains many zero values. Through dimension reduction, these sparse features can be compressed and the potential redundancy and noise are reduced so as to accurately represent the documents.

AEEA for document clustering

With the extracted features, we propose a novel Adaptive Encoding-based Evolutionary Approach (AEEA) for Chinese document clustering. The flexible encoding scheme is able to adaptively adjust the cluster number. To fit this scheme, new crossover and mutation operators are proposed so as to adaptively adjust the cluster number. Besides, a single step of K-means is introduced in the selection step for enhancing the exploitation ability. The details of our AEEA algorithm are presented as follows.

Population initialization

Initialization of the population is to generate np individuals, where each individual encodes the centroids of randomly generated clusters. Specifically, for the ith (i = 1, 2, ..., np) individual, an integer \(K_{i}\) is randomly chosen within the range of [2, \(\sqrt{N}\)]. Afterwards, the \(K_{i}\) cluster centroids are randomly selected from columns of the feature matrix M, constituting the parameter vector of ith individual. The length of each parameter vector is \(D\times K_{i}\), where D represents the dimension of the feature, i.e., the number of feature words in current dataset. Here, by adopting an adaptive encoding scheme, different individuals can be used to model various cluster numbers. Finally, when the evolutionary process is terminated, the optimal cluster cluster number is learned from the individual with the best fitness. Figure 2 shows three examples of the encoding scheme in a 2-D dataset. For instance, the first individual has two clusters (\(K_{1}\) = 2) and hence its parameter vector contains four dimensions. After randomly selecting the cluster centroids, the first individual becomes (1.8, 2.2, 1.9, 2.3). Similarly, the parameter vectors of the last two individuals contain six and eight dimensions to encode three and four clusters, respectively.

Note that there are a few existing EAs that can adapt the cluster number, but most of them adopt the fixed length encoding scheme [39, 40]. The adjusting of cluster number is conducted through posing additional activation thresholds of preserving or removing some invalid clusters within the chromosome. Compared to our AEEA with adaptive encoding scheme, these methods endure encoding redundancy problems and require activation control strategies to determine the cluster number.

Fig. 2
figure 2

The adaptive encoding scheme with flexible lengths

Fig. 3
figure 3

Illustrations of the cross-dimension learning error. a The individuals exchange information in the correct way according to the spatial information (marked with the solid line). b The individuals exchange information in the wrong way according to the index-based information (marked with the dotted line)

Crossover

The classic crossover operator, i.e., the indices-based crossover, remains the problem of cross-dimension learning error. As illustrated in Fig. 3, an individual may exchange wrong information using the indices-based cross-over operator. By taking the spatial information into consideration, we design a new crossover operator under the adaptive encoding scheme. The crossover operator can effectively exchange the information between the adaptive length individuals without incurring the cross-dimension learning error. As shown in Fig. 4, the new crossover operator adopts a subspace swapping strategy so as to exploit the spatial information. The detailed procedures are presented as follows.

Fig. 4
figure 4

Illustration of the subspace swapping strategy in the proposed crossover operator

Step (a): For the target vector \(X_{i}\) in the current population, we first generate a random number \(r_{1}\in [0,1]\). If \(r_{1}\) is smaller than the crossover rate CR and the target vector \(X_{i}\) contains at least two data points, then proceed to step (b). Otherwise, the trial vector \(U_{i}\) will be a reproduction of \(X_{i}\).

Step (b): A paired vector \(V_{i}\) is randomly selected from the current population for crossover. Next, we randomly select two cluster centers \({\textbf {m}}^\mathrm{{first}}\) and \({\textbf {m}}^\mathrm{{second}}\) from \(V_{i}\).

Step (c): We calculate the distance of \({\textbf {m}}^\mathrm{{first}}\) and \({\textbf {m}}^\mathrm{{second}}\) and set the radius \(rd_{j}\) of the swapping space to half of the distance. The midpoint of these two centers is taken as the center O. Specifically, the center O and the radius \(rd_{j}\) are calculated as

$$\begin{aligned}{} & {} O=\frac{{\textbf {m}}^\mathrm{{first}}+{\textbf {m}}^\mathrm{{second}}}{2}, \end{aligned}$$
(4)
$$\begin{aligned}{} & {} rd_{j}=\frac{\mathrm{{Cosine}}\_\mathrm{{dist}}({\textbf {m}}^\mathrm{{first}},{\textbf {m}}^\mathrm{{second}})}{2}. \end{aligned}$$
(5)

Then, a hypersphere, representing the exchanging subspace, is formed using O and \(rd_{j}\) as the center and the radius respectively. Through the swapping operator, the numbers of clusters for different individuals are varied and the corresponding cluster structures are also changed. On the other hand, as the proposed swapping operator is not based on the index, the genes in some positions of the individual have the opportunity to learn information from the genes in other positions of the individual. Since the subspace swapping strategy can increase the population diversity, it helps to enhance the exploration ability of our AEEA algorithm. Note that the subspaces determined by the two vectors are always consistent, because they indeed represent the same area in the spatial layout.

Step (d): For the target vector \(X_{i}\), we use the consistent subspace defined in Step (c) to retrieve the subset of the cluster centroids in \(X_{i}\). Then, we exchange the selected cluster centroids of \(V_{i}\) and \(X_{i}\) to generate the trial vector as

$$\begin{aligned} U_{i}=V_{i}\otimes X_{i}, \end{aligned}$$
(6)

where \(\otimes \) is the exchange procedure. As shown in Fig. 4, the cluster numbers in the swapping subspace of \(X_{i}\) and \(V_{i}\) are distinct. This indicates that the cluster number decoded from the trail vector \(U_{i}\) could be different from those decoded from parent vectors. In this way, the new crossover operator supports the flexibility of our AEEA algorithm in adaptively adjusting the cluster number. Practically, if the cluster number in the trial vector \(U_{i}\) is less than the minimum cluster number \(K_\mathrm{{min}}\) = 2 or exceeds the maximum number \(K_\mathrm{{max}}=\sqrt{N}\), the cluster number of \(U_{i}\) will be set to that of \(X_{i}\), and the cluster centroid in the individual \(X_{i}\) will be replaced by the nearest cluster centroid in the paired vector \(V_{i}\).

To summarize, the proposed crossover operator not only enables the flexibility of adjusting the cluster number, but also tackles the problem of the cross-dimension learning error. The target and paired vectors exchange the information through the consistent subspace. In this way, the trail vector can learn the spatial information to reach a better clustering structure, which is different from the crossover operator used in traditional genetic algorithm that conducts index-based crossover. Thus, the new crossover operator is well suitable for our adaptive encoding scheme. By effectively exploiting the spatial information, the potential error from cross-dimension learning can be fixed.

Mutation

The traditional genetic algorithm prefers to use the uniform distribution as the mutation mediation mechanism. As such, cluster centroids will be perturbed in an uniform way. However, the real-world problems may be affected by the noise, which cannot be accurately modeled using the uniform distribution. We, instead, adopt the Gaussian distribution for fine-tuning the cluster centroids so as to enhance the exploitation ability. For each cluster centroid decoded from the mutant vector \(M_{i}\), we first generate a random number \(r_{2}\in [0,1]\). If \(r_{2}\) is smaller than the mutation probability \(M_{u}\), the cluster centroid in the mutant vector \(M_{i}\) is updated as

$$\begin{aligned} {\varvec{{m}}}_k={\varvec{{m}}}_k+G(0,0.1)\times ({\varvec{{ub}}}-{\varvec{{lb}}}), \end{aligned}$$
(7)

where \({\varvec{{m}}}_k\) represents the kth cluster centroid decoded from the mutant vector; G(0, 0.1) generates random numbers following the Gaussian distribution, and 0 and 0.1 are the mean and standard deviation, respectively; \({\varvec{{ub}}}\) and \({\varvec{{lb}}}\) are the upper and lower bounds of the range of feature vectors, respectively. The Gaussian mutation fine-tunes the cluster centroids in small neighborhoods, and thus the local and global searches can be carried out at the same time.

Objective evaluation

For document clustering, the distance between the cluster centroids should be maximized while the intra-cluster distance is expected to be minimized. Both the internal indices and external indices [41] can be adopted for validating the clustering results. In this work, we use the Davies-Bouldin (DB) [42] and I [43] indices owning to their popularity. The two indices are briefly reviewed as follows.

DB index To calculate the DB index, we first calculate an index value \(R_{i}\) that corresponds to each cluster centroid

$$\begin{aligned} R_{i}=\max _{j,j\ne i}\left( \frac{e_{i}+e_{j}}{D_{ij}}\right) , \end{aligned}$$
(8)

where \(e_{i}=(1/N_{i})\sum _{{\varvec{{f}}}_{i}\in C_{i}} \mathrm{{dist}}({\varvec{{f}}}_{i},{\varvec{{m}}}_{i})\) is the average error within cluster i; \({\varvec{{m}}}_{i}\) represents the cluster centroid for concise. \(D_{ij}=\mathrm{{dist}}({\varvec{{m}}}_{i},{\varvec{{m }}}_{j})\) is the distance between the cluster centroids i and j; \(N_{i}\) denotes the number of documents in cluster \(C_{i}\); \( {K}\) is the number of clusters. Then, the DB index is formulated as

$$\begin{aligned} \mathrm{{DB}}(K)=\frac{1}{K}\sum ^{K}_{i=1} R_{i}. \end{aligned}$$
(9)

Since the clustering hypothesis is to maximize the inter-cluster distance and minimize the intra-cluster distance, a smaller DB index indicates better clustering results.

I index The I index includes three mutually competing factors are defined as

$$\begin{aligned} I(K)=\left( \frac{1}{K}\times \frac{P}{E_{k}}\times D_\mathrm{{max}}\right) ^{p}, \end{aligned}$$
(10)

where \( {K}\) is the number of clusters; P is a dataset-dependent parameter, which is commonly set to the number of samples N; \(E_k\) and \(D_\mathrm{{max}}\) are calculated as

$$\begin{aligned} E_{k}= & {} \sum ^{K}_{k=1}\sum _{{\varvec{{f}}}_{i}\in C_{k}}\mathrm{{dist}}({\varvec{{f}}}_{i},{\varvec{{m}}}_{k}), \end{aligned}$$
(11)
$$\begin{aligned} D_\mathrm{{max}}= & {} \max _{i,j=1,2,...,K}\mathrm{{dist}}({\varvec{{m}}}_{i},{\varvec{{m}}}_{j}), \end{aligned}$$
(12)

where \({\varvec{{m}}}_{k}\) denotes the kth cluster centroid, \({\varvec{{f}}}_{i}\) is the ith document and \(C_{k}\) represents the kth cluster. Besides, the parameter \(p>1\) controls the contrast of the I index for different clustering configurations. Generally, a larger I index indicates better clustering results.

To obtain the above indices, the distances between documents need to be calculated. It is generally accepted that the distance measure, or equivalently, the similarity measure is important for document clustering [44]. Previously, different types of distance measures have been adopted to fit various scenarios. In this work, we conduct experiments to evaluate the clustering performance using three commonly used distance measures, i.e., the cosine distance, the Euclidean distance and the Manhattan distance. The distance measures can be calculated as

$$\begin{aligned}{} & {} \mathrm{{Cosine}}\_\mathrm{{dist}}({\varvec{{f}}}_{i},{\varvec{{f}}}_{j})=\frac{{\varvec{{f}}}_{i}\cdot {\varvec{{f}}}_{j}}{\parallel {\varvec{{f}}}_{i}\parallel \parallel {\varvec{{f}}}_{j}\parallel }, \end{aligned}$$
(13)
$$\begin{aligned}{} & {} \mathrm{{Euclidean}}\_\mathrm{{dist}}({\varvec{{f}}}_{i},{\varvec{{f}}}_{j})=\parallel {\varvec{{f}}}_{i}-{\varvec{{f}}}_{j}\parallel _{2}, \end{aligned}$$
(14)
$$\begin{aligned}{} & {} \mathrm{{Mahattan}}\_\mathrm{{dist}}({\varvec{{f}}}_{i},{\varvec{{f}}}_{j})= \Vert {\varvec{{f}}}_{i}-{\varvec{{f}}}_{j}\Vert _{1}, \end{aligned}$$
(15)

where \({\varvec{{f}}}_{i}\) and \({\varvec{{f}}}_{j}\) denote the feature vectors of the cluster centroids i and j. The effects of the three distance measures will be examined in our experimental section.

Selection

First, we calculate the fitness of the mutant vector \(M_{i}\). After the cluster centroids are determined, samples will be assigned accordingly. For each data point, we calculate its distances to all the cluster centroids in \(M_{i}\) and assign it to the nearest cluster.

In the evolutionary process, there remain lots of zero values in the feature vectors. This is because, when conducting the words frequency, some words will not appear in a single document. The clustering results may lead to empty clusters, containing less than two data points. To relieve this side effect, we adopt a random reassignment method [40] to construct clusters. The clustering results can be evaluated through the validation indices, and we adopt the DB index and I index in this work. Then, the current individual \(X_{i}\) and the mutant vector \(M_{i}\) are compared and the one with better fitness is proceeded to next generation

$$\begin{aligned} X_{i,G+1}= {\left\{ \begin{array}{ll} M_{i,G},f(M_{i,G})\mathrm{{is}}\mathrm{{better}}\mathrm{{than}}f(X_{i,G})\\ X_{i,G},\mathrm{{otherwise}}. \end{array}\right. } \end{aligned}$$
(16)

Single step of K-means

In addition to the empty clusters, the weak local search ability may also lead to zero values in many dimensions. Considering the fact that the large number of zero values will increase the sparsity of the feature matrix and thus bring problems for optimization, we introduce a single step of K-means to tackle it. After selection, each data point has its own cluster membership. The single step of K-means is to update the cluster centroids through averaging all data points within current cluster and thus the cluster center gains more information by considering the whole cluster. In this way, the single step of K-means not only prevents the feature matrix from being too sparse, but also speeds up the local search. This strategy enables our algorithm to well exploit the clustering space as well as enhancing the computation efficiency.

In summary,the overall procedure of AEEA is summarized in Algorithm 1. The algorithm takes advantage of our general framework and thus is highly effective for the Chinese document clustering task. Besides, note that the encoding scheme we adopt is more efficient than the traditional fixed length encoding scheme in representing solutions of uncertain lengths. The crossover and mutation operations are designed by adding spatial operators to the traditional genetic algorithm. These operations can also be extended to other evolutionary algorithms for similar problems.

figure a

Experiments

Experimental setup

To validate the accuracy and effectiveness of our framework, we conduct extensive experiments for performance evaluation. As summarized in Table 1, the experiments are based on five Chinese document datasets according to the Fudan corpus. Dataset D3-600 contains samples from three categories, i.e., the computer, economy and politics (“计算机,经济,政治” in Chinese), with 200 documents per category. To construct dataset D4-800, we select different topics, i.e., the environment, history, agriculture and sports (“环境,历史,农业,体育” in Chinese). To validate the scalability of our AEEA algorithm, we then merge D3-600 and D4-800 into a large dataset D7-1400. That is, D7-1400 contains seven categories, i.e., “计算机,经济,政治,环境,历史,农业,体育” in Chinese. Furthermore, we increase the volumes of documents in D7-3500 with 500 documents per category, resulting in 3500 samples in the whole dataset. In D10-2000, we increase the category number to ten, i.e., politics, environment, history, agriculture, sports, transportation, education, military, medicine and art (“政治,环境,历史,农业,体育,交通,教育,军事,医学,艺术” in Chinese), with 2000 samples in the whole dataset. In D12-4500, we further enrich the category number to 12, i.e., computer, economy, politics, environment, history, agriculture, sports, transportation, education, military, medicine and art (“计算机,经济,政治,环境,历史,农业,体育,交通,教育,军事,医学,艺术” in Chinese), with 4500 samples in the whole dataset.

Table 1 Properties of the datasets
Table 2 Comparing the document clustering results (mean and standard deviation) of the AEEA algorithm with its competitors using the DB index for fitness evaluation

We compare AEEA with a classical algorithm K-means [16], and four state-of-the-art algorithms, i.e., Density-K-means [14], KSI-H-K [21], Two-phased Genetic Algorithm (TGA) [32] and Spectral Clustering algorithm with PSO (SCPSO) [45]. K-means requires a pre-defined cluster number as input to partition the document dataset into K clusters. The Density-K-means takes the shape of clusters into consideration and improves the performance of traditional K-means. KSI-H-K constructs a model that contains three layers, each layer is set up to handle big clusters, noise and clustering result refinement issues, respectively. TGA is a two-phased genetic algorithm. The method allows the crossover point to be anywhere in the chromosome. SCPSO combines the spectral clustering with the PSO method. On the one hand, the spectral clustering helps to recognize the non-convex distribution. On the other hand, the PSO module enables the hybrid algorithm to better explore the global space. For all competitors, we adopt the parameter settings recommended in their original papers. The K-means, Density-K-means algorithms and the first and third layers of KSI-H-K are terminated when the cluster centroids are not updated. And the second layer of KSI-H-K ends up when the merge condition is reached. Besides, TGA, SCPSO and the proposed AEEA are assigned with a pre-defined maximum number of fitness evaluation times \(10^{6}\).

For the AEEA algorithm, the size of population np is set to 100. The minimum and maximum cluster numbers are set to 2 and \(\sqrt{N}\) respectively. The crossover rate CR is set to 0.4, while the mutation rate \(M_{u}\) varies with the iteration times G

$$\begin{aligned} M_{u}=0.1-0.06\times \frac{G}{G_{\text {max}}}, \end{aligned}$$
(17)

where \(G_{\text {max}}\) denotes the maximum generations and is set to \(10^{6}/np\). All the simulation results were conducted on a DELL computer with an \(\mathrm Intel^{TM}\) \(\mathrm Core^{TM}\) CPU at 3.30 GHz. To avoid undesired random disturbance, we repeat each algorithm 30 times on all the four datasets and report the average values for comparison.

Following [46], we adopt the commonly used precision, recall and F-score to examine the performance of competing algorithms. The three measures are defined as

$$\begin{aligned}{} & {} \mathrm{{Precision}}=\frac{TP}{TP+FP}\times 100\%, \end{aligned}$$
(18)
$$\begin{aligned}{} & {} \mathrm{{Recall}}=\frac{TP}{TP+FN}\times 100\%, \end{aligned}$$
(19)
$$\begin{aligned}{} & {} F=\frac{2\times \mathrm{{precision}}\times {recall}}{precision+recall}, \end{aligned}$$
(20)

where TP, FP and FN represent the true positive, the false positive and the false negative respectively. As the F-score is a way of combining the precision and recall of the model, it is usually more convincing than the precision and recall for performance evaluation [47].

Experiment results

Validation by the DB Index

First, we use the DB index as the fitness evaluation, for guiding the optimization direction. Besides, following [46, 48], three external indices are adopted to evaluate the performance of competing algorithms. Table 2 shows the mean clustering results with the standard deviations, including the average DB index value, the precision, the recall and the F-score. In addition, at the bottom of Table 2, we provide the average F-score of different algorithms. For K-means and Density-K-means, the ground-truth cluster number is adopted as the prior information. In KSI-H-K, a cluster number need to be given for initial data clustering, and the second layer set up the parameter in AGNES to perform cluster merge, which can adaptively adjust the cluster number to some extent. The remainders, namely, TGA, SCPSO and the proposed AEEA, adaptively adapt the cluster number during the optimization. The explanation for the boldface in Table 2 is marked in blue fold.

Table 3 Comparing the document clustering results (mean and standard deviation) of the AEEA algorithm with its competitors using the I index for fitness evaluation

According to the F-score, AEEA exhibits the best performance on all the datasets, with \(6\%\) improvements over the second best algorithms. K-means algorithm witnesses the lowest F-score. Although obvious improvements are observed for the Density-K-means, the performance is still far from satisfactory. What’s worse, the Density-K-means shows significantly worse performance when the number of documents is increased. The two competing evolutionary algorithms and AEEA can determine the cluster number adaptively and thus obtain relative better performance. The F-score of AEEA is better than those of the KSI-H-K, TGA and SCPSO algorithms. This may come from the innovative design of our evolutionary operators and the appropriate utilization of the single step of K-means. Besides, although the recall of TGA and SCPSO may be comparable to that of AEEA on some datasets, but these two algorithms witness much smaller F-scores than AEEA. This indicates that TGA and SCPSO would likely fail to correctly find all samples in the same group.

Fig. 5
figure 5

The DB index convergence curves of TGA, SCPSO and AEEA algorithms. a The D4-800 dataset. b The D7-3500 dataset

Compared to TGA and SCPSO, the AEEA algorithm tackles the problem of cross-dimension learning error, which enables the exchange individuals to gain the correct information. And thus, it is able to find a better fitness value and thus obtain a higher clustering precision. SCPSO shows better performance than TGA, because it adopts spectral clustering that helps to detect clusters of arbitrary shapes. This way, SCPSO is able to discover the correct cluster structure in an efficient and accurate way. Although the prior information on the underlying cluster number is used, the three K-means-based algorithms cannot identify clusters with non-convex shapes and thus may fail to find the complex underlying clustering structure.

Table 4 P-values of Wilcoxon rank sum tests and Kruskal Wallis (KW) tests between AEEA and its competitors using the DB index for fitness evaluation on six datasets
Table 5 P-values of Wilcoxon rank sum tests and Kruskal Wallis (KW) tests between AEEA and its competitors using the I index for fitness evaluation on six datasets

Further, convergence curves of AEEA are compared with TGA and SCPSO. The K-means based algorithms are excluded, because they stop within 10–100 iterations. Taking the D4-800 and D7-3500 datasets as examples, Fig. 5 presents the convergence curves of different algorithms using the DB index values. As AEEA obtains lower DB index value, it is always able to reach a better solution compared with TGA and SCPSO. As shown in Fig. 5b, when the volume and category of documents increase, SCPSO may fall into local optimum at the initial stage, and thus fails to reach the global optimal solution in the later stages. Besides, as the performance of spectral clustering is quite sensitive to small changes in the similarity graph, the clustering results of SCPSO relies heavily on the model parameters.

Validation by the I Index

Next, we use the I index as the fitness evaluation, and the results are presented in Table 3, with the best results marked in boldface. It can be observed that our AEEA algorithm outperforms other methods on all datasets.

We also take the D4-800 and D7-3500 datasets as examples to examine the convergence of competing algorithms. From Fig. 6a, we can see that AEEA obtains larger I index value, showing the ability of reaching a better solution compared with TGA and SCPSO. In both examples presented in Fig. 6, the AEEA algorithm obtains better fitness than the other two evolutionary algorithms.

Fig. 6
figure 6

The I index convergence curves of TGA, SCPSO and AEEA algorithms. a The D4-800 dataset. b The D7-3500 dataset

Significance tests

To examine the significance of the performance improvements in AEEA, we conduct statistical tests based on the fitness evaluation indices. Particularly, the Wilcoxon’s rank-sum test [49] is performed for one-on-one comparisons, while the Kruskal Wallis (KW) test [50] is conducted for multiple comparisons. In Tables 4 and 5, we presents the significance tests using DB and I indices. The symbol (+) indicates that the performance of AEEA is significantly better than the competing algorithm(s), whereas the symbol (-) indicates the opposite. In both examples, the differences between the results of AEEA and those of other algorithms are statistically significant, showing the significantly better performance of AEEA in all comparisons. To summarize, our AEEA algorithm can achieve statistically better results since the p-values are always smaller than 0.05.

Further study for AEEA

In this section, we conduct a detailed investigation to afford a comprehensive understanding of AEEA.

Clustering results under different distance measurements

In this part, we run AEEA by adopting under different distance measurements, i.e., the Manhattan distance, the Euclidean distance and the Cosine distance. The results are presented in Table 6, with the best results marked in boldface. As can be seen, AEEA always obtains the best clustering performance when the cosine distance is adopted. This indicates that the cosine distance is more suitable for measuring the similarities among documents. With the increase of the data volume and dimension, the clustering results of Euclidean and Manhattan distances become worse. Thus, we equip AEEA with the cosine distance measure for all experiments in this manuscript.

Table 6 Comparing the clustering results (mean and standard deviation) of the AEEA algorithm under three distance measurements using the DB index for fitness evaluation

Parameter investigation

There are three controlling parameters in AEEA, i.e., the crossover rate CR, the mutation rate \(M_{u}\) and the size of population np. Different parameter settings have a direct influence on the clustering results. Following the classical work [51], we set the population size np to 100. Physically, the mutation rate represents the disturbance probability of the clustering center. We thus design an adaptive adjustment process to decrease the mutation probability with the increase of the iteration number. This way the disturbance of population is reduced in the late evolution stage. We conduct further investigations on the setting of CR. As examples, we test the performance of AEEA with CR adjusted from 0.1 to 0.9 and the F-score is regarded as the comparative indicator. Figure 7 shows the effects of setting different CR values on the final clustering performance. On all datasets, when CR is set to 0.4, AEEA reaches the best performance, and hence we empirically set CR to 0.4 to ensure that AEEA always works efficiently and accurately.

Fig. 7
figure 7

The F-score of AEEA on six datasets with different CR values. a Using DB index. b Using I index

Fig. 8
figure 8

Running time results of the three evolutionary methods in the D4-800 and D7-3500 datasets

Time complexity analysis

This part further compares the time complexity of AEEA and its competitors. The evolutionary procedure is generally divided into two parts: 1) different genetic operators and 2) the fitness evaluation. In the first part, the crossover and mutation operators adopt comparable time complexity, i.e., O(K) for each individual. Besides, the single step of K-means takes an additional O(K) cost. For fitness evaluation, it takes O(\(N\times K\)) time. Thus, our AEEA algorithm requires a total cost of O(\(\text {MaxFes} \times N\times K\)), where \(\text {MaxFes}\) denotes the maximum of fitness evaluations. For the compared TGA and SCPSO, they have exactly the same time complexity with AEEA. We also present the practical running time of the three evolutionary algorithms in the Fig. 8, it can be seen that the time consumed by AEEA is slightly higher than the other two algorithms. But considering the significantly improved accuracy, this magnitude of increase in the running time is acceptable.

Conclusion

In this work, we propose a general evolutionary framework for Chinese document clustering. This novel framework includes the evolutionary approach as the main clustering techniques and thus provides opportunities for designing new global optimization models. Specifically, within this framework, we design an Adaptive Encoding-based Evolutionary Approach (AEEA) so as to adaptively learn the cluster numbers. The experimental results show that our AEEA algorithm outperforms all competitors on all datasets. To summarize, our framework is flexible and scalable for solving the practical document clustering tasks and will raise promising prospects for applications in many intelligent systems.

Future directions of this work may include the enhancement of the AEEA model. For example, advanced models that are less sensitive to distance measures can be considered to remove the noisy points and to discover clusters of arbitrary shapes.