Chunking and cooperation in particle swarm optimization for feature selection

Bio-inspired optimization aims at adapting observed natural behavioral patterns and social phenomena towards efficiently solving complex optimization problems, and is nowadays gaining much attention. However, researchers recently highlighted an inconsistency between the need in the field and the actual trend. Indeed, while nowadays it is important to design innovative contributions, an actual trend in bio-inspired optimization is to re-iterate the existing knowledge in a different form. The aim of this paper is to fill this gap. More precisely, we start first by highlighting new examples for this problem by considering and describing the concepts of chunking and cooperative learning. Second, by considering particle swarm optimization (PSO), we present a novel bridge between these two notions adapted to the problem of feature selection. In the experiments, we investigate the practical importance of our approach while exploring both its strength and limitations. The results indicate that the approach is mainly suitable for large datasets, and that further research is needed to improve the computational efficiency of the approach and to ensure the independence of the sub-problems defined using chunking.


Introduction
Bio-inspired computation has emerged as one of the most studied branches of artificial intelligence during the last decades. Most frequently the aim is to propose simple algorithmic approaches suitable for complex real-world problems. That is, the aim of such approaches is to mimic principles of real life and turn them into simple algorithmic solutions that can be applied to address complex real-world problems [1].
In particular, over the last few decades, optimization has been strongly affected by this concept. The aim of the proposed approaches, which fall under the umbrella of bio-inspired (or nature-inspired) optimization, is to mimic the behavior and internal functioning of physical, biological and social systems in order to design enhanced optimization algorithms with potential value in real-world applications [2]. In the literature, the aforementioned approaches are generally referred to as a special kind of metaheuristics, most often population-based [3]. Such methods are in many cases based either on evolutionary computation or swarm intelligence [4]. Nowadays, such ideas have shown promise in different optimization problems, in which the computational complexity of exact algorithms grows exponentially with the problem dimension [5], and the performance of classical approximate algorithms (e.g. greedy methods) deteriorates significantly with the size of the search space [6]. In fact, in the age of big data, the need to solve such problems with a large number of decision variables is becoming increasingly crucial.
Hence, this growing attention has resulted in a considerable rise in the number of publications. However, this rapid increase in these has led to various methods which are similar or even identical, although presented in another form and under a different name. Readers are referred to [7] for more details and clarifications on these issues that challenge research in the field of metaheuristics, and to [8][9][10] for examples of such approaches. Moreover, as highlighted in the next section, another related problem is that some papers extend or apply some approaches without cross-referencing them. Therefore, instead of re-iterating the existing knowledge in a different form, a line of research that could make a real contribution is to properly and newly combine ideas already approved and to propose a new design suited to a specific optimization problem. In this paper, we first strengthen awareness of the problem of re-iterating the knowledge by taking two bio-inspired concepts that could be used to solve complex optimization problems, which are chunking and cooperative learning. The former, which is inspired by human behavior, consists in grouping basic information units allowing to build a higher level solution. Regarding the latter, it can be noted here that swarm intelligence approaches are by principle based on cooperation, and that most bio-inspired algorithms belong to the category of swarm intelligence [2], even if some of them were criticized for instance in [7]. In fact, while competition among individuals usually improves their performance, much greater improvement can be achieved through cooperation, as it is nowadays quasi-admitted in both real and artificial life [11].
More precisely, our contribution in this paper is first to give some examples, with regard to two concepts, not provided in the preceding papers (e.g. [8]). Second, we propose a new design combining the two approaches, which adopt them in a complementary way. Chunking aims to decompose a high-dimensional problem into independent sub-problems that could be optimized in a better way than the initial problem, while cooperative learning aims to share the solution's sub-components (sub-problems) or to concatenate them in the most efficient way. In particular, in this paper we focus on particle swarm optimization (PSO) as a typical and one of the best known bio-inspired optimization approaches. In fact, PSO is a swarm intelligence optimization algorithm inspired by the intelligent collective behavior of animals (e.g. birds and fish), that have been designed for continuous optimization problems [12]. Moreover, it has also been applied and adapted to combinatorial optimization problems [13].
In this work, after reviewing the state of the art of the two concepts, we aim to project them into the case of PSO and to propose a new design that matches the problem of feature selection (FS). FS is a combinatorial optimization problem arisen in the field of machine learning (ML). The aim of FS is to improve the generalization ability of learning algorithms by removing redundant, irrelevant and noisy features. FS is a combinatorial optimization problem with 2 n possible combinations, where n is the number of original features. Therefore, the search space grows exponentially with the number of features, and thus FS is still a problematic issue, especially on high-dimensional data due to its huge search space.
There are three main approaches for FS [14]. Filters evaluate feature subsets based on some predefined metrics, most often using information theory (e.g. entropy, mutual information (MI)). An example of a classical filter approach is correlation-based feature selection [15]. More information on filter approaches could be found for instance in [16]. Wrapper and embedded approaches both utilize a ML algorithm to score subsets of features according to their predictive capability. The main difference is that, in embedded approaches, the selection is done in the training process while in wrappers the ML classifier is used as a black box. Therefore, optimization algorithms such as PSO have been more frequently adopted in wrappers than in embedded approaches [17]. Regarding the comparison of wrappers and filters, in general, the former can promise better results than the latter. But, it is more computationally demanding [14]. Thus, in this paper, we aim to take advantage of both approaches as follows: the problem chunking is based on a filter approach and the optimization of each sub-problem (chunk) is optimized with a wrapper approach. 1 The rest of the paper is organized as follows: In the next section, we present a literature review while highlighting the problem of knowledge re-iteration. In section 3, we expose our design of combining chunking and cooperative learning for FS. Section 4 shows the experiments. Finally, a conclusion is presented.

Literature review
In this review, we focus on three issues. First, we present the current work on the use of chunking in optimization, while exposing the problem of re-iterating knowledge in a different form. Next, while highlighting the same problem, we describe the well-known cooperative learning approaches that have been adopted in PSO in particular and swarm intelligence in general. Finally, we project these ideas onto the FS problem.

Chunking
The first bio-inspired concept considered in this paper is chunking. Its purpose is to limit the cost of computation, and to divide the problem into sub-problems easier to be solved than the initial problem. It was first introduced in psychology [19] and then extended into ML [20,21]. To the best of our knowledge, its first introduction in optimization was in [22]. There were few attempts to apply this work (e.g. [23] and [24]). Other works interested on using this idea can be found within evolutionary computation. For instance, [25] integrated this idea into a variable-length genetic algorithm (GA) (though ignoring the earlier work of [23]). The basic idea of this GA variant is to use an increasing length of chromosomes in order to solve progressively the optimization problem through solving some smaller subproblems. This idea was also incorporated into PSO [26].
The concept of chunking was also used for other problems: [27] adopted it for the problem of optimal control, and [28] for redundancy elimination of network packets.
We see that, although the idea of dividing the problem into sub-problems is widely embraced in optimization, few of the works directly adopted the concept and the term of chunking in optimization. Moreover, there is a lack on the connection and cross-referencing of some of them. That is, works interested in evolutionary optimization (e.g. [25]) often do not cite previous works on local search (e.g. [23]). However, we can see that other related concepts could be based on a similar idea. An example of such a concept that was introduced in PSO is clustering because it has the same objective of grouping the solution attributes. One of the best-known articles in this respect is [29]; this paper investigates a clustering PSO approach for dynamic optimization problems. Their approach consists also of generating multi-swarms using a single-linkage hierarchical clustering. Here, we adopt the term chunking as being associated and more suited to learning and bio-inspired behaviors.
In addition, other optimization approaches are also inspired by similar ideas. For example, this concept is incorporated into the newly proposed metaheuristic fixed set search [30]. Also, the referenced paper highlighted a number of other methods, known as matheuristics [31], which use mathematical programming to decompose the problem into independent sub-problems, e.g. POPMUSIC [32]. There are also other methods which have leveraged the divide-and-conquer paradigm [33].
We can conclude in this part that the concept of chunking has been introduced in different forms and that there is a lack of comprehensive literature on this. In other words, we can notice that there is a number of papers which re-introduced the concept of chunking under a different name while few attempts have been made to improve the idea initially proposed. Therefore, it is necessary to unify these different works to take advantage of different ideas. In this paper, we aim to use this idea as a basis for including cooperative learning in PSO.

Cooperative learning
In this part, it should first be mentioned that the fundamental principle of PSO is based on learning through cooperation and sharing of information in such a manner that each particle participates in the evolution of the population and in the improvement of the solution. However, in the canonical version, all the particles follow the best at each iteration, then the algorithm could be easily trapped in a local optimum. Therefore, several learning approaches, which have a more decentralized sharing of information, have been proposed such as comprehensive learning [34]. We refer to [35] for other examples of learning approaches that were adopted in PSO. However, as further illustrated in [36], in these learning approaches, as for the canonical PSO, it is possible that some vector components have moved closer to the optimal solution, while others have moved away from it. So, as long as the effect of the enhancement outweighs the impact of the weakening components, these PSO variants would consider the new vector as an overall enhancement, although some vector components may have shifted further from the solution. Therefore, another form of learning was introduced in [36] under the name of cooperative PSO.
We can notice that a main difference between most of the PSO variants (including the canonical) highlighted in the past paragraph is that in the former each particle forms a complete solution while in the latter, a swarm aims to optimize a part of the solution that could be complemented by the other swarms (we note here that a swarm could represent a single variable as shown later). However, in the literature, the term cooperative PSO has been adopted for both cases. Regarding the first, [37] introduced the concept of multi-swarm optimization, which incorporates interaction mechanisms between the different swarms through a "master-slave" model. The same design has also been included in the cooperative multiswarm PSO proposed in [38]. In addition, [39] defined a cooperative PSO, with a different design: two surrogate assisted PSO variants (classical and social learning) cooperate to find the final (or a best) solution. The term surrogate refers to a function that approximates an objective function for time-consuming reasons, and in the paper, two PSO variants are sharing solutions, in which classical PSO [12] assisted by a fitness estimation function is used for exploration and a surrogate-assisted social-learning PSO [40] is used for exploitation. With respect to the second, some cooperative PSO extensions have been proposed, mainly to divide large-scale problems (e.g. [41]). In this paper, we are interested in the second type which has shown promise and is associated with chunking. Below, we reveal the origin of this concept which has been used in swarm intelligence in general and adapted to combinatorial optimization.
The basic concept of using cooperative learning within swarm intelligence, as designed in [36], is to build a family of swarms, each trying to optimize a component of the solution, and cooperating to find the final solution. It was initially designed for ant colony systems [42]. In this approach, the ants cooperate using an indirect form of communication mediated by pheromones which they deposit on the edges of the traveling salesman problem (TSP) graph while constructing the solutions. This work was extended in [43]; the authors of this paper tried to use a more diverse and dynamic form of agent communication through adaptive memory programming (AMP) processes, and applied the method to the same problem (TSP). The idea was adapted to PSO in [36]; their approach consists in using several swarms to optimize the various components of the vector solution cooperatively.
We can see from the literature that most of the adopted approaches are tailored to continuous optimization. In contrast, less attention has been paid to understanding how learning and cooperative approaches can be adapted to combinatorial optimization. An example of such approaches is given in [44]. In this paper, a discrete adaptation of the cooperative PSO was used to solve the problem of field programmable gate array placement. Also, other swarm intelligence approaches were adopted for combinatorial optimization. An example of this is the Multi-leader Migrating Birds Optimisation [45] (even if the term cooperation was not used directly). In fact, the approach was successfully applied for a combinatorial optimization problem using a cooperative approach [46]. Multi-swarm optimization [47] is another metaheuristic which is based on the idea of introducing cooperative learning into PSO. We can then observe that cooperative learning is another concept that has been adopted and reiterated in different forms. Moreover, the same term has been adopted in PSO to reflect two different designs, as illustrated above.
In this paper, we have chosen FS as a combinatorial optimization problem to be examined. In the next section, we shortly describe the problem of FS, and highlight some related papers which are interested in FS. It should be noted that in our previous paper [18], we have defined the design of combining the two approaches. In this paper, as an extension of the above, we present a more complete version with a more extensive literature review and experiment. Therefore, [18] is not included in the review below.

Feature selection
Nowadays, FS has become an essential technique in data pre-processing especially on highdimensional data in different ML applications [48]. Typical feature subset selection methods are sequential forward selection and sequential backward selection. However, such greedy approaches are prone to be stuck in local optima, especially now in the big data era. A global search technique is then needed to explore huge search spaces more efficiently, and therefore bio-inspired optimization has gained much attention in solving this problem. In particular, concerning the use of PSO for FS, it has been applied for FS in different forms; the most used form is the canonical binary PSO proposed by [13] as presented in the following section. This approach has been applied, for instance, in [49,50].
With regard to the concepts outlined above, to the best of our knowledge, the unique paper which used such kind of cooperation in swarm intelligence for FS is [51]. Their cooperative design adopted the first ant colony system to detect optimal cardinality features and the second to select the most relevant based on this information. However, the two problems are dependent, and cannot be optimized separately.
When dealing with the problem of FS, we can notice that most of the papers that are interested in the concept of chunking define it under the term of clustering. For instance, in [52], the features are divided into clusters by using a graph-theoretic clustering method. Their idea is to form a set of clusters of independent features. Another approach is presented in [53] in which features are divided into several clusters using a community detection algorithm after presenting them in a form of a graph. Also, other related concepts could be applied to FS. [26] proposed a variable-length PSO method for FS, where particles in a swarm can have different lengths which can also be modified during the evolutionary process. Another approach is proposed in [54] for FS with a high-dimensional dataset in which the hamming distance is introduced as a proximity measure to update the velocity of particles.

The proposed approach
In this section, we start by showing how PSO could be applied to the problem of FS, and then expressing the fitness function we have adopted in this paper. After that, we expose how we have incorporated both the concepts of chunking and cooperative learning to extend it.

Particle swarm optimization for feature selection
PSO was initially designed for continuous optimization problems, but was extended in several binary and discrete forms to deal with combinatorial optimization problems such as FS. In this paper, we adopt the traditional and most known discrete and binary version which was proposed in [13]. In this binary PSO form, the update of the population of particles is done according to 1, 2 and 3, concerning the inertia weight, and it is updated as in (4) (more details can be found in [13]).
where v j i (t) and x j i (t) correspond to the j th dimension of velocity and position vectors of the particle i. The position p j i (t) represents the j th dimension of the best previous position of particle i, while p j g (t) represents the j th dimension of the best position among all particles in the population. The variables r 1 and r 2 are two independently uniformly distributed random variables, whereas the parameters c 1 and c 2 are the acceleration factors and w is the inertia weight.
where sig is the sigmoid function used to transform continuous values into binary ones, and rand i is a uniform random variable in the interval [0,1].
t and t max are, respectively, the current iteration and the maximum number of iterations.
PSO could be integrated with a wrapper approach as in Algorithm 1 [55]. As can be seen in Algorithm 1, the proposed approach consists in evaluating a fitness function at each iteration f it (t). In the next part, we reveal it.

Fitness function
In this part, we first emphasize that the FS optimization problem has been formulated in different ways. The reason is that the optimization algorithm, in many cases, should not only maximize the prediction accuracy (or equivalently minimize the prediction error), but also minimize the number of features [50]. Indeed, in addition to the computational advantage of minimizing the number of features, it enables to improve the generalization capacity of the ML algorithm (i.e. the performance on the test set). In fact, as we have shown in our previous work [18], when relying only on the prediction accuracy to choose the best optimization algorithm, its performance differs with respect to training and test sets. That is, an optimization approach could give better results in the training set and worse in the test set. Such a situation is not desired, as the model should perform well also on unseen datasets. Therefore, some fitness functions have been proposed to balance prediction accuracy and generalization capacity.
The trade-off between maximizing the prediction accuracy and minimizing the number of features can be done in different manners. In this paper, as in [56], we adopt an aggregate fitness function as described in (5): F 1 is the error rate. F 2 is the percentage of the selected features as described in (6): where p is the number of selected features and n is the complete number of features.
We note that we have adapted the fitness function proposed in [56], as to define the problem as a minimization function. Nevertheless, the same spirit is adopted to aggregate the two objectives.
The value of α is set to 0.8 because the error rate has to be considered more than the number of features. Regarding this issue, Such a value has been recommended in the literature (e.g. [57] which proposed to adjust the value of α between 0.7 and 0.9).
On the other hand, other fitness functions have been proposed in [57,58] with the same purpose of balancing between the two objectives using similar or related aggregation functions.

Feature selection by combining chunking and cooperative learning
In this part, our objective is to bridge the gap between the different concepts highlighted in the previous section. That is, we propose a design adapted to the FS problem which adopts cooperative learning in PSO using chunking/clustering.
In the conventional PSO presented in the previous section, each individual in the swarm represents a complete solution vector. But, in our proposed approach, instead of having one swarm (of n particles) trying to find the optimal d-dimensional vector, the vector is split into its clusters of features that we can consider independent of the others. In other words, a solution vector of the selected features is a combination of the different solutions provided by each swarm according to the same principle of the cooperative PSO [36]. However, a challenging issue is to assign each particle to its appropriate swarm in order to obtain swarms that optimize independent sub-problems. In this paper, we associate with each swarm a cluster of features that their F-correlation with the remaining features could be neglected, and could then be considered independent of the others. For this, we adopt the idea defined in [52] which we summarize in the next paragraph.
In [52], after computing the F-correlation for each pair of features, the feature set is considered as a vertex set (graph), where each feature is considered as a vertex and the weight of the edge between vertices f i and f j is their F-correlation, the concept of a minimum spanning tree (MST) was adopted for grouping features, as it does not suppose that the points are grouped around the centers. More concretely, after having connected all the vertices so that the sum of the weights of the edges is minimal, the clusters are formed by removing the edges whose weights are smaller than the T-Relevance of both features with the class. We must note that we are based here on the consideration of [52] that the highly correlated features are assembled in a cluster and that is why we defined our assumption of independence of different clusters.
We note that our approach is a hybridization of filter and wrapper approaches, in which a correlation-based filter approach [15] is used for clustering and a PSO-based wrapper approach is used to select the features of each cluster. Indeed, as noted above, each of these FS approaches has its advantages (wrappers are generally more accurate and filters are the fastest), and our motivation of defining our approach is to take advantage of both.
In Algorithm 2, we define the different steps of the proposed approach. Algorithm 2 returns the subset of features found by the PSO search and the associated fitness function. After the initialization of the particle swarms, which is done as in the conventional PSO [13], it starts with the clustering of the features using F-correlation (as described above) followed by the main PSO wrapper loop. In other words, PSO is executed d times, where d is the number of chunks determined via the chunking/clustering process. The next step in the algorithm is to concatenate the different sub-solutions found by each PSO, and then to evaluate the fitness function and choose the corresponding subset.
As we can see above, the approach is simple and consists mainly of combining the approaches explored before. In fact, unless it is necessary, it is generally recommended to define the approaches in the simplest form. Our proposed approach is based on two concepts chunking/clustering and cooperative learning, and our contribution is to newly combine them to better address the problem of FS, and to fairly evaluate our approach in order to see to which extent we can move forward in this direction.

Experiments
In this section, we first describe the parameter setting of the experiments, then we compare and discuss the results.

Experimental setup
In our experiments, the concept of support vector machines (SVM) [59] is chosen as it has shown good performance in ML classification tasks. In other words, SVM is adopted in this paper in order to measure the predictive power of the selected features. To implement PSO for FS, we have adopted the "PySwarms" research toolkit [60] and the Scikit-learn package. Concerning PSO parameters, we consider the acceleration coefficients c 1 = 0.5 and c 2 = 0.5. We have chosen these values, which are the default parameters provided in "PySwarms", as they enable a balance between exploration and exploitation; more details on the impact of these parameters can be found in [35]. The number of iterations is set to 100. We choose this value (not a higher value) as the proposed method must find a good solution within a reasonable time to challenge some typical FS methods.
To evaluate our approach, we use twelve widely adopted datasets for comparing FS approaches. These datasets represent different domains (e.g. health sciences, pattern recognition) and differ widely in the number of features, the number of samples and the number of classes (three of them are multi-classification problems). We note that in this experiment, facing the FS problem, we are more interested in the difference with respect to the number of features. Most of these datasets are available in the UCI ML repository and some of them are used in NIPS 2003 FS challenge. Details are provided in Table 1. 2 We compare our approach with the PSO-based wrapper approach (with the same design adopted in [55]) and with two classical filter FS approaches. The considered fitness function is depicted in (5). Also, we compare the corresponding prediction accuracy, number of selected parameters and the CPU time (in seconds). The algorithms are executed on a computer equipped with an Intel i7-9750H and 16GB of RAM.
We have limited our comparison to the basic PSO-based approach, not including other metaheuristics or bio-inspired approaches, as our aim is to investigate and study the impact of chunking and cooperative learning on it, and if they improve its performance. Also, we compare with a χ 2 -based FS approach [61] and a MI-based approach [62], which are typical methods for performing FS with the Scikit-learn package. The comparison with these approaches is included in order to see to which extent our method (and bio-inspired optimization in general) could be adopted as an alternative to classical FS methods. We note here that the first approach (PSO) is a wrapper approach while the latter (χ 2 and MI) are filter ones. We should also note that the χ 2 and MI approaches, as implemented in Scikitlearn, require manual selection of the number of features to maximize their performance, and we choose moderate values for both. Regarding the approach defined in [52], it is not easy to compare it as it was implemented in the "Weka" software and most of the adopted datasets require a pre-processing step (e.g. transformation of categorical features and classes in numerical values), which is not developed in "Weka" but in Python.
We note at the end of this section that we have considered in this experiment that the number of chunks is 2. In fact, our aim is to see the impact of problem chunking, which could be done with this before further examining on a higher number. In fact, as the chunking process could be computationally demanding, it is preferable to examine it first in the simplest form before exploring a higher chunking dimension, to better analyze and manage the CPU time of the approach.

Comparison of the results
In this part, for each of the four approaches (namely our approach, PSO, χ 2 and MI), we display in Table 2 the results obtained for the fitness function presented in Section 3.2. Also, in Table 3, we show the prediction accuracy along with the number of selected features (in parenthesis) for each one of the four FS approaches.
We can see from Tables 2 and 3 that the approaches give different results, which vary depending on each dataset. Next, we provide a graphical comparison of both approaches to enable a better analysis of the results.
In fact, to further compare the performance of our approach with the conventional PSO, and to analyze the convergence of the two approaches, we depict in Fig. 1 the evolution of these results during the optimization process. Here we note that plotting the solution progress at each iteration for the native PSO is straightforward, and is implemented in the PySwarms package; To implement it in our approach, we extract the corresponding position of the global best at each iteration (referred in (1)) for the two chunks, and then concatenate them to determine the corresponding solution. The purpose of this unprecedented analysis in [52] and other papers is to track the contribution of chunking at each iteration and its impact on the convergence of the algorithm. Hence, the evolution of the results for both algorithms during the optimization process is displayed in Fig. 1 for the 16 datasets. We can see from Fig. 1 that our extension is useful mainly in large datasets, whereas it could not improve the results on the first examples. More concretely, for the first datasets, mainly "Breast Cancer" and "Lung Cancer" which are similar medical datasets, it is clear that our approach is not useful. In fact, the proposed approach could not find good initial solutions and the results obtained are then inferior to those of the PSO-based approach. This is also the case in the "Zoo" dataset which has the smallest number of features, although it is a multi-class dataset. In contrast, we can see that the proposed approach is useful for large and high dimensional datasets. That is, for the "Madeleon", "Digits" and "Gisette" datasets, which have a high number of samples, our approach is able to find better initial solutions and obtain better fitness values. Regarding the "LSVT", "HAR", "Speech", "Gisette" and "Leukemia" datasets, which have the five highest numbers of features among the studied examples, we can see that our approach has a better convergence ability and that chunking has been useful in helping PSO to find better solutions, and move more efficiently in the search space. Finally, we compare the approaches from a computational perspective. As the obtained CPU differs for each execution, we perform 30 runs to compare the CPU time of our approach with the PSO-based approach. Here, we show the results for three datasets, which are "Breast cancer", "Sonar" and "Semeion", and we start first by displaying the Boxplot for these datasets in Fig. 2.
In addition, we have performed a parametric statistical test using a T-test for the three datasets and we found, based on the p-value, that the hypothesis of equal average is not rejected for the "Breast" and "Sonar" datasets, and is rejected for the "Semeion" dataset. In other words, the CPU time of the PSO-based approach is significantly better only for the "Semeion" dataset, while there is no significant difference in the other datasets.
Regarding the comparison with respect to the CPU time obtained by using the χ 2 and MI-based approaches, we can notice that the CPU time of both our approach and the PSO approach is significantly worse than their CPU time for most of the datasets.

Discussion of the results
The aim of this part is to summarize the results and to analyze the impact of our extension. First, based on the above results, we can conclude that our design of chunking and cooperative learning is mainly useful for large datasets. However, it should be noted that the approach implemented could not significantly improve the computation time. Instead, for examples like the "Semeion" dataset, the corresponding CPU time was worse than that of PSO. Also, for both, our approach and the PSO approach, the CPU time is significantly worse than the χ 2 and MI approaches. Thereby, it is of utmost importance to take advantage of recent computational developments in order to improve the CPU time of the approach. Indeed, one of its advantages is that it is suitable for parallel computing (each sub-problem could be solved separately and in parallel) and, consequently, the CPU time could be considerably reduced by leveraging this feature.
By combining the results on accuracy and CPU time, we can conclude that both our approach and the PSO-based approach might be useful especially for large datasets. In fact, for small datasets, the χ 2 and MI-based approaches could quickly give satisfactory results. The limitation of these filter approaches could appear in high-dimensional datasets and hence the examination of bio-inspired optimization will be justifiable mainly for large datasets. This can be noticed first by the fact that both approaches give better prediction accuracy, or the same for few examples, as shown in Table 3. Second, we can see from Table 2 that our approach is better suited than the native PSO for large-scale examined datasets. Both observations are valid for all datasets with at least 100 features, except the "Semeion" dataset. Therefore, we can see that the performance of bio-inspired algorithms differ on each dataset, and the findings of this paper suggest that advanced approaches such as the one proposed in this work will be of interest in large datasets, subject to providing the suitable implementation and computational tools. But, we note that other factors have also an impact on the results as are highlighted below.
First, we can notice that the initial solutions have an enormous effect on the performance of the algorithm. This finding is in line with previous FS research which strongly differentiates between forward and backward strategies [63]. Second, the results also depend on the nature of the dataset. Indeed, as we have seen, the behavior on cancer datasets is similar and our approach is not appropriate for both examples. In fact, to be able to completely understand the behavior of the approach, it is important to distinguish between the different types of datasets. For instance, for the "Leukemia" dataset, there are a number of specially tailored research studies that have attempted to deepen the understanding of it and its relevant features (e.g. [64]). Third, regarding the number of classes, it should be noted that our approach could not show to be effective in the first three multi-class datasets examined. But, the results seem to be due to the number of features and the data domain as stated above, and the number of classes does not appear to have a clear impact. In fact, the approach is better than the native PSO for the large multi-class dataset examined ("Gisette"). However, more results are needed to confirm this, and more datasets and a specific study are needed in this regard.
Another note is with respect to the independence of the different sub-problems. In this paper, our aimed contribution is to project the proposed ideas into the existing literature. However, except for [52], there is a lack of studies in FS in this regard. Also, the paper was not extended enough with this respect and, to the best of our knowledge, there is no way of independently dividing the problem of FS that was really validated. We think that this fact was an obstacle to obtaining more comprehensive results on the datasets examined. Moreover, the definition of the fitness function is also another important influencing factor that is not widely explored. Indeed, its definition also has an impact on the solution. In fact, while examining the approach on a related fitness function proposed in [57] on some of these datasets, we found quite different results. Therefore, this theme is also one of the open related issues.
In sum, our purpose in this paper is to provide an insight on the impact of the proposed extensions on PSO by providing a fair evaluation that shows both the strength and limitation of the approach, and to define proposals for improving the approach.

Conclusion
In this paper, we introduced a novel extension of PSO based on hybridization of two bioinspired concepts which are chunking and cooperative learning. While introducing our method, we have explored how these approaches were introduced in different ways and re-iterated in a different form. That is, the concept of chunking was presented in different forms (e.g. clustering) and a lack of connection between different works was observed in this paper. Also, the concept of cooperative learning was adopted beyond swarm intelligence approaches (e.g. the POPMUSIC method). In this paper, we strengthened the awareness of the problem outlined in [7] by restricting our study to these concepts. Then, we combined and adapted the two concepts to the problem of feature selection. The experiment compared our method with a PSO-based wrapper approach and with two typical filter approaches. The results indicated that the approach is mainly suited for high-dimensional datasets and those with few classes, as shown in the different datasets examined. However, further study is needed to better explain these results.
On the one hand, future work should attempt to improve the computational efficiency of the approach using, for instance, parallel computing. On the other hand, the definition of the chunking approach is still an open issue. Instead of the minimum spanning tree-based method adopted in this paper, further research may try to incorporate domain knowledge into the chunking process. Indeed, as we have seen above, its impact could differ for each type of dataset, and the concept could be more intuitive for text classification problems (e.g. the "Digits" dataset), since the concept was originally proposed for language processing [19]. An example of such work could be found in [65]. Also, the referred paper proposed an in-depth mathematical analysis of the approach, which is also important for validating information theory (filter) approaches. Another option is to adopt the concept of variablelength PSO [26]. In fact, this idea, which was shortly described in Section 2.1, has shown promise for high-dimensional problems as the referenced paper asserted, and could then be a fruitful way to design an adequate generic approach for large datasets.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.