A multiobjective gene clustering algorithm guided by apriori biological knowledge with intensification and diversification strategies
 761 Downloads
Abstract
Background
Biologists aim to understand the genetic background of diseases, metabolic disorders or any other genetic condition. Microarrays are one of the main highthroughput technologies for collecting information about the behaviour of genetic information on different conditions. In order to analyse this data, clustering arises as one of the main techniques used, and it aims at finding groups of genes that have some criterion in common, like similar expression profile. However, the problem of finding groups is normally multi dimensional, making necessary to approach the clustering as a multiobjective problem where various cluster validity indexes are simultaneously optimised. They are usually based on criteria like compactness and separation, which may not be sufficient since they can not guarantee the generation of clusters that have both similar expression patterns and biological coherence.
Method
We propose a MultiObjective Clustering algorithm Guided by aPriori Biological Knowledge (MOCGaPBK) to find clusters of genes with high levels of coexpression, biological coherence, and also good compactness and separation. Cluster quality indexes are used to optimise simultaneously gene relationships at expression level and biological functionality. Our proposal also includes intensification and diversification strategies to improve the search process.
Results
The effectiveness of the proposed algorithm is demonstrated on four publicly available datasets. Comparative studies of the use of different objective functions and other widely used microarray clustering techniques are reported. Statistical, visual and biological significance tests are carried out to show the superiority of the proposed algorithm.
Conclusions
Integrating apriori biological knowledge into a multiobjective approach and using intensification and diversification strategies allow the proposed algorithm to find solutions with higher quality than other microarray clustering techniques available in the literature in terms of coexpression, biological coherence, compactness and separation.
Keywords
Multiobjective clustering (MOC) Gene expression data External biological knowledge Pathrelinking (PR) Pareto local search (PLS)Abbreviations
 D_{BB}
Biologybased distance
 D_{EB}
Expressionbased distance
 Dev
Overall cluster deviation
 DM
Decision maker
 GaPBK
Guided by apriori biological knowledge
 GO
Gene ontology
 HV
Hypervolume
 IC
Information content
 KEGG
Kyoto encyclopedia of genes and genomes
 MOC
Multiobjective clustering
 NSGAII
Nondominated sorting genetic algorithm
 PF
Pareto front
 PLS
Pareto local search
 PR
Pathrelinking
 PS
Pareto set
 Sep
Cluster separation
 SGA
Single objective genetic algorithm
 TR
Trajectory
 WS
Wang functional similarity
 XB
XieBeni index
Background
During the last two decades, the explosion in the increase of DNA microarray datasets available has promoted the application of machine learning methods for the understanding of the genomic data. A DNA microarray is used to collect information regarding gene expression level [1] under different conditions like a time series during a biological process, experiments of different tissue samples, among others [2]. This highthroughput technology has allowed a fast progress in biological and biomedical research [3], and it has facilitated the study of problems such as differential gene expression [4, 5], patterns of genes with (dis)similar expression levels [6, 7, 8], prediction of response to treatment [9, 10] and detection of gene mutations [11].
Clustering has proved to be a useful unsupervised learning technique for gene expression data analysis [12]. The goal is to find a partition of genetics elements represented in the microarray into k distinct groups, where k is the number of clusters which may or may not be known in advance. It is expected that genetic elements such as gene, EST contigs, noncoding sequences, among others, with similar expression profiles be put into a single cluster [13] as a way to reveal hidden patterns. Carrying out clustering is not a trivial task, in fact this unsupervised learning technique currently remains a complex and challenging task which was proved to be an NPhard problem [14]. Clustering can also be seen as an optimisation problem [15] where a cluster index (objective function) is optimised to obtain clustering solutions of high quality.
Several clustering algorithms for gene expression data have been proposed during the last years [16, 17, 18, 19, 20, 21, 22, 23, 24]. They are based on guiltbyassociation paradigm [25], i.e. groups of genetics elements which are associated, share similar expression profiles are more likely to share function. In recent years, this paradigm has been reformulated because the optimisation of a single cluster quality index based on expression levels can cause some issues. The fact that two genetic elements have similar expression patterns can be because they share some functionality, but also because of noise, which may lead to the misidentification of biological relationships [26]. Because of the above, some authors used external biological knowledge [12, 27] as another source of information about genetics elements as a way to address this situation and to find clusters with more biological coherence. Often external biological knowledge concerns the use of repositories such as the Gene Ontology Project (GO) [28] or Kyoto Encyclopedia of Genes and Genomes (KEGG) [29]. However, this biological knowledge can be partial, i.e. the information could be available only for a subset of the genetic elements. Clearly, using only a quality index based on external biological knowledge will lead to a partition of the data with clusters previously discovered and thereby extract repetitive information.
Clustering methods can use the expression profilesbased distance (D_{EB}) or biologicalbased distance (D_{BB}) to cluster genes. The distance value represents the strength of the relationship between two genes, in terms of expression behaviour (D_{EB}) or biological functions (D_{BB}). The distance (D_{EB} and D_{BB}) between two clusters is calculated as the distance between the medoids (the most central gene located in a cluster) of each cluster. This process helps to uncover new relationships in terms of cellular functions and biological processes in which genes participate; as well as, to understand their interactions, and cellular regulation. These results can also aid the study of the influence of genes in the development of diseases, their association in the formation of tissues or groups of genes that have similar response to a given drug.
where Ω is the set of all feasible clustering solutions, C is a clustering solution, and P_{t},t=1,…,m is a set of m different objective functions (quality indexes), i.e. that clustering C^{∗}⊆Ω corresponds to clustering solutions that have the best optimised m criteria P [30, 31].
In MOC problems, we have several clustering solutions C^{∗} that correspond to the optimisation of objective functions that tend to be in conflict [32], i.e. improving one objective involves worsening another. Then, it is required to reach a “tradeoff” where all objective functions (quality indexes) are satisfied to an acceptable degree which leads to find a set of best solutions called nondominated solutions [30].
Thus, a Pareto optimal set Π corresponds to a set of nondominated solutions, such that there is no other solution in Ω that dominates any of them.
Pareto front The Pareto Front F^{∗} corresponds to the image of Pareto optimal set Π, i.e. to the vectors of criterion functions (quality indexes) to Π.
In the literature it is possible to find multiobjective clustering (MOC) approaches for clustering expression data. The work presented in [33, 34] used a multiobjective genetic algorithm along with supervised techniques to perform the clustering process optimising two cluster quality indexes based on gene expression levels. However, biological information about genes is only used as verification and it is not part of the process of generating clusters. In [35] authors propose a technique for clustering of genes biologically guided by the interaction of a decision maker (DM). The technique optimises several cluster quality indexes based on expression level, meanwhile the DM evaluates clustering solutions based on the relationship in biological terms according to the expert knowledge in the area. Although the approach is interesting, since it considers expression and biology information, the fact that the formation of clusters is affected by DM expertise makes the approach limited to experiments with data in a small area where the decision maker has experience. In these multiobjective clustering approaches for gene expression data it is observed that only few of them use biological knowledge in spite of works in [26, 36, 37], which have shown that the inclusion of biological knowledge during the clustering process allows to find gene clusters with more common biological properties. In this paper, we present a multiobjective clustering algorithm guided by apriori biological knowledge. The proposed algorithm is based on NonDominated Sorting Genetic Algorithm (NSGAII) [38] which includes intensification and diversification strategies based on both PathRelinking (PR) [39] and Pareto Local Search (PLS) [40], respectively. The main contributions of this work are (1) the integration of information of genetic elements regarding their levels of expression and biological functions during the optimisation of cluster quality indexes, and (2) the proposal of adhoc local search strategies that exploit both the memory mechanism and neighbourhood principles of PathRelinking (PR) and Pareto Local Search (PLS) respectively, using a multiobjective approach. This work is tested against state of the art algorithms in order to show the benefits of both: using a multiobjective approach to tackle the clustering of expression data and the method proposed is able to discover clusters with more stronger corelation and common biological properties than literature alternatives.
Method
We call our method “MultiObjective Clustering Guided by aPriori Biological Knowledge” (MOCGaPBK). The method uses biological knowledge by the means of annotations of genetic elements with Gene Ontology (GO) terms. The integration of this biological knowledge is performed by the computation of the biologybased (D_{BB}) and the expressionbased (D_{EB}) distances.
Multiobjective clustering process
Our approach performs the discovery of clusters using NSGAII algorithm [38] along with PathRelinking (PR) [42] and Pareto Local Search (PLS) [43] as intensification and diversification strategies, respectively.
Algorithm 1 shows the pseudocode of the MOCGaPBK algorithm. First, it computes the expression (D_{EB}) and biological distance (D_{BB}) matrices (lines 12) in order to integrate the biological knowledge. Then the algorithm creates an initial population P_{0} of size N (line 4). In each generation v, the algorithm creates an offspring population Q_{v} of N individuals by using a binary tournament selection, the (k1)point crossover and the controllerrandom mutation operations (lines 67). A population R_{v} of size 2N (line 8) is created by union of P_{v} and Q_{v} (line 8). Then, a nondominated sorting and crowding distance calculation is applied on R_{v}. Here, solutions are ranked according to their nondomination level in F_{1},F_{2},…,F_{n}., i.e. NSGAII, label as F_{1} to nondominated solutions, F_{2} to nondominated solutions remaining after eliminating those with F_{1}. The process is repeated until it sorts all solutions in R_{v}. Then, in R_{v}, only those nondominated solutions, that is, the solutions ranked as F_{1}, are maintained (lines 910). The R_{v} population is used as input for the intensification and diversification strategies (lines 1114). The next population P_{v+1} is created by selecting the solutions labeled as F_{1} in R_{v} (line 20). If F_{1}<N, the algorithm complements the population with random solutions P_{r}, P_{v+1}=F_{1}∪P_{r} (lines 1618). On each generation the algorithm evaluates whether it reaches either of the two stopping criteria: (1) when the number of generations is reached and (2) when during a certain number of generations have passed without changes in the values of the objective functions of the solutions in the Pareto frontier. Finally, the algorithm returns the set of nondominated solutions (Pareto Front) in R_{v} (line 24).
Chromosome encoding
We use an integer encoding to represent a solution. Each number represents a cluster medoid which is the most central element located in a cluster, i.e. whose sum of the distances to another element of the cluster is minimum [44]. Each chromosome ch has the same length as the number of clusters K, and each position ch_{i} can have an integer value from 1 to n, n being the number of elements in the dataset. For instance, in case of a dataset with 100 elements, the chromosome [1 6 19 83 14 3] represents six clusters with 1, 6, 19, 83, 14, and 3 as the centre of each clusters.
Initial population
The initial population is generated as follows: first, the 50% of chromosomes is randomly created. Second, the remaining 50% is created based on a single genetic algorithm (SGA) which uses an integer encoding to represent a solution, k1 point crossover and controllerrandom mutation as evolutionary operators. Here, we optimise the PBM [45] index which is calculated using the distances between the points and their barycenters and the distances between the barycenters themselves. This initialisation helps the convergence because it gives to the multiobjective clustering algorithm good initial solutions.
Selection
We use a binary tournament selection along with Pareto ranking and crowding distance. In order to identify the winner chromosomes of each tournament, the individual belonging to the top Pareto ranking is chosen and in the case of a tie, the one with the highest crowding distance is selected.
Crossover and mutation
We use a (k1)point crossover, and a controllerrandom mutation [46] as evolutionary operators. In the crossover, k1 points on both individuals (parents) are randomly selected. All cluster medoid between those points are swapped between the two individuals. In controllerrandom mutation, a random position is selected from the chromosome and the corresponding element is replaced by a randomly chosen element that is not in the chromosome.
Objective functions

n: Number of elements in dataset.

K: Number of clusters.

x_{i}: ith element in dataset, with i=1,2,…,n

z_{k}: kth cluster medoid.

C: Set of all clusters.

C_{k}: kth cluster.
Cluster validity indexes used as objective functions
Validity index  Equation  Type 

XieBeni index (XB) [47] measures  \({XB=\frac {\sum \limits _{k=1}^{K}\sum \limits _{i=1}^{n} D^{2}(z_{k}, x_{i})}{ n \times \min _{k\neq l}\{D^{2}(z_{k}, z_{l})\}}} \)  Minimisation 
the quotient between the total  
variance and the minimum  
separation of the elements  
in the clusters.  
Overall cluster deviation (Dev) [48]  \(Dev= \sum \limits _{k=1}^{K} \sum \limits _{x_{i} \in C_{k}} D(z_{k}, x_{i})\)  Minimisation 
is defined as the overall summed  
distances between genes and their  
corresponding cluster medoid.  
Cluster separation (Sep) [49] is  \( Sep= \frac {2}{K(K1)}\sum \limits _{k=1}^{K} \sum \limits _{j=1, j\not =i}^{K} D^{2}(z_{i}, z_{j})\)  Maximization 
defined as intercluster distances  
between cluster medoids. 
Note that validity indices in Table 1, the distance D is computed using both expression profilesbased distance (D_{EB}) and biologicalbased distance (D_{BB}). Thus, each objective function has two versions: expressionbased (EB) and biologybased (BB) indices which consider Pearson correlation (ρ) and Wang functional similarity (WS) respectively. For instance, when XB index is used, we have two objective functions: XB _{EB} and XB _{BB}.
Intensification and diversification strategies
MOCGaPBK algorithm applies intensification and diversification strategies where promising regions are thoroughly explored. The strategies are based on Pathrelinking (PR) [42] and Pareto Local Search (PLS) [43], respectively.
Multiobjective pathrelinking
The algorithm builds trajectories between solutions as follows: Let C_{1} and C_{2} be two solutions obtained from the Pareto front (PF) of the multiobjective clustering process. The Pathrelinking procedure PR (C_{1},C_{2}) starts with the initial solution C_{1}, and gradually transforms it into solution C_{2}, by swapping out medoids in C_{1} and replacing them with medoids in C_{2}. To choose the initial solutions C_{1} and C_{2}, the algorithm selects the two solutions with the lowest Pareto ranking and the highest crowding distance. Then, for the next pair of solutions \(C^{\prime }_{1}\) and \(C^{\prime }_{2}\), we set \(C^{\prime }_{1} \leftarrow C_{2}\) as the current initial solution and it selects \(C^{\prime }_{2}\) as a new solution from the Pareto front with the lowest Pareto ranking and the highest crowding distance not considered before. This process continues until each solution in the Pareto front (PF) has been selected.
To gradually transform C_{1} into C_{2}, only nonrepeated medoids are considered as follows: Let \(M_{C_{1}}\) be the set of medoids in C_{1}. Let \(M_{C_{1}}M_{C_{2}}\) be the set of medoids in C_{1} not present in C_{2} and symmetrically, let \(M_{C_{2}}M_{C_{1}}\) be the set of medoids in C_{2} not present in C_{1}. Let tr_{0}(C_{1},C_{2})=C_{1} be the start solution in the trajectory TR(C_{1},C_{2}) from C_{1} to C_{2}. To obtain the next solution tr_{1}(C_{1},C_{2}) in the trajectory TR(C_{1},C_{2}), we remove from C_{1} a single medoid \(z_{k} \in M_{C_{1}}M_{C_{2}}\phantom {\dot {i}\!}\), and replace the empty position adding a medoid \(z_{l} \in M_{C_{2}}M_{C_{1}}\phantom {\dot {i}\!}\). To choose the solution in tr_{n} with the best move in each trajectory, we conducted a nondominated and crowding distance sorting. After that, we select the top ranked solution which is the new start solution C_{1}. The procedure is repeated until we reach the guiding solution C_{2}. Finally, each solution with the best move and original start and guiding solutions are stored in an set of intermediate pool (IP) solutions.
In our experiments, we apply PR(C_{1},C_{2}) and PR(C_{2},C_{1}) for each pair C_{1} and C_{2} in PF. Then, we merge IP and PF solutions and we recheck their nondominating and crowding distance levels. The nondominated solutions labelled as F_{1} are returned as the output of the procedure and it corresponds to the new Population R.
Pareto local search
Here, the PLS procedure receives a population of nondominated solutions A_{0} which are marked as unexplored. Then they are duplicated in a population A (Fig. 3a). An iterative process of three steps is performed. Selection step randomly selects a solution C from population A_{0}. After, neighbourhood exploration chooses a medoid z_{k} of C, which is then replaced in all neighbouring solutions N(C)={C1′,C2′,…,Cn′} by a medoid z_{k} that has not been assigned to the current solutions in N(C). Meanwhile, a dominating concept is used as acceptance criterion step,.i.e. the procedure will evaluate each C^{′}∈N(C) and if \(C \nprec C'\) it will mark as unexplored to C^{′} and the population A is updated eliminating those solutions are dominated by C^{′} (Fig. 3b). Once all C^{′}∈N(C) are evaluated, solution C is marked as explored and population A_{0} is updated filling only with the unexplored solutions from population A (Fig. 3c). PLS ends when the population A_{0} has no solutions, i.e., when all the solutions in population A have been explored, and they correspond to the new Population R.
Selection of a single solution
The MOCGaPBK algorithm produces a final Pareto front (PF) with a collection of one or more nondominated solutions. All of these solutions are highquality gene expression data partitions. To compare our results with those available in the literature, we select a single solution based on Silhouette index (S) [54] using the last nondominated solutions set in population R. The solution with the maximum value of S index is selected.
This index score lies between − 1 and 1, so that values close to 1 indicates better clustering solutions. To more detail of S(C) equation refer to [44].
Results
All the algorithms used in the experiment were implemented using R [55] version 3.2.5 and computational tests performed on a computer with Intel Xeon E52650V4 30 MB, 4 CPUs, 2.2Ghz, 96 cores/threads, 128GB RAM, 4TB. The distance between expression profiles were calculated using functions of the “amap” library [56]. The distance of biological functionality using the “GOSemSim” library [57].
Datasets
Gene expression datasets used in experiments
Dataset  Samples  Original elements  Selected elements 

Arabidopsis thaliana  8  138  133 
Yeast cell cycle  17  6000  384 
Yeast sporulation  7  6118  472 
Human fibroblasts serum  13  8613  501 
Experimental parameters
The algorithm is executed with the following parameters: number of generations =100, population size =50, crossover probability =0.80, mutation probability =0.01, generations without improvement =10, number of clusters k={4,5,6}. Evolutionary parameters and k values, were set up considering similar configurations used by other MOC algorithms in [34, 35, 63].
Performance evaluation
To evaluate the performance of MOCGaPBK algorithm, first we determine the best combination of objective functions regarding the hypervolume (HV) indicator. Then, we compare the performance of MOCGaPBK facing contestant clustering algorithms, under three aspects: (1) levels of coexpression, (2) biological coherence, and (3) compactness and separation. To carry out that, we use Eisen [64] and cluster profile [65] plots; annotation enrichment analysis [66]; and Silhouette index, respectively.
Eisen plot It is a tool used in microarray experiment for visual representation of gene expression profiles. It is achieved using heat map and colouring values that usually are red, black and green [65]. Here, Eisen plot is used to show clustering results so that, similar colours are grouped together, showing that the expression profiles of the genes of a cluster are similar to each other.
Cluster profile plot This tool shows gene expression of microarray using x, y matrix representation of time points and level expression [65], respectively. Here, the normalised level expression values (in green) of genes in each cluster are used. Additionally, the average expression along with the standard deviation (in black) are included.
where f represents the total number of genes in a GO category and g the total number of genes within the genome. Thus, this test measures the degree of overlap between the genes in each group and the genes in GO category.
Experimental results
Effect of objective functions
Best hypervolume values achieved by objective functions over 20 runs in all datasets
Objective functions  Arabidopsis  Cell cycle  Sporulation  Serum 

XB _{EB}  XB _{BB}  0.9989  0.9978  0.9993  0.9998 
Dev _{EB}  Dev _{BB}  0.9018  0.9258  0.9307  0.9449 
Sep _{EB}  Sep _{BB}  0.7913  0.8823  0.8625  0.7922 
Effect of local search strategies
Eisen and cluster profile plots
Annotation enrichment analysis
The most significant GO terms in datasets
Dataset  Cluster  Significant GO term  pvalue 

Arabidopsis  Cluster 1  Response to wounding(GO:0009611)  3.63E16 
Cellular biogenic amine metabolic process(GO:0006576)  1.00E14  
Cellular amine metabolic process(GO:0044106)  1.62E14  
Cluster 2  Lipid catabolic process(GO:0016042)  1.91E09  
Response to wounding(GO:0009611)  9.68E09  
Phenylpropanoid metabolic process(GO:0009698)  7.61E08  
Cluster 3  Response to organonitrogen compound(GO:0010243)  5.36E11  
Response to chitin(GO:0010200)  9.51E10  
Jasmonic acid mediated signaling pathway(GO:0009867)  3.03E09  
Cluster 4  Jasmonic acid biosynthetic process(GO:0009695)  7.76E04  
Jasmonic acid metabolic process(GO:0009694)  1.08E03  
Lipid oxidation(GO:0034440)  1.35E03  
Cell cycle  Cluster 1  Positive regulation of transport(GO:0051050)  1.84E04 
Regulation of transport(GO:0051049)  2.93E03  
Regulation of localization(GO:0032879)  3.39E03  
Cluster 2  Cell cycle(GO:0007049)  8.13E17  
Cell division(GO:0051301)  3.26E16  
Cell cycle process(GO:0022402)  2.30E14  
Cluster 3  Cell cycle phase(GO:0022403)  2.34E10  
Mitotic interphase(GO:0051329)  2.71E10  
Interphase(GO:0051325)  2.71E10  
Cluster 4  DNA replication(GO:0006260)  1.24E16  
DNA metabolic process(GO:0006259)  4.36E16  
Cell cycle(GO:0007049)  1.29E11  
Sporulation  Cluster 1  Glucose metabolic process(GO:0006006)  3.69E08 
Carbohydrate metabolic process(GO:0005975)  1.04E07  
Hexose metabolic process(GO:0019318)  2.49E07  
Cluster 2  Oxoacid metabolic process(GO:0043436)  1.76E05  
Organic acid metabolic process(GO:0006082)  1.80E05  
Monocarboxylic acid transport(GO:0015718)  4.42E05  
Cluster 3  Cell cycle process(GO:0022402)  2.76E19  
Cell cycle(GO:0007049)  5.83E19  
Anatomical formation in morphogenesis (GO:0048646)  6.88E19  
Cluster 4  Translation(GO:0006412)  1.03E28  
Ribosome biogenesis(GO:0042254)  1.84E08  
Ribonucleoprotein complex biogenesis(GO:0022613)  6.70E08  
Serum  Cluster 1  Mitotic recombination(GO:0006312)  1.55E11 
G2/M transition of mitotic cell cycle(GO:0000086)  1.68E09  
Chromosome segregation(GO:0007059)  1.74E09  
Cluster 2  Cellular response to zinc ion(GO:0071294)  5.25E08  
Striated muscle cell differentiation(GO:0051146)  5.98E07  
Response to zinc ion(GO:0010043)  1.26E06  
Cluster 3  Cholesterol metabolic process(GO:0008203)  7.46E14  
Cholesterol biosynthetic process(GO:0006695)  1.39E13  
Sterol biosynthetic process(GO:0016126)  2.95E13  
Cluster 4  Multimulticellular organism process(GO:0044706)  8.55E16  
Regulation of smooth muscle cell proliferation(GO:0048660)  1.50E14  
Smooth muscle cell proliferation(GO:0048659)  1.84E14 
Discussion
We evaluate the performance of MOCGaPBK algorithm using three combination of objective functions regarding hypervolume indicator. Table 3 provides information of the effect of them in clustering process which reveals the effectiveness of using XB _{EB}  XB _{BB} since it achieves the best values of hypervolume in all datasets. Such information can also be observed in Fig. 4, where the Pareto front of both XB criteria (cross symbol) dominates most of the solutions in the Pareto optimal set of the other objective functions, due to being further away from the reference point. In the same figure, the gray area represents the objective space covered only by the objective functions XB _{EB} and XB _{BB}. Such situations show that they are the best combination to perform the gene clustering process with our MOCGaPBK algorithm.
We also compare the performance of multiobjective optimization process regarding the incorporation of local search strategies. Figure 5 shows the best Pareto frontiers yield by MOCGaPBK and its variations on each of the four datasets. Generaly speaking, the use of local search improves the solutions found by the NSGAII, but it is not conclusive since for all datasets there are solutions found by NSGAII that dominates some of the solutions produced when a single local search strategy is used. However, the Pareto frontier produced by MOCGaPBK mostly dominates the Pareto frontiers produced by the other algorithms. In particular, for the case of the Yeast Cell Cycle data set (Fig. 5b), our proposal dominates all the solutions of the other algorithm, while in the other cases (Fig. 5a, c and d) there are only a few solutions that are non dominated. Also, MOCGaPBK always dominates all the solutions found by NSGAII. In all of the figures the gray area shows the space that is being covered by MOCGaPBK and not by the others. These results show the positive effect in the NSGAII of including local search strategies, which has been reported on other works, but it also shows that our algorithm overcome simple combinations of NSGAII and local search strategies.
We show the performance of MOCGaPBK algorithm optimizing XB _{EB}  XB _{BB} compared with SemiFeaClustMOO [70], MOfuzzy [63], MOGA [34], SOM [71] and Average linkage [72] clustering techniques. All results are exposed for four real life gene expression data sets, i.e., Arabidopsis Thaliana, Yeast Cell Cycle, Yeast Sporulation, and Human Fibroblasts Serum.
Mean values of Silhouette index over 20 runs of different algorithms
Algorithm  Arabidopsis  Cell cycle  Sporulation  Serum 

MOCGaPBK  0.49  0.63  0.80  0.58 
SemiFeaClustMOO  0.46  0.50  0.70  0.44 
MO fuzzy  0.41  0.43  0.59  0.40 
MOGA  0.40  0.42  0.58  0.38 
SOM  0.23  0.38  0.58  0.34 
Avg. link.  0.32  0.44  0.50  0.36 
Friedman test ranking result for comparing MOCGaPBK algorithm with other state of the art single and multi objective clustering techniques
Dataset  MOCGaPBK  SemiFeaClust  MO fuzzy  MOGA  SOM  Avg. link. 

Arabidopsis  0.49 (1)  0.46 (2)  0.41 (3)  0.40 (4)  0.23 (6)  0.32 (5) 
Cell cycle  0.63 (1)  0.50 (2)  0.43 (4)  0.42 (5)  0.38 (6)  0.44 (3) 
Sporulation  0.80 (1)  0.70 (2)  0.59 (3)  0.58 (4)  0.58 (4)  0.50 (6) 
Serum  0.58 (1)  0.44 (2)  0.40 (3)  0.38 (4)  0.34 (6)  0.36 (5) 
Avg. rank  (1)  (2)  (3.25)  (4.25)  (5.5)  (4.75) 
The test verifies whether the measured average ranks are significantly different from the mean silhouette rank. The method obtains a pvalue of 0.0033 indicating that the difference in the mean silhouette rank obtained by MOCGaPBK algorithm is significant. In fact, our proposed method has an average rank of 1 since it always obtains the best silhouette values in the experimental datasets, while its closest competitor is SemiFeaClustMOO algorithm with an average rank of 2. Clearly, it indicates that our approach obtains groups with better values of compactness and separation than competitive clustering techniques.
To visually demonstrate that clusters yield by MOCGaPBK have high coexpression patterns, we use Eisen and cluster profile plot. To do this, first we determine the best solution (regarding silhouette index), comparing all solutions yield by our method. So, we observe that MOCGaPBK has determined that four is the best number of clusters k for all datasets. Such number of clusters k matches with some values of competitive algorithms [34, 74]. In Figs. 6, 7, 8 and 9, we plot clustering solutions with best silhouette index. For instance, for Yeast Sporulation dataset (Fig. 8a), MOCGaPBK identifies as appropriate partitioning k=4. In such figure, it is evident that expression profiles of the genes in each cluster have similar expression profiles generating similar colour patterns throughout the samples. We can also see (Fig. 8b) that expression patterns of the four clusters of genes differ from each other, while the patterns within a cluster are very similar. So, for example, while cluster 1 has high expression level in the first samples and low in the last ones, the cluster 3 behaves in the opposite way. Similarly, cluster 2 presents low expression levels in first samples but high in remaining ones meanwhile the cluster 4 behaves in opposite way. The other three datasets have similar results. Such situations reveal that genes in the groups found by our algorithm are highly coexpressed.
To establish the biological relevance and coherence of a cluster, we performed an annotation enrichment analysis. In Table 4, we reported to each cluster the three most significant GO terms shared by the genes with their pvalue. It reveals that all the clusters of the solutions found by MOCGaPBK have obtained pvalues less than 0.01, i.e., each cluster has associated biological processes and thus, they are biologically significant and functionally enriched. In this aspect, MOCGaPBK outperforms competitive algorithms since they fail in finding biologically related clusters in some cases. For instance, in Yeast Sporulation dataset, MOGA, FCM, SOM and Average linkage present at least one cluster without biological significance [63].
Conclusions
In this paper we have presented a multiobjective gene clustering algorithm called MOCGaPBK. It includes external biological knowledge during the objective functions optimisation and it integrates intensification and diversification strategies based on both multiobjective PathRelinking and Pareto Local Search.
Results show that MOCGaPBK yields higher quality solutions than other clustering techniques considered here for comparison purposes. It is mainly to the strength of integrating apriori biological knowledge with a multiobjective clustering approach and the use of intensification and diversification strategies. The first one allows having partitions with higher coexpression levels and biological coherence since cluster quality indexes are used to optimise simultaneously gene relationships at expression level and biological functionality. The other aims to improve the clustering solutions to yield higher quality clustering solutions regarding to compactness and separation and to avoid fall into local optima.
The effectiveness of MOCGaPBK was demonstrated quantitatively and visually using statistical comparison test and cluster visualisation tools respectively. Results of silhouette tests, visualisations and annotation enrichment analysis show that the proposed method is able to discover compact, well separated, coexpressed and biologically significant clusters.
To perform the multiobjective clustering process we have used a chromosome representation based on clusters medoid. As a future work we would like to explore a clustering technique based on graph theory to tackle in a better way with datasets where nonconvex groups are present. Furthermore, because we use a multiobjective approach, our algorithm provides a set of solutions, all identically relevant from clustering style point of view. However, for biologists can be more appropriate to have a single solution for clustering of genes or otherwise have a subset of genes that appear grouped together most times. Due to that, in future we would like to develop a voting or ranking technique to identify the set of genes that appear most often in the same clusters, and so facilitate the inference of knowledge by biologists.
Notes
Funding
JPA thanks the SENESCYT of Ecuador for its financial support. MIP thanks CEBIB, STICAMSUD 17STIC05 and Proyecto DICYT O61619IP, VRIDEIUSACH. MD thanks CNPq, Alexander von HumboldtStiftung, and the Fundação de Amparo a Pesquisa do Estado do Rio Grande do Sul (FAPERGS)  PRONUPEQ (16/255100005206). This Research was supported by Microsoft under a Microsoft Azure for Research Award.
Ethical approval and consent to participate
Not applicable.
Availability of data and materials
The preprocessed data sets, the distance matrices and the best clustering solutions are available in http://bioinformatic.diinf.usach.cl/gapbk/The data and results published here are in part based upon data generated by GOSemSim: http://bioconductor.org/packages/2.6/bioc/html/GOSemSim.htmland the Gene Expression, Genome Variation and Functional Profiling Analysis Suite: http://babelomics.bioinfo.cipf.es/.
Authors’ contributions
JPA conducted the literature review, developed the code, carried out the experiments, helped interpreting the results and manuscript writing. MD helped interpreting the results and manuscript draft. MIP contributed with the modelling, experiments design, results analysis and manuscript writing. All authors contributed to manuscript refinement and approved the final version.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.BolónCanedo V, SánchezMaroño N, AlonsoBetanzos A, Benítez JM, Herrera F. A review of microarray datasets and applied feature selection methods. Inf Sci. 2014; 282:111–35.CrossRefGoogle Scholar
 2.Fa R, Nandi AK, Gong LY. Clustering analysis for gene expression data: A methodological review. In: Communications Control and Signal Processing (ISCCSP), 2012 5th International Symposium On.New York: Institute of Electrical and Electronics Engineers (IEEE): 2012. p. 1–6.Google Scholar
 3.Giancarlo R, Scaturro D, Utro F. Valworkbench: An open source java library for cluster validation, with applications to microarray data analysis. Comput Methods Prog Biomed. 2015; 118(2):207–17.CrossRefGoogle Scholar
 4.Vaes E, Khan M, Mombaerts P. Statistical analysis of differential gene expression relative to a fold change threshold on nanostring data of mouse odorant receptor genes. BMC Bioinformatics. 2014; 15(1):39.PubMedPubMedCentralCrossRefGoogle Scholar
 5.Lui T, Tsui N, Chan L, Wong C, Siu P, Yung B. Decode: an integrated differential coexpression and differential expression analysis of gene expression data. BMC Bioinformatics. 2015; 16(1):182.PubMedPubMedCentralCrossRefGoogle Scholar
 6.Liu YC, Cheng CP, Tseng V. Mining differential topk coexpression patterns from time course comparative gene expression datasets. BMC Bioinformatics. 2013; 14(1):230.PubMedPubMedCentralCrossRefGoogle Scholar
 7.Chou J, Zhou T, Kaufmann W, Paules R, Bushel P. Extracting gene expression patterns and identifying coexpressed genes from microarray data reveals biologically responsive processes. BMC Bioinforma. 2007; 8(1):427.CrossRefGoogle Scholar
 8.Chen X, Li J, Daggard G, Huang X. Finding similar patterns in microarray data In: Zhang S, Jarvis R, editors. AI 2005: Advances in Artificial Intelligence. Lecture Notes in Computer Science, vol. 3809. Heidelberg: Springer: 2005. p. 1272–6.Google Scholar
 9.Ban JY, Kim BS, Kim SC, Kim DH, Chung JH. Microarray analysis of gene expression profiles in response to treatment with melatonin in lipopolysaccharide activated raw 264.7 cells. Korean J Physiol Pharmacol Off J Korean Physiol Soc Korean Soc Pharmacol. 2011; 15(1):23–9.CrossRefGoogle Scholar
 10.Jang HS, Chung HS, Ko E, Shin JS, Shin MK, Hong MC, Kim Y, Min BI, Bae H. Microarray analysis of gene expression profiles in response to treatment with bee venom in lipopolysaccharide activated rawx 264.7 cells. J Ethnopharmacol. 2009; 121(2):213–20.PubMedCrossRefGoogle Scholar
 11.Mohammadi A, Saraee MH, Salehi M. Identification of diseasecausing genes using microarray data mining and gene ontology. BMC Med Genet. 2011; 4(1):1.Google Scholar
 12.Ghosh S, Mitra S, Dattagupta R. Fuzzy clustering with biological knowledge for gene selection. Appl Soft Comput. 2014; 16:102–11.CrossRefGoogle Scholar
 13.Jothi R, Mohanty SK, Ojha A. Functional grouping of similar genes using eigenanalysis on minimum spanning tree based neighborhood graph. Comput Biol Med. 2016; 71:135–48.PubMedCrossRefGoogle Scholar
 14.Tan J, Chua KS, Zhang L, Zhu S. Algorithmic and complexity issues of three clustering methods in microarray data analysis. Algorithmica. 2007; 48(2):203–19.CrossRefGoogle Scholar
 15.Bandyopadhyay S, Saha S. Unsupervised Classification: Similarity Measures, Classical and Metaheuristic Approaches, and Applications. Reading, Massachusetts: Springer; 2012.Google Scholar
 16.Wu FX. Genetic weighted kmeans algorithm for clustering largescale gene expression data. BMC Bioinformatics. 2008; 9(Suppl 6):12.CrossRefGoogle Scholar
 17.Hoque S, Istyaq S, Riaz MM. A hierarchical approach for clustering and pattern matching of gene expression data. In: Genetic and Evolutionary Computing (ICGEC), 2012 Sixth International Conference On.New York: Institute of Electrical and Electronics Engineers (IEEE): 2012. p. 413–16.Google Scholar
 18.Sirbu A, Bocicor MI. A dynamic approach for hierarchical clustering of gene expression data. In: Intelligent Computer Communication and Processing (ICCP), 2013 IEEE International Conference On.New York: IEEE: 2013. p. 3–6.Google Scholar
 19.Kramer M, Dutkowski J, Yu M, Bafna V, Ideker T. Inferring gene ontologies from pairwise similarity data. Bioinformatics. 2014; 30(12):34–42.CrossRefGoogle Scholar
 20.Muthukalathi S, Ramanujam R, Thalamuthu A. Consensus clustering for microarray gene expression data. Bonfring Int J Data Min. 2014; 4(4):26.CrossRefGoogle Scholar
 21.Isewon I, Oyelade OJ, Adebiyi E, Brors B. Ensemble based clustering of plasmodium falciparum genes. In: International Conference on African Development Issues (CUICADI). Ota: Covenant University: 2015. p. 15–22.Google Scholar
 22.Suresh R, Dinakaran K, Valarmathie P. Clustering gene expression data using selforganizing maps. J Comput Appl. 2008; 1(4):6.Google Scholar
 23.InostrozaPonta M. An integrated and scalable approach based on combinatorial optimization techniques for the analysis of microarray data. Newcastle: School of Electrical Engineering and Computer Science. University of Newcastle; 2008.Google Scholar
 24.Arefin AS, Riveros C, Berretta R, Moscato P. The MSTkNN with paracliques. In: Artificial Life and Computational Intelligence: First Australasian Conference, ACALCI 2015, Newcastle, NSW, Australia, February 57, 2015. Proceedings. Cham: Springer: 2015. p. 373–86.Google Scholar
 25.Gillis J, Pavlidis P. “Guilt by association” is the exception rather than the rule in gene networks. PLoS Comput Biol. 2012; 8(3):1–13.CrossRefGoogle Scholar
 26.Nepomuceno JA, Troncoso A, NepomucenoChamorro IA, AguilarRuiz JS. Integrating biological knowledge based on functional annotations for biclustering of gene expression data. Comput Methods Prog Biomed. 2015; 119(3):163–80.CrossRefGoogle Scholar
 27.Tari L, Baral C, Kim S. Fuzzy cmeans clustering with prior biological knowledge. J Biomed Inform. 2009; 42(1):74–81.PubMedCrossRefGoogle Scholar
 28.The Gene Ontology Consortium. Gene ontology consortium: going forward. Nucleic Acids Res. 2015; 43(D1):1049–56.CrossRefGoogle Scholar
 29.Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. Kegg as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016; 44(D1):457–62.CrossRefGoogle Scholar
 30.Handl J, Kell DB, Knowles J. Multiobjective optimization in bioinformatics and computational biology. Comput Biol Bioinforma IEEE/ACM Trans. 2007; 4(2):279–92.CrossRefGoogle Scholar
 31.Ferligoj A, Batagelj V. Direct multicriteria clustering algorithms. J Classif. 1992; 9(1):43–61.CrossRefGoogle Scholar
 32.Brockhoff D, Zitzler E. Objective reduction in evolutionary multiobjective optimization: Theory and applications. Evol Comput. 2009; 17(2):135–66.PubMedCrossRefGoogle Scholar
 33.Bandyopadhyay S, Mukhopadhyay A, Maulik U. An improved algorithm for clustering gene expression data. Bioinformatics. 2007; 23(21):2859–65.PubMedCrossRefGoogle Scholar
 34.Maulik U, Mukhopadhyay A, Bandyopadhyay S. Combining paretooptimal clusters using supervised learning for identifying coexpressed genes. BMC bioinforma. 2009; 10(1):1197–208.CrossRefGoogle Scholar
 35.Mukhopadhyay A, Maulik U, Bandyopadhyay S. An interactive approach to multiobjective clustering of gene expression patterns. IEEE Trans Biomed Eng. 2013; 60(1):35–41.PubMedCrossRefGoogle Scholar
 36.Azuaje FJ, Wang H, Zheng H, Léonard F, RollandTurner M, Zhang L, Devaux Y, Wagner DR. Predictive integration of gene functional similarity and coexpression defines treatment response of endothelial progenitor cells. BMC Syst Biol. 2011; 5(1):1–13.CrossRefGoogle Scholar
 37.LuqueBaena RM, Urda D, Gonzalo Claros M, Franco L, Jerez JM. Robust gene signatures from microarray data using genetic algorithms enriched with biological pathway keywords. J Biomed Inform. 2014; 49(C):32–44.PubMedCrossRefGoogle Scholar
 38.Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective genetic algorithm: Nsgaii. IEEE Trans Evol Comput. 2002; 6(2):182–97.CrossRefGoogle Scholar
 39.Frinhani RMD, Silva RMA, Mateus GR, Festa P, Resende MGC. Grasp with pathrelinking for data clustering: A case study for biological data. In: Experimental Algorithms: 10th International Symposium, SEA 2011, Kolimpari, Chania, Crete, Greece, May 57, 2011. Proceedings. Berlin: Springer: 2011. p. 410–20.Google Scholar
 40.Paquete L, Chiarandini M, Stützle T. Pareto local optimum sets in the biobjective traveling salesman problem: An experimental study. In: Metaheuristics for Multiobjective Optimisation. Berlin: Springer: 2004. p. 177–99.Google Scholar
 41.Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF. A new method to measure the semantic similarity of go terms. Bioinformatics. 2007; 23(10):1274–81.PubMedCrossRefGoogle Scholar
 42.Glover F. Tabu search and adaptive memory programming — advances, applications and challenges. In: Interfaces in Computer Science and Operations Research: Advances in Metaheuristics, Optimization, and Stochastic Modeling Technologies. Boston: Springer: 1997. p. 1–75.Google Scholar
 43.DuboisLacoste J, LópezIbáñez M, Stützle T. Anytime pareto local search. Eur J Oper Res. 2015; 243(2):369–85.CrossRefGoogle Scholar
 44.Mukhopadhyay A, Maulik U, Bandyopadhyay S. A survey of multiobjective evolutionary clustering. ACM Comput Surv. 2015; 47(4):61.CrossRefGoogle Scholar
 45.Pakhira MK, Bandyopadhyay S, Maulik U. Validity index for crisp and fuzzy clusters. Pattern Recog. 2004; 37(3):487–501.CrossRefGoogle Scholar
 46.Mukhopadhyay A, Maulik U. Multiobjective approach to categorical data clustering. In: 2007 IEEE Congress on Evolutionary Computation. New York: Institute of Electrical and Electronics Engineers (IEEE): 2007. p. 1296–303.Google Scholar
 47.Xie XL, Beni G. A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell. 1991; 13(8):841–7.CrossRefGoogle Scholar
 48.Handl J, Knowles J. An evolutionary approach to multiobjective clustering. Evol Comput IEEE Trans. 2007; 11(1):56–76.CrossRefGoogle Scholar
 49.Ripon KSN, Tsang CH, Kwong S, Ip MK. Multiobjective evolutionary clustering using variablelength real jumping genes genetic algorithm. In: Proceedings of the 18th International Conference on Pattern Recognition  Volume 01. ICPR ’06. Washington, DC: IEEE Computer Society: 2006. p. 1200–3.Google Scholar
 50.Resende MGC, Ribeiro CC. Grasp with pathrelinking: Recent advances and applications. Oper Res Comput Sci Interfaces Ser. 2005; 32:29–63.Google Scholar
 51.Chaves AA, Lorena LAN. Clustering search algorithm for the capacitated centered clustering problem. Comput Oper Res. 2010; 37(3):552–8.CrossRefGoogle Scholar
 52.Martí R, Campos V, Resende MGC, Duarte A. Multiobjective grasp with path relinking. Eur J Oper Res. 2015; 240(1):54–71.CrossRefGoogle Scholar
 53.Drugan MM, Thierens D. Stochastic pareto local search: Pareto neighbourhood exploration and perturbation strategies. J Heuristics. 2012; 18(5):727–66.CrossRefGoogle Scholar
 54.Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987; 20:53–65.CrossRefGoogle Scholar
 55.R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2017.Google Scholar
 56.Lucas A. amap: Another Multidimensional Analysis Package. Vienna: R Foundation for Statistical Computing; 2014.Google Scholar
 57.Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. Gosemsim: an r package for measuring semantic similarity among go terms and gene products. Bioinformatics. 2010; 26(7):976–8.PubMedCrossRefGoogle Scholar
 58.Reymond P, Weber H, Damond M, Farmer EE. Differential gene expression in response to mechanical wounding and insect feeding in arabidopsis. Plant Cell. 2000; 12(5):707–19.PubMedPubMedCentralCrossRefGoogle Scholar
 59.Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, et al.A genomewide transcriptional analysis of the mitotic cell cycle. Mol Cell. 1998; 2(1):65–73.PubMedCrossRefGoogle Scholar
 60.Chu S, DeRisi J, Eisen M, Mulholland J, Botstein D, Brown PO, Herskowitz I. The transcriptional program of sporulation in budding yeast. Science. 1998; 282(5389):699–705.PubMedCrossRefGoogle Scholar
 61.Iyer VR, Eisen MB, Ross DT, Schuler G, Moore T, Lee JCF, Trent JM, Staudt LM, Hudson J, Boguski MS, Lashkari D, Shalon D, Botstein D, Brown PO. The transcriptional program in the response of human fibroblasts to serum. Science. 1999; 283(5398):83–7.PubMedCrossRefGoogle Scholar
 62.Maulik U, Mukhopadhyay A, Bandyopadhyay S. Website: Combining ParetoOptimal Clusters Using Supervised Learning for Identifying Coexpressed Genes. http://anirbanmukhopadhyay.50webs.com/data.html. urldate:20160413.
 63.Saha S, Ekbal A, Gupta K, Bandyopadhyay S. Gene expression data clustering using a multiobjective symmetry based clustering technique. Comput Biol Med. 2013; 43(11):1965–77.PubMedCrossRefGoogle Scholar
 64.Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci. 1998; 95(25):14863–8.PubMedCrossRefGoogle Scholar
 65.Deepika T, Porkodi R. A survey on microarray gene expression data sets in clustering and visualization plots. Int J Emerg Res Manag Technol. 2015; 4(3):56–66.Google Scholar
 66.AlShahrour F, DíazUriarte R, Dopazo J. Fatigo: A web tool for finding significant associations of gene ontology terms with groups of genes. Bioinformatics. 2004; 20(4):578–80.PubMedCrossRefGoogle Scholar
 67.Zitzler E, Thiele L. Multiobjective evolutionary algorithms: A comparative case study and the strength pareto approach. IEEE Trans Evol Comput. 1999; 3(4):257–71.CrossRefGoogle Scholar
 68.K G, M G. Annotation enrichment analysis: An alternative method for evaluating the functional properties of gene sets. Sci Rep. 2014; 4(4191):1–9.Google Scholar
 69.AlShahrour F, Minguez P, Tárraga J, Medina I, Alloza E, Montaner D, Dopazo J. FatiGO +: A functional profiling tool for genomic data. integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Res. 2007; 35(SUPPL.2):91–6.CrossRefGoogle Scholar
 70.Saha S, Alok AK, Ekbal A. Use of semisupervised clustering and featureselection techniques for identification of coexpressed genes. IEEE J Biomed Health Inform. 2016; 20(4):1171–7.PubMedCrossRefGoogle Scholar
 71.Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR. Interpreting patterns of gene expression with selforganizing maps: Methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A. 1999; 96(6):2907–12.PubMedPubMedCentralCrossRefGoogle Scholar
 72.Jain AK, Murty MN, Flynn PJ. Data clustering: A review. ACM Comput Surv. 1999; 31(3):264–323.CrossRefGoogle Scholar
 73.Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc. 1937; 32(200):675–701.CrossRefGoogle Scholar
 74.Bezdek JC. Pattern recognition with fuzzy objective function algorithms. Norwell: Kluwer Academic Publishers; 1981.CrossRefGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.