1 Introduction

This work presents a novel approach to learning the quality and performance of an algorithm through the use of evolution. When an algorithm is developed to solve a given problem, the designer is presented with questions about the performance of their proposed method and its relative performance against existing methods. This is an inherently difficult task. However, under the current paradigm, the standard response to this situation is to use a known fixed set of datasets - or simulate new datasets themselves - and a common metric amongst the proposed method and its competitors. The collated algorithms are then assessed based on this metric with often minimal consideration for the appropriateness or reliability of the datasets being used, and the robustness of the method(s) in question [1, 13, 19].

This process is not so readily observed when travelling in the opposite direction but methods to do so exist. Suppose that the object of interest was not an algorithm but rather a dataset. In this case, the objective is to determine a preferable algorithm to complete some task on the data. There exists a number of methods employed across disciplines to complete this task that take into account the characteristics of the data and the context of the research problem. These methods are often equivalent to asking questions of the data, and include the use of diagnostic tests. For instance, in the case of clustering, if the data displayed an indeterminate number of non-convex blobs, then one could recommend that an appropriate clustering algorithm would be DBSCAN [11]. Otherwise, for scalability, k-means may be chosen [37, 38].

The approach presented in this work aims to flip the paradigm described here by allowing the data itself to be unfixed. This fluidity in the data is achieved by generating data for which the algorithm performs well (or better than some other) through the use of an evolutionary algorithm. The purpose of doing so is not to simply create a bank of useful datasets but rather to allow for the subsequent studying of these datasets. In doing so, the attributes and characteristics which lead to the success (or failure) of the algorithm may be described, giving a broader understanding of the algorithm on the whole. Our framework is described in Fig. 1.

Fig. 1
figure 1

On the right: the current path for selecting some algorithm(s) based on their validity and performance for a given dataset. On the left: the proposed flip to better understand the space in which ‘good’ datasets exist for an algorithm

This proposed flip has a number of motivations, and below is a non-exhaustive list of some of the problems that are presented by the established evaluation paradigm:

  1. 1.

    How are these benchmark examples selected? There is no true measure of their reliability other than their frequent use. In some domains and disciplines there are well-established benchmarks so those found through literature may well be reliable, but in others less so [5, 8, 36].

  2. 2.

    Sometimes, when there is a lack of benchmark examples, a ‘new’ dataset is simulated to assess the algorithm [26]. This begs the question as to how and why that simulation is created. Not only this, but the origins of existing benchmarks is often a matter of convenience rather than their merit.

  3. 3.

    In disciplines where there are established benchmarks, there may still be underlying problems around the true performance of an algorithm:

    1. (i)

      As an example, work by Torralba and Efros [34] showed that image classifiers trained and evaluated on a particular dataset, or datasets, did not perform reliably when evaluated using other benchmark datasets that were determined to be similar. Thus leading to a model which lacks robustness.

    2. (ii)

      The amount of learning one can gain as to the characteristics of data which lead to good (or bad) performance of an algorithm is constrained to the finite set of attributes present in the benchmark data chosen in the first place.

This work presents just one method from this new paradigm, and that method is built around the concept of evolution. Evolutionary algorithms (EAs) have been applied successfully to solve a wide array of problems - particularly where the complexity of the problem or its domain are significant. These methods are highly adaptive and their population-based construction (displayed in Fig. 2) allows for the efficient solving of problems that are otherwise beyond the scope of traditional search and optimisation methods. EAs have been chosen here as they are simple in design yet their capabilities encompass the difficulties of the flipped paradigm set out above.

Fig. 2
figure 2

A general schematic for an evolutionary algorithm

The use of EAs to generate artificial data is not a new concept, however. Its applications in data generation have included developing methods for the automated testing of software [16, 23, 29] and the synthesis of existing or confidential data [6]. Such methods also have a long history in the parameter optimisation of algorithms, and recently in the automated design of convolutional neural network (CNN) architecture [31, 32].

Other methods for the generation or synthesis of artificial data are numerous and range from simple concepts such as simulated annealing [21] to swarm-based learning techniques [2] or generative adversarial networks (GANs) [12]. The unconstrained learning style of methods such as CNNs and GANs aligns with that proposed in this work. By allowing the EA to explore and learn about the search space in an organic way, less-prejudiced insight can be established that is not necessarily reliant on any particular framework or agenda.

Note that the proposed methodology is not simply to use an EA to optimise an algorithm over a search space with fixed dimension or data type such as those set out in [6]. The shape of a dataset is considered a part of the sample space itself that can be traversed through the evolutionary algorithm.

The remainder of the paper is structured as follows:

  • Section 2 describes the structure of the proposed method including its parameters and operators.

  • Section 3 contains a case study where the success and failure of k-means clustering is examined using the proposed method. Included also is a comparison between k-means and another clustering algorithm DBSCAN.

  • Section 4 concludes this paper.

2 The evolutionary algorithm

In this section, the details of an algorithm that generates data for which a given function, or (equivalently) algorithm, is well-suited is described. This algorithm is to be referred to as “Evolutionary Dataset Optimisation” (EDO).

The EDO method is built as an evolutionary algorithm which follows a traditional (generic) schema with some additional features that keep the objective of artificial data generation in mind. With that, there are a number of parameters that are passed to EDO; the typical parameters of an evolutionary algorithm are a fitness function, f, which maps from an individual to a real number, as well as a population size, N, a maximum number of iterations, M, a selection parameter, b, and a mutation probability, pm. In addition to these, EDO takes the following parameters:

  • A set of probability distribution families, . Each family in this set has some parameter limits which form a part of the overall search space. For instance, the family of normal distributions, denoted by N(μ, σ2), would have limits on values for the mean, μ, and the standard deviation, σ.

  • A maximum number of “subtypes” for each family in . A subtype is an independent copy of the family that progresses separate from the others. These are the actual distribution objects which are traversed in the optimisation.

  • A probability vector to sample distributions from , w = (w1,…, w||).

  • Limits on the number of rows an individual dataset can have,

    $$ R \in \left\{ (r_{\min}, r_{\max}) \in \mathbb{N}^{2}~|~r_{\min} \leq r_{\max} \right\} $$
  • Limits on the number of columns a dataset can have,

    $$ \begin{array}{@{}rcl@{}} C &:=& \left( C_{1}, \ldots, C_{|\mathcal{P}|}\right) \text{ where } C_{j}\\ &&\in \left\{ (c_{\min}, c_{\max}) \in { \left( \mathbb{N}\cup\{\infty\}\right) }^{2}~|~c_{\min} \leq c_{\max}\right\} \end{array} $$

    for each j = 1,…,||. That is, C defines the minimum and maximum number of columns a dataset may have from each distribution in .

  • A second selection parameter, l ∈ [0,1], to allow for a small proportion of ‘lucky’ individuals to be carried forward.

  • A shrink factor, s ∈ [0,1], defining the relative size of a component of the search space to be retained after adjustment.

The concepts discussed in this section form the mechanisms of the evolutionary dataset optimisation algorithm. To use the algorithm practically, these components have been implemented in Python as a library built on the scientific Python stack [22, 25]. The library is fully tested and documented (at https://edo.readthedocs.io) and is freely available online under the MIT license [33]. The EDO implementation was developed to be consistent with the current best practices of open source software development [15].

figure d

The statement of the EDO algorithm is presented here to lay out its general structure from a high level perspective. Lower level discussion is provided below where additional algorithms for the individual creation, evolutionary operator and shrinkage processes are given along with diagrams (where appropriate). Note that there are no defined processes for how to stop the algorithm or adjust the mutation probability, pm. This is down to their relevance to a particular use case. Some examples include:

  • Regular decreasing in mutation probability across the available attributes [17].

  • Stopping when no improvement in the best fitness is found within some K consecutive iterations [18].

  • Utilising global behaviours in fitness to indicate a stopping point [20].

2.1 Individuals

Evolutionary algorithms operate in an iterative process on populations of individuals that each represent a solution to the problem in question. In a genetic algorithm, an individual is a solution encoded as a bit string of, typically, fixed length and treated as a chromosome-like object to be manipulated. In EDO, as the objective is to generate datasets and explore the space in which datasets exist, there is no encoding. As such the distinction is made that EDO is an evolutionary algorithm.

As is seen in Fig. 3, an individual’s creation is defined by the generation of its columns. A set of instructions on how to sample new values (in mutation, for instance, Section 2.4) for that column are recorded in the form of a probability distribution. These distributions are sampled and created from the families passed in . In EDO, the produced datasets and their metadata are manipulated directly so that the biological operators can be designed and be interpreted in a more meaningful way as will be seen later in this section.

Fig. 3
figure 3

An example of how an individual is first created

However, one should not assume that the columns are a reliable representative of the distribution associated with them, or vice versa. This is particularly true of ‘shorter’ datasets with a small number of rows, whereas confidence in the pair could be given more liberally for ‘longer’ datasets with a larger number of rows. In any case, appropriate methods for analysis should be employed before formal conclusions are made.

figure e

2.2 Selection

The selection operator describes the process by which individuals are chosen from the current population to generate the next. Almost always, the likelihood of an individual being selected is determined by their fitness. This is because the purpose of selection is to preserve favourable qualities and encourage some homogeneity within future generations [4].

figure f

In EDO, a modified truncation selection method is used [14], as can be seen in Fig. 4. Truncation selection takes a fixed number, nb = ⌈bN⌉, of the fittest individuals in a population and makes them the ‘parents’ of the next. It has been observed that, despite its efficiency as a selection operator, truncation selection can lead to premature convergence at local optima [14, 24]. The modification for EDO is an optional stage after the best individuals have been chosen: with some small l, a number, nl = ⌈lN⌉, of the remaining individuals can be selected at random to be carried forward. Hence, allowing for a small number of randomly selected individuals may encourage diversity and further exploration throughout the run of the algorithm. It should be noted that regardless of this step, an individual could potentially be present throughout the entirety of the algorithm.

Fig. 4
figure 4

The selection process with the inclusion of some lucky individuals

After the parents have been selected, there are two adjustments made to the current search space. The first is that the subtypes for each family in are updated to only those present in the parents. The second adjustment is a process which acts on the distribution parameter limits for each subtype in . This adjustment gives the ability to ‘shrink’ the search space about the region observed in a given population. This method is based on a power law described in [3] that relies on a shrink factor, s. At each iteration, t, every distribution subtype which is present in the parents has its parameter’s limits, (lt, ut), adjusted. This adjustment is such that the new limits, (lt+ 1, ut+ 1) are centred about the mean observed value, μ, for that parameter:

$$ \begin{array}{@{}rcl@{}} l_{t+1}&=& \max \left\{l_{t}, \ \mu - \frac{1}{2} (u_{t} - l_{t}) s^{t}\right\} \end{array} $$
(1)
$$ \begin{array}{@{}rcl@{}} u_{t+1}&= &\min \left\{u_{t}, \ \mu + \frac{1}{2} (u_{t} - l_{t}) s^{t}\right\} \end{array} $$
(2)

The shrinking process is given explicitly in Algorithm 5. Note that the behaviour of this process can produce reductive results for some use cases and is optional.

figure g

2.3 Crossover

Crossover is the operation of combining two individuals in order to create at least one offspring. In genetic algorithms, the term ‘crossover’ can be taken literally: two bit strings are crossed at a point to create two new bit strings. Another popular method is uniform crossover, which has been favoured for its efficiency and efficacy in combining individuals in both bit string and matrix representations [7, 28]. For EDO, this method is adapted to support dataset manipulation: a new individual is created by uniformly sampling each of its components (dimensions and then columns) from a set of two ‘parent’ individuals, as shown in Fig. 5.

Fig. 5
figure 5

The crossover process between two individuals with different dimensions

Observe that there is no requirement on the dimensions of the parents to be of similar or equal shapes. This is because the driving aim of the proposed method is to explore the space of all possible datasets. In the case where there is incongruence in the lengths of the two parents, missing values may appear in a shorter column that is sampled. To resolve this, values are sampled from the probability distribution associated with that column to fill in these gaps.

figure h

2.4 Mutation

Mutation is used in evolutionary algorithms to encourage a broader exploration of the search space at each generation. Under this framework, the mutation process manipulates the phenotype of an individual where numerous things need to be modified including an individual’s dimensions, column metadata and the entries themselves. This process is described in Fig. 6.

Fig. 6
figure 6

The stages of the mutation process

As shown in Fig. 6, each of the potential mutations occur with the same probability, pm. However, the way in which columns are maintained assure that (assuming appropriate choices for f and ) many mutations in the metadata and the dataset itself will only result in some incremental change in the individual’s fitness relative to, say, a completely new individual.

figure i
figure j

3 A case study in clustering

3.1 k-means clustering

The following examples act as a form of validation for EDO, and also highlight some of the nuances in its use. The objective of these examples is to use the proposed method to reproduce some known results about the clustering of data in the absence of any external forces, and to examine how clustering algorithms are typically evaluated. In particular, the focus will be on the well-known k-means (Lloyd’s) algorithm. Clustering was chosen as it is a well-understood problem that is easily accessible - especially when restricted to two dimensions. The k-means algorithm is an iterative, centroid-based method that aims to minimise the ‘inertia’ of the current partition, Z = {Z1,…, Zk}, of some dataset X:

$$ I(Z, X) := \frac{1}{|X|} \sum\limits_{j=1}^{k} \sum\limits_{x \in Z_{j}} {d(x, z_{j})}^{2} $$
(3)

A full statement of the algorithm to minimise (3) is given in Appendix A.

This inertia function is taken as the objective of the k-means algorithm, and is used for evaluating the final clustering. This is particularly true when the algorithm is not being considered an unsupervised classifier where accuracy may be used [13]. With that, the first example will use this inertia as the fitness function in EDO. That is, to find datasets which minimise I.

For the purposes of visualisation, EDO is restricted to the space containing only two-dimensional datasets, i.e. C = ((2,2)). In addition to this, all columns are formed from uniform distributions where the bounds are sampled from the unit interval. Thus, the only family in is:

$$ \mathcal{U} := \left\{U(a, b)~|~a, b \in [0, 1]\right\} $$
(4)

The remaining parameters are as follows: N = 100, R = (3,100), M = 1000, b = 0.2, l = 0, pm = 0.01, and shrinkage is excluded. Figure 7 shows an example of the fitness (above) and dimension (below) progression of the evolutionary algorithm under these conditions up until the 50th epoch.

Fig. 7
figure 7

a Progressions for final inertia and dimension across the first 50 epochs with R = (3,100). b Progressions for final inertia and dimension across the first 50 epochs with R = (50,100)

There is a steep learning curve here; within the first 50 generations an individual is found with a fitness of roughly 10− 10 which could not be improved on for a further 900 epochs. The same quick convergence is seen in the number of rows. This behaviour is quickly recognised as preferable and was dominant across all the trials conducted in this work. This preference for datasets with fewer rows is expected given that I is the sum of the mean error from each cluster centre. With that, when k is fixed a priori, reducing the number of points in each cluster (i.e. the terms of the second summation) quickly reduces the mean error of that cluster and thus the value of I.

However, something that may be seen as unwanted is a compaction of the cluster centres. Referring to Fig. 8a, the best and median individuals show two clusters that are essentially the same point whereas the worst is a random cloud across the whole of which was found in the initial population. The kind of behaviour exhibited by the best performing individuals here occurs in part because it is allowed. There are two immediate ways in which this allowed: first, that a near-trivial case is included in R and, secondly, that the fitness function does nothing to penalise the proximity of the inter-cluster means, as well as aiming to reduce the intra-cluster means. This kind of unwanted behaviour highlights a subtlety in how EDO should be used; that experimentation and rigour are required to properly understand an algorithm’s quality.

Fig. 8
figure 8

Representative individuals based on inertia with: aR = (3,100); bR = (50,100). Centroids displayed as crosses

Hence, consider Fig. 8b where the individuals have been generated with the same parameters as previously except with adjusted row limits, R = (50,100), so as to exclude this trivial case. In these trials, the results are equivalent: the worst performing individuals are without structure whilst the best-performing individuals display clusters that are dense about a single point despite the minimum number of rows being increased. Supposing this was not already a known result, we can see mounting evidence in favour of this compaction being ‘optimal’ behaviour in a dataset for k-means clustering.

However, the fitness function may be addressed still, and more extensive studying may be done. Indeed, the final inertia could be considered a flawed or fragile fitness function if it is supposed to evaluate the efficacy of the k-means algorithm. Incorporating the inter-cluster spread to the fitness of an individual dataset would reduce this observed compaction. For instance, the silhouette coefficient is a metric used to evaluate the appropriateness of a clustering to a dataset and does precisely that. The silhouette coefficient of a clustering of a dataset is given by the mean of the silhouette value, S(x), of each point xZj in each cluster:

$$ \begin{array}{lllll} A(x) := \frac{1}{|Z_{j}| - 1} \sum\limits_{y \in Z_{j} \setminus \{x\}} d(x, y), \\ B(x) := \min\limits_{k \neq j} \frac{1}{|Z_{k}|} \sum\limits_{w \in Z_{k}} d(x, w), \\ S(x) := \left\{\begin{array}{lllll} \frac{B(x) - A(x)}{\max\left\{A(x), B(x)\right\}} &\quad \text{if } |Z_{j}| > 1\\ 0 &\quad \text{otherwise} \end{array}\right. \end{array} $$
(5)

The optimisation of the silhouette coefficient is analogous to finding a dataset which increases both the intra-cluster cohesion (the inverse of A) and inter-cluster separation (B). Hence, the objective of minimising inertia is addressed by maximising cohesion. Meanwhile, the additional desire to spread out the clusters is considered by maximising separation.

Repeating the trials with the same parameters as with inertia, the silhouette fitness function yields the results summarised in Fig. 9a and b. Irrespective of row limits, the datasets produced show increased separation from one another whilst maintaining low values in the final inertia of the clustering as shown in Fig. 10. Again, the form of the individual clusters is much the same. The low values of inertia correspond to tight clusters, and the tightest clusters are those with a minimal number of points, i.e. a single point. As with the previous example, albeit at a much slower rate, the preferable individuals are those leading toward this case. That this gradual reduction in the dimension of the individuals occurs despite adjusting the fitness function and considering the space which excludes the trivial case bolsters the claim that the base case is also optimal.

Fig. 9
figure 9

a Progressions for silhouette and dimension across 1000 epochs at 100 epoch intervals with R = (3,100). b Progressions for silhouette and dimension across 1000 epochs at 100 epoch intervals with R = (50,100)

Fig. 10
figure 10

Representative individuals based on silhouette with: aR = (3,100); bR = (50,100). Centroids displayed as crosses

At this point, it should be noted that, due to the nature of the implementation, any individual from any generation may be retrieved and studied should the final results be too concentrated on any given case. The summary provided here is one particular way of studying the body of datasets generated with this method and this transparency in the history and progression of the proposed method is something that sets it apart from other methods such as GANs which have a reputation of providing so-called ‘black box’ solutions.

3.2 Comparison with DBSCAN

The extent of the capabilities EDO holds as a tool to better understand an algorithm are especially apparent when comparing an algorithm against another (or set of others) simultaneously. This is done by utilising the freedom of choice in a fitness function for EDO. Consider two algorithms, A and B, and some common metric between them, g. Then their similarities and contrasts can be explored by considering the differences in this metric on the two algorithms. In terms of EDO, this means using f = gAgB, f = gBgA or f = |gBgA| as the fitness function. By doing so, pitfalls, edge cases or fundamental conditions for the method may be highlighted. Overall, this process allows the researcher to more deeply learn about the method of interest beyond the traditional method of literature comparison on a particular example.

Consider the following use case with another clustering algorithm of a different form, Density Based Spatial Clustering of Applications with Noise (DBSCAN). In this particular case, the objective is to find datasets for which the method of interest, k-means, outperforms its alternative, DBSCAN. Here there is no concept of inertia as DBSCAN is density-based and is able to identify outliers [11]. As such, a valid metric must be chosen. One such metric is the silhouette score as defined in (5).

In this case, however, an adjustment to the fitness function must be made so as to accommodate for the condition of the silhouette coefficient that there must be more than one cluster present. Let Sk(X) and SD(X) denote the silhouette coefficients of the clustering found by k-means and DBSCAN respectively. Then the fitness function is defined to be:

$$ f(X) = \left\{\begin{array}{lllll} S_{D} (X) - S_{k} (X), &\quad \begin{array}{ll} \text{if DBSCAN identifies two or} \\ \text{more clusters (inc.\ noise)} \end{array} \\ \infty &\quad \ \ \text{otherwise.} \end{array}\right. $$
(6)

There are several remarks to be made here. First, note the order of the subtraction here as EDO minimises fitness functions by default. Also, f takes values in the range [− 2,2] where − 2 is the best, i.e. SD(X) = − 1 and Sk(X) = 1. Likewise, 2 is the worst score. Finally, the silhouette coefficient requires at least two clusters to be present and so if DBSCAN identifies a single cluster then that individual will be penalised heavily under this fitness function when, in fact, that clustering may be of high quality. As such, this fitness function may require adjustment.

It must also be acknowledged that k-means and DBSCAN share no common parameters and so direct comparison is more difficult. For the purposes of this example, only one set of parameters is used but a thorough investigation should include a parameter sweep in similar, real-world use cases. The parameters being used are k = 3 for k-means, and 𝜖 = 0.1, MinPoints = 5 for DBSCAN. This set was chosen following informal experimentation using the Python library Scikit-learn [27] to find comparable parameters in the given search space defined by the EDO parameters used previously with R = (50,100).

Figure 11 shows a summary of the progression of EDO for this use case. As with the previous examples where R = (50,100), the variation in the population fitness is unstable but there is a clear trend of improvement in the best individual over the course of the run. There is also a convergence seen in the number of rows a dataset has. The resting dimension varied across the trials conducted in this work but none exhibited a dramatic shift toward the lower limit of 50 rows as with previous examples. This is suggestive of a more competitive environment for individuals where slight changes to an individual can drastically alter their fitness.

Fig. 11
figure 11

Progressions for difference in silhouette (k-means-preferable) and dimension across 1000 epochs at 100 epoch intervals

The effect of such changes can be seen in Fig. 12 where representative individuals are shown for this example. Here, the best performing individual, when clustered by k-means, shows three clear and nicely separated clusters. Note that they are not so tightly packed; again, this suggests that the route to an optimal individual is less clearly defined. In contrast, when the same dataset is clustered by DBSCAN a single cluster is found with a single noise point held within the convex hull of the cluster, i.e. there are overlapping clusters (since noise points form a single cluster). Hence, along with the fact that the larger cluster is widely spread, it follows that the clustering has a relatively small, negative silhouette coefficient.

Fig. 12
figure 12

Representative individuals from a k-means-preferable run with clustering by: ak-means; b DBSCAN. Concave and convex hulls illustrated by shading and outline respectively

Another point of interest here is the convexity of the clusters. A known condition for the success of k-means is that the presented clusters are of roughly equal size and are convex. This is due to the overall objective being to approximate the centroidal Voronoi tessellation [9]. Without this condition, up to the correct choice of k, the algorithm will fail to produce adequate results for either inertia or silhouette. DBSCAN, on the other hand, does not have this condition and is able to detect non-convex clusters so long as they are dense enough. Figure 12 shows the clustering found by each method and the respective convex and concave hulls of the clusters found. The ‘concave hull’ of a cluster is taken to be the α-shape of the cluster’s data points [10] where α is determined to be the smallest value such that all the points in the cluster are contained in a single polygon. The convexity of cluster Zj, denoted j, is then determined to be the ratio of the area of its concave hull, Hc, to the area of its convex hull, Hv [30]:

$$ \mathcal{C}_{j} := \frac{area(H_{c})}{area(H_{v})} $$
(7)

With this definition, it should be clear that a perfectly convex cluster, such as a single point or line, would have j = 1.

It can be seen that the convexity of the clustering found by k-means appears to be higher than that by DBSCAN. This was apparent across all trials conducted in this work and indicates that the condition for convex clusters is being sought out during the optimisation process. Meanwhile, however, it is not clear whether the performance of DBSCAN falls owing to its parameters or the method itself. This is a point where parameter sweeping would prove most useful so as to determine a crossing point for these two driving forces.

Now, to add to the discussion above, the inverse optimisation should be considered. That is, using the same parameters, the datasets for which DBSCAN outperforms k-means with respect to the silhouette coefficient are to be investigated. This is equivalent to using − f as the fitness function except with the same penalty of for the case set out in (6).

Figures 13 and 14 show the same summary as above with the revised fitness function. Inspecting the former, it is seen that the best fitness found is worse than with the previous example. This, in part, is due to the fact that k-means cannot find a clustering with negative values as no clusters may overlap. It can, however, produce results with small silhouette scores where the clusters are tightly packed. Hence, the best fitness score is now − 1 whereas the worst is 2, still.

Fig. 13
figure 13

Progressions for difference in silhouette (DBSCAN-preferable) and dimension across 1000 epochs at 100 epoch intervals

Fig. 14
figure 14

Representative individuals from a DBSCAN-preferable run with clustering by: ak-means; b DBSCAN. Concave and convex hulls illustrated by shading and outline respectively

Note in the first two frames of Fig. 14a how k-means is forced to split what is evidently a single cluster in two whereas DBSCAN is able to identify the single cluster and the outlying noise (Fig. 14b). The proximity of these clusters has then dragged the silhouette score down for k-means. Referring to Fig. 14b, this kind of behaviour is certainly preferable for DBSCAN under these parameters: the beginning individuals are likely random clouds (as seen in the rightmost two frames of the figure) and the simplest step toward a fit dataset is one that maintains that vaguely dense body with minimal noise points far from it.

As has already been stated, the software implementation of the EDO method has been produced in line with the best practices of open source software development and reproducible research. In aid of this, all of the source code used in these examples (including to create the figures) has been archived under the DOI https://doi.org/10.5281/zenodo.3492236. Likewise, all of the data produced to support this case study have been archived under the DOI https://doi.org/10.5281/zenodo.3492228.

4 Conclusion

In this paper we have introduced a novel approach to understanding the quality of an algorithm by exploring the space in which their well-performing datasets exist. Following a detailed explanation of its internal mechanisms, a case study in k-means clustering was offered as validation for the proposed method. The method was able to reveal some known results without prior knowledge when investigating k-means in several scenarios, and again when comparing k-means and another leading clustering method, DBSCAN.

The method itself utilises biological operators to traverse a potentially broad region of the space of all possible datasets. This is done in an organic way with a minimal external framework attached. The generative nature of the proposed method also provides transparency and richness to the solution when compared to other contemporary techniques for artificial data generation as the entire history of individuals is preserved. While other search and optimisation methods exist, the decision to use an EA here was down to this transparency and the ease with which to implement biological operators that are both meaningful and easily understood.

The Evolutionary Dataset Optimisation method is dependent on a number of parameters set out in this work one of which is the choice of distribution families, ; these families go on to define the general statistical shape of the columns of the datasets that are produced and also control the present data types. The relationship between columns and their associated distribution is not causal and appropriate methods should be employed to understand the structure and characteristics of the data produced before formal conclusions are made as set out in the case study provided.

It is known that EAs might terminate at a local optimum and may not be able to traverse the entire sample space [35], or even a sufficient part of it. This would be even more problematic in the case presented in this work where the sample space is not even of a fixed size or data type. In all experiments carried out for this work, this theoretic limitation has not arisen. Figure 15 shows an exploration of the sample space and it is evident that the EDO method was able to explore a large proportion of it. In the early stages, it is also clear here how the EA got stuck in small parts of the search space before later moving toward a subregion of the unit square.

Fig. 15
figure 15

A scatter of all the individuals found at 50 epoch intervals in the first example of Section 3.2, i.e. those summarised in Fig. 12

Although this does provide evidence to say that the EA’s current design can sufficiently explore its given search space, it does not provide any guarantee that this will happen, even in expectation. Proving this theoretically is an area for further investigation.

Something that does stand against EAs is their tendancy to find the ‘easy’ way out. That is, reducing down to the simplest solution which solves the given problem. In most cases, that is not a problem and is often, in fact, favourable. Throughout the case study provided, this is seen to happen. Figure 15 shows this behaviour again by the strong diagonal region in later generations. In that particular example, the easiest solution for the EA (i.e. for k-means to outperform DBSCAN) was to collapse one dimension of the search space to make the problem one-dimensional. This kind of behaviour is not necessarily a bad thing as trivial, basic and simple cases are of great importance when understanding an algorithm’s quality.

However, should that be a problem, then the objective function could be adjusted accordingly. In the case study, several iterations of fitness functions were examined but each was adjusted by hand according to what was apparent at the time. Due to the architecture of the implementation of this method, this could be done in practicality. For instance, a similar strategy could be employed automatically by a more sophisticated fitness function that retains some information about the datasets generated from previous runs of EDO on a particular (or at least similar) parameter set. In this way, the currently completely unsupervised learning conducted by the EA could be ushered away from less helpful solutions (via some penalty, say) and toward previously unexplored behaviours. This automatic, iterative application of the proposed method would likely reveal more sophisticated insights into a particular algorithm.

In essence, the proposed method is merely a tool that demonstrates the benefit of the flipped paradigm set out in this work. The concept of where ‘good’ datasets exist is not something that is well-documented in literature and the hope of this work is that Evolutionary Dataset Optimisation acts as a starting point for further works to come.