In this section, we present features implemented in MDCGen. We emphasize improvements and aspects that differentiate our tool from previous proposals. For a better understanding, we show and follow the pseudocode in the explanations, paying special attention to characteristic parts of MDCGen. Listing 1 gives an overall view of the MDCGen sequential procedure. Input and outputs of the MDCGen algorithm are widely discussed in Section 4.
Let us start simply considering that, as for input arguments, MDCGen works with a set of parameters and, optionally, users have the possibility to import histograms from their own research, experiments, and applications and use them as empirical distributions for the generation of cluster point values.
The first steps taken by the MDCGen algorithm are the following:
- lin.1
CHECK CONSISTENCY OF input_parameters, where consistency of the provided parameterization is checked. Wrong parameter combinations and assignments generate errors and exit the program execution.
- lin.2
INITIALIZE global_variables, where the initialization of all global variables and structures is conducted. Not-defined parameters or parameters-to-randomize take definite values during this phase.
Object Distributions
In a synthetic dataset, cluster objects are points located in an N-dimensional space. For the generation of every singular cluster, an independent subspace is created with a cloud of points whose placement is determined by one or some underlying distributions. The presented tool enables the creation of N-dimensional clouds of points (r1) generated by using any of the following distribution functions: (a) Uniform, (b) Normal (a.k.a. Gaussian), (c) Logistic, (d) Gamma, e) Triangular, and (f) Gap or Ring-shaped (depending on how it is finally applied)—Fig. 1, left plot, shows probability density function (pdf ) curves of the available distributions. In subsequent steps, cluster subspaces are rotated, transformed, translated, and finally fused together in the output space. In addition to the distributions mentioned above, MDCGen allows importing arbitrary distributions by providing histograms as input arguments (r2).
Some previous dataset generators use Gaussian distributions to randomly establish values in every dimension; hence, cluster points are located following multivariate normal distributions. When no correlation between dimensions is set, whereas the overall variance is not affected by dimensionality, the Euclidean distances between points tend to be equal; therefore, they show an average value that increases in accordance with the number of dimensions. This phenomenon—thoroughly explained by Thirey and Hickman (2015)—is related to the curse of dimensionality and affects classifier capabilities to reach proper partitions (Beyer et al. 1999; François et al. 2007). Figure 1, right plot, reproduces a graph shown by Thirey and Hickman (2015) containing the theoretical pdf curves of multivariate normal distributions. We have verified the theory and superimposed the histograms of corresponding clusters generated with MDCGen.
A similar effect happens whenever point values are established variable by variable and based on distributions whose probability masses more or less coincide, provided there is no correlation or dependency between variables or they do not come from heavy-tailed distributions. Our tool allows to set distributions for every separated dimension or variable, as usual, but also to set a global distribution for the cluster intra-distances (r2, r3), i.e., objects take random values for all dimensions but what follows the selected distribution is their distance to the centroid.Footnote 1 As far as we know, this function has not been implemented in any of the publicly available generators so far. Figure 2 shows two clusters formed by a logistic distribution with identical parameters and random seeds. The difference resides in the fact that, in plot (a), the distribution was assigned to every variable, whereas in plot (b), the distribution defined cluster intra-distances. Already with only three dimensions, it is possible to observe how Euclidean distances between points become more alike for the multivariate case (a).
In Listing 1, steps directly related to the generation of sample values are as follows:
- lin.3
SET distributions FOR cluster_intra_distances OR cluster_dimensions, where distributions are linked either independently to every cluster dimension (if defined as multivariate) or to every cluster as a whole (radial-based case), in such a case by defining the distribution of point-to-center linear distances. A cluster being multivariate or having radial-based intra-distances is also a randomizable parameter.
- lin.8
GENERATE clusters IN isolated_subspaces, where object values are generated for every cluster. Listing 2 explores this part of the algorithm. In multivariate cases, values are simply generated for every dimension according to the selected distribution. In radial-based cases, first an auxiliar object set is randomly generated with an uniform distribution. Every object vector is therefore divided by its magnitude to transform them into unit vectors (i.e., normalized; therefore, all vectors are separated from the origin by a distance equal to 1). Later, a set of distances is randomly generated based on the selected cluster distribution. Such distances are multiplied by the unit vectors to finally achieve the desired radial-based distribution for the cluster intra-distances (i.e., linear distances of cluster objects to the cluster center) in the N-dimensional space. The following example illustrates the difference between “radial-based” and “multivariate.” Imagine a three-dimensional cluster A to be created with m samples. If “multivariate” is selected and Gaussian is the desired distribution function for all dimensions, the cluster generation process follows these steps:
- 1.
Values are independently assigned to every dimension,
$$ X=\{x_{1},x_{2},...,x_{m}\}, \qquad X \in G$$
$$ Y=\{y_{1},y_{2},...,y_{m}\}, \qquad Y \in G$$
$$ Z=\{z_{1},z_{2},...,z_{m}\}, \qquad Z \in G$$
where G is the set that contains all sets generated by Gaussian distributions.
- 2.
Cluster A is formed, where the i-object of cluster A is:
$$ \textbf{a}_{\textbf{i}}=(x_{i},y_{i},z_{i})$$
If, instead, “radial-based” is selected, being also Gaussian the desired distribution function, the cluster generation process is as follows:
- 1.
Values are independently assigned to every dimension,
$$ X=\{x_{1},x_{2},...,x_{m}\}, \qquad X \in U$$
$$ Y=\{y_{1},y_{2},...,y_{m}\}, \qquad Y \in U$$
$$ Z=\{z_{1},z_{2},...,z_{m}\}, \qquad Z \in U$$
where U is the set that contains all sets generated by Uniform distributions.
- 2.
The auxiliary cluster B is formed, where the i-object of B is
$$\textbf{b}_{\textbf{i}}=(x_{i},y_{i},z_{i})$$
- 3.
Later, B objects are normalized; therefore, their magnitude (i.e., distance to the cluster origin) becomes 1. For the i-object of B:
$$\hat{\textbf{b}_{\textbf{i}}}=\frac{\textbf{b}_{\textbf{i}}}{|\textbf{b}_{\textbf{i}}|}$$
- 4.
A new set of values D that represent object-to-center distances is created:
$$ D=\{d_{1},d_{2},...,d_{m}\}, \qquad D \in G$$
where G is again the set that contains all sets generated by Gaussian distributions.
- 5.
Cluster A is formed by multiplying every normalized object and its corresponding distance in D. Therefore, the i-object of cluster A is (note that di is a scalar):
$$\textbf{a}_{\textbf{i}}=d_{i} \times \hat{\textbf{b}_{\textbf{i}}} $$
Cluster Placement
Before cluster subspaces and their corresponding clouds of points are individually generated, it needs to be determined how and where to place such subspaces in the output space. This part of the dataset generation is tricky and has a considerable impact on classifier performances. It must be possible to create clusters with variable cluster inter-distances for the same dataset (some of them close to each other, some of them far from another). To address this issue, our tool initially limits each dimension to a closed [0, 1] value domain and later draws an imaginary grid to hang cluster subspaces in grid intersections ([0, 1] boundaries might be crossed in certain special cases, e.g., when a cluster with high size or sparsity in at least one dimension is placed close to output space borders; in any case, the origin of cluster subspaces are always located within [0, 1] ranges—example in Fig. 5). Every dimension is divided by αi equidistant hyperplanes, where i marks the specific dimension. By default, we define that the grid granularity depends on the given number of clusters k.
Equation 1 provides the default definition of αi:
$$ \alpha_{i}= 2+ C_{i}\left\lfloor1+ \frac{k}{\ln k}\right\rfloor $$
(1)
where Ci (alpha_constant) is a configurable parameter (set to 1 by default). If desired, αi can be independently adjusted for each dimension (the tool ensures that the selection of the diverse α creates a grid whose total number of intersections is larger than k). For instance, in a two-dimensional space (x, y), given k = 7, by default \(\alpha _{x}=\alpha _{y}= 2+\lfloor 1+ \frac {7}{\ln 7}\rfloor = 6\). The addend “+ 2” corresponds to hyperplanes that take 0 and 1 values in the i-dimension—cluster subspaces are not allowed to be centered there. In our example, it means that from the 6 ⋅ 6 = 36 hyperplane intersections, we have 20 non-usable intersections at the grid borders and therefore 16 valid intersections available to locate the 7 cluster subspaces. The explained procedure is illustrated in the examples of Fig. 3. The related step in Listing 1 is:
- lin.4
GENERATE underlying_grid, which establishes the number of hyperplanes per dimension and the valid hyperplane intersections. Details are provided in Listing 3. The WHILE loop ensures that the grid contains enough intersections for all clusters.
Space intersections are numbered and jumbled according to uniform random permutations. Later on, the first k intersections are selected and their indices decomposed according to α values. For example, in a three-dimensional space, intersection Ij transforms into (xj, yj, zj), where Ij = xjαx + yjαy + zjαz. Such indexing is only performed for a low number of dimensions to guarantee no subspace overlap, the remaining dimension coordinates are randomly generated (otherwise the one-dimensional indexing would become soon unfeasible for high-dimensional grids). To smooth the cluster alignment due to the grid arrangement, clusters are finally translated according to a random distance that depends on grid cell size (i.e., hyperplane separations). Steps in Listing 1 that cover this part are as follows:
- lin.5
CALCULATE base_intersections BASED ON underlying_grid, which outputs an array with indexes that correspond to base intersections. Base indicates that intersections belong to a dimensional-reduced subspace with enough intersections to allocate all desired clusters. Unless users desire output spaces with very few intersections and design them accordingly, a reference for the minimum value of base intersections is fixed by the ad hoc, experimental (2):
$$ \beta = 2k + \frac{\text{outliers}}{k} $$
(2)
where β stands for base intersections, k is the number of clusters and outliers is the number of outliers. Listing 4 explores this step.
- lin.6
CALCULATE centroid_coordinates_set BASED ON base_intersections, where every cluster centroidFootnote 2 is assigned a unique location in the final solution space based on the intersection index. Listing 5 delves into this step.
- lin.11
PLACE clusters IN output_space BASED ON centroid_coordinates_set simply takes vectors of every cluster, adds the corresponding centroid vector, and joins all clusters in a single matrix (i.e., the dataset or output_space). Before this step, clusters hang in isolated subspaces with the preliminary centroid located in the coordinates origin.
If we retrieve the example in Section 3.1 in which a three-dimensional cluster A was generated, in this step, cluster A—after applying additional transformations and operations configured by the user—is joined to other clusters in the same space and hanged in its corresponding location by adding the cluster centroid coordinates to every object vector. If A′ is the expression of the cluster in the final space and cA stands for its corresponding centroid location, the i-object of A′ becomes
$$\textbf{a}_{\textbf{i}^{\prime}}=\textbf{a}_{\textbf{i}}+\textbf{c}_{\textbf{A}}$$
The presented way of fusing cluster subspaces solves some issues related to cluster placement and makes it easy to implement some desired functionalities:
(r4) Cluster overlap is easily controlled by scaling distribution parameters in accordance with the size of the hyperrectangles (or N-dimensional cells) described by the grid. Examples are shown in Figs. 4 and 5.
(r12) There is no need to implement iterative algorithms for the cluster placement as cluster subspaces are paired with unique grid intersections through a one-dimensional index. Also, when placing outliers, there is no need to check if outliers are falling inside cluster influence areas because outliers are directly scattered around the not-used grid intersections.
(r6) Grid hyperplanes or divisions are configurable. The design of the grid will partially define if clusters are close to one another, far away from each other or a combination of both, therefore generating variable cluster inter-distances (example in Fig. 4).
(r10) Grid hyperplanes are configurable per dimension. Provided the configuration suffices for the required number of intersections (above k), clusters can be distant in the overall space but overlapping when subspaces are independently evaluated (example in Fig. 4).
Overlap Control
The overlap control is undertaken by the design of input parameters, mainly the type of distributions, compactness coefficients, and grid granularity together with the scale option. The type of distribution has an obvious effect on the potential overlap as distributions show different sparsity by definition (see Fig. 1, left plot). The MDCGen uses distributions to define either feature values independently or directly object intra-distances in the N-dimensional space, implying a direct impact in the space required by every cluster. Compact coefficients (cp) directly define variance parameters in the available distributions (e.g., σ in Gaussian or Logistic cases, lower, and upper thresholds in triangular ones), whereas mean parameters are set to “0” previous to any translation. In imported distributions, cp acts as an additional scaling factor not linked to the grid scaling. On the other hand, the scale parameter controls the scaling of cluster values by a factor as well as based on grid size (i.e., hyperplane separations). The example in Fig. 5 helps to understand how the overlap control works. In the two-dimensional example, two multivariate Gaussian clusters have been created with the same type of distribution and compactness coefficient (i.e., cp = σ = 0.1) but, whereas Cluster A is not scaled, Cluster B is scaled in line with grid size. As for the steps in Listing 1, it involves
- lin.7
MODIFY cluster_compactness_set BASED ON cluster_scaling_factors, which defines a coefficient for every cluster before the generation of object values. Listing 6 offers further explanations for this step.
Given that the space for cluster placement is always enclosed within the [0,1]-hypercube, it is not difficult to control the overlap during the dataset parameterization. In any case, MDCGen evaluates overlap by means of Silhouettes (r16), which gives a score between “0” and “1” to assess intra-cluster compactness and inter-cluster separation (Rousseeuw 1987). Related steps in Listing 1 are as follows:
- lin.12
CALCULATE cluster_inter_distances AND cluster_intra_distances, where cluster inter- and intra-distances as well as dataset geometrical properties are calculated. Such calculations and estimations allow using other cluster compactness vs. distance coefficients and measures in addition to Silhouette.
- lin.16
CALCULATE silhouette_performance, which calls Silhouette algorithms.
Subspaces, Outliers, and Noise
To avoid resorting to trial and error processes for ensuring that outliers do not fall within clustered areas, outliers are directly spread around grid intersections where there are no defined clusters. In addition, it is possible to add an arbitrary number of irrelevant noisy features (r8). Noise is generated by uniform distributions within the [0, 1] value range and can be defined for specific clusters and dimensions, allowing the creation of subspace clusters (r11). The arrangement of clusters and outliers over the underlying grid structure enables the natural generation of subspace outliers. Figure 6 shows an example of a three-dimensional dataset with normal clusters, subspace clusters, global outliers, and subspace outliers. Detailed in Listing 7, steps of Listing 1 that cope with outliers and noise are as follows:
- lin.13
PLACE outliers IN output_space, which uses the remaining free base intersections to locate outliers as we did before with cluster centroids. However, in this case, base intersections are reused following a circular shift. Note that this assignment is only performed for a reduced number of dimensions, the rest are again established at random. Similarly to the case of cluster centroids, a final deviation based on the hyperplane separation is applied to avoid alignment with grid intersections.
- lin.14
ADD noise IN output_space, which simply adds noise to global dimensions or cluster dimensions by replacing the generated values by uniform noise.
Additional Features: Correlations, Rotation, and Labeling
Additional implemented features are:
Feature correlations (r7): MDCGen allows the definition of correlated features by introducing coefficients (either per dataset or per cluster) that state the maximum allowed correlation (positive or negative) between two features. To do that, a correlation matrix C is created for each cluster and correlation coefficients are randomly generated but without exceeding the given threshold. To transform C into a valid covariance matrix, we use the method of Higham (1988), which is able to calculate the nearest symmetric positive semidefinite matrix S. Later, Cholesky decomposition is applied on S to find a matrix L, which accomplishes S = L⋅L∗ (L∗ is the conjugate transpose of L). Thus, it is possible to compute Y = LX, being X a set of vectors where object values of every cluster dimension are represented as random variables. Y contains the vectors of the final correlated variable values.
Cluster rotation (r9): Given the difficulties to specifically define a rotation in spaces with more than three dimensions (Daniele 2001), the MDCGen tool is limited to implement cluster isometries by generating a random orthonormal matrix Q, which, by means of Y = QX, performs a unitary transformation on X.
Labeled dataset (r15): In addition to the N-dimensional dataset, MDCGen generates an array with numerical labels that links objects to the created clusters. Outliers are labeled with the “0” value.
These additional features correspond to the following steps in Listing 1:
- lin.9
MODIFY clusters BASED ON cluster_feature_correlations, which starts constructing a correlation matrix based on the parameterization and later applies the method in Higham (1988) and Cholesky decomposition, see Listing 8. nearestSPD_matrix refers to Higham’s method (Higham 1988).
- lin.10
MODIFY clusters BASED ON cluster_rotation, which operates by creating a random orthogonal square matrix. Also detailed in Listing 8.
- lin.15
GENERATE dataset_labels, where dataset labels are simply generated with positive numbers for clustered objects and 0 for outliers.
Cluster Generation Summary
Finally, we repeat here the key steps of the MDCGen data generation process in an intuitive and summarized way (Listing 1):
- 1.
An N-dimensional grid is generated in the N-dimensional space. Grid granularity is adjusted based on the desired number of clusters, the desired number of dimensions, and configuration parameters related to cluster overlap.
- 2.
Points in the space to locate cluster centroids are linked to unique grid intersections (plus some optional drift).
- 3.
Cluster compactness factors are modified based on the size of grid cells and configuration parameters. Cluster compactness factors define how big clusters are in the final space.
- 4.
Clusters are independently generated in isolated spaces according to the selected distributions, the modified compactness factors, and other configuration parameters.
- 5.
Clusters are independently modified based on additional configuration parameters and options: rotation, correlations, etc.
- 6.
Clusters are joined and placed together in the final space according to the locations reserved for their corresponding centroids (Point 2 of this list).
- 7.
Outliers are generated according to configuration parameters and spread around free grid intersections.
- 8.
Noise is generated according to configuration parameters and added into the final space.