MDCGen: Multidimensional Dataset Generator for Clustering
 259 Downloads
Abstract
We present a tool for generating multidimensional synthetic datasets for testing, evaluating, and benchmarking unsupervised classification algorithms. Our proposal fills a gap observed in previous approaches with regard to underlying distributions for the creation of multidimensional clusters. As a novelty, normal and nonnormal distributions can be combined for either independently defining values feature by feature (i.e., multivariate distributions) or establishing overall intracluster distances. Being highly flexible, parameterizable, and randomizable, MDCGen also implements classic pursued features: (a) customization of clusterseparation, (b) overlap control, (c) addition of outliers and noise, (d) definition of correlated variables and rotations, (e) flexibility for allowing or avoiding isolation constraints per dimension, (f) creation of subspace clusters and subspace outliers, (g) importing arbitrary distributions for the value generation, and (h) dataset quality evaluations, among others. As a result, the proposed tool offers an improved range of potential datasets to perform a more comprehensive testing of clustering algorithms.
Keywords
Clustering Dataset generator Synthetic data1 Introduction
Synthetic datasets are necessary since real data does not allow a controlled and flexible testing of data mining algorithms and cannot be used to obtain generalization. While realworld datasets and scenarios are the ultimate reality check for competitive algorithms, it can be counterproductive to rely on real data during design and development of new algorithms (Färber et al. 2010). In this respect, synthetic datasets are to algorithms like simulations to control strategies; i.e., they are intended to develop testbeds to undergo exhaustive testing. Dataset generators must be flexible and highly parameterizable, covering a broad scope of options and shapes; therefore, algorithms can be exhaustively proofed and stresstested in a high variety of situations.
 r1.
Generate datasets in a broad range of dimensions (from 2 to a highN).
 r2.
Allow to use a variety of distributions for generating cluster object values, including the possibility to import users’ own distributions.
 r3.
Generate both globular and nonglobular clusters regardless of the number of given dimensions.
 r4.
Have control on cluster overlap (avoid, allow, measure).
 r5.
Independently control and allow different cluster properties in the same dataset (e.g., size, number of objects, shape, orientation).
 r6.
Have control over cluster interdistances, allowing close clusters, clusters far away of each other, or arbitrarily close and far.
 r7.
Define dependencies among features in clusters, i.e., to manipulate covariances and correlations.
 r8.
Incorporate outliers and noisy variables to the dataset if desired. Outliers should cover global, local, and subspace outliers.
 r9.
Independently rotate clusters.
 r10.
Generate clusters separated in the overall space, but not necessarily when considering subspaces; i.e., cluster structures should not be always detected when scatter plots of paired dimensions are evaluated.
 r11.
Generate subspace clusters if desired, i.e., groups of objects that show a clear clusterstructure in lower dimensional subspaces but become sparse or noisy when additional dimensions are considered.
 r12.
Avoid iterative algorithms (i.e., trial and error) that could slow down or even freeze the generation process in demanding parameterizations.
 r13.
Allow a high flexibility in the definition and randomization of parameters in a way that dataset variability is maximized and can satisfy specific application necessities.
 r14.
Reproduce datasets based on random seeds.
 r15.
Generate labeled datasets for subsequent evaluations.
 r16.
Output evaluations of dataset quality, e.g., overlap evaluation.
Our cluster generator MDCGen (Multidimensional Dataset Generator for Clustering) has been designed to fulfill those requirements. MDCGen is intended for research purposes; therefore it is free, open source, and publicly available to download from our website [omitted for doubleblind reviewing]. We provide MDCGen in MATLAB and in Python.
2 Related Work
A classic algorithm for generating datasets with clusters is presented by Milligan and Cooper (1986). Their method creates between one and five clusters located in a space of up to eight dimensions and assigns points to clusters based on three models that can generate clusters of equal and unequal sizes. The generator GenRandomClust (Qiu and Joe 2006) is an improved version of the method of Milligan and Cooper. GenRandomClust adds interesting functionalities, such as the possibility to set the separation between clusters by means of a separation index. In addition, clusters are distant in one dimension, but there is no constraint for the isolation in the remaining dimensions; this characteristic makes that clusters cannot be easily detected by scatter plots of paired dimensions. Covariance matrices can be manipulated to offer variable shapes, diameters, and orientations. GenRandomClust also allows the inclusion of noisy variables and outliers. Steinley and Henson present OCLUS as “an analytic method for generating clusters with known overlap” (Steinley and Henson 2005). OCLUS requires the establishment (as input parameters) of overlaps in each pair of adjacent clusters for every dimension. Precisely with regard to the overlap control GenRandomClust and OCLUS have been compared by Korzeniewski (2013), concluding that GenRandomClust is less robust as a consequence of only controlling the overlap of the two closest clusters. In any case, cluster overlap is not necessarily something to avoid at all costs, but to control, since “many real world datasets have inherently overlapping clusters” (Banerjee et al. 2005). Actually, studying how algorithms respond to datasets with overlaps is an interesting, necessary research line.
Julia Handl worked on clustering techniques and algorithms, e.g., Handl and Knowles (2005), and has published two generators of datasets for clustering that are, together with specific documentation, publicly available (accessed: Jul, Handl 2017). One generator creates clusters based on multivariate normal distributions, allowing the addition of dependencies among features by constructing symmetric, positivedefinite random covariance matrices. Clusters are generated in an iterative way, rejecting overlapping clusters and regenerating them afterwards. Since multivariate normal clusters become globular when dimensions increase, a second generator for highdimensional scenarios is proposed (50 to 100 dimensions). The second generator creates ellipsoidal clusters defining a main axis in a random orientation and points separated a “Gaussiandistributed distance from a uniformly random point on the major axis.” Cluster origins are translated based on a genetic algorithm that minimizes a score based on the overall deviation of the data and the overlap.
On the basis of uniform and normalbased distributions, the method of Pei and Zaïane (2006) offers a generator for twodimensional datasets where final clusters take a high variety of shapes. This approach also includes the incorporation of outliers as random noise or following defined patterns.
Finally, the cluster generator comprised in the ELKI data mining framework (Schubert et al. 2015) creates multidimensional datasets where distributions (uniform, normal, or gamma) are established dimension by dimension. A deep characterization is possible by means of configuration files where parameters are provided with XML tags. “Random seed” is established as an input parameter to allow reproducibility. The ELKI generator implements the manipulation and control of cluster overlaps, clustersizes, rotations, correlations, scaling, and translations.
3 Implementation
Let us start simply considering that, as for input arguments, MDCGen works with a set of parameters and, optionally, users have the possibility to import histograms from their own research, experiments, and applications and use them as empirical distributions for the generation of cluster point values.
 lin.1
CHECK CONSISTENCY OF input_parameters, where consistency of the provided parameterization is checked. Wrong parameter combinations and assignments generate errors and exit the program execution.
 lin.2
INITIALIZE global_variables, where the initialization of all global variables and structures is conducted. Notdefined parameters or parameterstorandomize take definite values during this phase.
3.1 Object Distributions
Some previous dataset generators use Gaussian distributions to randomly establish values in every dimension; hence, cluster points are located following multivariate normal distributions. When no correlation between dimensions is set, whereas the overall variance is not affected by dimensionality, the Euclidean distances between points tend to be equal; therefore, they show an average value that increases in accordance with the number of dimensions. This phenomenon—thoroughly explained by Thirey and Hickman (2015)—is related to the curse of dimensionality and affects classifier capabilities to reach proper partitions (Beyer et al. 1999; François et al. 2007). Figure 1, right plot, reproduces a graph shown by Thirey and Hickman (2015) containing the theoretical pdf curves of multivariate normal distributions. We have verified the theory and superimposed the histograms of corresponding clusters generated with MDCGen.
 lin.3
SET distributions FOR cluster_intra_distances OR cluster_dimensions, where distributions are linked either independently to every cluster dimension (if defined as multivariate) or to every cluster as a whole (radialbased case), in such a case by defining the distribution of pointtocenter linear distances. A cluster being multivariate or having radialbased intradistances is also a randomizable parameter.
 lin.8GENERATE clusters IN isolated_subspaces, where object values are generated for every cluster. Listing 2 explores this part of the algorithm. In multivariate cases, values are simply generated for every dimension according to the selected distribution. In radialbased cases, first an auxiliar object set is randomly generated with an uniform distribution. Every object vector is therefore divided by its magnitude to transform them into unit vectors (i.e., normalized; therefore, all vectors are separated from the origin by a distance equal to 1). Later, a set of distances is randomly generated based on the selected cluster distribution. Such distances are multiplied by the unit vectors to finally achieve the desired radialbased distribution for the cluster intradistances (i.e., linear distances of cluster objects to the cluster center) in the Ndimensional space. The following example illustrates the difference between “radialbased” and “multivariate.” Imagine a threedimensional cluster A to be created with m samples. If “multivariate” is selected and Gaussian is the desired distribution function for all dimensions, the cluster generation process follows these steps:If, instead, “radialbased” is selected, being also Gaussian the desired distribution function, the cluster generation process is as follows:
 1.Values are independently assigned to every dimension,$$ X=\{x_{1},x_{2},...,x_{m}\}, \qquad X \in G$$$$ Y=\{y_{1},y_{2},...,y_{m}\}, \qquad Y \in G$$where G is the set that contains all sets generated by Gaussian distributions.$$ Z=\{z_{1},z_{2},...,z_{m}\}, \qquad Z \in G$$
 2.Cluster A is formed, where the iobject of cluster A is:$$ \textbf{a}_{\textbf{i}}=(x_{i},y_{i},z_{i})$$
 1.Values are independently assigned to every dimension,$$ X=\{x_{1},x_{2},...,x_{m}\}, \qquad X \in U$$$$ Y=\{y_{1},y_{2},...,y_{m}\}, \qquad Y \in U$$where U is the set that contains all sets generated by Uniform distributions.$$ Z=\{z_{1},z_{2},...,z_{m}\}, \qquad Z \in U$$
 2.The auxiliary cluster B is formed, where the iobject of B is$$\textbf{b}_{\textbf{i}}=(x_{i},y_{i},z_{i})$$
 3.Later, B objects are normalized; therefore, their magnitude (i.e., distance to the cluster origin) becomes 1. For the iobject of B:$$\hat{\textbf{b}_{\textbf{i}}}=\frac{\textbf{b}_{\textbf{i}}}{\textbf{b}_{\textbf{i}}}$$
 4.A new set of values D that represent objecttocenter distances is created:where G is again the set that contains all sets generated by Gaussian distributions.$$ D=\{d_{1},d_{2},...,d_{m}\}, \qquad D \in G$$
 5.Cluster A is formed by multiplying every normalized object and its corresponding distance in D. Therefore, the iobject of cluster A is (note that d_{i} is a scalar):$$\textbf{a}_{\textbf{i}}=d_{i} \times \hat{\textbf{b}_{\textbf{i}}} $$
 1.
3.2 Cluster Placement
Before cluster subspaces and their corresponding clouds of points are individually generated, it needs to be determined how and where to place such subspaces in the output space. This part of the dataset generation is tricky and has a considerable impact on classifier performances. It must be possible to create clusters with variable cluster interdistances for the same dataset (some of them close to each other, some of them far from another). To address this issue, our tool initially limits each dimension to a closed [0, 1] value domain and later draws an imaginary grid to hang cluster subspaces in grid intersections ([0, 1] boundaries might be crossed in certain special cases, e.g., when a cluster with high size or sparsity in at least one dimension is placed close to output space borders; in any case, the origin of cluster subspaces are always located within [0, 1] ranges—example in Fig. 5). Every dimension is divided by α_{i} equidistant hyperplanes, where i marks the specific dimension. By default, we define that the grid granularity depends on the given number of clusters k.
 lin.4
GENERATE underlying_grid, which establishes the number of hyperplanes per dimension and the valid hyperplane intersections. Details are provided in Listing 3. The WHILE loop ensures that the grid contains enough intersections for all clusters.
 lin.5CALCULATE base_intersections BASED ON underlying_grid, which outputs an array with indexes that correspond to base intersections. Base indicates that intersections belong to a dimensionalreduced subspace with enough intersections to allocate all desired clusters. Unless users desire output spaces with very few intersections and design them accordingly, a reference for the minimum value of base intersections is fixed by the ad hoc, experimental (2):where β stands for base intersections, k is the number of clusters and outliers is the number of outliers. Listing 4 explores this step.$$ \beta = 2k + \frac{\text{outliers}}{k} $$(2)
 lin.6
CALCULATE centroid_coordinates_set BASED ON base_intersections, where every cluster centroid^{2} is assigned a unique location in the final solution space based on the intersection index. Listing 5 delves into this step.
 lin.11
PLACE clusters IN output_space BASED ON centroid_coordinates_set simply takes vectors of every cluster, adds the corresponding centroid vector, and joins all clusters in a single matrix (i.e., the dataset or output_space). Before this step, clusters hang in isolated subspaces with the preliminary centroid located in the coordinates origin.
If we retrieve the example in Section 3.1 in which a threedimensional cluster A was generated, in this step, cluster A—after applying additional transformations and operations configured by the user—is joined to other clusters in the same space and hanged in its corresponding location by adding the cluster centroid coordinates to every object vector. If A′ is the expression of the cluster in the final space and c_{A} stands for its corresponding centroid location, the iobject of A′ becomes$$\textbf{a}_{\textbf{i}^{\prime}}=\textbf{a}_{\textbf{i}}+\textbf{c}_{\textbf{A}}$$

(r4) Cluster overlap is easily controlled by scaling distribution parameters in accordance with the size of the hyperrectangles (or Ndimensional cells) described by the grid. Examples are shown in Figs. 4 and 5.

(r12) There is no need to implement iterative algorithms for the cluster placement as cluster subspaces are paired with unique grid intersections through a onedimensional index. Also, when placing outliers, there is no need to check if outliers are falling inside cluster influence areas because outliers are directly scattered around the notused grid intersections.

(r6) Grid hyperplanes or divisions are configurable. The design of the grid will partially define if clusters are close to one another, far away from each other or a combination of both, therefore generating variable cluster interdistances (example in Fig. 4).

(r10) Grid hyperplanes are configurable per dimension. Provided the configuration suffices for the required number of intersections (above k), clusters can be distant in the overall space but overlapping when subspaces are independently evaluated (example in Fig. 4).
3.3 Overlap Control
 lin.7
MODIFY cluster_compactness_set BASED ON cluster_scaling_factors, which defines a coefficient for every cluster before the generation of object values. Listing 6 offers further explanations for this step.
 lin.12
CALCULATE cluster_inter_distances AND cluster_intra_distances, where cluster inter and intradistances as well as dataset geometrical properties are calculated. Such calculations and estimations allow using other cluster compactness vs. distance coefficients and measures in addition to Silhouette.
 lin.16
CALCULATE silhouette_performance, which calls Silhouette algorithms.
3.4 Subspaces, Outliers, and Noise
 lin.13
PLACE outliers IN output_space, which uses the remaining free base intersections to locate outliers as we did before with cluster centroids. However, in this case, base intersections are reused following a circular shift. Note that this assignment is only performed for a reduced number of dimensions, the rest are again established at random. Similarly to the case of cluster centroids, a final deviation based on the hyperplane separation is applied to avoid alignment with grid intersections.
 lin.14
ADD noise IN output_space, which simply adds noise to global dimensions or cluster dimensions by replacing the generated values by uniform noise.
3.5 Additional Features: Correlations, Rotation, and Labeling

Feature correlations (r7): MDCGen allows the definition of correlated features by introducing coefficients (either per dataset or per cluster) that state the maximum allowed correlation (positive or negative) between two features. To do that, a correlation matrix C is created for each cluster and correlation coefficients are randomly generated but without exceeding the given threshold. To transform C into a valid covariance matrix, we use the method of Higham (1988), which is able to calculate the nearest symmetric positive semidefinite matrix S. Later, Cholesky decomposition is applied on S to find a matrix L, which accomplishes S = L⋅L^{∗} (L^{∗} is the conjugate transpose of L). Thus, it is possible to compute Y = LX, being X a set of vectors where object values of every cluster dimension are represented as random variables. Y contains the vectors of the final correlated variable values.

Cluster rotation (r9): Given the difficulties to specifically define a rotation in spaces with more than three dimensions (Daniele 2001), the MDCGen tool is limited to implement cluster isometries by generating a random orthonormal matrix Q, which, by means of Y = QX, performs a unitary transformation on X.

Labeled dataset (r15): In addition to the Ndimensional dataset, MDCGen generates an array with numerical labels that links objects to the created clusters. Outliers are labeled with the “0” value.
 lin.9
MODIFY clusters BASED ON cluster_feature_correlations, which starts constructing a correlation matrix based on the parameterization and later applies the method in Higham (1988) and Cholesky decomposition, see Listing 8. nearestSPD_matrix refers to Higham’s method (Higham 1988).
 lin.10
MODIFY clusters BASED ON cluster_rotation, which operates by creating a random orthogonal square matrix. Also detailed in Listing 8.
 lin.15
GENERATE dataset_labels, where dataset labels are simply generated with positive numbers for clustered objects and 0 for outliers.
3.6 Cluster Generation Summary
 1.
An Ndimensional grid is generated in the Ndimensional space. Grid granularity is adjusted based on the desired number of clusters, the desired number of dimensions, and configuration parameters related to cluster overlap.
 2.
Points in the space to locate cluster centroids are linked to unique grid intersections (plus some optional drift).
 3.
Cluster compactness factors are modified based on the size of grid cells and configuration parameters. Cluster compactness factors define how big clusters are in the final space.
 4.
Clusters are independently generated in isolated spaces according to the selected distributions, the modified compactness factors, and other configuration parameters.
 5.
Clusters are independently modified based on additional configuration parameters and options: rotation, correlations, etc.
 6.
Clusters are joined and placed together in the final space according to the locations reserved for their corresponding centroids (Point 2 of this list).
 7.
Outliers are generated according to configuration parameters and spread around free grid intersections.
 8.
Noise is generated according to configuration parameters and added into the final space.
4 Parameters and Configuration
This section shows the configurable parameters of MDCGen and the possibilities for either randomizing or specifying such parameters, therefore controlling the structure and generation of the final dataset (r13). If a parameter is not defined among the inputs, the tool randomizes its value or applies a value by default. One of the main challenges in the MDCGen design was allowing such randomization and, at the same time, a deep parameter specification. Configurations and decisions are possible at different levels: dataset, cluster, dimension, and clusterdimension. Moreover, the MDCGen tool is devised to be integrated in testbeds, frameworks or chain processes to provide a stream of different datasets within a given set of desired characteristics. Hence, to enable covering a broad range of dataset possibilities, the parameters are multiple and some training for tuning the tool is required. The performance evaluation generated by MDCGen as well as scatter plots and histograms are suitable ways to control and check the new dataset.
In this section, we provide examples based on the MATLAB version of MDCGen. In the Python and HTML versions, parameters are equivalently defined by means of JSON input files. The generator can be called even with no parameters at all:

sd: random seed [sc], to allow dataset reproducibility (r14).

M: total number of clustered objects (points) in the dataset [sc].

N: number of dimensions [sc].

k: number of clusters and cluster masses [sc, ar].
If k is a scalar, k establishes the number of clusters and, therefore, M objects are randomly distributed among k clusters. By default, the minimum number of points allowed per cluster is stated as a function of k and M, or it equals the input parameter km, if defined. If k is entered as an array, the number of clusters is the array length, whereas array values are taken as the number of objects embraced by each individual cluster.

km: minimum absolute number of objects per cluster [sc].

d: cluster distribution [sc, ar, mx].
If d is a scalar, the value defines all dataset distributions. If d is an array, d has length k and its values define distributions per cluster. If d is a matrix, it is a k × N matrix whose values define distributions per cluster and dimension.
Allowed d values and their meanings are as follows: (0) Random distribution; (1) Uniform; (2) Gaussian; (3) Logistic; (4) Triangular; (5) Gamma; (6) Gap or Ringshaped. In addition, alternative distributions can be imported as described below in this section (for configuration purposes, they take values for d and dflag indices starting from 7).

dflag: enable distribution [sc, ar].
As an array, dflag states which of the implemented distributions are available when d = 0, i.e., distributions are selected randomly. As a scalar, “1” enables all distributions and “0” disables all distributions except for Gaussian.

mv: multivariate distributions [sc, ar].
mv value (∈ {1, − 1, 0}) defines if distributions are applied to clusters dimension by dimension, if they are applied to cluster intradistances, or if such decision is established at random (see Section 3.1).

cp: compactness coefficient [sc, ar].
cp determines the variance component of the applied distribution. For instance, σ in Gaussian, upper and lower thresholds in triangular and uniform cases, or the b parameter for the Gamma distribution. Given that feature domains in the whole dataset are enclosed within [0,1], a consequent, meaningful design of cp is endorsed. Again, cp can be defined affecting all clusters (scalar) or cluster by cluster (array).

scale: scale to grid [sc, ar].
By means of scale, cluster cp can be automatically scaled according to grid size, therefore controlling cluster overlap. If positive (scaling_mode IS ‘min_grid_separation'), scale uses the minimum grid intersection distance for scaling; if negative (scaling_mode IS ‘max_grid_separation'), it uses the maximum. scale can also be defined either for all clusters (scalar) or independently cluster by cluster (array).

α: grid factor [sc, ar].
α determines grid granularity as explained in Section 3.2. A positive value (alpha_mode IS ‘k_based') multiplies (1), whereas a negative value (alpha_mode IS ‘fixed') directly replaces (1) with the given input (after removing the negative sign). α can be defined either for all dimensions together (scalar) or independently dimension by dimension (array).

corr: feature correlation [sc, ar].
Feature correlations (see Section 3.5) are set by defining a maximum correlation coefficient, which is applied for all clusters and dimension likewise (scalar), or for cluster dimensions independently taken (array).

rot: cluster rotation [sc, ar].
As explained in Section 3.5. Also definable per cluster (array) or for all clusters (scalar).

out: total number of outliers [sc].

Nnoise: noisy dimensions [sc, ar, mx].
A scalar value states the number of noisy dimensions to be added to the whole dataset. If Nnoise is defined as an array, values mark which dimensions of the dataset must be replaced by noise. If it is defined as a matrix, every column corresponds to a cluster and values state which dimensions (specific for every cluster) must be replaced by noise. Noisy dimensions are created after any other transformation (correlation, rotation, etc.). Sophisticated and flexible noise generation allows also to generate benchmarks for subspace clustering algorithms (Kriegel et al. 2009).

data: the final dataset, a matrix of M′ rows and N′ dimensions^{3}.

label: array with M′ labels.

perf: data structure containing performance indices for the overall datasets as well as for every independent cluster.

n: the number of imported distributions.

d(1 : n).values: arrays with histogram bin values of imported distributions. n arrays are required. The number of array elements is independent for each array.

d(1 : n).edges: arrays with histogram bin boundaries (or edges). n arrays are required. Array lengths must be equal to the corresponding values array plus one (i.e., each value must fall between two edges).
5 Conclusions
This paper presents MDCGen, a tool for generating datasets of objects arranged in clusters. MDCGen is devised for research purposes, specifically to test clustering algorithms and clustering validation techniques. It has been designed to fulfill the principal features implemented in previous approaches as well as the requirements observed by expert data analysts. In addition to allow a high flexibility in randomization and parameterization, the main novelties of MDCGen are related to the overlap control and cluster placement, both driven by the creation of hypergrids where cluster subspaces hang; and to the option of not only generating multivariate clusters, but also the possibility to directly define object distances to cluster centroids with a single distribution.
MDCGen opens a broad spectrum of possibilities to easily test data mining and machine learning algorithms, bringing them to demanding but controlled conditions. MDCGen is open source, free, and publicly available.
Footnotes
 1.
It is important to remark that the capability of MDCGen to generate multivariate clusters or clusters which intradistances follow radialbased distributions do not cover all multivariate or radialbased possible shapes. Additionally, note that available distribution functions in the current version of MDCGen show tails equal or lighter than Gaussian distributions (i.e., no heavy tails). The uniform distribution case is limited by maxmin parameters, and imported histograms can resemble heavytailed distributions but there is no embedded curve fitting and points are generated directly with the histogram (meaning that distribution tail finishes according to the histogram binning). Therefore, if users work with a closed space, heavytailed distributions can be simulated.
 2.
Note that centroids are used here in a broad sense to designate reference points whose goal is to place clusters in the final output space. Therefore, such centroids are not necessarily required to exactly correspond to real cluster centroids, especially if the underlying distributions are skewed.
 3.
The apostrophe ’ marks that M and N can be finally increased by the addition of outliers and noise.
 4.
Alternatively, if desired, the MDCGen software comes with a generator of random empirical distributions where parameters like multimodes or skewness can be configured.
Notes
Acknowledgments
This research has been partially funded by the Vienna Science and Technology Fund (WWTF) through project ICT15129, “BigDAMA”.
Funding Information
Open access funding provided by TU Wien (TUW).
References
 Banerjee, A., Krumpelman, C., Ghosh, J., Basu, S., Mooney, R. J. (2005). Modelbased Overlapping Clustering. In Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining (pp. 532–537).Google Scholar
 Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U. (1999). When is “Nearest Neighbor” meaningful? In Proceedings of the international conference on database theory (ICDT) pp. 217–235.Google Scholar
 Daniele, M. (2001). On the rigid rotation concept in ndimensional spaces. Journal of the Astronautical Sciences, 49(3), 401–420.MathSciNetGoogle Scholar
 Färber, I., Günnemann, S., Kriegel, H. P., Kröger, P., Müller, E., Schubert, E., Seidl, T., Zimek, A. (2010). On using classlabels in evaluation of clusterings. In Proceedings of the 1st international workshop on discovering, summarizing and using multiple clusterings (MultiClust 2010) in conjunction with 16th ACM SIGKDD conference on knowledge discovery and data mining, KDD: Washington.Google Scholar
 François, D., Wertz, V., Verleysen, M. (2007). The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering, 19(7), 873–886.CrossRefGoogle Scholar
 Handl, J. (2017). Accessed: cluster generators. http://personalpages.manchester.ac.uk/mbs/julia.handl/generators.html.
 Handl, J., & Knowles, J. (2005). Multiobjective Clustering around medoids. In 2005 IEEE Congress on evolutionary computation (Vol. 1, pp. 632–639).Google Scholar
 Higham, N. J. (1988). Computing a nearest symmetric positive semidefinite matrix. Linear Algebra and its Applications, 103, 103–118.MathSciNetCrossRefGoogle Scholar
 Korzeniewski, J. (2013). Empirical evaluation of OCLUS and GenRandomClust algorithms of generating cluster structures. Statistics in Transition New Series, 14(3), 487–494.Google Scholar
 Kriegel, H. P., Kröger, P., Zimek, A. (2009). Clustering high dimensional data: a survey on subspace clustering, patternbased clustering, and correlation clustering. ACM TKDD, 3(1), 1–58.CrossRefGoogle Scholar
 Milligan, G. W., & Cooper, M. C. (1986). A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21 (4), 441–458.CrossRefGoogle Scholar
 Pei, Y., & Zaïane, O. (2006). A synthetic data generator for clustering and outlier analysis. Technical report, Department of Computing Science, University of Alberta Edmonton, AB, Canada.Google Scholar
 Qiu, W., & Joe, H. (2006). Generation of random clusters with specified degree of separation. Journal of Classification, 23(2), 315–334.MathSciNetCrossRefGoogle Scholar
 Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.CrossRefGoogle Scholar
 Schubert, E., Koos, A., Emrich, T., Züfle, A., Schmid, K.A., Zimek, A. (2015). A framework for clustering uncertain data. PVLDB, 8(12), 1976–1979.Google Scholar
 Steinley, D., & Henson, R. (2005). OCLUS: an analytic method for generating clusters with known overlap, (Vol. 22.Google Scholar
 Thirey, B., & Hickman, R. (2015). Distribution of Euclidean Distances Between Randomly Distributed Gaussian Points. In nSpace, SAO/NASA ADS arXiv eprints Abstract Service (pp. 1–13). arXiv:1508.02238.
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.