for the Generation of Controlled Circular Data

. A number of artiﬁcial intelligence and machine learning problems need to be formulated within a directional space, where classical Euclidean geometry does not apply or needs to be readjusted into the circle. This is typical, for example, in computational linguistics and natural language processing, where language models based on Bag-of-Words, Vector Space, or Word Embedding, are largely used for tasks like document classiﬁcation, information retrieval and recommendation systems, among others. In these contexts, for assessing document clustering and outliers detection applications, it is often necessary to generate data with directional properties and units that follow some model assumptions and possibly form close groups. In the following we pro-pose a Reduced Variable Neighbourhood Search heuristic which is used to generate high-dimensional data controlled by the desired properties aimed at representing several real-world contexts. The whole problem is formulated as a non-linear continuous optimization problem, and it is shown that the proposed Reduced Variable Neighbourhood Search is able to generate high-dimensional solutions to the problem in short computational time. A comparison with the state-of-the-art local search routine used to address this problem shows the greater eﬃciency of the approach presented here.


Introduction and Background
The analysis and interpretation of directional data requires specific data representations, descriptions and distributions. Directional data occurs in many application areas like, e.g. earth sciences, astronomy, meteorology and medicine. Note that directional data is an "interval type" data: points are typically considered on the unit radius circle, or sphere, and the focus is on the direction of the data vectors. The position of the "zero degrees" is then arbitrary, and the angle between two unit length vectors is used as a natural measure for their distance. Although the angular distances can be used sometimes also with general data vectors defined in the Euclidean space, usual statistics, such as the arithmetic mean and the standard deviation, are not appropriate because they do not have this rotational invariance property [34], and one must rely on proper directional (or circular) statistics [26]. In this context classical Euclidean geometry laws do not apply or need to be properly reformulated into the circle.
Nowadays also a number of popular artificial intelligence and machine learning problems are modelled using directional statistics. This is typical, for example, in computational linguistics and natural language processing, where language models based on Bag-of-Words [42], Vector Space [35], or Word Embedding [25], are largely used for tasks like document classification, information retrieval and recommendation systems, among others. In these models, documents are represented by feature vectors. Specifically, in Bag-of-Words, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. A Vector Space model is a special Bag-of-Words representing a document by a feature vector, where each feature is a word (term) and the feature's value is a term weight. The term weight might be, either, a binary value (with 1 indicating that the term occurred in the document, and 0 indicating that it did not); a term frequency value (indicating how many times the term occurred in the document); or a TF-IDF (term frequency-inverse document frequency) value [1], a numerical statistic that is intended to reflect how important a word is to a document into a collection or corpus. The entire document is thus a feature vector, and each feature vector corresponds to a point in a vector space. The model for this vector space is such that there is an axis for every term in the vocabulary, and so the vector space is v-dimensional, where v is the size of the vocabulary. For long documents, where the vocabulary can be excessively large to handle, Word Embedding models can be a preferred option [25]. In Word Embedding model, like e.g. the popular Word2Vec [27], GloVe [31], or the most recent BERT [18], words or phrases from the vocabulary are mapped to lower-dimensions vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension [25]. Although these representation models have been largely used in computational linguistics and natural language processing, they have found also applications for generic visual categorization in computer vision [6,16,40], among others.
Typically, reasons for preferring a directional approach in these contexts are driven by the problem. For example, the angular distance is known to favor vector components with higher variance (see e.g. [33]), which in linguistics is desirable for giving more weight to the more informative terms. Other reasons are of computational nature. When the space dimension v and sample size n may be both very large, the computation of the ordinary distances may become time consuming. In addition, the high dimensional vectors are often very sparse and an angular distance, namely the cosine similarity, would neglect the 0-components and save time compared to an ordinary Euclidean distance. Besides, v is often considerably larger than n, leading to singular covariance matrices and complicating the use of the Mahalanobis distance (see e.g. [20], [15] and [19]), which otherwise is in general a good choice for taking into account data correlations.
In cluster analysis, as in many statistical and data science problems, the distance measure plays a crucial role [8,10]. Points are assigned to groups according to their distance from the closer group centroid, and points that cannot be firmly assigned to any group are considered outliers [13]. A clustering method, e.g. Kmeans [36], may produce equivalent results with different distance measures (and centroid definitions) but behaves very differently with others.
To study the relationship between distance measures and objectively compare the performance of clustering, outliers detection, or other data analysis frameworks, it is often necessary to conduct systematic tests on a big number of different data sets artificially generated with known properties. In fact, it is well known that there is always some bias in the use of standard collections of data from real life applications 1 which should be rather used to confirm the properties of methods previously studied under controlled settings [12].
As an example, [38] have extended the Forward Search [2] method to the automatic detection of outliers in human label documents set. These outliers are documents that are wrongly assigned to a category or weakly correlated to other documents, and that need to be removed before training a learning system on these data. The synthetic datasets produced with our method allow to calibrate the outliers detection strategy and to assess the quality of the information retrieval process [38].
While it is rather easy to generate synthetic multivariate data according to some probability distribution or covariance structure in the Euclidean space, efficient methods that at the same time impose constraints on the distribution of the pairwise angular distances of the generated data vectors have been less explored in the literature. In [14] we have proposed a simple local search approach for addressing the problem of high-dimensional directional data generation controlled by specific properties. In particular, motivated by problems in computational linguistics, it is hypothesized that data are power-law distributed [19,34], and that their pairwise cosine similarities are distributed as a von Mises distribution [41], i.e. the analogous of the normal distribution in circular statistics [26], around a desired cosine value. This represents a common realistic setting in several computational linguistics applications [22,32].
In this contribution we propose an efficient optimization approach based on a Reduced Variable Neighbourhood Search (RVNS) heuristic [9,21,30] which is characterized by a very high speed and some interesting implementation improvements with respect to the local search approach in the literature [14]. This algorithm is characterized by an ease of implementation and simplicity, and it takes inspiration from the recently proposed "less is more approach" [10,29], which supports the adoption of non-sophisticated and effective metaheuristics instead of hard-to-reproduce and complex solution approaches. We will show how the proposed RVNS heuristic is able to generate efficiently high-dimensional directional data controlled by the desired properties. The generation algorithm is computationally efficient and flexible as well in terms of adaptability to other distributions and problems.
The rest of the paper is organized as follows. Section 2 formulated the whole problem as a non-linear continuous optimization problem. Section 3 describes the implementation details of the considered optimization algorithms, namely the state-of-the-art local search routine in the literature and our proposed RVNS approach. Our computational analysis is reported in Sect. 4, where it is shown that the proposed RVNS is able to generate high-dimensional solutions to the problem in shorter computational time with respect to the local search routine, demonstrating the efficiently of the approach presented here. Finally the paper ends with conclusions and suggestions for possible future research in Sect. 5.

Problem Formulation
The problem has been firstly formulated by [14] distinguishing between two general settings. The first is the ideal case of a single homogeneous group of n documents represented in a vector space of size v (vocabulary size). Here, high recall and precision are achieved by minimizing the sum of the pairwise similarities between documents, where cs is the cosine similarity function that for two vectors Z i and Z j is: In the second setting we assume K > 1 homogeneous groups and a number of isolated outliers. Now, the quantity F computed within a given homogeneous group should be maximized, while it should be minimized when computed on the set of the K estimated centroids 2 . The "density" of the documents within an homogeneous group is determined by the average value of the quantity F for that group. Therefore, to generate artificial data suitable for benchmarks in the two general settings we can use the following constrained non-linear continuous optimization problem formulation [24]. Given a cosine similarity value cs ∈ (0, 1), a tolerance ξ ∈ (0, 1), ξ << cs, find a set Z of n v-dimensional vectors with v >> n non-negative variables, such that the set of n(n − 1)/2 pairwise cosine similarity values C = {cs(Z i , Z j ) ; j = 2, · · · , n ; 1 < i < j} satisfies: C is a random sample drawn from a von Mises distribution with mean μ(C ) ≈ cs and concentration parameter κ(C , ξ), being I 0 (·) the modified Bessel function of order 0.
We use the Watson goodness-of-fit test [39] to assess whether C is consistent with the hypothesized von Mises null distribution with known mean ( cs) and concentration parameter estimated from C itself.
In the von Mises, the concentration parameter κ is the analogous of the reciprocal of the variance for the normal distribution and links the sample standard deviation σ(C ) with the chosen tolerance ξ. This is done by using the well-known empiric three-sigma rule [34] and monitoring the constraint σ(C ) ≈ ξ/(10/3) during the data generation process, which corresponds to a concentration parameter κ(C , ξ) ≈ 9, ensuring that the data are representative of more than 99.9% of the generating von Mises distribution.

Optimization Methods
This section contains the implementation details of the considered optimization algorithms, namely the state-of-the-art local search routine implemented in [14] and the RVNS approach that we propose here. For a survey on the basic concepts of approximate optimization methods, including stochastic local search and metaheuristics, the reader is referred to [3,4,23]. The algorithms have been implemented using the commercial software package Matlab, R2019b 64-bit, version 9.7.0 3 .

Local Search Algorithm
The local search algorithm starts from an initial solution, Z 0 , of n data vectors with v variables, constructed as described in [14] such that all the vectors have exactly the same pairwise cosine similarity value cs [11]. Consider the set Z of n data vectors with v >> n variables, where all the vector tails are equal from variable index n + 1 onwards, i.e. z (i, k) = z (j, k) ∀k ≥ n + 1 and ∀i, j ≤ n, with i = j. For simplicity, all equal tail variables will be denoted without the vector index, e.g.
The remaining variables of the vectors of Z are all set equal to a same constant value, say z C , i.e. z (i, k) = z (j, k) = z C ∀k ≤ n and ∀i, j ≤ n, with i = j. As proved in [14], all row vectors of such initial solution, referred to as Z 0 , have the same pairwise distance for geometric construction. This allows our local search algorithm to start from a solution which already satisfies the first constraint of the problem. Summarizing, the initial solution Z 0 is [14]: ( Consider an arbitrary cosine similarity value cs ∈ (0, 1), and let z C , z (n+1) , . . . , z (v) be an arbitrary set of non-negative reals (i.e. ∈ + ). Then, directly by the definition of cosine similarity, and by replacing the component values with those specified in Eq. 3-4. By computing the resulting term, the expression is invariant ∀i, j ≤ n, i = j, and it ends exactly in cs. For more details on these calculations the reader is referred to [14].
It is worth noting in Eq. 4 that Δ = z 2 } is always non-negative. By computing directly Δ ≥ 0, trivially it is obtained: which is always true because: cs > 0 as n is always > 1; and then, being the second member of the inequality a negative number, the first member, z 2 C , is always bigger than the second member [14]. Such produced initial solution Z 0 represents a good starting point for the whole optimization algorithm because it already satisfies the first constraint of the problem (i.e. cs( . However Z 0 violates the second problem constraint, i.e. that the distribution of the data in C is a random sample drawn from a von Mises distribution with mean μ(C ) ≈ cs and and concentration parameter κ(C , ξ) ≈ 9, i.e. σ(C ) ≈ ξ/(10/3). Indeed, by construction of the initial solution Z 0 , the distribution of the data in C is a Dirac centered in cs.
The local search routine proceeds now by perturbing such initial solution Z 0 in order to move gradually the distribution of the data in C from the Dirac to a von Mises distribution with the desired properties. This recurring disturbance is performed by an iterative process which, at each iteration, injects a noise N normally distributed, with μ = 0, to the incumbent solution. The perturbation is greedily accepted iif the modified solution does not violate both the first and second constraints of the problem; otherwise the perturbation is rejected. The satisfaction of the second problem constraint is checked by performing a Watson goodness-of-fit test [39] at 1% significant level to ensure that the modified intermediate solution is consistent with the Von Mises distribution with the desired properties. The first problem constraint is validated instead by making sure that all pairwise cosine similarities among the perturbed vectors fall within the desired interval [ cs − ξ, cs + ξ]. For this reason, it is needed to inject a noise being small enough to be very likely to be accepted, but not excessively small otherwise the local search routine would become to slow in terms of convergence within a reasonable number of steps. [14] found experimentally that a random noise N whose components follows a normal distribution with μ(N ) = 0 and σ(N ) = 0.01 is a good choice yielding a satisfactory tradeoff between convergence speed and constraints violations given by the random perturbation.
The overall local search routine stops when it is reached a perturbed solution satisfying all problem constraints and having a standard deviation of the pairwise cosine similarities equal or larger than the desired threshold value of ξ/(10/3) given by the second problem constraint.

Reduced Variable Neighbourhood Search Heuristic
An important drawback of the local search routine just described in the previous section is that the normal perturbation that is applied to the incumbent solution is static and does not adapt automatically to the size of the problem to handle. Although in [14] it has been fixed experimentally a random normal noise N having μ(N ) = 0 and σ(N ) = 0.01 as a good perturbation setting, on average, it is very likely that, either, this σ(N ) value will be too small for large size problems, yielding to an excessively slow convergence time for the algorithm, or too large for small size problem instances, producing a large number of constraints violations which will increase as well exponentially the computational running time of the entire procedure.
For this reason we propose here an intelligent optimization approach based on Reduced Variable Neighbourhood Search (RVNS) [9,21,30] aimed at achieving high-quality performance for the problem by automating the resulting optimization strategy and adapting online to the size of the problem to tackle. The aim it to lead the local search to achieve a proper balance of diversification (exploration) and intensification (exploitation) during the search process, a fundamental objective for any effective heuristic solution approach. The diversification capability of a metaheuristic refers to its aptitude of exploring thoroughly different zones of the search space in order to identify promising areas. When a promising area is detected, the metaheuristic needs to exploit it intensively to find the relative local-optimum, but at the same time without wasting excessive computational resources. This is referred as the intensification capability of the metaheuristic.
Finding a good balance between diversification and intensification is indeed an essential task for the proper effectiveness of a metaheuristic [3,4,23].
RVNS is a variant of the classic Variable Neighbourhood Search (VNS) algorithm [5,10,21], which is a popular metaheuristic based on dynamic changes of neighbourhood structures of an incumbent solution during the search process [9,21]. The VNS methodology is based on the core concept of searching for new solutions in increasingly distant neighbourhoods of the current solution, jumping only if a better solution is found, without being limited only to a fixed trajectory [30]. RVNS is a variant that has been shown to be successful for many combinatorial problems where local optima with respect to one or several neighbourhoods are relatively close to each other [28].
The variable metric method has been suggested by [17]. The idea is to change the metric in each iteration such that the search direction (steepest descent with respect to the current metric) adapts better to the local shape of the function. The RVNS method is obtained if random points are selected from the current neighbourhood under exploration and no descent is followed. Rather, the values of these new points are compared with that of the incumbent and updating takes place in case of improvement. RVNS is a typical example of a pure stochastic heuristic, akin to a classic Monte-Carlo method, but more systematic [28]. It is useful especially for very large problem instances for which the local search within the classic VNS approach might become costly, as it is the case with our problem.
The details of our heuristic based on Reduced Variable Neighbourhood Search for the given problem are specified in Algorithm 1. The algorithm starts with an initial solution Z equal to Z 0 , that is the same starting solution in Eq. 3 described previously in Sect. 3.1. It is by construction a Dirac centered in cs, already satisfacting the first constraint of the problem but not the second.
Then, the shaking phase, which represents the core idea of RVNS, is applied to Z . The shaking phase aims to change the neighbourhood structure, N k (·), of the incumbent solution to achieve a larger algorithm diversification. The new incumbent solution, Z , is generated at random in order to avoid cycling, which might occur if a deterministic rule is used.
The simplest and most common choice for the neighbourhood structure consists of setting neighbourhoods with increasing cardinality: |N 1 (·)| < |N 2 (·)| < ... < |N kmax (·)|, where k max represents the maximum size of the shaking phase. Let k and k step be, respectively, the current size and the step size of the shaking phase.
The algorithm starts by selecting the first neighbourhood (k ← 1) and, at each iteration, it increases the parameter k if the new incumbent solution violates one of the problem constraints (k ← k + 1), until the largest neighbourhood is reached (k = k max ). The process of changing neighbourhoods when no improvement occurs diversifies the search trajectory. In particular, the choice of neighbourhoods of increasing cardinality yields a progressive diversification of the search process. Note that the k step parameter has been introduced in order to adapt the classic RVNS schema from a combinatorial to a continuous optimization setting.
For the given problem, a shaking phase of size k consists of perturbing the incumbent solution Z with a random noise N normally distributed with μ = 0 and σ(N ) = k · k step , producing a perturbed solution in the N k (·) of the current solution. Note that each perturbation corresponds just to a limited local modification of the incumbent solution Z which hopefully will not violate both Input: The number of vectors to generate, n, with v >> n non-negative data variables; the reference cosine similarity value cs ∈ (0, 1); and the tolerance ξ << cs to bound the cosine similarities in [ cs − ξ, cs + ξ] Output: A set Z of n vectors with v non-negative data variables; Initialization: -Let C = n i,j=1,j>i cs(Zi, Zj ) be the distribution of the pairwise cosine similarities of the vectors in Z ; -Let k, kstep, and kmax, respectively the current size, the step size, and the maximum size, of the shaking phase; begin · Generate the initial solution Z 0 of n vectors with v variables: · Select at random the noise matrix N n,v whose components follows a normal distribution with μ(N ) = 0 and σ(N ) = k · kstep; · Restart with the first neighbourhood structure: k ← 1; end end until σ(C ) ≥ ξ/(10/3); ⇒ Return(Z ). end Algorithm 1: Reduced Variable Neighbourhood Search heuristic problem constraints, as already described for the basic local search procedure in Sect. 3.1. Otherwise, if one of the two constraints are violated by Z , the new solution is discarded by restoring the previous solution (Z = Z ), and this is perturbed again in a larger neighbourhood (k ← k + 1).
The process of increasing progressively the parameter k in case of no improvements occurs until the maximum size of the shaking phase, k max , is reached. When this happens, meaning that the value of k step may be too large having produced already k max consecutive unsuccessful perturbations with the same k step , the step size of the shaking phase is halved (k step ← k step /2) in order to produce next perturbations having standard deviation proportionally smaller (σ(N ) = k · k step ). In this way when the algorithm restarts from the first neighbourhood (k ← 1) of Z , a fine-grained noise will be iteratively produced such that to increase the acceptance chances of the next perturbations.
Conversely, if the perturbed solution Z does not violate both problem constraints (i.e. Z = Z ), this is accepted as the new incumbent solution and the search is restarted from its first neighbourhood (k ← 1). In this case, the value of k step is doubled (k step ← k step · 2) in order to produce next perturbations having standard deviation proportionally larger. This simple reactive schema is aimed at achieving an optimal setting of k step in order to speeding up the converge speed of the algorithm. Given this reactive schema, the value of k max is not relevant to the overall performance of the algorithm, as this value depends directly from k step which is optimally tuned on-line. Therefore the value of k max is set equal to 5 at the beginning of the algorithm and is not required to change.
The algorithm continues iteratively with the same procedure and stops when the standard deviation of the pairwise cosine similarities of the obtained solution Z is equal or larger than the expected standard deviation, ξ/(10/3), that is the prefixed problem goal, giving Z as the final output.

Computational Analysis
This section reports our computational experiments on the comparison of the performance of the proposed RVNS approach with respect to that of the local search routine in the literature. The heuristics are identified with the following abbreviations: LS, for the local search routine described in Sect. 3.1; and RVNS, for the Reduced Variable Neighbourhood Search implementation described in Sect. 3.2.
In the experiments we generate data vectors drawn from a power-law distribution [7] 4 , which is a realistic setting in computational linguistics [35], and considering a number of components, v, ranging from 1·10 3 to 100·10 3 components.
As shown in [14], the easiest setting for our problem is to fix the desired cosine similarity value, cs to 0.5. This setting is sufficient for comparing the performance of the two optimization algorithms by varying the number of components v only. Indeed the computational times to generate the vectors increase proportionally with the number of components v of the vectors. In particular, for high values of v, the computational times may become quite large. In addition, the desired tolerance value, ξ, is set to 0.1, yielding to a bounded interval for the pairwise cosine similarities: [ cs − 0.1, cs + 0.1] = [0.4, 0.6] for each considered cosine similarity value cs. The number of vectors to be generated is also an arbitrary user choice; in our study we have chosen to generate 100 vectors for each of the controlled circular data group, as in [14].
Our computational results are reported in Table 1 and Table 2, which report a comparison of LS and RVNS respectively for small a large instances of the problem. In particular, Table 1 contains the results obtained by the two algorithms by setting cs = 0.5 and letting v range from 1 · 10 3 to 10 · 10 3 number of components, with a step of 1 · 10 3 . Table 2 contains instead the results obtained by the two algorithms for larger instances with cs = 0.5 and v ranging from 10 · 10 3 to 100 · 10 3 , with a step of 10 · 10 3 components. In both tables, for each dataset having components dimension v we have generated 10 different problem instances, therefore the results reported in the tables are the average values among the 10 generated problem instances 5 . For each algorithm, the first column Table 2. Computational results of the compared algorithms (LS and RVNS ) for larger problem instances with cs = 0.5 and v ranging from 10·10 3 to 100·10 3 components. The reported results are the average values over 10 problem instances for each components dimension v. For each algorithm, the first column is the average computational running time in seconds (time); the second column is the average of the number of unsuccessful iterations (unsucc), that is when the perturbed solutions violated one of the problem constraints; the third column is the average of the total number of iterations required by the algorithm to stop (tot iter ). in the tables is the average computational running time in seconds (time); the second column is the average of the number of unsuccessful iterations (unsucc), that is when the perturbed solutions violated one of the problem constraints; the third column is the average of the total number of iterations required by the algorithm to stop (tot iter ). Looking at Table 1 containing smaller problem instances, RV N S performed better than LS in all cases. It was able to generate the required controlled datasets in almost half of the time required by LS. The superiority of RV N S with respect to LS is also highlighted by the smaller number of unsuccessful perturbations, unsucc, and the smaller number of iterations, tot iter, obtained by RV N S. Practically the heuristic based on Reduced Variable Neighbourhood Search results to be a smarter optimization routine than the classic local search in the literature, being able to produce recurrent perturbations with a higher acceptance likelihood.
As shown in Table 2, these results are also confirmed for larger instances of the problem. RV N S generated the desired controlled circular data in shorter computational time than LS, requiring an inferior number of total iterations and producing a smaller number of unsuccessful perturbations as well. As we can see from this table, for higher values of v the computational times become quite large (see e.g. the case with v = 100 · 10 3 ), confirming the high complexity of the problem. Nevertheless RVNS was able to generate the controlled data in considerably shorter computational time with respect to LS, confirming the superiority of the Reduced Variable Neighbourhood Search approach. As shown in our computational analysis RVNS is able to produce quickly circular data with the desired properties for vectors with very large components dimension v, which represents an important achievement. Figure 1 shows an example of a circular dataset generated by the RVNS procedure. The histogram of the pairwise cosine similarities distribution in the figure has the typical shape of a normal distribution, which is in fact very similar in the circle to a von Mises of concentration parameter 9 Fig. 1. An example of a circular dataset generated by the RVNS procedure with cs = 0.5. As expected the histogram of the pairwise cosine similarities distribution for samples of the generated power-law data vectors has the shape of a von Mises distribution, i.e. the analogous of the normal distribution in the circle.

Conclusions
In many artificial intelligence and machine learning environments where stored entities are compared with each other or with incoming patterns, it is often necessary to compute artificially similar data to be used for experiments, simulations, or generally for test purposes. Hypothesizing a computational linguistic setting, we have modelled this problem as a constrained non-linear continuous optimization problem where the goal is to generate controlled datasets of similar circular vectors satisfying the conjecture in Eq. 1 for n documents in a vector space of size v (vocabulary size), and leaving the user the choice of the degree of closeness between the data (i.e. the setting of the pairwise similarity value among the generated vectors). Assuming a cosine similarity metric among the data, the problem has been conduced to the artificial generation of similar data in the circular space, further increasing the complexity of the problem. In addition, as these controlled datasets are aimed to represent real-world settings in computational linguistics, we have added the further constraint that such circular data need to be power-law distributed, while their pairwise cosine similarity values should follow a von Mises distribution within a desired bounded similarity interval.
In order to address the problem and to produce solutions within reasonable computational running time we have elaborated an optimization procedure based on Reduced Variable Neighborhood Search. In our computational experience we have shown the superiority of the proposed approach with respect to a local search routine in the literature in terms, in particular, of convergence speed. The proposed Reduced Variable Neighbourhood Search is efficient and scale well by adapting automatically to the dimension of the problem to tackle. The presented optimization strategy allows the generation of high-dimensional controlled datasets in the circular space having the desired properties within small computational running time.