A local approach to parameter space reduction for regression and classification tasks

Parameter space reduction has been proved to be a crucial tool to speed-up the execution of many numerical tasks such as optimization, inverse problems, sensitivity analysis, and surrogate models' design, especially when in presence of high-dimensional parametrized systems. In this work we propose a new method called local active subspaces (LAS), which explores the synergies of active subspaces with supervised clustering techniques in order to carry out a more efficient dimension reduction in the parameter space. The clustering is performed without losing the input-output relations by introducing a distance metric induced by the global active subspace. We present two possible clustering algorithms: K-medoids and a hierarchical top-down approach, which is able to impose a variety of subdivision criteria specifically tailored for parameter space reduction tasks. This method is particularly useful for the community working on surrogate modelling. Frequently, the parameter space presents subdomains where the objective function of interest varies less on average along different directions. So, it could be approximated more accurately if restricted to those subdomains and studied separately. We tested the new method over several numerical experiments of increasing complexity, we show how to deal with vectorial outputs, and how to classify the different regions with respect to the local active subspace dimension. Employing this classification technique as a preprocessing step in the parameter space, or output space in case of vectorial outputs, brings remarkable results for the purpose of surrogate modelling.


Introduction
Parameter space reduction [13,46] is a rapidly growing field of interest which plays a key role in fighting the curse of dimensionality.The need of reducing the number of design inputs is particularly important in engineering for advanced CFD simulations to model complex phenomena, especially in the broader context of model order reduction [11,12,4,38] and industrial numerical pipelines [8,7,39].
Active subspaces [13] is one of the most used techniques for linear reduction in input spaces.It has been proved useful in many numerical tasks such as regression, using a multi-fidelity data fusion approach with a surrogate model built on top of the AS as low-fidelity model [36], shape optimization [29,47,5] and a coupling with the genetic algorithm to enhance its performance [18,16], inverse problems [53], and uncertainty quantification [14].It has also been used to enhance classical model order reduction techniques such as POD-Galerkin [43], and POD with interpolation [17,45].Other attempts towards nonlinear parameter space reduction have been proposed recently: kernelbased active subspaces [35], nonlinear level-set learning [54], and active manifold [6] are the most promising.In [10], instead, they project the input parameters onto a low-dimensional subspace spanned by the eigenvectors of the Hessian corresponding to its dominating eigenvalues.
In this work we propose a new local approach for parameter space dimensionality reduction for both regression and classification tasks, called Local Active Subspaces (LAS).In our work we do not simply apply a clustering technique to preprocess the input data, we propose a supervised metric induced by the presence of a global active subspace.The directions individuated by local active subspaces are locally linear, and they better capture the latent manifold of the target function.
From a wider point of view, there is an analogy between local parameter space reduction and local model order reduction.With the latter, we mean both a spatial domain decomposition approach for model order reduction of parametric PDEs in a spatial domain Ω ⊂ R d and a local reduction approach in the parameter space.As representatives methods for the first paradigm we report the reduced basis element method [28], which combines the reduced basis method in each subdomains with a mortar type method at the interfaces, and more in general domain decomposition methods applied to model order reduction.For the second approach we cite the interpolation method in the Grassmannian manifold of the reduced subspaces [2]; in particular in [15] the K-medoids clustering algorithm with Grassmann metric is applied to the discrete Grassmann manifold of the training snapshots as a step to perform local model order reduction.With this work we fill the gap in the literature regarding localization methods in the context of parameter space reduction.
Other methods have been developed in the last years exploiting the localization idea.We mention localized slice inverse regression (LSIR) [50] which uses local information of the slices for supervised regression and semi-supervised classification.LSIR improves local discriminant information [23] and local Fischer discriminant analysis [41] with more efficient computations for classification problems.The main difference between slice inverse regression (SIR) [27] and AS is in the construction of the projection matrix.While SIR needs the elliptic assumption, AS exploits the gradients of the function of interest with respect to the input parameters.Recently, a new work on the subject was disclosed [51].Here we emphasize the differences and the original contributions of our work: 1) we implemented hierarchical top-down clustering applying K-medoids with a new metric that includes the gradient information through the active subspace.In [51] they employed hierarchical bottom-up clustering with unweighted average linkage and a distance obtained as a weighted sum of the Euclidean distance of the inputs and the cosine of the angle between the corresponding gradients; 2) we included for vector-valued objective functions and answered questions about the employment of the new method to decrease the ridge approximation error with respect to a global approach; 3) we also focused on classification algorithms and we devised a method to classify the inputs based on the local active subspace dimension with different techniques, including the use of the Grassmannian metric; 4) our benchmarks include vector-valued objective functions from computational fluid dynamics.We also show that clustering the outputs with our classification algorithms as pre-processing step leads to more efficient surrogate models.
This work is organized as follows: in section 2 we briefly review the active subspaces method, in section 3 we introduce the clustering algorithms used and the supervised distance metric based on the presence of a global active subspace, focusing on the construction of response surfaces and providing theoretical considerations.In section 4 we present the algorithms to exploit LAS for classification.We provide extensive numerical results in section 5 from simple illustrative bidimensional dataset to high-dimensional scalar and vector-valued functions.Finally, in section 6 we draw conclusions and future perspectives.

Active subspaces for parameter space reduction
Active subspaces (AS) [13] are usually employed as dimension reduction method to unveil a lower dimensional structure of a function of interest f , or provide a global sensitivity measure not necessarily aligned with the coordinate axes [42].Through spectral considerations about the second moment matrix of f , the AS identify a set of linear combinations of the input parameters along which f varies the most on average.
We make some general assumptions about the inputs and function of interest [13,52,42].Let us introduce the inputs as an absolutely continuous random vector X with values in R n and probability distribution µ.We represent with X ⊂ R n the support of µ and as such our parameter space.We want to compute the active subspace of a real-valued function f : (X , B(R n ), µ) → R, where B(R n ) is the Borel σ-algebra of R n .We denote with x ∈ X an element in the space of parameters and with {x i } i a set of realizations of X.
An extension to vector-valued functions has been presented in [52] and extended for kernel-based AS in [35].Even if in this section we focus only on scalar functions, the following considerations can be carried over to the multivariate case without too much effort.
Let Σ be the second moment matrix of ∇f defined as where E µ denotes the expected value with respect to µ, and ∇ x f = ∇f (x) = ∂f ∂x1 , . . ., ∂f ∂xn T is the column vector of partial derivatives of f .Its real eigenvalue decomposition reads Σ = WΛW T .We can retain the most energetic eigenpairs by looking at the spectral decay of the matrix Σ.The number r of eigenpairs we select is the active subspace dimension, and the span of the corresponding eigenvectors defines the active subspace.The partition is the following where Λ 1 = diag(λ 1 , . . ., λ r ), and W 1 contains the first r eigenvectors arranged by columns.With this matrix we can project the input parameters onto the active subspace, and its orthogonal complement, that is the inactive subspace, as follows: with P r : R n → R n the linear projection operator P r := W 1 W T 1 .The selection of the active subspace dimension r can be set a priori, or by looking at the presence of a spectral gap [13], or by imposing a cumulative energy threshold for the eigenvalues.
We will consider the problem of ridge approximation [34] in our applications.The AS are, in fact, the minimizers of an upper bound of the ridge approximation error.
Definition 1 (Ridge approximation).Given r ∈ N, r ≪ n and a tolerance ϵ ≥ 0, find the profile function h : (R n , B(R n ), µ) → R and the r-rank projection P r : R n → R n such that the following upper bound on the ridge approximation error is satisfied where ∥•∥ 2 is the L 2 -norm of R.
For a fixed projection P r the optimal profile h is given by the conditioned random variable E µ [f |P r ].Under the additional assumptions on the probability distribution µ, reported in appendix A.1 of the Appendix, the AS can indeed be defined as a minimizer of an upper bound of the ridge approximation error [13,52,32,35].The proof is a direct consequence of the Poincaré inequality and standard properties of eigenspaces, and for this specific version of the theorem it can be found in [32].
Theorem 1 (Definition of AS through ridge approximation).The solution P r of the ridge approximation problem in definition 1, with optimal profile h = E µ [f |P r ], is the orthogonal projector to the eigenspace of the first r-eigenvalues of Σ ordered by magnitude with C(C P , τ ) a constant depending on τ > 0 related to the choice of µ and on the Poincaré constant C P , and h = E µ [f |σ(P r )] is the conditional expectation of f given the σ-algebra generated by the random variable P r • X.
To ease the notation, in the following we will consider only the first three classes of probability distribution in the assumptions of appendix A.1 in the Appendix, such that τ = 0.

Localized parameter space reduction
Sometimes we do not have a priori knowledge about the target function's behaviour in a particular parameter space region.This could lead to a poor selection of the parameters range, hugely affecting optimization tasks.In these cases, a preprocessing of the data using a clustering technique could be highly beneficial.With a clustering of the input parameters, we can treat each subregion separately, and thus capture more accurately the target function's variability.This is always true for any function of interest, but for functions with global lower intrinsic dimensionality we can exploit such structure to enhance the clustering.To this end, we propose a new distance metric for K-medoids and hierarchical top-down clustering methods which exploits the global active subspace of the target function.By applying AS on each cluster we find the optimal rotation of the corresponding subregion of the input domain, which aligns the data along the active subspace of a given dimension.
In this section, we make some theoretical considerations regarding ridge approximation applied to partitions of the parameter space and review three clustering methods [22]: K-means, K-medoids, and hierarchical top-down clustering [26,31].We are going to use K-means as the baseline since the input parameter space is assumed to be a hyperrectangle.This assumption covers the majority of the practical test cases in the reduced order modeling community.

Ridge approximation with clustering and active subspaces
Regardless of the choice of clustering algorithm, given a partition of the parameter space we want to perform ridge approximation with AS in each subdomain.We will introduce some definitions and make some remarks to clarify the setting.The function of interest f represents scalar outputs, but the following statements can be extended to vector-valued outputs as well.
Definition 2 (Local ridge approximation with active subspaces).Given a partition of the domain P := {S i } i∈{1,...,d} and a map r : P → {1, . . ., n r }, n r ≪ n representing the local reduced dimension, the local ridge approximation with active subspaces of (f , µ) is the function R AS (r, f, µ) : X ⊂ R n → R that is defined locally for every S i ∈ P as where µ i := (1/µ(S i )) • µ| Si ∈ R n , and P r(Si),i : S i ⊂ R n → R n is the orthogonal projector with rank r that satisfies the minimization problem: P r(Si),i = argmin With this definition we can state the problem of local ridge approximation with active subspaces.
Problem 1 (Minimizers (P, r) of the ridge approximation error).Find the partition P of the domain X ⊂ R n and the local reduced dimension map r : P → {1, . . ., n r }, n r ≪ n, such that the L 2 -error between the objective function f and its local ridge approximation with active subspaces is minimized.
Assuming that the subspace Poincaré inequality [32] is valid also for (f, µ) restricted to the elements of the partition P, a straightforward bound is obtained by applying the Poincaré inequality for every element of the partition To obtain the previous upper bound, we made an assumption about the Poincaré subspace inequality that in general is not satisfied by any probability measure µ chosen: the assumptions on the probability distributions {µ i } d i=1 in appendix A.1 of the Appendix have to be satisfied at each subdomain {S i } d i=1 .For the moment we will consider the local reduced dimension map r constant and, in general, the codomain of r is a subset of {1, . . ., n r }, n r ≪ n.
The previous bound suggests that a good indicator for refinement could be represented by the sum of the residual eigenvalues {λ Si,j } m j=r S i of the local correlation matrices, for every S i ∈ P: λ Si,j .
We also have the following immediate result that hints to indefinitely many successive refinements to lower the L 2 -error ridge approximation error.
Remark 1 (Relationships between the upper bounds of consecutive refinements).Considering the sum over the number of refined clusters cl ∈ {1, . . ., d} we have that since the projectors {P r,cl } cl∈{1,...,d} are the minimizers of P r,cl = argmin P 2 =P,P =P T , rank(P )=r The RHS of equation ( 8) can be used as indicator for refinement.We remark that since the refinements increase the decay of the eigenvalues in the RHS of equation ( 8), the choice of the dimension of the active subspace may be shifted towards lower values to achieve further dimension reduction for the same accuracy, as we are going to show in the numerical experiments, in section 5.
Unfortunately, the minimizers of the ridge approximation error and of the upper bound are not generally the same: There is a counterexample for the non localized case in [52].We start from this counterexample to show that in general the L 2 -error of the local ridge approximation does not decrease between consequent refinements, even if the indicator from the RHS of equation ( 8) does, as stated in the previous remark.
Let µ be the uniform probability distribution on X .The objective function we want to approximate is with local reduced dimension map r(A) = r(B) = r(C) = 1.There exist ϵ > 0, ω > 0, such that where P 1,X is the optimal projector on the whole domain X with one-dimensional active subspace.
Proof.The proof is reported in appendix A.3 of the Appendix.
The heuristics behind the previous proof rests on the fact that ridge approximation with active subspaces performs poorly when the objective function has a high variation.The counterexample is valid whenever the global projector P 1,X is the minimizer of a local L 2 ridge approximation error for which the minimizer of the gradient-based indicator in equation ( 8) does not coincide.This leaves us with an indicator in equation ( 8) that does not guarantee a non increasing L 2 -error decay for subsequent refinements, but is nonetheless useful in practice.
We conclude the section with some remarks about the response surface design through the ridge approximation with active subspaces.
Remark 2 (Approximation of the optimal profile).In practice we do not consider the optimal profile h(y) = E µ [f |σ(P r )] (y) but we employ the approximation h(y) = f (y) = f (P r x).The reason lies on the fact that to approximate the optimal profile at the values {y i } i , additional samples from the conditional distribution p(z|y i = P r x) must be obtained; even if the accuracy of the ridge approximation could benefit from it, this is not always possible in practice because of the difficulty to sample from the conditional distribution or because of computational budget constraints.
If the data is split in training, validation, and test set, the local R 2 score on the validation set can be used as indicator for refinement.
Remark 3 (Estimator based on local R 2 scores).The R 2 score of a single cluster can be written with respect to the R 2 scores {R 2 l } l∈{1,...,d} relative to the clusters of the subsequent refinement.Let the sum be over the refinement clusters l ∈ {1, . . ., d}, we have which, substituting with the empirical variance, becomes where R 2 emp;l is the empirical local R 2 score relative to cluster number l.The definition can be extended for component-wise vector-valued objective functions f .The numerical results shown in section 5 consider the mean R 2 score along the components when the output is vectorial.
In practice every expectation is approximated with simple Monte Carlo, and without the number of training samples increasing, the confidence on the approximation is lower and lower, the more the domain is refined.This is taken into consideration while clustering, fixing a minimum number of samples per cluster for example.
The appendix A.2 in the Appendix clarifies the link between the number of Monte Carlo samples, the numerical method chosen for the discretization of the integral E µ [∇f ⊗ ∇f ], and the approximation of the active subspace.For example for deterministic models, one could employ the more efficient Sobol' sequence or a Latin hypercube sampling; if f is more regular and the parameter space dimension is not too high one could employ tensor product Gauss quadrature rule.See for example [42].
Before introducing the clustering algorithms we will employ, we specify that the partition P = {S i } i∈{1,...,d} is defined by the decision boundaries of the clustering algorithm chosen.

K-means clustering
We recall the K-means clustering algorithm.Let {x i } N i=1 be a set of N samples in R N F , where N F denotes the number of features.The K-means algorithm divides this set into The partitioning quality is assessed by a function which aims for high intracluster similarity and low intercluster similarity.For K-means this is done by minimizing the total within-cluster sum-of-squares criterion W T , which reads as where c j is the centroid describing the cluster S j .A centroid of a cluster is defined as the mean of all the points included in that cluster.This means that the centroids are, in general, different from the samples x i .K-means is sensitive to outliers, since they can distort the mean value of a cluster and thus affecting the assignment of the rest of the data.

K-medoids clustering with active subspaces-based metric
In order to overcome some limitations of the K-means algorithm, such as sensitivity to outliers, we can use K-medoids clustering technique [26,33,40,30].It uses an actual sample as cluster representative (i.e.medoid) instead of the mean of the samples within the cluster.
Following the notation introduced in the previous section, let m j be the medoid describing the cluster S j .The partitioning method is performed by minimizing the sum of the dissimilarities between the samples within a cluster and the corresponding medoid.To this end, an absolute-error criterion E is used, which reads as By looking at the formula above it is clear that the use of a data point to represent each cluster's center allows the use of any distance metric for clustering.We remark that the choice of the Euclidean distance does not produce the same results as K-means because of the different references representing the clusters.We propose a new supervised distance metric inspired by the global active subspace of the function f we want to approximate.We define a scaled L 2 norm using the eigenpairs of the second moment matrix of ∇f , which is the matrix from which we calculate the global active subspace: where Λ stands for the diagonal matrix with entries the eigenvalues of equation ( 1), and W is the eigenvectors matrix from the decomposition of the covariance matrix.As we are going to show in section 5 this new metric allows a better partitioning both for regression and classification tasks by exploiting both global and local informations.For insights about the heuristics behind it, we refer to remark 5.
To actually find the medoids the partitioning around medoids (PAM) algorithm [26] is used.PAM uses a greedy approach after the initial selection of the medoids, also called representative objects.The medoids are changed with a non-representative object, i.e. one of the remaining samples, if it would improve the clustering quality.This iterative process of replacing the medoids by other objects continues until the quality of the resulting clustering cannot be improved by any replacement.Algorithm 1 presents this approach with pseudo-code.
Algorithm 1 K-medoids algorithm with AS metric.
Input: set of samples assign each sample to its closest medoid using the distance metric d swap the medoids with the new selected objects by minimizing equation ( 14) 6: until clustering quality converges

Hierarchical top-down clustering
In this section, we present hierarchical top-down clustering [26,31], and exploit the additional information from the active subspace, as done for K-medoids.In the following sections, we refer to this technique with the acronym HAS.
In top-down hierarchical clustering, at each iteration the considered clusters, starting from the whole dataset, are split further and further based on some refinement criterion, until convergence.A nice feature of hierarchical clustering algorithms, with respect to K-means and K-medoids, is that the number of clusters can be omitted.Moreover, by stopping at the first refinement and forcing the total number of clusters to be the maximum number of clusters specified, HAS can be seen as a generalization of the previous methods: for this reason, we wanted to make the implementation consistent with K-means and K-medoids with AS induced metric as close as possible, as shown in the numerical results in section 5.
Pushing further the potential of clustering algorithms applied to local dimension reduction in the parameter space, HAS is a versatile clustering method that takes into account the variability of the AS dimension along the parameter space.The price paid for this is the overhead represented by the tuning of some hyper-parameters introduced later.
A schematic representation of the algorithm of top-down clustering is reported in algorithm 2. The design is straightforward and it employs a tree data structure that assigns at each node a possible clustering of the whole dataset: consequent refinements are represented by children nodes down until the leaves of the tree, that represent the final clusters.
Remark 4 (Normalization of the clusters at each refinement iteration).Each cluster, at every refinement step, is normalized uniformly along dimensions onto the hyper-cube domain [−1, 1] n , even if the subdomain identified by the cluster is not a hyperrectangle.Another possible choice for normalization is standardization, centering the samples with their mean and dividing them by their standard deviation.

Algorithm 2 Hierarchical top-down algorithm.
Input: set of samples S = {x i } N i=1 ∈ R N F maximum number of clusters K range of number of children {n child min , n child max } minimum number of elements in a cluster n el indicator for refinement I distance metric d minimum and maximum AS dimensions r min , r max score tolerance ϵ Output: refinement tree T 1: add the initial cluster S to FIFO queue q = {S} 2: while q ̸ = ∅ do 3: take S j , the first element from queue q 4: apply the refinement function in algorithm 3 to S j to get {S i } i 5: if the score tolerance ϵ is reached or other constraints are violated then end if 9: end while The procedure depends on many parameters that have to be tuned for the specific case or depend a priori on the application considered: the maximum number of clusters (K), the minimum and maximum number of children nodes (n child min , n child max ), the tolerance for the score on the whole domain (ϵ), the minimun and maximum dimension of the active subspace (r min , r max ), and the minimum number of elements (n el ) of each cluster (usually n el > r, where r is the local AS dimension).
More importantly the method is versatile for the choice of clustering criterion, indicator for refinement (I), distance metric (d, from equation ( 15)) and regression method.In the following sections we consider K-means and K-medoids with the active subspaces distance as clustering criterion (see section 3.3), but other clustering algorithms could in principle be applied at each refinement.
Remark 5 (Heuristics behind the choice of the active subspaces metric for K-medoids).Having in mind that the optimal profile h(y) = E µi [f |P r(Si),i ](y) from definition 2 is approximated as h(y) = f (y) = f (P r x) as reported in remark 2, we can argue that clustering with the AS metric from equation ( 15) is effective since, for this choice of the metric, the clusters tend to form transversally with respect to the active subspace directions.This is because the metric weights more the components with higher eigenvalues.So clustering with this metric reduces heuristically also the approximation error induced by the choice of the non-optimal profile.Other clustering criterions employed must satisfy the subspace Poincaré inequality for each cluster.Regarding the regression method we employ Gaussian process regression with RBF-ARD kernel [49].The procedure for response surface design with Gaussian processes and ridge approximation with active subspaces can be found in [13,35].As for the indicator for refinement (I), the local R 2 score in remark 3 is employed to measure the accuracy of the ridge approximation against a validation dataset and the estimator from the RHS of equation ( 8) is used to determine the dimension of the active subspace of each cluster.
Here, we make some considerations about the complexity of the algorithm.For each refinement, considering an intermediate cluster of K elements, the most expensive tasks are: the active subspace evaluation O((K/m)np 2 + (K/m)n 2 p + n 3 ) (the first two costs refer to matrix multiplications, while the third for eigendecomposition), the clustering algorithm, for example K-medoids with AS distance O(K(K − m) 2 ), and the Gaussian process regression O((K/m) 3 p 3 ), where p is the dimension of the outputs and m = n child min and M = n child max are the minimum and maximum number of children clusters, for a more compact notation.In the worst case the height of the refinement tree is l = log m N/n el where n el is the minimum number of elements per cluster.In appendix A.4 we report the detailed computational costs associated to each refinement level.
Input: cluster S = {x i } N i=1 ∈ R N F number of clusters per tree refinement level K range of number of children {n child min , n child max } minmum number of elements in a cluster n el indicator for refinement I distance metric d minimum and maximum AS dimensions r min , r max Output: {S j } n child j=1 , the children of cluster S 1: set best score to b = 0 2: for each n child from n child min to n child max do 3: apply the chosen clustering algorithm (e.g.K-medoids) with n child clusters and metric d to obtain the clusters {S j } n child j 4: evaluate the estimator of the error I for the refinement {S j } j , considering also the minimum and maximum reduced dimensions r min , r max 5: if I > b and the minimum number of elements n el is not reached and the maximum number of clusters K is not reached then 6: save the best refinement {S j } j and update the best score b 7: end if 8: end for

Classification with local active subspace dimension
A poor design of the parameter space could add an avoidable complexity to the surrogate modeling algorithms.Often, in practical applications, each parameter range is chosen independently with respect to the others.Then, it is the responsibility of the surrogate modeling procedure to disentangle the correlations among the parameters.However, in this way, looking at the response surface from parameters to outputs, regions that present different degrees of correlation are treated indistinctly.In this matter, a good practice is to study as a preprocessing step some sensitivity measures, like the total Sobol' indices [42] among groups of parameters, and split the parameter space accordingly in order to avoid the use of more expensive surrogate modeling techniques later.Sobol' indices or the global active subspace sensitivity scores give summary statistics on the whole domain.So in general, one could study the parameter space more in detail, classifying nonlinearly regions with respect to the complexity of the response surface, if there are enough samples to perform such studies.
We introduce an effective approach to tackle the problem of classification of the parameter space with respect to a local active subspace information.With the latter we mean two possible alternatives.
Definition 3 (Local active subspace dimension).Given a threshold ϵ > 0, the pairs of inputs and gradients {(X i , dY i )} i associated to an objective function of interest f : X ⊂ R n → R, the size of the neighbourhood of sample points to consider N ≥ n, and a subsampling parameter p ∈ N, p ≤ N , the local active subspace dimension r i associated to a sample point X i ∈ X is the positive integer where C(N, p) is the set of combinations without repetition of the N elements of the Euclidean neighbourhood of X i in p classes and P r is the projection onto the first r eingenvectors of the symmetric positive define matrix Definition 4 (Local active subspace).Given the pairs of inputs and gradients {(X i , dY i )} i associated to an objective function of interest f : X ⊂ R n → R, the size of the neighbourhood of sample points to consider N ≥ n, and a fixed dimension p ∈ N, 1 ≤ p ≤ N , the local active subspace W i associated to a sample point X i ∈ X is the matrix of the first p eigenvectors of the spectral decomposition of where U is the neighbourhood of sample points of X i with respect to the Euclidean distance.In practice, we choose p close to the global active subspace dimension.The pairs {(X i , W i )} i can be thought as a discrete vector bundle of rank p and {W i } i can be thought as a subset of points of the Grassmannian Gr(N, p), that is the set of p-dimensional subspaces in an N -dimensional vector space.
Starting from the pairs of inputs-gradients {(X i , dY i )} i , the procedure follows these steps: 1.Each parameter sample is enriched with the additional feature corresponding to the local active subspace dimension from definition 3 or the local active subspace from definition 4, represented by the variable Z.

2.
Each sample X i is labelled with an integer l i that will be used as classification label in the next step.To label the pairs {(X i , Z i )} i we selected K-medoids with the Grassmannian metric d(( where ∥•∥ F is the Frobenius distance, in case Z i represents the local active subspace or spectral clustering [31] in case Z i is the local active subspace dimension.In the last case, the labels correspond to the connected components of the graph built on the nodes {(X i , Z i )} i with adjacency list corresponding to the nearest nodes with respect to the distance where ∥•∥ is the Euclidean metric in R n .The connected components are obtained from the eigenvectors associated to the eigenvalue 0 of the discrete Laplacian of the graph [31].Summarizing, we employ two labelling methods: K-medoids in case Z i represents the local active subspace (Definition 4) W i or spectral clustering in case Z i represents the local active subspace dimension (Definition 3).
3. A classification method is applied to the inputs-labels pairs {(X i , l i )} i .Generally, for our relatively simple applications we apply a multilayer perceptron with 1000 hidden nodes and 2 layers.
Remark 6 (Grassmann distance).In general regarding the definition 4, the dimension p could be varying among samples X i and one could use a more general distance with respect to the one from equation ( 17) that can have as arguments two vectorial subspaces of possibly different and arbitrary large dimensions.
Remark 7 (Gradient-free active subspace).In general both the response surface design and the classification procedure above can be carried out from the pairs {(X i , Y i )} i of inputs, outputs instead of the sets {(X i , dY i )} i of inputs, gradients.In fact, the gradients {dY i } can be approximated in many different ways [13] from {(X i , Y i )} i .In the numerical results in section 5 when the gradients are not available they are approximated with the gradients of the local one-dimensional polynomial regression built on top of the neighbouring samples.

Numerical results
In this section we apply the proposed localized AS method to some datasets of increasing complexity.We emphasize that the complexity is not only defined by the number of parameters but also by the intrinsic dimensionality of the problem.We compare the clustering techniques presented in section 3, and we show how the active subspaces-based distance metric outperforms the Euclidean one for those functions which present a global lower intrinsic dimensionality.We remark that for hierarchical top-down clustering we can use both metrics, and we always show the best case for the specific dataset.evaluate feature Z i from (X i , dY i ) and the neighbouring points of X i 3: end for 4: initialize the |I| × |I| distance matrix M associated to {(X i , Z i )} i∈I 5: for each i ∈ I do 6: for each i ≤ j ∈ I do 7: end for 9: end for 10: use the labelling method with input M , to assign a label l i for each (X i , Z i ) 11: train the classification method with the inputs-labels pairs {(X i , l i )} i∈I We start from a bidimensional example for which we can plot the clusters and the regressions, and compare the different techniques.Even if it is not a case for which one should use parameter space dimensionality reduction we think it could be very useful for the reader to understand also visually all the proposed techniques.For the higher dimensional examples we compare the accuracy of the methods in terms of R 2 score and classification performance.All the computations regarding AS are done with the open source Python package 1 called ATHENA [37], for the classification algorithms we use the scikit-learn package [9], and for the Gaussian process regression GPy [21].
We suppose the domain X to be a n-dimensional hyperrectangle.we are going to rescale the input parameters X to [−1, 1] n .

Some illustrative bidimensional examples
We start by presenting two bidimensional test cases to show every aspect of the methodology together with illustrative plots.First we analyse a case where a global active subspace, even if present, does not provide a regression accurate enough along the active direction, in section 5.1.1.Then we consider a radial symmetric function for which, by construction, an AS does not exist, in section 5.1.2,and the use of K-means is instead preferable since we cannot exploit a privileged direction in the input domain.

Let us consider the following bidimensional quartic function
In figure 1 we can see the contour plot of the function, the active subspace directiontranslated for illustrative reasons -and the corresponding sufficient summary plot of the global active subspace, computed using 400 uniformly distributed samples.With sufficient summary plot we intend f (x) plotted against the input parameters projected onto the active subspace, that is W T 1 x.It is clear how, in this case, a univariate regression does not produce any useful prediction capability.
Let us apply the clustering techniques introduced in the previous sections fixing the number of clusters to 4. In figure 2 we can clearly see how the supervised distance metric in equation ( 15) acts in dividing the input parameters.On the left panel we apply K-means which clusters the data into 4 uniform quadrants, while in the middle and right panels we have K-medoids and hierarchical top-down, respectively, with a subdivision aligned with the global AS.We notice that for this simple case the new metric induces an identical clustering of the data.In figure 3 we plotted the sufficient summary plots for each of the clusters individuated by K-medoids or hierarchical top-down in figure 2. By using a single univariate regression for each cluster the R 2 score improves  a lot with respect to a global approach (see right panel of figure 1).We can also compare the R 2 scores for all the methods, using a test datasets of 600 samples.In figure 4 we report the scores for K-means, K-medoids and for hierarchical top-down with ASbased distance metric.The score for the global AS, which is 0.78, is not reported in figure 4 for illustrative reasons.The results are very similar due to the relatively simple test case, but we can see that even with 2 clusters the gain in accuracy is around 23% using the metric in equation (15).
The hierarchical top-down clustering method was ran with the following hyper-parameters: the total number of clusters is increasing from 2 to 10, the minimum number of children equal to the maximum number of children equal to 3, uniform normalization of the clusters, the minimum size of each cluster is 10 elements, the clustering method is K-medoids with AS distance, the maximum active subspace dimension is 1.
Then we want to increase the accuracy of the regression for a fixed number of clusters equal to 3, loosing in some regions the reduction in the parameter space.Starting from the clustering with hierarchical top-down and 3 clusters of dimension 1, the AS dimension of each of the 3 clusters is increased if the threshold of 0.95 on the local R 2 score is not met.In general, the local R 2 score is    The 3 clusters are reported in figure 5 on the left.The R 2 score on the test set is 1, instead of around 0.97 from figure 4. To obtain this result, the central cluster AS dimension is increased from 1 to 2. We compare the clustering with respect to the classification of the local AS dimension with algorithm 4 using as features the local AS dimension as defined in definition 3, on the right of figure 5. Actually, algorithm 4 is stopped after the plotted labels are obtained as the connected components of the underlying graph to which spectral clustering is applied: no classification method is employed, yet.It can be seen that hierarchical top-down clustering with heterogeneous AS dimension is more efficient with respect to the classes of algorithm 4, regarding the number of samples associated to a response surface of dimension 2.

Radial symmetric cosine
This example addresses the case for which an active subspace is not present.This is due to the fact that there are no preferred directions in the input domain since the function f has a radial symmetry.For this case the exploitation of the supervised distance metric does not provide any significant gain and K-means clustering works better on average, since it does not use the global AS structure.The model function we consider is f (x) = cos(∥x∥ 2 ), with x ∈ [−3, 3] 2 .
In figure 6 we compare the R 2 scores for K-means, K-medoids with AS-based metric, and hierarchical top-down with Euclidean metric.We used 500 training samples and 500 test samples.We see K-medoids has not a clear behaviour with respect to the number of clusters, while the other methods present a monotonic trend and better results on average, especially K-means.On the other hand local models improve the accuracy considerably, even for a small number of clusters,   the minimum number of children is equal to the maximum, the minimum number of elements per cluster is 10, the clustering method chosen is K-means, the normalization employed it the uniform one, and the total number of clusters is increasing from 2 to 11.

Higher-dimensional datasets
In this section we consider more interesting benchmarks, for which dimension reduction in the parameter space is useful since the starting dimension of the parameter space is higher.We test the classification procedure in algorithm 4 with an objective function with 6 parameters and defined piecewise as a paraboloid with different AS dimensions.We also test the procedure of response surface design with local AS, with a classical 8-dimensional epidemic benchmark model.

Multi-dimensional hyper-paraboloid
The objective function f : [−4, 4] 6 → R we consider is defined piecewise as follows In the 4 domains in which f is defined differently, we expect an AS dimension ranging from 1 to 4, respectively.We employed algorithm 4 using the local AS dimensions as additional features, from definition 3: the values of the hyper-parameters are the following: ϵ = 0.999, N = 6, p = 4.In figure 7 we plot the accuracy of the classification of the labels, associated to the connected components of the graph built as described in algorithm 4, and also the accuracy of the classification of the local active subspace dimension, that takes the values from 1 to 4. The test dataset for both the classification errors has size 1000.The score chosen to asses the quality of the classification is the mean accuracy, that is the number of correctly predicted labels over the total number of labels.For both the classification tasks 100 train samples are enough to achieve a mean accuracy above 80%.We remark that every step is applied to a dataset of samples in a parameter space of dimension 6, even if, to get a qualitative idea of the performances of the method, in figure 8 we show only the first two components of the decision boundaries of the 4 classes for both the previously described classification problems.

Ebola epidemic model
In this section we examine the performance of the proposed methods over the dataset created with the SEIR model for the spread of Ebola2 .The output of interest in this case is the basic reproduction number R 0 of the SEIR model, described in [19], which is computed using 8 parameters as follows As shown in previous works, this function has a lower intrinsic dimensionality, and thus a meaningful active subspace, in particular of dimension 1.To evaluate the performance of the local AS we compute the R 2 score, as in equation (11), varying the number of clusters from 2 to 10 for all the methods presented.The test and training datasets are composed by 500 and 300, respectively, uniformly distributed and independent samples.The results are reported in figure 9, where as baseline we reported the R 2 for the GPR over the global AS.We can see how the use of the AS-based distance metric contributes the most with respect to the actual clustering method (compare K-medoids and hierarchical top-down in the plot).K-means, instead, does not guarantee an improved accuracy (for 4 and 9 clusters), and in general the gain is limited with respect to the other methods, especially for a small number of clusters which is the most common case in practice, since usually we work in a data scarcity regime.The results for K-medoids and top-down are remarkable even for a small amount of clusters with an R 2 above 0.9 and an improvement over 10% with respect to the global AS, which means that no clustering has been used.
Classification of the labels Mean accuracy on the test set: 0.90 Classification of the local AS dimension Mean accuracy on the test set: 0.95 The hyper-parameters for the hierarchical top-down algorithm are the following: the maximum local active subspace dimension is 1, the maximum number of children is equal to the number of total clusters, the minimum number of children is 2 at each refinement level, the minimum number of elements per cluster is 10, and the clustering method for each refinement is K-medoids with AS distance.

Datasets with vectorial outputs
In this section we want to show how hierarchical top-down clustering and the classification procedure of algorithm 4 can be combined to improve the overall reduction in the parameter space, for a fixed lower threshold in the R 2 score.For the response surface design with active subspaces for vectorial outputs we refer to [52,35].

Poisson equation with random diffusivity
Let us consider the stochastic Poisson problem on the square x = (x, y) ∈ Ω := [0, 1] 2 , defined as: with homogeneous Neumann boundary condition on ∂Ω right , and Dirichlet boundary conditions on the remaining part of ∂Ω.The diffusion coefficient κ : (Ω, A, P ) × Ω → R, with A denoting a σ-algebra, is such that log(κ) is a Gaussian random field, with covariance function G(x, y) defined by where the correlation length is β = 0.03.We approximate this random field with the truncated Karhunen-Loève decomposition as where (X i ) i∈1,...,m are independent standard normal distributed random variables, and the eigenpairs of the Karhunen-Loève decomposition of the zero-mean random field κ are denoted with (γ i , ψ i ) i∈1,...,m .The parameters (X i ) i∈1,...,m=10 sampled from a standard normal distribution are the coefficients of the Karhunen-Loève expansion, truncated at the first 10 modes, so the parameter space has dimension m = 10.The domain Ω is discretized with a triangular unstructured mesh T with 3194 triangles.The simulations are carried out with the finite element method with polynomial order 1.The solution u is evaluated at 1668 degrees of freedom, thus the output is vectorial with dimension d = 1668.As done in [52,35], the output is enriched with the metric induced by the Sobolev space H 1 (Ω) on to the finite element space of polynomial order 1: the metric is thus represented by a d × d matrix M obtained as the sum of the mass and stiffness matrices of the numerical scheme and it is involved in the AS procedure when computing the correlation matrix E Df M Df T , where Df is the m × d Jacobian matrix of the objective function f : R 10 → R d , that maps the first m = 10 coefficients of the Karhunen-Loéve expansion (X i ) i∈1,...,m to the solution u.The Jacobian matrix is evaluated for each set of parametric instances with the adjoint method, as in [35].
Since the output is high-dimensional we classified with algorithm 4 the output space in 6 clusters, using the Grassmann distance from equation (17), as shown in figure 10.
Afterwards we applied hierarchical top-down clustering to every one of the 6 triplets of inputsoutputs-gradients, obtained restricting the outputs and the gradients to each one of the 6 clusters.The specifics of hierarchical top-down clustering we employed are the following: the minimum and maximum number of children for each refinement are equal to the total number of clusters, which is 4, the minimum number of elements in each cluster is 10, and the clustering algorithm chosen is K-medoids with the AS distance.The size of the training and test datasets is respectively of 500 and 150.The gradients are evaluated with the adjoint method.Since the output is vectorial we employed the mean R 2 score, where the average is made among the components of the vectorial output considered.
Then for every lower threshold on the R 2 score we increase one by one the dimension of the 6 × 4 local clusters, until all the R 2 scores of each of the 6 triplets are above the fixed threshold.The same procedure is applied to the whole dataset of inputs-outputs-gradients but executing hierarchical top-down clustering just once, for all the output's components altogether.
The results are reported in figure 11.In the case of the clustered outputs, the local dimension of each one of the 6 clustered outputs times 4 local clusters in the parameter space, for a total of 24 local clusters, are weighted with the number of elements of each cluster.In the same way the 4 clusters of the case with unclustered outputs is weighted with the number of the elements of each one of the 4 clusters.It can be seen that for every fixed threshold, there is an evident gain, with respect to the dimension reduction in the parameter space, in clustering the outputs and then performing hierarchical top-down clustering in the parameter space.10) for a total of 24 terms in the weighted average.

Shape design of an airfoil
For this vectorial test case we consider the temporal evolution of the lift coefficient of a parametrized NACA airfoil.Here we briefly present the problem we solve to create the dataset, we refer to [44] for a deeper description.
Let us consider the unsteady incompressible Navier-Stokes equations described in an Eulerian framework on a parametrized space-time domain S(µ) = Ω(µ) × [0, T ] ⊂ R 2 × R + .The vectorial velocity field u : S(µ) → R 2 , and the scalar pressure field p : S(µ) → R solve the following parametric PDE: Here, Γ = Γ in ∪ Γ out ∪ Γ 0 denotes the boundary of Ω(µ) composed by inlet boundary, outlet boundary, and physical walls, respectively.With f (x) we indicate the stationary non-homogeneous boundary condition, and with k(x) the initial condition for the velocity at t = 0.The geometrical deformation are applied to the boundary Γ 0 (µ).The undeformed configuration corresponds to the NACA 4412 wing profile [1,25].To alter such geometry, we adopt the shape parametrization and morphing technique proposed in [24], where 5 shape functions are added to the airfoil profiles.They are commonly called Hicks-Henne bump functions.Let y u and y l be the upper and lower ordinates of the profile, respectively.The deformation of such coordinates is described as follows where the bar denotes the reference undeformed state.The parameters µ ∈ D ⊂ R 10 are the weights coefficients, c i and d i , associated with the shape functions r i .In particular we set D := [0, 0.03] 10 .The explicit formulation of the shape functions can be found in [24].For this datasets, the Reynolds number is Re = 50000.The time step is dt = 10 −3 s.For other specifics regarding the solver employed and the numerical method adopted we refer to [44].
As outputs we considered the values of the lift coefficient, every 15 time steps from 100 ms to 30000 ms, for a total of 1994 components.Even in this case the output is classified with algorithm 4 with distance defined in definition 3. The values of the lift coefficient physically interesting are collected at last, after an initialization phase.Nonetheless for the purpose of having a vectorial output we considered its value from the time instant 100 ms.The procedure finds two classes and splits the ordered output components in two parts: from the component 0 to 996, the local AS dimension is 1, for the remaining time steps it is higher.So we can expect an improvement on the efficiency of the reduction in the parameter space when considering separately these two sets of outputs components as figure 12 shows.The weighted local AS dimension is in fact lower when using clustering, for every minimum R 2 threshold.

Conclusions and perspectives
In this work we present a new local approach for parameter space reduction which exploits supervised clustering techniques, such as K-means, K-medoids, and hierarchical top-down, with a distance metric based on active subspaces.We called this method local active subspaces (LAS).The proposed metric tend to form the clusters transversally with respect to the active subspace directions thus reducing the approximation error induced by the choice of the non-optimal profile.
The theoretical formulation provides error estimates for the construction of response surfaces over the local active subspaces.We also present a classification approach to capture the optimal From Corollary 8.1.11 of [20], a bound on the approximation error of the active subspace W 1 can be obtained making explicit ∥W T 2 EW 1 ∥ F with respect to the chosen numerical method for the discretization Ĉ of the integral C = E µi [∇f ⊗ ∇f ]: in [13] this has been done for the Monte Carlo method.In practice we could use quasi Monte Carlo sampling methods with Halton or Sobol' sequences [42], since Other numerical integration rules can be chosen so that different regularity conditions on the objective function may appear on the upper bound of the error, as the Lipschitz constant on equation ( 3) or the Hardy-Krause variation on equation ( 4).If the regularity of f is C s , we can also apply tensor product quadrature formulae or Smolyak's sparse quadrature rule [42].For high-dimensional datasets and f less regular, the estimate in equation ( 3) is the sharpest.

A.4 Computational complexity of hierarchical top-down
In table 1 we report the computational complexity of the hierarchical top-down clustering algorithm.We report the costs divided by level of refinement.

2 AFigure 1 :
Figure 1: On the left panel the contour plot of the quartic function and in orange the global active subspace direction.On the right panel the sufficient summary plot resulting projecting the data onto the global AS.

Figure 2 :
Figure 2: Comparison between the different clusters obtained by K-means (on the left), K-medoids (middle panel), and hierarchical top-down (on the right) with AS induced distance metric defined in equation (15) for the quartic test function.In orange the global active subspace direction.Every cluster is depicted in a different color.

Figure 3 :
Figure 3: Local sufficient summary plots for the 4 clusters individuated by K-medoids or hierarchical top-down in figure 2 (colors correspond).

Figure 4 :
Figure 4: R 2 scores comparison between local versions varying the number of clusters for the quartic function.Global AS has a score equal to 0.78.

Figure 5 :
Figure 5: On the left panel the hierarchical top-down clustering with heterogeneous AS dimension and R 2 score equal to 1. On the right panel the labels of the local AS dimension from definition 3.

Figure 6 :
Figure 6: R 2 scores comparison between global AS and local versions varying the number of clusters for the isotropic model function.Global AS corresponds to no clustering.

Figure 7 :
Figure 7: Mean accuracy study for a training dataset increasing in size from 50 to 500 samples.The test set is made of 1000 independent samples.The classification accuracy for the procedures of connected component classification (in blue) and local AS dimension classification (in orange) are both shown.

Figure 8 :
Figure 8: On the left panel, the decision boundaries of the 4 classes associated to the connected components of the graph built as described in algorithm 4. On the right panel, the decision boundaries of the 4 classes associated to the local AS dimension from 1 to 4. The datasets has dimension 6, only the first two components of the decision boundaries and of the test samples are plotted.

Figure 9 :
Figure 9: R 2 scores comparison between global AS and local versions varying the number of clusters for the Ebola spread model.Global AS corresponds to no clustering.

Figure 10 :
Figure 10: Subdivision of the spatial domain Ω in 6 clusters based on the Grassmann distance from definition 4, i.e. the clusters correspond to the connected components of the graph built on top of the degrees of freedom with adjacency list determined using as distance definition 4.

Figure 11 :
Figure 11:  In orange the local AS dimensions weighted on the number of elements of each of the 4 clusters in the parameter space, obtained with hierarchical top-down clustering.In blue the local AS dimensions weighted on the number of elements of each of the 4 clusters in the parameter space, obtained with hierarchical top-down clustering, times 6 clustered outputs (see figure10) for a total of 24 terms in the weighted average.

Figure 12 :
Figure 12: In orange the local AS dimensions weighted on the number of elements of each of the 2 clusters in the parameter space, obtained with hierarchical top-down clustering.In blue the local AS dimensions weighted on the number of elements of each of the 2 clusters in the parameter space, obtained with hierarchical top-down clustering, times 2 clustered outputs for a total of 4 terms in the weighted average.

P 1 ,
X = e 1 ⊗ e 1 , P 1,A = P 1,C = e 1 ⊗ e 1 , P 1,B = e 2 ⊗ e 2 , p 3 ) GPR First refinement: O(N (N − k) 2 ) K-medoids k from m to M O((N/k)np 2 + (N/k)n 2 p + n 3 ) AS O((N/k) 3 p 3 ) GPR Intermediate refinements --Last refinement: O((N/k l−1 )((N/k l−1 ) − k) 2 ) K-medoids k from m to M O((N/k l )np 2 + (N/k l )n 2 p + n 3) AS for each one of the m l−1 clusters O((N/k l ) 3 p 3 ) GPR (18)rithm 4Classification with local features from the AS information.Input: inputs-gradients pairs {(X i , dY i )} i∈I as training dataset local features based on AS information {Z i } i∈I labelling method based on the distance d from equation(17)or equation(18)classification method taking the inputs-labels pairs {(X i , l i )} i∈I Output: predictor for new test inputs and classes of the training dataset.
1: for each i ∈ I do 2:

Table 1 :
Computational complexity of hierarchical top-down clustering.