Theoretical Aspects in Penalty Hyperparameters Optimization

Learning processes play an important role in enhancing understanding and analyzing real phenomena. Most of these methodologies revolve around solving penalized optimization problems. A significant challenge arises in the choice of the penalty hyperparameter, which is typically user-specified or determined through Grid search approaches. There is a lack of automated tuning procedures for the estimation of these hyperparameters, particularly in unsupervised learning scenarios. In this paper, we focus on the unsupervised context and propose a bi-level strategy to address the issue of tuning the penalty hyperparameter. We establish suitable conditions for the existence of a minimizer in an infinite-dimensional Hilbert space, along with presenting some theoretical considerations. These results can be applied in situations where obtaining an exact minimizer is unfeasible. Working on the estimation of the hyperparameter with the gradient-based method, we also introduce a modified version of Ekeland’s principle as a stopping criterion for these methods. Our approach distinguishes from conventional techniques by reducing reliance on random or black-box strategies, resulting in stronger mathematical generalization.


Introduction
Training a Machine Learning (ML) algorithm is quite important to produce data-driven models, which can be successfully applied in real life applications.These processes often require to specify several variables by the users, namely hyperparameters, which must be set before the learning procedure starts.Hyperparameters govern the whole learning process and play a crucial role in guaranteeing good model performances.They are often manually specified, and the lack of an automatic tuning procedure makes the field of Hyperparameter Optimization (HPO) an ever-evolving topic.The literature offers various solutions for hyperparameters tuning, from Gradient-based to Black-Box or Bayesian's approaches, beside some naive but daily used methods such as Grid and Random search.A brief overview on existing methods can be found in [1].Hyperparameters can be of different types (discrete, continuous, categorical), and in most cases, the number of their configurations to explore is infinite.This paves the way for a mathematical formalization of the HPO in ML context with abstract spaces, such as Hilbert spaces.
A supervised learning algorithm may be represented as a mapping that takes a configuration of hyperparameter and a dataset D and returns an hypothesis [2]: where D is the space of finite dimensional dataset, representing a task, X and Y are the input and output spaces, Λ is an hyperparameter space, and H is an hypothesis space.A quite standard claim for the hypotheses set is to be a linear function space, endowed with a suitable norm (more binding arising from an inner product): two requirements satisfied when H is a Hilbert space of functions over X 1 .Assuming an Hilbert space structure on the hypothesis space has some advantages: (i) practical computations reduced to ordinary linear algebra operations and (ii) self duality; that is for any x ∈ X a representative of x can be found, i.e., k x ∈ H exists such that where k x is a suitable positive definite "kernel".This construction gives the chance to connect the abstract structure of H and what its elements actually are, flipping the construction of the hypotheses set from the kernel.Providing a suitable positive function k on X, H can be set as the minimal complete space of functions involving all {k x } x∈X equipped with the scalar product in (2).Thus, H is outlined in a unique way, and it is named the Representing Kernel Hilbert Space mapped to the kernel k.
Starting from this abstract scenario, one can deepen the HPO in supervised ML.Formally, HPO can be formulated as the problem of minimizing the discrepancy between A, trained on a given training dataset D tr , and a validation dataset D val [3], to find the optimal λ * such that In this study, we will address problem (3) through Gradient-based methods (GB), by using a bi-level approach.Bi-level programming solves an outer optimization problem subject to the optimality of an inner optimization problem, and it can be adopted to formalize HPO for any learning algorithm [4][5][6][7].We will work on Hilbert spaces for solving HPO in unsupervised problems, considering as hyperparameter the penalty coefficient.We already treat this aspect in the particular and more specific case of Nonnegative Matrix Factorization task, encurring in some generalization problems and restrictions of the theorems' assumptions [8].
To overcome the difficulties in ensuring the theoretical assumptions when real data domains are considered, this work extends existence and uniqueness theorems for the solution of the hyperparameters bi-level problem to the more general framework of infinite dimensional Hilbert space.This latter also allows the application of the Ekeland's variational principle to state that whenever a functional is not guaranteed to have a minimum, under suitable assumptions, a "good" substitute can be found, namely the best one can get as an approximate minimum.One of the purposes of this paper is to use this theoretical tool as a stopping criterion for the update of the hyperparameters as we will see later.
The outline of the paper is as follows.Section 2 introduces the classical bi-level formalization of HPO and some preliminary notions in a supervised context.Section 3 illustrates our proposal, extension on the unsupervised context.A general framework addressing HPO in Hilbert space is also set, and some general abstract tools are stated in Section 4. Section 5 presents a critical discussion and some practical considerations.Finally, Section 6 summarizes the obtained results and draws some conclusions.

Previous works and preliminaries
As briefly mentioned in the introduction, in a supervised learning scenario, HPO can be addressed through a bi-level formulation.This approach looks for the hyperparameters λ such that the minimization of the regularized training leads to the best performance of the trained data-driven model on a validation set.Accordingly to the ideas introduced in [9,10], the best hyperparameters for a data learning task can be selected as the solution of the following problem: where w ∈ R r are r parameters, J : Λ → R is the so-called Response F unction of the outer problem E : R r × Λ → R, and for every λ ∈ Λ ⊂ R p , L λ : R r → R is the inner problem.
A reformulation of HPO as a bi-level optimization problem is also solved via some GB algorithms.In particular, in GB methods HPO is addressed with classical procedure for continuous optimization, in which the hyperparameter update is given by where h t is an approximation of the gradient of function J and α is a step size.It is known that the main challenge in this context is the computation of h t , called hypergradient.In several cases, a numerical approximation of the hypergradient can be calculated for real-valued hyperparameters, although few learning algorithms are differentiable in the classical sense.There are two main strategies for computing the hypergradient: iterative differentiation [9,11,12] and implicit differentiation [13,14].The former requires calculating the exact gradient of an approximate objective.This is defined through the recursive application of an optimization dynamics that aims to replace and approximate the learning algorithm A; the latter involves the numerical application of the implicit function theorem to the solution mapping A(D tr ; •), when it is expressible through an appropriate equation [2].
In this study, we follow the iterative strategy, so that problem in ( 4)-( 5) can be addressed through a dynamical system type approach.
If the following hypothesis hold: 2. the Error Function E : R r × Λ → R is jointly continuous; 3. the map (w, λ) → L λ (w) is jointly continuous, and then the problem argmin L λ is a singleton for every λ ∈ Λ; 4. the problem w λ = argmin L λ remains bounded as λ varies in Λ; the problem in (4)-( 5) becomes: It can be proved that the optimal solution (w λ * , λ * ) of problem (7) exists [11].
The goal of HPO is to minimize the validation error of model g w : X → Y , parameterized by a vector w ∈ R r , with respect to hyperparameters λ.
Considering the penalty optimization problems in which hyperparameter is the penalty coefficient λ ∈ R + , the Inner P roblem is the penalized empirical error represented by L, defined as: where ℓ is the loss function, D tr = {(x i , y i )} n i=1 the training set, and r : R r → R is a penalty function.While the Outer P roblem is the generalized error of g w represented by E: where is the validation set.Note that E does not explicitly depend on λ.
This work will allow overcoming some assumptions of Hypothesis 1 (such as compactness) that are difficult to satisfy in real data learning contexts, and also to use some theoretical result as the Ekeland's variational principle, stated in the following, to improve iterative algorithms.
Theorem 1 (Ekeland's variational principle) [15] Let (V, d) be a complete metric space and J : V → R be a lower semi-continuous function which is bounded from below.Suppose that ε > 0 and ṽ ∈ V exist such that Then, given any ρ > 0, vρ ∈ V exists such that

Our Proposal
The bi-level HPO framework can be modified to include unsupervised learning paradigms, generally designed to detect some useful latent structure embedded in data.Tuning hyperparameters for unsupervised learning models is more complex than the supervised case due to the lack of the output space, which defines the ground truth collected in the validation set.
This section describes a general framework to address HPO in Hilbert spaces for the unsupervised case and a corollary of the Ekeland's variational principle used to derive a useful stopping criterion for iterative algorithms solving this HPO.Let X ∈ R n×m be a data matrix, with reference to the problem (4)- (5), where now J : Λ → R is a suitable functional and Λ a Hilbert space equipped with the scalar product (•, •), the outer problem is: and for every λ ∈ Λ the inner problem is: where R : Λ × R r → R is a penalty function.We want to emphasize the new formulation with respect to (8) regarding the function L λ , in which each component of the parameter w is penalized independently, and all optimization is performed on the data matrix X.
The bi-level problem associated to ( 10)-( 11) can be solved with a dynamical system approach in which the hypergradient is computed.Once the hypergradient is achieved a gradient-based approach can be used to find the optimum λ * .The Ekeland's variational principle can be used to construct an appropriate stopping criterion for iterative algorithms, with the aim of justifying and setting the hyperparameters related to the stopping criterion more appropriately.Roughly speaking, this variational principle asserts that, under assumptions of lower semi-continuity and boundedness from below, if a point λ is an "almost minimum point" for a function J, hence a small perturbation of J exists which attains its minimum at a point "near" to λ.As a fruitful selection of ρ occurs when ρ = √ ε and such a choice allows us to reduce the number of hyperparameters to the precision error only, thus we will use Theorem 1 in the following form.
Corollary 1 Let (V, d) be a complete metric space and J : Λ → R be a lower semicontinuous function which is bounded from below.Suppose that ε > 0 and λ ∈ Λ exist such that Then, z ∈ Λ exists such that

Main Abstract results
In this section, we are ready to weaken the assumptions we discussed earlier and provide results related to the use of the Ekeland's principle as a stopping criterion.We mention an abstract result of the existence of a minimizer in Hilbert spaces which has great importance and a wide range of applications in several fields.As just one example, the Riesz's Representation Theorem, even if implicitly, makes use of the existence of a minimizer [16].This is a widely relevant issue about Hilbert spaces, which makes them nicer than Banach spaces or other topological vector spaces.One can think, for example, that the whole Dirac Bra-ket formalism of quantum mechanics relies on this identification.

Abstract Existence Theorem
It is well known that each bounded sequence in a normed space Λ has a norm convergent subsequence if and only if it is a finite dimensional normed space.Thus, given a normed space Λ, as the strong topology (i.e., the one induced by the norm) is too strong to provide any widely appropriate subsequential extraction procedure, one can consider other weak topologies joined with the linear structure of the space and look for subsequential extraction processes therein.
In Banach spaces, as well as in Hilbert spaces, the two most relevant weakerthan-norm topologies are the weak-star topology and the weak topology.As the former is established in dual spaces, the latter is set up in every normed space.The notions of these topologies are not self-contained but fulfill a leading role in many features of the Banach space theory.In this regard, here we state some results we will use shortly.
Theorem 2 If Λ is a finite-dimensional space, the strong and weak topologies coincide.In particular, it follows that the weak topology is normable, and then clearly metrizable, too.If Λ is an infinite-dimensional space, the weak topology is strictly contained in the strong topology, namely open sets for the strong topology exist which are not open for the weak topology.Furthermore, the weak topology turns to be not metrizable in this case.
Definition 1 A functional J : Λ → R with Λ topological space, is said to be lower semi-continuous on Λ if for each a ∈ R, the sublevel sets In the following we introduce a "generalized Weierstrass Theorem" which gives a criteria for the existence of a minimum for a functional defined on a Hilbert space.For this reason, the incoming results will be provided for the abstract framework of a Hilbert space although, in some cases, they apply in the more general context of Banach spaces.Thus, throughout the remaining part of this section we denote by Λ any real infinite dimensional Hilbert space.In an infinite dimensional setting, the following definitions are strictly related to the different notions of weak and strong topology.
Definition 2 A functional J : Λ → R is said to be strongly (weakly, respectively) lower semi-continuous if J is lower semi-continuous when Λ is equipped with the strong (weak, respectively) topology.
We proceed by providing some useful results.

Proposition 3
The following statements are equivalent: i) J : Λ → R is sequentially weakly lower semi-continuous functional; ii) the epigraph of J is weakly sequentially closed, where, by definition, it is Remark 1 As a further consequence of the preliminary Theorem 2, we have that sequential weak lower semi-continuity and weak lower semi-continuity do not match if Λ is infinite dimensional since weak topology is not metrizable.However, the weaker concept of sequential weak lower semi-continuity meets our needs.
Proposition 4 Let C ⊆ Λ be a closed and convex subset.Then, C is weakly sequentially closed, too.
Since a sequentially weakly closed set is also strongly closed, it follows that a sequentially weakly lower semi-continuous functional is also (strongly) lower semi-continuous.Instead, the converse holds under an additional assumption.In particular, Proposition 4 allows us to infer the following results.
Proposition 5 If J : Λ → R is a strongly lower semi-continuous convex functional; thus J is weakly sequentially lower semi-continuous, too.Proof Since J is lower semi-continuous, thus epi(J) is closed.On the other hand, since J is convex, so it is epi(J), whence Proposition 4 ensures that epi(J) is weakly sequentially closed, i.e., J is weakly sequentially lower semi-continuous.
An immediate consequence of Theorem 7 is the following subsequence convergence result.
where ζ is a proper growth function.If (λ * , J(λ * )) is a limit point of the sequence (λ k , J(λ k )), then λ * is stationary for J.
Two interesting consequences for convergence analysis flow from there.Suppose that the models are chosen in such a way that the step-sizes λ k+1 − λ k tend to zero.This assumption is often enforced by ensuring that J(λ k+1 ) < J(λ k ) by at least a multiple of λ k+1 − λ k 2 (sufficient decrease condition). 3As model f unction we mean the Taylor's expansion of J in λ, stopped to the first order.Then, assuming for simplicity that J is continuous on its domain, any limit point λ * of the iterate sequence λ k will be stationary for the problem (Corollary 3).Thus, by choosing an error ε, we can stop update (6) for GB algorithms in the context of bi-level HPO for penalty hyperparameter, according to the pseudo code in 1.

Discussion and practical considerations
We want to emphasize that moving to infinite dimensional Hilbert spaces are not a mere abstract pretense, but it is also important in some application contexts.For example, when Support Vector Machine (SVM) are taken into consideration, a well known "kernel trick" permits to interpret a Gaussian kernel as an inner product in a feature space.This is potentially infinitedimensional, allowing to read the SVM classifier function as a linear function in the feature space [18].Another example is provided by the quantum system possible states problem, in which the state of a free particle can be described as vectors residing in a complex separable Hilbert space [19].Indeed, the strength of this article lies in theory.Both the existence theorem and the stopping criterion allow us to build an approach based on solid mathematical foundations useful for future extensions and generalizations to other problems, too.For example, infinite-dimensional Covariance Descriptors (CovDs) for classification is a fertile application arena for the extensions developed here.This finds motivation in the fact that CovDs could be mapped to Reproducing Kernel Hilbert Space (RKHS) via the use of SPD-specific kernels [20].

Conclusions
In this paper, we studied the task of penalty HPO and we provided a mathematical formulation, based on Hilbert spaces, to address this issue in an unsupervised context.Focusing on bi-level formulation, we showed some relaxed theoretical results both to weaken the hypotheses necessary for the existence of the solution.
Our approach differs from the more standard techniques in reducing the random or black box strategies giving stronger mathematical generalization suitable also when it is not possible obtaining exact minimizer.We also propose to use the Ekeland's principle as a stopping criterion, which fits well in the context of GB methods.
The author C. S. was partially supported by PRIN project "Qualitative and quantitative aspects of nonlinear PDEs" (2017JPCAPN 005) funded by Ministero dell'Istruzione, dell'Università e della Ricerca.• Conflict of interest: The authors have no relevant financial or nonfinancial interests to disclose.• Data availability: Data sharing not applicable to this article as no datasets were generated or analysed during the current study.