1 Introduction

Learning models are important approaches successfully applied in real-life applications. These processes often require the specification of several variables by the users, namely hyperparameters, which must be set before the learning procedure starts. Hyperparameters govern the whole learning process and play a crucial role in guaranteeing good model performances. They are often manually specified, and the lack of an automatic tuning procedure makes the field of hyperparameter optimization (HPO) an ever-evolving topic. The literature offers various solutions for hyperparameters tuning, from gradient-based to black-box or Bayesian approaches, besides some naive but daily used methods such as Grid and Random search. A brief overview of existing methods can be found in Ref. [6]. Hyperparameters can be of different types (discrete, continuous, categorical), and in most cases, the number of their configurations to explore is infinite. This paves the way for a mathematical formalization of the HPO in learning contexts with abstract spaces, such as Hilbert spaces.

A learning algorithm may be represented as a map \({\mathcal {A}}\) that takes a configuration of hyperparameters, \(\lambda \in \Lambda \), and a dataset D, and returns a hypothesis \(h\in {\mathcal {H}}\):

$$\begin{aligned} {\mathcal {A}}: \Lambda \times D \rightarrow {\mathcal {H}}; \qquad {\mathcal {A}}(\lambda , D) = h, \end{aligned}$$
(1.1)

where \(\Lambda \) is a hyperparameter space, and \({\mathcal {H}}\) is a hypothesis space [11]. A quite standard claim for the hypotheses set is to be a linear function space, endowed with a suitable norm (more binding arising from an inner product): two requirements satisfied when \({\mathcal {H}}\) is a Hilbert space of functions over the input space \({\mathcal {X}}\).Footnote 1 Assuming a Hilbert space structure on the hypothesis space has some advantages: (i) practical computations reduced to ordinary linear algebra operations and (ii) self-duality; that is for any \(x\in {\mathcal {X}}\) a representative of x can be found, i.e., \({\mathcal {k}}_x\in {\mathcal {H}}\) exists such that

$$\begin{aligned} h(x) = \langle {\mathcal {k}}_x, h\rangle \quad \text{ for } \text{ all } h\in {\mathcal {H}}, \end{aligned}$$
(1.2)

where \({\mathcal {k}}_x\) is a suitable positive definite “kernel”. This construction gives the chance to connect the abstract structure of \({\mathcal {H}}\) and what its elements actually are, flipping the construction of the hypotheses set from the kernel. Providing a suitable positive function k on \({\mathcal {X}}\), \({\mathcal {H}}\) can be set as the minimal complete space of functions involving all \(\{k_x\}_{x\in {\mathcal {X}}}\) equipped with the scalar product in (1.2). Thus, \({\mathcal {H}}\) is outlined in a unique way, and it is named the Representing Kernel Hilbert Space mapped to the kernel k.

Starting from this abstract scenario, HPO can be formulated as the problem of minimizing the goodness of the solution given by the algorithm \({\mathcal {A}}\) that, implicitly or explicitly, depends on the hyperparameter \(\lambda \). In particular, for supervised learning contexts, the optimal hyperparameter \(\lambda ^*\) can be found in the literature as the solution for the following optimization task:

$$\begin{aligned} \lambda ^*=\mathop {\textrm{argmin}}_{\lambda \in \Lambda } {\mathscr {V}}({{\mathcal {A}}}({\lambda }, D_{tr}), D_{val}), \end{aligned}$$
(1.3)

where \({\mathscr {V}}: {\mathcal {H}} \times {\mathcal {X}} \rightarrow \mathbb {R}\) evaluates the goodness of \({\mathcal {A}}\) measuring discrepancy between \({\mathcal {A}}\) on a given training dataset \(D_{tr}\), and a validation dataset \(D_{val}\) [10].

In this study, we will focus on HPO in unsupervised context, by using bi-level programming formalization. Bi-level approaches solve an outer optimization problem subject to the optimality of an inner optimization problem [1, 3, 5, 11]. In particular, we will consider as a hyperparameter the penalty coefficient in penalized optimization problems. It is important to note that penalization functions are essential tools in optimization and learning problems. They are used to introduce a bias towards simpler or more general solutions. In particular, they can help to prevent overfitting or enforce feature selection operations while controlling the sparsity, to stabilize the solution and prevent noise amplification in inverse problems with regularization conditions, deal with multicollinearity in regression models, or to improve visualization tasks with orthogonality constraints. We already treat this aspect and solve the problem in the specific case of the Nonnegative Matrix Factorization task [7]. By the way, some generalizations are needed to overcome theoretical restrictions and made the strategy broadly and cross-sectional applicable to other learning approaches. In particular, this work extends the existence and uniqueness theorems for the solution of the hyperparameters bi-level problem to the more general framework of infinite-dimensional Hilbert space. This latter also allows the application of Ekeland’s variational principle to state that whenever a functional is not guaranteed to have a minimum, under suitable assumptions, a “good” substitute can be found, namely the best one can get as an approximate minimum. One of the purposes of this paper is to use this theoretical tool as a stopping criterion for the update of the hyperparameters as we will see later.

The outline of the paper is as follows. Section 2 introduces the classical bi-level formalization of HPO and some preliminary notions in a supervised context. Section 3 illustrates our proposal, an extension of the unsupervised context. A general framework addressing HPO in Hilbert space is also set, and some general abstract tools are stated in Sect. 4. Finally, Sect. 5 summarizes the obtained results and draws some conclusions and future works.

2 Previous Works and Preliminaries

As briefly mentioned in the introduction, in a supervised learning scenario, HPO can be addressed through a bi-level formulation. This approach looks for the hyperparameters \(\lambda \) such that the minimization of the regularized training leads to the best performance of the trained data-driven model on a validation set. Accordingly, to the ideas introduced in [12, 20], the best hyperparameters for a data learning task can be selected as the solution to the following bi-level problem:

$$\begin{aligned} \min _{\lambda \in \Lambda }{J(\lambda )}\qquad J(\lambda ) = \inf \{{\mathcal {E}}(w_{\lambda }, \lambda ): w_{\lambda } \in \mathop {\textrm{argmin}}\limits _{u \in \mathbb {R}^{r}}{\mathscr {L}}_{\lambda }(u)\}, \end{aligned}$$
(2.1)

where \(w \in \mathbb {R} ^r\) are r parameters, \(J: \Lambda \rightarrow \mathbb {R} \) is the so-called Response Function of the outer problem with objective function \({\mathcal {E}}:\mathbb {R} ^r \times \Lambda \rightarrow \mathbb {R}\), and for every \(\lambda \in \Lambda \subset \mathbb {R}^p\), \({\mathcal {L}}_\lambda :\mathbb {R} ^r \rightarrow \mathbb {R} \) is the inner objective function.

One way to solve the reformulation of HPO as a bi-level optimization problem is the adoption of gradient-based (GB) methods. In particular, in GB methods, HPO is addressed with classical procedure for continuous optimization, in which the sequence is generated by the following rule:

$$\begin{aligned} \lambda _{t+1} = \lambda _t - \alpha {\textbf{h}}_t(\lambda ), \end{aligned}$$
(2.2)

where \({\textbf{h}}_t\) is an approximation of the gradient of the function J and \(\alpha \) is a step size, that converges to the optimal hyperparameter. In this context, it is known that the main challenge is the computation of \({\textbf{h}}_t\), called hypergradient. In several cases, this numerical approximation can be calculated for real-valued hyperparameters with iterative algorithms. There are two main strategies for computing the hypergradient: iterative differentiation [12, 13, 17] and implicit differentiation [16, 18]. The former requires calculating the exact gradient of an approximate objective. This is defined through the recursive application of optimization dynamics that aims to replace and approximate the learning algorithm \({\mathcal {A}}\). The latter involves the numerical application of the implicit function theorem to the solution mapping \({\mathcal {A}}\) when it is expressible through an appropriate equation [11].

In this study, we follow the iterative strategy, so that problem in (2.1) can be addressed through a dynamical system type approach.

If the following hypothesis hold:

Hypothesis 1

  1. 1.

    the set \(\Lambda \) is a compact subset of \(\mathbb {R}\);

  2. 2.

    the Error Function \({\mathcal {E}}: \mathbb {R}^{r} \times \Lambda \rightarrow \mathbb {R}\) is jointly continuous;

  3. 3.

    the map \((w, \lambda ) \rightarrow {\mathscr {L}}_{\lambda }(w)\) is jointly continuous, and then the problem

    \(\mathop {\textrm{argmin}}{\mathscr {L}}_{\lambda }\) is a singleton for every \(\lambda \in \Lambda \);

  4. 4.

    \(w_{\lambda } = \mathop {\textrm{argmin}}\limits _{u\in \mathbb {R}^r} {\mathscr {L}}_{\lambda }(u)\) remains bounded as \(\lambda \) varies in \(\Lambda \);

the problem in (2.1) becomes:

$$\begin{aligned} \min \limits _{\lambda \in \Lambda } J(\lambda ) = {\mathcal {E}}(w_{\lambda ^*}, \lambda ^*), \quad w_{\lambda } = \mathop {\textrm{argmin}}\limits _{u\in \mathbb {R}^r} {\mathcal {L}}_{\lambda }(u). \end{aligned}$$
(2.3)

It can be proved that the optimal solution \((w_{\lambda ^*},\lambda ^*)\) of problem (2.3) exists [13].

Considering the optimization problem in which hyperparameter is the penalty coefficient \(\lambda \in \mathbb {R}_+\), the Inner Problem is associated with the penalized empirical error represented by \({\mathcal {L}}\), defined as

$$\begin{aligned} {\mathcal {L}}_{\lambda } (w) = \sum \limits _{(x,y) \in D_{tr}} \ell (g_w (x),y) + \lambda {\mathcal {r}}(w), \end{aligned}$$
(2.4)

where \(\ell \) is a loss function, \(g_{w}: {\mathcal {X}} \rightarrow {\mathcal {Y}}\), is a parameterized model from the input to the output space, \(D_{tr}\subset {{\mathcal {X}}\times {\mathcal {Y}}}\) the training set, and \({\mathcal {r}}: \mathbb {R}^r \rightarrow \mathbb {R}\) is a penalty function. While the Outer Problem is related to the generalized error of \(g_w\) represented by \({\mathcal {E}}\):

$$\begin{aligned} {\mathcal {E}}(w,\lambda ) = \sum \limits _{(x,y) \in D_{val}} \ell (g_w (x),y), \end{aligned}$$
(2.5)

where \(D_{val}\subset {{\mathcal {X}}\times {\mathcal {Y}}}\) is the validation set. Note that \({\mathcal {E}}\) does not explicitly depend on \(\lambda \).

This work will allow us to extend these issues to the unsupervised context overcoming some assumptions of Hypothesis 1 (such as compactness) that are difficult to satisfy in real data learning contexts, and also to use some theoretical results as Ekeland’s variational principle, stated in the following section.

3 Our Proposal

The bi-level HPO framework can be modified to include unsupervised learning paradigms, generally designed to detect some useful latent structure embedded in data. Tuning hyperparameters for unsupervised learning models is more complex than the supervised case due to the lack of output space, which defines the ground truth collected in the validation set.

This section describes a general framework to address HPO in Hilbert spaces for the unsupervised case and a corollary of Ekeland’s variational principle used to derive a useful stopping criterion for iterative algorithms solving the HPO problem.

Let \(X \in \mathbb {R} ^{n \times m}\) be a data matrix, with reference to the problem (2.1), where now \(J: \Lambda \rightarrow \mathbb {R} \) is a suitable functional and \(\Lambda \) a Hilbert space equipped with the scalar product \((\cdot , \cdot )\). With these presumptions, the outer problem is defined by the following function:

$$\begin{aligned} {\mathcal {E}}:\mathbb {R} ^r \times \Lambda \rightarrow \mathbb {R} \qquad {\mathcal {E}}(w,\lambda ) = \sum \limits _{x \in X} \ell (g_w (x)), \end{aligned}$$
(3.1)

and for every \(\lambda \in \Lambda \), the inner relate problem is

$$\begin{aligned} {\mathcal {L}}_{\lambda }:\mathbb {R} ^r \rightarrow \mathbb {R} \qquad {\mathcal {L}}_{\lambda } (w) = \sum \limits _{x \in X} \ell (g_w (x)) + {\mathcal {R}}(\lambda , w), \end{aligned}$$
(3.2)

where \({\mathcal {R}}: \Lambda \times \mathbb {R}^r \rightarrow \mathbb {R}\) is a penalty function. We want to emphasize that in this new formulation, all optimization is performed on the data matrix X, and the penalty hyperparameter has been included as a variable in the penalty function \({\mathcal {R}}\). With this new formulation, the process of optimizing the hyperparameters is integrated directly into the broader optimization problem. This integration may streamline the optimization process and improve the overall efficiency of finding the best hyperparameters for the given problem. Furthermore, by partitioning the reference matrix X, it becomes possible to penalize each partition of the matrix with different magnitudes, potentially leading to better model performance or more refined results.

The bi-level problem associated with (3.1)–(3.2) can be solved with a dynamical system approach in which a numerical approximation of the hypergradient is computed. Once the hypergradient is achieved a GB approach can be used to find the optimum \(\lambda ^*\).

Ekeland’s variational principle can be used to construct an appropriate stopping criterion for iterative algorithms, with the aim of justifying and setting the hyperparameters related to the stopping criterion more appropriately.

Theorem 3.1

(Ekeland’s variational principle) [9] Let \((\Lambda , d)\) be a complete metric space and \(J:\Lambda \rightarrow \bar{\mathbb {R} }\) be a lower semi-continuous function which is bounded from below. Suppose that \(\varepsilon >0\) and \({\tilde{\lambda }}\in \Lambda \) exist such that

$$\begin{aligned} J({\tilde{\lambda }})\le \inf _\Lambda J +\varepsilon . \end{aligned}$$

Then, given any \(\rho >0\), \(\lambda _{\rho }\in \Lambda \) exists such that

$$\begin{aligned} J(\lambda _{\rho })\le J({\tilde{\lambda }}),\qquad d(\lambda _{\rho }, {\tilde{\lambda }})\le \frac{\varepsilon }{\rho }, \end{aligned}$$

and

$$\begin{aligned} J(\lambda _{\rho })<J(\lambda )+\rho \, d(\lambda _{\rho }, \lambda ) \qquad \forall \; \lambda \ne \lambda _{\rho }. \end{aligned}$$

Roughly speaking, this variational principle asserts that, under assumptions of lower semi-continuity and boundedness from below, if a point \({\tilde{\lambda }}\) is an “almost minimum point” for a function J, hence a small perturbation of J exists which attains its minimum at a point “near” to \({\tilde{\lambda }}\). It is important to note that a variation of the Theorem 3.1 can be used to reduce the number of user-dependent factors for the stopping criterion. In particular, a fruitful selection of \(\rho \) (for \(\rho =\sqrt{\varepsilon }\)) restricts the number of hyperparameters to the precision error only, allowing us to use the following corollary.

Corollary 3.2

Let \((\Lambda , d)\) be a complete metric space and \(J: \Lambda \rightarrow \bar{\mathbb {R} }\) be a lower semi-continuous function which is bounded from below. Suppose that \(\varepsilon >0\) and \({\tilde{\lambda }}\in \Lambda \) exist such that

$$\begin{aligned} J({\tilde{\lambda }})\le \inf _{\Lambda } J +\varepsilon . \end{aligned}$$

Then, \({\tilde{z}}\in \Lambda \) exists such that

$$\begin{aligned} J({\tilde{z}})\le J({\tilde{\lambda }}),\qquad d({\tilde{z}}, {\tilde{\lambda }})\le \sqrt{\varepsilon } \end{aligned}$$

and

$$\begin{aligned} J({\tilde{z}})<J(\lambda )+\sqrt{\varepsilon }\, d({\tilde{z}}, \lambda ) \quad \forall \; \lambda \ne {\tilde{z}}. \end{aligned}$$

4 Main Abstract Results

In this section, we are ready t o weaken the assumptions we discussed earlier and provide results related to the use of Ekeland’s principle as a stopping criterion. We mention an abstract result of the existence of a minimizer in Hilbert spaces which has great importance and a wide range of applications in several fields. Just one example is represented by Riesz’s Representation Theorem, that, even if implicitly, makes use of the existence of a minimizer [4]. This is a widely relevant issue about Hilbert spaces, which makes them nicer than Banach spaces or other topological vector spaces.

4.1 Abstract Existence Theorem

It is well known that each bounded sequence in a normed space \(\Lambda \) has a norm convergent subsequence if and only if it is a finite-dimensional normed space.

Thus, given a normed space \(\Lambda \), as the strong topology (i.e., the one induced by the norm) is too strong to provide any widely appropriate subsequential extraction procedure, one can consider other weak topologies joined with the linear structure of the space and look for subsequential extraction processes therein.

In Banach spaces, as well as in Hilbert spaces, the two most relevant weaker-than-norm topologies are the weak-star topology and the weak topology. As the former is established in dual spaces, the latter is set up in every normed space. The notions of these topologies are not self-contained but fulfill a leading role in many features of the Banach space theory. In this regard, here we state some results we will use shortly. The next one is straightforward (see, e.g., [4, Chapter 3]).

Proposition 4.1

If \(\Lambda \) is a finite-dimensional space, the strong and weak topologies coincide. In particular, it follows that the weak topology is normable, and then clearly metrizable, too.

If \(\Lambda \) is an infinite-dimensional space, the weak topology is strictly contained in the strong topology, namely open sets for the strong topology exist which are not open for the weak topology. Furthermore, the weak topology turns out to be not metrizable in this case.

Definition 4.2

A functional \(J:\Lambda \rightarrow \bar{\mathbb {R} }\) with \(\Lambda \) topological space, is said to be lower semi-continuous on \(\Lambda \) if for each \(a\in \mathbb {R} \), the sublevel sets

$$\begin{aligned} J^{-1}{(}{]}-\infty , a{]}{)} =\{\lambda \in \Lambda : J(\lambda )\le a\} \end{aligned}$$

are closed subsets of \(\Lambda \).

In the following, we introduce a “generalized Weierstrass Theorem” which gives a criterion for the existence of a minimum for a functional defined on a Hilbert space. For this reason, the incoming results will be provided for the abstract framework of a Hilbert space although, in some cases, they apply in the more general context of Banach spaces. Thus, throughout the remaining part of this section, we denote by \(\Lambda \) any real infinite-dimensional Hilbert space.

In an infinite-dimensional setting, the following definitions are strictly related to the different notions of weak and strong topology.

Definition 4.3

A functional \(J:\Lambda \rightarrow \bar{\mathbb {R} }\) is said to be strongly (weakly, respectively) lower semi-continuous if J is lower semi-continuous when \(\Lambda \) is equipped with the strong (weak, respectively) topology.

Definition 4.4

A functional \(J:\Lambda \rightarrow \bar{\mathbb {R} }\) is said to be strongly (weakly, respectively) sequentially lower semi-continuous if

$$\begin{aligned} \liminf _{n\rightarrow +\infty } J(\lambda _n)\ge J(\lambda ) \end{aligned}$$

for any sequence \((\lambda _n)_n\subset \Lambda \) such that \(\lambda _n\rightarrow \lambda \) (\(\lambda _n\rightharpoonup \lambda \), respectively).

We proceed by providing some useful results.

Proposition 4.5

The following statements are equivalent:

  1. (i)

    \(J:\Lambda \rightarrow \mathbb {R} \) is sequentially weakly lower semi-continuous functional;

  2. (ii)

    the epigraph of J is weakly sequentially closed, where, by definition, it is

    $$\begin{aligned} \textrm{epi}(J) = \{(\lambda , t)\in \textrm{dom}(J)\times \mathbb {R}: J(\lambda )\le t\}. \end{aligned}$$

Remark 4.6

As a further consequence of the preliminary Proposition 4.1, we have that sequential weak lower semi-continuity and weak lower semi-continuity do not match if \(\Lambda \) is infinite-dimensional since weak topology is not metrizable. However, the weaker concept of sequential weak lower semi-continuity meets our needs. For the proof of the next result, we refer the interested reader to [2, Theorem 3.32].

Proposition 4.7

Let \({\mathcal {C}}\subseteq \) \(\Lambda \) be a closed and convex subset. Then, \({\mathcal {C}}\) is weakly sequentially closed, too.

Since a sequentially weakly closed set is also strongly closed, it follows that a sequentially weakly lower semi-continuous functional is also (strongly) lower semi-continuous. Instead, the converse holds under an additional assumption. In particular, Proposition 4.7 allows us to infer the following results.

Proposition 4.8

If \(J:\Lambda \rightarrow \mathbb {R} \) is a strongly lower semi-continuous convex functional; thus J is weakly sequentially lower semi-continuous, too.

Proof

Since J is lower semi-continuous, thus \(\textrm{epi}(J)\) is closed. On the other hand, since J is convex, so it is \(\textrm{epi}(J)\), hence Proposition 4.7 ensures that \(\textrm{epi}(J)\) is weakly sequentially closed, i.e., J is weakly sequentially lower semi-continuous. \(\square \)

Thus, we are able to state the main result of this section.

Theorem 4.9

Let \({\mathcal {C}}\subset \) \(\Lambda \) be a non-empty, closed, bounded, and convex subset. Let \(J:\Lambda \rightarrow \mathbb {R} \) be a lower semi-continuous and convex functional. Thus J achieves its minimum in \({\mathcal {C}}\), i.e., \({\bar{\lambda }}\in {\mathcal {C}}\) exists such that \(J({\bar{\lambda }}) =\displaystyle \inf _{\lambda \in {\mathcal {C}}} J(\lambda )\).

Proof

Let \(m:=\displaystyle \inf _{\lambda \in {\mathcal {C}}} J(\lambda )\); hence, \((\lambda _n)_n\subset {\mathcal {C}}\) exists such that

$$\begin{aligned} J(\lambda _n)\rightarrow m \quad \text{ as } n\rightarrow +\infty . \end{aligned}$$
(4.1)

Now, our boundness assumption on \({\mathcal {C}}\) implies that, up to subsequences, \(\lambda \in {\mathcal {C}}\) exists such that \(\lambda _n\rightharpoonup \lambda \) as \(n\rightarrow +\infty \). Actually, since \({\mathcal {C}}\) is a closed and convex subset of \(\Lambda \), thus Proposition 4.7 applies, which guarantees that \(\lambda \in {\mathcal {C}}\).

Finally, from (4.1), Proposition 4.8 and Definition 4.4 we infer that \(J({\bar{\lambda }})\le m\), which gives the desired result. \(\square \)

Remark 4.10

We observe that Theorem 4.9 still holds if the subset \({\mathcal {C}}\) is not bounded as long as we ask for an additional assumption on the functional J. In fact, requiring J to be coerciveFootnote 2 (and if at least \({\bar{\lambda }}\in {\mathcal {C}}\) exists such that \(J({\bar{\lambda }})<+\infty \)), then any minimizer of J on \({\mathcal {C}}\) necessarily lies in some closed ball of radius \(r>0\). In fact, since \(J({\bar{\lambda }})<+\infty \), any minimizer \(\lambda \) of J must have \(J(\lambda )\le J({\bar{\lambda }})\); furthermore, since J is coercive, a sufficient large radius \(r>0\) exists such that \(J(\lambda )>J({\bar{\lambda }})\) for all \(\lambda \in {\mathcal {C}}\) with \(\Vert \lambda \Vert > r\). Thus, any minimizer, if exists, lies in the ball \(\{\lambda \in {\mathcal {C}}: \Vert \lambda \Vert \le r \}\).

In particular, Theorem 4.9 applies to the intersection between \({\mathcal {C}}\) and a closed ball of suitable radius, since it turns to be convex if we formally require \({\mathcal {C}}\) to be closed and convex.

Namely, the following result holds.

Corollary 4.11

Let \({\mathcal {C}}\subset \) \(\Lambda \) be a non-empty, closed, and convex subset. Let \(J:\Lambda \rightarrow \mathbb {R} \) be a lower semi-continuous, convex, and coercive functional. Thus J achieves its minimum, i.e., \({\bar{\lambda }}\in {\mathcal {C}}\) exists such that \(J({\bar{\lambda }}) =\displaystyle \inf _{\lambda \in {\mathcal {C}}} J(\lambda )\).

Now we introduce a couple of results that are a direct consequence of Ekeland’s variational principle. For the sake of completeness, here we provide them with all the details (see [8] for the original statements).

Let \(\Lambda \) be a complete metric space and \(J: \Lambda \rightarrow \mathbb {R} \) be the lower semi-continuous response function on \(\Lambda \). Suppose that a point \(\lambda \in \Lambda \) exists such that \(J(\lambda )<+\infty \). Thus, the following results hold.

Theorem 4.12

(Perturbation Result) Let \(J_{\lambda }: \Lambda \rightarrow \bar{\mathbb {R} }\) be a lower semi-continuous differentiable function such that the inequality

$$\begin{aligned} \left| J_{\lambda } (\gamma ) - J(\gamma )\right| \le \zeta (d(\gamma ,\lambda )) \quad \text {holds} \quad \forall \gamma \in \Lambda , \end{aligned}$$
(4.2)

where \(J_{\lambda }(\cdot )\) denote model function,Footnote 3\(\zeta \) is some growth function,Footnote 4 and let \(\lambda ^+\) be a minimizers of \(J_{\lambda }\). If \(\lambda ^+\) coincides with \(\lambda \), then \(\left| \nabla J(\lambda )\right| =0\). On the other hand, if \(\lambda \) and \(\lambda ^+\) are distinct, then a point \({\hat{\lambda }} \in X\) exists which satisfies

  1. 1.

    \(d(\lambda ^+,{\hat{\lambda }}) \le 2 \cdot \frac{\zeta (d(\lambda ^+, \lambda ))}{\zeta '(d(\lambda ^+, \lambda ))} \quad \) (point proximity)

  2. 2.

    \(J({\hat{\lambda }}) \le J(\lambda ^+) + \zeta (d(\lambda ^+, \lambda )) \quad \) (value proximity).

Proof

By Taylor’s theorem, it is simple to verify that \(\left| \nabla J_{\lambda }\right| (\lambda ) = \left| \nabla J\right| (\lambda )\). Now, since \(\lambda \) is a minimizer, we have \(\left| \nabla J(\lambda )\right| =0\) if \(\lambda ^+ = \lambda \). On the other hand, if \(\lambda ^+ \ne \lambda \), from inequality (4.2) and the definition of \(\lambda ^+\), it follows that

$$\begin{aligned} J(\gamma ) \ge J_{\lambda }(\lambda ^+) - \zeta (d(\gamma , \lambda )). \end{aligned}$$

Let us define the new function

$$\begin{aligned} G(\gamma ):= J(\gamma ) + \zeta (d(\gamma , \lambda )). \end{aligned}$$

Thus, from assumption (4.2) and inequality \(\inf G \ge J_{\lambda }(\lambda ^+)\), we infer that

$$\begin{aligned} G(\lambda ^+) - \inf G \le J(\lambda ^+)- J_{\lambda }(\lambda ^+) + \zeta (d(\lambda ^+, \lambda )) \le 2 \zeta (d(\lambda ^+, \lambda )). \end{aligned}$$

Hence, Theorem 3.1 applies and, having \(\varepsilon := 2 \zeta (d(\lambda ^+, \lambda ))\), for all \(\rho >0\) \(\lambda _{\rho }\) exists such that

$$\begin{aligned} G(\lambda _{\rho }) \le G(\lambda ^+) \quad \text{ and } \quad d(\lambda ^+, \lambda _{\rho }) \le \frac{\varepsilon }{\rho }. \end{aligned}$$

The desired result follows simply by placing \(\rho = \zeta '(d(\lambda ^+, \lambda ))\) with \({\hat{\lambda }}=\lambda _{\rho }\).

\(\square \)

An immediate consequence of Theorem 4.12 is the following subsequence convergence result.

Corollary 4.13

(Subsequence convergence to stationary points) Consider a sequence of points \(\lambda _k\) and closed functions \(J_{\lambda _k}: \Lambda \rightarrow \bar{\mathbb {R} }\) satisfying \(\lambda _{k+1} = \mathop {\textrm{argmin}}\limits _\gamma J_{\lambda _k} (\gamma )\) and \(d(\lambda _{k+1}, \lambda _k) \rightarrow 0\). Moreover, suppose that the inequality

$$\begin{aligned} \left| J_{\lambda _k}(\gamma ) - J(\gamma )\right| \le \zeta (d(\lambda _k,\gamma )) \quad \text {holds} \quad \forall k\in \mathbb {N}\quad \text {and} \quad \gamma \in \Lambda , \end{aligned}$$
(4.3)

where \(\zeta \) is a proper growth function. If \((\lambda ^*, J(\lambda ^*))\) is a limit point of the sequence \((\lambda _k, J(\lambda _k))\), then \(\lambda ^*\) is stationary for J.

Two interesting consequences for convergence analysis flow from there. Suppose that the models are chosen in such a way that the step-sizes \(\Vert \lambda _{k+1} - \lambda _k\Vert \) tend to zero. This assumption is often enforced by ensuring that \(J(\lambda _{k+1})< J(\lambda _k)\) by at least a multiple of \(\Vert \lambda _{k+1} - \lambda _k\Vert ^2\) (sufficient decrease condition). Then, assuming for simplicity that J is continuous on its domain, any limit point \(\lambda ^*\) of the iterate sequence \(\lambda _k\) will be stationary for the problem (Corollary 4.13).

Thus, by choosing an error \(\varepsilon \), we can stop update (2.2) for GB algorithms in the context of bi-level HPO for penalty hyperparameter, according to the pseudo-code described in Algorithm 1.

Algorithm 1
figure b

Pseudo-code

5 Conclusions

In this paper, we studied the task of penalty HPO and we provided a mathematical formulation, based on Hilbert spaces, to address this issue in an unsupervised context. We want to emphasize that moving to infinite-dimensional Hilbert spaces is not a mere abstract pretense, but it is also widely used in supervised contexts. For example, when Support Vector Machine (SVM) is taken into consideration, a well-known “kernel trick” permits the interpretation of a Gaussian kernel as an inner product in a feature space. This is potentially infinite-dimensional, allowing us to read the SVM classifier function as a linear function in the feature space [19]. Another example is provided by the quantum system possible states problem, in which the state of a free particle can be described as vectors residing in a complex separable Hilbert space [21].

In this work, we considered as hyperparameter the penalty coefficient of the constrained objective function and set up a bi-level strategy for its automatic tuning. Indeed, the strength of this article lies in theory. We showed some relaxed theoretical results both to weaken the hypotheses necessary for the existence of the solution and also proposed a variant of Ekeland’s principle as a stopping criterion of GB methods. Our approach differs from the more standard techniques in reducing the random or black-box strategies giving stronger mathematical generalization suitable also when it is not possible to obtain an exact minimizer. Both the existence theorem and the stopping criterion allow us to build an approach based on solid mathematical foundations useful for future extensions and generalizations to other problems, too. For example, infinite-dimensional Covariance Descriptors (CovDs) for classification are a fertile application arena for the extensions developed here. This finds motivation in the fact that CovDs could be mapped to Reproducing Kernel Hilbert Space (RKHS) via the use of SPD-specific kernels [14]. Also, the generalization of this approach related to a particular constrained matrix factorization problem, defined with Bregman divergence on Hilbert spaces, are subject of future works with experiments evaluating the goodness of the novel stopping criterion [15].