Trilevel and Multilevel Optimization using Monotone Operator Theory

We consider rather a general class of multi-level optimization problems, where a convex objective function is to be minimized subject to constraints of optimality of nested convex optimization problems. As a special case, we consider a trilevel optimization problem, where the objective of the two lower layers consists of a sum of a smooth and a non-smooth term.~Based on fixed-point theory and related arguments, we present a natural first-order algorithm and analyze its convergence and rates of convergence in several regimes of parameters.


Introduction
Hierarchical Optimization Problems, also known as Multilevel Optimization Problems (MOP), were first introduced by [4] and [27] as a class of constrained optimization problems, wherein the feasible set is determined -implicitlyas the optima of multiple optimization problems, nested in a predetermined sequence.In theory, MOP has applications in game theory, robust optimization, chance-constrained programming, and adversarial machine learning.In practice, MOP models are widely used in security applications, where they model so-called interdiction problems.See [16-18, 20, 28] for several examples.[f i (x) + g i (x)], i ∈ {1, ..., N } where the middle layers (for i ∈ {1, ..., N }) exhibit the so-called composite structure, where ω is a strongly convex differentiable function and there are smooth terms f i and non-smooth terms g i .In machine-learning applications, the smooth functions are chosen to be loss functions and the non-smooth functions g i are regularizers.We begin by considering a hierarchical optimization with three layers, wherein the middle and lower layers exhibit the composite structure: (2) By leveraging fixed-point theory and related reasoning, we propose a straightforward first-order algorithm and analyze its convergence and convergence rates across various parameter regimes.The algorithm exhibits the following non-asymptotic behaviour: The first layer exhibits a convergence rate of O( 1 k ), the second layer (middle layer) exhibits a convergence rate O(

Related Work
Our work is inspired by a long history of work on bilevel optimization problems (see, e.g, [1,12,13,31]).Our work extends proximal-gradient optimization algorithms [24] for a related bilevel optimization problem and is informed by [20].
Notably, [23] gave an explicit descent method for bi-level optimization in the form of min ω(x) x ∈ S := arg min in which i F is the indicator function on F and f is convex and smooth function and ω is strongly convex.Subsequently, [24] proposed the so-called BIG-SAM method for solving the more general problem of, min ω(x) x ∈ Y * := arg min in which f is smooth and g is convex and lower semi-continuous and possibly non-smooth.We consider a similar structure in a multi-level problem.
There are only a few solution approaches presented in the literature for tri-level problems, addressing very restricted classes of problems, and mostly without guarantees of global optimality.For example, error-bound conditions were used by Senter and Dotson [26] to assure the existence of strong convergence results for Mann iterates.Typically, error bounds are essential for assessing the accuracy and reliability of numerical approximations or algorithms and, providing a measure of how close the approximate solution is to the true solution, given certain assumptions or conditions.These conditions may include properties of the problem, the algorithm used, the precision of numerical calculations, and any assumptions made during the approximation process.Very recently, Sato et al. [25] presented a gradient-based algorithm for multilevel optimization, where the lower-level problems are replaced by steepest descent update equations.They present conditions when this reformulation asymptotically converges to the original multilevel problem.Based on our knowledge, no other solution approach can tackle the class of problems considered in this work.

Examples
We present several concrete examples of trilevel optimization problems.
1. Pursuit-evasion-intercept, or alternatively described as pursuit-evadedefend (see, e.g.[15]) is a (sequential) game wherein one player is seeking to follow and capture another in a dynamic setting, with a third tasked with intercepting the pursuer or 2. Bilevel optimization with robust uncertainty.Robust optimization, i.e., choosing the optimal outcome upon the worst case realization of a parameter.This can be expressed as a nested optimization problem, wherein the inner problem is a maximum over the parameter set [6].Any classic bilevel optimization, for instance Stackelberg games, can become trilevel when the leader makes a decision under robust uncertainty consideration.3. Mixture models with training and validation: consider some convex loss function on data with a regularization (e.g., LASSO), wherein the validation (for instance, a coreset) data set is considered more significant and thus an inner problem, the training set presents the middle problem, and the tuning of mixture weights of different models is the uppermost layer.

Preliminaries
Let Ω ⊆ R n be closed and convex and let T be a mapping from R n into itself.
Recall that the notion of the variational inequality (VI), denoted by V I(T, Ω), is to find a vector x * ∈ Ω such that Note that ( 5) is equivalent to finding the fixed point of the problem where P Ω is the metric projection of R n onto Ω, i.e., it maps x ∈ R n to the unique point in Ω defined as, where throughout the paper we use the Euclidean norm, P Ω (x) := arg min We also use the notation Fix(T ) = {x ∈ R n : T (x) = x} for the set of fixed points of T .This set Fix(T ) is closed and convex for non-expansive mappings T .
We now review some well-known facts about non-expansive mappings that we shall henceforth use in the paper without reference.
• Let T : R n → R n be a non-expansive mapping.Then I − T is monotone; that is x − y, (I − T )x − (I − T )y ≥ 0.
• Let S : R n → R n be a contraction mapping with coefficient r ∈ (0, 1).Then I − S called (1 − r)-strongly monotone; that is • Let x ∈ R n and z ∈ Ω be given.Then z = P Ω (x) if and only if the following inequality holds x − z, z − y ≥ 0 ∀y ∈ Ω, and also if and only if • For all x, y ∈ R n one has P Ω (x) − P Ω (y) 2 ≤ P Ω (x) − P Ω (y), x − y .

Convergence Analysis for Trilevel Optimization Problems
Let us consider the trilevel problem (2).After presenting our assumptions and preliminaries, we analyze the convergence of a proximal-gradient algorithm first under a variety of conditions on the step sizes (Section 2.3, using lemmas from Section 2.2).Alternatively, one can assume a certain error-bound condition (Section 2.4).

Assumptions and the Algorithm
We shall make the following standing assumptions: (ii) g i : R n → (−∞, +∞] is proper, lower semi-continuous and convex. (iii) the optimal solution set of the inner layers is non-empty, i.e., X * = ∅ and Y * = ∅.
(iv) ω : R n → R is strongly convex with strong convexity parameter µ.
(v) ω is a continuously differentiable function so that ∇ω(•) is Lipschitz continuous with constant Lω. 1 The three layers of a trilevel problem and the corresponding operators.
Consider three operators corresponding to the three layers of objective functions, S(x) Note that each of these corresponds to a fixed point map of their respective problems.It can be easily seen that X * ∩ Fix(T ) = Fix(T ) ∩ Fix(W ), however, in general, we are only interested in a (specific subset) of Fix(W ) = Y * and we expect Fix(T ) ∩ Fix(W ) to be empty.It is well known that mappings T and W are non-expansive and S is an r-contraction, i.e., for any u ∈ (0, 2  Lω+µ ], one has (For more details, see [22, Theorem 2.1.12,p.66]).
For any proper, lower semi-continuous and convex function g : R n → (−∞, +∞] the Moreau proximal mapping is defined by prox g (x) = arg min In general, a proximal-gradient algorithm [3] is based on an iterated mapping: T t (x) = prox tg (x − t∇f (x)), which has the following properties: (i) T t is non-expansive for sufficiently small t, i.e., (ii) Its fixed points are equivalent to the set of minimizers to the corresponding minimization problem, i.e., Fix(T t ) = arg min We shall denote the proximal gradient mapping as T in the Algorithm, instead of T t , because of property (ii).
A solution x * of (2) satisfies the following inequalities, Note that x ∈ X * is equivalent to x * − T (x * ), x − x * ≥ 0, ∀x ∈ Fix(W ) = Y * so the third condition can be modified to: x * − S(x * ), x − x * ≥ 0, for all x satisfying this relation.
Throughout this section, we are concerned with Algorithm 1, which is based on the proximal-gradient maps and their following combination: Define the following quantities regarding the relative limit behaviors of the two parameters: These quantities play a central role in analyzing the convergence of the proposed algorithms.For more details, you may see Example 4.
The following key Assumptions will be needed throughout the paper: Assumption 3 There exists K > 0 such that lim sup k→∞

Properties of the limit points
We now derive a set of results regarding the properties of limit points generated by the sequence (12).First, we present the following powerful lemma that we shall use in the analysis below: Lemma 5 [30, Lemma 2.1] Assume that a k be a sequence of non-negative real numbers such that Lω+µ and the real sequences α k and β k satisfy the assumptions Initialization: Select an arbitrary starting point x 0 ∈ R n For k = 1, 2, ... do

End For
where γ k is a sequences in (0, 1) and δ k is a sequence in R, such that (1) Then lim k→∞ a k = 0.
From now on and throughout the paper, we denote by {x k } the sequence generated by the algorithm (12).The convergence of the algorithm crucially depends on the starting points x 0 ∈ R n and the parameters (step-sizes) α k and β k , which are chosen in advance.Three different cases can be distinguished: δ = 0, δ > 0 and δ = ∞, each associated with some other Assumptions.Initially, we are going to seek the conditions ensuring boundedness of the sequence of iterates {x k }.Our proof techniques are similar to those that [24] used to prove their Lemma 2.
Throughout this paper, to simplify the notation, we will use w({x k }) to denote the set of cluster points of sequence {x k }, i.e., w({x k }) = x ∈ R n : x ki → x for some sub-sequence {x ki } of {x k } , and also for every k ≥ 1, we define It is straightforward to see that Q k is non-expansive.
Lemma 6 Assume δ < ∞ .Then {x k } is bounded, i.e., for every x ∈ Fix(W ) there exists a constant Cx such that x k − x ≤ Cx and constants C S and C T such that Moreover, for all x ∈ Fix(W ) one has lim sup k Proof Taking into account δ ∈ [0, +∞), from Assumption ?? one sees that there exists δ 0 > δ and k 0 ∈ N such that for every k ≥ k 0 , one has β k < δ 0 α k .On the other hand the sequence {x k+1 } can easily be rewritten as Now, for given x ∈ Fix(W ) we obtain There, r is the coefficient of the contraction map S. Therefore, {x k } is bounded.Also, for given x ∈ Fix(W ) from ( 14) one can observe that which implies that lim sup by (15) we then have which shows that {x k } is bounded.
The following simple example shows that the boundedness of {x k } does not necessarily hold when δ = ∞.
Clearly, S is contraction, and T and W are non-expansive.
It is easy to check that δ = lim sup The next Lemma will be useful in the sequel the proof is slightly similar to [19,Theorem 4.1] Lemma 8 Suppose that {x k } is bounded.
Proof Since {x k } is bounded then there exists constant M such that So we have Now, one can write Noticing Assumption 4 and setting one can apply Lemma 5 and the proof of part (a) is complete.
To prove part (b): Dividing both sides of the inequality ( 16) by β k , we obtain Using Assumptions 2, 3, 4 and by similar reasoning at part (a) the assertion follows from Lemma 5.
To prove part (c) : By boundednes of {x k } and α k → 0 and β k → 0 it is clear to see that and we have x k+1 − W (x k ) → 0 which together with part (a) gives the conclusion of part (c).
By virtue of the prior lemma, we are able, in many situations, to get a unique solution for the multilevel variational inequality without additional conditions on mappings S, T, W .

Convergence analysis under assumptions on step-sizes α k and β k
Next, we shall explore the convergence guarantees associated with different cases of δ.We summarize these results, which depend on problem assumptions and parameter regimes, in Table 2. Let us consider the existence of a solution for the convex trilevel optimization problem (2).We will analyze this in multiple stages.It is worthwhile to note that the convergence behavior towards X * is made complex by the interconnection among the three layers.On the whole, the non-expansive operators T and W do not increase the distance between any two points in the iteration for k large enough, and S contracts the distance between points in the sequence.As step size α k goes to zero, the ascendancy of the contraction mapping S diminishes, and the sequence {x k } becomes dominated by the non-expansive mappings T and W .The exact convergence behavior to a specific fixed point in X * will depend on additional properties of the individual operator Fix(W ) = Y * or its corresponding level φ 1 , such as the quadratic growth condition and linearly regular bound.First, let us consider the consistent case, i.e., Fix(T )∩Fix(W ) = ∅, and subsequently, further cases depending on the error bound condition.
Let us now present the key technical lemma concerning the case of δ = ∞, which relates to the convergence of the iteration.It establishes a connection between the set of cluster points of the sequence x k and the solution set of the variational inequality V I(T, Fix(W )): Lemma 9 Assume δ = ∞, together with Assumptions 2, 3, and 4. Furthermore, suppose that {x k } is bounded.Then every cluster point of the sequence ) and x ∈ w({x k }) be given.It was shown that ( 27) holds for every y ∈ Fix(W ).Therefore, Upon letting k → ∞ in the previous inequality and utilizing the Assumption that δ = ∞ ⇐⇒ lim Fact 10 If the interior of X * is non-empty then X * = Fix(T ) ∩ Fix(W ), and as well {x k } is bounded.
Proof First, we show X * = Fix(T ) ∩ Fix(W ).Let there exist x 0 ∈ intX * and let x ∈ R n be given.Hence for sufficiently small t ∈ (0, 1), we have that x 0 + t(x − x 0 ) ∈ X * ⊂ Fix(W ), which further implies and so φ 1 (x 0 ) ≤ φ 1 (x).This means that x 0 ∈ Fix(T ).Therefore, intX * ⊆ Fix(T).On the other hand, since X * is closed and convex, we therefore have Consequently, as we already have X * ∩ Fix(T ) = Fix(T ) ∩ Fix(W ), one can deduce that X * = Fix(T )∩Fix(W ), which verifies the desired equality.Notably, as mentioned in Remark 7, it is evident that the sequence x k is bounded.
Assumption 11 (Quadratic growth condition) Suppose now that φ 1 grows quadratically (globally) away from a part of its minimizing set Y * = Fix(W ), i.e., X * , meaning there is a real number µ > 0 such that where Ω 1 = B(0, Cx 0 ) for given x 0 ∈ Fix(W ) and φ * 1 represents the optimal value of φ 1 .
The quadratic growth condition can be interpreted as a notion of sharpness assumption on the function φ 1 , which describes functions that exhibit at least the behavior of dist(x, X * ) .Originally introduced to establish the convergence of trajectories for the gradient flow of analytic functions, Bolte et al. proposed an extension to non-smooth functions in their work published in [10].As a simple example, let us assume that φ 1 (x, y) = 0 for (x, y) ], we will observe, through a straightforward investigation, that (17) is verified.
Theorem 12 Let Assumption 11 hold, and δ = 0. Then {x k } converges to some x * ∈ X * such that Proof Strong convexity of ω, and contractivity of the operator S, together implies that there is unique x * ∈ X * such that x * = P X * Sx * and x * ∈ V I(S, X * ), i.e., Since the sequence {x k } is bounded, one sees that x k ∈ Ω 1 .Furthermore, utilizing assumption 11, one may be readily verified that w(x k ) ⊂ X * .Moreover, one can extract a convergent sub-sequence {x ki } of {x k+1 } or any sub-sequence thereof to x ′ ∈ X * , which holds by Lemma 8, part c, and (69) so that Next, we show x k → x * .Let the sequences c k and d k are defined as From above it is immediate that c k +d k = x k+1 −x * , and . By a simple calculation, one has and finally by plugging c k and d k in the previous inequality follows that Now, setting One has that Also, using the boundedness of {x k } together with δ = 0 we can conclude that lim sup The desired assertion now follows from Lemma 5.
Theorem 13 Let Assumption 11 hold, and δ = ∞.Moreover, assume that {x k } is bounded.Then {x k } converges to some x * ∈ X * such that Proof As before, there is a unique x * ∈ X * fixed point of the contraction map P X * S, i.e., x * = P X * Sx * .Therefore x * ∈ V I(S, X * ) and as a same method, due to the boundedness of x k we may get a subsequence and also there is subsequence We show the last inequality.Using Lemma 9, one derives that x ′′ ∈ V I(I − T, Fix(W )), i.e., T (x x * ≤ 0 and this follows (22).The rest of the proof follows from ( 20) and ( 21), and Lemma 5.
As another application of Theorem 12, one may point to Theorem 6.1 of [30] for solving the following quadratic minimization problem: where K is a nonempty closed convex and µ ≥ 0 is a real number, u, b ∈ R n and A is a bounded linear operator which is positive ( Ax, x ≥ 0 for all x ∈ R n ).Set S(x) := x − r∇ω(x) and T (x) := prox δK (x) = P K (x).Then the sequence {x k } generated by ) converges to the unique solution x * of problem (23) under the mild assumption α k → 0 and We drop the assumption that lim k α k+1 α k = 1.Notice that when we take K = R n , then problem (23) reduces to a classical convex quadratic optimization problem, in which case Remark 14 Knowing relation (20) and Assumption 2, we find out that the two following conditions together imply the convergence of the sequence {x k }: and lim sup Thanks to the Assumptions of Theorem 12, x * solves V I(S, X * ).This is due to the fact that δ = 0, which means β k → 0 faster than α k → 0. Afterwards, the term α k S(x k ) dominates, while the term β k T (x k ) becomes negligible.When δ = ∞, it is difficult to confirm the verification of condition (25) without assuming bounded linear regularity to control the growth of x − T (x) .
Up to now, we have shown that the sequence {x k } is bounded and convergent, provided that δ = 0.A natural question is to ask whether the sequence {x k } is convergent when δ is non-zero.The following proposition guarantees, under the assumption δ ∈ [0, +∞), that there is a particular variational inequality that is satisfied for any limit point of the sequence generated by the Algorithm.
Proposition 15 Assume δ < ∞, together with Assumptions 2, 3 and 4. Then sequence {x k } converges to the unique solution of the variational inequality ).From part (b) of Lemma 8 we have y k → 0 as k → ∞.By the definition of iteration (12) and monotonicity of I − W for all y ∈ Fix(W ), one sees easily that which implies that Now, for given x 1 , x 2 ∈ w({x k }), there exist sub-sequences {x ki } and {x kj } of {x k }, such that x ki → x 1 and x kj → x 2 .On taking the limsup of (28) and using the fact that y k → 0 and lim sup Rearranging ( 29) by substituting y = x 1 and y = x 2 shows that On the other hand, since I − S is (1 − r)-strongly monotone and I − T is monotone by adding up inequalities ( 30) and ( 31), one obtains that So, x 1 = x 2 .This shows that {x k } converges.(Here, we have used the fact that the sequence {x k } converges if and only if every sub-sequence of {x k } contains a convergent sub-sequence.)Setting x := lim k→∞ x k , we then see from ( 29) that This completes the proof.

Convergence analysis under an error-bound condition
Here, we introduce an error-bound condition that facilitates additional convergence guarantees.Let us denote the closed ball of radius ρ centred at 0 by B(0, ρ).
Definition 17 (Error bound condition), [7] Let W : X → X be such that Fix(W ) = ∅.We say that W is boundedly linearly regular if note that in general θ depends on ρ, which we sometimes indicate by writing θ = θ(ρ).
The notion of a bounded linear regularity is a valuable property in optimization and variational analysis.It ensures that a function behaves well near its critical points, and has been used in [11] to analyze linear convergence of algorithms involving nonexpansive mappings.An exemplary and practically significant illustration of an objective that is non-quasi-strongly convex yet satisfies the quadratic growth condition is the LASSO problem: when the operator Q has a nontrivial kernel.Further classes of functions that possess a regular error-bound property include the following: Example 2 Proposition 18 For all t ∈ 0, 1 L f 2 and x ∈ dom(∂φ 2 ) one has x − T t (x) ≤ td(0, ∂ϕ 2 (x)).
We shall now study cases wherein δ is not finite.From now on, we use Ω := V I(T, Fix(W )) and assume it is non-empty.
Theorem 19 Assume δ = ∞, together with Assumptions 2, 3, and 4. Assume also that {x k } is bounded.Moreover, if the following assumptions hold (A 2 ) W is boundedly linearly regular, then the sequence {x k } converges to x * , the unique solution of Furthermore, this implies that x * minimizes ω over Ω, i.e., min Proof Since Ω is closed and convex and S is a contraction, there exists x * ∈ Ω, which is a unique fixed point of the projection map P Ω S(x * ) = x * , i.e., To deduce x k → x * , we first note that x * ∈ Ω := V I(T, Fix(W )), which implies that and since P Fix(W ) (x k ) ∈ Fix(W ), one gets On the other hand, since {x k } is bounded, there exists M > 0 and k 0 ≥ 0 such that for all k ≥ k 0 , one has x k ∈ B(0, M ).So, by applying Assumption (A 1 ), one can easily observe that there exists θ > 0 such that Employing ( 35) and (36), one has Now, since {x k } is bounded, one can find a constant C > 0 so that and we will then have Therefore, by combining the previous inequality and (37), we get: Now, multiplication (38) with β k α k yields Using (39), Assumption (A 4 ), and part (b) of Lemma 8, we will observe that lim sup Moreover by Lemma 9, we have w({x k }) ⊆ Ω.Now, since {x k+1 } is bounded there exists a convergent sub-sequence {x ki } of {x k+1 } to x ′ ∈ Ω.From (34), it can be seen that lim sup Recall that we still have inequality (20).By a similar argument as in Remark 14, from (40) and (41) and in view of Lemma 5, we see that x k → x * and the proof is complete.
Notice that by Lemma 9, we know that when δ = ∞, then w({x k }) ⊆ Ω.The following example shows that this is not a necessary condition.
The following result asserts the existence of a limit point satisfying a variational inequality under a mild assumption related to the preceding theorem without any condition on δ.
Proposition 20 Assume δ < ∞, together with Assumptions 2, 3, 4, and (A 2 ).Moreover, assume that Assumption 11 holds for Ω in replace of X * .Then the sequence {x k } converges to x * , which is the unique solution of Proof This is immediate from Theorem (19).
The following fact provides the limit of distance between {x k } and Fix(W ) and X * , respectively.Proof We just prove the first assertion.(The second is straightforward from the boundedness of {x k }.)The proof relies on the study of the sequence {h k }.Since P X * is the projection operator onto the convex set X * , we have 1 2 Now, P X * (x k ) ∈ X * and consequently Since h k+1 = 0 the last inequality follows that Finally, the proof is completed by part (a) of Lemma 8.

Remark 22
We would also like to point out that if in Theorems 19 and Proposition 20 we had Ω := V I(T, Fix(W )) = X * then the trilevel optimization problem (2) would have a solution.
For instance, consider that lim k β k α k need not exist.Instead, we consider lim k sup β k α k .See Example 4 below, where there is no limit β k α k , but our results still apply.
4 and also Now, we are ready to give an example related to the step-sizes α k and β k that guarantee the convergence {x k } of all of our results.Note that in all cases lim k β k α k may not exist. Furthermore, • Assumption 2 holds when 0 < λ ≤ 1.
Remark that for the case δ = ∞, it is sufficient to consider Now we are ready to present a taxonomy of assumptions with respect to α k and β k , as referenced in Table 2 • (a): λ < γ.

Convergence Rate Analysis for Trilevel Optimization Problems
In this section, we present the main result of this paper.This addresses the rate of convergence of the sequence {x k } generated by Algorithm 1 with a particular choice of step-sizes.

Technical lemmas
The technical lemma which we state next, and for which we refer to [24], will play a crucial role in the convergence analysis.
Lemma 24 [24, Lemma 3] Let M > 0. Suppose that {a k } is a sequence of nonnegative real numbers which satisfy a 1 ≤ M and where γ ∈ (0, 1], {b k } is a sequence defined as b k := min{ 2 γk , 1} and {c k } is a sequence of real numbers such that c k ≤ M < ∞.
Then, the sequence {a k } satisfies The next result will be useful for the rate of convergence.
Lemma 25 One has the following where φ i (x) := f i (x) + g i (x) and ψ i (x) := 1 t (x − prox tgi (x − t∇f i (x))) and i = 1, 2 Proof Let i = 1.Using Lipschitz continuity of f 1 with parameter L f1 it is well-known that convexity of f 1 is equivalent to Assume that x ∈ R n and t ∈ (0, 1 ] be given.Plugging y = x − tψ 1 (x) in the previous inequality one obtains that (46) Now by simplifying and taking into account (46) one has and therefore it follows We are now in a position to derive the following result which appeared in a similar form [2,24], however for our context the proof had to be modified.
Proposition 26 [24,Proposition1] Let x ∈ R n and denote x + = T t (x).Then and also if z = Ws(x).Then Proof We will just prove the first part.The second part can be proved by the same method.Assume that x + = T t (x) so we have From (47) of Lemma 25 we obtain and then by plugging in the previous inequality one get and the desired result follows.
Now, we set up the sequences α k and β k and define the constant J as where r ∈ (0, 1].Clearly, Assumptions 2, 3, 4 are satisfied under (48).
To begin, we present the following lemma, which plays a key role in the sequel.
Lemma 27 Assume that {x k }, {y k }, {z k } and {v k } be the sequences generated by Algorithm 1 and also x ∈ Fix(W ) be given, defining y = T (x) and v = S(x).Then, for every k ≥ 1 the following relations hold true.
and there exists positive constants C S , C T and Cx so that    Proof All parts are a direct consequence of non-expansively of T, W and the contraction property of S and Lemma 6.
Lemma 28 Let {x k }, {y k },{z k } and {v k } be sequences generated by the Algorithm (1), where {α k } and {β k } are defined by (48).Then for every x ∈ Fix(W ) one has where C S , C T , Cx are defined in Lemma 6 and J = 2 (1−r) .
Proof One can write Now, one gets that as well as one can easily follow that Moreover, therefore all hypotheses of Lemma 24 are hold.Hence, the rate of convergence { x k+1 −x k } is immediately implied by setting a k = x k −x k−1 ,b k = α k , γ = 1−r and c k as (52).By the following arguments, the rate for { z k −x k+1 } can be derived where have used the fact that (1 − α k )β k ≤ α k , and we also have

The Main Result
Now, we are in a position to conclude our main result concerning the rate of convergence for convex trilevel optimization.Having proved that z k − x k−1 → 0 as k → ∞ and considering the lower semi-continuity of φ 1 , one obtains that {φ 1 (z k )} k∈N converges to the optimal value.Furthermore, this implies the convergence of the sequence {φ 1 (x k )} k∈N to the same value.We note that the same argument holds for the sequence {φ 2 (z k )} k∈N .The following theorem presents the convergence rate in function values to their optima: Theorem 29 Let {x k }, {v k }, {z k } and {y k } be sequences generated by Algorithm (1), where α k is proposed by (48).Then where C S , C T , C x * are the same constants as in Lemma 6, J is defined in (48), Furthermore, one has Proof Since x * is a solution of tri-level Problem 2 of Theorem 12, the following result was obtained: Let us take and α k ,β k be as in (48).From (58), it follows that where we used the facts 2(1 − α k )β k ≤ α k − α k+1 , and α k+1 ≤ α k .By Lemma 6, we then have c k ≤ (C T + C x * )C x * .By utilizing Lemma 24, we obtain Let us consider the first assertion (54).According to Proposition 26 and z k+1 = W (x k ), for every step-size s ≤ 1 the following inequality holds Combining with (51) and (59) for x * ∈ X * ⊆ Fix(W ) = Y * , one obtains Thus, the assertion (54) follows from (60) and (62).Now, we obtain the rate of convergence for φ 1 (y k ) − φ 1 (x * ).In Algorithm 1, we have y k+1 = T (x k ) and so using Lemma 25, one gets that Plugging inequality (59) into (63) gives the desired assertion (55).
To establish the assertion (56), since φ i , (i = 1, 2) is convex and bounded above on compact set B(x * , C x * ), by invoking [9, Theorem 2.1.10]one concludes that φ i is Lipschitz on this set.Note thanks to x k ∈ B(x * , C x * ), one has which combined with (59) gives the desired assertion.
Finally, to bound the rate of convergence ω(x k ) − ω(x * ), we take into consideration (49), (59), and strong convexity of ω, for all k ∈ N \ {1}.We obtain: Remark 30 It is worth pointing out that step-size {α k } depends on the parameter r, which needs to be chosen so that the map S is a contraction.Notice, however, that knowing Lω and µ, one can consider r such that r ∈ (0, 2 Lω+µ ].In this case, the map S is guaranteed to be a contraction.

An Extension to Multilevel Optimization Problems
In this section, we extend our results to a multilevel convex optimization problem wherein we have an arbitrary number of nested minimization problems: • for every x ∈ dom(∂φ i ) one has where dom(∂φ i ) = {x ∈ R n : ∂φ i (x) = ∅}.
In view of Fact 10 the following general result, however, holds true in having the qualification condition of being a non-empty interior X * N .We omit the proof.
Lemma 32 If X * N has non-empty interior then X * N = ∩ N i=1 Fix(T i ).
Lemma 33 Let Assumption (P 2 ) hold.Then the sequence {x k } generated by algorithm (65) is bounded.
Then, one can rewrite the sequence {x k+1 } as and since R k is a convex combination of non-expansive operators, then it is nonexpansive.Now, as in Lemma 6, for every x ∈ Fix(T 1 ) one has .., T N be non-expansive mappings from R n to itself such that ∩ N i=1 T i is non-empty and let λ 1 , λ 2 , ..., λn be real numbers such that N Lemma 35 Let Assumption (P 2 ) hold.Then for every x ∈ Fix(T 1 ), one has Proof Suppose that x ∈ Fix(T 1 ) is given.First, consider Therefore, we have To see the second part, for all k ∈ N we have Now, by plugging (68) in the following inequality the assertion follows immediately.
Now, we are in a position to present our main result regarding multi-level scenarios.
Theorem 36 Assume that X * has non-empty interior.Moreover the following holds Proof First, we note that by looking at Lemma 32 we observe that Let us now, consider the auxiliary sequence By employing [29, Theorem 3.2] along with Lemma 34, one establishes the convergence of the sequence y k to a specific point denoted as x * N .Consequently, we deduce the following: using Lemma 5 follows that x k − y k goes to zero as k → ∞.Proof Let x * N ∈ X * N be the unique fixed point of the contraction P X * N S, namely the unique solution of V I(S, X * N ), i.e., Invoking Assumption (P 2 ), one gives that {x k } is bounded, and so x k ∈ Ω * .Furthermore, utilizing assumption 31, it is straightforward to show that w(x k ) ⊂ X * N .Moreover, one can extract a convergent sub-sequence {x ki } of {x k+1 } or any subsequence thereof to x We will now show the rate of convergence for the general case.To study this, let us take the sequences y where C S , C T 1 , C x * N are the same constants as in Lemma 6 from Lemma 33 , J is defined in , Furthermore, one has Proof As (59) and Lemma 35 and utilizing of Lemma 24 one can conclude that , and the rest of the proof is similar to the one for the trilevel Theorem 29.

Conclusion
We have shown how to approach a broad class of hierarchical convex optimization problems wherein the inner problems optimize the so-called composite functions, i.e., sums of a convex smooth function and a convex non-smooth one, and all but the inner-most problem consider a constraint set composed of minimizers of another problem.We have used proximal gradient operators in an iterative proximal-gradient algorithm related to "SAM" of [24].For the first time, we consider diminishing sequences α k and β k such that the large limit of β k α k need not exist.The convergence is studied in a number of cases, depending on the relative speed of convergence of α k and β k and in some cases regularity properties of the problem layers.We showed standard k ) rates of convergence for appropriate corresponding quantities.Future work can include introducing stochasticity to the problems.
finally, the third layer exhibits O( 1 √ k ) global rate of convergence concerning the inner objective function values.By accessing the main iteration in terms of the inner objective function values, we observe the convergence rate O( 1 k ).

Fact 21
Let Assumption 4 hold and h k = d(x k , X * ).Also suppose that {x k } is bounded.Then the following assertion holds a) If h k = 0 then lim k→∞ k , Fix(W )) = 0.

Table 2
An overview of our results in Sections 2.3-with the corresponding assumptions on α k and β k and examples of the series.