Linear inverse problems with Hessian–Schatten total variation

In this paper, we characterize the class of extremal points of the unit ball of the Hessian–Schatten total variation (HTV) functional. The underlying motivation for our work stems from a general representer theorem that characterizes the solution set of regularized linear inverse problems in terms of the extremal points of the regularization ball. Our analysis is mainly based on studying the class of continuous and piecewise linear (CPWL) functions. In particular, we show that in dimension d=2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d=2$$\end{document}, CPWL functions are dense in the unit ball of the HTV functional. Moreover, we prove that a CPWL function is extremal if and only if its Hessian is minimally supported. For the converse, we prove that the density result (which we have only proven for dimension d=2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$d=2$$\end{document}) implies that the closure of the CPWL extreme points contains all extremal points.

Broadly speaking, the goal of an inverse problem is to reconstruct an unknown signal of interest from a collection of (possibly noisy) observations.Linear inverse problems, in particular, are prevalent in various areas of signal processing, such as denoising, impainting, and image reconstruction.They are defined via the specification of three principal components: (i) a hypothesis space F from which we aim to reconstruct the unknown signal f * ∈ F; (ii) a linear forward operator ν : F → R M that models the data acquisition process; and, (iii) the observed data that is stored in an array y ∈ R M with the implicit assumption that y ≈ ν(f * ).The task is then to (approximately) reconstruct the unknown signal f * from the observed data y.From a variational perspective, the problem can be formulated as a minimization of the form where E : R M × R M → R is a convex loss function that measures the data discrepancy, R : F → R is the regularization functional that enforces prior knowledge on the reconstructed signal, and λ > 0 is a tunable parameter that adjusts the two terms.The use of regularization for solving inverse problems dates back to the 1960s, when Tikhonov proposed a quadratic ( 2 -type) functional for solving finite-dimensional problems [Tik63].More recently, Tikhonov regularization has been outperformed by 1 -type functionals in various settings [Tib96,DE03].This is largely due to the sparsity-promoting effect of the latter, in the sense that the solution of an 1 -regularized inverse problem can be typically written as the linear combination of a few predefined elements, known as atoms [Don06b,BDE09].Sparsity is a pivotal concept in modern signal processing and constitutes the core of many celebrated methods.The most notable example is the framework of compressed sensing [CRT06,Don06a,EK12], which has brought lots of attention in the past decades.
In general, regularization enhances the stability of the problem and alleviates its inherent ill-posedness, especially when the hypothesis space is much larger than M .While this can happen in the discrete setting (e.g. when F = R d with d M ), it is inevitable in the continuum where F is an infinite-dimensional space of functions.Since naturally occurring signals and images are usually indexed over the whole continuum, studying continuousdomain problems is, therefore, undeniably important.It thus comes with no surprise to see the rich literature on this class of optimization problems.Among the classical examples are the smoothing splines for interpolation [Sch88,Rei67] and the celebrated framework of learning over reproducing kernel Hilbert spaces [Wah90,SHS01].Remarkably, the latter laid the foundation of numerous kernel-based machine learning schemes such as supportvector machines [EPP00].The key theoretical result of these frameworks is a "representer theorem" that provides a parametric form for their optimal solutions.While these examples formulate optimization problems over Hilbert spaces, the representer theorem has been recently extended to cover generic convex optimization problems over Banach spaces [BCDC + 19, BC20, Uns21, UA22].In simple terms, these abstract results characterize the solution set of (1) in terms of the extreme points of the unit ball of the regularization functional B R = {f ∈ F : R(f ) ≤ 1}.Hence, the original problem can be translated in finding the extreme points of the unit ball B R .
In parallel, Osher-Rudin-Fatemi's total-variation has been systematically explored in the context of image restoration and denoising [ROF92,Cha04,Get12].The total-variation of a differentiable function f : Ω → R can be computed as The notion can be extended to cover non-differentiable functions using the theory of functions with bounded variation [AFP00,CDDD03].In this case, the representer theorem states that the solution can be written as the linear combination of some indicator functions [BC20].This adequately explains the so called "stair-case effect" of TV regularization.Subsequently, higher-order generalizations of TV regularization have been proposed by Bredies et al. [BKP10,BH14,BH20].Particularly, the second-order TV has been used in various applications [HS06,BP10,KBPS11].By analogy with (2), the second-order TV is defined over the space of functions with bounded Hessian [Dem84].In particular, it can be computed for twice-differentiable functions f : Ω → R as where  LU13].While this had been only defined for twice-differentiable functions, it has been recently extended to the space of functions with bounded Hessian [ACU21].The extended seminorm-the Hessian-Schatten total variation (HTV)-has also been used for learning continuous and piecewise linear (CPWL) mappings [CAU21,PGU22].The motivation and importance of the latter stems from the following observations: (1) The CPWL family plays a significant role in deep learning.Indeed, it is known that the input-output mapping of any deep neural networks (DNN) with rectified linear unit (ReLU) activation functions is a CPWL function [MPCB14].Conversely, any CPWL mapping can be exactly represented by a DNN with ReLU activation functions [ABMM16].These results provide a one-to-one correspondence between the CPWL family and the input-output mappings of commonly used DNNs.
(2) For one-dimensional problems (i.e., when Ω ⊆ R), the HTV seminorm coincides with the second-order TV.Remarkably, the representer theorem in this case states that the optimal solution can be achieved by a linear spline; that is, a univariate CPWL function.The latter suggests the use of TV (2) regularization for learning univariate functions [SESS19, Uns19, AGCU20, BCG + 20, DDUF21, ADU22].(3) It is known from the literature on low-rank matrix recovery that the Schatten-1 norm (also known as the nuclear norm) promotes low rank matrices [DR16].Hence, by using the HTV seminorm with p = 1, one expects to obtain a mapping whose Hessian has low rank at most points, with the extreme case being the CPWL family whose Hessian is zero almost everywhere.
The aim of this paper is to identify the solution set of linear inverse problems with HTV regularization.Motivated by recent general representer theorems (see, [BCDC + 19, UA22], we focus on the characterization of the extreme points of the unit ball of the HTV functional.After recalling some preliminary concepts (Section 2), we study the HTV seminorm and its associated native space from a mathematical perspective (Section 3).Next, we prove our main theoretical result on density of CPWL functions in the unit ball of the HTV seminorm (Theorem 21) in Section 4. Finally, we invoke a variant of the Krein-Milman theorem to characterize the extreme points of the unit ball of the HTV seminorm (Section 5).

Preliminaries
Throughout the paper, we shall use fairly standard notations for various objects, such as function spaces and sets.For example, L n and H k denote the Lebesgue and k-dimensional Hausdorff measures on R n , respectively.Below, we recall some of the concepts that are foundational for this paper.We recall that the scalar product between M, N ∈ R n×n is defined by and induces the Hilbert-Schmidt norm.Next, we enumerate several properties of the Schatten norms that shall be used throughout the paper.We refer to standard books on matrix analysis (such as [Bha97]) for the proof of these results.
Proposition 2. The family of Schatten norms satisfies the following properties.
Definition 3 (L r -Schatten p-norm).Let p, r ∈ [1, +∞] and let M ∈ (L r (R n )) n×n .We define the L r -Schatten p-norm of M as An analogous definition can be given when the reference measure for the L r space is not the Lebesgue measure.
2.2.Poincaré inequalities.We recall that for a Borel set A ⊆ R n with L n (A) > 0 and f ∈ L 1 (A), then Definition 4. Let A ⊆ R n be an open domain.We say that A supports Poincaré inequalities if for every q ∈ [1, n) there exists a constant C = C(A, q) depending on A and q such that for every f ∈ W 1,q (A), where 1/q * = 1/q − 1/n.
We recall that any ball in R n supports Poincaré inequalities [EG15, Theorem 4.9].
Remark 5. Let A be a bounded open domain supporting Poincaré inequalities.We recall the following fact: , where 1/q * = 1/q − 1/n.To show this, apply a Poincaré inequality to , and deduce that, for a constant c m := We also have that f m − c m L q * (A) is uniformly bounded.Thus, we infer that f m L q * (A) is bounded in m, whence f ∈ L q * (A).2.3.Distributions.We denote, as usual, D(Ω) = C ∞ c (Ω) the space of test functions and D (Ω) its dual, i.e. the space of distributions [Sch57].If T ∈ D (Ω), we denote with ∇ 2 T the distributional Hessian of T , i.e. the matrix of distributions {∂ 2 i,j T } i,j∈1,...,n where In a natural way, if F ∈ D(Ω) n×n , we denote Remark 6.Let T be a distribution on Ω such that for every i = 1, . . ., n, ∂ i T is a Radon measure.Then T is induced by a BV loc (Ω) function.
The proof of this fact is classical.Here, we sketch it for the reader's convenience.
We let {ρ k } k be a sequence of Friedrich mollifiers.Let B ⊆ Ω be a ball such that B ⊆ Ω, so that, if k is big enough (that we will implicitly assume in what follows), we have a well defined distribution ρ k * T on B, which is induced by a C ∞ ( B) function, say t k .It is immediate to show that for every i = 1, . . ., n, ´B |∂ i t k |dL n are uniformly bounded in k, as T has derivatives that are Radon measures.Therefore, using a Poincaré inequality on B, we have that for some q * > 1, t k − c k L q * (B) is uniformly bounded in k, where c k := − ´B t k dL n .Hence, up to non-relabelled subsequences, t k − c k converges to an L q * (B) function f in the weak topology of L q * (B) and then in the weak topology of D (B).Also, t k converges in the topology of D (B) to . This forces {c k } k ⊆ R to be bounded, so that also t k was bounded in L q * (B) and hence T is induced by an L q * (B) function on B. A partition of unity argument shows that T is induced by an L 1 loc (Ω) function, whence the conclusion.

Hessian-Schatten Total Variation
In this section, we fix Ω ⊆ R n to be an open set and p ∈ [1, +∞].We let p * denote the conjugate exponent of p. First, we recall the definition of the HTV seminorm, presented in [ACU21], in the spirit of the classical theory of functions of bounded variation.Next, we review some known results for the space of functions with bounded Hessian (see, [Dem84]), proposing at the same time a few refinements and/or extensions.

Definitions and Basic Properties.
Definition 7 (Hessian-Schatten total variation).Let f ∈ L 1 loc (Ω).For every A ⊆ Ω open we define the Hessian-Schatten total variation of f as where the supremum runs among all F ∈ C ∞ c (A) n×n with F p * ,∞ ≤ 1.We say that f has bounded p-Hessian-Schatten variation in Ω if |D 2 p f |(Ω) < ∞.Remark 8.If f has bounded p-Hessian-Schatten variation in Ω, then the set function defined in (4) is the restriction to open sets of a finite Borel measure, that we still call |D 2 p f |.This can be proved with a classical argument, building upon [DGL77] (see also [AFP00, Theorem 1.53]).
By its very definition, the p-Hessian-Schatten variation is lower semicontinuous with respect to L 1 loc convergence.For any couple p, q ∈ [1, +∞], f has bounded p-Hessian-Schatten variation if and only if f has bounded q-Hessian-Schatten variation and moreover q) depending only on p and q.Hence, the induced topology is independent of the choice of p.For this reason, in what follows, we will often implicitly take p = 1 (omitting thus to write p), and we will stress p when this choice plays a role.
We prove now that having bounded Hessian-Schatten variation measure is equivalent to membership in W 1,1 loc with gradient with bounded total variation.Also, we compare the Hessian-Schatten variation measure to the total variation measure of the gradient.This will be a key observation, as it will allow us to use the classical theory of functions of bounded variation, see e.g.[AFP00].
Then the following are equivalent: (1) f has bounded Hessian-Schatten variation in Ω, If this is the case, then, as measures, In particular, there exists a constant C = C(n, p) depending only on n and p such that Proof.We divide the proof in two steps. Step . By the fact that f has bounded Hessian-Schatten variation in Ω, we can apply Riesz Theorem and deduce that for every j = 1, . . ., n, ∂ j S i is induced by a finite measure on Ω.Indeed, if ϕ ∈ C ∞ c (Ω), it holds where C is independent of ϕ.Then, by Remark 6, S i is induced by an L 1 loc (Ω) function, which proves the claim.
so that f has bounded p-Hessian-Schatten variation and |D 2 p f | ≤ µ as measures on Ω.
We show now that µ(Ω) ≤ |D 2 p f |(Ω).Fix now ε > 0. By Lusin's Theorem, we can find a compact set K ⊆ Ω such that µ(Ω \ K) < ε and the restriction of M to K is continuous.Since sup by the continuity of M we can find a Borel function N with finitely many values such that We let k → ∞, taking into account that x → N (x) is continuous on K and we recall that ψ was arbitrary to infer that As ε > 0 was arbitrary, the proof is concluded as we have shown that |D 2 p f | = µ.Remark 10.One may wonder what happens if, instead of defining the Hessian-Schatten total variation only on L 1 loc functions, we define it on the bigger space of distributions, extending, in a natural way, (4) to distributions, i.e. interpreting the right hand side as sup It turns out that the difference is immaterial: distributions with bounded Hessian-Schatten total variation are induced by L 1 loc functions, and, of course, the two definitions of p-Hessian-Schatten total variation coincide.This is proved exactly as in Step 1 of the proof of Proposition 9, using Remark 6 once more.
The following proposition is basically taken from [Dem84] and is a density (in energy) result akin to Meyers-Serrin Theorem.
where the infimum is taken among all sequences Proof.The (≤) inequality is trivial by lower semicontinuity.The proof of the opposite inequality is due to a Meyers-Serrin argument, and can be obtained adapting [Dem84, Proposition 1.4] (we know that f ∈ W 1,1 loc (Ω) thanks to Proposition 9).Notice that in the proof of [Dem84] Hilbert-Schmidt norms instead of Schatten norms are used.The proof can be adapted with no effort to any norm.Alternatively, one may notice that the result with Hilbert-Schmidt norms implies the result for any other matrix norm, thanks to the Reshetnyak continuity Theorem (see e.g.[AFP00, Theorem 2.39]), taking into account that D∇f k → D∇f in the weak* topology and (5).Now we show that Hessian-Schatten total variations decrease under the effect of convolutions, that is a a well-known property in the BV context.
where ρ(x) := ρ(−x).Notice that, defining the action of the mollification component-wise, ρ * F ∈ C ∞ c (Ω) (by the assumption on the support of ρ) with (by duality) where the supremum is taken among all M ∈ R n×n with |M | p * ≤ 1.Here we used that ) and the proof is concluded as F was arbitrary.
In the following proposition we obtain an analogue of the classical Sobolev embedding Theorems tailored for our situation.Recall Definition 4.
Proposition 13 (Sobolev embedding).Let f ∈ L 1 loc (Ω) with bounded Hessian-Schatten variation in Ω.Then and, if n = 2, f has a continuous representative.More explicitly, for every bounded domain A ⊆ Ω that supports Poincaré inequalities and r ∈ [1, +∞), there an affine map g = g(A, f ) such that, setting Proof.The case n = 1 is readily proved by direct computation (as, if a domain of R supports Poincaré inequality has to be an interval) so that in the following we assume n ≥ 2. Also, recall that Proposition 9 states that f ∈ W 1,1 loc (Ω) with ∇f ∈ BV loc (Ω).Therefore we can apply [Dem84, Proposition 3.1] to have continuity of f in the case n = 2, which also implies L ∞ loc (Ω) membership.As balls satisfy Poincaré inequalities, it is enough to establish the estimates of the second part of the claim to conclude.Fix then A and r as in the second part of the statement.
Let now {f k } k be given by Proposition 11 for f on A. Iterating Poincaré inequalities, taking into account Remark 5, we obtain affine maps g k so that, setting fk := f k − g k , fk satisfies (7) or (8), depending on n.Arguing as for Remark 5, we see that g k is bounded in L 1 (B) for any ball B ⊆ A. This implies that g k and ∇g k are bounded in L ∞ (A).Therefore, up to extracting a further non relabelled subsequence, fk converges in L 1 loc (A) to f − g, for an affine function g.Lower semicontinuity of the norms at the left hand sides of (7) or (8) allows us to conclude the proof.
Remark 14 (Linear extension domains).Let n = 2, we keep the same notation as for Proposition 13.Assume also that A has the following property: there exists an open set V ⊆ R 2 with Ā ⊆ V and a bounded linear map E : W 1,2 (A) → W 1,2 (V ) satisfying, for every u with bounded Hessian-Schatten variation (hence u ∈ W 1,2 (A) by Proposition 13): ( , where we possibly modified the constant C.
First, by (8 ) with support contained in V and such that ψ = 1 on A. Then we have Then we use the continuous representative of ψE f as in [Dem84, Proposition 3.1] and, from its very definition, the claim follows.
It is easy to see that (0, 1) 2 is suitable for the above argument, see Lemma 17 below and its proof.
The strict convexity of the Schatten p-norm, for p ∈ (1, +∞) has, as a consequence, the following rigidity result.
Lemma 15 (Rigidity).Let f, g ∈ L 1 loc (Ω) with bounded Hessian-Schatten variation and assume that and satisfying ρ f + ρ g = 1 |D∇(f + g)|-a.e.In particular, for every q ∈ [1, +∞], Proof.The first claim follows from the triangle inequality and the equality in the assumption.Now assume p ∈ (1, +∞).Take then ρ f and ρ g , the Radon-Nikodym derivatives: We can apply Proposition 9 and write the polar and by linearity we obtain that e. which implies the claim by strict convexity.The last assertion is due to Proposition 9.
3.2.Boundary Extension.[Dem84, Theorem 2.2] provides us with an extension operator for bounded domains with C 2 boundary.However, we need the result for parallelepipeds.This can be obtained following [Dem84, Remark 2.1].However, we sketch the argument as we are going also to need a slightly more refined result compared to the one stated in [Dem84].This extension result (namely, its corollary Proposition 18) will play a key role in the proof of Theorem 21 below.
Lemma 16.Let Ω = (a 0 , a 1 ) × Ω be a parallelepiped in R n and let f ∈ L 1 loc (Ω) with bounded Hessian-Schatten variation in Ω.Then, if we set where C is a constant that does not depend on f .
Lemma 17.Let Ω = (0, 1) n and let f ∈ L 1 loc (Ω) with bounded Hessian-Schatten variation in Ω.Then there exist a neighbourhood Ω of Ω and f ∈ L 1 loc ( Ω) with bounded Hessian-Schatten variation in Ω such that and where C is a constant that does not depend on f .
Proof.Apply several times (a suitable variant) of Lemma 16, extending Ω along each side.Notice that at each step, we are extending a parallelepiped which contains Ω.
Proposition 18.Let Ω = (0, 1) n and let f ∈ L 1 loc (Ω) with bounded Hessian-Schatten variation in Ω.Then there exists a sequence Definition 19.We say that f ∈ C(Ω) belongs to CPWL(Ω) if there exists a decomposition {P k } k of R n in n-dimensional convex polytopes (with convex polytope we mean the closed convex hull of finitely many points), intersecting only at their boundaries such that for every k, f |P k ∩Ω is affine and such that for every ball B, only finitely many P k intersect B.
Notice that CPWL functions defined on bounded sets have automatically finite Hessian-Schatten variation, by Proposition 9.
In the particular case n = 2, we can and will assume that the convex polytopes {P k } k as in the definition of CPWL function are triangles.
Thanks to Proposition 9, we can deal with |D 2 p f | and |D∇f | exploiting the theory of vector valued functions of bounded variation [AFP00].In particular, |D∇f | will charge only 1-codimensional faces of P k .Then, take a non degenerate face σ = P k ∩ P k for k = k (i.e.σ is the common face of P k and P k ).Then the Gauss-Green Theorem gives D∇f σ = (a k − a k ) ⊗ νH n−1 σ, where ν is the unit normal to σ going from Notice also that the rank one structure of D∇f is a particular case of the celebrated Alberti's theorem [Alb93], for vector-valued BV functions.According to this theorem the rank one structure holds for the singular part of the distributional derivative.
The following theorem on the density of CPWL functions is the main theoretical result of this paper.Its proof is deferred to Section 4.2.In view of it, notice that by Lemma 17 together with Theorem 13, if f ∈ L 1 loc ((0, 1) 2 ) has bounded Hessian-Schatten variation in (0, 1) 2 , then f ∈ L ∞ ((0, 1) 2 ).Also, notice that the statement of the theorem is for p = 1 only.This will be discussed in the forthcoming Remark 22.
We now justify this claim.By Remark 20, it is easy to realize that the two seminorms above coincide for CPWL(Ω) functions, but are, in general, different for arbitrary functions.For example, take f ((x, y) . Now assume by contradiction that there exists a sequence which is absurd.This also gives the same conclusion for |D 2 p • |, in the case p ∈ (1, +∞].
We conjecture that the result of Theorem 21 can be extended to arbitrary dimensions.
Conjecture 1.The density result of Theorem 21 remains valid when the input domain is chosen to be any n-dimensional hypercube, Ω = (0, 1) n . 1 4.2.Proof of Theorem 21.This whole section is devoted to the proof of Theorem 21.Remarkably, our proof is constructive and provides an effective algorithm to build such approximating sequence.
Take f ∈ L 1 loc (Ω) with finite Hessian-Schatten variation.We remark again that indeed f ∈ L ∞ (Ω).We notice that we can assume with no loss of generality that f is the restriction to Ω of a C ∞ c (R 2 ) function.This is due to Proposition 18 (and its proof), a cut off argument and and a diagonal argument.Still, we only have to bound Hessian-Schatten variations only on Ω.
We want to find a sequence . This will suffice, by lower semicontinuity.
Step 1. Fix now ε > 0 arbitrarily.The proof will be concluded if we find g ∈ CPWL(Ω) where C f is a constant that depends only on f (via its derivatives, even of second and third order) that still has to be determined.In what follows we will allow C f to vary from line to line.
Step 2. We add a bit of notation.Let v, w ∈ S 1 with v ⊥ w, s ∈ R 2 and h ∈ (0, ∞).We call G(v, w, s, h) the grid of R 2 G(v, w, s, h) The grid consist in boundaries of squares (open or closed) that are called squares of the grid.Vertices of squares of the grid are called vertices of the grid and the same for edges.Notice that G(v, w, s, h) contains a square with vertex s and whose squares have sides of length h and are parallel either to v or to w.
Step 4. For every N we find two collections of matrices {D N k } k and {U N k } k satisfying the following properties, for every k: (1) To build such sequences, first build where x N k is the centre of the square Q N k .We can do this thanks to the symmetry of Hessians of smooth functions.
We denote R θ the rotation matrix of angle θ.We set Û N k := Ũ N k A k , where A k is a matrix of the type 0 ±1 ±1 0 or ±1 0 0 ±1 defined in such a way that Û N k = R θk , for some θk ∈ [0, π/2).Notice that (16) still holds for Û N k in place of Ũ N k .Now notice that points with rational coordinates are dense in S 1 ⊆ R 2 , as a consequence of the well known fact that the inverse of the stereographic projection maps Q into Q 2 .Therefore we can find θ k ∈ (0, π/2), θ k = π/4 so close to θk so that 1) and (2) hold by the construction above, whereas item (3) can be proved taking into account also the smoothness of f .We write Step 5.By item (3) of Step 4, we take N big enough so that sup We suppress the dependence on N in what follows as from now N will be fixed.Also, we can, and will, assume 2 −N ≤ ε.
Step 6.We consider grids on Q k , for every k and depending on K ∈ N, free parameter.We recall that Q k has been defined in Step 3.These grids will be called , where h K k will be determined in this step and s k is any of the vertices of Q k (the choice of the vertex will not affect the grid).
For every k, we write where MCD(p k , q k ) = 1.We define also and, as U k is an orthogonal matrix, we have that This ensures that the vertices of Q k are also vertices of G K k .Now notice that lines in G K k parallel to v k intersect the horizontal edges of Q k in points spaced h K k /sin(θ k ) and also lines in G K k parallel to w k intersect the vertical edges of Q k in points spaced h K k /sin(θ k ).We now compute h q h and we notice that this quantity depends only on K (and on N ) but not on k.
Step 7. Now we want to build a triangulation for the square Q k , such triangulation will depend on the free parameter K and will be called Γ K k .We will then glue all the triangulations {Γ K k } k to obtain Γ K , a triangulation for Ω.We call edges and vertices of triangulation the edges and vertices of its triangles.We refer to Figure 1 for an illustration of the proposed triangulation.
Fix for the moment k.We take the grid G 0 k .By symmetry, we can reduce ourselves to the case of θ k ∈ (π/4, π/2).Indeed, if θ k ∈ (0, π/4), consider S to be the reflection against the axis passing through the top left and bottom right vertex of Q k , let v k := −Sv k k by what proved in Step 6.Now we consider the intersections of lines of G 0 k parallel to v k with the hypotenuse of T 0 u (these are not, in general, vertices of G 0 k ) and the intersections of vertices of G 0 k with the short sides of T 0 u .Then we triangulate T 0 u in such a way that the vertices of the triangulation on the sides T 0 u are exactly at the points just considered.Any finite triangulation is possible, but it has to be fixed.Now we rotate a copy of T 0 u (together with its triangulation) clockwise by π/2 and we translate it so that the point corresponding to A moves to B. We thus obtain a triangulated triangle T 0 r = BCF .By construction, the triangulation on T 0 r has the following property: its vertices on the hypotenuse of T 0 u correspond to the intersection points of lines of G 0 k parallel to w k with the hypotenuse and its vertices on the short sides are exactly the vertices of G 0 k on the short sides.Then we continue in this fashion to obtain four triangulated triangles, T 0 u , T 0 r , T 0 d , T 0 l , as in the left side of Figure 1.Notice that T 0 u ∪ T 0 r ∪ T 0 d ∪ T 0 l , together with its triangulation is invariant by rotations of π/2 with centre the centre of is formed by a square which is itself a union of squares, each with sides parallel to v k or w k and of length h 0 k .We triangulate in the standard way, where by standard way we mean the triangulation obtained considering the grid G K k (now K = 0) and, for every square of the grid, the diagonal with direction (v k − w k )/ √ 2. This is step 0 and this triangulation will be called Γ 0 k .We show now how to build the triangulation at step K + 1, Γ K+1 k starting from the one at step K, Γ K k , see the right side of Figure 1.At step K we will have u will be union of two copies of T K u scaled by a factor 1/2 but not rotated nor reflected, but translated so that the vertices corresponding to A will correspond to A and M respectively.Also the triangulation of T K u is scaled and maintained.We do the same for together with its triangulation is invariant by rotations of π/2 with centre the centre of and σ is not contained in the boundary of Q k , then the vertices of the triangulations on σ coincide exactly with vertices of G K+1 k on σ, so that we have a well defined triangulation, of Q k that we call Γ K+1 k .Now we define Γ K as the triangulation of Ω obtained by considering all the triangulations in {Γ K k } k .Notice that, by Step 6, the triangulations in {Γ K k } k can be joined, as their vertices on the boundaries of {Q k } k match.Notice that for every K, neighbourhood of the lines of G N , and this neighbourhood (in Ω) has vanishing area as K → ∞.Therefore, squares of the grid that are triangulated by Γ K in the standard way and such that also their eight neighbours are triangulated in the standard way by Γ K eventually cover monotonically Ω, up to the axes of the grid G N .Notice also that triangles in Γ K have edges of length smaller that 2 −N 2 −K .We add here this crucial remark on which we will heavily rely in the sequel and which well be the occasion to introduce the angle θ.There exists an angle, θ > 0, such that every angle in the triangles of Γ K is bounded from below by θ, uniformly in K ( θ depends on the choice of the various triangulations of T K u , that, in turn, depend on N , so that θ depends only on N and f ).This property is ensured by the self-similarity construction, that provides at each step K two families of triangles, those arising from self-similarity and those arising from the bisection of a (tilted) square with sides parallel to those of Q k , as in Figure 1.
Step 8.For every K, we set g K as the CPWL interpolant of f according to Γ K .Recall that CPWL functions on Ω have finite Hessian-Schatten total variation.We can compute |D 2 1 g K | = |D∇g K | explicitly, that will be concentrated on jump points of the ∇g K , i.e. on the edges of the triangulation Γ K (Remark 20).
The computations of Step 9 below ensure that {g K } K are equi-Lipschitz functions, so that it is clear that as Recall the definition of θ given at the end of Step 7. Some of our estimates depend on θ (see, in particular, the first item below and Step 9) whose value essentially depends on N .Since N has been fixed, depending on ε and the modulus of continuity of ∇ 2 f , we may absorb the θ dependence into the f dependence.
The claim, hence the conclusion, will be a consequence of these two following facts, stated for T closed triangle in Γ K , say T ∈ Qk: (1) it holds (2) whenever Notice that in the first item we have a constant C f which depends on f , and hence we have to take K big enough so that the contributions of these terms are small enough.
Recall that in our estimates we allow C f to vary line to line.We defer the proof of items 1 and 2 to Step 9 and Step 10 respectively, now let us show how to conclude the proof using these facts.Fix for the moment K and k.Now consider {T i } i , the (finite) collection (depending on K and k, but we will not make such dependence explicit) of all the closed triangles in the triangulation Γ K that are contained in Qk .Notice that i) The interiors of {T i } i are pairwise disjoint.ii) If σ is an edge of Γ K that lies on the boundary of Q k , then there exist exactly one element of {T i } i having σ as edge.This is due to the fact that we are taking triangles contained in Qk iii) If σ is an edge of Γ K that does not lie on the boundary of Q k , then there exist exactly two elements of {T i } i having σ as edge.
We order the collection {T i } i in such a way that T 1 , . . ., Tī are contained in U 4•2 −N 2 −K and Tī +1 , . . ., do not intersect U 2•2 −N 2 −K .We compute, using items 1 and 2, recalling iii) above for what concerns the factor 1/2 in the first line, Therefore, repeating the procedure for every k, Now we compute, for every k, taking into account (17), ) so that we can continue our previous computation to see that Step 9. We prove item 1 of Step 8.For definiteness, assume that K is fixed.We will heavily use Remark 20 with no reference. Say , under the assumption that AB does not lie in the boundary of Ω, so that there exists another triangle T = ABC of Γ K (possibly inside an adjacent cube to Q k , recall also that the mesh size parameter K is independent of k), so that T and T have disjoint interiors.
Call a = ∇g K on T and a = ∇g K on T .Then, The mean value theorem gives where ξ 1 , ξ 2 ∈ T .Now notice that as the angles of ABC are bounded below by θ, the matrix is invertible and its inverse has norm bounded above by cθ |AB| , for a suitable constant cθ.Also, possibly choosing a larger constant cθ, the bound from below of the angles yields that |BC| ≤ cθ|AB| and |AC| ≤ cθ|AB|.Similar considerations hold for the triangle T .As cθ depends only on θ, we will absorb this dependence into the f dependence, as announced above.
We rewrite then (18) as Similarly, and the right hand side is bounded above by C f L 2 (T ) as the angles of T are bounded below by θ.
Step 10.We prove item 2 of Step 8.For definiteness, assume that K and k are fixed, for T ⊆ Q k .We will heavily use Remark 20 with no reference again.Notice that T lies in a closed square of G K k and this square, together with the other squares of G K k intersecting it (at the boundary), is triangulated in the standard way, by the assumption that T does not intersect U 2•2 −N 2 −K .Notice that the square mentioned before is divided by Γ K into two triangles.For definiteness, assume that T is the one whose barycentre has smaller y coordinate, the other case being similar.Also, for definiteness, assume that θ k ∈ (π/4, π/2), the case θ k ∈ (0, π/4) being similar.
Call T = ACD, such that the angles are named clockwise and the angle at D is of π/2.Call B the other vertex of the square of the grid in which T lies.Call E the vertex of Γ K such that C = (B + E)/2.Call a = ∇g K on T , a = ∇g K on ACB and a = ∇g K on CDE.Finally, call F := (B + D)/2 and = |AD|.We refer to Figure 2 for an illustration on the introduced notations.
We first estimate and Then we can compute All in all, recalling 2 −N ≤ ε, and Then we can compute All in all, recalling again 2 −N ≤ ε, Summing all the three contributions, which concludes the proof.

Extremal Points of The Unit Ball
Let Ω := (0, 1) n ⊆ R n .In this section, we investigate the extremal points of the set Notice that elements of the set above are indeed in L 1 (Ω), by Proposition 13, as cubes support Poincaré inequalities.In order to carry out our investigation, we will consider a suitable quotient space.We describe now our working setting.
We consider the Banach space L 1 (Ω), endowed with the standard L 1 norm.We let A ⊆ L 1 (Ω) denote the (closed) subspace of affine functions.Therefore, L 1 (Ω)/A, endowed with the quotient norm, is still a Banach space.We call π : L 1 (Ω) → L 1 (Ω)/A the canonical projection.We define where we notice that the |D 2 • |(Ω) seminorm factorizes to the quotient, so that B = π({f ∈ L 1 (Ω) : |D 2 f |(Ω) ≤ 1}).Also, by Proposition 13 and standard functional analytic arguments (in particular, the Rellich-Kondrachov Theorem), we can prove that the convex set B is compact.We will then be able to apply the Krein-Milman Theorem, for M ⊆ B: B = co(M) if and only if ext(B) ⊆ M. (KM) We set Thus, B corresponds to the unit ball with respect to the |D 2 • |(Ω) norm whereas S to the unit spere with respect to the same norm.
Even though we do not have an explicit characterization of extremal points of B, it is easy to establish whether a function g ∈ π(CPWL(Ω)) is extremal or not.
Proof.The "only if" implication follows easily from Proposition 15.
We prove now the converse implication via a perturbation argument, recall Remark 20: we will use the same notation.
Let g ∈ E and let h ∈ B with supp (|D 2 h|) ⊆ supp (|D 2 g|).We have to prove that h ∈ span(g).Assume by contradiction that h / ∈ span(g).We call now {P g k } k (resp.{P h k } k ) the triangles associated to g (resp.h) and {a g k } k (resp.{a h k } k ) the values associated to ∇g (resp.∇h).As we are assuming supp (|D 2 h|) ⊆ supp (|D 2 g|), we can and will assume that {P g k } k and {P h k } k have the same cardinality and P g k = P h k for every k, so that we will drop the superscripts g and h on these triangles.Also, we assume that for every k, P k ⊆ Ω.If we show that c 1 + c 2 = 1, then we have concluded the proof, as this will show that g was not extremal (recall we are assuming that h / ∈ span(g)) and hence a contradiction.We prove now the claim.We compute and similarly we compute |D 2 (g − εh)|.Notice now that for every k, such that H 1 (∂P k ∩ ∂P ) > 0, then a h k − a h = λ k, (a g k − a g ) for some λ k, such that ε|λ k, | ≤ 1 (notice that a g k = a g implies a h k = a h ).Therefore, and notice that by Proposition 23 and the fact that g ∈ CPWL(Ω), then T is finite (we will show that T = ∅ in Step 1).Also notice that h ∈ T if and only if −h ∈ T , so that we write T = {±t 1 , . . ., ±t }.We aim at showing that g ∈ co(T ), this will conclude the proof.
If g − λ 1 h 1 − λ 2 h 2 ∈ RE, then the proof of this step is concluded.Otherwise we continue in this way, but, by the uniform decay posed on Hessian-Schatten total variations, this process must stop, and this forces eventually g − λ 1 h 1 − λ 2 h 2 − . . .− λ s h s ∈ RE.
Step 2. We claim that g ∈ span(T ).The proof of this fact is identical to the one of Step 1, but we take h i ∈ T instead of h i ∈ B. The possibility of doing so is ensured by Step 1 (applied to g, g − λ 1 h 1 , . ..) and process would stop when g − λ 1 g 1 − λ 2 h 2 − . . .− λ s h s = 0.
Step 3. We consider the finite dimensional vector subspace V := span(T ) ⊆ L 1 (Ω)/A, endowed with the subspace topology.Consider also B ∩ V, compact and convex, notice that g ∈ B ∩ V, by Step 2. We claim that ext(B ∩ V) ⊆ T .This will conclude the proof by the Krein-Milman Theorem, as in (KM), with T in place of M and B ∩ V in place of B. We are using that T is closed and that co(T ) = co(T ) as T is finite.Take h ∈ ext(B ∩ V), write then h = λ 1 t 1 + . . .+ λ t .Then there exists j ∈ {1, . . ., } such that supp (|D 2 t j |) ⊆ supp (|D 2 h|), as supp (|D 2 h|) ⊆ supp (|D 2 g|) and by Step 1 applied to h instead of g.The same perturbation argument of Proposition 23 shows that, in order for h to be extremal in B ∩ V, we must have h = ±t j , which concludes the proof.
Theorem 25 (Density of CPWL extreme points).If n = 2, then ext(B) ⊆ E. In particular, the extreme points of {f ∈ L 1 loc (Ω) : |D 2 f |(Ω) ≤ 1} are contained in π −1 (E) (recall that the closure is taken with respect to the quotient topology of L 1 (Ω)/A).If Conjecture 1 holds, this is true for any number n of space dimensions.Then the claim follows from the Krein-Milman Theorem as recalled in (KM).

)
for any p ∈ [1, +∞].Proof.Take f as in Lemma 17 and, if {ρ k } k is a sequence of Friedrich mollifiers, set f k := f * ρ k .The claim follows from lower semicontinuity and Lemma 12. 4. A Density Result for CPWL Functions In this section, we study the density of CPWL functions in the unit ball of the HTV functional.As usual, we let Ω ⊆ R n open and p ∈ [1, +∞].4.1.Definitions and The Main Result.
, as usual, |a k − a k | denotes the Euclidean norm.Let us remark that (15) has also been shown in [ACU21], directly relying on Definition 7, which paved the way of developing numerical schemes for learning CPWL functions [CAU21, PGU22].Since dD∇f d|D∇f | has rank one |D∇f |-a.e.we obtain also dD∇f d|D∇f | p = dD∇f d|D∇f | = 1 |D∇f |-a.e.(we recall that the matrix norm | • | without any subscript denotes the Hilbert-Schmidt norm).It follows from (5) that |D 2 p f | = |D∇f | for every p ∈ [1, +∞], in particular, |D 2 p f | is independent of p.

Figure 1 .
Figure 1.An illustration of the proposed triangulation in the square Q k .

Figure 2 .
Figure 2.An illustration of the notations introduced in the Step 10 of the proof.