When is there a representer theorem?

We consider a general regularised interpolation problem for learning a parameter vector from data. The well-known representer theorem says that under certain conditions on the regulariser there exists a solution in the linear span of the data points. This is at the core of kernel methods in machine learning as it makes the problem computationally tractable. Most literature deals only with sufficient conditions for representer theorems in Hilbert spaces and shows that the regulariser being norm-based is sufficient for the existence of a representer theorem. We prove necessary and sufficient conditions for the existence of representer theorems in reflexive Banach spaces and show that any regulariser has to be essentially norm-based for a representer theorem to exist. Moreover, we illustrate why in a sense reflexivity is the minimal requirement on the function space. We further show that if the learning relies on the linear representer theorem, then the solution is independent of the regulariser and in fact determined by the function space alone. This in particular shows the value of generalising Hilbert space learning theory to Banach spaces.


Introduction
It is a common approach in learning theory to formulate a problem of estimating functions from input and output data as an optimisation problem. Most commonly used is regularisation, in particular Tikhonov regularisation where we consider an optimisation problem of the form where H is a Hilbert space with inner product ·, · H , {(x i , y i ) : i ∈ N m } ⊂ H×Y is a set of given input/output data with Y ⊆ R, E : R m × Y m → R is an error function, Ω : H → R a regulariser and λ > 0 is a regularisation parameter. The use of a regulariser Ω is often described as adding additional information or using previous knowledge about the solution to solve an ill-posed problem or to prevent an algorithm from overfitting to the given data. This makes it an important method for learning a function from empirical data from a very large class of functions. Problems of this kind appear widely, in particular in supervised and semisupervised learning, but also in various other disciplines wherever empirical data is produced and has to be explained by a function. This has motivated the study of regularisation problems in mathematics, statistics and computer science, in particular machine learning [7,12,17].
It is commonly stated that the regulariser favours certain desirable properties of the solution and can thus intuitively be thought of as picking the function that may explain the data and which is the simplest in some suitable sense. This is in analogy with how a human would pick a function when seeing a plot of the data. One contribution of this work is to clarify this view as we show that if the learning relies on the linear representer theorem the solution is in fact independent of the regulariser and it is the function space we chose to work in which determines the solution.
Regularisation has been studied in particular in Hilbert spaces as stated above. This has various reasons. First of all the existence of inner products allows for the design of algorithms with very clear geometric intuitions often based on orthogonal projections or the fact that the inner product can be seen as a kind of similarity measure.
But in fact crucial for the success of regularisation methods in Hilbert spaces is the well known representer theorem which states that for certain regularisers there is always a solution in the linear span of the data points [6,9,16,18]. This means that the problem reduces to finding a function in a finite dimensional subspace of the original function space which is often infinite dimensional. It is this dimension reduction that makes the problem computationally tractable.
Another reason for Hilbert space regularisation finding a variety of applications is the kernel trick which allows for any algorithm which is formulated in terms of inner products to be modified to yield a new algorithm based on a different symmetric, positive semidefinite kernel leading to learning in reproducing kernel Hilbert spaces (RKHS) [15,17]. This way nonlinearities can be introduced in the otherwise linear setup. Furthermore kernels can be defined on input sets which a priori do not have a mathematical structure by embedding the set into a Hilbert space.
The importance and success of kernel methods and the representer theorem have led to extensions of the theory to reproducing kernel Banach spaces (RKBS) by [22] and representer theorems for learning in RKBS by [21]. Recently [20] published an extensive study on the theory of learning in reproducing kernel Banach spaces, also covering representer theorems. As in the classical Hilbert space literature they prove only necessary conditions for the existence of representer theorems. As in our previous work [14] they cover the case of uniform Banach spaces.
This motivates the study of the more general regularisation problem where B is a reflexive Banach space and the L i are continuous linear functionals on B. We are considering reflexive Banach spaces for two reasons. Firstly they are the fundamental building block of reproducing kernel Banach spaces as can be seen in Section 5 where we state the most relevant definitions and results from the work of [22]. Secondly we show in Section 4 that for the setting considered reflexivity is the minimal assumption on the space B for which our results can hold. Classical statements of the representer theorem give sufficient conditions on the regulariser for the existence of a solution in the linear span of the representers of the data. These results usually prove that the regulariser being norm-based (Ω(f ) = h( f B )) is sufficient for the existence of a representer theorem. [1] gave the, to our knowledge, first attempt in proving necessary conditions to classify all regularisers which admit a linear representer theorem. They prove a necessary and sufficient condition for differentiable regularisers on Hilbert spaces. In the authors earlier work [14] this result was extended part way to not necessarily differentiable regularisers on uniformly convex and uniformly smooth Banach spaces.
In this paper we study the question of existence of representer theorems in a more general context where the regulariser does not have to be norm-based but is allowed to be any function of the target function f . By showing necessary and sufficient conditions for representer theorems in this more general setting we show that in fact for a representer theorem to exist the regulariser has to be close to being norm-based.
Moreover, we answer the question of existence of representer theorems in a sense completely, showing that reflexivity is necessary for a result of this kind to hold true. An important consequence of our characterisation of regularisers which admit a linear representer theorem is that one can now prove that in fact the solution does not depend on the regulariser but only on the space the optimisation problem is stated in. This is interesting for two reasons. Firstly it means that we can always pick the regulariser best suited for the application at hand, whether this is computational efficiency or ease of formal calculations. Secondly it further illustrates the importance of being able to learn in a larger variety of spaces, i.e. of extending the learning theory to a variety of Banach spaces.
In Section 2 we will introduce the relevant notation and mathematical background needed for our main results. In particular we will present the relevant results of [1] which justify focusing on the easier to study regularised interpolation problem rather than the general regularisation problem.
Subsequently in Section 3 we will present one of the main results of our work that regularisers which admit a linear representer theorem are almost radially symmetric in a way that will be made precise in the statement. We state and prove two lemmas which capture most of the important structure required and then give the proof of the theorem.
In Section 4 we discuss the consequences of the theorem from Section 3. We prove the other main result of the paper, which states that, if we rely on the linear representer theorem for learning, in most cases the solution is independent of the regulariser and depends only on the function space. We also illustrate why it is clear that we cannot hope to weaken the assumption on the space B any further than reflexivity.
Finally in Section 5 we give some examples of spaces to which our results apply. This section is based on the work of [21,22] on reproducing kernel Banach spaces so we will first be presenting the relevant definitions and results on the construction of RKBS from [22] in this section. We then give a few examples which have been presented in these papers.

Preliminaries
In this section we present the notation and theory used to state and prove our main results. We summarise results which allow us to reduce the problem to study regularised interpolation problems and present the theory of duality mappings required for the proofs of our main results.
Throughout the paper we use N m to denote the set {1, . . . , m} ⊂ N and R + to denote the non-negative real line [0, ∞).
We will assume we have m data points where B will always denote a reflexive real Banach space and Y ⊆ R. Typical examples of Y are finite sets of integers for classification problems, e.g. {−1, 1} for binary classification, or the whole of R for regression.

Regularised interpolation
As discussed in the introduction we are interested in problems of the form (1), namely where B is a reflexive Banach space. The L i are continuous linear functionals on B with the y i ∈ Y ⊆ R the corresponding output data. The functional E : R m × Y m → R is an error functional, Ω : B → R a regulariser and λ > 0 is a regularisation parameter. [1] show in the Hilbert space case B = H that under very mild conditions this regularisation problem admits a linear representer theorem if and only if the regularised interpolation problem admits a linear representer theorem. This is not surprising as the regularisation problem is more general and one obtains a regularised interpolation problem in the limit as the regularisation parameter goes to zero. More precisely they proved the following theorem for the Hilbert space setting. The proof of this theorem for the general setting of B of this paper is almost identical to the version given in [1] but requires a few adjustments. The full proof is presented in the Appendix. Theorem 1 Let E be a lower semicontinuous error functional which is bounded from below. Assume further that for some ν ∈ R m \ {0}, y ∈ Y m there exists a unique minimiser 0 = a 0 ∈ R of min{E(aν, y) : a ∈ R}.
Assume the regulariser Ω is lower semicontinuous and has bounded sublevel sets.
Then Ω is admissible for the regularised interpolation problem (2) if the pair (E, Ω) is admissible for the regularisation problem (1).
Note that the assumptions on the error function and regulariser presented here are as in the paper by [1]. It is remarked in that paper that other conditions can also be sufficient. The proof of this result follows the earlier mentioned concept that one obtains a regularised interpolation problem as the limit of regularisation problems.
It is worth noting that the reverse direction of above theorem does not require any assumptions on the error function or regulariser. In fact we have the following result. (1) if Ω is admissible for the regularised interpolation problem (2).

Proposition 1 Let E, Ω be an arbitrary error functional and regulariser satisfying the general assumption that minimisers always exist. Then the pair (E, Ω) is admissible for the regularisation problem
This allows us to focus on the regularised interpolation problem which is a lot easier to study and yet obtain information about the regularisation problem which is more relevant for application. In particular every representer theorem proved below for regularised interpolation is valid for regularisation problems without any restrictions.

Duality mappings
Through the work by [11,21,22] and our earlier work [14] it has become apparent that the representer theorem is essentially a result about the dual space, a fact we will discuss more in detail in the introduction to the main section of this paper, Section 3. The proofs for our results heavily rely on tangents to balls which are exactly described by the duality mapping. Hence to generalise our results [14] further we need some definitions and results about duality mappings which are given in this section.
Remark 1 Recall that a set-valued map is called univocal if its values are singletons everywhere.
In this work we will be considering the case where μ is the identity and the duality mapping an isometry.
The following properties of the duality mapping are well known and can be found, e.g., in [8] and the references therein.

Proposition 2
For every x ∈ V the set J (x) is non-empty, closed and convex. Furthermore we have the following equivalences.
(iv) J is norm-to-weak* continuous exactly at points of smoothness of V .
The following generalised version of the Beurling-Livingston theorem is essential for the proof of our main result. A proof of this theorem can be found in the work by [5] which is very general, deducing the result from a result on multi-valued monotone nonlinear mappings. A more direct proof, giving the interested reader a better idea of the objects occurring in the result, can be found in the work by [3]. Unfortunately there is an issue in the proof in the paper by Blazek, we present a corrected version of it in the Appendix of this paper. The overall intuition of Blazeks proof is correct nonetheless and a moral summary of it can also be found in a paper by [2].
Theorem 2 (Beurling-Livingston) Let V be a real normed linear space with duality mapping J with gauge function μ and W a reflexive subspace of V .
Then for any fixed where W ⊥ denotes the annihilator of W in V * .

Existence of representer theorems
We are now in the position to present the first of the main results of this paper. Throughout this section B will denote a reflexive Banach space with dual space B * and duality mapping It is well known that this mapping is surjective for reflexive Banach spaces but need not be injective or univocal. It is injective if and only if the space is strictly convex and univocal if and only if the space is smooth. As argued in Section 2.1, rather than studying the general regularisation problem (1) we can consider the regularised interpolation problem (2), namely in which case the problem reduces to usual function evaluations at data points x i , but the framework also allows for other linear functionals such as local averages of the form where P is a probability measure on B.
Our goal is to classify all regularisers for which there exists a linear representer theorem. As stated in the authors earlier work [14], various previous works by [11,21,22] as well as our work indicate that the representer theorem in its core is actually a result about the dual space. [20] connect their representation to the Gâteaux derivative of the solution, which in turn is represented by the semi-inner product inducing the norm and thus also rooted in the dual space. This is not in contradiction to the classical Hilbert space setting, but does not become apparent as in a Hilbert space the dual element is the element itself and so the statement itself appears to be rooted in the space itself. [21,22] already formulated the representer theorem for reproducing kernel Banach spaces using the semi-inner product. Both the inner product in a Hilbert space and the semi-inner product in a reflexive Banach space represent the duality mapping of the space. Their work already indicated that the representer theorem should be thought of as rooted in the dual space, our work further supports this viewpoint. Since in a general reflexive Banach space there need not be a unique semi-inner product and the duality mapping is only represented by the collection of all semi-inner products we found working with semi-inner products in this case less convenient and work with the duality mapping directly. We formulate the representer theorem in terms of dual elements of the data, as we did previously [14]. In contrast to our previous work in this paper the space might not be smooth so we also need to account for the dual map potentially not being univocal. We call regularisers which always allow a solution which has a dual element in the linear span of the linear functionals defining the problem admissible, as is made precise in the following definition.
Remark 2 Note that classically the representer theorem refers to the fact that the regularisation problem has a solution of the form 3. Thus definition 2 says that a regulariser is admissible if and only if it admits a representer theorem. In theorems 3 and 4 below we prove necessary and sufficient conditions for admissibility of a regulariser and thus for the regulariser to admit a representer theorem. As such the result describes completely the class of regularisers for which a representer theorem exists.
It is well known that being a non-decreasing function of the norm on a Hilbert space is a sufficient condition for the regulariser to be admissible. By a Hahn-Banach argument similar as, e.g., in [21] this generalises to this notion of admissibility for reflexive Banach spaces.
We want to show that an admissible regulariser is in a sense almost radially symmetric (norm-based), similar to our previous work [14]. Here by "almost" almost we mean that the function can only fail to be radially symmetric in a constraint way in at most countably many jump discontinuities. A precise statement can be found in our previous work [14]. The proof strategy is similar to before [14] but it will turn out that in particular a lack of strict convexity makes the situation a lot more delicate to deal with. We will begin by showing that admissible regularisers are still nondecreasing along tangents in Banach spaces which are not strictly convex, but in a weaker form than for uniform Banach spaces. Subsequently we will explore to what extend this weaker tangential bound still implies radial symmetry.
Lemma 1 A function Ω : B → R is admissible if and only if for every exposed face of the ball Ω attains its minimum in at least one point, and for every f in the face where the minimum is attained and every L ∈ J (f ) exposing the face and every We are going to refer to the points that lemma 1 applies to as admissible points.
Note that this in particular means that every exposed point is admissible and the bound applies to every functional exposing it. Further, if the point is rotund then the lemma applies to every functional attaining its norm at the point.
Proof of lemma 1 Part 1: (Ω admissible ⇒ nondecreasing along tangential directions) Fix any f ∈ B and consider, for L ∈ J (f ) arbitrary but fixed, the regularised interpolation problem Since Ω is admissible there exists a solution f 0 such that c · L ∈ J (f 0 ). Now if there does not exist g ∈ B such that g = f and L ∈ J (g) then this can only be f itself, as in the case of uniform Banach spaces [14]. Thus for any f T ∈ ker(L) also L(f + f T ) = L(f ) = f 2 and f + f T also satisfies the constraints and hence But if there exists g ∈ B such that L ∈ J (g) we have no way of making a statement about how Ω(f ) and Ω(g) compare. All we can say is that in this face containing f and g there is at least one point where the minimum of Ω is attained. It is clear that for any of those minimal points the above discussion is true for L exposing the face so that we obtain the claimed tangential bound.
Part 2: (Nondecreasing along tangential directions ⇒ Ω admissible) Conversely fix any data (L i , y i ) ∈ B * × Y for i ∈ N m such that the constraints can be satisfied. Let f 0 be a solution to the regularised interpolation problem. If span{L i } ∩ J (f 0 ) = ∅ we are done, so assume not. We let To see that this is true choose V = B and W = Z in the Beurling-Livingston theorem (2). Since Z is a closed subspace of a reflexive space it is itself reflexive.
Further choose x 0 = f 0 and L 0 = 0. Then the theorem says that there exists f T Thus there existsf = f 0 + f T which satisfies the interpolation constraints and such that If f 0 + f T is exposed byL then the tangential bound applies and sof is a solution of the regularised interpolation problem. If on the other hand f 0 + f T is not exposed byL, then it is contained in a face exposed byL. But then for any f T ∈ B such thatf + f T is still contained in this face we have thatL ∈ J (f 0 + f T + f T ) and f T ∈ ker(L) so that f 0 + f T + f T satisfies the interpolation constraints. We can thus choose f T such that f 0 +f T +f T is a minimum of Ω in the face and the tangential bound hence applies to it. Thus similarly to before and f 0 + f T + f T is a solution of the regularised interpolation problem of the desired form.
This illustrates why strict convexity is the crucial property determining the type of result we can obtain. If the space is strictly convex then every point is rotund and thus exposed. This means every point is admissible and we are in a situation similar to before. We are thus first going to discuss this case, before looking at what can be said when the space is not strictly convex.

Strictly convex spaces
Since in a strictly convex space every point is exposed, every point is admissible and the tangential bound from lemma 1 applies everywhere. We thus are able to obtain results in exactly the spirit of our previous work [14].
Proof Since the space is assumed to be strictly convex every point is exposed. The space may not be smooth in which case the duality mapping J is not univocal but for a non-smooth, rotund point f every L ∈ J (f ) exposes it. Thus lemma 1 applies to all points f ∈ B and all functionals L ∈ J (f ). We thus do not need to worry about whether or not a point is an exposed point and whether it is exposed by a given functional attaining its norm at the point. This means we can follow the same general idea of argumentation as we did in our previous work [14].

Part 1: (Bound Ω on the half spaces given by the tangent planes throughf )
We start by showing that Ω is radially nondecreasing by moving out along a tangent and back along another tangent to hit any point along the ray λ ·f for λ > 1. Via the tangents at those points this again immediately gives the bound for all half spaces spanned by a tangent plane throughf given by some L ∈ J (f ), which might be more than one with J possibly not being univocal. This is illustrated in Fig. 1.
We fix somef ∈ B and λ > 1 and set f = λ ·f . To show that Ω(f ) ≥ Ω(f ) fix any L 1 ∈ J (f ) and f T ∈ ker{L 1 } and set and we now need to show that there exists t 0 such that there exists To show that such t 0 indeed exists we will consider choices L t ∈ J (f t ) for every t. Note first that by definition of g t which gives us an equivalent condition to find a suitable t 0 .
We now define the set-valued function F : [0, ∞) → P(R) 1 We can extend the tangential bound to the ray λ ·f by finding the point f t along the tangent from where the tangent to f t hits the desired point f = λ ·f on the ray. Via the tangents to points along the ray the bound then extends to the shaded half space By proposition 2 J (f ) is non-empty, weakly* closed and convex for every f ∈ B so the value of F (t) is either a single value or an interval in R.
It is known that if B is smooth then J is univocal and norm-to-weak* continuous so that F is clearly continuous. We show that if B is not smooth the function F is still almost continuous in the sense that in any jump the function is interval valued and the interval connects both ends of the jump. To show this fix an arbitrary t ∈ [0, ∞) and let s → t. We want to show that this L is indeed contained in J (f t ). By standard results (c.f. [4], Proposition 3.13 (iv)) we know that Further L ≤ lim inf L s = f t (c.f. [4], Proposition 3.13 (iii)) and thus Putting 5 and 6 together gives which shows that indeed L = f t and hence L ∈ J (f t ).
But this means that for s → t and any choice of F (s) where F is not single valued there exists x ∈ F (t) such that F (s) → x. This proves the claim that F is "effectively continuous", in the sense that whenever the function would have a jump it is set valued and its interval value closes the gap between either side of the jump. This means that an intermediate value theorem holds for the function F .
Going back to 4 we see that it is satisfied if and only if f t 0 2 ∈ F (t 0 ). For t = 0, i.e. f 0 =f , we have On the other hand ∞ we have λ f < f t for t large enough and thus for large t. Since f t 2 is continuous in t and the intermediate value theorem holds for F this means that there exists a t 0 such that f t 0 2 ∈ F (t 0 ) which means that there exists L t 0 ∈ J (f t 0 ) such that 4 is satisfied. For this t 0 indeed

Part 2: (Extend the bound around the circle)
The fact that we can extend the bound around the circle is clear by the same argument as in our previous work [14]. The idea is that we can repeatedly move along tangents around the circle without moving to far away from it, as illustrated in Fig. 2.
For points of smoothness of the norm we already showed [14] that if we take small enough steps along tangents we can get all the way around the circle without getting too far away from it. In points of non-smoothness we have more than one tangent to the ball. But as the tangential bound on Ω holds for every tangent it is obviously always possible to choose a tangent which stays arbitrary close to the circle.
Seeing that this result is effectively the same as what we proved for uniform Banach spaces [14] it is not surprising that the main result describing admissible regularisers for strictly convex Banach spaces is the same as for uniform Banach spaces. We can obtain the same closed form characterisation as before, saying that admissible regularisers are almost radially symmetric.

Theorem 3 A function Ω : B → R is admissible if and only if it is of the form
for some nondecreasing h : [0, ∞) → R whenever f B = r for r ∈ R. Here R is an at most countable set of radii where h has a jump discontinuity. For any f with f B = r ∈ R the value Ω(f ) is only constrained by the monotonicity property, i.e. it has to lie in between lim The proof given in our previous work [14] for the analogue of this theorem (Theorem 3.2 [14]) is in fact still entirely valid. We thus only comment briefly on a few important points. Note in particular that from the fact that for any f T ∈

Fig. 2
By repeatedly taking small steps along tangents we can move all the way around the circle and so f ≤ f + f T . By strict convexity the inequality is in fact strict so that the bound for the mollification in part 2 of theorem 3.2 from [14] remains valid. It is also clear that part 1 of the proof of lemma 1 holds for f = 0 so 0 is an admissible point.
Thus Ω is without loss of generality minimised at 0 with Ω(0) = 0. All other parts of the proof of theorem 3.2 are also clearly still valid.

Non-strictly convex spaces and l 1
Obtaining a general, closed form geometric interpretation of the tangential bound as we presented above is very difficult for spaces which are not strictly convex. This is due to the large geometric variety of Banach spaces, making it very hard to make any statements about the shape of the unit ball, even locally. We can, e.g., construct a Banach space with a rotund point such that no point in its neighbourhood is rotund. Similarly a convex function on R may not be differentiable on a countable dense subset (c.f., e.g., [10,13]) so also smoothness does not allow statements about surrounding points. Worse even, there might not even be any exposed point, e.g. the space c 0 does not contain exposed points. We are thus going to restrict our intention to the space that is most commonly used in applications, l 1 n . The space l 1 is only reflexive if it is finite dimensional but in applications we are often going to do computations in a finite truncation of l 1 so that this is an interesting case to consider.
Fixing the space to be a concrete example does remove the issue of geometric variety and it turns out that this allows to run an argument similar to the one presented in Section 3.1.

Lemma 3 If for every exposed face of the norm ball in l 1
n Ω attains its minimum in at least one point and for every f in the face where the minimum is attained and every L ∈ J (f ) exposing the face and every f T ∈ ker(L) we have Ω(f + f T ) ≥ Ω(f ) then for any fixed admissiblef ∈ l 1 n we have that for all f ∈ l 1 n such that f < f .

Proof Part 1: (Bound Ω on the half spaces given by the tangent planes throughf )
As in Section 3.1 we first show that Ω is radially nondecreasing. Notice that in l 1 n every vertex e i = (0, . . . , 0, 1, 0, . . . , 0) of the unit ball is an exposed point and hence admissible. If we fix an admissiblef ∈ l 1 n then it is either one of the vertices e i , or a minimum within a face which is the convex hull of several e i .
Iff is one of the vertices, e k say, then it is clear that there exists a linear functional L exposing it so that along the tangent given byL we can reach a different vertex λ 0 e j for λ 0 > 1 say. Now by the same argument we can find a tangent in the reverse direction, connecting λ 0 e j to λ 1 e k , λ 1 > λ 0 . It is clear that we can control the size of λ 0 and λ 1 to hit the desired point λf for 1 < λ.
If on the other handf is the minimum within a face F exposed by the linear functionalL thenf satisfies the tangential bound forL. Thus fromf we can reach any vertex, e k say, on the boundary of the face F along a tangent given byL. It is clear that e k , being an exposed point, has a tangent plane which is close to the face F , so that we can reach the minimum in the face λF for 1 < λ. Note that this minimum might not be λf . This is illustrated in Fig. 3.
Combining both arguments we see that the minimum of Ω within a given face is a nondecreasing function of the norm. Clearly with the tangent planes of the minima we get the same bound for any half space spanned by a tangent plane atf as in Section 3.1.

Part 2: (Extend the bound around the circle)
The fact that we can extend the bound around the circle in the same way as previously is clear from the arguments in part 1. We already noticed that iff is within a face we can reach any vertex on the boundary of the face. We further know that from a vertex we can get across any face containing it to another vertex while staying arbitrarily close to the face, as illustrated in Fig. 4. Hence it is clear that we can reach any admissible f with f > f .
Putting both observations together we get the claim.
This proof illustrates that in the case of l 1 n it may be convenient to view Ω as a function of the faces of the norm ball. In other words we are thinking of the faces as being collapsed to one point where Ω is minimised. Viewed as a function of the faces Ω is indeed almost radially symmetric again. Moreover in points of continuity of h the function Ω attains its minimum in a face F in every exposed point within the face.
Proof The proofs for uniform Banach spaces and strictly convex Banach spaces remain largely valid, only few extra considerations are required. We are going to briefly discuss sections which remain valid and present in full any extra arguments which are required.
Firstly, the fact that continuity in radial direction implies radial symmetry of Ω is clear since we only need to consider admissible points and for two admissible points f and g the previous argument obviously holds.
For this observation to be useful we need to verify again that the radially mollified regulariser More precisely we check that Ω is still non-decreasing along tangential directions, i.e. we need to show that for an admissible f for any L ∈ J (f ) exposing the face containing f and every f T ∈ ker(L) we still have The previous proof was based on the fact that f + f T > f . Whenever this is true the proof holds, so we only need to check the case when f + f T = f . But in this case we have that Since ρ is positive that means the integrand of Ω(f + f T ) is greater or equal than the integrand of Ω(f ) so that the property of being nondecreasing along all tangents is indeed preserved. Putting these two observations together we obtain the result. We know that as a function of the faces Ω is a monotone function of the norm, so a monotone function on the real line. After mollification Ω is in fact radially symmetric. The same considerations as before say that Ω must have been of the claimed form.
The converse is clear since the value of Ω is defined to be the minimum across each face, so minima exist and clearly satisfy the tangential bound.
For the moreover part let F be a face of the unit ball and g a minimum of Ω in F . Assume further that h is continuous in g . Fix a vertex e j in F . Then there clearly exists a tangent from λe j to g for 1 − ε < λ < 1 and thus Ω(λe j ) ≤ Ω(g). By continuity of h in g = e j we have Ω(λe j ) −→ λ→1 Ω(e j ) and so Ω(e j ) ≤ Ω(g).
Since g is the minimum in g this means Ω(e j ) = Ω(g). This shows that a very similar intuition to the results for strictly convex spaces indeed is true for l 1 n . Moreover it shows that any admissible regulariser in l 1 n will attain its minimum at the vertices which is exactly the reason for their use in applications.

Consequences and optimality
Just as presented in the authors earlier work [14] an important consequence of above results is that, if one relies on the representer theorem for learning, in fact the solution of the regularised interpolation problem in most cases does not depend on the regulariser but is determined by the function space alone. This has two important consequences. Firstly it means we are free to work with whatever regulariser is most convenient for our purpose, whether this is computational applications or proving theoretical results. Secondly it illustrates the importance of extending well established learning methods for Hilbert spaces to Banach spaces to allow for a greater variety of spaces to learn in.
In this section we discuss this fact and also illustrate why one can not hope to weaken the assumption on the function space any further than reflexivity.

The solution is determined by the space
Throughout this section we say that a function f 0 is a representer theorem solution of (2) if it is a solution of (2) in the sense of definition 2, i.e. such that there existŝ To prove above claim that the solution is often independent of the regulariser we are going to show that in most cases a function f 0 is a representer theorem solution of (2) if and only if it is a solution of the minimal norm interpolation problem This follows by combining above results with a result by [11]. They consider the minimal norm interpolation problem for a general Banach space X. Under the assumption that X is reflexive they prove a necessary and sufficient condition for a function to be a solution of this problem.
This corresponds to h(t) = t in (2). We now get the following theorem.
Theorem 5 Let B be a reflexive Banach space and Ω admissible. Then any representer theorem solution of (2) is a solution of (8).
Moreover for any solution of (8) there exists a representer theorem solution of (2) in the same face of the norm ball. Thus in particular if B is strictly convex then f 0 is a representer theorem solution of (2) if and only if it is a solution of (8).

Proof Part 1: (A solution of (2) is a solution of (8))
Assume that f 0 is a representer theorem solution of (2). Then since Further if f 0 is an admissible point in the sense of definition 2, then the tangential bound lemma 1 applies and so f 0 is a representer theorem solution of (2). If f 0 is not admissible in the sense of definition 2 then there exists an admissible point f 0 in the same face for which above inequality holds so that f 0 is a representer theorem solution of (2).
If B is strictly convex then every point is admissible and f 0 is a representer theorem solution of (2) if and only if it is a solution of (8).
This result shows that for any admissible regulariser on a reflexive, strictly convex Banach space the set of solutions with a dual element in the linear span of the defining linear functionals is identical. This in particular means that it is the choice of the function space, and only the choice of the space, which determines the solution of the problem. We are thus free to work with whichever regulariser is most convenient in application. Computationally in many cases this is likely going to be 1 2 · 2 . For theoretical results other regularisers may be more suitable, such as in the aforementioned paper by [11] which heavily relies on a duality between the norm of the space and its continuous linear functionals.
For a reflexive Banach space which is not strictly convex the solution is also mostly determined by the space, the regulariser only determines the point(s) within a certain face of the norm ball which is optimal. The face containing the solution again is independent of Ω.

Reflexivity is necessary
The fact that proposition 3 is an if and only if suggests that one can not do better than reflexivity in the assumptions on the space without weakening other assumptions. And indeed this is the case. The duality mapping J is surjective if and only if the space X is reflexive. Thus in a non-reflexive Banach space we can find L i which are not the image of any element in X under the duality mapping. In this case there is no hope of finding a solution in the sense of definition 2.
As an example consider X = l 1 with X * = l ∞ . Let L 1 = (x i ) i∈N where x i = i i+1 for i odd and x i = 0 for i even and L 2 = (y i ) i∈N where y i = i i+1 for i even and y i = 0 for i odd, i.e.
Then L 1 = L 2 = 1 but there cannot be a l 1 -sequence of norm 1, x say, such that L 1 (x) = 1 or L 2 (x) = 1. So L 1 , L 2 ∈ J (X). It is also clear by construction that the same is true for linear combinations of L 1 and L 2 so This means there is no hope of finding a solution in the sense of definition 2 with a dual element in the linear span of the defining linear functionals.

Examples
In this section we give several examples of Banach spaces to which the results in this paper apply. These examples are taken from the work of [21,22]. In these papers the theory of reproducing kernel Banach spaces (RKBS) is developed. This generalises the very well known theory of reproducing kernel Hilbert spaces, providing several advantages which we will discuss throughout this section. We begin by stating some of the key definitions and results for RKBS.
We It is convenient to view an element f * ∈ B * as a function on X by identifying it via the isometry with f ∈ B and simply writing f * (x).
With this definition one obtains a theorem reminiscent of reproducing kernel Hilbert spaces. It now turns out that there is a convenient way of constructing reproducing kernel Banach spaces. Theorem 7 Let W be a reflexive Banach space with dual space W * and let : X → W and * : X → W * maps such that span (X) = W and span * (X) = W * . Then there exists a RKBS B which is isometrically isomorphic to W given by with dual space B * which is isometrically isomorphic to W * and given by The reproducing kernel is given by K(x, y) = ( (x), * (y)) W .
As an example of these constructions consider the following example given by [22]. Let X = R and W = L p ( and kernel The duality pairing is given by For p = q = 2 this construction corresponds to the usual space of bandlimited functions. For other values of p we maintain the property of a Fourier transform with bounded support but consider a different L p norm making B isometrically isomorphic to L p (I).
Since unlike Hilbert spaces of the same dimension the L p (I) spaces are not isomorphic to each other they exhibit a richer geometric variety which is potentially useful for the development of new learning algorithms.
Note that above example is one dimensional for notational simplicity and similar constructions yield RKBS isomorphic to L p μ (R d ) where μ is a finite positive Borel measure on R d as shown in [21]. The corresponding RKBS B consists of functions of the form and the reproducing kernel is given by For d = 1 and μ the Lebesgue measure on [− 1 2 , 1 2 ] this reduces to the above example. The dual map in L p spaces is given by f * = f |f | p−2 f p−2 p which in the given example means that for an element f u ∈ B the corresponding dual element is given by Further the dual map in a reflexive Banach space is self-inverse so (f * u ) * = f u These constructions are of interest for various reasons. Firstly this allows us to learn in a larger variety of function spaces which may be of use if we are expecting the solution in a certain class due to prior knowledge, or if we fail to find a good enough solution in a Hilbert space, or if the data has some intrinsic structure that makes it impossible to embed into a Hilbert space.
Furthermore, as in contrast to Hilbert spaces two Banach spaces of the same dimension need not be isometrically isomorphic, Banach spaces exhibit a much richer geometric variety which is potentially useful for developing new learning algorithms.
Secondly it is often desirable to use norms which are not induced by an inner product because they possess useful properties for application. It is often stated in the literature that a regulariser is used to enforce a certain property such as sparsity or smoothness. But as we showed in Section 4 it is in fact not the regulariser as such but the norm of the function space alone which provides any desired property.
As an example consider L 1 regularisation which is often used to induce sparsity of the solution. Sparsity occurs because in L 1 all extreme points of the unit ball lie on the coordinate axes. The finite dimensional spaces l 1 d are reflexive and thus fall into the framework of this paper. The infinite dimensional spaces l 1 and L 1 are not reflexive but one can instead work in a L p space for p close to 1, see, e.g., [19].

A.1 Regularised Interpolation
The proof of theorem 1 is largely identical to the one presented in [1] but requires a few minor adjustments to hold for the generality of reflexive Banach spaces. We present the full proof here.
Proof of theorem 1 To prove that Ω is admissible for the regularised interpolation problem (2) we are going to show that Ω is tangentially nondecreasing in the sense of lemma 1 depending on the properties of the space B.
Fix 0 = f ∈ B and L ∈ J (f ) and let a 0 be the unique nonzero minimiser of min{E(aν, y) : a ∈ R}. For every λ > 0 Consider the regularisation problem By assumption there exist solutions f λ ∈ B such that i.e. there exist c λ ∈ R such that c λ L ∈ J (f λ ). Now fix any g ∈ B such that L ∈ J (g) which exists as B is reflexive so J is surjective. We then obtain where the first inequality follows from a 0 minimising E(aν, y) and the second inequality from L(g) = L 2 . This shows that Ω(f λ ) ≤ Ω(g) for all λ and so by hypothesis the set {f λ : λ > 0} is bounded. Hence there exists a weakly convergent subsequence (f λ l ) l∈N such that λ l −→ l→∞ 0 and f λ l f as l → ∞. Taking the limit inferior as l → ∞ on the right-hand side of inequality 10 we obtain Since a 0 is by assumption the unique, nonzero minimiser this means that Moreover since J (f λ ) ∩ span{L} = ∅ we have L · f λ = L(f λ ) → L 2 and thus f λ → L . Since f ≤ lim inf f λ = L (e.g. [4], Proposition 3.5 (iii)) we have f = L and thus L ∈ J (f ).
Since the f λ are minimisers of the regularisation problem we have for all g ∈ B such that L(g) = L 2 Since a 0 is the minimiser this implies in particular that Ω(f λ ) ≤ Ω(g) ∀g ∈ B such that L(g) = L 2 and taking the limit inferior again we obtain that f is in fact a solution of the interpolation problem Finally note that the claim is trivially true for L = 0 as in that case E is independent of f and for every λ the minimiser f λ has to be zero to satisfy J (f λ ) ∩ {0} = ∅. This means Ω is minimised at 0.

A.2 Duality mappings
The proof of theorem 2 crucially relies on the following connection of the duality mapping with subgradients (c.f. [2,3]).  We now give a proof of theorem 2 which follows the ideas of the one presented by [3] but corrects the issue in that paper.
Proof of theorem 2 Using the functional M from proposition 4 define a functional F : V → R by Since M is continuous, convex with strictly increasing derivative and L 0 is linear, F is clearly continuous, convex and coercive. This means that F attains its minimum on the reflexive subspace W in at least one point, z say. Hence for all y ∈ W By proposition 4 this means that L 0 W ∈ ∂M W (z − x 0 ) = J μ W (z − x 0 ). For simplicity we write L 0 W = L W . Note that if x 0 ∈ W and L W = 0 we have that F (x) = M(x − x 0 ) on W so z = x 0 and we trivially have J μ (x 0 − x 0 ) = {0} = {−L 0 + L 0 } ⊂ W ⊥ + L 0 . So we can without loss of generality assume that not both x 0 ∈ W and L W = 0.
In case x 0 ∈ W it is clear that M is minimised at x 0 . If L W = 0 then L W attains its norm on W in a point z say. Thus it is clear that there exists a minimiser for F of the form z = z + x 0 . More precisely F is minimised where an element of ∂M and ∇L 0 are equal. Since ∂M(x − x 0 ) = μ( x − x 0 ) L x μ( x−x 0 ) for L x ∈ J μ (x − x 0 ) we get that the minimiser z = z + x 0 is such that L W W * = μ( z − x 0 ).
If on the other hand x 0 ∈ W then we note that z being the minimum for F on W implies that L z (y) ≥ 0 for all L z ∈ ∂F (z) and all y ∈ W . But this means that μ( z − x 0 ) L z (y) μ( z − x 0 ) − L 0 (y) ≥ 0 for every L z ∈ J (z − x 0 ). But since L z μ( z−x 0 ) is of norm 1 this means that for all y ∈ W . Thus L W W * = L 0 W W * ≤ μ( z − x 0 ). Now denote by W the space generated by W and x 0 and note that this space is still reflexive. Extend L W to L W on W by setting Further L W (y) = L W (y) ≤ μ( z − x 0 ) · y for all y ∈ W , so L W > μ( z − x 0 ) can only happen if the norm is attained for some point λy + νx 0 for y ∈ W , ν = 0. Or equivalently, dividing through by ν, at a point y + x 0 for some y ∈ W . But for those points we have and thus L W = μ( z − x 0 ) and L W (z − x 0 ) = L W · z − x 0 .
Since for x 0 ∈ W we have W = W in either case we have obtained a function L W such that L W = L 0 W , L W = μ( z − x 0 ) and L W (z − x 0 ) = L W · z − x 0 . Now extend L W by Hahn-Banach to L V on V such that and L V W = L W . Hence (L V − L 0 ) W = 0 so L V ∈ W ⊥ + L 0 . It remains to show that L V ∈ J μ (z − x 0 ) by showing 12 holds for L V and every y ∈ V . Notice first that