PDE-based Group Equivariant Convolutional Neural Networks

We present a PDE-based framework that generalizes Group equivariant Convolutional Neural Networks (G-CNNs). In this framework, a network layer is seen as a set of PDE-solvers where geometrically meaningful PDE-coefficients become the layer's trainable weights. Formulating our PDEs on homogeneous spaces allows these networks to be designed with built-in symmetries such as rotation in addition to the standard translation equivariance of CNNs. Having all the desired symmetries included in the design obviates the need to include them by means of costly techniques such as data augmentation. We will discuss our PDE-based G-CNNs (PDE-G-CNNs) in a general homogeneous space setting while also going into the specifics of our primary case of interest: roto-translation equivariance. We solve the PDE of interest by a combination of linear group convolutions and non-linear morphological group convolutions with analytic kernel approximations that we underpin with formal theorems. Our kernel approximations allow for fast GPU-implementation of the PDE-solvers, we release our implementation with this article in the form of the LieTorch extension to PyTorch, available at https://gitlab.com/bsmetsjr/lietorch . Just like for linear convolution a morphological convolution is specified by a kernel that we train in our PDE-G-CNNs. In PDE-G-CNNs we do not use non-linearities such as max/min-pooling and ReLUs as they are already subsumed by morphological convolutions. We present a set of experiments to demonstrate the strength of the proposed PDE-G-CNNs in increasing the performance of deep learning based imaging applications with far fewer parameters than traditional CNNs.


Introduction
In this work we introduce PDE-based Group CNNs. The key idea is to replace the typical trifecta of convolution, pooling and ReLUs found in CNNs with a Hamilton-Jacobi type evolution PDE, or more accurately a solver for a Hamilton-Jacobi type PDE. This substitution is illustrated in Fig. 1 where we retain (channel-wise) affine combinations as the means of composing feature maps.
The PDE we propose to use in this setting comes from the geometric image analysis world [1,2,3,4,5,6,7,8,9,10,11,12]. It was chosen based on the fact that it exhibits similar behaviour on images as traditional CNNs do through convolution, pooling and ReLUs. Additionally it can be formulated on Lie groups to yield equivariant processing, which makes our PDE approach compatible with Group CNNs [13,14,15,16,17,18,19,20,21,22,23,24,25]. Finally an approximate solver for our PDE can be efficiently implemented on modern highly parallel hardware, making the choice a practical one as well.
Our solver uses the operator splitting method to solve the PDE in a sequence of steps, each step corresponding to a term of the PDE. The sequence of steps for our PDE is illustrated in Fig. 2. The morphological convolutions that are used to solve for the non-linear terms of the PDE are a key aspect of our design. Normally, morphological convolutions are considered on ℝ [26,27], but when extended to Lie groups such as ( ) they have many benefits in applications (e.g. crossing-preserving flow [28] or tracking [29,30]). Using morphological convolutions allows our network to have trainable non-linearities instead of the fixed non-linearities in (G-)CNNs.
The theoretical contribution of this paper consists of providing good analytical approximations to the kernels that go in the linear and morphological convolutions that solve our PDE. On ℝ the formulation of these kernels is reasonably straightforward, but in order to achieve group equivariance we need to generalize them on homogeneous spaces.
Instead of training kernel weights our goal is training the coefficients of the PDE. The coefficients of our PDE have the benefit of yielding geometrically meaningful parameters from a image analysis point of view. Additionally we will need (much) less PDE parameters than kernel weights to achieve a given level of performance in image segmentation and classification tasks; arguably the greatest benefit of our approach.

Figure 1
In a PDE-based CNN we replace the traditional convolution, pooling and ReLU operations by a PDE solver. The inputs of a given layer serve as initial conditions for a set of evolution PDEs, the outputs consist of affine combinations of the solutions of those PDEs at a fixed point in time. The parameters of the PDE become the trainable weights (alongside the affine parameters) over which we optimize.

Figure 2
Our Hamilton-Jacobi type PDE of choice contains a convection, diffusion, dilation and erosion term (CDDE for short). Through operator splitting we solve for these terms separately by using resampling (for convection), linear convolution (for diffusion) and morphological convolution (for dilation and erosion). This paper is a substantially extended journal version of [31] presented at the SSVM 2021 conference.

Structure of the Article
The structure of the article is as follows. We first place our work in its mathematical and deep learning context in Section 2. Then we introduce the needed theoretical preliminaries from Lie group theory in Section 3 where we also define the space of positions and orientations that will allow us to construct roto-translation equivariant networks.
In Section 4 we give the overall architecture of a PDE-G-CNN and the ancillary tools that are needed to support a PDE-G-CNN. We propose an equivariant PDE that models commonly used operations in CNNs.
In Section 5 we detail how our PDE of interest can be solved using a process called operator splitting. Additionally, we give tangible approximations to the fundamental solutions (kernels) of the PDEs that are both easy to compute and sufficiently accurate for practical applications. We use them extensively in the PDE-G-CNNs GPUimplementations in PyTorch that can be downloaded from the GIT-repository: https://gitlab.com/bsmetsjr/lietorch. Section 6 is dedicated to showing how common CNN operations such as convolution, max-pooling, ReLUs and skip connections can be interpreted in terms of PDEs.
We end our paper with some experiments showing the strength of PDE-G-CNNs in Section 7, and concluding remarks in Section 8.
The framework we propose covers transformations and CNNs on homogeneous spaces in general and as such we develop the theory in an abstract fashion. To maintain a bridge with practical applications we give details throughout the article on what form the abstractions take explicitly in the case of roto-translation equivariant networks acting on , specifically in 2D (i.e. = 2).

Context
As this article touches on disparate fields of study we use this section to discuss context and highlight some closely related work.

Drawing Inspiration from PDE-based Image Analysis
Since the Partial Differential Equations that we use are well-known in the context of geometric image analysis [1,2,3,4,5,6,7,8,9,10,11], the layers also get an interpretation in terms of classical image-processing operators. This allows intuition and techniques from geometric PDE-based image analysis to be carried over to neural networks.
In geometric PDE-based image processing it can be beneficial to include mean curvature or other geometric flows [32,33,34,35] as regularization and our framework provides a natural way for such flows to be included into neural networks. In the PDE-layer from Fig. 2 we only mention diffusion as a means of regularization, but mean curvature flow could easily be integrated by replacing the diffusion sub-layer with a mean curvature flow sub-layer. This would require replacing the linear convolution for diffusion by a median filtering approximation of mean curvature flow [1].

The Need for Lifting Images
In geometric image analysis it is often useful to lift images from a 2D picture to a 3D orientation score as in Fig. 3 and do further processing on the orientation scores [36]. A typical image processing task in which such a lift is beneficial is that of the segmentation of blood vessels in a medical image. Algorithms based on processing the 2D picture directly, usually fail around points where two blood vessels cross, but algorithms that lift the image to an orientation score manage to decouple the blood vessels with different orientations as is illustrated in the bottom row of Fig. 3.
To be able to endow image-processing neural networks with the added capabilities (such as decoupling orientations and guaranteeing equivariance) that result from lifting data to an extended domain, we develop our theory for the more general CNNs defined on homogeneous spaces, rather than just the prevalent CNNs defined on Euclidean space. One can then choose which homogeneous space to use based on the needs of one's application (such as needing to decouple orientations). A homogeneous space is, given subgroup of a group , the manifold of left cosets, denoted by ∕ . In the above image-analysis example, the group would be the special Euclidean group = ( ), the subgroup would be the stabilizer subgroup of a fixed reference axis, and the corresponding homogeneous space ∕ would be the space of positions and orientations ≡ ℝ × −1 , which is the lowest dimensional homogeneous space able to decouple orientations. By considering convolutional neural networks on homogeneous spaces such as these networks have access to the same benefits of decoupling structures with different orientations as was highly beneficial for geometric image processing [37,38,39,40,41,42,43,44,45,46,47,48,49,50,51]. Remark 2.1 (Generality of the architecture). Although not considered here, for other Lie groups applications (e.g. frequency scores [52,53], velocity scores, scale-orientation scores [54]) the same structure applies, therefore we keep our theory in the general setting of homogeneous spaces ∕ . This generality was also important in non-PDE based learning [22], but also for PDE-based learning it is again beneficial.

Figure 3
Illustrating the process of lifting and projecting, in this case the advantage of lifting an image from ℝ 2 to the 2D space of positions and orientations 2 derives from the disentanglement of the lines at the crossings.

The Need for Equivariance
We require the layers of our network to be equivariant: a transformation of the input should lead to a corresponding transformation of the output, in other words: first transforming the input and then applying the network or first applying the network and then transforming the output should yield the same result. A particular example, in which the output transformation is trivial (i.e. the identity transformation), is that of invariance: in many classification tasks, such as the recognition of objects in pictures, an apple should still be recognized as an apple even if it is shifted or otherwise transformed in the picture as illustrated in Fig. 4. By guaranteeing equivariance of the network, the amount of data necessary or the need for data augmentation are reduced as the required symmetries are intrinsic to the network and need not be trained.
Both regular and steerable G-CNNs naturally arise from linear mappings between functions on homogeneous spaces that are placed under equivariance constraints [22,24,57,55]. Regular G-CNNs explicitly extend the domain and lift feature maps to a larger homogeneous space of a group, whereas steerable CNNs extend the co-domain by gen- Figure 4 Spatial CNNs, as used for image classification for example, are translation equivariant but not necessarily equivariant with respect to rotation, scaling and other transformations as the illustrative tags of the differently transformed apples images suggest. Building a G-CNN with the appropriately chosen group confers the network with all the equivariances appropriate for the chosen application. Our PDE-based approach is compatible with the group CNN approach [22] and so can confer the same symmetries.

Figure 5
Illustrating the overall architecture of a PDE-G-CNN (example: retinal vessel segmentation). An input image is lifted to a homogeneous space from which point on it can be fed through subsequent PDE layers (each PDE layer follow the structure of Fig.2) that replace the convolution layers in conventional CNNs. Finally the result is projected back to the desired output space. erating fiber bundles in which a steerable feature vector is assigned to each position in the base domain. Although steerable operators have clear benefits in terms of computational efficiency and accuracy [59,60], working with steerable representations puts constraints on non-linear activations within the networks which limits the representation power of G-CNNs [57]. Like regular G-CNNs, the proposed PDE-G-CNNs do not suffer from this. In our proposed PDE-G-CNN framework it is essential that we adopt the domain-extension viewpoint, as this allows to naturally and transparently construct scale space PDEs via left-invariant vector fields [12]. In general this viewpoint entails that the domain of images is extended from the space of positions only, to a higher dimensional homogeneous space, and originates from coherent state theory [61], orientation score theory [36], cortical perception models [39], G-CNNs [13,20], and rigid-body motion scattering [62].
The proposed PDE-G-CNNs form a new, unique class of equivariant neural networks, and we show in Section 6 how regular continuous G-CNNs arise as a special case of our PDE-G-CNNs.

Probabilistic-CNNs
Our geometric PDEs relate to -stable Lévy processes [50] and cost-processes akin to [26], but then on rather than ℝ . This relates to probabilistic equivariant numerical neural networks [63] that use anisotropic convection-diffusions on ℝ .
In contrast to these networks, the PDE-G-CNNs that we propose allow for simultaneous spatial and angular diffusion on . Furthermore we include nonlinear Bellman processes [26] for max pooling over Riemannian balls.
KerCNNs An approach to introducing horizontal connectivity in CNNs that does not require a Lie group structure was proposed by Montobbio et al. [64,65] in the form of KerCNNs. In this biologically inspired metric model a diffusion process is used to achieve intra-layer connectivity.
While our approach does require a Lie group structure it is not restricted to diffusion and also includes dilation/erosion.

Neural Networks and Differential Equations
The connection between neural networks and differential equations became widely known in 2017, when Weinan E [66] explicitly explained the connection between neural networks and dynamical systems especially in the context of the ultradeep ResNet [67]. This point of view was further expanded by Lu et al. [68], showing how many ultradeep neural networks can be viewed as discretizations of ordinary differential equations. The somewhat opposite point of view was taken by Chen et al. [69], who introduced a new type of neural network which no longer has discrete layers, them being replaced by a field parameterized by a continuous time variable. Weinan E also indicated a relationship between CNNs and PDEs, or rather with evolution equations involving a nonlocal operator. Implicitly, the connection between neural networks and differential equations was also explored by the early works of Chen et al. [70] who learn parameters in a reaction-diffusion equation. This connection between neural networks and PDEs was then made explicit and more extensive by Long et al. who made it possible to learn a much wider class of PDEs [71] with their PDE-Net. More recent work in PDE inspired neural networks includes [72,73].
Basing neural network computations on PDEs formulated on manifolds also makes the processing independent with respect to the choice of coordinates on the manifold in the fashion of Weiler et al. [74].
More recent work in this direction includes integrating equivariant partial differential operators in steerable CNNs [75], drawing a strong analogy between deep learning and physics.
A useful aspect of the connection between neural networks and differential equations is the observation that the stability of the differential equation can give into the stability and generalization ability of the neural network [76]. Moreover, there are intriguing analogies with numerical PDE-approximations and specific network architectures (e.g. ResNets), as can be seen in the comprehensive overview article by Alt et al. [77].
The main contribution of our work in the field of PDE-related neural networks, is that we implement and analyze geometric PDEs on homogeneous spaces, to obtain general group equivariant PDE-based CNNs whose implementations just require linear and morphological convolutions with new analytic approximations of scale space kernels.

Equivariance: Groups & Homogeneous Spaces
We want to design the PDE-G-CNN, and its layers, in such a way that they are equivariant. Equivariance essentially means that one can either transform the input and then feed it through the network, or first feed it through the network and then transform the output, and both give the same result. We will give a precise definition after introducing general notation.

The General Case
A layer in a neural network (or indeed the whole network) can be viewed as an operator from a space of real-valued functions defined on a space to a space of real-valued functions defined on a space . It may be helpful to think of these function spaces as spaces of images.
We assume that the possible transformations form a connected Lie group . Think for instance of a group of translations which shift the domain into different directions. The Lie group being connected excludes transformations such as reflections, which we want to avoid for the sake of simplicity. We further assume that the Lie group acts smoothly on both spaces and , which means that there are smooth maps ∶ × → and ∶ × → such that for all , ℎ ∈ , Additionally we will assume that the group acts transitively on the spaces, meaning that for any two elements of these spaces there exists a transformation in that maps one to the other. This has as the consequence that and can be seen as homogeneous spaces [78]. In particular, this means that after selecting a reference element 0 ∈ we can make the following isomorphism: using the mapping which is a bijection due to transitivity and the fact that Stab ( 0 ) ∶= ∈ | | ( , 0 ) = 0 is a closed subgroup of . Because of this we will represent a homogeneous space as the quotient ∕ for some choice of closed subgroup = Stab ( 0 ) since all homogeneous spaces are isomorphic to such a quotient by the above construction.
In this article we will restrict ourselves to those homogeneous spaces that correspond to those quotients ∕ where the subgroup is compact and connected. Restricting ourselves to compact and connected subgroups simplifies many constructions and still covers several interesting cases such as the rigid body motion groups ( ).
The elements of the quotient ∕ consist of subsets of which we will denote by the letter , these subsets are know as left cosets of since every one of them consists of the set = for some ∈ , the left cosets are a partition of under the equivalence relation Under this notation the group consists of the disjoint union The left coset that is associated with the reference element 0 ∈ is and for that reason we also alias it by 0 ∶= when we want to think of it as an atomic entity rather than a set in its own right. We will denote quotient map from to ∕ with : Remark 3.1 (Principal homogeneous space). Observe that by choosing = { } we get ∕ ≡ , i.e. the Lie group is a homogeneous space of itself. This is called the principal homogeneous space. In that case the group action is equivalent to the group composition. The numerical experiments we perform in this paper are on the principal homogeneous space ℝ 2 × 1 of (2).
We will denote the group action/left-multiplication by an element ∈ by the operator ∶ ∕ → ∕ given by ∶= for all ∈ ∕ .
In addition, we denote the left-regular representation of on functions defined on ∕ by L defined by A neural network layer is itself an operator (from functions on ∕ to functions on ∕ ), and we require the operator to be equivariant with respect to the actions on these function spaces. Definition 3.2 (Equivariance). Let be a Lie group with homogeneous spaces ∕ and ∕ . Let Φ be an operator from functions (of some function class) on ∕ to functions on ∕ , then we say that Φ is equivariant with respect to if for all functions (of that class) we have that: or in words: the neural network commutes with transformations.
Most of the time we will have = in our proposed neural networks, only the initial lifting layer and the final projection layer will be between different homogeneous spaces, as we will see later on.

Vector and Metric Tensor Fields
The particular operators that we will base our framework on are vector and tensor fields, if these basic building blocks are equivariant then our processing will be equivariant. We explain what left invariance means for these objects next.
For ∈ and ∈ ∕ , let ( ∕ ) be the tangent space at point then the pushforward * ∶ ( ∕ ) → ( ∕ ) of the group action is defined by the condition that for all smooth functions on ∕ and all ∈ ( ∕ ) we have that * ∶= • .
Remark 3.3 (Tangent vectors as differential operators). Other than the usual geometric interpretation of tangent vectors as being the velocity vectorṡ ( ) tangent to some differentiable curve ∶ ℝ → ∕ we will simultaneously use them as differential operators acting on functions as we did in (8). This algebraic viewpoint defines the action of the tangent vectoṙ ( ) on a differentiable function aṡ In the flat setting of = ℝ , + , where the tangent spaces are isomorphic to the base manifold ℝ , when we have a tangent vector ∈ ℝ its application to a function is the familiar directional derivative: See [79, §2.1.1] for details on this double interpretation.
Vector fields that have the special property that the push forward ( ) * maps them to themselves in the sense that for all differentiable functions and where ∶ ↦ ( ∕ ) is a vector field, are referred to as -invariant.

Definition 3.4 ( -invariant vector field on a homogeneous space). A vector field on
It is straightforward to check that (9) and (10) are equivalent and that these imply the following.

Corollary 3.5 (Properties of -invariant vector fields). On a homogeneous space ∕ a -invariant vector field has the following properties:
In the same fashion we have an induced pseudometric on from the pseudometric tensor field on .
Definition 3.13 (Pseudometric on ). Let 1 , 2 ∈ . Then we define: This pseudometric has the property that G (ℎ 1 , ℎ 2 ) = 0 for all ℎ 1 , ℎ 1 ∈ , in fact for all ∈ ∕ we have that G ( 1 , 2 ) = 0 for all 1 , 2 ∈ . By requiring and to be connected we get the following strong correspondence between the metric structure on the homogeneous space and the pseudometric structure on the group. Lemma 3.14. Let 1 , 2 ∈ so that ( 2 ) is away from the cut locus of ( 1 ), then: Moreover if is a minimizing geodesic in the group connecting 1 with 2 then • is the unique minimizing geodesic in the homogeneous space ∕ that connects ( 1 ) with ( 2 ).
Denote the length functionals with: Observe that by construction of the pseudometric tensor fieldG on we have: Len ( ) = Len ∕ ( • ).
Now we assume • ≠ . Then since is the unique geodesic we have But then we can find some lif t ∈ Lip([0, 1], ) that is a preimage of , i.e. • lif t = . The potential problem is that while lif t (0) ∈ ( 1 ) and lif t (1) ∈ ( 2 ), lif t does not necessarily connect 1 to 2 . But since the coset ( 1 ) is connected we can find a curve wholly contained in it that connects 1 with lif t (0), call this curve head ∈ Lip([0, 1], ( 1 )). Similarly we can find a tail ∈ Lip([0, 1], ( 2 )) that connects lif t (1) to 2 . Both these curves have zero length since maps them to a single point on ∕ , i.e. Len ( head ) = Len ( tail ) = 0. Now we can compose these three curves: This new curve is again in Lip([0, 1], ) and connects 1 with 2 , but also: which is a contradiction since is a minimizing geodesic between 1 and 2 . We conclude • = and thereby: This result allows us to more easily translate results from Lie groups to homogeneous spaces.
We end our theoretical preliminaries by introducing the space of positions and orientations .

( ) and the Homogeneous Space
Our main example and Lie group of interest is the Special Euclidean group ( ) of the rotations and translations of ℝ , in particular for ∈ {2, 3}. When we take = {0} × ( − 1) we obtain the space of positions and This homogeneous space and group will enable the construction of roto-translation equivariant networks. For the experiments in this paper we restrict ourselves to = 2 but we include the case = 3 in some of our theoretical preliminaries.
As a set we identify with ℝ × −1 and choose some reference direction ∈ −1 ⊂ ℝ as the forward direction so that we can set the reference point of the space as 0 = ( , ). We can then see that elements of are those rotations that map to itself, i.e. rotations with the forward direction as their axis.
If we denote elements of ( ) as translation/rotation pairs ( , ) ∈ ℝ × ( ) then group multiplication is given by and the group action on elements = ( , ) ∈ ℝ × −1 ≡ is given as What the G-invariant vector field and metric tensor fields look like on is different for = 2 than for > 2. We first look at > 2.
Proof. For > 3 we can see that (17) are the only left-invariant vector fields since for all ℎ ∈ we have ℎ 0 = and so in order to be well-defined we must require ℎ * = on 0 , and this is true for (and its scalar multiples) but not true for any other tangent vectors at with , , > 0 weighing the main, lateral and angular motion respectively and where the inner product, outer product and norm are the standard Euclidean constructs.
Proof. It follows that to satisfy the second condition of Cor. 3.7 at the tangent space ( , ) of a particular ( , ) the metric tensor needs to be symmetric with respect to rotations about both spatially and angularly (i.e. we require isotropy in all angular and lateral directions) which leads to the three degrees of freedom contained in (18) irrespective of . For = 2 we represent the elements of 2 with ( , , ) ∈ ℝ 3 where , are the usual Cartesian coordinates and the angle with respect to the -axis, so that = (cos , sin ) . The reference element is then simply denoted by (0, 0, 0).
It may be counter-intuitive but decreasing the number of dimensions to 2 gives more freedom to the -invariant vector and metric tensor fields compared to > 2. This is a consequence of the subgroup being trivial and so the symmetry conditions from Cor. 3.5 and 3.7 also become trivial. The (2)-invariant vector fields are given as follows.

Proposition 3.17. On 2 the
(2)-invariant vector fields are spanned by the following basis: , Proof. For = 2 we have 2 ≡ ( ) and the group invariant vector fields on 2 are exactly the left-invariant vector fields on (2) given by (19).
In a similar manner (2)-invariant metric tensors are then given as follows.
Proposition 3.18. On 2 the (2)-invariant metric tensor fields are given by: Proof. Since (2) ≡ 2 the -invariant metric tensor fields are again exactly the left-invariant metric tensor fields.

This gives
(2)-invariant metric tensor fields 6 degrees of freedom and hence 6 trainable parameters on 2 . Remarkably, the case = 2 allows for more degrees of freedom than the case = 3 where Proposition 3.16 applies. In our experiments so far we have restricted ourselves to those metric tensors that are diagonal with respect to the frame from Prop. 3.17. A diagonal metric tensor would have just 3 degrees of freedom and have the same general form as (18), specifically: We will expand into non-diagonal metric tensors in future work.

Lifting & Projecting
The key ingredient of what we call a PDE-G-CNN is the PDE layer that we detail in the next section, however to make a complete network we need more. Specifically we need a layer that transforms the network's input into a format that is suitable for the PDE layers and a layer that takes the output of the PDE layers and transforms it to the desired output format. We call this input and output transformation lifting respectively projection, this yields the overall architecture of a PDE-G-CNN as illustrated in Fig. 5.
As our theoretical preliminaries suggest we aim to do processing on homogeneous spaces but the input and output of the network do not necessarily live on that homogeneous space. Indeed in the case of images the data lives on ℝ 2 and not on 2 where we propose to do processing.
This necessitates the addition of lifting and projection layers to first transform the input to the desired homogeneous space and end with transforming it back to the required output space. Of course for the entire network to be equivariant we require these transformation layers to be equivariant as well. In this paper we focus on the design of the PDE layers, details on appropriate equivariant lifting and projection layers in the case of (2) can be found in [20,80].
Remark 4.1 (General equivariant linear transformations between homogeneous spaces). A general way to lift and project from one homogeneous space to another in a trainable fashion is the following. Consider two homogeneous spaces ∕ 1 and ∕ 2 of a Lie group , let ∶ ∕ 1 → ℝ and ∶ ∕ 2 → ℝ with the following property: where 1 is compact. Then the operator T defined by transforms from a function on ∕ 1 to a function on ∕ 2 in an equivariant manner (assuming and are such that the integral exists). Here the kernel is the trainable part and is the left-invariant Haar measure on the group. Moreover it can be shown via the Dunford-Pettis [81] theorem that (under mild restrictions) all linear transforms between homogeneous spaces are of this form.
Remark 4.2 (Lifting and projecting on 2 ). Lifting an image (function) on ℝ 2 to 2 can either be performed by a non-trainable Invertible Orientation Score Transform [36] or a trainable lift [20] in the style of Remark 4.1. Projecting from 2 back down to ℝ 2 can be performed by a simple maximum projection: let ∶ 2 → ℝ then is a roto-translation equivariant projection as used in [20]. A variation on the above projection is detailed in

PDE Layer
A PDE layer operates by taking its inputs as the initial conditions for a set of evolution equations, hence there will be a PDE associated with each input feature. The idea is that we let each of these evolution equations work on the inputs up to a fixed time > 0. Afterwards, we take these solutions at time and take affine combinations (really batch normalized linear combinations in practice) of them to produce the outputs of the layer and as such the initial conditions for the next set of PDEs.
If we index network layers (i.e. the depth of the network) with and denote the width (i.e. the number of features or channels) at layer with then we have PDEs and take +1 linear combinations of their solutions. We divide a PDE layer into the PDE solvers that each apply the PDE evolution to their respective input channel and the affine combination unit. This design is illustrated in Fig. 1, but let us formalize it.
Let , =1 be the inputs of the -th layer (i.e. some functions on ∕ ), let and ∈ ℝ be the coefficients of the affine transforms for = 1 … +1 and = 1 … . Let each PDE be parametrized by a set of parameters . Then the action of a PDE layer is described as: where Φ , is the evolution operator of the PDE at time ≥ 0 and parameter set . We define the operator Φ , so that ( , ) ↦ Φ , ( ) satisfies the Hamilton-Jacobi type PDE that we introduce in just a moment. In this layer formula the parameters , and are the trainable weights, but the evolution time we keep fixed.
It is essential that we require the network layers, and thereby all the PDE units, to be equivariant. This has consequences for the class of PDEs that is allowed.
The PDE solver that we will consider in this article, illustrated in Fig. 2, computes the approximate solution to the PDE Here, is a -invariant vector field on ∕ (recall (17) and our use of tangent vectors as differential operators per Remark 3.3), ∈ 1 ∕2, 1 , G 1 and G ± 2 are -invariant metric tensor fields on ∕ , is the initial condition and Δ G and ‖ ⋅ ‖ G denote the Laplace-Beltrami operator and norm induced by the metric tensor field G. As the labels indicate, the four terms have distinct effects: • convection: moving data around, • (fractional) diffusion: regularizing data (which relates to subsampling by destroying data), • dilation: pooling of data, • erosion: sharpening of data. This is also why we refer to a layer using this PDE as a CDDE layer. Summarized the parameters of this PDE are given by = , The geometric interpretation of each of the terms in (24) is illustrated in Fig. 6 for = ℝ 2 and in Fig. 7 for = 2 .
Since the convection vector field and the metric tensor fields G 1 and G ± 2 are -invariant, the PDE unit, and so the network layer, is automatically equivariant.

Training
Training the PDE layer comes down to adapting the parameters in the PDEs in order to minimize a given loss function (the choice of which depends on the application and we will not consider in this article). In this sense, the vector field and the metric tensors are analogous to the weights of this layer.
Since we required the convection vector field and the metric tensor fields to be -invariant, the parameter space is finite-dimensional as a consequence of Cor. 3.5 and 3.7 if we restrict ourselves to Riemannian metric tensor fields.
For our main application on 2 each PDE unit would have the following 12 trainable parameters: • 3 parameters to specify the convection vector field as a linear combination of (19), • 3 parameters to specify the fractional diffusion metric tensor field G 1 , • and 3 parameters each to specify the dilation and erosion metric tensor fields G ± 2 , where the metric tensor fields are of the form (20) that are diagonal with respect to the frame from Prop. 3.17.
Surprisingly for higher dimensions has less trainable parameters than for = 2. This is caused by the ( )invariant vector fields on for ≥ 3 being spanned by a single basis element (per Proposition 3.15) instead of the three (19) basis elements available for = 2. Since the left-invariant metric tensor fields are determined by only 3 parameters irrespective of dimensions we count a total of 7 parameters for each PDE unit for applications on for ≥ 3.
In our own experiments we always use some form of stochastic gradient descent (usually ADAM) with a small amount of 2 regularization applied uniformly over all the parameters. Similarly we stick to a single learning rate for all the parameters. Given that in our setting different parameters have distinct effects treating all of them the same is likely far from optimal, however we leave that topic for future investigation.

PDE Solver
Our PDE solver will consist of an iteration of time step units, each of which is a composition of convection, diffusion, dilation and erosion substeps. These units all take their input as an initial condition of a PDE, and produce as output the solution of a PDE at time = .
The convection, diffusion and dilation/erosion steps are implemented with respectively a shifted resample, linear convolution, and two morphological convolutions, as illustrated in Fig. 8. The composition of the substeps does not solve (24) exactly, but for small Δ , it approximates the solution by a principle called operator splitting.
We will now discuss each of these substeps separately.

Convection
The convection step has as input a function 1 ∶ ∕ → ℝ and takes it as initial condition of the PDE The output of the layer is the solution of the PDE evaluated at time = , i.e. the output is the function ↦ where ∈ (i.e. 0 = ) and ∶ ℝ → is the exponential curve that satisfies (0) = and i.e. is the exponential curve in the group that induces the integral curves of the -invariant vector field on ∕ when acting on elements of the homogeneous space.
Note that this exponential curve existing is a consequence of the vector field being -invariant, such exponential curves do not exist for general convection vector fields. Proof. In our experiments equation (27) is numerically implemented as a resampling operation with trilinear interpolation to account for the off-grid coordinates.

Fractional Diffusion
The (fractional) diffusion step solves the PDE As with (fractional) diffusion on ℝ , there exists a smooth function called the fundamental solution of the -diffusion equation, such that for every initial condition 2 , the solution to the PDE (29) is given by the convolution of the function 2 with the fundamental solution : The convolution * ∕ on a homogeneous space ∕ is specified by the following definition.
Definition 5.2 (Linear group convolution). Let 0 = be compact, let ∈ 2 ( ∕ ) and ∈ 1 ( ∕ ) such that: ∀ℎ ∈ , ∈ ∕ ∶ (ℎ ) = ( ) (kernel compatibility) then we define: where and are the left-invariant Haar measures (determined up to scalar-multiplication) on respectively . on the group as a product of a measure on the quotient ∕ times the measure on the subgroup . As Haar-measures are determined up to a constant we take the following convention: we normalize the Haar-measure such that where G is the Riemannian measure induced by G and is a choice of Haar measure on . Thereby (32) boils down to Weil's integration formula: whenever is measurable. Since is compact we can indeed normalize the Haar measure so that ( ) = 1.
In general an exact analytic expression for the fundamental solution requires complicated steerable filter operators [50, Thm. 1 & 2] and for that reason we contend ourselves with more easily computable approximations. For now let us construct our approximations and address their quality and the involved asymptotics later.
Remark 5.5. In the approximations we will make use of logarithmic map as the inverse of the Lie group exponential map exp . Locally, such inversion can always be done by the inverse function theorem. Specifically, there is always a neighborhood ⊂ of the origin so that exp | is a diffeomorphism between and = exp ( ) ⊂ , where is a neighborhood of . Then we define the logarithmic map log ∶ → by exp • log = id and log • exp | | = id . For the moment, for simplicity, we assume = ( ) in the general setting a . a In our primary case of interest = ( The idea is that instead of basing our kernels on the metric G (which is hard to calculate [83]) we approximate it using the seminorm from Def. 3.12 (which is easy to calculate). We can use this seminorm on elements of the homogeneous space by using the group's logarithmic map log . We can take the group logarithm of all the group elements that constitute a particular equivalence class of ∕ and then pick the group element with the lowest seminorm: Henceforth, we write this estimate as G 0 , ≈ G ( ) relying on the following definition.
Definition 5.6 (Logarithmic metric estimate). Let G be a -invariant metric tensor field on the homogeneous space ∕ , then we define where * is the push-forward of the projection map given by (4).
We can interpret this metric estimate as finding all exponential curves in whose actions on the homogeneous space connect 0 (at = 0) to (at = 1) and then from that set we choose the exponential curve that has the lowest constant velocity according to the seminorm in Def. 3.12 and use that velocity as the distance estimate.
Summarizing, Def. 5.6 and Eq. (34), can be intuitively reformulated as: 'instead of the length of the geodesic connecting two points of ∕ we take the length of the shortest exponential curve connecting those two points'.
The following lemma quantifies how well our estimate approximates the true metric.
Lemma 5.7 (Bounding the logarithmic metric estimate). For all ∈ ∕ sufficiently close to 0 we have which has as a weaker corollary that for all compact neighborhoods of 0 there exists a metr > 1 so that for all in that neighborhood. Note that the constant metr depends on both the choice of compact neighborhood and the metric tensor field.
The proof of this lemma can be found in Appendix A.1.
Remark 5.8 (Logarithmic metric estimate in principal homogeneous spaces). When we take a principal homogeneous space such as 2 ≡ (2) with a left-invariant metric tensor field the metric estimate simplifies to G ( ) = ‖ ‖ log ‖ ‖G| , hence we see that this construction generalizes the logarithmic estimate, as used in [84,85], to homogeneous spaces other than the principal. Remark 5.9 (Logarithmic metric estimate for 2 ). Using the ( , , ) coordinates for 2 and a left-invariant metric tensor field of the form (20) we formulate the metric estimate in terms of the following auxiliary functions called the exponential coordinates of the first kind: The logarithmic metric estimate for (2) is then given by this estimate is illustrated in Fig. 9 where it is contrasted against the exact metric. We can see that the metric estimate G (and consequently any function of G ) has the necessary compatibility property to be a kernel used in convolutions per Def. 5.2.

Proposition 5.10 (Kernel compatibility of G ). Let G be a -invariant metric tensor field on ∕ , then we have
Note that, since we use left cosets, ℎ = but ℎ ≠ in general, this requirement is not trivial. Proof of this proposition is included in Appendix A.2.
Now that we have developed and analyzed the logarithmic metric estimate we can use it to construct an approximation to the diffusion kernel for = 1. , see the definition of polynomial growth below .
On Lie groups of polynomial growth this approximate kernel be bounded from above and below by the exact kernels.

Definition 5.12 (Polynomial growth). A Lie group with left-invariant Haar measure
is of polynomial growth when the volume of a sphere of radius around ∈ : can be polynomialy bounded as follows: there exists constants > 0 and grow > 0 so that take note that the exponent is the same on both the lower and upper bound. Since is left-invariant the choice of does not matter. Lemma 5.13. Let be of polynomial growth and let 1 be the fundamental solution to the = 1 diffusion equation on ∕ then there exists constants ≥ 1 , 1 ∈ (0, 1) and 2 > 1 so that for all > 0 the following holds: for all ∈ ∕ .
Proof. On a group of polynomial growth we have = G ( 0 , √ ) −1 . If is of polynomial growth we can apply [86, Thm. 2.12] to find that there exists constants 1 , 2 > 0 and for all > 0 there exists a constant so that: Remark 5.14 (Left vs. right cosets). Note that Maheux [86] uses right cosets while we use left cosets. We can translate the results easily by inversion in view of ( ) −1 = −1 −1 = −1 . We then apply the result of Maheux to the correct (invertible) -invariant metric tensor field on ∕ . Also note the different (but equivalent) way Maheux relates distance on the group with distance on the homogeneous space. While we use a pseudometric on induced by a metric on ∕ , Maheux uses a metric on ∕ induced by a metric on by: for any choice of 1 ∈ 1 . We avoid having to minimize inside the cosets as in (39) thanks to the inherent symmetries in our pseudometric. Now using the inequalities from Lemma 5.7 we obtain: , which leads to: Applying these inequalities we get: Reparametrising in both inequalities gives: Finally we fix > 0 and relabel constants: observe that since > 0 and metr ≥ 1 we have 0 < 1 < 1.
Depending on the actually achievable constants, Lem. 5.13 provides a very strong or very weak bound on how much our approximation deviates from the fundamental solution. Fortunately in the (2) case our approximation is very close to the exact kernel in the vicinity of the origin, as can be seen in Fig. 10. In our experiments we sample the kernel on a grid around the origin, hence this approximation is good for reasonable values of the metric parameters, which we may expect from Lemma 5.7 providing a second order relative error.

Figure 10
Comparing the numerically computed heat kernel 1 (left) with our approximation 1,appr based on the logarithmic norm estimate (right) for ∕ = (2). Shown here at = 1 with the same metric as in Fig. 9. Especially in deep learning applications where discretization is very coarse our approximation is sufficiently accurate as long as the spatial anisotropies ∕ and ∕ do not become too high. In this case with ∕ = 2 we have a relative 2 error of 0.23 in the plotted volume. Now let us develop an approximation for values of other than 1. From semi-group theory [88] it follows that semi-groups generated by taking fractional powers of the generator (in our case Δ G → −(−Δ G ) ) amounts to the following key relation between the -kernel and the diffusion kernel: for ∈ (0, 1) and > 0 where , is defined as follows. For explicit formulas of this kernel see [88, Ch. IX:11 eq. 17]. Since − is positive for all it follows that , is also positive everywhere. Now instead of integrating 1 to obtain the exact fundamental solution, we can replace it with our approximation 1,appr to obtain an approximate -kernel.
The bounding of 1 we obtained in Lem. 5.13 transfers directly to our approximation for other .
Theorem 5.17. Let be of polynomial growth and let be the fundamental solution to the ∈ (0, 1] diffusion equation on ∕ , then there exists constants ≥ 1 , 1 ∈ (0, 1) and 2 > 1 so that for all > 0 and ∈ ∕ the following holds: Proof. This is an consequence of Lem. 5.13 and the fact that , is positive, applying the integral from (40) yields: The other inequality works the same way.
Although the approximation (41) is helpful in the proof above it contains some integration and is not an explicit expression. Our initial experiments with diffusion for = 1 showed that (for the applications under consideration at least) adding diffusion did not improve performance. For that reason we chose not to focus further on diffusion in this work. We leave developing a more explicit and computable approximation for diffusion kernels for 0 < < 1 for future work.

Figure 11
Shapes of the level sets of the kernels on 2 for solving fractional diffusion ( ) and dilation/erosion ( ) for various values of the trainable metric tensor field parameters , and . This shape is essentially what is being optimized during the training process of a metric tensor field on 2 .

Dilation and Erosion
The dilation/erosion step solves the PDE By a generalization of the Hopf-Lax formula [89,Ch.10.3], the solution is given by morphological convolution for the (+) (dilation) variant and for the (−) (erosion) variant, where the kernel (also called the structuring element in the context of morphology) is given as follows.
Definition 5.18 (Dilation/erosion kernels). The morphological convolution kernel for small times and ∈ ( 1 ∕2, 1 is given by with ∶= 2 −1 (2 ) 2 ∕(2 −1) and for = 1 ∕2 by In the above definition and for the rest of the section we write G 2 for either G + 2 or G − 2 depending on whether we are dealing with the dilation or erosion variant. The morphological convolution □ (alternatively: the infimal convolution) is specified as follows. Remark 5.20 (Grayscale morphology). Morphological convolution is related to the grayscale morphology operations ⊕ (dilation) and ⊖ (erosion) on ℝ as follows: where 1 and 2 are proper functions on ℝ . Hence our use of the terms dilation and erosion, but mathematically we will only use □ as the actual operation to be performed and avoid ⊕ and ⊖.
The semigroup property follows directly from [90, Thm 2.1(ii)]. . This theorem builds on the work by Balogh et al. [90] who provide a solution concept that is (potentially) different from the strong, weak or viscosity solution. The point of departure is to replace the norm of the gradient (i.e. the dual norm of the differential) with a metric subgradient, i.e. we replace , and we get a solution concept in terms of this slightly different notion of a gradient.  [89]) one expects that the solutions (44),(45) of (43) are indeed the viscosity solutions (for resp. the + and −-case) for = 1 ∕2.
In many matrix Lie group quotients, like the Heisenberg group (2 + 1) studied in [94], or in our case of interest: the homogeneous space of positions and orientations) this is indeed the case. One can describeinvariant vector fields via explicit coordinates and transfer HJB systems on ∕ directly towards HJB-systems on ℝ or ℝ × , with = + = dim( ∕ ). Then one can directly apply results by Dragoni [91,Thm.4] and deduce that our solutions, the dilations in (44) resp. erosions in (45), are indeed the unique viscosity solutions of HJB-PDE system (43) for the + and −-case, for all ∈ [ 1 ∕2, 1]. Details are left for future research.
To get an idea of how the kernel in (46) operates in conjunction with morphological convolution we take = ∕ = ℝ and see how the operation evolves simple data, the kernels and results at = 1 are shown in Fig. 12. Observe that with close to 1 ∕2 (kernel and result in red) we obtain what amounts to an equivariant version of max/min pooling.

Figure 12
In the center we have kernels of the type (46) in ℝ (or the signed distance on a manifold of choice) for some ∈ ( 1 ∕2, 1 and = 1, which solves dilation/erosion. For → 1 ∕2 this kernel converges to the type in (47), i.e. the solution is obtained by max/min pooling. On the left we morphologically convolve a spike (in gray) with a few of these kernels, we see that if → 1 ∕2 we get max pooling, conversely we can call the case > 1 ∕2 soft max pooling. On the right we similarly erode a plateau, which for → 1 ∕2 yields min pooling. The effects of these operations in the image processing context can also be seen in the last two columns of Fig. 6.
The level sets of the kernels for > 1 ∕2 are of the same shape as for the approximate diffusion kernels, see Fig. 11, for = 1 ∕2 these are the stencils over which we would perform min/max pooling.
Remark 5.24. The level sets in Fig. 11 are balls in ∕ = 2 that do not depend on . It is only the growth of the kernel values when passing through these level sets that depends on . As such the example ∕ = ℝ and Fig. 12 is very representative to the general ∕ case. In the general ∕ case Fig. 11 still applies when one replaces the horizontal ℝ-axis with a signed distance along a minimizing geodesic in ∕ passing through the origin. In that sense ∈ [ 1 ∕2, 1] regulates soft-max pooling over Riemannian balls in ∕ .
We can now define a more tangible approximate kernel by again replacing the exact metric G 2 with the logarithmic approximation G 2 .

Definition 5.25 (Approximate dilation/erosion kernel). The approximate morphological convolution kernel
,appr for small times and ∈ ( 1 ∕2, 1 is given by with ∶= 2 −1 (2 ) 2 ∕(2 −1) and for = 1 ∕2 by We used this approximation in our parallel GPU-algorithms (for our PDE-G-CNNs experiments in Section 7). It is highly preferable over the 'exact' solution based on the true distance as this would require Eikonal PDE solvers ( [29,95] which would not be practical for parallel GPU implementations of PDE-G-CNNs. Again the approximations are reasonable as long as the spatial anisotropy does not get too high, see Fig. 9 for an example.
Next we formalize the theoretical underpinning of the approximations in the upcoming corollary.
An immediate consequence of Def. 5.25 and Lem. 5.7 (keeping in mind that the kernel expressions in Def. 5.25 are monotonic w.r.t. ∶= G 2 ( )) is that we can enclose our approximate morphological kernel with the exact morphological kernels in the same way as we did for the (fractional) diffusion kernel in Theorem 5.17. This proves the following Corollaries: For the case = 1 ∕2 the approximation is exact in an inner and outer region: Alternatively, instead of bounding by value we can bound in time, in which case we do not need to distinguish different cases of . With these two bounds on our approximate morphological kernels we end our theoretical results.

Generalization of (Group-)CNNs
In this section we point out the similarities between common (G-)CNN operations and our PDE-based approach. Our goal here is not so much claiming that our PDE approach serves as a useful model for analyzing (G-)CNNs, but that modern CNNs already bear some resemblance to a network of PDE solvers. Noticing that similarity, our approach is then just taking the next logical step by structuring a network to explicitly solve a set of PDEs.

Discrete Convolution as Convection & Diffusion
Now that we have seen how PDE-G-CNNs are designed we show how they generalize conventional G-CNNs. Starting with an initial condition we show how group convolution with a general kernel can be interpreted as a superposition of solutions (27) of convection PDEs: for any ∈ , now change variables to = −1 and recall that is left invariant: In this last expression we recognize (27) and see that we can interpret ↦ 0 as the solution of the convection PDE (25) at time = 1 for a convection vector field that has flow lines given by ( ) = exp − log so that ( (1)) −1 0 = 0 . As a result the output * ∕ can then be seen as a weighted sum of solutions over all possible left invariant convection vector fields.
Using this result we can consider what happens in the discrete case where we take the kernel to be a linear combination of displaced diffusion kernels (for some choice of ) as follows: where for all we fix a weight ∈ ℝ, diffusion time ≥ 0 and a displacement ∈ . Convolving with this kernel yields: we change variables to ℎ = : Here again we recognize ↦ ). Subsequently we take these solutions and convolve them with a (fractional) diffusion kernel with scale , i.e. after convection we apply the fractional diffusion PDE with evolution time and finally make a linear combination of the results.
We can conclude that G-CNNs fit in our PDE-based model by looking at a single discretized group convolution as a set of single-step PDE units working on an input, without the morphological convolution and with specific choices made for the convection vector fields and diffusion times.

Max Pooling as Morphological Convolution
The ordinary max pooling operation commonly found in convolutional neural networks can also be seen as a morphological convolution with a kernel for = 1 ∕2. Proposition 6.1 (Max pooling). Let ∈ ∞ ( ∕ ), let ⊂ ∕ be non empty and define ∶ ∕ → ℝ∪{∞} as: Then: We can recognize the morphological convolution as a generalized form of max pooling of the function with stencil .
Proof. Filling in (51) into Def. 5.19 yields: In particular cases we recover a more familiar form of max pooling as the following corollary shows.
The observation that max pooling is a particular limiting case of morphological convolution allows us to think of the case with > 1 ∕2 as a soft variant of max pooling, one that is better behaved under small perturbations in a discretized context.

ReLUs as Morphological Convolution
Max pooling is not the only common CNN operation that can be generalized by morphological convolution as the following proposition shows. Proof. Filling in into the definition of morphological convolution: due to the continuity and compact support of its supremum exists and moreover we have sup ∈ ∕ ∶ ≠ 0 ( ) = sup ∈ ∕ ( ) and thereby we obtain the required result = max ( ), 0 .
We conclude that morphological convolution allows us to: • do pooling in an equivariant manner with transformations other then translation, • do soft pooling that is continuous under domain transformations (illustrated in Fig. 12), • learn the pooling region by considering the kernel as trainable, • effectively fold the action of a ReLU into trainable non-linearities.

Residual Networks
So called residual networks [67] were introduced mainly as a means of dealing with the vanishing gradient problem in very deep networks, aiding trainability. These networks use so-called residual blocks, illustrated in Fig. 13, that feature a skip connection to group a few layers together to produce a delta-map that gets added to the input.

Figure 13
A residual block, like in [67], note the resemblance to a forward Euler discretization scheme.
This identity + delta structure is very reminiscent of a forward Euler discretization scheme. If we had an evolution equation of the type ( , ) = F ( (⋅, ), ) for ∈ , ≥ 0, with some operator F ∶ ∞ ( ) × → ℝ, we could solve it approximately by stepping forward with: for some time step Δ > 0. We see that this is very similar to what is implemented in the residual block in Fig. 13 once we discretize it.
The correspondence is far from exact given that multiple channels are being combined in residual blocks, so we can not easily describe a residual block with a PDE. Still, our takeaway is that residual networks and skip connections have moved CNNs from networks that change data to networks that evolve data.
For this reason we speculate that deep PDE-G-CNNs will not need (or have the same benefit from) skip connections, we leave this subject for future investigation. More discussion on the relation between residual networks and PDEs can be found in [77].

Experiments
To demonstrate the viability of PDE-based CNNs we perform two experiments where we compare the performance of PDE-G-CNNs against G-CNNs and classic CNNs. We will be doing a vessel segmentation and digit classification problem: two straightforward applications of CNNs. Examples of these two applications are illustrated in Fig. 14.
The goal of the experiments is to compare the basic building blocks of the different types of networks in clearly defined feed-forward network architectures. So we test networks of modest size only and do not just aim for the performance that would be possible with large-scale networks.
(a) Example from the DRIVE [96] dataset, showing a retinal image and its vessel segmentation.

Figure 14
We perform a segmentation experiment on retinal vessel images and a classification experiment on rotation augmented digits.

Implementation
We implemented our PDE-based operators in an extension to the PyTorch deep learning framework [98]. Our package is called LieTorch and is open source. It is available at https://gitlab.com/bsmetsjr/lietorch.
The operations we have proposed in the paper have been implemented in C++ for CPUs and CUDA for Nvidia GPUs but can be used from Python through PyTorch. Our package was also designed with modularity in mind: we provide a host of PyTorch modules that can be used together to implement the PDE-G-CNNs we proposed but that can also be used separately to experiment with other architectures.
All the modules we provide are differentiable and so our PDE-G-CNNs are trainable through stochastic gradient descent (or its many variants) in the usual manner. In our experiments we have had good results with using the ADAM [99] optimizer.
All the network models and training scripts used in the experiments are also available in the repository.

Design Choices
Several design choices are common to both experiments, we will go over these now.
First, we choose ∕ = 2 for our G-CNNs and PDE-G-CNNs and so go for roto-translation equivariant networks. In all instances we lift to 8 orientations.
Second, we use the convection, dilation and erosion version of (24), hence we refer to these networks as PDE-CNNs of the CDE-type. Each PDE-layer is implemented as in Fig. 1 with the single-pass PDE solver from Fig. 2 without the convolution. So no explicit diffusion is used and the layer consists of just resampling and two morphological convolutions. Since we do the resampling using trilinear interpolation this does introduce a small amount of implicit diffusion.
Remark 7.1 (Role of diffusion). In these experiments we found no benefit to adding diffusion to the networks. Diffusion likely would be of benefit when the input data is noisy but neither datasets we used are noisy and we have not yet performed experiments with adding noise. We leave this investigation for future work.
Third, we fix = 0.65. We came to this value empirically; the networks performed best with -values in the range 0.6 − 0.7. Looking at Fig. 12 we can conjecture that = 0.65 is the "sweet spot" between sharpness and smoothness. When the kernel is too sharp ( close to 1 ∕2) minor perturbations in the input can have large effects on the output, when the kernel is too smooth ( close to 1) the output will be smoothed out too much as well.
Fourth, all our networks are simple feed-forward networks.
Finally, we use the ADAM optimizer [99] together with 2 regularization uniformly over all parameters with a factor of 0.005.

DRIVE Retinal Vessel Segmentation
The first experiment uses the DRIVE retinal vessel segmentation dataset [96]. The object is the generate a binary mask indicating the location of blood vessels from a color image of a retina as illustrated in Fig. 14(a).
We test 6-and 12-layer variants of a CNN, a G-CNN and a CDE-PDE-CNN. The layout of the 6-layer networks is shown in Fig. 15, the 12-layer networks simply add more convolution, group convolution or CDE layers. All the networks were trained on the same training data and tested on the same testing data.
The output of the network is passed through a sigmoid function to produce a 2D map of values in the range [0, 1] which we compare against the known segmentation map with values in {0, 1}. We use the continuous DICE coefficient as the loss function: where the sum ∑ is over all the values in the 2D map. A relatively small = 1 is used to avoid divide-by-zero issues and the ≡ ≡ 0 edge case.
The 6-layer networks were trained over 60 epochs, starting with a learning rate of 0.01 that we decay exponentially with a gamma of 0.95. The 12-layer networks were trained over 80 epochs, starting from the same learning rate but with a learning rate gamma of 0.96.
We measure the performance of the network by the DICE coefficient obtained on the 20 images of the testing dataset. We trained each model 10 times, the results of which are summarized in Tbl. 1 and Fig. 16(a).
We achieve similar or better performance than CNNs or G-CNNs but with a vast reduction in parameters. Scaling from 6 to 12 layers even allows us to reduce the total number of parameters of the PDE-G-CNN while still increasing performance, this is achieved by reducing the number of channels (i.e. the width) of the network, see also Tbl. 2.

Model
Parameters DICE score ± std.dev.

RotNIST Digit Classification
The second experiment we performed is the classic digit classification experiment. Instead of using the plain MNIST dataset we did the experiment on the RotNIST dataset [97]. RotNIST contains the same images as MNIST but rotated to various degrees. Even though classifying rotated digits is a fairly artificial problem we include this experiment to show that PDE-G-CNNs also work in a context very different from the first segmentation experiment. While our choice of PDEs derives from more traditional image processing methods, this experiment shows their utility in a basic image classification context.

Figure 15
Schematic of the 6-layer models used on our segmentation experiments. Kernel sizes and number of feature channels in each layer are indicated, depth indicates that the data lives on 2 . Omitted are activation functions, batch normalization, padding and dropout modules. The 12-layer models are essentially the same but with double the number of layers but with reduced number of channels per layer (i.e. reduced width) for the CDE-PDE-CNN (hence the reduction in parameters going from 6 to 12 layers).
We tested three networks: the classic LeNet5 CNN [100] as a baseline, a 4-layer G-CNN and a 4-layer CDE-PDE-CNN. The architectures of these three networks are illustrated in Fig. 17.
All three networks were trained on the same training data and tested on the same testing data. We train with a learning rate of 0.05 and a learning rate gamma of 0.96. We trained the LeNet5 model for 120 epochs and the G-CNN and CDE-PDE-CNN models for 60 epochs.
We measure the performance of the network by its accuracy on the testing dataset. We trained each model 10 times, the results of which are summarized in Tbl. 3 and Fig. 16(b).
We manage to get better performance than classic or group CNNs with far fewer parameters.

Computational Performance
Care was taken in optimizing the implementation to show that PDE-based networks can still achieve decent running times despite their higher computational complexity. In Tbl. 4 we summarized the inferencing performance of each model we experimented with.
Our approach simultaneously gives us equivariance, a decrease in parameters and higher performance but at the cost of an increase in flops and memory footprint. While our implementation is reasonably optimized it has had far less development time dedicated to it than the traditional CNN implementation provided by PyTorch/cuDNN, so we are confident more performance gains can be found.
In comparison with G-CNNs our PDE-based networks are generally a little bit faster. Our G-CNN implementation is however less optimized compared to out PDE-G-CNN implementation. Were our G-CNN implementation equally optimized we expect G-CNNs to be slightly faster than the PDE-G-CNNs in our experiments.  Table 2 Allocation of parameters for the 6-and 12-layer CDE-PDE-CNNs used in the vessel segmentation experiment. The added depth of the networks allows us to shrink the width. With the network having less channels over all we can also shrink the number of channels in the lifting layer, which drastically reduces the total number of parameters.
(a) Performance on retinal vessel segmentation. We test 6-and 12-layer variants of conventional CNNs, G-CNNs and our PDE-CNNs, each network is trained 10 times, the chart shows the distribution of DICE performances on the test dataset.
(b) Performance of digit classification on the RotNIST dataset. We compare the classic 5-layer LeNet against a 4-layer G-CNN and PDE-CNN. LeNet was trained for 120 epochs, the other two for 60 epochs.

Figure 16
Comparison of PDE-based networks against conventional CNNs and group CNNs on segmentation and classification tasks.

Conclusion
In this article we presented the general mathematical framework of geometric PDEs on homogeneous spaces that underlies our PDE-G-CNNs. PDE-G-CNNs allow for a geometric and probabilistic interpretation of CNNs opening up new avenues for the study and development of these types of networks. We showed that additionally, PDE-G-CNNs have increased performance with a reduction of parameters.
PDE-G-CNNs ensure equivariance by design. The trainable parameters are geometrically relevant: they are leftinvariant vector and tensor fields.
PDE-G-CNNs have three types of layers: convection, diffusion and erosion/dilation layers. We have shown that these layers implicitly include standard nonlinear operations in CNNs such as max pooling and ReLU activation.
To efficiently evaluate PDE evolution in the layers, we provided tangible analytical approximations to the relevant kernel operators on homogeneous spaces. In this article we have underpinned the quality of the approximations in Theorem 5.17 and Theorem 5.21.
With two experiments we have verified that PDE-G-CNNs can improve performance over G-CNNs in the context of automatic vessel segmentation and digit classification. Most importantly, the performance increase is achieved with a vast reduction in the amount of trainable parameters.

Figure 17
Schematic of the three models tested with the RotNIST data. Kernel sizes and number of feature channels in each layer are indicated. Omitted are activation functions, batch normalization and dropout modules.

Model
Parameters Error rate ± std.dev.
Call = exp , then the previous inequality gives Clearly ∈ . Since G ( ) is the infinum of ‖ log ‖ for all ∈ it follows that G ( ) must also satisfy: for all ∈ 0 , i.e. all sufficiently close to 0 .
As a corollary we get that for any compact neighborhood ⊂ ∕ of 0 for all ∈ . We can see this by choosing 2 1 = 1 + sup ∈ ) G ( 0 , ) 2 , then for all by choosing ∈ 0 we have Let ⊂ ∕ be compact so that it contains 0 . Then on ⧵ 0 we have that both G and G ( 0 , ⋅) are strictly positive, continuous and so bounded functions. Consequently for all ∈ ⧵ 0 . Now choose metr = max{ 1 , 2 } then we obtain the corollary. Remark that metr depends on both the parameters of the metric tensor field and the choice of , and so may become very large indeed.

A.2 Proof of Proposition 5.10
As a preliminary we prove the following lemma.
Lemma For all ∈ let ∶ → be the left group multiplication given by ℎ = ℎ and let ∶ → be the right group multiplication given by ℎ = ℎ . Let be a closed subgroup of with the projection map ∶ → ∕ given by ( ) = .
From which we conclude that ℎ (⋅)ℎ −1 ∈ Γ 0 ,ℎ and so there is a bijection between Γ 0 , and Γ 0 ,ℎ given by Moreover, the bijection preserves the seminorm due to the G-invariance of G: