1 Introduction

In this work we generalize \(\mathbb {R}^2\) convolutional neural networks (CNNs) to SE(2) group CNNs (G-CNNs) in which the data lives on position orientation space, and in which the convolution layers are defined in terms of representations of the special Euclidean motion group SE(2). In essence this means that we replace the convolutions (with translations of a kernel) by SE(2) group convolutions (with roto-translations of a kernel). The advantage of the proposed approach compared to standard \(\mathbb {R}^2\) CNNs is that rotation covariance is encoded in the network design and does not have to be learned by the convolution kernels. E.g., a feature that may appear in the data under several orientations does not have to be learned for each orientation, but only once. As a result, there is no need for data augmentation by rotation and the kernel weights (that no longer need to learn rotation covariance) become available to increase the CNNs expressive capacity. Moreover, the proposed group convolution layers are compatible with standard CNN modules, allowing for easy integration in popular CNN designs.

A main objective of medical image analysis is to develop models that are invariant to the shape and appearance variability of the structures of interest, including their arbitrary orientations. Rotation-invariance is a desired property, which our G-CNN framework generically deals with. We show state-of-the-art results with improvement over standard 2D CNNs on three different medical imaging tasks: mitosis detection in histopathology images, vessel segmentation in retinal images and cell boundary segmentation in electron microscopy (EM).

1.1 Related Work

In relation to other approaches that incorporate rotation invariance/covariance in the network design, such as harmonic networks [1], local transformation invariance learning [2], deep symmetry nets [3], scattering CNNs [4, 5], and warped convolutions [6], the group convolution approaches [4, 5, 7,8,9,10] most naturally extend the standard CNNs by simply replacing the convolution operators.

In the work by Cohen and Welling [7] a comprehensive theoretical framework for G-CNNs is developed for discrete groups whose transformations stay on the pixel grid. In particular their focus was on the wall-paper groups p4 (group of translations + \(90^\circ \) rotations), for which a G-CNN approach was also developed by Dieleman et al. [8], and p4m (p4 + reflections). In their work it was convincingly demonstrated that including such symmetries, by replacing standard convolutions by group convolutions, substantially increases the network’s performance without increasing the number of network variables. Although their theoretical G-CNN framework [7] holds for more general groups, their actual application scope was limited to discrete groups that stay on the pixel grid. In this paper, we are not restricted to such groups, but include efficient bi-linear interpolation that allows us to employ the full structure of the continuous roto-translation group SE(2), which we can discretize to the sub-group SE(2, N), with N rotations. Special cases of our framework are standard 2D CNNs when \(N=1\) and the p4 G-CNNs as proposed in [7, 8] when \(N=4\).

In very recent work, Weiler et al. [9] describe a different approach to SE(2) G-CNNs. Instead of relying on interpolation they used 2D complex-valued steerable kernels, which has the advantage that kernel rotations are exact. A disadvantage is, however, that these kernels are constrained to a specific combination of complex valued basis functions. With our interpolation approach, kernel rotation simply appears in the CNN architecture as a (sparse) matrix-vector multiplication, that maps a set of base weights to a full set of rotated kernels.

In work by Mallat, Oyallon, and Sifre [4, 5] roto-translation invariant deep networks are formulated in the context of scattering theory. Their design involves a concatenation of separable group convolutions with hand-crafted (but well underpinned) filters, followed by the modulus as activation function. Learning takes place via support vector machines on the generated SE(2) invariant descriptors. In our approach, the filters are learned without restrictions, the convolutions do not have to be separable, and we here use the common ReLU activation function.

In work by Bekkers et al. [10], an effective template matching method was proposed using group correlations in orientation scores, which are SE(2) images obtained from a 2D image via lifting convolutions with a specific choice of kernel [11]. The SE(2) templates were put in a B-spline basis (allowing for exact kernel rotations) and optimized via logistic regression. Their architecture fits within our framework as a single channel G-CNN of depth 2 with a fixed lifting kernel.

2 SE(2) Convolutional Neural Networks

2.1 Group Theoretical Preliminaries

The Lie Group SE (2): The group \(SE(2) = \mathbb {R}^2 \rtimes SO(2)\) is the semi-direct product of the group of planar translations \(\mathbb {R}^2\) and rotations SO(2), and its group product is given by

$$\begin{aligned} g \cdot g' = ( \mathbf {x},\mathbf {R}_\theta ) \cdot ( \mathbf {x}', \mathbf {R}_{\theta '} ) = ( \mathbf {R}_\theta \mathbf {x}' + \mathbf {x}, \mathbf {R}_{\theta + \theta '}), \end{aligned}$$
(1)

with group elements \(g = (\mathbf {x},\theta ),g' = (\mathbf {x}',\theta ') \in SE(2)\), with translations \(\mathbf {x},\mathbf {x}'\) and planar rotations by \(\theta ,\theta '\). The group acts on the space of positions and orientations \(\mathbb {R}^2 \times S^1\) via \( g \cdot (\mathbf {x}',\theta ') = ( \mathbf {R}_\theta \mathbf {x}' + \mathbf {x}, \theta +\theta '). \) Since \((\mathbf {x}, \mathbf {R}_\theta ) \cdot ( \mathbf {0}, 0 ) = ( \mathbf {x} , \theta )\), we can identify the group SE(2) with the space of positions and orientations \(\mathbb {R}^2 \times S^1\). As such we will often write \(g=(\mathbf {x},\theta )\), instead of \((\mathbf {x},\mathbf {R}_\theta )\). Note that \(g^{-1} = ( -\mathbf {R}_{\theta }^{-1} \mathbf {x},-\theta )\) since \(g \cdot g^{-1} = g^{-1} \cdot g = (\mathbf {0},0)\).

Group Representations: The structure of the group can be mapped to other mathematical objects (such as 2D images) via representations. The left-regular SE(2) representation on 2D images \(f\in \mathbb {L}_2(\mathbb {R}^2)\) is given by

$$\begin{aligned} (\mathcal {U}_g f) (\mathbf {x}') = f(\mathbf {R}_\theta ^{-1} (\mathbf {x}' - \mathbf {x})), \end{aligned}$$
(2)

with \(g = (\mathbf {x},\theta ) \in SE(2), \; \mathbf {x}' \in \mathbb {R}^2\). It corresponds to a roto-translation of the image. The left-regular representation on functions \(F\in \mathbb {L}_2(SE(2))\) on SE(2), which we refer to as SE(2)-images, is given by

$$\begin{aligned} (\mathcal {L}_g F) (g')= F(g^{-1} \cdot g') = F(\mathbf {R}_\theta ^{-1} (\mathbf {x}' - \mathbf {x}), \theta ' - \theta ), \end{aligned}$$
(3)

with \(g=(\mathbf {x},\theta ),g'=(\mathbf {x}',\theta ') \in SE(2)\). It is a shift-twist (rotation + \(\theta \)-shift) of F, see e.g. Figure 1. Next we define the G-CNN layers in terms of these representations.

2.2 The SE(2) Group Convolution Layers

In CNNs one can take a convolution or a cross-correlation viewpoint and since these operators simply relate via a kernel reflection, the terminology is often used interchangeably. We take the second viewpoint, our G-CNNs are implemented using cross-correlations. On \(\mathbb {R}^2\) we define cross-correlation via inner products of translated kernels:

$$\begin{aligned} (k \star _{\mathbb {R}^2} f)(\mathbf {x}) := (\mathcal {T}_\mathbf {x} k, f)_{\mathbb {L}_2(\mathbb {R}^2)} := \int _{\mathbb {R}^2} k(\mathbf {x}' - \mathbf {x}) f(\mathbf {x}') \mathrm{d}\mathbf {x}', \end{aligned}$$
(4)

with \(\mathcal {T}_\mathbf {x}\) the translation operator, the left-regular representation of the translation group \((\mathbb {R}^2,+)\). In the SE(2) lifting layer we now simply replace translations of k by roto-translations via the SE(2) representation \(\mathcal {U}_g\) defined in Eq. (2).

The \({{\varvec{SE(2)}}}\) Lifting Layer: Let \(\underline{f},\underline{k}:\mathbb {R}^2\rightarrow \mathbb {R}^{N_c}\) be a vector valued 2D image and kernel (with \(N_c\) channels), with \(\underline{f} = (f_{1},\dots ,f_{N_c})\) and \(\underline{k} = (k_{1},\dots ,k_{N_c})\), then the group lifting correlations for vector valued images are defined by

$$\begin{aligned} (\underline{k}\,\tilde{\star } \underline{f})(g) := \sum \limits _{c=1}^{N_c} ( \mathcal {U}_g k_c, f_c )_{\mathbb {L}_2(\mathbb {R}^2)} = \sum \limits _{c=1}^{N_c} \int _{\mathbb {R}^2} k_c(\mathbf {R}_\theta ^{-1} (\mathbf {y} - \mathbf {x})) f_c(\mathbf {y}) \mathrm{d}\mathbf {y}. \end{aligned}$$
(5)

These correlations lift 2D image data to data that lives on the 3D position orientation space \(\mathbb {R}^2\times S^1 \equiv SE(2)\). The lifting layer that maps from a vector image \(\underline{f}^{(l-1)}:\mathbb {R}^2 \rightarrow \mathbb {R}^{N_{l-1}}\), with \(N_{l-1}\) channels at layer \(l-1\), to an SE(2) vector image \(\underline{F}^{(l)}:SE(2)\rightarrow \mathbb {R}^{N_l}\) using a set of \(N_l\) kernels \(\mathbf {k}^{(l)} := (\underline{k}_1^{(l)},\dots ,\underline{k}_{N_l}^{(l)})\), each with \(N_{l-1}\) channels, is then defined by

$$\begin{aligned} \underline{F}^{(l)} = \mathbf {k}^{(l)} \tilde{\star } \underline{f}^{(l-1)} := \left( \;\; \underline{k}_{1}^{(l)} \tilde{\star } \underline{f}^{(l-1)} \;\; , \;\;\dots \; , \;\; \underline{k}_{N_{l}}^{(l)} \tilde{\star } \underline{f}^{(l-1)} \;\;\right) . \end{aligned}$$
(6)

The \({\varvec{SE(2)}}\) Group Convolution Layer: Let \(\underline{F},\underline{K}:SE(2)\rightarrow \mathbb {R}^{N_c}\) be a vector valued SE(2) image and kernel, with \(\underline{F} = (F_{1},\dots ,F_{N_c})\) and \(\underline{K} = (K_{1},\dots ,K_{N_c})\), then the group correlations are defined as

$$\begin{aligned} (\underline{K} \star \underline{F})(g) := \sum \limits _{c=1}^{N_c} ( \mathcal {L}_g K_c, F_c )_{\mathbb {L}_2(SE(2))} = \sum \limits _{c=1}^{N_c} \int _{SE(2)} K_c(g^{-1} \cdot h) F_c(h) \mathrm{d}h, \end{aligned}$$
(7)

with \((K, F)_{\mathbb {L}_2(SE(2))} := \int _{SE(2)} K(h) F(h) \mathrm{d}h\), the inner product on \(\mathbb {L}_2(SE(2))\). A set of SE(2) kernels \(\mathbf {K}^{(l)} := (\underline{K}_1^{(l)},\dots ,\underline{K}_{N_l}^{(l)})\) defines a group convolution layer, mapping from \(\underline{F}^{(l-1)}\) with \(N_{(l-1)}\) channels to \(\underline{F}^{(l)}\) with \(N_{(l)}\) channels, via

$$\begin{aligned} \underline{F}^{(l)} \!= \! \mathbf {K}^{(l)} {\star } \underline{F}^{(l-1)} \!:= \! \left( \;\; \underline{K}_{1}^{(l)} {\star } \underline{F}^{(l-1)} \;\;,\;\; \dots \;\;,\;\; \underline{K}_{N_{l}}^{(l)} {\star } \underline{F}^{(l-1)} \;\;\right) . \end{aligned}$$
(8)

The Projection Layer: Projects a multi-channel SE(2) image back to \(\mathbb {R}^2\) via

$$\begin{aligned} \underline{f}^{(l)}(\mathbf {x}) \!= \underset{\theta \in [0,2\pi ]}{{\text {max}}} \; \underline{F}^{(l)}(\mathbf {x},\theta ). \end{aligned}$$
(9)
Fig. 1.
figure 1

Rotation co- and invariance. Top row: the activations after the lifting convolutions with a single kernel \(\underline{k}_1^{(2)}\), stacked together it yields an SE(2) image \(F_1^{(2)}\) (cf. Eq. (6)). The projection layer at the end of the pipeline gives a rotation invariant feature vector. Bottom row: the same figures with a rotated input.

2.3 Discretization and Network Architecture

Discretization, Kernel Sizes and Rotation: Discretized 2D images are supported on a bounded subset of \(\mathbb {Z}^2 \subset \mathbb {R}^2\) and the kernels live on a spatially rectangular grid of size \(n\times n\) in \(\mathbb {Z}^2\), with n the kernel size. We discretize the Lie group \(SE(2,N):=\mathbb {Z}^2 \rtimes SO(2,N)\), with the space of 2D rotations in SO(2) sampled with N rotation angles \(\theta _i=\frac{2\pi }{N}i\), with \(i=0,\dots ,N-1\). The discrete lifting kernels \(\mathbf {k}^{(l)}\) at layer l, mapping from a 2D image with \(N_{l-1}\) input channels to an SE(2, N) image with \(N_{l}\) channels, thus have a shape of \(n\times n \times N_{l-1} \times {N_{l}}\). The SE(2, N) kernels \(\mathbf {K}^{(l)}\) have a shape of \(n \times n \times N \times N_{l-1} \times N_{l}\). A complete set of rotations of kernels \(\mathbf {k}^{(l)}\) or \(\mathbf {K}^{(l)}\) can be constructed with a single matrix multiplication from a vector that contains the shared kernel weights. This matrix is sparse and encodes bi-linear interpolation and kernel rotation.

Table 1. SE(2, N) chain settings for different orientation samplings N.

3 Experiments and Results

We consider three different tasks in three different modalities. In each we consider the SE(2, N) samplings with \(N\in \{1,2,4,8,16\}\) to study the effect of the choice of N in the SE(2, N) discretization. See Table 1 for the network settings. In each experiment the data is augmented at train and test time with transposed versions of the 2D input. For reference we also include transpose plus \(90^\circ \) rotation augmentation for the \(N=1\) experiment (as in [12, 13]) in order to be able to show that these are not necessary in our SE(2, N) networks for \(N\ge 4\). Each experiment is repeated 3 times with random initialization and sampling to get a rough estimate of the mean and variance on the performance. For a fair comparison for different N the overall number of weights is matched. For a fair comparison with the \(\mathbb {R}^2\) approach, the number of “2D” activations (\(N_l N\)) in the last three layers is also matched. Each network optimizes a logistic loss using stochastic gradient descent with momentum using the same settings as in [12]. Our G-CNN implementations are available at https://github.com/tueimage/se2cnn. The results are given in Fig. 2, the tasks and metrics are summarized as follows.

Fig. 2.
figure 2

Top row: Crop outs of images of the three tasks with the class probabilities generated by our method. Bottom row: Mean results (\(\pm 1\) std. dev.).

Histopathology - Mitosis Detection: The task aims at detecting mitotic figures in hematoxylin-eosin stained slides. We used the public dataset AMIDA13 [14] that consists of high power field images from 23 breast cancer cases. Eight cases (458 mitoses) were used to train the networks with random batches of \(68 \times 68\) image patches, balanced between mitotic and hard negative figures. This receptive field was obtained by means of max-pooling operations in the first three layers. Sets of candidate detections were generated as in [13] after selection of an operating point on four validation cases (92 mitoses). We assessed an F\(_{1}\)-score for each model based on the 11 test cases (533 mitoses) in the conditions of [14].

Retina - Vessel Segmentation: In this task the blood vessels in the retina are segmented. For validation we use the public DRIVE database [15], which consists of 40 retinal images with manual segmentations. The set is split in a training set (of which we use 16 for training, and 4 for validation) and a test set of also 20 images. The G-CNNs produce a probability for the vessel and background class. Training is done with 10000 patches (\(17 \times 17\)) per class per image. The output probabilities can be thresholded to create a binary segmentation, which can be used to quantify performance in terms of sensitivity and specificity. The area under the receiver operator characteristic (ROC) curve, in short AUC, summarizes these performances into a single value.

Electron Microscopy - Cell Boundary Segmentation: This task consists of segmenting the boundaries of cells that are imaged with EM. We use the data and evaluation system of the ISBI EM segmentation challenge [16]. The data consists of 2 volumes (1 train, 1 test), each containing 30 consecutive images from a serial section transmission EM. Both the segmentation and the evaluation is done by treating the volumes as sequences of 2D slices. To increase receptive field size we include max pooling in the first 2 layers. Training is done with 10000 patches (\(48 \times 48\)) per class per image. The main evaluation criterion for the challenge is the Rand score, which measures the similarity between clusterings/connected components [17]. The reported Rand score is the maximum score (for several thresholds) computed for the connected components obtained after thinning of the binary cell boundary segmentation, see [16] for more details.

Results: In each experiment we see that the performance of the baseline with extra rotation augmentations is reached by the non-augmented G-CNNs for \(N\ge 4\), and even is surpassed for \(N \ge 8\). In the first two experiments we also observe that the variance on the output is reduced with increasing N. Our results on the public datasets match or improve upon the state of the art with the following scores: F\(_{1}\)-score = \(0.628 \pm 0.006\) for mitosis detection, AUC = \(0.9784 \pm 0.0001\) for vessel segmentation, Rand = \(0.962 \pm 0.008\) for cell boundary segmentation.

4 Discussion and Conclusions

We showed a consistent improvement of performances across three medical image analysis tasks when using G-CNNs compared to their corresponding CNN baselines. The reported results are in line with the benchmark of each dataset and the best performances were obtained for an orientation capacity \(N \ge 4\), indicating the advantage of learning such rotation-invariant representations. We observed improved stability over the repeated experiments in mitosis detection and vessel segmentation for \(N=8\) and \(N=16\), suggesting a regularization effect due to the increased weight sharing with increasing N.

We conclude that it is beneficiary to include SE(2) group convolution layers in CNN network design, as this avoids the need for rotation augmentation and it improves overall performance. In all three medical imaging problems we achieved state-of-the-art results with the same (basic) network design for each task. Based on these results we expect that our SE(2) layers may lead to a further performance increase when embedded in more complex network designs, such as the popular UNets and ResNets.