1 Introduction

1.1 Overview, Motivation

Optical coherence tomography (OCT) is a non-invasive imaging technique which measures the intensity response of back scattered light from millimeter penetration depth. Here we consider its use in ophthalmology as a means of aquiring high-resolution volume scans of human retina in vivo to understand eye functionalities. Figure 1 gives an overview of relevant anatomy. OCT devices record multiple two-dimensional B-scans in rapid succession and combine them to a single volume in a subsequent alignment step. Taking an OCT scan only takes multiple seconds to few minutes and can help detect symptoms of pathological conditions such as glaucoma, diabetes, multiple sclerosis or age-related macular degeneration. The relative ease of data acquisition also enables to use multiple OCT volume scans of a single patient over time to track the progression of a pathology or quantify the success of therapeutic treatment. As a consequence of the technological progress in OCT imaging which was made over the past few decades since its invention by Huang et al. (1991), more expertise for extraction of manual annotations is required which in the presence of big volumetric data sets is difficult to access.

To better leverage the availability of retinal OCT data in both clinical settings and empirical studies, much work is focused on the analysis of appropriate automatic feature extraction techniques. In particular, the access to such methods is especially crucial for achieving enhanced effectiveness of existing quantitative retinal multi cell layer segmentation approaches, and for increasing their clinical potential in real life applications, such as detection of fluid regions and reconstruction of vascular structures. The difficulty of these tasks lies in the challenging signal-to-noise ratio which is influenced by multiple factors including physical eye movement during registration and the presence of speckle noise.

In this paper, we extend the assignment flow approach proposed in Åström et al. (2017) for labeling data on graphs to automatic cell layer segmentation in OCT data. After a feature extraction step, each voxel is labeled by smoothing local layer decisions and jointly leveraging a global geometric invariant—the natural order of cell layers along the vertical axis of each B-scan, as shown in the second row of Fig. 2. We are able to produce high-quality segmentations of OCT volumes by using local features as input for a purpose-built assignment flow variant which serves to incorporate global context in a controlled way. This is in contrast to common machine learning approaches which use essentially full B-scans as input.

The empirical success of deep learning methods is driven by the striking ability of deep networks to discover informative features which capture even very subtle patterns in data. However, despite their apparent expressiveness, such features are notoriously hard to interpret by humans. While neural networks often generalize surprisingly well to unseen data, their lack of interpretability makes it hard to anticipate or otherwise reason about specific failure cases. This is particularly relevant in medical applications because deep networks may produce predictions which appear plausible even in cases where they fail to generalize. Additionally, the aquisition of high-quality labeled data for training is laborious and may require the expertise of skilled medical professionals such that data availibility is limited compared to other problem domains. We propose to localize the influence of feature extraction on the segmentation process by limiting field of view. Consequently, the used features are semantically weaker than the ones computed by competing deep learning methods. However, we still achieve state of the art performance by leveraging domain knowledge. In our pipeline, ambiguities in local features are resolved by regularizing to achieve local regularity as well as physiological cell layer ordering.

Our segmentation approach is a smooth image labeling algorithm based on geometric numerical integration on an elementary statistical manifold. It can work with input data from any metric space, making it agnostic to the choice of feature extraction and suitable as plug-in replacement in diverse pipelines. In addition to respecting the natural order of cell layers, the proposed segmentation process has a high amount of built-in parallelism such that modern graphics acceleration hardware can easily be leveraged. We compare the effectiveness of our novel approach between a selection of input features ranging from traditional covariance descriptors to convolutional neural networks. Figure 2 shows a typical volume segmentation computed by the proposed method. It illustrates how local ambiguity is caused by similar signal intensity and visual appearance of some layers and exacerbated by speckle noise. This ambiguity in local features is systematically resolved by leveraging the domain knowledge of local smoothness and global physiological layer order.

Fig. 1
figure 1

Schematic illustration of human eye functionality designed by Kjpargeter (n.d): Light enters the Cornea

figure a
and passes through the vitreous humour
figure b
towards the retina
figure c
and choroid
figure d
which are located around the fovea
figure e

1.2 Related Work

Effective segmentation of OCT volumes is a very active area of research. Here, we briefly review the current state of the art approaches originating from the broad research fields of graphical models, variational methods and machine learning.

Fig. 2
figure 2

a Left Normalized view on a 3D OCT volume scan dimension \(512 \times 512 \times 256\) of healthy human retina with ambiguous locations of layer boundaries. Middle The resulting segmentation of 11 layers displaying the order preserving labeling of the proposed approach. Right Boundary surfaces between different segmented cell layers are illustrated. b Typical result of the proposed segmentation approach for a single B-scan of healthy retina. Left raw OCT input data. Middle segmentation by locally selecting the label with maximum score for each voxel after feature extraction. Right segmentation by the proposed assignment flow approach using the same extracted features

1.2.1 Graphical Models

The first mathematical access to the problem is provided by the theory of graphical models which transforms the segmentation task into an optimization problem with hard pairwise interaction constraints between voxels. Starting with Kang et al. (2006) and Haeker et al. (2007), simultaneous retina layer detection attempts were made by finding an s-t minimum graph cut. Garvin et al. (2009) further extended this approach with a shape prior modeling layer boundaries. The methods benefit from low computational complexity, but are lacking of robustness in the presence of speckle and therefore require additional preprocessing steps. Along this line of reasoning, Antony et al. (2010) used a two stage segmentation process by applying anisotropic diffusion in a preprocessing step and consequently segmenting outer retina layers using graphical models. Similarly, Kafieh et al. (2013) proposed to use specific distances based on diffusion maps which are computed by coarse graining the original graph. However, increased performance for noisy OCT data gained by regularizing in this way comes at the cost of introducing bias in the preprocessing step which in turn inpairs robustness in settings with medical pathologies.

Motivated by Song et al. (2013), Dufour et al. (2013) comes up with a circular shape prior for segmentation of 6 retinal layers by incorporating soft constraints which are more suitable for the robust detection of pathological retina structures. Chiu et al. (2015) relies on a graphical model approach as a postprocessing step after applying a supervised kernel regression classification with features extracted according to Quellec et al. (2010). Rathke et al. (2014) reduced the overall complexity by a parallelizable segmentation approach based on probabilistic graphical models with global low-rank shape prior representing interacting retina tissues surfaces. While the global shape prior works well for non-pathological OCT data, it cannot be adapted to the broad range of variations caused by local pathological structure resulting in a inherent limitation of this approach. Here we refer to Rathke et al. (2017) for possible adaption of the probabilistic approach (Rathke et al. 2014) to pathological retina detection.

1.2.2 Variational Methods

Another category of layer detection methods focus on minimizing an energy functional to express the quantity of interest as the solution to an optimization problem. To this class of methods for retina detection level set approaches have proven to be particularly suitable by encoding each retina layer as the zero level sets of a certain functional. Yazdanpanah et al. (2011) introduces a level set method for minimizing an active contour functional supported by a multiphase model presented in Chan and Vese (2001) as circular shape prior, to avoid limitations of hard constraints as opposed to graphical model proposed by Garvin et al. (2009). Duan et al. (2015) suggests the approach to model layer boundaries with a mixture of Mumford Shah and Vese and Osher functionals by first preprocessing the data in the Fourier domain. A capable level set approach for joint segmentation of pathological retina tissues was reported in the work of Novosel et al. (2017). However, due to the involved hierarchical optimization, their method is computationally expensive. One common downside of the above algorithms are their inherent limitations to only include local notions of layer ordering, making their extension to cases with pathologically caused retina degeneracy a difficult task.

1.2.3 Machine Learning

Much recent work has focused on the use of deep learning to address the task of cell layer segmentation in a purely data driven way. The U-net architecture (Ronneberger et al. 2015) has proven influential in this domain because of its good predictive performance in settings with limited availability of training data. Multiple modifications of U-net have been proposed to specifically increase its performance in OCT applications (Roy et al. 2017; Liu et al. 2019). The common methods largely rely on convolutional neural networks to predict layer segmentations for individual B-scans which are subsequently combined to full volumes. These methods have also been used as part of a two-stage pipeline where additional prior knowledge such as local regularity and global order of cell layers along a spatial axis is incorporated through graph-based methods (Fang et al. 2017) or a second machine learning component (He et al. 2019).

1.3 Contribution, Organization

We propose a geometric assignment approach to retinal layer segmentation. By leveraging a continuous characterization of layer ordering, our method is able to simultaneously perform local regularization and incorporate the global topological ordering constraint in a single smooth labeling process. The segmentation is computed from a distance matrix containing pairwise distances between data for each voxel and prototypical data for each layer in some feature space. This highlights the ability to extract features from raw OCT data in a variety of different ways and to use the proposed segmentation as a plug-in replacement for other graph-based methods.

As a result of the proposed method, it becomes possible to compute high-quality cell layer segmentations of OCT volumes by using only local features for each voxel. This is in contrast to competing deep learning approaches which explicitly aim to incorporate as much global context into the feature extraction process as possible. The exclusive use of local features combats bias introduced through limited data availability in training and makes incorporation of three-dimensional information easily possible without limiting runtime scalability. To demonstrate this, we implement two feature extraction approaches. The first is based on identifying each voxel with a covariance descriptor and finding prototypical descriptors as cluster centers. For each voxel, Riemannian distances to the prototypical descriptors are used as input for subsequent segmentation. The second is based on training a relatively shallow convolutional neural network to classify small voxel patches of raw OCT data. Predicted class scores for each voxel are subsequently used as input for the proposed segmentation method.

The final pipeline thus comprises a preliminary feature extraction step (summarized in Sect. 5.2) which yields local data to subsequently be labeled in a regularized fashion by the proposed ordered assignment flow (Definition 2).

It enables robust cell layer segmentation for raw OCT volumes at scale, labeling an entire OCT volume in the time frame between 30 s and several minutes on a single GPU and in general leads to increased performance in the case of more informative features. This is without using any prior knowledge other than local regularity and order of cell layers. In particular, no global shape prior is used thus making our proposed approach suited for retina detection in OCT volumes with observable pathological patterns.

Our paper considerably elaborates the conference version (Sitenko et al. 2020) in the following ways. We extended the discussion of related work and added descriptions of two reference methods to make the paper more self-contained. The mechanism we use to promote topological layer ordering through regularization is based on a generalized notion of order preservation restated in Definition 1. In the present work, we motivate this notion by examining a related discrete graphical model in Proposition 1. Furthermore, regarding the choice of covariance descriptors (Tuzel et al. 2006) for retinal tissue representation we extensively discussed the impact of retrieving prototypical descriptors by approximating Riemannian distance via divergence functions. Accordingly, we provided a detailed qualitative performance evaluation in terms of the labeling accuracy and computational efficiency by comparing to the alternative Riemannian mean retrieval approach (Bini and Iannazzo 2013). We also substantially extended the evaluation of numerical labeling experiments by adding multiple illustrations as well as quantitative results. This includes comparison to the additional reference method proposed in Rathke et al. (2014). Finally, we added discussion of feature locality and of variance in the reference segmentations used for training.

The remainder of this paper is organized as follows. The assignment flow approach is summarized in Sect. 2 and extended in Sect. 4 in order to take into account the order of layers as a global constraint. In Sect. 3, we consider the Riemannian manifold \(\mathscr {P}_{d}\) of positive definite matrices as a suitable feature space for local OCT data descriptors. Various Riemannian metrics are discussed with regard to computational efficiency of clustering. The resulting features are subsequently compared to local features extracted by a convolutional network in Sect. 5. Performance measures for OCT segmentation will be reported for our novel approach and for two other state-of-the-art methods with available standalone software, that were evaluated in detail as summarized in Sect. 5. In Sect. 6, we shortly discuss the access to appropriate ground truth data and the impact of feature locality underlying our approach.

2 Assignment Flow

We summarize the assignment flow approach introduced by Åström et al. (2017) and refer to the recent survey (Schnörr 2020) for more background and a review of recent related work.

2.1 Overview

The assignment manifold \(\mathscr {W}\) (16) is a product space of probability simplices. Hence each point \(W\in \mathscr {W}\) is a collection of discrete probability vectors, one for each pixel, called assignment vectors. These vectors W(t) evolve on \(\mathscr {W}\) according to the assignment flow ODE (25). Due to the imposed Fisher–Rao geometry (12), W(t) converges to an integral solution (Zern et al. 2020a): for \(t \rightarrow \infty \), each \(W_{i}(t)\) approaches an unit vector that encodes the class label j assigned to the data point \(f_{i}\) given at pixel \(i\in I\).

Thus, assignment flows perform labelings as do discrete graphical models (Kappes et al. 2015). Yet, unlike the latter models, the assignment flow approach is smooth which enables efficient numerical inference (Zeilmann et al. 2020), parameter learning (Hühnerbein et al. 2021) and extensions to unsupervised and self-supervised scenarios (Zern et al. 2020b; Zisler et al. 2020).

Section 4 extends the assignment flow approach such that the natural ordering of labels due to retinal tissue layers is taken into account.

2.2 Assignment Manifold

Let \((\mathscr {F},d_{\mathscr {F}})\) be a metric space and

$$\begin{aligned} \mathscr {F}_{n}&= \{f_{i} \in \mathscr {F} :i \in {I}\},\quad |{I}|=n \end{aligned}$$
(1a)

given data. Assume that a predefined set of prototypes

$$\begin{aligned} \mathscr {F}_{*}&= \{f^{*}_{j} \in \mathscr {F} :j \in {J}\},\quad |{J}|=c \end{aligned}$$
(1b)

is given. Data labeling denotes the assignments

$$\begin{aligned} j \rightarrow i,\quad f_{j}^{*} \rightarrow f_{i} \end{aligned}$$
(2)

of a single prototype \(f_{j}^{*} \in \mathscr {F}_{*}\) to each data point \(f_{i} \in \mathscr {F}_{n}\). The set I is assumed to form the vertex set of an undirected graph \(\mathscr {G}=({I},\mathscr {E})\) which defines a relation \(\mathscr {E} \subset {I} \times {I}\) and neighborhoods

$$\begin{aligned} \mathscr {N}_{i} = \{k \in {I} :ik \in \mathscr {E}\} \cup \{i\}, \end{aligned}$$
(3)

where ik is a shorthand for the unordered pair (edge) \((i,k)=(k,i)\). We require these neighborhoods to satisfy the symmetry relation

$$\begin{aligned} k \in \mathscr {N}_{i} \quad \Leftrightarrow \quad i \in \mathscr {N}_{k},\qquad \forall i,k \in {I}. \end{aligned}$$
(4)

The assignments (labeling) (2) are represented by matrices in the set

$$\begin{aligned} \mathscr {W}_{*} = \{W \in \{0,1\}^{n \times c} :W\mathbb {1}_{c}=\mathbb {1}_{n}\} \end{aligned}$$
(5)

with unit vectors \(W_{i},\,i \in {I}\), called assignment vectors, as row vectors. These assignment vectors are computed by numerically integrating the assignment flow below (25) in the following geometric setting. The integrality constraint of (5) is relaxed and vectors

$$\begin{aligned} W_{i} = (W_{i1},\dotsc ,W_{ic})^{\top } \in \mathscr {S},\quad i \in {I}, \end{aligned}$$
(6)

that we still call assignment vectors, are considered on the elementary Riemannian manifold

$$\begin{aligned} (\mathscr {S},g),\quad \mathscr {S} = \{p \in \varDelta _{c} :p > 0\} \end{aligned}$$
(7)

with the probability simplex

$$\begin{aligned} \varDelta _{c}=\left\{ p \in \mathbb {R}_{+}^{c}:\sum _{i=1}^{c} = \langle \mathbb {1},p\rangle =1\right\} , \end{aligned}$$
(8)

the barycenter

$$\begin{aligned} \mathbb {1}_{\mathscr {S}} = \frac{1}{c}\mathbb {1}_{c} \in \mathscr {S}, \qquad (\text {barycenter}) \end{aligned}$$
(9)

tangent space

$$\begin{aligned} T_{0} = \{v \in \mathbb {R}^{c} :\langle \mathbb {1}_{c},v \rangle =0\} \end{aligned}$$
(10)

and tangent bundle \(T\mathscr {S} = \mathscr {S} \times T_{0}\), the orthogonal projection

$$\begin{aligned} \varPi _{0} :\mathbb {R}^{c} \rightarrow T_{0},\quad \varPi _{0} = I - \mathbb {1}_{\mathscr {S}}\mathbb {1}^{\top } \end{aligned}$$
(11)

and the Fisher–Rao metric

$$\begin{aligned} g_{p}(u,v) = \sum _{j \in {J}} \frac{u^{j} {v}^{j}}{p^{j}},\quad p \in \mathscr {S},\quad u,v \in T_{0}. \end{aligned}$$
(12)

Based on the linear map

$$\begin{aligned} R_{p} :\mathbb {R}^{c} \rightarrow T_{0},\quad R_{p} = {{\,\mathrm{Diag}\,}}(p)-p p^{\top },\quad p \in \mathscr {S} \end{aligned}$$
(13)

that satisfies

$$\begin{aligned} R_{p} = R_{p} \varPi _{0} = \varPi _{0} R_{p}, \end{aligned}$$
(14)

exponential maps and their inverses are defined by

$$\begin{aligned} {{\,\mathrm{Exp}\,}}&:\mathscr {S} \times T_{0} \rightarrow \mathscr {S}, &(p,v)&\mapsto {{\,\mathrm{Exp}\,}}_{p}(v) = \frac{p e^{\frac{v}{p}}}{\langle p,e^{\frac{v}{p}}\rangle }, \end{aligned}$$
(15a)
$$\begin{aligned} {{\,\mathrm{Exp}\,}}_{p}^{-1}&:\mathscr {S} \rightarrow T_{0},&q&\mapsto {{\,\mathrm{Exp}\,}}_{p}^{-1}(q) = R_{p} \log \frac{q}{p}, \end{aligned}$$
(15b)
$$\begin{aligned} \exp _{p}&:T_{0} \rightarrow \mathscr {S},&\exp _{p}&= {{\,\mathrm{Exp}\,}}_{p} \circ R_{p}, \end{aligned}$$
(15c)
$$\begin{aligned} \exp _{p}^{-1}&:\mathscr {S} \rightarrow T_{0},&\exp _{p}^{-1}(q)&= \varPi _{0}\log \frac{q}{p} \end{aligned}$$
(15d)

where multiplication, exponentials and logarithms apply componentwise. Applying the map \(\exp _{p}\) to a vector in \(\mathbb {R}^{c} = T_{0} \oplus \mathbb {R}\mathbb {1}\) does not depend on the constant component of the argument, due to (14).

Remark 1

The map \({{\,\mathrm{Exp}\,}}\) corresponds to the e-connection of information geometry (Amari and Nagaoka 2000), rather than to the exponential map of the Riemannian connection. Accordingly, the affine geodesics (15a) are not length-mini-mizing. But they provide a close approximation (Åström et al. 2017, Prop. 3) and are more convenient for numerical computations.

The assignment manifold is defined as

$$\begin{aligned} (\mathscr {W},g),\quad \mathscr {W} = \mathscr {S} \times \cdots \times \mathscr {S}.\quad (n = |{I}|\;\text {factors}) \end{aligned}$$
(16)

We identify \(\mathscr {W}\) with the embedding into \(\mathbb {R}^{n\times c}\)

$$\begin{aligned} \begin{aligned} \mathscr {W} = \Big \{ W \in \mathbb {R}^{n\times c} :W\mathbb {1}_c = \mathbb {1}_n&\text { and } W_{ij} > 0 \\&\text { for all } i\in [n], j\in [c]\Big \}. \end{aligned} \end{aligned}$$
(17)

Thus, points \(W \in \mathscr {W}\) are row-stochastic matrices \(W \in \mathbb {R}^{n \times c}\) with row vectors \(W_{i} \in \mathscr {S},\; i \in {I}\) that represent the assignments (2) for every \(i \in {I}\). We set

$$\begin{aligned} \mathscr {T}_{0} := T_{0} \times \cdots \times T_{0} \quad (n = |{I}|\;\text {factors}). \end{aligned}$$
(18)

Due to (17), the tangent space \(\mathscr {T}_0\) can be identified with

$$\begin{aligned} \mathscr {T}_0 = \{ V \in \mathbb {R}^{n\times c} :V\mathbb {1}_c = 0\}. \end{aligned}$$
(19)

Thus, \(V_i \in T_{0}\) for all row vectors of \(V \in \mathbb {R}^{n \times c}\) and \(i \in {I}\). All mappings defined above factorize in a natural way and apply row-wise, e.g. \({{\,\mathrm{Exp}\,}}_{W} = ({{\,\mathrm{Exp}\,}}_{W_{1}},\dotsc ,{{\,\mathrm{Exp}\,}}_{W_{n}})\) etc.

2.3 Assignment Flow

Based on (1a) and (1b), the distance vector field

$$\begin{aligned} D_{\mathscr {F};i} = \big (d_{\mathscr {F}}(f_{i},f_{1}^{*}),\dotsc , d_{\mathscr {F}}(f_{i},f_{c}^{*})\big )^{\top },\quad i \in {I} \end{aligned}$$
(20)

is well-defined. These vectors are collected as row vectors of the distance matrix

$$\begin{aligned} D_{\mathscr {F}} \in S_{+}^{n}, \end{aligned}$$
(21)

where \(S_{+}^{n}\) denotes the set of symmetric and entrywise nonnegative matrices.

Remark 2

In this paper, we build upon two different types of features to determine vectors (20) which are serving as input before mapping the assembled matrix (21) onto the assignment manifold as explained below. Hereby, the first class of features access our model by calculating distance to prototypes (1) with metric introduced in section (Sect. 3.2) while the second feature class directly possess the form of (21) as argued in section (Sect. 5.2.3).

The likelihood map and the likelihood vectors, respectively, are defined for \(i \in {I}\) as

$$\begin{aligned} \begin{aligned} L_{i} :\mathscr {S}&\rightarrow \mathscr {S},\\ L_{i}(W_{i})&= \exp _{W_{i}}\Big (-\frac{1}{\rho }D_{\mathscr {F};i}\Big ) = \frac{W_{i} e^{-\frac{1}{\rho } D_{\mathscr {F};i}}}{\langle W_{i},e^{-\frac{1}{\rho } D_{\mathscr {F};i}} \rangle }, \end{aligned} \end{aligned}$$
(22)

where the scaling parameter \(\rho > 0\) is used for normalizing the a-prior unknown scale of the components of \(D_{\mathscr {F};i}\) that depends on the specific application at hand.

A key component of the assignment flow is the interaction of the likelihood vectors through geometric averaging within the local neighborhoods (3). Specifically, using weights

$$\begin{aligned} \omega _{ik} > 0\quad \text {for all}\; k \in \mathscr {N}_{i},\;i \in {I}\quad \text {with}\quad \sum _{k \in \mathscr {N}_{i}} w_{ik}=1, \end{aligned}$$
(23)

the similarity map and the similarity vectors, respectively, are defined for \(i \in {I}\) as

$$\begin{aligned} \begin{aligned} S_{i} :\mathscr {W}&\rightarrow \mathscr {S},\\ S_{i}(W)&= {{\,\mathrm{Exp}\,}}_{W_{i}}\left( \sum _{k \in \mathscr {N}_{i}} w_{ik} {{\,\mathrm{Exp}\,}}_{W_{i}}^{-1}\big (L_{k}(W_{k})\big )\right) . \end{aligned} \end{aligned}$$
(24)

If \({{\,\mathrm{Exp}\,}}_{W_{i}}\) were the exponential map of the Riemannian (Levi-Civita) connection, then the argument inside the brackets of the right-hand side would just be the negative Riemannian gradient with respect to \(W_{i}\) of center of mass objective function comprising the points \(L_{k},\,k \in \mathscr {N}_{i}\), i.e. the weighted sum of the squared Riemannian distances between \(W_{i}\) and \(L_{k}\) (Jost 2017, Lemma 6.9.4). In view of Remark 1, this interpretation is only approximately true mathematically, but still correct informally: \(S_{i}(W)\) moves \(W_{i}\) towards the geometric mean of the likelihood vectors \(L_{k},\,k \in \mathscr {N}_{i}\). Since \({{\,\mathrm{Exp}\,}}_{W_{i}}(0)=W_{i}\), this mean precisely is \(W_{i}\) if the aforementioned gradient vanishes.

The assignment flow is induced on the assignment manifold \(\mathscr {W}\) by the locally coupled system of nonlinear ODEs

$$\begin{aligned} \dot{W}&= R_{W}S(W),\quad W(0)=\mathbb {1}_{\mathscr {W}}, \end{aligned}$$
(25a)
$$\begin{aligned} \dot{W}_{i}&= R_{W_{i}} S_{i}(W),\quad W_{i}(0)=\mathbb {1}_{\mathscr {S}},\quad i \in {I}, \end{aligned}$$
(25b)

where \(\mathbb {1}_{\mathscr {W}} \in \mathscr {W}\) denotes the barycenter of the assignment manifold (16). The solution \(W(t)\in \mathscr {W}\) is numerically computed by geometric integration (Zeilmann et al. 2020) and determines a labeling \(W(T) \in \mathscr {W}_{*}\) for sufficiently large T after a trivial rounding operation. Convergence and stability of the assignment flow have been studied by Zern et al. (2020a).

3 OCT Data Representation by Covariance Desciptors

In this section, we work out the basic geometric notation for representation of OCT data by means of covariance descriptors (Tuzel et al. 2006). Specifically, the metric data space \((\mathscr {F},d_{\mathscr {F}})\) underlying (1) will be identified with the Riemannian manifold \((\mathscr {P}_{d},d_{g})\) of positive definite matrices of dimension \(d \times d\), with Riemannian metric g and Riemannian distance \(d_{g}\) as specified in Sect. 5. In particular regarding the computation of corresponding prototypes (1b), an important aspect concerns the trade-off between respecting the Riemannian distance \(d_{g}\) of the matrix manifold \(\mathscr {P}_{d}\) and approximating surrogate distance functions, that enable to compute more efficiently Riemannian means of covariance descriptors while adopting their natural geometry. We review and discuss various choices in Sect. 3.2 after reviewing few required concepts of Riemannian geometry in Sect. 3.1.

3.1 The Manifold \(\mathscr {P}_{d}\)

We collect few concepts related to data \(p \in \mathscr {M}\) taking values on a general Riemannian manifold \((\mathscr {M},g)\) with Riemannian metric g; see, e.g., Lee (2013), Jost (2017) for background reading. Then we apply these concepts to the specific manifold \((\mathscr {P}_{d},g)\) and the corresponding distance \(d_{g}\), keeping the symbol g for the metric for simplicity. We refer to, e.g., Bhatia (2007, 2013), Pennec et al. (2006) and Moakher and Batchelor (2006) for further reading and to the references in Sect. 3.2.

Let \(\gamma :[0,1]\rightarrow \mathscr {M}\) a smooth curve connecting two points \(p = \gamma (0)\) and \(q = \gamma (1)\). The Riemannian distance between p and q is given by

$$\begin{aligned} d_g(p,q)&= \min _{\gamma :\gamma (0) = p,\gamma (1) = q} L(\gamma ) \end{aligned}$$
(26a)

with

$$\begin{aligned} L(\gamma )&= \int _{0}^{1}\Vert {\dot{\gamma }}(t)\Vert _{\gamma (t)}\mathrm{d}{t} = \int _{0}^1 \sqrt{g_{\gamma (t)}\big (\dot{\gamma }(t),\dot{\gamma }(t)\big )} \mathrm{d}{t}. \end{aligned}$$
(26b)

Assume the minimum of the right-hand side of (26a) is attained at \(\overline{\gamma }\). Then the exponential map at p is defined on some neighborhood \(V_{p} \subseteq T_{p}\mathscr {M}\) of 0 in the tangent space to \(\mathscr {M}\) at p by

$$\begin{aligned} \begin{aligned} \exp _{p} :V_{p} \supseteq T_{p}\mathscr {M}&\rightarrow U_{p} \subseteq \mathscr {M},\\ v&\mapsto \exp _{p}(v) := \overline{\gamma }(1). \end{aligned} \end{aligned}$$
(27)

This mapping is a diffeomorphism of \(V_{p}\) and its inverse map \(\exp _{p}^{-1} :U_{p} \rightarrow V_{p}\) exists on a corresponding open neighborhood \(U_{p}\). Let \(\mathscr {X}(\mathscr {M})\) denote the set of all smooth vector fields on \(\mathscr {M}\), i.e. \(X \in \mathscr {X}(\mathscr {M})\) evaluates to a tangent vector \(X_{p} \in T_{p}\mathscr {M}\) smoothly depending on p. The set of all smooth covector fields (one-forms) is denoted by \(\mathscr {X}^{*}(\mathscr {M})\), and df(X) denotes the action of the differential \(df \in \mathscr {X}^{*}(\mathscr {M})\) of a smooth function \(f :\mathscr {M}\rightarrow \mathbb {R}\) on a vector field X. The Riemannian gradient of f is the vector field \({{\,\mathrm{grad}\,}}f \in \mathscr {X}(\mathscr {M})\) defined by

$$\begin{aligned} g({{\,\mathrm{grad}\,}}f,X) = df(X) = X f,\quad \forall X \in \mathscr {X}(\mathscr {M}). \end{aligned}$$
(28)

We now focus on the following problem: Given a set of points \(\{p_{i}\}_{i \in [N]}\subset \mathscr {M}\), compute the weighted Riemannian mean as minimizer of the objective function

$$\begin{aligned} \begin{aligned}&\overline{p} = \arg \min _{q \in \mathscr {M}}J(q),\quad J(q) = \sum _{i \in [N]} \omega _{i}d^{2}_{g}(q,p_{i}),\\&\sum _{i\in [N]}\omega _{i}=1, \qquad \qquad \omega _{i}>0,\;\text { for all } i . \end{aligned} \end{aligned}$$
(29)

The Riemannian gradient of this objective function is given by Jost (2017, Lemma 6.9.4)

$$\begin{aligned} {{\,\mathrm{grad}\,}}J(p) = -\sum _{i\in [N]}\omega _{i} \exp _{p}^{-1}(p_{i}). \end{aligned}$$
(30)

Hence the Riemannian mean \(\overline{p}\) is determined by the optimality condition

$$\begin{aligned} \sum _{i\in [N]} \omega _{i}\exp ^{-1}_{\overline{p}}(p_{i}) = 0. \end{aligned}$$
(31)

A basic numerical method for computing \(\overline{p}\) is the fixed point iteration

$$\begin{aligned} q_{(t+1)} = \exp _{q_{(t)}}\left( \sum _{i\in [N]} \omega _{i}\exp ^{-1}_{q_{(t)}}(p_{i})\right) ,\quad t=1,2,\dotsc \end{aligned}$$
(32)

that may converge for a suitable initialization \(q_{(0)}\) to \(\overline{p}\).

We now focus on the specific manifold \((\mathscr {P}_{d},g)\)

$$\begin{aligned} \mathscr {P}_{d} = \{S\in \mathbb {R}^{d\times d}:S=S^{\top },\, S\;\text {is positive definite}\} \end{aligned}$$
(33)

with the tangent space

$$\begin{aligned} T_{S}\mathscr {P}_{d} = \{S \in \mathbb {R}^{d\times d}:S^{\top }=S\}, \end{aligned}$$
(34)

equipped with the Riemannian metric

$$\begin{aligned} g_{S}(U,V) = \mathrm{tr}(S^{-1} U S^{-1} V),\quad U,V\in T_{S}\mathscr {P}_{d}. \end{aligned}$$
(35)

The Riemannian distance (26a) is given by

$$\begin{aligned} d_{\mathscr {P}_{d}}(S,T) = \left( \sum _{i \in [d]}\big (\log \lambda _{i}(S,T)\big )^{2}\right) ^{1/2}, \end{aligned}$$
(36)

whereas the exponential map (27) reads

$$\begin{aligned} \exp _{S}(U) = S^{\frac{1}{2}}{{\,\mathrm{expm}\,}}\left( S^{-\frac{1}{2}} US^{-\frac{1}{2}}\right) S^{\frac{1}{2}}, \end{aligned}$$
(37)

and \({{\,\mathrm{expm}\,}}(\cdot )\) denotes the matrix exponential. Finally, given a smooth objective function \(J :\mathscr {P}_{d} \rightarrow \mathbb {R}\), the Riemannian gradient is given by

$$\begin{aligned} {{\,\mathrm{grad}\,}}J(S) = S\big (\partial J(S)\big )S \in T_{S}\mathscr {P}_{d}, \end{aligned}$$
(38)

where the symmetric matrix \(\partial J(S)\) denotes the Euclidean gradient of J at S. Since \(\mathscr {P}_{d}\) is a simply connected, complete and nonpositively curved Riemannian manifold (Bridson and Häflinger 1999, Section 10), the exponential map (37) is globally defined and bijective, and the Riemannian mean always exists and is uniquely defined as minimizer of the objective function (29), after substituting the Riemannian distance (36).

3.2 Computing Prototypical Covariance Descriptors

In this section, we focus on the computational differential geometric framework required for extraction of prototypes (1b) as Riemannian means from a set of covariance descriptors assembled from OCT data. Application details are reported in Sect. 5. Particularly with regard to more efficient handling present volumetric data and to reduce the computational costs, a surrogate metrics and distances are reviewed in Sects. 3.2.2 and 3.2.3. Their qualitative comparison is reported in Sect. 5.

3.2.1 Computing Riemannian Means

Given a set of covariance descriptors

$$\begin{aligned} \mathscr {S}_{N} = \{(S_1,\omega _{1}),\dots , (S_{N},\omega _{N})\} \subset \mathscr {P}_d \end{aligned}$$
(39)

together with positive weights \(\omega _{i}\), we next focus on the solution of the problem (29) for specific geometry (33),

$$\begin{aligned} \overline{S} = \arg \min _{S \in \mathscr {P}_{d}} J(S;\mathscr {S}_{N}),\quad J(S;\mathscr {S}_{N}) = \sum _{i \in [N]} \omega _{i}d_{\mathscr {P}_{d}}^{2}(S,S_{i}), \end{aligned}$$
(40)

with the distance \(d_{\mathscr {P}_{d}}\) given by (36). From (37), we deduce

$$\begin{aligned} U = \exp _{S}^{-1} \circ \exp _{S}(U) = S^{\frac{1}{2}} {{\,\mathrm{logm}\,}}\left( S^{-\frac{1}{2}}\exp _{S}(U)S^{-\frac{1}{2}}\right) S^{\frac{1}{2}} \end{aligned}$$
(41)

with the matrix logarithm \({{\,\mathrm{logm}\,}}={{\,\mathrm{expm}\,}}^{-1}\) (Higham 2008, Section 11). As a result, optimality condition (31) reads

$$\begin{aligned} \sum _{i \in [N]} \omega _{i} \overline{S}^{\frac{1}{2}} {{\,\mathrm{logm}\,}}\left( \overline{S}^{-\frac{1}{2}}S_{i} \overline{S}^{-\frac{1}{2}}\right) \overline{S}^{\frac{1}{2}} = 0. \end{aligned}$$
(42)

Applying the corresponding basic fixed iteration (32) has two drawbacks, however (Congedo et al. 2015): Convergence is not theoretically guaranteed and if the iteration converges, than at a linear rate only. Since each iterative step requires nontrivial numerical matrix decomposition that has to be applied multiple times to every voxel (vertex) of a 3D gridgraph, this results in an overall quite expensive approach, in particular when larger data sets are involved as is the case for highly resolved 3D OCT volumetric scans.

The following variant proposed by Bini and Iannazzo (2013) is guaranteed to converge at a quadratic rate assuming the matrices \(\{S_1,\dots ,S_N \}\) to pairwise commute. Using the parametrization

$$\begin{aligned} S = L L^{\top } \end{aligned}$$
(43)

corresponding to the Cholesky decomposition replacing the map of fixed point iteration (32) with its linearization leads to the following fixed point iteration

$$\begin{aligned} F_{\tau }(L; \mathscr {S}_{N}) = L L^{\top }- \tau \sum _{i\in [N]} \omega _{i} L^{\top } {{\,\mathrm{logm}\,}}(L^{-\top }S_i^{-1}L^{-1})L, \end{aligned}$$
(44)

with damping parameter \(\tau > 0\). Comparing to (42) shows that the basic idea is to compute the Riemannian mean \(\overline{S}\) as fixed point of the iteration

$$\begin{aligned} \overline{S} = \lim _{t \rightarrow \infty } S_{(t)},\quad S_{(t+1)} = F(S_{(t)};\mathscr {S}_{N}). \end{aligned}$$
(45)

Algorithm 1 provides a refined variant of this iteration including adaptive stepsize selection. See Congedo et al. (2015) for alternative algorithms that determine the Riemannian mean.

figure f

3.2.2 Log-Euclidean Distance and Means

A computationally cheap approach was proposed by Arsigny et al. (2007) (among several other ones). Based on the operations

$$\begin{aligned} S_{1} \odot S_{2}&= {{\,\mathrm{expm}\,}}\big ({{\,\mathrm{logm}\,}}(S_{1}+{{\,\mathrm{logm}\,}}(S_{2})\big )), \end{aligned}$$
(46a)
$$\begin{aligned} \lambda \cdot S&= {{\,\mathrm{expm}\,}}\big (\lambda {{\,\mathrm{logm}\,}}(S)\big ), \end{aligned}$$
(46b)

the set \((\mathscr {P}_{s},\odot ,\cdot )\) becomes isomorphic to the vector space where \(\odot \) plays the role of addition. Consequently, the mean of the data \(\mathscr {S}_{N}\) given by (39) is defined analogous to the arithmetic mean by

$$\begin{aligned} \overline{S} = {{\,\mathrm{expm}\,}}\left( \sum _{i\in [N]}\omega _{i}{{\,\mathrm{logm}\,}}(S_{i})\right) . \end{aligned}$$
(47)

While computing the mean is considerably cheaper than integrating the flow (38) using approximation Algorithm 1, the critical drawback of relying on (47) is not taking into account the (curved structure) of the manifold \(\mathscr {P}_{d}\). Therefore, in the next section, we additionally consider another approximation of the Riemannian mean that better respects the underlying geometry but can still be evaluated more efficiently than the Riemannian mean of Sect. 3.2.1.

3.2.3 S-Divergence and Means

A general approach to the approximation of the objective function (29) is to replace the squared Riemannian \(d_{g}^{2}(p,q)\) distance by a divergence function

$$\begin{aligned} D(p,q) \approx \frac{1}{2} d_{g}^{2}(p,q) \end{aligned}$$
(48)

that satisfies

$$\begin{aligned} D(p,q)&\ge 0 \quad \text {and}\quad D(p,q)=0 \;\Leftrightarrow \;p=q, \end{aligned}$$
(49a)
$$\begin{aligned} \partial _{1}^{2} D(p,q)&\succ 0,\quad \forall p \in {{\,\mathrm{dom}\,}}D(\cdot ,q). \end{aligned}$$
(49b)

We refer to, e.g., Bauschke and Borwein (1997) and Censor and Zenios (1997) for a complete definition. Property (49b) says that, for any feasible p, the Hessian with respect to the first argument is positive definite. In fact, suitable divergence functions D recover in this way locally the metric tensor of the underlying manifold \(\mathscr {M}\), in order to qualify as a surrogate for the squared Riemannian distance (48).

For the present case \(\mathscr {M}=\mathscr {P}_{d}\) of interest, Sra (2016) proposed the divergence function, called Stein divergence and is given for \(S, S_{1},S_{2} \in \mathscr {P}_{d}\) as

$$\begin{aligned} D_{s}(S_{1},S_{2}) = \log \det \left( \frac{S_{1}+S_{2}}{2}\right) - \frac{1}{2}\log \det (S_{1} S_{2}). \end{aligned}$$
(50)

Regarding the task of evaluating the Riemannian distance (36), which is required for the second term of problem (40) for subsequential extraction of prototypes (1b) in Sect. 5, while avoiding to solve the numerically involved numerical generalized eigenvalue problem, we replace (40) by

$$\begin{aligned} \overline{S} = \arg \min _{S\in \mathscr {P}_{d}} J_{s}(S;\mathscr {S}_{N}),\quad J_{s}(S;\mathscr {S}_{N}) = \sum _{i\in [N]}\omega _{i} D_{s}(S,S_{i}). \end{aligned}$$
(51)

The resulting Riemannian gradient flow reads

$$\begin{aligned} \dot{S} = -{{\,\mathrm{grad}\,}}J_{s}(S;\mathscr {S}_{N})&\overset{(38)}{=} -S \partial J(S;\mathscr {S}_{N}) S \end{aligned}$$
(52a)
$$\begin{aligned}&= -\frac{1}{2}\big (S R(S;\mathscr {S}_{N}) S - S\big ), \end{aligned}$$
(52b)

with

$$\begin{aligned} R(S;\mathscr {S}_{N}) = \sum _{i\in [N]}\omega _{i}\left( \frac{S+S_{i}}{2}\right) ^{-1}. \end{aligned}$$
(53)

Discretizing the flow using the geometric explicit Euler scheme with step size h,

$$\begin{aligned} S_{(t+1)}&= \exp _{S_{(t)}}\big (-h {{\,\mathrm{grad}\,}}J_{s}(S_{(t)};\mathscr {S}_{N})\big ) \end{aligned}$$
(54a)
$$\begin{aligned}&\overset{(37)}{=} S_{(t)}^{\frac{1}{2}} {{\,\mathrm{expm}\,}}\left( \frac{h}{2}\left( I-S_{(t)}^{\frac{1}{2}} R(S_{(t)};\mathscr {S}_{N})S_{(t)}^{\frac{1}{2}}\right) \right) S_{(t)}^{\frac{1}{2}} \end{aligned}$$
(54b)

and using the log-Euclidean mean (47) as initial point \(S_{(0)}\), defines Algorithm 2 as listed below.

figure g

4 Ordered Layer Segmentation

In this section, we work out an extension of the assignment flow (Sect. 2) which is able to respect the order of cell layers as a global constraint while remaining in the same smooth geometric setting. In particular, existing schemes for numerical integration still apply to the novel variant.

4.1 Ordering Constraint

With regard to segmenting OCT data volumes, the order of cell layers is crucial prior knowledge. In this paper we focus on segmentation of the following 11 retina layers: retinal nerve fiber layer (RNFL), ganglion cell layer (GCL), inner nuclear layer (INL), outer plexiform layer (OPL), outer nuclear layer (ONL), two photoreceptor layers (PR1, PR2) separated by the external limiting membrane (ELM), Choriocapillaris (CC) and the retinal pigment epithelium (RPE) together with the choroid section (CS). Figure 3 also contains positions for the internal limiting membrane (ILM) and Bruch’s membrane Membrane (BM).

Fig. 3
figure 3

OCT volume acquisition: \({\textcircled {{1}}}\) is the A-scan axis (single A-scan is marked yellow). Multiple A-scans taken in rapid succession along axis \({\textcircled {{2}}}\) form a two-dimensional B-scan (single B-scan is marked blue). The complete OCT volume is formed by repeating this procedure along axis \({\textcircled {{3}}}\). A list of retina layers that we expect to find in every A-scan is shown on the left (Color figure online)

To incorporate this knowledge into the geometric setting of Sect. 2, we require a smooth notion of ordering which allows to compare two probability distributions. In the following, we assume prototypes \(f^{*}_{j} \in \mathscr {F}\), \(j \in [n]\) in some feature space \(\mathscr {F}\) to be indexed such that ascending label indices reflect the physiological order of cell layers.

Definition 1

(Ordered Assignment Vectors) A pair of voxel assignments \((w_i, w_j)\in \mathscr {S}^2\), \(i < j\) within a single A-scan is called ordered, if \(w_j - w_i \in K = \{ By:y\in \mathbb {R}^c_+ \}\) with the matrix

$$\begin{aligned} B = \left( \begin{array}{ccccc} -1 &{} &{} &{} &{} \\ 1 &{} -1 &{} &{} &{} \\ &{} 1 &{} \ddots &{} &{} \\ &{} &{} \ddots &{} -1 &{} \\ &{} &{} &{} 1 &{} -1 \end{array} \right) \in \mathbb {R}^{c\times c}. \end{aligned}$$
(55)

This new continuous ordering of probability distributions is consistent with discrete ordering of layer indices in the following way.

Lemma 1

Let \(w_i = e_{l_1}\), \(w_j = e_{l_2}\), \(l_1, l_2\in [c]\) denote two integral voxel assignments. Then \(w_j-w_i\in K\) if and only if \(l_1 \le l_2\).

Proof

B is regular with inverse

$$\begin{aligned} B^{-1} = -Q,\quad Q_{i,j} = {\left\{ \begin{array}{ll}1 &{} \text {if } i\ge j\\ 0 &{} \text {else}\end{array}\right. } \end{aligned}$$
(56)

and \(w_j-w_i\in K \Leftrightarrow B^{-1}(w_j-w_i)\in \mathbb {R}^c_+\). It holds

$$\begin{aligned} B^{-1}(w_j-w_i) = Qe_{l_1}-Qe_{l_2} = \sum _{k=l_1}^c e_k - \sum _{k=l_2}^c e_k \end{aligned}$$
(57)

such that \(B^{-1}(w_j-w_i)\) has nonnegative entries exactly if \(l_1 \le l_2\). \(\square \)

The continuous notion of order preservation put forward in Definition 1 can be interpreted in terms of a related discrete graphical model. Consider a graph consisting of two nodes connected by a single edge. The order constrained image labeling problem on this graph can be written as the integer linear program

$$\begin{aligned} \min _{W\in \{0,1\}^{2\times c}, M\in \varPi (w_i,w_j)} \langle W, D\rangle + \theta \langle Q-\mathbb {I}, M\rangle \end{aligned}$$
(58)

where \(\varPi (w_i,w_j)\) denotes the set of coupling measures for marginals \(w_i\), \(w_j\) and \(\theta \gg 0\) is a penalty associated with violation of the ordering constraint. By taking the limit \(\theta \rightarrow \infty \) we find the more tightly constrained problem

$$\begin{aligned} \min _{W\in \{0,1\}^{2\times c}, M\in \varPi (w_i,w_j)} \langle W, D\rangle \quad \text {s.t. }\langle Q-\mathbb {I}, M\rangle = 0. \end{aligned}$$
(59)

Its feasible set has an informative relation to Definition 1 examined in Proposition 1.

Lemma 2

Let \(M \in \mathbb {R}^{c\times c}\) be an upper triangular matrix with non-negative entries above the diagonal and non-negative marginals

$$\begin{aligned} M\mathbb {1}_c \ge 0,\quad M^\top \mathbb {1}_c \ge 0. \end{aligned}$$
(60)

Then there exists a modified matrix \(M^1\) with the same properties such that \(M^1 \ge 0\).

Proof

Equation (60) directly implies \(M_{11}\ge 0\) and \(M_{cc}\ge 0\) because M is upper triangular. For row indices \(l\ne m\) and column indices \(q\ne r\), define the matrix \(O^{lm,qr}\) with

$$\begin{aligned} O^{lm,qr}_{ij} = {\left\{ \begin{array}{ll} -1&{} \text { if } (i,j)=(l,q) \vee (i,j)=(m,r)\\ 1 &{} \text { if } (i,j)=(l,r) \vee (i,j)=(m,q)\\ 0 &{} \text { else} \end{array}\right. }. \end{aligned}$$
(61)

Then \(O^{lm,qr}\mathbb {1}= (O^{lm,qr})^\top \mathbb {1}= 0\). Adding a matrix \(O^{lm,qr}\) to M does therefore not change its marginals, but it redistributes mass from the positions (lq) and (mr) to the positions (lr) and (mq). Due to (60), it is possible to choose scalars \(\alpha ^k_{lr} \ge 0\) such that

$$\begin{aligned} M + \sum _{2\le k\le c-1}\;\sum _{\begin{array}{c} l < k\\ r > k \end{array}} \alpha ^k_{lr} O^{lk,kr} \ge 0. \end{aligned}$$
(62)

\(\square \)

Proposition 1

A pair of voxel assignments \((w_i, w_j)\in \mathscr {S}^2\) within an single A-scan is ordered if and only if the set

$$\begin{aligned} \varPi (w_i,w_j) \cap \{ M\in \mathbb {R}^{c\times c}:\langle Q-\mathbb {I}, M\rangle = 0 \} \end{aligned}$$
(63)

is not empty.

Proof

See Appendix A. \(\square \)

Proposition 1 shows that transportation plans between ordered voxel assignments \(w_i\) and \(w_j\) exist which do not move mass from \(w_{i,l_1}\) to \(w_{j,l_2}\) if \(l_1 > l_2\). This characterizes order preservation for non-integral assignments as put forward in Definition 1.

Fig. 4
figure 4

Left En-face view on the volumetric OCT data superimposed by parallel blue lines which represent the location of 61 B-scans within the volume. The red line indicates the position of a B-scan shown in the center image. Center The enlarged view on a B-scan depicts typical artifacts such as shadow regions and speckle noise. Right The gray value intensity of a single vertical A-scan located near the Fovea region. This A-scan is highlighted by a yellow line in the enlarged B-scan (center image). Noisy intensity variations along the A-scan indicate the difficulty of automatically extracting retinal tissue boundary positions (Color figure online)

4.2 Ordered Assignment Flow

Likelihoods as defined in (22) emerge by lifting \(-\frac{1}{\rho }D_{\mathscr {F}}\) regarded as Euclidean gradient of \(-\frac{1}{\rho }\langle D_{\mathscr {F}}, W \rangle \) to the assignment manifold. It is our goal to encode order preservation into a generalized likelihood matrix \(L_\text {ord}(W)\). To this end, consider the assignment matrix \(W\in \mathscr {S}^N\) for a single A-scan consisting of N voxels. We define the related matrix \(Y(W)\in \mathbb {R}^{N(N-1)\times c}\) with rows indexed by pairs \((i,j)\in [N]^2\), \(i\ne j\) in fixed but arbitrary order. Using the matrix Q defined by (56), let the rows of Y be given by

$$\begin{aligned} Y_{(i,j)}(W) = {\left\{ \begin{array}{ll}Q(w_j-w_i) &{} \text {if }i > j\\ Q(w_i-w_j) &{} \text {if }i < j\end{array}\right. }. \end{aligned}$$
(64)

By construction, an A-scan assignment W is ordered exactly if all entries of the corresponding Y(W) are nonnegative. This enables to express the ordering constraint on a single A-scan in terms of the energy objective

$$\begin{aligned} E_\text {ord}(W) = \sum _{(i,j)\in [N]^2,\;i\ne j} \phi (Y_{(i,j)}(W)). \end{aligned}$$
(65)

where \(\phi :\mathbb {R}^c \rightarrow \mathbb {R}\) denotes a smooth approximation of \(\delta _{\mathbb {R}^c_+}\). In our numerical experiments, we choose

$$\begin{aligned} \phi (y) = \left\langle \gamma \exp \left( -\frac{1}{\gamma } y \right) , \mathbb {1}\right\rangle \end{aligned}$$
(66)

with a constant \(\gamma > 0\). Suppose a full OCT volume assignment matrix \(W\in \mathscr {W}\) is given and denote the set of submatrices for each A-scan by C(W). Then order preserving assignments consistent with given distance data \(D_{\mathscr {F}}\) in the feature space \(\mathscr {F}\) are found by minimizing the energy objective

(67)

We consequently define the generalized likelihood map

(68)

and specify a corresponding assignment flow variant.

Definition 2

(Ordered Assignment Flow) The dynamical system

$$\begin{aligned} \dot{W} = R_WS(L_\text {ord}(W)),\quad W(0) = \mathbb {1}_\mathscr {W} \end{aligned}$$
(69)

evolving on \(\mathscr {W}\) is called the ordered assignment flow.

By applying known numerical schemes (Zeilmann et al. 2020) for approximately integrating the flow (69), we find a class of discrete-time image labeling algorithms which respect the physiological cell layer ordering in OCT data. In Sect. 5, we benchmark the simplest instance of this class, emerging from the choice of geometric Euler integration.

5 Experimental Results

5.1 Data, Competing Approaches, Performance Measures

5.1.1 OCT-Data

In the following sections, after introducing key terminology in volumetric OCT data we describe experiments performed on a set of OCT volumes depicting the intensity of light reflection in chorioretinal tissues centered around the fovea. The scans were obtained using a spectral domain OCT device (Heidelberg Engineering, Germany) for multiple patients at a variety of resolutions by averaging various registered B-scans which share the same location in order to reduce speckle noise. This is representative of the fact that different resolutions may be desirable in clinical settings at the preference of medical practitioners. In the following, we always assume an OCT volume in question to consist of \(N_B\) B-scans, each comprising \(N_A\) A-scans with N voxels and use the term surface to refer to the set of voxels located at the interface of two retina layers. See Fig. 3 for a schematic illustration of the data acquisition process.

In the present work, we use a private dataset of 3D OCT volume scans provided by Heidelberg Engineering GmbH which we split into 82 volumes for training and 8 volumes for testing. In particular, the test set contains scans from multiple different patients without any observable pathological retina changes. See Appendix C for a detailed list of volume sizes and resolutions along each axis.

Figure 4 demonstrates the typical organization of a 3D-OCT volume acquired by scanning healthy human retina using an OCT device. B-Scans are indicated as blue lines placed in the Fundus image on the left. The particular B-Scan marked in red is depicted in the middle of Fig. 4. This illustrates the typical artifacts and corrupted layer intensities of the OCT volume. The right plot depicts the noisy signal along an A-scan indicated by a yellow vertical line which underpins the difficulty of segmenting the underlying data sets.

5.1.2 Reference Methods

To assess the segmentation performance of our proposed approach we compare ourselves to state of the art retina segmentation methods presented in Rathke et al. (2014) and Kang et al. (2006) which are applicable for both healthy and pathological patient data. In particular, we prefer these reference methods over Dufour et al. (2013), Song et al. (2013) and Garvin et al. (2009) because available implementations of the latter are limited to the segmentation of up to 9 retina layers. For both reference methods, we use the software implementation of their authors without any additional tuning or retraining.

IOWA Reference Algorithm A well-known graph-based approach to segmentation of macular volume data was developed by the Retinal Image Analysis Laboratory at the Iowa Institute for Biomedical Imaging (Kang et al. 2006; Abràmoff et al. 2010; Garvin et al. 2009). The problem of localizing cell layer boundaries in 3D OCT volumes is posed and ultimately transformed into a minimum st-cut problem on a non-trivially constructed graph G. To this end, a distance tensor \(D_k\in \mathbb {R}^{N_B\times N_A\times N}\) is formed in a feature extraction step for each boundary \(k\in [c-1]\). This encodes \(c-1\) separate binary segmentation problems on a geometric graph \(G_k\) spanning the volume. In each instance, voxels are to be classified as either belonging to boundary k or not belonging to boundary k. By utilizing a (directed) neighborhood structure on each \(G_k\), smoothness constraints are introduced and regulated via user-specified stiffness parameters. To model interactions between different boundaries, the graphs \(G_k\) are combined to a global graph G, introducing additional edges between them. The latter set up constraints on the distance between consecutive boundaries within each A-scan which can be used to enforce physiological ordering of cell layers. On G, the problem of optimal boundary localization takes the form of minimal closed set construction which is in turn transformed into a minimum st-cut problem for which standard methods exist. Their standalone software is freely available for research purposes.Footnote 1

Probabilistic Model Rathke et al. (2014) proposed a graph-based probabilistic approach for segmenting OCT volumes for given data y by leveraging the Bayesian ansatz

$$\begin{aligned} p(y,s,b) = p(y|s)p(s|b)p(b). \end{aligned}$$
(70)

Here, the tensor \(b \in \mathbb {R}^{N_B\times N_A \times (c-1)}\) contains real-valued boundary positions between retina layers and s denotes discrete (voxel-wise) segmentation. The appearance terms p(y|s), p(s|b) and p(b) represent data likelihood, Markov random field regularizer and global shape prior respectively. In order to approximate the desired posterior

$$\begin{aligned} p(b,s|y) = \frac{p(y|s)p(s|b)p(b)}{p(y)}, \end{aligned}$$
(71)

a variational inference strategy is employed. This aims to find a tractable distribution q decoupled into

$$\begin{aligned} q(b,s) = q_b(b)q_s(s) \end{aligned}$$
(72)

which is close to p(bs|y) in terms of the relative entropy \(\text {KL}(q\,|\,p)\). The shape prior p(b) is learned offline by maximum likelihood estimation in the space of normal distributions using a low-rank approximation of the involved covariance matrix. Ordering constraints

$$\begin{aligned} 1 \le s_{1,ij} \le s_{2,ij} \le \cdots \le s_{c-1,ij}, \quad ij \in [N_B] \times [N_A] \end{aligned}$$
(73)

are enforced for the discrete segmentation s and are not enforced for the continuous boundaries b. This is in contrast to the proposed model which integrates the ordering of retina layers by adding a cost function (63) penalizing the overall deviation of soft assignments during numerical integration of (25) from the subspace of probability distributions satisfying (1). The method comes along with a standalone software which is freely available.Footnote 2

5.1.3 Performance Measures

We will evaluate the computed segmentations by their direct comparison with manual annotations regarded as gold standard which were realized by a medical expert. Respective metrics are suitable for segmentation tasks that involve multiple tissue types (Crum et al. 2006). Specifically, we report the mean DICE similarity coefficient (Dice 1945) for each segmented cell layer.

Definition 3

(DICE) Given two sets AB the DICE similarity coefficient is defined as

$$\begin{aligned} {{\,\mathrm{DSC}\,}}(A,B) := \frac{2|A \cap B|}{|A|+|B|} = \frac{2{\textit{TP}}}{2{\textit{TP}}+{\textit{FP}}+{\textit{FN}}} \in [0,1], \end{aligned}$$
(74)

where \(\{{\textit{TP}},{\textit{FN}},{\textit{FP}}\}\) denotes the number of true positives, false negatives and false positives respectively.

The DICE similarity coefficient quantifies the region agreement between computed segmentation results and manually labeled OCT volumes which serve as ground truth. High similarity index \({{\,\mathrm{DSC}\,}}(A,B) \approx 1\) indicates large relative overlap between the sets A and B. This metric is well suited for average performance evaluation and appears frequently in the literature (e.g. Chiu et al. 2015; Yazdanpanah et al. 2011; Novosel et al. 2017). It is closely related to the positively correlated Jaccard similarity measure (Jaccard 1908) which in contrast to (74) is more strongly influenced by worst case performance.

In addition, we report the mean absolute error (MAE) of computed layer boundaries used in Rathke et al. (2014) and Garvin et al. (2009) to make our results more directly comparable to these references.

Definition 4

(Mean Absolute Error) For a single A-scan indexed by \(ij\in [N_B]\times [N_A]\), let \(e_{ij} :=| g_{ij}-p_{ij} |\) denote the absolute difference between a layer boundary position \(g_{ij}\) in the gold standard segmentation and a predicted layer boundary \(p_{ij}\). The mean absolute error (MAE) is defined as the mean value

$$\begin{aligned} {{\,\mathrm{MAE}\,}}(g,p) = \frac{1}{N_BN_A}\sum _{ij \in [N_B]\times [N_A]} e_i. \end{aligned}$$
(75)

5.2 Feature Extraction

5.2.1 Region Covariance Descriptors

To apply the geometric framework proposed in Sect. 3 we next introduce the region covariance descriptors (Tuzel et al. 2006) which have been widely applied in computer vision and medical imaging, see e.g. Cherian and Sra (2016), Turaga and Srivastava (2016), Depeursinge et al. (2014) and Sirinukunwattana et al. (2015). We model the raw intensity data for a given OCT volume by a mapping \(I:\mathscr {D} \rightarrow \mathbb {R}_+\) where \(\mathscr {D} \subset \mathbb {R}^3\) is the underlying spatial domain. To each voxel \(v \in \mathscr {D}\), we associate the local feature vector \(f :\mathscr {D} \rightarrow \mathbb {R}^{10}\),

$$\begin{aligned}&f : \mathscr {D} \rightarrow \mathbb {R}^{10} \end{aligned}$$
(76)
$$\begin{aligned}&v \mapsto ( I(v), \nabla _x I(v), \nabla _y I(v), \nabla _z I(v),\nonumber \\&\sqrt{2}\nabla _{xy} I(v), \dots , \nabla _{zz} I(v) )^\top . \end{aligned}$$
(77)

assembled from the intensity I(v) as well as first- and second-order responses of derivative filters capturing information from larger scales following (Hashimoto and Sklansky 1987). To improve the segmentation accuracy we combine the derivative filter responses from various scales in an computationally efficient way we first normalize the derivatives of the input volume I(v) at every scale \(\sigma _s\) by convolution each dimension with a 1D window:

$$\begin{aligned} \nabla _x \tilde{I}_{\sigma _s}(v) = \sigma _s^{2}\frac{\partial }{\partial x}\tilde{G}(v,\sigma _s) \end{aligned}$$
(78)

where \(\tilde{G}(v,\sigma _s)\) is an approximation to a Gaussian window \(\big (G(v,\sigma _s)*I\big )(v)\) at scale \(\sigma _s\) as in detail described in Hashimoto and Sklansky (1987). Subsequently we follow the idea presented by Lindeberg (2004) by taking local maxima over scales

$$\begin{aligned} \nabla _x \tilde{I}(v) = \max _{\sigma _s} \nabla _x \tilde{I}_{\sigma _s}(v), \end{aligned}$$
(79)

which are serving for the mapping (76).

By introducing a suitable geometric graph spanning \(\mathscr {D}\), we can associate a neighborhood \(\mathscr {N}_i\) of fixed size with each voxel \(i\in [n]\) as in (24). For each neighborhood, we define the regularized region covariance descriptor

$$\begin{aligned} S_i := \sum _{j \in \mathscr {N}_i}\theta _{ij} (f_j-\overline{f_i})(f_j-\overline{f_i})^T + \epsilon I, \quad \overline{f_i} = \sum _{k \in \mathscr {N}_i}\theta _{ik}f_k, \end{aligned}$$
(80)

as a weighted empirical covariance matrix with respect to feature vectors \(f_{j}\). The small value \(1 \gg \epsilon > 0\) acts as a regularization parameter enforcing positive definiteness of \(S_i\). Diagonal entries of each covariance matrix \(C_i\) are empirical variances of feature channels in (76) while the off-diagonal entries represent empirical correlations within the region \(\mathscr {N}_i\).

5.2.2 Prototypes on \(\mathscr {P}^d\)

In view of the assignment flow framework introduced in Sect. 2, we interpret region covariance descriptors (80) as data points in the metric space \(\mathscr {P}^d\) of symmetric positive definite matrices and model each retina tissue indexed by \(l \in [c]\) with a random variable \(S_l\) taking values in \(\mathscr {P}^d\). Suppose we draw \(N_l\) samples \(\{S_l^k \}_{k = 1}^{N_l}\) from the distribution of \(S_l\). The most basic way to apply assignment flows to data in \(\mathscr {P}^d\) is based on computing a prototypical element of \(\mathscr {P}^d\) for each tissue layer, e.g. the Riemannian center of mass of \(\{S_l^k \}_{k = 1}^{N_l}\). This corresponds to directly choosing \(\mathscr {P}^d\) as feature space \(\mathscr {F}\) in (1a). We find that superior empirical results are achieved by considering a dictionary of \(K_l > 1\) prototypical elements for each layer \(l\in [c]\). This entails partitioning the samples \(\{S_l^k \}_{k = 1}^{N_l}\) into \(K_l\) disjoint subsets \(\hat{S}^j_l \subseteq \{S_l^k \}_{k = 1}^{N_l}\), \(j\in [K_l]\) with representatives \(\tilde{S}^j_l\) determined offline.

Fig. 5
figure 5

Top Metric classification evaluated on thin layers (IPL, INL, OPL, PR2). Bottom Analogous metric evaluation for (GCL, ONL, PR1, RPE). From left to right The number of true outcomes after direct comparison with ground truth, for the choice of the exact Riemannian geometry of \(\mathscr {P}_d\), Stein divergence and Log-Euclidean distance for geometric mean computation. The results of first two columns indicate higher detection performance while respecting the Riemannian geometry of a curved manifold. Enlarging the set of prototypical covariance descriptors leads to increased matching accuracy which is in contrast to the observed flattening of matching curves when using the Log-Euclidean distance

To find a set of representatives which captures the structure of the data, we minimize expected loss measured by the Stein divergence (50) leading to the K-means like functional

$$\begin{aligned} \begin{aligned} \mathbb {E}_{p_l}(\tilde{S}_{l})&= \sum ^{K_l}_{j = 1} p(j) \sum _{S_l^i \in \hat{S}_l^j}\frac{p(i|j)}{p(j)} D_S(S_l^i,\tilde{S}^{j}_l),\\ p(i,j)&= \frac{1}{N_l},p_l(j) = \frac{N_j}{N_l}. \end{aligned} \end{aligned}$$
(81)

A hard partitioning is achieved by applying Lloyd’s algorithm in conjunction with Algorithm 2 for mean retrieval. We additionally employ the more common soft K-means like approach for determining prototypes by employing the mixture exponential family model based on Stein divergence to given data

$$\begin{aligned} p(S_l^i, \Gamma _l) = \sum _{j = 1}^K \pi _{l}^j p(S_l^i,\tilde{S}_l^j)), \end{aligned}$$
(82)

where the parameters

$$\begin{aligned} \Gamma _l = \{(\pi _l^{j}\}_{j = 1}^K,\{\tilde{S}_l^j\}_{j = 1}^K), \quad (\pi _l^1,\ldots , \pi _l^{|J|}) \in S \end{aligned}$$
(83)

have to be adjusted to given data. The prototypes are recovered as mean parameters \(S_l^{j,T}\) though an iterative process commonly refered to as expectation maximation (EM) defined by alternation of the following iterations

$$\begin{aligned} \begin{aligned} p_l(j|S_l^i,\Gamma ^t_l)&= \frac{\pi _l^{(j,t)}e^{-D_S(S_l^i,\tilde{S}_l^{(j,t)})}}{\sum _{k = 1}\pi _l^{(k,t)}e^{-D_S(S_l^i,\tilde{S}_l^{(k,t)})}},\\&\quad ({{\textit{\textbf{Expectation step}}}}) \end{aligned} \end{aligned}$$
(84)

followed by updating the marginals at each time step up to final time T

$$\begin{aligned} \pi _l^{(j,t+1)}&= \sum _{i = 1}^{N_j} p_l(j|S_l^i, \Gamma ^t_l) \tilde{S}^{j,t} \end{aligned}$$
(85)
$$\begin{aligned} \tilde{S}^{j,t+1}_l&= {{\,\mathrm{arg min}\,}}_{S \in \mathscr {P}_d} \left( \sum _{i =1}^n p(j|\Gamma _i^t) D_{S}(S^i_l,S)\right) . \nonumber \\&\quad ({{\varvec{Maximization step}}}) \end{aligned}$$
(86)

The decision to approximate the Riemannian metric on \(\mathscr {P}_d\) by the Stein divergence (50) can be backed up empirically. To this end, we randomly select descriptors (80) representing the nerve fibre layer in real-world OCT data and compute their Riemannian mean as well as their mean w.r.t. the Log-Euclidian metric (46) and Stein divergence (50). Figure 6 illustrates that Stein divergence approximates the full Riemannian metric more precisely than the Log-Euclidian metric while still achieving a significant reduction in computational effort. Furthermore to evaluate the classification we extracted a dictionary of 200 prototypes for representing each retina tissue for different choice of metric and subsequently evaluated the resulting segmentation accuracy by assigning each voxel to a class containing the prototype with smallest distance using a cropped OCT Volume of size \(138\times 100\times 40\) taken from the testing set.

Figure 5 visualizes the correct classification matches for retina layers ordered by color according to Fig. 3. In particular, we inspect a notable gain of correct matches while respecting the Riemannian geometry (first column) as opposed to Log-Euclidean setting (third column). Regarding the approximation of (36) by (50), we are observing more effective detection of outer photoreceptor layer (PR1), inner nuclear layer (INL) and retinal pigment epithelium (RPE). Furthermore, taking a closer look at (OPL) and (ONL) we note a typical tradeoff between the number of prototypes and detection performance indicating superior retina to voxel allocation by applying (46), whereas the surrogate divergence metric (50) has the tendency to improve the accuracy while increasing the size of evaluated prototypes in contrast to flattening curves when relying on (47).

Fig. 6
figure 6

Left Deviation of the geometric means computed using the Log-Euclidian metric and Stein divergence, respectively, from the true Riemannian mean. Right Runtime for geometric mean computation using the different metrics. All evaluations were performed on a randomly chosen subset of covariance descriptors representing the retinal nerve fibre layer in a real-world OCT scan. Both graphics clearly highlight the advantages of using Stein the divergence in terms of approximation accuracy and efficient numerical computation

This illustrates a tradeoff between computational effort and labeling performance, cf. Fig. 6. Note that prototypes are computed offline, making runtime performance less relevant to medical practitioners. However, building a distance matrix involves computing \(n\sum _{l\in [c]} K_l\) Riemannian distances resp. Stein divergences to prototypes. This still leads to a large difference in (online) runtime since evaluation of the Riemannian distance (36) involves generalized eigendecomposition while less costly Cholesky decomposition suffices to evaluate the Stein divergence (50).

Summarizing the discussed results concerning the application of Algorithms 1 and 2, we point out that respecting the Riemannian geometry leads to superior labeling results providing more descriptive prototypes (Figs. 5, 6).

5.2.3 CNN Features

In addition to the covariance features in Sect. 5.2.1, we compare a second approach to local feature extraction based on a convolutional neural network architecture. For each node \(i\in [n]\), we trained the network to directly predict the correct class in [c] using raw intensity values in \(\mathscr {N}_i\) as input. As output, we find a score for each layer which can directly be transformed into a distance vector suitable as input to the ordered assignment flow (69) via (68). The specific network used in our experiments has a ResNet architecture comprising four residually connected blocks of 3D convolutions and ReLU activation. Model size was hand-tuned for different sizes of input neighborhoods, adjusting the number of convolutions per block as well as corresponding channel dimensions. Details of the employed architecture are listed in Appendix B. In particular, the input is a patch of voxels with size \(17\times 17\times 5\) which upper-bounds the network field of view. We thus limit the network to extracting localized features as compared to commonly used machine learning approaches which aim to incorporate as much global context into the feature extraction process as possible. For example, the U-Net architecture employed in Liu et al. (2019) works with large (\(496\times 64\)) slices of B-scans and comprises three \(2\times 2\) pooling operations. On the coarsest scale (bottom of the U), a single convolution with filter size \(7\times 3\) thus translates into a field of view of at least \(56\times 24\) after unpooling.

Fig. 7
figure 7

From top to bottom: Row a One B-scan from a OCT-volume showing the shadow effect, with ground truth plot on the right. Row b Local nearest neighbor assignments based on prototypes by minimizing (81) computed with Stein divergence, with the result of the segmentation returned by the basic assignment flow (Sect. 2) on the right. Row c Proposed layer-ordered volume segmentation based on covariance descriptors. From left to right: ordered volume segmentation for different \(\gamma = 0.5,\gamma = 0.1\) [cf. Eq. (66)]. Row d Local rounding result extracted from Res-Net on the left and the result of the ordered assignment flow on the right

Table 1 Dice indices (± standard deviation) per cell layer for each of the compared segmentation approaches
Table 2 Mean absolute errors (± standard deviation) per cell layer interface for each of the compared segmentation approaches in pixels (1 pixel \(= {3.87}\,\upmu \mathrm{m}\))
Fig. 8
figure 8

Performance measures per layer in terms of the mean average error based on the segmentation of 10 healthy OCT volumes. Top row Error bars for retina layers separated by the external limiting membrane (ELM) corresponding to OAF (A) and OAF (B). Middle row Comparison of the mean errors of OAF (B) and the probabilistic method (Rathke et al. 2014). Bottom row Comparison of mean average errors of OAF (B) and the the IOWA reference algorithm

Fig. 9
figure 9

Box plots of DICE similarity coefficients between computed segmentation results and manually labeled ground truth. Left OAF (A). Right OAF (B). The OAF based on CNN features yields improved segmentations for all retina layers

5.3 Segmentation via Ordered Assignment

By numerically integrating the ordered assignment flow (2) parametrized by the distance matrix D, an assignment state W is evolved on \(\mathscr {W}\) until mean entropy of pixel assignments is low. We specifically use geometric Euler integration steps on \(T\mathscr {W}\) with a constant step-length of \(h=0.1\) (see Zeilmann et al. 2020 for details of this process). Geometric averaging with uniform weights leads to local regularization of assignments which smooths regions in which the features do not conclusively point to any label. More global knowledge about the ordering of cell layers is incorporated into \(E_{\text {ord}}\) which addresses more severe inconsistencies between local features and global ordering. In all experiments, the neighborhood of each voxel \(i\in [n]\) is choosen as the voxel patch of size \(5\times 5\times 3\) centered at i.

Fig. 10
figure 10

From top to bottom Three sample B-Scans extracted for different locations from a healthy OCT volume with 61 scans, with the fovea centered OCT scan visualized in the middle column. The associated augmented labeling. OAF (A) segmentation using a dictionary of covariance descriptors determined by (82). OAF (B) segmentation using features determined the CNN network. In contrast to to results achieved by OAF (A), the above visualization indicates more accurate detection of retina boundaries using OAF (B), in particular near the fovea region (middle column)

5.4 Evaluation

To benchmark our novel segmentation approach, we first extract local features for each voxel from a raw OCT volume. As described above, either region covariance descriptors (Sect. 5.2.1) or class scores predicted by a CNN (Sect. 5.2.3) are computed for segmenting the retina layers with ordered assignment flow which we in the following abbreviate as OAF (A) and OAF (B) respectively. To facilitate the performance examination between the proposed approach and the reference methods introduced in (Sect. 5.1.2) we evaluate the obtained results through direct comparison of different metrics from (Sect. 5.1.3) and by providing side-by-side visualizations of segmented OCT-volumes in each subsection separately. Specifically, we calculate the DICE similarity coefficient (Dice 1945) and the mean absolute error for segmented cell layers within the pixel size of \({3.87}\,\upmu \mathrm{m}\) compared to human grader by segmenting 8 OCT volumes consisting of 61 B-scans. Throughout the performed experiments, we fixed the grid connectivity \(\mathscr {N}_{i}\) for each voxel \(i\in I\) to \(3 \times 5 \times 5\).

5.4.1 Covariance Descriptor vs. CNN

In order to compare OAF (A) and OAF (B), we first specifically evaluate the segmentation performance based on local features given by the covariance descriptor (Sect. 5.2.1) as well as features extracted by a CNN (5.2.3). For OAF (A), a dictionary of \(k=400\) prototypical cluster centers on the positive definite cone (Sect. 33) has been determined offline for each retina layer using the iterative clustering with (82). These are compared to descriptors extracted from the unseen volume by computing pairwise Stein divergence (Sect. 3.2.3). The minimum value corresponding to the lowest divergence for each pair of voxel \(i\in [n]\) and cell layer \(j\in [c]\) is noted as entry \(d_{ij}\) of the distance matrix \(D_\text {cov}\), i.e. for every voxel i the divergence to its closest representative of layer j is given by

$$\begin{aligned} (D_{\text {cov}})_{ij} := \min _{k \in [400]} D_S(S_i,\tilde{S}_{j}^k). \end{aligned}$$
(87)

For OAF (B), class scores \(C\in \mathbb {R}^{n\times c}\) predicted by the neuronal network (Sect. 5.2.3) are transformed into a distance matrix \(D_{\text {cnn}} = -C\) simply by switching their sign followed by adjusting the parameter \(\rho \) to adjust data scale in the likelihood matrix (22).

A naive way to segment the volume in accordance with the observed data is by choosing \(\arg \min _{j\in [c]} D_{ij}\) for each voxel i. However, due to the challenging signal-to-noise ratio in real-world OCT data, classes will not usually be well-separated in the feature space at hand. The resulting uncertainty pertaining to the assignment of classes using exclusively local features is encoded into each distance matrix.

The experimental results discussed next illustrate the relative influence of the covariance descriptors (80) and regularization properties of the ordered assignment flow, respectively. To overcome the high computational complexity when extracting features given by (80) and the subsequent assembly of distance matrix (87) during the experiments carried out for OAF(A) and OAF(B) we segmented OCT volumes consisting of 41 remaining B-scans after cropping 10 B-scans from each volume boundary. Additionally we reduced the size of each B-scan by 148 voxels from each side along the \(N_A\) axis to avoid artifacts caused by high varying shape and strong thinning of the retinal layers near volume bounds. Figure 7 illustrates real-world labeling performance based on extracting a dictionary of 400 prototypes per layer by minimizing (81) and employing Algorithm 2 for mean retrieval. The second row in Fig. 7 illustrates a typical result of volume segmentation by nearest neighbor assignment without ordering constraint. As expected, the high texture similarity between the choroid and GCL layer yields wrong predictions resulting in violation of physiological cell layer ordering through the whole volume. However, using pairwise correlations captured by covariance matrices leads to accurate detection of the internal limiting membrane (ILM) with its characteristic highly reflective boundary. Similarly, the light rejecting fiber layers RNFL, PR1 and RPE can also be detected by this approach. For the particularly challenging inner layers such as GCL, INL and ONL that are mainly comprised of weakly reflective neuronal cell bodies, regularization by imposing (65) is required. In the third row of Fig. 7, we plot the ordered volume segmentation for two different values of the parameter \(\gamma \) defined in (66), which controls the ordering regularization by means of the novel generalized likelihood matrix (68). Direct comparison with the ground truth shows how ordered labelings evolve on the assignment manifold while simultaneously giving accurate data-driven detection of RNFL, OPL, INL and the ONL layer. For the remaining critical inner layers, the local prototypes extracted by (81) fail to represent the retina layers properly and lead to artifacts due to the presence of vertical shadow regions caused by existing blood vessels, which contribute to a loss of the interference signal during the OCT scanning process, as shown in Fig. 7.

Fig. 11
figure 11

Box plots of DICE similarity coefficients between computed segmentation results and manually labeled ground truth. Left IOWA reference algorithm (Garvin et al. n.d). Right OAF based on CNN features. See Table 1 for mean and standard deviations. Exploiting OAF (B) for retina tissue classification results in improved overall layer detection performance, especially for the PR1-RPE region

Fig. 12
figure 12

Illustration of retina layer segmentation results listed in Tables 1 and 2. From top to bottom Ground truth labeling. Labeled retina tissues using the proposed approach based on covariance descriptors and CNN features, respectively. The resulting segmentation obtained using the IOWA reference algorithm

After segmentation of the test data set, the mean and standard deviation were calculated for better assessment of the retina layer detection accuracy of the proposed segmentation method, according to the performance measures (75) and (74). The evaluation results for each retina tissue as depicted in Fig. 3, are detailed in Tables 1 and 2. The first row of Fig. 8 clearly shows the superior detection accuracy of utilizing the Ordered Assignment Flow for the first outer retina layers (RNFL, GCL, IPL, INL) and the (PR2-RPE) region in connection with local features extracted by a CNN (Sect. 5.2.3). Nonetheless, the covariance descriptor achieves comparable results for characterization of the outer plexiform layer (OPL) and exhibits increased retina detection regarding the photoreceptor region (PR1,PR2) and outer nuclear region (ONL). Table 1 includes the evaluation based on Dice similarity which is less sensitive to outliers and serves as an appropriate metric for calculating the performance measures across large 3D volumes. To obtain a consistent and clear comparability between the involved features on which we rely to tackle the specific problem of retina layer segmentation, the corresponding results are visualized in Fig. 9. The graphic illustrates higher Dice similarity and relatively small standard deviation when incorporating features (Sect. 5.2.3) as input to our model, which characterizes their superior informative content. According to the left plot, the covariance descriptor performs well for detecting the prototypical textures of the internal limiting membrane (ILM), the (ONL) and (PR1) layers as well as the RPE boundary to the choroid section. Especially this highlights the ability of using gradient based features for accurate detection of retina tissues indicating sharp contrast between the neighboring layers, as is the case for ONL and PR1.

In general, the more robust retina detection features extracted by a CNN can be attributed to the underlying manifold geometry of symmetric positive definite matrices where the data partition is performed linearly by hyperplanes. This further indicates the nonlinear structure of the acquired volumetric OCT data. Figure 10 presents typical labelings of a B-scan for different locations in the segmented healthy OCT volume obtained with the proposed approach. Direct comparison with the ground truth, as depicted in row (b), demonstrates higher accuracy and smoother boundary transitions by using CNN features instead of covariance descriptors. In particular, for the challenging segmentation of the ganglion cell layer (GCL) with a typical thinning near the macular region (middle scan), we report a Dice index of \(0.8373 \pm 0.0263\) as opposed to \(0.6657 \pm 0.1909\). The remaining numerical experiments are focused on the validation of OAF against the retina segmentation methods summarized in Sect. 5.1.2 serving as reference.

5.4.2 IOWA Reference Algorithm

To assess the segmentation performance of our proposed approach, we first compared to the state of the art graph-based retina segmentation method of 10 intra-retinal layers developed by the Retinal Image Analysis Laboratory at the Iowa Institute for Biomedical Imaging (Kang et al. 2006; Abràmoff et al. 2010; Garvin et al. 2009), also referred to as the IOWA Reference Algorithm. We quantify the region agreement with manual segmentation regarded as gold standard. Since both the augmented volumes and the compared reference methods determine boundary locations of retina layers intersections, we first transfer the retina surfaces to a layer mask by rounding to the voxel size and assign to voxels within each A-scan the associated layer label, starting from the observed boundary to the location of the next detected intersection surface of two neighboring layers.

To access a quantitative direct comparison with the IOWA reference algorithm, the tested OCT volumes were imported into OCTExplorer 3.8.0 and segmented using the predefined Macular-OCT IOWA software after properly adjusting the resolution parameters. Additionally, we preprocessed each volume by removing 2 B-scans from each side to get rid of boundary artifacts and performed segmentation with the resulting volume size of \(498\times 768\times 59\) voxels. Quantitative results are summarized in Tables 1 and 2. Figure 11 provides a statistical illustration of the Dice index which reveals the high performance accuracy for methods which is in accordance with the mean average error shown in the last row of Fig. 8. In particular, we observe a notable increase of performance using the OAF for detection of the ganglion cell layer with overall accuracy of \({0.8546 \pm 0.0281}\,\upmu \mathrm{m}\), see Fig. 12 for visualized segmentations of 3 B-scans.

Fig. 13
figure 13

From top to bottom Ground truth for the augmented retina layer corresponding to Table 2. Segmentation results of the OAF based on manifold valued features and on CNN features, respectively. Segmentation results achieved by the probabilistic graphical model approach (Rathke et al. 2014)

Fig. 14
figure 14

Row a: From left to right: 3D retinal surfaces determined using OAF (A) (left column) and OAF (B) (middle column). The last column depicts ground truth. Row b: From left to right: Segmentation of retinal tissues with the IOWA reference algorithm (left column) with the proposed approach (middle column). Row c: Visual comparison of the probabilistic method (Rathke et al. 2014) (left column) left and the OAF (B) (middle column). Our approach OAF (B) leads to accurate retina layer segmentation with smooth layer boundaries, as observed in the middle column

Fig. 15
figure 15

Box plots of DICE similarity coefficients between computed segmentation results and manually labeled ground truth. Left Probabilistic approach (70) proposed in Rathke et al. (2014). Right OAF based on CNN features. See Table 1 for mean and standard deviations. Direct comparison shows a notably higher detection performance for segmenting the intraretinal layers using OAF (B)

5.4.3 Probabilistic Model

Next, we provide a visual and statistical comparison of the proposed approach and the probabilistic state of the art retina segmentation approach (Rathke et al. 2014) underlying Eq. (70). As before, to achieve a direct comparison with the proposed approach, we first adopted the OCT volumes by performing a cropping of 134 voxel from volume boundary along \(N_A\) axis to match the shape and parameters for the trained model given in Rathke et al. (2014) which supports the detection of retinal layer boundaries on data sets of dimension \(496 \times 500 \times 61\). Subsequently, we removed the boundaries between regions GCL and IPL, ONL and PR1, PR2 and RPE to obtain three characteristic layers which have to be detected. Figure 13 displays the labeling accuracy. Both methods perform well by accurately segmenting flat shaped retina tissues, as shown in the first and last columns. However, closer inspection of the second column reveals a more accurate detection of layer thickness for the (PR2+RPE) and (INL) regions below the concave curved fovea region by using OAF(B). This is mainly due to the connectivity constraints imposed on boundary detection in Rathke et al. (2014). However, the method in Rathke et al. (2014) is more accurate by dealing with rapidly decreasing layer thickness near the fovea region, as observed for GCL and IPL layers in the middle column of Fig. 13 after visual comparison against the manual delineations (first row). In contrast to the Gaussian shape prior used in Rathke et al. (2014), the proposed method does not model connectivity constraints. This allows for the observed oversmoothing artifact, but also makes the OAF approach more amenable for extension to pathological volumes with vanishing retina boundaries. For example, in the case of vitreomacular traction or diabetic macular edema, imposing connectivity constraints aggravates the problem of dealing with irregular retina boundaries.

Figure 14 additionally provides a 3D view on detected retina surfaces for each evaluated reference method used in this publication. The corresponding performance measures (Table 1) underpin the notably higher Dice similarity for (PR2+RPE) and for the (INL) layers. The statistical plots for the mean average error and the Dice similarity index are given in Figs. 8 and 15, clearly showing the overall superiority of OAF (B) with respect to both Dice index and mean average error. In particular, following Table 2, small error rates are observed among all the segmented layers, except for the (ILM) boundary which is detected by all methods with high accuracy. We point out that in general our method is not limited to any number of segmented layers, if ground truth is available.

6 Discussion

We discuss additional aspects pertaining to the data used for training feature extractors as well as the locality of extracted features and limitations of the proposed approach.

6.1 Ground Truth Generation

The training and evaluation of supervised models for feature extraction requires a sizeable amount of high-quality labeled ground truth data. This presents a commonly encountered challenge in 3D OCT segmentation (Dufour et al. 2013; Kang et al. 2006), because the process of manually labeling every voxel of a 3D volume is extremely laborious. The desire to account for inter-observer variability in manual segmentations further compounds this problem. OCT volumes used for testing purposes in the present paper were initially segmented by an automatic procedure based on hand-crafted features. In a subsequent step, each B-scan segmentation was manually corrected by a medical practitioner. The automatic method used for initial segmentation only explicitly regularizes on each individual B-scan, leading to irregularity between consecutive B-scans (see Fig. 16).

Fig. 16
figure 16

Left Initial automatic segmentation of individual B-scan based on hand-crafted features. Right Section of the same automatically segmented volume orthogonal to each B-scan

Manual correction of initial automatic segmentations leads to a noticeable reduction of irregularity but does not completely remove it. We therefore cannot rule out that a small bias towards the initial automatic segmentation based on hand-crafted features may still be present in the ground truth segmentations that we used to quantify segmentation performance of novel methods as well as baseline methods in this paper. During feature extraction, deep learning models may be capable of discovering the specific hand-crafted features used for initial automated segmentation which may in turn lead to exploitation of any bias towards them. In contrast, because the reference methods are not trained on the same data, they can not exploit any such bias, putting them at a possible disadvantage.

Figure 16 also highlights the fact that manual annotations as a gold standard still have nontrivial variance and are partly inconsistent between B-scans. In Rathke et al. (2014), the variance in manual annotations is further analyzed by comparing between two different human observers. They found that for a similar dataset, the discrepancy between both human observers varies between \({1.37 \pm 0.51}\,\upmu \mathrm{m}\) for the most consistent layer boundary and \({7.57 \pm 1.06}\,\upmu \mathrm{m}\) for the least consistent. Comparison to the results in Table 2 (1 pixel \(= 3.87\,\upmu \mathrm{m}\)) illustrates that the proposed model is close to the quality of manual annotation in terms of mean average error. It is to be noted, that similar or even higher scores have been reported for deep learning methods such as Liu et al. (2019) which work on individual B-scans. In view of the inconsistency between manual B-scan segmentations displayed in Fig. 16, it is to be questioned to what extent further improvement of these scores truely reflects improved detection of retina layers if manual annotation is the most precise method available for reference. Part of the contribution of the present work is notably the introduction of a 3D segmentation framework (Definition 2) which serves to regularize by leveraging domain knowledge based on arbitrary features. In particular, any deep network can be used as a drop-in replacement for the feature extraction methods discussed in Sect. 5.2.

6.2 Feature Locality

The ordered assignment flow segmentation approach can work with data from any metric space and is hence completely agnostic to the choice of preliminary feature extraction method. In this paper, we chose to limit the field of view of deep networks such that features with local discriminative information are extracted. This makes empirical results directly comparable between features based on covariance descriptors and features extracted by these networks. In addition, we conjecture that local features may generalize better to unseen pathologies. Specifically, if a pathological change in retinal appearance pertains to the global shape of cell layers, local features are largely uneffected. In this way, we expect segmentation performance to be relatively consistent on real-world data. Conversely, widening the field of view in feature extraction should be accompanied by a well-considered training procedure in order to achieve similar generalization behavior, by employing e.g. extensive data augmentation. While raw OCT volume data has become relatively plentiful in clinical settings, large volume datasets with high-quality gold-standard segmentation are not widely available at the time of writing. Therefore, by representing a given OCT scan locally as opposed to incorporating global context at every stage, it is our next hypothesis that superior generalization can be achieved in the face of limited data availability. Similarly, although based on local features, the method proposed by Rathke et al. (2014) combines local knowledge in accordance with a global shape prior. This makes clear why some layer scores achieved by this method are very competitive, but it also limits the methods ability to generalize to unseen data if large deviation from the expected global shape seen in training is present.

6.3 Limitations, Future Work

While the OAF typically achieves strong improvement over trivial rounding or baseline regularization, it does not come with a guarantee that physiological layer order will be attained. This is because we use the smooth function (66) instead of the indicator function \(\delta _{\mathbb {R}^c_+}\) to define \(E_{\text {ord}}\) in (65). The parameter \(\gamma \) consequently presents a tradeoff between adherence to physiological layer order and difficulty of numerical integration in the smooth assignment framework (Sect. 2.3). In Fig. 7 [row (c)], this tradeoff becomes apparent when segmenting based on relatively weak covariance descriptor features. Choosing \(\gamma \) smaller leads to improved adherence to the physiological layer order in computed segmentations. However, this also makes numerical integration of the flow (69) more difficult such that the choice of constant step-length \(h=0.1\) may lead to artifacts [row (c), right image]. In such cases, choosing adaptive step-length for integration or using a higher-order numerical integration scheme should still yield stable algorithms at the cost of longer runtime.

We also note that at the fovea, uniformly weighted \(5\times 5\times 3\) averaging neighborhoods may lead to oversmoothing (see Fig. 13c middle image) which manifests in excessive thinning of e.g. GCL. To combat such artifacts, the choice of averaging weights (23) could be made adaptive to each local neighborhood. However, for most regions of the volume the constant choice of averaging weights made in our experiments does not lead to oversmoothing. Thus, weight adaptivity is to be targeted primarily around the fovea which has a distinctive shape. With regard to computational efficiency, another possible future direction is to encode the notion of layer ordering put forward in Definition 1 within the context of a linear dynamical system for data labeling (Zeilmann et al. 2020).

On the application side, modeling considerations similar to the ones underlying the flow (69) most likely also apply in other areas involving ordering constraints such as seismic horizon tracking for landscape analysis. We thus expect that much of the present work is also relevant outside of optical coherence tomography.

7 Conclusion

In this paper we presented a novel, fully automated and purely data driven approach for retina segmentation in OCT-volumes. Compared to methods (Kang et al. 2006) (Dufour et al. 2013) and (Rathke et al. 2014) that have proven to be particularly effective on tissue classification with a priory known retina shape orientation, our ansatz merely relies on local features and yields ordered labelings which are directly enforced through the underlying geometry of statistical manifold (16). To address the task of leveraging 3D-texture information, we proposed two different feature selection processes by means of region covariance descriptors (80) and the output obtained by training a CNN network as described in Sect. 5.2.3, which are both based on the interaction between local feature responses.

As opposed to other machine learning methods developed for segmenting human retina from volumetric OCT data, the proposed method only takes the pairwise distance between voxels and prototypes (1b) as input. As a direct consequence our approach can be applied in connection with broader range of features living in any metric space and additionally provides the incorporation of outputs from trained neuronal convolution networks interpreted as image features, where a particular instance of such type was demonstrated in Sect. 5.2.3. Even in view of the moderate result achieved after segmentation using OAF (A) in connection with covariance descriptors, we observe the importance of our automatic algorithm by its high level of regularization. Compared to the approach presented in Chiu et al. (2015) which employs a higher number of input features but still requires postprocessing steps to yield order preserving labeling, our approach provides a way to perform this tasks simultaneously.

Using locally adapted features for handling volumetric OCT data sets from patients with observable pathological retina changes is in particular valuable to suppress wrong layer boundaries predictions caused by prior assumptions on retinal layer thicknesses typically made by graphical model approaches as in Dufour et al. (2013) and Song et al. (2013). Our method overcomes this limitation by mainly avoiding any bias towards using priors to global retina shape and instead only relies on the natural biological layer ordering, which is accomplished by restricting the assignment manifold to probabilities that satisfy the ordering constraint presented in Sect. 4. The experimental results reported in Sect. 5, and the direct comparison to the state of the art segmentation techniques (Garvin et al. n.d) and (Rathke et al. 2014) by using common validation metrics, underpin a notable performance and robustness of the geometric segmentation approach introduced in Sect. 2, that we extended to order-preserving labeling in Sect. 4. Furthermore, the results indicate that the ordered assignment flow successfully tackles problems in the field of retinal tissue classification on 3D-OCT data which are typically corrupted by speckle noise, with achieved performance comparable to manual gr-aders which makes it to a method of choice for medical image applications and extensions therein. We point out that our approach consequently differs from common deep learning methods which explicitly aim to incorporate global context into the feature extraction process. In particular, throughout the experiments we observed higher regularization resulting in smoother transitions of layer boundaries along the B-scan acquisition axis similar to the effect in Rathke et al. (2014) where the used smooth global Gaussian prior leads to limitations for pathological applications.

To reduce the reliance of manually segmented ground truth for extracting dictionaries of prototypes, our method can easily be extended to unsupervised scenarios in the context of Zisler et al. (2020). To deal with highly variable layer boundaries another possible extension of our method is to predict weights for geometric averaging (23) in an optimal control theoretic way, to cope with the linearized dynamics of the assignment flow (Zeilmann et al. 2020) as in detail elaborated in Hühnerbein et al. (2021). Consequently, by building on the feasible concept of spatially regularized assignments (Schnörr 2020), the ordered flow (2) possesses the potential to be extended towards the detection of pathological retina changes and vascular vessel structure. We expect that the joint interaction of retina tissues and blood vessels during the segmentation with the assignment flow will lead to more effective layer detection, which is the objective of our current research.