1 Introduction

Image segmentation is a fundamental task of image processing, image analysis, image understanding, and pattern recognition. It has a very long history, whose origin may be dated back to about 50 years ago. A seminal paper is [1], where the authors pointed out that an important component of the Stanford Research Institute automation project was a set of programs providing the automaton with a means of interpreting visual data.

While it is possible to accurately represent the information in a real scene by an image, this representation alone does not enable us to highlight specific properties of the scene. Conversely, a description in terms of “natural” elements of the image, such as regions and boundaries of the visualized objects, represented in a uniform manner, provides easy access to useful global information, thus allowing recognition and extraction of specific image features. Thus, to generate a description of specific elements of the image, it is customary to segment the image into more parts (or segments). Figure 1 shows an example of two main types of segmentation, i.e., instance segmentation, which identifies the object instance of each pixel for every known object within an image, and semantic segmentation, which identifies the object category of each pixel for every known object within an image.

Image segmentation is used in many application fields, such as medical imaging [2], microscopy imaging [3], remote sensing [4], and document image analysis [5]. The choice between semantic and instance segmentation is generally dependent on the goal of the classification or object detection step that follows the segmentation phase. For example, in the segmentation of terrain in satellite imagery, we may use the semantic segmentation to distinguish different land areas, like vegetation, ground, water and building, or we may use the instance segmentation to distinguish different common weeds in agricultural fields (i.e., separate instances of objects belonging to the same class).

Since different applications may require different partitions to extract significant features, there is no single standard method for image segmentation. Thus, the segmentation problem has not a unique result, as shown in Fig. 2, where different segmentations of the same image are shown, resulting from different segmentation criteria. On the other hand, different methods are not equally effective in segmenting a specific type of image (real scenes, synthetic images, medical images, etc.), and the criteria to define a successful segmentation depend on the desired goal of the segmentation itself. Therefore, segmentation remains a challenging problem in image processing and computer vision, in spite of several decades of research.

Fig. 1
figure 1

Illustration of instance and semantic segmentation of the Berkeley database image #323016. The results displayed in (a) and (b) were produced by using Adobe Photoshop

Fig. 2
figure 2

Segmentations of the Berkeley database image #323016 by different users (see https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/resources.html.)

We present image segmentation as a highly ill-posed problem, and discuss basic models that take into account a-priori information about the solution, attempting to put these models into a coherent mathematical framework. We look at the inclusion of a-priori information as a sort of regularization approach and show that it is ubiquitous in image segmentation models, from older “classical” ones to machine learning approaches, revealing links and similarities between them. Note that we focus on basic models in order to keep our discussion easy and get rid of technical details. We also sketch some numerical methods used in the application of the various models. Although this is only a simplified and partial view of image segmentation, we believe that it may give a contribution towards a better understanding of this huge field.

The rest of this paper is organized as follows. In Sect. 2 we present a mathematical formulation of image segmentation, and in Sect. 3 we discuss basic segmentation models, focusing on edge-based, region-based, statistical and machine learning ones. In Sect. 4 we give a quick overview of numerical techniques that may be used to solve the aforementioned models. Finally, we give some conclusions in Sect. 5.

2 Mathematical formulation of image segmentation

Let \({\mathcal {I}}\) be the set of the images defined in a domain \(\Omega \subset {\mathbb {R}}^d\) (\(d \ge 2\)), \(I_0 \in {\mathcal {I}}\) the observed image, and \({\mathcal {P}}_1, \ldots , {\mathcal {P}}_n\) logical predicates used to check n statements, expressed using features of the image, e.g., edges, smoothness, texture or color, so that \({\mathcal {P}}_k(A)=true\) if all the points of \(A\subseteq \Omega \) satisfy the k-th statement. Just to give an example, in order to compute a two-region segmentation of a normalized gray-level image \(I_0\), we can define \(n=2\) simple statements as follows, which involve the gray level of the intensity light to separate the background from the foreground:

$$\begin{aligned} {\mathcal {P}}_1(A)= \left\{ I^*(x) < \alpha , \forall x \in A \right\} , \quad {\mathcal {P}}_2(A)= \left\{ I^*(x) \ge \alpha , \forall x \in A \right\} \end{aligned}$$

where \(I^*\) is a suitable approximation of \(I_0\) and \(\alpha \in (0,1)\) is a suitable value.

Generalizing the definition in [6], the instance segmentation S of \(I_0\) according to the predicates \({\mathcal {P}}_k\), \(k=1, \ldots , n\), consists of finding a decomposition of \(\Omega \) into m components \(\Omega _i\), with \(i = 1, \ldots , m\) and \(m \ge n\), such that

  1. 1.

    \(\Omega _i \not = \emptyset , \; \forall \, i \in \{ 1, \ldots , m \}\);

  2. 2.

    \(\overset{\circ }{\Omega _i} \bigcap \overset{\circ }{\Omega _j} = \emptyset , \; \forall \, i,j \in \{ 1, \ldots , m \}\) with \(i \not = j\), where \(\overset{\circ }{\Omega _k}\) denotes the interior of \(\Omega _k\);

  3. 3.

    \(\bigcup \limits _{i=1}^{m} \Omega _i = \Omega \);

  4. 4.

    \(\forall i \in \{ 1, \ldots , m \} \; \exists ! \; k \in \{ 1,\ldots ,n \} \;\) such that

    1. i.

      \({\mathcal {P}}_k(\Omega _i) =\) true;

    2. ii.

      \({\mathcal {P}}_k (\Omega _i \bigcup \Omega _j) =\) false, \(\forall \, j \in \{ 1, \ldots , m \}\) with \(j \not = i\).

      By adding to item 4

    3. iii.

      \({\mathcal {P}}_k(\Omega _j) =\) false, \(\forall \, j \in \{ 1, \ldots , m \}\) with \(j \not = i\).

we also obtain the semantic segmentation.

We can define the segmentation S of \(I_0\) as follows too. Let \(\Sigma \) be the set of possible segmentations of the images in \({\mathcal {I}}\) according to some criteria defined by the predicates \({\mathcal {P}}_k\). Then S can be expressed as

$$\begin{aligned} S = (u^*,I^*), \end{aligned}$$

where \(u^*\) is a curve that matches the boundaries of the decomposition of \(\Omega \), i.e., \(u^*= \cup _i \partial \Omega _i\)Footnote 1, and \(I^*\) is a piecewise-smooth function defined on \(\Omega \) that approximates \(I_0\). In particular, we may assume that the restriction of \(I^*\) to any set \(\overset{\circ }{\Omega _i}\) is piecewise differentiable. The segmentation S may be also identified directly by using a labeling operator \(\Phi \), i.e.,

$$\begin{aligned} S=\Phi (I^*), \end{aligned}$$
(1)

where

$$\begin{aligned} \Phi (I(x)) = l_i \text{ if } x \in \Omega _i, \end{aligned}$$

I(x) is the value of I associated with x, and \(l_i \in \, {\mathcal {N}} = \{ l_1, l_2, \ldots , l_m \}\) is a label.

3 Basic segmentation models

We look at image segmentation as an ill-posed problem, whose solution is highly undetermined. Classical approaches for computing a solution of an ill-posed problem require additional information that enforces uniqueness and stability. To this end, suitably defined penalty terms can be applied. Then, the solution is obtained by minimizing an energy functional E containing a fidelity term \({\mathcal {F}}\) that measures the consistency of the candidate segmentation with the observed image, and a penalty term \({\mathcal {P}}\) that promotes solutions with suitable properties:

$$\begin{aligned} (I^*,u^*) := \arg \, \underset{(I,u)}{\min } \, E(I, u; I_0) = \arg \, \underset{(I,u)}{\min } \left( {\mathcal {F}}(I,u;I_0) + \lambda {\mathcal {P}} (I,u) \right) . \end{aligned}$$
(2)

Here \(\lambda > 0\) is a parameter that generally needs careful tuning to suitably balance \({\mathcal {F}}\) and \({\mathcal {P}}\) (see, e.g., [7] and the references therein).

The minimization problem (2) can be solved by writing the Euler-Lagrange equations, which can be derived by integrating by parts the energy functional and using the Gauss theorem along with the fundamental lemma of the calculus of variations. Then a numerical solution can be computed by applying a gradient descent approach, where the descent direction is parameterized through an artificial time, and by a finite-difference discretization. A widely used and effective alternative consists in discretizing problem (2) and then solving it by a numerical optimization method. We will come back to these two approaches in Sect. 4.

Recently, machine learning techniques have been successfully applied to segmentation problems. The key idea is to tune a generic model to a specific solution through learning against sample data (training data). The learning phase extracts prior information to be embedded into the penalty term from a large dataset containing pairs of type (image, ground-truth label) [8]. Machine learning approaches using unlabeled image data as training datasets are also available. Although these techniques successfully solve image segmentation problems, sometimes outperforming state-of-the-art variational models, they have been often designed on-demand for specific tasks used as “black-box” models and require a large amount of data to produce results.

In the next subsections we provide some examples of image segmentation models. Note that we focus on basic models, with the aim of providing a general idea of these approaches while avoiding technical details that are outside the scope of this work. It is also worth observing that these models are the basis of modern ones, developed either to improve the effectiveness of the original models in some applications [9] or to complement and refine Machine Learning techniques for segmentation [10].

3.1 Edge-based models

Edge-based models aim at finding \(u^*= \cup _i \partial \Omega _i\) by solving the minimization problem (2) with respect to the curve u (note that I and \(I^*\) are not explicitly considered in this case). These models include the so-called Active Contours [11] or Snakes. Here the fidelity and regularization terms act as an internal force and an external force, respectively, which move the curve within the image to find the boundaries of the sets \(\Omega _i\). More precisely, the energy functional takes the form

$$\begin{aligned} E_{AC}(u) = \underbrace{\int _0^1 g(\vert \nabla I_0(u(s))\vert )^2 ds}_{{\mathcal {F}}} + \lambda \underbrace{\int _0^1 \vert u'(s) \vert ^2 ds}_{\mathcal {P}} , \end{aligned}$$
(3)

where \(I_0\) is the observed image, g is an edge-detector function and the curve u is parametrized by \(s \in [0,1]\). The first term attracts the curve toward the boundaries, whereas the second one controls its smoothness, and as a result the curve u changes its shape like a snake.

The evolving curve is driven by surface properties, such as curvature and normal direction, and by image features, such as gray levels, hue or saturation in color images, and intensity gradient in 2D images or change in slope in 3D ones. For example, the mean curvature can be used and in this case the edge-detector function is also responsible for stopping the curve on the edges. The function g may be defined as

$$\begin{aligned} g(\vert \nabla I_0 \vert ) = \frac{1}{1+\vert \nabla (G_\sigma * I_0)\vert ^2}, \end{aligned}$$

where g is a positive and decreasing function, \(G_\sigma \) is the Gaussian kernel with standard deviation \(\sigma \), and \(*\) denotes the convolution operator.

In a Lagrangian approach, an initial curve is evolved by

$$\begin{aligned} \frac{\partial u}{\partial t}+ {\mathcal {L}}(u)=0, \end{aligned}$$
(4)

where \({\mathcal {L}}\) is a differential operator. The simplest evolution is given by \({\mathcal {L}}(u)= F N\), where N is the normal to the curve and F is a constant that determines the speed of evolution. More generally, the evolution is driven by an external force. For example, in the mean-curvature evolution, \({\mathcal {L}}(u) = \kappa N\), where \(\kappa \) is the Euclidean curvature of u [12].

When u has an explicit representation, it is not easy to deal with topological changes like merge and split, and a re-parametrization of the curve may be required. Therefore, the evolution of the curve u is commonly described by level-set methods [13], thanks to their ability to follow topology changes, cusps and corners. In a level set approach, the curve u is implicitly represented by the zero-level set of a function \(\phi (t,x)\), i.e., \(u = \{ x\in \Omega : \phi (t,x) = 0 \}\). The level set formulations of the simplest evolution and the mean-curvature one read, respectively:

$$\begin{aligned} \frac{\partial \phi }{\partial t} = F \vert \nabla \phi \vert , \; F \in {\mathbb {R}} \;\;\; \text{ and } \;\;\; \frac{\partial \phi }{\partial t} = \text{ div } \left( \frac{\nabla \phi }{\vert \nabla \phi \vert } \right) \vert \nabla \phi \vert . \end{aligned}$$

3.2 Region-based models

Region-based models provide directly the segmentation by means of the image partition \(\{\Omega _i, \; i=1, \ldots , m \}\). Region-growing models are among the simplest models falling in this class, and in order to obtain accurate segmentations they have been merged with variational approaches where the evolution changes according to the minimization of an energy functional including region-based terms [14].

A very popular region-growing model was proposed by Mumford and Shah [15]. In this case, the functional E in (2) takes the form

$$\begin{aligned} E_{MS}(I,u) = \underbrace{\int _\Omega (I-I_0)^2 d x}_{\mathcal {F}} + \underbrace{\lambda \int _{\Omega - u} \vert \nabla I \vert ^2 d x + \mu \, len(u)}_{\mathcal {P}}, \end{aligned}$$
(5)

where len(u) denotes the length of u, and \(\lambda \) and \(\mu \) are positive parameters. The term \({\mathcal {F}}\) attempts to achieve the minimum distance between \(I_0\) and its piecewise-smooth approximation I, and \({\mathcal {P}}\) attempts to reduce the variation of I within each set \(\Omega _i\) while keeping the curve u as short as possible. Minimizing (5) in a suitable space provides an optimal pair \((I^*, u^*)\) representing a simplified description of \(I_0\) by means of a function with bounded variation and a set of edges [15]. Finally, in [16] the Mumford and Shah model is formulated as a deterministic refinement of a probabilistic model for image restoration.

A simplified version of the Mumford-Shah model is its restriction to piecewise-constant functions. The Chan-Vese model [17] is a particular case of the simplified version, aimed at obtaining a two-phase segmentation where the piecewise-constant function assumes only two values. Its functional E takes the following form:

$$\begin{aligned} \displaystyle E_{CV} (I, c_{in},c_{out})= & {} \displaystyle \!\! \underbrace{\left( \int _\Omega H(I) \left( c_{in} - I_0 \right) ^2dx + \int _\Omega (1-H(I)) \left( c_{out} - I_0 \right) ^2dx \right) }_{\mathcal {F}}\nonumber \\&+ \lambda \, \underbrace{\int _\Omega \vert \nabla H(I)\vert \, dx}_{{\mathcal {P}}} , \end{aligned}$$
(6)

where H is the Heaviside function and \(c_{in}\) and \(c_{out}\) are the average values of the intensity in the foreground and background of the image, respectively. The solution \(I^*\) is the best approximation to \(I_0\) among all the functions that take only two values.

Minimizing (6) is a nonconvex problem, thus solution methods may get stuck into local minima and result in unsatisfactory segmentations. Aiming to overcome this drawback, some strategies have been proposed, including the convexification of the functional by taking advantage of its geometric properties. An example is given by the two-phase partitioning model introduced by Chan, Esedoḡlu and Nikolova [18]:

$$\begin{aligned} E_{CEN}(I, c_{in}, c_{out})= & {} \underbrace{\int _{\Omega } \left( (c_{in} - I_0 )^2 I + (c_{out} - I_0 )^2 (1-I) \right) dx}_{{\mathcal {F}}} \nonumber \\&+ \lambda \, \underbrace{\int _{\Omega } \vert \nabla I \vert \, dx}_{{\mathcal {P}}} \end{aligned}$$
(7)

with \(0 \le I \le 1\) and \( c_{in},\, c_{out} > 0\).

3.3 Statistical models

Statistical models usually provide a conditional probability, \(P(S \vert I_0)\), of a segmentation \(S \in \Sigma \) given the observed image \(I_0\), and then select the segmentation with the highest probability. In the Maximum A Posteriori (MAP) approach the segmentation is given by

$$\begin{aligned} S^* = \arg \, \underset{S \in \Sigma }{\max } \, P(S \vert I_0). \end{aligned}$$

According to the Bayes rule,

$$\begin{aligned} P(S \vert I_0) = \frac{P(I_0 \vert S) P(S)}{P(I_0)}, \end{aligned}$$

where P(S) is the prior probability measuring how well S satisfies certain properties of the given image, and \(P(I_0 \vert S)\) is the conditional probability measuring the likelihood of \(I_0\) given S (see, e.g., [19]). Since the probability \(P(I_0)\) is constant, the segmentation can be obtained by maximizing \(P(I_0 \vert S) P(S)\).

Markov Random Field (MRF) models offer a framework to define prior and likelihood by capturing properties of the image such as texture, color, etc. [20]. The segmentation is formulated within an image labeling framework, i.e., \(S = \Phi (I(x))\), where the problem is reduced to find the labeling which maximizes the posterior probability. Label dependencies are modeled by an MRF.

Then, using the Hammersley-Clifford theorem, we get the Gibbs distribution

$$\begin{aligned} P(S) = \frac{1}{Z}\exp (-U(S)), \end{aligned}$$

where the energy function U takes the form

$$\begin{aligned} U(S) = \sum _{c \in C} V_c(S_c), \end{aligned}$$

C is the set of cliques of S, \(V_c(S_c)\) is the potential of the clique \(c \in C\) having the label configuration \(S_c\), and Z is a normalizing constant.

When the nature of the observed image is unknown, the Gaussian distribution is often used to model the conditional probability \( P (I_0 \vert S)\). Setting

$$\begin{aligned} U_1(S)= -\ln P(I_0 \vert S) \; \text{ and } \; U_2(S) = -\ln P(S), \end{aligned}$$

we get \(U(S)=U_1(S)+U_2(S)\), and then the original MAP estimation is equivalent to the following problem:

$$\begin{aligned} S^* = \arg \, \underset{S}{\max } \, P(S \vert I_0) = \arg \, \underset{S}{\min } \, U(S). \end{aligned}$$

3.4 Machine learning models

Machine learning approaches, and in particular deep learning ones, are more and more used in solving image segmentation problems, also outperforming the previous approaches. Roughly speaking, machine learning approaches do not benefit from prior information on the solution as described above, but “learn” the segmentation from large training datasets. The aim of a machine learning approach is to define a segmentation model \(f_\theta :{\mathcal {I}} \longrightarrow \Sigma \) such that the segmentation of \(I_0\) can be obtained as \(I^*= f_\theta (I_0)\). The function \(f_\theta \) is usually nonlinear and \(\theta \) is a large vector of parameters. The learning phase selects \(\theta \) in order to minimize a loss functional \({\mathcal {L}}\) that measures the accuracy of the predicted segmentation \(f_\theta (I_0)\).

In supervised machine learning, training data are available from databases of annotated segmentations, which provide a large number of pairs \((I_0,I^*) \in X \times Y \subset {\mathcal {I}} \times \Sigma \) (\(X \times Y\) is named training set). The vector of parameters \(\theta \) is obtained by minimizing a loss function plus a penalty term. For the sake of simplicity, we first consider a mean-square-error loss:

$$\begin{aligned} \theta ^* = \arg \, \underset{\theta }{\min } \,{\mathcal {L}}(X,Y,\theta )= \arg \, \underset{\theta }{\min } \, \left( \sum _X \parallel f_\theta (I_0)- I^* \parallel ^2 + {\mathcal {P}}_\theta (f_\theta (I_0)) \right) . \end{aligned}$$
(8)

Another widely used loss functions is the Binary Cross Entropy (BCE) loss, which measures the difference in information content between the actual and the predicted image segmentation:

$$\begin{aligned} {\mathcal {L}}_{BCE}(f_\theta (I_0),I^*) = -f_\theta (I_0)\log (I^*) - (1-f_\theta (I_0))log(1-I^*) . \end{aligned}$$

It is based on the Bernoulli distribution and works well with equal data distributions among classes. Some variants of BCE, such as the Weighted BCE and the Balanced CE are also used for tuning false negatives and false positives, respectively. The Shape-aware (Sa) loss calculates the average point-to-the-curve Euclidean distance among points around the curve of the predicted segmentation, \(u^*\), to the ground truth, \({\bar{u}}\), and use it as a coefficient to the cross-entropy (CE) loss function:

$$\begin{aligned} {\mathcal {L}}_{Sa}(f_\theta (I_0),I^*) = - \sum _{i \in \Sigma } CE(\theta (I_0),I^*)- \sum _{i \in \Sigma } i\, E_i \, CE(f_\theta (I_0),I^*), \end{aligned}$$

where \(\Sigma \) contains the set of points where the prediction curve does not match the ground-truth curve, and \(E_i=d(u_i^*, {\bar{u}}_i)\). The Dice loss, based on the well-known Dice coefficient metric, is also widely used to measure the similarity between two segmentations, and is defined as

$$\begin{aligned} {\mathcal {L}}_{Dice}(f_\theta (I_0), I^*) = 1 - \frac{2f_\theta (I_0)I^*+1}{f_\theta (I_0)+I^*+1}. \end{aligned}$$

In unsupervised machine learning, the training set is not equipped with annotated segmentations and the goal is to train \(f_\theta \) to recognize specific patterns or image features in the data. This approach is sometimes referred to as self-supervised learning [21], because the information is extracted from the data themselves rather than from a set of “predictions” (i.e., given segmentations). Then the fidelity term in (8) takes the form

$$\begin{aligned} \sum _{\mathcal {I}} \parallel f_\theta (I_0)- \Phi (f_\theta (I_0))\parallel ^2, \end{aligned}$$

where \(\Phi \) is the labeling operator defined in (1).

In order to progressively extract higher-level features from the data, machine learning models use a multi-layer structure called neural network, consisting of successive function compositions. The number of layers is the depth of the model, hence the terminology deep learning. A neural network with L layers is a function

$$\begin{aligned} f_\theta : {\mathcal {I}} \times (H_1 \times \cdots \times H_L) \longrightarrow \Sigma , \quad f_\theta (I)=(f_L \circ f_{L-1} \circ \ldots \circ f_1)(I), \end{aligned}$$

where \(f_i : {\mathbb {R}}^{d_{i-1}} \times H_i\longrightarrow {\mathbb {R}}^{d_{i}}\) are the activation functions (each depending on a component \(\theta _i\) of \(\theta \)), \(d_0 = d\) and \(d_L = n\), with n equal to the number of features. The adjective “neural” comes from the fact that those networks are loosely inspired by neuroscience.

Fig. 3
figure 3

Some neural network architectures used in image segmentation

Neural network structures successfully used in image segmentation are the Multilayer Perceptron (MLP), the Deep Auto-Encoder (DAE) and the Convolutional Neural Network (CNN) [22,23,24]. Their basic schemes are shown in Fig. 3. The MLP is a neural network connecting multiple layers in a directed graph, which means that the signal path through the nodes only goes one way. Each node, apart from the input nodes, has a nonlinear activation function. An MLP uses backpropagation as a supervised learning technique. The DAE network structure typically consists of 2L layer functions, where the first L layers act as an encoding function with the input to each layer being of lower dimension than the input to the previous layer, and the remaining L layers increase the size of their inputs until the final layer has the same dimension as the image input. The first L layers are an MLP. CNNs divide the image into small areas and scan it one area at a time, to identify and extract features that are used to classify the image. A CNN mainly consists of three layers:

  • convolutional layer: the image is analyzed a few pixels at a time to extract low-level features (edges, color, gradient orientation, etc.);

  • nonlinear layer: an element-wise activation function creates a feature map with probabilities that each feature belongs to the required class;

  • pooling or downsampling layer: the amount of features and computations in the network is reduced, hence controlling overfitting.

Among well-known deep neural network architectures successfully used in image segmentation, we mention SegNet [25], U-Net [26], and FCN [27].

4 Numerical techniques for segmentation models

The minimization in (2) is usually nontrivial and requires appropriate methods, taking into account the specific application. In this section we provide a brief summary of numerical methods that can be applied to segmentation models. We consider two approaches: first discretize then optimize and first optimize then discretize. In the former, all the quantities in (2) are discretized a priori and then optimization methods are applied to the resulting minimization problem in \({\mathbb {R}}^n\). In the latter, we first write optimality conditions for (2), which are generally partial differential equations (PDEs), and then solve those equations by suitable numerical methods, which discretize the equations. Finally, we also sketch some filtering techniques used in image segmentation, although they are not directly applied to the minimization problem (2). This is motivated by their use in some segmentation approaches, such as those based on deep learning.

For the sake of simplicity, here we consider \(S = I\) (i.e., we neglect u in the segmentation \(S = (I, u)\)). For 2D images (\(d=2\)) we denote by \(\Omega _{n_x,n_y}\) the discretization of \(\Omega \) consisting of a grid of \( n_x \times n_y\) pixels,

$$\begin{aligned} \Omega _{n_x,n_y} = \{ (i,j) : \, i=1,\ldots ,n_x, \; j=1,\ldots ,n_y\}. \end{aligned}$$

We also identify each pixel with its center and denote by \(S_{i,j}\) the value of S in (ij). Finally, we consider the forward and backward difference operators defined as follows:

$$\begin{aligned} \displaystyle D^{+x} I_{i,j}= & {} I_{i+1,j}-I_{i,j} , \quad \displaystyle D^{-x} I_{i,j} = I_{i,j}-I_{i-1,j} , \\ \displaystyle D^{+y} I_{i,j}= & {} I_{i,j+1}-I_{i,j} , \quad \displaystyle D^{-y} I_{i,j} = I_{i,j}-I_{i,j-1} , \end{aligned}$$

where we assume

$$\begin{aligned} \displaystyle I_{i-1,j}= & {} I_{i,j} \;\; \text{ for } i=1, \quad \displaystyle I_{i,j-1} = I_{i,j} \;\; \text{ for } j=1, \\ \displaystyle I_{i+1,j}= & {} I_{i,j} \;\; \text{ for } i=n_x, \quad \displaystyle I_{i,j+1} = I_{i,j} \;\; \text{ for } j=n_y, \end{aligned}$$

i.e., we define by replication the values of I with indices outside \(\Omega _{n_x,n_y}\). Likewise, for 3D images the discretization of the image domain consists of a grid of \(n_x \times n_y \times n_z\) voxels,

$$\begin{aligned} \Omega _{n_x,n_y,n_z} = \{ (i,j,k) : \, i=1,\ldots ,n_x, \; j=1,\ldots ,n_y, \; k=1,\ldots ,n_z\}, \end{aligned}$$

and the forward and backward difference operators are defined as follows:

$$\begin{aligned} \displaystyle D^{+x} I_{i,j,k}= & {} I_{i+1,j,k}-I_{i,j,k} , \quad \displaystyle D^{-x} I_{i,j,k} = I_{i,j,k}-I_{i-1,j,k},\\ \displaystyle D^{+y} I_{i,j,k}= & {} I_{i,j+1,k}-I_{i,j,k} , \quad \displaystyle D^{-y} I_{i,j,k} = I_{i,j,k}-I_{i,j-1,k},\\ \displaystyle D^{+z} I_{i,j,k}= & {} I_{i,j,k+1}-I_{i,j,k}, \quad \displaystyle D^{-z} I_{i,j,k} = I_{i,j,k}-I_{i,j,k-1}. \end{aligned}$$

For simplicity, henceforth we consider \(d=2\).

4.1 First discretize then optimize

Numerical optimization offers a large variety of methods to compute the segmentation by solving the minimization problem coming from a discretization of (2), possibly subject to constraints that can drive the segmentation towards particular features. The choice of the optimization method depends on the properties of the objective function and/or the constraints.

Roughly speaking, at iteration k, optimization methods for nonlinear problems generate a function \({\widetilde{E}}(I;I_k)\) that approximates the discretized objective function E around \(I_k\), and minimize it to obtain the next iterate (see, e.g., [28]). For example, given \(I_k\), the \((k+1)\)-st iteration may be written as

$$\begin{aligned} \begin{array}{l} \text{ Define } {\widetilde{E}}(I;I_k) \text{ that } \text{ approximates } E(I; I_0) \\ {\widetilde{I}}_{k+1} = \arg \underset{I}{\min } \; {\widetilde{E}}(I;I_k) \\ I_{k+1} = I_k + \alpha _k ({\widetilde{I}}_{k+1}-I_k) \end{array} \end{aligned}$$

where the step length \(\alpha _k\) satisfies some criterion.

“Classical” optimization techniques, such as gradient or Newton-type methods, require regularity assumptions on the objective function (and the constraints, if any). However, many segmentation models are modeled as non-smooth optimization problems. There are two main approaches to deal with non-differentiability: smoothing and non-smoothing [29]. The former formulates the problem as a suitable smooth one and applies the aforementioned classical optimization methods. The latter does not modify the mathematical model, and thus uses methods not requiring smoothness. For the purpose of illustration, here we focus on (7), where non-smoothness comes from a discretization of the TV term.

A regularized discrete TV may be obtained as follows:

$$\begin{aligned} \int _\Omega \vert \nabla I \vert \, dx \approx \sum _{i,j} \sqrt{(D^{+x} I_{i,j})^2+(D^{+y} I_{i,j})^2+\epsilon }, \end{aligned}$$

where \(\epsilon > 0\) is “suitably small”, but other regularized versions may be considered, e.g., based on Huber-like functions [30]. In this case, gradient and higher-order methods [31,32,33,34,35] can be used efficiently. Another way of introducing smoothness consists in splitting the variables into their positive and negative parts (thus doubling the number of unknowns) and introducing new constraints, and then applying first- or higher-order methods for smooth problems, such as in [36,37,38].

Non-smoothing approaches avoid regularization of the non-smooth terms in the optimization problem. This is the case, for example, of methods based on forward-backward splitting techniques, such as proximal-gradient methods [39, 40], and the forward-backward Expectation Maximization (EM) method in [41]. ADMM and split Bregman methods do not use smooth approximations too [7, 42,43,44,45,46]. The success of these approaches is based also on the availability in closed (and cheap) form of the proximal operator of the \(\ell _1\) norm by means of the well-known soft-thresholding, defined as

$$\begin{aligned}{}[{{\mathcal {S}}}(x,\gamma )]_{i,j}= \mathrm {sign}(x_{i,j})\cdot \max \left( \vert x_{i,j}\vert -\gamma , 0\right) , \end{aligned}$$

with \(\gamma > 0\). The difficulties associated with the non-differentiability of the TV functional may be also overcome by reformulating the minimization problem as a saddle-point problem and solving it by a primal-dual algorithm such as the Chambolle-Pock one [47, 48].

EM algorithms [49] are also widely used to solve statistical models. They are based on the idea of splitting the (negative) log-likelihood into two terms and alternating between the computation of the expectation and its minimization.

Finally, stochastic versions of the previous methods are used in segmentation with deep learning, to limit the computational cost. The idea is to use only random samples of the data at each iteration, to estimate first-order and possibly second-order information according to the loss function, with the aim of significantly reducing the computation and hence the time [50, 51].

4.2 First optimize then discretize

Reducing imaging problems to PDEs is many years old, because of the availability of a large amount of methods and software for solving PDEs. PDE-based methods have been introduced in different ways, such as the Perona-Malik filtering [52], directly based on properties of the PDE [53], and the axiomatic scale space theory [54, 55].

In a variational approach, one derives the first-order optimality conditions via smoothing regularization, if it is needed. Let us consider, for example, the level-set formulation of the Chan-Vese model (6), where I is represented by a function \(\phi \) such that \(\phi (x) = 0\) provides the curve separating two regions of I (when \(I=I^*\) the two regions identify the segmentation). Keeping \(c_{in}\) and \(c_{out}\) fixed and writing the Euler-Lagrange equations in a gradient-flow approach, we get

$$\begin{aligned} \begin{array}{llll} \displaystyle \frac{\partial \phi }{\partial t} (t,x) &{} = &{} \displaystyle \delta _\varepsilon (\phi ) \left( \lambda \, \text{ div } \left( \frac{\nabla \phi }{\vert \nabla \phi \vert } \right) - (c_{in} - I_0)^2 + (c_{out} - I_0)^2 \right) &{} \text{ in } (0, +\infty ) \times \Omega , \\ \phi (0,x) &{} = &{} \phi _0(x) &{} \text{ in } \Omega , \\ \displaystyle \frac{\delta _\varepsilon (\phi )}{\vert \nabla \phi \vert }\frac{\partial \phi }{\partial N} &{} = &{} 0 &{} \text{ on } \partial \Omega , \end{array} \end{aligned}$$
(9)

where \(\delta _\varepsilon \) is a regularized version of the Dirac measure, \(\phi _0\) is the initial-level function, and N is the exterior normal to the boundary \(\partial \Omega \) [17].

Finite-difference schemes are popular methods for the numerical solution of (9). Of course, the discretization used in image segmentation must take into account the nature and the properties of the operators involved in the model. For example, edge preserving is similar to shock capturing in computational fluid dynamics, and hence finite-difference schemes based on hyperbolic conservation laws can be used [56]. Just to give an example, the level-set equation

$$\begin{aligned} \frac{\partial \phi }{\partial t} = F \vert \nabla \phi \vert \end{aligned}$$

in Sect. 3.1 can be solved by using an upwind numerical scheme:

$$\begin{aligned} \phi ^{n+1} = \Psi (\phi ^n), \quad \Psi (\phi ^n_{i,j}) = \phi ^n_{i,j} - \Delta t (\max (F,0) \nabla ^+ \phi ^n_{i,j} + \min (F,0)\nabla ^- \phi ^n_{i,j}) , \end{aligned}$$

where

$$\begin{aligned} \displaystyle \nabla ^+ \phi ^n_{i,j}= & {} \displaystyle \left( \max \left( \max (D^{-x} \phi ^n_{i,j}, \, 0)^2, -\min (D^{+x} \phi ^n_{i,j}, \, 0)^2 \right) \right. \\&+ \displaystyle \left. \max \left( \max (D^{-y} \phi ^n_{i,j}, \, 0)^2, -\min (D^{+y} \phi ^n_{i,j}, \, 0)^2 \right) \right) ^{1/2} ,\\ \displaystyle \nabla ^- \phi ^n_{i,j}= & {} \displaystyle \left( \max \left( \max (D^{+x} \phi ^n_{i,j}, \, 0)^2, -\min (D^{-x} \phi ^n_{i,j}, \, 0)^2 \right) \right. \\&+ \displaystyle \left. \max \left( \max (D^{+y} \phi ^n_{i,j}, \, 0)^2, -\min (D^{-y} \phi ^n_{i,j}, \, 0)^2 \right) \right) ^{1/2} . \end{aligned}$$

4.3 Filters

Discrete filters are often used in image segmentation, e.g., in machine learning approaches. A digital filter can be represented as an operator

$$\begin{aligned} L: I \in {\mathcal {I}} \longrightarrow {\widetilde{I}} \in {\mathcal {I}}, \quad {\widetilde{I}}_{ij} = L [I; W_{ij}], \end{aligned}$$

where \(W_{ij} \subset \Omega _{n_x,n_y} \). A popular discrete filter in image segmentation is the convolution filter, defined by

$$\begin{aligned} {\widetilde{I}}_{i,j} = L_{a,b}[I; W_{i,j}] = \sum _{s=-a}^a \sum _{t=-b}^b h_{s,t} I_{i-s,j-t} , \end{aligned}$$
(10)

with a and b positive integers such that \(a \le \frac{n_x-1}{2}\) and \(b \le \frac{n_y-1}{2}\), \(W_{i,j} = \{ (s,t) : s = -a, \ldots , a, \; t = -b, \ldots , b \} \), and \(h_{s,t} \in {\mathbb {R}}\). The matrix \(H = (H_{i,j}) = (h_{-a+i,-b+j}) \in {\mathbb {R}}^{(2a+1)\times (2b+1)}\) is called kernel matrix and depends on the features we want to extract from the image. Common choices of a and b are \(a=b=3\) and \(a=b=5\).

Edge-detection kernels are frequently used in image segmentation, especially in CNNs. For example, the first layer of a CNN is often responsible for capturing low-level features such as edges, color, and gradient orientation. In general, the choice of H determines the type of features to be extracted. The kernel matrix

$$\begin{aligned} H= \left( \begin{array}{lll} 1 &{} \;\, 0 &{}-1 \\ 1 &{} \;\, 0 &{}-1\\ 1 &{} \;\, 0 &{}-1 \end{array} \right) \end{aligned}$$

is a vertical edge-detection kernel [57]. Another example is the Sobel operator, used to create an image emphasizing the edges [58]. It allows us to obtain either the gradient amplitude or the gradient direction of the image intensity at each point, by convolving the image with the kernel matrices

$$\begin{aligned} H^x_S= \left( \begin{array}{rrr} 1 &{} \;\, 0 &{} -1 \\ 2 &{} \;\, 0 &{} -2 \\ 1 &{} \;\, 0 &{} -1 \end{array} \right) , \quad H^y_S= \left( \begin{array}{rrr} 1 &{} 2 &{} 1 \\ 0 &{} 0 &{} 0 \\ -1 &{}-2 &{}-1 \end{array} \right) . \end{aligned}$$

The gradient magnitude, G, and the angle of orientation of the edges, \(\theta \), are given by

$$\begin{aligned} |G_{i,j} |= \sqrt{(H^x_S*I)_{i,j}^2+(H^y_S*I)_{i,j}^2}, \quad \theta _{i,j}=\arctan ((H^y_S*I)_{i,j}/(H^x_S*I)_{i,j}). \end{aligned}$$

A padding process is commonly used to preserve the dimension of the image after the convolution. It usually consists in the replication or reflection of the pixel values at the image border, or in adding an average gray or even zeros symmetrically around the border of the image. A pooling layer is usually inserted between two successive convolution layers, which is obtained by applying basic functions, such as max and mean, in a small window.

5 Conclusion

We presented a view of image segmentation, focusing on simple computational models and attempting to put them into a coherent framework where the inclusion of a-priori information about the solution is obtained by using penalty terms. We first introduced image segmentation and then outlined basic edge-based, region-based, statistical and machine learning models. We also sketched some numerical methods that can be employed to compute solutions to the models. We believe that our view of models and methods for image segmentation, although very far from being exhaustive, can help the readers understand much modern and sophisticated segmentation techniques, as well as select computational tools for their problems.