1 Introduction

Image segmentation is one of the most fundamental problems in computer vision. Deformable models [34] are a successful class of segmentation algorithms based on the idea of a deforming shape that adapts to the desired image region. The fundamental form of the deformable model-based segmentation method was proposed by Kass et al. [25] as an active contour model (ACM), also known as a “snake”. The snake model is a parametric curve with an evolution process controlled by a set of external and internal energies. External energies attract the shape to the desired image area and move it towards the boundaries of the segmented region, while the internal forces control the contour smoothness. This method overcomes many problems, like image noise and boundary irregularities, and makes it easy to extend with new types of image features and constraints. These advantages made the classical ACM very influential and widely improved, e.g. with the introduction of expansion forces [10], edge-based vector field energies [58], and region-based image energies [44].

The original ACM also had some drawbacks, like difficulties in topological adaptability (requiring additional mechanisms [30]) and sensitivity to initialisation. A major improvement came with the incorporation of the level set framework, proposed by Osher and Sethian [37, 47]. Instead of the explicitly defined parametric contour, the level set approach proposed an evolving surface, where the contour is implicitly represented as a set of surface points that have their height equal to zero (the zero level set). During the surface evolution, the zero level set can naturally split or merge and can disappear or appear in locations not constrained by the contour initial form. Notable level-set based models include Geometric [6] or Geodesic Active Contours (GAC) [7] and region-based Active Contours Without Edges (ACWE) [9].

Traditionally, the image energies of popular ACMs relied either on edge information [6, 7, 25, 58] or on intensity statistics of the segmented regions [9, 44], making the models best suited for regions with distinctive borders and fairly uniform intensity. These approaches, however, are often not sufficient for segmentation of non-uniform textured regions with high contrast and/or directional patterns. Although the discrimination of texture patterns is easy for the human eye, in computer vision it is still hard to find a universal texture descriptor. Due to the lack of a unified texture model, the segmentation methods rely on various feature extraction methods [40], like Grey Level Co-occurrence Matrices (GLCM) [20, 49], Gabor filters [24], structure tensors [4, 5] or histogram analysis [28, 36].

Texture segmentation using active contour models has been a popular topic of research in recent years [1, 14, 17, 51, 52]. Particularly, the variational ACWE-like models and their modified multi-channel versions for vector-valued images [8] have been often employed [1, 19, 32, 57]. Incorporation of the texture features into the ACM-based method usually encounters significant difficulties. First of all, the method must settle on a set of many possible feature extraction approaches and typically, only one approach (like for example Gabor filters) is adapted. Obviously, it strongly restricts the robustness and generality of such methods. Furthermore, even with this simplification (i.e. selection of a single texture feature extraction approach), the algorithm should be able to deal with multiple feature channels. As the level set-based methods can be computationally intensive, it can have a significant impact on the performance. As a result, these multiple channels are often combined/aggregated to avoid overcomplicating the model. It is why new solutions are required that are ready to integrate various texture characteristics and smartly process many features.

In this paper, we present a novel level set-based method for image segmentation that addresses the mentioned problems of the texture analysis integration into the active contour model. The method is not tied to any particular texture descriptors and inherently is able to consider multiple feature types. One can dare to say that the more approaches are used, the better the results could be. Here, the proposed solution is demonstrated with just three texture feature extraction approaches, namely GLCM, Gabor and structure tensor. The segmentation algorithm is divided into three main stages. It starts with many features calculation and selection of the most adequate descriptors for the considered (manually initialised) region of interest. It is followed by a fast level set stage that results in a rough segmentation. Finally, an ACWE-based stage fine-tunes the contour and provides the final result. Implementation of the first two stages makes heavy use of Graphics Process Unit (GPU) acceleration and exhibits a reasonable performance even on mid-range graphics cards. Even if the building blocks of the proposed algorithm are known, their join use is undoubtedly original and open to further extensions.

The main objective of this work is a new fast segmentation method that is able to use a relatively large bank of different features and perform the costly level set evolution while maintaining computation time suitable for interactive usage. The method combines both classical and region-based level set schemes and is open to extension with additional descriptors.

The concept of a multi-stage approach was initially outlined in [42] in the context of classical discrete parametric contour (snake). Apart from the similarities in the feature selection stage, the snake-based model employed a vastly different evolution scheme that considered only the closest neighbourhood of the contour points. In contrast, the proposed approach uses both local and global level set-based ACMs that are more robust in handling of multiple features and offer little topological restrictions.

The rest of the paper is organised as follows. Section 2 gives a brief review of the up-to-date related work on texture segmentation. It also recalls the utilised feature extraction and level set methods. Section 3 describes the proposed method and discusses implementation concerns. Section 4 presents an experimental evaluation of the method on both synthetic and natural images, as well as a comparison with other segmentation methods. Conclusions and possible future works are given in the last section.

2 Background

2.1 Related work on texture segmentation

Active contour-based methods were widely adapted to the task of texture segmentation. While the parametric contours are still used [33, 52], the level set-based models are gaining more and more attention. Generally, texture-based methods can be divided into supervised and unsupervised. The supervised methods often take advantage of information from the initial texture analysis stage. Notably, Paragios and Deriche [38] proposed an extension of the GAC model that performs an off-line feature analysis. The method requires a set of texture patterns present in the image. The analysis results are then integrated in the region and boundary image forces. Pujol and Radeva [39] employed the GLCM features in their ACM. Linear Discriminant Analysis was used to reduce a feature space, which was then applied to create a likelihood map that guided the contour deformation.

In the case of the unsupervised approaches, no prior analysis or classification is performed. Shen et al. [48] proposed a method for creating a likelihood deformation map from the intensity statistics and edge information of the initial region. This solution was integrated with 2D and 3D models. Huang et al. [23] presented a similar approach, where the intensity characteristics were used for the textures with small texons, and a bank of Gabor filters was utilised in the case of textures with larger patterns. Awate et al. [3] proposed a fast multi-phase level-set method based on threshold dynamics [15] that introduces a general nonparametric statistical image neighbourhood model instead of specific texture features. Rousson et al. [45] presented a method based on a variational framework, where the texture features are extracted with a structure tensor and nonlinear diffusion scheme. Savig et al. [46] created a texture edge detector from the Gabor feature space based on the Beltrami framework [50] and integrated it with the GAC and ACWE models. Houhou et al. [22] also used the Beltrami framework to propose a descriptor that captures both edge and textural properties. In this case, the ACM was based on Kullback-Leibler distance.

More recently, Wu et al. [57] presented a method based on ACWE, where GLCM and Gabor features are fused together to create the final feature space. Min et al. [32] integrated the image intensity and a novel adaptive texture descriptor into the ACWE-based model. Mewada et al. [31] combined the ACWE approach with a modified version of the linear structure tensor. The ACWE model was also applied for texture segmentation by Wang et al. [55], where the contour was driven by a local histogram. Tatu and Bansal [53] proposed a GAC-based model that uses intensity covariance matrices for the texture features. Gao et al. introduced a factorisation-based ACM that utilises the local spectral histogram as the texture feature [16], and, more recently, a model that performs a fusion of intensity and Gabor-based features along with a factorisation scheme [17]. Dong et al. [14] also employed a factorisation-based ACM, combined with neutrosophic sets, in the task of color texture segmentation. Dahl and Dahl [11] created a method based on ACWE and probabilistic image patch dictionaries. Ahmad et al. [1] proposed a fuzzy version of a variational level set model that uses the coefficient of variation as a region statistics. The model was proven to be effective on textured and inhomogeneous images [59]. Wang et al. [56] also tackled this problem by introducing an inhomogeneity entropy descriptor that was driving the energy of a level set-based ACM.

2.2 Texture feature extraction

Grey Level Co-occurrence Matrix

The GLCM is a square matrix of order equal to the number of pixel grey levels. For a spatial window w on an image I, the GLCM contains the probabilities of pair-wise occurrence of pixels with two grey levels. Each of the (i,j) matrix elements contains a number of occurrences, where a pixel with intensity i is adjacent to a pixel with intensity j. The adjacency of the pixels is defined by the distance between them in a given orientation. This matrix, after normalisation, is then used to calculate a set of texture feature descriptors, like Entropy, Correlation, Homogeneity, Contrast or Energy [20]. The feature generation process usually considers a selected set of features and generates a feature space with a combination of parameters: window size, pixel distance and orientation. To capture the properties of various patterns in different scales and orientations, the highly-dimensional space can contain a large number of features, therefore a feature selection [39] or fusion [57] is often performed.

Gabor features

A two-dimensional Gabor filter can be defined as a Gaussian kernel function modulated by sinusoidal wave plane. In this form, the response of the filter can be obtained by its convolution with the input image. Gabor filters were designed to model the behaviour of mammalian visual cortex cells [12] and have the ability to localise the information in both spatial and frequency domain. These properties make them useful in the image feature extraction process, particularly in the texture analysis field [24, 40]. In practice, the feature extraction is often performed by convolving the image with a set (bank) of Gabor filters with different orientations, scales and wavelengths.

Structure tensor

For a scalar two-dimensional image I with a pixel coordinate p = (x,y), the structure tensor [4, 5] is defined as a matrix derived from the gradient of the image:

$$ S(p) = \begin{pmatrix} g(p) * I(p)^{2}_{x} & g(p) * I_{x}(p) I_{y}(p) \\ g(p) * I_{x}(p) I_{y}(p) & g(p) * {I^{2}_{y}}(p) \end{pmatrix} $$
(1)

where g(p) is a Gaussian kernel with a σ standard deviation and Ix and Iy are partial derivatives of I in a window w centred at p. The size of the w window is crucial for capturing the image features of a desired scale. As shown in (1), the tensor gives three features for one pixel. Moreover, the resulting S(p) matrix contains information about the direction and strength of the edges in the neighbourhood of the p point. The two eigenvectors of the matrix are aligned orthogonally and parallel to the local edge (intensity change) in the windows w, while the corresponding eigenvalues indicate the strength of the edge in both directions.

2.3 Level sets and the ACWE model

An active contour, according to the level set idea [37], can be defined as a curve C in a form of a set of zero-height points p = (x,y) on the surface ϕ(p,t). This formulation gives C = {p : ϕ(p,t) = 0}, where ϕ(p,t) : R2R and t is the evolution time step. The curve evolves in its normal direction according to the differential equation:

$$ \frac{\partial \phi}{\partial t} = |\nabla \phi| F, \quad \phi(p,0)=\phi_{0}(p), $$
(2)

where ϕ0 is the initial contour and F is a speed function F(p,t), which allows the contour to expand or contract in order to encompass the segmented region. While the originally proposed level-set based methods mostly relied on edge information to drive the contour [6, 7], Chan and Vese proposed a global region based ACWE model [9] inspired by the Mumford–Shah formulation of the segmentation problem [35]. Given an image I, defined in the Ω domain, a piecewise constant segmentation of I into foreground and background can be obtained with the curve C, evolved by minimising the ACWE energy functional defined as:

$$ \begin{aligned} F(c^{+},c^{-},C) ={} & \mu \cdot Length(C) \\ & + \lambda_{1} {\int}_{inside(C)} \! |I(p)-c^{+}|^{2}dxdy \\ & + \lambda_{2} {\int}_{outside(C)} \!|I(p)-c^{-}|^{2}dxdy, \end{aligned} $$
(3)

where c+ and c represent the average pixel intensities of I inside and outside of the curve C, respectively, I(p) is the image intensity in the point p, Length(C) is the total length of the curve, and μ ≥ 0 and λ1,λ2 > 0 are positive constant weights. This functional aims to minimise the intensity variance inside and outside the contour.

This minimisation problem can then be redefined using the level set approach. The curve represented as a zero level set function C = {p ∈Ω | ϕ(p) = 0}, where ϕ(p) > 0 is the inside of the curve and ϕ(p) < 0 is outside. With this curve formulation, the energy functional takes the form:

$$ \begin{aligned} F(c^{+},c^{-},\phi) ={} & \mu \cdot {\int}_{\Omega} \delta (\phi(p))|\nabla \phi(p)|dxdy \\ & + \lambda_{1} {\int}_{\Omega} \! |I(p)-c^{+}|^{2} H(\phi(p)) dxdy \\ & + \lambda_{2} {\int}_{\Omega} \!|I(p)-c^{-}|^{2} (1-H(\phi(p)) dxdy, \end{aligned} $$
(4)

where H(z) and δ(z) are regularised versions of 1D Heaviside and Dirac delta functions, defined as:

$$ H(z)=\begin{cases} 1 & \text{if $z \geq 0$},\\ 0 & \text{if $z < 0$}, \end{cases} \quad \quad \delta(z) = \frac{d}{dz} H(z). $$
(5)

To finally solve the problem, the Euler–Lagrange equation can be obtained by minimising the energy with respect to ϕ and parameterising the descent direction by the evolution time t while keeping c+ and c constant:

$$ \frac{\partial \phi}{\partial t} = \delta(\phi) \left[\mu \ \text{div} \left( \frac{\nabla \phi}{|\nabla \phi|} \right) - \lambda_{1}(I-c^{+})^{2} + \lambda_{2}(I-c^{-})^{2} \right], $$
(6)

where div denotes the divergence operator. After each iteration, c+ and c are updated according to:

$$ \begin{aligned} c^{+} &= \frac{{\int}_{\Omega} I(p)H(\phi(p)) dxdy}{{\int}_{\Omega} H(\phi(p)) dxdy},\\ c^{-} &= \frac{{\int}_{\Omega} I(p)(1-H(\phi(p))) dxdy}{{\int}_{\Omega} (1-H(\phi(p))) dxdy}. \end{aligned} $$
(7)

The initial ACWE model formulation was extended to handle vector-valued images [8]. Instead of two average intensity values (inside and outside the curve), the method uses vectors of the average intensities for each image channel. In this form, the image defined on Ω consists of multiple channels Ii, where i = 1,..,N. The energy functional \(F_v(\overline {c^{+}},\overline {c^{-}},C)\) is then defined as:

$$ \begin{aligned} F_v(\overline{c^{+}},\overline{c^{-}},C)= & \mu \cdot Length(C) \\ & + {\int}_{in(C)} \! \frac{1}{N} \sum\limits_{i=1}^{N} \lambda_{i}^{+} |I_{i}(p)-c_{i}^{+}|^{2}dxdy \\ & + {\int}_{out(C)} \! \frac{1}{N} \sum\limits_{i=1}^{N} \lambda_{i}^{-} |I_{i}(p)-c_{i}^{-}|^{2}dxdy, \end{aligned} $$
(8)

where \(\overline {c^{+}}=(c_{1}^{+},...,c_{N}^{+})\) and \(\overline {c^{-}}=(c_{1}^{-},...,c_{N}^{-})\) are two vectors with the average values of the image channels inside \((\overline {c^{+}})\) and outside \((\overline {c^{-}})\) of the curve, and \(\lambda _{i}^{+}\) and \(\lambda _{i}^{-}\) are now separate weights for each channel.

The advantage of this model lies in its ability to separate image regions even without a sharp edge between them. This can be visible in the case of some texture features (see Fig. 1) when the separation can be possible, but the edges are blurred. The vector-valued version can also detect region dissimilarities that are present only in some of the channels.

Fig. 1
figure 1

Example texture feature maps generated by different methods: a source image, b GLCM Homogeneity, c Gabor filter magnitude, d first structure tensor eigenvalue

3 Method overview

The proposed segmentation method is based on a two-dimensional level set representation of an active contour. After a manual initialisation of the contour, the algorithm performs three main stages: initial texture feature extraction and selection, followed by two curve evolution stages (see Fig. 2). The first stage generates a set of texture feature maps that will be used to guide the further evolution of the contour. The next stage is based on a traditional level set propagating front formulation. The propagation speed function uses a strict textural stopping term that tracks sudden changes in the selected feature maps. Finally, the last stage switches to a variational level set framework based on vector-valued ACWE method. The second stage goal is to give a general outline of the segmented region, while the third stage is designed to provide a refined final form of the contour. The method also makes heavy use of GPU acceleration in most of the performance-critical computations.

Fig. 2
figure 2

Overview of the proposed method

3.1 Texture feature extraction and selection

The method starts with a texture feature extraction and selection stage, similar to the method described in [42]. The stage results in a set of texture feature maps Mbest, where a single feature map m is a scalar image that contains the feature values for each pixel of the original image. Any feature extraction method that is able to generate such a map can potentially be applied.

The extraction starts with a list of all possible feature maps that can be generated with the available algorithms. Instead of calculating whole maps for the entire image, the process generates the Minit set of maps only in the bounding box of the initial contour. For best results, this initialisation should capture a representative sample of the pattern. The features are then pre-selected to the Mbest set defined as:

$$ M_{best} = \{ m \in M_{init}: <percent>RSD(m) < r \}, $$
(9)

where r is the user-provided threshold (equal to 65% by default) and %RSD(m) is the Relative Standard Deviation of feature map m inside the initial contour. In this way, the process step picks the maps with the most uniform feature values inside the contour, according to a specified threshold. This criterion assumes that the feature uniformity identifies it as a suitable descriptor of the initial pattern. Furthermore, its important change during the contour evolution can signal a change in the texture. Next, the feature maps in Mbest are re-calculated for the entire image and are ready for the following stages.

The current implementation uses GLCM, Gabor and structure tensor features. As for GLCM, five features are calculated: Entropy, Correlation, Homogeneity, Contrast and Energy. A feature space is generated using a combination of specific parameters: window size, displacement and orientation. The extraction process starts with a fixed set of GLCM properties and, if necessary, performs additional computations. Automatic increase of the window size (from 3×3, 5×5 and 7×7 upward) can be performed in the case of larger patterns and redundant channels for different angles can be omitted or merged in case of isotropic texture patterns. Please refer to [42] for more details on these operations. Moreover, a bank of 24 Gabor filters with four orientations (0, 45, 90 and 135), three wavelengths and two Gaussian envelope scales is considered. The generation process also employs an angle detection scheme, similar as in the case of GLCM features. Finally, the method constructs a set of structure tensor features. This set consists of maps of both eigenvalues of tensors generated from σ and w combinations for a total of 18 feature maps.

3.2 Fast level set evolution

In the second stage, the initialised contour is deformed using the level set method. This stage is based on the traditional level set formulation presented in (2), where the speed function F in the point p combines image and curvature terms [27] as:

$$ F(p,t) = \alpha D(p) + (1-\alpha)C(p), $$
(10)

where D(p) is the image data term that drives the deformation, C(p) = div(∇ϕ(p)/|∇ϕ(p)|) is the curvature and α ∈ [0,1] is a user-defined balancing parameter. Instead of the original intensity-based image term, we apply a texture-based term Dtex(p) [41] that takes into consideration the previously selected texture feature set. First, for a given image point p, a set of features Sp is defined as:

$$ S_{p} = \left\{ m \in M_{best} : |m(p) - \bar{x}(m)| > \theta \cdot \sigma(m) \right\}, $$
(11)

where m is a texture feature map in the selected set Mbest, \(\bar {x}(m)\) and σ(m) are the feature map mean and standard deviation inside the initial contour, m(p) is the value of the map m in the point p and 𝜃 is a user-defined parameter that denotes the term sensitivity. In this way, the Sp set will contain the features in point p that are sufficiently dissimilar to their mean values inside the initial contour. Next, a function over(p) is defined to return a number of features that overstep the condition in (11) for each point in the image I:

$$ over(p) = |S_{p}|, \ p \in I. $$
(12)

Finally, the texture data term is defined as:

$$ D_{tex}(p)=\begin{cases} v & over(p) = 0 \\ -v & \text{otherwise,} \end{cases} $$
(13)

where v is a predefined constant. In this way, the curve is encouraged to expand into points where the texture features have values similar to the interior of the initial contour, but even one feature dissimilar enough will result in a strong penalty (see Fig. 3).

Fig. 3
figure 3

The second stage of the method illustrated. The over(p) function value in the given point, visualised in greyscale, is equal to the number of features that do not satisfy the similarity criterion

The second stage of the algorithm is expected to give a rough outline of the segmented region. An exact segmentation can be hard to achieve due to characteristic of some feature extraction methods [5, 20] that can provide distinctive, but blurry region boundaries. Figure 4c shows a result of this stage, where the contour covers most of the target region, but still contains some internal irregularities. Its edges are close to, but still not exactly on the desired boundaries of the region. This way of action is our deliberate choice: we assume that it is better to stop before passing the boundary than to cause an oversegmentation (see Fig. 4e). To encourage this behaviour, the default values of parameters were experimentally set to v = 20, α = 0.05 and 𝜃 = 4, where 𝜃 is the main user-adjustable method sensitivity.

The level set equation is solved with a GPU-accelerated implementation of the traditional numerical scheme [37]. The simulation is performed until the convergence of ϕ or a specified maximum number of iterations is reached.

Fig. 4
figure 4

Multiple stages of the segmentation process: a initialisation, b image with pixel intensity corresponding to the over(p) function value, c second stage result, d final result, e oversegmentation example

3.3 Final contour refinement

The third stage is designed to provide a final form of the contour. At this point, the method switches from a classical level set algorithm to a region-based vector-valued ACWE model that will work on previously selected texture feature maps. Furthermore, the information about the texture dissimilarities from (11) influences the length term of the energy functional (8). Originally, the μ parameter, along with the length contour μLength(C), is used for scaling/regularising of the contour shape. Smaller values of μ permit the contour to detect smaller objects in the image, at the cost of the contour smoothness.

Conventionally, the μ length weight is set constant for the entire evolution process [9]. In the proposed approach the μ is scaled according to the previously obtained information. After the second phase, the over(p) function can give some information about the features that caused the contour stoppage in each point. In the proposed approach, the contour length term is redefined to include a normalised sum of the texture features that did not meet the similarity condition (11) in a given point. The new term μstop(p) is defined as:

$$ \begin{aligned} \mu_{stop}(p) = \frac{over(p)}{ \max_{q \in I} over(q) } \cdot 100 \end{aligned} $$
(14)

This term gives a different response in each of the image points. Its main idea is to generally discourage the contour from expanding to the areas with a large number of sufficiently different features. This penalty is also scaled according to the strength of the feature response function over(p). Larger values of over(p) will result in a greater penalty, while lower values will encourage the contour to grow slowly, but still in a noticeable way. An example of the benefit of this length control term is shown in Fig. 5. The μstop(p) term is visualised in a form of a “stop map” (Fig. 5a) for each of the image points and shows a strong discrimination between the central texture patch and the target region. Without the influence of this term the method clearly fails (Fig. 5b). The provided ϕ heat maps and the plot along the marked line show that the improved length term helps to give a clear boundary between the regions.

Fig. 5
figure 5

Influence of the length term control: a visualisation of μstop(p) function, b result with no length term control, c ϕ heath map for the oversegmented result, d acceptable segmentation result with the corresponding ϕ heath map e and f plot with the ϕ value of both cases along the marked arrow

3.4 Implementation concerns

The most computationally intensive parts of the method, like GLCM or Gabor feature generation and the second stage level set algorithm, were implemented using the GPU acceleration. In the optimal case, a GPU-running function (kernel) should be designed to work independently on its own part of the data and have the relatively small local memory requirements. In the case of the GLCM features, a local co-occurrence matrix has to be calculated in a window around each of the image pixels. To mitigate a potential memory problem, a quantisation is applied to reduce the matrix size. As for the Gabor features, the implementation uses a simple convolution of the original image with a given filter, which easily scales on a GPU. The second stage level set method is also well suited for the GPU, as each separate position on the level set function can be recalculated independently. The third ACWE-based stage is still CPU-based, but it usually requires only 5 to 15 iterations before the stopping criterion is reached.

The whole algorithm is implemented using the MESA system [43] – a platform for designing and evaluation of the deformable model-based segmentation methods. While the platform uses the Java language, the GPU-accelerated algorithms were written in C using OpenCL.

4 Experimental evaluation

The proposed method was tested on synthetic images created using the Brodatz texture dataset, as well as on a set of natural images. The initial contours were manually placed inside the desired regions and scaled to the preferred size. Unless specified otherwise, only the sensitivity parameter 𝜃 was manually adjusted (between 4 and 12), while the other parameters were left on their default values. The first stage of the algorithm selected usually from 20 to 30 feature maps from 60 to 90 available.

The experiments were performed on a workstation with Intel Xeon E5-1620v2 CPU, 16 GB RAM, and Nvidia Titan Xp GPU. The total segmentation time was between 3 to 6 seconds for the analysed synthetic and natural image examples. As the implementation is heavily GPU-bound, other graphics cards were also tested.

The segmentation quality was assessed with five error measures: overlap error (OE), area difference (AD), average symmetric surface distance (ASSD), Hausdorff distance (HD), and Dice Coefficient (DC) [21]. Given two sets of image pixels S and G, where S is the tested result and G is the ground truth segmentation, the measures are defined as:

$$ \begin{aligned} OE&= 100 \ (1-|G \cap S| / |G \cup S| ),\\ AD&= 100 \ ((|S|-|G|)/|G|), \\ DC&= 2 \ |G \cap S| / (|G|+|S|). \end{aligned} $$
(15)

DC and OE are popular overlap error measures. In contrast to these two, the AD measure does not factor the actual overlap of the sets but quantifies just their area difference. Together with OE and DC, however, it can indicate over- or undersegmentation of the results.

The ASSD takes into consideration the distances between the surfaces/borders of the sets (i.e. the voxels/pixel that have at least one background or edge point in their vicinity). For each point sG in the border set S(G) the function d(sG,S(R)) denotes the Euclidean distance from sG to the closest pixel in S(R). These distances are also symmetrically calculated from the border pixels of R to G. All distance values are then averaged, which defines the ASSD as:

$$ \begin{aligned} ASSD(G,R)=\frac{1}{|S(G)|+|S(R)|} \left( \sum\limits_{s_{G} \in S(G)} d(s_{G}, S(R)) + \sum\limits_{s_{R} \in S(R)} d(s_{R}, S(G)) \right) \end{aligned} $$
(16)

In a similar fashion, the HD is the longest distance between any two points from the borders of the segmentation and ground truth regions (i.e. the greatest of all the distances from a point on one border the closest point in the other). With the border sets S(G) and S(R), and d(g,r) as the distance between pixels g and r, the HD can be defined as:

$$ HD(G,R)= \max \left\{ \underset{g \in S(G)}{sup} \underset{r \in S(R)}{inf} d(g,r), \underset{r \in S(R)}{sup} \underset{g \in S(G)}{\inf} d(g,r) \right\}. $$
(17)

4.1 Synthetic images

The first example (see Fig. 6) presents segmentation the results of synthetically composed images. The method was able to successfully segment textures of different scales and orientations, even when multiple patterns were present. It should be noted that a more careful placement and scaling of the initial contour was necessary only in the last example. The experiments were also performed with the feature pre-selection criterion from (9) disabled. Normally, about 50 to 70% of the initial feature maps were pre-selected. The omission of this operation did not have a real influence on the segmentation results but caused an about 30% increase in the segmentation time.

Fig. 6
figure 6

Segmentation of synthetic texture mosaics

The second example in Fig. 7 illustrates the influence of the 𝜃 sensitivity parameter. The first case (Fig. 7a) shows the second stage result with 𝜃 set to the default value of 4, where the contour exhibits some significant gaps, but the third stage provides an acceptable final result (Fig. 7b). The result of the second stage with 𝜃 = 7 is closer to the desired form, but in both of the cases, the final result is comparable. Generally, the 𝜃 parameter permits a wide range of values that can give an acceptable segmentation.

Fig. 7
figure 7

Influence of the 𝜃 sensitivity parameter on the intermediate and final segmentation result: a result of second phase with 𝜃 = 4, b result of third phase with 𝜃 = 4, c second phase result for 𝜃 = 7 and d third phase result for 𝜃 = 7

The third example contains a particularly difficult case (Fig. 8), where different types of features were necessary to segment the target region. On their own, the GLCM, Gabor and tensor features were not able to provide a strong boundary between all of four adjacent regions, but together they were able to perform a successful segmentation.

Fig. 8
figure 8

Segmentation results with a-c only the specific features enabled and d with all features used (result regions in the top row and over(p) function value map in the bottom)

The next example compares the proposed method with a deep learning-based approach that uses fully convolutional networks for the task of image segmentation (FCNT), proposed by Andrearczyk and Whelan [2]. The experiments were performed on mosaics created from the Kylberg texture dataset [26], and the segmentation was compared with the results presented in [2]. Figure 9 contains selected outcomes of both algorithms, while complete quantitative quality comparison is presented in Table 1. The tested mosaics contain quite challenging patterns, but both methods are mostly able to provide reasonable segmentation, indicated by low VOE and RVD, as well as DC greater than 0.95 in many cases. The border-based metrics (ASSD and HD) show the biggest differences in favour of the proposed method. Some of the FCNT segmentations contain many disjointed patches outside of the target region, which are not present in the results of the proposed method. This is indicated by lower ASSD and significantly reduced HD (especially in Fig. 9d and Fig. 9e).

Fig. 9
figure 9

Segmentation results on mosaics from the Kylberg dataset: results of the FCNT method (top row) and the proposed algorithm (bottom row)

Table 1 Accuracy of the Kylberg dataset mosaics segmentation performed with the FCNT and the proposed method

The last example (see Fig. 10) presents the influence of image noise on the performance of the proposed method. The ability to deal with noisy images is a desirable feature of robust segmentation methods, especially in medical applications [13], where input noise is common. The test image in the example was modified by the addition of Gaussian noise with increasing value of its standard deviation (σ from 45 to 85). Although the original patterns significantly deteriorated on the most noisy images, the proposed method managed to perform adequately. Each increase in the σ value required only a slight decrease (by 0.5 to 1.0) in the 𝜃 sensitivity parameter (starting from 𝜃 = 6.5 for the central pattern and 𝜃 = 4 for the top left region).

Fig. 10
figure 10

Influence of image noise on the segmentation results: a segmentation of two regions performed on unmodified images and b-e results on images with added Gaussian noise (σ values from 25 to 85)

4.2 Natural images

Figure 11 contains the segmentation results on various natural images. The results of the proposed algorithm were compared with two state-of-the-art texture segmentation methods: Dictionary-based Active Contour (DAC) [11] and Factorisation-based Active Contour (FAC) [16]. Both methods employ a region-based level set framework but take different approaches to texture feature extraction. DAC constructs a dictionary of texture patches present in the segmented image. Each image patch is then assigned to an element of the dictionary. Using the initial form of the contour, the pixels in the dictionary patches are labelled as being inside or outside of the contour. This labelling is then used to calculate the probability of being inside the target region for each image pixel. The level set evolution process expands the curve into regions where the probability is high and retracts it from low probability areas. The FAC method uses a local spectral histogram [28] as the texture descriptor. After the feature extraction, the image is modelled as a product of a matrix with the features values and a matrix of the features weights for the segmented regions. FAC assumes two regions: foreground and background. During the contour evolution, the level set framework uses a factorisation algorithm to minimise the differences between the model (features and weights matrices) and the original image (real values of the extracted features).

Fig. 11
figure 11

Natural image segmentation with different methods: a FAC, b DAC and c the proposed method

The DAC and FAC results were obtained with MATLAB implementations shared by the authors. For the FAC, the parameters were set to 12 histogram bins and the integration scale of 20. The default circular initialisation was also used. As for the DAC method, the initialisation corresponded with the initial contours used by the presented method, with the suggested patch sizes from 9 to 15. The quantitative analysis of the segmentation quality is presented in Table 2.

Table 2 Segmentation accuracy of the tested methods for natural images

The first three natural images (Leopard, Zebra1, and Zebra2) are more bi-modal in nature. For the Leopard image, the proposed method performance is on par with the other two algorithms. All methods achieved the DC around 0.93 and the OE values were between 11 and 12.5, with the lowest AD of -1.58 for the proposed method. The algorithms had the most problems in the bottom parts of the regions, where the leopard’s limbs blend into the grass. The images with Zebras contain high-contrast patterns of variable scale. Here, the results of the proposed method are also comparable to DAC and FAC, although a slight oversegmentation can be noticed (AD = 13.65 for Zebra1 and AD = 14.75 for Zebra2).

In the next image (Eel) the scene is composed of multiple objects with different patterns. All methods were able to encompass most of the target region, with FAC also including a part of a different object. The other two methods achieved a high DC (0.97 and 0.98) with low values of OE, AD, and ASSD.

The last two images (Cheetah1 and Cheetah2) contain multiple texture patterns. Apart from the target object, there are at least two other background regions. The proposed method performed significantly better than FAC and DAC, especially in the case of the second image (Cheetah2). Since DAC and FAC target bi-modal images, the proposed method gave better results thanks to the employment of multiple features. As the previous examples show, the method can differentiate between distinct regions even without an explicit multi-phase formulation. It should be noted that there is also a multi-phase version of the DAC algorithm that could perform better in these conditions. However, due to the nature of the proposed method, comparison with other two-phase methods seemed more appropriate.

4.3 Time performance

Finally, the performance results of the method on GPUs of different classes are presented in Fig. 12. Four Nvidia cards were tested: low-end GT 1030, mid-range GT 1050 Ti, high-end Titan Xp, and server-grade Tesla P100 (please see Table 3 for specifications). The execution time is shown as a stacked chart of three stages of the algorithm. The tests were performed on three versions of the first synthetic image from Fig. 6. The image, originally of resolution 256 × 256, was also scaled to 128 × 128 and 512 × 512 pixels. The first stage of the algorithm was fixed to use GLCM features, since their extraction it the most computationally intensive. In the case of high-end cards, the provided example did not pose a significant challenge (less than 1, 2, and 6 s for the three test images). The CPU-based third stage took the most time for the larger image sizes. Two lower-end cards were the more realistic targets for a 2D segmentation task. In their case, the achieved times were still reasonable. Even on the low-end GT 1030, the execution took less than 10 s for the largest image and was below 1 s and 3 s for 128 × 128 and 256 × 256 images, respectively.

Fig. 12
figure 12

Performance of the proposed method on different GPUs

Table 3 Specifications of the graphics cards used for performance tests

The proposed method benefits significantly from the GPU acceleration. For example, the level set evolution in the second stage executes in less than 0.5 s in most of the performed tests. In contrast, a CPU-based version can take from 4 to 8 s to finish on a 256 × 256 image and up to 25 s in the 512 × 512 case. The GPU-based GLCM algorithm also exhibits similar improvements in execution time.

4.4 Discussion

The proposed method was compared against other active contour-based methods, as well as against a deep learning-based approach. Despite the fact that the presented algorithm essentially works in a two-phase mode, it can clearly perform well on multi-modal images with multiple textures. The relevant multiple texture features can complement each other in those situations, while the unused descriptors are not getting in the way of the final results.

The GPU-based implementation provides a clear improvement in the runtime performance. Historically, algorithms like level sets and GLCM were notoriously computationally intensive, which limited their practical applicability. The experiments show that modern GPUs can compute relatively large banks of texture features and perform the “fast” level set evolution in sub-second time, which invalidates some previous restrictions. The GPU-based level set method in the second stage is driven by a speed function with a relatively small influence of the curvature, and the parameters of the evolution are tuned for possibly fast convergence. Furthermore, the computations are performed on the entire image domain. Typical optimisations aim to reduce the range of the level set function updates with sparse schemes [27], where only a part of the image domain is considered (e.g. only a narrow band around the level set zero-crossing is modified in each iteration). These modifications can fundamentally change the final results, while our approach is clearly efficient enough, at least in the 2D case.

The comparison with the FCNT, which uses fully convolutional networks, indicates that the proposed method can be competitive against current state-of-the-art approaches. Deep learning algorithms provides some advantages over active contours and level sets, like the lack of necessity for manual initialisation and interactive parameter tuning. On the other hand, the (often supervised) training stage is unavoidable and depends on the availability of prepared and annotated datasets.

The presented method employs a multi-stage approach that was firstly outlined in [42]. The original work used a snake-based ACM that employed a vastly different evolution scheme that considered only the closest neighbourhood of the curve discrete points. The proposed approach uses both local and global level set-based ACMs that are more robust in handling of multiple features. The topological restrictions of the parametric snake are also eliminated (see Fig. 7). Furthermore, the final refinement stage is not only different from the heavily heuristic approach of the snake version, but also uses a distinct, region-based level set model than the previous stage. This cooperation of those two models is one of the main features of the proposed method. The computational advantage of the discrete snake was also nullified with the GPU utilisation.

5 Conclusions and future works

In this paper, a multi-stage texture-based active contour has been presented. The proposed method combined both a classical and region-based level set contour formulations and smartly integrated multiple and varied texture features. The algorithm was carefully validated on synthetic and natural images and compared with the most similar among the available and accessible state-of-the-art approaches. The method was able to successfully segment various patterns in case of bi- and multi-modal images. It should be noticed that the presented approach is not tied to any particular texture descriptor and can be easily extended to integrate additional features. Moreover, it efficiently employs a large number of different kinds of descriptors thanks to the GPU acceleration.

We are aware that the current form of the method can be improved in many ways. Although the algorithm does not require any prior information about the texture classes in the image, it relies on manual contour initialisation. A more advanced feature selection scheme could also be employed in the first stage. As for the performance, the third ACWE-based stage would benefit greatly from GPU acceleration. Moreover, the general level set form of the algorithm could be extended into a multi-phase version that could segment multiple regions [3].

Now, we are working on an adaptation of the proposed segmentation method into 3D [18, 27]. We are also considering assessing the method on biomedical images and more experimental comparisons with very popular, fully convolutional approaches [29, 54]. Finally, we plan to tackle a more complex task of semantic segmentation.