1 Introduction

Exploratory processes play a key role in human creativity, especially in the process of creating artistic paintings. Based on this observation, Hertzmann [1] argues that the artistic process is led by a high-level goal such as “making a good painting,” but the final image is created through an exploratory process of trying out different techniques, styles, and compositions subsequently. Translating this into computational procedures, algorithmic painting can be modeled as an under-specified optimization problem that can be approached using a variety of high-level to low-level choices of abstraction, media, and distortion. Therefore, it is vital to solve this problem as an iterative, human-in-the-loop procedure with an explicit decomposition of stylistic tasks [1]. We postulate that creating methods adhering to this paradigm is useful for digital artistic stylization tools in general, as it empowers non-professional users to engage in casual creativity [2] by enabling them to create a broad range of stylistic edits and guiding them to aesthetic results without requiring a high level of technical skill. In this paper, we focus on controlling geometric shape abstraction and texture in artistic images.

Recent advancements in image synthesis, including diffusion models [3] and example-based stylization methods such as neural style transfer (NST) [4, 5], yield impressive results. However, their black-box representation of style, that intertwines shape, texture, and color aspects, and the resulting lack of separable artistic control variables [1] pose significant challenges for precise adjustment of stylistic elements. This is particularly apparent for geometric and textural attributes. For example, geometric elements, such as brush patterns, shapes and sizes, are often not applied in a consistent manner, and textural attributes, such as stroke textures, contours, tonal variations, and other complex image details are not adjustable on a fine-granular level. Our method seeks to introduce such controls by disentangling various stylistic elements in existing images, and can be used in a post-synthesis editing step.

We introduce an iterative, coarse-to-fine approach for manipulating geometric abstraction and texture with an explicit parametric decomposition of stylistic attributes. Our approach accepts any image as input and allows parametric and example-based control of geometric and textural elements. The core of our method lies in the decomposition of the input image into two components: a set of primitive shapes, representing the coarse structure of the image, and a parametric representation of the high-frequency texture details. For coarse structure extraction, we employ segmentation techniques such as superpixel segmentation [6] or neural stroke-based rendering [7, 8], depending on the desired shape primitives. To decompose the image texture into meaningful artistic control variables, we introduce a novel pipeline of lightweight, differentiable stylization filters. These filters are based on traditional image-based artistic rendering (IB-AR) [9] and are implemented in an auto-grad enabled framework [10]. Each filter produces a specific component of the overall texture details and is parameterized by stylistic or painterly attributes, such as the amount of contours, local contrast, or surface relief (e.g., oil paint texture) among others, making our approach particularly well-suited to decompose artistic images. Besides performing shape and filter decomposition in sequential stages, we also introduce an approach for joint stroke and filter parameter optimization, which enables a close integration between both, and enables adaption of the strokes to new example-based styles.

Fig. 1
figure 1

Our technique facilitates editing of texture and geometric abstraction while preserving the overall color and structure composition. In panels A to C, we modify textures by enhancing oiliness and stroke thickness (A), and applying text prompts “starry night” (B) and “impressionistic painting” (C). In panels D to F, we vary geometric abstraction primitives, using rectangular strokes (D), blocks (E), and ellipsoids (F). These modifications can be blended and interactively fine-tuned. Our style decomposition into strokes and artistic parameters enables applications such as combined stroke and texture style transfer

The resulting decomposed parameter and shape space enable a wide variety of human-in-the-loop as well as example-based texture and geometric shape edits (Fig. 1), making it particularly well-suited for interactive exploratory processes [1] in the workflow of casual creators [2]. The method of geometric abstraction is interchangeable and can be interactively controlled to adjust stroke shapes and abstraction levels. Filter parameters can also be interactively adjusted using manual editing of parameter masks or interpolated based on image attributes such as depth or saliency, allowing for separate control over color and texture.

The proposed method shows excellent performance in example-based texture editing, adapting textures to align with the underlying coarse structures and ensuring a consistent appearance. This includes the use of example-based losses to tune the texture in parametric space, aligning it with desired styles by optimizing NST image-based style losses [4, 11] or text-based losses [12]. Furthermore, for these example-based tasks, we can accelerate texture decomposition by training parameter prediction networks (PPNs) for real-time decomposition of NST results. This capability could be used to enable interactive refinement of NST results in professional artistic workflows, which we demonstrate by designing a PPN-based tool for NST image editing and artifact correction.

To summarize, we make the following contributions:

  1. 1.

    We present a holistic approach for geometric abstraction and texture editing that decomposes the image into coarse shapes and high-frequency detail textures.

  2. 2.

    We present a novel differentiable filter pipeline for texture editing. Compared to previous methods it is lightweight and improves parameter editability.

  3. 3.

    We present an approach for joint optimization of geometric shapes and parameters using neural stroke painting.

  4. 4.

    We introduce PPNs for real-time single- and arbitrary-style texture decomposition.

  5. 5.

    We demonstrate that a texture can be adapted to new styles using example images or text prompts, and can be interactively adjusted on a global or local level.

This work is an extension to [13]. In the following, we discuss related work in Sect. 2, detail our decomposition approach in Sect. 3, showcase stylization and editing applications in Sect. 4 and evaluate and conclude our method in Sects. 5 and 6.

2 Related work

2.1 Example-based stylization control

Example-based methods for stylization, particularly NST as introduced by Gatys et al. [4], employ deep neural networks to extract and apply stylistic features from a reference image to a target image in a black-box fashion, and exert control over the output using a style/content trade-off. Beyond this trade-off, various methods have been developed to manipulate specific attributes in style transfer such as color [14] or stroke textures [15, 16]. CLIPstyler [12] enables style transfer guided by textual prompts, utilizing CLIP-based losses [17]. However, limitations exist in these techniques: optimization-based methods lack interactive control, and network-based methods usually only allow control over a single attribute and are not composable with others.

2.2 Generative stylization control

Recently, tremendous progress has been made in image generation following the introduction of generative adversarial networks (GANs) [18, 19] and diffusion models [3, 20, 21]. For image editing, StyleGAN [19, 22] has shown to be highly adjustable due to its structured latent space. It has been applied in the stylization domain [23, 24], and used for stroke-based [25] and text-based editing of images [26, 27]. However, editing real images with GANs involves inverting them into a latent code [28], a process that may be imperfect and limited to specific domains like faces. More recently, denoising diffusion models [29] have demonstrated superior performance in representing multimodal sceneries compared to other methods [3, 20, 21]. They have been applied to the editing of natural images, e.g., by fine-tuning the diffusion model [30, 31], where they can also generate or edit stylized images. However, due to the time required for fine-tuning, these edits are not interactive, and also strongly alter semantic appearance when only stylistic or textural edits are desired. Other approaches perform style transfer using diffusion models [32], which improve on semantic consistency, but still lack fine-grained style control. Our method may be seen as a complementary technique to generative model control (e.g., applied as a post-processing step) that enables interactive global and local editing of stylistic abstraction while keeping the semantic content intact.

2.3 Image- and stroke-based rendering

In contrast to deep-learning-based approaches, traditional IB-AR, as reviewed by Kyprianidis et al. [9], encapsulates styles using a series of image filters, providing granular control over various stylistic attributes. These filters, however, are typically engineered for specific artistic styles such as cartoon [33], oil paint [34], or watercolor [35]. Lötzsch et al. [10] propose combining IB-AR with example-based stylization by implementing filters in a differentiable manner and optimizing parameters within these filters to approximate a stylized reference image, effectively decomposing a style into an interactively controllable “white box” parameter representation.

In our work, we utilize a similar filter-based, interpretable style representation. However, the approach by Lötzsch et al. [10] encapsulates the entire input image within these parameters, entwining shape, color, and details (e.g., local textures) in a complex and often redundant set of parameters and filters. This entanglement results in a cumbersome and non-intuitive editing experience, where adjustments in parameters influencing painterly attributes may inadvertently affect colors or shapes. In contrast, our method abstracts color and shape distribution at an earlier geometric abstraction stage, focusing the filter pipeline on primarily representing high-frequency components, i.e., local textures. In contrast to the style-specific filter chains employed by Lötzsch et al. [10], our approach introduces a streamlined filter pipeline, comprising just four filters, that is capable of adapting to a wide range of styles while enhancing parameter editability and reducing redundancy.

Further, Lötzsch et al. [10] demonstrated the viability of interactive parameter prediction by training PPNs for specific img2img translation tasks in combination with an additional post-processing CNN. In our work, we expand the versatility of PPNs, showcasing their adaptability for both general single- and arbitrary-style transfer decomposition tasks, while obviating the need for post-processing networks.

Our approach to geometric abstraction draws from existing artistic approaches that utilize either segmentation-based shapes [36, 37] or stroke-based rendering [38, 39] to abstract images into primitive shapes. The geometric image abstraction, e.g., obtained using SLIC [6] segmentation, is then provided as input to the filter pipeline. For stroke-based rendering, we integrate recent developments in neural painting, employing both optimization-based (stylized neural painting (SNP) [8]) and prediction-based (PaintTransformer [7]) strategies. Zou et al. [8] further propose to optimize strokes using NST losses to adapt the color and positioning of strokes according to an artistic style. We integrate their approach into our framework to jointly optimize strokes together with filter parameters, thereby enabling a fully differentiable geometric abstraction that can be adapted using losses such as CLIPStyler [12].

Fig. 2
figure 2

Texture decomposition involving a segmentation stage S and a pipeline of differentiable image filters O (see Sect. 3.1). *Gradients only flow back to S if using a differentiable stroke renderer

3 Method

3.1 Framework overview

Our approach, shown in Fig. 2, involves a two-stage process for decomposing images into a format suitable for subsequent texture editing tasks. The first stage, a segmentation stage denoted as \(S(\cdot )\), controls texture granularity and geometric abstraction. This stage transforms the input, a reference image \(I_r\), into an abstracted version \(I_a\) by rendering it using shape primitives. The second stage involves a pipeline of differentiable image filters, \(O(\cdot )\), which represents the remaining image details, i.e., \(I_r - I_a\). These details are represented within the parameters of the filter pipeline O. Each parameter in this pipeline corresponds to a specific artistic control variable, such as contours or local contrast, as detailed in Sect. 3.2.2. These parameters can be adjusted at the pixel level using parameter masks \(P_M\). The decomposition process, in general, first runs stage S and subsequently optimizes the decomposition loss \(\mathcal {L}\) on the output image \(I_o\) thereby refining \(P_M\). As a special case, S and \(P_M\) can be jointly optimized when using a differentiable first stage. In the following, we elaborate on filter decomposition in Sect. 3.2, the segmentation stage in Sect. 3.3, and on training networks for parameter prediction in Sect. 3.4.

3.2 Filter parameter decomposition

3.2.1 Decomposition loss

Formally, O is parametrized by a set of M parameter masks, i.e., \(P_M = \{ P_i \in \mathbb {R}^{h \times w} | i \le M \}\) and expects the output image of the segmentation stage \(I_a=S(I_r)\) as input. A stylized output image \(I_o\) is thus obtained as:

$$\begin{aligned} I_o = O\left( P_M, I_a\right) \end{aligned}$$
(1)

For acquiring a decomposed texture representation, the parameters \(P_M\) are fine-tuned through a composite loss function \(\mathcal {L}\). This function integrates a target loss \(\mathcal {L}_{\textrm{target}}\) with a total variation loss \(\mathcal {L}_{\textrm{TV}}\), modulated by a weighting factor \(\lambda _{\textrm{TV}}\):

$$\begin{aligned} \mathcal {L} = \mathcal {L}_{\textrm{target}} + \lambda _{\textrm{TV}}\mathcal {L}_{\textrm{TV}} \end{aligned}$$
(2)

Here, \(\mathcal {L}_{\textrm{target}}\) is designed to align with the desired target style, while \(\mathcal {L}_{\textrm{TV}}\) works to reduce noise in the masks and promote local coherence, which is beneficial for subsequent editing.

The choice of loss function for \(\mathcal {L}_{\textrm{target}}\) varies based on the intended outcome. For texture representation suitable for interactive editing, the \(\ell _1\) loss is preferred as it focuses on detailed reconstruction of the reference image:

$$\begin{aligned} \mathcal {L}_{\textrm{target}} = \left\| I_o - I_r \right\| _1 \end{aligned}$$
(3)

To adapt to a new texture style, \(\mathcal {L}_{\textrm{target}}\) can also directly utilize neural style transfer losses. For this, a perceptual content loss \(\mathcal {L}_c(I_o, I_r)\) [4] is combined with a style loss that extracts style from either an image, such as Gram loss [4] or optimal transport loss [11], or a text prompt, such as CLIPStyler [12]. We demonstrate these variants in Sect. 4.1. Additionally, these losses are effective for training a PPN to reconstruct style-transferred images in a single inference step. For this, we introduce single-style and arbitrary-style decomposition networks in Sect. 3.4.

3.2.2 Differentiable filter pipeline

Traditional heuristic-based stylization pipelines (e.g., cartoon [33], watercolor [35] or oil paint [34]), can be made differentiable and optimized in an example-based framework as Lötzsch et al. [10] demonstrated. However, several limitations decrease the filter pipelines’ suitability for example-based stylization and subsequent editing. For one, some filter operations such as color quantization [33] are not directly trainable and require differentiable proxies [10]. Further, filter pipelines often contain many repetitive elements like multiple smoothing steps which can complicate the optimization and parameter mask editing.

Our approach addresses these challenges by proposing a streamlined pipeline, \(O(\cdot )\), with intuitive and non-redundant parameters while being capable of matching any texture. It comprises the following differentiable filters:

  1. (1)

    Smoothing: Incorporates Gaussian smoothing (\(\sigma =1\)) and a subsequent bilateral filter with learnable parameters \(\sigma _d\) (distance kernel size) and \(\sigma _r\) (range kernel size).

  2. (2)

    Edge enhancement: Implements an eXtended difference-of-Gaussians (XDoG) [40] filter with learnable parameters for contour amount and opacity.

  3. (3)

    Painterly attributes: Controls a specific painterly aspect of a style. We utilize bump mapping for surface relief control throughout the paper, implemented using Phong shading [41] with learnable parameters bump-scale, Phong-specularity, and bump-opacity. However, in principle, any differentiable painterly filter can be used here, and we also implement wet-in-wet [42] and wobbling [35] filters for watercolor effect control.

  4. (4)

    Contrast: Controls the amount of local contrast enhancement.

Gradients are calculated for both the parameter masks \(P_M\) and the image input. To facilitate this, the image filters are implemented within an auto-grad-enabled framework, in line with Lötzsch et al. [10]. To preserve the input color distribution (i.e., the output from the segmentation stage), learnable parameters cannot alter color hues during the optimization process. An ablation study (Sect. 5.2) is conducted to validate our filter choices and demonstrates that each stage of our pipeline is essential for accurately reconstructing arbitrary styles. Filters in the pipeline always operate on a single-color image input, making filters independent of each other. Thereby, the pipeline is easily extendable with additional IB-AR techniques [9] to control further painterly attributes.

3.2.3 Parameter mask optimization

We optimize the parameter masks \(P_M\) using gradient descent to minimize the composite loss \(\mathcal {L}\). Optimizing solely on \(\mathcal {L}_{\textrm{target}}\), as in [10], results in highly fragmented masks with large local value variations (see Fig. 3, first row), complicating their editability. However, including \(\mathcal {L}_\textrm{TV}\) in the optimization process yields masks with less noise, increased sparsity, and greater smoothness.

For detail optimization, we follow [10]; using 100 iterations of Adam [43] with a learning rate of 0.01, reduced by factor 0.98 every five steps after the 50th iteration. Differing from [10], our method does not require the smoothing of masks, as our filters do not produce artifacts. Unless specified otherwise, we consistently use \(\lambda _\textrm{TV}=0.2\) during optimization in our experiments.

Fig. 3
figure 3

Parameter masks \(P_M\) optimized without (top row) and with (bottom row) \(\mathcal {L}_\textrm{TV}\) to fit the input \(I_r\) (top row) to the given segmented image \(I_a\) (obtained using SLIC)

3.3 Shape decomposition

The first stage S in our pipeline is responsible for geometric abstraction and decomposes the reference image \(I_r\) into shapes by re-rendering it using distinct shape primitives, such as brushstrokes or segments of uniform color to generate abstracted image \(I_a\). This stage can employ various image-based techniques such as superpixel segmentation (e.g., SLIC [6]) or stroke-based rendering (e.g., PaintTransformer [7]), as it operates on pixels in both input and output domains. We categorize these as feed-forward techniques, that is, methods that decompose without receiving feedback from the second stage, and showcase results in Sect. 4.2.

A limitation of such feed-forward methods is the potential misalignment of stroke placement and color when adapting the filter pipeline to a new style, such as through style transfer. To address this, we propose an alternative approach that concurrently optimizes strokes alongside parameters, ensuring adaptation to a cohesive style. We adapt the SNP technique by Zou et al.[8] to our parameter optimization framework, details of which are elaborated in the following.

3.3.1 Differentiable strokes

SNP [8] incorporates a neural renderer, which is tasked with generating strokes from a set of stroke parameters, and a stroke blender that combines these strokes in a differentiable manner, optimized to reconstruct the reference image. The process starts with an empty canvas \( h_0 \). Using a neural renderer \( G \), a sequence of strokes is generated and superimposed on the canvas iteratively. For each drawing step \( t \), the renderer \( G \) takes stroke parameters \( x_t \in \mathbb {R}^{d}\) (characterizing aspects like shape, color, transparency, and texture) to produce a stroke foreground \( s_t \) and an alpha matte \( \alpha _t \). These components are blended using a soft blending equation:

$$\begin{aligned} h_{t+1} = \alpha _t s_t + (1 - \alpha _t)h_t, \end{aligned}$$

where \( (s_t, \alpha _t) = G(x_t) \). This process is repeated for \( T \) steps, and the stroke parameters are optimized to match a target image \( I_r \), i.e., the reference image is also used for parameter optimization. The final rendered output \( h_T \) is formulated as:

$$\begin{aligned} h_{T} = f_{t=1\sim T}(\tilde{x}) \approx I_r, \end{aligned}$$

where \( f_{t=1\sim T}(\cdot ) \) represents the mapping from stroke parameters to the rendered canvas, and \( \tilde{x} = [x1,..., x_T] \) is the collection of parameters at each drawing step.

The optimization involves minimizing the similarity loss between the final canvas \( h_T \) and the reference \( I_r \) using a loss function \( \mathcal {L_{\text {stroke}}} \), which is typically the \(\ell _1\) loss, an additional optimal transport loss improves convergence in some cases [8]. The stroke parameters are updated using gradient descent, thereby iteratively refining the strokes on the canvas to achieve a rendering that closely resembles the input image. The neural renderer \( G \) consists of a rasterization network \(G_r\) that, given stroke alpha \(x_\alpha \) and shape \(x_\text {shape}\), predicts the alpha matte \(\alpha _t\), and a shading network \(G_s\) that, given stroke shape and color \(x_\text {color}\), predicts stroke texture \(s_t\). The shape representation (\(x_\text {shape}\)) is parametrized by either a textured rectangle rotated around an anchor point, akin to the stroke representation in PTf [7], or a Bezier curve, depending on the stroke type. Our results demonstrate the use of the former “oil-painting” and “color-tape” rectangular strokes. We refer to [8] for architecture and training details.

Algorithm 1
figure g

Stroke-Based Rendering and Optimization Algorithm

Fig. 4
figure 4

Joint optimization of strokes and parameters. Interim steps of the stroke canvas (top row) and filter output (bottom row) are shown, where m is the number of block divisions. Resulting \(P_M\) for contrast and XDoG contours are shown in (d)

3.3.2 Joint stroke and parameter optimization

Our joint optimization algorithm, shown in Algo. 1, adopts the progressive, grid layer-based approach of Zou et al. [8]. It starts optimizing stroke- and filter parameters on a \(128\times 128\) pixels canvas, which is subsequently partitioned into \(m \times m\) blocks (\(m=2, 3, 4,...\)), searching for a number of iterations in each grid layer. For each layer of m divisions, the number of strokes in \({\tilde{x}}\) are gradually grown. We sample stroke centers distributed according to the error between the current canvas and the reference, i.e., \(|h_{i-1} - I_r|^4\), while the other stroke parameters are sampled randomly. Subsequently, we predict and blend strokes for canvas \(h_{i}\) which is then stitched to create \(I_a\) and passed through pipeline O to create the stylized intermediate outputs. In Fig. 4, we use \(\ell _1\) for \(\mathcal {L}_{\text {target}}\) to match the painting “Wheat field with cypresses” by Van Gogh and show the interim outputs, full optimizations are showcased in the supplemental video (Online Resource 2). Gradients are computed and updated for both \({\tilde{x}}\) and \(P_M\) with learning rates \(\eta = 0.002\) and \(\lambda = 0.05\), respectively. After completing a grid level, strokes are rendered to a final canvas \(h^m_T\), which is upscaled and used as initialization in the following grid level. We optimize using Adam [43] and add T strokes per block in every grid layer.

While the optimization of strokes gradually increases the area of coverage by adding more and finer strokes, parameter optimization directly optimizes fine details across the entire image, which can lead to parameters emulating structure details that would be better represented by strokes. To remedy this, we add an optional stroke regularization loss \(\mathcal {L}_{\text {reg}}\) that aims to constrain parameters to only be optimized in the locations of the currently active stroke set and further constrain parameter variations inside a single stroke:

$$\begin{aligned} \mathcal {L}_{\text {reg}} = \sum _t (\alpha _t P_M - {\bar{P}}_M(t))^2 \end{aligned}$$
(4)

where the parameter mask should, for every stroke, be guided toward its average value \({\bar{P}}_M(t) = \frac{\sum {(\alpha _t P_M})}{\sum \alpha _t} \) under the area of the stroke (i.e., where \(\alpha _t\) is not zero). The resulting parameter masks (Fig. 4d) display a stroke-like structure, which aids in local parameter editing, as values can be uniformly adjusted for by selecting a stroke segment (see Sect. 4.3).

3.4 Parameter prediction for NST

Our optimization-based decomposition approach, while versatile, is computationally demanding, taking about 3 min for a 1MPix image on an Nvidia GTX 3090. For tasks with a specific style target like NST, we accelerate this step to real time by training a PPN to predict the parameter masks \(P_M\) in a single inference step, akin to pixel-predicting NST networks [44, 45].

We propose PPNs for both single- and arbitrary-style transfer texture decomposition, training them as follows:

3.4.1 Single-style decomposition PPN

The single-style transfer network (SSTN) introduced by Johnson et al. [44] is trained on a single-style image using NST losses. We adapt this approach to train a single-style PPN (\(\text {PPN}_\textrm{sst}\)) to decompose textures of a style image \(I_s\). We generate a stylized ground-truth image \(I_r\) using the SSTN trained on \(I_s\) and preprocess \(I_r\) using segmentation stage S to create training inputs \(I_a\) for our filter pipeline O. The training loss for \(\text {PPN}_\textrm{sst}\) follows Johnson et al. [44] and is a combination of Gram matrix style loss \(\mathcal {L}_{gram}\) [4] and content loss \(\mathcal {L}_c\) over VGG [46] features:

$$\begin{aligned} I_o&= O\big (\text {PPN}_{sst}(I_a), I_a\big ) \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {L}_{\text {PPN}_{sst}}&= \mathcal {L}_{gram}(I_o, I_s) + \lambda \mathcal {L}_c(I_o, I_r) \end{aligned}$$
(6)

The \(\text {PPN}_\textrm{sst}\) architecture modifies the final layer’s output channels to match the number of parameter masks (\(\#P_M\)).

3.4.2 Arbitrary-style decomposition PPN

In arbitrary style, transfer networks are trained to stylize images given both a content and style image at inference. We train an arbitrary-style decomposition PPN (\(\text {PPN}_\textrm{arb}\)) on a large set of style images \(I_s\), adopting the training regime from existing arbitrary-style transfer works [5, 45]. \(\text {PPN}_\textrm{arb}\) adapts the architecture of SANet [45], differing only in the last decoder which in our case has \(\#P_M\) channels. Training involves preprocessing similar to the single-style case but using SANet for generating \(I_r\). The training loss \(\mathcal {L}_{\text {PPN}_\textrm{arb}}\) combines AdaIN [5] style loss and content loss \(\mathcal {L}_c\):

$$\begin{aligned} I_o&= O\big (\text {PPN}_\textrm{arb}(I_a, I_s), I_a\big ) \end{aligned}$$
(7)
$$\begin{aligned} \mathcal {L}_{\text {PPN}_\textrm{arb}}&= \lambda _s\mathcal {L}_{\text {AdaIN}}(I_o, I_s) + \lambda _c\mathcal {L}_c(I_o, I_r) \end{aligned}$$
(8)

Differing from SANet training [45], we omit the structure-retaining identity loss due to our pipeline’s constrained content alteration capacity.

Training Details We trained \(\text {PPN}_\textrm{sst}\) and \(\text {PPN}_\textrm{arb}\) for 24 epochs, using MS-COCO [47] (content images) and WikiArt [48] (style images for \(\text {PPN}_\textrm{arb}\)), each comprising about 80,000 images. We use the training settings and hyperparameters listed in Table 1:

Table 1 Training hyperparameters

To ensure performance on diverse settings, the segmentation stage S settings were randomized by uniformly varying the global number of segments, smoothing sigma, and kernel sizes.

4 Controlling aspects of style

The decomposed style representation can be manipulated in numerous ways after optimization or prediction of style parameters. We explore re-optimization with alternative style transfer losses, parameter interpolation, and geometric abstraction control. Further, we discuss manual segment and parameter editing and showcase its usefulness for NST artifact correction.

Fig. 5
figure 5

Parameter retargeting using style transfer. After decomposition, the resulting parameters are reoptimized using the STROTSS [11] loss to a new texture style (shown as inset)

4.1 Texture style transfer

Different loss functions can be employed in \(\mathcal {L_{\text {target}}}\) to adapt parameters \(P_M\) to new artistic targets. Style transfer has shown itself as particularly useful in this regard, and can be used to tune results from a previous parameter decomposition to a new style, or optimize a new texture style directly from zero-initialized \(P_M\).

Image-based parameter style transfer. In Fig. 5, we show parameter retargeting to new artistic texture styles. In this example, we create a stylized image using NST [4], decompose the parameters using \(\ell _1\) matching and then re-optimize these using the self-transport style loss (STROTTS [11]). Notice how the overall structure and colors are preserved, while subtle details and textures are effectively altered.

Text-based parameter style transfer. Utilizing text prompts for style representation enhances control flexibility, as it eliminates the need for reference images. For text-based style transfers, we adapt the losses from CLIPstyler [12], differing from their approach by not training a convolutional neural network (CNN) but directly optimizing effect parameters. Specifically, we adapt their patch-based and directional losses, while typically content losses are not required due to the inherent content and color preservation of the filter pipeline. Figure 6 demonstrates how the style details of a real-world painting are transformed using text prompts, with the primary colors remaining true to the original. Further examples in the supplemental material show how details conform to the prompts and adjust to segment sizes (Fig. 7).

Fig. 6
figure 6

Re-optimization experiments using text-based style transfer [12]. The input painting “The Great Wave of Kanagawa” was segmented with PaintTransformer [7] and circle primitives

Fig. 7
figure 7

Using depth to interpolate parameters of the original NST with the prompt “drops”

4.2 Geometric abstraction control

Table 2 Properties of segmentation methods in stage S. These can be classified as superpixel (SP) and neural painting methods (NPM)

In Table 2, we list the segmentation and stroke-based rendering methods that were integrated and evaluated, and their main primitive for geometric control, e.g., primitive shape, the number of strokes, segments, or layers.

Superpixel segmentation. Techniques such as SLIC [6] typically partition the image into segments based on a clustering of color similarity and spatial proximity. In our pipeline, such methods are used to preserve low-frequency information in the form of uniform color patches while omitting high-frequency details. This segmentation aligns well with the shape boundaries in the image, making these methods well-suited for intermediate representations in interactive editing, we showcase separate editing of color and texture in Sect. 4.3.

Fig. 8
figure 8

Impact of various segmentation methods on the level of geometric abstraction in segmentation image \(I_a\)

Fig. 9
figure 9

Strokes (\(I_a\)) and pipeline output (\(I_o\)) after joint stroke and parameter optimization

Stroke-based rendering methods. On the other hand are more adept at achieving stronger geometric abstractions. Our approach incorporates two neural painting techniques, as listed in Table 2. These methods are designed to recreate images using specific shape primitives, such as brushes, rectangles, or circles. Stylized neural painting (SNP) [8] excels in generating abstract representations, such as approximating an image using rectangles (Fig. 8c), and can optimize a variety of shapes using a neural stroke renderer. PaintTransformer (PTf) [7], on the other hand, utilizes fixed primitive shapes and generates the result in one feed-forward pass, which allows for rapid visual feedback and facilitates interactive adjustments. PTf uses a fixed shape primitive during training, whereas SNP uses a trainable generator network to synthesize brush shapes. For nuanced control over stroke details at different spatial locations, we combine the stroke confidence output from PaintTransformer with an optional level-of-detail parameter mask, \(P_S\). This mask can be either manually defined or algorithmically predicted using techniques such as saliency or depth extraction. We use depth in Fig. 8b to define foreground and background stroke granularities. Joint stroke and texture parameter optimization (Sect. 3.3.2) employs the same shape primitives as when executing SNP [8] as a feed-forward stage only. An added benefit of combined optimization is that strokes, particularly their colors and location, are adjusted according to the optimized loss terms as well. In parameter style transfer, this approach can thus introduce stroke colors that are not present in the content image, such as yellow strokes in the sky for stars in a “starry night” style (Fig. 9).

Re-optimization. Filter parameter re-optimization fine-tunes texture and minor shapes to match the shape and granularity of geometric components in \(I_a\), yielding a cohesive visual integration. This process is illustrated in Figs. 6, 12, and 13, with additional examples showcasing variations in geometric shapes and granularities provided in Sect. IV of Online Resource 1.

Fig. 10
figure 10

Global parameter editing. Values of the bump-map scale \(P_M\) are increased (b) and decreased (c) uniformly by 50%

4.3 Filter parameter editing

One key benefit of our filter-parameter-based approach is that it allows for interactive style editing through both global and local adjustments of parameters.

Editing:

Parameters Globally: After optimization or prediction, the parameter masks outlined in Sect. 3.2.2 can be globally altered by uniformly modifying their values. For instance, Fig. 10 demonstrates increasing and decreasing the bump mapping parameter.

Editing:

Parameters Locally: Local modifications are feasible by adjusting parameter masks using brush metaphors. A practical use case is enhancing contours in specific facial areas, such as the eyes and mouth in a portrait, to emphasize these features, as demonstrated in the supplemental video (Online Resource 2).

Interpolating:

Parameters: Interpolation of parameters can be performed locally through binary or continuous-valued blending masks. In Fig. 7, we extract a depth map using DPT [49] which is used as a blending mask to blend between two parameter sets.

4.4 Refining style transfer results

In stylizations produced by methods such as NST, artifacts or unwanted style elements can appear. Superpixels serve as an efficient tool for selecting and refining specific regions, particularly useful in rectifying localized flaws or editing local style elements. After selection, the color of a segment can be adjusted, while the fine-structural texture, represented in the filter parameters, is kept intact.

Fig. 11
figure 11

Correction of NST artifacts using a PPN. Red dots on the face are removed by histogram matching selected segments to the skin color. The \(\text {PPN}_{sst}\) adds back in texture details and blends the edits seamlessly into the overall image

For instance, in Fig. 11, artifacts are removed by matching colors in selected superpixel regions to another area using histogram matching, thereby eliminating undesired stylization patterns or colors. A single-style \(\text {PPN}_\text {sst}\), trained on the candy style, repredicts the texture during real-time editing of segment colors, aligning textural patterns with new colors and structures. This process enables a cohesive and interactive editing workflow, further detailed in Sect. III of Online Resource 1.

5 Results and discussion

5.1 Qualitative comparisons

We compare our texture style transfer method to other example-based techniques, focusing on their capacity to precisely control geometric and texture patterns within a given image. Ideally, a technique in such an editing scenario should preserve the original composition and colors of the input image while adapting only the desired stylistic elements. Figure 12 shows results of CLIPstyler [12] and img2img stable diffusion [3], compared to our text-based filter-optimization with fine (SLIC segments) and coarse (PTf-circles) geometric shape abstractions. To ensure a fair comparison with CLIPstyler, which also modifies colors, we incorporate a histogram-matching loss [50] (CLIPstyler-Hist). It can be observed that CLIPstyler tends to introduce new content structures, and its stylization intensity varies significantly with the text prompt used. Additionally, even with histogram matching, there is a noticeable deviation in colors from the original image. In contrast, stable diffusion often results in substantial variations from the original image content. Our method retains a regular, brush-stroke-like appearance and is controllable in its granularity level, for example by using a larger brush shape for the background (Fig. 12f, top row). We further compare image-based filter optimization with STROTTS [11] incorporating histogram-matching loss [50], Gatys NST [4] with a color-preserving loss [14], and SNP with style transfer losses [8] in Fig. 13, with ours (using Ptf-circles and squares), and with joint stroke and parameter optimization. Similar to our CLIP-optimized parameters, our image-based parametric style transfer demonstrates a greater ability to retain original colors and exhibits more spatial homogeneity compared to these methods. The joint optimization of strokes and parameters (S+P) further enhances alignment between coarse geometrical structures and filter parameters, ensuring that elements such as rectangles and outlines are well-integrated with the image content, as exemplified in Fig. 13f, bottom row.

Fig. 12
figure 12

Comparisons to related methods for text-based stylization. We use prompts “round brushstrokes in the style of monet” (top) and “starry night” (bottom). Please zoom in to compare details

Fig. 13
figure 13

Comparisons to related methods for image-based style transfer. We compare methods that try to retain the content colors, such as color-preserving NST (CPNST) [14] and STROTTS [11] with added histogram preservation and SNP [8]. Please zoom in to compare details

5.2 Quantitative comparisons

We conduct several experiments to quantitatively assess our approach against others, and ablate different settings. We first describe our experimental setup and then discuss results concerning style-matching capability, mask-editability, the necessity of individual filters, and runtime performance.

5.2.1 Experimental setup

In our experiments, we use the following metrics to assess decomposition quality: VGG [46]-based content (\(\mathcal {L}_c\)) and style loss (\(\mathcal {L}_s\)) [4] as indicator of stylization quality, and, additionally, in the case of optimization, the \(\ell _1\) difference to the reference image generated by STROTTS [11]. Metrics are computed on a dataset consisting of 20 content images from MS-COCO [47] stylized each by 10 widely used NST styles. We used the following hyperparameters throughout the study: content image size of 512 pixels, style image size of 256 pixels, and content and style weights of 0.01 and 5000, respectively, and \(\lambda _\textrm{TV}\) with \( \lambda _{tv}=0.2\), and Adam with a learning rate of 0.01 for 500 iterations. The segmentation stage S employs SLIC [6] superpixel segmentation with \(s=1000\) and \(s=5000\) segments. Joint stroke and parameter (S+P) optimization uses 5 grid levels and 1000 strokes..

Table 3 Comparison to baselines. \(\text {PPN}_\textrm{sst}\) and \(\text {PPN}_\textrm{arb}\) are compared to their pixel-predicting baselines, and optimization of our pipeline to the stylization pipelines of Lötzsch et al. [10]

5.2.2 Style-matching capabilities

Table 3 evaluates the style-matching capabilities of our parameter prediction and optimization against their respective baselines. While the single-style NST network, as proposed by Johnson et al. [44], demonstrates superior content preservation compared to our single-style \(\text {PPN}_\textrm{sst}\), this is anticipated given that the PPN takes an abstracted and stylized image as input, whereas [44] operates on the content image. The style preservation of \(\text {PPN}_\textrm{sst}\), on the other hand, is comparably effective. In the case of arbitrary-style PPN, content preservation is nearly equivalent to its baseline, with an enhancement in style loss performance. For our optimization approach, we assess the \(\ell _1\) distance relative to the reference image, achieving a close match. Compared to the stylization filter pipelines of Lötzsch et al. [10], our results show similar metrics, despite their approach not applying loss constraints on parameter masks. See Sect. II of Online Resource 1 for visual comparisons.

5.2.3 Mask editability

Table 4 compares mask noisiness, where typically, less noisy masks enhance editability. Our method, when optimized with \(\mathcal {L}_\textrm{TV}\), significantly reduces mask noise, by more than two orders of magnitude, as indicated by the noise standard deviation (\(\sigma \)). This reduction contrasts sharply with the masks from Lötzsch et al.’s oil paint pipeline [10], which exhibit higher noise levels. In scenarios where stroke and texture parameters are optimized jointly, using \(\mathcal {L}_{\text {reg}}\) tends to produce noisier masks compared to \(\mathcal {L}_\textrm{TV}\). However, this is often due to sharp boundaries caused by stroke regularization, which, in practice, facilitate editability by allowing parameter mask areas to be selected based on their underlying stroke region.

Table 4 Parameter mask noisiness. Our pipeline is optimized with and without \(\mathcal {L}_\textrm{TV}\), and with joint stroke and parameter (S+P) optimization, and is compared against the oil paint pipeline [10]
Fig. 14
figure 14

Filter ablation study. All ablated pipelines use \(\mathcal {L}_\textrm{TV}\). Full bars denote the 5K SLIC segments, the error bars the 1K segment setting

Fig. 15
figure 15

Example result of the filter ablation study, removing any filter significantly degrades the pipeline’s capability to match \(I_r\)

5.2.4 Ablation study—filter pipeline

Figure 14 presents a filter ablation study aimed at evaluating the necessity of each filter in our pipeline by matching ablated pipeline configurations to a reference image. The full experimental setup is described in Sect. 5.2.1. We remove one filter each and measure \(\ell _1\) norm to \(I_r\), as well as content and style losses. Omission of \(\mathcal {L}_{TV}\) increases the matching performance, at the cost of editability. The ablation results indicate that the omission of any filter significantly affects the pipeline’s capacity to accurately match a wide range of styles. While the content loss \(\mathcal {L}_c\) is less affected, removing filters generally leads to higher \(\ell _1\) errors, indicating that the abstracted output \(I_a\) already contains the semantic information but all filters are needed to accurately represent the fine details. Removing the contrast filter does not impact details representation; however, it degrades style loss as it is needed for color adjustment. In Fig. 15, we show example results of the study. Our proposed full pipeline, and its optimization target, as generated by STROTTS [11], demonstrate a close match. All ablated configurations exhibit difficulties in reconstructing the clock face. The absence of xDoG removes the capability to accurately represent black areas. The absence of bump mapping leads to a failure in reconstructing areas of high luminance, such as on the tree on the left. The removal of any of the other filters results in difficulties in painting over segment boundaries.

Table 5 Runtime comparison. For PPNs, the value in parentheses denotes runtime without target generation and SLIC segmentation. Experiments are performed using a NVIDIA RTX 3090. Results are presented in seconds

5.2.5 Runtime performance

In Table 5, we present a comparison of the runtime performance of our proposed methods. Both PPNs are about equally fast as they only require a forward pass of the network. Furthermore, the SLIC segmentation requires approximately 75% of the execution time and does not have to be continuously re-executed in interactive editing scenarios. On the other hand, the optimization method requires several seconds even for small images, and is roughly comparable in runtime to other optimization-based NSTs, with the joint stroke and parameter optimization taking significantly longer due to repeated evaluation of stroke-prediction networks.

Fig. 16
figure 16

Failure case: alignment. Strokes \(I_a\) and filter output \(I_o\) do not align well due to divergence in stroke and parameter losses

5.3 Limitations

In our framework, the second stage is not guaranteed to consistently align with the geometric primitives introduced by the first stage. This issue is particularly apparent with style transfer losses, which prioritize statistical style targets over geometric consistency. Although joint optimization of parameters and strokes aims to enhance alignment, mismatches between the geometric representations in the stroke and target losses can still lead to discrepancies. For example, in Fig. 16, the first stage introduces tilted rectangles; however, optimizing the parameters with the CLIPStyler [12] loss “Tetris” overdraws block boundaries to create straight rectangles from the buildings.

Additionally, filter optimization matches target images effectively, but optimized losses steer parameters toward plausible outputs and editable masks without considering parameter stylistic intent. High-quality decomposition requires each filter’s parameters to uniquely affect the output, preventing local imitation of other filters’ effects. While our used filters typically yield distinct effects, with losses like \(\mathcal {L}_{\text {reg}}\) and \(\mathcal {L}_{TV}\) reducing localized imitation, some parameters can be prone to overuse. For instance, increasing contour opacity in a decomposed image can inadvertently darken areas beyond the intended contours, as seen in Fig. 17b. Conversely, reducing opacity does not completely remove all contours, notably in areas like the left eye and eyebrow (Fig. 17c). Future research could benefit from semantics-focused losses for superior decomposition. Further, painterly attribute control is limited to the existing parameters of the pipeline, adding other differentiable filters necessitates manual reconfiguration. Moreover, while our approach facilitates global texture adaptation using geometric abstraction and example-based techniques, it does not guarantee statistical textural properties, such as spatial uniformity.

Fig. 17
figure 17

Failure case: degraded decomposition quality. Increasing parameter values in the contour opacity mask increases overall blackness in the image instead of affecting just the contours

6 Conclusions

We introduced a lightweight, differentiable filter pipeline for texture editing, using both manual and example-based style control combined with geometric abstraction techniques. Our approach highlights the benefits of using a decomposed representation of strokes and textures for interactive and exploratory editing. This technique, applied to artistic images, facilitates the convergence of traditional stylistic image filtering and contemporary generative image synthesis, holding the potential to significantly enhance established image editing tools. Future research directions include exploring semantics-guided loss functions and deeper integration with generative image synthesis techniques.