Keywords

1 Introduction

A picture may be worth a thousand words, but at least it contains a lot of very diverse information. This not only comprises what is portrayed, e.g., composition of a scene and individual objects, but also how it is depicted, referring to the artistic style of a painting or filters applied to a photo. Especially when considering artistic images, it becomes evident that not only content but also style is a crucial part of the message an image communicates (just imagine van Gogh’s Starry Night in the style of Pop Art). Here, we follow the common wording of our community and refer to ‘content’ as a synonym for ‘subject matter’ or ‘sujet’, preferably used in art history. A vision system then faces the challenge to decompose and separately represent the content and style of an image to enable a direct analysis based on each individually. The ultimate test for this ability is style transfer [12] – exchanging the style of an image while retaining its content (Fig. 1).

Fig. 1.
figure 1

Evaluating the fine details preserved by our approach. Can you guess which of the cut-outs are from Monet’s artworks and which are generated? Solution is on p. xx.

Fig. 2.
figure 2

Style transfer using different approaches on 1 and a collection of reference style images. (a) [12] using van Gogh’s “Road with Cypress and Star” as reference style image; (b) [12] using van Gogh’s “Starry night”; (c) [12] using the average Gram matrix computed across the collection of Vincent van Gogh’s artworks; (d) [22] trained on the collection of van Gogh’s artworks alternating target style images every SGD mini-batch; (e) our approach trained on the same collection of van Gogh’s artworks. Stylizations (a) and (b) depend significantly on the particular style image, but using a collection of the style images (c), (d) does not produce visually plausible results, due to oversmoothing over the numerous Gram matrices. In contrast, our approach (e) has learned how van Gogh is altering particular content in a specific manner (edges around objects also stylized, cf. bell tower)

In contrast to the seminal work of Gatys et al. [12], who have relied on powerful but slow iterative optimization, there has recently been a focus on feed-forward generator networks [6, 20, 22, 27, 40, 41, 44]. The crucial representation in all these approaches has been based on a VGG16 or VGG19 network [39], pre-trained on ImageNet [34]. However, a recent trend in deep learning has been to avoid supervised pre-training on a million images with tediously labeled object bounding boxes [43]. In the setting of style transfer this has the particular benefit of avoiding from the outset any bias introduced by ImageNet, which has been assembled without artistic consideration. Rather than utilizing a separate pre-trained VGG network to measure and optimize the quality of the stylistic output [6, 12, 22, 27, 40, 41, 44], we employ an encoder-decoder architecture with adversarial discriminator, Fig. 3, to stylize the input content image and also use the encoder to measure the reconstruction loss. In essence the stylized output image is again run through the encoder and compared with the encoded input content image. Thus, we learn a style-specific content loss from scratch, which adapts to the specific way in which a particular style retains content and is more adaptive than a comparison in the domain of RGB images [48].

Most importantly, however, previous work has only been based on a single style image. This stands in stark contrast to art history which understands “style as an expression of a collective spirit” resulting in a “distinctive manner which permits the grouping of works into related categories” [9]. As a result, art history developed a scheme, which allows to identify groups of artworks based on shared qualities. Artistic style consists of a diverse range of elements, such as form, color, brushstroke, or use of light. Therefore, it is insufficient to only use a single artwork, because it might not represent the full scope of an artistic style. Today, freely available art datasets such as Wikiart [23] easily contain more than 100K images, thus providing numerous examples for various styles. Previous work [6, 12, 22, 27, 40, 41, 44] has represented style based on the Gram matrix, which captures highly image-specific style statistics, cf. Fig. 2. To combine several style images in [6, 12, 22, 27, 40, 41, 44] one needs to aggregate their Gram matrices. We have evaluated several aggregation strategies and averaging worked the best, Fig. 2(c). But, obviously, neither art history, nor statistics suggests aggregating Gram matrices. Additionally, we investigated alternating the target style images in every mini-batch while training [22], Fig. 2(d). However, all these methods cannot make proper use of several style images, because combining the Gram matrices of several images forfeits the details of style, cf. the analysis in Fig. 2. In contrast, our proposed approach allows to combine an arbitrary number of instances of a style during training.

We conduct extensive evaluations of the proposed style transfer approach; we quantitatively and qualitatively compare it against numerous baselines. Being able to generate high quality artistic works in high-resolution, our approach produces visually more detailed stylizations than the current state of the art style transfer approaches and yet shows real-time inference speed. The results are quantitatively validated by experts from art history and by adopting in this paper a deception rate metric based on a deep neural network for artist classification.

1.1 Related Work

In recent years, a lot of research efforts have been devoted to texture synthesis and style transfer problems. Earlier methods [17] are usually non-parametric and are build upon low-level image features. Inspired by Image Analogies [17], approaches [10, 28, 37, 38] are based on finding dense correspondence between content and style image and often require image pairs to depict similar content. Therefore, these methods do not scale to the setting of arbitrary content images.

Fig. 3.
figure 3

Encoder-decoder network for style transfer based on style-aware content loss.

In contrast, Gatys et al. [11, 12] proposed a more flexible iterative optimization approach based on a pre-trained VGG19 network [39]. This method produces high quality results and works on arbitrary inputs, but is costly, since each optimization step requires a forward and backward pass through the VGG19 network. Subsequent methods [22, 25, 40] aimed to accelerate the optimization procedure [12] by approximating it with feed-forward convolutional neural networks. This way, only one forward pass through the network is required to generate a stylized image. Beyond that, a number of methods have been proposed to address different aspects of style transfer, including quality [4, 13, 21, 44, 46], diversity [26, 41], photorealism [30], combining several styles in a single model [3, 6, 42] and generalizing to previously unseen styles [14, 20, 27, 36]. However, all these methods rely on the fixed style representation which is captured by the features of a VGG [39] network pre-trained on ImageNet. Therefore they require a supervised pre-training on millions of labeled object bounding boxes and have a bias introduced by ImageNet, because it has been assembled without artistic consideration. Moreover, the image quality achieved by the costly optimization in [12] still remains an upper bound for the performance of recent methods. Other works like [1, 5, 8, 32, 45] learn how to discriminate different techniques, styles and contents in the latent space. Zhu et al. [48] learn a bidirectional mapping between a domain of content images and paintings using generative adversarial networks. Employing cycle consistency loss, they directly measure the distance between a backprojection of the stylized output and the content image in the RGB pixel space. Measuring distances in the RGB image domain is not just generally prone to be coarse, but, especially for abstract styles, a pixel-wise comparison of backwards mapped stylized images is not suited. Then, either content is preserved and the stylized image is not sufficiently abstract, e.g., not altering object boundaries, or the stylized image has a suitable degree of abstractness and so a pixel-based comparison with the content image must fail. Moreover, the more abstract the style is, the more potential backprojections into the content domain exist, because this mapping is underdetermined (think of the many possible content images for a single cubistic painting). In contrast, we spare the ill-posed backward mapping of styles and compare stylized and content images in the latent space which is trained jointly with the style transfer network. Since both content and stylized images are run through our encoder, the latent space is trained to only pay attention to the commonalities, i.e., the content present in both. Another consequence of the cycle consistency loss is that it requires content and style images used for training to represent similar scenes [48], and thus training data preparation for [48] involves tedious manual filtering of samples, while our approach can be trained on arbitrary unpaired content and style images.

2 Approach

To enable a fast style transfer that instantly transfers a content image or even frames of a video according to a particular style, we need a feed-forward architecture [22] rather than the slow optimization-based approach of [12]. To this end, we adopt an encoder-decoder architecture that utilizes an encoder network E to map an input content image x onto a latent representation \(z=E(x)\). A generative decoder G then plays the role of a painter and generates the stylized output image \(y=G(z)\) from the sketchy content representation z. Stylization then only requires a single forward pass, thus working in real-time.

2.1 Training with a Style-Aware Content Loss

Previous approaches have been limited in that training worked only with a single style image [6, 12, 20, 22, 27, 40, 44] or that style images used for training had to be similar in content to the content images [48]. In contrast, given a single style image \(y_0\) we include a set Y of related style images \(y_j \in Y\), which are automatically selected (see Sect. 2.2) from a large art dataset (Wikiart). We do not require the \(y_j\) to depict similar content as the set X of arbitrary content images \(x_i \in X\), which we simply take from Places365 [47]. Compared to [48], we thus can utilize standard datasets for content and style and need no tedious manual selection of the \(x_i\) and \(y_j\) as described in Sects. 5.1 and 7.1 of [48].

To train E and G we employ a standard adversarial discriminator D [15] to distinguish the stylized output \(G(E(x_i))\) from real examples \(y_j\in Y\),

$$\begin{aligned} \mathcal {L}_D(E, G, D) = \mathop {\mathbb {E}}_{y \sim p_{Y}(y)} \left[ \log {D(y)} \right] + \mathop {\mathbb {E}}_{x \sim p_{X}(x)} \left[ \log {\left( 1 - D(G(E(x)))\right) } \right] \end{aligned}$$
(1)

However, the crucial challenge is to decide which details to retain from the content image, something which is not captured by Eq. 1. Contrary to previous work, we want to directly enforce E to strip the latent space of all image details that the target style disregards. Therefore, the details that need to be retained or ignored in z depend on the style. For instance, Cubism would disregard texture, whereas Pointillism would retain low-frequency textures. Therefore, a pre-trained network or fixed similarity measure [12] for measuring the similarity in content between \(x_i\) and \(y_i\) is violating the art historical premise that the manner, in which content is preserved, depends on the style. Similar issues arise when measuring the distance after projecting the stylized image \(G(E(x_i))\) back into the domain X of original images with a second pair of encoder and decoder \(G_2(E_2(G(E(x_i))))\). The resulting loss proposed in [48],

(2)

fails where styles become abstract, since the backward projection of abstract art to the original image is highly underdetermined.

Fig. 4.
figure 4

1st row - results of style transfer for different styles. 2nd row - sketchy content visualization reconstructed from the latent space E(x) using method of [31]. (a) The encoder for Pollock does not preserve much content due to the abstract style; (b) only rough structure of the content is preserved (coarse patches) because of the distinct style of El Greco; (c) latent space highlights surfaces of the same color and that fine object details are ignored, since Gauguin was less interested in details, often painted plain surfaces and used vivid colors; (d) encodes the thick, wide brushstrokes Cézanne used, but preserves a larger palette of colors. (Color figure online)

Therefore, we propose a style-aware content loss that is being optimized, while the network learns to stylize images. Since encoder training is coupled with training of the decoder, which produces artistic images of the specific style, the latent vector z produced for the input image x can be viewed as its style-dependent sketchy content representation. This latent space representation is changing during training and hence adapts to the style. Thus, when measuring the similarity in content between input image \(x_i\) and the stylized image \(y_i = G(E(x_i))\) in the latent space, we focus only on those details which are relevant for the style. Let the latent space have d dimensions, then we define a style-aware content loss as normalized squared Euclidean distance between \(E(x_i)\) and \(E(y_i)\):

(3)

To show the additional intuition behind the style-aware content loss we used the method [31] to reconstruct the content image from latent representations trained on different styles and illustrated it in Fig. 4. It can be seen that latent space encodes a sketchy, style-specific visual content, which is implicitly used by the loss function. For example, Pollock is famous for his abstract paintings, so reconstruction (a) shows that the latent space ignores most of the object structure; Gauguin was less interested in details, painted a lot of plain surfaces and used vivid colors which is reflected in the reconstruction (c), where latent space highlights surfaces of the same color and fine object details are ignored.

Since we train our model for altering the artistic style without supervision and from scratch, we now introduce extra signal to initialize training and boost the learning of the primary latent space. The simplest thing to do is to use an autoencoder loss which computes the difference between \(x_i\) and \(y_i\) in the RGB space. However, this loss would impose a high penalty for any changes in image structure between input \(x_i\) and output \(y_i\), because it relies only on low-level pixel information. But we aim to learn image stylization and want the encoder to discard certain details in the content depending on style. Hence the autoencoder loss will contradict with the purpose of the style-aware loss, where the style determines which details to retain and which to disregard. Therefore, we propose to measure the difference after applying a weak image transformation on \(x_i\) and \(y_i\), which is learned while learning E and G. We inject in our model a transformer block \(\mathbf {T}\) which is essentially a one-layer fully convolutional neural network taking an image as input and producing a transformed image of the same size. We apply T to images \(x_i\) and \(y_i=G(E(x_i))\) before measuring the difference. We refer to this as transformed image loss and define it as

(4)

where \(C \times H \times W\) is the size of image x and for training T is initialized with uniform weights.

Figure 3 illustrates the full pipeline of our approach. To summarize, the full objective of our model is:

$$\begin{aligned} \mathcal {L}(E, G, D) = \mathcal {L}_c(E, G) + \mathcal {L}_{t}(E, G) + \lambda \mathcal {L}_D(E, G, D) , \end{aligned}$$
(5)

where \(\lambda \) controls the relative importance of adversarial loss. We solve the following optimization problem:

$$\begin{aligned} E, G = \arg \;\min \limits _{E,G}\;\max \limits _{D} \mathcal {L}(E, G, D). \end{aligned}$$
(6)

2.2 Style Image Grouping

In this section we explain an automatic approach for gathering a set of related style images. Given a single style image \(y_0\) we strive to find a set Y of related style images \(y_j \in Y\). Contrary to [48] we avoid tedious manual selection of style images and follow a fully automatic approach. To this end, we train a VGG16 [39] network C from scratch on the Wikiart [23] dataset to predict an artist given the artwork. The network is trained on the 624 largest (by number of works) artists from the Wikiart dataset. Note that our ultimate goal is stylization and numerous artists can share the same style, e.g., Impressionism, as well as a single artist can exhibit different styles, such as the different stylistic periods of Picasso. However, we do not use any style labels. Artist classification in this case is the surrogate task for learning meaningful features in the artworks’ domain, which allows to retrieve similar artworks to image \(y_0\).

Let \(\phi (y)\) be the activations of the fc6 layer of the VGG16 network C for input image y. To get a set of related style images to \(y_0\) from the Wikiart dataset \(\mathcal {Y}\) we retrieve all nearest neighbors of \(y_0\) based on the cosine distance \(\delta \) of the activations \(\phi (\cdot )\), i.e.

$$\begin{aligned} Y = \{ y \; |\; y \in \mathcal {Y}, \delta (\phi (y), \phi (y_0)) < t \}, \end{aligned}$$
(7)

where \(\delta (a, b) = 1 + \frac{\phi (a)\phi (b)}{||a||_2 ||b||_2}\) and t is the \(10\%\) quantile of all pairwise distances in the dataset \(\mathcal {Y}\).

3 Experiments

To compare our style transfer approach with the state-of-the-art, we first perform extensive qualitative analysis, then we provide quantitative results based on the deception score and evaluations of experts from art history. Afterwards in Sect. 3.3 we ablate single components of our model and show their importance.

Implementation Details: The basis for our style transfer model is an encoder-decoder architecture, cf. [22]. The encoder network contains 5 conv layers: \(1 {\times }\)conv-stride-1 and \(4{\times }\)conv-stride-2. The decoder network has 9 residual blocks [16], 4 upsampling blocks and \(1 {\times }\)conv-stride-1. For upsampling blocks we used a sequence of nearest-neighbor upscaling and conv-stride-1 instead of fractionally strided convolutions [29], which tend to produce heavier artifacts [33]. Discriminator is a fully convolutional network with \(7{\times }\)conv-stride-2 layers. For a detailed network architecture description we refer to the supplementary material. We set \(\lambda =0.001\) in Eq. 5. During the training process we sample \(768 \times 768\) content image patches from the training set of Places365 [47] and \(768 \times 768\) style image patches from the Wikiart [23] dataset. We train for 300000 iterations with batch size 1, learning rate 0.0002 and Adam [24] optimizer. The learning rate is reduced by a factor of 10 after 200000 iterations.

Baselines: Since we aim to generate high-resolution stylizations, for comparison we run style transfer on our method and all baselines for input images of size \(768\times 768\), unless otherwise specified. We did not exceed this resolution when comparing, because some other methods were reaching the GPU memory limit. We optimize Gatys et al. [12] for 500 iterations using L-BFGS. For Johnson et al. [22] we used the implementation of [7] and trained a separate network for every reference style image on the same content images from Places365 [47] as our method. For Huang et al. [20], Chen et al. [4] and Li et al. [27] implementations and pre-trained models provided by the authors were used. Zhu et al. [48] was trained on exactly the same content and style images as our approach using the source code provided by the authors. Methods [4, 12, 20, 22, 27] utilized only one example per style, as they cannot benefit from more (cf. the analysis in Fig. 2).

Fig. 5.
figure 5

Results from different style transfer methods. We compare methods on different styles and content images.

3.1 Qualitative Results

Full Image Stylization: In Fig. 5 we demonstrate the effectiveness of our approach for stylizing different contents with various styles. Chen et al. [4] work on the overlapping patches extracted from the content image, swapping the features of the original patch with the features of the most similar patch in the style image, and then averages the features in the overlapping regions, thus producing an over-smoothed image without fine details (Fig. 5(d)). [20] produces a lot of repetitive artifacts, especially visible on flat surfaces, cf. Fig. 5(e, rows 1, 4–6). Method [27] fails to understand the content of the image and applies different colors in the wrong locations (Fig. 5(f)). Methods [22, 48] often fail to alter content image and their effect may be characterized as shifting the color histogram, e.g., Fig. 5(g, rows 3, 7; c, rows 1, 3–4). One reason for such failure cases of [48] is the loss in the RGB pixel space based on the difference between a backward mapping of the stylized output and the content image. Another reason for this is that we utilized the standard Places365 [47] dataset and did not hand-pick training content images, as is advised for [48]. Thus, artworks and content images used for training differed significantly in their content, which is the ultimate test for a stylization that truly alters the input and goes beyond a direct mapping between regions of content and style images. The optimization-based method [12] often works better than other baselines, but produces a lot of prominent artifacts, leading to details of stylizations looking unnatural, cf. Fig. 5(b, rows 4, 5, 6). This is due to an explicit minimization of the loss directly on the pixel level. In contrast to this, our model can not only handle styles, which have salient, simple to spot characteristics, but also styles, such as El Greco’s Mannerism, with less graspable stylistic characteristics, where other methods fail (Fig. 5, b–g, 5th row).

Fine-Grained Style Details: In Fig. 7 we show zoomed in cut-outs from the stylized images. Interestingly, the stylizations of methods [4, 12, 19, 20, 27] do not change much across styles (compare Fig. 7(d, f–i, rows 1–3)). Zhu et al. [48] produce more diverse images for different styles, but obviously cannot alter the edges of the content (blades of grass are clearly visible on all the cutouts in Fig. 7(e)). Figure 7(c) shows the stylized cutouts of our approach, which exhibit significant changes from one style to another. Another interesting example is the style of Pollock, Fig. 7 (row 8), where the style-aware loss allows our model to properly alter content to the point of discarding it – as would be expected from a Pollock action painting. Our approach is able to generate high-resolution stylizations with a lot of style specific details and retains those content details which are necessary for the style.

Fig. 6.
figure 6

Artwork examples of the early artistic period of van Gogh (a) and his late period (c). Style transfer of the content image (1st column) onto the early period is presented in (b) and the late period in (d).

Fig. 7.
figure 7

Details from stylized images produced for different styles for a fixed content image (a). (b) is our entire stylized image, (c) the zoomed in cut-out and (d)–(i) the same region for competitors. Note the variation across different styles along the column for our method compared to other approaches. This highlights the ability to adapt content (not just colors or textures) where demanded by a style. Fine grained artistic details with sharp boundaries are produced, while altering the original content edges.

Style Transfer for Different Periods of van Gogh: We now investigate our ability to properly model fine differences in style despite using a group of style images. Therefore, we take two reference images Fig. 6(a) and (c) from van Gogh’s early and late period, respectively, and acquire related style images for both from Wikiart. It can be clearly seen that the stylizations produced for either period Fig. 6(b, d) are fairly different and indeed depict the content in correspondence with the style of early (b) and late (d) periods of van Gogh. This highlights that collections of style images are properly used and do not lead to an averaging effect.

High-Resolution Image Generation: Our approach allows us to produce high quality stylized images in high-resolution. Figure 8 illustrates an example of the generated piece of art in the style of Berthe Morisot with resolution \(1280\times 1280\). The result exhibits a lot of fine details such as color transitions of the oil paint and brushstrokes of different sizes. More HD images are in the supplementary.

Fig. 8.
figure 8

High-resolution image (1280 \(\times \) 1280 pix) generated by our approach. A lot of fine details and brushstrokes are visible. A style example is in the bottom left corner.

Real-Time HD Video Stylization: We also apply our method to several videos. Our approach can stylize HD videos (\(1280 \times 720\)) at 9 FPS. Figure 9 shows stylized frames from a video. We did not use a temporal regularization to show that our method produces equally good results for consecutive frames with varying appearance w/o extra constraints. Stylized videos are in the supplementary.

Fig. 9.
figure 9

Results of our approach applied to the HD video of Eadweard Muybridge “The horse in motion” (1878). Every frame was independently processed (no smoothing or post-processing) by our model in the style of Picasso. Video resolution \(1920 \times 1280\) pix, here the original aspect ratio was changed to save space.

3.2 Quantitative Evaluation

Style Transfer Deception Rate: While several metrics [2, 18, 35] have been proposed to evaluate the quality of image generation, until now no evaluation metric has been proposed for an automatic evaluation of style transfer results. To measure the quality of the stylized images, we introduce the style transfer deception rate. We use a VGG16 network trained from scratch to classify 624 artists on Wikiart. Style transfer deception rate is calculated as the fraction of generated images which were classified by the network as the artworks of an artist for which the stylization was produced. For fair comparison with other approaches, which used only one style image \(y_0\) (hence only one artist), we restricted Y to only contain samples coming from the same artist as the query example \(y_0\). We selected 18 different artists (i.e. styles). For every method we generated 5400 stylizations (18 styles, 300 per style). In Table 1 we report mean deception rate for 18 styles. Our method achieves 0.393 significantly outperforming the baselines. For comparison, mean accuracy of the network on hold-out real images of aforementioned 18 artists from Wikiart is 0.616.

Human Art History Experts Perceptual Studies: Three experts (with a PhD in art history with focus on modern and pre-modern paintings) have compared results of our method against recent work. Each expert was shown 1000 groups of images. Each group consists of stylizations which were generated by different methods based on the same content and style images. Experts were asked to choose one image which best and most realistically reflects the current style. The score is computed as the fraction of times a specific method was chosen as the best in the group. We calculate a mean expert score for each method using 18 different styles and report them in Table 1. Here, we see that the experts selected our method in around \(50\%\) of the cases.

Speed and Memory: Table 2 shows the time and memory required for stylization of a single image of size \(768\times 768\) px for different methods. One can see that our approach and that of [22, 48] have comparable speed and only very modest demands on GPU memory, compared to modern graphics cards.

Table 1. Mean deception rate and mean expert score for different methods. The higher the better.
Table 2. Average inference time and GPU memory consumption, measured on a Titan X Pascal, for different methods with batch size 1 and input image of \(768\times 768\) pix.

3.3 Ablation Studies

Effect of Different Losses: We study the effect of different components of our model in Fig. 10. Removing the style-aware content loss significantly degrades the results, (c). We observe that without the style-aware loss training becomes instable and often stalls. If we remove the transformed image loss, which we introduced for a proper initialization of our model that is trained from scratch, we notice mode collapse after 5000 iterations. Training directly with pixel-wise L2 distance causes a lot of artifacts (grey blobs and flaky structure), (d). Training only with a discriminator neither exhibits the variability in the painting nor in the content, (e). Therefore we conclude that both the style-aware content loss and the transformed image loss are critical for our approach.

Fig. 10.
figure 10

Different variations of our method for Gauguin stylization. See Sect. 3.3 for details. (a) Content image; (b) full model (\(\mathcal {L}_c\), \(\mathcal {L}_{rgb}\) and \(\mathcal {L}_D\)); (c) \(\mathcal {L}_{rgb}\) and \(\mathcal {L}_D\); (d) without transformer block; (e) only \(\mathcal {L}_D\); (f) trained with all of Gauguin’s artworks as style images. Please zoom in to compare.

Fig. 11.
figure 11

Encoder ablation studies: (a) stylization using our model; (b) stylization using pre-trained VGG16 encoder instead of E.

Single vs Collection of Style Images: Here, we investigate the importance of the style image grouping. First, we trained a model with only one style image of Gauguin, which led to mode collapse. Second, we trained with all of Gauguin’s artworks as style images (without utilizing style grouping procedure). It produced unsatisfactory results, cf. Fig. 10(f), because style images comprised several distinct styles. Therefore we conclude that to learn a good style transfer model it is important to group style images according to their stylistic similarity.

Encoder Ablation: To investigate the effect of our encoder E, we substitute it with VGG16 [39] encoder (up to conv5_3) pre-trained on ImageNet. The VGG encoder retains features that separate object classes (since it was trained discriminatively), as opposed to our encoder which is trained to retain style-specific content details. Hence, our encoder is not biased towards class-discriminative features, but is style specific and trained from scratch. Figure 11(a, b) show that our approach produces better results than with pre-trained VGG16 encoder.

4 Conclusion

This paper has addressed major conceptual issues in state-of-the-art approaches for style transfer. We overcome the limitation of only a single style image or the need for style and content training images to show similar content. Moreover, we exceed a mere pixel-wise comparison of stylistic images or models that are pre-trained on millions of ImageNet bounding boxes. The proposed style-aware content loss enables a real-time, high-resolution encoder-decoder based stylization of images and videos and significantly improves stylization by capturing how style affects content.

Solution to Fig. 1: patches 3 and 5 were generated by our approach, others by artists.