Photographic style transfer

Image style transfer has attracted much attention in recent years. However, results produced by existing works still have lots of distortions. This paper investigates the CNN-based artistic style transfer work specifically and finds out the key reasons for distortion coming from twofold: the loss of spatial structures of content image during content-preserving process and unexpected geometric matching introduced by style transformation process. To tackle this problem, this paper proposes a novel approach consisting of a dual-stream deep convolution network as the loss network and edge-preserving filters as the style fusion model. Our key contribution is the introduction of an additional similarity loss function that constrains both the detail reconstruction and style transfer procedures. The qualitative evaluation shows that our approach successfully suppresses the distortions as well as obtains faithful stylized results compared to state-of-the-art methods.


Introduction
Image style transfer has shown a promising future for new forms of image manipulation. A neural artistic style transformation method proposed by Gatys et al. [7] has achieved great success with convolutional neural networks, which is followed by many works [2,3,8,11,15,[29][30][31]34] recently. They produce convincing visual results by transferring artistic features from reference painting onto the content photograph. However, these artistic style transfer methods suffer from visual distortion problem, even when both of the content and reference style images are photographic. The stylized results always contain visually intricate distortions which make them have a painting-like looking. Luan et al. [17] point out that the distortions appear only at style transformation process, and they thus propose a photorealism regularization term based on locally affine colour transformations to reconstruct fine content details. To avoid the unexpected geo-   [17] integrate semantic segmentation masks to Gatys et al.'s method [7]. Although the content spatial structures are preserved in many situations, details and exact shapes of structures are erased when semantic segmentation is inaccurate or contains overlapping areas. And the computation of matting Laplacian matrix and semantic segmentation consumes much extra time for highquality output. After investigating the style transformation procedure, we discover the distortions occur at two stages: the spatial structures of content image may be lost during content-preserving process and the unexpected geometric matching can be introduced during style transformation process. Figures 1 and 2 illustrate the distortions occur at both content-preserving and style transformation processes. For example, shown as zoom-ins (c-ii), the buildings of content image are obviously distorted by content-preserving process. Moreover, shown as zoom-ins (c-iii), the buildings are also distorted after style transformation process. However, buildings of (c-iii) hold different shapes and edges from (c-ii) from content-preserving process, which means the zoom-in buildings are distorted twice.
To improve the photorealism, this paper introduces an additional similarity layer with the corresponding loss function to constrain both content preservation and style transformation processes. This similarity layer is added into several places of the convolutional neural networks to prevent dis-tortions by minimizing a similarity loss function and other loss functions proposed in fast neural style algorithm [13].
The entire proposed method consists of two stages: detail reconstruction process and style transfer process. Our system has two key components: a dual-stream deep convolution network as Loss Network and edge-preserving filters as style fusion model (SFM). The edge-preserving filter is used to extract details and colour information of the outputs generated from the loss network, which means our scheme combines the details without colour from content and the colour without details from reference style. During the optimization process, the content and style features are captured first by the additional layers in loss network, and then a random white noise image X is passed through both detail reconstruction and style transfer networks. The final output of SFM is the stylized result.
The main contributions of this paper are as follows: we investigate into the problem of Gatys et al.'s method and find out that the lost photorealism of stylized result is caused by distortions occurring at both content preservation and style transformation stages; we propose a photographic style transfer method which is capable of improving the photorealism of stylized results. A similarity loss function using L1-norm is applied for reconstructing finer content details and preventing geometric mismatching problem. And a style fusion model using edge-preserving filter is utilized to reduce artefacts.

Related work
Global colour transfer methods Global colour transfer methods tend to utilize spatial-invariant objective functions to transfer images. Input images with simple styles can be processed well by these algorithms [9,12,22,23]. For example, a colour shift technique proposed by Reinhard et al. [23] can extract global features in a decorrelated colour space from reference style image and transfer them onto content input. Pitié et al. [22] propose an approach that also achieves the goal of global style transfer by matching full 3D colour histograms between images with a series of 1D histogram transformation. Although these methods can handle several simple situations like tone curves (e.g., low or high contrast), they are limited in the ability to match complex areas with corresponding colour styles. Local colour transfer methods. Local colour transfer researches propose to use spatial colour mapping technique like semantic segmentation [10,16,17,21,27,28,32] to handle various applications such as semantic colour gradient transfer (dark and bright) [10,21,27,32], transfer of artistic edits [1,24,26,33], and painting stylistic features [3,4,7,13,15,31,34]. Many of them [7,10,13,15,17,28,31,34] are using convolutional neural network to achieve this goal.
Gatys et al. [7] achieve groundbreaking performance of painterly style transfer [15,34] by using the responses of activation layers to represent features from input images. This work focuses mainly on the photographic style transfer, especially the preservation of photorealistic attribute, which is distinguished from their painting-like style transformation [3,7,13,15]. The artistic stylized results are compelling; however, because of distortion problem, the photorealism is lost when their artistic style methods are naively applied to photographic style transfer. To improve the photorealism, recently, Luan et al. [17] propose a photographic style transfer method which uses semantic segmentation and post-processing step to solve the distortion problem. Mechrez et al. [19] propose to use Screened Poisson Equation to replace Luan et al.'s postprocessing step and preserve more precise content details than Luan et al.'s results. Liao et al. [16] propose a photorealistic style transfer method for sophisticated images, which are based on finding the nearest neighbour field on deep features extracted from CNN. Our work follows from the neural style algorithm [7] and presents better results than aforementioned methods.

Method
This section presents the architecture of our approach and the key loss functions to constrain both detail reconstruction and style transfer processes.

Architecture
Gatys et al. [7] propose an image transformation network with convolutional neural networks to accomplish the task that an input image is transformed into an output image. The network architecture of Gatys et al. [7] includes a pretrained VGG-19 network [25] and two loss layers. The layers learn feature representations of input images and compute the representation differences between a generated image and inputs. Their algorithm adds two additional layers: content layer and style layer, which capture and store feature representations of inputs. Then, a random white noise image initialized as the same size of content input is fed into the network. The loss functions compute the distance of feature representations between the generated image with respect to content and reference style inputs separately. The derivatives of loss terms are propagated back to the loss network for next iteration until the maximum iteration number is reached. Similar to this optimization-based approach, our basic network uses the pre-trained VGG-16 network [25] as our loss network. The content loss function and perceptual loss functions in [13] are used in our network. In addition, we add another additional layer with pixel-level loss function into our network. Moreover, we also add a style fusion model as Fig. 1 Given a reference style image and a content image as inputs, photographic style transfer seeks to generate output with photorealistic attribute, which should preserve both the context of content and style of reference. Gatys et al. [7] succeed in transferring style colour but introducing distortions to the context of output. In comparison, our method transfers faithful style colour; meanwhile, it also preserves the photorealistic attribute (c-ii) shows that a introduces distortions into reconstructed content details, and (c-iii) shows that b distorts details of a

Fig. 3 Framework overview.
We use the Loss network to preserve content and transfer style from inputs to outputs. The loss functions are added into the pre-trained VGG-16 network [25], which are computed at certain layers and backpropagated to the loss network during optimization process. For example, L relu1_2 style computes the feature representation differences between random white noise image X and style image I s , where relu1_2 denotes the placement for style layer in VGG-16 network. Then, the deviation of L relu1_2 style is propagated back to ST network our post-processing step to reduce artefacts. Our network is an optimization-based approach which is designed for arbitrary style and content image pairs. Thus, it does not need a training process.
As shown in Fig. 3, our framework consists of two components: a dual-stream convolution network consisting of a loss network and a style fusion model. The loss network is composed by two parallel deep convolution networks and several additional layers. A scalar value Ł i (y, y t ) of loss function at layer i is computed to measure the Euclidean distance between the output image y and target image y t (y t can be content image and reference style image). For the dual-stream loss network, we refer to the upper deep convolution network as detail reconstruction network (DR network), which is designed for preserving the content details. Meanwhile, the lower convolution network is referred to as style transfer network (ST network), which aims to transfer style information, mainly colour, from reference style image to content input. As shown in the right side of Fig. 3, the style fusion model (SFM) also has two components: a detail filter and a style filter, which take the outputs of two parallel deep networks as their inputs separately.
Inputs and outputs For the DR network, the inputs are one photograph as content image I c and one random white noise image X DR with the same size of I c , and the output is one image O c . For the ST network, the inputs are one photograph as content image I c , one random white noise image X ST with the same size of I c and one photograph as style image I s . The output is one image O s . The X DR and X ST are initialized by random white noise image X . For the detail filter, the input is the output O c of DR network, and the input of style filter is the output O s of ST network. The output of entire SFM is one image O fusion .
Additional layers: There are three different layers in total: content layer, style layer and similarity layer. The content and similarity layers carry loss functions for the purpose of preserving content features from I c onto O c . And the style layers hold the loss functions to transfer stylistic features from I s to O s .

Loss functions
In general, we define three different loss terms for two purposes: 1 preserve the content feature information F as structure details and reconstruct them on X DR ; 2. learn the reference style features and correctly match them to X ST .
Layers in convolutional neural network define nonlinear filter banks to encode input image. Hence, the representations of features in a neural network actually are the filter responses to input image [18]. We assume that a layer has D different filters, and each filter has a size M, where M is height times width. For the reconstruction of feature, let F i be the feature representations captured at ith activation layer of the DR network when I c is on processing. Then, F i is a feature map with the size of D i × M i . The feature reconstruction loss is the squared and normalized Euclidean distance between the feature representations of X and target I c : where L denotes the set of activation layers containing feature loss. This term helps to minimize the visual distinguishability between the random image X and target image I c . However, as this reconstruction is from high layers [18], the rough spatial structure of content image can be preserved, but details especially exact shapes of the structure are lost. For the same convolutional neural network architecture, Zhao et al. [35] demonstrate using L1-norm loss in the spatial constraint better preserves the spatial structures as compared to using L2-norm. Hence, we introduce another similarity preserved loss L simi based on mean absolute error (L1-norm) into loss network. We found that the L1-norm loss employed outside of the network makes the style transformation output lose the colour information from style image. Hence, we attempt to add L1-norm loss inside the network. Let MAE be the mean absolute error of the feature representations of X and I c at jth activation layer of the loss network, and then the similarity preserved loss is defined as: where L denotes the set of activation layers added as similarity layers. The purpose of this loss term is how much information of target I c is lost by X , which contributes to reconstruct exact pixels of I c into X as many as possible by minimizing this term. As mentioned above, reconstructing content features with only L feat is not enough to preserve precise details, especially the exact edges inside structures. Figures 4 and 5 demonstrate the effect of L simi .
For the transformation of style, we need to obtain an effective representation of style in the reference image. According to [6], we use correlations of feature space to be the representation of style. And these feature correlations can be given by Gramian Matrix. Let G k be the Gramian Matrix of vectorized feature map F k at kth activation layer of ST network when the input x is on processing, and the vectorized feature map F k is reshaped to D k × H k W k . We define the Gramian Matrix as: where N is the total number of pixels of F k (x). The Gramian Matrix is the inner product between feature maps at kth activation layer, which gives the feature correlations. Then, the style loss is the squared Frobenius norm of the difference between the Gramian Matrices of the random image X ST and the target I s : b shows two insets of a and c (in that order), respectively. We may notice that c preserves more precise context of input than a where L denotes the set of activation layers holding style loss. The style loss is well defined even for different sizes of X ST and I s since the G k (x) always has the same D k × D k size. As demonstrated in [6], the generated output will only preserve the stylistic feature from target image, which means the spatial structure of target image cannot be preserved by minimizing the style loss.
In this paper, the L feat and L simi are used to constrain the detail reconstruction procedure, which preserves the spatial structures, exact details like shapes and edges inside content image onto output O c [shown as (c) in Fig. 4]. These two loss terms forms L DR , the joint loss of DR network. The L style , L feat and L simi constrain the style transformation procedure, which generates the output O s with stylistic features mainly colour information from reference image and detailed features from content image. The combination of three loss terms forms L ST , the joint loss of ST network. Therefore, the two final joint loss terms are defined as: and where α f and α d denote the weights of content layers and similarity layers in DR network, and β f , β d and β s denote the weights of three corresponding layers in ST network. All the implementation details of these parameters are introduced in Sect. 4. In previous researches [12,22], the output of prior process contains stylistic features from reference style, and these features are distributed according to the semantic structures of content input. Hence, the style transformation procedure in our ST network learns stylistic features and also distributes them into the semantic structures, which needs both style loss term and detail reconstruction loss terms. One example result from ST network is shown as (c) in Fig. 6.

Style fusion model
In Sect. 1, we mention that the distortions are introduced by both detail preservation and style transformation procedures. We use L simi to prevent geometric mismatching; however, the output of ST network may still exhibit distortion and noise artefacts due to the content-style trade-off (shown in Fig. 8). To reduce the artefacts, we apply a refinement technique style fusion model (SFM) into our approach. The edge-preserving filter (recursion filter) proposed by Gastal et al. [5] is capable of effectively smoothing always noise or textures while retaining sharp edges, which is a suitable technique for reducing artefacts. We thus use the edge-preserving filter (recursion filter) [5] to smooth both output image O c and O s with guidance O c . In this paper, we refer to detail filter and style filter as the smooth process of O c and O s , respectively. The final result O fusion is defined as: where σ s denotes the spatial standard deviation and σ r denotes the range standard deviation for the edge-preserving filter [5]. Shown as (e) in Fig. 6, the clear stylized result O fusion obtained by our SFM is free to the artefacts.

Implementation details
This section describes the implementation details for our approach. We choose pre-trained VGG-16 network [25] as the basic architecture of our DR network and ST network. The content layer with L feat is added into the activation layer of relu3_3, and the style layers with L style are added into relu1_2,relu2_2,relu3_3 and relu4_3 activation layers. The similarity layers are added into relu1_2,relu2_2, relu3_3 activation layers. For the DR network, we add content and similarity layers into the pre-trained VGG-16 network and Fig. 5 The similarity function for preventing geometric mismatching problem. a is the stylized result without similarity loss, and b is the stylized result with similarity loss. Note that the zoom-in regions show that the similarity loss effectively prevents the unexpected geometric matching Fig. 6 The style fusion model for reducing noise artefacts and avoiding distortions. a is the reconstructed content output of our DR network, and b is the extracted details (white points) of content without colour from a. c is the stylized output of our ST network, and d is the extracted colour without details from c. e Is the fusion stylized result from SFM.
We may notice that c still exhibits noise (red rectangles) and distortion (green rectangles) artefacts due to content-style trade-off (please refer to Fig. 8). However, the final stylized result (e) is free of noise and distortion artefacts. We recommend readers to view the electronic version choose parameters α f = 5 and α d = 10 3 for the detail reconstruction. For the ST network, we add content, similarity and style layers into the pre-trained VGG-16 network and choose β f = 5, β d = 10 and β s = 100 for the style transformation. We use σ s = 60 (default in the public source code) and σ r = 1 for the edge-preserving filter [5] in SFM. The effect of parameter α d , β d and σ r is illustrated in Figs. 7, 8 and 9, respectively.
We use a random white noise image X (X DR and X ST represent X for DR network and for ST network, respectively) with the same size of content image as our initialized input and choose Adam [14] optimization algorithm with learning rate 1 and iteration 1000 in the optimization process for all our experiments in this paper. All the inputs including I c , I s and X are scaled into width = 512 if their widths are over 512; otherwise, they remain original resolution. The dual- Fig. 7 The effect of parameter α d for our DR network. Note that the reconstructed content result achieves the highest PSNR at α d = 10 3 . The lower and larger values decrease the accuracy of reconstructed result. Hence, we find the best parameter α d = 10 3 for our DR network and use it to produce all the other results in this paper Fig. 8 The effect of parameter β d for content-style trade-off. A lower β d value cannot prevent unexpected geometric matching. For example, the regions of tower tops (green rectangles) in a and b. A larger β d value loses the style of reference image. For example, the buildings (red rectangles) in d and e have undesired dark colour style, which should be in the golden light style. Note that the stylized result at β d = 1 × 10 1 still exhibits some distortion and noise artefacts but they will be eliminated by SFM. We thus choose β d = 1 × 10 1 to produce our style transformation result of the ST network and all the other results in this paper. We recommend readers to view the electronic version Fig. 9 The effect of parameter σ r for SFM. Note that a lower σ r value cannot prevent noise artefacts, for example, red rectangles in a and b, and a larger σ r value suppresses the transferred style, for instance, green rectangles in d and e. We found the best parameter σ r = 1 to produce our result and all the other results in our paper Fig. 10 Placements for similarity layers in DR network. a-d Show the reconstructed content results with similarity layers at different places in our DR network. Note that the reconstructed result achieves the highest PSNR score at relu1_2,relu2_2,relu3_3. Hence, we place similarity layers at relu1_2,relu2_2,relu3_3 in our DR network for all the experiments in this paper stream convolution networks run the optimization process at the same time, and the optimization time is around 2.5 min. by running on our GPU card (NVIDIA GeForce GTX 1060, 6G GDDR5). The whole optimization process only needs one content image and one reference style image without any limitation on resolution.

Results
This section discusses the selection for hyperparameters, placement for similarity layer, comparisons between our methods and state-of-the-art methods in terms of global and local colour transfer. Fig. 11 Placements for similarity layers in ST network. a-c show the stylized results with similarity layers at different places in our ST network. Note that a presents a worse stylized result than b and c as the centre area of blanket and walls upside are not in golden style colour. It is difficult to tell that either b or c outperforms better style transformation as they achieve a very similar style transfer result (conducting a series of other experiments, please refer to our supplemental materials for more details). We thus choose to place similarity layers at relu1_2,relu2_2,relu3_3 in our ST network, which keeps the same placements as the DR network

The effect of hyperparameters
Figures 7 and 8 demonstrate the effect of parameters α d and β d , respectively. As shown in Fig. 7, the content reconstructed result achieves the highest PSNR (peak signal-to-noise ratio) value when α d = 10 3 . We thus choose α d = 10 3 to reconstruct content details in our DR network. In Fig. 8, a lower β d value still produces stylized result with geometric mismatching problem. Conversely, a too larger β d value produces less style result. Hence, we find the best value β d = 10 to produce our stylized result and all the other results in this paper. Figures 10 and 11 illustrate the choices of similar-ity layers in our DR network and ST network, respectively. For DR network, we choose to place similarity layers at relu1_2,relu2_2,relu3_3 as it achieves the highest PSNR score. For ST network, the stylized results (b) and (c) have very similar style transformation appearance, and we thus choose to place similarity layers at relu1_2,relu2_2,relu3_3 in our ST network, which keeps the same placements as the DR network. The implementation details of our networks are described in Tables 1, 2, and 3.

Comparisons
This section presents several comparisons between state-ofthe-art methods and ours.
Comparison between representative artistic style transfer methods and ours We compare Gatys et al. [7], Ghiasi et al. [8] with ours across great differences among content images in Fig. 12. Our results preserve content structures with more precise details than other artistic prior methods. For example, our results contain all details of ceiling lamp, frescoes, carpets, and railings which are not reconstructed well by Gatys et al. [7] and Ghiasi et al. [8]. To illustrate the ability of pre-serving precise details, we compare content and reference style image with great details to prior artistic style transfer methods in third row. Our method reconstructs almost every detail in content image and transfers the colour style faithfully, while Gayts et al. and Ghiasi et al. [8] lose great details. The detail representations on other examples also show our strong ability to reduce distortions and preserve content spatial structures as well.
Comparison between representative global colour transfer methods and ours. In Fig. 13, we compare our method with representative global colour transfer algorithms such as Reinhard et al. [23] and Pitié et al. [22]. A global colour mapping technique is applied by both of them to match the colour statistics of content input and reference style image. However, they cannot obtain faithful colour transformation results when the inputs have spatial-varying contents, which limits their applications. For example, in the second row of Fig. 13 Comparison between representative local photographic style transfer methods and ours In Fig. 14, we compare our method ([7]+ours) with the state-of-the-art methods, Luan et al. [17] and Liao et al. [16]. The approaches proposed by Luan et al. [17] and Liao et al. [16] are the latest methods which effectively avoids the distortion problem. Our method preserves more precise content details than Luan et al. For    [17] and ours ([17]+ [5]). Our method effectively handles the posterization effect of Luan et al. [17]. All examples from Luan et al. [17] dataset. We recommend readers to view the electronic version example, the plants in the first row, the characters of postcard in the third row and the windows in the bottom row. Our method may not obtain better faithful transformation results but our method achieves the highest score on the photorealism. Please refer to user study for more details in Sect. 5.3 and more scores in our supplemental materials. All the stylized results (including user study) of Luan et al. [17] use manually semantic segmentation masks provided by the authors and parameter λ = 10 4 (default value in Luan et al.'s paper). We further compare our method with Luan et al. using different λ values on the images in Fig. 12, please refer to our supplemental materials for more details.
Luan et al. [17] propose a two-stage photograph style transfer method which expands Gatys et al.'s artistic style transfer method. Their first stage integrates semantic segmentation into neural style [7] method for object-to-object colour transfer, and their second stage applies a post-processing step to improve the photorealism of stylized result obtained from the first stage. In terms of local object-to-object colour transfer, our similarity loss function may not transfer colour for object-to-object as faithful as manually semantic segmentation. However, our edge-preserving filter [5] used in SFM may help Luan et al.'s results avoid the posterization artefacts. In Fig. 15, we show the stylized results that we apply edge-preserving filter [5] to process the results obtained from Luan et al.'s first stage. For example, our method effectively prevents the posterization artefacts on buildings in the first row, water in the second row and forehead in the third row.
In Fig. 17 In Fig. 18, we demonstrate that our method is robust on preserving content spatial details and achieving faithful style transformation results. Note that (c) still preserves well the details of first content input even through two style transformation process with different reference style images, and the photorealism of (c) is also preserved well.
Limitation Our method is unable to transfer faithful colour between images which have semantic similarity for human observers but with much complex spatial-varying. In Fig. 19, we show some failure cases. For example, the blanket and floor in first row fail to be transferred into brown and white colour style.  [16] and ours ([17]+ [5]). Our method preserves finer content details than Luan et al. [17] and transfer style more faithful than Liao et al. [16]. All examples from Luan et al. [17] dataset. We recommend readers to view the electronic version  [19] and ours ([17]+ [5]) (in that order). We recommend readers to view the electronic version
For the photorealism, the score ranges from "1: definitely not photorealistic" to "4: definitely photorealistic". For the style faithfulness, the score ranges from "1: definitely not style faithful to reference style" to "4: definitely style faithful to reference style". For each person, he or she is asked to score the stylized results of 6 methods in a random order. There are totally 44 different scenes (excluding unrealistic and repeated scenes) selected from Luan et al. [17] dataset.
In Fig. 20, we show the average score and standard deviation of each method. For the photorealism, our method ([7]+ours) and Liao et al. [16] rank the 1st and 2nd, respectively. Luan et al. [17] and Pitié et al. [22] have the worst performance regarding the photorealism as their results exhibit some artefacts. For the style faithfulness, Luan et al. [17] and our method ([17]+ [5]) rank 1st and 2nd, respectively. Our edge-preserving filter [5] used in SFM slightly declines the style faithfulness score of Luan et al. [17], but it still achieves a higher score than Liao et al. [16]. Moreover, it significantly improves the photorealism score of Luan et al.'s results. Reinhard et al. [23] and Pitié et al. [22] perform the worst in the style faithfulness as their limitations for sophisticated images.

Conclusions
We investigate into the reason why the photorealism of stylized results is lost even when the photographic images are input to Gatys et al.'s method [7]. And we knowledge that both content preservation and style transformation stages distort images to lose the photorealistic attribute. Hence, we propose a photographic style transfer method that constrains detail reconstruction and style transformation processes by introducing a similarity loss function. This similarity loss function not only preserves exact details and structures of content image but also prevents the mismatch of texture patches between reference style and content image. The qualitative evaluation on Luan et al.'s [17] dataset shows that our proposed approaches are capable of preventing the distortions effectively and obtaining faithful stylized results as well.