Image Resizing by Reconstruction from Deep Features

Traditional image resizing methods usually work in pixel space and use various saliency measures. The challenge is to adjust the image shape while trying to preserve important content. In this paper we perform image resizing in feature space where the deep layers of a neural network contain rich important semantic information. We directly adjust the image feature maps, extracted from a pre-trained classification network, and reconstruct the resized image using a neural-network based optimization. This novel approach leverages the hierarchical encoding of the network, and in particular, the high-level discriminative power of its deeper layers, that recognizes semantic objects and regions and allows maintaining their aspect ratio. Our use of reconstruction from deep features diminishes the artifacts introduced by image-space resizing operators. We evaluate our method on benchmarks, compare to alternative approaches, and demonstrate its strength on challenging images.


Introduction
The media resizing problem had been widely studied in the last decade and many content-aware methods have been developed [1,33,26,32,23,10,16,27,34,22,4,29]. The main objective of these methods is to change the size of the input while maintaining the appearance of important regions such as salient object, and reducing visual artifacts. These two objectives can be seen as two quality measures that are sometimes contradicting. The first one measures how semantically close the resulting image is to the original one by preserving its important parts, and the second one measures the resemblance of the result to a natural image by reducing artifacts (see [3]).
Most techniques first employ some saliency measurement techniques to decide which regions of the image are more important. Then, an image resizing operator is used to create the resized image while preserving these regions, and hoping to introduce less artifacts. Both of these steps are still challenging. First, common saliency measures account for low-level features only, while disregarding important high-level semantics. Second, current resizing opera- (a) (b) (c) (d) (e) (f) Figure 1: Given an input image (a), our deep network resizing method first adjusts the size of the feature maps of a deep neural network (b), while protecting important semantic regions, and then reconstructs a retargeted image using iterative optimization (c-f). Note how starting from a linear scaled image (c), the iterations manage to reconstruct the shape of the bicycle (d-f), which is the main semantic object in the image, while minimizing artifacts.
tors do not directly account for the second quality measure of maintaining the natural look of the resulting image. In this work we present Deep Network Resizing (DNR) as a method that deals with the two aforementioned challenges using neural networks. First, we exploit the ability of pre-trained networks to analyze and encode both low-level and high-level features to identify important parts in the image. In addition, we employ a back-propagation aided optimization method to directly preserve both the structure of important regions and the natural image appearance of the result. This reduces many of the artifacts issues which exist in traditional approaches, and integrates analysis and syn-thesis based on neural networks in an image resizing technique.
The key idea of DNR is that instead of applying image resizing operators on the pixels of the image, it applies them in feature space, on the feature maps of deep layers of a pretrained Convolutional Neural Network (Figure 2). This allows content removal to concentrate on regions of the image that are semantically irrelevant. We show that DNR discards insignificant data, which in turn, preserves the semantic encoding of the input image. The operator we demonstrate our approach with is seam carving algorithm [1].
After the image is reconstructed using back-propagation based optimization we perform a refinement step. In this step, a grid-sampler layer is used, allowing only a change in the mapping of pixels while optimizing for the same objective. This step increases the natural look of the resulting images, by further reduction of artifacts.
Our main contributions are: • Utilizing the semantic guidance of deep layers of a CNN for image importance in resizing.
• Applying seam-carving in feature-space instead of image-space.
• Reducing artifacts of reconstructed images by optimization using grid-sampling.
• Presenting Deep Network Resizing, a method for image resizing using neural networks.

Related Work
Image Processing Techniques: A considerable work on content-aware media retargeting have been carried out in the field of image processing, and it is widely common to classify it into discrete methods [1,26,23,27] and continuous methods [33,32,16,10,34,22] (refer to [14,5] for comprehensive coverage on content-aware retargeting).
Discrete methods: Seam carving was introduced by Avidan et al. [1], which performs retargeting by repeatedly inserting or removing connected paths of pixels (called seams), passing through low importance regions. Later, Rubinstein et al. [26] improved seam carving using a lookforward energy map, which measures the amount of energy introduced on seam removal or insertion. Pritch et al. [23] introduced shift-maps for pixels re-arrangement, and formulated graph-labeling problem for various image editing applications, including retargeting. Rubinstein et al. [27] combined different retargeting operators through finding paths in a multidimensional space which dictates sequence of retargeting operations on the input media.
Continuous methods: Wolf et al. [33] introduced a map that is determined by three importance measures in order to devise a system of linear equations that defines a mapping of source pixels into their corresponding location in the target image. Wang et al. [32] compute a deformed mesh-grid by assigning a scale factor for each quad in the grid. They proposed two penalties, that encourages their solution to linearscale quads of high-importance and allow higher deformation on low-importance grids. Krahenbuhl et al. [16] uses an energy map, consisting of many automatic constraints and user defined constraints on key frames, in order to compute a non-uniform pixel accurate warping on video streams. Guo et al. [10] define a saliency-based triangles mesh representation, and use constrained mesh parametrization problem to compute the retargeting solution. Wu et al. [34] detect symmetric parts in the image and then carry summarization operation on the symmetric regions and warping on non-symmetric parts. Panoozzo et al. [22] use axis-aligned representation which overcomes the optimization's complexity when using 2D parametric representation of mesh deformation. The authors [22] later find the deformation parameters by solving a simple quadratic problem with linear constraints.
One possible approach of using deep learning to solve the retargeting problem would be to gather a training set of original and retargeted images and use supervised learning. However, it is very difficult to gather such a set as each image must support numerous retargeting sizes and there is no ground-truth method to apply this. Moreover, using manual retargeting may produce different results for different artists.
Cho et al. [4] have proposed a weakly and self supervised learning for image retargeting. The authors use the semantic encoding of pre-trained networks and devise a decoder that produces an attention map. The attention map is then combined with a shift-layer in order to obtain the target image. Unlike DNR, the authors [4] train their network on a given dataset, where the objective is to minimize structural damages while maintaining the detection score of the image, as given by the pre-trained CNN. DNR, however performs per-input analysis, and presents a solution that uses the strengths of deep learning in understanding the semantics of the image and in correcting images. DNR utilizes different retargeting operators to produce a feature representation of the target image (see comparison in Figure 15). In a more recent paper, Shocher et al. [29] proposed a Generative Adverserial Network (GAN) method for synthesizing images that can be considered a type of retargeting. The authors learn the patch distribution of the input image, and use this to generate images with similar patch statistics as the input image. However, the resulting image can have a different structure than the original image and still contain  Figure 2: Conventional resizing approaches act in image space (blue arrow), while our deep-resizing approach (red) applies resizing in the semantic feature space. We map image I to feature maps F(I) in feature space using a CNN. Then, we resize in feature space to create F(O). Lastly, we use back propagation optimization to reconstruct O. In other words, instead of reconstructing the original image I from F(I) (green arrow), we reconstruct the resized image O from F(O), which is the hypothetical mapping of O to feature space. some artifacts.
An independent work was recently proposed in [19], in which the authors also perform retargeting in feature space. However, there are two fundamental differences between their work and ours. First, they preform retargeting by sampling columns of deep feature maps at a constant rate, where we combine several deep retargeting operators. Further, the authors adapt methods in [18] and perform warping on the input image using PatchMach [2], where our image is reconstructed via pure synthesis procedure (see comparison in Figure 16).

Method
Conventional image resizing apply pixel-manipulations on the image. In this work, we propose a new approach, where resizing is applied in feature space, and the results are mapped back into image space by reconstruction (see Figure 2). Our key idea is leveraging deep features of a pre-trained CNN, which encode valuable latent semantics. By applying the resizing operators in feature-space we create target feature maps, where semantic information is kept unharmed. To reconstruct the output image we use optimization that iteratively minimizes the difference between the target feature maps and the actual feature maps of the optimized image.
Let I be an input image of size (h, w). Assume we use a pre-trained deep-network with L layers, we define the activation values of all neurons in level i applied on input I as the i-th feature map F i (I). F(I) is the set of all feature maps for 1 ≤ i ≤ L: Each feature map F i (I) has a certain number of channels, and a spatial dimension that depends on the size of the input I. We denote by (h I i , w I i , c i ) the height, width, and number of channels of the i-th feature map.
Given the target size (h , w ), the task of resizing in image-space is to obtain an image O of size (h , w ), while maintaining important regions in I and reducing artifacts as much as possible. The resizing task in feature-space is defined as obtaining a set of target feature maps: To obtain the actual resized image O we assume that F = F(O), the hypothetical mapping of O to feature space, and reconstruct O by minimizing the difference between the output feature maps and the target feature-maps using back-propagation. Since important regions in various levels are maintained in the target feature maps F , the reconstructed image O preserves them as well. Lastly, to maintain the target image natural appearance and reduce artifacts, we apply a grid-sampler [13] that further optimizes the constructed image.
An overview of DNR is illustrated in Figure 3. The input image (left top) is fed into a pre-trained CNN (right top) and its feature maps are extracted. Applying deep resizing operators on selected layers yields the target feature-maps, as illustrated in yellow in the figure. The target image (left bottom) is constructed by an optimization carried out using back-propagation: the results image is iteratively fed into the CNN, and an L 2 -loss is computed by comparing the feature-maps of the optimized-image and the target featuremaps. This loss is back-propagated through the network to alter the target image in several iterations (depicted with a series of snapshots at the bottom of Figure 3). Lastly, to maintain the target image natural appearance and reduce artifacts, we apply a grid-sampler [13] that further optimizes the constructed image.
In the following, we only discuss narrowing the width of the image by applying feature resizing. Similar arguments can be extended for any other target-size resizing.

Feature Maps resizing
In our resizing method we adapt seam carving [1] and employ it to the feature maps F(I). Guided by the feature maps F(I), we conservatively utilize seam-carving while avoiding semantic regions. Doing so may lead to partially retargeted image, therfore, we also perform a final resizing step on the reconstructed image using grid-warping [33]. This combination allows us to harness the capabilities of the two operators: seam-carving enables the removal of homogeneous unimportant regions, and grid-warping deforms Deep Seam Carving. Seam-carving in image-space finds vertical seams as minimal one-pixel wide connected-paths on some importance map of the input image. One vertical seam removal results in reducing the image's width by one pixel. Therefore, multiple vertical seams are removed to reach the desired width of the output image.
We extend the seam-carving algorithm by defining seamcarving on a feature-map instead of an image. First, instead of removing pixels from an image we remove neurons from the CNN layer of the feature-map. Second, because a feature map contains multiple channels, we define a seam removal as removing all neurons of the chosen seam in the same spatial location for all channels of the feature map. Third, to find minimal seams on feature-maps we use a hierarchical method to define the importance-map of each layer. Starting from the deepest, smallest in resolution, level which contains high-level semantic information, we move to shallower, larger resolution, layers that contain low-level features and refine the seams from previous layers consistently.
The basic importance-map of layer l at position (i, j) is defined as the L2-norm of the activation of the neurons along the channel axis: where * denotes all values along the channel axis (see Figure 4). We start by applying seam-carving on the deepest layer L in the hierarchy using the importance-map defined in Equation 1. This map is useful since deep-layer neurons have higher activation in semantic regions. As we move up the hierarchy from level l to level l − 1, we keep track of all seams that were removed from F l (I). Denote SC l = {s 1 , . . . , s n } as the set of all chosen seams at level l (an example of one chosen seam is indicated in yellow in Figure 5a).
To find the minimal seams on F l−1 (I), we consider a modified importance map M S l−1 at level l −1, that reduces the importance of regions that are part of the receptive field of the deep seams in level l. This attracts the seams at level l−1 to pass through the same regions and be consistent with    the seams of level l (see Figure 5). The new map is given by the following equation: where α ∈ [0, 1) is the scaling factor. Thus, the importance map in the finer layers inherit the information from the deeper layers, that implicitly constrain the selection of seams in the finer levels ( Figure 6).
Grid Warping. Grid warping in image space is applied by first dividing the image to a grid of cells and then scaling each cell linearly using a different scaling factor. The scaling factors must adhere to the following two requirements. First, the total width of the scaled cells must match the target width w . Second, the scaling factor of each cell should be proportional to the cell's importance. The first requirement guarantees that the resulting image size will match the target size, while the second requirement ensures less distortion in parts with high importance. For image width change, the initial width of each cell is given by w G and cells are assigned a scaling factor, σ i,j ∈ [0, 1], which specifies by how much each cell's width will be decreased. The actual resizing is applied using a linear scale -the width of cell (i, j) is reduced by multiplying it by the scaling factor σ i,j . In practice, it is useful to perform grid warping for width change by splitting the image into column-cells, defining only one cell in each column and one scaling factor σ i . Otherwise, different cells in the same column may be distorted differently, which may lead to jittery results.
To define the importance value µ i of each column-cell i, we aggregate the importance-maps calculated by Equation 1 of all layers from the deepest until the first layer by upsampling the deeper-layers maps to fit the size of the image. The values are then normalized to define the scaling factors as: Deep Multi-Operator. The combination of deep seamcarving and grid-warping is done by preventing deep seamcarving from removing seams with semantic content. To achieve this, we terminate the seam removal once the next seam's total importance is above a given threshold. However, we keep the same ratio of removed-seams to the original width of the feature-map in all layers, meaning that different number of seams are removed in each layer. Once deep seam-carving is terminated, and the image is reconstructed, we apply grid-warping on the intermediate resulting image to reach the final output in the desired size.

Image Reconstruction
Previous works show how to use a pre-trained CNN to synthesize images using back-propagation, for example to create images with different styles [7]. We adopt this approach, and use optimization to map back the target featuremaps into image-space, allowing us to obtain the resized output image. We use the target feature-maps to reconstruct our desired output by iteratively applying back-propagation to change the values of an image. Note that what we call output-image is in fact the input image to the network.
Our initial output image O is set to be a uniform 1D linear scaled version of the input image I (we consider other initialization methods, including random noise and seam carved image, and we compare the performance and results of each method in the supplementary materials). This allows the optimization to fix the distortions created by linear scale, and to re-construct the desired output by iteratively reducing distortions especially in important regions of the image (see Figure 1). Thus, we seek to update O by minimizing the total loss that is introduced by simple linear scale: where F i are the i th layer target feature maps, and F i (O) are the i th layer feature maps when the output image is fed into the pre-trained CNN. Moreover, λ 1 , . . . , λ L are nonnegative hyper-parameters set to provide the weight of contribution of each term to the total loss. As suggested in [7], minimizing the loss in Equation 4 using gradient descent can produce visually pleasing images. In DNR, we use Adam Optimizer [15] to solve Equation 4.

Image Refinement
The reconstruction optimization using back-propagation changes the pixel values of the output image O to minimize the loss function of Equation 4. This means that regions defined by the target feature-maps will most likely be preserved and reconstructed properly. However, some artifacts such as checkerboard patterns and noisy pixels still appear in the resulting reconstructed image. These artifacts appear because content removed from the original image is causing discontinuities between better-preserved important regions and such locations accumulate gradients more than others (similar to artifacts created by de-convolution [21]).
We developed a novel method that utilized a gridsampler layer G from [13] to overcome these artifacts. Gridsampling layer learns a mapping from positions of neurons in its input to positions in the output. In our case, we place such a layer as the first layer of the network, modifying the input to the network to be G(O) instead of O (see Figure 7).
We use G only after the initial reconstruction of O is finalized (Section 3.2). We add the grid-sampler layer and continue to optimize by using the same loss function of

Results
In our experiments, we use VGG19 [30], which was trained on ImageNet [28] dataset. Throughout this section, we use selected ReLU activation and Max-Pooling activation in VGG19's layers as our feature maps F i (I) Denote block i conv j as the ReLU activation of the jth convolution layer in block i, and block i pool as the pooling activation of block i. The default configuration of our experimental results, unless otherwise stated, uses block 1 conv 2 , block 2 conv 2 , block 3 conv 4 , block 4 conv 5 and block 5 pool as feature maps. (c) Figure 9: Visualization of the importance map as seen by Deep Seam Carving. We show the input importance map (b) that is used in Seam Carving [1], and the effective importance map used by Deep Seam Carving (c). The effective importance map is more focused on semantic areas, which suggests less distortion to important regions.
We always remove at least one seam in the deepest feature map, and remove more seams only if their importance are within the 20-th percentile of the importance map. The value of the parameters used in the reconstruction loss (Equation 4) are λ 1 = 1 and for i > 1, λ i = 0. The scaling factor in Equation 2 is set to α = 0.5. Finally, the grid size we use for warpping is 16.

Importance Map Effectiveness
The importance map used in the original Seam Carving algorithm [1] is based on gradient magnitude of the image. This map is often used as the base importance for many other retargeting algorithms as well. In Figure 9, we compare this importance map, and the effective importance map we use for Deep Seam Carving. The effective importance map is derived by summing of importance maps used by Deep Seam Carving. To visualize the importance map, we up-sample low-resolution maps to match the image shape. As can be seen, the original importance map tends to concentrate on edges and lacks the ability to capture semantics, while our map clearly gives higher importance to semantic objects in the image.

Feature Space vs Image Space
A possible alternative approach that will still use deep feature maps would be to apply seam carving in image space while using the feature maps as importance maps. Therefore, instead of removing seams from the feature maps, one can consider removing the same seams from the input image in order to produce the output image. Figure 10 shows a comparison of this approach to DNR  that is based on reconstruction. As can be seen, image space retargeting leads to artifacts due to removing many seams from the same region. In contrast, in our DNR method, reconstructing the image leads to more continuous results. First because neighboring activations in VGG19 have overlapping receptive fields, thus affecting several output pixels in the reconstruction. Second, because using a CNN that that was trained on natural images tends to generate photorealistic images as well.

Reconstruction via Deep Layers Activation
In the following experiments we compare between different initialization methods and consider different values for the hyper parameters {λ i }. The experiments were conducted without the refinement procedure, so we can better compare the quality of the reconstructed images. Moreover, we extract block 1 conv 1 , block 3 conv 1 , and block 5 conv 1 activation and use them as our feature maps.
In Figure 11, we compare different initialization methods for the initial solution of the reconstruction phase. Specifically, we consider three methods; random initialization, linear scale and seam carving. In random initialization the solution's pixels values are uniformly sampled, while in linear scale initialization the input image is scaled to the desired shape using 1D uniform scale. Finally, in seam carving initialization, we remove the same seams that were removed from low-level feature maps (i.e block 1 conv 1 ).
As can be seen, with random initialization the output contains many artifacts. Nonetheless, we are able to produce visually pleasing images for both seam carving and linear scale initialization. For linear scale it is best to use low learning rates while increasing the maximum number of iterations. For seam-carving initialization, we need far fewer optimization steps in order to achieve high-quality images.
Next, we show the contribution of deep layers to the quality of the output image, and consider various values of {λ 1 , λ 2 , λ 3 }. In Figure 12  ization the output is highly-distorted and the optimization doesn't converge to a visually-pleasing solution. However, if we use seam carving initialization we get better results.
These findings are expected for two reasons. First, synthesizing photo-realistic images from deep layers activation is hard (as suggested in the paper of Gatys et al. [7]). Second, in seam carving initialization, our solution is already photo-realistic except at regions where seams were removed. Thus, few pixels need to be altered in order to reduce the loss function. On the other hand, in linear scale initialization the initial loss is relatively-high, and many pixels are changed between optimization steps.
Finally, in Figure 13 and Figure 14 we show the contribution of deep layers to the quality output image. In Figure 13, we see that deep layers contribute to the quality of the output image, while in linear scale initialization (Figure 14, deep layers negatively contribute to the quality of the output image.

Visual Comparison with Previous Methods
We use the RetargetMe benchmark [25] containing a variety of images and the results of previous retargeting operators on these images. We show sample results of DNR compared to Linear Scale, Seam Carving [26], Warping [33] and MultiOp [27] in Figure 17 and in Figure 18. As can be seen from both figures, DNR better retains the aspect ratios of semantic regions compared to the other methods. In addition, we compare DNR with the work in [4] and [19], and show the results in Figure 15 and Figure 16, respectively. We also demonstrate our method's ability in extreme size retargeting, and show more results in Figure 19.

User Study
To evaluate our DNR method against other alternative methods we turned to RetargetMe benchmark [25] that compared various methods. We conducted two forced choice tests comparing our results side-by-side to an alternative. We showed the original image before retargeting and asked the user to choose the image that best preserves the content of the original image. The order of presentation was randomly shuffled and the survey forms were randomly distributed among 112 participants.
First, we chose to compare against the best performing method, which is SV method [16]. DNR received 55.5% of the votes when compared with SV (out of 889 votes in total). Second, we compared against the best result obtained per image. In other words, we chose the highest ranking results in the original study, where each image could be obtained using a different retargeting method. Even in this case, our results received 52.8% of the votes (out of 956 votes in total). Counting the number of images that users preferred our results we found that DNR was favored in 42 images (against 25 to SV), and in 37 images (against 29 to Best).  and Warpping [33]. Further, we include the average semantic score computed on the best retargeted images that were chosen in RetargetMe user study (Best).

Semantic Preservation Evaluation
We want to compare the preservation of semantic details as a result of the retargeting operator. For this purpose, we define a Semantic Score (SS) given by: Figure 15: 50% width scale. The input images (left) are from The Pascal VOC2017 dataset [6]. We compare results of WSSDCNN [4] (middle) and DNR (right). The results are obtained by setting α = 0.2 and employing Deep Seam Carving to perform 50% of the retargeting task. As can be seen, DNR better preserves the images subjects (see guidelines). This score compares the magnitudes of some deep VGG-19 layer activation, before and after the retargeting. In particular, we expect that if the retargeting operator damages semantic regions, then the score will be lower, since in this case high activation on the original image will increase the denominator F i (I) 2 , while low activation on the retrageted image will diminish the numerator F L (O) 2 . We used block 5 conv 1 as the feature map F i (·) in Equation 5. In Table 1, we computed the average semantic score per image in the RetargetMe benchmark for different retargeting operators. As can be seen, our DNR method receives the highest score.

Limitations
In our experiments, we encountered two challenges that resulted in unsatisfactory results.
First, VGG19 [30] was trained for the purpose of object detection, and DNR relies on its ability to detect semantic regions and objects. However, the network doesn't always succeed in providing semantic information on important regions. In addition, the network detects specific features in an object and can still have low activation on different regions of an important objects. All these could lead to object distortion in the final results (see Figure 20).
Another challenge is choosing the correct threshold to switch from seam-carving to warping in our multi-operator scheme. In particular, we have seen that there are cases in which further seams could have been removed while in other cases, we removed too many seams (see Figure 21).
Lastly, the time to produce results using DNR is still large. On average, it takes up to two minutes to retarget an image of size 640 × 480. Further optimization is required to achieve online image retargeting which can lead to possible extension of DNR to videos.

Conclusion
We have presented an image retargeting technique that operates in deep layers of a pre-trained neural network. The technique utilizes the semantic information latent in the deep layers hierarchy to aggregate on-the-fly an effective importance map. We have shown the strength of the high-level image analysis versus common low-level feature analysis. In addition, our technique is based on an optimization procedure that reconstructs the image from its deep features, hence, it tends to produce much less visual artifacts.
In this work, we use a specific available pre-trained network. However, in the future we would like to consider pretraining a network with a special-purpose target in mind, so its deep features will be more relevant to a specific task. Another avenue for future work is to leverage the optimization of the target image to synthesize new content. This will possibly be effective in upscaling an image into an overly different aspect ratio.  [26], (d) Warping [33], (e) Multiop [27], (f) SV [16] and (g) DNR (ours).

Input Image
Linear Scale SC Ours Figure 19: Stress test of extreme retargeting on images from Coco Dataset [20]. The width of the input images (first row) is scaled by 50%,40% and 30%. We compare linear scale (second row), seam carving [26] (third row) and our results (last row). Notice how the important subject in each image preserves its shape as much as possible.