Convolutional Photomosaic Generation via Multi-scale Perceptual Losses

Tesfaldet, Matthew; Saftarli, Nariman; Brubaker, Marcus A.; Derpanis, Konstantinos G.

doi:10.1007/978-3-030-11015-4_9

Convolutional Photomosaic Generation via Multi-scale Perceptual Losses

Matthew Tesfaldet^14,15,
Nariman Saftarli¹⁶,
Marcus A. Brubaker^14,15 &
…
Konstantinos G. Derpanis^15,16

Conference paper
First Online: 23 January 2019

1612 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11131))

Abstract

Photographic mosaics (or simply photomosaics) are images comprised of smaller, equally-sized image tiles such that when viewed from a distance, the tiled images of the mosaic collectively resemble a perceptually plausible image. In this paper, we consider the challenge of automatically generating a photomosaic from an input image. Although computer-generated photomosaicking has existed for quite some time, none have considered simultaneously exploiting colour/grayscale intensity and the structure of the input across scales, as well as image semantics. We propose a convolutional network for generating photomosaics guided by a multi-scale perceptual loss to capture colour, structure, and semantics across multiple scales. We demonstrate the effectiveness of our multi-scale perceptual loss by experimenting with producing extremely high resolution photomosaics and through the inclusion of ablation experiments that compare with a single-scale variant of the perceptual loss. We show that, overall, our approach produces visually pleasing results, providing a substantial improvement over common baselines.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Photographic mosaics (or simply photomosaics) are images comprised of smaller, equally-sized image tiles (or “templates”) such that when viewed from a distance, the tiled images of the mosaic collectively resemble a perceptually plausible image. Although the term has existed since the 1990s (specifically for photography), the unique art form of stitching together a series of adjacent pictures to produce a scene has existed since the 1970s. They are inspired from traditional mosaics, an ancient art form dating back at least as far as 1500 BCE, where scenes and patterns were depicted using coloured pieces of glass, stone or other materials. Here we focus on computer-generated photomosaics. Computer-generated photomosaicking relies on various algorithms to select suitable combinations of templates from a given collection to compose a photomosaic that is perceptually similar to a target image.

In early work, Harmon and Knowlton experimented with creating large prints from collections of small symbols or images. In their famous artwork, “Studies in Perception I” [6], they created an image of a choreographer by scanning a photograph with a camera and converting the grayscale values into typographic symbols. This piece was exhibited at one of the earliest computer art exhibitions, “The Machine as Seen at the End of the Mechanical Age”, held at the Museum of Modern Art in New York City in 1968. Soon after, Harmon [7] investigated how much information is required for recognizing and discriminating faces and what information is the most important for perception. To demonstrate that very little detail was required for humans to recognize a face, he included a mosaic rendering of Abraham Lincoln consisting of varying shades of gray. Based on Harmon’s findings, Salvador Dalí, in 1976, created the popular photomosaic, “Gala Contemplating the Mediterranean Sea” [4]. This was among the first examples of photomosaicking, and one of the first by a recognized artist.

Generally, there are two methods of photomosaicking: patch-wise (e.g., [14]) and pixel-wise (e.g., [18]). Patch-wise photomosaicking involves matching each tiled region with a template consisting of the closest average colour. In pixel-wise photomosaicking the matching is done on a per-pixel level between the pixels of the target image and the templates. This is computationally more expensive but generally produces more visually pleasing results since the per-pixel matching allows a rudimentary matching of structure.

Computer-generated photomosaicking has mostly been explored in the context of matching colour/grayscale intensities and, in an extremely limited sense, structures. Pixel-wise methods are limited to matching the colour of individual pixels, while patch-wise methods typically use simple similarity metrics that may miss important structural information, e.g., edges, curves, etc. Both are limited to analysis at a single scale and generally ignore overall image semantics when producing a photomosaic. In contrast, our proposed approach involves a holistic analysis of colour, structure, and semantics across multiple scales.

Jetchev et al. [8] experimented with using convolutional networks (ConvNets) to form a perceptually-based mosaicking model; however, their approach was limited to a texture transfer process and consequently was not true photomosaicking, i.e., their outputs did not consist of tiled images. Furthermore, their approach did not account for matching colours between the input and output, only structure, and only at a single scale.

In this paper, we propose a perceptually-based approach to generating photomosaics from images using a ConvNet. We rely on a perceptual loss [9] for guiding the discrete selection process of templates to generate a photomosaic. Inspired by previous work [17], we extend the perceptual loss over multiple scales. Our approach is summarized in Fig. 1.

We make the following contributions. Given a discrete set of template images, we propose a feed-forward ConvNet for generating photomosaics. To the authors’ knowledge, we are the first to demonstrate a ConvNet for photomosaicking that utilizes a perceptual metric. We demonstrate the effectiveness of our multi-scale perceptual loss by experimenting with producing extremely high resolution photomosaics and through the inclusion of ablation experiments that compare with a single-scale variant of the perceptual loss. We show that, overall, our approach produces visually pleasing results with a wide variety of templates, providing a substantial improvement over common baselines.

2 Technical Approach

Given an RGB input image, $\mathbf X \in \mathbb {R}^{H \times W \times 3}$, our goal is to generate a photomosaic, $\mathbf Y \in \mathbb {R}^{H \times W \times 3}$, where H and W denote the image height and width. For every non-overlapping tiled region in the image, we learn a distribution of weightings (or coefficients) for selecting templates. This is represented using a map of one-hot encodings, denoted by $\mathbf C \in [0, 1]^{(H / H_T) \times (W / W_T) \times N_T}$, where $H_T$, $W_T$, and $N_T$ denote the template height, template width, and the number of templates, respectively. Each spatial position on this map contains a one-hot encoding denoted by $\mathbf {c}_{r,c}$, where r and c correspond to its row and column position on the map. RGB templates, $\mathbf T \in \mathbb {R}^{H_T \times W_T \times 3N_T}$, are given and fixed between training and testing. In Sect. 2.1, we outline our encoder-decoder ConvNet architecture. Section 2.2 describes how we exploit a continuous relaxation of the function to make training differentiable. Finally, Sect. 2.3 describes our multi-scale perceptual loss which is used to train the decoder portion of the function.

2.1 Encoder-Decoder Architecture

Our ConvNet is designed as an encoder-decoder network that takes $\mathbf X $ as input and produces $\mathbf Y $ as the photomosaic output. We adopt the VGG-16 [16] ConvNet pre-trained on the ImageNet dataset [15] as the encoder portion of our network, which is kept fixed. For the purpose of photomosaicking, we find using the layers up to pool3 of VGG-16 to be sufficient. Our decoder is as follows: a 1$\,\times \,$1$\,\times \,$256 (corresponding to $height \times width \times num\_filters$) convolution, a ReLU activation, a $3 \times 3 \times N_T$ convolution (3$\,\times \,$3 to encourage template consistency among neighbours), and a channel-wise softmax to produce the template coefficients. To keep the range of activations stable, we use layer normalization [2] after each convolution in the decoder. In all convolutional layers we use a stride of 1.

For each tiled region, $\mathbf y _{r,c}$, of the final output, $\mathbf Y $, let $\mathbf c _{r,c}(i)$ be the i-th coefficient of the one-hot encoding corresponding to that region and $\mathbf T (i) \in \mathbb {R}^{H_T \times W_T \times 3}$ the i-th template of RGB templates $\mathbf T $. The output $\mathbf y _{r,c}$ is generated by linearly combining the templates for that region by their respective template coefficients,

$$\begin{aligned} \mathbf y _{r,c} = \sum _{i=1}^{N_T} \mathbf c _{r,c}(i) \mathbf T (i). \end{aligned}$$

(1)

The final output, $\mathbf Y $, is a composition of each tiled output $\mathbf y _{r,c}$.

2.2 Learning a Discrete Selection of Templates

Key to our approach is the discrete selection of templates at each tiled region. This is necessary to produce a photomosaic. During training, however, using an to select the template with the maximal coefficient is not possible because the argmax function is non-differentiable. Instead, we exploit a continuous relaxation of the by annealing the softmax that produces the coefficients. In particular, we gradually upscale the softmax inputs during training by $1/\tau $, where $\tau $ is the “temperature” parameter that is gradually “cooled” (i.e., reduced) as training progresses. In the limit as $\tau \rightarrow 0$, the softmax function approaches the argmax function and Eq. 1 becomes nearly equivalent to a discrete sampler, as desired. Specifically, the softmax distribution of coefficients nears a one-hot distribution. This encourages the network to select a single template for each tiled region. During inference, however, instead of linearly combining templates by their respective coefficients, each tiled region output, $\mathbf y _{r,c}$, can be generated by selecting the template corresponding to the argmax of the distribution of coefficients, $\mathbf c _{r,c}$.

2.3 Multi-scale Perceptual Loss

So-called “perceptual losses” have previously been used as a representation of salient image content for image stylization tasks, e.g., image style transfer [5, 9]. Instead of generating images based on differences between raw colour pixel values, perceptual losses are used to enable high quality generation of images based on differences between low-level to high-level image feature representations extracted from the convolutional layers of a pre-trained ConvNet. To that end, we use a perceptual loss [9] to guide the network to produce photomosaics that are perceptually similar to the input. Specifically, the perceptual loss measures the difference between low-level features (e.g., visual content such as edges, colours, curves) to high-level features (e.g., semantic content such as faces and objects) computed on the input image and the output photomosaic. Like our encoder, we use the VGG-16 [16] ConvNet pre-trained on the ImageNet dataset [15]. However, here it is used as a perceptual metric and layers conv1_1, conv2_1, conv3_1, conv4_1, and conv5_1 are used for computing the perceptual loss. Formally, let $\phi _l(\mathbf {X})$ be the activations of the l-th layer of VGG-16 when processing input $\mathbf {X}$. The perceptual loss is computed as the average Mean Squared Error (MSE) between feature representations of $\mathbf X $ and $\mathbf Y $,

$$\begin{aligned} L(\mathbf X ,\mathbf Y ) = \frac{1}{L}\sum _l{||\phi _l(\mathbf X ) - \phi _l(\mathbf Y )||_2^2}, \end{aligned}$$

(2)

where L is the number of layers used for computing the perceptual loss.

To produce visually accurate photomosaics, we require the objective to consider the content within each tiled region as well as the content spanning multiple tiled regions. This necessitates analysis across multiple scales. Motivated by prior work [10, 17], we compute the perceptual loss (Eq. 2) on a Gaussian pyramid [3] of the input and output. This guides the decoder to select templates that closely match the content within each tiled region, as well as collectively match the overall content of the input. To mitigate the influence of seams between tiled regions, we blur the photomosaic output before feeding it into the loss. Our final objective is as follows:

$$\begin{aligned} L(\mathbf X ,B(\mathbf Y )) = \frac{1}{SL}\sum _s\sum _l{||\phi _l(\mathbf X ^s) - \phi _l(B(\mathbf Y ^s))||_2^2}\ , \end{aligned}$$

(3)

where input $\mathbf {X}^s$ is taken from the s-th level of a Gaussian pyramid, $B(\mathbf Y ^s)$ is the blurred photomosaic output taken from the same level, and S is the number of scales used for the pyramid.

Training. For training the weights of our decoder, we use the images from the Microsoft COCO dataset [13]. We train on a merger of the train, test, and validation splits of COCO. We resize each image to 512 $\times $ 512 and train with a batch size of 12 for 2,000 iterations. We use the Adam optimizer [11] with a learning rate of $6\mathrm {e}{-3}$ that is exponentially decayed every 100 iterations at a rate of 0.96. We follow a temperature cooling schedule starting from $\tau = 1$ and gradually decreasing $\tau $ every 10 iterations until $\tau = 0.067$. Our network is implemented using TensorFlow [1]. Training roughly takes 20 min on an NVIDIA Titan V GPU. Figure 2 shows results using various 8 $\times $ 8 templates on a 512 $\times $ 512 input.

3 Experiments

To evaluate our approach, we perform two experiments: a baseline qualitative comparison using nearest neighbour with both a simple L2 metric and with a Structural SIMilarity (SSIM) [19] metric, which is a perception-based metric that attempts to address shortcomings of L2 by taking the local image structure into account; and a qualitative comparison between using a single scale and multiple scales for the perceptual loss. Finally, we experiment with producing extremely high resolution photomosaics. For our full photomosaic results, collection of templates used, and source code, please refer to the supplemental material on the project website: ryersonvisionlab.github.io/perceptual-photomosaic-projpage.

Baselines. To demonstrate that our approach improves upon common baselines in capturing colour, structure, and semantics across multiple scales, we compare against nearest neighbour with L2 and SSIM for template selection on two sets of templates: the complete set of emojis from Apple, and a specially-designed set of templates of oriented edges at varying thicknesses and rotations. Photomosaics are generated as follows: for each tiled region, the template with the lowest L2 loss or highest SSIM when compared with the underlying image content (in raw colour pixel values) is selected. Figure 3 shows our results. Nearest neighbour with L2 (Fig. 3b) completely fails in retaining both the colour and structure of the input. With SSIM (Fig. 3c), some structure of the input is preserved, albeit only at small scales, while colour accuracy is generally lacking. Moreover, both methods do not preserve the semantics of the input, such as the subject’s hair, nose, and eyes. In contrast, our approach (Fig. 3d) reliably captures the colour, structure, and semantics of the image.

Single vs. Multi-scale. We perform an ablation study on our multi-scale perceptual loss to present the individual contributions of each scale (i.e., fine and coarse) and to motivate the benefit of incorporating information across multiple scales. When the perceptual loss is operating on a single scale, it is restricted to scrutinizing the photomosaic output at that scale. As shown in Fig. 4, when the scale is only at a fine level, the output fails to preserve larger structures like the outline around the subject’s jawline and ears. At a coarse level, the reduction in resolution prevents finer details from being captured, such as the orientation of edges in the input image, resulting in a noisier output. However, when using the multi-scale perceptual loss operating on both fine and coarse scales, the output reliably preserves both the finer details and the coarse structure of the image.

High Resolution. To demonstrate the effectiveness of using a multi-scale perceptual loss, we experiment with generating extremely high resolution photomosaics, as shown in Fig. 5. The input is a 5,280$\,\times \,$3,960 image of Vincent Van Gogh’s painting, “Starry Night”, and the output is a visually compelling 10,560$\,\times \,$7,936 photomosaic. The multi-scale perceptual loss enables the model to capture both the coarse scale and fine scale features of the input. For example, the input image content spanning multiple tiled regions (e.g., the large black tower and the stars) are reliably captured in the photomosaic through the appropriate composition of templates, while the input image content within tiled regions are reliably captured through the appropriate selection of templates that match the underlying image structure, such as the orientation and colour of the brush strokes.

4 Conclusion

In this paper, we presented a ConvNet for generating photomosaics of images given a collection of template images. We rely on a multi-scale perceptual loss to guide the discrete selection process of templates to generate photomosaics that best preserve colour, structure, and semantics of the input across multiple scales. We show that our approach produces visually pleasing results with a wide variety of templates, providing a substantial improvement over common baselines. We demonstrate the benefits of a multi-scale perceptual loss through the inclusion of ablation experiments and by experimenting with generating extremely high resolution photomosaics.

References

Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 265–283 (2016). https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv:1607.06450 (2016)
Burt, P.J., Adelson, E.H.: The Laplacian pyramid as a compact image code. IEEE Trans. Commun. 31, 532–540 (1983)
Article Google Scholar
Dalí, S.: Gala Contemplating the Mediterranean Sea which at Twenty Meters Becomes the Portrait of Abraham Lincoln Exhibited in 1976. Guggenheim Museum, New York
Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR, pp. 2414–2423 (2016)
Google Scholar
Harmon, L., Knowlton, K., Hay, D.: Studies in Perception I Exhibited at The Machine as Seen at the End of the Mechanical Age, 27 November 1968– 9 February 1969, The Museum of Modern Art, New York
Google Scholar
Harmon, L.D.: The recognition of faces. Sci. Am. 229(5), 70–83 (1973)
Article Google Scholar
Jetchev, N., Bergmann, U., Seward, C.: GANosaic: mosaic creation with generative texture manifolds. In: NIPS Workshop (2017)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Ke, T.W., Maire, M., Yu, S.X.: Multigrid neural architectures. In: CVPR, pp. 6665–6673 (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2014)
Google Scholar
Kornmesser, M.: Top 100 images. https://www.spacetelescope.org/images/archive/top100 (2015), images by ESA/Hubble (M. Kornmesser)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Martins, D.: Photo-mosaic (2014). https://github.com/danielfm/photo-mosaic. Accessed 15 July 2018
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Snelgrove, X.: High-resolution multi-scale neural texture synthesis. In: SIGGRAPH ASIA Technical Briefs (2017)
Google Scholar
Tran, N.: Generating photomosaics: an empirical study. In: SAC, pp. 105–109 (1999)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Signal Process 13, 600–612 (2004)
Google Scholar

Download references

Acknowledgements

MT is supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) Canadian Graduate Scholarship. KGD and MAB are supported by NSERC Discovery Grants. This research was undertaken as part of the Vision: Science to Applications program, thanks in part to funding from the Canada First Research Excellence Fund.

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, York University, Toronto, Canada
Matthew Tesfaldet & Marcus A. Brubaker
Vector Institute, Toronto, Canada
Matthew Tesfaldet, Marcus A. Brubaker & Konstantinos G. Derpanis
Department of Computer Science, Ryerson University, Toronto, Canada
Nariman Saftarli & Konstantinos G. Derpanis

Authors

Matthew Tesfaldet
View author publications
You can also search for this author in PubMed Google Scholar
Nariman Saftarli
View author publications
You can also search for this author in PubMed Google Scholar
Marcus A. Brubaker
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos G. Derpanis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Matthew Tesfaldet , Nariman Saftarli , Marcus A. Brubaker or Konstantinos G. Derpanis .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tesfaldet, M., Saftarli, N., Brubaker, M.A., Derpanis, K.G. (2019). Convolutional Photomosaic Generation via Multi-scale Perceptual Losses. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11131. Springer, Cham. https://doi.org/10.1007/978-3-030-11015-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-11015-4_9
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11014-7
Online ISBN: 978-3-030-11015-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics