Deep spatial and tonal data optimisation for homogeneous diffusion inpainting

Diffusion-based inpainting can reconstruct missing image areas with high quality from sparse data, provided that their location and their values are well optimised. This is particularly useful for applications such as image compression, where the original image is known. Selecting the known data constitutes a challenging optimisation problem, that has so far been only investigated with model-based approaches. So far, these methods require a choice between either high quality or high speed since qualitatively convincing algorithms rely on many time-consuming inpaintings. We propose the first neural network architecture that allows fast optimisation of pixel positions and pixel values for homogeneous diffusion inpainting. During training, we combine two optimisation networks with a neural network-based surrogate solver for diffusion inpainting. This novel concept allows us to perform backpropagation based on inpainting results that approximate the solution of the inpainting equation. Without the need for a single inpainting during test time, our deep optimisation accelerates data selection by more than four orders of magnitude compared to common model-based approaches. This provides real-time performance with high quality results.


Introduction
The classical inpainting problem [1][2][3][4] deals with input images that have been partially corrupted and aims at reconstructing these missing areas.However, inpainting can be also useful when the whole image is known.For inpaintingbased image compression [5][6][7][8][9][10][11][12][13][14][15][16][17][18], the encoder stores only a small percentage of known data from which the decoder restores the discarded remainder of the image with inpainting.Some approaches [5][6][7][10][11][12] such as the pioneering work of Carlsson [5] limit the choice of known data to edge locations.Following the diffusion-based codec of Galić et al. [8,19], many later approaches [14][15][16]20] rely on careful optimisation of the placement of known data in the image domain without the restriction to semantic image features.Inpainting with partial differential equations (PDEs) [5] has been able to outperform state-ofthe-art codecs: Already simple homogeneous diffusion [21] can compress depthmaps or flow fields better than HEVC [22] with suitably selected known data [20,23,24].The problem of choosing the right positions of mask pixels, the socalled inpainting mask, is also vital for other applications such as denoising [25] or adaptive sampling [26].In addition to this spatial optimisation, compression also benefits from tonal optimisation: The values of the known pixels can be adjusted to optimise the reconstruction quality as well.
However, even for a simple inpainting operator, spatial and tonal optimisation constitute challenging problems.This sparked a plethora of nonneural approaches [14-16, 20, 24, 27-41].We systematically review those in Section 1.2.Among these methods, most require many inpaintings per iteration, which tend to be computationally expensive or rely on sophisticated implementations for acceleration.For instance, probabilistic methods for spatial optimisation [34,37] yield high quality masks, but come with a high computational cost.Theoretical optimality results are rare, but have been derived from shape optimisation [27] for homogeneous diffusion inpainting.This allows near instantaneous spatial optimisation without the need for a single inpainting.However, so far, existing discrete implementations of this concept do not realise the full potential of the theoretical results from the continuous setting.
With our deep data optimisation for homogeneous diffusion inpainting, we aim for the best of both worlds.We train neural networks that can optimise both mask positions and values without the need for a single inpainting.This allows real-time performance while maintaining competitive quality for our data selection.During training, we leverage new hybrid concepts that combine model-based inpainting with deep learning.

Our Contribution
We propose a deep learning framework for inpainting with homogeneous diffusion.It is the first neural network approach that allows tonal optimisation in addition to the selection of spatial positions.In order to merge model-based inpainting relying on PDEs with learning ideas, we propose the concept of a

Spatial Optimisation
Finding good positions for sparse known pixels constitutes a challenging optimisation problem that has sparked significant research activities.In the following, we mostly focus on approaches for diffusion-based inpainting, but there are also more broadly related works, for instance the free knot problem for spline interpolation.For instance, Schütze and Schwetlick [44] have proposed a data selection algorithm for the 2-D setting which can also be applied to images.Model-based methods for diffusion inpainting can be organised in four categories.
1. Analytic Approaches.From the theory of shape optimisation, Belhachmi et al. [27] derived optimality statements in the continuous setting.In practice, these can be approximated by dithering the Laplacian magnitude of the input image.This approach does not require inpainting to find the mask pixels and is therefore very fast.However, the dithering yields only an imperfect approximation with limited quality [34,37].2. Nonsmooth Optimisation Strategies.Combining concepts from optimal control with sophisticated strategies such as primal-dual solvers, multiple works [28,29,33,39,40] leverage nonsmooth optimisation for the selection of mask positions.These produce results of high quality, but are difficult to adapt to different inpainting operators.Moreover, they do not allow to specify the target amount of mask points a priori.For applications in compression, the non-binary masks need to be binarised, which reduces quality and requires tonal optimisation [35] for good results.3. Sparsification Methods.Mainberger et al. [37] have proposed probabilistic sparsification (PS) to tackle the combinatorial complexity of spatial optimisation.They start with a full mask and iteratively remove candidate pixels.Among those candidates the algorithm discards a fraction of pixels with the smallest inpainting error permanently, while returning the remainder to the mask.This process is repeated until the desired percentage of mask points, the target density, is achieved.Besides good quality, this approach can easily be adapted to many different inpainting operators, including inpainting with PDEs [34,37] or interpolation on triangulations [32].This flexibility and quality comes at the cost of many inpainting operations.Nonetheless, sparsification is popular and widely used due to its advantages and its simplicity.4. Densification Approaches.For applications such as compression or denoising, low densities are required.In such cases it can make sense to start with an empty mask and fill it successively instead of using sparsification.Such strategies [30,31,36] share the benefits of simplicity, good quality, and broad applicability with sparsification.They have been successfully used for diffusion-based [30,31] and exemplar-based [36] inpainting operators.For compression, densification also has been applied to data structures such as subdivision trees instead of individual pixels [8,14,16,45].However, all of these strategies still require a significant amount of inpaintings to obtain masks of sufficient quality.
The approaches of Categories 3 and 4 are greedy strategies that can become stuck in local minima.To address this problem, a relocation strategy, the so-called nonlocal pixel exchange (NLPE) [37] has been proposed as a postprocessing.It is a probabilistic method that iteratively swaps point locations randomly with heuristics for candidate selection based on the inpainting error.While it can yield significant additional improvements, it also requires more inpaintings and tends to converge slowly.Similar strategies have also been used for interpolation on triangulations [38].
Note that the approach from Category 1 is the only one to require no inpaintings or complex solvers.Unfortunately, this near instantaneous spatial optimisation yields clearly worse results in terms of quality than the methods from Categories 2-4.With our deep learning framework, we aim at achieving the best of both worlds: Fast spatial optimisation without the need for any inpaintings while producing results of a quality comparable to Categories 2-4.

Tonal Optimisation
So far, we have discussed methods that focus on finding optimal positions at which the original image data is kept.However, in a data optimisation scenario, we are not confined to selecting the location, but can also alter the value of mask pixels.This tonal optimisation introduces errors at mask pixels if those lead to a more accurate reconstruction in larger missing areas.Also for tonal optimisation, one can distinguish several categories: 1. Least Squares Approaches.For spatially fixed mask pixels, tonal optimisation leads to a least squares problem.The resulting linear system of equations is given by the normal equations.It has as many unknowns as mask pixels.The system matrix is a quadratic, dense matrix that is symmetric and positive definite [37].
To solve it numerically, various algorithms can be applied.Direct methods include Cholesky, LU, and QR factorisations, while conjugate gradients and the LSQR algorithm constitute suitable iterative approaches [46].Other iterative methods that have been used for tonal optimisation are the L-BFGS algorithm [29] and a gradient descent with cyclically varying step sizes [34].All of these approaches suffer from the fact that they require to store the full matrix, which can become prohibitive for masks with too many pixels.
A potential remedy of this memory restriction consists of subsequently computing a so-called inpainting echo in a mask pixel [37].It describes the influence of the mask pixel on the final inpainting result and can be used to adjust the grey or colour value accordingly.Doing this in random order for all mask pixels can be interpreted as a randomised Gauss-Seidel or SOR iteration step.If one does not store all inpainting echoes but computes them again in each iteration step, one achieves low memory requirements at the expense of a long runtime.Discrete Green's functions offer another way to decompose the inpainting problem into pixel-wise contributions [47].From this dictionary, the inpainting result can be assembled with simple linear superposition.Hoffmann [48] have used this property to derive an alternative least squares formulation for tonal optimisation which can be solved efficiently with a Cholesky solver.While its solution is equivalent to the direct least squares approach, it benefits from speed-ups for low amounts of mask pixels which are represented by only a few entries from the Green's function dictionary.
A recent alternative goes back to Chizhov and Weickert [30].It uses nested conjugate gradient approaches within a finite element framework and it is both efficient w.r.t.memory and runtime.2. Nonsmooth Optimisation Methods.Hoeltgen and Weickert [35] have shown that thresholded non-binary spatial mask optimisation [28,29,34,39,40] is equivalent to a combined selection of binary masks and a tonal optimisation.Thus, the previously discussed nonsmooth strategies also indirectly perform tonal optimisation.However, this is inherently coupled to a spatial optimisation with the advantages and drawbacks described in the previous section.3. Localisation Approaches.Since the influence of a single mask pixel mainly affects its local neighbourhood, tonal optimisation can be sped up by localisation.Strategies exist for localised operators such as Shepard interpolation with truncated Gaussians [41], 1-D linear interpolation [49], or smoothed particle hydrodynamics [31].Other approaches limit the influence artificially by subdivision trees [15] or segmentation [20,24].4. Quantisation-based Strategies.All compression codecs rely on quantisation, the coarse discretisation of the colour domain.It can be beneficial to directly take quantisation into account during tonal optimisation instead of applying it in postprocessing.Thereby, one replaces the continuous optimisation problem by a discrete one.To this end, Schmaltz et al. [16] proposed a simple strategy that visits pixels in random order and changes their values if increasing or decreasing the quantisation level yields a better results.Peter et al. [14] instead augment the Gauss-Seidel strategy with echoes [37] with a projection to the quantised grey levels.For interpolation on triangulations, Marwood et al. [38] use a stochastic approach that randomly assigns different quantisation levels in combination with spatial optimisation.
In addition to tonal optimisation itself, there are also related strategies.Galić et al. [8] proposed an early predecessor that modified tonal values to avoid singularities in PDE-based inpainting.To avoid visually unpleasant singularities at mask pixels, Schmaltz et al. [16] use interpolation swapping: After the initial inpainting, they remove disks around the known data and use the more reliable reconstruction for a second inpainting.
The tonal category 1 is restricted to linear diffusion operators, including homogeneous diffusion.Category 2 marks the indirect tonal optimisation performed by nonsmooth spatial methods and categories 3 and 4 are mainly relevant for practical applications in compression.We aim at providing a neural network alternative to Category 1 methods for homogeneous diffusion inpainting.As for spatial inpainting, our goal is to propose a deep optimisation approach that offers high speed at good quality.

Relations to Deep Learning Approaches
To our best knowledge, deep learning approaches for sparse data optimisation are still very rare and so far, only spatial optimisation has been covered at all.Dai et al. [26] have proposed a deep learning method for adaptive sampling that trains an inpainting and an optimisation network separately.Joint training for spatial optimisation and inpainting with Wasserstein GANs was introduced by Peter [43].Both approaches differ significantly from the current one, since they aim at learning both a spatial optimisation CNN and the inpainting operator.In contrast, we optimise known data for model-based diffusion inpainting with a surrogate solver for homogeneous diffusion inpainting.Moreover, our deep data selection is the first to consider both spatial and tonal optimisation.
In addition, a plethora of deep inpainting methods exist (e.g.[50][51][52][53][54][55][56]).A full review is beyond the scope of this paper, because these approaches do not consider any form of data optimisation.Since the selection of known data is decisive for the quality of inpainting-based compression, the current lack of research in this direction is the primary reason why deep inpainting has not played a role in this area, yet.

Organisation of the Paper
After a brief review of diffusion-inpainting and model-based optimisation in Section 2, we propose our deep mask learning approach in Section 3. Section 4 provides an ablation study and a comparison to model-based data optimisation.We conclude our paper with a summary and outlook on future work in Section 5.

Diffusion-based Inpainting and Data Optimisation
Consider a grey value image f : Ω → R that is only known on the inpainting mask, a subset K ⊂ Ω of the rectangular image domain Ω ⊂ R 2 .Diffusionbased inpainting [5,58] reconstructs the missing areas Ω \ K by propagating the information of the fixed known pixels from K over the diffusion time t.
The inpainted image is the steady state t → ∞ of this evolution.Fig. 1 illustrates such a propagation over time.For our inpainting purposes, we are only interested in the steady state and not the intermediate steps of the evolution.There are sophisticated anisotropic diffusion approaches [8,15,16,58,59] that adapt the amount of propagation in different directions to the image structure and can achieve results of very good quality even if the dataset K is not highly optimised.However, in the following, we consider simple homogeneous diffusion [21] for inpainting.It is parameter-free and can achieve surprisingly high quality for a well-optimised dataset.In this case, the inpainted image u fulfils the inpainting equation which arises as the steady state if one inpaints with the homogeneous diffusion equation ∂ t u = ∆u.Here, ∆u = ∂ xx u + ∂ yy u denotes the Laplacian and c is a binary confidence function with c(x) = 1 for known data in K and c(x) = 0 otherwise.At the image boundaries ∂Ω we impose reflecting boundary conditions.Note that it is also possible to use non-binary confidence values [35], which we will do in Section 3.2.1.Since homogeneous diffusion is a linear operator, colour inpainting is implemented by channel-wise processing.
In practice, we implement this method on a discrete input image f ∈ R nxny with resolution n x × n y .Discretising Eq. 1 with finite differences leads to a linear system of equations.Then, reconstructing the image u ∈ R nxny is achieved with a suitable numerical solver.
The discrete problem of mask optimisation for homogeneous diffusion inpainting consists in finding the binary mask c ∈ {0, 1} nxny with a userspecified target density d such that c 1 /(n x n y ) = d where • 1 denotes the 1-norm.This density can be seen as a budget that specifies the percentage of image pixels that should be contained in the final mask.
For comparisons, we consider the analytic approach of Belhachmi et al. [27].It is based on results from the theory of shape optimisation that demonstrate that mask pixels should be placed at locations of large Laplace magnitude.In the discrete setting, they use a Floyd-Steinberg dithering [60] of the Laplace magnitude.This leads to an imperfect, but very fast approximation of the theoretical optimum.This algorithm is a representative for simple approaches that do not require any inpaintings to determine the optimised mask.
As a prototype for better performing mask optimisation algorithms, we consider the widely used probabilistic sparsification of Mainberger et al. [37].It yields better results than the analytic approach by taking the discrete nature into account directly and greedily removing pixels that are not important for the reconstruction.It starts with a full inpainting mask.In each iteration, it removes a fraction p of candidate pixels from the mask.After an inpainting with the new mask, it analyses the local inpainting error: Candidate pixels which have a high local inpainting error are hard to reconstruct and should thus not be removed.Therefore, the algorithm adds back the fraction q of random AA [27] PS [37] PS+NLPE [37] PSNR:  candidates with the largest errors.The iterations are repeated until the target density d is reached.Further improvements can be achieved with the nonlocal pixel exchange [37].It is designed to escape from potential local minima by moving a set of p candidate locations from the inpainting mask to locations in the unknown image areas.If this positional exchange improves the overall inpainting, it is maintained, otherwise it is reverted.While this guarantees that mask quality cannot deteriorate, each step requires an inpainting and therefore, convergence tends to be slow.
In Fig. 2, a comparison of the three aforementioned spatial optimisation techniques with a uniformly random mask highlights their significant impact.Carefully optimised known data are integral for good inpainting results.
Since we consider homogeneous diffusion and do not require quantisation, we use a least squares approach for tonal optimisation.Due to the similar quality of the tonal methods from Section 1.2, we choose the Green's formulation by Hoffmann et al. [48] equipped with a Cholesky solver.It offers good quality at fairly low computational cost, in particular for very sparse masks.
In the following sections we introduce a deep learning approach that does not require inpaintings during spatial or tonal optimisation and approximates the quality of probabilistic methods and model-based tonal optimisation.

Spatial and Tonal Optimisation with Surrogate Inpainting
In this section, we describe the three types of networks that act as the building blocks for our neural data optimisation framework.The centrepiece required for our different pipelines is the surrogate inpainting network.It approximates inpainting with homogeneous diffusion by minimising the residual of the inpainting equation.We only use it during training.Its sole purpose is to act as a fast approximate solver for the inpainting problem that is still differentiable and allows backpropagation.
For the data optimisation, we consider a mask network for spatial optimisation and a tonal network for optimisation of the pixel values.Each of them is trained together with a separate surrogate inpainting network.Both data optimisation networks minimise the inpainting error w.r.t. the reconstruction by the respective surrogate solver.
In addition, the mask network requires a separate loss to approximate the intended mask density d.The macro architecture of our spatial approach with can be found in Fig. 3(a).
For the tonal setting in Fig. 3(b), we have a similar overall setup.However, here the binary masks are already part of the training dataset.In practice, we use our tonal network to generate these inputs, but also other sources such as model-based spatial optimisation approaches or even randomly generated masks could be used instead.Note that here, the optimised mask values are fed into the surrogate solver instead of the original ones.
All three types of networks use a similar U-net structure [61] that we discuss in more detail in Section 3.4.In the following sections on the individual networks, we only discuss deviations from this standard U-net architecture.
Deploying our networks for practical applications comes down to first applying the mask network to the input image.The resulting mask is then optionally fed into the tonal network together with the original.This yields the complete known data for homogeneous diffusion inpainting.The surrogate solver is never used in an evaluation scenario.Instead, we use model-based inpainting.

The Surrogate Inpainting Network
To train our mask and tonal networks, we require backpropagation from inpainting results.For instance, this could be achieved by translating a classical discrete implementation of a diffusion process into a neural network [62], which results in a sequence of ResNet [63] blocks.However, this might require very deep networks to reach the steady state of the diffusion process since the number of ResNet blocks is tied to the diffusion time in such a scenario.Instead, we propose an alternative that approximates inpainting results more efficiently by also having access to the ground truth.
The surrogate inpainting network I takes known data specified in terms of the locations in a binary or non-binary mask c and pixel values g as an input.Note that these known values do not necessarily need to coincide with the corresponding data in the original f .In addition, it has access to the full known image f .This network will only be used during training, and for evaluation, a model-based solver is responsible for the inpainting.Therefore, having access to the unknown pixels in Ω \ K eases the networks task and does not compromise the validity of data optimisation in any way.
The reconstruction u = I (f , g, c) should solve the discrete inpainting equation which is a discretised version of Eq. ( 1).The finite difference discretisation of the Laplacian is represented by the matrix A ∈ R nxny×nxny and C ∈ [0, 1] nxny×nxny is a diagonal matrix containing the mask entries.Since the network aims at simulating a numerical solver for Eq. ( 2), we follow the ideas of Alt et al. [62] and define a corresponding residual loss Here • 2 denotes the Euclidean norm.Note that the inpainting network is explicitly not trained to minimise any reconstruction loss w.r.t. the original f .
The residual loss only makes sure that the networks produces a good approximation of homogeneous diffusion inpainting given the mask c and the pixel values g.It follows similar principles as deep energy approaches [64].This ensures that the surrogate solver's access to the full original image does not skew the data optimisation.

The Mask Network
Given the original image f , our mask network M outputs positional data in terms of the mask c = M(f ) with a density d.

Non-binary Mask Networks
In the approach from our conference paper [42], our network outputs nonbinary masks with values in [0, 1].Our goal is that the choice of c should yield the best possible inpainting.Therefore, our network is equipped with an inpainting loss that measures the deviation of the reconstruction u from the original f in terms of While this loss establishes a connection between mask positions and reconstruction quality, it does not address the density.To this end, we apply a sigmoid activation at the last layer of our mask U-net, which limits the nonbinary mask outputs to [0, 1].If the preliminary mask ĉ exceeds the target density d, we rescale it according to c = dĉ With ε = 10 −5 we avoid rounding issues for very low estimated mask densities and potential division by zero.
During training, our network passes on the non-binary confidence values.Values close to 1 indicate that the mask network sees this position as highly important, and a value close to 0 marks unimportant positions.For practical applications, however, we still require binary masks.These can be extracted with a simple postprocessing: Interpreting the confidence values as a probability, we perform a weighted coin flip for each confidence value.
Our experiments show that this non-binary mask optimisation creates a challenging energy landscape.During the training process, the mask network can get stuck in local minima that assign equal confidence to every mask pixel.Combined with the coin flip, this can lead to a uniform random mask.As a remedy, we propose an additional mask loss L M that acts as a regulariser by penalising the inverse variance As in Eq. ( 5), ε avoids division by zero.The regularisation parameter α balances the influence of the mask loss with the inpainting loss.Not only does this discourage flat masks with equal confidence in every pixel, but it also encourages confidence values close to 0 and 1.This yields the additional benefit of a closer approximation of binary masks during training.

Binary Mask Networks
Recently, strategies for deep data optimisation of neural network-based inpainting have been proposed that also allow direct output of binary masks [43].This constitutes a challenge since the binarisation of real input values is a non-differentiable operation.However, end-to-end approaches that also learn the inpainting benefit from this binarisation, since the training of the inpainting network tends to be biased by a non-binary mask input.This leads to worse results during deployment of the inpainting network.
For our own strategy, we investigate two different alternatives for direct binarisation and evaluate their performance in Section 4.2.
Strategy 1: Quantisation.First, we directly adopt the strategy of Peter [43]: We interpret binarisation of x ∈ R by hard rounding x → c + 0.5 as very coarse quantisation.Theis et al. [65] have shown that simply approximating the derivative by 1 yields very good results among more sophisticated alternatives.
For this strategy, the variance-based regularisation from Eq. ( 6) is not necessary.However, the enforcement of the target density via rescaling from the non-binary approach also does not work in this case.Therefore, we define the mask loss directly as the deviation from the target density d according to Since the mask contains only binary values, the 1-norm • 1 yields the number of mask points and thus the mask loss measures the deviation from the target density d.While the non-binary strategy does not require a density loss, we found in our experiments that it can have a stabilising effect on training if added to the regulariser loss from Eq. ( 6).
Strategy 2: Coin Flip.Instead of quantisation, we can also modify our nonbinary approach to output binary masks.We keep the regularisation mask loss and rescaling from Section 3.2.1,yielding a non-binary confidence mask.However, during training, we directly add the coin flip binarisation.This can be seen as an alternative quantisation approach instead of the rounding operation in Strategy 1.We apply the same synthetic gradient as in the first binary mask approach.
In Section 4.2 we evaluate the binary and non-binary alternatives for mask generation in an ablation study.

The Tonal Network
Finally, our tonal network takes both the original image f and a mask c as an input.The mask can either originate from the mask network or an external source.
Fortunately, we do not require binarisation layers, since the input masks are already binary.Furthermore, the mask density is already fixed.Therefore, the tonal network uses the U-net described in Section 3.4 without further need for modifications.It feeds the optimised pixel values g = T (f , c) into the inpainting loss from Eq. ( 4).
The residual network is trained with the residual loss w.r.t. the optimised known data L R (u, g, c) as well.While this works well, we have found in our experiments that the training of the surrogate solver can be stabilised by also minimising the residual L R (u, f , c) w.r.t. the original known data.This provides a fixed reference point for the residual solver, since in contrast to g, the known data from f is not influenced by the training progress of the tonal network.This prevents the training of the residual solver from getting trapped in local minima.

Network Architecture
For all three networks, we use a U-net [61] architecture, since U-nets implement the core principles of multigrid solvers for PDE-based inpainting [62].This makes them a perfect fit for the surrogate solver.U-nets and multigrid have in common that they operate on multiple scales, first restricting the image in multiple stages down to the coarsest scale and then prolongating it again to the finest scale.We follow this general structure in Fig. 4(a).
However, in contrast to our conference paper [42], we also rely on modifications to the standard U-net approach that were first used for inpainting by Vašata et al. [66].They replace traditional convolutional layers by multiple parallel dilated convolutions with dilation factors 0, 2, and 5 followed by ELU activations.As shown in Fig. 4(b), the results are concatenated to a joint output.This so-called multiscale context aggregation was originally designed by Yu and Koltun [67] to increase the receptive field for segmentation.We discuss its benefits for our application in Section 4.2.1 with an ablation study.
For restriction, we also use context aggregation [67] with 5 × 5 dilated convolutions followed by a 2 × 2 max pooling.The corresponding prolongation uses the same structure, but with 5 × 5 transposed convolutions and 2 × 2 upsampling.Two context aggregation blocks without any upsampling or max pooling perform postprocessing on the coarsest scale.The final hard sigmoid activation limits the results to the original image range [0, 1].Only in the case of our binary mask networks, this is followed by a quantisation or coin flip binarisation layer.As commonly the case in multiscale architectures, the number of channels increases for coarser scales.It ranges from 64 to 256 (see Fig. 4(a) for details), which is half of the channel bandwidth used by Vašata et al. [66].In Section 4.2 we have verified that such smaller networks suffice for our task.

Experimental Evaluation
After an overview of the technical details of our evaluation in Section 4.1, we justify our design decisions for the networks with an ablation study in Section 4.2.We compare with model-based approaches for spatial optimisation in Section 4.3 and with tonal optimisation methods in Section 4.4.In both cases, we assess reconstruction quality and speed.

Experimental Setup
Unless stated otherwise, all networks rely on the modified U-net architecture from Section 3.4 with ≈ 2.9 million parameters per network.
All of our networks have been trained on an Intel Xeon E5-2689 v4 CPU (2 cores), together with an Nvidia Pascal P100 16GB GPU.For training, we use a subset of 100, 000 images randomly sampled from ImageNet [68] by Dai et al. [26] and the corresponding validation dataset containing 1, 000 images.We use centre crops to reduce the size of the images, thus speeding up the training process.For model selection we crop to 64 × 64, while the remainder of the experiments are performed on size 128×128.All networks were with the Adam optimiser [69] and a learning rate of 5 • 10 −5 .We used 50 epochs for the spatial experiments, and 100 for tonal experiments.For evaluation, we used an AMD Ryzen 7 5800X CPU equipped with an Nvidia RTX 3090 24GB GPU.We performed model selection based on the lowest achieved inpainting error on the validation set.Our test set is based on all 500 images of the BSDS500 database [57].These were centre cropped to size 128×128 in order to fit the size of the training data.The cropping also speeds up the model-based competitors and thus allows us to compare with them on a larger variety of images.We measure qualitative results with the peak signal-to-noise ratio (PSNR).
We compare with three spatial optimisation methods.The analytic approach by Belhachmi et al. [27] (AA) acts as a representative of very fast spatial optimisation.It is implemented with Floyd-Steinberg dithering [60] of the Laplace magnitude.Probabilistic sparsification (PS) in combination with a non-local pixel exchange (NLPE) provides qualitative benchmarks.These methods have been implemented with a conjugate gradient solver, ensuring convergence up to a relative residual of 10 −6 for the diffusion inpainting.NLPE is run for 5 so-called cycles, each consisting of c 1 iterations.

Ablation Study
In the following, we first evaluate different architectures and design principles, to select the best among those for the comparison with model-based approaches.

Network Architecture
Compared to the standard U-net architecture in our conference publication [42], the modified U-net from Section 3.4 benefits from the context aggregation and more sophisticated postprocessing layers after upsampling to the finest scale.In [42], we used sequential 3 × 3 convolutions on each scale.Therefore, propagation of information over larger distances works mainly via downsampling to coarse scales and upsampling.On each individual scale, the receptive field of the simple convolutions is relatively small.In contrast, the context aggregation allows our network to perceive larger regions of the image on each individual scale.Our evaluation in Table 1(a) contrasts these modifications with the standard U-net using a similar total amount of weights.The modifications yield up to 2.3 dB improvement w.r.t.PSNR, especially on challenging very sparse masks.
We also evaluated other modifications to the U-net structure such as gated convolutions, but the context aggregation yielded the best combination of good qualitative performance and stability during training.
In Table 1(b), we compare the full size U-net proposed by Vašata et al. [66] with our leaner version from Section 3.4 on 128 × 128 color images.The large U-net uses twice the amount of channels in relation to Fig. 4(a) in all but the last two postprocessing layers.This results in ≈ 11.5 million parameters, compared to our significantly lower ≈ 2.9 million.The larger network does not yield a qualitative advantage over a wide range of densities.The PSNR results of the large network only deviate marginally from those of the small masknet in Table 1(b).However, it increases training times from 43 min to 93 min per epoch.Therefore, we use our lean nets instead.

Non-binary vs. Binary Masks
In Section 3.2 we have proposed three possible output options for our mask networks: non-binary masks, binary masks based on quantisation, and binary masks produced by a coin flip.For full deep learning based approaches [43], the binarisation during training is a key component of their architecture.Surprisingly, our ablation study in Table 1(a) paints a different picture: The non-binary mask network clearly outperforms both binary options.This results from a key difference in our method compared to full deep learning approaches.Using a non-binary mask while simultaneously training an inpainting network introduces a bias.This deteriorates inpainting quality during testing [26].However, our surrogate solver is only deployed during training and is not coupled directly to an inpainting loss.It merely approximates diffusion-based inpainting.During testing, we use a model-based implementation of homogeneous diffusion inpainting.
Therefore, we benefit from a non-binary mask network that does not rely on synthetic gradients for binarisation layers.Consequentially, we use the nonbinary variant for our comparisons with model-based data optimisation.

Spatial Optimisation
In our conference paper [42], we have shown that our approach yields similar results as probabilistic sparsification [37] on a small dataset of five greyscale images.Here we extend the evaluation of our improved networks to the significantly larger greyscale BSDS500 database in Fig. 7(a) and double the range of evaluated mask densities to 20%.Our mask network not only consistently outperforms both the analytic approach [27] (AA) and probabilistic sparsification [37] (PS), but very closely approximates the quality of PS+NLPE.
The same ranking also applies in the case of the full colour version of BSDS500 in Fig. 7(b).Thus, our mask network rivals the best model-based approach in the comparison.Visually, it yields similar results as the probabilistic methods in Fig. 5 and Fig. 6.Especially for low densities, there is a large quality gap between the analytical approach and all other competitors.
Even though our mask net offers a similar quality as PS+NLPE, it requires significantly less computational time since it does not rely on any inpaintings original image 210088 BSDS500 [57] AA [  Also at high density, our mask network yields results that are comparable to the probabilistic approaches PS and PS+NLPE [37].The visual gap towards the analytic approach (AA) [27] is smaller, but still noticable.
Thus, our mask network reaches our goal of providing an easy to use, parameter-free spatial optimisation which approximates the quality of stochastic methods at a computational cost close to the instantaneous analytic approach.

Tonal Optimisation
In Fig. 8(a) we compare our tonal network with the Green's function approach of Hoffmann [48] on the masks obtained from our mask network.Especially (a) On grey level images, our network consistently outperforms the analytic spatial optimisation (AA) [27] and probabilistic sparsification (PS) [37].
Especially for lower densities, Masknet results rival the quality of PS with non-local pixel exchange (NLPE) [37] as postprocessing.(b) For colour images, our mask network also closely approximates the quality of PS+NLPE for the whole range from 1% to 20%.We compare our tonal network with the Green's function approach [48] on masks from our mask network.Our tonal optimisation (TO) reaches a comparable quality up to 15% known data.(b) In a comparison that combines the spatially optimised mask with tonal optimisation, our full network approach yields competitive results to PS+NLPE combined with tonal optimisation.All model-based approaches use the Green's function approach [48] for tonal optimisation.
for sparse known data, our deep tonal optimisation reaches a similar quality as the model-based approach.Only above 15%, the improvements over the unoptimised data from the mask net decline.
Our results in Fig. 5 show that our network approach also remains competitive to PS+NLPE when adding tonal optimisation.Here we apply the tonal network for our own deep learning method and the Green's function optimiser for all model-based competitors.
As for spatial optimisation, our tonal network offers a viable alternative for time critical applications.Fig. 9(b) shows that the computational cost of the Green's function approach grows significantly with the number of mask values that need to be optimised.In contrast, the computational time of the  The analytic approach (AA) [27] is the fastest method, followed closely by our Masknet on the GPU and CPU.These three methods all have a constant speed independently of mask density.The speed of PS and PS+NLPE [37] is density dependent.Overall, our methods are consistently faster by several orders of magnitude compared to probabilistic approaches.(b) The situation for tonal optimisation comparable.The Green's function-based solver [48] becomes increasingly slower with rising mask density.Only for very low densities its speed is comparable to our tonal network on the CPU.The networks have constant runtime independent of density and are faster by 1 to 5 orders of magnitude.
tonal network is independent of the mask density.For densities larger than 5%, speed-ups by multiple orders of magnitude can be achieved with our mask net.Thus, a combination of our spatial and tonal networks is a viable option for real-time applications that does not require to sacrifice quality for speed.

Conclusions
Our data optimisation approach merges classical inpainting with partial differential equations and deep learning with a surrogate solver.This allows us to select both position and values of known data for homogeneous diffusion inpainting that minimise the reconstruction error.
With this new strategy for sparse data optimisation we obtain real-time results in hitherto unprecedented quality.They yield reconstructions that rival the results of probabilistic sparsification with postprocessing by non-local pixel exchange and tonal optimisation.Simultaneously, they are reaching the near instantaneous speed of the qualitatively inferior analytic approach.This improvement of computational time by multiple orders of magnitude at comparable quality demonstrates the high potential of a fusion between model-and learning-based principles.We see this as a milestone on our way to bring the best of both worlds together in the area of inpainting and data optimisation.
In the future, we plan to incorporate our framework into image compression codecs.Time-consuming spatial and tonal optimisation still present a bottleneck in this area.This holds true especially for practical applications with high demand for computational efficiency, such as video coding.While real-time decoding is already possible with diffusion [70][71][72], the data selection during encoding will benefit from our deep optimisation.smoothed particle hydrodynamics.SIAM Journal on Applied Mathematics 14(4), 1669-1704 (2021)

Fig. 1
Fig. 1 Image Evolution of Homogeneous Diffusion Inpainting.This figure shows a reconstruction of image 130014 from the BSDS500 database [57] cropped to size 256 × 256.White mask pixels indicate a total of 10% known data.At time t = 0, we assign the original pixel values to known areas and initialise the unknown regions with zero (black).For t → ∞, diffusion propagates the known values and yields the inpainted image as the steady state.

Fig. 2
Fig.2Spatial Optimisation Techniques.For reference, we consider a uniformly random mask with 10% known data and the corresponding reconstruction of image 130014 from Fig.1with homogeneous diffusion inpainting.The analytic approach[27] (AA) already yields a significant improvement over the random mask.Probabilistic sparsification (PS) and non-local pixel exchange (NLPE)[37] refine the background as well as the fur patterns of the giraffe.A gain of more than 9dB PSNR illustrates the vital importance of spatial optimisation for homogeneous diffusion inpainting.
(f , c) reconstruction u inpainting loss LI (u, f ) mask loss LM (c, d) residual loss LR (u, f , c) (a) training of the mask network tonal network c = M(f ) optimised tonal values g inpainting network u = I(f , g, c) reconstruction u mask c original image f inpainting loss LI (u, f ) residual loss LR (u, g, c) (b) training of the tonal network

Fig. 3
Fig. 3 Structure of Our Deep Tonal Optimisation Framework.The surrogate inpainting network and its associated residual loss are marked in yellow, the mask network and loss in blue, and the tonal network in red.Forward passes between the networks are indicated by solid arrows, while dashed arrows represent backpropagation.

Fig. 4
Fig. 4 Modified U-Net Architecture.(a) The original inputs are subsampled by max pooling three times, pass through the bottleneck and are then prolongated by upsampling again to the finest scale.Exchange of information between scales is implemented by skip connections.Two post processing blocks with transposed convolutions conclude the pipeline.The numbers below each context aggregation block indicate the number of channels.(b) In a context aggregation block, three parallel dilated convolution increase the receptive field of the filter.All results are concatenated together.

Fig. 6
Fig.6Visual Comparison for 10% Mask Density.Also at high density, our mask network yields results that are comparable to the probabilistic approaches PS and PS+NLPE[37].The visual gap towards the analytic approach (AA)[27] is smaller, but still noticable.

Fig. 7
Fig.7Spatial Optimisation.(a) On grey level images, our network consistently outperforms the analytic spatial optimisation (AA)[27] and probabilistic sparsification (PS)[37].Especially for lower densities, Masknet results rival the quality of PS with non-local pixel exchange (NLPE)[37] as postprocessing.(b) For colour images, our mask network also closely approximates the quality of PS+NLPE for the whole range from 1% to 20%.

Fig. 8
Fig.8Tonal Optimisation.(a) We compare our tonal network with the Green's function approach[48] on masks from our mask network.Our tonal optimisation (TO) reaches a comparable quality up to 15% known data.(b) In a comparison that combines the spatially optimised mask with tonal optimisation, our full network approach yields competitive results to PS+NLPE combined with tonal optimisation.All model-based approaches use the Green's function approach[48] for tonal optimisation.

Fig. 9
Fig.9Runtime Comparison with Logarithmic Time Axis.(a) The analytic approach (AA)[27] is the fastest method, followed closely by our Masknet on the GPU and CPU.These three methods all have a constant speed independently of mask density.The speed of PS and PS+NLPE[37] is density dependent.Overall, our methods are consistently faster by several orders of magnitude compared to probabilistic approaches.(b) The situation for tonal optimisation comparable.The Green's function-based solver[48] becomes increasingly slower with rising mask density.Only for very low densities its speed is comparable to our tonal network on the CPU.The networks have constant runtime independent of density and are faster by 1 to 5 orders of magnitude.