Abstract
In recent years, classical knowledge-driven approaches for inverse problems have been complemented by data-driven methods exploiting the power of machine and especially deep learning. Purely data-driven methods, however, come with the drawback of disregarding prior knowledge of the problem even though it has shown to be beneficial to incorporate this knowledge into the problem-solving process.
We thus introduce an unpaired learning approach for learning posterior distributions of underdetermined inverse problems. It combines advantages of deep generative modeling with established ideas of knowledge-driven approaches by incorporating prior information about the inverse problem. We develop a new neural network architecture ’UnDimFlow’ (short for Unequal Dimensionality Flow) consisting of two normalizing flows, one from the data to the latent, and one from the latent to the solution space. Additionally, we incorporate the forward operator to develop an unpaired learning method for the UnDimFlow architecture and propose a tailored point estimator to recover an optimal solution during inference. We evaluate our method on the two underdetermined inverse problems of image inpainting and super-resolution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Note that although the third loss term might be sufficient to learn the correct mapping, in experiments the fourth loss term has shown to accelerate and stabilize training.
- 2.
Note that for clarification, we denote all learned probability distributions as p while denoting the modeled distributions as q.
References
Ardizzone, L., Kruse, J., Rother, C., Köthe, U.: Analyzing inverse problems with invertible neural networks. In: International Conference on Learning Representations (2018)
Ardizzone, L., Lüth, C., Kruse, J., Rother, C., Köthe, U.: Guided image generation with conditional invertible neural networks. arXiv preprint arXiv:1907.02392 (2019)
Arridge, S., Maass, P., Öktem, O., Schönlieb, C.B.: Solving inverse problems using data-driven models. Acta Numerica 28, 1–174 (2019)
Asim, M., Daniels, M., Leong, O., Ahmed, A., Hand, P.: Invertible generative models for inverse problems: mitigating representation error and dataset bias. In: International Conference on Machine Learning, pp. 399–409. PMLR (2020)
Benning, M., Burger, M.: Modern regularization methods for inverse problems. Acta Numerica 27, 1–111 (2018)
Chaudhuri, S.: Super-Resolution Imaging, vol. 632. Springer Science, Cham (2001)
Chen, Y., Ranftl, R., Pock, T.: Insights into analysis operator learning: from patch-based sparse models to higher order MRFs. IEEE Trans. Image Process. 23(3), 1060–1072 (2014)
Daras, G., Dean, J., Jalal, A., Dimakis, A.: Intermediate layer optimization for inverse problems using deep generative models. In: International Conference on Machine Learning, pp. 2421–2432. PMLR (2021)
Dashti, M., Stuart, A.M.: The Bayesian approach to inverse problems. In: Ghanem, R., Higdon, D., Owhadi, H. (eds.) Handbook of Uncertainty Quantification, pp. 311–428. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-12385-1_7
Deco, G., Brauer, W.: Nonlinear higher-order statistical decorrelation by volume-conserving neural architectures. Neural Netw. 8(4), 525–535 (1995)
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using Real NVP. In: International Conference on Learning Representations (2017)
Engl, H.W., Hanke, M., Neubauer, A.: Regularization of Inverse Problems, vol. 375. Springer, Cham (1996)
Ho, J., Chen, X., Srinivas, A., Duan, Y., Abbeel, P.: Flow++: improving flow-based generative models with variational dequantization and architecture design. In: International Conference on Machine Learning, pp. 2722–2730. PMLR (2019)
Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1\(\times \) 1 convolutions. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 10236–10245 (2018)
Kobler, E., Effland, A., Kunisch, K., Pock, T.: Total deep variation for linear inverse problems. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7549–7558 (2020)
Mairal, J., Ponce, J., Sapiro, G., Zisserman, A., Bach, F.: Supervised dictionary learning. In: Advances in Neural Information Processing Systems, vol. 21 (2008)
Meinhardt, T., Moller, M., Hazirbas, C., Cremers, D.: Learning proximal operators: using denoising networks for regularizing inverse imaging problems. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1781–1790 (2017)
Moeller, M., Mollenhoff, T., Cremers, D.: Controlling neural networks via energy dissipation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3256–3265 (2019)
Newman, M., Barkema, G.: Monte Carlo Methods in Statistical Physics, vol. 24. Oxford University Press, New York, USA (1999)
Padmanabha, G.A., Zabaras, N.: Solving inverse problems using conditional invertible neural networks. J. Comput. Phys. 433, 110194 (2021)
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning, pp. 1530–1538. PMLR (2015)
Romano, Y., Elad, M., Milanfar, P.: The little engine that could: regularization by denoising (red). SIAM J. Imaging Sci. 10(4), 1804–1844 (2017)
Scarlett, J., Heckel, R., Rodrigues, M.R., Hand, P., Eldar, Y.C.: Theoretical perspectives on deep learning methods in inverse problems. arXiv preprint arXiv:2206.14373 (2022)
Siahkoohi, A., Rizzuti, G., Louboutin, M., Witte, P., Herrmann, F.: Preconditioned training of normalizing flows for variational inference in inverse problems. In: Third Symposium on Advances in Approximate Bayesian Inference (2020)
Siahkoohi, A., Rizzuti, G., Witte, P.A., Herrmann, F.J.: Faster uncertainty quantification for inverse problems with conditional normalizing flows. arXiv preprint arXiv:2007.07985 (2020)
Sim, B., Oh, G., Kim, J., Jung, C., Ye, J.C.: Optimal transport driven CycleGAN for unsupervised learning in inverse problems. SIAM J. Imaging Sci. 13(4), 2281–2306 (2020)
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations (2021)
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454 (2018)
Whang, J., Lindgren, E., Dimakis, A.: Approximate probabilistic inference with composed flows. In: NeurIPS 2020 Workshop on Deep Learning and Inverse Problems (2020)
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
Xiao, Z., Yan, Q., Amit, Y.: A method to model conditional distributions with normalizing flows. arXiv preprint arXiv:1911.02052 (2019)
Acknowledgements
CR acknowledges support from the Cantab Capital Institute for the Mathematics of Information (CCIMI) and the EPSRC grant EP/W524141/1. MM acknowledges the support of the German Research Foundation Grant MO 2962/7-1. CBS acknowledges support from the Philip Leverhulme Prize, the Royal Society Wolfson Fellowship, the EPSRC advanced career fellowship EP/V029428/1, EPSRC grants EP/S026045/1 and EP/T003553/1, EP/N014588/1, EP/T017961/1, the Wellcome Innovator Awards 215733/Z/19/Z and 221633/Z/20/Z, the European Union Horizon 2020 research and innovation programme under the Marie Skodowska-Curie grant agreement No. 777826 NoMADS, the CCIMI and the Alan Turing Institute. CE acknowledges support from the Wellcome Innovator Award RG98755.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
In the following, we provide additional theoretical results as well as experimental details for all experiments described previously.
1.1 A.1 Additional Theoretical Results
This Subsection details some of the theoretical results of the main part of the paper.
Further Analysis of Third and Fourth Loss Term. To show that \((g_{\theta _2}^{-1} \circ f_{\theta _1})\) learns a correct mapping between the data measurement space Y and the model parameter space X when minimizing the energy function E in Eq. (2), we assume E to have fully converged so that \(\hat{\theta }_1,\hat{\theta }_2 := \mathop {\textrm{argmin}}\limits _{\theta _1, \theta _2} E(\theta _1, \theta _2)\). Explicitly, for the third loss term this means that \({{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{y}}\,}}, {{\,\mathrm{\mathbf {z_2}}\,}}}\Big [\Vert {{\,\textrm{A}\,}}\big (g_{\hat{\theta }_2}^{-1}(f_{\hat{\theta }_1}(y), z_2)\big )- y\big \Vert _2^2\Big ]\) is minimal. Therefore, for the expectation value \({{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{y}}\,}}, {{\,\mathrm{\mathbf {z_2}}\,}}}\) to be minimal, \(\Vert {{\,\textrm{A}\,}}\big (g_{\hat{\theta }_2}^{-1}(f_{\hat{\theta }_1}(y), z_2)\big )- y\big \Vert _2^2\) has to be minimal, as well. This means that the forward operator cancels out all values of x that map to \(Z_2\) through flow \(g_{\theta _2}\). The remaining values of x, after applying the forward operator, need to equal those in y, up to a certain degree that is determined by the noise level. For this, there exist two cases: In the first case, \(\Vert f_{\hat{\theta }_1}^{-1}(g_{\hat{\theta }_2}^{\langle z_1 \rangle } (x))-y\Vert _2^2\) is minimal, as well, i.e., the inverse mapping between the corresponding part of x, mapping first to \(Z_1\) and second to Y through flow \(f_{\theta _1}^{-1}\), denoted as \(f_{\hat{\theta }_1}^{-1}\big (g_{\hat{\theta }_2}^{\langle z_1 \rangle } (x)\big )\), and y are equal. In the second case, considering the fact that still the expectation value in our loss term has to be minimal, parts of the sampled \(z_2\) have to make up for these differences in the inverse mapping. Based on our design to sample \(z_2\) randomly during training, the second case, however, may never lead to the expectation value in the loss term being minimal and thus the third loss term enforces the correct mapping.
For a fully converged energy function E, the fourth loss term \({{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{x}}\,}}, {{\,\mathrm{\mathbf {z_2}}\,}}}\Big [\Vert g_{\theta _2}^{-1}\big (f_{\theta _1}\)\(\big ({{\,\textrm{A}\,}}(x)\big ), z_2\big )- x\big \Vert _2^2\Big ]\) likewise is minimal. In contrast to the third loss term, the fourth loss term has an additional source of randomness involved as we are computing the expectation value for \({{\,\mathrm{\textbf{x}}\,}}\) living in the higher dimensional model parameter space X. Whenever computing the (squared) \(L^2\) distance in X while applying the forward operator A with inherent information loss and afterwards making up for the missing information by randomly sampling \(z_2\) as in the fourth loss term, minimizing \({{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{x}}\,}}, {{\,\mathrm{\mathbf {z_2}}\,}}}\) may never reach zero because of the randomness involved. Still, similarly as for the third loss term, minimizing \(\Vert g_{\theta _2}^{-1}\big (f_{\theta _1}\big ({{\,\textrm{A}\,}}(x)\big ), z_2\big )- x\big \Vert _2^2\) can be reached if either \(f_{\hat{\theta }_1}^{-1}\big (g_{\hat{\theta }_2 \langle z_1 \rangle } (x)\big )\) and y are equal or parts of the sampled \(z_2\) make up for these differences which, due to the randomness involved, never minimizes \({{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{x}}\,}}, {{\,\mathrm{\mathbf {z_2}}\,}}}\). Additionally, since we are computing the \(L^2\) distance in the higher dimensional space, another possible minimum of the squared \(L^2\) distance above might be a suboptimal mapping between Y and X whenever parts of \(z_2\) by chance map to the correct values of the ‘\(Z_2\)’-part of x. Due to random sampling of \(z_2\), this however never minimizes the expectation value, either. In other words, whenever we are irrevocably throwing away information as in the case of applying the forward operator to the higher dimensional x and we are randomly sampling to make up for these information, the \(L_2\) distance of our input x and the reconstructed solution \(x^\prime \) is minimal if all values retained after applying the forward operator are mapped back to the corresponding values in X. Thus, the fourth loss term also enforces to learn the correct mapping between Y and X.
Detailed Derivation on Computing \(\boldsymbol{p}_{{\textbf{x}} \mid \textbf{y}, \textbf{z}_\textbf{2}}\boldsymbol{(x} \mid \boldsymbol{y}, \boldsymbol{z}_{\boldsymbol{2}})\)
In step [1], we apply the change-of-variables formula to then use the independence of \(p_{{{\,\mathrm{\mathbf {z_1}}\,}}}\) and \(p_{{{\,\mathrm{\mathbf {z_2}}\,}}}\) to substitute \(p_{{{\,\mathrm{\textbf{z}}\,}}}\) by \(p_{{{\,\mathrm{\mathbf {z_1}}\,}}} \cdot p_{{{\,\mathrm{\mathbf {z_2}}\,}}}\) and the fact that per definition \(x=g_{\theta _2}^{-1}(z_1, z_2)\) (see step [2]). Making use of the inverse function theorem, we replace the Jacobian of \(g_{\theta _2}\) by the inverse of the Jacobian of \(g_{\theta _2}^{-1}\) as shown in step [3] and in step [4] pull out the inverse as the determinant of the inverse of a matrix equals the inverse determinant of the matrix, i.e., \(\det (A^{-1})=\det (A)^{-1}\). As the determinant is a scalar and the inverse of a scalar s is just \(\frac{1}{s}\), the absolute value of this fraction can be further simplified so that \(|s^{-1}|= |\frac{1}{s}| = \frac{|1|}{|s|} = \frac{1}{|s|} = |s|^{-1}\) (see step [5]). In the last step, we just replace \(z_1\) with \(f_{\theta _1}(y)\) as per definition \(z_1 = f_{\theta _1}(y)\). The resulting term is easy to compute as y is given, \(z_2\) is sampled from \(p_{{{\,\mathrm{\mathbf {z_2}}\,}}}\) and furthermore \(p_{{{\,\mathrm{\mathbf {z_1}}\,}}}\big (f_{\theta _1}(y)\big )\) as well as \(p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2)\) are simple probability distributions where we can sample from and compute probabilities of easily. The last part of the term involves computing the absolute value of the determinant of the Jacobian of the inverse of flow \(g_{\theta _2}\). As invertible layers of normalizing flows allow for efficient computation of their Jacobian determinants, this part of the term is easily computable, as well.
Detailed Derivation of MAP Estimator. For optimization purposes, we formulate the optimization problem in Eq. (7) as a minimization problem, i.e.,
While in step [2], we simply formulate the maximization problem as a minimization problem, in step [3], we use the fact that minimizing a function d(x) due to monotonicity of the natural logarithm equals to minimizing \(\log d(x)\). Substituting the results of Eq. (7) for \(p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}(z_1, z_2)\big )\) (see step [4]) and applying logarithmic laws (see step [5]), we simplify the original equation so that we are able to cancel \(p_{{{\,\mathrm{\mathbf {z_1}}\,}}}\big (f_{\theta _1}(y)\big )\) in step [6] as \({{\,\mathrm{\mathbf {z_1}}\,}}\) and \({{\,\mathrm{\mathbf {z_2}}\,}}\) are independent random variables and we are solely optimizing over all \(z_2 \in Z_2\). Applying logarithmic laws again (see step [7] and [8]), we end up with two terms that are easy to evaluate given that the first term computes the log probability of our simple base distribution and by construction, the log determinant of a normalizing flow is simple to calculate, as well.
Detailed Derivation of EDDO
While we first insert the definition of \(q_{{{\,\mathrm{\textbf{y}}\,}}\mid {{\,\mathrm{\textbf{x}}\,}}}\) (see step [2]) and use the fact that minimizing a function is equal to minimizing the natural logarithm of this function (see step [3]), in step [4] we eliminate \(\frac{1}{2\sigma ^2}\) as it is always a positive constant and thus can be omitted. Formulating the maximization as a minimization problem (see step [5]) and substituting \(z_1\) by \(f_{\theta _1}(y)\) in step [6], we are left with inserting the results of Eq. 13 for \(p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}\big (f_{\theta _1}(y), z_2\big )\big ))\).
Supervised Training with UnDimFlow. Additionally to the proposed unpaired learning method with which makes use of the forward operator A, the proposed approach is also capable of learning underdetermined inverse problems with unknown forward operator A, under the assumption that paired training data (y, x) is available. In the following, we will briefly describe the supervised training with UnDimFlow by highlighting and explaining the energy function \(E(\theta _1, \theta _2)\).
Supervised training of the composed normalizing flows is realized by minimizing an energy function \(E(\theta _1, \theta _2)\) consisting of four loss terms:
The first two terms of the energy function train the combination of both flows in a supervised manner. In detail, in the first loss term we compute the squared L2 distance between data measurement samples \(y \in Y\) and the output of the inverse function composition \(h_{\theta _1, \theta _2}^{-1}\) of both flows for \(h_{\theta _1, \theta _2}^{-1} := f_{\theta _1}^{-1}\big ({g_{\theta _2}}_{\langle z_1 \rangle }(x)\big )\).
1.2 A.2 Further Results for Super-Resolution Under High Uncertainty
To show the variety of suitable solutions that our approach is able to learn, we additionally provide nine examples for seven different test images of the Fashion-MNIST dataset. While fixing the input image y, we randomly sample from the posterior. Figure 6 shows the results and the input image (rightmost column). All samples visually look like valid solutions to the inverse problem while showing some minor differences in uncertain regions of the images.
Detailing the analysis of the proposed EDDO estimator, we additionally conduct experiments for five initializations of \(z_2\) and test different weightings of the regularization term \(p_{{{\,\mathrm{\textbf{x}}\,}}\mid {{\,\mathrm{\textbf{y}}\,}}}(x \mid y)\) by scaling the hyperparameter \(\lambda \), accordingly. The results are depicted in Fig. 7. Each row of Fig. 7 represents the results for each of the five initializations of \(z_2\) for a specific value of \(\lambda \) as stated left to each row. It can be seen that for increasing weight of the regularization term, the images generated become smoother and tend to more extreme pixel values, i.e., pixels being either black or white which results in a reduction of grayscale values.
1.3 A.3 Experimental Details
Dataset. Throughout the experiments in this paper, we make use of the Fashion-MNIST dataset [30]. The Fashion-MNIST dataset introduced by Zalando Research in 2017 contains images of fashion objects. Images within the dataset have a resolution to \(28\times 28\) pixels and are grouped into ten classes, i.e., the images are grouped into ‘T-Shirt/Top’, ‘Trousers’, ‘Pullover’, ‘Dress’, ‘Coat’, ‘Sandals’, ‘Shirt’, ‘Sneaker’, ‘Bag’ and ‘Ankle boots’ (Fig. 8).
Network Architecture. To show that the proposed method is capable of learning multiple inverse problems, we choose the same network architecture of UnDimFlow on which we train and test the method for different inverse problems. The overall network architecture is summarized in Fig. 9. It consists of two normalizing flows \(f_{\theta _1}\) and \(g_{\theta _2}\) which are connected through the latent space \(Z_1\). Both flows use Glow-like building blocks (cf. [14]) as core building blocks of the normalizing flows. In contrast to standard Glow building blocks, we apply a random permutation as this has shown to improve stability during training of our experiments (Fig. 10). Each block therefore computes
for an activation function \(\sigma \), a permutation matrix P, a scaling parameter \(s_{\text {an}}\), a bias parameter \(b_{\text {an}}\) (both parts of the ActNorm layer [14]) and the affine coupling function \(f_{\text {ac}}\) [11].
We choose the subnetworks s and t of the affine coupling layers to either be shallow convolutional neural networks (denoted as ’Convolutional block’ in Fig. 9) or fully-connected neural networks (denoted as ’Fully-connected block’ in Fig. 9). The convolutional subnetworks consist of three convolutional layers with a kernel size of \(3\times 3\), \(1\times 1\) and again \(3 \times 3\), accordingly. For fully-connected subnetworks we apply three fully-connected layers. Both types of subnetworks additionally make use of ReLU activations in between the linear layers.
Additionally, as an increasing number of channels has proven to be beneficial when using affine coupling layers, we incorporate downsampling operations via Haar transformations to invertibly increase the number of channels while decreasing spacial dimensionality.
For variational dequantization [13], we use a conditional normalizing flow and condition the aforementioned Glow-like building blocks on the image, i.e., either on y for the variational dequantization network of flow \(f_{\theta _1}\) or on x for the variational dequantization network of flow \(g_{\theta _2}\). We use the most common technique to incorporate conditions into Glow-like building blocks and simply use the condition as further input to the affine coupling layer. As in our experiments we are working with grayscale images, we need to increase the number of channels to be able to use affine coupling layers. We thus apply a checkerboard mask, i.e., a pixel shuffling technique to increase the number of channels from one to two by moving every other pixel to the second channel. We apply this transformation to the condition image and to the input of the variational dequantization network after applying the logit function. Afterwards, we apply two Glow-like building blocks before transforming back via an inverse checkerboard mask and a sigmoid transformation.
Overview of Subnetworks for Affine Coupling Layers. In the context of affine coupling layers, we use shallow convolutional and fully-connected subnetworks. An overview of the details of these subnetworks are shown in Table 1 and 2.
General Training Settings. An overview of the general training settings can be found in Table 3.
We use the Fashion-MNIST dataset with default train/test split and no data pre-processing steps during training and testing for all experiments but the partitioning experiments. For partitioning experiments we use random horizontal flipping during training.
All experiments were conducted on a NVIDIA Tesla V100 with 5120 CUDA Cores and 16 GB HBM2-Memory 3200 MHz MHz on the OMNI computing cluster of the University of Siegen.
Image Inpainting Experiments. For image inpainting, the general training settings have been extended by the following weightings of the individual loss terms of the energy function (Tables 4 and 5):
Image Super-Resolution Experiments. For image super-resolution, the general training settings have been extended by the following weightings of the individual loss terms of the energy function:
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Runkel, C., Moeller, M., Schönlieb, CB., Etmann, C. (2023). Learning Posterior Distributions in Underdetermined Inverse Problems. In: Calatroni, L., Donatelli, M., Morigi, S., Prato, M., Santacesaria, M. (eds) Scale Space and Variational Methods in Computer Vision. SSVM 2023. Lecture Notes in Computer Science, vol 14009. Springer, Cham. https://doi.org/10.1007/978-3-031-31975-4_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-31975-4_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-31974-7
Online ISBN: 978-3-031-31975-4
eBook Packages: Computer ScienceComputer Science (R0)