Skip to main content

Learning Posterior Distributions in Underdetermined Inverse Problems

  • Conference paper
  • First Online:
Scale Space and Variational Methods in Computer Vision (SSVM 2023)

Abstract

In recent years, classical knowledge-driven approaches for inverse problems have been complemented by data-driven methods exploiting the power of machine and especially deep learning. Purely data-driven methods, however, come with the drawback of disregarding prior knowledge of the problem even though it has shown to be beneficial to incorporate this knowledge into the problem-solving process.

We thus introduce an unpaired learning approach for learning posterior distributions of underdetermined inverse problems. It combines advantages of deep generative modeling with established ideas of knowledge-driven approaches by incorporating prior information about the inverse problem. We develop a new neural network architecture ’UnDimFlow’ (short for Unequal Dimensionality Flow) consisting of two normalizing flows, one from the data to the latent, and one from the latent to the solution space. Additionally, we incorporate the forward operator to develop an unpaired learning method for the UnDimFlow architecture and propose a tailored point estimator to recover an optimal solution during inference. We evaluate our method on the two underdetermined inverse problems of image inpainting and super-resolution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that although the third loss term might be sufficient to learn the correct mapping, in experiments the fourth loss term has shown to accelerate and stabilize training.

  2. 2.

    Note that for clarification, we denote all learned probability distributions as p while denoting the modeled distributions as q.

References

  1. Ardizzone, L., Kruse, J., Rother, C., Köthe, U.: Analyzing inverse problems with invertible neural networks. In: International Conference on Learning Representations (2018)

    Google Scholar 

  2. Ardizzone, L., Lüth, C., Kruse, J., Rother, C., Köthe, U.: Guided image generation with conditional invertible neural networks. arXiv preprint arXiv:1907.02392 (2019)

  3. Arridge, S., Maass, P., Öktem, O., Schönlieb, C.B.: Solving inverse problems using data-driven models. Acta Numerica 28, 1–174 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  4. Asim, M., Daniels, M., Leong, O., Ahmed, A., Hand, P.: Invertible generative models for inverse problems: mitigating representation error and dataset bias. In: International Conference on Machine Learning, pp. 399–409. PMLR (2020)

    Google Scholar 

  5. Benning, M., Burger, M.: Modern regularization methods for inverse problems. Acta Numerica 27, 1–111 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  6. Chaudhuri, S.: Super-Resolution Imaging, vol. 632. Springer Science, Cham (2001)

    Google Scholar 

  7. Chen, Y., Ranftl, R., Pock, T.: Insights into analysis operator learning: from patch-based sparse models to higher order MRFs. IEEE Trans. Image Process. 23(3), 1060–1072 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  8. Daras, G., Dean, J., Jalal, A., Dimakis, A.: Intermediate layer optimization for inverse problems using deep generative models. In: International Conference on Machine Learning, pp. 2421–2432. PMLR (2021)

    Google Scholar 

  9. Dashti, M., Stuart, A.M.: The Bayesian approach to inverse problems. In: Ghanem, R., Higdon, D., Owhadi, H. (eds.) Handbook of Uncertainty Quantification, pp. 311–428. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-12385-1_7

    Chapter  Google Scholar 

  10. Deco, G., Brauer, W.: Nonlinear higher-order statistical decorrelation by volume-conserving neural architectures. Neural Netw. 8(4), 525–535 (1995)

    Article  Google Scholar 

  11. Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using Real NVP. In: International Conference on Learning Representations (2017)

    Google Scholar 

  12. Engl, H.W., Hanke, M., Neubauer, A.: Regularization of Inverse Problems, vol. 375. Springer, Cham (1996)

    Book  MATH  Google Scholar 

  13. Ho, J., Chen, X., Srinivas, A., Duan, Y., Abbeel, P.: Flow++: improving flow-based generative models with variational dequantization and architecture design. In: International Conference on Machine Learning, pp. 2722–2730. PMLR (2019)

    Google Scholar 

  14. Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible 1\(\times \) 1 convolutions. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 10236–10245 (2018)

    Google Scholar 

  15. Kobler, E., Effland, A., Kunisch, K., Pock, T.: Total deep variation for linear inverse problems. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7549–7558 (2020)

    Google Scholar 

  16. Mairal, J., Ponce, J., Sapiro, G., Zisserman, A., Bach, F.: Supervised dictionary learning. In: Advances in Neural Information Processing Systems, vol. 21 (2008)

    Google Scholar 

  17. Meinhardt, T., Moller, M., Hazirbas, C., Cremers, D.: Learning proximal operators: using denoising networks for regularizing inverse imaging problems. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1781–1790 (2017)

    Google Scholar 

  18. Moeller, M., Mollenhoff, T., Cremers, D.: Controlling neural networks via energy dissipation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3256–3265 (2019)

    Google Scholar 

  19. Newman, M., Barkema, G.: Monte Carlo Methods in Statistical Physics, vol. 24. Oxford University Press, New York, USA (1999)

    MATH  Google Scholar 

  20. Padmanabha, G.A., Zabaras, N.: Solving inverse problems using conditional invertible neural networks. J. Comput. Phys. 433, 110194 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  21. Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: International Conference on Machine Learning, pp. 1530–1538. PMLR (2015)

    Google Scholar 

  22. Romano, Y., Elad, M., Milanfar, P.: The little engine that could: regularization by denoising (red). SIAM J. Imaging Sci. 10(4), 1804–1844 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  23. Scarlett, J., Heckel, R., Rodrigues, M.R., Hand, P., Eldar, Y.C.: Theoretical perspectives on deep learning methods in inverse problems. arXiv preprint arXiv:2206.14373 (2022)

  24. Siahkoohi, A., Rizzuti, G., Louboutin, M., Witte, P., Herrmann, F.: Preconditioned training of normalizing flows for variational inference in inverse problems. In: Third Symposium on Advances in Approximate Bayesian Inference (2020)

    Google Scholar 

  25. Siahkoohi, A., Rizzuti, G., Witte, P.A., Herrmann, F.J.: Faster uncertainty quantification for inverse problems with conditional normalizing flows. arXiv preprint arXiv:2007.07985 (2020)

  26. Sim, B., Oh, G., Kim, J., Jung, C., Ye, J.C.: Optimal transport driven CycleGAN for unsupervised learning in inverse problems. SIAM J. Imaging Sci. 13(4), 2281–2306 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  27. Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations (2021)

    Google Scholar 

  28. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446–9454 (2018)

    Google Scholar 

  29. Whang, J., Lindgren, E., Dimakis, A.: Approximate probabilistic inference with composed flows. In: NeurIPS 2020 Workshop on Deep Learning and Inverse Problems (2020)

    Google Scholar 

  30. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)

  31. Xiao, Z., Yan, Q., Amit, Y.: A method to model conditional distributions with normalizing flows. arXiv preprint arXiv:1911.02052 (2019)

Download references

Acknowledgements

CR acknowledges support from the Cantab Capital Institute for the Mathematics of Information (CCIMI) and the EPSRC grant EP/W524141/1. MM acknowledges the support of the German Research Foundation Grant MO 2962/7-1. CBS acknowledges support from the Philip Leverhulme Prize, the Royal Society Wolfson Fellowship, the EPSRC advanced career fellowship EP/V029428/1, EPSRC grants EP/S026045/1 and EP/T003553/1, EP/N014588/1, EP/T017961/1, the Wellcome Innovator Awards 215733/Z/19/Z and 221633/Z/20/Z, the European Union Horizon 2020 research and innovation programme under the Marie Skodowska-Curie grant agreement No. 777826 NoMADS, the CCIMI and the Alan Turing Institute. CE acknowledges support from the Wellcome Innovator Award RG98755.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christina Runkel .

Editor information

Editors and Affiliations

A Appendix

A Appendix

In the following, we provide additional theoretical results as well as experimental details for all experiments described previously.

1.1 A.1 Additional Theoretical Results

This Subsection details some of the theoretical results of the main part of the paper.

Further Analysis of Third and Fourth Loss Term. To show that \((g_{\theta _2}^{-1} \circ f_{\theta _1})\) learns a correct mapping between the data measurement space Y and the model parameter space X when minimizing the energy function E in Eq. (2), we assume E to have fully converged so that \(\hat{\theta }_1,\hat{\theta }_2 := \mathop {\textrm{argmin}}\limits _{\theta _1, \theta _2} E(\theta _1, \theta _2)\). Explicitly, for the third loss term this means that \({{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{y}}\,}}, {{\,\mathrm{\mathbf {z_2}}\,}}}\Big [\Vert {{\,\textrm{A}\,}}\big (g_{\hat{\theta }_2}^{-1}(f_{\hat{\theta }_1}(y), z_2)\big )- y\big \Vert _2^2\Big ]\) is minimal. Therefore, for the expectation value \({{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{y}}\,}}, {{\,\mathrm{\mathbf {z_2}}\,}}}\) to be minimal, \(\Vert {{\,\textrm{A}\,}}\big (g_{\hat{\theta }_2}^{-1}(f_{\hat{\theta }_1}(y), z_2)\big )- y\big \Vert _2^2\) has to be minimal, as well. This means that the forward operator cancels out all values of x that map to \(Z_2\) through flow \(g_{\theta _2}\). The remaining values of x, after applying the forward operator, need to equal those in y, up to a certain degree that is determined by the noise level. For this, there exist two cases: In the first case, \(\Vert f_{\hat{\theta }_1}^{-1}(g_{\hat{\theta }_2}^{\langle z_1 \rangle } (x))-y\Vert _2^2\) is minimal, as well, i.e., the inverse mapping between the corresponding part of x, mapping first to \(Z_1\) and second to Y through flow \(f_{\theta _1}^{-1}\), denoted as \(f_{\hat{\theta }_1}^{-1}\big (g_{\hat{\theta }_2}^{\langle z_1 \rangle } (x)\big )\), and y are equal. In the second case, considering the fact that still the expectation value in our loss term has to be minimal, parts of the sampled \(z_2\) have to make up for these differences in the inverse mapping. Based on our design to sample \(z_2\) randomly during training, the second case, however, may never lead to the expectation value in the loss term being minimal and thus the third loss term enforces the correct mapping.

For a fully converged energy function E, the fourth loss term \({{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{x}}\,}}, {{\,\mathrm{\mathbf {z_2}}\,}}}\Big [\Vert g_{\theta _2}^{-1}\big (f_{\theta _1}\)\(\big ({{\,\textrm{A}\,}}(x)\big ), z_2\big )- x\big \Vert _2^2\Big ]\) likewise is minimal. In contrast to the third loss term, the fourth loss term has an additional source of randomness involved as we are computing the expectation value for \({{\,\mathrm{\textbf{x}}\,}}\) living in the higher dimensional model parameter space X. Whenever computing the (squared) \(L^2\) distance in X while applying the forward operator A with inherent information loss and afterwards making up for the missing information by randomly sampling \(z_2\) as in the fourth loss term, minimizing \({{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{x}}\,}}, {{\,\mathrm{\mathbf {z_2}}\,}}}\) may never reach zero because of the randomness involved. Still, similarly as for the third loss term, minimizing \(\Vert g_{\theta _2}^{-1}\big (f_{\theta _1}\big ({{\,\textrm{A}\,}}(x)\big ), z_2\big )- x\big \Vert _2^2\) can be reached if either \(f_{\hat{\theta }_1}^{-1}\big (g_{\hat{\theta }_2 \langle z_1 \rangle } (x)\big )\) and y are equal or parts of the sampled \(z_2\) make up for these differences which, due to the randomness involved, never minimizes \({{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{x}}\,}}, {{\,\mathrm{\mathbf {z_2}}\,}}}\). Additionally, since we are computing the \(L^2\) distance in the higher dimensional space, another possible minimum of the squared \(L^2\) distance above might be a suboptimal mapping between Y and X whenever parts of \(z_2\) by chance map to the correct values of the ‘\(Z_2\)’-part of x. Due to random sampling of \(z_2\), this however never minimizes the expectation value, either. In other words, whenever we are irrevocably throwing away information as in the case of applying the forward operator to the higher dimensional x and we are randomly sampling to make up for these information, the \(L_2\) distance of our input x and the reconstructed solution \(x^\prime \) is minimal if all values retained after applying the forward operator are mapped back to the corresponding values in X. Thus, the fourth loss term also enforces to learn the correct mapping between Y and X.

Detailed Derivation on Computing \(\boldsymbol{p}_{{\textbf{x}} \mid \textbf{y}, \textbf{z}_\textbf{2}}\boldsymbol{(x} \mid \boldsymbol{y}, \boldsymbol{z}_{\boldsymbol{2}})\)

$$\begin{aligned} \begin{aligned} p_{{{\,\mathrm{\textbf{x}}\,}}\mid {{\,\mathrm{\textbf{y}}\,}}, {{\,\mathrm{\mathbf {z_2}}\,}}}(x \mid y, z_2)&= p_{{{\,\mathrm{\textbf{z}}\,}}}\big (g_{\theta _2}(x)\big ) \cdot \bigg |\det \Big (J_{g_{\theta _2}}\big (x\big )\Big )\bigg |&\text {[1]}\\&= p_{{{\,\mathrm{\mathbf {z_1}}\,}}}(z_1) \cdot p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2) \cdot \bigg |\det \Big (J_{g_{\theta _2}}\big (g_{\theta _2}^{-1}(z_1, z_2)\big )\Big )\bigg |&\text {[2]}\\&= p_{{{\,\mathrm{\mathbf {z_1}}\,}}}(z_1)\cdot p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2) \cdot \bigg |\det \Big (J^{-1}_{g_{\theta _2}^{-1}}(z_1, z_2)\Big )\bigg |&\text {[3]}\\&= p_{{{\,\mathrm{\mathbf {z_1}}\,}}}(z_1)\cdot p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2) \cdot \bigg |\det \Big (J_{g_{\theta _2}^{-1}}(z_1, z_2)\Big )^{-1}\bigg |&\text {[4]}\\&= p_{{{\,\mathrm{\mathbf {z_1}}\,}}}(z_1)\cdot p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2) \cdot \bigg |\det \Big (J_{g_{\theta _2}^{-1}}(z_1, z_2)\Big )\bigg | ^{-1}&\text {[5]}\\&= p_{{{\,\mathrm{\mathbf {z_1}}\,}}}\big (f_{\theta _1}(y)\big ) \cdot p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2) \cdot \bigg |\det \Big (J_{g_{\theta _2}^{-1}}(f_{\theta _1}\big (y), z_2\big )\Big )\bigg | ^{-1}&\text {[6]}\\ \end{aligned} \end{aligned}$$
(12)

In step [1], we apply the change-of-variables formula to then use the independence of \(p_{{{\,\mathrm{\mathbf {z_1}}\,}}}\) and \(p_{{{\,\mathrm{\mathbf {z_2}}\,}}}\) to substitute \(p_{{{\,\mathrm{\textbf{z}}\,}}}\) by \(p_{{{\,\mathrm{\mathbf {z_1}}\,}}} \cdot p_{{{\,\mathrm{\mathbf {z_2}}\,}}}\) and the fact that per definition \(x=g_{\theta _2}^{-1}(z_1, z_2)\) (see step [2]). Making use of the inverse function theorem, we replace the Jacobian of \(g_{\theta _2}\) by the inverse of the Jacobian of \(g_{\theta _2}^{-1}\) as shown in step [3] and in step [4] pull out the inverse as the determinant of the inverse of a matrix equals the inverse determinant of the matrix, i.e., \(\det (A^{-1})=\det (A)^{-1}\). As the determinant is a scalar and the inverse of a scalar s is just \(\frac{1}{s}\), the absolute value of this fraction can be further simplified so that \(|s^{-1}|= |\frac{1}{s}| = \frac{|1|}{|s|} = \frac{1}{|s|} = |s|^{-1}\) (see step [5]). In the last step, we just replace \(z_1\) with \(f_{\theta _1}(y)\) as per definition \(z_1 = f_{\theta _1}(y)\). The resulting term is easy to compute as y is given, \(z_2\) is sampled from \(p_{{{\,\mathrm{\mathbf {z_2}}\,}}}\) and furthermore \(p_{{{\,\mathrm{\mathbf {z_1}}\,}}}\big (f_{\theta _1}(y)\big )\) as well as \(p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2)\) are simple probability distributions where we can sample from and compute probabilities of easily. The last part of the term involves computing the absolute value of the determinant of the Jacobian of the inverse of flow \(g_{\theta _2}\). As invertible layers of normalizing flows allow for efficient computation of their Jacobian determinants, this part of the term is easily computable, as well.

Detailed Derivation of MAP Estimator. For optimization purposes, we formulate the optimization problem in Eq. (7) as a minimization problem, i.e.,

$$\begin{aligned} \bar{z}_{2_{\text {MAP}}}&= \mathop {\textrm{argmax}}\limits _{z_2 \in Z_2} \Big (p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}(z_1, z_2)\big )\Big )&\text {[1]}\end{aligned}$$
(13)
$$\begin{aligned}&= \mathop {\textrm{argmin}}\limits _{z_2\in Z_2} \Big (- p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}(z_1, z_2)\big )\Big )&\text {[2]}\end{aligned}$$
(14)
$$\begin{aligned}&= \mathop {\textrm{argmin}}\limits _{z_2\in Z_2} \Big (- \log p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}(z_1, z_2)\big )\Big )&\text {[3]}\end{aligned}$$
(15)
$$\begin{aligned}&= \mathop {\textrm{argmin}}\limits _{z_2\in Z_2} \Bigg (- \log \bigg (p_{{{\,\mathrm{\mathbf {z_1}}\,}}}\big (f_{\theta _1}(y)\big )\cdot p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2) \cdot \Big |\det \Big (J_{g_{\theta _2}^{-1}}\big (f_{\theta _1}(y), z_2\big )\Big )\Big |^{-1}\bigg )\Bigg )&\text {[4]}\end{aligned}$$
(16)
$$\begin{aligned}&= \mathop {\textrm{argmin}}\limits _{z_2\in Z_2} \bigg (- \log p_{{{\,\mathrm{\mathbf {z_1}}\,}}}\big (f_{\theta _1}(y)\big ) - \log p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2) - \log \Big |\det \Big (J_{g_{\theta _2}^{-1}}\big (f_{\theta _1}(y), z_2\big )\Big )\Big |^{-1}\bigg )&\text {[5]}\end{aligned}$$
(17)
$$\begin{aligned}&= \mathop {\textrm{argmin}}\limits _{z_2\in Z_2} \bigg (- \log p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2) - \log \Big |\det \Big (J_{g_{\theta _2}^{-1}}\big (f_{\theta _1}(y), z_2\big )\Big )\Big |^{-1}\bigg )&\text {[6]}\end{aligned}$$
(18)
$$\begin{aligned}&= \mathop {\textrm{argmin}}\limits _{z_2\in Z_2} \Bigg (- \log p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2) - \log \bigg (\frac{1}{\big |\det \big (J_{g_{\theta _2}^{-1}}\big (f_{\theta _1}(y), z_2\big )\big )\big |}\bigg )\Bigg )&\text {[7]}\end{aligned}$$
(19)
$$\begin{aligned}&= \mathop {\textrm{argmin}}\limits _{z_2\in Z_2} \bigg (- \log p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2) + \log \Big |\det \Big (J_{g_{\theta _2}^{-1}}\big (f_{\theta _1}(y), z_2\big )\Big )\Big |\bigg )&\text {[8]} \end{aligned}$$
(20)

While in step [2], we simply formulate the maximization problem as a minimization problem, in step [3], we use the fact that minimizing a function d(x) due to monotonicity of the natural logarithm equals to minimizing \(\log d(x)\). Substituting the results of Eq. (7) for \(p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}(z_1, z_2)\big )\) (see step [4]) and applying logarithmic laws (see step [5]), we simplify the original equation so that we are able to cancel \(p_{{{\,\mathrm{\mathbf {z_1}}\,}}}\big (f_{\theta _1}(y)\big )\) in step [6] as \({{\,\mathrm{\mathbf {z_1}}\,}}\) and \({{\,\mathrm{\mathbf {z_2}}\,}}\) are independent random variables and we are solely optimizing over all \(z_2 \in Z_2\). Applying logarithmic laws again (see step [7] and [8]), we end up with two terms that are easy to evaluate given that the first term computes the log probability of our simple base distribution and by construction, the log determinant of a normalizing flow is simple to calculate, as well.

Detailed Derivation of EDDO

$$\begin{aligned} \begin{aligned} \bar{z}_{2_{\text {disc}}}&= \mathop {\textrm{argmax}}\limits _{z_2 \in Z_2} \Big (q_{{{\,\mathrm{\textbf{y}}\,}}\mid {{\,\mathrm{\textbf{x}}\,}}}\big (y \mid g_{\theta _2}^{-1}(z_1, z_2)\big ) \cdot \lambda ^*\; p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}(z_1, z_2)\big ) \Big )&\text {[1]}\\&= \mathop {\textrm{argmax}}\limits _{z_2\in Z_2} \bigg (\exp \Big (- \frac{\Vert Ag_{\theta _2}^{-1}(z_1, z_2) - y\Vert _2^2}{2\sigma ^2}\Big )\cdot \lambda ^*\; p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}(z_1, z_2)\big ) \bigg )&\text {[2]}\\&= \mathop {\textrm{argmax}}\limits _{z_2\in Z_2} \bigg (- \frac{1}{2\sigma ^2}\Vert Ag_{\theta _2}^{-1}(z_1, z_2) - y\Vert _2^2 + \lambda \log p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}(z_1, z_2)\big ) \bigg )&\text {[3]}\\&= \mathop {\textrm{argmax}}\limits _{z_2\in Z_2} \bigg (-\Vert Ag_{\theta _2}^{-1}(z_1, z_2) - y\Vert _2^2 + \lambda \log p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}(z_1, z_2)\big ) \bigg )&\text {[4]}\\&= \mathop {\textrm{argmin}}\limits _{z_2\in Z_2} \bigg (\Vert Ag_{\theta _2}^{-1}(z_1, z_2) - y\Vert _2^2 - \lambda \log p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}(z_1, z_2)\big ) \bigg )&\text {[5]}\\&= \mathop {\textrm{argmin}}\limits _{z_2\in Z_2} \bigg (\Vert Ag_{\theta _2}^{-1}(f_{\theta _1}(y), z_2) - y\Vert _2^2 - \lambda \log p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}\big (f_{\theta _1}(y), z_2\big )\big ) \bigg )&\text {[6]}\\&= \mathop {\textrm{argmin}}\limits _{z_2\in Z_2} \bigg (\Vert Ag_{\theta _2}^{-1}(f_{\theta _1}(y), z_2)- y\Vert _2^2&{\text {[7]}} \\&\qquad \qquad \quad \; - \lambda \Big (\log p_{{{\,\mathrm{\mathbf {z_2}}\,}}}(z_2) + \log \Big |\det \Big (J_{g_{\theta _2}^{-1}}\big (f_{\theta _1}(y), z_2\big )\Big )\Big | \Big )\bigg ).&\\ \end{aligned} \end{aligned}$$
(21)

While we first insert the definition of \(q_{{{\,\mathrm{\textbf{y}}\,}}\mid {{\,\mathrm{\textbf{x}}\,}}}\) (see step [2]) and use the fact that minimizing a function is equal to minimizing the natural logarithm of this function (see step [3]), in step [4] we eliminate \(\frac{1}{2\sigma ^2}\) as it is always a positive constant and thus can be omitted. Formulating the maximization as a minimization problem (see step [5]) and substituting \(z_1\) by \(f_{\theta _1}(y)\) in step [6], we are left with inserting the results of Eq. 13 for \(p_{{{\,\mathrm{\textbf{x}}\,}}}\big (g_{\theta _2}^{-1}\big (f_{\theta _1}(y), z_2\big )\big ))\).

Supervised Training with UnDimFlow. Additionally to the proposed unpaired learning method with which makes use of the forward operator A, the proposed approach is also capable of learning underdetermined inverse problems with unknown forward operator A, under the assumption that paired training data (yx) is available. In the following, we will briefly describe the supervised training with UnDimFlow by highlighting and explaining the energy function \(E(\theta _1, \theta _2)\).

Supervised training of the composed normalizing flows is realized by minimizing an energy function \(E(\theta _1, \theta _2)\) consisting of four loss terms:

$$\begin{aligned} \begin{aligned} E(\theta _1, \theta _2)&:= {{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{y}}\,}}, {{\,\mathrm{\textbf{x}}\,}}}\Big [\Vert f_{\theta _1}^{-1}\big ({g_{\theta _2}}^{\langle z_1 \rangle }(x)\big ) -y \Vert _2^2 \Big ] \\&+ \lambda _1 {{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{y}}\,}}, {{\,\mathrm{\textbf{x}}\,}}, {{\,\mathrm{\mathbf {z_2}}\,}}}\Big [\Vert g_{\theta _2}^{-1}\big (f_{\theta _1}(y), z_2\big )- x\big \Vert _2^2\Big ]\\&- \lambda _2{{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{y}}\,}}}\big [ \log p_{{{\,\mathrm{\textbf{y}}\,}}}(y) \big ] \\&- \lambda _3{{\,\mathrm{\mathbb {E}}\,}}_{{{\,\mathrm{\textbf{x}}\,}}}\big [\log p_{{{\,\mathrm{\textbf{x}}\,}}}( x) \big ]. \end{aligned} \end{aligned}$$
(22)

The first two terms of the energy function train the combination of both flows in a supervised manner. In detail, in the first loss term we compute the squared L2 distance between data measurement samples \(y \in Y\) and the output of the inverse function composition \(h_{\theta _1, \theta _2}^{-1}\) of both flows for \(h_{\theta _1, \theta _2}^{-1} := f_{\theta _1}^{-1}\big ({g_{\theta _2}}_{\langle z_1 \rangle }(x)\big )\).

Fig. 6.
figure 6

Randomly sampled \(\bar{x}_{\text {rnd}}\) for \(8 \times 8\) pixel low resolution images with a weighting factor \(\lambda _2=200\).

Fig. 7.
figure 7

Overview of five EDDO point estimators for different initializations of \(z_2\) per weighting factor \(\lambda \) for super-resolution at the example of an image of the Fashion-MNIST dataset. Each row shows each of the five point estimators for \(\lambda \in \{0, 10^{-6}, ..., 0.5, 1\}\). For each column, we initialize \(z_2\) identically to ease comparability of results.

1.2 A.2 Further Results for Super-Resolution Under High Uncertainty

To show the variety of suitable solutions that our approach is able to learn, we additionally provide nine examples for seven different test images of the Fashion-MNIST dataset. While fixing the input image y, we randomly sample from the posterior. Figure 6 shows the results and the input image (rightmost column). All samples visually look like valid solutions to the inverse problem while showing some minor differences in uncertain regions of the images.

Detailing the analysis of the proposed EDDO estimator, we additionally conduct experiments for five initializations of \(z_2\) and test different weightings of the regularization term \(p_{{{\,\mathrm{\textbf{x}}\,}}\mid {{\,\mathrm{\textbf{y}}\,}}}(x \mid y)\) by scaling the hyperparameter \(\lambda \), accordingly. The results are depicted in Fig. 7. Each row of Fig. 7 represents the results for each of the five initializations of \(z_2\) for a specific value of \(\lambda \) as stated left to each row. It can be seen that for increasing weight of the regularization term, the images generated become smoother and tend to more extreme pixel values, i.e., pixels being either black or white which results in a reduction of grayscale values.

1.3 A.3 Experimental Details

Dataset. Throughout the experiments in this paper, we make use of the Fashion-MNIST dataset [30]. The Fashion-MNIST dataset introduced by Zalando Research in 2017 contains images of fashion objects. Images within the dataset have a resolution to \(28\times 28\) pixels and are grouped into ten classes, i.e., the images are grouped into ‘T-Shirt/Top’, ‘Trousers’, ‘Pullover’, ‘Dress’, ‘Coat’, ‘Sandals’, ‘Shirt’, ‘Sneaker’, ‘Bag’ and ‘Ankle boots’ (Fig. 8).

Fig. 8.
figure 8

Example images of the Fashion-MNIST dataset.

Fig. 9.
figure 9

UnDimFlow network architecture overview. The overall network consists of two normalizing flows \(f_{\theta _1}: Y \rightarrow Z_1\) and \(g_{\theta _2}: X \rightarrow Z_1 \times Z_2\). Both flows are connected via the same (part of the) latent space \(Z_1\) and make use of Glow-like building blocks as core components. Additionally, a multi-scale architecture is realized by including invertible Haar downsampling layers.

Network Architecture. To show that the proposed method is capable of learning multiple inverse problems, we choose the same network architecture of UnDimFlow on which we train and test the method for different inverse problems. The overall network architecture is summarized in Fig. 9. It consists of two normalizing flows \(f_{\theta _1}\) and \(g_{\theta _2}\) which are connected through the latent space \(Z_1\). Both flows use Glow-like building blocks (cf. [14]) as core building blocks of the normalizing flows. In contrast to standard Glow building blocks, we apply a random permutation as this has shown to improve stability during training of our experiments (Fig. 10). Each block therefore computes

$$\begin{aligned} f_{\text {block}}(y) = P \cdot \sigma (s_{\text {an}}) \odot f_{\text {ac}}(y) + b_{\text {an}} \end{aligned}$$
(23)

for an activation function \(\sigma \), a permutation matrix P, a scaling parameter \(s_{\text {an}}\), a bias parameter \(b_{\text {an}}\) (both parts of the ActNorm layer [14]) and the affine coupling function \(f_{\text {ac}}\) [11].

Fig. 10.
figure 10

Overview of the core component of the normalizing flows \(f_{\theta _1}\) and \(g_{\theta _2}\). An affine coupling layer is applied before an ActNorm layer and a permutation. In contrast to a standard Glow building block, we use a random permutation. Depending of the type of the block, i.e., either convolutional or fully-connected, the subnetworks of the affine coupling layers are convolutional or fully-connected neural networks.

We choose the subnetworks s and t of the affine coupling layers to either be shallow convolutional neural networks (denoted as ’Convolutional block’ in Fig. 9) or fully-connected neural networks (denoted as ’Fully-connected block’ in Fig. 9). The convolutional subnetworks consist of three convolutional layers with a kernel size of \(3\times 3\), \(1\times 1\) and again \(3 \times 3\), accordingly. For fully-connected subnetworks we apply three fully-connected layers. Both types of subnetworks additionally make use of ReLU activations in between the linear layers.

Additionally, as an increasing number of channels has proven to be beneficial when using affine coupling layers, we incorporate downsampling operations via Haar transformations to invertibly increase the number of channels while decreasing spacial dimensionality.

For variational dequantization [13], we use a conditional normalizing flow and condition the aforementioned Glow-like building blocks on the image, i.e., either on y for the variational dequantization network of flow \(f_{\theta _1}\) or on x for the variational dequantization network of flow \(g_{\theta _2}\). We use the most common technique to incorporate conditions into Glow-like building blocks and simply use the condition as further input to the affine coupling layer. As in our experiments we are working with grayscale images, we need to increase the number of channels to be able to use affine coupling layers. We thus apply a checkerboard mask, i.e., a pixel shuffling technique to increase the number of channels from one to two by moving every other pixel to the second channel. We apply this transformation to the condition image and to the input of the variational dequantization network after applying the logit function. Afterwards, we apply two Glow-like building blocks before transforming back via an inverse checkerboard mask and a sigmoid transformation.

Overview of Subnetworks for Affine Coupling Layers. In the context of affine coupling layers, we use shallow convolutional and fully-connected subnetworks. An overview of the details of these subnetworks are shown in Table 1 and 2.

Table 1. Overview of structure of convolutional subnetworks of affine coupling layers.
Table 2. Overview of structure of fully-connected subnetworks of affine coupling layers.
Table 3. Overview of general training setting.

General Training Settings. An overview of the general training settings can be found in Table 3.

We use the Fashion-MNIST dataset with default train/test split and no data pre-processing steps during training and testing for all experiments but the partitioning experiments. For partitioning experiments we use random horizontal flipping during training.

All experiments were conducted on a NVIDIA Tesla V100 with 5120 CUDA Cores and 16 GB HBM2-Memory 3200 MHz MHz on the OMNI computing cluster of the University of Siegen.

Image Inpainting Experiments. For image inpainting, the general training settings have been extended by the following weightings of the individual loss terms of the energy function (Tables 4 and 5):

Table 4. Overview of weighting parameters for image inpainting experiments in Sect. 4.2 per individual loss term of the energy function.
Table 5. Overview of weighting parameters for image super-resolution experiments in Sect. 4.1 per individual loss term of the energy function.

Image Super-Resolution Experiments. For image super-resolution, the general training settings have been extended by the following weightings of the individual loss terms of the energy function:

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Runkel, C., Moeller, M., Schönlieb, CB., Etmann, C. (2023). Learning Posterior Distributions in Underdetermined Inverse Problems. In: Calatroni, L., Donatelli, M., Morigi, S., Prato, M., Santacesaria, M. (eds) Scale Space and Variational Methods in Computer Vision. SSVM 2023. Lecture Notes in Computer Science, vol 14009. Springer, Cham. https://doi.org/10.1007/978-3-031-31975-4_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-31975-4_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-31974-7

  • Online ISBN: 978-3-031-31975-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics