Here, we show that the value iteration scheme described through Algorithm 1 converges to a unique fixed point satisfying Eq. (8). To this end, we first prove the existence of a unique fixed point (Theorem 1) following [3, 25], and subsequently prove the convergence of the value iteration scheme presupposing that a unique fixed point exists (Theorem 2) following [27].
Theorem 1
Assuming a bounded reward function \(R_{s,a}^{s'}\), the optimal free-energy vector \(F^*(s)\) is a unique fixed point of Bellman’s equation \(F^*=BF^*\), where the mapping \(B:\mathbb {R}^{|\mathcal {S}|} \rightarrow \mathbb {R}^{|\mathcal {S}|}\) is defined as in Eq. (13)
Proof
Theorem 1 is proven through Propositions 1 and 2 in the following.
Proposition 1
The mapping \(T_{\pi ,\psi }: \mathbb {R}^{|\mathcal {S}|} \rightarrow \mathbb {R}^{|\mathcal {S}|}\)
converges to a unique solution for every policy-belief-pair \((\pi ,\psi )\) independent of the initial free-energy vector F(s).
Proof
By introducing the matrix \(P_{\pi ,\psi }(s,s')\) and the vector \(g_{\pi ,\psi }(s)\) as
$$\begin{aligned} P_{\pi ,\psi }(s,s') := \mathbb {E}_{{\pi (a|s)}} \bigg [ \mathbb {E}_{\psi (\theta |a,s)} \left[ T_\theta (s'|a,s)\right] \bigg ] , \end{aligned}$$
$$\begin{aligned} g_{\pi ,\psi }(s) := \mathbb {E}_{{\pi (a|s)}} \Bigg [ \mathbb {E}_{\psi (\theta |a,s)} \bigg [ \mathbb {E}_{T_\theta (s'|a,s)} \left[ R_{s,a}^{s'}\right] - \frac{1}{\beta } \log \frac{\psi (\theta |a,s)}{\mu (\theta |a,s)} \bigg ] - \frac{1}{\alpha } \log \frac{{\pi (a|s)}}{{\rho (a|s)}} \Bigg ], \end{aligned}$$
Equation (14) may be expressed in compact form: \(T_{\pi ,\psi }F = g_{\pi ,\psi } + \gamma P_{\pi ,\psi } F\). By applying the mapping \(T_{\pi ,\psi }\) an infinite number of times on an initial free-energy vector F, the free-energy vector \(F_{\pi ,\psi }\) of the policy-belief-pair \((\pi ,\psi )\) is obtained:
$$\begin{aligned} F_{\pi ,\psi } := \lim _{i\rightarrow \infty } T_{\pi ,\psi }^i F = \lim _{i\rightarrow \infty } \sum _{t=0}^{i-1} \gamma ^t P_{\pi ,\psi }^t g_{\pi ,\psi } + \underbrace{\lim _{i\rightarrow \infty } \gamma ^i P_{\pi ,\psi }^i F}_{\rightarrow 0} , \end{aligned}$$
which does no longer depend on the initial F. It is straightforward to show that the quantity \(F_{\pi ,\psi }\) is a fixed point of the operator \(T_{\pi ,\psi }\):
$$\begin{aligned} T_{\pi ,\psi }F_{\pi ,\psi }&= g_{\pi ,\psi } + \gamma P_{\pi ,\psi } \lim _{i\rightarrow \infty } \sum _{t=0}^{i-1} \gamma ^t P_{\pi ,\psi }^t g_{\pi ,\psi } \nonumber \\&= \gamma ^0 P_{\pi ,\psi }^0 g_{\pi ,\psi } + \lim _{i\rightarrow \infty } \sum _{t=1}^{i} \gamma ^t P_{\pi ,\psi }^t g_{\pi ,\psi } \nonumber \\&= \lim _{i\rightarrow \infty } \sum _{t=0}^{i-1} \gamma ^t P_{\pi ,\psi }^t g_{\pi ,\psi } + \underbrace{\lim _{i\rightarrow \infty } \gamma ^i P_{\pi ,\psi }^i g_{\pi ,\psi }}_{\rightarrow 0} = F_{\pi ,\psi }. \end{aligned}$$
Furthermore, \(F_{\pi ,\psi }\) is unique. Assume for this purpose an arbitrary fixed point \(F'\) such that \(T_{\pi ,\psi }F' = F'\), then \(F' = \lim _{i\rightarrow \infty }T_{\pi ,\psi }^iF'=F_{\pi ,\psi }\).
Proposition 2
The optimal free-energy vector \(F^*=\max _{\pi }\mathop {\mathrm {ext}}\nolimits _{\psi }F_{\pi ,\psi }\) is a unique fixed point of Bellman’s equation \(F^*=BF^*\).
Proof
The proof consists of two parts where we assume \(\mathop {\mathrm {ext}}\limits = \max \) in the first part and \(\mathop {\mathrm {ext}}\limits = \min \) in the second part respectively. Let \(\mathop {\mathrm {ext}}\limits = \max \) and \(F^*=F_{\pi ^*,\psi ^*}\), where \((\pi ^*,\psi ^*)\) denotes the optimal policy-belief-pair. Then
$$\begin{aligned} F^* = T_{\pi ^*,\psi ^*} F^* \le \underbrace{\max _\pi \max _\psi T_{\pi ,\psi } F^*}_{=BF^*} =: T_{\pi ',\psi '} F^* \mathop {\le }\limits ^{\text {Induction}} F_{\pi ',\psi '}, \end{aligned}$$
where the last inequality can be straightforwardly proven by inductionFootnote 1 and exploiting the fact that \(P_{\pi ,\psi }(s,s') \in [0;1]\). But by definition \(F^* = \max _{\pi }\max _{\psi }F_{\pi ,\psi } \ge F_{\pi ',\psi '}\), hence \(F^* = F_{\pi ',\psi '}\) and therefore \(F^*=BF^*\). Furthermore, \(F^*\) is unique. Assume for this purpose an arbitrary fixed point \(F'=F_{\pi ',\psi '}\) such that \(F'=BF'\) with the corresponding policy-belief-pair \((\pi ',\psi ')\). Then
$$\begin{aligned} F^* = T_{\pi ^*,\psi ^*} F^* \ge T_{\pi ',\psi '} F^* \mathop {\ge }\limits ^{\text {Induction}} F_{\pi ',\psi '} = F', \end{aligned}$$
and similarly \(F' \ge F^*\), hence \(F' = F^*\).
Let \(\mathop {\mathrm {ext}}\limits = \min \) and \(F^*=F_{\pi ^*,\psi ^*}\). By taking a closer look at Eq. (13), it can be seen that the optimization over \(\psi \) does not depend on \(\pi \). Then
$$\begin{aligned} F^* = T_{\pi ^*,\psi ^*} F^* \ge \min _{\psi } T_{\pi ^*,\psi } F^* =: T_{\pi ^*,\psi '} F^* \mathop {\ge }\limits ^{\text {Induction}} F_{\pi ^*,\psi '}. \end{aligned}$$
But by definition \(F^*=\min _{\psi }F_{\pi ^*,\psi } \le F_{\pi ^*,\psi '}\), hence \(F^*=F_{\pi ^*,\psi '}\). Therefore it holds that \(BF^* = \max _\pi \min _\psi T_{\pi ,\psi } F^* = \max _\pi T_{\pi ,\psi ^*} F^*\) and similar to the first part of the proof we obtain
$$\begin{aligned} F^* = T_{\pi ^*,\psi ^*} F^* \le \underbrace{\max _\pi T_{\pi ,\psi ^*} F^*}_{= BF^*} =: T_{\pi ',\psi *} F^* \mathop {\le }\limits ^{\text {Induction}} F_{\pi ',\psi *}. \end{aligned}$$
But by definition \(F^* = \max _{\pi }F_{\pi ,\psi ^*} \ge F_{\pi ',\psi *}\), hence \(F^*=F_{\pi ',\psi *}\) and therefore \(F^*=BF^*\). Furthermore, \(F_{\pi ^*, \psi ^*}\) is unique. Assume for this purpose an arbitrary fixed point \(F'=F_{\pi ',\psi '}\) such that \(F'=BF'\). Then
$$\begin{aligned} F' = T_{\pi ',\psi '} F' \le T_{\pi ',\psi ^*} F' \mathop {\le }\limits ^{\text {Induction}} F_{\pi ', \psi ^*} \mathop {\le }\limits ^{\text {Induction}} T_{\pi ',\psi ^*} F^* \le T_{\pi ^*,\psi ^*} F^* = F^*, \end{aligned}$$
and similarly \(F^* \le F'\), hence \(F^* = F'\).
Theorem 2
Let \(\epsilon \) be a positive number satisfying \(\epsilon <\frac{\eta }{1-\gamma }\) where \(\gamma \in [0;1)\) is the discount factor and where u and l are the bounds of the reward function \(R_{s,a}^{s'}\) such that \(l \le R_{s,a}^{s'}\le u\) and \(\eta =\max \{|u|,|l|\}\). Suppose that the value iteration scheme from Algorithm 1 is run for \(i=\lceil \log _\gamma \frac{\epsilon (1-\gamma )}{\eta } \rceil \) iterations with an initial free-energy vector \(F(s)=0\) for all s. Then, it holds that \(\max _s |F^*(s) - B^iF(s)| \le \epsilon \), where \(F^*\) refers to the unique fixed point from Theorem 1.
Proof
We start the proof by showing that the \(L_\infty \)-norm of the difference vector between the optimal free-energy \(F^*\) and \(B^iF\) exponentially decreases with the number of iterations i:
$$\begin{aligned} \max _s&\left| F^*(s) - B^iF(s)\right| =: \left| F^*(s^*) - B^iF(s^*)\right| \nonumber \\&\mathop {=}\limits ^{\text {Eq. } (9)} \left| \max _\pi \mathbb {E}_{\pi (a|s^*)} \bigg [ \frac{1}{\beta } \log Z_\beta (a,s^*) - \frac{1}{\alpha } \log \frac{\pi (a|s^*)}{\rho (a|s^*)}\bigg ] \right. \nonumber \\&\left. - \max _\pi \mathbb {E}_{\pi (a|s^*)} \bigg [ \frac{1}{\beta } \log Z^i_\beta (a,s^*) - \frac{1}{\alpha } \log \frac{\pi (a|s^*)}{\rho (a|s^*)}\bigg ] \right| \nonumber \\&\le \max _\pi \left| \mathbb {E}_{\pi (a|s^*)} \bigg [ \frac{1}{\beta } \log Z_\beta (a,s^*) - \frac{1}{\beta } \log Z^i_\beta (a,s^*) \bigg ] \right| \nonumber \\&\le \max _a \left| \frac{1}{\beta } \log Z_\beta (a,s^*) - \frac{1}{\beta } \log Z^i_\beta (a,s^*) \right| \nonumber \\&=: \left| \frac{1}{\beta } \log Z_\beta (a^*,s^*) - \frac{1}{\beta } \log Z^i_\beta (a^*,s^*) \right| \nonumber \\&\mathop {=}\limits ^{\text {Eq. } (11)} \left| \mathop {\mathrm {ext}}\limits _\psi \mathbb {E}_{\psi (\theta |a^*,s^*)} \bigg [ \mathbb {E}_{T_\theta (s'|a^*,s^*)} \big [ R_{s,a}^{s'}+ \gamma F^*(s') \big ] -\frac{1}{\beta } \log \frac{\psi (\theta |a^*,s^*)}{\mu (\theta |a^*,s^*)} \bigg ] \right. \nonumber \\&-\, \left. \mathop {\mathrm {ext}}\limits _\psi \mathbb {E}_{\psi (\theta |a^*,s^*)} \bigg [ \mathbb {E}_{T_\theta (s'|a^*,s^*)} \big [ R_{s,a}^{s'}+ \gamma B^{i-1}F(s') \big ] -\frac{1}{\beta } \log \frac{\psi (\theta |a^*,s^*)}{\mu (\theta |a^*,s^*)} \bigg ] \right| \nonumber \\&\le \max _\psi \left| \mathbb {E}_{\psi (\theta |a^*,s^*)} \bigg [ \mathbb {E}_{T_\theta (s'|a^*,s^*)} \big [ \gamma F^*(s') - \gamma B^{i-1}F(s')\big ] \bigg ] \right| \nonumber \\&\le \gamma \max _s \left| F^*(s) - B^{i-1}F(s) \right| \mathop {\le }\limits ^{\text {Recur.}} \gamma ^i \max _s \left| F^*(s) - F(s) \right| \le \gamma ^i \frac{\eta }{1-\gamma } \nonumber , \end{aligned}$$
where we exploit the fact that \(\left| \mathop {\mathrm {ext}}\nolimits _xf(x) - \mathop {\mathrm {ext}}\nolimits _xg(x) \right| \le \max _x \left| f(x) - g(x) \right| \) and that the free-energy is bounded through the reward bounds l and u with \(\eta =\max \{|u|,|l|\}\). For a convergence criterion \(\epsilon >0\) such that \(\epsilon \ge \gamma ^i \frac{\eta }{1-\gamma }\), it then holds that \(i \ge \log _\gamma \frac{\epsilon (1-\gamma )}{\eta }\) presupposing that \(\epsilon < \frac{\eta }{1-\gamma }\).