In this section, we allow for a guided search of the label space. Since the space is vastly large, we allow the search do be guided by a parametric model that itself is optimized on the training data.
Problem setting for structured prediction with parametric decoder
At application time, prediction \({\hat{\mathbf{{y}}}}\) is determined by solving the decoding problem of Eq. 7; decision function \(f_{\varvec{\phi }}(\mathbf{{x}},\mathbf{{y}})\) depends on a feature vector \({\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}})\). The decoder is allowed T (plus a constant number of) evaluations of the decision function, where \(T\sim p(T|\tau )\) is governed by some distribution and its value is not known in advance. The decoder has parameters \({\varvec{\psi }}\) that control this choice of labelings.
In the available T time steps, the decoder has to create a set of candidate labelings \(Y_T(\mathbf{{x}})\) for which the decision function is evaluated. The decoding process starts in a state \(Y_0(\mathbf{{x}})\) that contains a constant number of labelings. In each time step \(t+1\), the decoder can choose an action
\(a_{t+1}\) from the action space
\(A_{Y_t}\); this space should be designed to be much smaller than the label space \({\mathcal Y}(\mathbf{{x}})\). Action \(a_{t+1}\) creates another labeling \(\mathbf{{y}}_{t+1}\); this additional labeling creates successor state \(Y_{t+1}(\mathbf{{x}})=a_{t+1}(Y_t(\mathbf{{x}}))=Y_t(\mathbf{{x}})\cup \{\mathbf{{y}}_{t+1}\}\).
In a basic definition, \(A_{Y_t}\) could consist of actions \(\alpha _{\mathbf{{y}}j}\) (for all \(\mathbf{{y}}\in Y_t\) and \(1\le j\le n_{\mathbf{{x}}}\), where \(n_{\mathbf{{x}}}\) is the number of clients in \(\mathbf{{x}}\)) that take output \(\mathbf{{y}}\in Y_t(\mathbf{{x}})\) and generate labeling \({\bar{\mathbf{{y}}}}\) by flipping the labeling of the j-th client; output \(Y_{t+1}(\mathbf{{x}})=Y_t(\mathbf{{x}})\cup \{\bar{\mathbf{{y}}}\}\) is \(Y_t(\mathbf{{x}})\) plus this modified output. This definition would allow the entire space \(\mathcal{Y}(\mathbf{{x}})\) to be reached from any starting point. In our experiments, we will construct an action space that contains application-specific state transactions such as flip the labels of the k addresses that have the most open connections—see Sect. 7.3.
The choice of action \(a_{t+1}\) is based on parameters \({\varvec{\psi }}\) of the decoder, and on a feature vector \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\); for instance, actions may be chosen by following a stochastic policy \(a_{t+1}\sim \pi _{\varvec{\psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}))\). We will define feature vector \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\) in Sect. 7.4.4; for instance, it may contain the difference between the geographical distribution of clients whose label is changed by action \(a_{t+1}\) and the geographical distribution of all clients with that same label. Choosing an action \(a_{t+1}\) requires an evaluation of \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\) for each possible action in \(A_{Y_t(\mathbf{{x}})}\). Our problem setting is most useful for applications in which evaluation of \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\) takes less time than evaluation of \({\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}}_{t+1})\)—otherwise, it might be better to evaluate the decision function for a larger set of randomly drawn outputs than to spend time on selecting outputs for which the decision function should be evaluated. Feature vector \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\) may contain a computationally inexpensive subset of \({\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}}_{t+1})\).
After T steps, the decoding process is terminated. At this point, the decision-function values \(f_{\varvec{\phi }}\) of a set of candidate outputs \(Y_T(\mathbf{{x}})\) have been evaluated. Prediction \({\hat{\mathbf{{y}}}}\) is the argmax of the decision function over this set:
$$\begin{aligned} {\hat{\mathbf{{y}}}}= {\mathop {{{\mathrm{ argmax\,}}}} \limits _{\mathbf{{y}}\in Y_T(\mathbf{{x}})}} f_{\varvec{\phi }}(\mathbf{{x}},\mathbf{{y}}). \end{aligned}$$
(10)
At training time, a labeled sample \(L=\{(\mathbf{{x}}_1,\mathbf{{y}}_1),\dots , (\mathbf{{x}}_n,\mathbf{{y}}_n)\}\) is available.
HC search
HC search (Doppa et al. 2013) is an approach to structured prediction that learns parameters \({\varvec{\psi }}\) of a search heuristic, and then uses a decoder with this search heuristic to learn parameters \({\varvec{\phi }}\) of a structured-prediction model (the decision function \(f_{\varvec{\phi }}\) is called the cost-function in HC-search terminology). We apply this principle to our problem setting.
At application time, the decoder produces labeling \({\hat{\mathbf{{y}}}}\) that approximately maximizes \(f_{\varvec{\phi }}(\mathbf{{x}},\mathbf{{y}})\) as follows. The starting point \(Y_0(\mathbf{{x}})\) of each decoding problem contains the labeling produced by the logistic regression classifier (see Sect. 4.2). Action \(a_{t+1}\in A_{Y_t}\) is chosen deterministically as the maximum of the search heuristic \(f_{\varvec{\psi }}({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1}))={\varvec{\psi }}^\top {\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\). After T steps, the argmax \({\hat{\mathbf{{y}}}}\) of \(f_{\varvec{\phi }}(\mathbf{{x}},\mathbf{{y}})\) over all outputs in \(Y_T(\mathbf{{x}})=a_T(\ldots a_1(Y_0(\mathbf{{x}}))\ldots )\) (Eq. 7) is returned as prediction.
At training time, HC search first learns a search heuristic with parameters \({\varvec{\psi }}\) as follows. Let \(L_{\varvec{\psi }}\) be an initially empty set of training constraints for the heuristic. For each training instance \((\mathbf{{x}}_i,\mathbf{{y}}_i)\), starting state \(Y_0(\mathbf{{x}}_i)\) contains the labeling produced by the logistic regression classifier (see Sect. 4.2). Time t is then iterated from 1 to an upper bound \({\bar{T}}\) on the number of time steps that will be available for decoding at application time. Then, iteratively, all elements \(a_{t+1}\) of the finite action space \(A_{Y_t}(\mathbf{{x}}_i)\) and their corresponding outputs \(\mathbf{{y}}'_{t+1}\) are enumerated and the action \(a_{t+1}^*\) that leads to the lowest-cost output \(\mathbf{{y}}'_{t+1}\) is determined. Since the training data are labeled, the actual costs of labeling \(\mathbf{{x}}_i\) as \(\mathbf{{y}}'_{t+1}\) when the correct labeling would be \(\mathbf{{y}}_i\) can be determined by evaluating the cost function. Search heuristic \(f_\psi \) has to assign a higher value to \(a_{t+1}^*\) than to any other \(a_{t+1}\), and the costs \(c(\mathbf{{x}}_i,\mathbf{{y}}_i,\mathbf{{y}}'_{t+1})\) of choosing a poor action should be included in the optimization problem. Hence, for each action \(a_{t+1}\in A_{Y_t}(\mathbf{{x}}_i)\), constraint
$$\begin{aligned}&f_\psi ({\varvec{\Psi }}(\mathbf{{x}}_i,Y_t(\mathbf{{x}}_i),a_{t+1}^*))-f_\psi ({\varvec{\Psi }}(\mathbf{{x}}_i,Y_t(\mathbf{{x}}_i),a_{t+1}))\nonumber \\&\quad >\sqrt{c(\mathbf{{x}}_i,\mathbf{{y}}_i,\mathbf{{y}}'_{t+1})-c(\mathbf{{x}}_i,\mathbf{{y}}_i,\mathbf{{y}}^*_{t+1})} \end{aligned}$$
(11)
is added to \(L_{\varvec{\psi }}\). Model \({\varvec{\psi }}\) should satisfy the constraints in \(L_{\varvec{\psi }}\). We use a soft-margin version of the constraints in \(L_{\varvec{\psi }}\) and squared slack-terms which results in a cost-sensitive multi-class SVM (actions a are the classes) with margin scaling (Tsochantaridis et al. 2005).
After parameters \({\varvec{\psi }}\) have been fixed, parameters \({\varvec{\phi }}\) of structured-prediction model \(f_{\varvec{\phi }}(\mathbf{{x}},\mathbf{{y}})={\varvec{\phi }}^\top {\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}})\) are trained on the training data set of input-output pairs \((\mathbf{{x}}_i,\mathbf{{y}}_i)\) using SVM-struct with margin rescaling and using the search heuristic with parameters \({\varvec{\psi }}\) as decoder. Negative pseudo-labels are generated as follows. For each \((\mathbf{{x}}_i,\mathbf{{y}}_i)\in L\), heuristic \({\varvec{\psi }}\) is applied \({\bar{T}}\) times to produce a sequence of output sets \(Y_0(\mathbf{{x}}_i),\ldots ,Y_{\bar{T}}(\mathbf{{x}}_i)\). When \({\bar{\mathbf{{y}}}}={{\mathrm{ argmax\,}}}_{\mathbf{{y}}\in Y_{\bar{T}}(\mathbf{{x}}_i)}{\varvec{\phi }}^\top {\varvec{\Phi }}(\mathbf{{x}},\mathbf{{y}})\ne \mathbf{{y}}_i\) violates the cost-rescaled margin, then a new training constraint is added, and parameters \({\varvec{\phi }}\) are optimized to satisfy these constraints.
Online policy-gradient decoder
The decoder of HC search has been trained to locate the labeling \({\hat{\mathbf{{y}}}}\) that minimizes the costs \(c(\mathbf{{x}}_i,\mathbf{{y}}_i,{\hat{\mathbf{{y}}}})\) for given true labels. However, it is then applied to finding candidate labelings for which \(f_{\varvec{\phi }}(\mathbf{{x}},\mathbf{{y}})\) is evaluated with the goal of maximizing \(f_{\varvec{\phi }}\). However, since the decision function \(f_{\varvec{\phi }}\) may be an imperfect approximation of the input-output relationship that is reflected in the training data, labelings that minimize the costs \(c(\mathbf{{x}}_i,\mathbf{{y}}_i,{\hat{\mathbf{{y}}}})\) might be different from outputs that maximize the decision function. We will now derive a closed optimization problem in which decoder and structured-prediction model are jointly optimized. We will study its convergence properties theoretically.
We now demand that during the decoding process, the decoder chooses action \(a_{t+1}\in A_{Y_t}\) which generates successor state \(Y_{t+1}(\mathbf{{x}})=a_{t+1}(Y_t(\mathbf{{x}}))\) according to a stochastic policy, \(a_{t+1}\sim \pi _{\varvec{\psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}))\), with parameter \({\varvec{\psi }}\in \mathbb {R}^{m_2}\) (where \(m_2\) is the dimensionality of the decoder feature space) and features \({\varvec{\Psi }}(\mathbf{{x}},Y_t(\mathbf{{x}}),a_{t+1})\). At time T, the prediction is the highest-scoring output from \({Y}_T(\mathbf{{x}})\) according to Eq. 7.
The learning problem is to find parameters \({\varvec{\phi }}\) and \({\varvec{\psi }}\) that minimize the expected costs over all inputs, outputs, and numbers of available decoding steps:
$$\begin{aligned}&\mathop {{{\mathrm{argmin\,}}}}\limits _{{{\varvec{\phi }},{\varvec{\psi }}}}\; {\mathbb {E}}_{(\mathbf{{x}},\mathbf{{y}}), T, Y_T(\mathbf{{x}})} \left[ c(\mathbf{{x}}, \mathbf{{y}}, {\mathop {{{\mathrm{ argmax\,}}}}\limits _{\hat{\mathbf{{y}}}\in Y_T(\mathbf{{x}})}} f_{\varvec{\phi }}(\mathbf{{x}}, {\hat{\mathbf{{y}}}})\right] \end{aligned}$$
(12)
$$\begin{aligned}&\hbox {with } (\mathbf{{x}},\mathbf{{y}})\sim p(\mathbf{{x}},\mathbf{{y}}),\quad T\sim p(T|\tau ) \end{aligned}$$
(13)
$$\begin{aligned}&Y_T(\mathbf{{x}})\sim p(Y_T(\mathbf{{x}}) | \pi _{\varvec{\psi }},\mathbf{{x}}, T). \end{aligned}$$
(14)
The costs \(c(\mathbf{{x}},\mathbf{{y}},{\hat{\mathbf{{y}}}})\) of the highest-scoring element \({\hat{\mathbf{{y}}}}={{\mathrm{ argmax\,}}}_{{\mathbf{{y}}'}\in Y_T(\mathbf{{x}})} f_{\varvec{\phi }}(\mathbf{{x}}, {\mathbf{{y}}'})\) may not be differentiable in \({\varvec{\phi }}\). Let therefore loss \(\ell (\mathbf{{x}},\mathbf{{y}},Y_T(\mathbf{{x}});{\varvec{\phi }})\) be a differentiable approximation of the cost that \({\varvec{\phi }}\) induces on the set \(Y_T(\mathbf{{x}})\). Section 7.2 instantiates the loss for the motivating problem. Distribution \(p(\mathbf{{x}},\mathbf{{y}})\) is unknown. Given training data \(S=\{(\mathbf{{x}}_1,\mathbf{{y}}_1),\dots ,(\mathbf{{x}}_m,\mathbf{{y}}_m)\}\), we approximate the expected costs (Eq. 12) by the regularized expected empirical loss with convex regularizers \({\varOmega }_{\varvec{\phi }}\) and \({\varOmega }_{\varvec{\psi }}\):
$$\begin{aligned} {\varvec{\phi }}^*,{\varvec{\psi }}^*= & {} \mathop {{{\mathrm{argmin\,}}}}\limits _{{{\varvec{\phi }},{\varvec{\psi }}}} \sum _{(\mathbf{{x}},\mathbf{{y}})\in S} V_{{\varvec{\phi }},{\varvec{\psi }},\tau }(\mathbf{{x}},\mathbf{{y}}) +{\varOmega }_{\varvec{\phi }}+{\varOmega }_{\varvec{\psi }}\end{aligned}$$
(15)
$$\begin{aligned} \hbox {with }V_{{\varvec{\phi }},{\varvec{\psi }},\tau }(\mathbf{{x}},\mathbf{{y}})= & {} \sum _{T=1}^\infty \Bigl (p(T|\tau ) \sum _{Y_T(\mathbf{{x}})} p(Y_T(\mathbf{{x}}) | \pi _{\varvec{\psi }},\mathbf{{x}},T) \ell (\mathbf{{x}},\mathbf{{y}},Y_T(\mathbf{{x}});{\varvec{\phi }})\Bigr ). \end{aligned}$$
(16)
Equation 15 still cannot be solved immediately because it contains a sum over all values of T and all sets \(Y_T(\mathbf{{x}})\). To solve Eq. 15, we will liberally borrow ideas from the field of reinforcement learning. First, we will derive a formulation of the gradient \(\nabla _{{\varvec{\psi }},{\varvec{\phi }}} V_{{\varvec{\psi }},{\varvec{\phi }},\tau }(\mathbf{{x}},\mathbf{{y}})\). The gradient still involves an intractable sum over all sequences of actions, but its formulation suggests that it can be approximated by sampling action sequences according to the stochastic policy. By using a baseline function—which are a common tool in reinforcement learning (Greensmith et al. 2004)—we can reduce the variance of this sampling process.
Let \(a_{1\dots T}=a_1,\dots ,a_{T}\) with \(a_{t+1}\in A_{Y_t}\) be a sequence of actions that executes a transition from \(Y_0(\mathbf{{x}})\) to \(Y_T(\mathbf{{x}})=a_T(\dots (a_1(Y_0(\mathbf{{x}})))\dots )\). The available computation time is finite and hence \(p(T|\tau )=0\) for all \(T>{\bar{T}}\) for some \({\bar{T}}\). We can rewrite Eq. 16:
$$\begin{aligned} V_{{\varvec{\phi }},{\varvec{\psi }},\tau }(\mathbf{{x}},\mathbf{{y}})&=sum_{a_{1\dots {\bar{T}}}} \Bigl (p(a_{1\dots {{\bar{T}}}}|{\varvec{\psi }},Y_0(\mathbf{{x}}))\sum _{T=1}^{{\bar{T}}} p(T|\tau ) \ell (\mathbf{{x}},\mathbf{{y}},a_T(\dots (a_1(Y_0(\mathbf{{x}})))\dots );{\varvec{\phi }})\Bigr ),\nonumber \\ \hbox {with}\quad&p(a_{1\dots {{\bar{T}}}}|{\varvec{\psi }},Y_0(\mathbf{{x}}))=\prod _{t=1}^{\bar{T}}\pi _{\varvec{\psi }}(a_t|\mathbf{{x}},a_{t-1}(\ldots (Y_0(\mathbf{{x}})\ldots )). \end{aligned}$$
(17)
Equation 18 defines \(D_{\ell ,\tau }\) as the partial gradient \(\nabla _{\varvec{\phi }}\) of the expected empirical loss for an action sequence \(a_1,\dots ,a_{\bar{T}}\) that has been sampled according to \(p(a_{1,\dots ,{{\bar{T}}}}|{\varvec{\psi }},Y_0(\mathbf{{x}}))\).
$$\begin{aligned} D_{\ell ,\tau }(a_{1\dots {{\bar{T}}}},Y_0(\mathbf{{x}});{\varvec{\phi }}) = \sum \nolimits _{T=1}^{{\bar{T}}} p(T|\tau ) \nabla _{\varvec{\phi }}\ell (\mathbf{{x}},\mathbf{{y}},a_T(\dots (a_1(Y_0(\mathbf{{x}})))\dots );{\varvec{\phi }}) \end{aligned}$$
(18)
The policy gradient \(\nabla _{\varvec{\psi }}\) of a summand of Eq. 17 is
$$\begin{aligned}&{\nabla _{\varvec{\psi }}p(a_{1\dots {{\bar{T}}}}|{\varvec{\psi }},Y_0(\mathbf{{x}})) \sum _{T=1}^{{\bar{T}}} p(T|\tau ) \ell (\mathbf{{x}},\mathbf{{y}},a_T(\dots (a_1(Y_0(\mathbf{{x}})))\dots );{\varvec{\phi }})} \nonumber \\&= \Bigl (p(a_{1\dots {{\bar{T}}}}|{\varvec{\psi }},Y_0(\mathbf{{x}})) \sum _{T=1}^{{\bar{T}}} \nabla _{\varvec{\psi }}\log \pi _{\varvec{\psi }}(a_T|\mathbf{{x}},a_{T-1}(\ldots (a_1(Y_0(\mathbf{{x}}))\ldots ))\Bigr ) \nonumber \\&\qquad \times \sum _{T=1}^{{\bar{T}}} p(T|\tau )\ell (\mathbf{{x}},\mathbf{{y}},a_T(\dots (a_1(Y_0(\mathbf{{x}})))\dots );{\varvec{\phi }}). \end{aligned}$$
(19)
Equation 19 uses the “log trick” \(\nabla _\psi p=p\log \nabla _\psi p\); it sums the gradients of all actions and scales with the accumulated loss of all initial subsequences. Baseline functions (Greensmith et al. 2004) reflect the intuition that \(a_T\) is not responsible for losses incurred prior to T; also, relating the loss to the expected loss for all sequences that contain \(a_T\) reflects the merit of \(a_T\) better. Equation 20 defines the policy gradient for an action sequence sampled according to \(p(a_{1\dots {{\bar{T}}}}|{\varvec{\psi }},Y_0(\mathbf{{x}}))\), modified by baseline function B.
$$\begin{aligned}&{E_{\ell ,B,\tau }(a_{1\dots {\bar{T}}},Y_0(\mathbf{{x}});{\varvec{\psi }},{\varvec{\phi }})}\nonumber \\&=\sum _{T=1}^{\bar{T}}\nabla _{\varvec{\psi }}\log \pi _{\varvec{\psi }}(a_{T}|\mathbf{{x}},a_{T-1}(\dots (a_1(Y_0(\mathbf{{x}})))\dots )) \nonumber \\&\quad \bigg (\sum _{t=T}^{\bar{T}}p(t|\tau ) \ell (\mathbf{{x}},\mathbf{{y}},a_t(\dots (a_1(Y_0(\mathbf{{x}})))\dots );{\varvec{\phi }}) - B(a_{1\dots T-1},Y_0(\mathbf{{x}});{\varvec{\psi }},{\varvec{\phi }},\mathbf{{x}})\bigg ) \end{aligned}$$
(20)
Lemma 1
(General gradient) Let \(V_{{\varvec{\phi }},{\varvec{\psi }},\tau }(\mathbf{{x}},\mathbf{{y}})\) be defined as in Eq. 17 for a differentiable loss function \(\ell \). Let \(D_{\ell ,\tau }\) and \(E_{\ell ,B,\tau }\) be defined in Eqs. 18 and 20 for any scalar baseline function \(B(a_{1\dots T},Y_0(\mathbf{{x}});{\varvec{\psi }},{\varvec{\phi }},\mathbf{{x}})\). Then the gradient of \(V_{{\varvec{\phi }},{\varvec{\psi }},\tau }(\mathbf{{x}},\mathbf{{y}})\) is
$$\begin{aligned} \nabla _{{\varvec{\phi }},{\varvec{\psi }}}V_{{\varvec{\phi }},{\varvec{\psi }},\tau }(\mathbf{{x}},\mathbf{{y}})= & {} \sum _{a_{1\dots {\bar{T}}}} p(a_{1\dots {\bar{T}}}|{\varvec{\psi }},Y_0(\mathbf{{x}})) \nonumber \\&\times \, \left[ E_{\ell ,B,\tau }(a_{1\dots {\bar{T}}},Y_0(\mathbf{{x}});{\varvec{\psi }},{\varvec{\phi }})^\top , D_{\ell ,\tau }(a_{1\dots {\bar{T}}},Y_0(\mathbf{{x}});{\varvec{\phi }})^\top \right] ^\top \end{aligned}$$
(21)
Proof
The gradient stacks the partial gradients of \({\varvec{\psi }}\) and \({\varvec{\phi }}\) above each other. The partial gradient \(\nabla _{{\varvec{\phi }}}V_{{\varvec{\phi }},{\varvec{\psi }},\tau }(\mathbf{{x}},\mathbf{{y}}) = \sum _{a_{1\dots {\bar{T}}}} p(a_{1\dots {\bar{T}}}|{\varvec{\psi }},Y_0(\mathbf{{x}})) D_{\ell ,\tau }(a_{1\dots {\bar{T}}},Y_0(\mathbf{{x}});{\varvec{\phi }})\) follows from Equation 18. The partial gradient \(\nabla _{{\varvec{\psi }}}V_{{\varvec{\phi }},{\varvec{\psi }},\tau }(\mathbf{{x}},\mathbf{{y}}) = \sum _{a_{1\dots {\bar{T}}}} p(a_{1\dots {\bar{T}}}|{\varvec{\psi }},Y_0(\mathbf{{x}})) E_{\ell ,B,\tau } (a_{1\dots {\bar{T}}},Y_0(\mathbf{{x}});{\varvec{\psi }},{\varvec{\phi }})\) is a direct application of the Policy Gradient Theorem (Sutton et al. 2000; Peters and Schaal 2008) for episodic processes.
The choice of a baseline function B influences the variance of the sampling process, but not the gradient; a lower variance means faster convergence. Let \(E_{\ell ,B,\tau ,T}\) be a summand of Eq. 20 with a value of T. Variance \(\mathbb {E}[(E_{\ell ,B,\tau ,T}(a_{1\dots {\bar{T}}},Y_0(\mathbf{{x}});{\varvec{\psi }},{\varvec{\phi }})-\mathbb {E}[E_{\ell ,B,\tau ,T}(a_{1\dots {\bar{T}}},Y_0(\mathbf{{x}});{\varvec{\psi }},{\varvec{\phi }})|a_{1..T}])^2|a_{1..T}]\) is minimized by the baseline that weights the loss of all sequences starting in \(a_T(\dots (a_1(Y_0(\mathbf{{x}})))\dots )\) by the squared gradient (Greensmith et al. 2004):
$$\begin{aligned} B_{\mathrm {G}}(a_{1\dots T},Y_0(\mathbf{{x}});{\varvec{\psi }},{\varvec{\phi }},\mathbf{{x}}) = {\frac{\sum _{a_{T+1}} G(a_{1..T+1},Y_0)^2 Q(a_{1..T+1},Y_0) }{\sum _{a_{T+1}} G(a_{1..T+1},Y_0)^2}}\end{aligned}$$
(22)
$$\begin{aligned} {\hbox {with } Q(a_{1..T+1},Y_0)=\mathop {\mathbb {E}}\limits _{a_{T+2\ldots {\bar{T}}}}} {\left[ \sum _{t=T+1}^{{\bar{T}}}p(t|\tau )\ell (\mathbf{{x}},\mathbf{{y}},a_t(\dots (Y_0(\mathbf{{x}}))\dots );{\varvec{\phi }})\bigg |a_{1..T+1}\right] }\end{aligned}$$
(23)
$$\begin{aligned} {\hbox {and }G(a_{1..T+1},Y_0) =} { \nabla \log \pi _\psi (a_{T+1}|\mathbf{{x}},a_{T}(\dots (a_1(Y_0(\mathbf{{x}})))\dots )).} \end{aligned}$$
(24)
This baseline function is intractable because it (intractably) averages the loss of all action sequences that start in state \(Y_T(\mathbf{{x}})=a_T(\dots a_1(Y_0(\mathbf{{x}}))\dots )\) with the squared length of the gradient of their first action \(a_T\). Instead, the assumption that the expected loss of all sequences starting at T is half the loss of state \(Y_T(\mathbf{{x}})\) gives the approximation:
$$\begin{aligned} B_{\mathrm {HL}}(a_{1\dots T},Y_0(\mathbf{{x}});{\varvec{\psi }},{\varvec{\phi }},\mathbf{{x}})=\frac{1}{2}\sum \nolimits _{t=T+1}^{\bar{T}}p(t) \ell (\mathbf{{x}},\mathbf{{y}},a_T(...a_1(Y_0(\mathbf{{x}}))...);{\varvec{\phi }}). \end{aligned}$$
(25)
We will refer to the policy-gradient method with baseline function \(B_{\mathrm {HL}}\) as online policy gradient with baseline. Note that inserting baseline function
$$\begin{aligned} B_{\mathrm {R}}(a_{1\dots T},Y_0(\mathbf{{x}});{\varvec{\psi }},{\varvec{\phi }},\mathbf{{x}}) = -\sum _{t=1}^{T} p(t) \ell (\mathbf{{x}},\mathbf{{y}},a_t(\dots (a_1(Y_0(\mathbf{{x}})))\dots );{\varvec{\phi }}) \end{aligned}$$
(26)
into Eq. 20 resolves each summand of Eq. 21 to Eq. 19, the unmodified policy gradient for \(a_{1\dots ,{\bar{T}}}\). We will refer to the online policy-gradient method with baseline function \(B_{\mathrm {R}}\) as online policy gradient without baseline. Algorithm 1 shows the online policy-gradient learning algorithm. It optimizes parameters \({\varvec{\psi }}\) and \({\varvec{\phi }}\) using a stochastic gradient by sampling action sequences from the intractable sum over all action sequences of Eq. 21 Theorem 1 proves its convergence under a number of conditions. The step size parameters \(\alpha (i)\) have to satisfy
$$\begin{aligned} \sum \limits _{i=0}^\infty \alpha (i) = \infty \;,\; \sum \limits _{i=0}^\infty \alpha (i)^2 < \infty . \end{aligned}$$
(27)
Loss function \(\ell \) is required to be bounded. This can be achieved by constructing the loss function such that for large values it smoothly approaches some arbitrarily high ceiling C. However, in our case study we could not observe cases in which the algorithm does not converge for unbounded loss functions. Baseline function B is required to be differentiable and bounded for the next theorem. However, no gradient has to be computed in the algorithm. All baseline functions that are considered in Sect. 7 meet this demand.
Theorem 1
(Convergence of Algorithm 1) Let the stochastic policy \(\pi _{\varvec{\psi }}\) be twice differentiable, let both \(\pi _{\varvec{\psi }}\) and \(\nabla _{\varvec{\psi }}\pi _{\varvec{\psi }}\) be Lipschitz continuous, and let \(\nabla _{\varvec{\psi }}\log \pi _{\varvec{\psi }}\) be bounded. Let step size parameters \(\alpha (i)\) satisfy Eq. 27. Let loss function \(\ell \) be differentiable in \({\varvec{\phi }}\) and both \(\ell \) and \(\nabla _{\varvec{\phi }}\ell \) be Lipschitz continuous. Let \(\ell \) be bounded. Let B be differentiable and both B and \(\nabla _{{\varvec{\phi }}{\varvec{\psi }}}B\) be bounded. Let \({\varOmega }_{\varvec{\phi }}=\gamma _1\Vert {\varvec{\phi }}\Vert ^2\),\({\varOmega }_{\varvec{\psi }}=\gamma _2\Vert {\varvec{\psi }}\Vert ^2\). Then, Algorithm 1 converges with probability 1.
Proof
For space limitations and in order to improve readability, throughout the proof we omit dependencies on \(\mathbf{{x}}\) and \(Y_0(\mathbf{{x}})\) in the notations when dependence is clear from the context. For example, we use \(p(a_{1..{\bar{T}}}|{\varvec{\psi }})\) instead of \(p(a_{1..{\bar{T}}}|{\varvec{\psi }},Y_0(\mathbf{{x}}))\). We use Theorem 2 from Chap. 2 and Theorem 7 from Chap. 3 of Borkar (2008) to prove convergence. We first show that the full negative gradient \(-\sum _{(\mathbf{{x}},\mathbf{{y}})}\nabla _{{\varvec{\psi }},{\varvec{\phi }}}V_{{\varvec{\psi }}_i,{\varvec{\phi }}_i,\tau }(\mathbf{{x}},\mathbf{{y}}) - [\gamma _2 {\varvec{\psi }}_i^\top , \gamma _1 {\varvec{\phi }}_i^\top ]^\top \) is Lipschitz continuous.
Let \(L(a_{T..{\bar{T}}},{\varvec{\phi }})=\sum _{t=T}^{\bar{T}}p(t) \ell (\mathbf{{x}},\mathbf{{y}},a_t(..Y_0(\mathbf{{x}})..);{\varvec{\phi }})\). We proceed by showing that \(p(a_{1..{\bar{T}}}|{\varvec{\psi }})E_{\ell ,B,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }},{\varvec{\phi }})= \sum _{T=1}^{\bar{T}}(p(a_{1..{\bar{T}}}|{\varvec{\psi }}) \)
\(\nabla _{{\varvec{\psi }}}\log \pi _{\varvec{\psi }}(a_T|{\varvec{\psi }}) (L(a_{T..{\bar{T}}},{\varvec{\phi }}) -B(a_{1..T-1},{\varvec{\psi }},{\varvec{\phi }})))\) is Lipschitz in \([{\varvec{\psi }}^\top ,{\varvec{\phi }}^\top ]^\top \). It is differentiable in \([{\varvec{\psi }}^\top ,{\varvec{\phi }}^\top ]^\top \) per definition and it suffices to show that the derivative is bounded. By the product rule,
$$\begin{aligned}&{\nabla _{{\varvec{\psi }},{\varvec{\phi }}}(p(a_{1..{\bar{T}}}|{\varvec{\psi }})\nabla _{{\varvec{\psi }}}\log \pi _{\varvec{\psi }}(a_T|{\varvec{\psi }})(L(a_{T..{\bar{T}}},{\varvec{\phi }})-B(a_{1..T-1},{\varvec{\psi }},{\varvec{\phi }})))}\nonumber \\&\quad =\nabla _{{\varvec{\psi }},{\varvec{\phi }}}(p(a_{1..{\bar{T}}}|{\varvec{\psi }})\nabla _{{\varvec{\psi }}}\log \pi _{\varvec{\psi }}(a_T|{\varvec{\psi }})) (L(a_{T..{\bar{T}}},{\varvec{\phi }})-B(a_{1..T-1},{\varvec{\psi }},{\varvec{\phi }})) \nonumber \\&\quad + p(a_{1..{\bar{T}}}|{\varvec{\psi }})\nabla _{{\varvec{\psi }}}\log \pi _{\varvec{\psi }}(a_T|{\varvec{\psi }})\nabla _{{\varvec{\psi }},{\varvec{\phi }}}(L(a_{T..{\bar{T}}},{\varvec{\phi }})-B(a_{1..T-1},{\varvec{\psi }},{\varvec{\phi }})). \end{aligned}$$
(28)
We can see that line 29 is bounded because \(p,\nabla _{{\varvec{\psi }}}\log \pi _\psi ,\nabla _{\varvec{\phi }}L\) and \(\nabla _{{\varvec{\phi }},{\varvec{\psi }}}B\) are bounded by definition and products of bounded functions are bounded. Regarding line 28, we state that L and B are bounded by definition. Without loss of generality, let \(T=1\).
$$\begin{aligned}&{\nabla _{{\varvec{\psi }}}(p(a_{1..{\bar{T}}}|{\varvec{\psi }})\nabla _{{\varvec{\psi }}}\log \pi _{\varvec{\psi }}(a_1|{\varvec{\psi }}))} \nonumber \\&= \nabla _{{\varvec{\psi }}} (\nabla _{{\varvec{\psi }}}\pi _{\varvec{\psi }}(a_1|{\varvec{\psi }}) p(a_{2..{\bar{T}}}|a_1{\varvec{\psi }})) \end{aligned}$$
(29)
$$\begin{aligned}&= p(a_{2..{\bar{T}}}|a_1{\varvec{\psi }})) \nabla _{{\varvec{\psi }}}\nabla _{{\varvec{\psi }}}\pi _{\varvec{\psi }}(a_1|{\varvec{\psi }}) + \nabla _{{\varvec{\psi }}}\pi _{\varvec{\psi }}(a_1|{\varvec{\psi }}) \nabla _{{\varvec{\psi }}} p(a_{2..{\bar{T}}}|a_1,{\varvec{\psi }})) \end{aligned}$$
(30)
Equation 29 follows from \(p \nabla _\psi \log p = \nabla {\varvec{\psi }}p\). The left summand of Eq. 30 is bounded because both p and \(\nabla _{\varvec{\psi }}\nabla _{\varvec{\psi }}\pi _{\varvec{\psi }}\) are bounded by definition. Furthermore, \(\nabla _{{\varvec{\psi }}}p(a_{2..{\bar{T}}}|{\varvec{\psi }})=\nabla _{\varvec{\psi }}\pi _{\varvec{\psi }}(a_2) p(a_{3..{\bar{T}}}|{\varvec{\psi }}) + \pi _{\varvec{\psi }}(a_2) \nabla _{\varvec{\psi }}p(a_{3..{\bar{T}}}|{\varvec{\psi }})\) is bounded because \(\nabla _{\varvec{\psi }}\pi _{\varvec{\psi }}(a_t)\) and \(p(a_{t..{\bar{T}}}|{\varvec{\psi }})\) are bounded for all t and we can expand \( \nabla _{\varvec{\psi }}p(a_{3..{\bar{T}}}|{\varvec{\psi }})\) recursively. From this it follows that the right summand of Eq. 30 is bounded as well. Thus we have shown the above claim.
\(p(a_{1..{\bar{T}}}|{\varvec{\psi }}) D_{\ell ,\tau }(a_{1..{\bar{T}}};{\varvec{\phi }})\) is Lipschitz because \(p(a_{1..{\bar{T}}}|{\varvec{\psi }})\) is Lipschitz and bounded and \(D_{\ell ,\tau }\) is a sum of bounded Lipschitz functions. The product of two bounded Lipschitz functions is bounded. \([\gamma _1{\varvec{\psi }}^\top ,\gamma _2{\varvec{\phi }}^\top ]^\top \) is obviously Lipschitz as well, which concludes the considerations regarding the full negative gradient.
Let \(M_{i+1}=[E_{\ell ,B,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i,{\varvec{\phi }}_i)^\top , D_{\ell ,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i)^\top ]^\top - \sum _{(\mathbf{{x}},\mathbf{{y}})}\nabla _{{\varvec{\psi }},{\varvec{\phi }}}V_{{\varvec{\psi }}_I,{\varvec{\phi }}_i,\tau }(\mathbf{{x}},\mathbf{{y}}))\), where \(E_{\ell ,B,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i,{\varvec{\phi }}_i)\) and \(D_{\ell ,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i)\) are samples as computed by Algorithm 1. We show that \(\{M_i\}\) is a Martingale difference sequence with respect to the increasing family of \(\sigma \)-fields \(\mathcal {F}_i=\sigma ([{\varvec{\phi }}_{0}^\top ,{\varvec{\psi }}_{0}^\top ]^\top ,M_1,...,M_i),i\ge 0\). That is, \(\forall i\in \mathbb {N}\), \(\mathbb {E}[M_{i+1}|\mathcal {F}_i]=0\)
almost surely, and \(\{M_i\}\) are square-integrable with \(\mathbb {E}[\Vert M_{i+1}\Vert ^2|\mathcal {F}_i]\le K(1+\Vert [{\varvec{\phi }}_i^\top ,{\varvec{\psi }}_i^\top ]^\top \Vert ^2)\)
almost surely, for some \(K>0\).
\(\mathbb {E}[M_{i+1}|\mathcal {F}_n]=0\) is given by the definition of \(M_{i+1}\) above. We have to show \(\mathbb {E}[\Vert M_{i+1}\Vert ^2|\mathcal {F}_i]\le K(1+\Vert [{\varvec{\phi }}_i^\top ,{\varvec{\psi }}_i^\top ]^\top \Vert ^2)\) for some K. We proceed by showing that for each \((\mathbf{{x}},\mathbf{{y}},a_{1..{\bar{T}}})\) it holds that \(\Vert [E_{\ell ,B,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i,{\varvec{\phi }}_i)^\top ,D_{\ell ,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i)^\top ]^\top \Vert ^2 \le K(1+\Vert [{\varvec{\phi }}_i^\top ,{\varvec{\psi }}_i^\top ]^\top \Vert ^2)\). From that it follows that
$$\begin{aligned}&\Vert \sum _{\mathbf{{x}},\mathbf{{y}}}\sum _{a_{1..{\bar{T}}}}p(a_{1..{\bar{T}}}|{\varvec{\psi }}) [E_{\ell ,B,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }},{\varvec{\phi }})^\top , D_{\ell ,\tau }(a_{1..{\bar{T}}};{\varvec{\phi }})^\top ]^\top \Vert ^2\\&\quad \le K(1+\Vert [{\varvec{\phi }}_i^\top ,{\varvec{\psi }}_i^\top ]^\top \Vert ^2) \end{aligned}$$
and \(\Vert M_{i+1}\Vert ^2\le 4K(1+\Vert [{\varvec{\phi }}_i^\top ,{\varvec{\psi }}_i^\top ]^\top \Vert ^2)\) which proves the claim.
Regarding \(E_{\ell ,B,\tau }\), we assume that \(\Vert \nabla \log \pi _{\varvec{\psi }}\Vert ^2\) is bounded by some \(K''\) and it follows that \(\Vert \sum _{T=1}^{\bar{T}}\nabla _{\varvec{\psi }}\log \pi _{\varvec{\psi }}(a_T|\mathbf{{x}},a_{T-1}(..Y_0(\mathbf{{x}})..)\Vert ^2\) is also bounded by \({\bar{T}}^2 K''\). \(\Vert \ell (\mathbf{{x}},\mathbf{{y}},a_{T}(..Y_0(\mathbf{{x}})..); {\varvec{\phi }})\Vert ^2\le K'(1+\Vert {\varvec{\phi }}\Vert ^2)\) and B bounded per assumption and thus \(\sum _{t=T}^{\bar{T}}p(t) \ell (\mathbf{{x}},\mathbf{{y}},a_{t}(..Y_0(\mathbf{{x}})..) (\mathbf{{x}});{\varvec{\phi }}) - B(a_{1..T-1};{\varvec{\phi }},{\varvec{\psi }}) \le {\bar{K}}'(1+\Vert {\varvec{\phi }}\Vert ^2)\) with some \({\bar{K}}'\). It follows that \(\Vert E_{\ell ,B,\tau }(a_{1..{\bar{T}}}; {\varvec{\psi }}_i,{\varvec{\phi }}_i)\Vert ^2\le 2^{\bar{T}}K''{\bar{K}}'(1+\Vert {\varvec{\phi }}\Vert ^2)\). As \(\nabla _{\varvec{\phi }}\ell (\mathbf{{x}},\mathbf{{y}},a_{T}(..Y_0(\mathbf{{x}})..);{\varvec{\phi }})\) is bounded per assumption, \(\Vert D_{\ell ,\tau }\Vert ^2\le K'''\) for some \(K'''>0\). The claim follows: \(\Vert [E_{\ell ,B,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i,{\varvec{\phi }}_i)^\top , D_{\ell ,\tau }(a_{1..{\bar{T}}}; {\varvec{\psi }}_i)^\top ]^\top \Vert ^2 = \Vert E_{\ell ,B,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i,{\varvec{\phi }}_i)\Vert ^2 + \Vert D_{\ell ,\tau }(a_{1..{\bar{T}}};{\varvec{\psi }}_i)\Vert ^2 \le K''' + 2^{\bar{T}}K''\bar{K}'(1+\Vert {\varvec{\phi }}\Vert ^2) \le K''' + {\bar{T}}^2K''{\bar{K}}'(1+\Vert [{\varvec{\psi }}^\top ,{\varvec{\phi }}^\top ]^\top \Vert ^2)\).
We can now use Theorem 2 from Chap. 2 of Borkar (2008) to prove convergence by identifying function \(h([{\varvec{\phi }}_i^\top ,{\varvec{\psi }}_i^\top ]^\top )\) as assumed in the assumptions of that theorem with the full negative gradient \(-\sum _{(\mathbf{{x}},\mathbf{{y}})}\nabla _{{\varvec{\psi }},{\varvec{\phi }}}V_{{\varvec{\psi }}_i,{\varvec{\phi }}_i,\tau }(\mathbf{{x}},\mathbf{{y}}) - \left[ \gamma _2 {\varvec{\psi }}_i^\top , \gamma _1 {\varvec{\phi }}_i^\top \right] ^\top \). The theorem states that the algorithm converges with probability 1 if the iterates \([{\varvec{\phi }}_{i+1}^\top ,{\varvec{\psi }}_{i+1}^\top ]^\top \) stay bounded.
Now, let \(h_r(\xi )=h(r\xi )/r\). Next, we show that \(\lim _{r\rightarrow \infty }h_r(\xi )=h_\infty (\xi )\) exists and that the origin in \(\mathbb {R}^{m_1+m_2}\) is an asymptotically stable equilibrium for the o.d.e. \(\dot{\xi }(t)=h_\infty (\xi (t))\). With this, Theorem 7 from Chap. 3 of Borkar (2008)—originally from Borkar and Meyn (2000)—states that the iterates stay bounded and Algorithm 1 converges. Next, we show that h meets (A4):
$$\begin{aligned}&{h_r({\varvec{\phi }}, {\varvec{\psi }})} \\&= \frac{1}{r}\sum _{\mathbf{{x}},\mathbf{{y}}}\sum _{a_{1..{\bar{T}}}}p(a_{1..{\bar{T}}}|r{\varvec{\psi }}) \left[ E_{\ell ,B,\tau }(a_{1..{\bar{T}}};r{\varvec{\psi }},r{\varvec{\phi }})^\top , D_{\ell ,\tau }(a_{1..{\bar{T}}};r{\varvec{\phi }})^\top \right] ^\top \\&\quad + 1/r \big [\gamma _1 r {\varvec{\psi }}_i^\top , \gamma _2 r {\varvec{\phi }}_i^\top \big ]^\top \nonumber \\&=\sum _{\mathbf{{x}},\mathbf{{y}}}\sum _{a_{1..{\bar{T}}}}p(a_{1..{\bar{T}}}|r{\varvec{\psi }}) \sum _{T=1}^{\bar{T}}\big [\nabla _{\varvec{\psi }}\pi _{\varvec{\psi }}(a_{T}|r{\varvec{\psi }})^\top (L(a_{T..{\bar{T}}},r{\varvec{\phi }})-B(a_{1..T-1}))/r , \\&p(a_{1..{\bar{T}}}|r{\varvec{\psi }}) D_{\ell ,\tau }(a_{1..{\bar{T}}};r{\varvec{\phi }})^\top /r \big ]^\top + \big [\gamma _1 {\varvec{\psi }}_i^\top , \gamma _2 {\varvec{\phi }}_i^\top \big ]^\top , \end{aligned}$$
\(\nabla _{\varvec{\psi }}\pi _{\varvec{\psi }}, L\) and B are all bounded and it follows that \(p(a_{1..{\bar{T}}}|r{\varvec{\psi }}) \sum _{T=1}^{\bar{T}}\big [\nabla _{\varvec{\psi }}\pi _{\varvec{\psi }}(a_{T}|r{\varvec{\psi }})^\top (L(a_{T..{\bar{T}}},r{\varvec{\phi }})-B(a_{1..T-1}))/r\rightarrow 0\). The same holds for the other part as \(p(a_{1..{\bar{T}}}|r{\varvec{\psi }})\) and \(D_{\ell ,\tau }(a_{1..{\bar{T}}};r{\varvec{\phi }})\) are bounded. It follows that \(h_\infty ([{\varvec{\psi }}^\top ,{\varvec{\phi }}^\top ]^\top ) = \big [\gamma _1 {\varvec{\psi }}_i^\top , \gamma _2 {\varvec{\phi }}_i^\top \big ]^\top \). Therefore, the ordinary differential equation \(\dot{\xi }(t)=h_\infty (\xi (t))\) has an asymptotically stable equilibrium at the origin, which shows that (A4) is valid.