Basic ideas
Suppose we consider a model with an n-dimensional parameter vector \(\varvec{\theta }:=\left( \theta _{0},\dots ,\theta _{n-1}\right) ^{\top }\) and a twice continuously differentiable log-likelihood function \(\ell \). Assume without loss of generality that we seek to construct a level-\(\alpha \) confidence interval for the parameter \(\theta _{0}\), and let \(\widetilde{\varvec{\theta }}:=\left( \theta _{1},\dots ,\theta _{n-1}\right) ^{\top }\) be the vector of all remaining parameters, called nuisance parameters. For convenience, we may write \(\ell =\ell \mathopen {\left( \varvec{\theta }\right) }\mathclose {}\mathclose {}\) as a function of the complete parameter vector or \(\ell =\ell \left( \theta _{0},\widetilde{\varvec{\theta }}\right) \) as a function of the parameter of interest and the nuisance parameters.
The algorithm RVM introduced in this paper searches the right end point \(\theta _{0}^{\max }\) (equation (4)) of the confidence interval \(\mathinner {I}\). The left end point can be identified with the same approach if a modified model is considered in which \(\ell \) is flipped in \(\theta _{0}\). As RVM builds on the method by Venzon and Moolgavkar (1988), we start by recapitulating their algorithm VM below.
Let \(\varvec{\theta }^{*}\in \varTheta \) be the parameter vector at which the parameter of interest is maximal, \(\theta _{0}^{*}=\theta _{0}^{\max }\), and \(\ell \mathopen {\left( \varvec{\theta }^{*}\right) }\mathclose {}\mathclose {}\ge \ell ^{*}\). Venzon and Moolgavkar (1988) note that \(\varvec{\theta }^{*}\) satisfies the following necessary conditions:
-
1.
\(\ell \mathopen {\left( \varvec{\theta }^{*}\right) }\mathclose {}\mathclose {}=\ell ^{*}\) and
-
2.
\(\ell \) is in a local maximum with respect to the nuisance parameters, which implies \(\frac{\partial \ell }{\partial \widetilde{{\varvec{\theta }}}} ({\varvec{\theta }}^{*})=0\).
The algorithm VM searches for \(\varvec{\theta }^{*}\) by minimizing both the log-likelihood distance to the threshold \(\left| \ell (\varvec{\theta })-\ell ^{*}\right| \) and the magnitude of the gradient of the nuisance parameters \(|\frac{\partial \ell }{\partial \widetilde{{\varvec{\theta }}}}|\). To this end, the algorithm repeatedly approximates the log-likelihood surface \(\ell \) with second order Taylor expansions \(\hat{\ell }\). If \(\varvec{\theta }^{(i)}\) is the parameter vector in the \(i{\text {th}}\) iteration of the algorithm, expanding \(\ell \) around \(\varvec{\theta }^{(i)}\) yields
$$\begin{aligned} \hat{\ell }(\varvec{\theta })&:= \ell \mathopen {\left( \varvec{\theta }^{(i)}\right) }\mathclose {}\mathclose {}+\varvec{g}^{\top }\left( \varvec{\theta }-\varvec{\theta }^{(i)}\right) \nonumber \\&\quad +\,\frac{1}{2}\left( \varvec{\theta }-\varvec{\theta }^{(i)}\right) ^{\top }\underline{\varvec{\mathrm {H}}}\left( \varvec{\theta }-\varvec{\theta }^{(i)}\right) \nonumber \\&=\bar{\ell }+\widetilde{\varvec{g}}^{\top }\widetilde{\varvec{\delta }}+g_{0}\delta _{0}+\frac{1}{2}\widetilde{\varvec{\delta }}^{\top }\widetilde{\underline{\varvec{\mathrm {H}}}}\widetilde{\varvec{\delta }}+\delta _{0}\widetilde{\varvec{\mathrm {H}}}_{0}^{\top }\widetilde{\varvec{\delta }}+\frac{1}{2}\delta _{0}\mathrm {H}_{00}\delta _{0}\nonumber \\&=: \hat{\ell }^{\delta }\left( \delta _{0},\widetilde{\varvec{\delta }}\right) . \end{aligned}$$
(5)
Here, \(\varvec{\delta }:=\varvec{\theta }-\varvec{\theta }^{(i)}\), \(\bar{\ell }:=\ell \mathopen {\left( \varvec{\theta }^{(i)}\right) }\mathclose {}\mathclose {}\); \(\varvec{g}:=\frac{\partial \ell }{\partial \varvec{\theta }}\mathopen {\left( \varvec{\theta }^{(i)}\right) }\mathclose {}\mathclose {}\) is the gradient and \(\underline{\varvec{\mathrm {H}}}:=\frac{\partial ^{2}\ell }{\partial {\varvec{\theta }}^{2}}\mathopen {\left( \varvec{\theta }^{(i)}\right) }\mathclose {}\mathclose {}\) the Hessian matrix of \(\ell \) at \(\varvec{\theta }^{(i)}\). Analogously to notation used above, we split \(\varvec{\delta }\) into its first entry \(\delta _{0}\) and the remainder \(\widetilde{\varvec{\delta }}\), \(\varvec{g}\) into \(g_{0}\) and \(\widetilde{\varvec{g}}\), and write \(\varvec{\mathrm {\mathrm {H}}}_{0}\) for the first column of \(\underline{\varvec{\mathrm {H}}}\), \(\widetilde{\underline{\varvec{\mathrm {H}}}}\) for \(\underline{\varvec{\mathrm {H}}}\) without its first column and row, and split \(\varvec{\mathrm {\mathrm {H}}}_{0}\) into \(\mathrm {H}_{00}\) and \(\widetilde{\varvec{\mathrm {\mathrm {H}}}}_{0}\).
In each iteration, VM seeks \(\delta _{0}^{*}\) and \(\widetilde{\varvec{\delta }^{*}}\) that satisfy conditions 1 and 2. Applying condition 2 to the approximation \(\hat{\ell }^{\delta }\) [Eq. (5)] yields
$$\begin{aligned} \widetilde{\varvec{\delta }^{*}}=-\widetilde{\underline{\varvec{\mathrm {H}}}}^{-1}\left( \widetilde{\varvec{\mathrm {\mathrm {H}}}}_{0}\delta _{0}+\widetilde{\varvec{g}}\right) . \end{aligned}$$
(6)
Inserting (5) and (6) into condition 1 gives us
$$\begin{aligned} \ell ^{*}&= \frac{1}{2}\left( \mathrm {H}_{00}-\widetilde{\varvec{\mathrm {\mathrm {H}}}}_{0}^{\top } \widetilde{\underline{\varvec{\mathrm {H}}}}^{-1}\widetilde{\varvec{\mathrm {\mathrm {H}}}}_{0}\right) \delta _{0}^{*2}\nonumber \\&\quad +\,\left( g_{0}-\widetilde{\varvec{g}}^{\top }\widetilde{\underline{\varvec{\mathrm {H}}}}^{-1} \widetilde{\varvec{\mathrm {\mathrm {H}}}}_{0}\right) \delta _{0}^{*}+\bar{\ell }-\frac{1}{2} \widetilde{\varvec{g}}^{\top }\widetilde{\underline{\varvec{\mathrm {H}}}}^{-1}\widetilde{\varvec{g}}, \end{aligned}$$
(7)
which can be solved for \(\delta _{0}^{*}\) if \(\underline{\varvec{\mathrm {H}}}\) is negative definite. If Eq. (7) has multiple solutions, Venzon and Moolgavkar (1988) choose the one that minimizes \(\varvec{\delta }\) according to some norm. Our algorithm RVM applies a different procedure and chooses the root that minimizes the distance to \(\theta _{0}^{\max }\) without stepping into a region in which the approximation (5) is inaccurate. In Sect. 2.5, we provide further details and discuss the case in which Eq. (7) has no real solutions.
After each iteration, \(\varvec{\theta }\) is updated according to the above results:
$$\begin{aligned} \varvec{\theta }^{(i+1)}=\varvec{\theta }^{(i)}+\varvec{\delta }^{*}. \end{aligned}$$
(8)
If \(\ell \mathopen {\left( \varvec{\theta }^{(i+1)}\right) }\mathclose {}\mathclose {}\approx \ell ^{*}\) and \(\frac{\partial \ell }{\partial \widetilde{{\varvec{\theta }}}} \mathopen {\left( \varvec{\theta }^{(i+1)}\right) }\mathclose {}\mathclose {}\approx 0\) up to the desired precision, the search is terminated and \(\varvec{\theta }^{(i+1)}\) is returned.
The need to extend the original algorithm VM outlined above comes from the following issues: (1) The quadratic approximation \(\hat{\ell }\) may be imprecise far from the approximation point. In extreme cases, updating \(\varvec{\theta }\) as suggested could take us farther away from the target \(\varvec{\theta }^{*}\) rather than closer to it. (2) The approximation \(\hat{\ell }\) may be constant in some directions or be unbounded above. In these cases, we may not be able to identify unique solutions for \(\delta _{0}\) and \(\widetilde{\varvec{\delta }}\), and the gradient criterion in condition 2 may not characterize a maximum but a saddle point or a minimum. (3) The limited precision of numerical operations can result in discontinuities corrupting the results of VM and hinder the algorithm from terminating.
To circumvent these problems, we introduce a number of extensions to VM. First, we address the limited precision of the Taylor approximation \(\hat{\ell }\) with a trust region approach (Conn et al. 2000). That is, we constrain our search for \(\varvec{\delta }^{*}\) to a region in which the approximation \(\hat{\ell }\) is sufficiently accurate. Second, we choose some parameters freely if \(\hat{\ell }\) is constant in some directions and solve constrained maximization problems if \(\hat{\ell }\) is not bounded above. In particular, we detect cases in which \(\ell _{\mathrm {PL}}\) approaches an asymptote above \(\ell ^{*}\), which means that \(\theta _{0}\) is not estimable. Lastly, we introduce a method to identify and jump over discontinuities as appropriate. An overview of the algorithm is depicted as flow chart in Fig. 1. Below, we describe each of our extensions in detail.
The trust region
In practice, the quadratic approximation (5) may not be good enough to reach a point close to \(\varvec{\theta }^{*}\) within one step. In fact, since \(\ell \) may be very “non-quadratic”, we might obtain a parameter vector for which \(\ell \) and \(\frac{\partial \ell }{\partial \tilde{{\varvec{\theta }}}}\) are farther from \(\ell ^{*}\) and \(\varvec{0}\) than in the previous iteration. Therefore, we accept changes in \(\theta \) only if the approximation is sufficiently accurate in the new point.
In each iteration i, we compute the new parameter vector, compare the values of \(\hat{\ell }\) and \(\ell \) at the obtained point \(\varvec{\theta }^{(i)}+\varvec{\delta }^{*}\), and accept the step if, and only if, \(\hat{\ell }\) and \(\ell \) are close together with respect to a given distance measure. If \(\bar{\ell }\) is near the target \(\ell ^{*}\), we may also check the precision of the gradient approximation \(\frac{\partial \hat{\ell }}{\partial \tilde{{\varvec{\theta }}}}\) to enforce timely convergence of the algorithm.
If we reject a step, we decrease the magnitude of the value \(\left| \delta _{0}^{*}\right| \) obtained before, reduce the maximal admissible length r of the nuisance parameter vector and solve the constrained maximization problem
$$\begin{aligned} \widetilde{\varvec{\delta }^{*}}=\underset{\widetilde{\varvec{\delta }}{:}\,\left| \widetilde{\varvec{\delta }}\right| \le r}{\mathrm {argmax}\,}\hat{\ell }^{\delta }\left( \delta _{0},\widetilde{\varvec{\delta }}\right) . \end{aligned}$$
(9)
As the quadratic subproblem (9) appears in classical trust-region algorithms, efficient solvers are available (Conn et al. 2000) and implemented in optimization software, such as in the Python package Scipy (Jones et al. 2001).
We check the accuracy of the approximation at the resulting point \(\varvec{\theta }^{(i)}+\varvec{\delta }^{*}\), decrease the search radius if necessary, and continue with this procedure until the approximation is sufficiently precise. The metric and the tolerance applied to measure the approximation’s precision may depend on how far the current log-likelihood \(\bar{\ell }\) is from the target \(\ell ^{*}\). We suggest suitable precision measures in Sect. 2.8.
Since it is typically computationally expensive to compute the Hessian \(\underline{\varvec{\mathrm {H}}}\), we desire to take as large steps \(\delta _{0}\) as possible. However, it is also inefficient to adjust the search radius very often to find the maximal admissible \(\delta _{0}^{*}\). Therefore, RVM first attempts to make the unconstrained step given by Eqs. (6) and (7). If this step is rejected, RVM determines the search radius with a log-scale binary search between the radius of the unconstrained step and the search radius accepted in the previous iteration. If even the latter radius does not lead to a sufficiently precise result, we update \(\delta _{0}^{*}\) and r by factors \(\beta _{0},\beta _{1}\in \left( 0,1\right) \) so that \(\delta _{0}^{*}\leftarrow \beta _{0}\delta _{0}^{*}\) and \(r\leftarrow \beta _{1}r\).
Linearly dependent parameters
The right hand side of Eq. (6) is defined only if the nuisance Hessian \(\widetilde{\underline{\varvec{\mathrm {H}}}}\) is invertible. If \(\widetilde{\underline{\varvec{\mathrm {H}}}}\) is singular, the maximum with respect to the nuisance parameters is not uniquely defined or does not exist at all. We will consider the second case in the next section and focus on the first case here.
If \(\hat{\ell }\) has infinitely many maxima in the nuisance parameters, we can choose some nuisance parameters freely and consider a reduced system including the remaining independent parameters only. To that end, we check \(\widetilde{\underline{\varvec{\mathrm {H}}}}\) for linear dependencies at the beginning of each iteration. We are interested in a minimal set S containing indices of rows and columns whose removal from \(\widetilde{\underline{\varvec{\mathrm {H}}}}\) would make the matrix invertible. To compute S, we iteratively determine the ranks of sub-matrices of \(\widetilde{\underline{\varvec{\mathrm {H}}}}\) using singular value decompositions (SVMs). SVMs are a well-known tool to identify the rank of a matrix and have also been applied to determine the number of identifiable parameters in a model (Eubank and Webster 1985; Viallefont et al. 1998).
We proceed as follows: first, we consider one row of \(\widetilde{\underline{\varvec{\mathrm {H}}}}\) and determine its rank. Then, we continue by adding a second row, determine the rank of the new matrix and repeat the procedure until all rows, i.e. the full matrix \(\widetilde{\underline{\varvec{\mathrm {H}}}}\), are considered. Whenever the matrix rank increases after addition of a row, this row is linearly independent from the previous rows. Conversely, the rows that do not increase the matrix rank are linearly dependent on other rows of \(\widetilde{\underline{\varvec{\mathrm {H}}}}\). The indices of these rows form the set S. In general, the set of linearly dependent rows is not unique. Therefore, we consider the rows of \(\widetilde{\underline{\varvec{\mathrm {H}}}}\) in descending order of the magnitudes of the corresponding gradient entries. This can help the algorithm to converge faster.
After S is determined, we need to check whether there is a parameter vector \(\varvec{\theta }^{*}\) satisfying requirements 1 and 2 from Sect. 2.1 for the approximation \(\hat{\ell }\). Let \(\widetilde{\underline{\varvec{\mathrm {H}}}}_{\mathrm {dd}}\) (“d” for “dependent”) be the submatrix of \(\underline{\varvec{\mathrm {H}}}\) that remains if all rows and columns corresponding to indices in S are removed from \(\widetilde{\underline{\varvec{\mathrm {H}}}}\). Similarly, let \(\widetilde{\underline{\varvec{\mathrm {H}}}}_{\mathrm {ff}}\) (“f” for “free”) be the submatrix of \(\widetilde{\underline{\varvec{\mathrm {H}}}}\) containing only the rows and columns corresponding to indices in S, and let \(\widetilde{\underline{\varvec{\mathrm {H}}}}_{\mathrm {df}}=\widetilde{\underline{\varvec{\mathrm {H}}}}_{\mathrm {fd}}^{\top }\) be the matrix containing the rows whose indices are not in S and the columns whose indices are in S. Let us define \(\widetilde{\varvec{g}}_{\mathrm {d}}\), \(\widetilde{\varvec{g}}_{\mathrm {f}}\), \(\widetilde{\varvec{\delta }}_{\mathrm {d}}\), and \(\widetilde{\varvec{\delta }}_{\mathrm {f}}\) accordingly. If \(\widetilde{\underline{\varvec{\mathrm {H}}}}_{\mathrm {dd}}\) is not negative definite, \(\hat{\ell }\) is unbounded, and requirement 2 cannot be satisfied. Otherwise, we may attempt to solve
$$\begin{aligned} 0= & {} \frac{\partial }{\partial \widetilde{{\varvec{\delta }}}}\hat{\ell }^{\varvec{\delta }} \end{aligned}$$
(10)
$$\begin{aligned} \,\Longleftrightarrow \,\nonumber \\ 0= & {} \widetilde{\underline{\varvec{\mathrm {H}}}}_{\mathrm {dd}}\widetilde{\varvec{\delta }_{\mathrm {d}}^{*}}+\widetilde{\underline{\varvec{\mathrm {H}}}}_{\mathrm {df}}\widetilde{\varvec{\delta }_{\mathrm {f}}^{*}}+\widetilde{\varvec{\mathrm {\mathrm {H}}}}_{0\mathrm {d}}\delta _{0}^{*}+\widetilde{\varvec{g}}_{\mathrm {d}} \end{aligned}$$
(11)
$$\begin{aligned} 0= & {} \widetilde{\underline{\varvec{\mathrm {H}}}}_{\mathrm {df}}^{\top }\widetilde{\varvec{\delta }_{\mathrm {d}}^{*}}+\widetilde{\underline{\varvec{\mathrm {H}}}}_{\mathrm {ff}}\widetilde{\varvec{\delta }_{\mathrm {f}}^{*}}+\widetilde{\varvec{\mathrm {\mathrm {H}}}}_{0\mathrm {f}}\delta _{0}^{*}+\widetilde{\varvec{g}}_{\mathrm {f}}. \end{aligned}$$
(12)
If equation system (11)–(12) has a solution, we can choose \(\widetilde{\varvec{\delta }_{\mathrm {f}}^{*}}\) freely. Setting \(\widetilde{\varvec{\delta }_{\mathrm {f}}^{*}}\leftarrow \varvec{0}\) makes Eq. (11) equivalent to
$$\begin{aligned} \widetilde{\varvec{\delta }_{d}^{*}}=-\widetilde{\underline{\varvec{\mathrm {H}}}}_{\mathrm {dd}}^{-1}\left( \widetilde{\underline{\varvec{\mathrm {H}}}}_{0\mathrm {d}}\delta _{0}+\widetilde{\varvec{g}}_{\mathrm {d}}\right) . \end{aligned}$$
(13)
That is, we may set \(\widetilde{\underline{\varvec{\mathrm {H}}}}\leftarrow \widetilde{\underline{\varvec{\mathrm {H}}}}_{\mathrm {dd}}\), \(\widetilde{g}\leftarrow \widetilde{g}_{\mathrm {d}}\), \(\widetilde{\varvec{\delta }^{*}}\leftarrow \widetilde{\varvec{\delta }_{\mathrm {d}}^{*}}\) for the remainder of the current iteration and proceed as usual, but leaving the free nuisance parameters unchanged: \(\widetilde{\varvec{\delta }_{\mathrm {f}}^{*}}=\varvec{0}\). With the resulting \(\delta _{0}^{*}\), we check whether (12) holds approximately. If not, the log-likelihood is unbounded above. We consider this case in the next section.
An alternative way to proceed when \(\widetilde{\underline{\varvec{\mathrm {H}}}}\) is singular is to apply a generalized matrix inverse in Eq. (6). For example, we could apply the Moore–Penrose inverse (Penrose 1955), which is well defined even for singular matrices. In tests, however, this approach appeared to be sensitive to a threshold parameter, and we obtained better results with the procedure described above. We present the alternative approach based on the Moore-Penrose inverse along with test results in Supplementary Appendix A.
Solving unbounded subproblems
In each iteration, we seek the nuisance parameters \(\widetilde{\varvec{\theta }}\) that maximize \(\ell \) for the computed value of \(\theta _{0}\). The log-likelihood \(\ell \) is bounded above if the MLE exists, which we presume. Nonetheless, the approximate log-likelihood \(\hat{\ell }\) could be unbounded at times, which would imply that the approximation is imprecise for large steps. Since we cannot identify a global maximum of \(\hat{\ell }\) if it is unbounded, we instead seek the point maximizing \(\hat{\ell }\) in the range where \(\hat{\ell }\) is sufficiently accurate.
If testing Eq. (12) in the previous section has not shown that \(\hat{\ell }\) is unbounded above, we test the boundedness of \(\hat{\ell }\) via a Cholesky decomposition on \(-\widetilde{\underline{\varvec{\mathrm {H}}}}\). The decomposition succeeds if, and only if, \(\widetilde{\underline{\varvec{\mathrm {H}}}}\) is negative definite, implying that \(\hat{\ell }\) is bounded in the nuisance parameters. Otherwise, \(\hat{\ell }\) is unbounded, since free parameters have been fixed and removed in the previous section, guaranteeing that \(\widetilde{\underline{\varvec{\mathrm {H}}}}\) is non-singular.
If \(\hat{\ell }\) is unbounded, we set \(\delta _{0}^{*}\leftarrow r_{0}\), \(r\leftarrow r_{1}\) for some parameters \(r_{0},r_{1}>0\) and solve the maximization problem (9). The parameters \(r_{0}\) and \(r_{1}\) can be adjusted along with the trust region and saved for future iterations to efficiently identify the maximal admissible step. That is, we increase (or reduce) \(\delta _{0}^{*}\) and r as long as (or until) \(\hat{\ell }\) is sufficiently precise. In particular, we adjust the ratio of \(\delta _{0}^{*}\) and r so that the likelihood increases: \(\hat{\ell }^{\delta }\mathopen {\left( \delta _{0}^{*},\widetilde{\varvec{\delta }^{*}}\right) }\mathclose {}\mathclose {}>\bar{\ell }\). At the end of the iteration, we set \(r_{0}\leftarrow \delta _{0}^{*}\), \(r_{1}\leftarrow r\) to start later iterations with appropriate step sizes.
Step choice for the parameter of interest
Whenever \(\hat{\ell }\) has a unique maximum in the nuisance parameters, we compute \(\delta _{0}^{*}\) by solving Eq. (7). This equation can have one, two, or no roots. To discuss how \(\delta _{0}^{*}\) should be chosen in either of these cases, we introduce some helpful notation. First, we write \(\hat{\ell }_{\mathrm {PL}}\mathopen {\left( \theta _{0}\right) }\mathclose {}\mathclose {}:=\underset{\tilde{\theta }}{\max \,}\hat{\ell }\mathopen {\left( \theta _{0},\tilde{\theta }\right) }\mathclose {}\mathclose {}\) for the profile log-likelihood function of the quadratic approximation. Furthermore, we write in accordance with previous notation
$$\begin{aligned} \hat{\ell }_{\mathrm {PL}}^{\delta }\mathopen {\left( \delta _{0}\right) }\mathclose {}\mathclose {} :=\hat{\ell }_{\mathrm {PL}}\mathopen {\left( \theta _{0}^{(i)}+\delta _{0}\right) }\mathclose {}\mathclose {}=a\delta _{0}^{2}+p\delta _{0}+q+\ell ^{*} \end{aligned}$$
(14)
with \(a:=\frac{1}{2}\left( \mathrm {H}_{00}-\widetilde{\varvec{\mathrm {\mathrm {H}}}}_{0}\widetilde{\underline{\varvec{\mathrm {H}}}}^{-1}\widetilde{\varvec{\mathrm {\mathrm {H}}}}_{0}\right) \), \(p:=g_{0}-\widetilde{g}^{\top }\widetilde{\underline{\varvec{\mathrm {H}}}}^{-1}\widetilde{\varvec{\mathrm {\mathrm {H}}}}_{0}\), and \(q:=\bar{\ell }-\frac{1}{2}\widetilde{g}^{\top }\widetilde{\underline{\varvec{\mathrm {H}}}}^{-1}\widetilde{g}-\ell ^{*}\) [see Eq. (7)].
Our choices of \(\delta _{0}^{*}\) attempt to increase \(\theta _{0}\) as much as possible while staying in a region in which the approximation \(\hat{\ell }\) is reasonably accurate. The specific step choice depends on the slope of the profile likelihood \(\hat{\ell }_{\mathrm {PL}}^{\delta }\) and on whether we have already exceeded \(\theta _{0}^{\max }\) according to our approximation, i.e. \(\hat{\ell }_{\mathrm {PL}}^{\delta }\mathopen {\left( 0\right) }\mathclose {}\mathclose {}<\ell ^{*}\). In Sects. 2.5.1–2.5.3 below, we assume that \(\hat{\ell }_{\mathrm {PL}}^{\delta }\mathopen {\left( 0\right) }\mathclose {}\mathclose {}>\ell ^{*}\). We discuss the opposite case in Sect. 2.5.4.
Case 1: decreasing profile likelihood
If the profile likelihood decreases at the approximation point, i.e. \(p<0\), we select the smallest positive root:
$$\begin{aligned} \delta _{0}^{*}={\left\{ \begin{array}{ll} -\frac{q}{p} &{}\quad \text {if}\,a=0\\ -\frac{1}{2a}\left( p+\sqrt{p^{2}-4aq}\right) &{}\quad \text {else.} \end{array}\right. } \end{aligned}$$
(15)
Choosing \(\delta _{0}^{*}>0\) ensures that the distance to the end point \(\theta _{0}^{\max }\) decreases in this iteration. Choosing the smaller positive root increases our trust in the accuracy of the approximation and prevents potential convergence issues (see Fig. 2a).
If \(\hat{\ell }_{\mathrm {PL}}^{\delta }\) has a local minimum above the threshold \(\ell ^{*}\), Eq. (14) does not have a solution, and we may attempt to decrease the distance between \(\hat{\ell }_{\mathrm {PL}}^{\delta }\) and \(\ell ^{*}\) instead. This procedure, however, may let RVM converge to a local minimum in \(\hat{\ell }_{\mathrm {PL}}^{\delta }\) rather than to a point with \(\hat{\ell }_{\mathrm {PL}}^{\delta }=\ell ^{*}\). Therefore, we “jump” over the extreme point by doubling the value of \(\delta _{0}^{*}\). That is, we choose
$$\begin{aligned} \delta _{0}^{*}=-\frac{p}{a} \end{aligned}$$
(16)
if \(p^{2}<4aq\) (see Fig. 2b). This choice of \(\delta _{0}^{*}\) ensures that we quickly return to a range where the profile likelihood function decreases while at the same time accounting for the scale of the problem by considering the curvature of the profile likelihood.
Case 2: increasing profile likelihood
If the profile likelihood increases at the approximation point, i.e. \(p>0\), Eq. (14) has a positive root if, and only if, \(\hat{\ell }_{\mathrm {PL}}\) is concave down; \(a<0\). We choose this root whenever it exists:
$$\begin{aligned} \delta _{0}^{*}=-\frac{1}{2a}\left( p+\sqrt{p^{2}-4aq}\right) . \end{aligned}$$
(17)
However, if \(\hat{\ell }_{\mathrm {PL}}\) grows unboundedly, Eq. (14) does not have a positive root. In this case, we change the threshold value \(\ell ^{*}\) temporarily to a value \(\ell ^{*\prime }\) chosen so that Eq. (14) has a solution with the updated threshold (see Fig. 2c). For example, we may set
$$\begin{aligned} \ell ^{*\prime }:=\max \left\{ \hat{\ell }_{\mathrm {PL}}^{\delta }\mathopen {\left( 0\right) }\mathclose {}\mathclose {}+1,\frac{\bar{\ell }+\ell \mathopen {\left( \hat{\varvec{\theta }}\right) }\mathclose {}\mathclose {}}{2}\right\} . \end{aligned}$$
(18)
The first term in the maximum expression in (18) ensures that a solution exists; setting the threshold 1 unit higher than the current approximate value of the profile log-likelihood seems reasonable given that we typically consider the log-likelihood surface in a range where it is \(\mathcal {O}\mathopen {\left( 1\right) }\mathclose {}\) units below its maximum. The second term in the maximum expression permits us to take larger steps if we are far below the likelihood maximum. That way, we may reach local likelihood maxima faster. After resetting the threshold, we proceed as usual.
To memorize that we have changed the threshold value \(\ell ^{*}\), we set a flag \(\mathtt {maximizing}\leftarrow \mathtt {True}\). In future iterations \(j>i\), we set the threshold \(\ell ^{*}\) back to its initial value and \(\mathtt {maximizing}\leftarrow \mathtt {False}\) as soon as \(\ell \mathopen {\left( \varvec{\theta }^{(j)}\right) }\mathclose {}\mathclose {}\) falls below the initial threshold or \(\hat{\ell }_{\mathrm {PL}}\) is concave down at the approximation point \(\varvec{\theta }^{(j)}\).
Case 3: constant profile likelihood
If the profile likelihood has a local extremum at the approximation point, i.e. \(p=0\), \(a\ne 0\), we proceed as in cases 1 and 2: if \(a>0\), we proceed as if \(\hat{\ell }_{\mathrm {PL}}\) were increasing, and if \(a<0\), we proceed as if \(\hat{\ell }_{\mathrm {PL}}\) were decreasing. However, the approximate profile likelihood could also be constant, \(a=p=0\). In this case, we attempt to make a very large step to check whether we can push \(\theta _{0}\) arbitrarily far. In Sect. 2.6, we discuss this procedure in greater detail.
Profile likelihood below the threshold
If the profile likelihood at the approximation point is below the threshold, \(\hat{\ell }_{\mathrm {PL}}^{\delta }\mathopen {\left( 0\right) }\mathclose {}\mathclose {}<\ell ^{*}\), we always choose the smallest possible step:
$$\begin{aligned} \delta _{0}^{*}={\left\{ \begin{array}{ll} -\frac{1}{2a}\left( p+\sqrt{p^{2}-4aq}\right) &{}\quad \text {if}\,a\ne 0,\,p<0\\ -\frac{q}{p} &{}\quad \text {if}\,a=0,\,p\ne 0\\ -\frac{1}{2a}\left( p-\sqrt{p^{2}-4aq}\right) &{}\quad \text {if}\,a\ne 0,\,p>0. \end{array}\right. } \end{aligned}$$
(19)
This shall bring us to the admissible parameter region as quickly as possible.
As RVM rarely steps far beyond the admissible region in practice, Eq. (19) usually suffices to define \(\delta _{0}^{*}\). Nonetheless, if we find that \(\hat{\ell }_{PL}^{\varvec{\delta }}\) has a local maximum below the threshold, i.e. \(p^{2}<4qa\), we may instead maximize \(\hat{\ell }_{PL}^{\delta }\) as far as possible:
$$\begin{aligned} \delta _{0}^{*}=-\frac{p}{2a}. \end{aligned}$$
(20)
If we have already reached a local maximum (\(p\approx 0\)), we cannot make a sensible choice for \(\delta _{0}\). In this case, we may recall the iteration \(k:=\underset{j{:}\,\ell (\theta ^{(j)})\ge \ell ^{*}}{\mathrm {argmax}\,}\theta _{0}^{(j)}\), in which the largest admissible \(\theta _{0}\) value with \(\ell \mathopen {\left( \theta ^{(k)}\right) }\mathclose {}\mathclose {}\ge \ell ^{*}\) has been found so far, and conduct a binary search between \(\theta ^{(i)}\) and \(\theta ^{(k)}\) until we find a point \(\theta ^{(i+1)}\) with \(\ell \mathopen {\left( \theta ^{(i+1)}\right) }\mathclose {}\mathclose {}\ge \ell ^{*}\).
Identifying inestimable parameters
If the considered parameter is not estimable and the profile log-likelihood \(\ell _{\mathrm {PL}}\) never falls below the threshold \(\ell ^{*}\), RVM may not converge. However, often it is possible to identify inestimable parameters by introducing a step size limit \(\delta _{0}^{\max }\). If the computed step exceeds the maximal step size, \(\delta _{0}^{*}>\delta _{0}^{\max }\) and the current function value exceeds the threshold value, i.e. \(\bar{\ell }\ge \ell ^{*}\), we set \(\delta _{0}^{*}:=\delta _{0}^{\max }\) and compute the corresponding nuisance parameters. If the resulting log-likelihood \(\ell \mathopen {\left( \theta ^{(i)}+\varvec{\delta }^{*}\right) }\mathclose {}\mathclose {}\) is not below the threshold \(\ell ^{*}\), we let the algorithm terminate, raising a warning that the parameter \(\theta _{0}\) is not estimable. If \(\ell \mathopen {\left( \theta ^{(i)}+\varvec{\delta }^{*}\right) }\mathclose {}\mathclose {}<\ell ^{*}\), however, we cannot draw this conclusion and decrease the step size until the approximation is sufficiently close to the original function.
The criterion suggested above may not always suffice to identify inestimable parameters. For example, if the profile likelihood is constant but the nuisance parameters maximizing the likelihood change non-linearly, RVM may not halt. For this reason, and also to prevent unexpected convergence issues, it is advisable to introduce an iteration limit to the algorithm. If the iteration limit is exceeded, potential estimability issues may be investigated further.
Discontinuities
RVM is based on quadratic approximations and requires therefore that \(\ell \) is differentiable twice. Nonetheless, discontinuities can occur due to numerical imprecision even if the likelihood function is continuous in theory. Though we may still be able to compute the gradient \(g\) and the Hessian \(\underline{\varvec{\mathrm {H}}}\) in these cases, the resulting quadratic approximation will be inaccurate even if we take very small steps. Therefore, these discontinuities could hinder the algorithm from terminating.
To identify discontinuities, we define a minimal step size \(\epsilon _{\mathrm {step}}\), which may depend on the gradient \(\varvec{g}\). If we reject a step with small length \(\left| \varvec{\delta }^{*}\right| \le \epsilon _{\mathrm {step}}\), we may conclude that \(\ell \) is discontinuous at the current approximation point \(\varvec{\theta }^{(i)}\). To determine the set D of parameters responsible for the issue, we decompose \(\varvec{\delta }^{*}\) into its components. We initialize \(D\leftarrow \emptyset \) and consider, with the jth unit vector \(\varvec{e}_{j}\), the step \(\varvec{\delta }^{*\prime }:=\sum _{j\le k,\,j\ne D}\varvec{e}_{j}\delta _{j}^{*}\) until \(\hat{\ell }^{\delta }\mathopen {\left( \varvec{\delta }^{*\prime }\right) }\mathclose {}\mathclose {}\not \approx \ell ^{\delta }\mathopen {\left( \varvec{\delta }^{*\prime }\right) }\mathclose {}\mathclose {}\) for some \(k<n\). When we identify such a component, we add it to the set D and continue the procedure.
If we find that \(\ell \) is discontinuous in \(\theta _{0}\), we check whether the current nuisance parameters maximize the likelihood, i.e. \(\ell \) is bounded above and \(\widetilde{\varvec{g}}\) is approximately \(\varvec{0}\). If the nuisance parameters are not optimal, we hold \(\theta _{0}\) constant and maximize \(\ell \) with respect to the nuisance parameters. Otherwise, we conclude that the profile likelihood function has a jump discontinuity. In this case, our action depends on the current log-likelihood value \(\bar{\ell }\), the value of \(\ell \) at the other end of the discontinuity, and the threshold \(\ell ^{*}\).
-
If \(\ell \mathopen {\left( \varvec{\theta }^{(i)}+\varvec{e}_{0}\delta _{0}^{*}\right) }\mathclose {}\mathclose {}\ge \ell ^{*}\) or \(\ell \mathopen {\left( \varvec{\theta }^{(i)}\right) }\mathclose {}\mathclose {}<\ell \mathopen {\left( \varvec{\theta }^{(i)}+\varvec{e}_{0}\delta _{0}^{*}\right) }\mathclose {}\mathclose {}\), we accept the step regardless of the undesirably large error.
-
If \(\ell \mathopen {\left( \varvec{\theta }^{(i)}+\varvec{e}_{0}\delta _{0}^{*}\right) }\mathclose {}\mathclose {}<\ell ^{*}\) and \(\ell \mathopen {\left( \varvec{\theta }^{(i)}\right) }\mathclose {}\mathclose {}\ge \ell ^{*}\), we terminate and return \(\theta _{0}^{(i)}\) as the bound of the confidence interval.
-
Otherwise, we cannot make a sensible step and try to get back into the admissible region by conducting the binary search procedure we have described in Sect. 2.5.4.
If \(\ell \) is discontinuous in variables other than \(\theta _{0}\), we hold the variables constant whose change decreases the likelihood and repeat the iteration with a reduced system. After a given number of iterations, we release these parameters again, as \(\theta \) may have left the point of discontinuity.
Since we may require that not only \(\hat{\ell }\) but also its gradient are well approximated, a robust implementation of RVM should also handle potential gradient discontinuities. The nuisance parameters causing the issues can be identified analogously to the procedure outlined above. All components in which the gradient changes its sign from positive to negative should be held constant, as the likelihood appears to be in a local maximum in these components. The step in the remaining components may be accepted regardless of the large error.
Suitable parameters and distance measures
The efficiency of RVM depends on the distance measures and parameters applied when assessing the accuracy of the approximation and updating the search radius of the constrained optimization problems (9). If the precision measures are overly conservative, then many steps will be needed to find \(\varvec{\theta }^{*}\). If the precision measure is too liberal, in turn, RVM may take detrimental steps and might not even converge.
We suggest the following procedure: (1) we always accept forward steps with \(\delta _{0}^{*}\ge 0\) if the true likelihood is larger than the approximate likelihood: \(\ell ^{\delta }\mathopen {\left( \varvec{\delta }^{*}\right) }\mathclose {}\mathclose {}\ge \hat{\ell }^{\delta }\mathopen {\left( \varvec{\delta }^{*}\right) }\mathclose {}\mathclose {}\). (2) If the approximate likelihood function is unbounded, we require that the likelihood increases: \(\ell ^{\delta }\mathopen {\left( \varvec{\delta }^{*}\right) }\mathclose {}\mathclose {}\ge \bar{\ell }\). This requirement helps RVM to return quickly to a region in which the approximation is bounded. However, if the step size falls below the threshold used to detect discontinuities, we may accept satisfactory precise steps even when the likelihood does not increase. This prevents the algorithm from seeking potential discontinuities even though the approximation is precise. (3) If we are outside the admissible region, i.e. \(\bar{\ell }<\ell ^{*}\), we enforce that we get closer to the target likelihood: \(\left| \ell ^{\delta }\mathopen {\left( \varvec{\delta }^{*}\right) }\mathclose {}\mathclose {}-\ell ^{*}\right| <\left| \bar{\ell }-\ell ^{*}\right| \). This reduces potential convergence issues. (4) We require that
$$\begin{aligned} \frac{\left| \hat{\ell }^{\delta }\mathopen {\left( \varvec{\delta }^{*}\right) }\mathclose {}\mathclose {}-\ell ^{\delta }\mathopen {\left( \varvec{\delta }^{*}\right) }\mathclose {}\mathclose {}\right| }{\left| \bar{\ell }-\ell ^{*}\right| }\le \gamma \end{aligned}$$
(21)
for a constant \(\gamma \). That is, the required precision depends on how close we are to the target. This facilitates fast convergence of the algorithm. The constant \(\gamma \in \left( 0,1\right) \) controls how strict the precision requirement is. In tests, \(\gamma =\frac{1}{2}\) appeared to be a good choice. (5) If we are close to the target, \(\ell ^{\delta }\mathopen {\left( \varvec{\delta }^{*}\right) }\mathclose {}\mathclose {}\approx \ell ^{*}\), we also require that the gradient estimate is precise:
$$\begin{aligned} \frac{\left| \frac{\partial {\hat{\ell }}^{\delta }}{\partial \tilde{{\theta }}}\mathopen {\left( \varvec{\delta }^{*}\right) }\mathclose {}\mathclose {}-\frac{\partial {{\ell }}^{\delta }}{\partial \tilde{{\theta }}} \mathopen {\left( \varvec{\delta }^{*}\right) }\mathclose {}\mathclose {}\right| }{\left| \varvec{g}\right| }\le \gamma . \end{aligned}$$
(22)
This constraint helps us to get closer to a maximum in the nuisance parameters. Here, we use the \(\mathcal {L}_{2}\) norm.
When we reject a step because the approximation is not sufficiently accurate, we adjust \(\delta _{0}^{*}\) and solve the constrained maximization problem (9) requiring \(\left| \widetilde{\varvec{\delta }}\right| \le r\). To ensure that the resulting step does not push the log-likelihood below the target \(\ell ^{*}\), the radius r should not be decreased more strongly than \(\delta _{0}^{*}\). In tests, adjusting r by a factor \(\beta _{1}:=\frac{2}{3}\) whenever \(\delta _{0}^{*}\) is adjusted by factor \(\beta _{0}:=\frac{1}{2}\) yielded good results. Aside from accepting or rejecting proposed steps, the algorithm evaluates the accuracy of various equations with some tolerance. We suggest to introduce a single tuning parameter \(\epsilon _{\mathrm {tol}}\) to control the accuracy requirement. In tests, we found that a generous bound of \(\epsilon _{\mathrm {tol}}=0.001\) makes the algorithm robust against errors arising, for example, if almost singular matrices are inverted.
The optimal value for the minimal step size \(\epsilon _{\mathrm {step}}\), used to detect discontinuities due to numerical errors, depends on the expected accuracy of the gradient and Hessian of the log-likelihood function. We suggest a default value of \(10^{-5}\). The maximal step size \(\delta _{0}^{\max }\), used to classify parameters as inestimable, should be chosen as large as possible without leading to numerical issues. That way, the criterion becomes less dependent on the scale of the parameter. We suggest a value of \(\delta _{0}^{\max }=10^{10}\), as it is far beyond the typical range of parameters in practical problems. Nonetheless, the parameter needs to be adjusted if very large parameter values may be expected.
The distance measures and parameters given above are meant to be applicable in a wide range of problems without further fine tuning. We list the tuning parameters along with the suggested values in Table 1.
Table 1 Tuning parameters along with suggested values that typically yield good results in practice Confidence intervals for functions of parameters
Often, modelers are interested in confidence intervals for functions \(f\mathopen {\left( \varvec{\theta }\right) }\mathclose {}\mathclose {}\) of the parameters. A limitation of VM and RVM is that such confidence intervals cannot be computed directly with these algorithms. However, this problem can be solved approximately by considering a slightly changed likelihood function. We aim to find
$$\begin{aligned} \phi ^{\max }=\underset{\varvec{\theta }\in \varTheta {:}\,\ell \mathopen {\left( \varvec{\theta }\right) }\mathclose {}\mathclose {}\ge \ell ^{*}}{\max \,}f\mathopen {\left( \varvec{\theta }\right) }\mathclose {}\mathclose {} \end{aligned}$$
(23)
or the respective minimum. Define
$$\begin{aligned} \check{\ell }\mathopen {\left( \phi ,\varvec{\theta }\right) }\mathclose {}\mathclose {}:=\ell \mathopen {\left( \varvec{\theta }\right) }\mathclose {}\mathclose {}-\frac{1}{2}\left( \frac{f\mathopen {\left( \varvec{\theta }\right) }\mathclose {}\mathclose {}-\phi }{\varepsilon }\right) ^{2}\chi _{1,1-\alpha }^{2}, \end{aligned}$$
(24)
with a small constant \(\varepsilon \). Consider the altered maximization problem
$$\begin{aligned} \check{\phi }^{\max }=\underset{\varvec{\theta }\in \varTheta {:}\,\check{\ell }\mathopen {\left( \phi ,\varvec{\theta }\right) }\mathclose {}\mathclose {}\ge \ell ^{*}}{\max \,}\phi , \end{aligned}$$
(25)
which can be solved with VM or RVM.
We argue that a solution to (25) is an approximate solution to (23), with an error bounded by \(\varepsilon \). Let \(\mathopen {\left( \phi ^{\max },\varvec{\theta }^{*}\right) }\mathclose {}\mathclose {}\) be a solution to problem (23) and \(\mathopen {\left( \check{\phi }^{\max },\check{\varvec{\theta }}^{*}\right) }\mathclose {}\mathclose {}\) a solution to problem (25). Since \(\phi ^{\max }=f\mathopen {\left( \varvec{\theta }^{*}\right) }\mathclose {}\mathclose {}\), it is \(\check{\ell }\mathopen {\left( \phi ^{\max },\varvec{\theta }^{*}\right) }\mathclose {}\mathclose {}=\ell \mathopen {\left( \varvec{\theta }^{*}\right) }\mathclose {}\mathclose {}\ge \ell ^{*}\). Therefore, \(\mathopen {\left( \phi ^{\max },\varvec{\theta }^{*}\right) }\mathclose {}\mathclose {}\) is also a feasible solution to (25), and it follows that \(\check{\phi }^{\max }\ge \phi ^{\max }\). At the same time, \(\check{\ell }\mathopen {\left( \phi ,\varvec{\theta }\right) }\mathclose {}\mathclose {}\le \ell \mathopen {\left( \varvec{\theta }\right) }\mathclose {}\mathclose {}\), which implies that \(f\mathopen {\left( \check{\varvec{\theta }}^{*}\right) }\mathclose {}\mathclose {}\le f\mathopen {\left( \varvec{\theta }^{*}\right) }\mathclose {}\mathclose {}\), since \(\varvec{\theta }^{*}\) maximizes f over a domain larger than the feasibility domain of (25). In conclusion, \(f\mathopen {\left( \check{\varvec{\theta }}^{*}\right) }\mathclose {}\mathclose {}\le f\mathopen {\left( \varvec{\theta }^{*}\right) }\mathclose {}\mathclose {}=\phi ^{\max }\le \check{\phi }^{\max }\). Lastly,
$$\begin{aligned} \ell ^{*}&= \ell \mathopen {\left( \hat{\varvec{\theta }}\right) }\mathclose {}\mathclose {}-\frac{1}{2}\chi _{1,1-\alpha }^{2}~\le ~\check{\ell }\mathopen {\left( \check{\phi }^{\max },\check{\varvec{\theta }}^{*}\right) }\mathclose {}\mathclose {}\nonumber \\&= \ell \mathopen {\left( \check{\varvec{\theta }}^{*}\right) }\mathclose {}\mathclose {}-\frac{1}{2}\left( \frac{f\mathopen {\left( \check{\varvec{\theta }}^{*}\right) }\mathclose {}\mathclose {}-\check{\phi }^{\max }}{\varepsilon }\right) ^{2}\chi _{1,1-\alpha }^{2}. \end{aligned}$$
(26)
Simplifying (26) yields \(\left| f\mathopen {\left( \check{\varvec{\theta }}^{*}\right) }\mathclose {}\mathclose {}-\check{\phi }^{\max }\right| \le \varepsilon \). Thus, \(\left| \phi ^{\max }-\check{\phi }^{\max }\right| \le \varepsilon \).
Though it is possible to bound the error by an arbitrarily small constant \(\varepsilon \) in theory, care must be taken if the function \(f\mathopen {\left( \varvec{\theta }\right) }\mathclose {}\mathclose {}\) is not well-behaved, i.e. strongly nonlinear. In these cases, overly small values for \(\varepsilon \) may slow down convergence. Clearly, considering an objective function with an added parameter increases the computational effort to compute the Hessian matrix required for RVM. Note, however, that the Hessian of the altered likelihood \(\check{\ell }\) can be computed easily if the Hessian of the original likelihood \(\check{\ell }\) and the derivatives of the function f are known. Therefore, determining confidence intervals for functions of parameters via solving the altered problem (4) may not be computationally harder than determining confidence intervals for parameters.
Observe that the suggested procedure may seem to resemble the approach of Neale and Miller (1997), who also account for constraints by adding the squared error to the target function. However, unlike Neale and Miller (1997), the approach suggested above bounds the error in the confidence interval bound, not the error of the constraint. Furthermore, we do not square the log-likelihood function, which would worsen nonlinearities and could thus make optimization difficult. Therefore, our approach is less error-prone than the method by Neale and Miller (1997).