Parameter estimation
The posterior distributions of Sect. 2.4 arise from the combination of \(n_0\) prior observations, perhaps fictitious, and the n actual observations. In the FS we combine the \(n_0\) prior observations with a carefully selected m out of the n observations. The search proceeds from \(m = 0\), when the fictitious observations provide the parameter values for all n residuals from the data. It then continues with the fictitious observations always included amongst those used for parameter estimation; their residuals are ignored in the selection of successive subsets.
As mentioned in Sect. 1, there is one complication in this procedure. The \(n_0\) fictitious observations are treated as a sample with population variance \(\sigma ^2\). However, the m observations from the actual data are, as in Sect. 2.1, from a truncated distribution of m out of n observations and so asymptotically have a variance \(c(m,n)\sigma ^2\). An adjustment must be made before the two samples are combined. This becomes a problem in weighted least squares (for example, Rao 1973, p. 230). Let \(y^+\) be the \((n_0 + m) \times 1\) vector of responses from the fictitious observations and the subset, with \(X^+\) the corresponding matrix of explanatory variables. The covariance matrix of the independent observations is \(\sigma ^2G\), with G a diagonal matrix; the first \(n_0\) elements of the diagonal of G equal one and the last m elements have the value c(m, n). The information matrix for the \(n_0 + m\) observations is
$$\begin{aligned} (X^{+\mathrm {T}} W X^+)/\sigma ^2 = \{X_0^{\mathrm{T}}X_0 + X(m)^{\mathrm{T}}X(m) /c(m,n)\}/\sigma ^2, \end{aligned}$$
(6)
where \(W = G^{-1}\). In the least squares calculations, we need only to multiply the elements of the sample values y(m) and X(m) by \(c(m,n)^{-1/2}\). However, care is needed to obtain the correct expressions for leverages and variances of parameter estimates.
Since, during the forward search, n in (3) is replaced by the subset size m, X and y in (4) become \(y(m)/\sqrt{c(m,n)}\) and \(X(m)/\sqrt{c(m,n)}\), giving rise to posterior values \(a_1(m)\), \(b_1(m)\), \(\tau _1(m)\) and \({\hat{\sigma }}^2_1(m)\).
The estimate of \(\beta \) from \(n_0\) prior observations and m sample observations can, from (6), be written
$$\begin{aligned} \hat{\beta }_1(m) = (X^{+\mathrm {T}} W X^+)^{-1}X^{+\mathrm {T}} W y^+. \end{aligned}$$
(7)
In Sect. 2.3
\(\hat{\beta }_0 = R^{-1}X_0^{\mathrm {T}}y_0\), so that \(X_0^{\mathrm {T}}y_0 = X_0^{\mathrm {T}}X_0\, \hat{\beta }_0\). Then the estimate in (7) can be written in full as
$$\begin{aligned} \hat{\beta }_1(m)= & {} \{X_0^\mathrm {T} X_0 + X(m)^\mathrm {T} X(m) /c(m,n)\}^{-1}\{X_0^\mathrm {T} y_0 + X(m)^{\mathrm{T}}y(m) /c(m,n)\} \nonumber \\= & {} \{X_0^\mathrm {T} X_0 + X(m)^\mathrm {T} X(m) /c(m,n)\}^{-1}\{X_0^\mathrm {T}X_0\, \hat{\beta }_0 + X(m)^{\mathrm{T}}y(m) /c(m,n)\}. \nonumber \\ \end{aligned}$$
(8)
Forward highest posterior density intervals
Inference about the parameters of the regression model comes from regions of highest posterior density. These are calculated from the prior information and the subset at size m. Let
$$\begin{aligned} V(m) = (X^{+\mathrm {T}}X^+)^{-1} = \{X_0^{\mathrm{T}}X_0 + X(m)^{\mathrm{T}}X(m) \}^{-1}, \end{aligned}$$
(9)
with (j, j)th element \(V_{jj}(m)\). Likewise, the j-th element of \(\hat{\beta }_1(m)\), \(j=1, 2, \ldots , p\) is denoted \(\hat{\beta }_{1j}(m)\). Then
$$\begin{aligned} \hbox {var}\,{ \hat{\beta }}_{1j}(m) = \hat{\sigma }^2(m) V_{jj}(m). \end{aligned}$$
The \((1 - \alpha )\%\) highest posterior density (HPD) interval for \(\beta _{1j}\) is
$$\begin{aligned} \hat{\beta }_{1j}(m) \pm t_{\nu ,1 - \alpha /2} \sqrt{\hat{\sigma }^2(m) V_{jj}}, \end{aligned}$$
with \(t_{\nu ,\gamma }\) the \(\gamma \%\) point of the t distribution on \(\nu \) degrees of freedom. Here \(\nu = n_0 +m - p\).
The highest posterior density intervals for \(\tau \) and \(\sigma ^2\) are, respectively, given by
$$\begin{aligned}{}[g_{a_1(m),b_1(m),\alpha /2}, g_{a_1(m),b_1(m),1 - \alpha /2}] \quad \hbox {and} \quad [ig_{a_1(m),b_1(m),\alpha /2}, ig_{a_1(m),b_1(m),1 - \alpha /2}], \end{aligned}$$
where \(g_{a,b,\gamma }\) and \(ig_{a,b,\gamma }\) are the \(\gamma \%\) points of the G(a, b) and IG(a, b) distributions.
Outlier detection
We detect outliers using a form of deletion residual that includes the prior information. Let \(S^*(m)\) be the subset of size m found by FS, for which the matrix of regressors is X(m). Weighted least squares on this subset of observations (8) yields parameter estimates \(\hat{\beta }_1(m)\) and \(\hat{\sigma }^2(m)\), an estimate of \(\sigma ^2\) on \(n_0+m-p\) degrees of freedom. The residuals for all n observations, including those not in \(S^*(m)\), are
$$\begin{aligned} e_{i}(m) = y_i -x_i^\mathrm {T}\hat{\beta }_1(m) \qquad (i=1, \ldots , n). \end{aligned}$$
(10)
The search moves forward with the augmented subset \(S^*(m+1)\) consisting of the observations with the \(m+1\) smallest absolute values of \(e_{i}(m)\). To start we take \(m_0 = 0\), since the prior information specifies the values of \(\beta \) and \(\sigma ^2\).
To test for outliers, the deletion residuals are calculated for the \(n-m\) observations not in \(S^*(m)\). These residuals are
$$\begin{aligned} r_{i}(m) = \frac{y_{i} - x_{i}^\mathrm {T}\hat{\beta }_1(m)}{ \sqrt{\hat{\sigma }^2(m)\{1 + h_{i}(m)\}}} = \frac{e_{i}(m)}{ \sqrt{\hat{\sigma }^2(m)\{1 + h_{i}(m)\}}}, \end{aligned}$$
(11)
where, from (8), the leverage \(h_{i}(m) = x_i^\mathrm {T}\{X_0^\mathrm {T}X_0 + X(m)^\mathrm {T}X(m)/c(m,n)\}^{-1}x_i\). Let the observation nearest to those forming \(S^*(m)\) be \(i_{\mathrm{min}}\) where
$$\begin{aligned} i_{\mathrm{min}} = \arg \min _{i \notin S^*(m)} | r_{i}(m)|. \end{aligned}$$
To test whether observation \(i_{\mathrm{min}}\) is an outlier, we use the absolute value of the minimum deletion residual
$$\begin{aligned} r_{\mathrm{imin}}(m) = \frac{e_{\mathrm{imin}}(m)}{ \sqrt{\hat{\sigma }^2(m)\{1 + h_{\mathrm{imin}}(m)\}}}, \end{aligned}$$
(12)
as a test statistic. If the absolute value of (12) is too large, the observation \(i_{\mathrm{min}}\) is considered to be an outlier, as well as all other observations not in \(S^*(m)\).
Envelopes and multiple testing
A Bayesian FS through the data provides a set of n absolute minimum deletion residuals. We require the null pointwise distribution of this set of values and find, for each value of m, a numerical estimate of, for example, the 99% quantile of the distribution of \(|r_{\mathrm{imin}}(m)|\).
When used as the boundary of critical regions for outlier testing, these envelopes have a pointwise size of 1%. Performing n tests of outlyingness of this size leads to a procedure for the whole sample which has a size much greater than the pointwise size. In order to obtain a procedure with a 1% samplewise size, we require a rule which allows for the simple behaviour in which a few outliers enter at the end of the search and the more complicated behaviour when there are many outliers which may be apparent away from the end of the search. However, at the end of the search such outliers may be masked and not evident. Our chosen rule achieves this by using exceedances of several envelopes to give a “signal” that outliers may be present.
In cases of appreciable contamination, the signal may occur too early, indicating an excessive number of outliers. This happens because of the way in which the envelopes increase towards the end of the search. Accordingly, we check the sample size indicated by the signal for outliers and then increase it, checking the 99% envelope for outliers as the value of n increases, a process known as resuperimposition. The notation \(r_{\mathrm{min}}(m,n)\) indicates the dependence of this process on a series of values of n.
In the next section, where interest is in envelopes over the whole search, we find selected percentage points of the null distribution of \(|r_{\mathrm{imin}}(m)|\) by simulation. However, in the data analyses of Sect. 5 the focus is on the detection of outliers in the second half of the search. Here we use a procedure derived from the distribution of order statistics to calculate the envelopes for the many values of \(r_{{\mathrm{min}}}(m,n)\) required in the resuperimposition of envelopes. Further details of the algorithm and its application to the frequentist analysis of multivariate data are in Riani et al. (2009).