1 Introduction

The concept of minimal detectable error (Baarda 1968), later termed minimal detectable bias (MDB), was a pioneering tool for the analysis of behaviour of a network in the presence of an outlier. Being assumed as a measure of network internal reliability it was meant to span the a priori analysis of network sensitivity to an outlier with the chances to detect it. The original formula for MDB covering the case of correlated observations, was later analyzed by Wang and Chen (1994), Schaffrin (1997), Teunissen (1990, 1998, 2000) and was further extended upon the case of multiple outliers (Teunissen 2000; Knight et al. 2010). It was noticed in numerical tests that the gross errors of MDB magnitudes are often not identified, but identification can be successful at greater magnitudes (e.g. Hekimoglu and Erenoglu 2005). The concept of III-type error was introduced (Hawkins 1980; Förstner 1983) to cover the situations when the error-free observation can be identified mistakenly as the one contaminated by a gross error.

The MDB concept itself does not cover the issue of outlier identifiability. It only determines the minimal magnitude of a gross error in a particular observation, the presence of which in a system can be disclosed through excessive non-centrality effect in a global test. Hence, extending the MDB concept upon the issue of outlier identifiability would be a desirable research task.

Also the “response-based” measures of network internal reliability (Prószyński 2010) that provide reliability criteria clearly interpretable in terms of network responses to outliers, are not associated with the chances for outlier identification.

Taking into account the above description of the problem, the objective of the research was assumed to be the following:

  1. i.

    to work out a method of evaluating the chances to identify a gross error of the MDB magnitude (assumed to be a single gross error in a system), and together with some other related characteristics to create supplementation of the MDB concept with regard to outlier identifiability,

  2. ii.

    to propose a method for a priori evaluation of increase of MDB necessary to ensure that the thus obtained gross error can be reliably identified in practice,

  3. iii.

    to provide probabilistic support for response-based reliability criteria with regard to outlier identifiability.

Since the identifiability issue has much in common with the concept of outlier separability, some common elements are discussed of the proposed approach and the chosen existing methods of outlier separability analysis (Wang and Knight 2012; Yang et al. 2013).

2 Preliminaries

The main part of the paper will be preceded with some preliminary statements and auxiliary concepts describing the approach and presenting the notation applied in the analyses.

2.1 Specifying the terms “detectable gross error” and “identifiable gross error”

Since the distinction between “outlier detection” and “outlier identification” is clearly defined (Teunissen 2000), we give some details that specify the approach to a priori analysis of outlier identifiability proposed in the present paper.

We confine the explanations to the case when a network is contaminated with a single gross error (i.e. one outlier case).

Detectable gross error—an observation error of the magnitude such that its presence in a network is signalized by the global model test statistic exceeding its critical value.

Identifiable gross error—a detectable gross error the exact location of which in a network, i.e. in a particular observation, can be identified among the suspected observations in the first adjustment run, i.e. without subsequent diagnostic operations such as removal or re-weighting of observations. It is when the outlier test statistic of maximum absolute value of all the outlier test statistics that exceed the critical value, corresponds to the contaminated observation.

In the above definition “outlier identification” is clearly separated from “outlier detection”, since it is meant as a subsequent process of forming the set of suspected observations and finding among them the contaminated observation.

Unidentifiable gross error—a detectable gross error located in such a specific region of a network (consisting of at least two observations), where all the observations obtain equal values of outlier test statistics. The error is unidentifiable within the region (Cen et al. 2003; Prószyński 2008).

The conditions concerning the existence of the Regions of Unidentifiable Errors (RUE) for correlated observations are derived in Appendix A.

2.2 GM model and the disturbance/response relationship

Let us consider a GM model, written in an original form

$$\begin{aligned} {\mathbf {Ax}}+{\mathbf {e}}={\mathbf {y}}; \quad {\mathbf {e}}\sim ( {\mathbf {0}},{\mathbf {C}}) \end{aligned}$$
(1)

and in the equivalent modified form that exposes the correlation matrix (Prószyński 2010)

$$\begin{aligned} {\mathbf {A}}_\mathrm{S} {\mathbf {x}}+{\mathbf {e}}_\mathrm{S} ={\mathbf {y}}_\mathrm{S} ; \quad {\mathbf {e}}_\mathbf{s} \sim ({\mathbf {0}},{\mathbf {C}}_\mathbf{s} ) \end{aligned}$$
(2)

where y the \(n\times \) 1 vector of observations; A the \(n\times \) u design matrix; rank A \(=\,u-d\) (d—system defect, \(d\ge 0\)); x the \(u\times 1\) vector of unknown parameters; e the \(n\times \) 1 vector of random errors; we shall also use \({\mathbf v} = -{\mathbf e}\); C the \(n\times \) n covariance matrix of e (positive definite), \({\mathbf C}=\sigma _{o}^{2}{\mathbf P}^{-1}=\sigma _{o}^{2}{\mathbf Q}\); \({\varvec{\sigma }} = \) (diag \(\mathbf C )^{1/2}\)\({\mathbf A}_\mathrm{s}={{\varvec{\sigma }}}^{-1}{\mathbf A},\,{\mathbf e}_{\mathbf s}={{\varvec{\sigma }}}^{-1}{\mathbf e}, {\mathbf y}_{\mathbf s}={{\varvec{\sigma }}}^{-1}{\mathbf y}, {\mathbf C}_{\mathbf s}={{\varvec{\sigma }}}^{-1}{\mathbf C}{\varvec{\sigma }}^{-1}\), \({\mathbf {C}}_\mathrm{s} \) a correlation matrix; for uncorrelated observations \({\mathbf C}_\mathrm{s}=\mathbf{I}\).

The LS estimator of the vector \(\mathbf {v}_\mathbf{s}\), where \(\mathbf {v}_{\mathbf s}=-{\mathbf e}_{\mathbf s}\), is given by

$$\begin{aligned} \hat{\mathbf{v}}_\mathbf{s} =-\mathbf{Hy}_\mathbf{s} \end{aligned}$$
(3)

where

H \(=\) I \(-\) A \(_\mathbf{s}\)(A \(_\mathbf{s}^\mathbf{T}\) C \(_\mathbf{s}^{-1}\) A \(_\mathbf{s})^{+}\) A \(_\mathbf{s}^\mathbf{T}\) C \(_\mathbf{s}^{-1}\) is the modified reliability matrix (Prószyński 2010), i.e. the reliability matrix for the modified GM model as in (2), (*)\(^{+}\) denotes the pseudo-inverse.

Decomposing the vector \({\mathbf y}_{\mathbf s}\), so that \({\mathbf y}_{\mathbf s}={\mathbf y}_{\mathbf s}^\mathrm{true}-{\mathbf v}_{\mathbf s}+\Delta \mathbf {y}_{\mathbf s}\), where \(\Delta \) y \(_\mathrm{S}\) is the vector of standardized observation gross errors (i.e. standardized disturbances), and realizing that H \(\cdot \) \(\mathbf{y}_{\mathrm{S}}^\mathrm{true}\) \(=\) 0, we obtain (3) in the form

$$\begin{aligned} \hat{\mathbf{v}}_\mathbf{s} =\mathbf{Hv}_\mathbf{s} -\mathbf{H}\cdot \Delta {\mathbf {y}}_\mathbf{s} \end{aligned}$$
(4)

Denoting the second term in (4) by \(\Delta {{\hat{\mathbf{v}}}}_\mathbf{s} \), being the vector of standardized increments in LS residuals (i.e. standardized responses), we get on its basis the well known disturbance/response relationship for the system (2), i.e.

$$\begin{aligned} {\Delta }{\hat{\mathbf {v}}}_{\mathrm{S}} =-\mathrm{{\mathbf H}}\cdot \varvec{\Delta }{{\mathbf {y}}}_{\mathrm{S}} \end{aligned}$$
(5)

where \({\Delta }{\hat{\mathbf {v}}}_{\mathrm{S}} =-\Delta {\hat{\mathbf {e}}}_{\mathrm{S}} \).

2.3 A short note on minimal detectable error (MDB) and response-based reliability measures

Below, we present the formula for MDB as given in Wang and Chen (1994), Teunissen (1990, 1996), using the notation as in Sect. 2.2

$$\begin{aligned} \text{ MDB }_i =\varvec{\sigma }_i \cdot \sqrt{\frac{\varvec{\lambda }}{r_i }} \quad r_i =\left\{ {{{\mathbf {H}}}^\mathrm{T}{{\mathbf {C}}}_\mathrm{S} ^{-\text{1 }}{{\mathbf {H}}}} \right\} _{ii} ; \quad r_i \left[ {0, \infty } \right) \end{aligned}$$
(6)

where MDB\(_{i}\) minimal detectable bias in the i-th observation; its standardized form i.e. \({\hbox {MDB}}_{{\mathrm{S},i}}\) = MDB\(_{i}\)/\(\sigma _{i}\) is termed as controllability of the i-th observation, \(\sigma _{i}\) the standard deviation of the i-th observation, \(\lambda \) the non-centrality parameter (as in a global model test), \(r_{i}\) a generalized reliability number for the i-th observation, \(\mathbf{C}_{\mathrm{S}}\), H the matrices as in (2) and (3), respectively; H \(^{T}\) C \(_{\mathrm{S}}^{-1}\) H \(=\) C \(_{\mathrm{S}}^{-1}\) H.

The generalized reliability number \(r_{i}\) alone can also be considered as internal reliability measure (Caspary 1988).

The behaviour of a system in the presence of a single gross error can also be characterized by the so called response-based internal reliability measures (Prószyński 2010), derived on the basis of disturbance/response relationship (5), i.e. disregarding the random-error environment. For correlated observations the measures are the following pairs of indices

$$\begin{aligned} h_{ii},\, w_{ii},\, \mathrm{or}\, \mathrm{equivalently}\, h_{ii},\, k_{i} \end{aligned}$$

where \(h_{ii}\) the i-th diagonal element of the matrix H, \(w_{ii}\) the asymmetry index for the i-th row and the i-th column of the matrix H, \(k_{i}\) the ratio of the squared quasi-global response Q\(_{(i)}\) and the squared local response \(h_{ii}\) to an outlier in the i-th observation [see formula (28) in Appendix A].

The reliability criteria are the following

$$\begin{aligned} 0.\mathrm{5}<h_{ii} \le \mathrm{1}\quad \wedge \quad h_{ii} -\mathrm{2}h_{ii} ^\mathrm{2}<w_{ii} <h_{ii} -h_{ii} ^\mathrm{2} \quad i=\mathrm{1},\ldots ,n\nonumber \\ \end{aligned}$$
(7)

or, equivalently

$$\begin{aligned} 0.5<h_{ii} \le 1 \quad \wedge \quad 0<k_i <1 \quad i=1,\ldots ,n \end{aligned}$$

They are derived from the postulate that the maximum system response should be located in the observation in which the gross error resides, and that the responses in other observations should possibly be the smallest (Prószyński 2010). Hence, there are then the chances for effective identification of a single gross error residing in any of the observations. We can then state that the criteria determine the area of outlier-exposing responses. The set of values (\(h_{ii}\)\(w_{ii})\) which form this area (see Figs. 2, 3) will be denoted by S\(_\mathrm{O}\).

It is not possible to interrelate the above two types of measures, i.e. \(r_{i}\) and (\(h_{ii}\),\(w_{ii})\) or (\(h_{ii}\),\(k_{i})\) on the grounds of rigorous matrix operations due to different generation principles. So, instead of direct interrelations we can establish indirect correspondence between these measures by finding their values on basis of the model (1) or model (2) components, as shown on a scheme below

$$\begin{aligned} {\mathbf {A}},\;{\mathbf {C}}\rightarrow {\mathbf {A_\mathrm{s}}} ,\;{\mathbf {C_\mathrm{s}}} \rightarrow {\mathbf {H,C_\mathrm{s}}} \rightarrow \left\{ {{\begin{array}{ll} {r_i } \\ {h_{ii} ; w_{ii}} \\ \end{array} }} \right. i=\text{1 },\ldots ,n \end{aligned}$$
(8)

3 A study on outlier identifiability evaluation in terms of probability

The a priori analysis of outlier identifiability presented in this paper, refers in principle to outlier identification procedure as in Knight et al. (2010). In that procedure the global model test is followed by the local outlier tests resulting in a set of suspected outliers. The final outcome of the procedure is the observation with a maximum absolute value of the test statistic. The outlier test statistics (i.e \(w^{2})\) are obtained from mean-shift model. The global model test and the local outlier tests are coordinated by equalizing the non-central parameters and selecting the probabilities according to the \(\beta \)-Method (Baarda 1968). In the present paper, for finding the suspected outliers and identifying the contaminated observation instead of \(w^{2 }\) the \(\vert w\vert \) values are used, assuming \(\left| w \right| _\mathrm{crit} =\sqrt{w_{\mathrm{{crit}}}^{{2}} } \).

The w-variables, being the standardized random variables, are defined by

$$\begin{aligned} w_{i(i)} =\frac{\hat{z}_{i(i)} }{\sigma _{\hat{z}_{i(i)} } }; \quad w_{j(i)} =\frac{\hat{z}_{j(i)} }{\sigma _{\hat{z}_{j(i)} } }\quad i,j=1,\ldots , n \quad j\ne i \end{aligned}$$
(9)

where “i” denotes the observation contaminated with a gross error, “j” denotes any other observation; \(\hat{z}\) is the LS estimator of a gross error, obtained on basis of “mean-shift” model (Knight et al. 2010)

With one outlier case being assumed as in the present research, the above testing procedure is similar to Baarda’s w-test (Baarda 1968).

3.1 Parameters of outlier test statistics for the needs of identifiability analysis

In the notation of the present paper the w-variables as in (9) in a network contaminated with a single gross error \(\Delta y_{\mathrm{S},i}\), have the following detailed form

$$\begin{aligned} {w}_{i(i)}&=\dfrac{\Big \{ {{{\mathbf {H}}}^T{{\mathbf {C}}}_{\mathrm{S}}^{-1} {{\mathbf {H}}}} \Big \}_{i*} }{\sqrt{\Big \{ {{\mathbf {H}}^T{\mathbf {C}}_{\mathrm{S}}^{-1} {\mathbf {H}}} \Big \}_{ii} } }\cdot ( {{\mathbf {e}}_\mathrm{S} +\Delta {\mathbf {y}}_{{\mathrm{S}(i)}} })\nonumber \\ \;&=\;\dfrac{\Big \{ {{\mathbf {H}}^T{\mathbf {C}}_{\mathrm{S}}^{-1}{\mathbf {H}}}\Big \}_{i*}}{\sqrt{\Big \{{{\mathbf {H}}^T{\mathbf {C}}_{\mathrm{S}}^{-1} {\mathbf {H}}}\Big \}_{ii}}}\cdot {\mathbf {e}}_{\mathrm{S}} +\sqrt{\Big \{ {{\mathbf {H}}^T{\mathbf {C}}_{\mathrm{S}}^{-1} {\mathbf {H}}} \Big \}_{ii} } \Delta y_{{\mathrm{S}},i} \nonumber \\ {w}_{j(i)}&=\dfrac{\Big \{ {{\mathbf {H}}^T{\mathbf {C}}_{\mathrm{S}}^{-1} {\mathbf {H}}} \Big \}_{j*} }{\sqrt{\Big \{ {{\mathbf {H}}^T{\mathbf {C}}_{\mathrm{S}}^{-1} {\mathbf {H}}} \Big \}_{jj} } }( {{\mathbf {e}}_\mathrm{S} +\Delta {\mathbf {y}}_{{\mathrm{S}(i)}} })\nonumber \\ \;&=\;\dfrac{\Big \{ {{\mathbf {H}}^T{\mathbf {C}}_{\mathrm{S}}^{-1} {\mathbf {H}}} \Big \}_{j*} }{\sqrt{\Big \{ {{\mathbf {H}}^T{\mathbf {C}}_{\mathrm{S}}^{-1} {\mathbf {H}}} \Big \}_{jj} } }{\mathbf {e}}_{\mathrm{S}} +\dfrac{\Big \{ {{\mathbf {H}}^T{\mathbf {C}}_{\mathrm{S}}^{-1} {\mathbf {H}}} \Big \}_{ji} }{\sqrt{\Big \{ {{\mathbf {H}}^T{\mathbf {C}}_{\mathrm{S}}^{-1} {\mathbf {H}}} \Big \}_{jj} } }\Delta y_{{\mathrm{S},i}} \nonumber \\&i,j= \text{1 },\ldots ,n\quad j\ne i \end{aligned}$$
(10)

where {\(\cdot \)}\(_{i*}\) and{\(\cdot \)}\(_{j*}\) denote the i-th and the j-th row of H \(^{T}\) C \(_{\mathrm{S}}^{-1}\) H.

With e \(\sim \) N(0,C), and consequently \(\mathrm{\mathbf{e}}_\mathrm{s} \sim \) N(0C \(_\mathbf{s})\), we get after simple operations

$$\begin{aligned} {w}_{i( i)} \sim {N}\,( {\mu _i ,\text{1 }}) \quad \mu _i =\sqrt{\left\{ {{\mathbf {H}}}^T{{\mathbf {C}}}_{\mathrm{S}}^{-1} {\mathbf {H}} \right\} _{ii} } \cdot \Delta \mathrm{y}_{{\mathrm{S},i}} \end{aligned}$$
(11)
$$\begin{aligned} {w}_{j( i)} \sim {N}\,( {\mu _j ,\text{1 }}) \quad \mu _j =\frac{\left\{ {{\mathbf {H}}^T{{\mathbf {C}}}_{\mathrm{S}}^{-1} {{\mathbf {H}}}} \right\} _{ji} }{\sqrt{\left\{ {{{\mathbf {H}}}^T{{\mathbf {C}}}_{\mathrm{S}}^{-1} {{\mathbf {H}}}} \right\} _{jj} } }\cdot \Delta \mathrm{y}_{{\mathrm{S},i}} \end{aligned}$$
(12)
$$\begin{aligned} \rho _{ij} \,=\,\mathrm{cor}( {{w}_{i(i)} ,{w}_{j(i)} })\;=\;\frac{\left\{ {{{\mathbf {H}}}^T{{\mathbf {C}}}_{\mathrm{S}}^{-1} {{\mathbf {H}}}} \right\} _{ij} }{\sqrt{\left\{ {{{\mathbf {H}}}^T{{\mathbf {C}}}_{\mathrm{S}}^{-1} {{\mathbf {H}}}} \right\} _{ii} } \sqrt{\left\{ {{{\mathbf {H}}}^T{{\mathbf {C}}}_{\mathrm{S}}^{-1} {{\mathbf {H}}}} \right\} _{jj} } } \end{aligned}$$
(13)

as in (Förstner 1983).

To analyze outlier identifiability, it is most reasonable to consider detectable gross errors, i.e. \(\Delta \)y\(_{{\mathrm{S},i}}\) \(\ge \) MDB\(_{{\mathrm{S},i}}\), where MDB\(_{{\mathrm{S},i}}\) as in (6). Substituting \(\Delta \)y\(_{{\mathrm{S},i}}\) = MDB\(_{{\mathrm{S},i}}\) into (11) and (12), we obtain

$$\begin{aligned} \mu _i =\sqrt{\lambda };\quad \mu _j =\rho _{ij} \cdot \sqrt{\lambda };\quad \rho _{ij} \end{aligned}$$
(14)

The formula (14) reflects a well known property (Förstner 1983) that with the defined I type and II type errors the correlation between the outlier test statistics is decisive for identification of the contaminated i-th observation.

3.2 Identifiability indices and their properties

To identify within a set of suspected observations the i-th observation in which a gross error of MDB magnitude resides, we need that \(\vert w_{i(i)}\vert \) is dominating over each of the corresponding absolute values for the remaining observations within this set. For each of the suspected observations we have \(\vert w\vert \) \(>\) \(\vert w\vert _\mathrm{crit}\), or for short \(\vert w\vert \) \(>\) c (we assume \(\left| w \right| _\mathrm{crit} =\sqrt{w_{\mathrm{{crit}}}^{{2}}} )\).

Using \(\frac{\left| {w_{j(i)} } \right| }{\left| {w_{i(i)} } \right| }<1\) as an equivalent condition to \(\vert w_{i(i)}\vert >\vert w_{j(i)}\vert \), we may form for the i-th observation an identifiability index denoted as ID\(_{i}\), defined in terms of conditional probability

$$\begin{aligned} \text{ ID }_i ={\hbox {P}}( {\hbox {Q}_i \left| {{\bar{\hbox {R}}}_{i}} \right. }) \end{aligned}$$
(15)

where

$$\begin{aligned} \begin{array}{l} {\hbox {Q}}_{i} =\frac{\left| {{w}_{1(i)} } \right| }{\left| {{w}_{i(i)} } \right| }<\text{1 } \cap \cdots \cap \frac{\left| {{w}_{j(i)} } \right| }{\left| {{w}_{i(i)} } \right| }<\text{1 } \cap \cdots \cap \frac{\left| {{w}_{n-1(i)} } \right| }{\left| {w_{i(i)}} \right| }<\text{1 }; \\ \qquad \quad j\ne i \\ \hbox {R}=\left| {w_{1(i)} } \right| <c \cup \ldots \cup \left| {w_{i(i)} } \right| <c \cup \cdots \cup \left| {w_{n(i)} } \right| <c \\ \end{array} \end{aligned}$$

where non-centralities \(\mu \) of the w-variables are determined for \(\text {MDB}_{s,i}\) as in (14); \({\bar{\hbox {R}}}_i \) being an event opposite to \({\hbox {R}}_{i}\), contains all possible sets of suspected observations, each corresponding to a particular distribution of random errors in a single measurement of a network.

In formulating \({\hbox {Q}}_{i}\), we take into account the fact that domination of \(\vert w_{i(i)}\vert \) within the set of all the observations implies its domination within any set of suspected observations containing \(w_{i(i)}\).

Using for each component in \({\hbox {Q}}_{i}\) a symbol Z as for a ratio of two folded normal variables (see Appendix B), we may write (15) in the form

$$\begin{aligned}&\text{ ID }_i ={\hbox {P}}\bigg (\mathrm{Z}_{1(i)} <\text{1 } \cap \cdots \cap \mathrm{Z}_{j(i)} \nonumber \\&\qquad <\text{1 } \cap \cdots \cap \mathrm{Z}_{n-1(i)} <\text{1 } \Big | {{\bar{\hbox {R}}}_i} \bigg )\quad j\ne i \end{aligned}$$
(16)

Due to a high complexity of the definition (15), increasing with the number (n) of observations in a network, an empirical method based on numerical simulation of random observation errors was applied in the research. The method consists in:

  • simulating numerically a certain number (e.g. 1000) of n-dimensional vectors of correlated random errors (according to a given C);

  • computing w-variables for each vector of random errors using the formulas (10), the systematic components being as in (14);

  • after elimination of the sets of w-variables where the critical values are not exceeded, computing sample frequency for the sets where \(\left| w \right| \) for a contaminated observation (such that \(\vert w\vert \) \(>\) c) is dominating, the sample frequency being empirical approximation of ID. As a check on correctness of simulation procedure a sample frequency for the eliminated sets of w-variables (i.e. with \(\vert w\vert \) \(<\) c) was used as being empirical approximation of II type error probability \(\beta \).

To extend the scope of identifiability analysis, the computer program written for the method contains the formulas (10) in a modified form introducing a multiplying factor, such that the systematic components are as follows

$$\begin{aligned} \mu _i =g_i \cdot \sqrt{\lambda };\quad \mu _j =\varvec{\rho }_{ij} \cdot g_i \cdot \sqrt{\lambda };\quad g_i >0 \end{aligned}$$
(17)

which corresponds to the use of \(\Delta \) y \(_{{\mathrm{S},i}}\) = \(g_{i}\) \(\cdot \) MDB\(_{{\mathrm{S},i}}\).

This modification can be used in case of unsatisfactory values of ID\(_{i}\) obtained with \(\Delta \) y \(_{{\mathrm{S},i}}\) = MDB\(_{{\mathrm{S},i}}\).

We do not have exact theoretical reference for evaluating the accuracy of the simulation method. Therefore, we may only analyze the degree of dispersion of the ID values for different sets of simulated data and different observations in a network. The estimates obtained in that way for networks in Examples 1 and 2 (see Sect. 6) with 1000 simulations used are within \(\pm \) 1 or \(\pm \) 2 % (standard deviations).

For the purpose of this study we consider also identification of the contaminated observation without setting restrictions onto the values of w-variables. Such a procedure that covers also the outlier detection is a departure from the assumed definition of outlier identification (see Sect. 2.1) and will be termed pseudo-identification. Consequently, we shall operate with a pseudo-identifiability index, denoted by ID\(_{i}^{*}\), and having the form

$$\begin{aligned} \mathrm{ID}_i^*={\hbox {P}}({\hbox {Q}}_i ) \end{aligned}$$
(18)

where \({\hbox {Q}}_{i}\) as in (15).

Although to a smaller degree than in the case of \({\hbox {P}}( {{\hbox {Q}}_i \left| {{\bar{\hbox {R}}}_i } \right. })\) (15), finding P(Q\(_{i})\) is still a complex computation task. However, we may get empirical approximation of this index (\({\overline{\text{ ID }}}_i^*)\) by means of slightly modified simulation method.

On the grounds of probability theory some relations can be established between ID\(_{i}^{*}\) and ID\(_{i}\)

$$\begin{aligned} {\hbox {P}}\left\{ {{\hbox {Q}}_i \left| {\bar{\hbox {R}}_i } \right. } \right\} =\frac{{\hbox {P}}\left\{ {{\hbox {Q}}_i \cap {\bar{\hbox {R}}}_i } \right\} }{{\hbox {P}}\left\{ {{\bar{\hbox {R}}}_i } \right\} }=\frac{{\hbox {P}}\left\{ {{\hbox {Q}}_i } \right\} -{\hbox {P}}\left\{ {{\hbox {Q}}_i \cap \text{ R }_i } \right\} }{{\hbox {P}}\left\{ {{\bar{\hbox {R}}}_i } \right\} } \end{aligned}$$

where \({\hbox {P}}( {\bar{\hbox {R}}_i })=1-{\beta }\),

and hence

$$\begin{aligned} \mathrm{ID}_i^*=(1-\beta )\cdot \mathrm{ID}_i +{\hbox {P}}\left\{ {{\hbox {Q}}_i \cap {\hbox {R}}_i } \right\} \end{aligned}$$
(19)

Assuming that \({\hbox {P}} ({\{}{\hbox {Q}}_{i}\) \(\cap \) \({\hbox {R}}_{i}\)}) \(>\) 0, we get

$$\begin{aligned} \mathrm{ID}_i^*>(1-\beta ) \cdot \mathrm{ID}_i \end{aligned}$$
(20)

Hypothetically, the case that ID \(_{i}^{*}\)= ID \(_{i}\) might occur when \({\hbox {P}}( {{\hbox {Q}}_i \cap {\bar{\hbox {R}}}_i })={\hbox {P}}( {{\hbox {Q}}_i })\cdot {\hbox {P}}( {{\bar{\hbox {R}}}_i })\), i.e. when Q\(_{i}\)and \({\bar{\hbox {R}}}_i \) were independent events. Then with \(\text {ID}_{i}=1\), we would also have \(\text {ID}_{i}^{*}=1\), which would imply domination of \(\vert w_{i(i)}\vert \) in each possible set in R. Since the above independency is only a detached theoretical assumption, we can only state that ID\(_{i}\) is an unattainable upper limit for ID \(_{i}^{*}\).

The above relations have been confirmed by the results obtained from the simulation method.

3.3 Partial identifiability indices and their properties

As an auxiliary tool for network analysis, the partial identifiability indices for pairs of observations were introduced, i.e. for the i-th observation contaminated by a gross error and the j-th observation being error-free. Similarly to two options of identifiability indices (Sect. 3.2) we distinguish

  • partial identifiability index ID\(_{i/j}\)

    $$\begin{aligned} \text{ ID }_{i/j} ={\hbox {P}}\left( {\frac{\left| {w_{j(i)} } \right| }{\left| {w_{i(i)} } \right| }<1\big | {\bar{\hbox {R}}}_{ij} }\right) \end{aligned}$$
    (21)

    where \({\hbox {R}}_{ij}=\vert w_{i(i)}\vert <\) c \(\cup \) \(\vert \) w \(_{j(i)}\vert \) \(<\) c or in notation of (16)

    $$\begin{aligned} \text{ ID }_{i/j} ={\hbox {P}}( {\mathrm{Z}_{j(i)} <1\left| {\bar{\hbox {R}}} \right. _{ij} }) \end{aligned}$$
  • partial pseudo-identifiability index ID\(_{i/j}^{*}\).

    $$\begin{aligned} \mathrm{ID}_{i/j}^*={\hbox {P}}\left( {\frac{\left| {{w}_{j(i)} } \right| }{\left| {{w}_{i(i)} } \right| }<1}\right) \!, \end{aligned}$$
    (22)

    or in notation of (16),

    $$\begin{aligned} \text {ID}_{i/j}={\hbox {P}}{\{}\mathrm{Z}_{j(i)}<1{\}} \end{aligned}$$

    The indices are the values of distribution function of ratio of two folded normal variables. In the case of ID\(_{i/j}\) the space of the values of w-variables is reduced in terms of absolute values, assuming that both the i-th and the j-th observation are the elements of a set of suspected observations. For finding the values of ID\(_{i/j}^{*}\), a MATLAB-based software has been developed (Appendix B) for computing the values of the distribution function of Z. We can also find empirical approximation of each type of index (i.e. ID\(_{i/j}\) and ID\(_{i/j}^{*})\) by means of the simulation method presented in Sect. 3.2, by computing sample frequencies for chosen pairs of observations. The following properties of ID\(_{i/j}^{*}\) indices can be formulated:

  • from the formula (14), where \(\mu _{i}\) \(>\) \(\vert \mu _{j}\vert \), and the property (32) in Appendix B, it follows that for any pair of observations in a network we shall have \(\text {ID}_{i/j}^{*}> 0.5\). Figure 1 shows dependence of ID\(_{i/j}^{*}\) on magnitude of correlation \(\vert \rho _{ij}\vert \) (\(\vert \rho _{ij}\vert <1\)), obtained with \(\mu _i =\sqrt{\lambda }= 4.13\) and \({\mu }_j ={\rho }_{ij} \sqrt{\lambda }=4.13\cdot {\rho }_{ij} \) (as in formula (14). We can see that the smaller \(\vert \rho _{ij}\vert \), the greater is ID\(_{i/j}^{*}\).

  • due to \(\rho (w_{i}, w_{j})=\rho (w_{j},w_{i})\), the ID\(_{i/j}^{*}\) indices are symmetrical within pairs of observations, i.e. ID\(_{i/j}^{*} = \text {ID}_{j/i}^{*}\).

  • for all the observations forming a RUE region in a network, we shall have \(\text {ID}_{i}=\text {ID}_{j}=\text {ID}_{k}\ldots = \text{ ID }_\mathrm{Rue} \), where \(\text{ ID }_\mathrm{Rue} \) could be termed the identifiability index for a RUE region containing an outlier. For all pairs of observations within RUE we shall have \(\text {ID}_{i/j}^{*}= 0\). The index \(\text{ ID }_\mathrm{Rue} \) does not apply to networks being a RUE as a whole. In such networks \(\text {ID}_{i}=\text {ID}_{j}=\text {ID}_{k}\ldots = 0\) and ID\(_{i/j}^{*}= 0\) for all the observations.

Fig. 1
figure 1

Variability of the index ID\(_{i/j}^{*}\) as a function of correlation \(\rho (w_{i}\)\(w_{j})\)

3.4 Mis-identifiability indices and probabilities of III type errors

To cover in a priori analysis the possibility of identifying the j-th error-free observation instead of the contaminated i-th observation, defined as III type error (Hawkins 1980; Förstner 1983), we introduce mis-identifiability indices as shown below

$$\begin{aligned}&\text{ MID }_{ij} ={\hbox {P}}( {{\hbox {Q}}_j \left| {{\bar{\hbox {R}}}} \right. _i }) \nonumber \\&{\hbox {Q}}_j \!=\!\frac{\left| {w_{1(i)} } \right| }{\left| {w_{j(i)} } \right| }\!<\! \text{1 } \cap \cdots \cap \frac{\left| {w_{i(i)} } \right| }{\left| {w_{j(i)} } \right| }\!<\! \text{1 } \cap \cdots \cap \frac{\left| {w_{n-1(i)} } \right| }{\left| {w_{j(i)} } \right| }<\text{1 }\quad i\ne j \nonumber \\ \end{aligned}$$
(23)

where \({\bar{\hbox {R}}}_i \) as in (19).

The indices MID\(_{ij}\) correspond to probabilities of committing III type errors, denoted by \(\upgamma _{ij}\) (Förstner 1983). The indices are determined for gross errors of the MDB magnitudes.

Using the simulation method we can get empirical approximation of MID\(_{ij}\) by computing sample frequency for the sets where \(\left| {w_{j(i)} } \right| \) (such that \(\left| {w_{j(i)} } \right| >\) c) is dominating

Taking into account the MID\(_{ij}\) indices for all the j-th observations, we may find the observation with maximum value of \(\text{ MID }_{ ij}\), i.e. \(\text{ MID }_{{\textit{ij}},\mathrm{max}} \).

Realizing that all MID\(_{ij}\) indices together with ID\(_{i}\) indices refer to disjoint events that form a complete event, we may formulate on basis of (15) and (23) the following relationship

$$\begin{aligned} \overline{\text{ ID }} _i =\text{1 }-\sum \limits _{j=1,j\ne i}^{n-\text{1 }} {\text{ MID }_{ij} } \end{aligned}$$
(24)

where n as in (15) is the number of all the observations in a network.

According to (24), with the ID\(_{i}\) values being greater than 0.5 there can be no observation with MID\(_{ij}>\) 0.5.

This confirms the well known property, that the greater the probability of finding the contaminated observation (e.g. Wang and Knight 2012), the smaller is the probability of committing the III-type error.

4 Proposed supplementation of the MDB concept for a priori analysis of network reliability

By definition the MDB concept is not associated with outlier identifiability. Based on the study of outlier identifiability evaluation (Sect. 3), we propose supplementation of the MDB concept as in formula (6) with identifiability index ID\(_{i}\), as in formula (15). The pair (MDB\(_{i}\), ID\(_{i }\)) would characterize the minimal detectable error in a particular observation together with the chances for its identification in a network.

In case of unsatisfactory value of ID\(_{i,}\) we may find the multiplying factor \(g_{i}\) as in (17) that shows the degree of magnification of MDB\(_{i}\) necessary to obtain a required level of outlier identifiability. We may also find a particular j-th (j \(\ne \) i) observation corresponding to maximum probability of III type error, i.e. \(\gamma _{ij}\) (see mis-identifiability indices \(\text{ MID }_{ij,\mathrm{max}}\) in Sect. 3.4).

For more detailed analysis of outlier identifiability, we may compute the index ID\(_{i}^{*}\) and the indices ID\(_{i/j }\) and ID\(_{i/j}^{*}\) for some chosen pairs of observations.

SUPPLEMENTATION of MDB for the i-th observation can thus be formed in the following two levels:

BASIC—ID\(_{i,}~g_{i}\), \(\text{ MID }_{ij,\mathrm{max}} \); AUXILIARY—ID\(_{i}^{*}\), ID\(_{i/j}\), ID\(_{i/k}\),..., ID\(_{i/j}^{*}\), ID\(_{i/k}^{*}\), ...

Additionally, by finding the response-based reliability measures (\(h_{ii}\),\(w_{ii})\) or (\(h_{ii}\),\(k_{i})\) for the analyzed i-th observation, we obtain in an indirect way a link between the network response and the indices ID\(_{i }\) and \(\text{ MID }_{ij,\mathrm{max}} \).

5 Common elements of the proposed approach with some chosen solutions in outlier separability testing

Although in the present paper the term “separability” is not used explicitly, the proposed identifiability indices can be considered to some extent as outlier separability measures. A direct link of the proposed approach with outlier separability analysis are mis-identifiability indices being the maximum probabilities of III type errors. Analogy can be found between the proposed approach and that in (Wang and Knight 2012). In the letter approach the concept of minimal separable bias (MSB) is presented, being the magnitude of MDB increased by the multiplying factor so as to ensure identifying of an outlier at a satisfactory confidence level (denoted there as \(1-\alpha _{s})\). This corresponds in the present paper to the use of partial pseudo-identifiability index ID\(_{i/j}^{*}\). Due to possibility of increasing MDB by the iteratively determined multiplying factor to reach a corresponding level of partial pseudo-identifiability, we may obtain a bias equivalent to MSB. One can also notice that the two options of the standardized separability test statistic (Wang and Knight 2012) contain exactly the arguments \({\hbox {h}}_{1}^{\prime }\) and \({\hbox {h}}_{2}^{\prime }\) of a distribution function \({\hbox {P}}(\mathrm{Z}< 1)\) as in formula (31) in the present paper. The ratio itself can be a proposal of test statistic for the above mentioned separability test.

Analogies can also be expected between the proposed approach and multiple alternative hypotheses testing (Yang et al. 2013) with respect to definitions of probabilities of committing different types of errors as well as in the relations between these probabilities.

6 Numerical examples

To illustrate the proposed approach we use a levelling network analyzed in Knight et al. (2010) (Fig. 2a) and a GPS network (Fig. 2b). Referring to the first publication gives opportunity to expand the conclusions reached there.

Fig. 2
figure 2

Networks used in numerical examples

Example 1

For a network in Fig. 2a, we have

$$\begin{aligned} {{\mathbf {A}}}= & {} \left[ {{\begin{array}{rrr} 1 &{}0 &{}0 \\ {-1} &{}1 &{}0 \\ 0 &{}{-1} &{}0 \\ 0 &{}0 &{}1 \\ 0 &{}0 &{}{-1} \\ {-1} &{}0 &{}1 \\ \end{array} }} \right] ;\\ {{\mathbf {C}}}_\mathrm{S}= & {} \left[ {{\begin{array}{rrrrrr} {1.00} &{}{0.80} &{}{0.14} &{}{-0.59} &{}{-0.48} &{}{0.04} \\ {0.80} &{}{1.00} &{}{0.00} &{}{-0.17} &{}{-0.68} &{}{-0.30} \\ {0.14} &{}{0.00} &{}{1.00} &{}{-0.67} &{}{0.25} &{}{0.76} \\ {-0.59} &{}{-0.17} &{}{-0.67} &{}{1.00} &{}{-0.29} &{}{-0.76} \\ {-0.48} &{}{-0.68} &{}{0.25} &{}{-0.29} &{}{1.00} &{}{0.57} \\ {0.04} &{}{-0.30} &{}{0.76} &{}{-0.76} &{}{0.57} &{}{1.00} \\ \end{array} }} \right] \end{aligned}$$

To save space we show only the matrix H \(^\mathrm{T}\) C \(_\mathrm{S}^{-1}\) H and a correlation submatrix for the variables \(w_{2}\) and \(w_{3}\)

$$\begin{aligned}&{{\mathbf {H}}}^\mathrm{T} \mathrm{C}_\mathrm{S}^{-1} {{\mathbf {H}}}\!=\! \left[ \! {{\begin{array}{rrrrrr} {10.58} &{}\ {-{\mathbf{1.06}}} &{}\ {-{\mathbf{0.48}}} &{}\ {11.54} &{}\ {4.48} &{}\ {5.97} \\ {-{\mathbf{1.06}}} &{}\ {{\mathbf{0.62}}} &{}\ {{\mathbf{0.28}}} &{}\ {-{\mathbf{1.06}}} &{}\ {-{\mathbf{0.55}}} &{}\ {-{\mathbf{0.91}}} \\ {-{\mathbf{0.48}}} &{}\ {{\mathbf{0.28}}} &{}\ {{\mathbf{0.13}}} &{}\ {-{ \mathbf{0.48}}} &{}\ {-{\mathbf{0.25}}} &{}\ {-{\mathbf{0.41}}} \\ {11.54} &{}\ {-{\mathbf{1.06}}} &{}\ {-{\mathbf{0.48}}} &{}\ {13.68} &{}\ {5.07} &{}\ {6.46} \\ {4.48} &{}\ {-{\mathbf{0.55}}} &{}\ {-{\mathbf{0.25}}} &{}\ {5.07} &{}\ {1.95} &{}\ {2.59} \\ {5.97} &{}\ {-{\mathbf{0.91}}} &{}\ {-{\mathbf{0.41}}} &{}\ {6.46} &{}\ {2.59} &{}\ {3.56} \\ \end{array} }} \!\right] \\&\varvec{\rho } \left[ {{\begin{array}{r} {w_2 } \\ {w_3 } \\ \end{array} }} \right] =\left[ {{\begin{array}{rr} 1 &{}\quad 1 \\ 1 &{}\quad 1 \\ \end{array} }} \right] \end{aligned}$$

The mutually parallel column-vectors (and row-vectors) in H \(^\mathrm{T}\) C \(_\mathrm{S}^{-1}\) H are marked in bold. Further results of analysis are given in Table 1 and Fig. 3.

Fig. 3
figure 3

Identifiability indices (ID\(_{i})\) shown in a (\(h_{ii}\),\(w_{ii})\) system for Network 1

Table 1 Results of internal reliability and identifiability analysis for the network 1

None of the observations satisfy the reliability criteria required for outlier-exposing responses. The network contains RUE formed by the observations 2 and 3. Hence, based on the results of the simulation method we can write \({\overline{\mathrm{ID}}}_2 ={\overline{\mathrm{ID}}}_3 = {\overline{\mathrm{ID}}}_\mathrm{RUE} = 0.49\). The equal MDB values for these observations, represent minimal detectable gross errors that are identifiable as located in the RUE region of a network, but are unidentifiable within this region, i.e. \(\text{ ID }_{2/3}^*= \text{ ID }_{3/2}^*= 0\).

It is difficult to find out a relationship between the indices ID\(_{i}\) and internal reliability measures \(r_{i}\) or (\(h_{ii}\),\(w_{ii})\). Except for observation 5, the indices ID\(_{i}\) are not specially differentiated and they represent a low level, slightly exceeding 0.5 for the observation 4.

This level does not ensure a sufficiently reliable identification of gross errors. This is reflected in the values of \({\overline{\mathrm{MID}}}_{ij,\mathrm{max}} \) indices.

Below, we show the effect upon ID\(_{i}\) of increasing the magnitude of a gross error by applying the multiplying factor \(g_{i}>\)1 [see formula (17) for the observation 1, 5 and 6], i.e.

  • obs. 1;   \(g_1 = 2\),   \({\overline{\mathrm{ID}}}_1 = 0.63\);   \(g_1 = 3\),   \({\overline{\mathrm{ID}}}_1 = 0.79\);

  • obs. 5;   \(g_5 = 2\),   \({\overline{\mathrm{ID}}}_5 = 0.36\);   \(g_5 = 3\),   \({\overline{\mathrm{ID}}}_5 = 0.61\);

  • obs. 6;   \(g_6 = 2\),   \({\overline{\mathrm{ID}}}_6 = 0.67\);   \(g_6 = 3\),   \({\overline{\mathrm{ID}}}_6 = 0.82\).

The correlations \(\rho _{ij}\) between the w-variables for the observations not forming a RUE are in absolute values within the interval [0.36, 0.98], and hence, the values of ID\(_{i/j}^{*}\) indices are within [0.64, 0.99] (see Fig. 1).

Example 2

We omit showing the design matrix A (\(12\times 4\)) (with elements 1 and \(-\)1), and confine presentation of the covariance matrix C to the range of values of standard deviations and correlation coefficients, i.e. \(\sigma [2.5, 3.2], \rho [-0.24, 0.21]\).

Fig. 4
figure 4

Identifiability indices (ID\(_{i})\) shown in a (\(h_{ii}\),\(w_{ii})\) system for Network 2

As we can see in Fig. 4 all the observations satisfy the reliability criteria required for outlier-exposing responses. The indices \({\overline{\mathrm{ID}}}_i \) represent high values as they all lay in the interval [0.969, 0.992]. The values of \({\overline{\mathrm{MID}}}_{ij,\mathrm{max}} \) indices are within the interval [0.002, 0.023]. This means that identification of a gross error of MDB magnitude in each observation is highly reliable.

The correlations \(\rho _{ij}\) between the w-variables being in absolute values within the interval [0.002, 0.563] are much smaller than in Network 1. Consequently, the values of ID\(_{i/j}^{*}\) indices are much greater (see Fig. 1), i.e. [0.973, 0.996].

The values of \({\overline{\mathrm{ID}}}_i^*\) indices are only slightly smaller than those of \({\overline{\mathrm{ID}}}_i \), i.e. [0.943, 0.972], which confirms the case discussed in Sect. 3.2.

7 Concluding remarks

For networks satisfying the response-based reliability criteria we have a high level of outlier identifiability, which is due to small correlations between w-variables.

By supplementing the MDB concept with the identifiability index one may evaluate at an a priori analysis the probability of identifying a gross error of MDB magnitude in the first adjustment run. For the observations with small identifiability indices one may find the magnitude of gross error, greater than MDB, necessary to ensure a satisfactory level of identifiability. While setting a certain requirement for the probability level, e.g. ID \(\ge \) 0.95, as well as for the level of mis-identification, e.g. \({\overline{\mathrm{MID}}}_{ij,\mathrm{max}} <0.02\), the proposed approach can be used in optimizing networks with respect to internal reliability. The identifiability index can also be useful in explaining discrepancies between the MDB values and the actual results of outlier identification. The significant discrepancies reported in some papers do not indicate weaknesses of the MDB concept but are a result of incorrect treating the magnitudes of actually identified gross errors as the quantities equivalent to the corresponding MDBs.

The proposal requires further clarification in terms of the theoretical basis and testing on a wider range of observation systems. That would allow one to determine the optimal sample size for the simulation method and the actual accuracy of empirical estimates. Also the working program used for this purpose needs optimization to reduce the operation time. The relationship between the indices ID\(_{i}^{*}\) and ID\(_{i}\) deserves a more in-depth analysis.

The similar issue of reliable identification of outliers, termed as outlier separability (Wang et al. 2012) and (Yang et al. 2013), slightly touched in this paper, can serve as future reference for more detailed comparative analysis of the approach proposed herein.

The question of analogous approach for the case of multiple outliers is a much more complicated problem and is planned to be a topic of the next research. In seeking solution an interesting concept of maximum MDB (Knight et al. 2010) will be taken into account. Also the approach to correlation between multiple outlier detection statistics (i.e. the use of maximum correlation and global correlation coefficients) as in (Wang et al. 2012), will play an important role in shaping the strategy of further research.