Abstract
The concept of minimal detectable bias (MDB) as initiated by Baarda (Publ Geod New Ser 2(5), 1968) and later developed by Wang and Chen (Acta Geodaet et Cartograph Sin Engl Edn 42–51, 1994), Schaffrin (J Eng Surv 123:126–137, 1997), Teunissen (IEEE Aerosp Electron Syst Mag 5(7):35–41, 1990, J Geod 72:236–244 1998, Testing theory: an introduction. Delft University Press, Delft, 2000) and others, refers to the issue of outlier detectability. A supplementation of the concept is proposed for the case of correlated observations contaminated with a single gross error. The supplementation consists mainly of an outlier identifiability index assigned to each individual observation in a network and a mis-identifiability index being the maximum probability of identifying a wrong observation. To those indices there can also be added the MDB multiplying factor to increase the identifiability index to a satisfactory level. As auxiliary measures there are indices of partial identifiability concerning pairs of observations. The indices were derived assuming the generalized outlier identification procedure as in Knight et al. (J Geod. doi:10.1007/s00190-010-0392-4, 2010), which with one outlier case being assumed is similar to Baarda’s w-test (Baarda in Publ Geod New Ser 2(5), 1968). The following two options of identifiability indices and partial identifiability indices are distinguished: I. the indices related to identification of a contaminated observation within a set of observations suspected of containing a gross error (identifiability), II. the indices related to identification of a contaminated observation within a whole set of observations (pseudo-identifiability). To characterize the proposed approach in the context of the existing solutions of similar topic being the separability testing, the properties of both types of identifiability indices are discussed with reference to the concept of Minimal Separable Bias (Wang and Knight in J Glob Position Syst 11(1):46–57, 2012) and a general approach in Yang et al. (J Geod 87(6):591–604, 2013). Numerical examples are provided to verify the proposed approach.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
The concept of minimal detectable error (Baarda 1968), later termed minimal detectable bias (MDB), was a pioneering tool for the analysis of behaviour of a network in the presence of an outlier. Being assumed as a measure of network internal reliability it was meant to span the a priori analysis of network sensitivity to an outlier with the chances to detect it. The original formula for MDB covering the case of correlated observations, was later analyzed by Wang and Chen (1994), Schaffrin (1997), Teunissen (1990, 1998, 2000) and was further extended upon the case of multiple outliers (Teunissen 2000; Knight et al. 2010). It was noticed in numerical tests that the gross errors of MDB magnitudes are often not identified, but identification can be successful at greater magnitudes (e.g. Hekimoglu and Erenoglu 2005). The concept of III-type error was introduced (Hawkins 1980; Förstner 1983) to cover the situations when the error-free observation can be identified mistakenly as the one contaminated by a gross error.
The MDB concept itself does not cover the issue of outlier identifiability. It only determines the minimal magnitude of a gross error in a particular observation, the presence of which in a system can be disclosed through excessive non-centrality effect in a global test. Hence, extending the MDB concept upon the issue of outlier identifiability would be a desirable research task.
Also the “response-based” measures of network internal reliability (Prószyński 2010) that provide reliability criteria clearly interpretable in terms of network responses to outliers, are not associated with the chances for outlier identification.
Taking into account the above description of the problem, the objective of the research was assumed to be the following:
-
i.
to work out a method of evaluating the chances to identify a gross error of the MDB magnitude (assumed to be a single gross error in a system), and together with some other related characteristics to create supplementation of the MDB concept with regard to outlier identifiability,
-
ii.
to propose a method for a priori evaluation of increase of MDB necessary to ensure that the thus obtained gross error can be reliably identified in practice,
-
iii.
to provide probabilistic support for response-based reliability criteria with regard to outlier identifiability.
Since the identifiability issue has much in common with the concept of outlier separability, some common elements are discussed of the proposed approach and the chosen existing methods of outlier separability analysis (Wang and Knight 2012; Yang et al. 2013).
2 Preliminaries
The main part of the paper will be preceded with some preliminary statements and auxiliary concepts describing the approach and presenting the notation applied in the analyses.
2.1 Specifying the terms “detectable gross error” and “identifiable gross error”
Since the distinction between “outlier detection” and “outlier identification” is clearly defined (Teunissen 2000), we give some details that specify the approach to a priori analysis of outlier identifiability proposed in the present paper.
We confine the explanations to the case when a network is contaminated with a single gross error (i.e. one outlier case).
Detectable gross error—an observation error of the magnitude such that its presence in a network is signalized by the global model test statistic exceeding its critical value.
Identifiable gross error—a detectable gross error the exact location of which in a network, i.e. in a particular observation, can be identified among the suspected observations in the first adjustment run, i.e. without subsequent diagnostic operations such as removal or re-weighting of observations. It is when the outlier test statistic of maximum absolute value of all the outlier test statistics that exceed the critical value, corresponds to the contaminated observation.
In the above definition “outlier identification” is clearly separated from “outlier detection”, since it is meant as a subsequent process of forming the set of suspected observations and finding among them the contaminated observation.
Unidentifiable gross error—a detectable gross error located in such a specific region of a network (consisting of at least two observations), where all the observations obtain equal values of outlier test statistics. The error is unidentifiable within the region (Cen et al. 2003; Prószyński 2008).
The conditions concerning the existence of the Regions of Unidentifiable Errors (RUE) for correlated observations are derived in Appendix A.
2.2 GM model and the disturbance/response relationship
Let us consider a GM model, written in an original form
and in the equivalent modified form that exposes the correlation matrix (Prószyński 2010)
where y the \(n\times \) 1 vector of observations; A the \(n\times \) u design matrix; rank A \(=\,u-d\) (d—system defect, \(d\ge 0\)); x the \(u\times 1\) vector of unknown parameters; e the \(n\times \) 1 vector of random errors; we shall also use \({\mathbf v} = -{\mathbf e}\); C the \(n\times \) n covariance matrix of e (positive definite), \({\mathbf C}=\sigma _{o}^{2}{\mathbf P}^{-1}=\sigma _{o}^{2}{\mathbf Q}\); \({\varvec{\sigma }} = \) (diag \(\mathbf C )^{1/2}\), \({\mathbf A}_\mathrm{s}={{\varvec{\sigma }}}^{-1}{\mathbf A},\,{\mathbf e}_{\mathbf s}={{\varvec{\sigma }}}^{-1}{\mathbf e}, {\mathbf y}_{\mathbf s}={{\varvec{\sigma }}}^{-1}{\mathbf y}, {\mathbf C}_{\mathbf s}={{\varvec{\sigma }}}^{-1}{\mathbf C}{\varvec{\sigma }}^{-1}\), \({\mathbf {C}}_\mathrm{s} \) a correlation matrix; for uncorrelated observations \({\mathbf C}_\mathrm{s}=\mathbf{I}\).
The LS estimator of the vector \(\mathbf {v}_\mathbf{s}\), where \(\mathbf {v}_{\mathbf s}=-{\mathbf e}_{\mathbf s}\), is given by
where
H \(=\) I \(-\) A \(_\mathbf{s}\)(A \(_\mathbf{s}^\mathbf{T}\) C \(_\mathbf{s}^{-1}\) A \(_\mathbf{s})^{+}\) A \(_\mathbf{s}^\mathbf{T}\) C \(_\mathbf{s}^{-1}\) is the modified reliability matrix (Prószyński 2010), i.e. the reliability matrix for the modified GM model as in (2), (*)\(^{+}\) denotes the pseudo-inverse.
Decomposing the vector \({\mathbf y}_{\mathbf s}\), so that \({\mathbf y}_{\mathbf s}={\mathbf y}_{\mathbf s}^\mathrm{true}-{\mathbf v}_{\mathbf s}+\Delta \mathbf {y}_{\mathbf s}\), where \(\Delta \) y \(_\mathrm{S}\) is the vector of standardized observation gross errors (i.e. standardized disturbances), and realizing that H \(\cdot \) \(\mathbf{y}_{\mathrm{S}}^\mathrm{true}\) \(=\) 0, we obtain (3) in the form
Denoting the second term in (4) by \(\Delta {{\hat{\mathbf{v}}}}_\mathbf{s} \), being the vector of standardized increments in LS residuals (i.e. standardized responses), we get on its basis the well known disturbance/response relationship for the system (2), i.e.
where \({\Delta }{\hat{\mathbf {v}}}_{\mathrm{S}} =-\Delta {\hat{\mathbf {e}}}_{\mathrm{S}} \).
2.3 A short note on minimal detectable error (MDB) and response-based reliability measures
Below, we present the formula for MDB as given in Wang and Chen (1994), Teunissen (1990, 1996), using the notation as in Sect. 2.2
where MDB\(_{i}\) minimal detectable bias in the i-th observation; its standardized form i.e. \({\hbox {MDB}}_{{\mathrm{S},i}}\) = MDB\(_{i}\)/\(\sigma _{i}\) is termed as controllability of the i-th observation, \(\sigma _{i}\) the standard deviation of the i-th observation, \(\lambda \) the non-centrality parameter (as in a global model test), \(r_{i}\) a generalized reliability number for the i-th observation, \(\mathbf{C}_{\mathrm{S}}\), H the matrices as in (2) and (3), respectively; H \(^{T}\) C \(_{\mathrm{S}}^{-1}\) H \(=\) C \(_{\mathrm{S}}^{-1}\) H.
The generalized reliability number \(r_{i}\) alone can also be considered as internal reliability measure (Caspary 1988).
The behaviour of a system in the presence of a single gross error can also be characterized by the so called response-based internal reliability measures (Prószyński 2010), derived on the basis of disturbance/response relationship (5), i.e. disregarding the random-error environment. For correlated observations the measures are the following pairs of indices
where \(h_{ii}\) the i-th diagonal element of the matrix H, \(w_{ii}\) the asymmetry index for the i-th row and the i-th column of the matrix H, \(k_{i}\) the ratio of the squared quasi-global response Q\(_{(i)}\) and the squared local response \(h_{ii}\) to an outlier in the i-th observation [see formula (28) in Appendix A].
The reliability criteria are the following
or, equivalently
They are derived from the postulate that the maximum system response should be located in the observation in which the gross error resides, and that the responses in other observations should possibly be the smallest (Prószyński 2010). Hence, there are then the chances for effective identification of a single gross error residing in any of the observations. We can then state that the criteria determine the area of outlier-exposing responses. The set of values (\(h_{ii}\), \(w_{ii})\) which form this area (see Figs. 2, 3) will be denoted by S\(_\mathrm{O}\).
It is not possible to interrelate the above two types of measures, i.e. \(r_{i}\) and (\(h_{ii}\),\(w_{ii})\) or (\(h_{ii}\),\(k_{i})\) on the grounds of rigorous matrix operations due to different generation principles. So, instead of direct interrelations we can establish indirect correspondence between these measures by finding their values on basis of the model (1) or model (2) components, as shown on a scheme below
3 A study on outlier identifiability evaluation in terms of probability
The a priori analysis of outlier identifiability presented in this paper, refers in principle to outlier identification procedure as in Knight et al. (2010). In that procedure the global model test is followed by the local outlier tests resulting in a set of suspected outliers. The final outcome of the procedure is the observation with a maximum absolute value of the test statistic. The outlier test statistics (i.e \(w^{2})\) are obtained from mean-shift model. The global model test and the local outlier tests are coordinated by equalizing the non-central parameters and selecting the probabilities according to the \(\beta \)-Method (Baarda 1968). In the present paper, for finding the suspected outliers and identifying the contaminated observation instead of \(w^{2 }\) the \(\vert w\vert \) values are used, assuming \(\left| w \right| _\mathrm{crit} =\sqrt{w_{\mathrm{{crit}}}^{{2}} } \).
The w-variables, being the standardized random variables, are defined by
where “i” denotes the observation contaminated with a gross error, “j” denotes any other observation; \(\hat{z}\) is the LS estimator of a gross error, obtained on basis of “mean-shift” model (Knight et al. 2010)
With one outlier case being assumed as in the present research, the above testing procedure is similar to Baarda’s w-test (Baarda 1968).
3.1 Parameters of outlier test statistics for the needs of identifiability analysis
In the notation of the present paper the w-variables as in (9) in a network contaminated with a single gross error \(\Delta y_{\mathrm{S},i}\), have the following detailed form
where {\(\cdot \)}\(_{i*}\) and{\(\cdot \)}\(_{j*}\) denote the i-th and the j-th row of H \(^{T}\) C \(_{\mathrm{S}}^{-1}\) H.
With e \(\sim \) N(0,C), and consequently \(\mathrm{\mathbf{e}}_\mathrm{s} \sim \) N(0, C \(_\mathbf{s})\), we get after simple operations
as in (Förstner 1983).
To analyze outlier identifiability, it is most reasonable to consider detectable gross errors, i.e. \(\Delta \)y\(_{{\mathrm{S},i}}\) \(\ge \) MDB\(_{{\mathrm{S},i}}\), where MDB\(_{{\mathrm{S},i}}\) as in (6). Substituting \(\Delta \)y\(_{{\mathrm{S},i}}\) = MDB\(_{{\mathrm{S},i}}\) into (11) and (12), we obtain
The formula (14) reflects a well known property (Förstner 1983) that with the defined I type and II type errors the correlation between the outlier test statistics is decisive for identification of the contaminated i-th observation.
3.2 Identifiability indices and their properties
To identify within a set of suspected observations the i-th observation in which a gross error of MDB magnitude resides, we need that \(\vert w_{i(i)}\vert \) is dominating over each of the corresponding absolute values for the remaining observations within this set. For each of the suspected observations we have \(\vert w\vert \) \(>\) \(\vert w\vert _\mathrm{crit}\), or for short \(\vert w\vert \) \(>\) c (we assume \(\left| w \right| _\mathrm{crit} =\sqrt{w_{\mathrm{{crit}}}^{{2}}} )\).
Using \(\frac{\left| {w_{j(i)} } \right| }{\left| {w_{i(i)} } \right| }<1\) as an equivalent condition to \(\vert w_{i(i)}\vert >\vert w_{j(i)}\vert \), we may form for the i-th observation an identifiability index denoted as ID\(_{i}\), defined in terms of conditional probability
where
where non-centralities \(\mu \) of the w-variables are determined for \(\text {MDB}_{s,i}\) as in (14); \({\bar{\hbox {R}}}_i \) being an event opposite to \({\hbox {R}}_{i}\), contains all possible sets of suspected observations, each corresponding to a particular distribution of random errors in a single measurement of a network.
In formulating \({\hbox {Q}}_{i}\), we take into account the fact that domination of \(\vert w_{i(i)}\vert \) within the set of all the observations implies its domination within any set of suspected observations containing \(w_{i(i)}\).
Using for each component in \({\hbox {Q}}_{i}\) a symbol Z as for a ratio of two folded normal variables (see Appendix B), we may write (15) in the form
Due to a high complexity of the definition (15), increasing with the number (n) of observations in a network, an empirical method based on numerical simulation of random observation errors was applied in the research. The method consists in:
-
simulating numerically a certain number (e.g. 1000) of n-dimensional vectors of correlated random errors (according to a given C);
-
computing w-variables for each vector of random errors using the formulas (10), the systematic components being as in (14);
-
after elimination of the sets of w-variables where the critical values are not exceeded, computing sample frequency for the sets where \(\left| w \right| \) for a contaminated observation (such that \(\vert w\vert \) \(>\) c) is dominating, the sample frequency being empirical approximation of ID. As a check on correctness of simulation procedure a sample frequency for the eliminated sets of w-variables (i.e. with \(\vert w\vert \) \(<\) c) was used as being empirical approximation of II type error probability \(\beta \).
To extend the scope of identifiability analysis, the computer program written for the method contains the formulas (10) in a modified form introducing a multiplying factor, such that the systematic components are as follows
which corresponds to the use of \(\Delta \) y \(_{{\mathrm{S},i}}\) = \(g_{i}\) \(\cdot \) MDB\(_{{\mathrm{S},i}}\).
This modification can be used in case of unsatisfactory values of ID\(_{i}\) obtained with \(\Delta \) y \(_{{\mathrm{S},i}}\) = MDB\(_{{\mathrm{S},i}}\).
We do not have exact theoretical reference for evaluating the accuracy of the simulation method. Therefore, we may only analyze the degree of dispersion of the ID values for different sets of simulated data and different observations in a network. The estimates obtained in that way for networks in Examples 1 and 2 (see Sect. 6) with 1000 simulations used are within \(\pm \) 1 or \(\pm \) 2 % (standard deviations).
For the purpose of this study we consider also identification of the contaminated observation without setting restrictions onto the values of w-variables. Such a procedure that covers also the outlier detection is a departure from the assumed definition of outlier identification (see Sect. 2.1) and will be termed pseudo-identification. Consequently, we shall operate with a pseudo-identifiability index, denoted by ID\(_{i}^{*}\), and having the form
where \({\hbox {Q}}_{i}\) as in (15).
Although to a smaller degree than in the case of \({\hbox {P}}( {{\hbox {Q}}_i \left| {{\bar{\hbox {R}}}_i } \right. })\) (15), finding P(Q\(_{i})\) is still a complex computation task. However, we may get empirical approximation of this index (\({\overline{\text{ ID }}}_i^*)\) by means of slightly modified simulation method.
On the grounds of probability theory some relations can be established between ID\(_{i}^{*}\) and ID\(_{i}\)
where \({\hbox {P}}( {\bar{\hbox {R}}_i })=1-{\beta }\),
and hence
Assuming that \({\hbox {P}} ({\{}{\hbox {Q}}_{i}\) \(\cap \) \({\hbox {R}}_{i}\)}) \(>\) 0, we get
Hypothetically, the case that ID \(_{i}^{*}\)= ID \(_{i}\) might occur when \({\hbox {P}}( {{\hbox {Q}}_i \cap {\bar{\hbox {R}}}_i })={\hbox {P}}( {{\hbox {Q}}_i })\cdot {\hbox {P}}( {{\bar{\hbox {R}}}_i })\), i.e. when Q\(_{i}\)and \({\bar{\hbox {R}}}_i \) were independent events. Then with \(\text {ID}_{i}=1\), we would also have \(\text {ID}_{i}^{*}=1\), which would imply domination of \(\vert w_{i(i)}\vert \) in each possible set in R. Since the above independency is only a detached theoretical assumption, we can only state that ID\(_{i}\) is an unattainable upper limit for ID \(_{i}^{*}\).
The above relations have been confirmed by the results obtained from the simulation method.
3.3 Partial identifiability indices and their properties
As an auxiliary tool for network analysis, the partial identifiability indices for pairs of observations were introduced, i.e. for the i-th observation contaminated by a gross error and the j-th observation being error-free. Similarly to two options of identifiability indices (Sect. 3.2) we distinguish
-
partial identifiability index ID\(_{i/j}\)
$$\begin{aligned} \text{ ID }_{i/j} ={\hbox {P}}\left( {\frac{\left| {w_{j(i)} } \right| }{\left| {w_{i(i)} } \right| }<1\big | {\bar{\hbox {R}}}_{ij} }\right) \end{aligned}$$(21)where \({\hbox {R}}_{ij}=\vert w_{i(i)}\vert <\) c \(\cup \) \(\vert \) w \(_{j(i)}\vert \) \(<\) c or in notation of (16)
$$\begin{aligned} \text{ ID }_{i/j} ={\hbox {P}}( {\mathrm{Z}_{j(i)} <1\left| {\bar{\hbox {R}}} \right. _{ij} }) \end{aligned}$$ -
partial pseudo-identifiability index ID\(_{i/j}^{*}\).
$$\begin{aligned} \mathrm{ID}_{i/j}^*={\hbox {P}}\left( {\frac{\left| {{w}_{j(i)} } \right| }{\left| {{w}_{i(i)} } \right| }<1}\right) \!, \end{aligned}$$(22)or in notation of (16),
$$\begin{aligned} \text {ID}_{i/j}={\hbox {P}}{\{}\mathrm{Z}_{j(i)}<1{\}} \end{aligned}$$The indices are the values of distribution function of ratio of two folded normal variables. In the case of ID\(_{i/j}\) the space of the values of w-variables is reduced in terms of absolute values, assuming that both the i-th and the j-th observation are the elements of a set of suspected observations. For finding the values of ID\(_{i/j}^{*}\), a MATLAB-based software has been developed (Appendix B) for computing the values of the distribution function of Z. We can also find empirical approximation of each type of index (i.e. ID\(_{i/j}\) and ID\(_{i/j}^{*})\) by means of the simulation method presented in Sect. 3.2, by computing sample frequencies for chosen pairs of observations. The following properties of ID\(_{i/j}^{*}\) indices can be formulated:
-
from the formula (14), where \(\mu _{i}\) \(>\) \(\vert \mu _{j}\vert \), and the property (32) in Appendix B, it follows that for any pair of observations in a network we shall have \(\text {ID}_{i/j}^{*}> 0.5\). Figure 1 shows dependence of ID\(_{i/j}^{*}\) on magnitude of correlation \(\vert \rho _{ij}\vert \) (\(\vert \rho _{ij}\vert <1\)), obtained with \(\mu _i =\sqrt{\lambda }= 4.13\) and \({\mu }_j ={\rho }_{ij} \sqrt{\lambda }=4.13\cdot {\rho }_{ij} \) (as in formula (14). We can see that the smaller \(\vert \rho _{ij}\vert \), the greater is ID\(_{i/j}^{*}\).
-
due to \(\rho (w_{i}, w_{j})=\rho (w_{j},w_{i})\), the ID\(_{i/j}^{*}\) indices are symmetrical within pairs of observations, i.e. ID\(_{i/j}^{*} = \text {ID}_{j/i}^{*}\).
-
for all the observations forming a RUE region in a network, we shall have \(\text {ID}_{i}=\text {ID}_{j}=\text {ID}_{k}\ldots = \text{ ID }_\mathrm{Rue} \), where \(\text{ ID }_\mathrm{Rue} \) could be termed the identifiability index for a RUE region containing an outlier. For all pairs of observations within RUE we shall have \(\text {ID}_{i/j}^{*}= 0\). The index \(\text{ ID }_\mathrm{Rue} \) does not apply to networks being a RUE as a whole. In such networks \(\text {ID}_{i}=\text {ID}_{j}=\text {ID}_{k}\ldots = 0\) and ID\(_{i/j}^{*}= 0\) for all the observations.
3.4 Mis-identifiability indices and probabilities of III type errors
To cover in a priori analysis the possibility of identifying the j-th error-free observation instead of the contaminated i-th observation, defined as III type error (Hawkins 1980; Förstner 1983), we introduce mis-identifiability indices as shown below
where \({\bar{\hbox {R}}}_i \) as in (19).
The indices MID\(_{ij}\) correspond to probabilities of committing III type errors, denoted by \(\upgamma _{ij}\) (Förstner 1983). The indices are determined for gross errors of the MDB magnitudes.
Using the simulation method we can get empirical approximation of MID\(_{ij}\) by computing sample frequency for the sets where \(\left| {w_{j(i)} } \right| \) (such that \(\left| {w_{j(i)} } \right| >\) c) is dominating
Taking into account the MID\(_{ij}\) indices for all the j-th observations, we may find the observation with maximum value of \(\text{ MID }_{ ij}\), i.e. \(\text{ MID }_{{\textit{ij}},\mathrm{max}} \).
Realizing that all MID\(_{ij}\) indices together with ID\(_{i}\) indices refer to disjoint events that form a complete event, we may formulate on basis of (15) and (23) the following relationship
where n as in (15) is the number of all the observations in a network.
According to (24), with the ID\(_{i}\) values being greater than 0.5 there can be no observation with MID\(_{ij}>\) 0.5.
This confirms the well known property, that the greater the probability of finding the contaminated observation (e.g. Wang and Knight 2012), the smaller is the probability of committing the III-type error.
4 Proposed supplementation of the MDB concept for a priori analysis of network reliability
By definition the MDB concept is not associated with outlier identifiability. Based on the study of outlier identifiability evaluation (Sect. 3), we propose supplementation of the MDB concept as in formula (6) with identifiability index ID\(_{i}\), as in formula (15). The pair (MDB\(_{i}\), ID\(_{i }\)) would characterize the minimal detectable error in a particular observation together with the chances for its identification in a network.
In case of unsatisfactory value of ID\(_{i,}\) we may find the multiplying factor \(g_{i}\) as in (17) that shows the degree of magnification of MDB\(_{i}\) necessary to obtain a required level of outlier identifiability. We may also find a particular j-th (j \(\ne \) i) observation corresponding to maximum probability of III type error, i.e. \(\gamma _{ij}\) (see mis-identifiability indices \(\text{ MID }_{ij,\mathrm{max}}\) in Sect. 3.4).
For more detailed analysis of outlier identifiability, we may compute the index ID\(_{i}^{*}\) and the indices ID\(_{i/j }\) and ID\(_{i/j}^{*}\) for some chosen pairs of observations.
SUPPLEMENTATION of MDB for the i-th observation can thus be formed in the following two levels:
BASIC—ID\(_{i,}~g_{i}\), \(\text{ MID }_{ij,\mathrm{max}} \); AUXILIARY—ID\(_{i}^{*}\), ID\(_{i/j}\), ID\(_{i/k}\),..., ID\(_{i/j}^{*}\), ID\(_{i/k}^{*}\), ...
Additionally, by finding the response-based reliability measures (\(h_{ii}\),\(w_{ii})\) or (\(h_{ii}\),\(k_{i})\) for the analyzed i-th observation, we obtain in an indirect way a link between the network response and the indices ID\(_{i }\) and \(\text{ MID }_{ij,\mathrm{max}} \).
5 Common elements of the proposed approach with some chosen solutions in outlier separability testing
Although in the present paper the term “separability” is not used explicitly, the proposed identifiability indices can be considered to some extent as outlier separability measures. A direct link of the proposed approach with outlier separability analysis are mis-identifiability indices being the maximum probabilities of III type errors. Analogy can be found between the proposed approach and that in (Wang and Knight 2012). In the letter approach the concept of minimal separable bias (MSB) is presented, being the magnitude of MDB increased by the multiplying factor so as to ensure identifying of an outlier at a satisfactory confidence level (denoted there as \(1-\alpha _{s})\). This corresponds in the present paper to the use of partial pseudo-identifiability index ID\(_{i/j}^{*}\). Due to possibility of increasing MDB by the iteratively determined multiplying factor to reach a corresponding level of partial pseudo-identifiability, we may obtain a bias equivalent to MSB. One can also notice that the two options of the standardized separability test statistic (Wang and Knight 2012) contain exactly the arguments \({\hbox {h}}_{1}^{\prime }\) and \({\hbox {h}}_{2}^{\prime }\) of a distribution function \({\hbox {P}}(\mathrm{Z}< 1)\) as in formula (31) in the present paper. The ratio itself can be a proposal of test statistic for the above mentioned separability test.
Analogies can also be expected between the proposed approach and multiple alternative hypotheses testing (Yang et al. 2013) with respect to definitions of probabilities of committing different types of errors as well as in the relations between these probabilities.
6 Numerical examples
To illustrate the proposed approach we use a levelling network analyzed in Knight et al. (2010) (Fig. 2a) and a GPS network (Fig. 2b). Referring to the first publication gives opportunity to expand the conclusions reached there.
Example 1
For a network in Fig. 2a, we have
To save space we show only the matrix H \(^\mathrm{T}\) C \(_\mathrm{S}^{-1}\) H and a correlation submatrix for the variables \(w_{2}\) and \(w_{3}\)
The mutually parallel column-vectors (and row-vectors) in H \(^\mathrm{T}\) C \(_\mathrm{S}^{-1}\) H are marked in bold. Further results of analysis are given in Table 1 and Fig. 3.
None of the observations satisfy the reliability criteria required for outlier-exposing responses. The network contains RUE formed by the observations 2 and 3. Hence, based on the results of the simulation method we can write \({\overline{\mathrm{ID}}}_2 ={\overline{\mathrm{ID}}}_3 = {\overline{\mathrm{ID}}}_\mathrm{RUE} = 0.49\). The equal MDB values for these observations, represent minimal detectable gross errors that are identifiable as located in the RUE region of a network, but are unidentifiable within this region, i.e. \(\text{ ID }_{2/3}^*= \text{ ID }_{3/2}^*= 0\).
It is difficult to find out a relationship between the indices ID\(_{i}\) and internal reliability measures \(r_{i}\) or (\(h_{ii}\),\(w_{ii})\). Except for observation 5, the indices ID\(_{i}\) are not specially differentiated and they represent a low level, slightly exceeding 0.5 for the observation 4.
This level does not ensure a sufficiently reliable identification of gross errors. This is reflected in the values of \({\overline{\mathrm{MID}}}_{ij,\mathrm{max}} \) indices.
Below, we show the effect upon ID\(_{i}\) of increasing the magnitude of a gross error by applying the multiplying factor \(g_{i}>\)1 [see formula (17) for the observation 1, 5 and 6], i.e.
-
obs. 1; \(g_1 = 2\), \({\overline{\mathrm{ID}}}_1 = 0.63\); \(g_1 = 3\), \({\overline{\mathrm{ID}}}_1 = 0.79\);
-
obs. 5; \(g_5 = 2\), \({\overline{\mathrm{ID}}}_5 = 0.36\); \(g_5 = 3\), \({\overline{\mathrm{ID}}}_5 = 0.61\);
-
obs. 6; \(g_6 = 2\), \({\overline{\mathrm{ID}}}_6 = 0.67\); \(g_6 = 3\), \({\overline{\mathrm{ID}}}_6 = 0.82\).
The correlations \(\rho _{ij}\) between the w-variables for the observations not forming a RUE are in absolute values within the interval [0.36, 0.98], and hence, the values of ID\(_{i/j}^{*}\) indices are within [0.64, 0.99] (see Fig. 1).
Example 2
We omit showing the design matrix A (\(12\times 4\)) (with elements 1 and \(-\)1), and confine presentation of the covariance matrix C to the range of values of standard deviations and correlation coefficients, i.e. \(\sigma [2.5, 3.2], \rho [-0.24, 0.21]\).
As we can see in Fig. 4 all the observations satisfy the reliability criteria required for outlier-exposing responses. The indices \({\overline{\mathrm{ID}}}_i \) represent high values as they all lay in the interval [0.969, 0.992]. The values of \({\overline{\mathrm{MID}}}_{ij,\mathrm{max}} \) indices are within the interval [0.002, 0.023]. This means that identification of a gross error of MDB magnitude in each observation is highly reliable.
The correlations \(\rho _{ij}\) between the w-variables being in absolute values within the interval [0.002, 0.563] are much smaller than in Network 1. Consequently, the values of ID\(_{i/j}^{*}\) indices are much greater (see Fig. 1), i.e. [0.973, 0.996].
The values of \({\overline{\mathrm{ID}}}_i^*\) indices are only slightly smaller than those of \({\overline{\mathrm{ID}}}_i \), i.e. [0.943, 0.972], which confirms the case discussed in Sect. 3.2.
7 Concluding remarks
For networks satisfying the response-based reliability criteria we have a high level of outlier identifiability, which is due to small correlations between w-variables.
By supplementing the MDB concept with the identifiability index one may evaluate at an a priori analysis the probability of identifying a gross error of MDB magnitude in the first adjustment run. For the observations with small identifiability indices one may find the magnitude of gross error, greater than MDB, necessary to ensure a satisfactory level of identifiability. While setting a certain requirement for the probability level, e.g. ID \(\ge \) 0.95, as well as for the level of mis-identification, e.g. \({\overline{\mathrm{MID}}}_{ij,\mathrm{max}} <0.02\), the proposed approach can be used in optimizing networks with respect to internal reliability. The identifiability index can also be useful in explaining discrepancies between the MDB values and the actual results of outlier identification. The significant discrepancies reported in some papers do not indicate weaknesses of the MDB concept but are a result of incorrect treating the magnitudes of actually identified gross errors as the quantities equivalent to the corresponding MDBs.
The proposal requires further clarification in terms of the theoretical basis and testing on a wider range of observation systems. That would allow one to determine the optimal sample size for the simulation method and the actual accuracy of empirical estimates. Also the working program used for this purpose needs optimization to reduce the operation time. The relationship between the indices ID\(_{i}^{*}\) and ID\(_{i}\) deserves a more in-depth analysis.
The similar issue of reliable identification of outliers, termed as outlier separability (Wang et al. 2012) and (Yang et al. 2013), slightly touched in this paper, can serve as future reference for more detailed comparative analysis of the approach proposed herein.
The question of analogous approach for the case of multiple outliers is a much more complicated problem and is planned to be a topic of the next research. In seeking solution an interesting concept of maximum MDB (Knight et al. 2010) will be taken into account. Also the approach to correlation between multiple outlier detection statistics (i.e. the use of maximum correlation and global correlation coefficients) as in (Wang et al. 2012), will play an important role in shaping the strategy of further research.
References
Baarda W (1968) A testing procedure for use in geodetic networks. Publ Geod New Ser 2(5). Netherlands Geodetic Commission, Delft
Caspary WF (1988) Concepts of network and deformation analysis. Monograph 11, School of Surveying, The University of New South Wales, Kensington
Cen M, Li Z, Ding X, Zhuo J (2003) Gross error diagnostics before least squares adjustment of observations. J Geod 77:503–513
Förstner W (1983) Reliability and discernability of extended Gauss–Markov models. Deutsche Geodätische Kommission, Reihe A, No. 98, Munchen
Hawkins DM (1980) Identification of outliers. Chapman and Hall, New York
Hekimoglu S, Erenoglu RC (2005) A test for Baarda’s internal reliability theory. In: Proceedings of international symposium on “Modern technologies, education and professional practice in geodesy and related fields”, Sofia, Bulgaria
Kim H-J (2006) On the ratio of two folded normal distributions. Commun Stat Theory Methods 35:965–977
Kim H-J (2014) Some distributional properties of ratio of two folded normals. Technical report, Statistics Department, Dongguk University, Seoul, Korea
Knight NL, Wang J, Rizos C (2010) Generalised measures of reliability for multiple outliers. J Geod. doi:10.1007/s00190-010-0392-4
Prószyński W (2008) The vector space of imperceptible observation errors: a supplement to the theory of network reliability. Geod Cartogr 57(1):3–19
Prószyński W (2010) Another approach to reliability measures for systems with correlated observations. J Geod 84:547–556
Schaffrin B (1997) Reliability measures for correlated observations. J Eng Surv 123:126–137
Teunissen PJG (1990) Quality control in integrated navigation systems. IEEE Aerosp Electron Syst Mag 5(7):35–41
Teunissen PJG (1996) Testing theory, an introduction. Delft University Press, Delft
Teunissen PJG (1998) Minimal detectable biases of GPS data. J Geod 72:236–244
Teunissen PJG (2000) Testing theory: an introduction. Delft University Press, Delft
Wang J, Chen Y (1994) On the reliability measure of observations. Acta Geodaet et Cartograph Sin Engl Edn 42–51
Wang J, Knight N (2012) New outlier separability test and its application in GNSS positioning. J Glob Position Syst 11(1):46–57
Wang J, Almagbile Y, Wu T, Tsujii T (2012) Correlation analysis for fault detection statistics in integrated GNSS/INS systems. J Glob Position Syst 11(2):89–99
Yang L, Wang J, Knight N, Shen Y (2013) Outlier separability analysis with a multiple alternative hypotheses test. J Geod 87(6):591–604
Acknowledgments
The author wishes to thank Prof. Kim Hao of Dongguk University for some valuable guidelines concerning dependence case in the theory of ratio of two folded normal variables. The guidelines were helpful in constructing a probabilistic measure for partial pseudo-identifiability. The author feels greatly indebted to Prof. Mieczysław Kwaśniak and Dr. Artur Wilkowski of Warsaw University of Technology for their assistance in performing the computations and creating the graphs.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A
1.1 1. Condition for existence of RUE
Let us consider the following w-variables as in (Knight et al. 2010),
-
with a gross error \(\Delta \)y\(_{s,i}\) in the i-th observation:
$$\begin{aligned} w_{i(i)} =\frac{\hat{z}_{i(i)} }{{\sigma }_{\hat{z}_{i(i)} } };\; w_{j(i)} =\frac{\hat{z}_{j(i)} }{{\sigma }_{\hat{z}_{j(i)} } } \end{aligned}$$ -
with a gross error \(\Delta y_{s,j}\) in the j-th observation:
$$\begin{aligned} w_{i(j)} =\frac{\hat{z}_{i(j)}}{{\sigma }_{\hat{z}_{i(j)}}};\; w_{j(j)} =\frac{\hat{z}_{j(j)}}{{\sigma }_{\hat{z}_{j(j)}}} \end{aligned}$$Applying the formula (10), we get the corresponding pairs of relationships
where {\(\cdot \)}\(_{i*}\), {\(\cdot \)}\(_{j*}\) are the i-th and the j-th row of H \(^\mathrm{T}\) C \(_\mathrm{S}^{-1}\) H.
Requiring that \(\vert w_{i(i)}\vert \) = \(\vert w_{j(i)}\vert \) and \(\vert w_{i(j)}\vert \) = \(\vert w_{j(j)}\vert \), and taking into account that the w-variables can be of the same or opposite signs, we get after simple operations the condition
i.e. the i-th and the j-th row in H \(^\mathrm{T}\) C \(_\mathrm{S}^{-1}\) H are linearly dependent vectors with positive or negative coefficient.
Applying (25) to pairs of the corresponding elements in the i-th and the j-th row, we get the following condition for the i-th and the j-th observation to be a RUE region
or, equivalently
The above reasoning can be extended upon several observations in a network.
It follows immediately from (25), that the networks with rank \(({\mathbf H}^\mathrm{T}{\mathbf C}_\mathrm{S}^{-1}\mathbf{H})=1\), i.e. where all the rows and columns are linearly dependent, are as a whole RUE regions, irrespective of the correlation matrix used.
Correlation of w-variables within RUE is represented by a submatrix (or a matrix) with non-diagonal elements being \(\vert \rho _{ij}\vert =1\).
1.2 2. Condition excluding the existence of RUE
For (\(h_{ii}\), \(w_{ii})\) \(\in \) S\(_{O}\), \(i=1,{\ldots },\,n\), we have, \(0.5< h_{ii} \le ~1\,^{\wedge } 0<~k_{i}~<\) 1.
Since
we obtain \(\vert h_{qi}\vert \) \(\le \) 0.5 for \(q~\ne ~i,\, i = 1,{\ldots },n.\) Hence, for any pair of observations, e.g. y\(_{i}\), y\(_{j}\), we get
Since H \(\cdot \) A \(_{\mathrm{S}}\) = 0, and hence H \(^\mathrm{T}\) C \(_{\mathrm{S}}^{-1}\) H \(\cdot \) A \(_{\mathrm{S}}\) = 0, the inequality (29) implies that H \(^\mathrm{T}\) C \(_{\mathrm{S}}^{-1}\) H, being of the same rank as H, has all determinants of the 1st and 2nd order positive, so
which contradicts the condition (27) for the existence of RUE in a network.
Appendix B
1.1 Probabilistic tool for a priori analysis of outlier partial pseudo-identifiability
Let us consider two independent normal variables \(X_{1}\sim \) N(\(\mu \) \(_{1}\), \(\sigma \) \(_{1}^{2})\), X \(_{2}\sim \) N(\(\mu \) \(_{2}\), \(\sigma \) \(_{2}^{2})\) and the corresponding folded normal variables \(\vert \) X \(_{1}\vert \sim \mathrm{FN}\)(\(\mu \) \(_{1}\), \(\sigma \) \(_{1}^{2})\), \(\vert \) X \(_{2}\vert \sim \mathrm{FN}\)(\(\mu \) \(_{2}\), \(\sigma \) \(_{2}^{2})\). The ratio \(\mathrm{Z}=\frac{\left| {{X}_\mathrm{1} } \right| }{\left| {{X}_\mathrm{2} } \right| }\) has distribution \(\mathrm{Z}\!\sim \! \mathrm{RFN}\)(\(\mu \) \(_{1}\), \(\mu \) \(_{2}\), \(\sigma \) \(_{1}^{2}\), \(\sigma \) \(_{2}^{2})\) with distribution function F(z), \(z >\) 0 (Kim 2006). The generalization of the approach for dependency case is presented in (Kim 2014), where the distribution function of \(\mathrm{Z}\sim \mathrm{RFN}\)(\(\mu \) \(_{1}\), \(\mu \) \(_{2}\), \(\sigma \) \(_{1}^{2}\), \(\sigma \) \(_{2}^{2}\), \(\rho )\) is determined, valid for \(\vert \rho \vert \) \(<\) 1.
Assuming \(\sigma \) \(_{1}\) = 1, \(\sigma \) \(_{2}=1\) and \(z =1\), as needed for the analysis in the present paper, we obtain on the basis of the above mentioned generalized distribution function a formula for \({\hbox {P}}(\mathrm{Z} < 1)\) as a function of \(\mu \) \(_{1}\), \(\mu \) \(_{2}\), \(\rho \), i.e.
where
\(\Phi \)(a) = \({\hbox {P}}(X\) \(<\) a); \(X\sim \) N(0, 1); \(L(a,~b,~\rho )\) can be equivalently replaced by \({\Phi }(-a, -b, \rho )\).
Several properties of the function \({\hbox {P}}(\mathrm{Z}<1)=f({\mu }_{1},~{\mu }_{2},\rho )\) as in (31) concerning the signs of its arguments, can be readily proved, e.g.
The properties can be helpful in simplifying tables or diagrams constructed for the function.
We present a property corresponding to basic formulas used in identifiability analysis (14), Sect. 3.1, where \(\mu \) \(_{1}\) \(>\) 0 and \(\mu \) \(_{2}\) is of the same sign as \(\rho \)
We can prove this property by substituting in right-hand side function of the above equality \(\mu \) \(_{2}^{*}\) = \(-\) \(\mu \) \(_{2}\) and \(\rho ^{*}\) = \(-\) \(\rho \) instead of \(\mu \) \(_{2}\) and \(\rho \) respectively, and finally finding that
which are exactly the components of the formula (31) determining the value of the left-hand side function.
For computing the values of the function \({\hbox {P}}(\mathrm{Z}<1)=f({\mu }_{1}\), \(\mu \) \(_{2}\), \(\rho )\), a software based on the MATLAB package was developed.
On the basis of computations we may list some important properties useful for interpretation of the results of identifiability analysis. To visualize the properties we show a graph of the function \({\hbox {P}}(Z<1)=f({\mu }_{1},~{\mu }_{2},~\rho )\) for 0 \(\le \) \(\mu \) \(_{1}\) \(\le \) 6, 0 \(\le \) \(\mu \) \(_{2}\) \(\le \) 6, \(\rho =0.7\) (Fig. 5):
A specific case, when \(\left| {{\mu }_1 } \right| =\left| {\mu _2 } \right| \), \(\vert \rho \vert =1\), cannot be analyzed with the use of formula (31), since the distribution function of \(\mathrm{Z} \sim \mathrm{RFN}({\mu }_{1},{\mu }_{2},{\sigma }_{1}^{2},{\sigma }_{2}^{2},\rho )\) is not valid for \(\vert \rho \vert =1\). In this case we have \(\vert \textit{X}_{1}\vert =\vert X_{2}\vert \) with probability \({\hbox {P}} = 1\). and hence \(\mathrm{Z} = \frac{{\left| {{{X}_1}} \right| }}{{\left| {{{X}_{\text {2}}}} \right| }} = \frac{{\left| {{{X}_1}} \right| }}{{\left| {{{X}_{\text {1}}}} \right| }} =1\) (Z becomes a constant), so \({\hbox {P}}(\mathrm{Z}<1)=0\)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Prószyński, W. Revisiting Baarda’s concept of minimal detectable bias with regard to outlier identifiability. J Geod 89, 993–1003 (2015). https://doi.org/10.1007/s00190-015-0828-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00190-015-0828-y