1 Introduction

Meaningful statistical inference is only possible if the target of inference (parameter) is identifiable, meaning that if parameter values differ, the parameterised distributions should also differ. There are several versions of identifiability definitions, and many identifiability and non-identifiability results, see, e.g., Yakowitz and Spragins (1968), Rothenberg (1971), Prakasa Rao (1992), Ho and Rosen (2017).

Here situations are treated in which parameters are identifiable according to this classical definition, yet the parameters cannot be identified from observed data. Rothenberg (1971), Hsiao (1983), Prakasa Rao (1992) define identifiability with explicit reference to observable data but do not cover the issues that are treated here. Regarding the results in Rothenberg (1971), there is no difference between classical identifiability and identifiability from observations. Simple examples for classical identifiability issues are the non-identifiability of linear regression parameters in case of collinear explanatory variables, and identifiability of the parameters of mixture distributions, which can be guaranteed under certain assumptions (Yakowitz and Spragins 1968), particularly ruling our label switching, but counterexamples exist (Prakasa Rao 1992, Chapter 8). Hsiao (1983), Prakasa Rao (1992) also study situations in which issues occur because certain modelled random variables are unobservable, such as the true value of a variable in errors-in-variables models. Some examples in Sect. 4 are also of this kind, but there are further reasons why the observed data may not allow for identification of classically identifiable parameters, which are explored here.

Some such situations have already appeared in the literature, see, e.g., Neyman and Scott (1948), Bahadur and Savage (1956), Donoho (1988), Spirtes et al. (1993), Robins et al. (2003), Molenberghs et al. (2008), Almeida and Mouchart (2014). Section 4 gives more details on these works, and how they fit into the unified terminology introduced here.

It turns out that there are different possible levels of information about identifiable parameters in the data, and therefore various definitions are introduced. Consistent estimators may or may not exist (“empirical identifiability”). Sets that can distinguish two parameter values for a finite sample size may or may not exist (“empirical distinguishability” with weaker and stronger versions).

The concept of empirical identifiability is closely connected to the concept of estimability, which is also stronger than classical identifiability. Once more there are several versions around. The concept mostly focuses on what can be estimated with a give finite sample, and its connection with classical identifiability is investigated, see, e.g., Bunke and Bunke (1974), Jacquez and Greif (1985), Maclaren and Nicholson (2020).

This work was motivated by the discovery that data hold no information about distinguishing i.i.d. Gaussian observations from Gaussian data with a constant correlation between any two observations. This will be used as a guiding example. Section 2 derives a key result regarding this situation. Section 3 presents the main definitions, some of their implications and some more examples. Section 4 reviews results from the literature that fit into the framework of Sect. 3. Section 5 uses this framework to discuss the general problem of telling apart dependence and independence in situations in which potential dependence is not governed by the observation order or observable external information. This is relevant in many situations that require independence assumptions. Section 6 presents another example in some detail, namely the empirical identification of parameters indicating the cluster memberships of every single point in k-means clustering. Section 7 concludes the paper. All proofs are in the Appendix.

2 Constant correlation between Gaussian observations

This work was motivated by the following example, which in itself should be of strong interest.

Example 1

A model assumption for much standard statistical inference is to assume independently identically distributed (i.i.d.) Gaussian \(X_1,\ldots ,X_n,\ X_1\sim {{\mathcal {N}}}(\mu ,\sigma ^2)\) (model M0). Now consider Gaussian \(X_1,\ldots ,X_n\) with correlation \({\text {Cor}}(X_i,X_j)=\rho >0\) constant for any \(i\ne j\) (model M1; M01 denotes the model with \(\rho \ge 0\) assumed).

This would be a problem for inference about \(\mu \), because in the latter situation, for the arithmetic mean \({\bar{X}}_n\):

$$\begin{aligned} {{\mathcal {L}}}({\bar{X}}_n)=\mathcal{N}\left( \mu ,\frac{(1-\rho )\sigma ^2}{n}+\rho \sigma ^2\right) \rightarrow _{n\rightarrow \infty } {{\mathcal {N}}}\left( \mu ,\rho \sigma ^2\right) . \end{aligned}$$

This means that the mean is inconsistent for \(\mu \) as long as \(\rho \sigma ^2>0\); confidence intervals and tests computed based on the i.i.d. assumption will be biased, possibly dramatically so.

Although \(\rho \) is identifiable in the classical sense, it turns out to not be empirically identifiable from observed data. It is not even possible to empirically distinguish any two \(\rho _1\ne \rho _2\) (in fact, plots of data generated from model M01 with different values for \(\rho \) including the i.i.d. case \(\rho =0\) do not reveal any features by which these distributions could be distinguished). \(\mu \) in model M01 is not empirically identifiable either, but two \(\mu _1\ne \mu _2\) are empirically distinguishable, see Sect. 3.

The following lemma shows that in model M01, the conditional distribution given the mean \({\bar{X}}_n\) is the same as for i.i.d., therefore uncorrelated, Gaussian random variables. But the mean does not hold information about correlations (or rather, any information about correlations is confounded with the information about the true means), meaning that model M1 cannot be distinguished from model M0 based on the data alone.

Lemma 1

\(\textbf{X}_n=\begin{pmatrix} X_1\\ \vdots \\ X_n\end{pmatrix},\ \textbf{Y}_n=\begin{pmatrix} Y_1\\ \vdots \\ Y_n\end{pmatrix}\). Assume

$$\begin{aligned} {{\mathcal {L}}}(\textbf{X}_n) = {{\mathcal {N}}}_n(\varvec{\mu ,\Sigma }),\ {\varvec{\mu }}= \begin{pmatrix}\mu \\ \vdots \\ \mu \end{pmatrix},\ {\varvec{\Sigma }}= \begin{bmatrix} \sigma ^2 &{} \rho \sigma ^2 &{} \dots &{} \rho \sigma ^2\\ \rho \sigma ^2 &{} \sigma ^2 &{} \dots &{} \rho \sigma ^2\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \rho \sigma ^2 &{} \rho \sigma ^2 &{}\dots &{} \sigma ^2 \end{bmatrix}. \end{aligned}$$

Then, for

$$\begin{aligned} {{\mathcal {L}}}(\textbf{Y}_n)={{\mathcal {N}}}_n({\varvec{\mu }},(1-\rho )\sigma ^2{{\textbf{I}}}_n):\ \mathcal{L}(\textbf{X}_n\mid {\bar{X}}_n)={{\mathcal {L}}}(\textbf{Y}_n\mid {\bar{Y}}_n), \end{aligned}$$

which does not depend on \(\mu \), where \({\textbf {I}}_n\) is the n-dimensional unit matrix.

Thus, conditionally on the mean, \(\textbf{X}_n\) will look like i.i.d. Gaussians with variance \((1-\rho )\sigma ^2\); for \(\rho >0\) there is less variation of the \(X_i\) given their mean than their unconditional variance. On the other hand, \({\bar{X}}_n\) has a larger variance than under independence (\(\rho =0\)).

In fact, the model can equivalently be written as a model with a single realisation of a random effect \(Z,\ i=1,\ldots ,n\):

$$\begin{aligned} X_i=\mu +Z+E_i,\ Z\sim {{\mathcal {N}}}(0,\tau _1^2),\ E_i\sim \mathcal{N}(0,\tau _2^2), \nonumber \\ \ \sigma ^2=\tau _1^2+\tau _2^2,\ \rho =\frac{\tau _1^2}{\tau _1^2+\tau _2^2}, \end{aligned}$$
(1)

which suggests that observed data look like i.i.d. \(\mathcal{N}(\mu ^*,\tau _2^2)\) with \(\mu ^*=\mu +Z\), and Z is unknown and unobservable. The definitions and results in Sect. 3 aim at making precise a general sense in which observations give no information about \(\rho \) and limited information about \(\mu \).

3 Empirical identifiability and distinguishability

Let \(X_1,X_2,\ldots ,X_n,\ldots \) be random variables on a space \({{\mathcal {X}}}\), for \(n\in {\mathbb {N}}:\ {{\mathcal {L}}}(X_1,\ldots ,X_n)=P_{n;\theta }\) with parameter \(\theta \in \Theta \). \(P_{\infty ;\theta }\) denotes the distribution of the whole sequence. The spaces \({{\mathcal {X}}}\) and \(\Theta \) can be very general, but assume that \(\Theta \) is a metric space with metric \(d_\Theta \). The focus may be on the parameter \(\theta \) in full, or it may be on \(g(\theta )\), where \(g:\ \Theta \mapsto \Lambda \), \(\Lambda \) being a metric space with metric \(d_\Lambda \). No further conditions on \(\Theta \) and \(\Lambda \) are required for the general definitions. It is generally assumed that the underlying \(\sigma \)-algebras are rich enough so that the sets required in the arguments are measurable. In the specific cases discussed here this is always fulfilled using standard (Borel) \(\sigma \)-algebras and parameter spaces. Sometimes but not always \(g(\theta )=\theta \) and \(\Lambda =\Theta \) are considered. Other examples for g are a projection on a lower dimensional space, or an indicator function for a parameter subset (hypothesis) of interest.

Definition 1

\(g(\theta )\) is called empirically identifiable if it is possible to find a consistent sequence of estimators \((T_n)_{n\in {\mathbb {N}}}\), i.e., with \(T_n={{\mathcal {X}}}^n\mapsto \Lambda \), \(\forall \theta \in \Theta :\ T_n(X_1,\ldots ,X_n)\rightarrow g(\theta )\) in probability.

Consistency is always meant with respect to \(P_{\infty ;\theta }\). Traditionally, statistical identifiability of a parametric model \(\left( P_\theta \right) _{\theta \in \Theta }\) means that \(\theta _1\ne \theta _2\Rightarrow P_{\theta _1}\ne P_{\theta _2}\); for parameter parts, \(g(\theta _1)\ne g(\theta _2)\Rightarrow P_{\theta _1}\ne P_{\theta _2}\) is often referred to as partial identifiability (Prakasa Rao 1992; Ho and Rosen 2017). If parameters are not (partially) identifiable, they can obviously not be empirically identifiable, because no consistent estimator can tell equal distributions apart:

Corollary 1

Parameters and parameter parts that are empirically identifiable are also identifiable.

Here, data generating mechanisms are treated that do not allow to empirically identify parameters that are in fact identifiable in the traditional sense. Model M01 is an example. Obviously, distributions with different correlation parameters \(\rho _1\ne \rho _2\) are different from each other, and \(\rho \) can be estimated consistently if the whole sequence of n observations is repeated independently. In this case, assuming equal correlation between any two components, the sequence of length n of observations becomes an n-variate Gaussian, and the sample correlation between any two of the n components will estimate \(\rho \) consistently, although a better estimator will of course use information from all components. The data generating mechanism modelled in Sect. 2 does not allow for independent repetition; all available observations are dependent on all other observations, and this makes consistent estimation of \(\rho \) impossible:

Theorem 1

Using the notation of Lemma 1, if, for \(n\in {\mathbb {N}}:\ {{\mathcal {L}}}(\textbf{X}_n)= {{\mathcal {N}}}_n(\varvec{\mu ,\Sigma })\), then, with \(\theta =(\mu ,\sigma ^2,\rho )\), \(g(\theta )=\rho \) is not empirically identifiable in model M01.

Not only is the correlation \(\rho \) not empirically identifiable, the same holds for \(\mu \), meaning that in practice using an estimator different from \({\bar{X}}_n\) does not help dealing with the potential existence of \(\rho > 0\).

Theorem 2

Using the notation of Lemma 1 and Theorem 1, if for \(n\in {\mathbb {N}}:\ {{\mathcal {L}}}(\textbf{X}_n)= \mathcal{N}_n(\varvec{\mu ,\Sigma })\), then \(g(\theta )=\mu \) is not empirically identifiable.

The proof of Theorem 2 relies on the random effects formulation (1) with only a single realisation of the random effect. A similar case can be made for a standard random effects model assuming that the number of realised values of the random effect is bounded even if the number of observations goes to infinity, as expressed in the following model M2:

$$\begin{aligned} X_{ij}=\mu +Z_i+E_j,\ Z_i\sim {{\mathcal {N}}}(0,\tau _1^2),\ E_j\sim \mathcal{N}(0,\tau _2^2), \end{aligned}$$

\(i=1,\ldots ,m\) (group), \(j=1,\ldots ,n_i\) (within group observation), \(n=\sum _{i=1}^m n_i,\ \theta =(\mu ,\tau _1,\tau _2)\). Let m be fixed, whereas n is allowed to grow. Let \(\textbf{X}_n\) be the vector collecting all \(X_{ij}\).

Such a model could make sense for a random effects meta analysis with a low number m of studies, each of which is potentially large, but it does not allow to empirically identify the random effects’ variance, and neither the overall mean, unless \(m\rightarrow \infty \). Consequently, common advice in the meta analysis literature is to not use a random effects model if the number of studies is low [see, e.g., Kulinskaya et al. (2008)].

Lemma 2

In model M2, \(g_1(\theta )=\mu \) and \(g_2(\theta )=\tau _1^2\) are not empirically identifiable, whereas \(g_3(\theta )=\tau _2^2\) is empirically identifiable.

In model M01, there is a difference between trying to estimate \(\rho \) on one hand and \(\mu \) on the other hand. While \(\mu \) cannot be estimated consistently, in case that \(\rho \) is small, the data can give fairly precise information about its location, whereas there is no information in the data about \(\rho \) at all. The following definition aims at formalising this difference.

Definition 2

For \(n\in {\mathbb {N}},\ \alpha \le \beta \in (0,1]\), an observable set A, i.e., any measurable set expressing an observable event, is an \((\alpha ,\beta ,n)\)-distinguishing set for \(\theta _1\ne \theta _2\in \Theta \) if

$$\begin{aligned} P_{n;\theta _1}(A)\le \alpha ,\ P_{n;\theta _2}(A)>\beta . \end{aligned}$$
(2)

Definition 3

Two values \(\lambda _1\ne \lambda _2\in \Lambda \) are called empirically distinguishable if \(\exists n, \alpha \in (0,1],\) and \(\forall \theta _1,\theta _2\in \Theta \text{ with } g(\theta _1)=\lambda _1,\ g(\theta _2)=\lambda _2\) there is an \((\alpha ,\alpha ,n)\)-distinguishing set A.

Obviously, this definition is symmetric in \(\lambda _1, \lambda _2\). Before returning to the problem of constant correlation between Gaussian observations, empirical distinguishability is discussed in some more generality.

For \(\epsilon >0\) and \(\eta _0\) in some metric space H with metric \(d_H\), define \(B_\epsilon (\eta _0)=\{\eta :\ d_H(\eta ,\eta _0)\le \epsilon \}\). If \(g(\theta )=\theta \), empirical distinguishability follows from empirical identifiability, because there is a consistent estimator \(T_n\) of \(\theta \), and \(A=\{T_n\in B_\epsilon (\theta _2)\}\) will distinguish \(\lambda _1=\theta _1\ne \lambda _2=\theta _2\) for large enough n if \(\epsilon \) is chosen small enough that \(\theta _1\not \in B_\epsilon (\theta _2)\).

In general, empirical identifiability does not imply empirical distinguishability. If \(g(\theta )\) specifies only a part of the information in \(\theta \), it may happen that no set A can distinguish \(g(\theta _1)=\lambda _1\) from \(g(\theta _2)=\lambda _2\) uniformly over the information in \(\theta \) that is not in \(g(\theta )\), even if \(g(\theta )\) is empirically identifiable.

Example 2

Let \(X_i,\ i\in {\mathbb {N}}\) be independently distributed according to \(P_\theta ,\ \theta =(p,m),\ p\in [0,1],\ m\in {\mathbb {N}},\) defined as follows: For \(i\ge m\), \({{\mathcal {L}}}(X_i)=\)Bernoulli(p). For \(i<m\), \(\mathcal{L}(X_i)=\)Bernoulli(q), where q is randomly drawn from \(\mathcal{U}(0,1)\). \(g(\theta )=p\) is empirically identifiable, because \(\bar{X}_n\) is consistent for it. But any \(p_1\ne p_2\) are not empirically distinguishable, because for any \(n<m\), \(X_1,\ldots ,X_n\) do not contain any information about p. In this situation, the data may carry information about whether \(n>m\) (namely where it can be observed that at some point in the past q may likely have changed to p), in which case it also carries information about p, but for \(m=1\) this can never happen, and for very small m this can hardly ever be diagnosed with any reliability.

If \(\lambda =g(\theta )\), in order to make \(\lambda _1\) and \(\lambda _2\) empirically distinguishable from having a consistent estimator \((T_n)_{n\in {\mathbb {N}}}\) of \(\theta \), in general g needs to be uniformly continuous, and \((T_n)_{n\in {\mathbb {N}}}\) needs to be uniformly consistent on \(C=g^{-1}(\lambda _1)\cup g^{-1}(\lambda _2)\), i.e.,

$$\begin{aligned} \forall \epsilon>0,\alpha>0\ \exists n_0 \forall n\ge n_0, \theta \in C:\ P_{n;\theta }\{T_n\in B_\epsilon (\theta )\}>1-\alpha . \end{aligned}$$

The latter holds automatically in case that \(\theta \) is empirically identifiable if |C| is finite, and in particular if g is bijective.

Lemma 3

If \(\theta \in \Theta \) is empirically identifiable, g is uniformly continuous on an open superset of C, requiring that such a set exists, and there is an estimator \(T_n\) of \(\theta \) that is uniformly consistent on C, then \(\lambda _1\ne \lambda _2\in \Lambda \) are empirically distinguishable.

Only assuming \(\lambda =g(\theta )\) but not \(\theta \) to be empirically identifiable, consistency of \((T_n)_{n\in {\mathbb {N}}}\) as estimator of \(g(\theta )\) needs to be uniform on C for making \(\lambda _1\ne \lambda _2\) empirically distinguishable.

Consistent estimators are uniformly consistent in many situations. For example, the behaviour of affine equivariant multiple linear regression estimators is uniform over the whole parameter space. Therefore, if they are consistent for the full parameter vector, they are uniformly consistent, and will according to Lemma 3 empirically distinguish subvectors and single coefficients of the regression parameter.

Remark 1

There are possible variants of Definition 3 that may be taken as different “grades” of empirical distinguishability.

In the situation in Example 2, events may be observed for large enough n that can distinguish \(p_1\) and \(p_2\), even though this is not guaranteed to happen. A concept that allows \(p_1\) and \(p_2\) to be seen as distinguishable in some sense (at least if too low m is excluded; m needs to be large enough that a “change point” after m observations can be diagnosed with nonzero probability regardless of p) is “potential distinguishability”, see Definition 4.

Furthermore, it would make a difference to not allow the probabilities \(P_{n;\theta _1}(A),\ P_{n;\theta _2}(A)\) to be arbitrarily close. Consider a simple i.i.d. \({{\mathcal {N}}}(\theta ,1)\) model. With the given definition, and, for given fixed \(\theta _0\), \(g(\theta )=1(\theta =\theta _0)\), 0 and 1 can be empirically distinguished (the rejection region of the standard two-sided test will do), i.e., it can be distinguished whether \(\theta =\theta _0\) or \(\theta \ne \theta _0\). Using a definition that requires a “gap” of some \(\beta >0\) between \(P_{n;\theta _1}(A)\) and \(P_{n;\theta _2}(A)\), 0 cannot be distinguished from 1, because for \(\theta \rightarrow \theta _0\), \(P_\theta (A)\) for any A gets arbitrarily close to \(P_{\theta _0}(A)\). Both of these definitions can be seen as appropriate, from different points of view. It could be argued that \({{\bar{X}}}\) contains some, if not necessarily conclusive, information about whether \(\theta =\theta _0\), and it should therefore count as distinguishable from \(\theta \ne \theta _0\), as achieved by Definition 3,

But even with arbitrarily large n, \({{\bar{X}}}\) will not be exactly zero, and will therefore be at least as compatible with some \(\theta \ne \theta _0\) as with \(\theta _0\), which could be used to argue that the two should not be defined as distinguishable. Choosing g as the identity, any fixed \(\theta \ne \theta _0\) could still be distinguished from \(\theta _0\); in this case there is a positive distance between \(\theta \) and \(\theta _0\), which makes \(\bar{X}_n\) a better fit for the closer parameter. This can be achieved by the concept of “empirical gap distinguishability”, see Definition 4.

Definition 4

  1. (a)

    \(\lambda _1\in \Lambda \) is called potentially empirically distinguishable from \(\lambda _2\in \Lambda \) if \(\exists \alpha \in (0,1),\ n\in {\mathbb {N}}\), a set \(D\subseteq g^{-1}(\lambda _1)\times g^{-1}(\lambda _2)\) so that

    $$\begin{aligned} \forall \theta _1\in g^{-1}(\lambda _1)\ \exists \theta _2\in g^{-1}(\lambda _2):\ (\theta _1,\theta _2)\in D, \end{aligned}$$

    and a set A that \((\alpha ,\alpha ,n)\)-distinguishes \(\theta _1\) from \(\theta _2\) for all \((\theta _1,\theta _2)\in D\).

  2. (b)

    Two values \(\lambda _1\ne \lambda _2\in \Lambda \) are called empirically gap distinguishable if \(\exists n, \alpha <\beta \in (0,1],\) and A that is an \((\alpha ,\beta ,n)\)-distinguishing set \(\forall \theta _1,\theta _2\in \Theta \text{ with } g(\theta _1)=\lambda _1,\ g(\theta _2)=\lambda _2\).

Potential distinguishability in particular implies \(\forall \theta _1\in g^{-1}(\lambda _1):\ P_{n;\theta _1}(A)\le \alpha \), so that \(\lambda _1\) can be “rejected” by the indicator of A (keeping in mind that \(\alpha \) cannot necessarily be chosen small), even though this test may be biased against some \(\theta _2\in g^{-1}(\lambda _2)\). Potential distinguishability is not symmetric in \(\lambda _1\) and \(\lambda _2\), but in Example 2, in fact \(p_1\) is potentially distinguishable from \(p_2\) (A can be chosen as intersection of a set that rejects the null hypothesis of no change point in the binary sequence, see Worsley (1983), and \(|{{\bar{X}}}_n^*-p_1|\) being large, where \({{\bar{X}}}_n^*\) is the mean after the estimated change point), and \(p_2\) is potentially distinguishable from \(p_1\) in the same way. The reason why there is symmetric potential distinguishability here but not standard distinguishability is that different distinguishing sets are required for the two directions, and that no set works uniformly over all \(\theta \in g^{-1}(p_1)\cup g^{-1}(p_2)\). See Example 5 for a genuinely asymmetric instance of potential distinguishability.

Empirical gap distinguishability is stronger than empirical distinguishability, whereas potential distinguishability is weaker. Still, lack of empirical gap distinguishability means that in terms of the parameterised probabilities of observable sets, \(\lambda _1\ne \lambda _2\) appear infinitesimally close, regardless of \(d_\Lambda (\lambda _1,\lambda _2)\).

Theorem 3 treats empirical distinguishability in model M01.

Theorem 3

In model M01, any two \(\rho _1\ne \rho _2\ge 0\), \(\rho _1,\rho _2<1\), are indistinguishable and not even potentially distinguishable in any direction, whereas any two \(\mu _1\ne \mu _2\) are empirically distinguishable.

Remark 2

Apart from partial identifiability, there are further weaker versions of the classical identifiability concept. A parameter value is locally identifiable if in an open neighbourhood in the parameter space there is no other parameter that parameterises the same distribution (Rothenberg 1971). Set identifiability (Ho and Rosen 2017) means that sets of equivalent parameter values can be identified and potentially be estimated as opposed to a precise parameter value. In other words, parameters from non-equivalent sets could be (empirically) distinguished, whereas equivalent parameters could not be distinguished. Empirical versions of these definitions are possible, but all the situations treated here that are not empirically identifiable would not be empirically locally or set identifiable either. All proofs of empirical non-identifiability in the Appendix rule out the existence of consistent estimators that can tell any two parameter values apart, so neither parameter sets nor parameter values in any neighbourhood of each other can be consistently told apart.

4 Examples from the literature

The concepts of empirical identifiability and distinguishability provide a framework that fits various existing results on the limitations of empirically identifying parameters or hypotheses that are identifiable according to the classical definition.

Example 3

The so-called “incidental parameter problem” was introduced by Neyman and Scott (1948), see Lancaster (2000) for a review. It refers to a situation in which there are observed units the number of which is allowed to go to infinity, and for these observed units there is a bounded finite number of observations, say \(x_{ij},\ i=1,\ldots ,n,\ j=1,\ldots ,t\), where \(n\rightarrow \infty \) but t fixed. The model for the distribution of the corresponding random variables \(X_{ij}\) involves some parameters \(\alpha _i,\ i=1,\ldots ,n\). For the estimation of each of these there are only t observations available, and the \(\alpha _i\) will not be estimated consistently if \(n\rightarrow \infty \), so they are not empirically identifiable, although they may well be empirically distinguishable, depending on the specific model. The problem can be avoided in many situations by modeling the \(\alpha _i\) as random effects, so that they are governed by only one or few parameters, but in some situations, e.g., panel data in economics (Lancaster 2000), researchers may be interested in inference about specific \(\alpha _i\), and also standard distributional assumptions for the random effects distribution may not seem realistic.

Example 4

In a famous paper, Bahadur and Savage (1956) show the non-existence of valid statistical inference for the problem of finding out about the true mean in a sufficiently large class of distributions \({{\mathcal {F}}}\) with existing mean (essentially requiring that for every P and Q also any mixture of them is in \({{\mathcal {F}}}\), and that \(\forall \mu \in {\mathbb {R}}\ \exists P\in {{\mathcal {F}}}:\ E_P(X)=\mu \)).

Applying the terminology of the present paper, their Theorem 1 shows that any two means \(\mu _1\) and \(\mu _2\) are not empirically gap distinguishable. The reason is that with \(E_P(X)=\mu _1\), \(E_Q(X)\) can be chosen so that for arbitrarily small \(\epsilon >0\) and \(R=(1-\epsilon )P+\epsilon Q\), \(E_R(X)\) can take any value \(\mu _2\ne \mu _1\). For arbitrarily large n and small enough \(\epsilon \), \(P^n(A)\) and \(R^n(A)\) are arbitrarily close.

Example 5

Expanding the work of Bahadur and Savage (1956), Donoho (1988) considered functionals J of distributions, including the number of modes of the density, the Fisher information, any \(L_p\)-norm of any derivative of the density, the number of mixture components, and the negentropy. He showed that for a sufficiently rich nonparametric family \({{\mathcal {P}}}\) of distributions, the graph (i.e., the set of pairs (PJ(P)) with distribution P and functional value J(P)) is dense in the epigraph (the set of pairs (Pj) where \(j\ge J(P)\)) using a “testing topology” induced by \(d(P,Q)=\sup _{0\le \psi \le 1} |\int \psi dP-\int \psi dQ|\). As \(\psi \) can be the indicator of a distinguishing set, two classes of distributions \({{\mathcal {P}}}\) and \({{\mathcal {Q}}}\) are not empirically gap distinguishable if \(\inf _{P\in {{\mathcal {P}}}, Q\in {{\mathcal {Q}}}}d(P,Q)=0\). On the positive side, Donoho proves the existence of confidence sets for lower bounds of these functionals (except the negentropy); enough data can make it possible to identify that \(J(P)\ge k\) fixed and given, provided that this is indeed the case for P. \(J(P)=k_1\) can be potentially distinguished from \(J(P)=k_2<k_1\), but not from \(J(P)=k_3>k_1\).

Whereas distinguishability results are mostly negative, Donoho (1988) constructs a consistent estimator of the number of modes (Corollary of Theorem 3.4) and shows therefore that the number of modes is identifiable from data, but consistency is not uniform, and for fixed n only a lower bound for J(P) can be given.

The key ingredient for Donoho’s results [as well as the result of Bahadur and Savage (1956)] is the richness of \({{\mathcal {P}}}\). If \({{\mathcal {P}}}\) is suitably constrained by assumptions, the functionals can be empirically identified; however it cannot be empirically identified whether such assumptions hold.

Example 6

Spirtes et al. (1993), Robins et al. (2003) deal with the possibility to infer the presence or absence of causal arrows in a graphical model. Chapter 4 of Spirtes et al. (1993) is about “statistical indistinguishability”. They define several indistinguishability concepts of different strengths for the problem of identifying the causal graph. As classical identifiability, these concepts regard the model, not making reference to observable data, but empirical identifiability is also of key interest regarding the issue how much can be inferred about the causal relationship between observed variables in the presence of an unobserved confounder.

Robins et al. (2003) show that in several situations “uniformly consistent tests” do not exist, whereas “pointwise consistent tests” exist; the latter however do not allow to distinguish presence or absence of causal arrows for any fixed n uniformly over the possible parameters. “Consistent tests” are procedures that can have outcomes 0 (“accept absence of arrow”), 1 (“reject absence or arrow”), or 2 (“inconclusive”). Their Example 2 is very simple and most instructive. Assume observed binary random variables X and Y and a categorical unobserved confounder Z. The existence of a causal arrow between X and Y is operationalised by a parameter \(\theta ^*\) encoding the strength of the causal effect, where \(\theta ^*=0\) means that there is no causal arrow. There are eight different possible causal graphs encoding the possible conditional independence structures [Fig. 3 in Robins et al. (2003)]. The authors assume that the distribution is faithful to the graph, meaning that there are not more independence relationships in the distribution than encoded in the graph.

Adapting the terminology of the present work, the problem is to distinguish two classes of graphs. The first class \(C_1\) consists of those graphs that encode marginal independence between X and Y, which implies \(\theta ^*=0\) under faithfulness. The second class \(C_2\) are the graphs that imply marginal dependence between X and Y, in which case a consistent test should give an inconclusive result, because there is a possible graph in which both X and Y are influenced by Z, which causes marginal dependence, despite X and Y being independent given Z, therefore \(\theta ^*=0\). The results in Robins et al. (2003) imply that within the second class, the existence of a causal arrow between X and Y is not empirically identifiable, as only dependence or independence between X and Y can be observed. There is a pointwise consistent test (i.e., a consistent estimator of the indicator variable, therefore empirical identifiability) that can tell apart the first and the second class. The two classes are not empirically gap distinguishable but this is not very surprising as \(\theta \) can be arbitrarily close to 0 if a causal arrow exists. What is more remarkable is that the proof of their Theorem 1 (which states that no uniformly consistent test exists) implies that \(C_1\) cannot even be empirically gap distinguished from \(C_2\cap \{\theta ^*=\theta ^*_0\}\), where \(\theta ^*_0\ne 0\) is a fixed parameter value. This is because a graph in the second class that has causal arrows between each pair of X, Y, and Z encodes a model that allows for dependence between Z and X, and also between Z and Y in such a way that X and Y can be arbitrarily close to marginally independent despite the existing causal effect \(\theta ^*_0\).

Example 7

Regarding models for missing values, a key distinction is between missing at random (MAR) and missing not at random (MNAR) mechanisms. MAR holds if the distribution of the missingness indicator only depends on the complete data (including missing values) through the non-missing observations. This is a very convenient assumption for dealing with missing values, because it means that the non-missing data provide enough information to allow for unbiased inference. As acknowledged by the missing values literature, it is doubtful that this assumption is realistic, though, and it is also doubtful whether the data allow to check this assumption, as key information for this is hidden in the missing values.

Molenberghs et al. (2008) made this concern precise by showing that for every MAR model there exists an MNAR model that reproduces the same observed likelihood function, meaning in particular that the densities of the observed data and therefore the probabilities for every observable set are equal between these models. This translates into empirical indistinguishability (not even potential distinguishability is possible) of an indicator of MAR vs. MNAR, once more assuming a sufficiently rich class of models. The authors state that MAR and MNAR may be distinguishable under certain parametric assumptions that restrict the flexibility of the MNAR models to emulate the likelihoods for certain MAR models; but then it is not empirically distinguishable whether such a model holds or not.

Example 8

p-Dimensional ordinal data is often modelled as generated by discretisation of latent Euclidean continuous variables. There is much work about dimension reduction, but assume here that there are p continuous latent variables, every one of which corresponds to an observed ordinal variable, i.e., if \(Y_i\) is the ith latent variable, and the observed ordinal variable \(X_i\) takes the ordered categories \(j=1,\ldots ,k_i\) with probabilities \(\pi _1,\ldots ,\pi _{k_i}\) with \(\pi _0=0\), \(X_i=j\) if \(Y_i\) is between the \(\sum _{l=0}^{j-1}\pi _l\)- and \(\sum _{l=0}^{j}\pi _l\)-quantiles of \({{\mathcal {L}}}(Y_i)\).

The most popular approach is to assume the latent variables as multivariate Gaussian, see Muthén (1984). A mis-specification of the distribution of the latent distribution can have consequences in practice, see Foldnes and Grønneberg (2020), who discuss tests of this assumption. In fact, all such tests only test the dependence structure of \((Y_1,\ldots ,Y_p)\) rather than the shape of the marginal distributions of \({{\mathcal {L}}}(Y_i)\), which makes sense as the assignment of categories of \(X_i\) according to quantiles of \(\mathcal{L}(Y_i)\) obviously works for any continuous \({{\mathcal {L}}}(Y_i)\). Following Sklar’s famous theorem (Sklar 1959), the joint distribution of \((Y_1,\ldots ,Y_p)\) has a cumulative distribution function (CDF) H that can be written as \(H(y_1,\ldots ,y_p)=C(F_1(y_1),\ldots ,F_p(y_p))\) where \(F_1,\ldots ,F_p\) are the marginal CDFs and C is a copula. This means that any dependence structure observed in ordinal data is compatible with any choice of the marginals. Indeed, for the latent variable model for such data, Proposition 2 of Almeida and Mouchart (2014) implies that \(F_1,\ldots ,F_p\) are not empirically identifiable, and that any two vectors of marginal CDFs are not even empirically distinguishable (be it potentially). The authors formulate this as classical identifiability statement, but involving the \((Y_1,\ldots ,Y_p)\) in the model, even though not observable, means that the model is identified but not empirically, i.e., not from what is observable.

5 Distinguishing independence and dependence

The problem of identifying constant correlation between Gaussian observations is an instance of the more general problem to detect dependence between the observations in a sample, particularly if they are meant to be analysed by methods that assume independent data. Here the focus is on i.i.d. data. There will not be sophisticated results in this Section, the focus is on general ideas.

Existing tests such as the runs test (Wald and Wolfowitz 1940) require additional information about the kind of dependence to be detected. Most of them test for dependence governed by the observation order, which is sensible if it can be suspected that closeness in observation order can give rise to dependence. This is often the case if observations are a time series, but also other meaningful orderings are conceivable, and also dependence governed by “closeness” on external variables such as spatial location. Alternatively, there may be a known grouping of observations and possible within-group dependence, as modelled for example by random effects models.

In practice, the observation order is not always meaningful, an originally existing meaningful observation order may be unavailable to the data analyst, or dependence structures can be suspected that cannot be detected by examining relations between observations that are in some sense “close”. Constant correlation between any two Gaussian observations as in model M1 is one example of such a structure.

The question of interest here is whether independence and dependence can be distinguished in case that the observation order carries no relevant information, and neither is there secondary information from additionally observed variables. This amounts to observing the empirical distribution of the data only.

For real-valued data \(X_1,\ldots ,X_n\), \({\hat{F}}_n\) denotes the empirical distribution function. Assume that only \({\hat{F}}_n\) is observed. The concept of empirical identifiability can be applied to binary “parameters”, and particularly to a parameter that indicates, within an underlying model, whether \(X_1,\ldots ,X_n\) are i.i.d. or not. Empirical identifiability involves asymptotics. For \(n\rightarrow \infty \) here it needs to be assumed that a new sequence \(X_1,\ldots ,X_n\) is generated for each n, because observing a sequence \({\hat{F}}_1,\ldots ,{\hat{F}}_n,{\hat{F}}_{n+1},\ldots \) based on the same sequence of observations will re-introduce observation order information.

At first sight the task of identifying dependence from \({\hat{F}}_n\) may seem hopeless, given that \({\hat{F}}_n\) is perfectly compatible with i.i.d. data generated from distributions with a “close” true CDF or even \({\hat{F}}_n\) itself. As in other identifiability problems, information can come from restrictive assumptions.

Example 9

Consider model M2 and assume for the number of groups \(m\ge 2\), but only \({\hat{F}}_n\) is observed, meaning that it cannot be observed to which group i an observation \(X_{ij}\) belongs. Independence amounts to \(\tau _1^2=0\), and the interest here is in \(g(\theta )=1(\tau _1^2=0)\) denoting the indicator function for \(\{\tau _1^2=0\}\). In case that \(\tau _1^2=0\), the underlying distribution of \(X_{ij}\) is i.i.d. Gaussian. In case that \(\tau _1^2>0\), the underlying distribution of \(X_{ij}\) partitions the data into different Gaussians for different i. Assuming that \(\frac{n_i}{n}\rightarrow \pi _i>0\), \({\hat{F}}_n\) will for large enough n look like a Gaussian mixture, which can be told apart from a single Gaussian. The order of a Gaussian mixture can be consistently estimated (James et al. 2001), and therefore \(1(\tau _1^2=0)\), which is equal to the indicator of a single mixture component, is empirically identifiable. Note that the cited result is for i.i.d. data from a Gaussian mixture, which in model M2 would require the group memberships to be modelled i.i.d. multinomial\((1,\pi _1,\ldots ,\pi _m)\), and then conditioning on the unknown values of \(Z_1,\ldots , Z_m\).

The example illustrates that certain empirical distributions can indeed indicate dependence, if corresponding distributional shapes (here a Gaussian mixture with \(m\ge 2\) components) are assumed as impossible under independence but can occur under dependence.

Even if general marginal distributions are allowed, there are specific dependence structures that can be identified from the empirical distribution alone. In order to simplify matters, from now on consider binary data \(X_1,\ldots ,X_n\), for which observing the empirical distribution is equivalent to observing the number of ones or \({{\bar{X}}}_n=1-{\hat{F}}_n(0)\), and all marginal distributions are Bernoulli(p). Call the i.i.d. model M3. A problem of interest is whether there are models for dependence for which all marginal distributions are identical that can be told apart based on \({\hat{F}}_n\) from M3.

Example 10

Here is an example for a dependence structure that can be distinguished from \({\hat{F}}_n\). This relies on \({{\bar{X}}}_n\) concentrating on a pre-speficied value (here \(\frac{1}{2}\)) with larger probability than under M3 even if p equals this value. Consider a model M4 where \({{\mathcal {L}}}(X_1)=\,\)Bernoulli(p). With \(q=p1\left( p\le \frac{1}{2}\right) +(1-p)1\left( p> \frac{1}{2}\right) \), \(r=2\frac{q}{q+1},\) let Y be an unobserved Bernoulli(r)-random variable. If \(Y=0\), \(X_2,\ldots ,X_n,\ldots \) i.i.d. Bernoulli(p/2). If \(Y=1\), \(X_3,X_5, X_7,\ldots \) i.i.d. Bernoulli(1/2), and for all even \(n:\ X_n=1-X_{n-1}\), so that \({{\bar{X}}}_n=\frac{1}{2}\). r is chosen so that all marginal distributions are Bernoulli(p). With \(A_n=\{\bar{X}_n=\frac{1}{2}\}\), \(P(A_n)\rightarrow 0\) under M3, whereas for even n, \(P(A_n)=r\) under M4. Therefore \(A_n\) gap distinguishes M3 from M4.

Such examples rely on the definition of a subset \(A_n\) of possible values of \({\hat{F}}_n\), which under the dependence model for large enough n has a probability either higher than the maximum over p under M3, or smaller than the corresponding minimum. For checking dependence in practice, \(A_n\) needs to be specified in advance, meaning that the user needs to know a priori which values of \({\hat{F}}_n\) can be suspected to indicate dependence even given the possibility under M3 that \(p={{\bar{X}}}_n\). Such information is rarely available in practice.

Summarising, dependence can be diagnosed from the data only if

  • It is governed by the known order of observations or known external variables,

  • Or it favours (or avoids) specific events regarding the observed empirical distribution compared to the independence model of interest,

both in ways that the user has to specify in advance. It can be suspected that many existing dependence structures are not of this kind, meaning that only very limited aspects of independence between observations, regarding a sequence of such observations, can be checked. It is therefore very important to think through all background information about the data generating process to become aware of further potential issues with independence.

6 Cluster membership in k-means clustering

k-means clustering is probably the most popular cluster analysis method (Jain 2010). It can be connected to a “fixed classification model” (Bock 1996): Let \(\textbf{X}_1,\ldots ,\textbf{X}_n,\ \textbf{X}_i\in {\mathbb {R}}^p,\ i=1,\ldots ,n,\) be independently distributed with

$$\begin{aligned} {{\mathcal {L}}}(\textbf{X}_i)={{\mathcal {N}}}_p\left( {\varvec{\mu }}_{\gamma _i},\sigma ^2{{\textbf{I}}}_p\right) ,\ \gamma _i\in \{1,\ldots ,k\},\ k>1,\ \sigma ^2\ge 0. \end{aligned}$$
(3)

This model can be interpreted as generating k different Gaussian distributed clusters characterised by cluster means \({\varvec{\mu }}_1,\ldots ,{\varvec{\mu }}_k\in {\mathbb {R}}^p\), all with the same spherical covariance matrix, and \(\gamma _i\) indicates the true cluster membership of \(\textbf{X}_i\). The \(\gamma _i\) take discrete values, and their number converges to \(\infty \) with n, so these are nonstandard parameters, but in many applications they are of practical interest.

The maximum likelihood (ML)-estimator for \(\theta =({\varvec{\mu }}_1,\ldots ,{\varvec{\mu }}_k,\gamma _i,\ldots ,\gamma _n)\) in this model is given by k-means clustering (the ML-estimator for \(\sigma ^2\) is easily derived, but this is not relevant here) of data \({\tilde{\textbf{X}}}_n=(\textbf{X}_1,\ldots ,\textbf{X}_n)\):

$$\begin{aligned} T_n({\tilde{\textbf{X}}}_n)= & {} (\textbf{m}_{1n},\ldots ,\textbf{m}_{kn},g_{in},\ldots ,g_{nn})\\= & {} \mathop {\mathrm{arg\,min}}\limits _{\textbf{m}_1,\ldots ,\textbf{m}_k,g_1,\ldots ,g_k} W(\textbf{m}_{1n},\ldots ,\textbf{m}_{kn},g_{in},\ldots ,g_{nn}),\\ W(\textbf{m}_{1n},\ldots ,\textbf{m}_{kn},g_{in},\ldots ,g_{nn})= & {} \sum _{i=1}^n \Vert \textbf{X}_{in}-\textbf{m}_{g_{in}}\Vert ^2 \end{aligned}$$

with ties in the \(\mathop {\mathrm{arg\,min}}\limits \) broken in an arbitrary way. For given \(\textbf{m}_{1},\ldots ,\textbf{m}_{k}\), the \(g_1,\ldots ,g_{n}\) minimising W are given by

$$\begin{aligned} g_i=\mathop {\mathrm{arg\,min}}\limits _{j\in \{1,\ldots ,k\}}\Vert \textbf{X}_i-\textbf{m}_j\Vert ,\ i=1,\ldots ,n, \end{aligned}$$

and with these write \(W(\textbf{m}_{1n},\ldots ,\textbf{m}_{kn})=W(\textbf{m}_{1n},\ldots ,\textbf{m}_{kn},g_{in},\ldots ,g_{nn})\). For given \(g_1,\ldots ,g_{n}\), the cluster-wise mean vectors minimise W. As there are only finitely many values of \(g_1,\ldots ,g_{n}\), the ML-estimator does always exist, if not necessarily uniquely. Two issues with identifiability here are (i) that the numbering of the clusters is arbitrary and (ii) that \({\varvec{\mu }}_q={\varvec{\mu }}_r\) for \(q\ne r\) means that it is not possible to distinguish between \(\gamma _i=q\) and \(\gamma _i=r\) for \(i\in {\mathbb {N}}\). Therefore assume that the \({\varvec{\mu }}_1,\ldots ,{\varvec{\mu }}_k\) are pairwise different and lexicographically ordered (i.e., with obvious notation, \(\mu _{11}\le \ldots \le \mu _{k1}\) with ties broken by the second variable, or, if there’s still equality, by the third and so on, same for \(\textbf{m}_{1n},\ldots ,\textbf{m}_{kn}\)). This makes the model identifiable according to the traditional definition.

However, due to the nonstandard nature of the model parameters, the ML-estimator is known to be inconsistent, even if only the estimation of the k mean vectors alone is of interest (Bryant 1991).

The cluster membership parameters are another example for parameters that are identifiable according to the classical definition (because \(\gamma _i\) uniquely defines the distribution of \(\textbf{X}_i\)), but cannot be empirically identified.

Theorem 4

The parameters \(\gamma _i,\ i\in {\mathbb {N}}\) in the fixed classification model defined in (3) are not empirically identifiable.

It may be suspected that this is a consequence of the fact that for \(i=1,\ldots ,n\), only \(\textbf{X}_i\) holds information about the parameter \(\gamma _i\), and the number of these parameters goes to \(\infty \) with \(n\rightarrow \infty \). But this is not quite true. More observations add information about the clusters that can in turn be used to classify individual observations better. The problem here is rather the Gaussian distribution assumption, which implies that the marginal density of \(\textbf{X}_i\) is everywhere nonzero, so that the single observation made of \(\textbf{X}_i\) is not enough to determine with probability 1 to what cluster the observation belongs (the setup by which the classical identifiability could be used to estimate this parameter would be to have a potentially infinite amount of replicates of \(\textbf{X}_i\)), even if there is an infinite amount of information about the clusters. In fact, there is a different model setup in which the \(\gamma _i\) are empirically identifiable, which requires that, where densities exist, the marginal density \(f_{\theta ^*,n}(\textbf{X}_i=\textbf{x})\) is zero wherever \(f_{\theta ,n}(\textbf{X}_i=\textbf{x})>0\), where \(\theta ^*\) equals \(\theta \) in all components except \(\gamma _i\).

Defining

$$\begin{aligned} W(P)=({\varvec{\mu }}_1^*,\ldots ,{\varvec{\mu }}_k^*)=\mathop {\mathrm{arg\,min}}\limits _{(\textbf{m}_1,\ldots ,\textbf{m}_k)\in ({\mathbb {R}}^p)^k} \int \min _{\textbf{m}\in \{\textbf{m}_1,\ldots ,\textbf{m}_k\}}\Vert \textbf{x}-\textbf{m}\Vert ^2 dP(\textbf{x}), \end{aligned}$$

Pollard (1981) showed that for a distribution P satisfying

$$\begin{aligned} E_P\Vert \textbf{X}\Vert ^2 < \infty ,\ W(P) \text{ is } \text{ unique } \text{ up } \text{ to } \text{ the } \text{ numbering } \text{ of } \text{ the } \text{ means, } \end{aligned}$$
(4)

the k-means estimator \(\left( T_n^m\right) _{n\in {\mathbb {N}}}\), where \(T_n^m({\tilde{\textbf{X}}}_n)=(\textbf{m}_{1n},\ldots ,\textbf{m}_{kn})\), is strongly consistent for W(P). Assume further that

$$\begin{aligned} \forall j\ne l\in \{1,\ldots ,k\}:\ P\{\Vert \textbf{X}-{\varvec{\mu }}^*_j\Vert ^2=\Vert \textbf{X}-{\varvec{\mu }}^*_l\Vert ^2\}=0. \end{aligned}$$
(5)

For \({{\mathcal {L}}}(\textbf{X})=P,\ j=1,\ldots ,k\), define

$$\begin{aligned} A_j=\left\{ \textbf{X}:\ j=\mathop {\mathrm{arg\,min}}\limits _l\Vert \textbf{X}-{\varvec{\mu }}_l^*\Vert ^2\right\} ,\ P_j=\mathcal{L}(\textbf{X}\mid \textbf{X}\in A_j),\ \pi _j=P(A_j). \end{aligned}$$

\(P_j\) is P constrained to the set \(A_j\) of points that are closest to the mean \({\varvec{\mu }}_j\) (\(A_1,\ldots ,A_k\) form a Voronoi tesselation of \({\mathbb {R}}^p\)), and

$$\begin{aligned} P=\sum _{j=1}^k\pi _jP_j \end{aligned}$$
(6)

(every distribution can be written as a mixture in this form; as a side remark, P might be a Gaussian mixture, but in this case the \(P_j\) are not its Gaussian components). Mixture models of this form can be derived from a model for outcomes \((G, \textbf{X})\) with \(G\in \{1,\ldots ,k\}\) distributed according to a categorical distribution with probabilities \((\pi _1,\ldots ,\pi _k)\) and \(\mathcal{L}(\textbf{X}\mid G=j)=P_j\). Then \({{\mathcal {L}}}(\textbf{X})=P\) (McLachlan and Peel 2000). For an i.i.d. sequence \(\textbf{Y}=((G_1,\textbf{X}_1),(G_2,\textbf{X}_2),\ldots )\) let \(\mathcal{L}(\textbf{Y})={\tilde{P}},\ \textbf{G}=(G_1,G_2,\ldots )\).

Now consider \({{\mathcal {L}}}({\tilde{\textbf{X}}}_n)=P^*\) so that \(\textbf{X}_1,\ldots ,\textbf{X}_n\) are independently distributed with

$$\begin{aligned} {{\mathcal {L}}}(\textbf{X}_i)=P_{\gamma _i}\ \gamma _i\in \{1,\ldots ,k\},\ k>1,\ i=1,\ldots ,n. \end{aligned}$$
(7)

This defines a fixed classification model associated to the mixture P. Let \(Q_P\) be an infinite i.i.d. product of categorical distributions on \(\{1,\ldots ,k\}\) with probabilities \((\pi _1,\ldots ,\pi _k)\). Assume for given P that \(\varvec{\gamma }=(\gamma _1,\gamma _2,\ldots )\) fulfill

$$\begin{aligned} {\tilde{P}}\left\{ \lim _{n\rightarrow \infty }T_n^m({\tilde{\textbf{X}}}_n)=W(P)\mid \textbf{G}=\varvec{\gamma }\right\} =1. \end{aligned}$$
(8)

Because of the strong consistency of \(T_n^m\), (8) holds with probability 1 under \(Q_P\), but note that under (7), \(\varvec{\gamma }\) is a fixed parameter and not a random variable, and the fact that (8) holds for \(Q_P\)-almost all \(\varvec{\gamma }\) just means that (8) is not more restrictive than assuming a mixture with fixed proportions, although it will not allow for fully general \(\varvec{\gamma }\).

Theorem 5

Assuming (4), (5), and (8), the parameters \(\gamma _i,\ i\in {\mathbb {N}},\) in the fixed classification model defined by (7) are empirically identifiable.

Already from Pollard (1981) it is clear that k-means does not actually estimate the centres of the spherical Gaussians in (3), but rather the Voronoi tesselation resulting from P, and the resulting clusters are not necessarily spherical. Added here is the observation that one can define meaningful cluster indicators in this setup, and that these can be consistently estimated, even though there is one such indicator for every observation. This is not possible in (3). Furthermore (6) interprets P as a mixture, and thus shows that there is a mixture that k-means estimates consistently.

The reader may wonder about empirical distinguishability of the parameters \(\gamma _i,\ i\in {\mathbb {N}}\), i.e., about whether the given values \(j_1\) and \(j_2\) of \(\gamma _i\) for given i could be distinguished. In the situation of Theorem 5, this follows from Lemma 3 (projections of discrete parameters are by definition uniformly continuous). The situation of Theorem 4 is less obvious. It is however clear that the data contain some information about \(\gamma _i\) through \(\textbf{X}_i\). If it were possible to empirically identify \({\varvec{\mu }}_1,\ldots ,{\varvec{\mu }}_k\), \(A=\{\Vert \textbf{X}_i-{\varvec{\mu }}_{j_1}\Vert >\Vert \textbf{X}_i-{\varvec{\mu }}_{j_2}\Vert \}\) could distinguish \(j_1\) and \(j_2\). A conjecture is that finding consistent estimators for \({\varvec{\mu }}_1,\ldots ,{\varvec{\mu }}_k\) requires additional conditions on the sequence \((\gamma _i)_{i\in {\mathbb {N}}}\) that allow to use a consistent estimator from an i.i.d. mixture, see Redner and Walker (1984), Bryant (1991).

7 Conclusion

There are various potential issues with the identification of parameters, and the four definitions given here (empirical identifiability, empirical distinguishability, empirical gap distinguishability, potential empirical distinguishability) may not cover all of them; even using the definition of distinguishing sets, further definitions are possible, for example empirical set identifiability, but what is already present allows to deal with many examples.

Apart from the precise definitions, there are also different sources for identifiability and distinguishability problems. In some situations (Examples 7, 8) the problem is that modelled information is not observed, either because it regards missing values, or latent variables. In some situations (Examples 4, 5), the issue is that the class of distributions to consider, even for a single parameter value of interest, is too rich, allowing for so much flexibility that the probability of any observable set cannot be sufficiently constrained. Identifiability and empirical distinguishability can be the result of model assumptions constraining this flexibility, see Example 9; in Example 1, constraining the variance \(\sigma ^2\) would in turn allow the data to be informative about \(\rho \). These model assumptions cannot be justified from the data alone though. In some further situations, the data carry information about the parameter of interest, but this information, or more precisely the growth of the information over n, is limited [Lemma 2, Example 3, Theorem 4; Example 1 is also of this kind, using (1)]. Example 2 is constructed so that the distinguishing information may not occur at any finite n, and the parameterisation in Example 6 is arbitrarily close to a situation of classical non-identifiability, which is only avoided by the faithfulness assumption.

Some examples such as model M01 and the problem in Sect. 6 are characterised by not allowing for i.i.d. repetition; the corresponding parameters can be identified if the whole sequence of observations is repeated i.i.d., and the lack of empirical identifiability is due to the assumed impossibility to do this. It may be wondered whether such models have a valid frequentist interpretation, which seems to rely on i.i.d. replicability at least in principle. Frequentism needs to be interpreted in a rather “idealist” way to accommodate such situations, appealing to replication of an infinite sequence as a thought experiment, although this issue can arguably be made regarding time series and other models as well; for more on this, see Hennig (2020).

The most relevant and unsettling implication of the work for practice regards the lack of possibility to check certain model assumptions, particularly independence; flexible enough models allowing for non-identical marginals can be impossible to detect as well, although this is not shown here.

The considerations regarding requirements for detecting dependence in Sect. 5 do not only hold for data for which only the empirical distribution is observed; they hold in the same way also for situations in which the observation order, even if known and potentially meaningful, is not informative for the dependence structure, and no external variables exist either that hold such information. This is probably a very common situation. The only way to justify independence then is knowledge about the subject matter and the data generation. Bayesians may think that the lack of information in the data about parameters such as \(\rho \) in M01 could be compensated by a prior distribution, but the lack of empirical identifiability and distinguishability of \(\rho \) raises the question where quantitative information should come from to set up the prior. A prior could only be obtained from existing qualitative information about the data generating process, and as there is no information in the data, the prior will determine the impact of \(\rho \) without the possibility of being “corrected” by the data.