In this section, we refer to Neyman’s contributions to the methodology of sampling (in connection with estimation) in order to reveal that his framework aims at the explicit incorporation of the diverse types of prior information that are available in different research designs.
Historically, the challenge of drawing inferences from a sample rather than from a whole population was tantamount to ascertaining that the former is a representation of the latterFootnote 7 (cf. Kuusela 2011, 91–93). In his groundbreaking paper, Neyman (1934) compared two “very broad” (559) groups of sampling techniques that presuppose taking representative samples from finite populations: random sampling in its special form, so-called stratified sampling and purposive selection (sampling). What was, for Neyman, distinctive for random sampling was that there was some randomness present in the selection process, as opposed to purposive selection, where there is no randomness in the selection process. It follows from his paper that the method of random sampling may be of “several types” (Neyman 1934, 567–568), including simple random sampling with or without replacement, and stratified and cluster sampling (discussed by us below in this article), among others. The meaning of random sampling can be rephrased in more recent terms as probability sampling. In probability sampling, each unit in a finite population of interest has a definite non-zero chance of selection (Steel 2011, 896–898). This chance does not need to be equal for every unit. Neyman’s rationale for random selection is that it enables the application of the probability calculus to interval estimation and the calculation of error probability, which, in Neyman’s view, is not feasible in the case of purposive selection (1934, 559, 572, 586).Footnote 8 Purposive selection means that the selection of sampling units is determined by a researcher’s arbitrary, non-random choice and it is either impossible to ascribe probabilities to the selection of a particular possible set, or these probabilities are ex ante known to be either \( 0\) or \( 1\).
Stratified and Cluster Sampling
Stratified sampling is a kind of probability sampling in which, before drawing a random sample, a population is divided into several, mutually exclusive and exhaustive groups called strata, from which the name of the approach derives. Next, the sample is divided into partial samples, each being randomly drawn from the strata. Stratified sampling is often a more convenient, economical way of sampling, e.g., in a survey about support for a new presidential candidate conducted separately in each province of a country where, roughly speaking, a province corresponds to a stratum. Citizens in such a case are not randomly selected from the population of the country’s inhabitants as a whole but from each stratum separately. If the ratio of each stratum sample size to the size of the stratum’s population is the same for each province, then every inhabitant of the country has the same chance of being included in the survey.
This form of stratified sampling prevailed at the time of the publication of Neyman’s classic paper in 1934. A simple example can help to understand the core idea. Imagine a country with three provinces with \( 25\), \( 10\), and \( 5\) inhabitants, respectively. If the stratified sample includes \( 8\) inhabitants, then the sizes of corresponding subsamples must be \( 5\), \( 2\), and \( 1\), accordingly. This is to assure that none of the strata will be under- or overrepresented and for the whole sample to remain representative of the relative proportions of the population. Stratified sampling is particularly useful when the variability of the investigated characteristic is known to be in some way dependent on an auxiliary factor. Strata should then be determined to represent the ranges of values of such a factor—we discuss this later in this section.
Sometimes the characteristics of a population or its environment makes it difficult to sample individual units of a population. The cost or inconvenience of sampling units is simply too high compared to its benefits, all things considered, as in the case of investigating per capita food expenditure. It is much easier to get to know what the monthly food expenditure of a household is with a known number of members than it is to draw at random a particular citizen and to determine how much she spends per month. This is because food for all members of a household is usually bought jointly and shared without discriminating how much of a product was bought or used by an individual member. The investigated state of affairs regarding individuals exists and relevant data could theoretically be obtained—individuals might be randomly selected and asked to record their expenditures or consumption—but this would be very inconvenient for the individuals and require high compensation for their agreement to participate in the survey. One approach to preserving convenience and thriftiness is to randomly draw and investigate clusters (new sampling units of a higher order), like households, rather than the units themselves. In other cases, cultural conventions or moral considerations might be worth taking into account, such as in the case of determining the value of weekly church donations per person in a particular city. Imagine no public data is available and you want to estimate it based on a random sample. In some countries, the amount donated is not formally predetermined and some parishioners may believe that the amount of an individual’s donation should remain undisclosed.Footnote 9 In this case, a possible way of data collection that preserves the indicated people’s moral values would be to treat parishes with a known number of parishioners and a known total sum of donations as sampling units—clusters.Footnote 10
Thus, this type of sampling for Neyman consists of treating groups of individuals as units of sampling. Clusters as groups are collectives of units that are always taken into consideration together: first, some of the clusters are selected at random, and then all members of the selected clusters are included in the sample. Strata, in contrast, are conceived as subsets of a population, and from every stratum, some units are drawn at random. For example, if a country’s districts were treated as clusters, rather than strata, then random drawing would apply to districts themselves: some districts would be randomly selected and then all the citizens from the selected districts would be subjected to the questionnaire. Sometimes the attributes of a cluster’s elements are measured separately for each element and generalised, while in other cases, a generalised measure is immediately available (being unique). This second case would be the just mentioned examples of parishes and households, where measures of an element’s attributes are not available. A clear advantage of cluster sampling is that it seems to naturally capture the structure of many studied populations, and so it may be the only reasonable sampling scheme in the socio-economic realm, for “human populations are rarely spread in single individuals. Mostly they are grouped” (Neyman 1934, 568). This type of sampling was later classified as one-stage cluster sampling. This type is distinguished from the multi-stage type, in which clusters are randomly selected in the first stage but random selection is continued in the follow-up stage(s) within the selected clusters (see Levy, Lemeshow 2008, 225).
Sampling of clusters can be combined with stratified sampling. If prior information prompts one towards sample clusters instead of the original units of the population, then the original population can be reconceptualised as a population of clusters, and stratification can thus be performed on the reconceptualised population of clusters. Neyman used this approach in his 1934 paper. Still, the assumptions, roles, and consequences may be examined separately for clustering and stratification, as exemplified by Neyman.
We turn now to the epistemic consequences of the use of prior information by means of stratification and clustering. Neyman has mathematically demonstrated that the information on how a population is organised and socio-economic factors like those mentioned above can be objectively applied in the process of scientific investigation at the stage of designing the sampling scheme with the use of stratification and clustering. He has shown how these factors influence the process of statistical inference—thus how social values of convenience, thriftiness, or abidance of cultural norms can influence statistical inference and enable statistically reliable conclusions and for there to be control over the nominal level of false conclusions, as a means to reach the epistemic goal in aspect (\( I\)).
Even when stratification and/or clustering is arbitrary it does not rule out the feasibility of an estimation (aspect (\( I\)) of the epistemic goal) that will use the best linear unbiased estimator (B.L.U.E.),Footnote 11 the conception of which was introduced in Neyman’s 1934 paper and meant the linear unbiased estimator of minimal variance (Neyman 1934, 563–567). In Neyman’s terminology, the value of the variance of an estimator is inversely proportional to its accuracy (Neyman 1938a). An increase in the accuracy of estimation means shorter confidence intervals (see Neyman 1937, 371).Footnote 12 That a method of sampling is representative means that it enables consistent estimation of a research variable and of the accuracy of an estimate (see Neyman 1934, 587–588). Consistency of the method of estimation means, in Neyman’s theory, that interval estimation with a predefined confidence level can be ascribed to every sample irrespective of the unknown properties of a population (Neyman 1934, 586). Consistent estimation can be achieved regardless of the variation of the research variable within a particular strata, the way a population is divided into strata and the primary entities organised into clusters (Neyman 1934, 579).
Neyman’s analysis of stratified and clustered sampling designsFootnote 13 indicate how to properly implement information available prior to the onset of the research process concerning how a population is organised and its relevant socio-economic factors. He has mathematically shown that information representing the influence of these factors on sampling and estimation can be implemented in an explicit, objective way without obstructing consistent estimation.
Purposive Selection and Optimum Allocation Sampling
In contrast to the method of stratified sampling (or, more generally, the method of random sampling), purposive selection aims not at random selection, but at the maximal representativeness of a sample by intentional (purposive) selection of certain groups of entities. This selection is based on an investigator’s expert knowledge of general facts about the population in question or her own experience concerning the results of previous investigations. This kind of approach may sometimes appear natural to a researcher. For example, consider an ecologist who wants to assess the difference in blooming periods of certain herb species from two large forest complexes exposed to different climatic conditions. If an investigator knows about the presence of a certain factor of secondary interest and its influence on the abnormal disturbance of the selected species’ blooming, she might tend to exclude sampling from those forest sites (and thus those individuals of the herb) that are to a large extent subject to the local extreme (abnormal) disturbances of the aforementioned factor. This can be explained as an attempt to minimise the risk of a random drawing of an ‘extreme’ sample whose observational mean would be very distant from the population mean of the blooming period. It seems reasonable in such a case to purposively select specimens growing in sites that represent normal conditions with regards to this factor. By avoiding the risk of selecting an extreme sample, a more representative sample will be selected which, ideally, should lead to better accuracy of the assessment of the relevant characteristic of the population.
According to Neyman, the basic assumption underlying purposive selection was that the values of an investigated quantity (ascribed to particular units of the investigated population from which a sample is to be taken) are correlated with the auxiliary variable and that the regression of these values on the values of this same auxiliary variable is linear (Neyman 1934, 571). Neyman stated that if one assumes that the above hypothesis is true, then successful purposive selection must sample units of the population for which the mean value of the auxiliary variable will have the same value, or at least as close as possible to the value for the whole population (see Neyman 1934, 571). This can be motivated by the following simple example: supposing that the quantity of an average weekly income from donations is positively correlated with the mean age of the members of a parish, then, if most of the parishes from the investigated population were “senior” (in terms of the average age of members), the sample should include an adequately larger number of “senior” parishes than “younger” ones so that the mean “age” of a parish in a sample is close to the mean age of a parish from the whole population of parishes.
As mentioned earlier, purposive selection originally concerned non-probabilistic sampling. Neyman later modified the concept of purposive selection so that it became a special case of random sampling. What was assumed, to differentiate random sampling from purposive selection before Neyman’s paper, was first that “the unit is an aggregate, such as a whole district, and the sample is an aggregate of these aggregates” (1934, 570). Neyman has shown that the fact that “elements of sampling are […] groups of […] individuals, does not necessarily involve a negation of the randomness of the sampling”. We discussed this in Subsection 2.2 under the label of cluster sampling, as it is called nowadays. Thus, “the nature of the elements of sampling”, whether the unit of sample is an individual, or a cluster (a group of individuals), should not be considered as “constituting any essential difference between random sampling and purposive selection” (1934, 571).
Second, it was assumed by the time of Neyman’s analysis that “the fact that the selection is purposive very generally involves intentional dependence on correlation, the correlation between the quantity sought and one or more known quantities” (1934, 570–571). Neyman has shown that this dependence can be reformulated as a special case of stratified sampling, which was by then regarded to be a type of random sampling. The effect of joining these two facts was as follows: “the method of stratified sampling by groups (clusters) includes as a special case the method of purposive selection” (1934, 570). Neyman stressed that this reconceptualised purposive sampling can be applied without difficulties only in exceptional cases. As an improved alternative to the method of purposive selection, but also to the method of simple random sampling and the method of stratified sampling with sample sizes for strata being proportionate to the sizes of the strata from which they are drawn, Neyman (1934) offered a method that is today called optimum allocation sampling.
Neyman showed in his analysis of how to minimise the length of an estimator in the case of stratified sampling design that the size of the stratum is not the only factor that should be taken into account when determining the needed size of a sample of a stratum. It is more optimal for an estimate’s accuracy to also take into account estimates of the standard deviation of the research variable in strata (Neyman 1933, 92).Footnote 14 The variance of an estimator of a quantity is proportional to the variability of the research variable within strata. Therefore, to minimise the variance of the estimator by optimal sample allocation, the sample size for a stratum should be proportional to the quotient of the size of a stratum with the variability of the research variable in a stratum (Neyman 1933, 64; 1934, 577–580). If the variability of an auxiliary characteristic is known to be correlated with the variability of the research variable, one can use this information to divide the population into more homogenous strata with regards to the auxiliary variable, which will result in smaller (estimated) variances of the research variable within a stratum and subsequently a more accurate estimation (Neyman 1933, 41, 89; 1934, 579–580). Neyman stated that “There is no essential difference between cases where the number of controls is one or more” (Neyman 1934, 571), and if there is more than one known correlation, then one can implement all the relevant knowledge about manifold existing correlations using the “weighted regression” of the variable of interest upon multiple controls (see Neyman 1934, 574–575). In the case of the absence of any ready data, estimation of the variability of the investigated quantity within strata requires preliminary research; the result of such an initial trial may subsequently be reused as a part of the main trial (Neyman 1933, 43–44). When one cannot make any specific assumption about the shape of the regression line of the research variable on the auxiliary variable, “The best we can do is to sample proportionately to the sizes of strata” (Neyman, 1934, 581–583). It is important to note that Neyman’s idea of optimum allocation sampling implies unequal inclusion probabilities (Kuusela 2011, 164)—sampling units that belong to strata with greater variability of the research variable will have higher inclusion probability. Methodological ideas proposed are clear cases of the direct, objective methodological inclusion of prior information of relationships between the sought after characteristics of the investigated population and some other auxiliary characteristics. These ideas demonstrate how sampling design and, eventually, the accuracy of an outcome can depend on the correlation of an investigated quantity with another quantity. If such information is known prior to sampling, it can increase estimation’s accuracy. The same holds for implementing prior information about the estimated variability of an investigated property.Footnote 15
If clusters are the elements of sampling, minimising their size also increases the accuracy of an estimator (Neyman 1934, 582). Making clusters comprised of the same number of entities also increases the accuracy (Neyman 1933, 90). What was not addressed by Neyman is that more internally heterogeneous clusters also increase the accuracy of an estimation. So, pre-study information concerning some social factors in how a human population is structured in terms of the research variable can serve to devise smaller, or more internally varied, clusters so as to increase accuracy.
These facts about stratification and clustering indicate that via the use of Neyman’s theory of sampling and estimation, prior information about the changeability of an investigated property, about the dependence of the research variable on auxiliary factors, and about contextual social factors can be implemented using statistical procedures in an objective way to increase the accuracy of estimation. This yields the epistemic benefit of aspect (\( II\)) of the epistemic goal.
Double Sampling
Now we turn to aspects of Neyman’s sampling design that concern a factor that inevitably and essentially influences the processes of collecting evidence and of formulating conclusions, namely the prior information regarding the costs of research.
It is taken for granted in statistics that Neyman “invented” (Singh 2003, 529) or “developed” (Breslow 2005, 1) a method called double sampling ( Neyman 1938a) or two-phase sampling (Legg, Fuller 2009). Neyman, in his analysis of stratified sampling (1934), proved that if a certain auxiliary characteristic is well known for the population, one can use it to divide the whole population into strata and undertake optimum allocation sampling to improve the accuracy of the original estimate. The problem of double sampling refers in turn to the situation in which there is no means of obtaining a large sample which would give a result with sufficient accuracy because sampling the variable of interest is very expensive and because knowledge of an auxiliary variable, which could improve the estimate’s accuracy, is not yet available. The first step of the sampling procedure, in this case, is to secure data for the auxiliary variable only from a relatively large random sample of the population in order to obtain an accurate estimate of the distribution of this auxiliary character. The second step is to divide this population, as in stratified sampling, into strata according to the value of the auxiliary variable and to draw at random from each of the strata a small sample to secure data regarding the research variable (Neyman 1938a, 101–102). Neyman intended this second stage to follow the optimum allocation principle (Neyman 1938b, 153).Footnote 16
The main problem in double sampling is how to rationally allocate the total expenditure between the two samplings so that the sizes of the first large sample and the second small sample, as well as sizes of samples drawn from particular strata, are optimal from the perspective of the accuracy of estimation (Neyman 1938b, 155). For example, suppose that the average value of food expenditure per family in a certain district is to be determined. Because the cost of ascertaining the value of this research variable for one sampling unit is very high, limited research funds only allow one to take quite a small sample. However, the attribute in question is correlated with another attribute, for example, a family’s income, whose per-unit sampling cost is relatively low. An estimate of the original attribute can be obtained for a given expenditure either by a direct random sample of the attribute or by arranging the sampling of the population in the two steps as described above.
Neyman provided formulas for the allocation of funds in double sampling that yield greater accuracy of estimation compared to estimation calculated from data obtained in one-step sampling—both having the same budget. Nevertheless, in certain circumstances, double sampling will lead to less accurate results. Neyman indicated that certain preliminary information must be available in order to verify whether the sampling pattern will lead to better or worse accuracy and to know how to allocate funds (Neyman 1938a, 112–115). So, double sampling requires prior estimates of the following characteristics: the proportion of individuals belonging to first-stage strata, the standard deviation of the research variable within strata, the mean values of the research variable in strata, and, obviously, the costs of gathering data for the auxiliary variable and research variable per sampling unit (see Neyman 1938a, 115).Footnote 17 To increase the efficiency of estimation by using double sampling, both types of costs must differ enough, and the between-stratum variance of the research variable must be sufficiently large when compared to the within-stratum variance (Neyman 1938a, 112–115). Thus, to evaluate which of the two methods might be more efficient, prior information concerning the above-indicated properties of the population sampled is required. It is also needed to approximately determine the optimal size of the samples (Neyman 1938a, 115).
What we have shown is that the method of double sampling articulates the rules of using the prior information concerning the structure of a population (with regard to an auxiliary variable interrelated with a research variable), the information about the estimated values of a research variable, its variability, as well as typical economic factors: the costs of different types of data collection and available research funds. Those rigid rules determine the estimation procedure and its effects in an objective manner. More importantly, this method guides a researcher towards the realisation of the second (\( II\)) aspect of the epistemic goal: the correct use of these types of information can increase the accuracy of estimation.