On the Maximum Likelihood Estimation of Population and Domain Means

Estimating of population and domain means based on model-design approaches is considered in this paper. Population elements randomly belong to domains. A joint distribution of the variable under study and an auxiliary variable is assumed. Data are observed in a sample selected from a fixed population. The partition of the sample elements into domains of the population is also known. Outside of the sample, values of the auxiliary variable are known but their partition among the domains is not known. The domain means are estimated based on the likelihood function of the data observed in the sample and outside of it. The maximum likelihood estimation method provides regression-type estimators of domain means of the variable under study. They are dependent on posterior probabilities that observations of the auxiliary variable belong to particular domains. Moreover, the weighted means of the domain averages estimators are used to estimation of the population mean. The accuracy of the evaluated estimators and the ordinary estimator is compared using a simulation analysis. The results of this paper could be useful in economic, demographic and sociological surveys.

Zadło [14] assumed that population elements randomly belong to domains. He presented several examples. For instance, he considered the estimation of the income of different enterprises when they randomly belonged to different investment intervals. Elections are another example of domains. In this case a domain consists of people who vote for a specific party. Often, the selection of a particular party is random because a lot of voters are not committed to voting for any particular party. The model for generating an accounting error (see [13]) can also exemplify domains. An observed value of an accounting document is treated as the outcome of the random variable, which is a mixture of two distribution functions. One of these distribution functions generates the true accounting amount, and the second generates an accounting amount contaminated with an error. Documents without errors belong to the first domain and documents polluted with accounting errors belong to the second domain. Hence, documents randomly belong to the domains. This idea, which is based on distribution mixtures, is developed in this paper.
Many auxiliary variables are usually observed during national censuses. Moreover, variables under study (observed during a census) can be used as auxiliary variables in survey sampling on a subsequent occasion. Therefore, we can expect these variables to be highly correlated. Let us note that apart from the above examples, there are many populations where all values of the auxiliary variable are observed. These can be found in economic, demographic, agricultural and other official registers.
In this paper, the model or model-randomization approaches are taken into account (see, e.g., [8] or [9]. Estimating domain means is usually better, when the estimation is supported by data on auxiliary variables observed outside the sample but usually under the assumption that their distribution among domains is known (see several monographs on Small Area Sampling and, e.g., [10]). The models formulated in this paper are close to those considered by Chambers and Skinner [2]. Estimators of domain averages are derived by means of the maximum pseudo-likelihood method. More precisely, a variant of the likelihood method of estimation based on incomplete data of the variable under study is adopted to estimate distribution mixture parameters. Our analysis is mainly supported by monographs [4,6,7].
The most important results of the paper are as follows: -The pseudo-likelihood function is formulated for estimation of the mixture distribution parameters in the case when data are observed in the sample selected according various inclusion probabilities (Sect. . The random vectors [Y k X k Z k * ], k ∈ U , are independent, and each of them has the same probability distribution. Let P(Z k * = z (h) ) = p h , h = 1, . . . , H , H h=1 p h = 1. Random variable Z k * has multinomial distribution with parameters (1, p 1 , . . . , p H ) (see, e.g., [7]). Event {Y k < y k , X k < x k } with specific feature are mutually exclusive for all h = t and h = 1, . . . , H , t = 1, . . . , H . This and the total probability theorem lets us write the following: is the conditional distribution function. In the case where variables [Y k X k ] are continuous, we have: . . , G, k ∈ U , are density functions. This leads to the conclusion that our model defines the following distribution function: According to the assumptions of this model, the random variable N k=1 Z k * has multinomial probability distribution with parameters ). Let us note that the introduced definitions lead to the conclusion that the sizes and consistencies of the domains are random. Hence, the multinomial probability model leads to partitions of the population into disjoint subsets called domains. Therefore, each outcome of partitioning the population into domains could be different.
Our main aim is to estimate the expected (domain mean) value μ h = E(Y k |Z k * = z (h) ) and the probabilities p h , h = 1, . . . , H . Additionally, estimators of the expected value (population mean) μ = H h=1 p h μ h are proposed. In order to do this, sample s of size n ≤ N is selected from population U according to a sampling design denoted by P(s) ≥ 0, s ∈ S , where S is sampling space and s∈S P(s) = 1. Inclusion probabilities of the sampling design are defined by π k = {s:k∈s,s∈S } P(s), k = 1, . . . , N . Let s = U − s be the complement of s in U .
then s is the empty set.

Maximum Likelihood Estimation
Identifying a domain is possible after observation of variable Z k * in sample s. The density function of the conditional distribution of Therefore, the observed values of the variables in the whole population are defined by the following distribution mixture: We assume that only values x 1 , . . . , x k , . . . , x N are observed in the whole population before selecting a sample. The marginal distribution of X k is as follows: The sample contains the following data on variable values: Hence, the sample contains complete data on the distribution mixture, while outside of the sample, the data are incomplete.
When the sample is selected according to preassigned inclusion probabilities, the pseudo-likelihood approach (see, [3,8,12]) leads to the following function: where the complete and incomplete log-likelihood functions are as follows, respectively: where n h is the size of This means that the sample log-likelihood functions l 1 (d s ) and l 2 (x s ) are designunbiased estimators of the population log-likelihood functions l 1 (d U ) and l 2 (x U ), respectively.
Usually, looking for the maximum of the log-likelihood function l(d s , x s ) is very complex and not exact. An approximation method has to be applied to solve the problem. Therefore, we use the more simple iteration method known as the EMalgorithm (see [4,6,7]). According to this method, function l(d s , x s ) is replaced with the following: where h is the estimator of the expected size of the h-domain in the set s. The Appendix provides outline of how to get optimal values of parameterŝ (t+1) x and the following estimators of probabilities p h : whereN StatisticsN andτ (t) are estimators of N . In general, estimators ofˆ (t+1) could be obtained as roots of the first subsystem of the equation system (21). Moreover, N h is the estimator of the expected values of the domain size N p h . The initial values ofˆ (t) andp (t) h are equal to the roots of system . . , H . When π k , k ∈ U depend on variables from X, the likelihood function under the condition that X = x needs to be consider. Several aspects of this problem were discussed by Pfeffermann [8] on the basis of large literature. Therefore, in order to simplify our considerations, we assume that the inclusion probabilities π k , k ∈ U as well as p h , h = 1, . . . , H could depend on the non-random auxiliary variable that is different from observations of variables from X.
The simple random sample drawn without replacement does not depend on the auxiliary variables. In this case π k = n N for k ∈ U , and the estimator expressed by (7) simplifies to the following form:
In the Appendix, we evaluate estimators of domain means μ y,h and the fraction of the population elements in domains p h , h = 1, . . . , H according to the EM estimation algorithm and expressions (4)- (8). From Expressions (6) and (7), let us write for t = 0, 1, 2, . . . the following: wherep h and p (0) h =p are explained by expressions (7) and (8).
x (t) The following regression-type estimators of μ y,h are derived in the Appendix: where t = 0, 1, 2, . . ., σ 2(t) x,s,h = 1 When the constant of the linear regression y on x is approximately equal to zero, we can use the following ratio-type estimator: Particularly, in the case of a simple random sample drawn without replacement, when π k = n N for all k ∈ U , we have: Generalization of the proposed regression-type estimators into the case of a multidimensional auxiliary variable is as follows. Let Let I a be the unit matrix of degree a and J a be the a-element column vector which all elements are equal to one. The rows of the matrix X could be rewritten in such a way that These let us generalize the estimators defined by expressions (12) and (13) as follows: where t = 0, 1, 2, . . .
Usually, the estimation process is stopped when the number of iterations t reaches the preassigned level T . Some other stopping rules are discussed, e.g., in [6,7]. These works also considered several procedures which assess accuracy of estimators such as bootstrap methods.

Simulation Study
Let simple random samples {s j , j = 1, . . . , M} be independently drawn without replacement from the whole population of size N . We assume that each of them is partitioned between H -domains in such a way that s j = s . . , H . The relative bias of estimators is defined as follows: We assume that M = 10000.
Example 1 Let us consider the following simple set of data on a two-dimensional random variable generated according to two-dimensional normal distribution. The set consists of three domains of the same size equal to 500. Hence, a population of size 1500 is divided into three domains. The data in the h-domain are generated according to normal distribution N (μ x,h , μ y,h , v x,h , v y,h , ρ h ). We will consider the following population partitioned into domains. The domain parameters of the population are: N (8, 4, 1, 1, 0.5), N (14, 11.2, 1, 1, 0.8) and N (20, 19, 1, 1, 0.95). The spread of artificially generated data is shown in Fig. 1.
The simple random sample (s = s 1 ∪ s 2 ∪ s 3 ) is drawn without replacement from the whole population of size N . We assume that the size of each s h, j is 2 ≤ n h ≤ n − 2(H − 1), h = 1, 2, 3, j = 1, . . . , M. In the second column of Table 1 , the domains are identified by integers 1, 2 and 3. Columns 3-6 give the relative efficiency coefficient values e(.) for the domains.
Estimatorsȳ h for n = 75, 150 and in the third domain for n = 150. All considered estimators are practically unbiased because their relative biases (evaluated as the ratio of the bias module by the square root of the mean square error of the estimator) are not larger than 0.1%. Therefore, the biases are not shown in Table 1. Accuracy of the estimators increases when the sample size also increases. When the correlation coefficient seems to be the most universal of the considered estimators and therefore should be preferred. The relative biases ofp (t) h , h = 1, . . . , H , are not larger than 0.5%. Accuracies of these also increase when the sample size increases. They are better than ordinary sample frequenciesp h for n ≥ 45. Hence, the considered procedure could also be used to estimate of the probabilities p h , h = 1, . . . , H of distribution mixtures. The several last rows of Table 1 let us say that all three estimators of the population average are significantly better than the simple sample mean. Moreover, the ratio-type estimator is the most accurate.

Example 2
The second population consists of data published in [11] about Swedich municipalities. We consider data about three variables REV84 (real estate values from 1984), RMT85 (revenues from municipal taxation in 1985) and ME84 (municipal employees in 1984). We take into account these data without the largest outliers. The size of the considered population is 281. The population was partitioned into three domains according to quantiles 30% and 70% of variable REV84. This provided the following sizes of domains: N 1 = 86, N 2 = 109 and N 3 = 86. Real estate valuation depends on market fluctuations. Therefore, the same property today may be in the first domain, but tomorrow it may be in a different domain. Therefore, belonging to a domain can be treated as random.  The population of 281 Swedish municipal units The distributions of variables RMT85 and ME84 have too much asymmetry on the right side, and they differ too significantly from normal distribution. Therefore, we considered their logarithmic transformation, and the spread of this is shown Fig. 2. The domain mean values of logRMT85 were μ 1 = 6.704, μ 2 = 7.5.20 and μ 3 = 8.528. The simulation of estimation accuracy was based on the simple random samples drawn without replacement. The sizes of the samples were: 8 (2.85% the population size), 14 (4.98%) and 28(2.96%). Table 2 shows only the accuracies of the estimation of population mean because the estimators of the domain means were less accurate than the simple random sample mean. Analysis of Table 2 lets us say that all of the estimators of the population mean that are taken into account, are more accurate than the simple random sample mean. The accuracy of the second regression estimator is the best among the considered ones, and similarly its relative bias is the smallest.

Example 3
Let us consider data about current and starting salaries of employees that are available in the SPSS statistical packages as the example dataset. The set consists of 474 observations. The two data domains are identified. The first domain of 390 observations is the set of clerks and the second one consists of 84 managers. In general,  Population of employees an employee randomly belongs to one of these domains, because one day he could be a manager and the next day he could be a clerk, and vice versa. The starting and current salaries in the first domain (clerks) are $14164 and $28054, respectively. The starting and current salaries in the second domain (managers) are $28091 and $63978, respectively. The spread of the data partitioned into domains is shown in Fig. 3. The following sizes of samples were taken into account: 15 (3.2%), 24 (5.1%) and 48 (10.1%). The results of the simulation are shown in Table 3. Similar to example 2, this table shows only the accuracy of the estimation of the population mean because the estimators of domain means were less accurate than the simple random sample mean. Table 3 leads to the conclusions that the all estimators are more accurate than the simple sample mean for sample size n > 24 except the ratio estimator because its deff coefficients are less than 100% for all sample sizes. The accuracy of the ratio estimator decreases when the sample size increases. The relative biases of the estimators are quite large.

Analysis of
Analysis of all the tables and figures lets us say that estimation of domain means is possible only when data observed in domains are well separated. The more optimistic conclusion is that the proposed estimators of population means are always more accurate than the simple random sample in all considered cases, when the sample size is at least 5% of the population size. Their biases are also rather acceptable. Of those estimators of the population mean, the second regression estimator and the ratio estimator are the best and could be used in practical research.
The presented simulation analyzes will be continued in a wider scope in the next article. In particular, different mixtures of at least three-dimensional probability distributions will be considered. In addition, various modifications to the estimators used herein will be proposed, leading to a more accurate estimation of the domain averages.

Conclusions
Three estimators of domain means use additional auxiliary data in order to improve estimation accuracy. Properties of the maximum likelihood method let us to derive new estimators of domain and population means. The simulation analysis shows which is the best estimator for occasions when the domains are sufficiently well separated. This separation may not to be very obvious when estimating the population mean. In this case, all of the estimators of population means were better than the simple random sample mean. The considered estimation method lets us also estimate the probabilities of the distribution mixtures. Generalization of the regression estimators for a multidimensional auxiliary variable was also shown.
Some other generalization or modifications of the estimation procedure are possible. Auxiliary variables observed in censuses or in official registers can be used to improve the efficiency of estimating means. We can consider distributions other than normal as elements of the mixture. For instance, expenditures or incomes in domains could be modeled by means of asymmetric distributions like lognormal or gamma distributions.