Predictive Estimation of Finite Population Mean in Case of Missing Data Under Two-phase Sampling

The present paper deals with the problem of estimation of finite population mean of study variable using two auxiliary variables in two-phase sampling scheme using predictive approach in case of missing values of the study variable and unknown population mean of first auxiliary variable. Four classes of such estimators have been proposed using this predictive approach. The expressions of bias and mean square errors are derived up to first order of approximation. The optimal values of the constants involved in the proposed classes of estimators have been obtained and thus minimum mean square errors of the proposed classes are obtained in this study. The empirical and graphical comparisons with regression type estimators (under single phase and double phase sampling scheme) and also among themselves have been made for evaluating the performance of the proposed classes for different choices of non-responding units. Five real data sets and three simulated data sets following normal distribution have been used to evaluate the performance of the proposed classes. Numerical findings confirm the theoretical results obtained regarding superiority of proposed classes of estimators over the conventional regression type estimators in terms of percent relative efficiencies.


Introduction
Data missingness has grown to be a significant issue for practitioners, and understanding its stochastic character is crucial to understand the strategies that should be used to address the issue of incomplete data.In medical and social sciences, collecting, analyzing and drawing inferences from data is crucial for research.Missing data can seriously affect inferences from randomized clinical trials, if missing data are not handled properly.Hansen and Hurwitz [1] first introduced how to deal with the problem of incomplete samples in mail surveys.Further, Rubin [2] gave detailed description of the concepts: Missing at Random (MAR), Observed at Random (OAR), Missing Completely at Random (MCAR) and Parameter Distinctness (PD).Researchers have developed a variety of imputation strategies over the years so that samples are structurally complete and analyses may be performed successfully.These techniques fill in missing values with an appropriate function of the given data.Sande [3] and Kalton et al. [4] suggested imputation methods that make an incomplete data set structurally complete.Mean imputation, ratio imputation, regression imputation, hot deck imputation, cold deck imputation, and nearest neighbor imputation, etc. are some popular imputation techniques.Researchers have made continuous efforts to devise improved estimators of population mean of the study variable by developing efficient imputation techniques which include ref. [5][6][7][8][9][10][11][12][13] and many more.In the survey sample literature, numerous authors have taken non-response scenarios into account when estimating the population variance of the study variable such as Singh and Joarder [14], Sharma [15], Singh et al. [16], Singh and Khalid [17], Sharma and Singh [18], Singh et al. [19], Basit and Bhatti [20], and Singh and Khalid [21].
Auxiliary information is frequently used in survey sample situations to increase the accuracy of an estimator, whether at the planning stage, or the designing stage, or the estimate stage, or a combination of these phases.When complete information on auxiliary variable is available, then ratio, regression, and conventional imputation techniques can be utilized to anticipate the information on population mean of the study variable.But, if the information of the population mean of the auxiliary variable is not available, then two phase sampling scheme is employed.Chand [22] suggested a method of chaining the information on first and second auxiliary variables correlated with the study variable which was further extended by many authors like, Kiregyera [23,24], Srivastava [25], Singh et al. [26], Gupta and Shabbir [27], Chaudhary and Singh [28], Kumar and Sharma [29], Mehta and Tailor [30] and many more.
Basu [31] gave a predictive approach and used it to predict, non-sampled values which was further advocated by many researchers to use existing estimators as predictors in different types of statistical models.In this direction, major contributions are Sahoo and Panda [32], Sahoo and Sahoo [33], Sahoo et al. [34], Saini [35], Singh et al. [36], Singh and Singh [37], Yadav and Mishra [38], Bandyopadhyay and Singh [39], Singh et al. [40], Kumar and Saini [41] and many more, who gave enhanced efficient predictive estimators of finite population mean.Till now, no predictive approach has been applied when complete data of study variable is not available.Motivated and encouraged by the recent work of all above authors using predictive approach in case of complete data, we propose different classes of estimators of population mean under predictive approach (in double sampling scheme using two auxiliary variables) when observations on the variable of interest are missing and population mean of first auxiliary variable is

Methodology and Notations
Let U = U 1 , U 2 , ..., U N be a finite population consisting of N identifiable units.Suppose Y, X, and Z, be the character under study (whose mean Y is to be estimated), first and second auxiliary variables, respectively.Here, Y is assumed to be positively correlated with both X and Z, while its correlation with X is stronger than its correlation with Z.In addition, we suppose that the population mean of X is unknown, but the population mean of Z is available in advance.Let Y i , X i , Z i be the value of variable (Y, X, Z) for the i th unit of population.
Let a first phase sample S ′ of size n ′ be drawn (using simple random sampling without replacement (SRSWOR)) from the population .Let S be the second phase sample selected from first phase sample of size n < n ′ .On these n units, the infor- mation of variables Y, X and Z have been measured.Suppose information on r units corresponding to variable Y is obtained but on the remaining (n − r) units, informa- tion is missing.Also, it is assumed that information of variables X and Z is available on all these n units.Let the responding set of sampling units be denoted by R and that of non-responding set byR c .So, S = R ∪R c .For each i ∈ R, value of the sampling observation y i is known.However, for the units i∈ R c , the sampling observation y i values are missing and thus for these imputed values have to be derived.
The following notations are used throughout the paper: Y, X, Z : population means of the study variable, first, and second auxiliary vari- able respectively.y r , x r , x n , z n , x n � , z n � ∶ sample means of the respective variables for the sample sizes shown in suffices.yx , yz , xz ∶ correlation coefficients between the variables as shown in subscripts for the whole population.yx , yz , xz : Population regression coefficients of the variables shown in subscripts.
b yx (r) : Sample regression coefficient between y and x based on the sample of size r.
S 2 y , S 2 x , ∶ Population variances (with divisors (N − 1) ) of the variables Y, X and Z respectively.
S yx , S yz , S xz ∶ Population covariances (with divisors (N − 1)) between the variables as shown in subscripts.
C y , C x , C z : Coefficients of variation of the variables as shown in subscripts.s yx (r) ∶ Sample covariance (with divisors (r − 1) ) between y and x based on the sample of size r.
s 2 x (r) ∶ Sample variance (with divisors (r − 1)) of the variable x based on the sam- ple of size r.
Also for the sake of simplicity define, In the following section, we have mentioned some existing imputation techniques when the population mean X is unknown and in the absence of second auxiliary vari- able Z .The corresponding point estimators of Y , and their variances or minimum mean square error are mentioned which shall be later used for the comparison purposes.

Some Traditional Imputation Methods
In this section, some traditional methods of imputation are given considering single and double sampling strategy, assuming that X is unknown, by replacing X with x n ′.
(i) Simple mean method of imputation (under single phase sampling scheme).
The imputation scheme under simple mean method of imputation is: The corresponding point estimator for Y is: The variance of this estimator is: (ii) Ratio method of imputation (under single phase sampling scheme).
The imputation scheme under ratio method of imputation is where â = The corresponding point estimator for Y is: The Mean Square Error ( MSE) of this estimator is: (iii) Compromised Method of Imputation (Singh and Horn [42]) (under single phase sampling scheme).
The imputation scheme under compromised method of imputation is: where is a constant chosen suitably.The corresponding point estimator for Y is: The MSE of the estimator is derived as: (iv) Regression method of imputation (under single phase sampling scheme).
The imputation scheme under regression method of imputation is: x (r) and ĉ = y r − byx x r .The corresponding point estimator for Y is: and MSE is as follows: (v) Regression method of imputation (under double sampling scheme).The corresponding point estimator for Y is: The MSE is obtained as: A number of authors have developed estimators of Y in case of missing data, con- sidering two phase sampling scheme, replacing x n ′ by an improved estimator of X when there exists another auxiliary variable Z.In the following section, we propose some new efficient classes of estimators of Y using predictive approach advocated by Basu [31], when X and Z, two auxiliary variables are present.

Proposed Classes of Estimators Under Predictive Approach
To build up estimators of Y , we represent Y as by decomposing the population U into four mutually exclusive domains: Since the right-hand side of (1)'s first term is known, so the problem is to predict the quantitiesy 1 , y 2 , and y 3 from the sampled data.IfT 1 , T 2 and T 3 are respectively their inferred predictors, then under predictive approach, an estimator of Y is given as: In case of no additional auxiliary data on population is available, then the simplest option for T 1 , T 2 , T 3 would be y r , which gives Ŷ = y r .But the goal of this study is to propose efficient estimators (or classes of estimators) of Y using predictive approach when two auxiliary variables X and Z are present.
Since data on X is known at the sample level, so we predict Y-values in the domains R 1 and R 2 using X-values only.By taking , where 1 y r , X 1 , x r and 2 y r , X 2 , x n are functions of y r , X 1 , x r and y r , X 2 , x n such that.
They also satisfying certain regularity conditions like: 1.Whatever be the sample chosen, the sample point y r , X 1 , x r assumes values in a bounded, closed convex subset, P, of R 3 containing the point Y, X, X .
2. The function 1 y r , X 1 , x r is continous and bounded in P.
3. The third order partial derivatives of 1 y r , X 1 , x r exist and are continous and bounded in P.
Journal of Statistical Theory and Applications (2023) 22:283-308 The similar regularity conditions are also assumed for sample point y r , X 2 , x n and the function 2 y r , X 2 , x n .
The expansions of 1 y r , X 1 , x r and 2 y r , X 2 , x n about the point Y, X, X using Taylor's series up to second order (neglecting remainder terms), give the following expressions: and where h i , h ′ i are the first order partial derivatives and h ij , h ′ ij are second order partial derivatives of 1 y r , X 1 , x r and 2 y r , X 2 , x n (i = 1, 2, 3;j = 1, 2, 3) respectively at the point Y, X, X .Also, Since, information on Z is available at the population level, so we make four different selections for T 3 and thus obtained four different estimators for Y, using result (2). (i and satisfying certain regularity conditions similar to functions 1 y r , X 1 , x r and 2 y r , X 2 , x n .The expansion of 3 y r , Z 3 , z n′ about the point Y, Z, Z in second order Tay- lor's series neglecting remainder terms, gives the following expression: where h ′′ i are the first order partial derivatives and h ′′ ij are second order partial derivatives of , then from (2), Ŷ comes out to be Remark 4. 1 It may be noted that the selection of predictors x n and 3 y r , Z 3 , z n ′ is not unique.Many estimators for Y can be suggested for different choices of ratio and regression type predictors which is displayed as in the Table 1.below.
In the following section, we shall discuss properties of the proposed classes of estimators.Up to first order of approximation, the expressions of bias and mean square errors are obtained.

Biases and Mean Square Errors of Proposed Classes of Estimators
Ŷi (i = 1, 2, 3, 4) For the derivation of the bias and mean square errors of the proposed classes of estimators Ŷi (i = 1, 2, 3, 4), we take the following: x r = X 1 + e 9 Assuming large sample situation the expectations used are as under: All the four proposed estimators Ŷi (i = 1, 2, 3, 4) can be represented in terms of e i s .So, on retaining terms up to second degree of e i s only, we have:

(i) For class of estimators Ŷ1
On taking expectations on both sides of result (3), the bias of Ŷ1 , up to the terms of order o n −1 is: The expression of MSE Ŷ1 , up to first order of approximation, will be (5)  The optimum choices of h 2 and h ′ 2 are obtained by minimizing the mean square error given in Eq. ( 7) with respect to h 2 and h ′ 2 .So, these are obtained as: Hence, for the class of estimators Ŷ1 , the minimum mean square error is

(ii) For class of estimators Ŷ2
Bias of Ŷ2 , is obtained by taking expectations on both sides of result ( 4), up to the terms of order o n −1 and is as follows: The expression of MSE Ŷ2 , up to first order of approximation will be: The optimum choices ofh 2 , h ′ 2 , and h ′′ 2 are obtained by minimizing the mean square error given in Eq. ( 9) with respect to h 2 , h ′ 2 , and h ′′ 2 and these are given as. (7) For the class of estimators Ŷ2 , the minimum mean square error is obtained as (iii) For class of estimators Ŷ3 On taking expectations on both sides of result ( 5), the bias of estimator Ŷ3 , up to the terms of order o n −1 is: The expression of MSE Ŷ3 , up to first order of approximation will be.
The optimum choices of h 2 , h ′ 2 , k 1 and k 2 are obtained by minimizing the mean square error given in Eq. ( 11) with respect to as h 2 , h ′ 2 , k 1 and k 2 and these are given as.
The minimum mean square error of the class of estimators Ŷ3 is obtained as: Journal of Statistical Theory and Applications (2023) 22:283-308 (iv) For class of estimators Ŷ4 On taking expectations on both sides of result ( 6), the bias of estimator Ŷ4 , up to the terms of order o n −1 is: The expression of MSE Ŷ4 , up to first order of approximation will be The optimum choices of h 2 , h ′ 2 , k 1 and k 2 are obtained by minimizing the mean square error obtained from Eq. ( 13) with respect to as h 2 , h ′ 2 , k 1 and k 2 and are given as.
The minimum mean square error of the class of estimators Ŷ3 is obtained as ( 12)

Relative Performances of Proposed Classes of Estimators
Here, we shall compare the MSE of the proposed classes of estimators with the regression type estimator in one phase sampling and two phase sampling.So, the following results are derived.
In addition to the above results, we also noticed the following result: From these results, it is clear that all the four classes are superior to regression type of estimators, both in single phase and double phase sampling, except in case of class Ŷ1 , which is superior under the optimal condition.

Numerical Illustrations
To support the theoretical results obtained in the previous section, we have considered five empirical data sets and three hypothetical data sets.

Empirical Study
The effectiveness of our suggested classes of estimators has been demonstrated using the following five empirical data sets.According to their respective percent relative efficiencies, the performance of the suggested imputation approaches is compared.So, we have calculated the percent relative efficiencies (PREs) of the estimators with respect to the estimator y reg1 for different response rates by using the following expression:

Simulation Study
In this section, using R software, a simulation study has been conducted to examine the percent relative efficiency of suggested estimators owing to the existence of nonresponse in the population.The simulation study is carried in the following stages: Stage I: Obtain first phase sample S ′ of size n ′ from population of size N using SRSWOR scheme.
Stage II: From first phase sample, draw second phase sample S of size n using again SRSWOR scheme.
Stage III: Remove (n − r) sample units from sample S randomly.
Stage IV: The dropped units are then imputed using proposed imputation techniques considered for the sample.
Stage V: Obtain the value of estimator Ŷ of Y.

MSE
The following is a description of artificial data sets: An artificial population is generated of size N = 2000 which involves Y, X and Z as study variable, first and second auxiliary variables respectively.The study variable Y is highly correlated with X while it is correlated with Z due to its correlation with X.These variables are generated from a multivariate normal distribution with the following theoretical mean vector mean vector = [250, 130, 400] and theoreti- cal covariance matrix Σ = ⎡ ⎢ ⎢ ⎣ 1.5 1.2 1.0 1.2 1.2 0.8 1.0 0.8 2 respectively.We have taken n � = 1500 and n = 1100.
To have precise idea of the performance of proposed classes of estimators Ŷi (i = 1, 2, 3, 4) , in the section below, we show a graphic depiction of the estima- tors' PREs in relation to various response values.
In order to improve the readability of the results and better comparison for different values of r, we have shown our findings using PREs of the regression type estimator in two-phase sampling and suggested classes of estimators for various values of r by using pictorial representation.Just for the rough idea, we have presented graphical representation only for Population 1, as the similar trend will be followed by all the proposed estimators in the remaining considered Populations 2-8.These results are shown in Fig. 1.The behavior of different considered estimators has been given with different specific colors which are further associated with different numbers like 1,2,3,4,5 corresponding to estimators y reg2 , Ŷ1 , Ŷ2 , Ŷ3 , Ŷ4 estimators respectively.
(i) It is observed that percent relative efficiencies of all the proposed classes of estimators Ŷi (i = 1, 2, 3, 4) , perform better than conventional regression-type esti- mators defined in both in single phase and two-phase sampling schemes.Also, there is considerable gain in efficiency over the conventional regression-type estimators in all the proposed class of estimators.
(ii) For high response rate i.e. for greater increase in response rate in the sample, all the proposed classes of estimators gives sufficiently large gains in efficiencies over regression type estimator in single phase sampling.
(iii) It is observed that as the response rate rises, the percent relative efficiencies of all the proposed classes also increases, which indicates that all these proposed classes will perform more efficiently if number of responding units are sufficiently large.
(iv) It has been also observed that the PRE of proposed class of estimator Ŷ4 is maximum, which reveals that this class of estimator is superior among all the proposed classes of estimators − Y.

Conclusions
In the present study, four efficient classes of estimators have been proposed for efficient estimation of population mean, under two-phase sampling using two auxiliary variables.It has been seen that with the increase in response rate, the PREs of proposed classes increases, which indicates that our proposed classes of estimators could perform significantly better, if high number of responding units are available.A large number of estimators can be formed using different predictors belonging to different proposed classes, giving a large variety of strategies.The proposed classes of estimators can be utilized in real life scenario for efficient estimation of population mean.

Fig. 1
Fig. 1 Behavior of PRE of different proposed estimators with respect to different value of r