A model robust sub-sampling approach for Generalised Linear Models in Big data settings

In today's modern era of Big data, computationally efficient and scalable methods are needed to support timely insights and informed decision making. One such method is sub-sampling, where a subset of the Big data is analysed and used as the basis for inference rather than considering the whole data set. A key question when applying sub-sampling approaches is how to select an informative subset based on the questions being asked of the data. A recent approach for this has been proposed based on determining sub-sampling probabilities for each data point, but a limitation of this approach is that appropriate sub-sampling probabilities rely on an assumed model for the Big data. In this article, to overcome this limitation, we propose a model robust approach where a set of models is considered, and the sub-sampling probabilities are evaluated based on the weighted average of probabilities that would be obtained if each model was considered singularly. Theoretical support for such an approach is provided. Our model robust sub-sampling approach is applied in a simulation study and in two real world applications where performance is compared to current sub-sampling practices. The results show that our model robust approach outperforms alternative approaches.


Introduction
Big data analysis has gained interest in recent years through providing new insights and unlocking hidden knowledge in different fields of study (Karmakar and Mukhopadhyay 2020) including medicine (Rehman, Naz, and Razzak 2021), fraud detection (Vaughan 2020), the oil and gas industry (Nguyen, Gosine, and Warrian 2020) and astronomy (Zhang and Zhao 2015).However, the analysis of Big data can be challenging for traditional statistical methods and standard computing environments (Wang et al. 2016).Martinez-Mosquera, Navarrete, and Lujan-Mora (2020) discuss storage and modelling solutions when handling such a large amount of data.In general, modelling solutions can be grouped into three broad categories: 1) sub-sampling methods, where the analysis is performed on an informative sub-sample obtained from the Big data (Kleiner et al. 2014;Ma, Mahoney, and Yu 2015;Ma and Sun 2015;Drovandi et al. 2017;Wang, Zhu, and Ma 2018;Wang, Yang, and Stufken 2019;Yao and Wang 2019;Ai et al. 2021a;Cheng, Wang, and Yang 2020;Lee, Schifano, and Wang 2021;Ai et al. 2021b;Yao and Wang 2021); 2) divide and recombine methods, where the Big data set is divided into smaller blocks, and then the intended statistical analysis is performed on each block and subsequently recombined for inference (Lin and Xi 2011;Guha et al. 2012;Cleveland and Hafen 2014;Chang, Lin, and Wang 2017;Li, Hung, and Xie 2020); 3) online updating of streamed data, where statistical inference is updated as new data arrive sequentially (Schifano et al. 2016;Xue et al. 2020).In recent years, compared to divide and recombine methods, sub-sampling has been applied to a variety of regression problems, while online updating is typically only used for streaming data.In addition, in cases where a large data set is not needed to answer a specific question with sufficient confidence, subsampling seems preferable as the analysis of the data can often be undertaken with standard methods.Moreover, the computational efficiency of sub-sampling over analysing large data sets has been observed for parameter estimation in linear (Wang, Yang, and Stufken 2019) and logistic (Wang, Zhu, and Ma 2018) regression models.For these reasons, we focus on sub-sampling methods in this article.
The key challenge for sub-sampling methods is how to obtain an informative sub-sample that can be used to efficiently answer specific analysis questions and provide results that align with the analysis of the whole Big data set.Two approaches for this exist in the literature: 1) randomly sample from the Big data with sub-sampling probabilities that are found based on a specific statistical model and objective (e.g., prediction, parameter estimation) (Wang, Zhu, and Ma 2018;Yao and Wang 2019;Wang, Yang, and Stufken 2019;Ai et al. 2021a;Cheng, Wang, and Yang 2020;Ai et al. 2021b;Lee, Schifano, and Wang 2021;Yao and Wang 2021); 2) select sub-samples based on an experimental design (Drovandi et al. 2017;Deldossi and Tommasi 2022).Randomly sampling with certain probabilities (based upon the definitions of A-or L-optimality criteria, see Atkinson, Donev, and Tobias (2007)) is the focus of this article, and has been applied for parameter estimation in a wide range of regression problems including softmax (Yao and Wang 2019) and quantile regression (Ai et al. 2021a), and generalised linear models (Wang, Zhu, and Ma 2018;Wang, Yang, and Stufken 2019;Cheng, Wang, and Yang 2020;Ai et al. 2021b;Yao and Wang 2021).In contrast, the approach based on an experimental design has only been applied for: 1) parameter estimation in logistic fixed and mixed effects regression models (Drovandi et al. 2017); 2) parameter estimation and prediction accuracy in linear and logistic regression models (Deldossi and Tommasi 2022).
A key feature of both of the current sub-sampling approaches is that they rely on a statistical model that is assumed to appropriately describe the Big data.Given this is a potentially limiting assumption, Yu and Wang (2022) proposed to select the best candidate model from a pool of models based on the Bayesian Information Criterion (BIC) (Schwarz 1978).This was applied to linear models, and resulted in sub-sampling probabilities that were more appropriate than those based on considering a single model.Similarly, for linear regression, Shi and Tang (2021) and Meng et al. (2021) explored using space filling and orthogonal Latin hypercube experimental design techniques to allow for potential model misspecification.In this paper, we propose that, instead of selecting a single best candidate model for the Big data, a set of models is considered and a model averaging approach is used for determining the sub-sampling probabilities.Through adopting such an approach, it is thought that the analysis goal (e.g., efficient parameter estimation) should be achieved regardless of the preferred model for the data.To implement this model robust approach, we consider subsampling based on A-and L-optimality criteria within the Generalised Linear Modelling framework, and provide theoretical support for using a model averaged approach based on each of these criteria.Given we consider Generalised Linear Models (GLMs), our approach should be generally applicable across many areas of science and technology where a variety of data types are observed.This is demonstrated through applying our proposed methods within a simulation study, and for the analysis of two real-world Big data problems.
The remainder of the article is structured as follows.Section 2 introduces GLMs and the existing probability-based sub-sampling approach of Ai et al. (2021b) and Yao and Wang (2021).Our proposed model robust sub-sampling approach is introduced in Section 3, which is embedded with a GLM framework.A simulation study is then used to assess the performance of our model robust approach in Section 4, and two real world applications are presented.Section 5 concludes the article with a discussion of the results and some suggestions for future research.

Background
There are variety of ways Big data can be sub-sampled.In this section, we focus on the approach where sub-sampling probabilities are determined for each data point, and the Big data is sub-sampled (at random) based on these probabilities.Such an approach was first proposed by Wang, Zhu, and Ma (2018) for logistic regression problems in Big data settings, and has been extended to a wide range of regression problems (e.g., Yao and Wang (2019), Wang, Yang, and Stufken (2019), Ai et al. (2021a), Cheng, Wang, and Yang (2020), Ai et al. (2021b), and Yao and Wang (2021)).In this section, we describe such a sub-sampling approach as applied to GLMs based on the work of Ai et al. (2021b).

Generalised Linear Models
Let a Big data set be denoted as F N = (X 0 , y), where X 0 = (x 01 , . . ., x 0N ) T ∈ R N ×p represents a data matrix based on the Big data set with p covariates, y = (y 1 , . . ., y N ) T represents the response vector and N is the total number of data points.To fit a GLM, consider the model matrix X = h(X 0 ) ∈ R N ×(p+q) where h(.) is some function of X 0 which creates an additional q columns representing, for example, an intercept and/or higher-order terms.A GLM can then be defined via three components: 1) distribution of response y, which is from the exponential family (e.g., Normal, Binomial or Poisson); 2) linear predictor η = Xθ, where θ = (θ 1 , . . ., θ p+q ) T is the parameter vector; and 3) link function g(.), which links the mean of the response to the linear predictor (Nelder and Wedderburn 1972).Throughout this article, the inverse link function g −1 (.) is denoted by u(.).
A common exponential form for the probability density or mass function of y can be written as: where ψ(.), a(.) and b(.) are some functions, ω is known as the natural parameter and γ the dispersion parameter.Based on Equation (1) the link function g(.) can then be defined as g(µ) = η, where µ = E[y|X, θ] = dψ(ω)/dω.A general linear model, or linear regression model, is a special case of a GLM where y ∼ N (µ, Σ) and g(.) is the identity link function g(µ) = µ, such that µ = Xθ.For logistic regression, y ∼ Bin(n, π) and g(.) is the logit link function g(π) = log(π/(1 − π)) such that g(π) = Xθ.Similarly, for Poisson regression, y ∼ Poisson(λ) and g(.) is the log link function g(λ) = log(λ) such that g(λ) = Xθ.Note that the dispersion parameter γ = 1 for the logistic and Poisson regression models.

A general sub-sampling algorithm for GLMs
As described by Ai et al. (2021b), consider a general sub-sampling approach to estimate parameters θ through a weighted log-likelihood function for GLMs (weights are the inverse of the sub-sampling probabilities).A weighted likelihood function is considered, since an unweighted likelihood leads to biased estimates of model parameters for logistic regression, see Wang (2019).Define φ i as the probability that row i of F N is randomly selected, for i = 1, . . ., N , where N i=1 φ i = 1 and φ i ∈ (0, 1).A sub-sample S of size r is then drawn with replacement from F N based on φ = (φ 1 , . . ., φ N ).The selected responses, covariates and sub-sampling probabilities are then used to estimate the model parameters.Pseudo-code for this general sub-sampling approach is provided in Algorithm 1.
From Algorithm 1, the first step is to assign sub-sampling probabilities φ i to the rows of F N .The simplest approach is to assign each data point an equal probability of being selected.These probabilities could also depend on the composition of y, e.g., for binary data, one could sample proportional to the inverse of the number of successes and failures.Based on these probabilities, a sub-sample S of size r is then drawn completely at random (with Algorithm 1: General sub-sampling algorithm (Ai et al. 2021b) 1 Sampling: Assign φ i , i = 1, ..., N, for F N .
Based on φ, draw an r size sub-sample with replacement from F N to yield S = {x * l , y * l , φ * l } r l=1 = (X * , y * , φ * ). 2 Estimation: Based on S, find: Based on this sub-sample, model parameters θ are estimated via maximising a weighted log-likelihood function.The estimates θ can then be considered as estimates of what would be obtained if the whole Big data set were analysed.
The asymptotic properties of θ based on the general sub-sampling approach given in Algorithm 1 were derived by Ai et al. (2021b).These properties are outlined below as they form the basis for our extensions to model robust sub-sampling detailed in Section 3.

Asymptotic properties of θ from the general sub-sampling algorithm
To explore the asymptotic properties of the estimates obtained from the general sub-sampling algorithm, Ai et al. (2021b) made a number of assumptions which are outlined below.
To follow these, note that ψ(u(θ T x i )) and u(θ T x i ) denote the first-order derivatives of ψ(u(θ T x i )) and u(θ T x i ) (with respect to θ), respectively, with two dots similarly denoting the second-order derivatives.In addition, the Euclidean norm of a vector a will be denoted as ||a|| = (a T a) 1/2 .
H 1. Assume that Xθ lies in the interior of a compact set K ∈ Ω almost surely.
H 2. The regression coefficient θ is an inner point of the compact domain H 4. As N → ∞, the observed information matrix goes to a positive definite matrix in probability.
H 6. Assume N −2 N i=1 ||x i || s /φ i = O P (1) for s = 2, 4. H 7. For δ = 0 and some δ > 0, assume Assumptions H1 and H2 ensure that E[y i |x i ] < ∞ for all i, and was used by Clémençon, Bertail, and Chautru (2014) when investigating the impact of survey sampling with unequal inclusion probabilities on (stochastic) gradient descent-based-estimation methods in Big data problems.H2 defines an admissible set which is required for ensuring consistency of estimates for model parameters for GLMs, see Fahrmeir and Kaufmann (1985).Assumption H4 ensures (asymptotically) that J X , the observed Fisher information, is defined for the given model, and that it is positive definite and therefore non-singular.To obtain a Bahadur representation, that is often useful in determining the asymptotic properties of statistical estimators of the sub-sampled estimator, Assumptions H3 and H5 are needed.H6 and H7 are moment conditions on covariates and sub-sampling probabilities.Assumption H7 is required by the Lindeberg-Feller central limit theorem, see Van der Vaart (2000).
Based on the above assumptions, the following theorems were proved by Ai et al. (2021b).The first theorem proves that the estimator from the sub-sampling algorithm is consistent where || θ − θMLE || can be made small with a large sub-sample size r, and θMLE is the maximum likelihood estimator of θ based on F N .The second theorem proves that the approximation error, θ− θMLE , given F N is approximately asymptotically Normally distributed with mean zero and variance V .
Theorem 1.If H1 to H7 hold then, as N → ∞ and r → ∞, θ is consistent to θMLE in conditional probability given F N .Moreover, the rate of convergence is r − 1 2 .That is, with probability approaching one, for any > 0, there exist finite ∆ and r such that Theorem 2. If H1 to H7 hold, then as N → ∞ and r → ∞, conditional on F N in probability, prediction, etc).To address this, Ai et al. (2021b) proposed determining φ based on an optimality criterion from experimental design (specifically A-optimality and L-optimality), and this led to the proposal of Theorems 3 and 4 (below).Such theorems provide the optimal choice for the sub-sampling probabilities to minimise the asymptotic mean squared error of θ (or tr(V )) and J X θ (or tr(V c )), denoted as φ mM SE and φ mVc , respectively.We note that such sub-sampling probabilities are conditional on the assumption of a model being appropriate to describe the Big data.
Theorem 3. The sub-sampling strategy is A-optimal if the sub-sampling probability is chosen such that Optimal sub-sampling probabilities from Theorem 3 can be computationally expensive as they require the calculation of J −1 X .Hence, the linear-or L-optimality criterion was proposed by Ai et al. (2021b), which minimises the asymptotic mean squared error of J X θ or tr(J X V J X ) which corresponds to minimising the variance of a linear combination of the parameters (Atkinson, Donev, and Tobias 2007).This led to the following theorem.
Theorem 4. The sub-sampling strategy is L-optimal if the sub-sampling probability is chosen such that Wang, Zhu, and Ma (2018) applied Theorems 3 and 4 to obtain the following optimal subsampling probabilities for logistic regression: with Ai et al. (2021b) applied Theorems 3 and 4 to obtain the following optimal sub-sampling probabilities for Poisson regression: Unfortunately, in practice, the optimal sub-sampling probabilities φ mM SE and φ mVc cannot be determined as they depend on θMLE .To address this, based on Theorems 1 and 2, Ai et al. (2021b) proposed a two stage sub-sampling strategy where an initial random sample of the Big data is used to estimate θMLE ; an estimate which is then used with the results from Theorems 3 and 4 to provide estimates of optimal sub-sampling probabilities.Such an approach is thus termed a two stage sub-sampling approach, and is outlined in the next section.

Optimal sub-sampling algorithm for GLMs
Theorems 1 to 4 provide a theoretical basis for the optimal sub-sampling algorithm of Ai et al. (2021b).In this section, we describe this approach for GLMs which is outlined in Algorithm 2. This is a two stage algorithm where the general sub-sampling algorithm is initially applied.For this, the sub-sampling probabilities could be 1/N for all observations or may be based on a stratified sampling technique.Based on this initial sub-sample, estimates θ are obtained.Then, optimal sub-sampling probabilities are estimated using results from Theorems 3 and 4, where θMLE is approximated by θ.
Algorithm 2: Two stage optimal sub-sampling algorithm for GLMs.
Stage 1 1 Random Sampling : Assign φ = (φ 1 , . . ., φ N ) for F N .For example, in logistic regression φ i = φ prop represents proportional sub-sampling probabilities that are based on the response composition, and φ i = N −1 which is uniform sub-sampling probabilities for Poisson regression.According to φ draw a random sub-sample of size r 0 , S r 0 = {x r 0 l , y r 0 l , φ r 0 l } r 0 l=1 = (X r 0 , y r 0 , φ r 0 ). 2 Estimation: Based on S r 0 , find:

Stage 2
3 Optimal sub-sampling probability : Estimate φ mM SE or φ mVc from Theorems 3 and 4 using θr 0 in place of θMLE .
4 Optimal sub-sampling and estimation: Based on the estimated sub-sampling probabilities, draw a sub-sample S r completely at random (with replacement) of size r from F N , such that S r = {x r l , y r l , φ r l } r l=1 = (X r , y r , φ r ).Form S r 0 +r by combining S r 0 and S r , and obtain: The first stage of the algorithm is as shown in Algorithm 1.In the second stage, r ≥ r 0 data points are sampled with replacement from the Big data through the optimal sub-sampling probabilities (estimated based on the model parameters obtained from stage one).These two sub-sampled data sets are then combined to yield a single data set, and this data set is used to estimate the model parameters.Once these estimates have been obtained, Ṽ , the variance-covariance matrix of θ, can be estimated via Ṽ = J −1 X Ṽc J −1 X , where Specific versions of the above algorithm under the logistic and Poisson regression models are given in Appendix B in Algorithms 4 and 5, respectively.

Limitations of the optimal sub-sampling algorithm
The above optimal sub-sampling approach has a number of limitations.One such limitation is the computational expense involved in obtaining the optimal sub-sampling probabilities, as these need to be found for each data point in the Big data set.To address this, Lee, Schifano, and Wang (2021) introduced a faster two stage sub-sampling procedure for GLMs using the Johnson-Lindenstrauss Transform (JLT) and sub-sampled Randomised Hadamard Transform (SRHT), which are techniques to downsize matrix volume.Another limitation is that, as the approximated optimal sub-sampling probabilities are proportional to |y i − ψ(u( θT x i ))|, an observation with y i ≈ ψ(u( θT x i )) has a near zero probability of being selected and datapoints with y i = ψ(u( θT x i )) will be never sub-sampled.Ai et al. (2021b) proposed to resolve this by introducing (a small positive value, e.g., 10 −6 ) to constrain the optimal subsampling probabilities by replacing [y i − ψ(u( θT x i ))] with max (|y i − ψ(u( θT x i ))|, ) which ensures such data points have a non-zero probability of being selected.Lastly, one of the major limitations of the approach that has not been addressed previously is the inherent assumption that the Big data can be appropriately described by a given model.That is, the sub-sampling probabilities are evaluated based on an assumed model, and they are only optimal for this model.We suggest that this is a substantial limitation as specifying such a model in practice can be difficult.This motivates the development of methods that yield sub-sampling probabilities that are robust to the choice of model, and our proposed approach for this is outlined next.
3 Model robust optimal sub-sampling method In order to apply the above two stage sub-sampling approach, optimal sub-sampling probabilities need to be evaluated, and these are based on a model that is assumed to appropriately describe the data.In practice, determining such a model may be difficult, and there could be a variety of models that appropriately describe the data.Hence, a sub-sampling approach that provides robustness to the choice of model is desirable.For this, we propose to consider a set of Q models which can be constructed to encapsulate a variety of scenarios that may be observed within the data.For each model in this set, define model probabilities α q for q = 1, . . ., Q such that Q q=1 α q = 1, which represents our a priori belief about the appropriateness of each model.Denote the model matrix for the q-th model as X q = h q (X 0 ), i.e., some function of the data matrix X 0 .To apply a sub-sampling approach for this model set, sub-sampling probabilities are needed, and they should be constructed such that the resulting data is expected to address the analysis aim, regardless of which model is actually preferred for the Big data.For this purpose, we propose to form these sub-sampling probabilities by taking a weighted average (based on α q ) of the sub-sampling probabilities that would be obtained for each model (singularly).This is the basic approach of our model robust optimal sub-sampling algorithm, for which further details are provided below, including a theoretical basis for constructing the sub-sampling probabilities in this way.

Properties of model robust optimal sub-sampling algorithm
Updating the notation of the model matrix from X to X q for the q-th model subsequently leads to analogous definitions for x qi , θ q , θqMLE , θq , J X q , V q and V q c .Assuming H1 to H7 hold for each of the q models, Theorems 1 and 2 apply straightforwardly to each of the models.Extending the ideas from Section 2.3, optimal sub-sampling probabilities can be selected based on certain optimality criteria to, for example, ensure efficient estimates of parameters across the Q models.This leads to the following theorems.
Theorem 5.For a set of Q models with model probability α q for the q-th model, q = 1, . . .Q, if the sub-sampling probabilities are selected as follows: i = 1, . . ., N , and Q q=1 α q = 1, then Q q=1 α q tr(V q ) attains its minimum.
The proof of Theorem 5 is available in Appendix A, which is an extension of the proof of Theorem 3 in Ai et al. (2021b).
Theorem 6.For a set of Q models with model probability α q for the q-th model, q = 1, . . .Q, if the sub-sampling probabilities are selected as follows: i = 1, . . ., N , and Q q=1 α q = 1, then Q q=1 α q tr(V q c ) attains its minimum.
The proof of Theorem 6 follows straightforwardly from the proof of Theorem 5.
The above theorems can be applied to obtain optimal sub-sampling probabilities under the logistic regression model as follows: with Similarly, optimal sub-sampling probabilities under the Poisson regression model can be obtained as follows: with y i ∈ N 0 or non-negative integers, λ q i = exp ( θT qM LE x qi ), J Xq = N −1 N h=1 λ q h x qh (x qh ) T and i = 1, . . ., N .
As noted in Section 2.3 and by Ai et al. (2021b), these model robust optimal sub-sampling probabilities are based on the maximum likelihood estimator found by considering the whole Big data set.Hence, a two stage procedure similar to Algorithm 2 is proposed for model robust sub-sampling, and this is outlined next.

Model robust optimal sub-sampling algorithm for GLMs
The two stage model robust optimal sub-sampling algorithm for GLMs is presented in Algorithm 3, where the estimates of sub-sampling probabilities based on the results of Theorems 5 and 6 are used.The specific algorithms for logistic and Poisson regression are available in Appendix B as Algorithms 6 and 7, respectively.
Algorithm 3: Two stage model robust optimal sub-sampling algorithm for GLMs.

Stage 2
3 Optimal sub-sampling probability: Estimate optimal sub-sampling probabilities φ mM SE or φ mVc using the results from Theorem 5 or 6, with θqMLE replaced by θr 0 q .4 Optimal sub-sampling and estimation: Based on φ mM SE or φ mVc , draw sub-sample of size r from F N , such that S r = {h q (x 0 r l ), y r l , φ r l } r l=1 = (h q (X r 0 ), y r , φ r ) for q = 1, . . ., Q. Combine S r 0 and S r to form S (r 0 +r) , and obtain : 5 Output: θq and S (r 0 +r) for q = 1, . . ., Q.
Similar to Algorithm 2, the first stage entails randomly sub-sampling F N (with replacement), and estimating model parameters for each of the Q models.Based on these estimated parameters, model specific sub-sampling probabilities are obtained, and these are combined based on α q , forming model robust optimal sub-sampling probabilities.Subsequently, r ≥ r 0 data points are sampled from F N .The two sub-samples are then combined, and each of the Q models are fitted (separately) based on the weighted log-likelihood function, which should yield efficient estimates of parameters across each of the models.Similar to estimating the variance-covariance matrix of a single model (Equation ( 6)), Ṽq , the variance-covariance matrix of θq can be estimated, for q = 1, . . ., Q.
In the following section, the proposed model robust optimal sub-sampling method and the current optimal sub-sampling method are assessed through a simulation study and real world scenarios.

Applications of optimal sub-sampling algorithms
In this section, a simulation study and two real world applications are used to assess the performance of the proposed model robust optimal sub-sampling algorithm (Algorithm 3) compared to the optimal sub-sampling algorithm (Algorithm 2), and random sampling.The main results are presented in this section with some results presented in the Appendix.The simulation study and real world applications were coded in the R statistical programming language with the help of Rstudio IDE, and our scripts are available through Github.Appendix C provides specific Github hyperlinks to the code repositories for the simulation study and real world applications.

Simulation study design
To explore the performance of our model robust sub-sampling approach, a simulation study was constructed based on the logistic and Poisson regression models.For each case, a set of Q = 4 models were assumed based on Shi and Tang (2021), and this set is summarised in Table 1.For each model, F N was constructed, by assuming a distribution for the covariates and corresponding response.The performance of the three sampling methods were then compared for each F N through evaluating six scenarios: 1) random sampling to estimate the parameters of the data generating model; 2) optimal sub-sampling under the data generating model -this simulates the case where an appropriate model was assumed for describing the Big data; 3)-5) optimal sub-sampling under alternative models i.e., optimal sub-sampling to estimate the parameters of the data generating model but where samples were obtained based on an alternative model -this simulates undertaking optimal sub-sampling based on a 'wrong' model; 6) model robust optimal sub-sampling, with α q = 1/4 used to estimate the parameters of the data generating model, for q = 1, . . ., 4. To quantitatively compare each approach, the simulated Mean Squared Error (SMSE) was used.This was evaluated as follows: where M is the number of simulations, p + q is the number of parameters, θ n is the n-th underlying parameter of the data generating model and θnm is the estimate of this parameter from the m-th simulation.
In addition, the determinant of the observed Fisher information matrix (i.e., the inverse of the variance-covariance matrix V ) was also compared across sub-sampling approaches, with the largest of such values being preferred.For each simulation, N = 10000, r 0 = 100, r = 100, 200, . . ., 1400 and M = 1000.
Additionally, we only consider the A-optimal (mM SE) sub-sampling strategy in these simulations, as Wang, Zhu, and Ma (2018) and Ai et al. (2021b) indicated that this approach generally outperformed the L-optimal (mV c ) strategy.We explore both optimality criteria in the real-world applications.

Logistic regression
Following Wang, Zhu, and Ma (2018), covariate data for the logistic regression model were simulated based on two distributions: Exponential (λ) and Multivariate Normal (µ, Σ).The values of (λ, µ, Σ) and θ are given in Table 2 for each data generating model.For all models, the first element of θ is the value of the parameter for the intercept.While λ, µ and Σ were selected arbitrarily, θ was selected such that for N = 10000, the data generating model was preferred over the alternative models based on the Akaike Information Criterion (Akaike 1974).

Covariates
Distributions .9, 1.9, 0.9, 0.7] θ = [−1, 0.5, 0.5, 0.5, 0.5]  Figure 1 provides summaries of the SMSE and average model information over all models when the covariate data is generated through a Multivariate Normal distribution.The SMSE and average model information indicate that, under optimal sub-sampling for mM SE, the data generating model is typically preferred within the model set.This is expected as it is the case where the appropriate data generating model was correctly assumed to describe the Big data.Of note, the proposed model robust approach performs similarly to the optimal subsampling approach.Notable increases in the SMSE and decreases in the model information are observed when the incorrect model is considered for optimal sub-sampling compared to the data generating model.Overall, random sampling tends to have the worst performance.Similar results were obtained when the covariate data were generated through an Exponential distribution (Appendix D).

Poisson regression
The simulation study based on Poisson regression was constructed similar to the logistic regression case.In terms of generating covariate values, Uniform and Multivariate Normal (µ, Σ) distributions were used, see Table 3 and Ai et al. (2021b).Values for θ were selected as described above, and are given in this table.

Covariates
Distributions SMSE and average model information over all models when the covariate data was generated from a Multivariate Normal distribution and the response from generated form a Poisson regression model are shown in Figure 2. Generally, random sampling performs worst, while the proposed model robust approach and the optimal sub-sampling method based on the data generating model perform the best, and have similar SMSE and average model information values.Again, the use of the optimal sub-sampling algorithm can lead to notable increases in SMSE when the assumed model is incorrect.
Similar results were obtained when the covariate data was generated from a uniform distribution (see Appendix D).

Real world applications
The three sub-sampling methods are applied to analyse the "Skin segmentation" and "New York City taxi fare" data under logistic and Poisson regression, respectively.In the simulation study, the parameters were specified for the data generating model.However, in real world applications these are unknown.In this situation the sub-sampling methods cannot be compared as in Section 4.1.Instead, for every model, the simulated mean squared error (SMSE) under each sub-sampling method is evaluated for various r sub-sample sizes and M simulations.This can be evaluated as follows: where SM SE q ( θMLE ) is the simulated mean squared error in Equation ( 11) with θ replaced by θMLE .
In the following real world examples, the set of Q models includes the main effects model, with intercept, and all possible combinations of quadratic terms for continuous covariates.Again, these were constructed based on the work of Shi and Tang (2021).

Identifying skin from colours in images
Rajen and Abhinav (2012) considered the problem of identifying skin-like regions in images as part of the complex task of performing facial recognition.For this purpose, Rajen and Abhinav (2012)  We consider the same classification problem but use a logistic regression model.Skin presence is denoted as one and skin absence is denoted as zero.Each colour vector is scaled to have a mean of zero and a variance of one (initial range was between 0 − 255).To compare subsampling methods, we set r 0 = 200, r = 200, 300, . . ., 1800 for the sub-samples, and construct a set of Q models by considering an intercept with main effects model with all covariates (scaled colors red,green and blue) as the base model, and form all alternative models by including different combinations of quadratic terms of all covariates.This leads to Q = 8.Each of these models is considered equally likely a priori.
Figure 3 shows the SSMSE values over the Q models for various sub-sample sizes obtained by applying logistic regression to the "Skin segmentation" data.The proposed model robust approach performs similarly to the optimal sub-sampling method for r = 200 and r = 300.However, when the sample size increases the optimal sub-sampling method performs poorly, and after r = 1300 random sampling actually has lower SSMSE values than optimal subsampling.The same is not true for our proposed model robust approach which has the lowest values of SSMSE throughout the selected sample sizes under mM SE and mV c optimal subsampling criteria.It is interesting that the optimal sub-sampling approach based on mM SE performed worse than random sampling in comparison to the simulation study results.Upon investigating this, it was found that it could potentially be explained by one of the models being particularly poor (subsequent higher SMSE with increasing r values) for describing the data, and therefore led to inflated SSMSE values.Despite this, we note that the model robust approach appears to perform well in general.

New York City taxi cab usage
New York City (NYC) taxi trip and fare information from 2009 onward, consisting of over 170 million records each year, are publicly available courtesy of the New York City Taxi and Limousine Commission.Some analyses of interest of these NYC taxi data include: a taxi driver's decision process to pick up a fair or cruise for customers which was modelled via a logistic regression model (Yazici, Kamga, and Singhal 2013); taxi demand and how it is impacted by location, time, demographic information, socioeconomic status and employment status which was modelled via a multiple linear regression model (Yang and Gonzales 2014); and the dependence of taxi supply and demand on location and time considered via the Poisson regression model (Yang and Gonzales 2017).
In our application, we are interested in how taxi usage varies with day of the week (weekday/weekend), season (winter/spring/summer/autumn), fare amount, and cash payment.The data used in our application is the "New York City taxi fare" data for the year 2013, hosted by the University of Illinois Urbana Champaign (Donovan and Work 2016).
Each data point includes the number of rides recorded against the medallion number (a license number provided for a motor vehicle to operate as a taxi) (y), weekday or not (x 1 ), winter or not (x 2 ), spring or not (x 3 ), summer or not (x 4 ), summed fare amount in dollars (x 5 ), and the ratio of cash payment trips in terms of all trips (x 6 ).The continuous covariate x 5 was scaled to have a range of zero and one.Poisson regression was used to model the relationship between the number of rides per medallion and these covariates.For our study, three sub-sampling methods are compared for the analysis of the taxi fare data, assigning r 0 = 100 and r = 200, . . ., 1900 for sub-samples.The model set consists of the main effects model(x 1 , x 2 , x 3 , x 4 , x 5 , x 6 ) and all possible combinations of the quadratic terms of the continuous covariates (x 2 5 , x 2 6 ) which leads to Q = 4.Each of these models were considered equally likely a priori.
The SSMSE over the four models is shown in Figure 4. Our proposed model robust approach outperforms the optimal sub-sampling method for almost all sample sizes for both mM SE and mV c strategy.Under mV c , the model robust approach actually initially performs worse than the optimal sub-sampling approach but this is quickly reserved as r is increased.Random sampling performs the worst, suggesting that there is benefit in using targeted sampling approaches (as proposed here) over random selection.

Discussion
In this article, we proposed a model robust optimal sub-sampling approach for GLMs.This new approach extends the current optimal sub-sampling approach by considering a set of models (rather than a single model) when determining the sub-sampling probabilities.The exact formulation of these probabilities is derived using Theorems 5 and 6.The robustness properties of this proposed approach were demonstrated in a simulation study, and in two real-world analysis problems, where it was shown to outperform optimal sub-sampling and random sampling.Accordingly, we suggest that such an approach be considered in future Big data analysis problems.
The main limitation of the proposed approach that could be addressed in future research is extending the specification of the model set to a flexible class of models.This could be, for example, through a generalised additive model (Hastie and Tibshirani 1986) or the inclusion of a discrepancy term in the linear predictor (Krishna et al. 2021).Another avenue of interest that could also be explored is reducing the model set after stage one of the two-stage algorithm where models that clearly do not appear to be appropriate for the data could be dropped.Both of these extensions are planned for future research.Here we define 0/0 = 0, and this equivalent to removing data points with |y i − ψ(u( θq T M LE x qi ))| = 0 in the expression of V q c .For the q-th model sub-sampling probabilities would be i = 1, . . ., N and φ q = N i=1 φ q mM SE i = 1.
If so, using the a priori probabilities α q the model average optimal sub-sampling probabilities is chosen such that i = 1, . . ., N , Q q=1 α q = 1 then Q q=1 α q tr(V q ) attains its minimum.
Similarly optimal sub-sampling probabilities can be obtained for L-optimality.
Figure 1: (a) Model information (larger is better) and (b) SMSE (smaller is better) for the sub-sampling methods for the logistic regression model under mM SE. Covariate data were generated from the Multivariate Normal distribution.15

Figure 3 :
Figure3: Summed SMSE over the available models for logistic regression applied on the "Skin segmentation" data.

Figure 4 :
Figure 4: Summed SMSE over the available models for Poisson regression applied on the 2013 NYC taxi usage data.

Data
Figure 5: (a) Model information (larger is better) and (b) SMSE (smaller is better) for the sub-sampling methods for the logistic regression model under mM SE. Covariate data were generated from the Exponential distribution.32

Table 1 :
Model set assumed for simulation study.

Table 2 :
λ, µ, Σ and θ values are given to generate x 1 , x 2 through Exponential and Multivariate Normal distributions and form F N for each data generating logistic regression model.

Table 3 :
µ, Σ and θ values are given to generate x 1 , x 2 through uniform and Multivariate Normal distributions and form F N for each data generating Poisson regression model.
Model information (larger is better) and (b) SMSE (smaller is better) for the sub-sampling methods for the Poisson regression model under mM SE. Covariate data were generated from the standardised Multivariate Normal distribution.
Binias et al. (2018)egmentation" data set, which consists of RGB (R-red, G-green, B-blue) values of randomly sampled pixels from N = 245, 057 face images (out of which 50, 859 are skin samples and 194, 198 are non-skin samples) from various age groups, race groups and genders.Bhatt et al. (2009)andBinias et al. (2018)applied multiple supervised machine learning algorithms to classify if images are skin or not based on the RGB colour data.In addition, Abbas and Farooq (2019) conducted the same classification task for two different colour spaces, HSV (H-hue, S-saturation, V-value) and YCbCr (Y-luma component, Cb-blue difference chroma component, Cr-red difference chroma component), by transforming the RGB colour space.