1 Introduction

Analysis of big data has gained interest in recent years through providing new insights and unlocking hidden knowledge in different fields of study (Karmakar and Mukhopadhyay 2020) including medicine (Rehman et al. 2021), fraud detection (Vaughan 2020), the oil and gas industry (Nguyen et al. 2020) and astronomy (Zhang and Zhao 2015). However, the analysis of big data can be challenging for traditional statistical methods and standard computing environments (Wang et al. 2016). Martinez-Mosquera et al. (2020) discuss storage and modelling solutions when handling such a large amount of data. In general, modelling solutions can be grouped into three broad categories: (1) subsampling methods, where the analysis is performed on an informative subsample obtained from the big data (Kleiner et al. 2014; Ma et al. 2015; Ma and Sun 2015; Drovandi et al. 2017; Wang et al. 2018, 2019; Yao and Wang 2019; Ai et al. 2021a, b; Cheng et al. 2020; Lee et al. 2021; Yao and Wang 2021); (2) divide and recombine methods, where the big data set is divided into smaller blocks, and then the intended statistical analysis is performed on each block and subsequently recombined for inference (Lin and Xi 2011; Guha et al. 2012; Cleveland and Hafen 2014; Chang et al. 2017; Li et al. 2020); (3) online updating of streamed data, where statistical inference is updated as new data arrive sequentially (Schifano et al. 2016; Xue et al. 2020). In recent years, compared to divide and recombine methods, subsampling has been applied to a variety of regression problems, while online updating is typically only used for streaming data. In addition, in cases where a large data set is not needed to answer a specific question with sufficient confidence, subsampling seems preferable as the analysis of the data can often be undertaken with standard methods. Moreover, the computational efficiency of subsampling over analysing large data sets has been observed for parameter estimation in linear (Wang et al. 2019) and logistic (Wang et al. 2018) regression models. For these reasons, we focus on subsampling methods in this article.

The key challenge for subsampling methods is how to obtain an informative subsample that can be used to efficiently answer specific analysis questions and provide results that align with the analysis of the whole big data set. Two approaches for this exist in the literature: (1) randomly sample from the big data with subsampling probabilities that are found based on a specific statistical model and objective (e.g., prediction and/or parameter estimation) (Wang et al. 2018; Yao and Wang 2019; Ai et al. 2021a, b; Lee et al. 2021; Yao and Wang 2021); (2) select subsamples based on an experimental design (Drovandi et al. 2017; Deldossi and Tommasi 2022). Randomly sampling with certain probabilities (based upon the definitions of A- or L-optimality criteria, see Atkinson et al. 2007) is the focus of this article, and has been applied for parameter estimation in a wide range of regression problems including softmax (Yao and Wang 2019) and quantile regression (Ai et al. 2021a), and Generalised Linear Models (GLMs; Wang et al. 2018; Ai et al. 2021b; Yao and Wang 2021). In contrast, the approach based on an experimental design has only been applied for: (1) parameter estimation in logistic fixed and mixed effects regression models (Drovandi et al. 2017; 2) parameter estimation and prediction accuracy in linear and logistic regression models (Deldossi and Tommasi 2022).

A key feature of both of the current subsampling approaches is that they rely on a statistical model that is assumed to appropriately describe the big data. Given this is a potentially limiting assumption, Yu and Wang (2022) proposed to select the best candidate model from a pool of models based on the Bayesian Information Criterion (BIC; Schwarz 1978). This was applied to linear models, and resulted in subsampling probabilities that were more appropriate than those based on considering a single model. Similarly, for GLMs, Shi and Tang (2021), Meng et al. (2021) and Yi and Zhou (2023) explored using space-filling or orthogonal Latin hypercube designs so that a wide range of models could be considered, however, such approaches have notable limitations particularly as the design dimension increases. In this paper, we propose that, instead of selecting a single best candidate model for the big data, a set of models is considered and a model averaging approach is used to determine the subsampling probabilities. Through adopting such an approach, it is thought that the analysis goal (e.g., efficient parameter estimation) should be achieved regardless of the preferred model for the data. To implement this model robust approach, we consider subsampling based on A- and L-optimality within the Generalised Linear Modelling framework, and adopt a model averaged approach based on each of these criteria. Given we consider GLMs, our approach should be generally applicable across many areas of science and technology where a variety of data types are observed. This is demonstrated through applying our proposed methods within a simulation study, and for the analysis of two real-world big data problems.

The remainder of the article is structured as follows. Section 2 introduces GLMs and the existing probability-based subsampling approach of Ai et al. (2021b) and Yao and Wang (2021). Our proposed model robust subsampling approach is introduced in Section 3, which is embedded within a GLM framework. A simulation study is then used to assess the performance of our model robust approach in Sect. 4, and two real-world applications are presented. Section 5 concludes the article with a discussion of the results and some suggestions for future research.

2 Background

There are a variety of ways big data can be subsampled. In this section, we focus on the approach where subsampling probabilities are determined for each data point, and the big data is subsampled (at random) based on these probabilities. Such an approach was first proposed by Wang et al. (2018) for logistic regression problems in big data settings, and has been extended to a wide range of regression problems (e.g., Yao and Wang 2019, 2021; Ai et al. 2021a, b). In this section, we describe such a subsampling approach as applied to GLMs based on the work of Ai et al. (2021b).

2.1 Generalised Linear Models

Let a big data set be denoted as \(F_N=(\varvec{X}_0,\varvec{y})\), where \(\varvec{X}_0=(\varvec{x}_{0_1},\ldots ,\varvec{x}_{0_N})^T \in R^{N \times p}\) represents a data matrix based on the big data set with p covariates, \(\varvec{y}=(y_1,\ldots ,y_N)^T\) represents the response vector and N is the total number of data points. To fit a GLM, consider the model matrix \(\varvec{X}=h(\varvec{X}_0) \in R^{N \times (p+t)}\) where h(.) is some function of \(\varvec{X}_0\) which creates an additional t columns representing, for example, an intercept and/or higher-order terms. A GLM can then be defined via three components: (1) distribution of response \(\varvec{y}\), which is from the exponential family (e.g., Normal, Binomial or Poisson); (2) linear predictor \(\varvec{\eta }=\varvec{X}\varvec{\theta }\), where \(\varvec{\theta }=(\theta _1,\ldots ,\theta _{p+t})^T\) is the parameter vector; and (3) link function g(.), which links the mean of the response to the linear predictor (Nelder and Wedderburn 1972). Throughout this article, the inverse link function \(g^{-1}(.)\) is denoted by u(.).

A common exponential form for the probability density or mass function of y can be written as:

$$\begin{aligned} f(y;\omega ,\gamma )=\exp {\Big (\frac{y\omega - \psi (\omega )}{a(\gamma )} + b(y,\gamma ) \Big )}, \end{aligned}$$
(1)

where \(\psi (.), a(.)\) and b(.) are some functions, \(\omega \) is known as the natural parameter and \(\gamma \) the dispersion parameter. Based on Eq. (1) the link function g(.) can then be defined as \(g(\varvec{\mu }) = \varvec{\eta }\), where \(\mu = E[y\vert \varvec{X},\varvec{\theta }] = \text{ d } \psi (\omega )/\text{d }\omega \). A general linear model, or linear regression model, is a special case of a GLM where \(\varvec{y} \sim N(\varvec{\mu },\varvec{\Sigma })\) and g(.) is the identity link function \(g(\varvec{\mu })=\varvec{\mu }\), such that \(\varvec{\mu }=\varvec{X}\varvec{\theta }\). For logistic regression, \(\varvec{y} \sim \text{ Bin }(n,\varvec{\pi })\) and g(.) is the logit link function \(g(\varvec{\pi })=\log (\varvec{\pi }/(1-\varvec{\pi }))\) such that \(g(\varvec{\pi })=\varvec{X}\varvec{\theta }\). Similarly, for Poisson regression, \(\varvec{y} \sim \text{ Poisson }(\varvec{\lambda })\) and g(.) is the log link function \(g(\varvec{\lambda })=\log (\varvec{\lambda })\) such that \(g(\varvec{\lambda })=\varvec{X}\varvec{\theta }\). Note that the dispersion parameter \(\gamma =1\) for the logistic and Poisson regression models.

2.2 A general subsampling algorithm for GLMs

As described by Ai et al. (2021b), consider a general subsampling approach to estimate parameters \(\varvec{\theta }\) through a weighted log-likelihood function for GLMs (where the weights are the inverse of the subsampling probabilities). A weighted likelihood function is considered, since an unweighted likelihood leads to biased estimates of model parameters, see Wang (2019). Define \(\phi _i\) as the probability that, for a single draw, row i of \(F_N\) is randomly selected, for \(i=1,\ldots ,N\), where \(\sum _{i=1}^{N} \phi _i=1\) and \(\phi _i \in (0,1)\). A subsample S of size r is then drawn with replacement from \(F_N\) based on \(\varvec{\phi }=(\phi _1,\ldots ,\phi _N)\). The selected responses, covariates and subsampling probabilities are then used to estimate the model parameters. Pseudo-code for this general subsampling approach is provided in Algorithm 1.

figure a

From Algorithm 1, the first step is to assign subsampling probabilities \(\phi _i\) to the rows of \(F_N\). The simplest approach is to assign each data point an equal probability of being selected. These probabilities could also depend on the composition of \(\varvec{y}\), e.g., for binary data, one could sample proportional to the inverse of the number of successes and failures. Based on these probabilities, a subsample S of size r is then drawn at random (with replacement) from \(F_N\) to yield \(\{\varvec{x}^*_l,y^*_l,\phi ^*_l\}_{l=1}^r\). Based on this subsample, model parameters \(\tilde{\varvec{\theta }}\) are estimated via maximising a weighted log-likelihood function. The estimates \(\tilde{\varvec{\theta }}\) can then be considered as estimates of what would be obtained if the whole big data set were analysed.

The asymptotic properties of \(\tilde{\varvec{\theta }}\) based on the general subsampling approach given in Algorithm 1 have been derived by Ai et al. (2021b). Based on these properties, two theorems were proposed which provide insight into the behaviour of the estimator as the subsample increases. Specifically, as \(N\rightarrow \infty \) and \(r\rightarrow \infty \), they showed that \(\varvec{\tilde{\theta }}\) is a consistent estimator of \(\hat{\varvec{\theta }}_{MLE}\), the maximum likelihood estimator of \(\varvec{\theta }\) based on \(F_N\). In addition, they showed that the approximation error, \(\tilde{\varvec{\theta }} - \hat{\varvec{\theta }}_{MLE}\), given \(F_N\), is approximately asymptotically Normally distributed with mean zero and variance \(\varvec{V}\).

When applying the general subsampling algorithm, it may not be clear how to appropriately choose \(\varvec{\phi }\) depending upon the goal of the analysis (e.g., parameter estimation, response prediction, etc.). To address this, Ai et al. (2021b) proposed determining \(\varvec{\phi }\) based on optimality criteria (e.g. A- and L-optimality), which leads to optimal subsampling probabilities that minimise the asymptotic mean squared error of \(\tilde{\varvec{\theta }}\) [or \(\text{ tr }(\varvec{V})\)] and \(\varvec{J_X} \tilde{\varvec{\theta }}\) [or \(\text{ tr }(\varvec{V}_c)\)], respectively, where \(\varvec{J_X}\) is the observed information matrix and \(\varvec{V}_c\) is the variance of \(\varvec{J_X} \tilde{\varvec{\theta }}\). Based on each of these optimality criteria, subsampling probabilities can be determined for each data point, and these will be denoted as \(\varvec{\phi }^{mMSE}\) and \(\varvec{\phi }^{mV_c}\), respectively.

Unfortunately, the optimal subsampling probabilities \(\varvec{\phi }^{mMSE}\) and \(\varvec{\phi }^{mV_c}\) cannot be determined directly as they depend on \(\hat{\varvec{\theta }}_{MLE}\). To address this, Ai et al. (2021b) proposed a two-stage subsampling strategy where an initial random sample of the big data is used to estimate \(\hat{\varvec{\theta }}_{MLE}\). This estimate is then used to approximate the optimal subsampling probabilities. Such an approach is thus termed a two-stage subsampling algorithm. A full description of this algorithm is given in Ai et al. (2021b).

2.3 Limitations of the optimal subsampling algorithm

The above optimal subsampling approach has a number of limitations. One such limitation is the computational expense involved in obtaining the optimal subsampling probabilities, as these need to be determined for each data point in the big data set. To address this, Lee et al. (2021) introduced a faster two-stage subsampling procedure for GLMs using the Johnson–Lindenstrauss Transform (JLT) and subsampled Randomised Hadamard Transform (SRHT), which are techniques to downsize matrix volume. Another limitation is that the approximation for the optimal subsampling probabilities can lead to some data points having zero probability of being selected. Ai et al. (2021b) proposed to resolve this by setting these subsampling probabilities to a small value to ensure such data points have some (non-zero) probability of being selected. Lastly, one of the major limitations of the approach that has not been addressed previously is the inherent assumption that the big data can be appropriately described by a given model. That is, the subsampling probabilities are evaluated based on an assumed model, and they are generally only optimal for this model. We suggest that this is a substantial limitation as specifying such a model in practice can be difficult. This motivates the development of methods that yield subsampling probabilities that are robust to the choice of model, and our proposed approach for this is outlined next.

3 Model robust subsampling method

In order to apply the two-stage subsampling approach from Ai et al. (2021b), optimal subsampling probabilities need to be evaluated, and these are based on a model that is assumed to appropriately describe the data. In practice, determining such a model may be difficult, and there could be a variety of models that appropriately describe the data. Hence, a subsampling approach that provides robustness to the choice of model is desirable. For this, we propose to consider a set of Q models which can be constructed to encapsulate a variety of scenarios that may be observed within the data. For each model in this set, define model probabilities \(\alpha _q\) for \(q=1,\ldots ,Q\) such that \(\sum _{q=1}^Q\alpha _q=1\), which represents our a priori belief about the appropriateness of each model. Denote the model matrix for the qth model as \(\varvec{X}_q= h_q(\varvec{X}_0)\), i.e., some function of the data matrix \(\varvec{X}_0\). To apply a subsampling approach for this model set, subsampling probabilities are needed, and they should be constructed such that the resulting data is expected to address the analysis aim, regardless of which model is actually preferred for the big data. For this purpose, we propose to form these subsampling probabilities via a weighted average (based on \(\alpha _q)\) of the subsampling probabilities that would be obtained for each model (singularly). This is the basic approach of our model robust subsampling algorithm, for which further details are provided below.

3.1 Properties of model robust subsampling algorithm

Updating the notation of the model matrix from \(\varvec{X}\) to \(\varvec{X}_q\) for the qth model subsequently leads to analogous definitions for \(\varvec{x}_{qi}\), \(\varvec{\theta }_{q}\), \(\hat{\varvec{\theta }}_{qMLE}\), \(\tilde{\varvec{\theta }}_{q}\), \({\varvec{J_X}}_q\), \(\varvec{V}_q\) and \({\varvec{V}_q}_c\). In addition, let \(\dot{\psi }(u(\varvec{\theta }^T\varvec{x}_i))\) and \(\dot{u}(\varvec{\theta }^T\varvec{x}_i)\) denote the first-order derivatives of \(\psi (u(\varvec{\theta }^T\varvec{x}_i))\) and \(u(\varvec{\theta }^T\varvec{x}_i)\) (with respect to \(\varvec{\theta }\)), and denote the Euclidean norm of a vector \(\varvec{a}\) as \(\Vert \varvec{a}\Vert =(\varvec{a}^T\varvec{a})^{1/2}\). Extending the ideas from Sect. 2.2, optimal subsampling probabilities can be selected based on certain optimality criteria to, for example, ensure efficient estimates of parameters across the Q models. To inform the choice of subsampling probabilities, consider the following theorems.

Theorem 1

For a set of Q models with model probability \(\alpha _q\) for the qth model, \(q=1,\ldots Q,\) if the subsampling probabilities are selected as follows:

$$\begin{aligned} {{\phi _q}^{mMSE}_i =} \frac{\vert y_i - \dot{\psi }(u(\hat{\varvec{\theta }_q}_{MLE}^T\varvec{x}_{qi}))\vert \, \Vert \varvec{J}^{-1}_{\varvec{X}_q} \dot{u}(\hat{\varvec{\theta }_q}_{MLE}^T\varvec{x}_{qi})\varvec{x}_{qi} \Vert }{\sum _{j=1}^{N} \vert y_j - \dot{\psi }(u(\hat{\varvec{\theta }_q}_{MLE}^T\varvec{x}_{qj}))\vert \, \Vert \varvec{J}^{-1}_{\varvec{X}_q} \dot{u}(\hat{\varvec{\theta }_q}_{MLE}^T\varvec{x}_{qj})\varvec{x}_{qj} \Vert }, \end{aligned}$$

\(i=1,\ldots ,N,\) and \(\sum _{q=1}^{Q} \alpha _q = 1,\) then \(\sum _{q=1}^Q\alpha _q\text{ tr }(\varvec{V}_q)\) attains its minimum.

The proof of Theorem 1 is available in “Appendix A”, which is an extension of the proof of Theorem 3 from Ai et al. (2021b).

Theorem 2

For a set of Q models with model probability \(\alpha _q\) for the qth model, \(q=1,\ldots Q,\) if the subsampling probabilities are selected as follows:

$$\begin{aligned} {{\phi _q}_i^{mV_c} = }\frac{\vert y_i - \dot{\psi }(u(\hat{\varvec{\theta }_q}_{MLE}^T\varvec{x}_{qi}))\vert \, \Vert \dot{u}(\hat{\varvec{\theta }_q}_{MLE}^T\varvec{x}_{qi}) \varvec{x}_{qi} \Vert }{\sum _{j=1}^{N} \vert y_j - \dot{\psi }(u(\hat{\varvec{\theta }_q}_{MLE}^T\varvec{x}_{qj}))\vert \, \Vert \dot{u}(\hat{\varvec{\theta }_q}_{MLE}^T\varvec{x}_{qj}) \varvec{x}_{qj} \Vert }, \end{aligned}$$

\(i=1,\ldots ,N,\) and \(\sum _{q=1}^{Q} \alpha _q = 1,\) then \(\sum _{q=1}^Q\alpha _q\text{ tr }({\varvec{V}_q}_c)\) attains its minimum.

The proof of Theorem 2 is similar to the proof of Theorem 1. Based on the above results, we propose that model robust subsampling probabilities can be obtained as follows: \({\phi _i} = \sum _{q=1}^{Q} \alpha _q {\phi _q}_i,\) for \(i=1,\ldots ,N\), and \(\sum _{q=1}^{Q} \alpha _q = 1\), i.e. a model averaged approach based on the optimal subsampling probabilities from each model q. Similar to the work of Ai et al. (2021b), these model robust subsampling probabilities depend on the maximum likelihood estimator found by considering the whole big data set. As this is not available in big data settings, a two-stage approach is proposed for model robust subsampling, and this is outlined next.

3.2 Model robust subsampling algorithm for GLMs

The two-stage model robust subsampling algorithm for GLMs is presented in Algorithm 2, where subsamples are drawn based on model averaged subsampling probabilities.

figure b

The first phase of Algorithm 2 entails randomly subsampling \(F_{N}\) (with replacement), and estimating model parameters for each of the Q models. Based on these estimated parameters, model specific subsampling probabilities are obtained, and these are combined based on \(\alpha _q\) to form model robust subsampling probabilities. Subsequently, \(r\ge r_0\) data points are sampled from \(F_{N}\). The two subsamples are then combined, and each of the Q models are fitted (separately) based on the weighted log-likelihood function, which should yield, for example, efficient estimates of parameters across each of the models. In the following section, our proposed model robust subsampling approach is assessed via simulation and in two real-world scenarios.

4 Applications of optimal subsampling algorithms

In this section, a simulation study and two real-world applications are used to assess the performance of our proposed model robust subsampling algorithm (Algorithm 2) compared to: (1) the approach of Ai et al. (2021b) based on a single model, and to; (2) sampling completely at random. The main results are presented in this section with some results presented in the Supplementary Material. The simulation study and real-world applications were coded in the R statistical programming language (R Core Team 2021) with the help of RStudio IDE (2020), and code to reproduce our results is available through GitHub. Supplementary Material S2 provides specific GitHub hyperlinks to the code repositories.

4.1 Simulation study design

To explore the performance of our model robust subsampling approach, a simulation study was constructed based on the logistic and Poisson regression models. For each case, a set of \(Q=4\) models were assumed based on Shi and Tang (2021), and this set is summarised in Table 1. For each model, \(F_{N}\) was constructed by assuming a distribution for the covariates and the corresponding response. The performance of the three sampling methods were then compared for each \(F_{N}\) through evaluating six scenarios: (1) random sampling to estimate the parameters of the data generating model; (2) optimal subsampling under the data generating model—this simulates the case where an appropriate model was assumed for describing the big data; (3)–(5) optimal subsampling under alternative models (all models in Table 1 except the data generating model)—this simulates optimal subsampling under a ‘wrong’ model; (6) model robust subsampling to estimate the parameters of the data generating model under the assumption that each of the q models are equally likely a priori (i.e., assuming the model set as given in Table 1 with \(\alpha _q=1/Q\)). For each simulation, \(N=10,000\), \(r_0=100\), \(r=100,200,\ldots ,1400\) and \(M=1000\).

To compare the results from each of the six scenarios, the simulated mean squared error (SMSE) of the estimated model parameters was evaluated as follows:

$$\begin{aligned} SMSE(\varvec{\tilde{\theta }},\varvec{\iota }) = \frac{1}{M}{\sum _{m=1}^{M} \sum _{n=1}^{p+t} ({\tilde{\theta }}_{nm} - \iota _n)^2 }, \end{aligned}$$
(2)

where M is the number of simulations, \(p+t\) is the number of parameters in the data generating model and \(\varvec{\tilde{\theta }}\) is an \(M\times (p+t)\) matrix of the estimated model parameters from each simulation. Thus, to evaluate the SMSE to compare \(\varvec{\tilde{\theta }}\) with \(\varvec{\theta }\) and \(\varvec{\hat{\theta }}_{MLE}\), we set \(\varvec{\iota }=\varvec{\theta }\) and \(\varvec{\hat{\theta }}_{MLE}\), respectively. In addition, the average model information (i.e., the mean of M determinants of the Fisher information matrix based on \(\varvec{\theta }\)) was also evaluated for comparison across the different subsampling scenarios.

Table 1 Model set assumed for the simulation study.

Wang et al. (2018) and Ai et al. (2021b) showed that SMSE values based on comparisons with \(\varvec{\hat{\theta }}_{MLE}\) were generally smaller when using \(\phi ^{mMSE}\) compared to \(\phi ^{mV_c}\). Therefore, the results from the simulation study when using \(\phi ^{mV_c}\) are included in Supplementary Material S3.1 and S3.2. However, results from both optimality criteria are discussed in the main text in the two real-world applications. In addition, in Supplementary Material S4, we have extended our simulation study to explore the effect of considering different values for \(r_0\). Further, in Supplementary Material S5, we have extended our simulation study to explore the effect of considering non-uniform model probabilities and a model set that includes models with different covariates.

4.1.1 Logistic regression

Following Wang et al. (2018), covariate data (\(\varvec{x}_1,\varvec{x}_2\)) for the logistic regression model were simulated based on two distributions: Exponential (\(\lambda \)) and Multivariate Normal (\(\varvec{\mu },\varvec{\Sigma }\)). The values of (\(\lambda ,\varvec{\mu },\varvec{\Sigma }\)) and \(\varvec{\theta }\) are given in Table 2 for each data generating model. For all models, the first element of \(\varvec{\theta }\) is the value of the parameter for the intercept with other values denoting values for slope parameters. While \(\lambda \), \(\varvec{\mu }\) and \(\varvec{\Sigma }\) were selected arbitrarily, \(\varvec{\theta }\) was determined so that the data generating model would be preferred (based on the Akaike Information Criterion 1974) over each rival model if the whole big data set was analysed. This was considered to avoid situations where an alternative model might be preferred over the data generating model as r approached N and thus potentially hindering the interpretation of results.

For logistic regression, through applying Theorems 1 and 2, model robust subsampling probabilities are obtained as follows:

$$\begin{aligned}&{\phi }^{mMSE}_i = \sum _{q=1}^{Q} \frac{\alpha _q\vert y_i - {\pi _q}_i\vert \Vert \varvec{J}^{-1}_{\varvec{X}_q} \varvec{x}_{qi} \Vert }{\sum _{j=1}^{N} \vert y_j - {\pi _q}_j\vert \Vert \varvec{J}^{-1}_{\varvec{X}_q} \varvec{x}_{qj} \Vert }, \quad {\phi }^{mV_c}_i = \sum _{q=1}^{Q} \frac{\alpha _q\vert y_i - {\pi _q}_i\vert \Vert \varvec{x}_{qi}\Vert }{\sum _{j=1}^{N} \vert y_j - {\pi _q}_j\vert \Vert \varvec{x}_{qj} \Vert } \end{aligned}$$

with \(y_i \in \{0,1\}\), \({\pi _q}_i = \exp {(\hat{\varvec{\theta }}_{qMLE}^T\varvec{x}_{qi})}/(1+\exp {(\hat{\varvec{\theta }}_{qMLE}^T\varvec{x}_{qi} )}),\) \(\varvec{J}_{\varvec{X}_q}=N^{-1}\sum _{h=1}^{N} {\pi _q}_h(1-{\pi _q}_h) \varvec{x}_{qh} (\varvec{x}_{qh})^T\) and \(i=1,\ldots ,N\).

The general model robust approach applied in this example is given in Algorithm 2, and this algorithm specifically based on logistic regression is given in Supplementary Material S1 as Algorithms S3.

Table 2 \(\lambda ,\varvec{\mu },\varvec{\Sigma }\) and \(\varvec{\theta }\) values are given to generate \(x_1,x_2\) through Exponential and Multivariate Normal distributions and form \(F_{N}\) for each data generating logistic regression model.
Fig. 1
figure 1

Logarithm of (a) average model information and (b) SMSE for the subsampling methods for the logistic regression model under \(\phi ^{mMSE}\). Covariate data were generated from the Exponential distribution.

Figure 1 provides summaries of the logarithm-scaled SMSE and average model information when the covariate data is generated from an Exponential distribution. The SMSE and average model information indicate that, under optimal subsampling for \(\phi ^{mMSE}\), the data generating model is typically preferred within the model set. This is expected as it is the case where the appropriate data generating model was correctly assumed to describe the big data. Of note, the proposed model robust approach performs similarly to the optimal subsampling approach. Notable increases in the SMSE and decreases in the average model information are observed when the incorrect model is assumed within optimal subsampling. For average model information, there are occasions where assuming a model that is more complex than the data generating model performs well when compared to alternative approaches (e.g. when both quadratic terms are assumed to be included in the model but only one is present). However, our model robust approach is still preferred when all scenarios are considered. In addition, random sampling tends to have the worst performance. Similar results were obtained when the covariate data were generated from a Multivariate Normal distribution with L-optimality, see Supplementary Material S3.1. Also, similar results were observed when SMSE was evaluated based on \(\hat{\varvec{\theta }}_{MLE}\), see Supplementary Material S3.2.

4.1.2 Poisson regression

A simulation study based on Poisson regression was constructed similar to the logistic regression case. In terms of generating covariate values, uniform and Multivariate Normal distributions were considered. Values for \(\varvec{\theta }\) were selected as described above, and are given in Table 3. Model robust subsampling probabilities for Poisson regression can be obtained by applying Theorems 1 and 2 which yields the following:

$$\begin{aligned}&{\phi }^{mMSE}_i = \sum _{q=1}^{Q} \frac{\alpha _q \vert y_i - {\lambda _q}_i\vert \Vert \varvec{J}^{-1}_{\varvec{X}_q} \varvec{x}_{qi} \Vert }{\sum _{j=1}^{N} \vert y_j - {\lambda _q}_j\vert \Vert \varvec{J}^{-1}_{\varvec{X}_q} \varvec{x}_{qj}\Vert }, \quad \phi ^{mV_c}_i = \sum _{q=1}^{Q} \frac{\alpha _q \vert y_i - {\lambda _q}_i\vert \Vert \varvec{x}_{qi} \Vert }{\sum _{j=1}^{N} \vert y_j - {\lambda _q}_j\vert \Vert \varvec{x}_{qj} \Vert } \end{aligned}$$

with \(y_i \in N_0\) or non-negative integers, \({\lambda _q}_i = \exp {(\hat{\varvec{\theta }}_{qMLE}^T\varvec{x}_{qi})},\) \(\varvec{J}_{\varvec{X}_q}= N^{-1}\sum _{h=1}^{N} {\lambda _q}_h \varvec{x}_{qh} (\varvec{x}_{qh})^T\) and \(i=1,\ldots ,N\). The specific version of Algorithm 2 for Poisson regression is given in Supplementary Material S1 as Algorithm S4.

Table 3 \(\varvec{\mu },\varvec{\Sigma }\) and \(\varvec{\theta }\) values are given to generate \(x_1,x_2\) through uniform and Multivariate Normal distributions and form \(F_{N}\) for each data generating Poisson regression model.

SMSE and average model information when the covariate data was generated from a uniform distribution and the response formed through a Poisson regression model are shown in Fig. 2. On average, random sampling performs worst, while the proposed model robust approach and the optimal subsampling method based on the data generating model perform the best, and have similar SMSE and average model information values. As for logistic regression, our model robust approach is preferable overall even though the complex model performs well in some instances. Again, the use of the optimal subsampling algorithm can lead to notable increases in SMSE when the assumed model is incorrect. Similar results were obtained when the covariate data was generated from a uniform distribution with L-optimality, see Supplementary Material S3.1. In addition, similar results were observed when SMSE was evaluated based on \(\hat{\varvec{\theta }}_{MLE}\), see Supplementary Material S3.2.

Fig. 2
figure 2

Logarithm of (a) average model information and (b) SMSE for the subsampling methods for the Poisson regression model under \(\phi ^{mMSE}\). Covariate data were generated from the uniform distribution.

4.2 Real world applications

The three subsampling methods are applied to analyse the “Skin segmentation” and “New York City taxi fare” data under logistic and Poisson regression, respectively. In the simulation study, the parameters were specified for the data generating model. However, in real-world applications these are unknown. In such cases, the subsampling methods cannot be compared as in Sect. 4.1. Instead, for the Q model set, the Summed SMSE (SSMSE) under each subsampling methods can be evaluated as follows:

$$\begin{aligned} SSMSE(\hat{\varvec{\theta }}_{MLE}) = \sum _{q=1}^{Q} {\alpha _q SMSE(\varvec{\tilde{\theta }}_q,\varvec{\hat{\theta }}_{q,MLE})}, \end{aligned}$$

where \(\varvec{\tilde{\theta }}_q\) is a matrix with the estimated parameters for the qth model over M simulations and \(\varvec{\hat{\theta }}_{q,MLE}\) denotes \(\hat{\varvec{\theta }}_{MLE}\) for the qth model.

In the following real-world examples, the set of Q models includes the main effects model, with intercept, and all possible combinations of quadratic terms for continuous covariates with \(\alpha _q\) set to 1/Q. Again, these were constructed based on the work of Shi and Tang (2021).

4.2.1 Identifying skin from colours in images

Rajen and Abhinav (2012) considered the problem of identifying skin-like regions in images as part of the complex task of performing facial recognition. For this purpose, Rajen and Abhinav (2012) collated the “Skin segmentation” data set, which consists of RGB (R-red, G-green, B-blue) values of randomly sampled pixels from \(N=245,057\) face images (out of which 50, 859 are skin samples and 194, 198 are non-skin samples) from various age groups, race groups and genders. Bhatt et al. (2009) and Binias et al. (2018) applied multiple supervised machine learning algorithms to classify if images are skin or not based on the RGB colour data. In addition, Abbas and Farooq (2019) conducted the same classification task for two different colour spaces, HSV (H-hue, S-saturation, V-value) and YCbCr (Y-luma component, Cb-blue difference chroma component, Cr-red difference chroma component), by transforming the RGB colour space.

We consider the same classification problem but use a logistic regression model. Skin presence is denoted as one and skin absence is denoted as zero. Each colour vector is scaled to have a mean of zero and a variance of one (initial range was between 0 and 255). To compare subsampling methods, we set \(r_0=200,r=200,300,\ldots ,1800\) for the subsamples, and construct a set of Q models by considering an intercept with main effects model with all covariates (scaled colours red, green and blue) as the base model, and form all alternative models by including different combinations of quadratic terms of all covariates. This leads to \(Q=8\). Each of these models is considered equally likely a priori.

Figure 3 shows the SSMSE values over the Q models for various subsample sizes obtained by applying logistic regression to the “Skin segmentation” data. The proposed model robust approach performs similarly to the optimal subsampling method for \(r=200\) and 300. However, when the sample size increases the optimal subsampling method performs poorly, and after \(r=1300\) random sampling actually has lower SSMSE values than optimal subsampling. The same is not true for our proposed model robust approach which has the lowest values of SSMSE throughout the selected sample sizes under \(\phi ^{mMSE}\) and \(\phi ^{mV_c}\) optimal subsampling criteria. Of note, the optimal subsampling approach based on \(\phi ^{mMSE}\) performed worse than random sampling in some cases. That is, minimal decrease in SSMSE was observed for optimal subsampling beyond r\(=\)900 compared to random sampling. Upon investigating this, it was found that this was likely due to one of the models being particularly poor (leading to higher SMSE with increasing r values) for describing the data, and therefore led to inflated SSMSE values. Despite this, we note that the model robust approach appears to perform well in general.

Fig. 3
figure 3

Logarithm of SSMSE over the available models for logistic regression applied on the “Skin segmentation” data

4.2.2 New York City taxi cab usage

New York City (NYC) taxi trip and fare information from 2009 onward, consisting of over 170 million records each year, are publicly available courtesy of the New York City Taxi and Limousine Commission. Some analyses of interest of these NYC taxi data include: a taxi driver’s decision process to pick up a fair or cruise for customers which was modelled via a logistic regression model (Yazici et al. 2013); taxi demand and how it is impacted by location, time, demographic information, socioeconomic status and employment status which was modelled via a multiple linear regression model (Yang and Gonzales 2014); and the dependence of taxi supply and demand on location and time considered via the Poisson regression model (Yang and Gonzales 2017).

In our application, we are interested in how taxi usage varies with day of the week (weekday/weekend), season (winter/spring/summer/autumn), fare amount, and cash payment. The data used in our application is the “New York City taxi fare” data for the year 2013, hosted by the University of Illinois Urbana Champaign (Donovan and Work 2016).

Each data point includes the number of rides recorded against the medallion number (a license number provided for a motor vehicle to operate as a taxi) (y), weekday or not (\(\varvec{x}_1\)), winter or not (\(\varvec{x}_2\)), spring or not (\(\varvec{x}_3\)), summer or not (\(\varvec{x}_4\)), summed fare amount in dollars (\(\varvec{x}_5\)), and the ratio of cash payment trips in terms of all trips (\(\varvec{x}_6\)). The continuous covariate \(\varvec{x}_5\) was scaled to have a range of zero and one. Poisson regression was used to model the relationship between the number of rides per medallion and these covariates.

Fig. 4
figure 4

Logarithm of SSMSE over the available models for Poisson regression applied on the 2013 NYC taxi usage data.

For our study, three subsampling methods are compared for the analysis of the taxi fare data, assigning \(r_0=100\) and \(r=100,200,\ldots ,1900\) for subsamples. The model set consists of the main effects model (\(\varvec{x}_1,\varvec{x}_2,\varvec{x}_3,\varvec{x}_4,\varvec{x}_5,\varvec{x}_6\)) and all possible combinations of the quadratic terms of the continuous covariates (\(\varvec{x^2_5},\varvec{x^2_6}\)) which leads to \(Q=4\). Each of these models were considered equally likely a priori.

The SSMSE over the four models is shown in Fig. 4. Our proposed model robust approach outperforms the optimal subsampling method for almost all sample sizes for both \(\phi ^{mMSE}\) and \(\phi ^{mV_c}\) subsampling probabilities. Under \(\phi ^{mV_c}\), the model robust approach actually initially performs worse than the optimal subsampling approach but this is quickly reserved as r is increased. Random sampling performs the worst, suggesting that there is benefit in using targeted sampling approaches (as proposed here) over random selection.

5 Discussion

In this article, we proposed a model robust subsampling approach for GLMs. This new approach extends the current optimal subsampling approach by considering a set of models (rather than a single model) when determining the subsampling probabilities. The exact formulation of these probabilities was via model averaging which was proposed based on results from Theorems 1 and 2. The robustness properties of this proposed approach were demonstrated in a simulation study, and in two real-world analysis problems, where it outperformed optimal subsampling and random sampling. Accordingly, we suggest that our model robust approach could be considered in future big data analysis problems.

The main limitation of our proposed approach is that the specified model set could become quite large as the number of covariates increases. We note, however, that this is also an issue for big data analysis (as each model would still need to be fitted and compared). Such an issue could potentially be addressed here in future work by extending the specification of the model set to a flexible class of models. This could be, for example, through a generalised additive model (Hastie and Tibshirani 1986; De Silva et al. 2022) or the inclusion of a discrepancy term in the linear predictor (Krishna et al. 2021). Such an approach would seem efficient as it would reduce the number of models that need to be considered in the model set. In a similar vein, one could consider determining subsampling probabilities based on the work of Adewale and Wiens (2009) and Adewale and Xu (2010) aimed at obtaining robust designs under a misspecified model. Another avenue of interest that could also be explored is determining \(\alpha _q\) and/or reducing the model set based on the initial subsample of \(F_N\) where, for example, models that clearly do not support the data could be dropped. Such extensions are planned as future research.

6 Supplementary information

The proof for Theorem 1 is included in the Appendix. Specific subsampling algorithms for logistic and Poisson regression models, code for the simulation study and real-world applications and finally some additional figures are available in the online Supplementary Material.