1 Introduction

1.1 Integration Analysis Under DataSHIELD Constraint

Advancements in high-throughput technologies enable bioscientists to access data from numerous angles within one biological sample, and hundreds, even thousands of variables associated with the genome, transcriptome, proteome, metabolome, epigenome, etc., are of interest [1]. Extracting and statistically analyzing informative features area primary step in understanding the biological mechanisms of diseases, such as HIV [2,3,4]. However, because of the limited observational sample size, detecting the signal of informative features is rather challenging due to the high dimensionality [5]. In practice, a natural solution to address the curse of dimensionality is to increase the sample size by merging datasets from different sources, often known as integrative analysis in the literature [5, 6]. Such integration is regular in evidence-based medicine apart from the group-centered studies [7], as valuable medical topics are often repetitively examined by more than one research unit and share similarities across different locations [8]. Still, the discrepancy exists in the patient population, and heterogeneity between various studies becomes a significant challenge in integrative analysis [5, 9]. Another obstacle in the integrative analysis is related to data-sharing constraints; analysts might not be able to share individual-level data because of legal and privacy concerns. For instance, the patient-level medical information linked with electronic health records (EHR) usually cannot go past the firewall of its associated hospital [5].

Facing such data-sharing constraints, [10] proposed a widely adopted individual privacy-preserving framework, DataSHIELD, for integrative analysis. The mechanism of DataSHIELD is to pass only summary statistics from decentralized local nodes to the central node in integrative analysis. However, current DataSHIELD-supported approaches under the high-dimensional setting often fail to consider or cannot be easily extended to accommodate cross-site heterogeneity without sacrificing statistical efficiency; see [5, 6]. These examples include, but are not limited to, the aggregated debiased lasso estimator adopted by [11,12,13] where local debiasing might incur additional estimation errors. Other works, such as [14, 15], can avoid efficiency loss but only work for cross-site homogeneous scenarios and demand successive communications between local machines and central nodes, which could waste time and resources.

Recently, several improvements have been made to accommodate cross-site heterogeneity in integrative analysis within the DataSHIELD framework [5, 6], but these require the calculation of the estimator at local sites and an approximation of the loss function. Such an approximation can lead to inequivalent solutions in contrast with the solutions obtained by merging all datasets into one, followed by potentially associated accuracy and efficiency loss. In addition, the DSILT Algorithm proposed by [5] requires the summary statistics to be updated and transmitted frequently, which may pose further privacy concerns besides increased computational complexity. Relevant work has also been considered under the names of distributed learning [16] and federated learning [17,18,19,20]. However, similar to the existing literature in integrative analysis, they often fail to address one or some of the issues concerning high dimensionality, site heterogeneity, data privacy, computational cost, estimation accuracy, and equivalence, and we aim to bridge the gap.

In this paper, we focus on the DataSHIELD constraint and propose a \(L_{2,0}\) method to conduct integrative analysis. The proposed method can naturally accommodate high dimensionality and site heterogeneity with low computational cost, and improve the estimation accuracy by utilizing the common supports across different sites and implementing \(L_{2,0}\) penalization solely on nuisance parameters. The most distinctive feature of our algorithm is that it can achieve equivalence (in contrast with approximated ones) under the DataShield constraint, which means the results produced by our algorithm are exactly the same as the ones if all datasets across different sites have been merged into one. The equivalence property helps eliminate the concern of potential estimation accuracy and efficiency loss induced by approximation algorithms for integrative analysis under the data-sharing constraint. We introduce the model setting and outline of the proposed method, particularly the common support assumption and the \(L_{2,0}\) penalization in Sects. 1.2 and 1.3, respectively.

Notations: \(\left| \cdot \right|\) denotes the size of the set. [i] represents the set \(\{1, \ldots , i\}\). Suppose ab are constants. \(a[i]:=\) \(\{a, \ldots , ai\}\) and \([i]-b:= \{1-b, \ldots , i-b\}\). That is, we operate on each element in the set. \(S1 {\setminus } S2\) symbolizes the set difference operation. We define \(S_1 \times S_2\) as the cartesian products of two sets \(S_1\) and \(S_2\). \({X}_{[S_1 \times S_2]}\) means the submatrix whose entires are in the kth row and \(\ell\)th column of matrix X, where \(k \in S_1\) and \(\ell \in S_2\). Inspired by [21], we modified their notations and practiced as follows. Define selected set \(\mathcal {A}=\{j \in [p]:||{\beta }_{G_{j}}||_{2} \ne 0\}\). The unselected set is \(\mathcal {I}=[p] \backslash \mathcal {A}=\mathcal {A}^{c}\). We let \({\beta }_{\mathcal {A}}=({\beta }_{G_{j}}, j \in \mathcal {A}) \in \mathcal {R}^{M \cdot |\mathcal {A}|}\). That is, the dimension of \({\beta }_{\mathcal {A}}\) is the scalar product between the total number of sites M and the set size \(|\mathcal {A}|\). We define \({X}_{\mathcal {A}}=({X}_{G_{j}}, j \in \mathcal {A}) \in \mathbb {R}^{n \times (M \cdot |\mathcal {A}|)}\) and denote \({\beta }^{*}\) as the true regression coefficients. The true subsets of groups is \(\mathcal {A}^{*}=\{j \in [p]:\Vert {\beta }_{G_{j}}^{*}\Vert _{2} \ne 0\}\). \(\mathcal {I}^{*}=\left( \mathcal {A}^{*}\right) ^{c}\).

1.2 Model Statement

We consider an integrative analysis problem with multiple datasets of linear models, a setting broadly considered in practice; see [22, 23]. Suppose there are M independent studies. The m-th study contains \(n_m\) random observations on the outcomes \(y^{(m)}=(y_{1}^{(m)}, \dots ,y_{n_m}^{(m)})^{\top }\), vector \(D^{(m)}=(D_{1}^{(m)},\dots ,D_{n_m}^{(m)})^{\top } \in \mathbb {R}^{n_m}\) and covariate matrix \({X}^{(m)}=(X_{1}^{(m)\top },\dots ,X_{n_m}^{(m)\top })^{\top }\in \mathbb {R}^{n_m \times p}\). Within each study, we assume the same regression model:

$$\begin{aligned} y_{i}^{(m)}= D_{i}^{(m)} \cdot \alpha ^{(m)} +X_{i}^{(m)}\beta ^{(m)}+\epsilon _{i}^{(m)},\quad i=1,\dots ,n_m \end{aligned}$$
(1)

where \(\alpha ^{(m)} \in \mathcal {R}\) denotes the site-specific parameter of our interest. Here, we focus on a one-dimensional site-specific parameter for the sake of presentation, but the model can be naturally extended to multi-dimensional site-specific parameters. The coefficient vector \(\beta ^{(m)} = (\beta ^{(m)}_1, \ldots , \beta ^{(m)}_p)^\intercal\) represents nuisance parameters, which are out of our interest but used for model adjustment in practice. \(\epsilon _{i}^{(m)}\) are i.i.d. error terms satisfying \(\mathbb {E} (\epsilon _{i}^{(m)}|D_{i}^{(m)},X_{i}^{(m)})=0\).

One motivating example of Model (1) is the estimation of treatment effects in multi-site observational studies where the potential confounder bias needs to be accounted for by appropriate statistical methods, such as matching methods [24, 25] and AIPWE [26]. Here, as suggested by [27, 28], we consider using a regression model to adjust the potential confounder bias and let \(D^{(m)}\) denote the treatment indicator and \(X^{(m)}\) denote the potential confounder of the m-th site. Then, the site-specific parameter \(\alpha ^{(m)}\) is the treatment effect of the m-th site, which is of our interest, and \(\beta ^{(m)}\) is the confounder effect, a nuisance parameter. Other motivating examples include but are not limited to the repeated measurement design, where a subject might be measured multiple times across different sites [29], and the multiple-measurement-vector (MMV) problem, where signals are collected from different sources [30].

Based on Model (1), to borrow similarities across different sites to improve estimation accuracy in integrative analysis, we consider a common support assumption that the sparsity sets are the same across different sites. Sparsity is a widely adopted assumption to ensure the identifiability of the model in high dimensions [31]; i.e., \(p>n_{m}\). Although the parameters might not be the same across different sites due to site heterogeneity, the sparsity of the nuisance parameter might be the same in many practical applications. Take the multi-site observational study as an example; though the specific effects of the confounders might be different, the true confounders (features) are often the same due to the similarity of patients’ preferences in choosing the drugs regardless of the sites. Other examples naturally bearing the common support assumption include the repeated measurement design and the MMV problem where for each subject or signal, measurement across different sites are expected to have the same sparsity set [32,33,34]. In specific, let

$$\begin{aligned} supp(\beta ^{(m)})=\{j: \beta _{j}^{(m)}\ne 0\}. \end{aligned}$$
(2)

denote the active set. The common support assumption means that

$$\begin{aligned} supp(\beta ^{(1)})=\cdots =supp(\beta ^{(m)})=\cdots =supp(\beta ^{(M)}); \end{aligned}$$
(3)
$$\begin{aligned} s=|supp(\beta ^{(m)})|<n. \end{aligned}$$
(4)

Note that here, we impose the common support assumption solely on the nuisance parameter. This is because the site-specific parameter is what we wish to study and to improve the estimation accuracy of the parameter of interest, site-specific parameters should be retained in the reduced model rather than being screened out during the feature selection [35].

The main difference between the common support assumption considered in this paper and the similar parameter assumption considered in the fusion method [11, 36] is where the sparsity arises. Specifically, the common support assumption is w.r.t. the sparsity of the parameters, while the similar parameter assumption is w.r.t the sparsity of the distance between the parameters, which is similar to the difference between lasso and fused lasso [37]. Both the sparsity of the parameters and the sparsity of the distance between the parameters have practical implications, and we might adopt one or both in real-world scenarios; e.g., the sparsity in genomics [38] and the sparsity in the distance between parameters in time-varying/spatial data [39]. In this paper, we focus on the common support assumption, which induces the same sparsity of the parameter across different sites.

1.3 Outline of the Proposed Method

Besides passing only summary statistics from the local sites, our proposed method consists of two key elements. The first key element of the proposed method is that we apply the \(L_{2,0}\) penalization to induce the same support set over the nuisance parameter. The second key element of the proposed method is that we place the \(L_{2,0}\) penalization solely on the nuisance parameter but not on the site-specific parameter.

For the first key element, we consider \(L_{2,0}\) instead of \(L_{2,1}\), another widely adopted strategy to induce the same support, for the following two reasons. First, it is well known that \(L_{2,1}\) suffers from the selection bias and over-shrinkage of significant coefficients, while \(L_{2,0}\) is more favorable as it allows an explicit presentation of support size [21, 40]. Second, based on \(L_{2,0}\), with a carefully designed algorithm to assemble the summary statistics, we can address inequivalence issues in those approximation integrative analysis methods based on \(L_{2,1}\) under the DataSHIELD constraint. The second key element, penalization solely on the nuisance parameter, is due to the following two reasons. The first reason is such partial penalization can help improve model interpretation. Take the above multi-site observational study as an example; it is hard for researchers to conclude or explain the treatment efficacy if the treatment indicator is excluded from their selected model [41]. The second reason is that keeping the parameter of interest in the model can help improve estimation accuracy. As argued by [35], prior information should be taken into account in feature selection. In particular, the parameters deemed important, such as the parameter of interest in this paper, should be retained in the selected model to improve estimation accuracy.

The rest of the paper will be organized as follows. We will first review the single-site algorithm proposed by [21] and then present ours, enabling multi-site functioning with penalization solely on the nuisance parameter in Sect. 2. The theoretical properties will be in Sect. 3, followed by the simulation and the real-data application in Sects. 4 and 5, respectively. At the end, we will have a summary in Sect 6.

2 Methodology

2.1 Innovative Use of Splicing Approach

As discussed in Sect. 1.3, we aim to apply \(L_{2,0}\) to induce the same support across different sites. It is well known that \(L_{2,0}\) leads to an intractable nonconvex problem, and among the few existing computational methods for \(L_{2,0}\), we consider the single-site best subset of groups selection (BSGS) algorithm - Abess, which adopts a splicing approach proposed by [21]. In contrast to traditional feature selection methods, which focus on each variable individually, BSGS incorporates the grouping information of variables and selects features at the group level. By employing the splicing technique, the algorithm iteratively includes the significant groups and discards the nonessential ones, enhancing the interpretability of the outcome variable. Under mild assumptions, the algorithm has been proven to possess polynomial complexity with a high probability of determining the optimal subsets of groups, even in high-dimensional feature spaces.

In specific, [21] assumed a linear model composed of J non-overlapping groups, referred to as a group linear model. The model is formulated as follows:

$$\begin{aligned} {y}=\sum _{j=1}^{J} {X}_{G_{j}} {\beta }_{G_{j}}+\varepsilon . \end{aligned}$$
(5)

They denote \({y} \in \mathbb {R}^{n}\) as the outcome variable and represent the j th group’s regressor matrix as \({X}_{G_{j}} \in \mathbb {R}^{n \times p_{j}}\), where \(p_{j}\) is size of the jth group. They define \({\beta }_{G_{j}} \in \mathbb {R}^{p_{j}}\) as the j th group’s regression coefficients and \(\varepsilon \in \mathbb {R}^{n}\) as the random error term. \(G_{j}\) refers to a collection of indices associated with predictors that belong to the jth group. Additionally, \(\cup _{j=1}^{J} G_{j}=[p]\), where \(G_{i}\) and \(G_{j}\) are disjoint for all \(i \ne j\). In the simplest case, the model converts to the ordinary linear model when the group size \(p_j\) = 1 for all \(j \in [J]\).

To use splicing to address the problem of our interest, we need to categorize the same variable across different sites as a group in an appropriate way. Suppose we are indifferent between the site-specific and nuisance parameters. We can modify \(y = ({y^{(1)}}^\intercal , \cdots , {y^{(M)}}^\intercal ) ^\intercal \in \mathcal {R}^{n}\), \(\beta _{G_j} = ( \beta ^{(1)}_j, \cdots , \beta ^{(M)}_j )^\intercal \in \mathbb {R}^{M}\), and \(\alpha = ( \alpha ^{(1)}, \cdots , \alpha ^{(M)})^\intercal \in \mathbb {R}^{M}\). However, because of the unmatched dimensionality, one will fail to construct the proper setting when assigning \(X_{G_j}\) as a general matrix previously permitted by [21]. Such a challenge motivates us to consider a special regressor matrix with a diagonal block structure. Formally, we design \(X_{G_j} = \text {diag} (X^{(1)}_j, \cdots , X^{(M)}_j ) \in \mathcal {R}^{n \times M}\) for \(j \in [p]\). \(D = \text {diag} \left( D^{(1)}, \cdots , D^{(M)}\right) \in \mathcal {R}^{n \times M}\), where \(D^{(m)} \in \mathcal {R}^{n_m}\), \(\alpha ^{(m)} \in \mathcal {R}\), \(\beta ^{(m)}_j \in \mathcal {R}\) for \(j \in [p]\).

Equivalently, we can write the group linear model in its general matrix form. The idea is that we append the treatment indicator to each design matrix of biomarkers on each site, intending to show the indifference between the nuisance and site-specific parameters under our temporary assumption.

$$\begin{aligned} y = \breve{X} \breve{\beta } + \epsilon , \end{aligned}$$
(6)

where \(\breve{\beta } = (\breve{\beta }^{(1)}{}^{\intercal }, \cdots , \breve{\beta }^{(M)}{}^{\intercal })^{\intercal } \in \mathbb {R}^{M(p+1)}\) and \(\breve{\beta }^{(m)} = (\beta ^{(m)}{}^{\intercal }, \alpha ^{(m)}{}^{\intercal })^{\intercal }\) for \(m \in [M]\). \(\breve{X} = \text {diag}(\breve{X}^{(1)}, \cdots \breve{X}^{(M)}) \in \mathcal {R}^ {n \times (M(p+1))}\) and \(\breve{X}^{(m)} = [{X}^{(m)}, D^{(m)}] \in \mathcal {R}^{n_m \times (p+1)}\).

Suppose the entire dataset resides at a single site, and we allow for penalizing site-specific parameters. The feature selection and parameter estimation can be performed using the algorithm introduced by [21] after the problem reformulation. With initially selected predictors, the algorithm repetitively solves an ordinary least square problem with the objective function

$$\begin{aligned} \min _{\breve{\beta } \in \mathbb {R}^{M(p+1)}} \frac{1}{2 n}\Vert {y}-\breve{{X}} \breve{{\beta }}\Vert _{2}^{2} \quad \text{ s.t. } \Vert \breve{{\beta }}\Vert _{2,0} \leqslant T, \end{aligned}$$
(7)

followed by a splicing procedure. That is, the algorithm exchanges the predictors from selected and unselected sets to check for further improvement in the loss.

Yet, there still exist challenges to address the problem of our interest even after we modify [21]’s algorithm by Eq. (7). First, the algorithm in [21] does not distinguish between the site-specific and nuisance parameters, and \(\alpha\) is penalized while it is of interest and deemed important. This might lead to less interpretable and accurate results. Second, the algorithm in [21] does not consider the situation of interest where data are stored at multiple sites and processed under the DataSHIELD framework. Extending the single-site algorithm in [21] to multiple sites under the DataSHIELD framework is desired but challenging. In specific, issues arise when the algorithm attempts to compute the optimal solutions of (7). For instance, we need to figure out how one of the oracle solutions,

$$\begin{aligned} {\breve{\beta }}^{k}_{\mathcal {A}^{k}} = (\breve{X}_{\mathcal {A}^{k}}^\intercal \breve{X}_{\mathcal {A}^{k}})^{\scriptscriptstyle {-\!1}} \breve{X}_{\mathcal {A}^{k}}^\intercal y, \end{aligned}$$

in the k-th iteration will be computed if only the transmission of summary statistics is permitted.

2.2 Proposed Method

Built on the innovative use of splicing in Sect. 2.1, we propose an integrative analysis algorithm to estimate the site-specific parameters. Specifically, we first modify the objective function and derive an updated solution when \(L_{2,0}\) penalization is solely on the nuisance parameter. Then, we develop Algorithm 1 to assemble the summary statistics and outputs requested in the solution of the modified \(L_{2,0}\) objective function. In the end, based on Algorithm 1, we formally introduced Algorithm 2, which satisfies the DataSHIELD framework in a multi-site setting without penalizing the site-specific parameter.

To start with, we state the updated objective function

$$\begin{aligned} \min _{\beta \in \mathbb {R}^{Mp}, \alpha \in \mathbb {R}^{M} } \frac{1}{2n} \Vert y-X\beta - D\alpha \Vert ^2_2 \text { s.t. } \Vert \beta \Vert _{2,0} \le T, \end{aligned}$$
(8)

where \(\beta = \left[ {\beta ^{(1)}}^\intercal , \cdots , {\beta ^{(M)}}^\intercal \right] ^\intercal \in \mathcal {R}^{Mp}\); \(\alpha = \left[ \alpha ^{(1)}, \cdots , \alpha ^{(M)} \right] ^\intercal \in \mathcal {R}^M\), \(n = \sum _{n=1}^{M} n_m\), \(X = diag\left( X^{(1)}, \cdots X^{(M)}\right) \in \mathcal {R}^ {n \times (Mp)}\), \(D = diag\left( D^{(1)}, \cdots D^{(M)}\right) \in \mathcal {R}^{n \times M}\).

Namely, we consider a multi-site linear regression with block diagonal design matrices X and D. The site-specific parameter \(\alpha\) is not subject to penalization as we no longer treat \(\beta _{G_j}\) and \(\alpha _{G_j}\) as indistinguishable. While the presence of any \(\beta _{G_j}\) for all \(j \in [p]\) is indeterminate, we always guarantee the inclusion of \(\alpha\) in the model, which we refer to as the concept of conditional feature screening. Since the penalization term depends on \(\beta\) purely, given \(\beta\), the optimization of \(\alpha\) exclusively pertains to the first term, which is differentiable. Solving it gives us the solution of \(\alpha\). That is,

$$\begin{aligned} \alpha = \left( D^\intercal D\right) ^{\scriptscriptstyle {-\!1}} D^\intercal \left( y - X\beta \right) . \end{aligned}$$
(9)

In turn, once we know the optimal solution of \(\alpha\), we can solve for the one for \(\beta\) by splicing. We substitute the given \(\alpha\) into (8). Some algebra turns the problem to

$$\begin{aligned} \min _{\beta \in \mathbb {R}^{Mp}} \frac{1}{2n} \Vert \hat{y}- \hat{X}\beta \Vert ^2_2 \text {s.t. } \Vert \beta \Vert _{2,0} \le T, \end{aligned}$$
(10)
$$\begin{aligned} \text {where } H = \left( I-D \left( D^\intercal D \right) ^{\scriptscriptstyle {-\!1}}D^{\intercal }\right) , \end{aligned}$$
(11)
$$\begin{aligned} \hat{y} = H y; \hspace{1mm} \hat{X} = HX. \end{aligned}$$
(12)

Remark

Problem (10) is exactly the question proposed in (7). We can directly apply [21]’s results after the variable transformations. Since H is idempotent and symmetric, we will see cancellations in our final solutions.

However, like [21], we still face the obstacle of solving \(\beta ^{\diamond }_{\mathcal {A}^{\diamond }}\), the optimal value of \(\beta\) under the optimal selected set \(\mathcal {A}^{\diamond }\), without knowing \(X_{\mathcal {A}^{\diamond }}\) (the sub-matrix of X under the optimal selected set \(\mathcal {A}^{\diamond }\)) explicitly. Hence, we are motivated to design local summary statistics containing sufficient information to restore our interested variables \(\beta ^{k}\), \(\alpha ^k\), and \(\mathcal {A}^k\) of the k-th iteration of splicing leading to the optimal solution in the center. Since for any index set \(\mathcal {A}^{k}\),

$$\begin{aligned} {\beta }^{k}_{\mathcal {A}^{k}}= & {} (X_{\mathcal {A}^{k}}^\intercal H X_{\mathcal {A}^{k}})^{\scriptscriptstyle {-\!1}} X_{\mathcal {A}^{k}}^\intercal H y \nonumber \\= & {} (X_{\mathcal {A}^{k}}^\intercal X_{\mathcal {A}^{k}} - X_{\mathcal {A}^{k}}^\intercal D (D^\intercal D)^{\scriptscriptstyle {-\!1}} D^\intercal X_{\mathcal {A}^{k}})^{\scriptscriptstyle {-\!1}} (X_{\mathcal {A}^{k}}^\intercal y - X_{\mathcal {A}^{k}}^\intercal D (D^\intercal D)^{\scriptscriptstyle {-\!1}} D^\intercal y), \end{aligned}$$
(13)

the involved pairwise products,

$$\begin{aligned} X_{\mathcal {A}^{k}}^\intercal X_{\mathcal {A}^{k}}, X_{\mathcal {A}^{k}}^\intercal y, X_{\mathcal {A}^{k}}^\intercal D, D^\intercal D, D^\intercal y, \end{aligned}$$

can be assembled by their local counterparts, respectively. That is summary statistics

$$\begin{aligned} {X^{(m)}_{\mathcal {A}^k}}^\intercal X^{(m)}_{\mathcal {A}^k}, {X^{(m)}_{\mathcal {A}^{k}}}^\intercal y^{(m)}, {X^{(m)}_{\mathcal {A}^{k}}}^\intercal D^{(m)}, {D^{(m)}}^\intercal D^{(m)}, {D^{(m)}}^\intercal y^{(m)}, \end{aligned}$$

sent from local sites \(m = 1, \cdots M\). Such a procedure, computing distributively and then assembling in the center, is achievable due to the diagonal design of the matrix X in our problem. As an illustrative example, we give the guidelines of constructing \(X_{\mathcal {A}^{k}}^\intercal X_{\mathcal {A}^{k}}\) from \({X^{(m)}_{\mathcal {A}^k}}^\intercal X^{(m)}_{\mathcal {A}^k}\) in Algorithm (1). For each site m, we generate the site-specific row and column index sets. The Cartesian product of these sets indicates the positions in the resultant matrix, \(X_{\mathcal {A}^{k}}^\intercal X_{\mathcal {A}^{k}}\), where the entries of \({X^{(m)}_{\mathcal {A}^k}}^\intercal X^{(m)}_{\mathcal {A}^k}\) should be placed. We emphasize the structure of the matrix X as it enables the restoration of the variables of interest in the center using local summary statistics. Under this diagonal structure, \(X_{\mathcal {A}^{k}}^\intercal X_{\mathcal {A}^{k}}\) is sparse in the sense that its remaining entries not filled by the summary statistics are all zeros. The presence of these zero entries eliminates the need for site inter-communication when assembling the resultant matrix. This would be infeasible if a general matrix were used instead.

Algorithm 1
figure a

Assemble \(X_{\mathcal {A}^{k}}^\intercal X_{\mathcal {A}^{k}}\) from \({X^{(m)}_{\mathcal {A}^k}}^\intercal X^{(m)}_{\mathcal {A}^k}\)

Remark

If we seek to build \(X_{\mathcal {A}^{k}}^\intercal X_{\mathcal {I}^{k}}\), then we replace the column index set C with \(M[|\mathcal {I}^k|-1] + m\).

It is noteworthy that, although the pairwise products required in the center, such as \(X_{\mathcal {A}^{k}}^\intercal X_{\mathcal {A}^{k}}\), \(D^\intercal D\) is a direct build-up of their local counterparts, their inverse are not. \((X_{\mathcal {A}^{k}}^\intercal H X_{\mathcal {A}^{k}})^{\scriptscriptstyle {-\!1}}\) and \((D^\intercal D)^{\scriptscriptstyle {-\!1}}\) is unattainable from the direct assembling of local summary statistics as every entry of the matrix’s inverse requires information across different sites. We, hence, delegate the inverse calculation to the center. Similarly, we can decompose \(\alpha ^{k}\), \(d^{k}_{\mathcal {I}^{k}}\), the loss L into the pairwise products, which can be directly constructed by local summary statistics sent. However, if ever involved, the inverse will be calculated in the center. Most importantly, all the transmissions we mentioned above only occur once. For example, we only need to transmit \(X^\intercal X\) to the center once. Then, for any \(\mathcal {A}^{k}\) and \(\mathcal {I}^{k}\), the product of sub-matrices \(X_{\mathcal {A}^{k}}^\intercal X_{\mathcal {I}^{k}}\) can be directly obtained from \(X^\intercal X\) stored at the center by using linear algebra such that \(X_{\mathcal {A}^{k}}^\intercal X_{\mathcal {I}^{k}} = (X^\intercal X)_{[\mathcal {A}^{k} \times \mathcal {I}^{k}]}\)

After overcoming the data privacy constraint, we are ready to introduce Algorithm 2. Following a similar approach to [21], we select potentially important predictors by examining their correlation with the outcome variable y. We initialize set \(\mathcal {A}^0\) by sorting out the corresponding index of T largest \(\{||{X^\intercal _{G_j}y}||^2_2, j \in [p] \}\) in the center, where

$$\begin{aligned} {\Vert X^\intercal _{G_j}y\Vert }^2_2 = \sum _{j=1}^{M} \Vert {{X^{(m)}_{G_j}}^\intercal }y\Vert ^2_2. \end{aligned}$$

That requires the local nodes \(1, \cdots , M\) send \(\{||{{X^{(m)}_{G_j}}^\intercal }y||^2_2, j \in [p] \}\) to the center. It is a one-off transmission.

Algorithm 2
figure b

MSplicing (Multi-Site Algorithm)

Remark

Note that the input of Algorithm 2 is summary statistics rather than the raw data for data privacy protection concerns.

Allied with the idea of [21], our algorithm solves a constrained optimization problem. There are two major differences between Algorithm 1 and that of [21]. The first difference is that we have rewritten the solutions, loss, and set selection criteria with \(\alpha\) unpenalized. The second difference is that, due to the distributed learning setting, we need to expand and break down the expression of pertinent variables pairwise, as demonstrated in Eq. (13), before calculating the variables of interest. We achieved that by calling the Algorithm 1. However, the convenience is evident as these pairwise products are already obtainable in the center, having no demands for additional communication between sites. Being a one-time occurrence, the data transmission safeguards data privacy and enables the adoption of more advanced security techniques, such as differential privacy, in future work.

One may wonder why we employ Euclidian distance for the candidate pre-screening in the while loop but use a newly defined norm during splicing. That is because \(S^k_1\) is a subset of \(\mathcal {A}^k\). The metrics we utilized to select \(S^k_1\), hence, are associated with the optimal value of \(\mathcal {A}^k\), which minimizes the loss. However, the new selection criteria could be different if we attempt to move in the direction that lessens the loss change, as shown in Lemma 2.1.

Lemma 2.1

For any \(j \in \mathcal {A}^k\), the loss increase to \(\mathcal {L}(\beta ^k)\) due to the disposal of the jth group is

$$\begin{aligned} \mathcal {L}(\beta ^{\mathcal {A}^k \backslash j}) - \mathcal {L}(\beta ^k) = \beta ^\intercal _{G_j} \left( X_{G_j}^\intercal \left( \frac{I+D \left( D^\intercal D \right) ^{\scriptscriptstyle {-\!1}}D^\intercal }{2n} \right) X_{G_j}\right) \beta _{G_j}, \end{aligned}$$

where \(\beta ^{\mathcal {A}^k \backslash j}\) represents the estimator, which we assign the entries of jth group to be zero.

We, hence, define new selection criteria that appeared in Algorithm 2.

$$\begin{aligned} \Vert \beta _{G_j}\Vert ^2_{D_{G_j}} = \beta ^\intercal _{G_j} \left( X_{G_j}^\intercal \left( \frac{I+ D \left( D^\intercal D \right) ^{\scriptscriptstyle {-\!1}}D^\intercal }{2n} \right) X_{G_j}\right) \beta _{G_j}, \end{aligned}$$
(14)

where \(\left( D^\intercal D\right) ^{\scriptscriptstyle {-\!1}}\) has already been computed in the center and requires no updates. \(\beta _{G_j}\) and \(X^\intercal _{G_j}D\) are directly attainable from \(\beta ^{k}\), \(X^\intercal _{\mathcal {A}^k}D\) and \(X^\intercal _{\mathcal {I}^k}D\) , respectively. For \(j \in \mathcal {A}^k\), \(X^\intercal _{G_j} X_{G_j}\) are extractable from \(X^\intercal _{\mathcal {A}^k} X_{\mathcal {A}^k}\). The logic of defining the \(\Vert d_{G_j}\Vert ^2_{D_{G_j}}\) and ways of computing it with summary statistics is similar.

We end this section by accentuating that Algorithm 2 can automatically address high dimensionality and site heterogeneity. The simulations in Sect. 4 also demonstrate that it can improve estimation accuracy. Furthermore, Algorithm 2 excels in aspects including computation cost, privacy protection, and equivalence, and we give remarks to the last three merits.

  • Computation cost: We extend the algorithm to multiple sites but only spend a similar computation cost compared to the original single-site algorithm, except for the time spent on assembling the summary statistics in Algorithm 1. That is because the proposed method can be viewed as splicing implemented on a single site, i.e., the center, once the summary statistics are assembled, and thus, the proposed method inherits the merits of computation cost in [21] and possesses polynomial complexity.

  • Privacy protection: During the data transmission procedure, no individual data were exposed or utilized in the algorithm. The summary statistics we constructed replace their roles and suffice the need for the algorithms to function properly under the distributive setting. No communication has ever occurred between local nodes, thus eliminating the potential collusion. The data are only transmitted once, and its receiving end is exclusive to the central node. The favorable one-time off data transfer in our algorithm, hence, easily allows for incorporating other privacy-preserving techniques, such as differential privacy.

  • Equivalence: The specially designed block diagonal structure and Algorithm 1 ensure the exact recovery of the associated variables to compute the optimal solutions and enable the solution output from Algorithm 2 to be equivalent. That is, the solution output from Algorithm 2 under the DataSHIELD constraint in a multi-site setting is exactly the same as the one outputted from the whole dataset when the datasets from different sites are pooled together at a single site.

3 Theoretical Properties

In this section, we will provide the assumptions and, hence, develop theorems to justify the validity of our proposed method.

Assumption 1

The random errors \(\varepsilon _{i}\) for \(i \in [n]\) follow an i.i.d. sub-Gaussian distribution with mean zero. Mathematically, there exists \(\sigma > 0\) such that \(P\left( \left| \varepsilon _{i}\right| >x\right) \leqslant 2 \exp \left( -x^{2} / \sigma ^{2}\right)\), for all \(x \geqslant 0\).

Assumption 2

Let \(0< \hat{c}_{*}(T) \leqslant \hat{c}^{*}(T)<\infty\),

$$\begin{aligned} \hat{c}_{*}(T)\Vert {u}\Vert _{2}^{2} \leqslant \frac{||\hat{{X}}_{\mathcal {A}} {u}||_{2}^{2}}{n} \leqslant \hat{c}^{*}(T)\Vert {u}\Vert _{2}^{2}, \forall {u} \ne 0, {u} \in \mathbb {R}^{M \cdot |\mathcal {A}|} \text{ with } |\mathcal {A}| \leqslant T. \end{aligned}$$

Assumption 3

$$\begin{aligned} \hat{\omega }_{T}\Vert {u}\Vert _{2} \geqslant \frac{||\hat{{X}}_{\mathcal {A}}^{\top } \hat{{X}}_{\mathcal {B}} {u}||_{2}}{n}, \forall {u} \ne 0, {u} \in \mathbb {R}^{ M \cdot |\mathcal {B}|} \text{ with } |\mathcal {A}| \leqslant T,|\mathcal {B}| \leqslant T, \text{ and } \mathcal {A} \cap \mathcal {B}=\emptyset . \end{aligned}$$

Assumption 4

$$\begin{aligned} \hat{\mu }_{T}=\frac{8 \hat{c}^{*}(T) \left( \frac{(1+\hat{\eta }) \hat{\omega }_{T}}{\hat{c}_{*}(T) } \right) ^{2} }{(1-\hat{\eta })\left( \hat{c}_{*}(T)-\frac{\hat{\omega }_{T}^{2}}{\hat{c}_{*}(T)}\right) } \end{aligned}$$

\(0 \leqslant \hat{\mu }_{T}<1\) and the constant \(\hat{\mu }_{T}\) depends on T for some constant \(0<\hat{\eta }<1\).

Assumption 5

The minimum group signal is denoted as \(\vartheta =\min _{j \in \mathcal {A}^{*}}\left\| {\beta }_{G_{j}}\right\| _{2}^{2}\) and it satisfies \(\frac{T M \log (M(p+1)) \log (\log n)}{n \vartheta }=o(1)\), where T is the support size and M, p, n is the total number of sites, nuisance parameters considered, and observations across sites, respectively.

Assumption 6

\(\frac{M|\mathcal {A}^*| \log (p+1) \log (\log n)}{n}=o(1)\) and \(\frac{M T_{\max } \log (M(p+1))}{n}=o(1)\), where \(T_{\max }\) is the maximum support size.

Assumption 7

The total number of sites \(M = o\left( (p+1)^{\log (\log n)}\right)\)

The assumptions listed above are either well presented and argued in the work of [21] or are modest extensions of the former; we consider the transformed matrix \(\hat{{X}}\), given in Eq. (10), instead of the original design matrix X, to embed the effects of site-specific parameters. Additionally, we fix the group size to be M, a scenario included in the more general setting of [21].

Assumption 1 specifies the distribution for the error but not for the nuisance parameter X. A fixed design for the nuisance parameter avoids the distribution heterogeneity issue in X, which occurs in federated learning. Assumptions 2 and 3 control the angle between the column spaces of X’s subpartitions. Specifically, under Assumption 3, when \(\hat{\omega }_{T} = 0\), the column space of \(\hat{{X}}_{\mathcal {A}}\) is orthogonal to that of \(\hat{{X}}_{\mathcal {B}}\), implying independence between these partitioned matrices. Assumption 4 regulates the correlation between the groups. We provide a stricter and thus better upper bound, \(\delta _{T}\), for \(\omega _T\) compared to the one given by [21]. Specifically, we set \(\delta _T = (c(2T)-c(2T))/2\). One can then verify that \(\omega _T \leqslant \delta _T\) by the theorem of [42]. Zhang et al. [21] provides one sufficient condition to govern Assumption 4, which is \(\delta _{T} \leqslant 0.188\). Alternatively, that is \(c_{}(2 T) \geqslant 0.812\) and \(c^{}(2 T) \leqslant 1.188\). However, note that this condition is sufficient but not necessary. Therefore, it may be possible for one to relax this correlation condition further.

Under these assumptions, Theorem 3.1 proves that, with high probability, Algorithm 2 will not incorrectly filter out truly relevant variables. Theorem 3.2 verifies that cooperated with the GIC criterion, the algorithm can identify the subset of truly relevant groups even when the model size is unknown. Its estimator is the same as the oracle least-squares estimator [21]. Theorem 3.2 implies that, with high probability, the GIC-tuned Algorithm 2 has polynomial complexity. The proof of the theoretical results can be found in the appendix.

Theorem 3.1

(Support Recovery) Let \((\hat{\mathcal {A}}, \hat{\mathcal {I}}, \hat{{\beta }}, \hat{{\alpha }}, \hat{{d}})\) represent the solutions of Algorithm 2. Under Assumptions (1) to (5). if \(T \geqslant s^{*}\), we obtain

$$\begin{aligned} P\left( \hat{\mathcal {A}} \supseteq \mathcal {A}^{*}\right) \geqslant 1-\delta _{1}-\delta _{2}. \end{aligned}$$

Particularly, when \(T=s^{*}\),

$$\begin{aligned} P\left( \hat{\mathcal {A}}=\mathcal {A}^{*}\right) \geqslant 1-\delta _{1}-\delta _{2}. \end{aligned}$$

Theorem 3.2

(Selection Consistency) Let \((\hat{\mathcal {A}}, \hat{\mathcal {I}}, \hat{{\beta }}, \hat{{\alpha }}, \hat{{d}})\) represent the solutions of Algorithm 2. Suppose Assumptions (2) to (5) hold with \(T_{\max }\), and Assumptions (1), (6), and (7) also hold. When n is sufficiently large, we obtain

$$\begin{aligned} P\left( \min _{\hat{\mathcal {A}} \ne \mathcal {A}^{*}, \hat{\mathcal {A}} \subseteq \mathcal {S}} {\text {GIC}}(\hat{\mathcal {A}}) > {\text {GIC}}\left( \mathcal {A}^{*}\right) \right) \geqslant 1 - O\left( p^{-\gamma }\right) , \end{aligned}$$

for some constant \(0< \gamma < 1\), where

$$\begin{aligned} {\text {GIC}}(\hat{\mathcal {A}}) = n \log L(\hat{{\beta }}) + M(T+1) \log (p+1) \log (\log n), \end{aligned}$$

and \(L(\hat{{\beta }})\)is the loss output for a fixed support size T from Algorithm 2.

Theorem 3.3

(Convergence) In the kth iteration, Algorithm 2 outputs solutions denoted as \(\left( \mathcal {A}^{k}, \mathcal {I}^{k}, {\beta }^{k}, {\alpha }^{k}, {d}^{k}\right)\). Under Assumptions (1) to (4), when \(T \geqslant s^{*}\), we obtain:

(i)

$$\begin{aligned} \left| 2 n L\left( {\beta }^{k}\right) - 2 n L\left( {\beta }^{*}\right) \right| \leqslant \hat{\mu }_{T}^{k} \Vert {y}\Vert _{2}^{2}. \end{aligned}$$

(ii)

$$\begin{aligned} \mathcal {A}^{k} \supseteq \mathcal {A}^{*} \text { if } k > \log _{\frac{1}{\hat{\mu }_{T}}}\left( \frac{\Vert {y}\Vert _{2}^{2}}{\left( 1-\frac{\hat{\eta }}{2}\right) n\left( \hat{c}_{*}(T)-\frac{\hat{\omega }_{T}^{2}}{\hat{c}_{*}(T)}\right) \vartheta }\right) , \end{aligned}$$

with probability at least \(1-\delta _{1}-\delta _{2}\).

4 Simulation

Next, we conduct simulations to verify the merits of Algorithm 2. The results demonstrate that Algorithm 2 achieves nearly zero false positive rates (FPR), improved estimation accuracy, and polynomial computational time, as evidenced by our experiments.

We construct each site matrix \({X}^{(m)}\) \(\in \mathcal {R}^{n_m \times p}\), where \(m = 1\) to M by drawing its rows independently from a multivariate Gaussian distribution \(\mathcal {MVN}({{0}}_{n_m \times 1}, {\Sigma }_{p \times p})\). We let p columns of \({X}^{(m)}\) exponentially correlate. That is, we set the covariance matrix, \({\Sigma }_{i j}=\rho ^{|i-j|}\), and it will reflect the interactions between different groups. To depict the relationship between \({X}^{(m)}\) and \({D}^{(m)}\) on the same site, we begin by generating \({\grave{D}}^{(m)}\) according to the rule that if \(\sum _{j \in \mathcal {A}^{*}} {X}^{(m)}_{i, j} > 1\), then \({\grave{D}}^{(m)}_{i} = 1\). Otherwise, \({\grave{D}}^{(m)}_{i} = 0\). When we set \({D}^{(m)} = {\grave{D}}^{(m)}\), there is a strong correlation between \({D}^{(m)}\) and \({X}^{(m)}\). Allowing more flexibility in adjusting the correlation between \({D}^{(m)}\) and \({X}^{(m)}\), we introduce the probability \(\grave{p}\). Formally, we will generate \({D}^{(m)}\) as follows,

$$\begin{aligned} {D}^{(m)}_{i} = {\left\{ \begin{array}{ll} {\grave{D}}^{(m)}_{i}\,(\text {strongly correlated with }{X}^{(m)}_{i, j},\,j \in \mathcal {A}^{*}) &{}\text {with probability }\grave{p}\\ 0\text { or }1\,(\text {random assignment}) &{} \text {each with probability }\frac{1-\grave{p}}{2} \end{array}\right. } \end{aligned}$$

When \(\grave{p} = 0\), \({D}^{(m)}\), and \({X}^{(m)}\) are independent and have zero correlation.

We then generate the underlying regression coefficients \({\beta }^{*}\). Since \({\beta }^{*}\) can be obtained by re-allocating entries from \({\beta }_{G_{j}}^{*}\), we will define \({\beta }_{G_{j}}^{*}\) instead. Different from bringing in \(\rho\), which portrays the interaction across groups, we construct \({\beta }_{G_{j}}^{*}\), where \(j \in \mathcal {A}^{*}\) and \(\alpha ^{*}\) with the intention to reflect the correlation inside the group. That is, we assume the same predictor has a correlated coefficient across different sites. For each important group’s coefficient entries, \(i = 1\) to M, we let

$$\begin{aligned} \left( {\beta }_{G_{j}}^{*}\right) _{i}=b_{i}^{j}-\frac{1}{M} \sum _{i=1}^{M} b_{i}^{j}, \end{aligned}$$

where \(b_{1}^{j}, \ldots , b_{M}^{j}\) is obtained by independent draws from \(\mathcal {N}(0.5, 0.2)\). Meanwhile, we set \({\beta }_{G_{j}}^{*} = {0}_{M \times 1}\) for \(j \in \mathcal {I}^{*}\) trivially from the definition of \(\mathcal {I}^{*}\). Eventually, we obtain y from

$$\begin{aligned} y = diag\left( X^{(1)}, \cdots X^{(M)}\right) {\beta ^{*}} + diag\left( D^{(1)}, \cdots D^{(M)}\right) {\alpha ^{*}} + \varepsilon , \end{aligned}$$

where \(\varepsilon _{i} \sim \mathcal {N}\left( 0, \sigma ^{2}\right) , i=1\) to n.

Nuisance parameters are fixed to be \(p = 100\), \(n_m = 50\), \(M = 4\), \(\rho = 0.9\), and \(\sigma = 4\). We will adapt the same metrics used in [21] except for one w.r.t. to the site-specific parameter \(\alpha\) to evaluate the selection of group, and parameter estimation. They are:

  • True Positive Rate (TPR): \(\textrm{TP} /(\textrm{TP}+\textrm{FN})\), where \(\textrm{TP}:= |\hat{\mathcal {A}} \cap \mathcal {A}^{*}|\) and \(\textrm{FN}:=| \hat{\mathcal {I}} \cap \mathcal {A}^{*}|\)

  • False Positive Rate (FPR): \(\textrm{FP} /(\textrm{FP}+\textrm{TN})\), where \(\textrm{TN}:= |\hat{\mathcal {I}} \cap \mathcal {I}^{*}|\) and \(\textrm{FP}:= |\hat{\mathcal {A}} \cap \mathcal {I}^{*}|\).

  • Estimation Error of the site-specific parameter \(\alpha\) (SReEE): SReEE \(= ||\hat{{\alpha }}-{\alpha }^{*}||_{2} /\left\| {\alpha }^{*}\right\| _{2}\) or if the testing setting contains \(\left\| {\alpha }^{*}\right\| _{2} = 0\), we use

  • Estimation Error of the site-specific parameter \(\alpha\) (SEE): SEE \(= ||\hat{{\alpha }}-{\alpha }^{*}||_{2}\)

The closer the TPR to 1 and FPR to 0, the better the method has performed in the feature selection. The lower the SReEE or SEE to 0, the more accurate the site-specific parameter estimation is. Based on the discussion in [21], the methods SGSplicing (Abess), GLasso-BIC (group Lasso using the Bayesian information criterion), GLasso-CV (group Lasso using 5-fold cross-validation), GMCP-BIC (group MCP using the Bayesian information criterion), GMCP-CV (group MCP using 5-fold cross-validation), and GOMP (group orthogonal matching pursuit) will be used as baseline methods. Additionally, we include methods where \(L_{1}\) or \(L_{0}\) penalization on both the site-specific parameter and the nuisance parameter is applied at each site, and the parameters are estimated separately based on the local data. Finally, both the local version of Algorithm 2, i.e., the method applied to the local data, LSplicing, and the proposed method, i.e., the original multi-site version of Algorithm 2, MSplicing, are included.

The sequence of these methods being plotted on the figures follows the order of being presented here. We set the repetition to be 500 times. If a “proper" box plot fails to be seen in the figures, it is because it shrinks into a bold line, indicating a small range of value fluctuations. We will have four settings in total. Each will test the methods’ performance under different conditions: varying densities of \(\beta\), different correlations between \({D}^{(m)}\) and \({X}^{(m)}\), diverse properties of \(\alpha\) (heterogeneity, homogeneity, and nullity), and various supports for \(\alpha\) (exact and similar). Table 1 provides the summary of different simulation settings.

Table 1 Summary of simulation settings

Figures (1, 2, 3 and 4) show in all settings, our method remains superior in FPR and SReEE. A nearly 0 FPR indicates our approach will not select the wrong features at almost all times, a property inherited from Abess. Abess also stays at almost 0 FPR. In contrast, other methods, such as GOMP, tend to over-choose the predictors and climb to a high FPR, which is undesirable in the feature selection as it leads to the model misspecification problem. Another attribute of our method and Abess shared is its conservativity. It means our algorithm and Abess tend to select fewer predictors to guarantee they will not include the wrong predictors, especially when the noise-signal ratio is high. That explains why both two methods are relatively inactive in the TPR. However, our method performs better each time than Abess in TPR due to the correct use of the prior information, except when we test for \(\alpha ^*\) with zero coefficients (i.e., the given foreknowledge is wrong). In that scenario, our method is on the same TPR level as Abess. Apart from the case where the existing knowledge is incorrect, our method keeps the lowest \(\alpha\) estimation error among all the algorithms. In comparison, other approaches either continue to exhibit a high \(\alpha\) error rate or exclude the parameter of interest, \(\alpha\), from the model; SReEE (not SEE) \(= 1\) indicates that some other algorithms have screened out the predictor we wish to retain, which could impair the model’s interpretability. We also notice that, given the prior information is correct, among all three methods run on each local site separately, our local version, LSplicing, outperforms both \(L_{0}\) and \(L_{1}\) in every criterion (i.e., TPR, FPR, and SReEE) due to its inclusion of deemed important features. Its superior performance justifies the need for conditional feature selection. Meanwhile, our original multi-site method surpasses its local counterpart, suggesting that employing common support assumptions could further enhance performance.

Because Zhang’s Abess significantly outperformed other algorithms in their paper, we will focus on comparing the computational times of Abess and our method. The result in Table 2 suggests the computational cost of the proposed method is slightly higher than but comparable to the one of Abess. Such an observation aligns with our expectations, as our algorithm is developed from Abess and inherits its polynomial computational complexity, while the assembling of the summary statistics will take some additional time - but it is an inevitable trade-off for privacy protection. Note that here, we report time in R and with parallel computing removed because the time difference will be more obvious in case they are not of the same order compared with that in C.

Table 2 Comparison of computational times for MSplicing and abess
Fig. 1
figure 1

Test various densities of the nuisance parameters with methods Abess, GLasso-BIC, GLasso-CV, GMCP-BIC, GMCP-CV, GOMP, L0, L1, LSplicing, and MSplicing plotted from left to right in each setting

Fig. 2
figure 2

Test various correlations between X and D with methods Abess, GLasso-BIC, GLasso-CV, GMCP-BIC, GMCP-CV, GOMP, L0, L1, LSplicing, and MSplicing plotted from left to right in each setting

Fig. 3
figure 3

Test heterogeneity, homogeneity, and nullity of the site-specific parameter with methods Abess, GLasso-BIC, GLasso-CV, GMCP-BIC, GMCP-CV, GOMP, L0, L1, LSplicing, and MSplicing plotted from left to right in each setting

Fig. 4
figure 4

Test the exact and similar support of the site-specific parameter with methods Abess, GLasso-BIC, GLasso-CV, GMCP-BIC, GMCP-CV, GOMP, L0, L1, LSplicing, and MSplicing plotted from left to right in each setting

5 Real Data Applications

The motivation for the proposed method lies in multi-center studies. Here, we consider a single-center real data application in order to better demonstrate our method. In particular, collecting the whole dataset allows us to compare the proposed method with the method without DataSHIELD constraints, such as group lasso. This is a commonly adopted illustration strategy in integrative analysis literature [5]. In the appendix, we include an additional real data analysis on a multi-center energy dataset.

We testify to the aforementioned selected approaches in AIDS Clinical Trials Group Study 175 (ACTG175). We will designate CD8 T cell count at 20±5 weeks (cd820) as the dependent variable and, accordingly, eliminate other outcome variables, including CD4 T cell count at baseline (cd40), at 20±5 weeks (cd420), and 96±5 weeks (cd496), missing CD4 T cell count at 96±5 weeks (r), CD8 T cell count at baseline (cd80), and (days); the number of days until the first instance of (i) a decline in CD4 T cell count of at least 50, (ii) an event indicating progression to AIDS, or (iii) death.

We investigated the treatment effect of zidovudine-incorporated therapy. This means that instead of utilizing the treatment arm (arms) variable with values 0=zidovudine, 1=zidovudine and didanosine, 2=zidovudine and zalcitabine, and 3=didanosine, we employ the treatment indicator (treat). This variable assigns 0 to zidovudine only and 1 to other therapies.

When selecting appropriate covariates, we have excluded the patient ID number (pidnum) due to its lack of predictive power. Additionally, we have removed the dummy variable zidovudine used before treatment initiation (zprior) since it has a constant value 1 across all observations. As our analysis does not involve survival analysis, the indicator variable for observing the event in days (cens) has been deleted. We have retained the remaining 14 predictors, except for the variable race (race). That is because we intend to separate the data into two sites based on ethical value, with 0 representing white and 1 denoting non-white. In such a way, we created a multi-site dataset that mimics data collected from hospitals in different geographical areas. The predictors selected by methods in comparison are as follows.

  • Our method: wtkg, karnof, treat

  • GLasso-BIC: wtkg, homo, gender

  • GLasso-CV: wtkg, hemo, homo, drugs, oprior, preanti, gender, strat, symptom

  • GMCP-BIC: No variables are selected

  • GMCP-CV: wtkg, hemo, drugs, oprior, preanti, gender, symptom

  • GOMP: age, wtkg, hemo, homo, drugs, karnof, oprior, preanti, gender, str2, strat, symptom, offtrt, treat

  • SGSplicing: wtkg, gender

Notably, except for GOMP, which selected almost all the predictors, our method stands out by specifically identifying “treat” as a significant predictor when we intend to perform conditional screening. Furthermore, our method identifies the baseline weight in kilograms (wtkg) as a key predictor. (Wtkg) is selected by all the listed methods except for GMCO-BIC, the latter of which did not select any variables at all. Another significant predictor chosen by our method is the Karnofsky score on a scale of 0-100 (karnof). It is a performance scale index, categorizing the patients according to their functional impairment and disability- the lower the score, the smaller the chance for the patients to survive the most severe disease [43]. Our method suggests that the coefficient of treatment for the white group is 18.5 but 40.4 for the non-white group. This discrepancy may indicate a heterogeneous treatment effect for different ethnic groups. Meanwhile, the positive signs and large magnitude may imply that zidovudine-incorporated therapy can universally and significantly improve the CD8 T cell count at 20±5 weeks, regardless of patients’ ethnic information. The computational time for our method and Abess is 1.43 s and 0.83 s, respectively.

6 Summary

In this paper, we consider the problem of selecting a common set of active features (support) given data from multiple sites. Among learning tasks across multiple data centers, there is a tendency to use the same features for data analysis for the convenience of merging analysis results or conducting meta-analysis, which motivates our study. To address this issue, we reformulate the common support selection problem as a \(L_{2,0}\) penalization problem. To solve the well-known computational challenge in zero-norm penalization, we adopt a splicing-based algorithm with polynomial time complexity. Two improvements are made compared with the existing set selection method: (i) our selection procedure is conditional on the site-specific parameters, which sufficiently takes prior information into account; (ii) our algorithm satisfies the data-sharing constraint, which avoids the privacy leakage when transferring data across different sites. The simulation results also support the superiority of our proposed method in terms of the error rate of variable selection and estimation accuracy of site-specific parameters. We also apply our proposed method to analyze real data, including ACTG 175 and the electricity consumption of multi-site server rooms (see appendix), to show its practicality.

We focus on the common support assumption, and a natural question to ask is whether we can combine the common support assumption with a similar parameter assumption in integrative analysis. To do that, we need to add an appropriate fusion penalization into our objective function and investigate the theoretical properties of the newly defined objective function. In addition, we mainly investigate and split a single-site study to better illustrate the proposed method. It would be definitely of interest to apply the proposed method to establish robust statistical evidence in a real multi-center study in bioscience, which we leave to future work.