Privacy-preserving and lossless distributed estimation of high-dimensional generalized additive mixed models

Various privacy-preserving frameworks that respect the individual’s privacy in the analysis of data have been developed in recent years. However, available model classes such as simple statistics or generalized linear models lack the flexibility required for a good approximation of the underlying data-generating process in practice. In this paper, we propose an algorithm for a distributed, privacy-preserving, and lossless estimation of generalized additive mixed models (GAMM) using component-wise gradient boosting (CWB). Making use of CWB allows us to reframe the GAMM estimation as a distributed fitting of base learners using the L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_2$$\end{document}-loss. In order to account for the heterogeneity of different data location sites, we propose a distributed version of a row-wise tensor product that allows the computation of site-specific (smooth) effects. Our adaption of CWB preserves all the important properties of the original algorithm, such as an unbiased feature selection and the feasibility to fit models in high-dimensional feature spaces, and yields equivalent model estimates as CWB on pooled data. Next to a derivation of the equivalence of both algorithms, we also showcase the efficacy of our algorithm on a distributed heart disease data set and compare it with state-of-the-art methods.


Introduction
More than ever, data is collected to record the ubiquitous information in our everyday life.However, on many occasions, the physical location of data points is not confined to one place (one global site) but distributed over different locations (sites).This is the case for, e.g., patient records that are gathered at different hospitals but usually not shared between hospitals or other facilities due to the sensitive information they contain.This makes data analysis challenging, particularly if methods require or notably benefit from incorporating all available (but distributed) information.For example, personal patient information is typically distributed over several hospitals, while sharing or merging different data sets in a central location is prohibited.To overcome this limitation, different approaches have been developed to directly operate at different sites and unite information without having to share sensitive parts of the data to allow privacy-preserving data analysis.

Distributed Data
Distributed data can be partitioned vertically or horizontally across different sites.Horizontally partitioned data means that observations are spread across different sites with access to all existing features of the available data point, while for vertically partitioned data, different sites have access to all observations but different features (covariates) for each of these observations.In this work, we focus on horizontally partitioned data.Existing approaches for horizontally partitioned data vary from fitting regression models such as generalized linear models (GLMs; Wu et al, 2012;Lu et al, 2015;Jones et al, 2013;Chen et al, 2018), to conducting distributed evaluations (Boyd et al, 2015;Ünal et al, 2021;Schalk et al, 2022b), to fitting artificial neural networks (McMahan et al, 2017).Furthermore, various software frameworks are available to run a comprehensive analysis of distributed data.One example is the collection of R (R Core Team, 2021) packages DataSHIELD (Gaye et al, 2014), which enables data management and descriptive data analysis as well as securely fitting of simple statistical models in a distributed setup without leaking information from one site to the others.

Interpretability and Data Heterogeneity
In many research areas that involve critical decision-making, especially in medicine, methods should not only excel in predictive performance but also be interpretable.Models should provide information about the decision-making process, the feature effects, and the feature importance as well as intrinsically select important features.Generalized additive models (GAMs; see, e.g., Wood, 2017) are one of the most flexible approaches in this respect, providing an interpretable yet complex models that also allow for non-linearity in the data.
As longitudinal studies are often the most practical way to gather information in many research fields, methods should also be able to account for subject-specific effects and account for the correlation of repeated measurements.Furthermore, when analyzing data originating from different sites, the assumption of having identically distributed observations across all sites often does not hold.In this case, a reasonable assumption for the data-generating process is a site-specific deviation from the general population mean.Adjusting models to this situation is called interoperability (Litwin et al, 1990), while ignoring it may lead to biased or wrong predictions.

Related Literature
Various approaches for distributed and privacypreserving analysis have been proposed in recent years.In the context of statistical models, Karr et al (2005) describe how to calculate a linear model (LM) in a distributed and privacypreserving fashion by sharing data summaries.Jones et al (2013) propose a similar approach for GLMs by communicating the Fisher information and score vector to conduct a distributed Fisher scoring algorithm.The site information is then globally aggregated to estimate the model parameters.Other privacy-preserving techniques include ridge regression (Chen et al, 2018), logistic regression, and neural networks (Mohassel and Zhang, 2017).
In machine learning, methods such as the naive Bayes classifier, trees, support vector machines, and random forests (Li et al, 2020a) exist with specific encryption techniques (e.g., the Paillier cryptosystem; Paillier, 1999) to conduct model updates.In these setups, a trusted third party is usually required.However, this is often unrealistic and difficult to implement, especially in a medical or clinical setup.Furthermore, as encryption is an expensive operation, its application is infeasible for complex algorithms that require many encryption calls (Naehrig et al, 2011).Existing privacy-preserving boosting techniques often focus on the AdaBoost algorithm by using aggregation techniques of the base classifier (Lazarevic and Obradovic, 2001;Gambs et al, 2007).A different approach to boosting decision trees in a federated learning setup was introduced by (Li et al, 2020b) using a locality-sensitive hashing to obtain similarities between data sets without sharing private information.These algorithms focus on aggregating tree-based base components, making them difficult to interpret, and come with no inferential guarantees.
In order to account for repeated measurements, Luo et al (2022) propose a privacypreserving and lossless way to fit linear mixed models (LMMs) to correct for heterogeneous sitespecific random effects.Their concept of only sharing aggregated values is similar to our approach, but is limited in the complexity of the model and only allows normally distributed outcomes.Other methods to estimate LMMs in a secure and distributed fashion are Zhu et al (2020), Anjum et al (2022), or Yan et al (2022).
Besides privacy-preserving and distributed approaches, integrative analysis is another technique based on pooling the data sets into one and analyzing this pooled data set while considering challenges such as heterogeneity or the curse of dimensionality (Curran and Hussong, 2009;Bazeley, 2012;Mirza et al, 2019).While advanced from a technical perspective by, e.g., outsourcing computational demanding tasks such as the analysis of multi-omics data to cloud services (Augustyn et al, 2021), the existing statistical cloud-based methods only deal with basic statistics.The challenges of integrative analysis are similar to the ones tackled in this work, our approach, however, does not allow merging the data sets in order to preserve privacy.

Our Contribution
This work presents a method to fit generalized additive mixed models (GAMMs) in a privacypreserving and lossless manner1 to horizontally distributed data.This not only allows the incorporation of site-specific random effects and accounts for repeated measurements in LMMs, but also facilitates the estimation of mixed models with responses following any distribution from the exponential family and provides the possibility to estimate complex non-linear relationships between covariates and the response.To the best of our knowledge, we are the first to provide an algorithm to fit the class of GAMMs in a privacy-preserving and lossless fashion on distributed data.
Our approach is based on component-wise gradient boosting (CWB; Bühlmann and Yu, 2003).CWB can be used to estimate additive models, account for repeated measurements, compute feature importance, and conduct feature selection.Furthermore, CWB is suited for high-dimensional data situations (n p).CWB is therefore often used in practice for, e.g., predicting the development of oral cancer (Saintigny et al, 2011), classifying individuals with and without patellofemoral pain syndrome (Liew et al, 2020), or detecting synchronization in bioelectrical signals (Rügamer et al, 2018).However, there have so far not been any attempts to allow for a distributed, privacy-preserving, and lossless computation of the CWB algorithm.In this paper, we propose a distributed version of CWB that yields the identical model produced by the original algorithm on pooled data and that accounts for site heterogeneity by including interactions between features and a site variable.This is achieved by adjusting the fitting process using 1) a distributed estimation procedure, 2) a distributed version of row-wise tensor product base learners, and 3) an adaption of the algorithm to conduct feature selection in the distributed setup.We implement our method in R using the DataSHIELD framework and demonstrate its application in an exemplary medical data analysis.Our distributed version of the original CWB algorithm does not have any additional HPs and uses optimization strategies from previous research results to define meaningful values for all hyperparameters, effectively yielding a tuning-free method.
The remainder of this paper is structured as follows: First, we introduce the basic notation, terminology, and setup of GAMMs in Section 2. We then describe the original CWB algorithm in Section 2.3 and its link to GAMMs.In Section 3, we present the distributed setup and our novel extension of the CWB algorithm.Finally, Section 4 demonstrates both how our distributed CWB algorithm can be used in practice and how to interpret the obtained results.

Implementation
We implement our approach as an R package using the DataSHIELD framework and make it available on GitHub2 .The code for the analysis can also be found in the repository3 .

Notation and Terminology
Our proposed approach uses the CWB algorithm as fitting engine.Since this method was initially developed in machine learning, we introduce here both the statistical notation used for GAMMs as well as the respective machine learning terminology and explain how to relate the two concepts.
We assume a p-dimensional covariate or feature space X = (X 1 × . . .× X p ) ⊆ R p and response or outcome values from a target space Y.The goal of boosting is to find the unknown relationship f between X and Y.In turn, GAMMs (as presented in Section 2.2) model the conditional distribution of an outcome variable Y with realizations y ∈ Y, given features x = (x 1 , . . ., x p ) ∈ X .Given a data set D = x (1) , y (1) , . . ., x (n) , y (n) with n observations drawn (conditionally) independently from an unknown probability distribution P xy on the joint space X × Y, we aim to estimate this functional relationship in CWB with f .The goodness-of-fit of a given model f is assessed by calculating the empirical risk R emp ( f ) = n −1 (x,y)∈D L(y, f (x)) based on a loss function L : Y × R → R and the data set D. Minimizing R emp using this loss function is equivalent to estimating f using maximum likelihood by defining L(y, f (x)) = − (y, h(f (x))) with loglikelihood , response function h and minimizing the sum of log-likelihood contributions.
In the following, we also require the vector x j = (x (1) j , . . ., x (n) j ) T ∈ X j , which refers to the j th feature.Furthermore, let x = (x 1 , . . ., x p ) and y denote arbitrary members of X and Y, respectively.A special role is further given to a subset u = (u 1 , . . ., u q ) , q ≤ p, of features x, which will be used to model the heterogeneity in the data.

Generalized Additive Mixed Models
A very flexible class of regression models to model the relationship between covariates and the response are GAMMs (see, e.g., Wood, 2017).
In GAMMs, the response Y (i) for observation i = 1, . . ., n s of measurement unit (or site) s is assumed to follow some exponential family distribution such as the Poisson, binomial, or normal distributions (see, e.g., McCullagh and Nelder, 2019), conditional on features x (i) and the realization of some random effects.The expectation µ := E(Y (i) |x (i) , u (i) ) of the response Y (i) for observations i = 1, . . ., n s of measurement unit (or site) s in GAMMs is given by (1) In (1), h is a smooth monotonic response function, f corresponds to the additive predictor, γ j,s ∼ N (0, ψ) are random effects accounting for heterogeneity in the data, and φ j are non-linear effects of pre-specified covariates.The different index sets J 1 , J 2 , J 3 ⊆ {1, . . ., p} ∪ ∅ indicate which features are modeled as fixed effects, random effects, or non-linear (smooth) effects, respectively.The modeler usually defines these sets.However, as we will also explain later, the use of CWB as a fitting engine allows for automatic feature selection and therefore does not require explicitly defining these sets.In GAMMs, smooth effects are usually represented by (spline) basis functions, i.e., φ j (x j ) ≈ (B j,1 (x j ), . . ., B j,dj (x j )) θ j , where θ j ∈ R dj are the basis coefficients corresponding to each basis function B j,dj .The coefficients are typically constrained in their flexibility by adding a quadratic (difference) penalty for (neighboring) coefficients to the objective function to enforce smoothness.GAMMs, as in (1), are not limited to univariate smooth effects φ j , but allow for higherdimensional non-linear effects φ(x j1 , x j2 , . . ., x j k ).
The most common higher-dimensional smooth interaction effects are bivariate effects (k = 2) and can be represented using a bivariate or a tensor product spline basis (see Section 2.3.1 for more details).Although higher-order splines with k > 2 are possible, models are often restricted to bivariate interactions for the sake of interpretability and computational feasibility.In Section 3, we will further introduce varying coefficient terms φ j,s (x j ) in the model (1), i.e., smooth effects f varying with a second variable s.Analogous to random slopes, s can also be the index set defining observation units of random effects J 2 .Using an appropriate distribution assumption for the basis coefficients θ j , these varying coefficients can then be considered as random smooth effects.

Component-Wise Boosting
Component-wise (gradient) boosting (CWB; Bühlmann and Yu, 2003;Bühlmann et al, 2007) is a an iterative algorithm that performs blockcoordinate descent steps with blocks (or base learners) corresponding to the additive terms in (1).With a suitable choice of base learners and objective function, CWB allows efficient optimization of GAMMs, even in high-dimensional settings with p n.We will first introduce the concept of base learners that embed additive terms of the GAMM into boosting and subsequently describe the actual fitting routine of CWB.Lastly, we will describe the properties of the algorithm and explain its connection to model (1).

Base Learners
In CWB, the l th base learner b l : X → R is used to model the contribution of one or multiple features in the model.In this work, we investigate parametrized base learners b l (x, θ l ) with parameters θ l ∈ R d l .For simplicity, we will use θ as a wildcard for the coefficients of either fixed effects, random effects, or spline bases in the following.We assume that each base learner can be represented by a generic basis representation g l : X → R d l , x → g l (x) = (g l,1 (x), . . ., g l,d l (x)) T and is linear in the parameters, i.e., b l (x, θ l ) = g l (x) T θ l .For n observations, we define the design matrix of a base learner b l as Z l := (g l (x (1) ), . . ., g l (x (n) )) T ∈ R n×d l .Note that base learners are typically not defined on the whole feature space but on a subset X l ⊆ X .For example, a common choice for CWB is to define one base learner for every feature x l ∈ X l to model the univariate contributions of that feature.
A base learner b l (x, θ l ) can depend on HPs α l that are set prior to the fitting process.For example, choosing a base learner using a P-spline (Eilers and Marx, 1996) representation requires setting the degree of the basis functions, the order of the difference penalty term, and a parameter λ l determining the smoothness of the spline.In order to represent GAMMs in CWB, the following four base learner types are used.

(Regularized) linear base learners
A linear base learner is used to include linear effects of a features x j1 , . . ., x j d l into the model.The basis transformation is given by g l (x) = (g l,1 (x), . . ., g l,d l +1 (x)) T = (1, x j1 , . . ., x j d l ) T .Linear base learners can be regularized by incorporating a ridge penalization (Hoerl and Kennard, 1970) with tunable penalty parameter λ l as an HP α l .Fitting a ridge penalized linear base learner to a response vector y ∈ R n results in the penalized least squares estimator θl = , where I d denotes the d-dimensional identity matrix.Often, an unregularized linear base learner is also included to model the contribution of one feature x j as a linear base learner without penalization.The basis transformation is then given by g l (x) = (1, x j ) T and λ l = 0.

Spline base learners
These base learners model smooth effects using univariate splines.A common choice is penalized B-splines (P-Splines; Eilers and Marx, 1996), where the feature x j is transformed using a B-spline basis transformation g l (x) = (B l,1 (x j ), . . ., B l,d l (x j )) T with d l basis functions g l,m = B l,m , m = 1, . . ., d l .In this case, the choice of the spline order B, the number of basis functions d l , the penalization term λ l , and the order v of the difference penalty (represented by a matrix D l ∈ R d l−v ×d l ) are considered HPs α l of the base learner.The base learner's parameter estimator in general is given by the penalized least squares solution θl = (Z T l Z l + K l ) −1 Z T l y, with penalization matrix K l = λ l D l D l in the case of P-splines.

Categorical and random effect base learners
Categorical features A possible alternative encoding is the dummy encoding with gl (x) = (1, 1 {1} (x j ), . . ., 1 {G−1} (x j )) T with reference group G. Similar to linear and spline base learners, it is possible incorporate a ridge penalization with HP α l = λ l .This results in the base learner's penalized least squared estimator θl = (Z T l Z l +K l ) −1 Z T l y with penalization matrix K l = λ l I G .Due to the mathematical equivalence of ridge penalized linear effects and random effects with normal prior (see, e.g., Brumback et al, 1999), this base learner can further be used to estimate random effect predictions γj when using categorical features u j and thereby account for heterogeneity in the data.

Row-wise tensor product base learners
This type of base learner is used to model a pairwise interaction between two features x j and x k .Given two base learners b j and b k with basis representations g j (x) = (g j,1 (x j ), . . ., g j,dj (x j )) T and g The HPs α l = {α j , α k } of a row-wise tensor product base learner are induced by the HPs α j and α k of the respective individual base learners.Analogously to other base learners, the penalized least squared estimator in this case is θl . This Kronecker sum penalty, in particular, allows for anisotropic smoothing with penalties τ j and τ k when using two spline bases for g j and g k , and varying coefficients or random splines when combining a (penalized) categorical base learner and a spline base learner.

Fitting Algorithm
CWB first initializes an estimate f of the additive predictor with a loss-optimal constant value f [0] = arg min c∈R R emp (c).It then proceeds and estimates Eq. ( 1) using an iterative steepest descent minimization in function space by fitting the previously defined base learners to the model's functional gradient ∇ f L(y, f ) evaluated at the current model estimate f .Let f [m] denote the model estimation after m ∈ N iterations.In each step in CWB, the pseudo residuals r . ., n} are first computed.CWB then selects the bestfitting base learner from a pre-defined pool of base-learners denoted by B = {b l } l∈{1,...,|B|} and adds the base learner's contribution to the previous model f [m] .The selected base learner is chosen based on its sum of squared errors (SSE) when regressing the pseudo residuals r[m] = (r [m](1) , . . ., r [m](n) ) T onto the base learner's features using the L 2 -loss.Further details of CWB are given in Algorithm 1 (see, e.g., Schalk et al, 2022a).

Controling HPs of CWB
Good estimation performance can be achieved by selecting a sufficiently small learning rate, e.g., 0.01, as suggested in Bühlmann et al (2007), and adaptively selecting the number of boosting iterations via early stopping on a validation set.To enforce a fair selection of model terms and thus unbiased effect estimation, regularization parameters are set such that all base learners have the same degrees-of-freedom (Hofner et al, 2011).As noted by Bühlmann et al (2007), choosing smaller degrees-of-freedom induces more penalization (and thus, e.g., smoother estimated function for spline base learners), which yields a model with lower variance at the cost of a larger bias.This bias induces a shrinkage in the estimated coefficients towards zero but can be reduced by running the optimization process for additional iterations.for m ∈ {1, . . ., M } do 4: end for 9: end for 12: return f = f [M ] 13: end procedure

Properties and Link to Generalized Additive Mixed Models
The estimated coefficients θ resulting from running the CWB algorithm are known to converge to the maximum likelihood solution (see, e.g., Schmid and Hothorn, 2008) for M → ∞ under certain conditions.This is due to the fact that CWB performs a coordinate gradient descent update of a model defined by its additive base learners that exactly represent the structure of an additive mixed model (when defining the base learners according to Section 2.3.1) and by the objective function that corresponds to the negative (penalized) log-likelihood.Two important properties of this algorithm are 1) its coordinatewise update routine, and 2) the nature of model updates using the L 2 -loss.Due to the first property, CWB can be used in settings with p n, as only a single additive term is fitted onto the pseudo-residuals in every iteration.This not only reduces the computational complexity of the algorithm for an increasing number of additive predictors (linear instead of quadratic) but also allows variable selection when stopping the routine early (e.g., based on a validation data set), as not all the additive components might have been selected into the model.In particular, this allows users to specify the full GAMM model without manual specification of the type of feature effect (fixed or random, linear or non-linear) and then automatically sparsify this model by an objective and data-driven feature selection.The second property, allows fitting models of the class of generalized linear/additive (mixed) models using only the L 2 -loss instead of having to work with some iterative weighted least squares routine.In particular, this allows performing the proposed lossless distributed computations described in this paper, as we will discuss in Section 3.

Distributed Computing Setup and Privacy Protection
Before presenting our main results, we now introduce the distributed data setup we will work with throughout the remainder of this paper.The data set D is horizontally partitioned into S data sets In this distributed setup, multiple ways exist to communicate information without revealing individual information.More complex methods such as differential privacy (Dwork, 2006), homomorphic encryption (e.g., the Paillier cryptosystem; Paillier, 1999), or k-anonymity (Samarati and Sweeney, 1998) allow sharing information without violating an individual's privacy.An alternative option is to only communicate aggregated statistics.This is one of the most common approaches and is also used by DataSHIELD (Gaye et al, 2014) for GLMs or by Luo et al (2022) for LMMs.DataSHIELD, for example, uses a privacy level that indicates how many individual values must be aggregated to allow the communication of aggregated values.For example, setting the privacy level to a value of 5 enables sharing of summary statistics such as sums, means, variances, etc. if these are computed on at least 5 elements (observations).

Host and Site Setup
Throughout this article, we assume the 1, . . ., S sites or servers to have access to their respective data set D s .Each server is allowed to communicate with a host server that is also the analyst's machine.In this setting, the analyst can potentially see intermediate data used when running the algorithms, and hence each message communicated from the servers to the host must not allow any reconstruction of the original data.The host server is responsible for aggregating intermediate results and communicating these results back to the servers.

Distributed Component-Wise Boosting
We now present our distributed version of the CWB algorithm to fit privacy-preserving and lossless GAMMs.In the following, we first describe further specifications of our setup in Section 3.1, elaborate on the changes made to the set of base learners in Section 3.2, and then show how to adapt CWB's fitting routine in Section 3.3.

Setup
In the following, we distinguish between sitespecific and shared effects.As effects estimated across sites typically correspond to fixed effects and effects modeled for each site separately are usually represented using random effects, we use the terms as synonyms in the following, i.e., shared effects and fixed effects are treated interchangeably and the same holds for site-specific effects and random effects.We note that this is only for ease of presentation and our approach also allows for site-specific fixed effects and random shared effects.As the data is not only located at different sites but also potentially follows different data distributions P xy,s at each site s, we extend Eq. ( 1) to not only include random effects per site, but also site-specific smooth (random) effects φ j,s (x j ), s = 1, . . ., S for all features x j with j ∈ J 3 .For every of these smooth effects φ j,s we assume an existing shared effect f j,shared that is equal for all sites.These assumptions -particularly the choice of site-specific effects -are made for demonstration purposes.In a real-world application, the model structure can be defined individually to match the given data situation.However, note again that CWB intrinsically performs variable selection, and there is thus no need to manually define the model structure in practice.
In order to incorporate the site information into the model, we add a variable x (i) 0 ∈ {1, . . ., S} for the site to the data by setting x(i) = (x The site variable is a categorical feature with S classes.

Base Learners
For shared effects, we keep the original structure of CWB with base learners chosen from a set of possible learners B. Section 3.3.1 explains how these shared effects are estimated in the distributed setup.We further define a regularized categorical base learner b 0 with basis transformation g 0 (x 0 ) = (1 {1} (x 0 ), . . ., 1 {S} (x 0 )) T and design matrix Z 0 ∈ R n×S .We use b 0 to extend B with a second set of base learners B × = {b 0 ×b | b ∈ B} to model site-specific random effects.All base learners in B × are row-wise tensor product base learners b l× = b 0 × b l of the regularized categorical base learner b 0 dummy-encoding every site and all other existing base learners b l ∈ B. This allows for potential inclusion of random effects for every fixed effect in the model.More specifically, the l th site-specific effect given by the row-wise tensor product base learner b l× uses the basis transformation g l× = g 0 ⊗ g l where the basis transformation g l is equal for all S sites.After distributed computation (see Eq. ( 4) in the next section), the estimated coefficients are θl× = ( θT l×,1 , . . ., θT l×,S ) T with θl×,s ∈ R d l .The regularization of the row-wise Kronecker base learners not only controls their flexibility but also assures identifiable when additionally including a shared (fixed) effect for the same covariate.The penalty matrix given as Kronecker sum of the penalty matrix of the categorical site effect and the penalty matrices K 0 and K l with respective regularization strengths λ 0 , λ l× .As K 0 = λ 0 I S is a diagonal matrix, K l× is a block matrix with entries λ 0 I d l + λ l× K l on the diagonal blocks.Moreover, as g 0 is a binary vector, we can also express the design matrix Z l× ∈ R n×Sd l as a block matrix, yielding where Z l,k are the distributed design matrices of b l on sites s = 1, . . ., S.

Fitting Algorithm
We now describe the adaptions required to allow for distributed computations of the CWB fitting routine.In Sections 3.3.1 and 3.3.2,we show the equality between our distributed fitting approach and CWB fitted on pooled data.Section 3.3.3describes the remaining details such as distributed SSE calculations, distributed model updates, and pseudo residual updates in the distributed setup.
Section 3.4 summarizes the distributed CWB algorithm and Section 3.5 elaborates on the communication costs of our algorithm.

Distributed Shared Effects Computation
Fitting CWB in a distributed fashion requires adapting the fitting process of the base learner b l in Algorithm 1 to distributed data.To allow for shared effects computations across different sites without jeopardizing privacy, we take advantage of CWB's update scheme, which boils down to a (penalized) least squares estimation per iteration for every base learner.This allows us to build upon existing work such as Karr et al (2005) to fit linear models in a distributed fashion by just communicating aggregated statistics between sites and the host.
In a first step, the aggregated matrices F l,s = Z T l,s Z l,s and vectors u l,s = Z T l,s y s are computed on each site.In our privacy setup (Section 2.4), communicating F l,s and u l,s is allowed as long as the privacy-aggregation level per site is met.In a second step, the site information is aggregated to a global information F l = S s=1 F l,s + K l and u l = S s=1 u l,s and then used to estimate the model parameters θl = F −1 l u l .This approach, referred to as distFit, is explained again in detail in Algorithm 2 and used for the shared effect computations of the model by substituting θ

Note that the pseudo residuals r[m]
k are also securely located at each site and are updated after each iteration.Details about the distributed pseudo residuals updates are explained in Section 3.3.3.We also note that the computational complexity of fitting CWB can be drastically reduced by pre-calculating and storing (Z T l Z l + K l ) −1 in a first initialization step, as the matrix is independent of iteration m, and reusing these pre-calculated matrices in all subsequent iterations (cf.Schalk et al, 2022a).Using pre-calculated matrices also reduces the amount of required communication between sites and host.
Algorithm 2 Distributed Effect Estimation.

The line prefixes [S] and [H] indicate whether the operation is conducted at the sites ([S]) or at the host ([H]).
Input Sites design matrices Z l,1 , . . ., Z l,S , response vectors y 1 , . . ., y S and an optional penalty matrix K l .Output Estimated parameter vector θl .

Distributed Site-specific Effects Computation
If we pretend that the fitting of the base learner b l× is performed on the pooled data, we obtain where ( 4) is due to the block structure, as described in (3) of Section 3.2.This shows that the fitting of the site-specific effects θl× can be split up into the fitting of individual parameters It is thus possible to compute site-specific effects at the respective site without the need to share any information with the host.The host, in turn, only requires the SSE of the respective base learner (see next Section 3.3.3)to perform the next iteration of CWB.Hence, during the fitting process, the parameter estimates remain at their sites and are just updated if the site-specific base learner is selected.This again minimizes the amount of data communication between sites and host and speeds up the fitting process.After the fitting phase, the aggregated site-specific parameters are communicated once in a last communication step to obtain the final model.

Pseudo Residual Updates, SSE Calculation, and Base Learner Selection
The remaining challenges to run the distributed CWB algorithm are 1) the pseudo residual calculation (Algorithm 1 line 4), 2) the SSE calculation (Algorithm 1 line 7), and 3) base learner selection (Algorithm 1 line 9).

Distributed pseudo residual updates
The site-specific response vector y s containing the values y (i) , i ∈ {1, . . ., n s } is the basis of the pseudo residual calculation.We assume that every site s has access to all shared effects as well as the site-specific information of all site-specific base learners b l× only containing the respective parameters θl×,s .Based on these base learners, it is thus possible to compute a site model The site-specific SSE values are then sent to the host and aggregated to SSE l = S s=1 SSE l,s .If privacy constraints have been met in all previous calculations, sharing the individual SSE values is not critical and does not violate any privacy constraints as the value is an aggregation of all n s observations for all sites s.
Having gathered all SSE values at the host location, selecting the best base learner in the current iteration is done in the exact same manner as for the non-distributed CWB algorithm by selecting l [m] = arg min l∈{1,...,|B|,1×,...,|B|×} SSE l .After the selection, the index l [m] is shared with all sites to enable the update of the site-specific models f is shared with all sites.Caution must be taken when the number of parameters of one base learner is equal to the number of observations, as this allows reverse-engineering private data.In the case of a site-specific effect selection, no parameter needs to be communicated, as the respective estimates are already located at each site.

Distributed CWB Algorithm with Site-Specific Effects
Assembling all pieces, our distributed CWB algorithm is summarized in Algorithm 3.

Communication Costs
While the CWB iterations themselves can be performed in parallel on every site and do not slow down the process compared to a pooled calculation, it is worth discussing the communication costs of distrCWB.Initialization: 3: [H] Initialize shared model f [0] shared (x) = arg min c∈R Remp(c) 4: [S] Calculate Z l,s and F l,s = Z T l,s Z l,s , ∀l ∈ {1, . . ., |B|}, s ∈ {1, . . ., S} 5: for m ∈ {1, . . ., M } or while an early stopping criterion is not met do 7: [S] Update pseudo residuals: for k ∈ {1, . . ., S} do 13: [S] Fit l th site-specific effect: θ s 14: [S] Calculate the SSE for the l th shared and site-specific effect: 15: [S] Send SSE l,s and SSE l×,s to the host [H] Update model: ) 24: [H] Upload model update θ[m] l [m] to the sites.

26:
[S] Update site model f [m] s via parameter updates θl [m] = θl [m] end for 28: [S] Communicate site-specific effects θ1× , . . ., θ|B|× to the host 29: [H] Add site-specific effects to the model of shared effects f [M ] shared to obtain the full model f [M ] 30: [H] return f = f [M ] 31: end procedure iterations M , i.e., O(M ).For the analysis of communication costs in terms of the number of base learners, we distinguish between the initialization phase and the fitting phase.

Initialization
As only the sites share F l,s ∈ R d×d , ∀l ∈ {1, . . ., |B|}, the transmitted amount of values is d 2 |B| for each site and therefore scales linearly with |B|, i.e., O(|B|).The host does not communicate any values during the initialization.

Fitting
In each iteration, every site shares its vector Z T l,s

Application
We now showcase our algorithm on a heart disease data set that consists of patient data gathered all over the world.The data were collected at four different sites by the 1) Hungarian Institute of Cardiology, Budapest (Andras Janosi, M.D.), 2) University Hospital, Zurich, Switzerland (William Steinbrunn, M.D.), 3) University Hospital, Basel, Switzerland (Matthias Pfisterer, M.D.), and 4) V.A. Medical Center, Long Beach, and Cleveland Clinic Foundation (Robert Detrano, M.D., Ph.D.), and is thus suited for a multi-site distributed analysis.The individual data sets are freely available at https://archive.ics.uci.edu/ml/datasets/heart+disease (Dua and Graff, 2017).For our analysis, we set the privacy level (cf.Section 2.4) to 5 which is a common default.

Data Description
The raw data set contains 14 covariates, such as the chest pain type (cp), resting blood pressure (trestbps), maximum heart rate (thalach), sex, exercise-induced angina (exang), or ST depression (i.e., abnormal difference of the ST segment from the baseline on an electrocardiogram) induced by exercise relative to rest (oldpeak).A full list of covariates and their abbreviations is given on the data set's website.After removing noninformative covariates and columns with too many missing values at each site, we obtain n cleveland = 303, n hungarian = 292, n switzerland = 116, and n va = 140 observations and 8 covariates.A table containing the description of the abbreviations of these covariates is given in Table 1 in the Supplementary Material B.1.For our application, we assume that missing values are completely at random and all data sets are exclusively located at each sites.The task is to determine important risk factors for heart diseases.The target variable is therefore a binary outcome indicating the presence of heart disease or not.

Analysis and Results
We follow the practices to setup CWB as mentioned in Section 2.3.2 and run the distributed CWB algorithm with a learning rate of 0.1 and a maximum number of 100000 iterations.To determine an optimal stopping iteration for CWB, we use 20 % of the data as validation data and set the patience to 5 iterations.In other words, the algorithm stops if no risk improvement on the validation data is observed in 5 consecutive iterations.For the numerical covariates, we use a P-spline with 10 cubic basis functions and second-order difference penalties.All base learners are penalized accordingly to a global degree of freedom that we set to 2.2 (to obtain unbiased feature selection) while the random intercept is penalized according to 3 degrees of freedom (see the Supplementary Material B.2 for more details).Since we are modelling a binary response variable, h −1 is the inverse logit function logit −1 (f ) = (1 + exp(−f )) −1 .The model for an observation of site s, conditional on its random effects γ, is given in the Supplementary Material B.3.

Results
The algorithm stops after m stop = 5578 iterations as the risk on the validation data set starts to increase (cf. Figure 1 in the Supplementary Material B.4).Out of these 5578 iterations, the distributed CWB algorithm selects a shared effect in 782 iterations and site-specific effects in 4796 iterations.This indicates that the data is rather heterogeneous and requires site-specific (random) effects.Figure 1  The estimated effect of the most important feature oldpeak (cf. Figure 1, Right) found is further visualized in Figure 2. Looking at the shared effect, we find a negative influence on the risk of heart disease when increasing ST depression (oldpeak).When accounting for site-specific deviations, the effect becomes more diverse, particularly for Hungary.In the Supplementary Material B.5 and B.6, we provide the partial effects for all features and showcase the conditional predictions of the fitted GAMM model for a given site.

Comparison of Estimation Approaches
The previous example shows partial feature effects that exhibit shrinkage due to the early stopping of CWB's fitting routine.While this prevents overfitting and induces a sparse model, we can also run CWB for a very large amount of iterations without early stopping to approximate the unregularized and hence unbiased maximum likelihood solution.We illustrate this in the following by training CWB and our distributed version for 100000 iterations and compare its partial effects to the ones of a classical mixed model-based estimation routine implemented in the R package mgcv (Wood, 2017).
Results of the estimated partial effects of our distributed CWB algorithm and the original CWB on pooled data show a perfect overlap (cf. Figure 3).This again underpins the lossless property of the proposed algorithm.The site-specific effects on the pooled data are fitted by defining a row-wise Kronecker base learner for all features and the site as a categorical variable.The same approach is used to estimate a GAMM using mgcv fitted on the pooled data with tensor products between the main feature and the categorical site variable.A comparison of all partial feature effects is given in the Supplementary Material B.7 showing good alignment between the different methods.For the oldpeak effect shown in Figure 3, we also see that the partial effects of the two CWB methods are very close to the mixed model-based estimation, with only smaller differences caused by a slightly different penalization strength of both approaches.The empirical risk is 0.4245 for our distributed CWB algorithm, 0.4245 for CWB on the pooled data, and 0.4441 for the GAMM on the pooled data.

Discussion
We proposed a novel algorithm for distributed, lossless, and privacy-preserving GAMM estimation to analyze horizontally partitioned data.To account for data heterogeneity of different sites we introduced site-specific (smooth) random effects.Using CWB as the fitting engine allows estimation in high-dimensional settings and fosters variable as well as effect selection.This also includes a data-driven selection of shared and site-specific features, providing additional data insights.Owing to the flexibility of boosting and its base learners, our algorithm is easy to extend and can also account for interactions, functional regression settings (Brockhaus et al, 2020), or modeling survival tasks (Bender et al, 2020).
An open challenge for the practical use of our approach is its high communication costs.For larger iterations (in the 10 or 100 thousands), computing a distributed model can take several hours.One option to reduce the total runtime is to incorporate accelerated optimization recently proposed in Schalk et al (2022a).Another driver that influences the runtime is the latency of the technical setup.Future improvements could reduce the number of communications, e.g., via multiple fitting rounds at the different sites before communicating the intermediate results.
A possible future extension of our approach is to account for both horizontally and vertically distributed data.Since the algorithm is performing component-wise (coordinate-wise) updates, the extension to vertically distributed data naturally falls into the scope of its fitting procedure.This would, however, require a further advanced technical setup and the need to ensure consistency across sites.

Algorithm 1
Vanilla CWB algorithm Input Train data D, learning rate ν, number of boosting iterations M , loss function L, set of base learner B Output Model f [M ] defined by fitted parameters θ[1] , . . ., θ[M] 1: procedure CWB(D, ν, L, B) 2: Initialize: f [0] (x) = arg min c∈R Remp(c) 3: s , . . ., x (ns) s , y (ns) s , s = 1, . . ., S with n s observations.Each data set D s is located at a different site s and potentially follows a different data distributions P xy,s .The union of all data sets yields the whole data set D = ∪ S s=1 D s with mutually exclusive data sets D s ∩ D l = ∅ ∀l, s ∈ {1, . . ., S}, l = s.The vector of realizations per site is denoted by y s ∈ Y ns .

s
as a representative of f [m] on every site s.The pseudo residual updates r[m] s per site are then based on f [m]

s
, i ∈ {1, . . ., n s } using D s .Most importantly, all remaining steps of the distributed CWB fitting procedure do not share the pseudo residuals r[m]    s in order to avoid information leakage about y s .Distributed SSE calculation and base learner selectionAfter fitting all base learners b l ∈ B and b l× ∈ B × to r[m] s , we obtain θ[m] l , l = 1, . . ., |B|, and θ[m] l× , l × = 1 × , . . ., |B × |.Calculating the SSE distributively for the l th and l × th base learner b l and b l× , respectively, requires calculating 2S site-specific SSE values:

∈
R d , ∀l ∈ {1, . . ., |B|}.Over the course of M boosting iterations, each site therefore shares dM |B| values.Every site also communicates the SSE values, i.e., 2 values (index and SSE value) for every base learner and thus 2M |B| values for all iterations and base learners.In total, each site communicates M |B|(d + 2) values.The communication costs for all sites are therefore O(|B|).The host, in turn, communicates the estimated parameters θ[m] ∈ R d of the |B| shared effects.Hence, dM |B| values as well as the index of the best base learner in each iteration are transmitted.In total, the host therefore communicates dM |B|+M values to the sites, and costs are therefore also O(|B|).

Fig. 1 :
Fig. 1: Left: Model trace showing how and when the four most selected additive terms entered the model.Right: Variable importance (cf.Au et al, 2019) of selected features in decreasing order.

Fig. 2 :
Fig.2: Decomposition of the effect of oldpeak into the shared (left) and the site-specific effects (middle).The plot on the right-hand side shows the sum of shared and site-specific effects.

Fig. 3 :
Fig.3: Comparison of the site-specific effects for oldpeak between the distributed (dsCWB) and pooled CWB approach (compboost) as well as estimates of from mgcv.
During the initialization, data is shared just once, while the fitting phase requires the communication of data in each iteration.Let d = max l d l be the maximum number of basis functions (or, alternatively, assume d basis functions for all base learners).The two main drivers of the communication costs are the number of boosting iterations M and the number of base learners |B|.Because of the iterative nature of CWB with a single loop over the boosting iterations, the communication costs (both for the host and each site) scale linearly with the number of boosting Algorithm 3 Distributed CWB Algorithm.The line prefixes [S] and [H] indicate whether the operation is conducted at the sites ([S]) or at the host ([H]).Sites with site data D k , learning rate ν, number of boosting iterations M , loss function L, set of shared effects B and respective site-specific effects B × ∈ {1, . . ., ns} distFit(Z l,1 , . . ., Z l,S , y 1 , . . ., y S , K l ) l = Aggregate SSE values: SSE l = S s=1 SSE l,s and SSE l× = S s=1 SSE l×,s Select best base learner: l [m] = arg min l∈{1,...,|B|,1×,...,|B|×} SSE l