Mixture of Experts Distributional Regression: Implementation Using Robust Estimation with Adaptive First-order Methods

In this work, we propose an efficient implementation of mixtures of experts distributional regression models which exploits robust estimation by using stochastic first-order optimization techniques with adaptive learning rate schedulers. We take advantage of the flexibility and scalability of neural network software and implement the proposed framework in mixdistreg, an R software package that allows for the definition of mixtures of many different families, estimation in high-dimensional and large sample size settings and robust optimization based on TensorFlow. Numerical experiments with simulated and real-world data applications show that optimization is as reliable as estimation via classical approaches in many different settings and that results may be obtained for complicated scenarios where classical approaches consistently fail.


Introduction
Mixture models are a common choice to model the joint distribution of several sub-populations or sub-classes on the basis of data from the pooled population where the sub-class memberships are not observed (for an introduction see McLachlan and Peel, 2004).Each sub-class is assumed to follow a parametric probability distribution such that the pooled observations are from a mixture of these distributions.Many applications of mixture models aim at estimating the distributions of the latent sub-classes and identifying sub-class memberships of the observations (McLachlan et al, 2019).Mixture models have also been used in machine learning, e.g., for clustering (Viroli and McLachlan, 2019), to build generative models for images (Van den Oord and Schrauwen, 2014) or as a hybrid approach for unsupervised outlier detection (Zong et al, 2018).
Different kinds of mixture models are used depending on the available data structure and the parametric model which is assumed for each sub-population or sub-class.In the following, we assume a supervised learning task and that regression models are used to model the sub-populations.This leads to mixture regression models which define a mixture of (conditional) models for the outcome of interest, where the mixture components are still unknown and thus considered latent variables.This model class is also referred to as mixtures of experts (Gormley and Frühwirth-Schnatter, 2019).Mixtures of experts allow for the inclusion of covariates when modeling the mixture components as well as for the mixture weights.In the following, we consider a mixture of experts model where the regression models do not only include the mean parameter but also other distributional parameters that may depend on the covariates or features.In addition, we assume that the sub-population sizes vary with covariates (or features).This leads to the class of mixture of experts distributional regression models.

Mixture Models and Their Estimation
As for classical statistical regression models, the goal of mixtures of regressions or mixture regression models is to describe the conditional distribution of a response (or outcome), conditional on a set of covariates (or features).Mixtures of regressions have been first introduced by Quandt (1958) under the term switching regimes where only two-component mixtures of linear regression models were considered, i.e., the number of sub-classes was fixed to two.This was extended to general mixtures of linear regression models by DeSarbo and Cron (1988) who referred to this approach as a model-based version of clusterwise regression (Späth, 1979).The extension to mixtures of generalized linear regression models was proposed in Wedel and DeSarbo (1995).Aiming for a setting beyond mean regression (Kneib, 2013), a distributional regression setting can be used such as generalized additive models for location, scale and shape (GAMLSS; Rigby and Stasinopoulos 2005) which also allow for non-linear smooth relationships between covariates and the distributional parameter of interest (Stasinopoulos et al, 2018).
Mixture models can be estimated using various techniques, with the expectation-maximization (EM) algorithm based on the maximum-likelihood principle being the most prominent one.Other approaches include Bayesian methods such as MCMC algorithms with data augmentation (Diebolt and Robert, 1994).An alternative way of model specification and estimation was proposed by Bishop (1994) who introduced mixture density networks (MDNs).MDNs use a mixture of experts distributional regression model specification.But the relationship between the covariates (or features) and the mixture weights or the distributional regression models is learned using a neural network.The training of MDNs is done using highly optimized stochastic gradient descent (SGD) routines with adaptive learning rates and momentum.These procedures have proven to be very effective in the optimization of large neural networks with millions of parameters and complex model structures.While often advantageous in terms of prediction performance, MDNs lack interpretability as the relationship between inputs (features) and distribution parameters is modeled by a deep neural network.

Novel Modeling Approach
In this work, we combine the ideas of interpretable mixtures of regression models and MDNs to allow for a mixture of experts distributional regression models in a very general setup.In particular, this approach enables modeling mixtures of sub-populations where the distribution of every sub-population is modeled using a distributional regression.Predictors of every distribution parameter in every sub-population can be defined by linear effects or (tensor product) splines and thereby allow not only for complex relationships between the features and the distributions' mean but also other distribution characteristics such as scale or skewness.Furthermore, the sub-population sizes may vary with features in a flexible way using again a combination of linear effects as well as (tensor product) splines.

Robust Estimation
Estimating mixture regression models within a maximum-likelihood framework on the basis of the EM algorithm works well for smaller problems.However, higher-dimensional settings where the number of parameters is similar to or exceeds the number of observations are often infeasible to estimate.In fact, the convergence and stability of classical algorithms are more and more adversely affected by an increase in model complexity.Maximum-likelihood estimation of distributional regression models itself induces a non-convex optimization problem in all but special cases.Thus extending mixtures of mean regression models to mixtures of distributional regression models further complicates the optimization.
In this work, inspired by MDNs, we suggest and analyze the usage of stochastic first-order optimization techniques using adaptive learning rate schedulers for (regularized) maximum-likelihood estimation.Our results show that this approach is as reliable as estimation via classical approaches in many different settings and even provides estimation results in complex cases where classical approaches consistently fail.

Flexible and Scalable Implementation
Common implementations of EM optimization routines are limited in their flexibility to specify a mixture of (many) potentially different distributions but in general focus on the case where all components have a parametric distribution from the same distributional family.Another (computational) limitation of existing approaches is faced for large amounts of data (observations) as methods usually scale at least quadratic in the number of observations.On the other hand, SGD optimizers from the field of deep learning are trained on minibatches of data allowing large dataset applications and the use in a generic fashion for all model classes.We take advantage of this flexibility and scalability and implement the fitting of the proposed model class using mixdistreg, an R (R Core Team, 2022) software package based on the R package deepregression (Rügamer et al, 2022) that allows for the definition of mixtures with components from many different distributional families, estimation in highdimensional and large sample size settings and robust optimization based on TensorFlow (Abadi et al, 2015).

Summary of our Approach and Overview on the Paper Structure
Our framework unites neural density networks (Magdon-Ismail and Atiya, 1998) with (distributional) regression approaches, extends MDNs by incorporating penalized smooth effects, and comprises various frameworks proposed in the statistical community such as Leisch (2004); Grün and Leisch (2007); Stasinopoulos and Rigby (2007) to estimate mixtures of linear, generalized linear, generalized additive, or distributional regression models.In contrast to many existing approaches, our framework further allows to estimate the mixture weights themselves on the basis of an additive structured predictor.We refer to this combination of the model class of mixtures of experts distributional regression with structured additive predictors and the estimation using neural network software as the neural mixture of experts distributional regression (NMDR) approach.The remainder of this paper is structured as follows.In Section 2, we present our model definition.In Section 3, we introduce the architecture for SGD-based optimization in neural networks and discuss penalized estimation approaches including mixture sparsification.We then demonstrate the framework's properties using extensive numerical experiments in Section 4 and its application to real-world data in Section 5. We conclude with a discussion in Section 6.

Methodology
Our goal is to model the conditional distribution of Y | x where Y is the univariate outcome (or response) of interest.We assume that Y | x ∼ F, where F is a mixture of parametric distributions F m , m ∈ {1, . . ., M } =: M.These distributions, in turn, depend on unknown parameters θ m (x) ∈ R km that are influenced by a set of features (or covariates) x ∈ R p .Furthermore, the nonnegative mixture weights are also allowed to depend on features (or covariates) x ∈ R p such that π m (x) for all m = 1, . . ., M with M m=1 π m (x) = 1.

Model Definition
For an observation y in (a suitable subset of) R constituting the support from the conditional distribution of Y | x, we define the conditional density by i.e., by a mixture of density functions f m of the distributions F m .Each component density f m has its own k m distribution parameters θ m = (θ m,1 , . . ., θ m,km ) .π m ∈ [0, 1] are the mixture weights with Each of the parameters ϑ j in ϑ is assumed to depend on the features x through an additive predictor η j , j = 1, . . ., K+M .For the distribution parameters θ a monotonic and differentiable function h j , j = 1, . . ., K is assumed to provide a map between the additive structured predictor and the distribution parameter, i.e., ϑ j = h j (η j (x)).The parameter-free transformation function h j ensures the correct domain of each ϑ j .E.g., if ϑ j represents a scale parameter the function h j ensures that ϑ j is positive while the additive predictor η j may take arbitrary values in R.
For the mixture weights π a single monotonic and differentiable function h K+1 is assumed to map the M additive predictors to the (M −1)-dimensional simplex, i.e., h K+1 maps R M → [0, 1] M under the condition that the sum of the weights is 1.This links the last M additive predictors η π := (η K+1 , . . ., η K+M ) to the set of mixture weights π.The most common choice in this respect for h K+1 is the softmax function .
This implies , for m ∈ M. (2) Effects in the additive predictors η K+l , l = 1, . . ., M , in (2) are not identifiable without further constraints.We do not enforce any constraints during model training.Model interpretation is still possible in relative terms (e.g., using a log-odds interpretation).However, some constraints would need to be imposed if identifiable regression coefficients for the additive predictors are to be obtained.For all parameters in ϑ, the additive structured predictors η j (x) ensure interpretability of the relationship between the parameter ϑ j and the covariates.For example, if η K+M is a linear model, i.e., η K+M (x) = x β, the regression coefficients β can be interpreted as linear contributions of each of the features to the logits of the mixture weight for the last mixture component M .

Additive Predictor Structure
The model ( 1) relates all model parameters ϑ to features x through additive predictors η j , j = 1, . . ., K + M .As different densities f m and also different parameters θ m have potentially different influences on the conditional distribution of Y | x, every parameter in θ is defined by its own additive structured predictor η j .Here we assume the additive predictors to have the following structure: where β 0,j corresponds to the model intercept, β j are the linear effects for predefined covariates x L(j) with L(j) ⊆ {1, . . ., p}∪∅ being a subset of all possible predictors, and f l,j are non-linear smooth functions for one or more covariates in x with S(j) being a (potentially empty) set of indices indicating the covariates with non-linear predictor effects.We assume that every f l,j (x) can be represented by (tensor product) basis functions B l,j,o , o = 1, . . ., O, taking one or several columns of x as input and mapping these onto the space spanned by the basis functions.Denoting z l,j := B l,j (x) = (B l,j,1 (x), . . ., B l,j,O (x)) ∈ R O , the non-linear smooth terms can be written as f l,j (x) = z l,j γ l,j where γ l,j ∈ R O are the corresponding basis coefficients for z l,j .

Model Log-Likelihood
Based on model ( 1) and the structures imposed on the predictors in (3), the negative log-likelihood of the model parameters ψ for n independent observations y = (y 1 , . . ., y n ) ∈ R n and their corresponding n observed feature vectors X := (x 1 , . . ., x n ) ∈ R n×p is given by the sum of the negative log-likelihood contributions i (ψ) of each observation i = 1, . . ., n: (4) The negative log-likelihood can be rewritten as the negative sum of the exponentiated log-likelihoods of both the mixture weights π m and the component densities f m using the log-sum-exp (LSE) function: In practice, formulation ( 5) is often preferred over (4) as it is less affected by under-or overflow problems.

Identifiability
Identifiability is of concern for the mixture of experts distributional regression model because of the identifiability problems which could occur due to the mixture specification, due to the additive structured predictors, and due to the softmax mapping to the mixture weights.

Mixture Models
For any mixture model, trivial identifiability problems may arise due to label switching and overfitting with either empty or duplicated components.
In addition, also generic identifiability problems may occur for mixtures of distributions but also for mixtures of regressions.A detailed overview of identifiability problems in the finite mixture case is given in Frühwirth-Schnatter (2006).

Additive Structured Predictors
The components of the additive model with predictor structures of the form (3) are in general only identifiable up to a constant and also require further restrictions if both linear and non-linear smooth effects are defined for one and the same covariate.For example, η j = β 0,j +f 1,j (x 1 ) can be equally represented by β0,j + f1,j (x 1 ) by defining β0,j = β 0,j + c and f1,j (x 1 ) = f 1,j (x 1 ) − c with c ∈ R, i.e., by adding and subtracting a constant c in both terms.We ensure the identifiability of non-linear smooth terms f l,j by using sum-to-zero constraints for all non-linear additive functions such as splines or tensor product splines.
The identifiability of linear effects xβ in the presence of a univariate non-linear effect f (x) must also be ensured.In this case, several different options exist (see, e.g., Rügamer et al, 2020).The most straightforward way is to use a nonlinear effect f (x) with a basis representation that includes the linear effect as null space (and hence, for enough penalization as discussed in Section 3.3, results in a linear effect).In this case, L(j) only consists of variables that are only modeled using a linear component (e.g., categorical effects) and L(j) ∩ S(j) = ∅.

Model Representation and Robust Estimation
The observed data log-likelihood is, in general, not concave and thus difficult to optimize.We suggest to use optimizers from the field of deep learning by framing our model as a neural network.This enables the fitting of the full model class of mixture of experts distribution regression models with additive structure predictors in a straightforward way.Numerical experiments confirm that this choice -when properly trained -is not only more flexible and robust than EM-based optimization but also makes large dataset applications feasible due to mini-batch training.

Neural Network Representation
Models described in ( 1) and ( 3) can be represented as neural networks in the following way.An exemplary architecture is depicted in Figure 1.The network architecture implementing model (1) defines at most K subnetworks, where each subnetwork models one or more additive structured predictors η j of a distribution parameter θ j .Using an appropriate parameter-free transformation function, these additive structured predictors are mapped to the distribution parameters and passed to a distribution layer (Dillon et al, 2017).Each distribution layer corresponds to a mixture component f m that is further passed to a multinomial or categorical distribution layer, modeling the mixture of all defined distributions.The mixture weights can either be directly estimated or also learned on the basis of input features using additive structured predictors in another subnetwork.A classical linear mixture regression combining M linear regressions would, e.g., be given by M subnetworks, each learning the expectation of a normal distribution and a mixture subnetwork that only takes a constant input (a bias) and learns the M mixture weights.

Optimization Routines
Representing the model ( 1) and (3) as a neural network makes a plethora of optimization routines from deep learning readily available for model estimation.Various first-order stochastic optimization routines exist for neural networks.Most of these optimizers work with mini-batches J 1 , . . ., J B ⊂ {1, . . ., n} of data, hence perform stochastic gradient descent (SGD).The Fig. 1 Example of an architecture.Smaller subnetworks (subnet) learn one or more parameters of a distribution which is defined in the respective distribution layer.For the first distribution in this example, each distribution parameter in θ 1 is learned through a separate network while the second distribution is learned by a network that outputs all parameters θ 2 together (e.g., used when θ 2 are constrained parameters such as probabilities).Each distribution layer thus corresponds to a distributional regression model.The mixture model is then defined by an additional subnetwork that learns the mixture weights π as well as by the M learned distributions f 1 , . . ., f K .
update of parameters ψ [t] in iteration t ∈ N and mini-batch J b is where ψ [0] is some starting value, ∇ ψ i (ψ [t−1] ) are the individual gradients of the negative log-likelihood contributions of the observations in the batch evaluated at the parameters of the previous iteration, and ν [t] is a learning rate updated using a learning rate scheduler.We determine the starting value ψ [0] with the Xavier uniform initialization scheme (Glorot and Bengio, 2010).
For the update of ν [t] , we found the following learning rate schedulers useful for the optimization of mixture of experts distributional regression models: RMSprop (Hinton et al, 2012), Adadelta (Zeiler, 2012), Adam (Kingma and Ba, 2014), and Ranger (Wright, 2019).These optimizers in turn often come with hyperparameters such as their initial learning rate which need to be adjusted.We will investigate their influence in our numerical experiments section.

Penalized Estimation
Estimating the model class of mixtures of experts distributional regressions with additive structured predictors can benefit from a penalized log-likelihood specification.Such a penalization allows to control the smoothness or wiggliness of the non-linear smooth functions in the additive structured predictors and to induce sparsity in the estimation of the additive structured predictors as well as the mixture weights.

Additive Structured Predictors
In order to estimate suitable non-linear smooth additive structured predictors based on splines within the neural network, the respective coefficients in each layer may require regularization using appropriate penalties or penalty matrices.The smoothness or wiggliness of the non-linear additive structured predictor can be calibrated by either selecting a suitable dimension of the basis representation or by imposing a penalization on the respective coefficients in combination with a generous basis representation.Using the later approach, the penalized version of ( 4) is given by with sets of index sets Λ defining the weights that are penalized using a quadratic penalty with individual smoothing parameters λ l and individual quadratic penalty matrices P l (see, e.g., Wood, 2017).While estimating the tuning parameters is possible by including a more elaborated optimization routine in the neural network such as the Fellner-Schall method (Wood and Fasiolo, 2017), we use the approach suggested by Rügamer et al (2020) to tune the different smooth effects by relating λ l to their respective degrees-of-freedom df l .This has the advantage of training the network in a simple backpropagation procedure with only little to no tuning, by, e.g., setting the df l equal to a pre-specified value for all l ∈ Λ.

Entropy-based Penalization of Mixtures
In order to penalize an excessive amount of non-zero mixture weights, we further introduce an entropy-based penalty for the mixture weights that can be simply added to the objective function in (7): The second part of ( 8) is controlled by a tuning parameter ξ ∈ R + 0 and corresponds to the entropy induced by the (estimated) marginal mixture weights π, i.e., in case the mixture weights depend on covariates x the mixture weights obtained when averaging over the observed x.A large value of ξ enforces the weight distribution to be sparse in the amount of non-zero elements in π, while smaller values will result in an (almost) uniform distribution of π.As the entropy is permutation invariant w.r.t. the ordering of the components in π, this penalty is particularly suitable for mixtures that are only identified up to a permutation of the component labels.We investigate the effects of this tuning parameter ξ in Section 4.4.While it is technically possible to allow feature-dependency of the conditional mixture weights which depend on x in (8), further research is required to investigate the effect of the entropy-based penalty in this case.

Numerical Experiments
We now investigate our framework in terms of predictive and estimation performance.To this end, we first compare our neural mixture of experts distributional regression (NMDR) approach with EM-based optimization routines to demonstrate competitiveness with state-of-the-art procedures.For this comparison the model class considered is restricted to only contain constant mixture weights and linear predictors for the distributional parameters in order to facilitate the use of available implementations using EM-based optimization routines.Based on this simulation setup, we also perform an empirical investigation of different optimization routines in the neural network context to provide a practical guideline for users of our framework.We then evaluate our approach considering a mixture of generalized additive regression models with additional noise variables to demonstrate the framework's efficacy when including additive predictor structures.Finally, we investigate an overfitting mixture setting, in which we simulate situations where EM-based optimization fails and NMDR provides sparsity off-the-shelf.

Evaluation Metrics
We measure the estimation performance of both the regression coefficients and the mixture weights π using the root mean squared error (RMSE), the prediction performance using the predictive log-scores (PLS; Gelfand and Dey, 1994) on an independent test dataset of the same size as the training dataset, and the differences between the true and estimated class memberships using the adjusted Rand index (ARI; Hubert and Arabie, 1985) as well as the accuracy (ACC) by deriving the estimated class memberships based on the maximum a-posteriori probabilities for each of the mixture components.In order to deal with the problem of label switching for the non-label-invariant performance measures, we first determine the labeling which induces the accuracy-optimal assignment and then calculate these performance metrics using this labeling.The accuracy-optimal labeling is computed using the true class memberships (known in this case as data is simulated) and the estimated class memberships as induced by the mixture posterior probabilities estimated by each model.

Comparison with EM-based Optimization
We first compare the NMDR framework to an EM-based algorithm implemented in the R package gamlss.mx(Stasinopoulos and Rigby, 2016) allowing for mixtures of various distributional regressions.We use n ∈ {300, 2500} observations, M ∈ {2, 3, 5, 10} identically distributed mixture components, either following a Gaussian, a Laplace, or a Logistic distribution, each defined by location and scale parameter, and mixture weights are randomly drawn  such that the smallest weight is not less than 3%.We use p m ∈ {2, 10} features for each distribution and distribution parameter in the mixture, and uniformly sampled regression coefficients from a U(−2, 2)-distribution.While we test gamlss.mxwith a fixed budget of 20 restarts, we compare these results to NMDR using 1 and 3 random initializations to assess the effect of multiple restarts.Each experimental configuration is replicated 10 times.All fitted models in this simulation are correctly specified, i.e., they correspond to the data generating process.

Results
Results are visualized in Figure 2 and yield four important findings.First, the EM-based approach only provides results in case p m = 2.The EM-based approach is not able to converge to any meaningful solution in all settings with p m = 10, whereas NMDR's performance is affected by the increased number of predictors, but still yields reasonable results for n = 2500, also without restarts.Second, in the case p m = 2, the EM-based approach provides in general a better classification performance than NMDR (as indicated by ACC Prob.and ARI Prob.).Third, the difference in estimation performance in case p m = 2 between the EM-based and the NMDR approach is often negligible in terms of the RMSE between the estimated parameters.Fourth, while the induced regularization of the SGD-based routine induces a bias in the estimation and hence typically larger estimation errors, the predictive performance of NMDR is always better compared to the EM-based approach.

Optimization Routines
In order to provide insights into the various optimizers' performance, we conduct a small benchmark study to assess the influence of the choice of an optimizer and to find a good default.Figure 3 visualizes the comparison based on the ranks for each optimizer across all 48 settings from Section 4.1.

Results
Figure 3 indicates that convergence problems are primarily encountered for SGD with a substantial amount of runs diverging during optimization.An overall performance assessment based on ranks indicates that the RMSprop optimizer achieves the lowest overall rank.However, the figure highlights that in fact, no clear best optimizer emerges across all scenarios.The ranks obtained for the different optimizers also vary considerably with the performance criterion.
Overall one might conclude, that both the RMSprop optimizer and the Adam optimizer perform in general well, also when used with their default settings.We note, however, that tuning the optimizer and its learning rate would in general also be beneficial in terms of both predictive performance and estimation quality.A further speed-up for some of these routines can be achieved by additionally incorporating momentum, which also proved to be effective in the optimization of additive models using boosting (Schalk et al, 2022).

Mixture of Additive Regression Models
Next, we investigate mixtures of mean regression models with non-linear smooth effects in the additive predictors of the mean distribution parameter.We generate n = 2500 data points from M = 3 mixtures of Poisson or normal distributions with the additive structured predictor of the means defined by η ).All covariates are independently drawn from a uniform distribution U(0, 1).We model those effects using thin-plate regression splines from Wood (2017) h(•) = exp(•) and the identity for the Gaussian case.We vary π to be either (1/3, 1/3, 1/3) or (1/10, 3/10, 6/10) and add 3 or 10 noise variables that are also included in the model as non-linear smooth predictors for the expectation parameter.We use two different scale values 2 or 4, which either define the Gaussian variance in each mixture component or a multiplicative offset effect in the Poisson case.Our method is compared with a state-of-the-art implementation of mixtures of additive models using the R package flexmix (Leisch, 2004) and, as an oracle, a generalized additive model with varying coefficients for all smooth effects, where the class label (unknown to the other two approaches) is used as the varying parameter.For NMDR, the smoothing parameters are determined as described in Section 3.3 via the respective ACC ARI PLS LS normal (2) normal ( 4) poisson ( 2) poisson ( 4) normal ( 2) normal ( 4) poisson ( 2) poisson ( 4) normal ( 2) normal ( 4) poisson ( 2) poisson ( 4) normal ( 2) normal ( 4) poisson ( 2 degrees-of-freedom which are set equally for all smooths to 10 for normal and 6 for Poisson distribution.

Results
The comparison for all settings is depicted in Figure 4 showing the average log-score (LS; calculated on the training dataset using the estimated model parameters) and PLS, as well as the ARI and ACC.The boxplots summarize all 10 simulation replications for the two different mixture weights and the two number of noise variables settings.Results suggest that our approach leads to better predictions measured by the PLS, but is inferior in terms of LS.The smaller LS values are possibly due to fewer data points to fit the model because of the need for a validation set, and due to the shrinkage induced by early stopping the procedure.The median performance of the clustering induced based on the estimated posterior probabilities is in general on par with the EM-based approach in terms of ARI and ACC with flexmix, while showing even slightly better performance in the Gaussian case.

Misspecified Mixtures and Sparsity
In this simulation, we use a normal mixture with p m = 10 fixed linear predictors for each distribution and distribution parameter (mean and variance), where all features are again drawn from a standard normal distribution and regression coefficients from a uniform U(−2, 2)-distribution.The data is then generated using M = 2 actual mixture components with π 1 drawn (independently of features) from a uniform distribution on the interval (0.06, 0.094) and π 2 = 1 − π 1 to ensure that the minimum value of both probabilities is at least 6%.We then evaluate the estimation of mixture probabilities by NMDR for n ∈ {300, 2500} when increasing the number of specified distributions M † ∈ {3, 5, 10}.To allow for sparsity in π, we use the objective function ent introduced in Section 3.3.For each scenario, 10 replications are performed.

Results
Results for various settings of the entropy penalty parameter ξ are depicted in Figure 5.While setting ξ to small values larger than zero can improve the predictive performance and even outperform the correctly specified model without additional distribution components, the bias induced by the penalty generally decreases the estimation performance.We additionally investigate the coefficient path obtained from varying values of the penalty parameter ξ for one simulated example.Results for the n = 2500, M † = 5 setting are depicted in Figure 6.ξ is varied between 0 and 1 on a logarithmic scale.The true model has two non-zero probabilities 0.6077 and 0.3923 while the other 3 entries in π are 0. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q True Prob.

Cell Cycle-regulated Genes of Yeast
In order to demonstrate the flexibility of our approach, we investigate its application to the yeast cell cycle dataset from Spellman et al (1998).In this study, genome-wide mRNA levels were measured for 6178 yeast open reading frames (ORFs) for 119 min at 7 min intervals.We here analyze the subset of data where all 18 time points for the alpha factor arrest are available.The resulting longitudinal dataset consists of 80,802 observations of the standardized expression levels.A subset of this dataset was also analyzed using mixture models in Grün et al (2011).

Distributional Mixture Regression
As both the mean and standard deviation of the standardized expression levels of genes change over time, we apply a mixture of distributional regressions model where the mean µ and the standard deviation σ of the normally distributed mixture components depend on time, i.e., Y t ∼ M m=1 π m N (µ m,t , σ 2 m,t ) for t ∈ [0, 119].The additive structured predictors for these distribution parameters are defined as where the non-linear smooth functions f are modeled by thin-plate regression splines (Wood, 2003).Previous approaches for modeling this dataset investigated the use of a mixture of mixed models, i.e., the inclusion of gene-specific random effects (Luan and Li, 2003;Grün et al, 2011).We investigate here an alternative option for modeling this additional heterogeneity by allowing for time-varying standard deviations.We use M = 6 which corresponds to the number of mixture components identified by Spellman et al (1998).

Results
In order to plot the estimated smooth effects together with the true observations, we first derive the component membership of every gene.As done in the  E-step of mixture model approaches (see, e.g., Grün et al, 2011), we calculate the a posteriori probability for every gene to belong to component m and then take the maximum of all components 1, . . ., M = 6.For this application, observations were only assigned to 5 of the 6 assumed components.Note that due to the nature of our optimization routine, not all components necessarily contain at least one observation.
Comparing the number of genes assigned to each cluster one sees that the most common component in our results is cluster 6 with 39,078 observations.Cluster 4 contains 25,722 observations, cluster 1 9,306, and cluster 5 6,552 observations.The least number of observations are assigned to cluster 2 which contains only 144 observations.Figure 7 visualizes the results obtained in a panel-plot where in each panel the trajectories of the ORFs for all genes assigned to this cluster are shown together with the component-specific estimates of the time-varying means and standard deviations.The identified clusters clearly vary in showing either an initial decrease or increase in their means.In addition, one also sees that the standard deviations of the clusters vary over time with some clusters exhibiting a particularly large amount of heterogeneity at later time points.

Outlook
We have introduced the class of mixtures of experts distributional regression with additive structural predictors and investigated its embedding into neural networks for robust model estimation.Overall this leads to the neural mixture of experts distributional regression (NMDR) approach.We show that popular first-order adaptive update routines are well-suited for learning these mixture of experts (distributional) regression models and also highlight that the embedding into a neural network estimation framework allows for straightforward extensions of the general mixture model class and (regularized) maximum-likelihood estimation using optimization routines suitable also for big data applications due to mini-batch training.Using the proposed architecture for mixture of experts distributional regression, a possible extension of our approach is therefore the combination with other (deep) neural networks.This allows learning both the distribution components and the mixture weights either by (a) a structured model, such as a linear or additive model, (b) a custom (deep) neural network, or (c) a combination thereof.A similar approach has been investigated by Fritz et al (2022) using a zero-inflated Poisson model (i.e., a mixture including a point mass distribution) which includes both additive effects and a graph neural network in the additive predictor.

Fig. 2
Fig. 2 Comparison of EM-based optimization (EM), NMDR with one restart (NMDR) or with three restarts (NMDR 3) for different distributions (columns), measures (rows), and combinations of n and p = pm.Boxplots contain all 10 runs and the four different settings for M (i.e., in total 40 data points per boxplot).Missing boxplots for EM are due to missing solutions caused by missing values when comparing current results to a convergence threshold.RMSEs for coefficients > 3 and PLS values < −50 are omitted to improve readability.

Fig. 4
Fig.4Comparison of average PLS and LS as well as ACC and ARI of a state-of-the-art implementation (EM), an oracle varying coefficient model with known class memberships, and our approach (NMDR) in different colors for the two distributions and two scales (xaxis).The boxplots contain the results for 10 replications over two settings for the mixture weights and two settings for the number of noise variables.

Fig. 5
Fig.5Model quality for misspecified models with specified mixtures M † ∈ {3, 5, 10} (columns, counting the additional components) instead of actual M = 2 mixtures and different goodness of fit measures (rows) for 10 replications (boxes).Colors correspond to different settings of ξ or represent estimation results of the correct model (black).

Fig. 6
Fig.6Coefficient path (estimated mixture probabilities) for different entropy penalties.A value of around 1e-02 yields the best trade-off between sparsity and estimation performance.

Fig. 7
Fig.7Trajectories of ORFs (y-axis) for all 4489 genes (black lines) over the course of the 18 time points (x-axis) with the estimated mean trend (red solid line) per component (facets) and uncertainty visualized by two times the estimated (time-varying) standard deviation (shaded red area).
. For Poisson data, we use