1 Introduction

Mixture-of-experts (ME), introduced by Jacobs et al. (1991), is a successful and flexible supervised learning architecture that allows one to efficiently represent complex non-linear relationships in observed pairs of heterogeneous data \((\varvec{X},Y)\). The ME model relies on the divide and conquer principle, so that the response Y is gathered from soft-association of several expert responses, each targeted to a homogeneous sub-population of the heterogeneous population, given the input covariates (predictors or features) \(\varvec{X}\). From the statistical modeling point of view, a ME model is an extension of the finite mixture model (McLachlan and Peel. 2000) which explores the unconditional (mixture) distribution of a given set of the features \(\varvec{X}\). It is thus more tailored to unsupervised learning than to supervised learning and it has the fully conditional mixture model of the form

$$\begin{aligned} \text {ME}(y|\varvec{x}) = \sum _{k=1}^K \text {G}_k(\varvec{x}) \text {E}_k(y|\varvec{x})\cdot \end{aligned}$$
(1)

In model (1) the ME distribution of the response y given the predictors \(\varvec{x}\) is a conditional mixture distribution with predictor-dependent mixing weights, referred to as gating functions, \(\text {G}_k(\varvec{x})\), and conditional mixture components, referred to as experts \(\text {E}_k(y|\varvec{x})\), K being the number of experts.

Mixture of experts (ME) models thus allow one to better capture more complex relationships between y and \(\varvec{x}\) in heterogeneous situations in non-linear regression \(y\in \mathbb {R}\), classification \(y\in \{1,\ldots ,G\}\), and in clustering the data by associating each expert component to a cluster. The richness of the class of ME models in terms of conditional density approximation capabilities has been recently demonstrated by proving denseness results (Nguyen et al. 2021a, 2019).

They have been investigated in their simple form, as well as in their hierarchical form (Jordan and Jacobs 1994), for non-linear regression and model-based cluster and discriminant analyses and in different application domains. The inference in this case can be performed by maximum likelihood estimation (MLE) via the expectation–maximization (EM) algorithm (Jordan and Jacobs 1994; McLachlan and Krishnan 2008; Dempster et al. 1977) or, when p is possibly larger than the sample size n, by regularized MLE via dedicated EM-Lasso algorithms as in Khalili (2010) , Montuelle et al. (2014), Chamroukhi et al. (2019), Chamroukhi and Huynh (2019), Huynh and Chamroukhi (2019) and Nguyen et al. (2020) which include Lasso-like penalties (Tibshirani 1996). For a more complete review of ME models, the reader is referred to Yuksel et al. (2012) and Nguyen and Chamroukhi (2018). More recent theoretical results about the ME estimation and model selection for different families of ME models, can be found in Nguyen et al. (2020).

To the best of our knowledge, ME models have been exclusively studied in multivariate analysis when the inputs are vectors, i.e. \(\varvec{X}\in \mathcal {X}= \mathbb {R}^p\). However, in many problems, the predictors and/or the responses are observed from smooth functions. Indeed, in many situations, unlike in predictive and cluster analyses of multivariate and potentially high-dimensional heterogeneous data, which have been studied with the ME modeling in (1), the observed data may arise from continuously observed processes, e.g. time series. Thus, a multivariate (vectorial) analysis does not allow one to enough capture the inherent functional structure of the data. In such situations, classical multivariate models are not adapted as they ignore the underlying intrinsic nature and structure of the data. Functional Data Analysis (FDA) (Ramsay and Silverman 2005; Ferraty and Vieu 2006) in which the individual data units are assumed to be functions, rather than vectors, offers an adapted framework to deal with continuously observed data, including in regression, classification and clustering. FDA considers the observed data as (discretized) values of smooth functions, rather than multivariate observations represented in the form of “simple” vectors.

The study of functional data has been considered in most of the statistical modeling and inference problems including regression, classification, clustering, functional graphical models (Qiao et al. 2019), among others. In regression, functional linear models have been introduced including penalized functional regression (Élodie Brunel et al. 2016; Goldsmith et al. 2011) and in particular the FLiRTI approach, a functional linear regression constructed upon interpretable regularization (James et al. 2009), and more generally generalized linear models with functional predictors (Müller et al. 2005; James 2002), which cover functional logistic regression for classification. In classification, we can also cite functional linear discriminant analysis (James and Hastie 2001), and, as a penalized model, Lasso-regularized functional logistic regression (Mousavi and Sørensen 2017). To deal with heterogeneous functional data, the construction of mixture models with functional data analytic aspects have been introduced for model-based clustering ( Liu and Yang (2009), Jacques and Preda (2014a)) including Lasso-regularized mixtures for functional data (Devijver 2017; James and Sugar 2003; Jacques and Preda 2014b; Chamroukhi and Nguyen 2019). The resulting functional mixture models are better able to handle functional data structures compared to standard multivariate mixtures.

The problem of clustering and prediction in presence of functional observations from heterogeneous populations, leading to complex distributions, is still however less investigated. In this paper, we investigate the framework of mixtures-of-experts (ME) models, as models of choice in modeling heterogeneity in data for prediction and clustering with vectorial observations, and extend it to the functional data framework. The main novelty of our paper is to get interpretable results for functional ME (FME). First, the ME framework is extended to a functional data setting to learn from functional predictors, and the statistical inference in the resulting setting (ie. of the FME model) looks for estimating a sparse and interpretable FME model parameters. The key technical challenges brought by this setting are addressed by considering sparsity on the derivatives of the underlying functional parameters of the FME model, thanks to Lasso-like regularizations, solved by a dedicated EM-Lasso algorithm. To the best of our knowledge, this is we first time the ME model is constructed upon functional predictors, and provides interpretable sparse estimates.

Firstly, we introduce in Sect. 2 a new family of ME models, referred to as FME, to relate a functional predictor to a scalar response, and develop a dedicated EM algorithm for the maximum-likelihood parameter estimation. Secondly, to deal with potential high-dimensional setting of the introduced FME model, we develop in Sect. 3 a Lasso-regularized approach, which consists of a penalized MLE estimation via an hybrid EM-Lasso algorithm, which integrates an optimized coordinate ascent procedure to efficiently implement the M-Step. Thirdly, we in particular present in Sect. 3.2 and extended FME model, which is constructed upon a sparse and highly-interpretable regularization of the functional expert and gating parameters. The resulting model, abbreviated as iFME, is fitted by regularized MLE via an dedicated EM algorithm. The developed algorithms for the two introduced ME models are applied and evaluated in Sect. 4 on several simulated scenarios and on real data sets, in both clustering and non-linear regression.

2 Functional mixtures-of-experts (FME)

We wish to derive and fit new mixture-of-experts (ME) models in presence of functional predictors, and potentially, functional responses. In this paper, we first consider ME models with a functional predictor \(X(\cdot )\) and a real response Y where the pair arises from heterogeneous population composed of unknown K homogeneous sub-populations. To the best of our knowledge, this is the first time ME models are considered for functional data.

2.1 ME with functional predictor and scalar response

Let \(\{X_i(t),Y_i\}_{i=1}^n\), be a sample of n independently and identically distributed (i.i.d.) data pairs where \(Y_i\) is a real-valued response and \(X_{i}(t)\) is a functional predictor with \(t \in \mathcal {T}\subset \mathbb {R}\), for example the time in time series. First, to model the conditional relationships between the continuous response Y and the functional predictor \(X(\cdot )\), given an expert z, we formulate each expert component \(\text {E}_z(y|\varvec{x})\) in (1) as a functional regression model (cf. Müller et al. (2005), James et al. (2009)). The resulting functional expert regression model for the ith observation takes the following stochastic representation

$$\begin{aligned} Y_i = \beta _{z_i,0} + \int _{\mathcal {T}} X_i(t) \beta _{z_i}(t) dt + \varepsilon _i, \quad i\in [n], \end{aligned}$$
(2)

where \(\beta _{z_i,0}\) is an unknown constant intercept, \(\beta _{z_i}(t),\, t\in \mathcal {T}\) is the function of unknown coefficients of functional expert \(z_i\), and \( \varepsilon _i \sim \mathcal {N}(0,\sigma ^2_{z_i})\) are independent Gaussian errors, \(z_i\in [K]\) being the unknown label of the expert generating the ith observation. The notation \(z_i\in [K]\) means \(z_i=1,\ldots ,K\) which is used throughout this paper. In this context, the response Y is related to the entire trajectory of \(X(\cdot )\). Let \(\varvec{\beta }= \{\beta _{z,0},\beta _{z}(t), t\in \mathcal {T}\}_{z=1}^K\) represents the set of unknown functional parameters for the experts network.

Now consider the modeling of the gating network in the proposed FME model. As in the context of ME for vectorial data, different choices are possible to model the gating network function, typically softmax-gated or Gaussian-gated ME (e.g. see Nguyen and Chamroukhi (2018), Xu et al. (1994), Chamroukhi et al. (2019)). A standard choice as in Jacobs et al. (1991) to model the gating network \(\text {G}_z(\varvec{x})\) in (1) is to use the multinomial logistic (softmax) function as a distribution of the latent variable Z. In this functional data modeling context with \(K\ge 2\) experts, we use a multinomial logistic function as an extension of the functional logistic regression presented in Mousavi and Sørensen (2018) for linear classification. The resulting functional softmax gating network then takes the following form

$$\begin{aligned}{} & {} \pi _z\left( X(t),t\in \mathcal {T}; \varvec{\alpha }\right) =\mathbb {P}(Z=z|X(t),t\in \mathcal {T};\varvec{\alpha }) \nonumber \\{} & {} \quad = \frac{\exp \{\alpha _{z,0} + \int _{\mathcal {T}} X(t) \alpha _{z}(t) dt\}}{1+\sum _{z^\prime =1}^{K-1}\exp \{\alpha _{z\prime ,0} + \int _{\mathcal {T}} X(t) \alpha _{z\prime }(t) dt\}}, \end{aligned}$$
(3)

where \(\varvec{\alpha }= \{\alpha _{z,0},\alpha _{z}(t),\, t\in \mathcal {T}\}_{z=1}^K\) is the set of unknown constant intercept coefficients \(\alpha _{z,0}\) and functional parameters \(\alpha _{z}(t), t\in \mathcal {T}\) for each expert \(z\in [K]\). Note that model (3) is equivalent to assuming that each expert z is related to the entire trajectory \(X(\cdot )\) via the following functional linear predictor for the gating network

$$\begin{aligned} h_z(X(t),t\in \mathcal {T};\varvec{\alpha })= & {} \log \left\{ \frac{\pi _z\left( X(t),t\in \mathcal {T};\varvec{\alpha }\right) }{\pi _K\left( X(t),t\in \mathcal {T};\varvec{\alpha }\right) } \right\} \nonumber \\= & {} \alpha _{z,0} + \int _{\mathcal {T}} X(t) \alpha _{z}(t) dt. \end{aligned}$$
(4)

The objective is to estimate the functional parameters \(\varvec{\alpha }\) and \(\varvec{\beta }\) of the FME model defined by (2)–(3), from an observed sample. In this setting with functional predictors, this requires estimating a possibly infinite number of coefficients (as many as the number of temporal observations for the predictor). In order to reduce the complexity of the problem, the observed functional predictor can be projected onto a fixed number of basis functions so that we sufficiently capture enough the functional structure of the data, and sufficiently reduce enough the number of coefficients to estimate.

2.2 Smoothing representation of the functional experts

Here we consider the case of fixed design, that is, the covariates \(X_i(t)\) are non-random functions. We suppose that the \(X_i(\cdot )\)’s are measured with error at any given time t. Hence, instead of observing directly \(X_i(t)\), one has a noisy version of it \(U_i(t)\), defined as

$$\begin{aligned} U_i(t) = X_i(t) + \delta _i(t), \quad i \in [n], \end{aligned}$$

where \(\delta _i(\cdot ) \sim \mathcal {N}(0, \sigma ^2_\delta )\) are measurement errors assumed to be independent of the \(X_i(\cdot )\)’s and the \(Y_i\)’s. Since the functional predictors \(X_i(t)\) are not directly observed, we first construct an approximation of \(X_i(t)\) from the noisy predictors \(U_i(t)\) by projecting the latter onto a set of continuous basis functions. Let \(\varvec{b}_r(t) = \left[ b_1(t), \ldots , b_r(t)\right] ^\top \) be a r-dimensional (B-spline, Fourier, Wavelet) basis, then with r sufficiently large \(X_i(t)\) can be represented as

$$\begin{aligned} X_i(t) = \sum _{j=1}^r x_{ij} b_j(t) = \varvec{x}_i^\top \varvec{b}_r(t), \end{aligned}$$
(5)

where \(x_{ij} = \int _{\mathcal {T}} X_i(t)b_j(t)dt\) for \(j \in [r]\) and \(\varvec{x}_i = ( x_{i1}, \ldots , x_{ir})^\top \). Since \(X_i(t)\) is not observed, the representation coefficients \(x_{ij}\)’s are unknown. Hence we propose an unbiased estimator of \(x_{ij}\) defined as

$$\begin{aligned} {\widehat{x}}_{ij}:= \int _\mathcal {T}U_i(t)b_j(t)dt. \end{aligned}$$

Thus, an estimate \(\widehat{X}_i(t)\) of \(X_i(t)\) is given by

$$\begin{aligned} \widehat{X}_i(t) = {\widehat{\varvec{x}}}_i^\top \varvec{b}_r(t),\quad i\in [n], \end{aligned}$$
(6)

with \({\widehat{\varvec{x}}}_i = ({\widehat{x}}_{i1}, \ldots , \widehat{x}_{ir})^\top \).

Similarly, to represent the regression coefficient functions \(\beta _z(\cdot )\), consider a p-dimensional basis \(\varvec{b}_p(t) = \big [b_1(t),b_2(t),\ldots ,b_p(t)\big ]^\top \). Then the function \(\beta _z(t)\) can be represented as

$$\begin{aligned} \beta _z(t) = \varvec{\eta }^{\top }_z \varvec{b}_p(t) \end{aligned}$$
(7)

where \(\varvec{\eta }_z = (\eta _{z,1},\eta _{z,2},\ldots ,\eta _{z,p})^\top \) is the vector of unknown coefficients and the choice of p should ensure the tradeoff between smoothness of the functional predictor and complexity of the estimation problem. We select \(r \ge p\) to satisfy the identifiability constraint (see for instance Goldsmith et al. (2011), Ramsay and Silverman (2002)). Furthermore, rather than assuming a perfect fit of \(\beta _z(t)\) by \(\varvec{b}_p(t)\) as in (7), we use for each Gaussian expert regressor z, the following error model as proposed by James et al. (2009) for functional linear regression

$$\begin{aligned} \beta _z(t) = \varvec{\eta }^{\top }_z \varvec{b}_p(t) + e(t) \end{aligned}$$

where e(t) represents the approximation error of \(\beta _z(t)\) by the linear projection (7). As we choose \(p\gg n\), |e(t)| can be assumed to be small.

2.3 Smoothing representation of the functional gating network

Since here we are examining functional predictors, an appropriate representation has also to be given for the gating network (3) with functional parameters \(\{\alpha _{z}(t),\, t\in \mathcal {T}\}_{z=1}^K\). Due to the infinite number of these parameters, we also represent the gating network by a finite set of basis functions similarly as for the experts network. For the representation of the functional predictors \(X_i(t)\), \(i\in [n]\), we use \(\widehat{X}_i(t)\) established in (6). The coefficients function \(\alpha _{z}(t)\) is represented similarly as for the \(\beta \) coefficients function of the experts network, by using a q-dimensional basis \(\varvec{b}_q(t) = \left[ b_1(t),b_2(t),\ldots ,b_q(t)\right] ^\top \), \(q \le r\), via the projection

$$\begin{aligned} \alpha _z(t) = \varvec{\zeta }^\top _z \varvec{b}_q(t), \end{aligned}$$
(8)

where \(\varvec{\zeta }_z = (\zeta _{z1},\zeta _{z2},\ldots ,\zeta _{zq})^\top \) is the vector of softmax coefficients function. Note that here we use the same type of basis functions for both representations of \(\beta _z\) and \(\alpha _z\), but one can use different types of bases if needed. Then, by using the representations (6) and (8) of X(t) and \(\alpha _z(t)\), respectively, in the linear predictor \(h_z(\cdot )\) defined in (4) for \(i\in [n]\), the latter is thus approximated as

$$\begin{aligned} h_z\left( U_i(t),t\in \mathcal {T};\varvec{\alpha }\right)= & {} \alpha _{z_i,0} + \varvec{\zeta }_{z_i}^\top \textbf{r}_i, \end{aligned}$$
(9)

where \(\textbf{r}_i = \left[ \int _{\mathcal {T}} \varvec{b}_r(t) \varvec{b}_q(t)^\top dt\right] ^\top \widehat{\varvec{x}}_i\). Thus, following its definition in (3), the functional softmax gating network is approximated as

$$\begin{aligned} \pi _{k}(\textbf{r}_i;\varvec{\xi }) = \frac{\exp {\{\alpha _{k,0} + \varvec{\zeta }^\top _k\textbf{r}_i\}}}{1+\sum _{k^\prime =1}^{K-1}\exp {\{\alpha _{k^\prime ,0} + \varvec{\zeta }^\top _{k^\prime }\textbf{r}_i \}}}, \end{aligned}$$
(10)

where \(\varvec{\xi }= \left( (\alpha _{1,0},\varvec{\zeta }^\top _{1}),\ldots , (\alpha _{K-1,0},\varvec{\zeta }^\top _{K-1})\right) ^\top \in \mathbb {R}^{(q+1)(K-1)}\) is the unknown parameter vector of the functional gating network to be estimated.

2.4 The FME model conditional density

We now have appropriate representations for the functional predictors, as well as for both the functional gating network and the functional experts network, involved in the construction of the functional ME (FME) model (2)-(3). Gathering (6) and (7), the stochastic representation (2) of the FME model can thus be defined as follows,

$$\begin{aligned} Y_i|u_i(\cdot )= & {} \beta _{z_i,0} + \varvec{\eta }_{z_i}^\top \textbf{x}_i + \varepsilon ^\star _i, \quad i\in [n], \end{aligned}$$
(11)

where \(\textbf{x}_i = \left[ \int _{\mathcal {T}} \varvec{b}_r(t) \varvec{b}_p(t)^\top dt\right] ^\top \widehat{\varvec{x}}_i\) and \(\varepsilon ^\star _i =\varepsilon _i + {\widehat{\varvec{x}}}_i^\top \int _{\mathcal {T}} \varvec{b}_r(t)e(t) dt\). We can see that the problem is now reduced at the standard version of the ME model. From this stochastic representation under the Gaussian assumption for the error variable \(\varepsilon _i\), the conditional density of each approximated functional expert \(z_i =k\) is thus given by

$$\begin{aligned} f\big (y_i|u_i(\cdot ),z_i=k;\varvec{\theta }_k\big ) = \phi (y_{i};\beta _{k,0} + \varvec{\eta }_{k}^{\top } \textbf{x}_{i},\sigma ^2_{k}), \end{aligned}$$
(12)

where \(\phi (\,\cdot \,;\mu , v)\) is the Gaussian probability density function with mean \(\mu \) and variance v, \(\beta _{k,0} + \varvec{\eta }_{k}^{\top } \textbf{x}_{i}\) is the mean of the approximated functional regression expert, \(\sigma ^2_{k}\) its variance, and \(\varvec{\theta }_k = (\beta _{k,0},\varvec{\eta }^\top _{k}, \sigma ^2_k)^\top \in \mathbb {R}^{p+2}\) the unknown parameter vector of expert density k, \(k\in [K]\) to be estimated. Finally, combining (12) and (10) in the ME model (1), leads to the the following conditional density defining the FME model,

$$\begin{aligned} f(y_i|u_i(\cdot );\varvec{\varPsi }) = \sum _{k=1}^{K} \pi _{k}(\textbf{r}_i;\varvec{\xi }) \phi (y_{i};\beta _{k,0} + \varvec{\eta }_{k}^{\top } \textbf{x}_{i},\sigma ^2_k), \end{aligned}$$
(13)

where \(\varvec{\varPsi }= (\varvec{\xi }^{\top },\varvec{\theta }_1^{\top },\ldots ,\varvec{\theta }_K^{\top })^{\top }\) is the parameter vector of the model to be estimated.

2.5 Maximum likelihood estimation via the EM algorithm

The FME model (13) is now defined upon an adapted finite representation of the functional predictors, and its parameter estimation can then be performed given an observed data sample. We first consider the maximum likelihood estimation framework via the EM algorithm (Dempster et al. 1977; Jacobs et al. 1991) which has many desirable properties including stability and convergence guarantees (e.g. see McLachlan and Krishnan (2008) for more details). Note that here we use the term maximum likelihood estimation to not unduly clutter the clarity of the text, while as it will be specified later, we refer to the conditional maximum likelihood estimator.

In practice, the data are available in the form of discretized values of functions. The noisy functional predictors \(U_i(t)\) are usually observed at discrete sampling points \(t_{i1}<\ldots <t_{im_i}\) with \(t_{ij}\in \mathcal {T}\) for \(j\in [m_i]\). We suppose that \(U_i(t)\) is scaled such that \(0\le t \le 1\) and divide the time period [0, 1] up into a fine grid of \(m_i\) points \(t_{i1}, \ldots , t_{im_i}\). Thus, in (11) we have \(\textbf{x}_{i} = \left[ \sum _{j=1}^{m_i} \varvec{b}_r(t_{ij}) \varvec{b}_p(t_{ij})^\top \right] ^\top \hspace{-4pt}{\widehat{\varvec{x}}}_i\), \(\textbf{r}_{i} = \left[ \sum _{j=1}^{m_i} \varvec{b}_r(t_{ij}) \varvec{b}_q(t_{ij})^\top \right] ^\top \hspace{-4pt}{\widehat{\varvec{x}}}_i\), where \({\widehat{x}}_{ij} = \sum _{j=1}^{m_i} U_i(t_{ij})b_j(t_{ij})\). Note that if we choose \(p = q = r\), then \(\textbf{x}_{i} = \textbf{r}_{i} = {\widehat{\varvec{x}}}_i\). Let \(\mathcal {D}= \{(\varvec{u}_1,y_1),\ldots ,(\varvec{u}_n, y_n)\}\) be an i.i.d. sample of n observed data pairs where \(\varvec{u}_i = (u_{i,1},\ldots ,u_{i,m_i})\) is the observed functional predictor for the ith response \(y_i\).

We use \(\mathcal {D}\) to estimate the parameter vector \(\varvec{\varPsi }\) by iteratively maximizing the observed data log-likelihood,

$$\begin{aligned} \log L(\varvec{\varPsi }) = \sum _{i=1}^{n}\log \sum _{k=1}^{K} \pi _{k}(\textbf{r}_i;\varvec{\xi }) \phi (y_{i};\beta _{k,0} + \varvec{\eta }_{k}^{\top } \textbf{x}_{i},\sigma ^2_k), \nonumber \\ \end{aligned}$$
(14)

via the EM algorithm. As detailed in “Appendix”, the EM algorithm for the FME model is implemented as follows. After starting with an initial solution \(\varvec{\varPsi }^{(0)}\), it alternates, at each iteration s, between the two following steps, until convergence (when there is no longer a significant change in the values of the log-likelihood (14)).

E-step Calculate the following conditional probability memberships \(\tau ^{(s)}_{ik}\) (for all \(i\in [n]\)), that the observed pair \((u_i, y_i)\) originates from the kth expert, given the observed data and the current parameter estimate \(\varvec{\varPsi }^{(s)}\),

$$\begin{aligned}{} & {} \tau _{ik}^{(s)}= \mathbb {P}(Z_i=k|y_i,u_i(\cdot );\varvec{\varPsi }^{(s)}) \nonumber \\{} & {} \quad = \frac{ \pi _{k}(\textbf{r}_i;\varvec{\xi }^{(s)}) \phi (y_{i};\beta ^{(s)}_{k,0} + \textbf{x}^{\top }_{i}\varvec{\eta }^{(s)}_{k},{\sigma ^2_k}^{(s)}) }{f(y_{i}|u_i(\cdot );\varvec{\varPsi }^{(s)})}\cdot \end{aligned}$$
(15)

M-step Update the value of the parameter vector \(\varvec{\varPsi }\) by maximizing the Q-function (43) with respect to \(\varvec{\varPsi }\). The maximization is performed by separate maximizations with respect to (w.r.t.) the gating network parameters \(\varvec{\xi }\) and, for each expert k, w.r.t. the expert network parameters \(\varvec{\theta }_k\), for each of the K experts.

Updating the gating network’s parameters \(\varvec{\xi }\) consists of maximizing w.r.t. \(\varvec{\xi }\) the part of (43) that is a function of \(\varvec{\xi }\). Since we use a softmax-gated expert network in (10), this maximization problem consists of a weighted multinomial logistic problem for which there is no a closed-form solution. We then use a Newton–Raphson (NR) procedure, which iteratively maximizes (44) after starting from an initial parameter vector \(\varvec{\xi }^{(0)}\), by updating, at each NR iteration t, the values of the parameter vector \(\varvec{\xi }\) according to the following updating formula:

$$\begin{aligned} \varvec{\xi }^{(t+1)} = \varvec{\xi }^{(t)} - \Big [H(\varvec{\xi };\varvec{\varPsi }^{(s)})\Big ]_{\varvec{\xi }=\varvec{\xi }^{(t)}}^{-1} g(\varvec{\xi }; \varvec{\varPsi }^{(s)})\Big | _{\varvec{\xi }=\varvec{\xi }^{(t)}} \end{aligned}$$
(16)

where \(H(\varvec{\xi };\varvec{\varPsi }^{(s)})\) and \(g(\varvec{\xi }; \varvec{\varPsi }^{(s)})\) are, respectively, the Hessian matrix and the gradient vector of \(Q(\varvec{\xi }; \varvec{\varPsi }^{(s)})\), and are provided in “Appendix A”. At each NR iteration, the Hessian matrix and gradient vector are evaluated at the current value of \(\varvec{\xi }\). We keep updating the gating network parameter \(\varvec{\xi }\) according to (16) until there is no significant change in \(Q(\varvec{\xi };\varvec{\varPsi })\). The maximization then provides \(\varvec{\xi }^{(s+1)}\) for the next EM iteration. Alternatively, instead of using the NR procedure, one can employ the Minorize-Maximization (MM) algorithm to update the parameter \(\varvec{\xi }\) of the gating network. In this approach, the Hessian matrix in (16) is approximated by a square matrix that is independent of \(\varvec{\xi }\). This can offer an improved computational efficiency, although it comes at the cost of requiring a greater number of iterations. For further details, refer to Gormley and Murphy (2008).

Updating the experts network parameters \(\varvec{\theta }_k\) consists of solving K independent weighted regression problems where the weights are the conditional expert memberships \(\tau ^{(s)}_{ik}\) given by (15). The updating formulas for the regression parameters \((\beta _{k,0},\varvec{\eta }_k)\) and the noise variances \(\sigma _{k}^{2}\) for each expert k are straightforward and correspond to weighted versions of those of standard Gaussian linear regression, i.e., weighted ordinary least squares. The updating rules for the experts network parameters are given by the following formulas:

$$\begin{aligned}{} & {} \beta ^{(s+1)}_{k,0} = \frac{1}{n^{(s)}_k}\sum _{i=1}^{n} \tau _{ik}^{(s)}(y_i - \textbf{x}_{i}^{\top }\varvec{\eta }^{(s)}_{k}), \nonumber \\{} & {} \varvec{\eta }^{(s+1)}_{k} = \frac{1}{\sum _{i=1}^{n}\tau ^{(s)}_{ik} \textbf{x}_i\textbf{x}_i^\top } \sum _{i=1}^{n}\tau ^{(s)}_{ik} (y_{i} - \beta _{k,0}^{(s+1)}) \textbf{x}_i, \\{} & {} {\sigma _{k}^2}^{(s+1)} = \frac{1}{n^{(s)}_k} \sum _{i=1}^{n} \tau _{ik}^{(s)}\left[ y_i - (\beta ^{(s+1)}_{k,0} + \textbf{x}_{i}^{\top }\varvec{\eta }^{(s+1)}_{k})\right] ^2,\nonumber \end{aligned}$$
(17)

where \(n^{(s)}_k = \sum _{i=1}^n\tau _{ik}^{(s)}\) represents the expected cardinal number of component k.

This EM algorithm, designed here for the FME that is constructed upon smoothing of the functional data, can be seen as a direct extension of the vectorized version the ME model. While it can hence be expected to provide accurate estimations as in the vector predictors setting, the number of parameters to estimate here in the case of the FME can still be high, for example when a big number of basis functions is used to have more accurate approximation of the functional predictors. In that case, it is better to regularize the maximum likelihood estimator in order to establish a compromise between the quality of fit and complexity.

3 Regularized maximum likelihood estimation

We rely on the LASSO (Tibshirani 1996) as a successful procedure to encourage sparse models in high-dimensional linear regression based on an \(\ell _1\)-penalty, and include it in this mixture of experts modeling framework for functional data. The \(\ell _1\)-regularized ME models have demonstrated their performance from a computational point of view (Chamroukhi and Huynh 2019; Huynh and Chamroukhi 2019) and enjoy good theoretical properties of universal approximation and consistent model selection in the high-dimensional setting (Nguyen et al. 2020, 2021b).

3.1 \(\ell _1\)-regularization and the EM-Lasso algorithm

We propose an \(\ell _1\)-regularization of the observed-data log-likelihood (18) to be maximized, along with coordinate ascent algorithms to deal with the high-dimensional setting when updating the parameters within the resulting EM-Lasso algorithm. The objective function in this case is given by the following \(\ell _1\)-regularized observed-data log-likelihood,

$$\begin{aligned} \mathcal {L}(\varvec{\varPsi }) = \log L(\varvec{\varPsi }) - \text {Pen}_{\lambda ,\chi }(\varvec{\varPsi }), \end{aligned}$$
(18)

where \(\log L(\varvec{\varPsi })\) is the observed-data log-likelihood of \(\varvec{\varPsi }\) defined by (14), and \(\text {Pen}_{\lambda ,\chi }(\varvec{\varPsi })\) is a LASSO regularization term encouraging sparsity for the expert and the gating network parameters, defined by

$$\begin{aligned} \text {Pen}_{\lambda ,\chi }(\varvec{\varPsi }) = \lambda \sum _{k=1}^{K} \sum _{j = 1}^{p} |\eta _{k,j}| + \chi \sum _{k=1}^{K-1}\sum _{j = 1}^{q} |\zeta _{k,j}|, \end{aligned}$$
(19)

where \(\lambda \) and \(\chi \) are positive real values representing tuning parameters. The maximization of (18) cannot be performed in a closed form but again the EM algorithm can be adapted to iteratively solve the maximization problem. The resulting algorithm for the FME model, called EM-Lasso, takes the following form, as detailed in “Appendix B”. After starting with an initial solution \(\varvec{\varPsi }^{(0)}\), it alternates between the two following steps, until convergence, i.e., when there is no longer a significant change in the values of the \(\ell _1\)-penalized log-likelihood (18).

E-step The E-Step in this EM-Lasso algorithm is unchanged compared to the previously presented EM algorithm, and only requires the computation of the conditional expert memberships \(\tau ^{(s)}_{ik}\) according to (15).

M-step In this regularized MLE context, the parameter vector \(\varvec{\varPsi }\) is now updated by maximizing the regularized Q-function (46), i.e., \(\varvec{\varPsi }^{(s+1)} = \arg \max _{\varvec{\varPsi }}\left\{ Q(\varvec{\varPsi };\varvec{\varPsi }^{(s)}) - \text {Pen}_{\lambda ,\chi }(\varvec{\varPsi })\right\} .\) This is performed by separate maximizations w.r.t. the gating network parameters \(\varvec{\xi }\) and, for each expert k, w.r.t. the expert network parameters \(\varvec{\theta }_k\), \(k\in [K]\).

Updating the gating network parameters at iteration s of the EM-Lasso algorithm consists of maximizing the following regularized Q-function w.r.t. \(\varvec{\xi }\),

$$\begin{aligned} \mathcal {Q}_{\chi }(\varvec{\xi };\varvec{\varPsi }^{(s)}) = Q(\varvec{\xi };\varvec{\varPsi }^{(s)}) - \chi \sum _{k=1}^{K-1} \Vert \varvec{\zeta }_{k}\Vert _1, \end{aligned}$$
(20)

where \( Q(\varvec{\xi };\varvec{\varPsi }^{(s)}) = \sum _{i=1}^n\left[ \sum _{k=1}^{K-1} \tau ^{(s)}_{ik} \left( \alpha _{k,0} + \varvec{\zeta }^\top _{k}\textbf{r}_i\right) \right. \left. - \log \left( 1 + \sum _{k^\prime =1}^{K-1} \exp \{\alpha _{k^\prime ,0} + \varvec{\zeta }^\top _{k^\prime }\textbf{r}_i\} \right) \right] \). One can see this is equivalent to solving a weighted regularized multinomial logistic regression problem for which \(\mathcal {Q}_{\chi }(\varvec{\xi };\varvec{\varPsi }^{(s)})\) is its penalized log-likelihood, the weights being the conditional probabilities \(\tau ^{(s)}_{ik}\). There is no closed-form solution for this kind of problem. We then use an iterative optimization algorithm to seek for a maximizer of \(\mathcal {Q}_{\chi }(\varvec{\xi };\varvec{\varPsi }^{(s)})\), i.e., an update for the parameters of the gating network. To be effective when the number of parameters to estimate is high, we propose a coordinate ascent algorithm to update the softmax gating network coefficients in this regularized context.

Coordinate ascent for updating the gating network. The idea of the coordinate ascent algorithm (e.g. see Hastie et al. (2015), Huynh and Chamroukhi (2019)), implemented in our context at the sth EM-Lasso iteration to maximize \(\mathcal {Q}_{\chi }(\varvec{\xi };\varvec{\varPsi }^{(s)})\) at the M-Step, is as follows. The gating function parameter vectors \(\varvec{\xi }_k=(\alpha _{k,0},\varvec{\zeta }^\top _{k})^\top \) as components of the whole gating network parameters \(\varvec{\xi }=(\varvec{\xi }^\top _1,\ldots ,\varvec{\xi }^\top _{K-1})^\top \), are updated one at a time, while fixing the other gate’s parameters to their previous estimates. Furthermore, to update a single gating parameter vector \(\varvec{\xi }_{k}\), we only update its coefficients \(\xi _{kj}\) one at a time, while fixing the other coefficients to their previous values. More precisely, for each single gating function k, we partially approximate the smooth part of \(\mathcal {Q}_{\chi }(\varvec{\xi };\varvec{\varPsi }^{(s)})\) with respect to \(\varvec{\xi }_k\) at the current EM-Lasso estimate, say \(\varvec{\xi }^{(t)}\), then optimize the resulting objective function (with respect to \(\varvec{\xi }_k\)). This corresponds to solving penalized weighted least squares problems using coordinate ascent. Thus, this results into an inner loop, indexed by m, that cycles over the components of \(\varvec{\xi }_k\) and updates them one by one, while the others are kept fixed to their previous values, i.e., \(\xi _{kh}^{(m+1)} = \xi _{kh}^{(m)}\) for all \(h\ne j\), until the objective function (48) is not significantly increased.

The obtained closed form updates for each coefficient \(\zeta _{kj}\), \(j\in [q]\), and for the intercept \(\alpha _{k,0}\), are as follows

$$\begin{aligned}{} & {} \zeta _{kj}^{(m+1)} = \frac{{\mathcal {S}}_{\chi } \Big ( \sum _{i=1}^n w_{ik}\textrm{r}_{ij} (c_{ik} - \tilde{c}_{ikj}^{(m)}) \Big )}{ \sum _{i=1}^n w_{ik} \textrm{r}_{ij}^2 }~\text { for } j\in [q],\\{} & {} \alpha _{k,0}^{(m+1)} = \frac{ \sum _{i=1}^n w_{ik} (c_{ik} - \textbf{r}_i^{\top } \varvec{\zeta }_k^{(m+1)}) }{\sum _{i=1}^n w_{ik}}, \end{aligned}$$

where \(\tilde{c}_{ikj}^{(m)} = \alpha _{k0}^{(m)} + \textbf{r}_i^{\top }\varvec{\zeta }_k^{(m)} - \zeta _{kj}^{(m)}\textrm{r}_{ij}\) is the fitted value excluding the contribution from the jth component of the ith vector \(\textrm{r}_{ij}\) in the design matrix of the gating network and \({\mathcal {S}}_{\chi }(\cdot )\) is a soft-thresholding operator defined by \({\mathcal {S}}_{\chi }(u) = \textrm{sign}(u)(\vert u \vert - \chi )_+\) with \((v)_+\) is a shorthand for \(\max \{v,0\}\). The values \((\alpha ^{(m+1)}_{k,0},\varvec{\zeta }^{(m+1)}_{k})\) obtained at convergence of the coordinate ascent inner loop for the kth gating function are taken as the fixed values of that gating function, in the procedure of updating the next parameter vector \(\varvec{\xi }_{k+1}\). Finally, when all the gating functions have their values updated, i.e., after \(K-1\) inner coordinate ascent loops, to avoid numerical instability, we perform a backtracking line search, before actually updating the gating network’s parameters for the next EM-Lasso iteration. More precisely, the update is \(\varvec{\xi }^{(t+1)} = (1-\nu )\varvec{\xi }^{(t)} + \nu \bar{\varvec{\xi }}^{(t)}\), where \(\bar{\varvec{\xi }}^{(t)}\) is the output after \(K-1\) inner loops and \(\nu \) is backtrackingly determined to ensure \(\mathcal {Q}_{\chi }(\varvec{\xi }^{(t+1)};\varvec{\varPsi }^{(s)}) \ge \mathcal {Q}_{\chi }(\varvec{\xi }^{(t)};\varvec{\varPsi }^{(s)})\).

We keep cycling these updated iterates for the parameter vectors \(\varvec{\xi }_k\), until convergence of the whole coordinate ascent (CA) procedure inside the M-Step, i.e., when the relative increase in the Lasso-regularized objective \(\mathcal {Q}_{\chi }(\varvec{\xi };\varvec{\varPsi }^{(s)})\) related to the softmax gating network is not significant, e.g., less than a small tolerance. Then, the next EM-Lasso iteration is calculated with the updated gating network’s parameters \(\varvec{\xi }^{(s+1)} = ({\widetilde{\alpha }}_{1,0},{\widetilde{\varvec{\zeta }}}^{\top }_{1},\ldots , {\widetilde{\alpha }}_{K-1,0},{\widetilde{\varvec{\zeta }}}^{\top }_{K-1})^\top \) where the values \({\widetilde{\alpha }}_{k,0}\) and \({\widetilde{\zeta }}_{kj}\) for all \(k\in [K-1], j \in [q]\) are those obtained for the \(\alpha _{k,0}\)’s and the \(\zeta _{kj}\)’s at convergence of the CA algorithm.

Updating the experts network parameters The maximization step for updating the expert parameters \(\varvec{\theta }_k\) consists of maximizing the function \(\mathcal {Q}_{\lambda }(\varvec{\theta }_k;\varvec{\varPsi }^{(s)})\) given by

$$\begin{aligned} \mathcal {Q}_{\lambda }(\varvec{\theta }_k;\varvec{\varPsi }^{(s)})= & {} Q(\varvec{\theta }_k;\varvec{\varPsi }^{(s)}) - \lambda \Vert \varvec{\eta }_k\Vert _1, \end{aligned}$$
(21)

where \(Q(\varvec{\theta }_k;\varvec{\varPsi }^{(s)}) = - \frac{1}{2\sigma _k^2}\sum _{i=1}^{n} \tau _{ik}^{(s)} \left( y_{i} - (\beta _{k,0} + \varvec{\eta }_{k}^{\top } \textbf{x}_{i})\right) ^2 - \frac{n}{2}\log (2\pi \sigma _k^2)\cdot \) This corresponds to solving a weighted LASSO problem where the weights are the conditional experts memberships \(\tau _{ik}^{(s)}\). We then solve it by an iterative optimization algorithm similarly to the previous case of updating the gating network parameters. As it can be seen in “Appendix B.2”, updating \((\beta _{k,0}, \varvec{\eta }_k)\) according to (21) is obtained by coordinate ascent as follows. For each \(j\in [p]\), the closed-form update for \(\eta _{kj}\) at the mth iteration of the coordinate ascent algorithm within the M-Step of EM-Lasso, is given by

$$\begin{aligned}{} & {} \eta _{kj}^{(m+1)} = \frac{{\mathcal {S}}_{\lambda {\sigma _k^2}^{(s)}} \Big ( \sum _{i=1}^n \tau _{ik}^{(s)}\textrm{x}_{ij} (y_{i} - \tilde{y}_{ij}^{(m)}) \Big )}{ \sum _{i=1}^n \tau _{ik}^{(s)} \textrm{x}_{ij}^2 },\\{} & {} \beta _{k,0}^{(m+1)} = \frac{\sum _{i=1}^n\tau _{ik}^{(s)}(y_i - \textbf{x}_i^{\top }\varvec{\eta }_{k}^{(m+1)})}{\sum _{i=1}^n\tau _{ik}^{(s)}}, \end{aligned}$$

in which \(\tilde{y}_{ij}^{(m)} = \beta _{k,0}^{(m)} + \textbf{x}_i^{\top }\varvec{\eta }_k^{(m)} - \eta _{kj}^{(m)}\textrm{x}_{ij}\) is the fitted value excluding the contribution from \(\textrm{x}_{ij}\) and \(\mathcal S_{\lambda {\sigma _k^2}^{(s)}}(\cdot )\) is the soft-thresholding operator. We keep updating the components of \((\beta _{k,0},\varvec{\eta }_k)\) cyclically until no enough increase in objective function (21). Then, once \((\beta _{k,0}, \varvec{\eta }_{k})\) are updated while fixing the variance \(\sigma ^2_k\), the latter is then updated straightforwardly as in the case of standard weighted Gaussian regression, and it is update is given by

$$\begin{aligned} {\sigma ^2_k}^{(s+1)} = \frac{\sum _{i=1}^n \tau _{ik}^{(s)} \left( y_i - \beta _{k,0}^{(s+1)} - \textbf{x}_i^{\top }\varvec{\eta }_k^{(s+1)}\right) ^2}{\sum _{i=1}^n \tau _{ik}^{(s)}}, \end{aligned}$$

where \((\beta _{k,0}^{(s+1)}, \varvec{\eta }_{k}^{(s+1)}) = ({\widetilde{\beta }}_{k,0}, {\widetilde{\varvec{\eta }}}_{k})\) is the solution obtained at convergence of the CA algorithm, which is taken as the update in the next EM-Lasso iteration.

This completes the parameter vector update \(\varvec{\varPsi }^{(s+1)} = \big (\varvec{\xi }^{(s+1)}, \varvec{\theta }_1^{(s+1)}, \ldots , \varvec{\theta }_K^{(s+1)}\big )\) of the regularized FME model, where \(\varvec{\xi }^{(s+1)}\) and \(\varvec{\theta }_k^{(s+1)}, k\in [K]\), are, respectively, the updates of the gating network parameters and the experts network parameters, calculated by the coordinate ascent algorithms.

The EM-Lasso algorithm provides an estimate of the FME parameters with sparsity constraints on the values of the parameter vectors \(\varvec{\xi }\) and \(\varvec{\theta }_k\), \(k\in [K]\). Actually, since here these parameter vectors do not relate directly the original functional inputs, to the output, assuming some of their values is zero is not necessarily justified, as there is no indeed any reason that these values are zero, nor easily interpretable, compared to the sparsity in vectorial generalized linear models, mixture of regressions and ME models.

From now on, we refer to FME and FME-Lasso, respectively, the FME model fitted by EM algorithm in Sect. 2.5 and the regularized FME model fitted by EM-Lasso algorithm, in Sect. 3.1. In the following Section, we introduce a regularization that is interpretable and encourages sparsity in the FME model.

3.2 Interpretable Functional Mixture of Experts (iFME)

In this section, we provide a sparse and, especially highly-interpretable fit, for the coefficient functions \(\{\beta _k(t), t\in \mathcal {T}\}\) and \(\{\alpha _k(t), t\in \mathcal {T}\}\) representing each of the K functional experts and gating network. We call our approach Interpretable Functional Mixture of Experts (iFME). The presented iFME allows us to control the expert and gating parameter functions while still providing performance as with the standard FME model presented previously.

3.2.1 Interpretable sparse regularization

We rely on the methodology of Functional Linear Regression That’s Interpretable (FLiRTI) presented in James et al. (2009). The idea of the FLiRTI methodology is as follows. We use variable selection with sparsity assumption on appropriate chosen derivatives of the coefficient function, say \(\beta _{z_i}(t)\) here, in the case of the functional expert network, to produce a highly interpretable estimate for the coefficient functions \(\beta _{z_i}(t)\). For instance, \(\beta _{z_i}^{(0)}(t) = 0\) implies that the predictor \(X_i(t)\) has no effect on the response \(Y_i\) at t, \(\beta _{z_i}^{(1)}(t) = 0\) means that \(\beta _{z_i}(t)\) is constant in t, \(\beta _{z_i}^{(2)}(t) = 0\) shows that \(\beta _{z_i}(t)\) is linear in t, etc. Assuming sparsity in higher-order derivatives of \(\beta _{z_i}(t)\), for instance when \(d=3\) or \(d=4\), will however give us a less easily interpretable fit. Hence, for example, if one believes that the expert parameter function \(\beta _{z_i}(t)\) is exactly zero over a certain region and exactly linear over other region of t, then it is necessary to estimate \(\beta _{z_i}(t)\) such that \(\beta _{z_i}^{(0)}(t)=0\) and \(\beta _{z_i}^{(2)}(t)=0\) over those regions, respectively. In this situation, we need to model \(\beta _{z_i}(t)\) assuming that its zeroth and second derivatives are sparse. However, with the EM-Lasso method derived above via the Lasso regularization, there is no actually any reason that we could obtain those desired properties, which may result in an estimate for \(\beta _{z_i}(t)\) that is rarely exactly zeros (and/or linear), and making the sparsity and coefficient curves hard to interpret. The same situation may occur with the gating parameter functions. To handle this, we describe in what follows the construction of our iFME model that produces flexible-shape and highly-interpretable estimates for the expert and gating coefficient functions, by simultaneously constraining any two of their derivatives to be sparse.

We start by selecting a p-dimensional basis \(\varvec{b}_p(t)\) and a q-dimensional basis \(\varvec{b}_q(t)\) for approximating the experts and gating networks, respectively. For the expert network, if we divide the time domain into a grid of p evenly spaced points, and let \(D^d\) be the dth finite difference operator defined recursively as

$$\begin{aligned} D \varvec{b}_p(t_j)= & {} p\left[ \varvec{b}_p(t_j) - \varvec{b}_p(t_{j-1})\right] , \\ D^2 \varvec{b}_p(t_j)= & {} D\left[ D \varvec{b}_p(t_j)\right] \\= & {} p^2\left[ \varvec{b}_p(t_j) - 2\varvec{b}_p(t_{j-1}) + \varvec{b}_p(t_{j-2})\right] , \\{} & {} \vdots \\ D^d \varvec{b}_p(t_j)= & {} D\left[ D^{d-1} \varvec{b}_p(t_j)\right] , \end{aligned}$$

then \(D^d \varvec{b}_p(t_j)\) provides an approximation for \(\varvec{b}_p^{(d)}(t_j) = [b_1^{(d)}(t_j), \ldots , b_p^{(d)}(t_j)]^\top \), \(j\in [p]\). Let

$$\begin{aligned}{} & {} \textbf{A}_p = \Big [ \underbrace{D^{d_1} \varvec{b}_p(t_1), D^{d_1} \varvec{b}_p(t_2), \ldots , D^{d_1} \varvec{b}_p(t_p)}_{{\textbf{A}_p^{[d_1]}}},\\{} & {} \quad \underbrace{D^{d_2}\varvec{b}_p(t_1), D^{d_2} \varvec{b}_p(t_2), \ldots , D^{d_2} \varvec{b}_p(t_p)}_{{\textbf{A}_p^{[d_2]}}} \Big ]^{\top } \end{aligned}$$

be the matrix of approximate \(d_1\)th and \(d_2\)th derivative of the basis \(\varvec{b}_p(t)\). We denote \(\textbf{A}_p^{[d_1]}\) the first p rows of \(\textbf{A}_p\) and \(\textbf{A}_p^{[d_2]}\) the remainder, i.e., \(\textbf{A}_p = \big [\textbf{A}_p^{[d_1]}, \textbf{A}_p^{[d_2]}\big ]^{\top }\). One can see such matrix \(\textbf{A}_p\) is in \(\mathbb {R}^{2p\times p}\) and \(\textbf{A}_p^{[d_1]}\) is a square invertible matrix.

Now, if we consider the following representation for the expert network coefficient function

$$\begin{aligned} \varvec{\gamma }_{z_i} = \textbf{A}_p \varvec{\eta }_{z_i} \end{aligned}$$
(22)

with \(\varvec{\gamma }_{z_i} = \big ({\varvec{\gamma }^{[d_1]}_{z_i}}^\top \hspace{-2pt},{\varvec{\gamma }^{[d_2]}_{z_i}}^\top \big )^\top \), where \(\varvec{\gamma }^{[d_1]}_{z_i} = \big (\gamma ^{[d_1]}_{1,{z_i}}, \ldots , \gamma ^{[d_1]}_{p,{z_i}}\big )^\top \), \(\varvec{\gamma }^{[d_2]}_{z_i} = \big (\gamma ^{[d_2]}_{1,{z_i}}, \ldots , \gamma ^{[d_2]}_{p,{z_i}}\big )^\top \), and \(\varvec{\eta }_{z_i}\) defined as in relation to \(\beta _{z_i}(t)\) as in (7), then one can see that \(\varvec{\gamma }^{[d_1]}_{z_i}\) provides an approximation to \(\beta _{z_i}^{(d_1)}(t)\), i.e. the \(d_1\)th derivative of \(\beta _{z_i}(t)\). Respectively, \(\varvec{\gamma }^{[d_2]}_{z_i}\) provides an approximation to \(\beta _{z_i}^{(d_2)}(t)\), the \(d_2\)th derivative of \(\beta _{z_i}(t)\). Therefore, enforcing sparsity in \(\varvec{\gamma }_{z_i}\) will constrain \(\beta _{z_i}^{(d_1)}(t)\) and \(\beta _{z_i}^{(d_2)}(t)\) to be zero at most time points.

Similarly, let \(\textbf{A}_q = \big [\textbf{A}_q^{[d_1]}, \textbf{A}_q^{[d_2]}\big ]^{\top } \in \mathbb {R}^{2q\times q}\) be the corresponding matrix defined for the gating network. If we consider the following representation for the gating network coefficient function

$$\begin{aligned} \varvec{\omega }_{z_i} = \textbf{A}_q \varvec{\zeta }_{z_i} \end{aligned}$$
(23)

with \(\varvec{\omega }_{z_i} {=} \big ({\varvec{\omega }^{[d_1]}_{z_i}}^\top \hspace{-2pt}, {\varvec{\omega }^{[d_2]}_{z_i}}^\top \big )^\top \), where \(\varvec{\omega }^{[d_1]}_{z_i} {=} \big (\omega ^{[d_1]}_{1,{z_i}}, \ldots , \omega ^{[d_1]}_{q,{z_i}}\big )^\top \), \(\varvec{\omega }^{[d_2]}_{z_i} = \big (\omega ^{[d_2]}_{1,{z_i}},\ldots ,\omega ^{[d_2]}_{q,{z_i}}\big )^\top \), and \(\varvec{\zeta }_{z_i}\) defined as in relation to \(\alpha _{z_i}(t)\) as in (8), then we can derive the same interpretation as above for the coefficient vector \(\varvec{\omega }_{z_i}\) and the gating parameter function \(\alpha _{z_i}(t)\).

Using simple calculations, the representations (22) and (23) imply the following relations that subsequently used in the iFME model:

$$\begin{aligned} \varvec{\eta }_{z_i} = {\big (\textbf{A}_p^{[d_1]}\big )}^{-1} \varvec{\gamma }_{z_i}^{[d_1]}, \qquad \varvec{\gamma }_{z_i}^{[d_2]} = \textbf{A}_p^{[d_2]}{\big (\textbf{A}_p^{[d_1]}\big )}^{-1} \varvec{\gamma }_{z_i}^{[d_1]}, \end{aligned}$$
(24)

and

$$\begin{aligned} \varvec{\zeta }_{z_i} = {\big (\textbf{A}_q^{[d_1]}\big )}^{-1} \varvec{\omega }_{z_i}^{[d_1]}, \qquad \varvec{\omega }_{z_i}^{[d_2]} = \textbf{A}_q^{[d_2]}{\big (\textbf{A}_q^{[d_1]}\big )}^{-1} \varvec{\omega }_{z_i}^{[d_1]}, \end{aligned}$$
(25)

respectively.

In fact, one can construct \(\textbf{A}_p\) and \(\textbf{A}_q\) with only one derivative. Then the constraints involved to the \(d_2\)th derivative will be eliminated making the estimation easier, but also limiting the flexibility in the shapes of the functions. That is why in this construction and in our experimental studies, \(\textbf{A}_p\) and \(\textbf{A}_q\) are constructed with multiple derivatives in order to produce curves of \(\beta _{z_i}(\cdot )\) and \(\alpha _{z_i}(\cdot )\) with such many desired properties.

3.2.2 The iFME model

Combining the stochastic representation of the FME model in (11) for the experts model, the linear predictor definition in (9), and the relations (24)-(25), we obtain the following iFME model construction

$$\begin{aligned} Y_i|u_i(\cdot )= & {} \beta _{z_i,0} + {\varvec{\gamma }_{z_i}^{[d_1]}}^\top \textbf{v}_i + \varepsilon _i^\star , \end{aligned}$$
(26)
$$\begin{aligned} h_{z_i}(u_i(\cdot );\varvec{\alpha })= & {} \alpha _{z_i,0} + {\varvec{\omega }_{z_i}^{[d_1]}}^\top \textbf{s}_i, \end{aligned}$$
(27)

subject to

$$\begin{aligned} \varvec{\gamma }_{z_i}^{[d_2]} = \textbf{A}_p^{[d_2]}{\textbf{A}_p^{[d_1]}}^{-1} \varvec{\gamma }_{z_i}^{[d_1]} \quad \text {and}\quad \varvec{\omega }_{z_i}^{[d_2]} = \textbf{A}_q^{[d_2]}{\textbf{A}_q^{[d_1]}}^{-1} \varvec{\omega }_{z_i}^{[d_1]}, \nonumber \\ \end{aligned}$$
(28)

where \(\textbf{v}_i = \big ({\textbf{A}^{[d_1]}_p}^{-1}\big )^{\hspace{-2pt}\top }\textbf{x}_i\) is the new design vector for the experts and \(\textbf{s}_i = \big ({\textbf{A}^{[d_1]}_q}^{-1}\big )^{\hspace{-2pt}\top }\textbf{r}_i\) the new one for the gating network. Hence, from (26) and (27), the conditional density of each expert and the gating network are now written as

$$\begin{aligned} f(y_i|u_i(\cdot );\varvec{\psi }_k) = \phi (y_{i};\beta _{k,0} + {\varvec{\gamma }_{k}^{[d_1]}}^{\top } \textbf{v}_{i},\sigma ^2_{k}) \end{aligned}$$
(29)

and

$$\begin{aligned} \pi _{k}(\textbf{s}_i;\textbf{w}) = \frac{\exp {\{\alpha _{k,0}+ {\varvec{\omega }_k^{[d_1]}}^\top \textbf{s}_i \}}}{1+\sum _{k^\prime =1}^{K-1}\exp {\{\alpha _{k^\prime ,0}+{\varvec{\omega }^{[d_1]}_{k^\prime }}^\top \textbf{s}_i \}}}, \end{aligned}$$
(30)

where \(\varvec{\psi }_k=\big (\beta _{k,0}, {\varvec{\gamma }^{[d_1]}_k}^\top , \sigma ^2_k\big )^\top \) is the unknown parameter vector of expert component density k and \(\textbf{w}=\left( \alpha _{1,0},{\varvec{\omega }^{[d_1]}_1}^\top ,\ldots ,\alpha _{K-1,0}, {\varvec{\omega }^{[d_1]}_{K-1}}^{\hspace{-4pt}\top }\right) ^{\hspace{-2pt}\top }\), with \(\big (\alpha _{K,0}, {\varvec{\omega }^{[d_1]}_{K}}^{\top }\big )^\top \) a null vector, is the unknown parameter vector of the gating network. Finally, gathering (29) and (30) as for (13), the iFME model density is now given by

$$\begin{aligned} f(y_i|u_i(\cdot );\varvec{\varPsi }) = \sum _{k=1}^{K} \pi _{k}(\textbf{s}_i;\textbf{w}) \phi (y_{i};\beta _{k,0} + {\varvec{\gamma }^{[d_1]}_{k}}^{\top }\hspace{-2pt}\textbf{v}_i,\sigma ^2_k),\nonumber \\ \end{aligned}$$
(31)

where \(\varvec{\varPsi }= (\textbf{w}^\top , \varvec{\psi }_1^\top , \ldots , \varvec{\psi }_K^\top )^\top \) is the parameter vector of the model to be estimated. Thus, the iFME model constructed upon the parameter vectors \(\varvec{\gamma }_k\)’s and \(\varvec{\omega }_k\)’s, for which the sparsity is assumed to obtain interpretable estimates.

3.3 Regularized MLE via the EM-iFME algorithm

In order to fit the iFME model and to maintain the sparsity in \(\varvec{\gamma }_k\) and \(\varvec{\omega }_k\), the following EM-iFME algorithm is then developed to maximize the penalized log-likelihood function

$$\begin{aligned} \mathcal {L}(\varvec{\varPsi }) = \sum _{i=1}^{n}\log f(y_i|u_i(\cdot );\varvec{\varPsi }) + \text {Pen}_{\lambda ,\chi }(\varvec{\varPsi }) \end{aligned}$$
(32)

with the conditional iFME density \(f(y_i|u_i(\cdot );\varvec{\varPsi })\) is defined in (31) and the new sparse and interpretable regularization term is given by

$$\begin{aligned}{} & {} \text {Pen}_{\lambda ,\chi }(\varvec{\varPsi }) = \lambda \sum _{k=1}^{K} \left( \Vert \varvec{\gamma }_k^{[d_1]} \Vert _1 + \rho \Vert \varvec{\gamma }_k^{[d_2]} \Vert _1\right) \nonumber \\{} & {} \quad + \chi \sum _{k=1}^{K-1} \left( \Vert \varvec{\omega }_k^{[d_1]} \Vert _1 + \varrho \Vert \varvec{\omega }_k^{[d_2]} \Vert _1\right) , \end{aligned}$$
(33)

where \(\rho \) and \(\varrho \) are, respectively, the weights associated to the \(d_2\)th derivative of the expert and the gating parameter function. The appearance of the weighting parameters \(\rho \) and \(\varrho \), besides the usual regularization parameters \(\lambda \) and \(\chi \), is motivated by the fact that one may wish to place a greater emphasis on sparsity in the \(d_2\)th derivative than in the \(d_1\)th derivative of the parameter functions, or vice versa (we will see the usage of them in the subsequent section of experimental study). In practice, the selection of \(\rho \) and \(\varrho \) is more about whether they are greater than or less than one (i.e., the emphasis on \(d_2\)th) rather than select an exact value.

Note that, firstly, unlike the previous FME-Lasso, in iFME model the regularization operates on the functional derivative \(\varvec{\gamma }_k\)’s rather than the functional coefficients \(\varvec{\eta }_k\)’s for the experts, and on the functional derivatives \(\varvec{\omega }_k\)’s rather than the functional coefficients \(\varvec{\zeta }_k\)’s for the gating network. Secondly, maximizing the penalized log-likelihood function (32) with penalization in (33) in iFME model must be coupled with the constrains (28). Follows are the two steps of the proposed EM-iFME algorithm.

E-Step The E-Step for the new iFME model calculates for each observation the conditional probability memberships of each expert k

$$\begin{aligned}{} & {} \tau _{ik}^{(s)}= \mathbb {P}(Z_i=k|y_i,u_i(\cdot );\varvec{\varPsi }^{(s)}) \nonumber \\{} & {} \quad = \frac{ \pi _{k}(\textbf{s}_i;\textbf{w}^{(s)})\, \phi (y_{i};\beta ^{(s)}_{k,0} + \textbf{v}^{\top }_{i}{\varvec{\gamma }^{[d_1]}_k}^{(s)},{\sigma ^2_k}^{(s)}) }{f(y_{i}|u_i(\cdot );\varvec{\varPsi }^{(s)})}, \end{aligned}$$
(34)

where \(f(y_{i}|u_i(\cdot );\varvec{\varPsi }^{(s)})\) is now calculated according to the iFME density given by (31).

M-Step The maximization is performed by separate maximizations w.r.t. the gating network parameters \(\textbf{w}\) and the experts network parameters \(\varvec{\psi }_k\)’s.

Updating the gating network parameters. The maximization step for updating the gating network parameters \(\textbf{w}=\left( \alpha _{1,0},{\varvec{\omega }^{[d_1]}_1}^\top ,\ldots ,\alpha _{K-1,0},{\varvec{\omega }^{[d_1]}_{K-1}} ^{\hspace{-4pt}\top }\right) ^{\hspace{-2pt}\top }\) consists of maximizing the function \(\mathcal {Q}_{\chi }(\textbf{w};\varvec{\varPsi }^{(s)})\) given by

$$\begin{aligned}{} & {} \mathcal {Q}_{\chi }(\textbf{w};\varvec{\varPsi }^{(s)}) = Q(\textbf{w};\varvec{\varPsi }^{(s)}) \nonumber \\{} & {} \quad - \chi \sum _{k=1}^{K-1} \left( \Vert \varvec{\omega }_k^{[d_1]} \Vert _1 + \varrho \Vert \varvec{\omega }_k^{[d_2]} \Vert _1\right) , \end{aligned}$$
(35)

subject to

$$\begin{aligned} \varvec{\omega }_{k}^{[d_2]} = \textbf{A}_q^{[d_2]}{\textbf{A}_q^{[d_1]}}^{-1} \varvec{\omega }_{k}^{[d_1]}, \quad \forall k\in [K-1], \end{aligned}$$
(36)

where \(Q(\textbf{w};\varvec{\varPsi }^{(s)}) = \sum _{i=1}^n\left[ \sum _{k=1}^{K-1} \tau ^{(s)}_{ik} \left( \alpha _{k,0} + {\varvec{\omega }_{k}^{[d_1]}}^\top \textbf{s}_i\right) \right. \left. - \log \left( 1 + \sum _{k^\prime =1}^{K-1} \exp \{\alpha _{k^\prime ,0} + {\varvec{\omega }_{k^\prime }^{[d_1]}}^\top \textbf{s}_i\} \right) \right] \cdot \) This is a constrained version of the weighted regularized multinomial logistic regression problem, where the weights are the conditional probabilities \(\tau ^{(s)}_{ik}\).

To solve it, in the same spirit as when updating the gating network in the previous EM-Lasso algorithm, we use an outer loop that cycles over the gating function parameters to update them one by one. However, to update a single gating function parameter \(\textbf{w}_k=(\alpha _{k,0}, {\varvec{\omega }_{k}^{[d_1]}}^\top )^\top \), \(k\in [K-1]\), since the maximization problem (35) is now coupled with an additional constraint (36), rather than using a coordinate ascent algorithm as in EM-Lasso, we use the following alternative approach. For each single gating network k, using a Taylor series expansion, we partially approximate the smooth part of \(\mathcal {Q}_{\chi }(\textbf{w};\varvec{\varPsi }^{(s)})\) defined in (35) w.r.t. \(\textbf{w}_k\) at the current estimate \(\textbf{w}^{(t)}\), then maximize the resulting objective function (w.r.t. \(\textbf{w}_k\)), subject to the corresponding constraint (w.r.t. k) in (36). It corresponds to solving the following penalized weighted least squares problem with constraints,

$$\begin{aligned} \begin{aligned}&\underset{(\alpha _{k,0},\ \varvec{\omega }_k^{[d_1]}, \varvec{\omega }_{k}^{[d_2]})}{\max } -\frac{1}{2} \sum _{i=1}^n w_{ik} \left( c_{ik} - \alpha _{k,0} - \textbf{s}_i^{\top }\varvec{\omega }^{[d_1]}_k\right) ^2 \\&\quad - \chi \left( \Vert \varvec{\omega }_k^{[d_1]} \Vert _1 + \varrho \Vert \varvec{\omega }_k^{[d_2]} \Vert _1\right) \\&\quad \text {subject to }\quad \varvec{\omega }_{k}^{[d_2]} = \textbf{A}_q^{[d_2]}{\textbf{A}_q^{[d_1]}}^{-1} \varvec{\omega }_{k}^{[d_1]}, \end{aligned} \end{aligned}$$
(37)

where \(w_{ik} = \pi _k(\textbf{w}^{(t)};\textbf{s}_i)\left( 1-\pi _k(\textbf{w}^{(t)};\textbf{s}_i)\right) \) are the weights and \(c_{ik} = \alpha _{k,0}^{(t)} + \textbf{s}_i^{\top }{\varvec{\omega }^{[d_1]}_k}^{(t)} + \frac{\tau _{ik}^{(s)} - \pi _k(\textbf{w}^{(t)};\textbf{s}_i)}{w_{ik}}\) are the working responses computed given the current estimate \(\textbf{w}^{(t)}\). This problem can be solved by quadratic programming (see, e.g., Gaines et al. (2018)) or by using the Dantzig selector (Candes et al. 2007), which we opt to use in our experimental studies. The details of using Dantzig selector to solve problem (37) are given in “Appendix C.1”.

Therefore, if \(({\widetilde{\alpha }}_{k,0}, {\widetilde{\varvec{\omega }}}_k^{[d_1]}, {\widetilde{\varvec{\omega }}}_{k}^{[d_2]})\) is an optimal solution to problem (37), then \({\widetilde{\textbf{w}}}_k = {({\widetilde{\alpha }}_{k,0}, {{{\widetilde{\varvec{\omega }}}_k}^{[d_1]}}}^\top )^\top \) is taken as an update for the gating parameter vector \(\textbf{w}_k\). We keep cycling over \(k\in [K-1]\) until there is no significant increase in the regularized \(Q-\)function (35).

Updating the experts network parameters. The maximization step for updating the expert parameter vector \(\varvec{\psi }_k=(\beta _{k,0}, {\varvec{\gamma }_k^{[d_1]}}^\top , \sigma _k^2)^\top \) consists of solving the following problem:

$$\begin{aligned} \begin{aligned}&\underset{(\beta _{k,0},\ \varvec{\gamma }_k^{[d_1]}, \varvec{\gamma }_k^{[d_2]}, \sigma _k^2)}{\max }\ \ \sum _{i=1}^{n} \tau _{ik}^{(s)} \log \phi \left( y_{i};\beta _{k,0} + \textbf{v}_i^\top \varvec{\gamma }_k^{[d_1]},\sigma ^2_k\right) \\&\quad - \lambda \left( \Vert \varvec{\gamma }_k^{[d_1]} \Vert _1 + \rho \Vert \varvec{\gamma }_k^{[d_2]} \Vert _1\right) \\&\quad \text {subject to}\quad \ \ \varvec{\gamma }_k^{[d_2]} = \textbf{A}_p^{[d_2]}{\textbf{A}_p^{[d_1]}}^{-1} \varvec{\gamma }_k^{[d_1]}. \end{aligned} \end{aligned}$$
(38)

As in the previous EM-Lasso algorithm, we first fix \(\sigma ^2_k\) to its previous estimate and perform the update for \((\beta _{k,0}, \varvec{\gamma }_{k}^{[d_1]})\), which corresponds to solving a penalized weighted least squares problem with constraints. This is be performed by using the Dantzig selector, in the same manner as previously for solving problem (37). The corresponding technical details can be found in “Appendix C.2”.

Once the \((\beta _{k,0}, \varvec{\gamma }_{k}^{[d_1]})\) are updated, the straightforward update for the variance \(\sigma ^2_k\) is given by the standard estimate of a weighted Gaussian regression. More specifically, let \(({\widetilde{\beta }}_{k,0}, {\widetilde{\varvec{\gamma }}}_k^{[d_1]}, {\widetilde{\varvec{\gamma }}}_k^{[d_2]})\) be the solution to the problem (38) (with \(\sigma ^2_k\) fixed to \({\sigma ^2_k}^{(s)}\)), the updates for expert parameter vector \(\varvec{\psi }_k\) are given by

$$\begin{aligned}{} & {} (\beta _{k,0}^{(s+1)}, {\varvec{\gamma }_k^{[d_1]}}^{(s+1)}) = ({\widetilde{\beta }}_{k,0}, {\widetilde{\varvec{\gamma }}}_k^{[d_1]}),\\{} & {} {\sigma ^2_k}^{(s+1)} = \frac{\sum _{i=1}^n \tau _{ik}^{(s)} \left( y_i - \beta _{k,0}^{(s+1)} - \textbf{v}_i^{\top }{\varvec{\gamma }_k^{[d_1]}}^{(s+1)}\right) ^2}{\sum _{i=1}^n \tau _{ik}^{(s)}}\cdot \end{aligned}$$

Thus, at the end of the M-Step, we obtain a parameter vector update \(\varvec{\varPsi }^{(s+1)} = (\textbf{w}^{(s+1)},\varvec{\psi }_1^{(s+1)}, \ldots , \varvec{\psi }_K^{(s+1)})\) for the next EM iteration, where \(\textbf{w}^{(s+1)}\) and \(\varvec{\psi }_k^{(s+1)}\), \(k\in [K]\), are, respectively, the updates of the gating network parameters and the experts network parameters, calculated by the two procedures described above. Alternating the E-Step and M-Step until convergence, i.e., when there is no longer a significant change in the values of the penalized log-likelihood (32), leads to a penalized maximum likelihood estimate \(\widehat{\varvec{\varPsi }}\) for \(\varvec{\varPsi }\).

Estimating the coefficient functions Finally, once the parameter vector of iFME model has been estimated, the coefficient functions of the gating network \(\alpha _k(t)\), \(k\in [K-1]\) and the ones of the experts network \(\beta _k(t)\), \(k\in [K]\), can be reconstructed by the following formulas

$$\begin{aligned} \begin{aligned} \widehat{\alpha }_{k}(t)&= \varvec{b}_q(t)^\top \textbf{A}_q^{-1}\widehat{\varvec{\omega }}^{[d_1]}_{k},\\ \widehat{\beta }_{k}(t)&= \varvec{b}_p(t)^\top \textbf{A}_p^{-1}\widehat{\varvec{\gamma }}^{[d_1]}_{k}, \end{aligned} \end{aligned}$$
(39)

where \(\widehat{\varvec{\omega }}^{[d_1]}_{k}\) and \(\widehat{\varvec{\gamma }}^{[d_1]}_{k}\) are respectively the regularized maximum likelihood estimates for \(\varvec{\omega }_{k}^{[d_1]}\) and \(\varvec{\gamma }_{k}^{[d_1]}\).

3.4 Non-linear regression and clustering with FME models

Once the model parameters have been estimated, one can then construct an estimate of the unknown functional non-linear regression function, as well as a clustering of the data into groups of similar pairs of observations.

For the aim of functional non-linear regression, the unknown non-linear regression function with functional predictors is given by the following conditional expectation \({\widehat{y}}|u(\cdot ) = \mathbb {E}[Y|U(\cdot );{\widehat{\varvec{\varPsi }}}]\), which is defined by \(\widehat{y}_i|u_i(\cdot ) =\sum _{k=1}^{K} \pi _{k}(\textbf{r}_i;\widehat{\varvec{\xi }})({\widehat{\beta }}_{k,0} + {\widehat{\varvec{\eta }}}_{k}^{\top } \textbf{x}_{i})\) for the FME model (13), and by \(\widehat{y}_i|u_i(\cdot ) =\sum _{k=1}^{K} \pi _{k}(\textbf{s}_i;{\widehat{\textbf{w}}}) (\widehat{\beta }_{k,0} + {\varvec{\gamma }^{[d_1]}_{k}}^{\top }\textbf{v}_i)\) for the iFME model (31).

For clustering, a soft partition of the data into K clusters, represented by the estimated posterior probabilities \(\widehat{\tau }_{ik} = \mathbb {P}(Z_i=k|u_i,y_i;{\widehat{\varvec{\varPsi }}})\), is obtained. A hard partition can also be computed according to the Bayes’ optimal allocation rule. That is, by assigning each curve to the component having the highest estimated posterior probability \(\tau _{ik}\), defined by (15) for FME or by (34) for the iFME model, using the MLE \(\widehat{\varvec{\varPsi }}\) of \(\varvec{\varPsi }\):

$$\begin{aligned} \widehat{z}_i = \arg \max _{1\le k \le K} \tau _{ik}(\widehat{\varvec{\varPsi }}), \quad i\in [n], \end{aligned}$$

where \(\widehat{z}_i\) denotes the estimated cluster label for the ith curve.

3.5 Tuning parameters and model selection

In practice, appropriate values of the tuning parameters should be chosen. In using FME, this cover the selection of K, the number of experts, and r, p, and q, the dimensions of B-spline bases used to approximate, respectively, the predictors, the experts, and the gating network functions, although they can be chosen to be equal. For the FME-Lasso, additionally the \(\ell _1\) penalty constants \(\chi \) and \(\lambda \) in (19) should be chosen. For the iFME model, the tuning parameters include also \(d_1\), \(d_2\), the two derivatives related to the sparsity constraints, and \(\rho \), \(\varrho \), the weights associated to the \(d_2\)th derivative of the expert and gating functions (see (33)).

The selection of the tuning parameters can be performed by a cross–validation procedure with a grid search scheme to select the best combination. An alternative is to use the well-known BIC criterion (Schwarz 1978) or, in our paper, its extension, called modified BIC (Städler et al. 2010) defined as

$$\begin{aligned} \text {mBIC} = L(\widehat{\varvec{\varPsi }}) - \text {df}(\widehat{\varvec{\varPsi }})\frac{\log n}{2}, \end{aligned}$$
(40)

where \(\widehat{\varvec{\varPsi }}\) is the obtained log-likelihood estimator (for the FME model) or penalized log-likelihood estimator (for the FME-Lasso and iFME models), and the number of degrees of freedom \(\text {df}(\widehat{\varvec{\varPsi }})\) is the effective number of parameters of the model, given by

$$\begin{aligned} \text {df}(\widehat{\varvec{\varPsi }}) = {\left\{ \begin{array}{ll} \text {df}(\varvec{\zeta }) + (K-1) + \text {df}(\varvec{\eta }) + K + K &{} \quad \text {for the FME and FME-Lasso models,}\\ \text {df}(\varvec{\omega }) + (K-1) + \text {df}(\varvec{\gamma }) + K + K &{} \quad \text {for the iFME model}, \end{array}\right. } \end{aligned}$$

in which the quantities \(\text {df}(\varvec{\zeta })\), \(\text {df}(\varvec{\eta })\), \(\text {df}(\varvec{\omega })\) and \(\text {df}(\varvec{\gamma })\) are, respectively, the counts for non-zero free coefficients in the vectors \(\varvec{\zeta }\), \(\varvec{\eta }\), \(\varvec{\omega }\), and \(\varvec{\gamma }\). Note that, because of the constraints (28) for the iFME model, free coefficients in \(\varvec{\omega }\) and \(\varvec{\gamma }\) consist of only the part related to the \(d_1\) derivative. That is, \(\text {df}(\varvec{\zeta }) = \sum _{k=1}^{K-1}\sum _{j=1}^{q}\textbf{1}_{\{\zeta _{kj}\ne 0\}}\), \(\text {df}(\varvec{\eta }) = \sum _{k=1}^{K}\sum _{j=1}^{p}\textbf{1}_{\{\eta _{kj}\ne 0\}}\), \(\text {df}(\varvec{\omega }) = \sum _{k=1}^{K-1}\sum _{j=1}^{q}\textbf{1}_{\{\omega ^{[d_1]}_{kj}\ne 0\}}\) and \(\text {df}(\varvec{\gamma }) = \sum _{k=1}^{K}\sum _{j=1}^{p}\textbf{1}_{\{\gamma ^{[d_1]}_{kj}\ne 0\}}\). We apply both the BIC and the modified BIC in our experimental study.

4 Experimental study

We study the performances of the FME, FME-Lasso, and iFME models in regression and clustering problems by considering simulated scenarios and real data with functional predictors and scalar responses. The interests of this study consist of the prediction performance as well as the estimation of the functional parameters, i.e., the expert and gating functions in the ME model, along with the clustering partition of the considered heterogeneous data.

4.1 Evaluation criteria

We will use the following criteria, for where applicable, to assess and compare the performances of the models and the related algorithms. For regression evaluation, we use the relative prediction error (RPE) and the correlation (Corr) index as standard criteria to evaluate the prediction performance, i.e., to quantify the relationship between the true and the predicted values of the scalar outputs. The RPE is defined by \(\text {RPE}=\sum _{i=1}^n (y_i-{\widehat{y}}_i)^2/\sum _{i=1}^ny_i^2\), where \(y_i\) and \({\widehat{y}}_i\) are, respectively, the true and the predicted response of the ith observation in the testing set. For clustering evaluation, we use the Rand index (RI), the adjusted Rand index (ARI), and the clustering error (ClusErr), as standard criteria to evaluate the clustering performance, i.e., to quantify how similar the testing observations are presented in the true partition compared to the predicted partition. To evaluate the parameters estimation performance, we compute the mean squared error (MSE) between the true and the estimated functional parameters. The MSE between a true function \(g(\cdot )\) and its estimate \({\widehat{g}}(\cdot )\) is defined by

$$\begin{aligned} \text {MSE}\big (\widehat{g}(\cdot )\big ) = \frac{1}{m} \sum _{j=1}^m \big (g(t_j) - \widehat{g}(t_j)\big )^2, \end{aligned}$$
(41)

with m being the number of time points taken into account. The function \(g(\cdot )\) here corresponds to an expert function \(\beta (\cdot )\), or a gating function \(\alpha (\cdot )\). The parameter functions are reconstructed from the model parameters using (7) and (8) for both FME and FME-Lasso models, and using (39) for iFME model.

The values of these criteria are averaged over N Monte Carlo runs (\(N=100\) for simulation, for the real data, see the corresponding section). Note that, the average over N sample replicates of \(\text {MSE}\,\big (\widehat{g}(\cdot )\big )\) is equivalent to the usual Mean Integrated Squared Error (MISE) defined by \(\text {MISE}\,\big (\widehat{g}(\cdot )\big ) = \mathbb {E}\left[ \int _{\mathcal {T}} \big ({\widehat{g}}(t) - g(t)\big )^2dt \right] \), where the integral here is calculated numerically by a Riemann sum over the grid \(t_1,\ldots ,t_m\).

4.2 Simulation studies

4.2.1 Simulation parameters and experimental protocol

We generate data from a K-component functional mixture of Gaussian experts model that relates a scalar response \(y\in \mathbb {R}\) to a univariate functional predictor \(X(t), t\in \mathcal {T}\) defined on a domain \(\mathcal {T}\subset \mathbb {R}\). The data generation protocol is detailed in “Appendix D.1”. The parameters that were used in this data generating process (54) are described and provided in the simulation parameters and experimental protocol section in “Appendix D.2”. We study different simulation scenarios (sample sizes n, number of observations per time-series input m, noise levels \(\sigma ^2_\delta \),..) summarised in Sect. D.2 and Table. 9.

Figure 1 displays, for each scenario, 10 randomly taken predictors colored according to their corresponding clusters.

Fig. 1
figure 1

10 randomly taken predictors in scenarios S1 (large m and small \(\sigma _{\delta }\)), S2 (small m and small \(\sigma _{\delta }\)), S3 (large m and large \(\sigma _{\delta }\)), and S4 (small m and large \(\sigma _{\delta }\)). Here, the noisy predictors \(U_i(\cdot )\) are colored (blue, red, yellow) according to their true cluster labels \(Z_i\in \{1,2,3\}\)

For each run, the concerned dataset is randomly split into a training set and testing set of equal size, the model parameters are estimated using training set, with the tuning parameters selected by maximizing the modified BIC (40). The evaluation criteria are computed on testing set and reported for each model accordingly.

4.2.2 Some implementation details

For all scenarios, for all datasets, we implemented the three proposed models with 10 EM runs and with a tolerance of \(10^{-6}\). For the iFME model, in principle, for each parameter function the two derivatives \(d_1\) and \(d_2\) to be penalized, and the weights for the latter (i.e., \(\rho \) and \(\varrho \)) can be seen as tuning parameters. However, such an implementation could be computationally expensive in this simulation study with 400 datasets in total and 10 EM runs for each dataset. Therefore, we opted to fix \(d_1\) and \(d_2\) for all implementations (\(d_1=0\) and \(d_2=3\) for both expert and gating networks.) and left \(\rho \) and \(\varrho \) to be selected in some sets of targeted values. The choices of the targeted values were made by the following straightforward arguments. Since \(\beta _1(\cdot )\) and \(\beta _2(\cdot )\) have zero-valued regions, the weight for their zeroth derivative in penalization term should be large, equivalently, the weight for the third derivative should be small, so \(\rho \) is selected in a set of small values: \(\rho \in \{10^{-2}, 10^{-3}, 10^{-4}\}\). On the other hand, for \(\alpha _1(t)\) and \(\alpha _2(t)\), we select \(\varrho \in \{10, 10^2, 10^3\}\) as we should emphasize sparsity in their third derivative.

4.2.3 Simulation results

Clustering and prediction performances We report in Table 1 the results of regression and clustering tasks on simulated datasets in the four considered scenarios.

Table 1 Evaluation criteria of FME, FME-Lasso and iFME models for test data in scenarios \(S1,\ldots , S4\)

The mean and standard error of the relative predictions error (RPE) and correlation (Corr) summarize the regression performance, while the mean and standard error of the Rand index (RI), adjusted Rand index (ARI) and clustering error (ClusErr) summarize the clustering performance.

As we can see from Table 1, all the models have very good performances on both regression and clustering tasks, with high correlation, RI, ARI, and small RPE and clustering error. The iFME appears to slightly have a better performance than the others in all scenarios. The low standard errors confirm the stability of the algorithms.

Figure 2 shows the clustering results obtained by the models with highest values of the modified BIC criterion, on a dataset selected in scenario S1. Here, we plotted the responses against the predictors at two specific time points: \(t_1=0\) and \(t_{50}=0.5\). The highly accurate predictions (in both regression and clustering) can be seen visually through Fig. 2. This figure also shows that it is difficult to cluster these data according to a few number of time observations, for example in \(\mathbb {R}^2\), according to \(\{(U_i(t_0),y_i)\}_{i=1}^n\) or \(\{(U_i(t_{50}),y_i)\}_{i=1}^n\), which suggests using functional approaches.

Fig. 2
figure 2

Scatter plots of \({\widehat{y}}_i\) against \(U_i(t_1)\) (top panels) and \(U_i(t_{50})\) (bottom panels) on a randomly selected dataset. Here, the clustering errors are \(5.5 \%\), \(4.75 \%\) and \(5.0 \%\) for FME, FME-Lasso and iFME models, respectively

Comparison with functional regression mixtures (FMR): Finally, we compare our proposed models with the functional mixture regression (FMR) model proposed in Yao et al. (2010). In their approach, the functional predictors are first projected onto its eigenspace, then the obtained new coordinates are fed to the standard mixture regression model to estimate the weights and the coefficients of the \(\widehat{\beta }(\cdot )\) in that eigenspace. They performed functional principal component analysis (FPCA) to obtain estimates for the eigenfunctions and the principal component scores (which serve as predictors). The number of relevant FPCA components are chosen automatically for each dataset by selecting the minimum number of components that explain \(90\%\) of the total variation of the predictors. It is noticed that, in their simulation studies, the authors computed the scalar responses by using conditional prediction, i.e., the true \(y_i\) were used to determine which cluster the observation belongs to. Then the predicted \({\widehat{y}}_i\) is calculated as the conditional mean of the density of the corresponding cluster. For comparison with that approach, we also used this strategy to make predictions in our models. We further employed the FMR model with the B-spline functional representation, instead of the functional PCA, the number of B-spline functions is set to be the same as the number of basis functions used in our models. Table 2 shows the evaluation criteria corresponding to the considered models evaluated on 100 datasets in scenario S3. Here, FMR-PC is the original model of Yao et al. (2010) and FMR-B is the modified one with B-spline bases. As expected, FME, FME-Lasso, and iFME, which are more flexible compared to the the FMR model, allows to capture more complexity in the data, thanks to the predictor-depending mixture weights, and provides clearly better results than the FMR alternatives.

Table 2 Performance comparison of the models for datasets in scenarios S3

Parameter estimation performance To evaluate the parameter estimation performance, we consider the functional parameter functions estimation error as defined by (41). This error between the true function and the estimated one, provides an evaluation of how well the proposed models reconstruct the hidden functional gating and expert networks. In this evaluation, we considered scenario S1, i.e., \(m=300\), \(\sigma ^2_{\delta }=1\). Moreover, in order to provide an idea of the impact of training size on parameter estimation, we run the models with different training sizes (share the same scenario S1) and report the MSE for each function, for each model in Table 3. It shows that there are significant improvements, even with small training size, when using iFME model in estimating the gating network, compared to the FME and FME-Lasso model.

Table 3 Average of 100 Monte Carlo runs of MSE between the estimated functions resulted by FME, FME-Lasso and iFME models in S1 scenario

Now, in order to evaluate the designed sparsity of zeroth and third derivatives of the reconstructed functions, we compute the MSEs versus their true values, i.e., zeros, on each designed intervals. In particular, we divide the domain \(\mathcal {T}=[0,1]\) into three parts: \(\mathcal {T}_1=[0,0.3)\), \(\mathcal {T}_2=[0.3,0.7)\), and \(\mathcal {T}_3=[0.7,1]\), then for each model the MSEs on the corresponding intervals are reported in Table 4.

Table 4 MSE of the derivatives of reconstructed functions on the corresponding interested intervals, in which, \(\mathcal {T}_1=[0,0.3)\), \(\mathcal {T}_2=[0.3,0.7)\), \(\mathcal {T}_3=[0.7,1]\) and \(\mathcal {T}=[0,1]\)

The reported values show that, as expected, iFME model is better than both FME and FME-Lasso in providing sparse solutions with respect to the derivatives of the parameter functions.

Table 5 shows the means and the standard errors of the estimated intercepts and variances. Note that these coefficients are not considered in the penalization. For the intercepts \(\beta _{k,0}\), all the models estimated them very well, while for the intercepts \(\alpha _{k,0}\), iFME is slightly better than the others. For the variances, FME-Lasso gave the estimated values closest with the true values on average.

Table 5 Intercepts and variances obtained by FME, FME-Lasso and iFME models in scenario S1
Table 6 Comparison of different initialization strategies. The reported values are the mean and standard error (in parentheses) over 100 Monte Carlo runs

The initialization is crucial for the EM algorithm. In all of our experimental studies, the model parameter vector \(\varvec{\Psi }\) of the FME model was initialized as follows (similar for the FME-Lasso and iFME models). Firstly, we perform a K-means algorithm on the predictors \(\{\textbf{x}_i\}_{i=1}^n\), then for each estimated group k, \((\beta _{k,0}^{(0)}, \varvec{\eta }_k^{(0)})\) is initialized as the solution to the linear regression problem \(y_i=\beta _{k,0}+\textbf{x}_i^\top \varvec{\eta }_k\), for i belongs to group kth. The parameter \(\sigma _k^{(0)}\) is then initialized as the estimated variance within group kth. For the gating parameter, we simply drawn each component of \(\varvec{\xi }^{(0)}\) randomly from the uniform distribution \({\mathcal {U}}(0,1)\).

However, to see the impact of different initialization strategies, we performed a side experiment with data taken from S1 scenario. In particular, the initialization for expert parameters is fixed as described before, while that for gating parameter varies in “rand.”, “zeros”, and “LR”. Here, “rand.” is the random strategy described above, “zeros” is the strategy where all coefficients of \(\varvec{\xi }\) are initialized as zeros, i.e., we are putting equal weights for the experts; and “LR” is the strategy where we perform a logistic regression with the predictors are \(\textbf{r}_i\) (for FME, FME-Lasso) or \(\textbf{s}_i\) (for iFME), and the responses are the labels resulted by K-means on them. Table 6 shows the time to convergence, and the log-likelihood, RPE on testing sets corresponding with the three strategies. We can see that, in terms of log-likelihood and RPE, simple strategies as “rand.” and “zeros” perform quite well, while in terms of time to convergence, LR is a good choice for the iFME model.

Finally, to illustrate the selection of the number of expert components using BIC and/or modified BIC in this simulation study, we provide, in Fig. 3, the plots of these criteria against the number of experts for each model. Here, we implemented the models with all fixed tuning parameters, except K which varies in the set \(\{1,\ldots ,6\}\). We can observe that BIC selects the correct number \(K=3\) for both FME-Lasso and iFME, while it selects \(K=4\) for FME. However, the modified BIC selects \(K=4\) for both FME and FME-Lasso, and selects the true number of components \(K=3\) for iFME.

Fig. 3
figure 3

Values of BIC (top) and modified BIC (bottom) for a FME, b FME-Lasso and c iFME model versus the number of experts K, fitted on a randomly taken dataset in scenario S1. Here, the square points correspond to the highest values

4.3 Application to real data

In this section, we apply the FME, FME-Lasso and iFME models to two well-known real datasets, Canadian weather and Diffusion Tensor Imaging (DTI). For each dataset, we perform clustering and investigate the prediction performance, estimate the functional mixture of experts models with different number of experts K and perform the selection of K using modified BIC, and discuss the obtained results.

4.3.1 Canadian weather data

Canadian weather dat a has been introduced in Ramsay and Silverman (2005) and is also available in the R package fda. This dataset consists of \(m=365\) daily temperature measurements (averaged over the year 1961 to 1994) at \(n=35\) weather stations in Canada, and their corresponding average annual precipitation (in \(\log \) scale). The weather stations are located in 4 climate zones: Atlantic, Pacific, Continental and Arctic (Fig. 4a). In this dataset, presented in Fig. 4b, the noisy functional predictors \(U_i(\cdot )\) are the curves of 365 averaged daily temperature measurements, the scalar responses \(Y_i\) are the corresponding total precipitation at each station i, during the year, for 35 stations. Its station climate zone is taken as a cluster label (the cluster label \(Z_i\in \{1,\ldots ,4\}\).

Fig. 4
figure 4

a 35 daily mean temperature measurement curves; b geographical visualization of the stations, in which the sizes of the bubbles corresponds to their \(\log \) of precipitation values and the colors correspond to their climate regions

The aim here is to use the daily temperature curves (functional predictors) to predict the precipitations (scalar responses) at each station. Moreover, in addition to predicting the precipitation values, we are interested in clustering the temperature curves (therefore the stations), as well as identifying the periods of time of the year that have effect on prediction for each group of curves.

Firstly, in order to assess the prediction performance of the FME, FME-Lasso and iFME models on this dataset, we implement the models by selecting the tuning parameters, including the number of expert components K in the set \(\{1,2,3,4,5,6\}\), by maximizing the modified BIC criterion, given its performance as shown in the simulation study. We report in Table 7 the results in terms of correlations, sum of squared errors (SSE) and relative prediction errors.

Table 7 7-fold cross-validated correlation (Corr), sum of squared errors (SSE) and relative prediction error (RPE) of predictions on Canadian weather data

According to the obtained results, iFME provides the best results w.r.t. all the criteria. The cross-validated RPE provided by iFME is only of \(1.2\%\), the next is FME-Lasso with \(4.5\%\), while the FME model has the worst RPE value. Note that in James et al. (2009), the authors applied their proposed model to Canadian weather data and obtained a 10-fold cross-validated SSE of 4.77. Clearly, with the smaller cross-validated SSEs, the FME-Lasso and iFME models significantly improve the prediction.

Figure 5 shows the obtained results with the FME, FME-Lasso and iFME models, with \(K=4\), and with the derivatives \(d_1=0\), \(d_2=3\) for the iFME model. The estimated experts functions and gating functions are presented in the two top panels of the curve, while the clustering for the temperature curves and the stations are shown in the two bottom panels.

Fig. 5
figure 5

Results obtained by FME (left panels), FME-Lasso (middle panels) and iFME (right panels) on Canadian weather data with \(K=4\). For each column, the panels are respectively the estimated functional experts network, estimated functional gating network, estimated clusters of the temperature curves and estimated clusters of the stations

As we can see, all models provide reasonable clustering for the curves which may be corresponding to different complicated underlying meteorological forecasting mechanisms. Particularly, although not using any spatial information, merely temperature information, the obtained clustering for the stations is also comparable with the original labels of the stations. For example, the FME and FME-Lasso models identify exactly the Arctic stations, while iFME identifies exactly the Pacific stations, and all of the models provide reasonable spatially organized clusters. However, what is interesting here is the shape of the expert and gating functions \({\widehat{\alpha }}(\cdot )\)’s and \({\widehat{\beta }}(\cdot )\)’s obtained by the models. While FME and FME-Lasso gave less interpretable estimations, iFME appears to give, as it can be seen in the two top-right panels, piece-wise zero-valued and possibly quadratic estimated functions, which have a wide range of flat relationship from January to February and from June to September.

Motivated by the above results, on direction of identifying the periods of time of the year that truly have an effect on prediction, we implement the iFME model with \(K=2\), and the derivative levels \(d_1\) and \(d_2\) are set to be the zeroth and the third derivatives. The reason for the choices of \(d_1\) and \(d_2\) is that the penalization on the zeroth derivative would take into account zero ranges in the expert and gating functions, while the penalization on the third derivative, would take into account the smoothness for the changes between the periods of times in the functions. The obtained results are shown in Fig. 6.

Fig. 6
figure 6

Results obtained by iFME model with \(K=2\), \(d_1=0\), \(d_2=3\): a estimated functional expert network, b estimated functional gating network, c estimated clusters of the temperature curves, and d estimated clusters of the stations

As we can see, there are differences in the prediction mechanisms of the models between the northern stations and the southern stations. At southern stations, the obtained \(\widehat{\beta }_2(t)\) shows that there is a negative relationship in the spring and a positive relationship in the late fall, but no relationship in the remaining period of the year. This phenomenon is concordant with the result obtained in James et al. (2009), where the authors obtained the same relationships in the same periods of time. However, our iFME model additionally suggests that, at the northern stations, the relationship between temperature and precipitation may differ from that of southern stations. This may be explained by the differences in mean temperatures and climatic characteristics between the two regions.

Finally, Fig. 7 displays the values of modified BIC for varying number of expert component for the proposed models on Canadian weather data. According to these values, FME-Lasso and iFME select \(K=2\), while FME selects \(K=3\).

Fig. 7
figure 7

Values of modified BIC for a FME, b FME-Lasso and c iFME model, versus the number of experts K, fitted on Canadian weather data. The square points correspond to highest values. Here, iFME is implemented with \(d_1=0\), \(d_2=3\) and \(\rho =\varrho =100\)

4.3.2 Diffusion tensor imaging data for multiple sclerosis subjects

We now apply our proposed models to the diffusion tensor imaging (DTI) data for subjects with multiple sclerosis (MS), discussed in Goldsmith et al. (2012). The data come from a longitudinal study investigating the cerebral white matter tracts of subjects with multiple sclerosis, recruited from an outpatient neurology clinic and healthy controls. We are interested in the underlying relationship between the fractional anisotropy profile (FAP) from the corpus callosum and the paced auditory serial addition test (PASAT) score, which is a commonly used examination of cognitive function affected by MS. The FAP curves are derived from DTI data, which are obtained by a Magnetic Resonance Imaging (MRI) scanner. Each curve is recorded at 93 locations along the corpus callosum. The PASAT score is the number of correct answers out of 60 questions, and thus ranges from 0 to 60. In our context, the FAP curves serve as the noisy functional predictors \(U_i(\cdot )\) and the PASAT scores serve as the scalar responses \(Y_i\). So this dataset consists of \(n=99\) pairs \((U_i(\cdot ),Y_i)\), \(i\in [n]\), with each \(U_i(\cdot )\) contains \(m=93\) fractional anisotropy values. Figure 9a shows all the predictors (FAP curves), and Fig. 8 (most-right panels) shows the responses (PASAT Scores).

In Ciarleglio and Ogden (2016), the authors applied their Wavelet-based functional mixture regression (WBFMR) model with two components to this dataset, and observed that there is one group in which there is no association between the FAP and the PASAT score for those subjects belonging to it. We accordingly fix \(K=2\) in our models. Figure 8 displays the obtained results for each of the three models, and Fig. 9a shows the functional predictors FAP curves clustered with the iFME model. In this implementation, we tried iFME model with two different combinations of \(d_1\) and \(d_2\): \((d_1,d_2)=(0,2)\) and \((d_1,d_2)=(0,3)\). As expected, when \(d_2\) is the second derivative, the reconstructed parameter functions are piecewise zero and linear, while when \(d_2\) is the third derivative, the reconstructed functions have smooth changes along the tract location.

Fig. 8
figure 8

The estimated expert and gating coefficient functions, the estimated groups of the PASAT scores, resulted by FME, FME-Lasso and iFME models with \(K=2\) for the DTI dataset. For iFME model, the upper is implemented with penalization on the zeroth and third derivatives, while the lower is with penalized zeroth and second derivatives

Fig. 9
figure 9

DTI data with a FAP curves of all subjects, b Cluster 1 and c Cluster 2, obtained by iFME model with \(d_1=0\), \(d_2=2\), and d the point-wise average of the curves in each of the two clusters

In Fig. 8, we have the three following observations. First, as it can be observed in Fig. 8 right-panel, all models give a threshold of 50 that clusters the PASAT scores, this is the same as the threshold observed in Ciarleglio and Ogden (2016). Second, the absolute values of the coefficient functions \(\widehat{\beta }_2(t)\)’s are significantly smaller than those of \(\widehat{\beta }_1(t)\)’s, this is again the same with the result obtained by the WBFMR model. Third, when \(\widehat{\beta }_2(t)\) is estimated as zero in the FME-Lasso and iFME models, the shape of \(\widehat{\beta }_1(t)\) is almost the same as the shape obtained in Ciarleglio and Ogden (2016), particularly, the peak at around the tract location of 0.42. These confirm the underlying relationship between the fractional anisotropy and the cognitive function: higher fractional anisotropy values between the locations about 0.2 to 0.7 results in higher PASAT scores for subjects in Group 1. The clustering of the FAP curves, resulted by the iFME model with \(d_1=0\), \(d_2=2\), is shown in Fig. 9 (b)-(d).

Next, to compare with Ciarleglio and Ogden (2016), we investigate the prediction performance of the proposed models with respect to the leave-one-out cross validated relative prediction errors defined by \(\text {CVRPE}=\sum _{i=1}^n (y_i-{\widehat{y}}_i^{(-i)})^2/\sum _{i=1}^n y_i^2,\) where \(y_i\) is the true score for subject i and \(\widehat{y}_i^{(-i)}\) is the score predicted by the model fit on data without subject i. In this implementation, we keep fixing \(K=2\) and select the other tuning parameters by maximizing the modified BIC. The CVRPEs corresponding to the models are provided in Table 8. Note that, for comparison, in Ciarleglio and Ogden (2016), the CVRPE of their WBFMR model is 0.0315 and of the wavelet based functional linear model (FLM) is 0.0723.

Table 8 CVRPEs of the models on the DTI data

Finally, we present in Fig. 10 the selection of the number of experts K with modified BIC. In this case, FME and FME-Lasso select \(K=2\), and iFME selects \(K=4\).

Fig. 10
figure 10

Values of modified BIC for a FME, b FME-Lasso and c iFME, versus the number of experts K, fitted on DTI data. The square points correspond to highest values

5 Conclusion and discussion

The first algorithm for mixtures-of-experts constructed upon functional predictors is presented in this paper. Beside the classic maximum likelihood parameter estimation, we proposed two other regularized versions that allow sparse and notably interpretable solutions, by regularizing in particular the derivatives of the underlying functional parameters of the model, after projecting onto a set of continuous basis functions. The performances of the proposed approaches are evaluated in data prediction and clustering via experiments involving simulated and two real-world functional data.

The presented FME models can be extended in different ways. First direct extensions of the modeling framework presented here can be considered with categorical response, to perform supervised classification with functional predictors, or with vector response, to perform multivariate functional regression. Then, it may be interesting to consider the extension of the FME model to setting involving vector (or scalar) predictors and functional responses (Chiou et al. 2004).

Another extension, which we intend also to investigate in the future, concerns the case when we observe pairs of functional data, i.e., a sample of n functional data pairs \(\{X_i(t),Y_i(u)\}_{i=1}^n\), \(t \in \mathcal {T}\subset \mathbb {R}\), \(u \in {\mathcal {U}} \subset \mathbb {R}\), where \(Y_i(\cdot )\) is a functional response, explained by a functional predictor \(X_{i}(\cdot )\). The modeling with such FME extension then takes the form \(Y_i(u) = \beta _{z_i,0}(u) + \int _t X_i(t) \beta _{z_i}(t,u) dt + \varepsilon _i(u)\), to explain the functional response Y by the functional predictor X via the unknown discrete variable z. The particularity with this model is that, for the clustering, as well as for the prediction, we model the relation between Y at any time u and the entire curve of X, or the entire curve of each variable \(X_{ij}\) in the case of multivariate functional predictor \(\varvec{X}_i\).

Since we look for sparse estimates in the space of the derivatives of the functional parameters, and provided the resulting estimates with highly structured shapes, we believe that using the fused LASSO (Tibshirani et al. 2005) rather than the LASSO could be expected to take accommodate more the targeted sparsity. The choice of the regularization constants for the derivatives is more about the targeted penalization magnitude (i.e., very large or very small) rather than the exact values. Typically, in the case one has no idea about the shape of the functional experts, but still want them to be interpretable, then it is practical to consider the zeroth and third derivatives (since they produce smooth changes), with the grid for \(\rho \) containing both very small and very large values. The same for the functional gating network. Furthermore, if we believe that the relation of the response on the predictor is very sparse, except over some small regions where they are linearly related, then we will penalize the zeroth and second derivatives of the functional experts, and we will put very small regularization on the second derivative. This could produce curves such as those in bottom panels of Fig. 8. For a more general tuning of the regularization constant values, we recommend performing a complete cross-validation study to select the best values of the regularization constants.