Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Additive mixed models with approximate Dirichlet process mixtures: the EM approach

  • 343 Accesses

  • 3 Citations


We consider additive mixed models for longitudinal data with a nonlinear time trend. As random effects distribution an approximate Dirichlet process mixture is proposed that is based on the truncated version of the stick breaking presentation of the Dirichlet process and provides a Gaussian mixture with a data driven choice of the number of mixture components. The main advantage of the specification is its ability to identify clusters of subjects with a similar random effects structure. For the estimation of the trend curve the mixed model representation of penalized splines is used. An Expectation-Maximization algorithm is given that solves the estimation problem and that exhibits advantages over Markov chain Monte Carlo approaches, which are typically used when modeling with Dirichlet processes. The method is evaluated in a simulation study and applied to theophylline data and to body mass index profiles of children.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18


  1. Arenz, S., Rckerl, R., Koletzko, B., von Kries, R.: Breast-feeding and childhood obesity—a systematic review. Int. J. Obesity Relat. Metab. Disord. 28, 1247–1256 (2004)

  2. Beyerlein, A., Fahrmeir, L., Mansmann, U., Toschke, A.: Alternative regression models to assess increase in childhood BMI. BMC Med. Res. Methodol. 8(1), 59 (2008a)

  3. Beyerlein, A., Toschke, A., von Kries, R.: Breastfeeding and childhood obesity: Shift of the entire BMI distribution or only the upper parts? Obesity 16, 2730–2733 (2008b)

  4. Boeckmann, A.J., Sheiner, L.B., Beal, S.L.: NONNEM Users Guide: Part V. University of California, San Francisco (1994)

  5. Brezger, A., Kneib, T., Lang, S.: BayesX: analysing Bayesian structured additive regression models. J. Stat. Softw. 14(11), 1–22 (2005)

  6. Burkhardt, J.: ASA047: Nelder–Mead Minimization Algorithm. C++ Library (2008)

  7. Chen, C.-M., Rzehak, P., Zutavern, A., Fahlbusch, B., Bischof, W., Herbarth, O., Borte, M., Lehmann, I., Behrendt, H., Krämer, U., Wichmann, H.-E., Heinrich, J.: Longitudinal study on cat allergen exposure and the development of allergy in young children. J. Allergy Clin. Immunol. 119, 1148–1155 (2007)

  8. Davidian, M., Giltinan, D.M.: Nonlinear Models for Repeated Measurement Data. Chapman & Hall, London (1995)

  9. De Boor, C.: A Practical Guide to Splines. Springer, New York (1978)

  10. Diggle, P.J., Heagerly, P., Liang, K.Y., Zeger, S.L.: Analysis of Longitudinal Data, 2nd edn. Oxford University Press, Oxford (2002)

  11. Efron, B.: Bootstrap methods: another look at the jackknife. Ann. Stat. 7, 1–26 (1979)

  12. Eilers, P.H.C., Marx, B.D.: Flexible smoothing with B-splines and penalties. Stat. Sci. 11, 89–121 (1996)

  13. Eilers, P.H.C., Marx, B.D.: Splines, knots and penalties. Wiley Interdiscip. Rev. 2, 637–653 (2010)

  14. Fahrmeir, L., Kneib, T., Lang, S.: Regression—Modelle, Methoden und Anwendungen. Springer, Berlin (2007)

  15. Fan, J., Li, R.: New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J. Am. Stat. Assoc. 99, 710–723 (2004)

  16. Fenske, N., Fahrmeir, L., Rzehak, P., Höhle, M.: Detection of risk factors for obesity in early childhood with quantile regression methods for longitudinal data. Technical Report 38, Ludwig-Maximilians-University Munich (2008)

  17. Ferguson, T.S.: A Bayesian analysis of some nonparametric problems. Ann. Stat. 1, 209–230 (1973)

  18. Fitzmaurice, G.M., Laird, N.M., Ware, J.H.: Applied Longitudinal Analysis, 2nd edn. Wiley Series in Probability and Statistics. Wiley, New Jersey (2004)

  19. Fritsch, A., Ickstadt, K.: Improved criteria for clustering based on the posterior similarity matrix. Bayesian Anal. 4, 367–392 (2009)

  20. Green, P.J.: Penalized likelihood for general semi-parametric regression models. Int. Stat. Rev. 55, 245–259 (1987)

  21. Harder, T., Bergmann, R., Kallischnigg, G., Plagemann, A.: Duration of breastfeeding and risk of overweight: a meta-analysis. Am. J. Epidemiol. 162, 397–403 (2005)

  22. Heinzl, F.: clustmixed: Clustering in linear and additive mixed models. R Package Version 0.1 (2012)

  23. Heinzl, F., Tutz, G.: Clustering in linear mixed models with approximate Dirichlet process mixtures using EM algorithm. Stat. Model. 13, 41–67 (2013)

  24. Heinzl, F., Tutz, G.: Clustering in linear mixed models with a group fused lasso penalty. Biom. J. 56, 44–68 (2014)

  25. Heinzl, F., Fahrmeir, L., Kneib, T.: Additive mixed models with Dirichlet process mixture and P-spline priors. Adv. Stat. Anal. 96, 47–68 (2012)

  26. Kimura, T., Tokuda, T., Nakada, Y., Nokajima, T., Matsumoto, T., Doucet, A.: Expectation-maximization algorithms for inference in Dirichlet processes mixture. Pattern Anal. Appl. 16, 55–67 (2013)

  27. Koenker, R.: Quantile Regression. Economic Society Monographs. Cambridge University Press, Cambridge (2005)

  28. Komárek, A., Lesaffre, E.: Generalized linear mixed model with a penalized Gaussian mixture as a random effects distribution. Comput. Stat. Data Anal. 52, 3441–3458 (2008)

  29. Laird, N.M., Ware, J.H.: Random-effects models for longitudinal data. Biometrics 38, 963–974 (1982)

  30. Li, Y., Lin, X., Müller, P.: Bayesian inference in semiparametric mixed models for longitudinal data. Biometrics 66, 70–78 (2010)

  31. Lindstrom, M.J., Bates, D.M.: Newton-Raphson and EM algorithms for linear mixed effects models for repeated measures data. J. Am. Stat. Assoc. 83, 1014–1022 (1988)

  32. Mayr, A., Hothorn, T., Fenske, N.: Prediction intervals for future BMI values of individual children—a non-parametric approach by quantile boosting. BMC Medical Research Methodology 12, 6 (2012)

  33. McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn. Chapman & Hall, New York (1989)

  34. McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley, New York (1997)

  35. McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000)

  36. Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7, 308–313 (1965)

  37. Ohlssen, D.I., Sharples, L.D., Spiegelhalter, D.J.: Flexible random-effects models using Bayesian semi-parametric models: applications to institutional comparisons. Stat. Med. 26, 2088–2112 (2007)

  38. O’Neill, R.: Algorithms AS 47: function minimization using a simplex procedure. J. R. Stat. Soc. C 20, 338–345 (1971)

  39. O’Sullivan, F.: A statistical perspective on ill-posed inverse problems (with discussion). Stat. Sci. 1, 505–527 (1986)

  40. Pinheiro, J.C., Bates, D.M.: Mixed-Effects Models in S and S-Plus. Springer, New York (2000)

  41. Reinsch, C.: Smoothing by spline functions. Numerische Mathematik 10, 177–183 (1967)

  42. Rigby, R., Stasinopoulos, D.: Generalized additive models for location, scale and shape. Appl. Stat. 54, 507–554 (2005)

  43. Ruppert, D., Carroll, R.J.: Spatially-adaptive penalties for spline fitting. Aust. N. Z. J. Stat. 42, 205–223 (2000)

  44. Ruppert, D., Wand, M.P., Carroll, R.J.: Semiparametric Regression. Cambridge University Press, Cambridge (2003)

  45. Rzehak, P., Sausenthaler, S., Koletzko, S., Bauer, C.P., Schaaf, B., von Berg, A., Berdel, D., Borte, M., Herbarth, O., Krämer, U., Fenske, N., Wichmann, H.-E., Heinrich, J.: Period-specific growth, overweight and modification by breastfeeding in the GINI and LISA birth cohorts up to age 6 years. Eur. J. Epidemiol. 24, 449–467 (2009)

  46. Sethuraman, J.: A constructive definition of Dirichlet priors. Stat. Sinica 4, 639–650 (1994)

  47. Verbeke, G., Lesaffre, E.: A linear mixed-effects model with heterogeneity in the random-effects population. J. Am. Stat. Assoc. 91, 217–221 (1996)

  48. Verbeke, G., Molenberghs, G.: Linear Mixed Models for Longitudinal Data. Springer Series in Statistics. Springer, New York (2000)

  49. Verbyla, A.P., Cullis, B.R., Kenward, M.G., Welham, S.J.: The analysis of designed experiments and longitudinal data by using smoothing splines. J. R. Stat. Soc. C 48, 269–300 (1999)

  50. Wang, N., Carroll, R.J., Lin, X.: Efficient semiparametric marginal estimation for longitudinal/clustered data. J. Am. Stat. Assoc. 100, 147–157 (2005)

  51. Wood, S.N.: Generalized Additive Models: An Introduction with R. Chapman & Hall, London (2006)

  52. Zeger, S.L., Diggle, P.J.: Semi-parametric models for longitudinal data with application to CD4 cell numbers in HIV seroconverters. Biometrics 50, 689–699 (1994)

  53. Zhang, D., Lin, X., Raz, J., Sowers, M.F.: Semi-parametric stochastic mixed models for longitudinal data. J. Am. Stat. Assoc. 93, 710–719 (1998)

  54. Zutavern, A., Rzehak, P., Brockow, I., Schaaf, B., Bollrath, C., von Berg, A., Link, E., Krämer, U., Borte, M., Herbarth, O., Wichmann, H.-E., Heinrich, J.: Day care in relation to respiratory-tract and gastrointestinal infections in a German birth cohort study. Acta Paediatrica 96, 1494–1499 (2007)

Download references


We thank Elisabeth Thiering and Dr. Joachim Heinrich from the Helmholtz Zentrum Munich for providing the data of the LISA study.

Author information

Correspondence to Felix Heinzl.

Appendix: EM algorithm

Appendix: EM algorithm


In the E-step, we take the expectation of the penalized likelihood \(l_{P}({\varvec{\xi }})\) based on the complete model over all unobserved \(w_{ih}\). Collecting all observed data in \({\varvec{y}}=({\varvec{y}}_1^T,\ldots ,{\varvec{y}}_n^T)^T\), we get for the E-step of iteration \(t+1\)

$$\begin{aligned} Q({\varvec{\xi }})&=E\left( l_{P}({\varvec{\xi }})|{\varvec{y}},{\varvec{\xi }}^{(t)}\right) \\&=\sum _{i=1}^n\sum _{h=1}^N\pi _{ih}({\varvec{\xi }}^{(t)})[\log \pi _h+\log f_{ih}({\varvec{y}}_i;{\varvec{\psi }})]\\&\quad -\frac{1}{2} \left( (d-k)\log (\tau ^2) \!+\! \frac{1}{\tau ^2}{\varvec{\gamma }}_p^T{\varvec{\gamma }}_p\right) \!+\! (N-1)\log \alpha \\&\quad +(\alpha -1)\sum _{h=1}^{N-1}\log (1 - v_h) , \end{aligned}$$

where \(\pi _{ih}({\varvec{\xi }}^{(t)})\) is the probability at iteration \(t\) that subject \(i\) belongs to cluster \(h\) and is given by

$$\begin{aligned} \pi _{ih}({\varvec{\xi }}^{(t)}) = \frac{f_{ih}({\varvec{y}}_i;{\varvec{\psi }}^{(t)})\pi _h^{(t)}}{\sum _{l=1}^Nf_{il}({\varvec{y}}_i;{\varvec{\psi }}^{(t)})\pi _l^{(t)}}. \end{aligned}$$

For clarity, in the following we write \(\pi _{ih}:=\pi _{ih}({\varvec{\xi }}^{(t)})\).


In the M-step, \(Q({\varvec{\xi }})\) is maximized with respect to all unknown parameters. Due to \(Q({\varvec{\xi }})=Q(\alpha ,{\varvec{v}})+Q({\varvec{\psi }})\) the M-step can be separated into two parts: The maximization of

$$\begin{aligned} Q(\alpha ,{\varvec{v}})&= \sum _{i=1}^n\sum _{h=1}^N\pi _{ih}\log \pi _h + (N-1)\log \alpha \\&+ (\alpha -1)\sum _{h=1}^{N-1}\log (1 - v_h), \end{aligned}$$

with respect to \(\alpha \) and \({\varvec{v}}\) and the maximization of

$$\begin{aligned} Q({\varvec{\psi }})&= \sum _{i=1}^n\sum _{h=1}^N\pi _{ih}\,\log f_{ih}({\varvec{y}}_i;{\varvec{\psi }})\\&-\frac{1}{2} \left( (d-k)\log (\tau ^2) + \frac{1}{\tau ^2}{\varvec{\gamma }}_p^T{\varvec{\gamma }}_p\right) , \end{aligned}$$

with respect to \({\varvec{\psi }}\). The first optimization problem is solved by alternating updates of the first order conditions

$$\begin{aligned} v_h = \frac{\sum _{i=1}^n \pi _{ih}}{\sum _{i=1}^n\sum _{l=h}^N \pi _{il}+\alpha -1}, \quad h=1,\ldots ,N-1, \end{aligned}$$


$$\begin{aligned} \alpha = \frac{1-N}{\sum _{h=1}^{N-1}\log (1 - v_h)}. \end{aligned}$$

Note that the update (8) is the same as in Kimura et al. (2013) but in a more compact version. Without further restrictions it could happen that \(v_h \notin [0,1]\) if \(\alpha \in (0,1)\). While Kimura et al. (2013) always consider \(\alpha \ge 1\) for preventing this, we use the following correction approach: Update \(v_h\) by (8) for increasing \(h\). If \(v_{h^*} > 1\) set \(v_h\) to \(1\) for \(h=h^*,\ldots ,N-1\). This constraint for \({\varvec{v}}\) is equivalent to the following restriction on \({\varvec{\pi }}\) by using the stick breaking procedure:

$$\begin{aligned} \pi _h=\left\{ \begin{array}{l@{\quad }l} \frac{1}{n+\alpha -1}\sum _{i=1}^n\pi _{ih}, &{} \text {for}\quad h<h^*,\\ 1 - \sum _{l=1}^{h-1}\pi _l &{} \text {for}\quad h=h^*,\\ 0 &{} \text {for}\quad h>h^*,\\ \end{array}\right. \end{aligned}$$

where \(h^*\) is the lowest index \(h\) for which the cumulative sum of the original weights \(\pi _l^{\circ }\) exceeds one: \(\sum _{l=1}^{h}\pi _l^{{\circ }} > 1\). Finally, it can be seen that for \(\alpha \in (0,1)\), all weights \(\pi _h\) for \(h < h^*\) are stretched by the factor \(\frac{n}{n+\alpha -1}\) compared to the unpenalized estimators for \(\pi _h\) as in Verbeke and Molenberghs (2000), which we get for \(\alpha =1\). The amount of stretching is controlled by the parameter \(\alpha \). If \(\alpha \approx 0\), a very strong clustering is achieved while for larger values of \(\alpha \) only few clusters drop out. It should be noted that during the computations \(v_h = 1-10^{-300}\) instead of \(v_h =1\) is used to avoid \(\log (0)\). Then one gets \(\pi _h \approx 0\) for \(h>h^*\). For \(\alpha > 1\) no correction is needed, but especially in this case it is important that \(N\) is large enough (Kimura et al. 2013). As proposed by Ohlssen et al. (2007) \(N\) should be chosen such that

$$\begin{aligned} N>1+\frac{\log (\varepsilon )}{\log \left( \frac{\alpha }{\alpha +1}\right) }, \end{aligned}$$

with \(\varepsilon >0\). Thus, for a given range on \(\alpha \) a lower bound for \(N\) can be determined. This means that for the range \(\alpha \in (0,1)\) and \(\varepsilon =0.001\) even \(N=11\) is sufficiently large for an adequate approximation of the distribution \(G\). Since in practice a very strong clustering with a number of clusters as low as possible is generally desirable, we propose to set a very low starting value like \(\alpha =0\) to induce a small value for \(\alpha \).

In the second part of the M-step, we get the current state for \({\varvec{\psi }}\) by alternating separate maximization of \(Q({\varvec{\psi }})\) to \({\varvec{\beta }}\), \({\varvec{\gamma }}_0\), \({\varvec{\gamma }}_p\), \({\varvec{\mu }}_1,\ldots ,{\varvec{\mu }}_N\) and to the variance parameters \(\tau ^2\), \(\sigma ^2\) and \({\varvec{D}}\). Conditional on the actual state of the other parameters, the maximization of \({\varvec{\beta }}\) results in

$$\begin{aligned} {\varvec{\beta }}\!&= \! \left( \sum _{i=1}^n {\varvec{X}}_i^T{\varvec{V}}_i^{-1}{\varvec{X}}_i\right) ^{-1}\\&\left( \sum _{i=1}^n {\varvec{X}}_i^T{\varvec{V}}_i^{-1} \left( {\varvec{y}}_i \!-\! {\varvec{B}}_i{\varvec{T}}{\varvec{\gamma }}_0 \!-\! {\varvec{B}}_i{\varvec{W}}{\varvec{\gamma }}_p \!-\! \sum _{h=1}^N\pi _{ih}{\varvec{Z}}_i{\varvec{\mu }}_h\right) \right) \!. \end{aligned}$$

The first order condition for \({\varvec{\gamma }}_0\), given all the other parameters, yields

$$\begin{aligned} {\varvec{\gamma }}_0 \!&= \! \left( \sum _{i=1}^n {\varvec{T}}^T{\varvec{B}}_i^T{\varvec{V}}_i^{-1}{\varvec{B}}_i{\varvec{T}}\right) ^{-1}\\&\left( \sum _{i=1}^n {\varvec{T}}^T{\varvec{B}}_i^T{\varvec{V}}_i^{-1}\left( {\varvec{y}}_i \!-\! {\varvec{X}}_i{\varvec{\beta }}- {\varvec{B}}_i{\varvec{W}}{\varvec{\gamma }}_p \!-\!\sum _{h=1}^N\pi _{ih}{\varvec{Z}}_i{\varvec{\mu }}_h\right) \right) \!, \end{aligned}$$

whereas the penalized basis coefficients are updated by

$$\begin{aligned} {\varvec{\gamma }}_p&= \left( \sum _{i=1}^n {\varvec{W}}^T{\varvec{B}}_i^T{\varvec{V}}_i^{-1}{\varvec{B}}_i{\varvec{W}}+ \frac{1}{\tau ^2}{\varvec{I}}_{d-k}\right) ^{-1}\\&\left( \sum _{i=1}^n {\varvec{W}}^T{\varvec{B}}_i^T{\varvec{V}}_i^{-1}\left( {\varvec{y}}_i - {\varvec{X}}_i{\varvec{\beta }}\!-\! {\varvec{B}}_i{\varvec{T}}{\varvec{\gamma }}_0 \!-\!\sum _{h=1}^N\pi _{ih}{\varvec{Z}}_i{\varvec{\mu }}_h\right) \right) . \end{aligned}$$

Given the other parameters, setting the derivative of \(Q({\varvec{\psi }})\) with respect to \({\varvec{\mu }}_h\), \(h=1,\ldots ,N\), to zero yields

$$\begin{aligned} {\varvec{\mu }}_h&= \left( \sum _{i=1}^n \pi _{ih}{\varvec{Z}}_i^T{\varvec{V}}_i^{-1}{\varvec{Z}}_i\right) ^{-1}\\&\left( \sum _{i=1}^n\pi _{ih}{\varvec{Z}}_i^T{\varvec{V}}_i^{-1}({\varvec{y}}_i-{\varvec{X}}_i{\varvec{\beta }}- {\varvec{B}}_i{\varvec{T}}{\varvec{\gamma }}_0 \!-\! {\varvec{B}}_i{\varvec{W}}{\varvec{\gamma }}_p)\right) \!. \end{aligned}$$

For the inverse smoothing parameter \(\tau ^2\) one gets the update

$$\begin{aligned} \tau ^2 = \frac{1}{d-k}{\varvec{\gamma }}_p^T{\varvec{\gamma }}_p. \end{aligned}$$

To obtain the constraint \(\sum _{h=1}^N\pi _h{\varvec{\mu }}_h = \mathbf {0}\), in each M-step deviations from this restriction are subtracted from \({\varvec{\mu }}_h\), \(h=1,\ldots ,N\). But it should be noted that these deviations could only be added to the unpenalized spline coefficients \({\varvec{\gamma }}_0\) in the case of the decomposition (6) with equidistant knots and if \(q \le k\), i.e. if the dimension of the random effects is equal to or smaller than the order \(k\) of the penalty matrix. For other cases we propose the following simple but effective strategy: We center the cluster centers followed by an immediate update of the basis coefficients so that the P-spline parameters can absorb the general time trend. For a correct update of the variance parameters the uncentered cluster centers should be used in the working response.

For the simultaneous maximization of the variance parameters \(\sigma ^2\) and \({\varvec{D}}\), given \({\varvec{\beta }}\), \({\varvec{\gamma }}_0\), \({\varvec{\gamma }}_p\), \({\varvec{\mu }}_1,\ldots ,{\varvec{\mu }}_N\) and \(\tau ^2\) the algorithm AS 47 of O’Neill (1971) in the C++ version (Burkhardt 2008) is used, which is an implementation of the Nelder–Mead algorithm (Nelder and Mead 1965). In this optimization procedure we choose for the reflection, extension and contraction coefficients the common settings 1.0, 2.0 and 0.5. Note that the covariance matrix \({\varvec{D}}\) is parameterized by \({\varvec{D}}= {\varvec{L}}{\varvec{L}}^T\) because then the matrix is automatically nonnegative-definite and even positive-definite (and so invertible, too) if \({\varvec{L}}\) is a matrix with exclusively nonzero diagonal entries (Lindstrom and Bates 1988).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Heinzl, F., Tutz, G. Additive mixed models with approximate Dirichlet process mixtures: the EM approach. Stat Comput 26, 73–92 (2016).

Download citation


  • Additive mixed models
  • Dirichlet process mixture
  • EM algorithm
  • Penalized splines
  • Stick breaking