Estimating Multilevel Models on Data Streams

Social scientists are often faced with data that have a nested structure: pupils are nested within schools, employees are nested within companies, or repeated measurements are nested within individuals. Nested data are typically analyzed using multilevel models. However, when data sets are extremely large or when new data continuously augment the data set, estimating multilevel models can be challenging: the current algorithms used to fit multilevel models repeatedly revisit all data points and end up consuming much time and computer memory. This is especially troublesome when predictions are needed in real time and observations keep streaming in. We address this problem by introducing the Streaming Expectation Maximization Approximation (SEMA) algorithm for fitting multilevel models online (or “row-by-row”). In an extensive simulation study, we demonstrate the performance of SEMA compared to traditional methods of fitting multilevel models. Next, SEMA is used to analyze an empirical data stream. The accuracy of SEMA is competitive to current state-of-the-art methods while being orders of magnitude faster. Electronic supplementary material The online version of this article (10.1007/s11336-018-09656-z) contains supplementary material, which is available to authorized users.


Introduction
In this supplementary matrial additional, results of the first simulation study and the applications, and two additional simulation studies are presented. First are the results presented of the simulation study which is presented in the paper. We first show the estimated coefficients of the fixed effects, followed by the variances of the random effects and the correlations between these random effects. Then, Section 3 contains the remaining parameters of the application: fluctuations in weight by Koorenman and Scherpenzeel (2014). Here, we present the estimated coefficients of the fixed effects of this model (except Monday which is in the original paper), the same holds for the variance of the random effects. We also present the correlations between the random effects and, lastly, the residual variance. This document finishes with 2 smaller studies: 1) we show that even if starting values are less favourable SEMA preforms adequately, and 2) when there is a dependency between the value of the random effect b j and the point of entering in the data stream, parameter estimates are only slightly affected.
2 Results simulation study 1

Fixed effects
Here, we present all the results of the simulation study, which due to the large number of results would make the paper too lengthy. The simulation study consists of 4 conditions, which differ as follows: • Condition A: Random intercept (i.e., 1 random effect) • Condition B: 5 Random Effects -no correlation between the random effects • Condition C: 5 Random Effects -correlation between the random effects is equal to .15 • Condition D: 5 Random Effects -correlation between the random effects is equal to .5 All conditions of the simulation study contain 15 fixed effect coefficients: Intercept, 3 continuous level 2 variables, 1 dummy variable (like 'gender') and 1 categorical variable with 3 categories, i.e., 2 dummy variables, 5 continuous level 1 variables and 1 categorical variable with 4 categories, i.e., 3 dummy variables. The following 15 pages contain the estimated parameter estimates of all fixed effects. You find the data generating values in each of the figures captions. All figures are structured similarly: Condition A is top left, Condition B is top right, Condition C is bottom left, and Condition D is bottom right. All figures contain a black line which indicates the true value. Especially, for the estimates of the fixed effects, all four methods produced very similar results which makes it nearly impossible to identify the separate methods. To differentiate the four methods, each line contains symbols: • the green line with triangle is SEMA Update, • the red line with '×' is EM, • and the purple line with solid circle is Sliding Window EM.
Furthermore, each figure contains 2 sets of 4 bars. The first set is the 95% empirical interval of the simulation study at n = 25, 000, the second set of bars indicate the 95% empirical interval at the end of the data streams. The color/symbol combinations identifying the different methods are also used for the bars, see also the example Figure 1. Note that from one page to the other, the Y -axes differ due to different data generating values and amount of true variance (with which the data were generated) or sampling variance.

Random effects
The following figures are the results of the random effects. The random effects consist of 5 variances of the random effects and 10 correlations between the random effects. We start with the presentation of the variances (Fig. 16 -20), followed by the figures containing the results of the correlations (Fig. 21 -50). Contrary to the figures of the fixed effects and variances, we present the correlations between the random effects per condition. Hence In these figures, you can clearly see the updating scheme of the different methods: while the SWEM and EM methods clearly show steps from update to update, the SEMA update only shows smooth change in between the updates (where it shows more profound changes in parameter estimates), and lastly, SEMA has a smooth curve retrieving the data generating parameters slightly later than the previous methods.

.1 Design
This simulation study is designed to illustrate SEMA can fit a multilevel model without using a training set and still obtain good parameter estimates. We compare the parameter estimates from SEMA with those provided by a popular R package to fit multilevel models: lmer from the lme4 package (lme4, Bates, Mächler, Bolker, and Walker, 2015). The estimation method used in the lme4 package is coined "bobyqa" (acronym for Bound Optimization BY Quadratic Approximation; Powell, 2009). This algorithm finds the best fitting parameter values by iteratively approximating the likelihood function using quadratic approximation, i.e., this algorithm does not use first or second order derivatives. We explicitly study the effect of three factors: the number of observations per individual (n j ), the number of random effects (r), and the number of level 1 covariates (lvl 1 ).
The number of observations n j is an important contributor to the reliability of the estimates of b j : more observations per individual results in less uncertainty. Additionally, when an individual is observed more often, outdated contributions to the CDSS are more often revised than in a scenario where the number of observations is limited. Therefore, we expect SEMA will obtain accurate parameter estimates in conditions with a higher number of observations using less data points than in conditions where individuals are observed less often. The second factor, the number of random effects (r) also influences the uncertainty: the information in the data is spread out over the number of variables. We expect that more random effects will, therefore, influence SEMA such that it has to evaluate more data points to obtain good parameter estimates than in the condition where the number of random effects is small. Lastly, for the number of covariates on the first level (lvl 1 ), we have similar expectations for the number of random effects: more fixed effects will lead to a slower retrieval (i.e., more data points will have to enter) of the data-generating parameters because the information is spread out over more parameters. The three factors are all crossed, however, we will not let the number random effects exceed the number of covariates.
The data streams are generated with variance terms of the random effects equal to 4 and 9 in the case of r = 2, and in the case of r = 7: φ 2 = 4, 9, 16, 2.25, 6.25, 12.25, and 20.25 respectively. The fixed effects are generated with the following parameter values: β = 3.5, −4.5, and −5.5 for scenarios with 3 level one fixed effects and 3.5, 4.5, −5.5, −2.5, −3.5, −4.5, 5.5, and 6.5 for 8 level 1 fixed effects. Additionally, the residual variance, the number of level 2 variables, and the length of the data stream were fixed across the conditions (with σ 2 = 25, β = 1.5 and 2.5; and n = 50, 000 data points). In order to illustrate that SEMA performs well, even with start values which are not ideal, the start values of all parameters were set equal to 1. Due to the computational complexity of the offline fitting procedure, the Lmer function is fitted to the data streams only every n = 1, 000 data points instead of after each data point. Each condition was replicated 100 times.

Results
In Figure 51, the average estimated variance terms of both σ 2 and the variance of the random intercept and slope over the 100 replications are presented. On the x-axes the length of the data stream are presented, and on the y-axes the parameter estimates are presented. In the top right corner of each figure, the simulation condition is written. The gray lines represent the parameter estimates obtained using Lmer, the black lines the parameters estimates of SEMA. The Lmer function was not always able to converge in the beginning of the data stream, in these cases the gray lines are omitted from the figure. A comparison between the n j = 50 and the n j = 10 conditions shows that SEMA rapidly approaches Lmer's parameter estimates, especially when the number of observations for each individual is large. Furthermore, SEMA provides estimates even when Lmer is unable to converge. There is hardly any difference between including 3 level 1 predictors or 8 level 1 predictors: the top two figures and the two figures in the middle row are, given the number of observations per individual, very similar.
The bottom two figures deviate from the figures above since these conditions are, even for the Lmer algorithm, very difficult to fit. In the bottom left figure, Lmer is only able to fit the model when at least 34,000 data points are available. Even when Lmer is able to fit the model, Lmer cycled through the data thousands of times to obtain convergence. While Lmer is able to fit the model using less data in the lower right panel than in the lower left panel, still Lmer revisited the same data thousands of times to fit the model and hence took (very) long times to compute. Comparing the results of SEMA in the panel in the lower left panel with the results of SEMA in the other panels, it is clear that this lower left panel (condition: n j = 10, r = 7, lvl 1 = 8) is, as expected, a difficult condition. Especially in the extremely challenging condition where only 10 observations per individual are available to estimate a large number of fixed and random effects, it is clear that the parameter estimates of SEMA have not yet converged, even at the end of the data stream. However, when there is more information available per individual (n j = 50), SEMA performs much better (lower right panel) than in the condition with n j = 10.
Next, we present three tables with the parameter estimates averaged over the replications, the standard deviation over the replications and the 95% interval based on the empirical distribution of results of the simulation study (percentiles). In Table 1, we present the estimates of two of the fixed effects estimated by SEMA and Lmer at two points in the data stream. Since the (qualitative) behavior is similar across all fixed effects, we choose to present only these two. Across conditions, we can conclude from Table 1 that the fixed effects are estimated well by both Lmer and SEMA, with the mere difference that the SD's of SEMA are slightly larger (however, SEMA is magnitudes faster).
The estimates of the variance of the random intercept and one of the random slopes (φ 2 = 4, and 9) are presented in Table 2. Clearly, these variance terms are more difficult to estimate for SEMA than the fixed effects. While Lmer retrieves the true values as soon as it is able to fit the model, SEMA needs more data to obtain good estimates of the variance terms. Finally, in Table 3, the estimates of the residual variance (σ 2 = 25) are presented. The same conditions which showed a large SD forφ, have a large SD for the estimates ofσ 2 . Overall, SEMA and Lmer produce very similar estimates of σ 2 , although the condition of n j = 10, r = 7, lvl 1 = 8 with start values equal to 1, remains difficult for SEMA even when 50,000 data points have entered.

Ordering of data points
Here, we illustrate that the ordering of the data points does not influence the end results of the parameter estimates. To do so, we generated a data stream consisting of n = 1, 000, 000 and J = 20, 000 and examine the effects of reordering the data during the stream. The data were generated using a random intercept model. The values of the fixed effects are: 100.0, 0.1, 0.5, 0.9, 1.3, 1.7, 2.1, 2.5, 2.9, 3.3, 3.7, 4.1, 4.5, 4.9, and 5.3. The variance of the random intercept is 50, and the residual variance was equal to 5. The data set was reordered 11 times, using the correlation between the b j 's and the moment of entering in the data stream. The correlation is varied from nearly perfectly positive to nearly perfectly negative. Even though there are differences between the different orderings, overall the parameter estimates are very comparable across orderings.   Table (2) The variability of the estimates of φ, across replications of Lmer and SEMA (the "-" indicate that Lmer failed to converge in this condition).