Machine Learning

, Volume 86, Issue 2, pp 169–207

RE-EM trees: a data mining approach for longitudinal and clustered data

Article

Abstract

Longitudinal data refer to the situation where repeated observations are available for each sampled object. Clustered data, where observations are nested in a hierarchical structure within objects (without time necessarily being involved) represent a similar type of situation. Methodologies that take this structure into account allow for the possibilities of systematic differences between objects that are not related to attributes and autocorrelation within objects across time periods. A standard methodology in the statistics literature for this type of data is the mixed effects model, where these differences between objects are represented by so-called “random effects” that are estimated from the data (population-level relationships are termed “fixed effects,” together resulting in a mixed effects model). This paper presents a methodology that combines the structure of mixed effects models for longitudinal and clustered data with the flexibility of tree-based estimation methods. We apply the resulting estimation method, called the RE-EM tree, to pricing in online transactions, showing that the RE-EM tree is less sensitive to parametric assumptions and provides improved predictive power compared to linear models with random effects and regression trees without random effects. We also apply it to a smaller data set examining accident fatalities, and show that the RE-EM tree strongly outperforms a tree without random effects while performing comparably to a linear model with random effects. We also perform extensive simulation experiments to show that the estimator improves predictive performance relative to regression trees without random effects and is comparable or superior to using linear models with random effects in more general situations.

Keywords

Clustered data Longitudinal data Panel data Mixed effects model Random effects Regression tree CART 

References

  1. Abdolell, M., LeBlanc, M., Stephens, D., & Harrison, R. V. (2002). Binary partitioning for continuous longitudinal data: categorizing a prognostic variable. Statistics in Medicine, 21, 3395–3409. CrossRefGoogle Scholar
  2. Afshartous, D., & de Leeuw, J. (2005). Prediction in multilevel models. Journal of Educational and Behavioral Statistics, 30, 109–139. CrossRefGoogle Scholar
  3. Becker, R. A., Cleveland, W. S., & Shyu, M.-J. (1996). The visual design and control of trellis display. Journal of Computational and Graphical Statistics, 5, 123–155. CrossRefGoogle Scholar
  4. Berk, R. A. (2008). Statistical learning from a regression perspective. New York: Springer. MATHGoogle Scholar
  5. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey: Wadsworth. MATHGoogle Scholar
  6. De’Ath, G. (2002). Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology, 83, 1105–1117. Google Scholar
  7. De’Ath, G. (2006). mvpart: multivariate partitioning. R package version 1.2-4. Google Scholar
  8. Dee, T. S., & Sela, R. J. (2003). The fatality effects of highway speed limits by gender and age. Economics Letters, 79, 401–408. CrossRefGoogle Scholar
  9. Evgeniou, T., Pontil, M., & Toubia, O. (2007). A convex optimization approach to modeling consumer heterogeneity in conjoint estimation. Marketing Science, 26, 805–818. CrossRefGoogle Scholar
  10. Galimberti, G., & Montanari, A. (2002). Regression trees for longitudinal data with time-dependent covariates. In K. Jajuga, A. Sokolowski, & H.-H. Bock (Eds.), Classification, clustering and data analysis (pp. 391–398). New York: Springer. CrossRefGoogle Scholar
  11. Ghose, A., Ipeirotis, P., & Sundararajan, A. (2005). The dimensions of reputation in electronic markets (Technical Report 06-02). NYU CeDER Working Paper. Google Scholar
  12. Hajjem, A., Bellavance, F., & Larocque, D. (2008). Mixed-effects regression trees for clustered data. Les Cahiers du GERAD G-2008-57. Google Scholar
  13. Hajjem, A., Bellavance, F., & Larocque, D. (2011). Mixed effects regression trees for clustered data. Statistics and Probability Letters, 81, 451–459. MathSciNetMATHCrossRefGoogle Scholar
  14. Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association, 72, 320–340. MathSciNetMATHCrossRefGoogle Scholar
  15. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference, and prediction. New York: Springer. MATHGoogle Scholar
  16. Hsiao, W.-C., & Shih, Y.-S. (2007). Splitting variable selection for multivariate regression trees. Statistics and Probability Letters, 77, 265–271. MathSciNetMATHCrossRefGoogle Scholar
  17. Laird, N. M., & Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics, 38, 963–974. MATHCrossRefGoogle Scholar
  18. Larsen, D. R., & Speckman, P. L. (2004). Multivariate regression trees for analysis of abundance data. Biometrics, 60, 543–549. MathSciNetMATHCrossRefGoogle Scholar
  19. Lee, S. K. (2005). On generalized multivariate decision tree by using GEE. Computational Statistics & Data Analysis, 49, 1105–1119. MathSciNetMATHCrossRefGoogle Scholar
  20. Lee, S. K. (2006). On classification and regression trees for multiple responses and its application. Journal of Classification, 23, 123–141. MathSciNetCrossRefGoogle Scholar
  21. Lee, S. K., Kang, H.-C., Han, S.-T., & Kim, K.-H. (2005). Using generalized estimating equations to learn decision trees with multivariate responses. Data Mining and Knowledge Discovery, 11, 273–293. MathSciNetCrossRefGoogle Scholar
  22. Liu, Z., & Bozdogan, H. (2004). Improving the performance of radial basis function (RBF) classification using information criteria. In H. Bozdogan (Ed.), Statistical data mining and knowledge discovery (pp. 193–216). Boca Raton: Chapman and Hall/CRC. Google Scholar
  23. Liu, C., & Rubin, D. B. (1994). The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika, 81, 633–648. MathSciNetMATHCrossRefGoogle Scholar
  24. Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12, 361–386. MathSciNetMATHGoogle Scholar
  25. Milborrow, S. (2011). rpart.plot: plot rpart models. R package version 1.2-2. Google Scholar
  26. Patterson, H. D., & Thompson, R. (1971). Recovery of inter-block information when block sizes are unequal. Biometrika, 58, 545–554. MathSciNetMATHCrossRefGoogle Scholar
  27. Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., & the R Core team (2009). nlme: linear and nonlinear mixed effects models. R package version 3.1-93. Google Scholar
  28. R Development Core Team (2009). R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. ISBN 3-900051-07-0. URL http://www.R-project.org. Google Scholar
  29. Ritschard, G., & Oris, M. (2005). Life course data in demography and social sciences: statistical and data mining approaches. In R. Levy, P. Ghisletta, J.-M. Le Goff, D. Spini, & E. Widmer (Eds.), Towards an interdisciplinary perspective on the life course, advances in life course research (pp. 289–320). Amsterdam: Elsevier. Google Scholar
  30. Ritschard, G., Gabadinho, A., Müller, N. S., & Studer, M. (2008). Mining event histories: a social science perspective. International Journal of Data Mining, Modelling and Management, 1, 68–90. CrossRefGoogle Scholar
  31. Segal, M. R. (1992). Tree-structured models for longitudinal data. Journal of the American Statistical Association, 87, 407–418. CrossRefGoogle Scholar
  32. Sela, R. J., & Simonoff, J. S. (2009). RE-EM trees: a new data mining approach for longitudinal data. NYU Stern Working Paper SOR-2009-03. Google Scholar
  33. Simonoff, J. S. (2003). Analyzing categorical data. New York: Springer. MATHGoogle Scholar
  34. Therneau, T. M., & Atkinson, B. (2010). rpart: recursive partitioning. R port by Brian Ripley. R package version 3.1-46. Google Scholar
  35. Witten, I. H., & Frank, E. (2000). Data mining. New York: Morgan Kauffman. Google Scholar
  36. Verbeke, G., & Molenberghs, G. (2000). Linear mixed models for longitudinal data. New York: Springer. MATHGoogle Scholar
  37. West, B. T., Welch, K. B., & Galecki, A. T. (2007). Linear mixed models: a practical guide using statistical software. Boca Raton: Chapman and Hall/CRC. MATHGoogle Scholar
  38. Zhang, H. (1997). Multivariate adaptive splines for analysis of longitudinal data. Journal of Computational and Graphical Statistics, 6, 74–91. MathSciNetCrossRefGoogle Scholar
  39. Zhang, H. (1998). Classification trees for multiple binary responses. Journal of the American Statistical Association, 93, 180–193. MATHCrossRefGoogle Scholar

Copyright information

© The Author(s) 2011

Authors and Affiliations

  1. 1.J.P. Morgan Chase & Co.ColumbusUSA
  2. 2.Statistics Group, Information, Operations, and Management Sciences Department, Leonard N. Stern School of BusinessNew York UniversityNew YorkUSA

Personalised recommendations