Abstract
The random forest model is a popular framework used in classification and regression. In cases where dense dependences exist within the variables, it may be beneficial to capture these dependences through latent variables, further used to build the random forest. In this paper, we present Sylva, a generalization of the T-Trees model (Botta et al., 2008), the only attempt so far where latent variables are integrated in the random forest learning scheme. Sylva is an innovative hybrid approach in which an adapted random forest framework benefits from the modeling of dependences via FLTM, a forest of latent tree models (Mourad et al., 2011). The FLTM model drives the generation on the fly of the latent variables used to learn the random forest. In the unprecedented large-scale study reported here, Sylva, instantiated by different clustering methods, is compared to T-Trees using high-dimensional real-world datasets in the context of genetic association studies. We show that the already high predictive power of T-Trees is not significantly increased by Sylva. In constrast, in Sylva, the importance measure distribution corresponding to top-ranked variables is significantly skewed towards higher values than in T-Trees, which meets the feature selection objective.
This work was supported by the French National Research Agency (ANR SAMOGWAS project). The software development and the realization of experiments were performed in part at the CCIPL (Centre de Calcul Intensif des Pays de la Loire, Nantes, France). C. Sinoquet thanks V. Botta for his expert advice on the T-Trees model, and C. Kemps for her help in the preparation of the data. C. Kemps was granted by the GRIOTE project funded by the Pays de la Loire Region.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. In Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB), pp. 33–42 (1999)
Bessière, P., Mazer, E., Ahuactzin, J.-M., Mekhnacha, K.: Bayesian Programming. Chapman and Hall/CRC, Boca Raton (2013)
Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 10, P10008 (2008)
Botta, V.: A walk into random forests. Adaptation and application to Genome-Wide Association Studies. Ph.D. Thesis, University of Liège, Belgium (2013)
Botta, V., Louppe, G., Geurts, P., Wehenkel, L.: Exploiting SNP correlations within random forest for genome-wide association studies. PLOS ONE 9(4), e93379 (2014)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Chambers, J.M., Cleveland, W.S., Kleiner, B., Tukey, P.A.: Graphical Methods for Data Analysis. CRC Press, Boca Raton (1983)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231 (1996)
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 36, 3–42 (2006)
Gregorutti, B., Michel, B., Saint-Pierre, P.: Correlation and variable importance in random forests. Stat. Comput. 27(3), 659–678 (2013)
Louppe, G.: Understanding random forests: from theory to practice. Ph.D. Thesis, University of Liège, Belgium (2014)
Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.), Proceedings of Advances in Neural Information Processing Systems 26 (NIPS), pp. 431–439 (2013)
Mourad, R., Sinoquet, C., Leray, P.: A hierarchical Bayesian network approach for linkage disequilibrium modeling and data-dimensionality reduction prior to genome-wide association studies. BMC Bioinform. 12(1), 16 (2011)
Mourad, R., Sinoquet, C., Zhang, N.L., Liu, T., Leray, P.: A survey on latent tree models and applications. J. Artif. Intell. Res. 47, 157–203 (2013)
Phan, D.-T., Leray, P., Sinoquet, C.: Modeling genetical data with forests of latent trees for applications in association genetics at a large scale. Which clustering should be chosen? In: Proceedings of the 6th International Conference on Bioinformatics Models, Methods and Algorithms (Bioinformatics), pp. 5–16. Portugal, Lisbon (2015)
ProBT Website. http://www.probayes.com/fr/recherche/probt/
Robin, X., et al.: pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 12, 77 (2011)
Schwarz, G.E.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Sinoquet, C.: A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies. BMC Bioinform. 19, 106 (2018)
Strobl, C., Boulesteix, A.-L., Neib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinform. 9, 307 (2008)
sylvestra++ Website. https://www.ls2n.fr/listelogicielsequipe/DUKe/134/SYLVESTRA++
WTCCC Website. http://www.wtccc.org.uk/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Sinoquet, C., Mekhnacha, K. (2018). Random Forests with Latent Variables to Foster Feature Selection in the Context of Highly Correlated Variables. Illustration with a Bioinformatics Application.. In: Duivesteijn, W., Siebes, A., Ukkonen, A. (eds) Advances in Intelligent Data Analysis XVII. IDA 2018. Lecture Notes in Computer Science(), vol 11191. Springer, Cham. https://doi.org/10.1007/978-3-030-01768-2_24
Download citation
DOI: https://doi.org/10.1007/978-3-030-01768-2_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01767-5
Online ISBN: 978-3-030-01768-2
eBook Packages: Computer ScienceComputer Science (R0)