Skip to main content

Random Forests with Latent Variables to Foster Feature Selection in the Context of Highly Correlated Variables. Illustration with a Bioinformatics Application.

  • Conference paper
  • First Online:
  • 1194 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11191))

Abstract

The random forest model is a popular framework used in classification and regression. In cases where dense dependences exist within the variables, it may be beneficial to capture these dependences through latent variables, further used to build the random forest. In this paper, we present Sylva, a generalization of the T-Trees model (Botta et al., 2008), the only attempt so far where latent variables are integrated in the random forest learning scheme. Sylva is an innovative hybrid approach in which an adapted random forest framework benefits from the modeling of dependences via FLTM, a forest of latent tree models (Mourad et al., 2011). The FLTM model drives the generation on the fly of the latent variables used to learn the random forest. In the unprecedented large-scale study reported here, Sylva, instantiated by different clustering methods, is compared to T-Trees using high-dimensional real-world datasets in the context of genetic association studies. We show that the already high predictive power of T-Trees is not significantly increased by Sylva. In constrast, in Sylva, the importance measure distribution corresponding to top-ranked variables is significantly skewed towards higher values than in T-Trees, which meets the feature selection objective.

This work was supported by the French National Research Agency (ANR SAMOGWAS project). The software development and the realization of experiments were performed in part at the CCIPL (Centre de Calcul Intensif des Pays de la Loire, Nantes, France). C. Sinoquet thanks V. Botta for his expert advice on the T-Trees model, and C. Kemps for her help in the preparation of the data. C. Kemps was granted by the GRIOTE project funded by the Pays de la Loire Region.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering gene expression patterns. In Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB), pp. 33–42 (1999)

    Google Scholar 

  2. Bessière, P., Mazer, E., Ahuactzin, J.-M., Mekhnacha, K.: Bayesian Programming. Chapman and Hall/CRC, Boca Raton (2013)

    Book  Google Scholar 

  3. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 10, P10008 (2008)

    Article  Google Scholar 

  4. Botta, V.: A walk into random forests. Adaptation and application to Genome-Wide Association Studies. Ph.D. Thesis, University of Liège, Belgium (2013)

    Google Scholar 

  5. Botta, V., Louppe, G., Geurts, P., Wehenkel, L.: Exploiting SNP correlations within random forest for genome-wide association studies. PLOS ONE 9(4), e93379 (2014)

    Article  Google Scholar 

  6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  7. Chambers, J.M., Cleveland, W.S., Kleiner, B., Tukey, P.A.: Graphical Methods for Data Analysis. CRC Press, Boca Raton (1983)

    Google Scholar 

  8. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231 (1996)

    Google Scholar 

  9. Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 36, 3–42 (2006)

    Article  Google Scholar 

  10. Gregorutti, B., Michel, B., Saint-Pierre, P.: Correlation and variable importance in random forests. Stat. Comput. 27(3), 659–678 (2013)

    Article  MathSciNet  Google Scholar 

  11. Louppe, G.: Understanding random forests: from theory to practice. Ph.D. Thesis, University of Liège, Belgium (2014)

    Google Scholar 

  12. Louppe, G., Wehenkel, L., Sutera, A., Geurts, P.: Understanding variable importances in forests of randomized trees. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.), Proceedings of Advances in Neural Information Processing Systems 26 (NIPS), pp. 431–439 (2013)

    Google Scholar 

  13. Mourad, R., Sinoquet, C., Leray, P.: A hierarchical Bayesian network approach for linkage disequilibrium modeling and data-dimensionality reduction prior to genome-wide association studies. BMC Bioinform. 12(1), 16 (2011)

    Article  Google Scholar 

  14. Mourad, R., Sinoquet, C., Zhang, N.L., Liu, T., Leray, P.: A survey on latent tree models and applications. J. Artif. Intell. Res. 47, 157–203 (2013)

    Article  MathSciNet  Google Scholar 

  15. Phan, D.-T., Leray, P., Sinoquet, C.: Modeling genetical data with forests of latent trees for applications in association genetics at a large scale. Which clustering should be chosen? In: Proceedings of the 6th International Conference on Bioinformatics Models, Methods and Algorithms (Bioinformatics), pp. 5–16. Portugal, Lisbon (2015)

    Google Scholar 

  16. ProBT Website. http://www.probayes.com/fr/recherche/probt/

  17. Robin, X., et al.: pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 12, 77 (2011)

    Article  Google Scholar 

  18. Schwarz, G.E.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    Article  MathSciNet  Google Scholar 

  19. Sinoquet, C.: A method combining a random forest-based technique with the modeling of linkage disequilibrium through latent variables, to run multilocus genome-wide association studies. BMC Bioinform. 19, 106 (2018)

    Article  Google Scholar 

  20. Strobl, C., Boulesteix, A.-L., Neib, T., Augustin, T., Zeileis, A.: Conditional variable importance for random forests. BMC Bioinform. 9, 307 (2008)

    Article  Google Scholar 

  21. sylvestra++ Website. https://www.ls2n.fr/listelogicielsequipe/DUKe/134/SYLVESTRA++

  22. WTCCC Website. http://www.wtccc.org.uk/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christine Sinoquet .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sinoquet, C., Mekhnacha, K. (2018). Random Forests with Latent Variables to Foster Feature Selection in the Context of Highly Correlated Variables. Illustration with a Bioinformatics Application.. In: Duivesteijn, W., Siebes, A., Ukkonen, A. (eds) Advances in Intelligent Data Analysis XVII. IDA 2018. Lecture Notes in Computer Science(), vol 11191. Springer, Cham. https://doi.org/10.1007/978-3-030-01768-2_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01768-2_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01767-5

  • Online ISBN: 978-3-030-01768-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics