Ensemble methods are among the state-of-the-art predictive modeling approaches. Applied to modern big data, these methods often require a large number of sub-learners, where the complexity of each learner typically grows with the size of the dataset. This phenomenon results in an increasing demand for storage space, which may be very costly. This problem mostly manifests in a subscriber-based environment, where a user-specific ensemble needs to be stored on a personal device with strict storage limitations (such as a cellular device). In this work we introduce a novel method for lossless compression of tree-based ensemble methods, focusing on random forests. Our suggested method is based on probabilistic modeling of the ensemble’s trees, followed by model clustering via Bregman divergence. This allows us to find a minimal set of models that provides an accurate description of the trees, and at the same time is small enough to store and maintain. Our compression scheme demonstrates high compression rates on a variety of modern datasets. Importantly, our scheme enables predictions from the compressed format and a perfect reconstruction of the original ensemble. In addition, we introduce a theoretically sound lossy compression scheme, which allows us to control the trade-off between the distortion and the coding rate.
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
Tax calculation will be finalised during checkout.
Breiman L, Friedman J, Olshen R A, Stone C J. Classification and Regression Trees (1st edition). Chapman and Hall/CRC, 1984.
Quinlan J R. C4.5: Programs for Machine Learning (1st edition). Morgan Kaufmann Publishers, 1992.
Breiman L. Bagging predictors. Machine Learning, 1996, 24(2): 123-140.
Schapire R E. The boosting approach to machine learning: An overview. In Nonlinear Estimation and Classification, Denison D D, Hansen M H, Holmes C C, Mallick B, Yu B (eds.), Springer, 2003, pp.149-171.
Breiman L. Random forests. Machine Learning, 2001, 45(1): 5-32.
Friedman J, Hastie T, Tibshirani R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (1st edition). Springer, 2001.
Painsky A, Rosset S. Compressing random forests. In Proc. the 16th International Conference on Data Mining, December 2016, pp.1131-1136.
Geurts P. Some enhancements of decision tree bagging. In Proc. the 4th European Conference Principles of Data Mining and Knowledge Discovery, Sept. 2000, pp.136-147.
Meinshausen N. Node harvest. The Annals of Applied Statistics, 2010, 4(4): 2049-2072.
Friedman J H, Popescu B E. Predictive learning via rule ensembles. The Annals of Applied Statistics, 2008, 2(3): 916-954.
Bernard S, Heutte L, Adam S. On the selection of decision trees in random forests. In Proc. the 2009 International Joint Conference on Neural Networks, June 2009, pp.302-307.
Joly A, Schnitzler F, Geurts P, Wehenkel L. L 1-based compression of random forest models. In Proc. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, April 2012, pp.375-380.
Buciluă C, Caruana R, Niculescu-Mizil A. Model compression. In Proc. the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2006, pp.535-541.
Tikk D, Kóczy L T, Gedeon T D. A survey on universal approximation and its limits in soft computing techniques. International Journal of Approximate Reasoning, 2003, 33(2): 185-202.
Katajainen J, Mäkinen E. Tree compression and optimization with applications. International Journal of Foundations of Computer Science, 1990, 1(04): 425-447.
Chen S, Reif J H. Efficient lossless compression of trees and graphs. In Proc. the 6th Data Compression Conference, March 1996, pp.428.
Painsky A, Wornell G W. On the universality of the logistic loss function. arXiv:1805.03804, 2018. https://arxiv.org/pdf/1805.03804.pdf, September 2018.
Painsky A, Wornell G W. Bregman divergence bounds and the universality of the logarithmic loss. arXiv:1810.07014, 2018. http://export.arxiv.org/pdf/1810.07014, September 2018.
Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 2006, 15(3): 651-674.
Painsky A, Rosset S. Cross-validated variable selection in tree-based methods improves predictive performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(11): 2142-2153.
Sayood K. Introduction to Data Compression (5th Edition). Morgan Kaufmann, 2017.
Szpankowski W, Weinberger M J. Minimax pointwise redundancy for memoryless models over large alphabets. IEEE Transactions on Information Theory, 2012, 58(7): 4094-4104.
Orlitsky A, Santhanam N P, Zhang J. Universal compression of memoryless sources over unknown alphabets. IEEE Transactions on Information Theory, 2004, 50(7): 1469-1481.
Painsky A, Rosset S, Feder M. Universal compression of memoryless sources over large alphabets via independent component analysis. In Proc. the 2015 Data Compression Conference, April 2015, pp.213-222.
Painsky A, Rosset S, Feder M. A simple and efficient approach for adaptive entropy coding over large alphabets. In Proc. the 2016 Data Compression Conference, March 2016, pp.369-378.
Painsky A, Rosset S, Feder M. Large alphabet source coding using independent component analysis. IEEE Transactions on Information Theory, 2017, 63(10): 6514-6529.
Painsky A, Rosset S, Feder M G. Linear independent component analysis over finite fields: Algorithms and bounds. IEEE Transactions on Signal Processing, 2018, 66(22): 5875-5886.
Zaks S. Lexicographic generation of ordered trees. Theoretical Computer Science, 1980, 10(1): 63-82.
Banerjee A, Merugu S, Dhillon I S, Ghosh J. Clustering with Bregman divergences. Journal of Machine Learning Research, 2005, 6: 1705-1749.
Lloyd S. P. Least squares quantization in PCM. IEEE Transactions on Information Theory, 1982, 28(2): 129-137.
Cover T M, Thomas J A. Elements of Information Theory (2nd edition, e-book). John Wiley & Sons, 2012.
Deutsch L P. Gzip file format specification version 4.3. 1996. https://www.rfc-editor.org/rfc/rfc1952.txt, Oct. 2018.
Schuchman L. Dither signals and their effect on quantization noise. IEEE Transactions on Communication Technology, 1964, 12(4): 162-165.
Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Machine Learning, 2006, 63(1): 3-42.
Liu F T, Ting K M, Yu Y, Zhou Z H. Spectrum of variable-random trees. Journal of Artificial Intelligence Research, 2008, 32: 355-384.
Zhou Z H, Feng J. Deep forest: Towards an alternative to deep neural networks. arXiv:1702.08835, 2017. https://arxiv.org/pdf/1702.08835v2.pdf, September 2018.
Electronic supplementary material
About this article
Cite this article
Painsky, A., Rosset, S. Lossless Compression of Random Forests. J. Comput. Sci. Technol. 34, 494–506 (2019). https://doi.org/10.1007/s11390-019-1921-0
- entropy coding
- lossless compression
- lossy compression
- random forest