A random forest guided tour

Biau, Gérard; Scornet, Erwan

doi:10.1007/s11749-016-0481-7

A random forest guided tour

Invited Paper
Published: 19 April 2016

Volume 25, pages 197–227, (2016)
Cite this article

TEST Aims and scope Submit manuscript

Gérard Biau^1,2 &
Erwan Scornet¹

42k Accesses
1551 Citations
16 Altmetric
1 Mention
Explore all metrics

Abstract

The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is versatile enough to be applied to large-scale problems, is easily adapted to various ad hoc learning tasks, and returns measures of variable importance. The present article reviews the most recent theoretical and methodological developments for random forests. Emphasis is placed on the mathematical forces driving the algorithm, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures. This review is intended to provide non-experts easy access to the main ideas.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Amaratunga D, Cabrera J, Lee Y-S (2008) Enriched random forests. Bioinformatics 24:2010–2014
Article Google Scholar
Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9:1545–1588
Article Google Scholar
Archer KJ, Kimes RV (2008) Empirical characterization of random forest variable importance measures. Comput Stat Data Anal 52:2249–2260
Article MATH MathSciNet Google Scholar
Arlot S, Genuer R (2014) Analysis of purely random forests bias. arXiv:1407.3939
Auret L, Aldrich C (2011) Empirical comparison of tree ensemble variable importance measures. Chemom Intell Lab Syst 105:157–170
Article Google Scholar
Bai Z-H, Devroye L, Hwang H-K, Tsai T-H (2005) Maxima in hypercubes. Random Struct Algorithms 27:290–309
Article MATH MathSciNet Google Scholar
Banerjee M, McKeague IW (2007) Confidence sets for split points in decision trees. Ann Stat 35:543–574
Article MATH MathSciNet Google Scholar
Barndorff-Nielsen O, Sobel M (1966) On the distribution of the number of admissible points in a vector random sample. Theory Probab Appl 11:249–269
Article MATH MathSciNet Google Scholar
Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2:1–127
Article MATH Google Scholar
Bernard S, Heutte L, Adam S (2008) Forest-RK: a new random forest induction method. In: Huang D-S, Wunsch DC II, Levine DS, Jo K-H (eds) Advanced intelligent computing theories and applications. With aspects of artificial intelligence. Springer, Berlin, pp 430–437
Chapter Google Scholar
Bernard S, Adam S, Heutte L (2012) Dynamic random forests. Pattern Recognit Lett 33:1580–1586
Article Google Scholar
Biau G (2012) Analysis of a random forests model. J Mach Learn Res 13:1063–1095
MATH MathSciNet Google Scholar
Biau G, Devroye L (2010) On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. J Multivar Anal 101:2499–2518
Article MATH MathSciNet Google Scholar
Biau G, Devroye L (2013) Cellular tree classifiers. Electron J Stat 7:1875–1912
Article MATH MathSciNet Google Scholar
Biau G, Devroye L, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J Mach Learn Res 9:2015–2033
MATH MathSciNet Google Scholar
Biau G, Cérou F, Guyader A (2010) On the rate of convergence of the bagged nearest neighbor estimate. J Mach Learn Res 11:687–712
MATH MathSciNet Google Scholar
Boulesteix A-L, Janitza S, Kruppa J, König IR (2012) Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip Rev Data Mining Knowl Discov 2:493–507
Article Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
MATH MathSciNet Google Scholar
Breiman L (2000a) Some infinity theory for predictor ensembles. Technical Report 577, University of California, Berkeley
Breiman L (2000b) Randomizing outputs to increase prediction accuracy. Mach Learn 40:229–242
Article MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45:5–32
Article MATH Google Scholar
Breiman L (2003a) Setting up, using, and understanding random forests V3.1. https://www.stat.berkeley.edu/~breiman/Using_random_forests_V3.1.pdf
Breiman L (2003b) Setting up, using, and understanding random forests V4.0. https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf
Breiman L (2004) Consistency for a simple model of random forests. Technical Report 670, University of California, Berkeley
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman & Hall/CRC, Boca Raton
MATH Google Scholar
Bühlmann P, Yu B (2002) Analyzing bagging. Ann Stat 30:927–961
Article MATH MathSciNet Google Scholar
Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data. Technical Report 666, University of California, Berkeley
Clémençon S, Depecker M, Vayatis N (2013) Ranking forests. J Mach Learn Res 14:39–73
MATH MathSciNet Google Scholar
Clémençon S, Vayatis N (2009) Tree-based ranking methods. IEEE Trans Inform Theory 55:4316–4336
Article MATH MathSciNet Google Scholar
Criminisi A, Shotton J, Konukoglu E (2011) Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Found Trends Comput Graph Vis 7:81–227
Article MATH Google Scholar
Crookston NL, Finley AO (2008) yaImpute: an R package for \(k\)NN imputation. J Stat Softw 23:1–16
Article Google Scholar
Cutler A, Zhao G (2001) PERT—perfect random tree ensembles. Comput Sci Stat 33:490–497
Google Scholar
Cutler DR, Edwards TC Jr, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007) Random forests for classification in ecology. Ecology 88:2783–2792
Article Google Scholar
Davies A, Ghahramani Z (2014) The Random Forest Kernel and creating other kernels for big data from random partitions. arXiv:1402.4293
Deng H, Runger G (2012) Feature selection via regularized trees. In: The 2012 international joint conference on neural networks, pp 1–8
Deng H, Runger G (2013) Gene selection with guided regularized random forest. Pattern Recognit 46:3483–3489
Article Google Scholar
Denil M, Matheson D, de Freitas N (2013) Consistency of online random forests. In: International conference on machine learning (ICML)
Denil M, Matheson D, de Freitas N (2014) Narrowing the gap: random forests in theory and in practice. In: International conference on machine learning (ICML)
Désir C, Bernard S, Petitjean C, Heutte L (2013) One class random forests. Pattern Recognit 46:3490–3506
Article Google Scholar
Devroye L, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York
Book MATH Google Scholar
Díaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7:1–13
Dietterich TG (2000) Ensemble methods in machine learning. In: Kittler J, Roli F (eds) Multiple classifier systems. Springer, Berlin, pp 1–15
Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7:1–26
Article MATH MathSciNet Google Scholar
Efron B (1982) The jackknife, the bootstrap and other resampling plans, vol 38. CBMS-NSF Regional Conference Series in Applied Mathematics, Philadelphia
Fink D, Hochachka WM, Zuckerberg B, Winkler DW, Shaby B, Munson MA, Hooker G, Riedewald M, Sheldon D, Kelling S (2010) Spatiotemporal exploratory models for broad-scale survey data. Ecol Appl 20:2131–2147
Article Google Scholar
Friedman J, Hastie T, Tibshirani R (2009) The elements of statistical learning, 2nd edn. Springer, New York
MATH Google Scholar
Genuer R (2012) Variance reduction in purely random forests. J Nonparametr Stat 24:543–562
Article MATH MathSciNet Google Scholar
Genuer R, Poggi J-M, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recognit Lett 31:2225–2236
Article Google Scholar
Geremia E, Menze BH, Ayache N (2013) Spatially adaptive random forests. In: IEEE international symposium on biomedical imaging: from nano to macro, pp 1332–1335
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3–42
Article MATH Google Scholar
Gregorutti B, Michel B, Saint Pierre P (2016) Correlation and variable importance in random forests. Stat Comput. doi:10.1007/s11222-016-9646-1
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
Article MATH Google Scholar
Györfi L, Kohler M, Krzyżak A, Walk H (2002) A distribution-free theory of nonparametric regression. Springer, New York
Book MATH Google Scholar
Ho T (1998) The random subspace method for constructing decision forests. Pattern Anal Mach Intell 20:832–844
Article Google Scholar
Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15:651–674
Article MathSciNet Google Scholar
Howard J, Bowles M (2012) The two most important algorithms in predictive modeling today. In: Strata Conference: Santa Clara. http://strataconf.com/strata2012/public/schedule/detail/22658
Ishioka T (2013) Imputation of missing values for unsupervised data using the proximity in random forests. In: eLmL 2013, The fifth international conference on mobile, hybrid, and on-line learning, pp 30–36. International Academy, Research, and Industry Association
Ishwaran H (2007) Variable importance in binary regression trees and forests. Electron J Stat 1:519–537
Article MATH MathSciNet Google Scholar
Ishwaran H (2013) The effect of splitting on random forests. Mach Learn 99:75–118
Article MATH MathSciNet Google Scholar
Ishwaran H, Kogalur UB (2010) Consistency of random survival forests. Stat Probab Lett 80:1056–1064
Article MATH MathSciNet Google Scholar
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Stat 2:841–860
Article MATH MathSciNet Google Scholar
Ishwaran H, Kogalur UB, Chen X, Minn AJ (2011) Random survival forests for high-dimensional data. Stat Anal Data Mining ASA Data Sci J 4:115–132
Article MathSciNet Google Scholar
Jeffrey D, Sanja G (2008) Simplified data processing on large clusters. Commun ACM 51:107–113
Google Scholar
Joly A, Geurts P, Wehenkel L (2014) Random forests with random projections of the output space for high dimensional multi-label classification. In: Calders T, Esposito F, Hüllermeier E, Meo R (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 607–622
Google Scholar
Kim H, Loh W-Y (2001) Classification trees with unbiased multiway splits. J Am Stat Assoc 96:589–604
Article MathSciNet Google Scholar
Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014) A scalable bootstrap for massive data. J Royal Stat Soc Ser B (Stat Methodol) 76:795–816
Article MathSciNet Google Scholar
Konukoglu E, Ganz M (2014) Approximate false positive rate control in selection frequency for random forest. arXiv:1410.2838
Kruppa J, Schwarz A, Arminger G, Ziegler A (2013) Consumer credit risk: individual probability estimates using machine learning. Expert Syst Appl 40:5125–5131
Article Google Scholar
Kruppa J, Liu Y, Biau G, Kohler M, König IR, Malley JD, Ziegler A (2014a) Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biometr J 56:534–563
Article MATH MathSciNet Google Scholar
Kruppa J, Liu Y, Diener H-C, Holste T, Weimar C, König IR, Ziegler A (2014b) Probability estimation with machine learning methods for dichotomous and multicategory outcome: applications. Biometr J 56:564–583
Article MATH MathSciNet Google Scholar
Kuhn M, Johnson K (2013) Applied predictive modeling. Springer, New York
Book MATH Google Scholar
Kyrillidis A, Zouzias A (2014) Non-uniform feature sampling for decision tree ensembles. In: IEEE international conference on acoustics, speech and signal processing, pp 4548–4552
Lakshminarayanan B, Roy DM, Teh YW (2014) Mondrian forests: efficient online random forests. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems, pp 3140–3148
Latinne P, Debeir O, Decaestecker C (2001) Limiting the number of trees in random forests. In: Kittler J, Roli F (eds) Multiple classifier systems. Springer, Berlin, pp 178–187
Chapter Google Scholar
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2:18–22
Google Scholar
Lin Y, Jeon Y (2006) Random forests and adaptive nearest neighbors. J Am Stat Assoc 101:578–590
Article MATH MathSciNet Google Scholar
Louppe G, Wehenkel L, Sutera A, Geurts P (2013) Understanding variable importances in forests of randomized trees. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, pp 431–439
Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A (2012) Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inform Med 51:74–81
Article Google Scholar
Meinshausen N (2006) Quantile regression forests. J Mach Learn Res 7:983–999
MATH MathSciNet Google Scholar
Meinshausen N (2009) Forest Garrote. Electron J Stat 3:1288–1304
Article MATH MathSciNet Google Scholar
Mentch L, Hooker G (2014) A novel test for additivity in supervised ensemble learners. arXiv:1406.1845
Mentch L, Hooker G (2015) Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J Mach Learn Res (in press)
Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA (2011) On oblique random forests. In: Gunopulos D, Hofmann T, Malerba D, Vazirgiannis M (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 453–469
Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9:141–142
Article MATH Google Scholar
Nicodemus KK, Malley JD (2009) Predictor correlation impacts machine learning algorithms: Implications for genomic studies. Bioinformatics 25:1884–1890
Article Google Scholar
Politis DN, Romano JP, Wolf M (1999) Subsampling. Springer, New York
Book MATH Google Scholar
Prasad AM, Iverson LR, Liaw A (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181–199
Article Google Scholar
Qian SS, King RS, Richardson CJ (2003) Two statistical methods for the detection of environmental thresholds. Ecol Model 166:87–97
Article Google Scholar
Rieger A, Hothorn T, Strobl C (2010) Random forests with missing values in the covariates. Technical Report 79, University of Munich, Munich
Saffari A, Leistner C, Santner J, Godec M, Bischof H (2009) On-line random forests. In: IEEE 12th international conference on computer vision workshops, pp 1393–1400
Schwarz DF, König IR, Ziegler A (2010) On safari to random jungle: a fast implementation of random forests for high-dimensional data. Bioinformatics 26:1752–1758
Article Google Scholar
Scornet E (2015a) On the asymptotics of random forests. J Multivar Anal 146:72–83
Scornet E (2015b) Random forests and kernel methods. IEEE Trans Inform Theory 62:1485–1500
Scornet E, Biau G, Vert J-P (2015) Consistency of random forests. Ann Stat 43:1716–1741
Article MATH MathSciNet Google Scholar
Segal MR (1988) Regression trees for censored data. Biometrics 44:35–47
Article MATH Google Scholar
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: IEEE conference on computer vision and pattern recognition, pp 1297–1304
Stone CJ (1977) Consistent nonparametric regression. Ann Stat 5:595–645
Article MATH MathSciNet Google Scholar
Stone CJ (1980) Optimal rates of convergence for nonparametric estimators. Ann Stat 8:1348–1360
Article MATH MathSciNet Google Scholar
Stone CJ (1982) Optimal global rates of convergence for nonparametric regression. Ann Stat 10:1040–1053
Article MATH MathSciNet Google Scholar
Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9:307
Article Google Scholar
Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inform Comput Sci 43:1947–1958
Article Google Scholar
Toloşi L, Lengauer T (2011) Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27:1986–1994
Article Google Scholar
Truong AKY (2009) Fast growing and interpretable oblique trees via logistic regression models. PhD thesis, University of Oxford, Oxford
Varian H (2014) Big data: new tricks for econometrics. J Econ Perspect 28:3–28
Article Google Scholar
Wager S (2014) Asymptotic theory for random forests. arXiv:1405.0352
Wager S, Hastie T, Efron B (2014) Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J Mach Learn Res 15:1625–1651
MATH MathSciNet Google Scholar
Watson GS (1964) Smooth regression analysis. Sankhy\(\bar{a}\) Ser A 26:359–372
Welbl J (2014) Casting random forests as artificial neural networks and profiting from it. In: Jiang X, Hornegger J, Koch R (eds) Pattern recognition. Springer, Berlin, pp 765–771
Google Scholar
Winham SJ, Freimuth RR, Biernacka JM (2013) A weighted random forests approach to improve predictive performance. Stat Anal Data Mining ASA Data Sci J 6:496–505
Article MATH MathSciNet Google Scholar
Yan D, Chen A, Jordan MI (2013) Cluster forests. Comput Stat Data Anal 66:178–192
Article MathSciNet Google Scholar
Yang F, Wang J, Fan G (2010) Kernel induced random survival forests. arXiv:1008.3952
Yi Z, Soatto S, Dewan M, Zhan Y (2012) Information forests. In: 2012 information theory and applications workshop, pp 143–146
Zhu R, Zeng D, Kosorok MR (2015) Reinforcement learning trees. J Am Stat Assoc 110(512):1770–1784
Ziegler A, König IR (2014) Mining data with random forests: current options for real-world applications. Wiley Interdiscip Rev Data Mining Knowl Discov 4:55–63
Article Google Scholar

Download references

Acknowledgments

We thank the Editors and three anonymous referees for valuable comments and insightful suggestions.

Author information

Authors and Affiliations

Sorbonne Universités, UPMC Univ Paris 06, CNRS, Laboratoire de Statistique Théorique et Appliquées (LSTA), boîte 158, 4 place Jussieu, 75005, Paris, France
Gérard Biau & Erwan Scornet
Institut universitaire de France, Paris, France
Gérard Biau

Authors

Gérard Biau
View author publications
You can also search for this author in PubMed Google Scholar
Erwan Scornet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gérard Biau.

Additional information

This invited paper is discussed in comments available at: doi:10.1007/s11749-016-0482-6; doi:10.1007/s11749-016-0483-5; doi:10.1007/s11749-016-0484-4; doi:10.1007/s11749-016-0485-3; doi:10.1007/s11749-016-0487-1.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Biau, G., Scornet, E. A random forest guided tour. TEST 25, 197–227 (2016). https://doi.org/10.1007/s11749-016-0481-7

Download citation

Published: 19 April 2016
Issue Date: June 2016
DOI: https://doi.org/10.1007/s11749-016-0481-7

Keywords

Mathematics Subject Classification

62G02

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A random forest guided tour

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

A survey on ensemble learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A random forest guided tour

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

A survey on ensemble learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation