forestWgt.Rmd
Recall that each tree in a random forest is constructed from a bootstrap sample of the data Thus, the topology of each tree, and in particular the terminal nodes, are determined from in-bag (IB) data. This explains the tree induction step. But what about the terminal node estimator and the predicted value? What data are used to calculate these values? And what about out-of-bag (OOB) data, how is that used?
The first answer is that IB data is used to populate the terminal nodes and to construct the tree predictor. That is, the same IB data used for the tree induction is also used to define the predicted value. We call this the IB tree predictor. As for the second question about OOB data, we don’t actually use OOB data for prediction. Instead we keep track of OOB membership and use the existing tree to get a cross-validated prediction value for cases that are OOB for that tree.
This leads to two different ensembles: the IB and OOB ensemble. Let \(\varphi_b^{\text{IB}}\) be the \(bth\) IB tree. The IB forest predictor equals \[ \overline{\varphi}^{\text{IB}}({\bf x}) =\frac{1}{B}\sum_{b=1}^B \varphi_b^{\text{IB}}({\bf x}) =\sum_{i=1}^n W_i^{\text{IB}}({\bf x}) \psi(Y_i) \] where \(0\le W_1^{\text{IB}}({\bf x}), \ldots,W_n^{\text{IB}}({\bf x}) \le 1\) are weights (called the IB forest weights) that sum to one. Here \(\psi(Y)\) is some function defined by the problem at hand that maps the response \(Y\) to the target we are interested in. In other words, the IB ensemble is a forest weighted average involving the response. For regression, \(\psi(Y)=Y\) and the forest ensemble is the forest weighted average of responses \[ \overline{\varphi}^{\text{IB}}({\bf x}) = \sum_{i=1}^n W_i^{\text{IB}}({\bf x}) Y_i. \] For classification, \(\psi(Y)=I\{Y=1\}\) and \[ \overline{\varphi}^{\text{IB}}({\bf x}) = \sum_{i=1}^n W_i^{\text{IB}}({\bf x}) I\{Y=1\} \] which can be interpreted as the forest weighted average vote.
Keep in mind the primary role of the IB ensemble is prediction on new data. This distinguishes it from the OOB ensemble to be discussed next which is used for inference on the training data.
The IB ensemble is used for prediction on new data. It is almost never used for inference on the training data.
When it comes to performance values like out-of-sample error rates, and related quantities such as variable importance, this is where OOB trees and the OOB ensemble come in to play. For a case \(i\), the OOB ensemble is obtained by averaging trees over which \(i\) is OOB. We can write the OOB ensemble for \(i\) as \[ \overline{\varphi}_i^{\text{OOB}} =\frac{1}{|\text{OOB}_i|}\sum_{b\in\text{OOB}_i} \varphi_b^{\text{IB}}({\bf X}_i) =\sum_{j=1}^n W_{i,j}^{\text{OOB}} \psi(Y_i) \] where \(\text{OOB}_i\) represents the trees over which \(i\) is OOB and \(0\le W_{i,1}^{\text{OOB}},\ldots,W_{i,n}^{\text{OOB}}\le 1\) are the OOB forest weights for \(i\).
Notice that the OOB ensemble is not a function of \({\bf x}\). This is because it is the unique estimator for the case \(i\) with covariate \({\bf X}_i\). This should make it clear why it is not used for prediction. On the other hand, because the estimator is out-of-sample, it is used to calculate error rates and performance values.
The OOB ensemble is used for inference on the training data and for obtaining OOB performance values such as the prediction error and variable importance.
Now we formalize the previous discussion in more mathematical terms. This will help explain where the forest weights come from. Let \({\mathscr F}=\{T_1,\ldots,T_B\}\) denote the forest where \(T_b=T_b(\mathscr{L})\) is the \(b\)th tree. The terminal nodes of a tree are rectangular regions which form a partition of the feature space \({\mathscr X}\). Each \({\bf x}\in{\mathscr X}\) is a member of a unique terminal node, which we denote by the rectangle \(R_b({\bf x})\) for tree \(T_b\).
For the moment let’s pretend that a tree is constructed using the entire learning data set, \(\mathscr{L}\). For concreteness let’s assume we are interested in multiclassification. Let \(Y\in\{1,\ldots,C\}\) denote the classification outcome which is a value from \(C\ge 2\) possible classes. The target function we wish to estimate is \[ \psi^o_c({\bf x}) = \mathbb{P}\{Y=c|{\bf X}={\bf x}\},\qquad 1\le c\le C. \] The tree predictor \(\varphi_{b,c}({\bf x})\) for \(\psi^o_c({\bf x})\) equals the relative frequency of the \(c\) class labels in \(R_b({\bf x})\). Therefore setting \(\psi_{c}(Y)=I\{Y=c\}\), \[ \varphi_{b,c}({\bf x}) = \sum_{i=1}^n W_{i,b}({\bf x}) \psi_{c}(Y_i),\qquad W_{i,b}({\bf x}) = \frac{1_{\{{\bf x}\in R_{b}({\bf X}_i)\}}}{\sum_{j=1}^n1_{\{{\bf x}\in R_{b}({\bf X}_j)\}}} \] where \(0\le W_{i,b}({\bf x}) \le 1\) and \(\sum_{i=1}^n W_{i,b}({\bf x}) =1\) are convex weights. Observe the denominator for the weight equals the number of cases in \(R_b({\bf x})\), the terminal node for \({\bf x}\).
The ensemble predictor is the average, therefore \[ \overline{\varphi}_c({\bf x}) =\frac{1}{B}\sum_{b=1}^B \varphi_{b,c}({\bf x}) = \frac{1}{B}\sum_{b=1}^B \sum_{i=1}^n W_{i,b}({\bf x}) \psi_{c}(Y_i) = \sum_{i=1}^n W_i({\bf x}) \psi_{c}(Y_i), \] where \[ W_i({\bf x}) = \frac{1}{B}\sum_{b=1}^B \frac{1_{\{{\bf x}\in R_b({\bf X}_i)\}}}{\sum_{j=1}^n1_{\{{\bf x}\in R_{b}({\bf X}_j)\}}}. \] These are the forest weights discussed earlier and the sum of their values is always one for each \({\bf x}\). Also notice if we substitute \(I\{Y=c\}\) for \(\psi_{c}(Y)\) in the above, \[ \overline{\varphi}_c({\bf x}) = \sum_{i=1}^n W_i({\bf x}) I\{Y=c\}, \] which shows more clearly that the ensemble is a forest weighted averaged vote as described earlier.
Now let’s consider what happens when the tree is constructed from a bootstrap sample.
Define integer values \(n_{i,b}\ge 0\) recording the bootstrap frequency of case \(i\) for tree \(T_b\). For later use we also need to track OOB data which corresponds to \(n_{i,b}=0\). Define values \(I_{i,b}\in\{0,1\}\) indicating OOB membership for case \(i\): \[
I_{i,b}=\left\{
\begin{array}{ll}
1 &\mbox{if $n_{i,b}=0$}\\
0 &\mbox{otherwise}.
\end{array}\right.
\] To define the IB ensemble we modify the definition of the forest weights to permit only IB data. Define the IB forest weights \[
W_i^{\text{IB}}({\bf x}) =
\frac{1}{B}\sum_{b=1}^B
\frac{n_{i,b}1_{\{{\bf x}\in R_b({\bf X}_i)\}}}{\sum_{j=1}^n n_{j,b}1_{\{{\bf x}\in R_b({\bf X}_j)\}}}.
\] The denominator equals the bootstrap size for \({\bf x}\)’s terminal node, \(R_b({\bf x})\). In the numerator, notice the weight from a specific tree is zero unless \(i\) is in-bag; i.e. \(n_{i,b}>0\). The IB ensemble is \[
\overline{\varphi}_c^{\text{IB}}({\bf x})
=\frac{1}{B}\sum_{b=1}^B \varphi^{\text{IB}}_{b,c}({\bf x})
= \sum_{i=1}^n W_i^{\text{IB}}({\bf x}) I\{Y_i=c\}.
\]
Now we describe the OOB ensemble.
Only those trees for which \(i\) is OOB are used, therefore \[
\overline{\varphi}_{i,c}^{\text{OOB}}
=\frac{1}{|\text{OOB}_i|}\sum_{b\in\text{OOB}_i} \varphi_{b,c}^{\text{IB}}({\bf X}_i)
%=\frac{1}{B}\sum_{b=1}^B \p^{\OOB}_{i,b,c}
= \sum_{j=1}^n W_{i,j}^{\text{OOB}} I\{Y_j=c\}
\] where \(W_{i,j}^{\text{OOB}}\) is the forest weight for case \(j\) in which \(i\) is excluded and is defined by \[
W_{i,j}^{\text{OOB}} = \frac{1}{\sum_{b=1}^B I_{i,b}}\sum_{b=1}^B\left[
\frac{I_{i,b}n_{j,b}1_{\{{\bf X}_i\in R_b({\bf X}_j)\}}}{\sum_{k=1}^n
n_{k,b}1_{\{{\bf X}_i\in R_b({\bf X}_k)\}}}\right].
\] The value in the square brackets is the tree weight for case \(j\) when \(i\) is OOB for tree \(b\). The normalizing constant \(\sum_{b=1}^B I_{i,b}\) equals number of trees \(i\) is OOB.
The following shows how to directly use forest weights to obtain IB and OOB predicted values for classification. This is meant for illustration as these values are of course automatically provided by the package as shown below.
We use the iris data for illustration.
library(randomForestSRC)
## run classification forest
o <- rfsrc(Species~.,iris)
The ensemble estimates for \(\psi^o_c({\bf x}) = \mathbb{P}\{Y=c|{\bf X}={\bf x}\}\) are stored in $predicted
and $predicted.oob
## extract inbag and oob predictors
phat.inb <- o$predicted
phat.oob <- o$predicted.oob
We also note that predicted class labels (using Bayes rule) are stored in $class
and $class.oob
although we won’t use this here.
Now we extract the IB and OOB forest weights and use these to directly obtain the ensemble predicted values. We show these are the same as the above.
## extract inbag and oob forest weights
<- predict(o, forest.wt="inbag")$forest.wt
fwt.inb <- predict(o, forest.wt="oob")$forest.wt
fwt.oob
## calculate inbag and oob predictors
<- do.call(cbind, lapply(levels(o$yvar), function(lbl) {
phat.fwt.inb apply(fwt.inb, 1, function(wt) {
sum(wt * (o$yvar == lbl))
})
}))<- do.call(cbind, lapply(levels(o$yvar), function(lbl) {
phat.fwt.oob apply(fwt.oob, 1, function(wt) {
sum(wt * (o$yvar == lbl))
})
}))
## show these are the same as before
print(head(data.frame(
IB=phat.inb,
OOB=phat.oob,
IB.fwt=phat.fwt.inb,
OOB.fwt=phat.fwt.oob), 20))
> IB.setosa IB.versicolor IB.virginica OOB.setosa OOB.versicolor OOB.virginica
> 1 1.000 0.000 0 1.0000000 0.000000000 0
> 2 1.000 0.000 0 1.0000000 0.000000000 0
> 3 1.000 0.000 0 1.0000000 0.000000000 0
> 4 1.000 0.000 0 1.0000000 0.000000000 0
> 5 1.000 0.000 0 1.0000000 0.000000000 0
> 6 1.000 0.000 0 1.0000000 0.000000000 0
> 7 1.000 0.000 0 1.0000000 0.000000000 0
> 8 1.000 0.000 0 1.0000000 0.000000000 0
> 9 0.998 0.002 0 0.9942197 0.005780347 0
> 10 1.000 0.000 0 1.0000000 0.000000000 0
> 11 1.000 0.000 0 1.0000000 0.000000000 0
> 12 1.000 0.000 0 1.0000000 0.000000000 0
> 13 1.000 0.000 0 1.0000000 0.000000000 0
> 14 1.000 0.000 0 1.0000000 0.000000000 0
> 15 0.982 0.018 0 0.9523810 0.047619048 0
> 16 0.978 0.022 0 0.9360465 0.063953488 0
> 17 1.000 0.000 0 1.0000000 0.000000000 0
> 18 1.000 0.000 0 1.0000000 0.000000000 0
> 19 0.976 0.024 0 0.9337017 0.066298343 0
> 20 1.000 0.000 0 1.0000000 0.000000000 0
> IB.fwt.1 IB.fwt.2 IB.fwt.3 OOB.fwt.1 OOB.fwt.2 OOB.fwt.3
> 1 1.000 0.000 0 1.0000000 0.000000000 0
> 2 1.000 0.000 0 1.0000000 0.000000000 0
> 3 1.000 0.000 0 1.0000000 0.000000000 0
> 4 1.000 0.000 0 1.0000000 0.000000000 0
> 5 1.000 0.000 0 1.0000000 0.000000000 0
> 6 1.000 0.000 0 1.0000000 0.000000000 0
> 7 1.000 0.000 0 1.0000000 0.000000000 0
> 8 1.000 0.000 0 1.0000000 0.000000000 0
> 9 0.998 0.002 0 0.9942197 0.005780347 0
> 10 1.000 0.000 0 1.0000000 0.000000000 0
> 11 1.000 0.000 0 1.0000000 0.000000000 0
> 12 1.000 0.000 0 1.0000000 0.000000000 0
> 13 1.000 0.000 0 1.0000000 0.000000000 0
> 14 1.000 0.000 0 1.0000000 0.000000000 0
> 15 0.982 0.018 0 0.9523810 0.047619048 0
> 16 0.978 0.022 0 0.9360465 0.063953488 0
> 17 1.000 0.000 0 1.0000000 0.000000000 0
> 18 1.000 0.000 0 1.0000000 0.000000000 0
> 19 0.976 0.024 0 0.9337017 0.066298343 0
> 20 1.000 0.000 0 1.0000000 0.000000000 0
Notice that the forest weights sum to 1.
## notice that forest weights are convex (sum to 1)
print(rowSums(fwt.inb, na.rm = TRUE))
> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> [149] 1 1
print(rowSums(fwt.oob, na.rm = TRUE))
> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
> [149] 1 1
Cite this vignette as
H. Ishwaran, M. Lu, and U. B. Kogalur. 2021. “randomForestSRC: forest weights, in-bag (IB) and out-of-bag (OOB) ensembles vignette.” http://randomforestsrc.org/articles/forestWgt.html.
@misc{HemantWeight,
= "Hemant Ishwaran and Min Lu and Udaya B. Kogalur",
author = {{randomForestSRC}: forest weights, in-bag (IB) and out-of-Bag (OOB) ensembles vignette},
title = {2021},
year = {http://randomforestsrc.org/articles/forestWgt.html},
url = "\url{http://randomforestsrc.org/articles/forestWgt.html}",
howpublished = "[accessed date]"
note }