Subsampling can be used to reduce the size of the in-sample data used to grow a tree and therefore can greatly reduce computational load. Subsampling is implemented using options sampsize
and samptype
in rfsrc()
. See rfsrc.fast()
in the next subsection for a fast forest implementation using subsampling.
Our function rfsrc.fast()
adopts fast approximate random forests using subsampling that applies to all families.
library("randomForestSRC")
## load the Iowa housing data
data(housing, package = "randomForestSRC")
## do quick and *dirty* imputation
<- impute(SalePrice ~ ., housing,
housing ntree = 50, nimpute = 1, splitrule = "random")
## grow a fast forest
<- rfsrc.fast(SalePrice ~ ., housing)
obj print(obj)
> Sample size: 2930
> Number of trees: 500
> Forest terminal node size: 5
> Average no. of terminal nodes: 83.776
> No. of variables tried at each split: 27
> Total no. of variables: 80
> Resampling used to grow trees: swor
> Resample size used to grow trees: 398
> Analysis: RF-R
> Family: regr
> Splitting rule: mse *random*
> Number of random split points: 10
> (OOB) R squared: 0.8761255
> (OOB) Error rate: 790552487
We could adjust the parameters in functions to encourage computational speed.
## ------------------------------------------------------------
## big data set, reduce number of variables using simple method
## ------------------------------------------------------------
library("randomForestSRC")
## use Iowa housing data set
data(housing, package = "randomForestSRC")
## original data contains lots of missing data, use fast imputation
## however see impute for other methods
<- impute(data = housing, fast = TRUE)
housing2
## run shallow trees to find variables that split any tree
<- rfsrc(SalePrice ~., housing2, ntree = 250, nodedepth = 4,
xvar.used var.used="all.trees", mtry = Inf, nsplit = 100)$var.used
## now fit forest using filtered variables
<- names(xvar.used)[xvar.used >= 1]
xvar.keep <- rfsrc(SalePrice~., housing2[, c("SalePrice", xvar.keep)])
o print(o)
> Sample size: 2930
> Number of trees: 500
> Forest terminal node size: 5
> Average no. of terminal nodes: 400.552
> No. of variables tried at each split: 16
> Total no. of variables: 46
> Resampling used to grow trees: swor
> Resample size used to grow trees: 1852
> Analysis: RF-R
> Family: regr
> Splitting rule: mse *random*
> Number of random split points: 10
> (OOB) R squared: 0.9038358
> (OOB) Error rate: 613708802
Computational speed can be significantly improved with randomized splitting invoked by the option nsplit
. When nsplit
is set to a non-zero positive integer, a maximum of nsplit
split points are chosen randomly for each of the candidate splitting variables when splitting a tree node, thus significantly reducing the cost from having to consider all possible split-values. To revert to traditional deterministic splitting (all split values considered) use nsplit
=0.
Randomized splitting for trees has a long history. See [1]. [2] recently introduced extremely randomized trees using what they call the extra-trees algorithm. We can see that this algorithm essentially corresponds to randomized splitting with nsplit
=1. In our experience however this is not always the optimal value for nsplit
and may often be too low.
Finally, for completely randomized (pure random) splitting use splitrule
=“random.” In pure splitting, nodes are split by randomly selecting a variable and randomly selecting its split point [3].
Setting ntime
to a reasonably small value such as 100 or 250 (the default) constrains survival ensemble calculations to a restricted grid of time points and significantly improves computational times.
The default setting importance="none"
turns off variable importance (VIMP) calculations and considerably reduces computational times. Variable importance can always be recovered later using functions vimp
or predict
. Also consider using max.subtree
which calculates minimal depth, a measure of the depth that a variable splits, and yields fast variable selection [4].
nodesize
Another way to speed up things considerably is increasing nodesize
. When users have very big data sets, you can actually get better performance with larger nodesize
. The function tune.nodesize()
can be used to set nodesize
.
The C-index is computationally expensive and requires approximately \(O(n^2)\) calculations where \(n\) is the sample size. This makes it infeasible for large survival data sets. The solution is to turn of the C-calculation by turning off all performance requests when training the forest. This is accomplished using the option perf.type='none'
. Thus a generic training call would look something like:
It is also recommended to set nodesize
to a reasonably large value. Also, assuming the user is interested in measuring performance of their model, consider using the Brier score which is very fast. This can be obtained using the function get.brier.survival
:
bs <- get.brier.survival(o)
See the vignette for Random Survival Forests for more details about the Brier score.
Use option save.memory="TRUE"
for big data competing risk and survival models. By default the package stores terminal node quantities to be used in prediction for test data but this can be memory intensive for big data.
Make sure block.size="NULL"
(or set to number of trees) so that the cumulative error is calculated only once.
Use the hidden jitt
option by setting this to false
predict(myObject, myTestData, jitt=FALSE)
Cite this vignette as
H. Ishwaran, M. Lu, and U. B. Kogalur. 2021. “randomForestSRC: speedup random forest analyses vignette.” http://randomforestsrc.org/articles/speedup.html.
@misc{HemantSpeed,
= "Hemant Ishwaran and Min Lu and Udaya B. Kogalur",
author = {{randomForestSRC}: speedup random forest analyses vignette},
title = {2021},
year = {http://randomforestsrc.org/articles/speedup.html},
url = "\url{http://randomforestsrc.org/articles/speedup.html}",
howpublished = "[accessed date]"
note }