Tune Random Forest for optimal mtry and nodesize — tune.rfsrc • Fast Unified Random Forests with randomForestSRC

Finds the optimal mtry and nodesize for a random forest using out-of-bag (OOB) error. Two search strategies are supported: a grid-based search and a golden-section search with noise control. Works for all response families supported by rfsrc.fast.

# S3 method for class 'rfsrc'
tune(formula, data,
  mtry.start = ncol(data) / 2,
  nodesize.try = c(1:9, seq(10, 100, by = 5)), ntree.try = 100,
  sampsize = function(x) { min(x * .632, max(150, x^(3/4))) },
  nsplit = 1, step.factor = 1.25, improve = 1e-3, strikeout = 3, max.iter = 25,
  method = c("grid", "golden"),
  final.window = 5, reps.initial = 2, reps.final = 3,
  trace = FALSE, do.best = TRUE, seed = NULL, ...)

# S3 method for class 'rfsrc'
tune.nodesize(formula, data,
  nodesize.try = c(1:9, seq(10, 150, by = 5)), ntree.try = 100,
  sampsize = function(x) { min(x * .632, max(150, x^(4/5))) },
  nsplit = 1, method = c("grid", "golden"),
  final.window = 5, reps.initial = 2, reps.final = 3, max.iter = 50,
  trace = TRUE, seed = NULL, ...)

Arguments

formula: A model formula.
data: A data frame with response and predictors.
mtry.start: Initial mtry for tune.
nodesize.try: Candidate nodesize values. Only values \(\le\) floor(sampsize(n)/2) are used.
ntree.try: Number of trees grown at each tuning evaluation.
sampsize: Function or numeric giving the per-tree subsample size. During tuning a single numeric size ssize is computed and passed to rfsrc.fast. If a vector is supplied (e.g., class specific), its total is used for ssize.
nsplit: Number of random split points to consider at each node.
step.factor: Multiplicative step-out factor over mtry for grid search in tune.
improve: Minimum relative improvement required to continue a search step in tune.
strikeout: Maximum number of consecutive non-improving steps allowed in tune.
max.iter: Maximum number of iterations for the step-out search in tune or the coordinate loop when method = "golden".
method: Search strategy: "grid" (default) or "golden".
final.window: For golden search, the terminal bracket width for the one-dimensional line search.
reps.initial: Replicates averaged at interior evaluations during golden iterations.
reps.final: Replicates averaged for each candidate during the final local sweep in golden search.
trace: If TRUE, prints progress.
do.best: If TRUE, tune fits and returns a forest at the optimal pair.
seed: Optional integer for reproducible tuning. The holdout split (when used) and all tuning fits become deterministic for a given seed.
...: Additional arguments passed to rfsrc.fast. Arguments that control tuning itself (perf.type, forest, save.memory, ntree, mtry, nodesize, sampsize, nsplit) are managed internally.

Details

Error estimate. If 2 * ssize < n, a disjoint holdout of size ssize is used for evaluation; otherwise OOB error is used.

Subsample used during tuning. Both functions derive a single integer ssize from sampsize and pass it to rfsrc.fast for all tuning fits. This improves stability and comparability across candidates. When do.best = TRUE in tune, the final forest is fit with the user-supplied sampsize exactly as provided.

Grid search. tune performs a step-out search over mtry for each nodesize in nodesize.try, using step.factor, improve, strikeout, and max.iter. tune.nodesize evaluates the supplied nodesize.try grid directly.

Golden search. Uses a guarded golden-section line search with noise control. For each one-dimensional search (over nodesize or mtry), the routine probes a small left-anchor grid 1:9, iterates golden shrinkage until the bracket width is at most final.window, then runs a short local sweep with reps.final replicates. In tune the searches over nodesize and mtry alternate in a simple coordinate loop, with improve and strikeout as stopping controls.

Value

For tune:

results: matrix with columns nodesize, mtry, err.
optimal: named numeric vector c(nodesize = ..., mtry = ...).
rf: fitted forest at the optimum if do.best = TRUE.

For tune.nodesize:

nsize.opt: optimal nodesize.
err: data frame with columns nodesize and err.

Author

Hemant Ishwaran and Udaya B. Kogalur

Examples

# \donttest{
## ------------------------------------------------------------
## White wine classification example
## ------------------------------------------------------------
data(wine, package = "randomForestSRC")
wine$quality <- factor(wine$quality)

## Fixed seed makes tuning reproducible
set.seed(1)

## Full tuner over nodesize and mtry (grid)
o1 <- tune(quality ~ ., wine, sampsize = 100, method = "grid")
print(o1$optimal)

## Golden search alternative
o2 <- tune(quality ~ ., wine, sampsize = 100, method = "golden",
           reps.initial = 2, reps.final = 3, seed = 1)
print(o2$optimal)

## visualize the nodesize/mtry surface
if (library("interp", logical.return = TRUE)) {

  plot.tune <- function(o, linear = TRUE) {
    x <- o$results[, 1]
    y <- o$results[, 2]
    z <- o$results[, 3]
    so <- interp(x = x, y = y, z = z, linear = linear)
    idx <- which.min(z)
    x0 <- x[idx]; y0 <- y[idx]
    filled.contour(x = so$x, y = so$y, z = so$z,
                   xlim = range(so$x, finite = TRUE) + c(-2, 2),
                   ylim = range(so$y, finite = TRUE) + c(-2, 2),
                   color.palette = colorRampPalette(c("yellow", "red")),
                   xlab = "nodesize", ylab = "mtry",
                   main = "error rate for nodesize and mtry",
                   key.title = title(main = "OOB error", cex.main = 1),
                   plot.axes = {
                     axis(1); axis(2)
                     points(x0, y0, pch = "x", cex = 1, font = 2)
                     points(x, y, pch = 16, cex = .25)
                   })
  }

  plot.tune(o1)
  plot.tune(o2)
}

## ------------------------------------------------------------
## nodesize only: grid vs golden
## ------------------------------------------------------------
o3 <- tune.nodesize(quality ~ ., wine, sampsize = 100, method = "grid",
                    trace = TRUE, seed = 1)
o4 <- tune.nodesize(quality ~ ., wine, sampsize = 100, method = "golden",
                    reps.initial = 2, reps.final = 3, trace = TRUE, seed = 1)
plot(o3$err, type = "s", xlab = "nodesize", ylab = "error")

## ------------------------------------------------------------
## Tuning for class imbalance (rfq with geometric mean performance)
## ------------------------------------------------------------
data(breast, package = "randomForestSRC")
breast <- na.omit(breast)
o5 <- tune(status ~ ., data = breast, rfq = TRUE, perf.type = "gmean",
           method = "golden", seed = 1)
print(o5$optimal)

## ------------------------------------------------------------
## Competing risks example (nodesize only)
## ------------------------------------------------------------
data(wihs, package = "randomForestSRC")
plot(tune.nodesize(Surv(time, status) ~ ., wihs, trace = TRUE)$err, type = "s")
# }

Tune Random Forest for optimal `mtry` and `nodesize`

Arguments

Details

Value

Author

See also

Examples