In this lab, we will cover some state-of-the-art techniques in the framework of tree models. We use the same datasets as in previous lab, Boston Housing data and (Taiwan) Credit Card default data (subsample n=12,000 rows).
# load Boston data
library(MASS)
data(Boston)
index <- sample(nrow(Boston),nrow(Boston)*0.90)
boston_train <- Boston[index,]
boston_test <- Boston[-index,]
Random forest is an extension of Bagging, but it makes significant improvement in terms of prediction. The idea of random forests is to randomly select \(m\) out of \(p\) predictors as candidate variables for each split in each tree. Commonly, \(m=\sqrt{p}\). The reason of doing this is that it can decorrelates the trees such that it reduces variance when we aggregate the trees. You may refer Wikipedia and the tutorial on the author’s website. Note: The current randomForest
package do not handle asymmetric loss.
We start with Boston Housing data.
library(randomForest)
boston_rf<- randomForest(medv~., data = boston_train, importance=TRUE)
boston_rf
##
## Call:
## randomForest(formula = medv ~ ., data = boston_train, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 10.61935
## % Var explained: 87.91
mod_rf <- randomForest(medv~., data=boston_train,
importance=TRUE, ntree=500)
By default, \(m=p/3\) for regression tree, and \(m=\sqrt{p}\) for classification problem. You can change it by specifying mtry=
. You can also specify number of trees by ntree=
. The default is 500. The argument importance=TRUE
allows us to see the variable imporatance.
boston_rf$importance
## %IncMSE IncNodePurity
## crim 9.5325118 2472.0865
## zn 0.7271348 280.0408
## indus 7.5241319 2522.4000
## chas 0.9414219 240.9856
## nox 11.1921414 3084.9106
## rm 35.6567040 11065.4391
## age 4.1682140 942.8690
## dis 7.3667489 2356.4018
## rad 1.2737420 334.8613
## tax 4.1793545 1165.0005
## ptratio 6.0707040 2182.0569
## black 1.2574037 717.6691
## lstat 63.8071091 11629.9653
mod_rf$importance
## %IncMSE IncNodePurity
## crim 7.7350185 2247.7709
## zn 0.8751567 304.6511
## indus 7.1334557 2470.0364
## chas 0.8143924 188.9219
## nox 10.1678683 2815.6307
## rm 33.5807209 11384.5377
## age 3.6536810 981.6732
## dis 7.6362637 2397.4340
## rad 1.7554319 403.9413
## tax 4.6484295 1177.2934
## ptratio 6.7814665 2280.7953
## black 1.3385881 735.2751
## lstat 61.2619646 11909.0258
The MSR is MSE of out-of-bag prediction (recall the OOB in bagging). The fitted randomForest actually saves all OOB errors for each ntree
value from 1 to 500. We can make a plot to see how the OOB error changes with different ntree
.
plot(boston_rf$mse, type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error")
plot(mod_rf$mse, type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error")
Prediction on the testing sample.
boston_rf_pred<- predict(boston_rf, boston_test)
mean((boston_test$medv-boston_rf_pred)^2)
## [1] 5.348627
As we mentioned before, the number of candidate predictors in each split is \(m=p/3\approx 4\). We can also specify \(m\) with argument mtry
. Now let’s see how the OOB error and testing error changes with mtry
.
oob_err <- rep(0, 13)
test.err <- rep(0, 13)
for(i in 1:13){
fit<- randomForest(medv~., data = boston_train, mtry=i)
oob_err[i]<- fit$mse[500]
test.err[i]<- mean((boston_test$medv-predict(fit, boston_test))^2)
cat(i, " ")
}
## 1 2 3 4 5 6 7 8 9 10 11 12 13
matplot(cbind(test.err, oob_err), pch=15, col = c("red", "blue"), type = "b", ylab = "MSE", xlab = "mtry")
legend("topright", legend = c("test Error", "OOB Error"), pch = 15, col = c("red", "blue"))
Exercise:
Create a plot displaying the test error across
ntree=
1, …, 500, andmtry=
1, …, 13. (You can draw 13 lines in different color representing each \(m\)).
We apply the random forest model to the credit card dataset:
# load credit card data
credit_data <- read.csv(file = "https://yanyudm.github.io/Data-Mining-R/lecture/data/credit_default.csv", header=T)
# convert categorical variables
credit_data$SEX<- as.factor(credit_data$SEX)
credit_data$EDUCATION<- as.factor(credit_data$EDUCATION)
credit_data$MARRIAGE<- as.factor(credit_data$MARRIAGE)
# random splitting
index <- sample(nrow(credit_data),nrow(credit_data)*0.9)
credit_train = credit_data[index,]
credit_test = credit_data[-index,]
credit_rf <- randomForest(as.factor(default.payment.next.month)~.,
data = credit_train,
importance=TRUE, ntree=500)
credit_rf
##
## Call:
## randomForest(formula = as.factor(default.payment.next.month) ~ ., data = credit_train, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 18.41%
## Confusion matrix:
## 0 1 class.error
## 0 7962 478 0.05663507
## 1 1510 850 0.63983051
We can again easily plot the error rate vs. ntree. However, as far as I know, randomForest
does not support asymmetric loss either. So it always uses the overall misclassification rate as the error.
plot(credit_rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd=rep(2, 3), lty = c(1,2,3), col = c("black", "red", "green"))
ROC curve and AUC can be obtained based on the probability prediction.
credit_rf_pred <- predict(credit_rf, type = "prob")[,2]
library(ROCR)
pred <- prediction(credit_rf_pred, credit_train$default.payment.next.month)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7658264
Create the confusion matrix based on the cutoff probability from asymmetric cost (pcut=1/6).
## out-of-sample
pcut <- 1/6
credit_rf_pred_test <- predict(credit_rf, newdata=credit_test, type = "prob")[,2]
credit_rf_class_test <- (credit_rf_pred_test>pcut)*1
table(credit_test$default.payment.next.month, credit_rf_class_test, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 552 376
## 1 63 209