In this lab, we will cover some state-of-the-art techniques in the framework of tree models. We use the same datasets as in previous lab, Boston Housing data and (Taiwan) Credit Card default data (subsample n=12,000 rows).

# load Boston data
library(MASS)
data(Boston)
index <- sample(nrow(Boston),nrow(Boston)*0.90)
boston_train <- Boston[index,]
boston_test <- Boston[-index,]

1 Random Forests

Random forest is an extension of Bagging, but it makes significant improvement in terms of prediction. The idea of random forests is to randomly select \(m\) out of \(p\) predictors as candidate variables for each split in each tree. Commonly, \(m=\sqrt{p}\). The reason of doing this is that it can decorrelates the trees such that it reduces variance when we aggregate the trees. You may refer Wikipedia and the tutorial on the author’s website. Note: The current randomForest package do not handle asymmetric loss.

1.1 Random Forest for Regression

We start with Boston Housing data.

library(randomForest)
boston_rf<- randomForest(medv~., data = boston_train, importance=TRUE)
boston_rf
## 
## Call:
##  randomForest(formula = medv ~ ., data = boston_train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 10.61935
##                     % Var explained: 87.91
mod_rf <- randomForest(medv~., data=boston_train, 
                       importance=TRUE, ntree=500)

By default, \(m=p/3\) for regression tree, and \(m=\sqrt{p}\) for classification problem. You can change it by specifying mtry=. You can also specify number of trees by ntree=. The default is 500. The argument importance=TRUE allows us to see the variable imporatance.

boston_rf$importance
##            %IncMSE IncNodePurity
## crim     9.5325118     2472.0865
## zn       0.7271348      280.0408
## indus    7.5241319     2522.4000
## chas     0.9414219      240.9856
## nox     11.1921414     3084.9106
## rm      35.6567040    11065.4391
## age      4.1682140      942.8690
## dis      7.3667489     2356.4018
## rad      1.2737420      334.8613
## tax      4.1793545     1165.0005
## ptratio  6.0707040     2182.0569
## black    1.2574037      717.6691
## lstat   63.8071091    11629.9653
mod_rf$importance
##            %IncMSE IncNodePurity
## crim     7.7350185     2247.7709
## zn       0.8751567      304.6511
## indus    7.1334557     2470.0364
## chas     0.8143924      188.9219
## nox     10.1678683     2815.6307
## rm      33.5807209    11384.5377
## age      3.6536810      981.6732
## dis      7.6362637     2397.4340
## rad      1.7554319      403.9413
## tax      4.6484295     1177.2934
## ptratio  6.7814665     2280.7953
## black    1.3385881      735.2751
## lstat   61.2619646    11909.0258

The MSR is MSE of out-of-bag prediction (recall the OOB in bagging). The fitted randomForest actually saves all OOB errors for each ntree value from 1 to 500. We can make a plot to see how the OOB error changes with different ntree.

plot(boston_rf$mse, type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error")

plot(mod_rf$mse, type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error")

Prediction on the testing sample.

boston_rf_pred<- predict(boston_rf, boston_test)
mean((boston_test$medv-boston_rf_pred)^2)
## [1] 5.348627

As we mentioned before, the number of candidate predictors in each split is \(m=p/3\approx 4\). We can also specify \(m\) with argument mtry. Now let’s see how the OOB error and testing error changes with mtry.

oob_err <- rep(0, 13)
test.err <- rep(0, 13)
for(i in 1:13){
  fit<- randomForest(medv~., data = boston_train, mtry=i)
  oob_err[i]<- fit$mse[500]
  test.err[i]<- mean((boston_test$medv-predict(fit, boston_test))^2)
  cat(i, " ")
}
## 1  2  3  4  5  6  7  8  9  10  11  12  13
matplot(cbind(test.err, oob_err), pch=15, col = c("red", "blue"), type = "b", ylab = "MSE", xlab = "mtry")
legend("topright", legend = c("test Error", "OOB Error"), pch = 15, col = c("red", "blue"))

Exercise:

Create a plot displaying the test error across ntree=1, …, 500, and mtry= 1, …, 13. (You can draw 13 lines in different color representing each \(m\)).

1.2 Random Forest for Classification

We apply the random forest model to the credit card dataset:

# load credit card data
credit_data <- read.csv(file = "https://yanyudm.github.io/Data-Mining-R/lecture/data/credit_default.csv", header=T)
# convert categorical variables
credit_data$SEX<- as.factor(credit_data$SEX)
credit_data$EDUCATION<- as.factor(credit_data$EDUCATION)
credit_data$MARRIAGE<- as.factor(credit_data$MARRIAGE)
# random splitting
index <- sample(nrow(credit_data),nrow(credit_data)*0.9)
credit_train = credit_data[index,]
credit_test = credit_data[-index,]

credit_rf <- randomForest(as.factor(default.payment.next.month)~., 
                          data = credit_train,
                          importance=TRUE, ntree=500)
credit_rf
## 
## Call:
##  randomForest(formula = as.factor(default.payment.next.month) ~      ., data = credit_train, importance = TRUE, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 18.41%
## Confusion matrix:
##      0   1 class.error
## 0 7962 478  0.05663507
## 1 1510 850  0.63983051

We can again easily plot the error rate vs. ntree. However, as far as I know, randomForest does not support asymmetric loss either. So it always uses the overall misclassification rate as the error.

plot(credit_rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd=rep(2, 3), lty = c(1,2,3), col = c("black", "red", "green"))

ROC curve and AUC can be obtained based on the probability prediction.

credit_rf_pred <- predict(credit_rf, type = "prob")[,2]
library(ROCR)
pred <- prediction(credit_rf_pred, credit_train$default.payment.next.month)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7658264

Create the confusion matrix based on the cutoff probability from asymmetric cost (pcut=1/6).

## out-of-sample
pcut <- 1/6
credit_rf_pred_test <- predict(credit_rf, newdata=credit_test, type = "prob")[,2]
credit_rf_class_test <- (credit_rf_pred_test>pcut)*1
table(credit_test$default.payment.next.month, credit_rf_class_test, dnn = c("True", "Pred"))
##     Pred
## True   0   1
##    0 552 376
##    1  63 209