In this lab we will go through the model building, validation, and interpretation of tree models. The focus will be on rpart package.
CART stands for classification and regression trees:
For the regression tree example, we will use the Boston Housing data. Recall the response variable is the housing price. For the classification tree example, we will use the credit scoring data. The response variable is whether the loan went to default.
Note that unlike logistic regression, the response variable does not have to be binary in case of classification tree. We can use classification tree on classification problems with more than 2 outcomes.
The classification trees is slightly more complicated to specify. What makes it more complicated is that we often have asymmetric cost function. In the credit scoring case it means that false negatives (predicting 0 when truth is 1, or giving out loans that end up in default) will cost more than false positives (predicting 1 when truth is 0, rejecting loans that you should not reject).
Here we make the assumption that false negative cost 5 times of false positive. In real life the cost structure should be carefully researched.
library(rpart)
library(rpart.plot)
credit_data <- read.csv(file = "https://yanyudm.github.io/Data-Mining-R/lecture/data/credit_default.csv", header=T)
# rename
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
credit_data<- rename(credit_data, default=default.payment.next.month)
# convert categorical data to factor
credit_data$SEX<- as.factor(credit_data$SEX)
credit_data$EDUCATION<- as.factor(credit_data$EDUCATION)
credit_data$MARRIAGE<- as.factor(credit_data$MARRIAGE)
index <- sample(nrow(credit_data),nrow(credit_data)*0.80)
credit_train = credit_data[index,]
credit_test = credit_data[-index,]
credit_rpart0 <- rpart(formula = default ~ ., data = credit_train, method = "class")
credit_rpart <- rpart(formula = default ~ . , data = credit_train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))
Note the following important differences from the regression trees:
The method = “class” is required if the response is not declared as factors.
The parms argument, which is a list. The most import element is
the loss matrix. The diagonal elements are 0, and off-diagonal elements
tells you the loss(cost) of classifying something wrong. For binary
classification, the numbers in c() specify the cost in this sequence:
c(0, False Negative, False Positive, 0)
. If you have
symmetric cost, you can ignore the parms argument.
However, this tree with default cost minimizes the symmetric cost, which is misclassification rate. We can take a look at the confusion matrix.
pred0 <- predict(credit_rpart0, type="class")
table(credit_train$default, pred0, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 7167 324
## 1 1417 692
Note that in the predict()
function, we need
type="class"
in order to get binary prediction.
Look at the confusion matrix, is it what we expected? Think about why the confusion matrix is like this?
Therefore, for most applications (very unbalanced data), we often have asymmetric cost. Recall the example in logistic regression. In the credit scoring case it means that false negatives (predicting 0 when truth is 1, or giving out loans that end up in default) will cost more than false positives (predicting 1 when truth is 0, rejecting loans that you should not reject).
Here we make the assumption that false negative cost 5 times of false positive. In real life the cost structure should be carefully researched.
credit_rpart <- rpart(formula = default ~ . ,
data = credit_train,
method = "class",
parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))
For more advanced controls, you should carefully read the help document for the rpart function.
credit_rpart
## n= 9600
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 9600 7491 1 (0.78031250 0.21968750)
## 2) PAY_0< 0.5 7459 5050 0 (0.86459311 0.13540689)
## 4) PAY_AMT3>=1366.5 4558 2350 0 (0.89688460 0.10311540)
## 8) PAY_5< 1 4369 2030 0 (0.90707256 0.09292744) *
## 9) PAY_5>=1 189 125 1 (0.66137566 0.33862434) *
## 5) PAY_AMT3< 1366.5 2901 2361 1 (0.81385729 0.18614271)
## 10) PAY_4< 1 2604 2150 1 (0.82565284 0.17434716)
## 20) BILL_AMT1>=1262.5 1823 1375 0 (0.84914975 0.15085025)
## 40) LIMIT_BAL>=65000 990 565 0 (0.88585859 0.11414141) *
## 41) LIMIT_BAL< 65000 833 671 1 (0.80552221 0.19447779) *
## 21) BILL_AMT1< 1262.5 781 602 1 (0.77080666 0.22919334) *
## 11) PAY_4>=1 297 211 1 (0.71043771 0.28956229) *
## 3) PAY_0>=0.5 2141 1042 1 (0.48668846 0.51331154) *
prp(credit_rpart, extra = 1)
For a binary classification problem, as you learned in logistic regression there are 2 types of predictions. One is the predicted class of response (0 or 1), and the second type is the probability of response being 1. We use an additional argument type=“class” or type=“prob” to get these:
In-sample prediction
credit_train.pred.tree1<- predict(credit_rpart, credit_train, type="class")
table(credit_train$default, credit_train.pred.tree1, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 4840 2651
## 1 519 1590
Exercise: Out-of-sample prediction
#Predicted Class
credit_test.pred.tree1<-
table()
Usually if you want a hassle-free model, using type=“class” is enough given that you specified the loss matrix correctly in rpart.
We can get the expected loss for this tree model by defining a cost function that has the correct weights:
cost <- function(r, phat){
weight1 <- 5
weight0 <- 1
pcut <- weight0/(weight1+weight0)
c1 <- (r==1)&(phat<pcut) #logical vector - true if actual 1 but predict 0
c0 <-(r==0)&(phat>pcut) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
The output from predict() function with ‘type = “prob”’ is a matrix. The first column is the probability of response = 0, and the second column is the probability of response = 1.
head(predict(credit_rpart, credit_test, type="prob"))
## 0 1
## 3 0.9070726 0.09292744
## 6 0.9070726 0.09292744
## 9 0.9070726 0.09292744
## 11 0.9070726 0.09292744
## 13 0.8858586 0.11414141
## 19 0.9070726 0.09292744
Calculate the in-sample cost.
cost(credit_train$default, predict(credit_rpart, credit_train, type="prob")[,2])
## [1] 0.5464583
Exercise: Out-of-sample prediction
#Predicted Class
credit_test.pred.tree1<-
table()
Exercise: Try type="prob"
in
prediction, what can you say about these predicted probabilities?
Calculate the cost for testing sample using above cost function
cost(credit_test$default, predict(credit_rpart, credit_test, type="prob")[,2])
## [1] 0.5608333
We can compare this model’s out-of-sample performance with the logistic regression model with all variables in it.
#Fit logistic regression model
credit_glm <- glm(default~.,
data = credit_train,
family=binomial)
#Get binary prediction
credit_test_pred_glm <- predict(credit_glm, credit_test, type="response")
#Calculate cost using test set
cost(credit_test$default, credit_test_pred_glm)
## [1] 0.6916667
#Confusion matrix
table(credit_test$default, as.numeric(credit_test_pred_glm>1/6), dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 852 1025
## 1 127 396
Exercise: Comparison for in-sample performance. Which model do you think is better?
Recall that ROC Curve gives you the trade-off between hit rate (1 - false positive) and false negative, and area under the curve (AUC) can be used as a measure of how good the binary classification model performs when you do not know the cost function.
To get ROC curve, we get the predicted probability of Y being 1 from the fitted tree. The additional cp parameter controls the complexity of tree. Here we change it from its default 0.01 to a smaller value to grow a more complex tree than just the root node (if you use the default the tree you get will tell you to classify everything as 0). More discussion on this in the next section.
credit_rpart <- rpart(formula = default ~ .,
data = credit_train,
method = "class",
parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))
#Probability of getting 1
credit_test_prob_rpart = predict(credit_rpart, credit_test, type="prob")
credit_test_prob_rpart has 2 columns, the first one is prob(Y) = 0 and the second prob(Y) = 1. We only need the second column because they add to 1 for binary classification.
To get ROC curve we use
install.packages('ROCR')
library(ROCR)
pred = prediction(credit_test_prob_rpart[,2], credit_test$default)
perf = performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
Area under the curve is given by (do not worry about the syntax here):
slot(performance(pred, "auc"), "y.values")[[1]]
## [1] 0.7464324
For a given cut-off probability, the 0/1 prediction result can be calculated similar to what you do in logistic regression
credit_test_pred_rpart = as.numeric(credit_test_prob_rpart[,2] > 1/(5+1))
table(credit_test$default, credit_test_pred_rpart, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 1231 646
## 1 140 383
You can refer to the last section on specifying a loss matrix, rpart will automatically generate decision rules with your cost structure taken into consideration.
Exercise: Draw the ROC curve for training sample.
Use rpart()
to fit regression and classification
trees.
Know how to interpret a tree.
Use predict()
for prediction, and how to assess the
performance.
Know how to use Cp plot/table to prune a large tree.