credit.data <- read.csv("https://yanyudm.github.io/Data-Mining-R/lecture/data/credit0.csv", header=T)
We remove X9 and id from the data since we will not be using them for prediction.
credit.data$X9 = NULL
credit.data$id = NULL
credit.data$Y = as.factor(credit.data$Y)
Now split the data 90/10 as training/testing datasets:
id_train <- sample(nrow(credit.data),nrow(credit.data)*0.90)
credit.train = credit.data[id_train,]
credit.test = credit.data[-id_train,]
The training dataset has 61 variables, 4500 obs.
SVM is probably one of the best off-the-shelf classifiers for many of problems. It handles nonlinearity, is well regularized (avoids overfitting), have few parameters, and fast for large number of observations. It can be adapted to handle regression problems as well. You can read more about SVM in Chapter 12 of the textbook.
The R package e1071 offers an interface to the most popular svm implementation libsvm. You should read more about the usage of the package in this short tutorial (http://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf).
install.packages('e1071')
credit.data <- read.csv("https://yanyudm.github.io/Data-Mining-R/lecture/data/credit0.csv", header=T)
credit.data$X9 = NULL
credit.data$id = NULL
credit.data$Y = as.factor(credit.data$Y)
# Now split the data 90/10 as training/testing datasets:
id_train <- sample(nrow(credit.data),nrow(credit.data)*0.90)
credit.train = credit.data[id_train,]
credit.test = credit.data[-id_train,]
library(e1071)
credit.svm = svm(Y ~ ., data = credit.train, cost = 1, gamma = 1/length(credit.train), probability= TRUE)
prob.svm = predict(credit.svm, credit.test, probability = TRUE)
prob.svm = attr(prob.svm, 'probabilities')[,2] #This is needed because prob.svm gives a matrix
pred.svm = as.numeric((prob.svm >= 0.08))
table(credit.test$Y,pred.svm,dnn=c("Obs","Pred"))
## Pred
## Obs 0 1
## 0 397 75
## 1 18 10
mean(ifelse(credit.test$Y != pred.svm, 1, 0))
## [1] 0.186
You are already familiar with the credit scoring set. Let’s define a cost function for benchmarking testing set performance. Note this is slightly different from the one we used for searching for optimal cut-off probability in logistic regression. Here the 2nd argument is the predict class instead of the predict probability (since many methods are not based on predict probability).
creditcost <- function(observed, predicted){
weight1 = 10
weight0 = 1
c1 = (observed==1)&(predicted == 0) #logical vector - true if actual 1 but predict 0
c0 = (observed==0)&(predicted == 1) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
creditcost(credit.test$Y, pred.svm)
## [1] 0.51