The objective of this case is to get you started with regression model building, variable selection, and model evaluation in R.
We use Boston Housing Data as an illustrative example in this lab. We learn basic linear regression and analysis with R. Code in this file is not the only correct way to do things, however it is important for you to understand what each statement does. You will have to modify the code accordingly for your homework.
Boston housing data is a built-in dataset in MASS
package, so you do not need to download externally. Package MASS
comes with R when you installed R, so no need to use install.packages(MASS)
to download and install, but you do need to load this package.
library(MASS)
data(Boston); #this data is in MASS package
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
You can find details of the dataset from help document.
?Boston
The original data are 506 observations on 14 variables, medv being the response variable \(y\):
We skip the Exploratory Data Analysis (EDA) in this notes, but you should not omit it in your HW and Cases. EDA is very important and always the first analysis to do before any modeling.
Next we sample 90% of the original data and use it as the training set. The remaining 10% is used as test set. The regression model will be built on the training set and future performance of your model will be evaluated with the test set.
sample_index <- sample(nrow(Boston),nrow(Boston)*0.90)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]
If we want our results to be invariant to the units and the parameter estimates \(\beta_i\) to be comparible, we can standardize the variables. Essentially we are replacing the original values with their z-score.
1st Way: create new variables manually.
Boston$sd.crim <- (Boston$crim-mean(Boston$crim))/sd(Boston$crim);
This does the same thing.
Boston$sd.crim <- scale(Boston$crim);
2nd way: If you have a lot of variables to standardize then the above is not very pleasing to do. You can use a loop like this. It standardizes every varables other than the last one which is \(y\).
for (i in 1:(ncol(Boston_train)-1)){
Boston_train[,i] <- scale(Boston_train[,i])
}
The technique is not as important in linear regression because it will only affect the interpretation but not the model estimation and inference.
model_1 <- lm(medv~., data = Boston_train)
model_2 <- lm(medv~crim+zn, data = Boston_train)
summary(model_1)
summary(model_2)
AIC(model_1); BIC(model_1)
AIC(model_2); BIC(model_2)
Exercise: Compare MSE, \(R^2\), and MSPE of these three models.
The ‘leaps’ package provides procedures for best subset regression.
install.packages('leaps')
library(leaps)
Which subset of variables should you include in order to minimize BIC?
#regsubsets only takes data frame as input
subset_result <- regsubsets(medv~.,data=Boston_train, nbest=2, nvmax = 14)
summary(subset_result)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston_train, nbest = 2,
## nvmax = 14)
## 13 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## 2 subsets of each size up to 13
## Selection Algorithm: exhaustive
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
## 1 ( 2 ) " " " " " " " " " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 2 ( 2 ) " " " " " " " " " " " " " " " " " " " " "*" " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 3 ( 2 ) " " " " " " " " " " "*" " " "*" " " " " " " " " "*"
## 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*"
## 4 ( 2 ) " " " " " " "*" " " "*" " " " " " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
## 5 ( 2 ) " " " " " " " " " " "*" " " "*" " " " " "*" "*" "*"
## 6 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" " " "*"
## 6 ( 2 ) " " " " " " " " "*" "*" " " "*" " " " " "*" "*" "*"
## 7 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" "*" "*"
## 7 ( 2 ) " " "*" " " "*" "*" "*" " " "*" " " " " "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" "*" "*" " " "*" "*" " " "*" " " "*"
## 8 ( 2 ) " " " " " " "*" "*" "*" " " "*" "*" " " "*" "*" "*"
## 9 ( 1 ) "*" " " " " "*" "*" "*" " " "*" "*" "*" "*" " " "*"
## 9 ( 2 ) "*" "*" " " " " "*" "*" " " "*" "*" "*" "*" " " "*"
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" " " "*"
## 10 ( 2 ) "*" "*" " " " " "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 2 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" " " "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 2 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
plot(subset_result, scale="bic")
Each row represents a model. Black indicates that a variable is included in the model, while white indicates that it is not. The argument scale = ""
can be “Cp”, “adjr2”, “r2” or “bic”.
What is the problem with best subset regression? If there are n independent variables, the number of possible nonempty subsets is 2^n - 1. If you try a best subset regression with more than 50 variables, you might need to wait for your entire life to get the result.
To perform the Forward/Backward/Stepwise Regression in R, we need to define the starting points:
nullmodel=lm(medv~1, data=Boston_train)
fullmodel=lm(medv~., data=Boston_train)
nullmodel is the model with no varaible in it, while fullmodel is the model with every variable in it.
model_step_b <- step(fullmodel,direction='backward')
## Start: AIC=1416.27
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
## tax + ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - age 1 1.34 9619.9 1414.3
## - indus 1 8.89 9627.5 1414.7
## <none> 9618.6 1416.3
## - black 1 146.82 9765.4 1421.2
## - chas 1 194.15 9812.7 1423.4
## - zn 1 219.12 9837.7 1424.5
## - tax 1 229.66 9848.2 1425.0
## - crim 1 239.94 9858.5 1425.5
## - nox 1 495.39 10114.0 1437.1
## - rad 1 504.19 10122.8 1437.5
## - ptratio 1 891.37 10509.9 1454.6
## - dis 1 1124.76 10743.3 1464.6
## - rm 1 1608.48 11227.0 1484.6
## - lstat 1 2647.79 12266.4 1524.9
##
## Step: AIC=1414.34
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
## ptratio + black + lstat
##
## Df Sum of Sq RSS AIC
## - indus 1 8.85 9628.8 1412.8
## <none> 9619.9 1414.3
## - black 1 148.73 9768.6 1419.3
## - chas 1 196.80 9816.7 1421.5
## - zn 1 218.38 9838.3 1422.5
## - tax 1 228.47 9848.4 1423.0
## - crim 1 240.12 9860.0 1423.6
## - rad 1 503.12 10123.0 1435.5
## - nox 1 517.18 10137.1 1436.2
## - ptratio 1 890.36 10510.3 1452.6
## - dis 1 1274.23 10894.1 1468.9
## - rm 1 1676.56 11296.5 1485.4
## - lstat 1 2900.98 12520.9 1532.3
##
## Step: AIC=1412.76
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
## black + lstat
##
## Df Sum of Sq RSS AIC
## <none> 9628.8 1412.8
## - black 1 146.93 9775.7 1417.7
## - chas 1 206.14 9834.9 1420.4
## - zn 1 210.71 9839.5 1420.6
## - tax 1 239.50 9868.3 1421.9
## - crim 1 243.34 9872.1 1422.1
## - rad 1 510.75 10139.5 1434.3
## - nox 1 521.17 10149.9 1434.7
## - ptratio 1 882.32 10511.1 1450.7
## - dis 1 1379.82 11008.6 1471.7
## - rm 1 1668.26 11297.0 1483.5
## - lstat 1 2895.45 12524.2 1530.4
model_step_f <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='forward')
## Start: AIC=2007.18
## medv ~ 1
##
## Df Sum of Sq RSS AIC
## + lstat 1 20825.7 16495 1637.7
## + rm 1 18010.7 19310 1709.4
## + indus 1 8747.3 28573 1887.7
## + ptratio 1 8289.6 29031 1894.9
## + tax 1 8024.0 29296 1899.0
## + nox 1 7003.1 30317 1914.6
## + crim 1 5569.1 31751 1935.7
## + age 1 5351.0 31969 1938.8
## + rad 1 5077.6 32243 1942.6
## + zn 1 4231.9 33088 1954.4
## + black 1 3885.7 33435 1959.2
## + dis 1 2217.1 35103 1981.3
## + chas 1 743.7 36577 2000.0
## <none> 37320 2007.2
##
## Step: AIC=1637.67
## medv ~ lstat
##
## Df Sum of Sq RSS AIC
## + rm 1 3280.2 13214 1538.8
## + ptratio 1 1815.2 14679 1586.6
## + dis 1 870.7 15624 1615.0
## + chas 1 556.7 15938 1624.0
## + age 1 327.9 16167 1630.5
## + tax 1 158.4 16336 1635.3
## + crim 1 105.7 16389 1636.8
## + black 1 95.1 16400 1637.0
## <none> 16495 1637.7
## + indus 1 43.0 16452 1638.5
## + zn 1 40.5 16454 1638.5
## + nox 1 10.1 16484 1639.4
## + rad 1 0.4 16494 1639.7
##
## Step: AIC=1538.79
## medv ~ lstat + rm
##
## Df Sum of Sq RSS AIC
## + ptratio 1 1128.40 12086 1500.2
## + dis 1 523.87 12691 1522.4
## + chas 1 512.89 12702 1522.8
## + black 1 300.66 12914 1530.3
## + tax 1 208.44 13006 1533.5
## + crim 1 208.08 13006 1533.6
## + age 1 77.55 13137 1538.1
## <none> 13214 1538.8
## + rad 1 46.09 13168 1539.2
## + indus 1 7.85 13207 1540.5
## + zn 1 3.04 13211 1540.7
## + nox 1 0.15 13214 1540.8
##
## Step: AIC=1500.18
## medv ~ lstat + rm + ptratio
##
## Df Sum of Sq RSS AIC
## + dis 1 596.05 11490 1479.2
## + chas 1 374.86 11711 1487.8
## + black 1 239.47 11847 1493.1
## + age 1 111.57 11974 1498.0
## + crim 1 82.40 12004 1499.1
## <none> 12086 1500.2
## + rad 1 41.93 12044 1500.6
## + zn 1 38.60 12047 1500.7
## + indus 1 11.62 12074 1501.7
## + tax 1 9.52 12076 1501.8
## + nox 1 6.80 12079 1501.9
##
## Step: AIC=1479.16
## medv ~ lstat + rm + ptratio + dis
##
## Df Sum of Sq RSS AIC
## + nox 1 663.27 10827 1454.1
## + black 1 333.06 11157 1467.8
## + chas 1 259.44 11230 1470.8
## + indus 1 191.77 11298 1473.5
## + crim 1 180.83 11309 1474.0
## + tax 1 160.37 11330 1474.8
## + zn 1 112.67 11377 1476.7
## + age 1 51.92 11438 1479.1
## <none> 11490 1479.2
## + rad 1 2.05 11488 1481.1
##
## Step: AIC=1454.11
## medv ~ lstat + rm + ptratio + dis + nox
##
## Df Sum of Sq RSS AIC
## + chas 1 298.869 10528 1443.4
## + black 1 170.017 10657 1448.9
## + zn 1 127.102 10700 1450.7
## + crim 1 108.031 10719 1451.5
## + rad 1 105.034 10722 1451.7
## <none> 10827 1454.1
## + indus 1 9.074 10818 1455.7
## + tax 1 0.682 10826 1456.1
## + age 1 0.003 10827 1456.1
##
## Step: AIC=1443.37
## medv ~ lstat + rm + ptratio + dis + nox + chas
##
## Df Sum of Sq RSS AIC
## + black 1 143.805 10384 1439.1
## + zn 1 142.119 10386 1439.2
## + rad 1 110.980 10417 1440.5
## + crim 1 86.449 10441 1441.6
## <none> 10528 1443.4
## + indus 1 12.850 10515 1444.8
## + age 1 1.306 10526 1445.3
## + tax 1 0.291 10528 1445.4
##
## Step: AIC=1439.12
## medv ~ lstat + rm + ptratio + dis + nox + chas + black
##
## Df Sum of Sq RSS AIC
## + rad 1 180.342 10204 1433.1
## + zn 1 157.810 10226 1434.2
## + crim 1 48.706 10335 1439.0
## <none> 10384 1439.1
## + tax 1 8.591 10375 1440.7
## + indus 1 8.542 10376 1440.7
## + age 1 4.098 10380 1440.9
##
## Step: AIC=1433.14
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad
##
## Df Sum of Sq RSS AIC
## + crim 1 202.224 10002 1426.0
## + tax 1 157.186 10046 1428.1
## + zn 1 106.690 10097 1430.4
## <none> 10204 1433.1
## + indus 1 16.366 10187 1434.4
## + age 1 0.766 10203 1435.1
##
## Step: AIC=1426.04
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad +
## crim
##
## Df Sum of Sq RSS AIC
## + tax 1 162.005 9839.5 1420.6
## + zn 1 133.216 9868.3 1421.9
## <none> 10001.5 1426.0
## + indus 1 21.948 9979.5 1427.0
## + age 1 1.225 10000.2 1428.0
##
## Step: AIC=1420.61
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad +
## crim + tax
##
## Df Sum of Sq RSS AIC
## + zn 1 210.711 9628.8 1412.8
## <none> 9839.5 1420.6
## + indus 1 1.183 9838.3 1422.5
## + age 1 0.587 9838.9 1422.6
##
## Step: AIC=1412.76
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + rad +
## crim + tax + zn
##
## Df Sum of Sq RSS AIC
## <none> 9628.8 1412.8
## + indus 1 8.8472 9619.9 1414.3
## + age 1 1.2965 9627.5 1414.7
model_step_s <- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='both')
One caution when comparing fit statistics using AIC, the definition varies by program/procedure.
model_1 <- lm(medv~crim+zn+chas+nox+rm+dis+rad+tax+ptratio+black+lstat, data=Boston_train)
extractAIC(model_1)
## [1] 12.000 1412.756
AIC(model_1)
## [1] 2705.99
Exercise
- Compare in-sample and out-of-sample performance between these reduced models.
- Conduct 10-fold cross validation on the full sample and compare the CV scores.