The outcome \(Y\) is available (also called dependent variable, response in statistics). Then, a set of predictors, regressors, covariates, features, or independent variables.
There are two main types of problems, regression problem and classification problem.
In order to demonstrate this simple machine learning algorithm, I use Iris dataset, a famous dataset for almost all machine learning courses, and apply KNN onto the dataset to train a classifier for Iris Species.
Before start, always do, you should
Let’s first load the Iris dataset. This is a very famous dataset in almost all data mining, machine learning courses, and it has been an R build-in dataset. The dataset consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginicaand Iris versicolor). Four features(variables) were measured from each sample, they are the length and the width of sepal and petal, in centimeters. It is introduced by Sir Ronald Fisher in 1936.
The iris flower data set is included in R. It is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
First, load iris data to the current workspace
data(iris)
iris
What is in the dataset? You can use head()
or tail()
to print the first or last few rows of a dataset:
head(iris)
Check dimensionality, the dataset has 150 rows(observations) and 5 columns (variables)
dim(iris)
## [1] 150 5
Variable names or column names
names(iris); # or colnames(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Structure of the dataframe, note that the difference between num and Factor
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
By default, R treat strings as factors (or categorical variables), in many situations (for example, building a regression model) this is what you want because R can automatically create “dummy variables” from the factors. However when merging data from different sources this can cause errors. In this case you can use stringsAsFactors = FALSE
option in read.table
.
class(iris[,1])
## [1] "numeric"
class(iris[,5])
## [1] "factor"
Simple summary statistics
Try the summary()
function.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
## Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Suppose we use the first 30 observations of each flower as the training sample and the rest as testing sample.
setosa <- rbind(iris[iris$Species=="setosa",])
versicolor <- rbind(iris[iris$Species=="versicolor",])
virginica <- rbind(iris[iris$Species=="virginica",])
ind <- 1:30
iris_train <- rbind(setosa[ind,], versicolor[ind,], virginica[ind,])
iris_test <- rbind(setosa[-ind,], versicolor[-ind,], virginica[-ind,])
Exercise: (HW 1) Random sample a training data set that contains 80% of original data points.
In R, knn()
function is designed to perform K-nearest neighbor. It is in package "class"
.
install.packages("class")
library(class)
knn_iris <- knn(train = iris_train[, -5], test = iris_test[, -5], cl=iris_train[,5], k=5)
Here, the function knn()
requires at least 3 inputs (train, test, and cl), the rest inputs have defaut values. train
is the training dataset without label (Y), and test
is the testing sample without label. cl
specifies the label of training dataset. By default \(k=1\), which results in 1-nearest neighbor.
Here I use test set to create contingency table and show the performance of classifier.
table(iris_test[,5], knn_iris, dnn = c("True", "Predicted"))
## Predicted
## True setosa versicolor virginica
## setosa 20 0 0
## versicolor 0 19 1
## virginica 0 1 19
sum(iris_test[,5] != knn_iris)
## [1] 2
If you only have a set of features (predictors) measured on a set of samples(observations), and you do not have a outcome variable, you are dealing with unsupervised learning problems.
In this case, your objective could be very different. You may want to:
It can be an useful pre-processing step for you to obtain labels for the supervised learning.
K-means clustering with 5 clusters, the ‘fpc’ package provides the ‘plotcluster’ function. You need to run install.packages('fpc')
to install it first.
install.packages("fpc")
library(fpc)
fit <- kmeans(iris[,1:4], 5)
plotcluster(iris[,1:4], fit$cluster)
The first argument of the kmeans function is the dataset that you wish to cluster, that is the column 1-4 in the iris dataset, the last column is true category the observation so we do not include it in the analysis; the second argument 5 indicates that you want a 5-cluster solution. The result of the cluster analysis is then assigned to the variable fit, and the plotcluster function is used to visualize the result.
Do you think it is a good solution? Try it with 3 cluster.
kmeans_result <- kmeans(iris[,1:4], 3)
plotcluster(iris[,1:4], kmeans_result$cluster)
hc_result <- hclust(dist(iris[,1:4]))
plot(hc_result)
#Cut Dendrogram into 3 Clusters
rect.hclust(hc_result, k=3)
There are three things happened in the first line. First dist(iris[, 1:4]) calculates the distance matrix between observations (how similar the observations are from each other judging from the 4 numerical variables). Then hclust takes the distance matrix as input and gives a hierarchical cluster solution. At last the solution is assigned to the variable hc_result. In hierarchical clustering you do not need to give the number of how many clusters you want, it depends on how you cut the dendrogram.