Without a doubt, R is nibbling other programming languages that popularly used by data scientists or statisticians. Recently, most of statisticians use R to analyze data, fit models, do research. Moreover, a lot of practitioners also tend to use R due to its free, open source and active community full of great researchers. Researchers and many other soft engineers help R community to be organicaly developed. All those cooperations are helping R to become one of the greatest statistical programming language in the world. R Studio, REvolution, GGplot2, Shiny App etc. are a few great tools in this community which play big role to make this language great.
For people who want to study frontier statistical methodologies, R helps them to be able to focus on the implementation of statistical methods. One can easily implement new estimation method or algorithms for simulation or comparison purpose, meanwhile, one can easily develop R packages for the methods or models they proposed, which can be used by practionor, maybe right after the method is published. This cannot happen before when statistician know few about software developing and engineers also care less about statistical theories. However, by learning a few of simple R package developing knowledge, one can develop statistical package for their new methods so that people in industry can quickly apply research fruits to accelare their business or improve their decision making, just as simple as clicking button in SPSS. On the other side, practitioners can also develop package for their own projects, which makes their intellectual property reproducible for further using (they don’t necessarily need to publish their package to the public). Although open source may face some issues like infrequent maintenance or security, it makes the flow of knowledge seamless.
And it is easy to learn as learning how to use a package in R. Here is the tutorial: https://shiny.rstudio.com/tutorial/
R Studio is a very active community to enrich functions of R. There are several great features I want to emphisis and recommend.
Markdown is an elegant syntax for writers, no matter he or she is a novelist or even in social science or nature science fields, which need more sophisticated notations or graphical and numerical results. Its simple syntax allows writers to be able to focus on the essential part of knowledge sharing: the content and idea itself. Although LaTex is a great tool to generate beautiful document, researchers, unlike publisher, doesn’t really need to know much about a lot of tedious things like: formating, typesetting, referencing, and perfectly positioning statistical tables or figures. Because smooth writing process leads to fluent communication with readers, the advantage of Markdown language is obvious.
Here is Markdown Basics: http://rmarkdown.rstudio.com/authoring_basics.html And many other tutorials are available in the same website.
New version of R studio provides R Notebook feature that is similar to R markdown. Both of them provide an easy environment to test and iterate when writing article with code. You can run programs in the article and display whatever results you want to insert.
Shiny Web App provide a very important pipeline for statisticians and data scientists to interact with audience or customers. With this simple and user-friendly tool, those who are playing with data can directly interpret what they’ve found to those who are interested in. This process was used to be complicated and need the involvement of professionals in software developing or web developing. The Shiny App, however, simplify the process and provide a intuitive way to build interactive application upon statistical results. Therefore, it is a nice tool that is worthy to learn so that you can smoothly tranfer your idea into real product.
R
R Studio
Shiny App
R Markdown
The download and installation should be straightforward, in case you encounter problems you can check the following video tutorials.
Install R: http://www.youtube.com/watch?v=SJ9sVyqWJn8&hd=1
Install R Studio: http://www.youtube.com/watch?v=6aTRbo7kdGk&hd=1
RStudio is running based on R. It is an IDE (Intergrated Development Environment) with many advanced features. This lab notes is created based R Markdown, a very nice and useful tool from RStudio.
After you open RStudio, it should look like this:
There are three panels showing. However, you need the forth one, which is the editor window. Click the green-plus icon on left-top corner, and select R Script. You write all your code in this editor window, and remember to save it!
Other Panels
R is open source software. That means, everyone can contribute to it by writing R packages and sharing to the community. An package usually consists of several R functions and datasets that are designed for specific tasks. There are over 10,000 packages in CRAN.
You need to download the package first and then load it to working environment before using some particular functions. We will see this later.
You may call yourself software developer if you can write R packages. If you are interested in writing package, here is a good book to read http://r-pkgs.had.co.nz/.
Always set working directory before you start coding. Working directory where you may read external data, write data, and save the code.
Or
Click Session -> Set Working Directory -> Choose Directory, then choose the folder to which you wish to save your work. This will be the default Create a “R Script”, name your R Script and save it. Then you can start writing code in the editor window.
Your objects (loaded datasets, variables, functions, etc.) are contained in your “current workspace”, which can be saved any time. In Rstudio: Session -> Load Workspace/Save Workspace As….
Remember: Keep it tidy! Keep separate projects (code, data files) in separate workspaces/directories.
You can assign numbers and lists of numbers (vector) to a variable. Assignment is specified with the “<-” or “=” symbol. They assign the RHS value to LHS object. There are some subtle differences between them but most of the time they are equivalent. I highly suggest you to use “<-” when you want to do assignment, but use “=” in the argument of function(May explain later).
Here we define two variables \(x = 10\) and \(y = 5\), then we calculate the result of \(x+y=\). Type following code in the editor and run line by line. To run a line of code, you can move cursor to that line, and use Crtl+Enter (Command+Enter for Mac). If you want to run multiple lines of code, simply highlight those lines and use the same command. (Note that you can put a # in front of a line to write comment in code.)
x <- 10
y = 5
x+y
## [1] 15
After you run the code, what did you find in the Global Environment (Workspace) window?
In RStudio, you can view every variable you defined in the Global Environment (Workspace) window, along with other objects such as imported datasets in the Workspace panel. You can use R as an over-qualified calculator. Try the following commands. You already have \(x, y\) defined. Then you can calculate \(log(x)=\)
log(x)
## [1] 2.302585
\(exp(y)=\)
exp(y)
## [1] 148.4132
\(cos(x)=\)
cos(x)
## [1] -0.8390715
The log, exp, cos operators are functions in r. They take inputs (also called arguments) in parentheses and give outputs.
You can also run logical operations, such as \(x == y, x > y\):
x == y
## [1] FALSE
x > y
## [1] TRUE
Exercise:
Economic Order Quantity Model: \(Q= \sqrt{2DK/h}\)
- D=5000: annual demand quantity
- K=$4: fixed cost per order
- h=$0.5: holding cost per unit
- Q=?
There are four types of data structure in R: Vector, Matrix, Data frame, and List
To assign a list of numbers (vector) to a variable, the numbers within the c command are separated by commas. As an example, we can create a new variable, called “z” which will contain the numbers 3, 5, 7, and 9.
# Define numerical vector z
z<- c(3,5,7,9)
# Define character vector zz, where numerical operations cannot be directly applied.
zz<- c("cup", "plate", "pen", "paper")
#Average
mean(z)
## [1] 6
#Standard devidation
sd(z)
## [1] 2.581989
#Median
median(z)
## [1] 6
#Max
max(z)
## [1] 9
#Min
min(z)
## [1] 3
#Summary Stats
summary(z)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 4.5 6.0 6.0 7.5 9.0
Elementwise operations for single vector or vectors:
z
## [1] 3 5 7 9
z+2
## [1] 5 7 9 11
z/10
## [1] 0.3 0.5 0.7 0.9
# define vector z1
z1 <- c(2,4,6,8)
# Elementwise operations (must be the same length)
z+z1
## [1] 5 9 13 17
z*z1
## [1] 6 20 42 72
Vector of multiple vectors is still a vector. z2
# define vector z2
z2 <- c(z, z1)
z2
## [1] 3 5 7 9 2 4 6 8
How to extract the second entry of vector z2=(3,5,7,9,2,4,6,8)?
z2[2]
## [1] 5
How to extract all elements greater than 3 from vector z2?
z2[z2>3]
## [1] 5 7 9 4 6 8
How to extract all elements greater than 3 and smaller than 6 from vector z2?
z2[z2>3 & z2<6]
## [1] 5 4
How to order the vector z2 from smallest to largest?
z2[order(z2)]
## [1] 2 3 4 5 6 7 8 9
Exercise:
- What is dot product(inner product) of z and z1?
- Find the elements of z2 that smaller than 3 or greater than 7.
Matrix is a table of numbers (or strings). \(A\) is a matrix with 2 rows and 3 columns.
z = c(3,5,7,9)
A = matrix(data = c(1,2,3,4,5,6), nrow = 2)
matrix() is a function that creates a matrix from a given vector. Some of the arguments in a function can be optional. For example you can also add the ncol arguments, which is unnecessary in this situation.
A <- matrix(data = c(1,2,3,4,5,6), nrow = 2, ncol = 3)
Another way to write the function is to ignore the argument names and just put arguments in the right order, but this may cause confusion for readers.
A <- matrix(c(1,2,3,4,5,6), 2, 3)
Question: Think about what would it be if specify ncol=2, or ncol=4?
The default order to position the numbers of a vector to matrix is by column(from top to bottom), but you can specify it as by row using an additional argument byrow=TRUE.
A <- matrix(data = z2, nrow = 4, ncol = 2, byrow = TRUE)
Elementwise operations for matrices:
A
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
A+2
## [,1] [,2] [,3]
## [1,] 3 5 7
## [2,] 4 6 8
Deimension
# Dimensions of A
dim(A)
## [1] 2 3
Transpose and Multiplication
# Transpose
t(A)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
# Matrix multiplication is doable if and only if the number of columns in A1 equals the number of rows in A2
t(A) %*% A
## [,1] [,2] [,3]
## [1,] 5 11 17
## [2,] 11 25 39
## [3,] 17 39 61
# New matrix with dimension 4*2
A2 <- A * 2
# Matrix calculation should satisfy the rules of matrix algebra
A + A2
## [,1] [,2] [,3]
## [1,] 3 9 15
## [2,] 6 12 18
Question: What would happen if run A %*% A2?
How to extract the second entry of second row from matrix A?
A[2,2]
## [1] 4
How to extract the first row from matrix A?
A[1, ]
## [1] 1 3 5
How to extract first two column from matrix A?
A[,1:2]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Exercise:
- What are the diagonal elements of t(A) %*% A?
Data frames in R are the “datasets”, that is tables of data with each row as an observation, and each column representing a variable. Data frames have column names (variable names) and row names.
You can use data.frame() to transform a matrix into a dataframe. Most of the time you will import a text file as a data frame or use one of the example datasets that come with R.
mydf <- data.frame(A)
class(mydf)
## [1] "data.frame"
Use the read.table or read.csv function to import comma/space/tab delimited text files. You can also use the Import Dataset Wizard in RStudio. Package “readxl” allows you to read xls/xlsx files. First, download the storks datasets ( storks.cvs and storks.txt files) and save them into your Working Directory.
mydata_csv <- read.csv("storks.csv", header=TRUE)
mydata_txt <- read.table("storks.txt", header=TRUE, sep = "\t")
#Load cars dataset that comes with R (50 obs, 2 variables)
data(cars)
#Dimension
dim(cars)
#Preview the first few rows
head(cars)
#Variable names
names(cars)
#Summary
summary(cars)
#Structure
str(cars)
Subsetting elements from data frames is similar to subsetting from matrices. On the other hand, since data frames have variable names (label for each column), you can also use the following two ways to refer variables of a data frame:
In RStudio, hitting tab after df$
allows you to select/autocomplete variable names in df
Add new variable to the data
#First 2 obs of the variable dist in cars
cars$dist[1:2]
## [1] 2 10
cars1<- cars
cars1$time<- cars$dist/cars$speed
Drop variable time
# since "time" is the third column, we can do
cars2<- cars1[,-3]
# we can also drop "time" by keeping the other two variables
cars3<- cars1[c("speed", "dist")]
List is a container. You can put different types of objects into a list to create your own list of all you have in hand.
mylist<- list(myvector=z, mymatrix=A, mydata=cars)
Most of the output of R function is a list that contains severl objects.
# Load car dataset that comes with R
data(cars)
#fit a simple linear regression between braking distance and speed
lm(dist~speed, data=cars)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Coefficients:
## (Intercept) speed
## -17.579 3.932
There are three ways to get an element from a list:
Note that you use double square brackets for indexing a list.
reg <- lm(dist~speed, data = cars)
reg[[1]]
reg[["coeffcients"]]
reg$coeffcients
If you have done object oriented programming before, the list “reg” is actually an object that belongs to class “lm”. The element names such as “coeffcients” are fields of the “lm” class.
Define a vector with values (5, 2, 11, 19, 3, -9, 8, 20, 1). Calculate the sum, mean, and standard deviation.
Re-order the vector from largest to smallest, and make it a new vector.
Convert the vector to a 3*3 matrix ordered by column. What is the sum of first column? What is the number in column 2 row 3? What is the column sum?
Use the following code to load the CustomerData to your R.
customer <- read.csv(file = "https://yanyudm.github.io/Data-Mining-R/lecture/data/CustomerData.csv")
- How many rows and columns are there?
- Extract all variable names.
- What is the average “Debt to Income Ratio”?
- What is the proportion of “Married” customers?
A Simple Scatter Plot
plot(cars)
Types of distributions: norm, binom, beta, cauchy, chisq, exp, f, gamma, geom, hyper, lnorm, logis, nbinom, t, unif, weibull, wilcox
Four prefixes:
‘d’ for density (PDF)
‘p’ for distribution (CDF)
‘q’ for quantile (percentiles)
‘r’ for random generation (simulation)
dbinom(x=4,size=10,prob=0.5)
## [1] 0.2050781
pnorm(1.86)
## [1] 0.9685572
qnorm(0.975)
## [1] 1.959964
rnorm(10)
## [1] -0.1391552 -0.3980465 2.1242855 -1.5065391 -1.1990397 0.5159876
## [7] -1.3993646 -1.0226974 -0.2958490 0.3470557
rnorm(n=10,mean=100,sd=20)
## [1] 90.34530 93.47035 108.45112 114.25664 103.72400 108.84396 91.33399
## [8] 104.24333 102.97761 87.60985
That is great! Go R!
You may have trouble in the rest of the semester…, so please try to get used to it!