Taste of R: An Introduction
Introduction
R is a statistical programming language and environment, it is open source and available on most platforms. R is not a replacement of Java, C, Perl, Python or other common language; R is a specific tool for data calculations, manipulations and graphing.If you have a programming background, R can be a great replacement of Excel, it may not be for everyone, but is a great programmable tool powerful for those willing to put the time in. See the R Project site to download and more info.This article will go through examples on how you can use R to replace Excel. This may not be for everyone, but if you’re like me and love the command-line and vim, you’ll love R.
Quick Examples
# variable assignment
list <- c(1,3,6)
mean(list)
[1] 3.33 - result
max(list)
[1] 6 - result
sum(list)
[1] 10 - result
# add each element of the lists together
list2 <- c(2,4,8)
list + list2
[1] 3 7 14
# create sequence
s <- seq(0,10, by=0.5)
plot(s)
Reading Data
There are numerous ways to get data into R, from reading from textfiles, to databases and even direct from the internet. Data sets used: pageviews.data, weight.data, movies2010.csv
Read from Text Files
# read in table, single column of data
data <- read.table("pageviews.data")
data[1:5,] # verify
# plot as timeseries
plot(ts(data))
# read table, specify delimiter
weight <- read.table("weight.data", sep="|", header=TRUE)
plot(weight$Date, weight$Weight)
# read in csv file
data <- read.csv(file="movies2010.csv")
library(ggplot2)
p <- ggplot(data, aes(Box.Office, Rating))
p + geom_point() +
scale_x_continuous(
breaks=c(100000000,200000000,300000000,400000000),
labels=c("100M", "200M", "300M","400M"))
Load from Database
library(RMySQL)
con <- dbConnect(dbDriver('MySQL'),
user='demo',
password: 'demo',
host: 'localhost',
dbname: 'baseball')
resultSet <- dbSendQuery(con,
"SELECT W,Attendance,yearID,name
FROM teams
WHERE (yearID between 1990 and 2010)
AND franchID: 'SFG' ")
stats <- fetch(resultSet, n=-1)
# plot Wins vs. Attendance
library(ggplot2)
p <- ggplot(stats, aes(x=W, y=Attendance, label=yearID))
p + geom_point() + geom_text(hjust=0.2, vjust=-0.5, size=2.6)
Read from Internet
# grab stock quote data from Yahoo
etfc <- read.csv(paste("http://ichart.finance.yahoo.com/table.csv?",
"s=ETFC", "&g=m", sep=""))
# save to file (so dont need to fetch again)
save(etfc, "etfc.RData")
# read in from file
etfc <- load("etfc.RData")
# verify, show first 5 rows
etfc[1:5,]
# simple plot
plot(etfc$Date, etfc$Close)
# better plotting
library("ggplot2")
qplot(as.Date(Date, "%Y-%m-%d"), Close,
data=etfc, geom="line",
xlab="", ylab="",
colour: I("steelblue4"),fill: I("steelblue4"))
Graphing: ggplot
ggplot is a powerful graphing library whose graphs are a bit better looking than R defaults. I tend to use ggplot whenever I can, but will show both methods for this tutorial. Additionally, ggplot has a powerful theme system you can use for consistent colors and styles.
x <- seq(-3, 3, by=0.1)
y <- sin(x)
# normal plot
plot(x,y)
# save a standard plot
png("standard.png")
plot(x,y)
dev.off()
# better plotting
library("ggplot2")
qplot(x,y)
qplot(x,y, geom="line", colour: I("steelblue4"))
# maps
try_require("maps")
states <- data.frame(map("state", plot=FALSE)1)
(usamap <- qplot(x, y, data=states, geom="path"))
# save with ggplot
ggsave(file="sin.png")
More ggplot2 examples
Bar Chart Side-by-Side
A basic bar chart, the data shown is percentages of time spent per media type compared to advertising dollars spent.
m <- matrix(c(8,27,16,11,43,43,25,19,8,0.5), nrow=2)
colnames(m) <- c("Print","Radio","TV","Internet","Mobile")
barplot(m, beside=T)
If you want two different charts next to each other
library('gridExtra')
plot1 <- ggplot(td, aes(Year, Tablets)) + geom_bar(stat="identity")
plot2 <- ggplot(td, aes(Year, PC)) + geom_bar(stat="identity")
grid.arrange(plot1, plot2, ncol=2)
Plotting Two Lines on Same Graph
An example plotting two lines on to the same graph
# setup data
x <- seq(0, 50, 1)
supply <- x * -2 + 100
demand <- x * 2
df <- data.frame( x: x, supply=supply, demand=demand)
library(ggplot2)
ggplot(df, aes(x)) +
geom_line(aes(y=supply, colour="supply")) +
geom_line(aes(y=demand, colour="demand")) +
opt(title='') +
Cycle Graph
A cycle graph is an interesting way to look at cyclic data such as weekly pageviews of a web site. Typically a web site traffic will see a large dip on weekends which can make it difficult to see what patterns might be occurring.Here’s an example, the graph on the left is a standard linear graph, on the right is the same day plotted as a cycle graph, the cycle being days of the week. You immediately notice Wednesday dips while the rest are mostly up.
Here’s how the above graphs were created, using pageviews.data
# read in table
data <- read.table("pageviews.data")
# plot as normal timeseries, difficult to see
plot(ts(data))
# cycle plot
monthplot(ts(data, start=1, frequency=7))
Graph: Axis, Labels and Legenes
To add axis labels to the chart, use xlab and ylab, using the stock quote example above with labels
qplot(as.Date(Date, "%Y-%m-%d"), Close,
data=etfc, geom="line",
xlab="Year", ylab="Stock Price",
colour: I("steelblue4"),fill: I("steelblue4")
)
Using ggplot
ggplot(df, aes(x)) +
geom_line(aes(y=supply, colour="supply"), size=2) +
geom_line(aes(y=demand, colour="demand"), size=2) +
scale_x_continuous('') +
scale_y_continuous('')
Manipulating Data
# data entry
x <- c(1,2,3)
# more data entry, using stdin (keyboard)
x: scan()
1:
sort(x)
# diff command
# accumlative total of mail sent
mailings <- c(12345, 23432, 36765, 49567, 60234)
diff(mailings)
# tabulate data
survey <- c("a", "b", "b", "b", "c", "a", "a", "c", "c", "b")
table(survey)
# random numbers
runif(5, 1,10, 5) # pick 5 numbers between 1 - 10
sample(1:10, 5, replace=F)
sample(1:10, 5, replace=T)
Linear Regression
# setup data
x <- seq(1, 10, by=0.5)
# create random data for Y
y <- seq(10, 46, by=2) + runif(19, 1, 10)
lm(y ~ x) # linear model equation
plot(x, y) # plot data
abline(lm(y ~ x)) # add regression line to graph
# using ggplot
library(ggplot2)
p <- qplot(x, y)
p + stat_smooth(method="lm",size=1)
Programming
# packages (cran)
install.packages("ggplot2")
library("ggplot2")
# conditionals
c <- 42
x <- if (c == 42) 3.14 else 2.71
# loops
teams: c("BAL","BOS","CHW","CLE","DET")
for (team in teams) {
print(paste("Hello", team, sep=" ")) # string concat
}
Function Example
oddcount <- function(x) {
c <- 0
for(n in x) {
if (n %% 2 == 1)
c <- c + 1
}
return(c)
}
set <- c(3,4,5,6,9)
oddcount(set)
[1] 3
Baseball Example
Batting Average Hack from Baseball Hacks by Jospeh Adler
oddcount <- function(x) {
library(RMySQL)
library(lattice)
con <- dbConnect(dbDriver('MySQL'),
user='demo',
password: 'demo',
host: 'localhost',
dbname: 'baseball')
res 250");
batting <- fetch(res, n=-1);
attach(batting);
#Compute batting averages
AVG <- H/AB;
#Plot the charts
histogram(~ AVG | teamID)
densityplot(~ AVG | teamID)
Basketball Example
NBA example, plotting field goal percent and rebounds verse wins.
# load data
data <- read.csv("nba.csv")
# lets look at just 2009
data2009 <- subset(data, year == 2009) # use conditional
library(ggplot2)
# theory: field goal percent: wins ?
p <- ggplot(data2009, aes(x=o_fgm/o_fga, y=won, label=X.team))
p + geom_point() + geom\_text(hjust=0.2, vjust=-0.5, size=2.6)
# theory: rebounds: wins ?
p <- ggplot(data2009, aes(x=o_reb, y=won, label=X.team))
p + geom_point() + geom_text(hjust=0.2, vjust=-0.5, size=2.6)
New User Report Example
Another web site example, graphing new users signing up over last 30 days. You’ll need your own data source.
library(RMySQL)
library(ggplot2)
source("~/.dbconns/prod_slave.R")
# grab data
res <- dbSendQuery(con,
"SELECT
DATE\_FORMAT(dt\_created, '%m/%d/%Y') as dt,
count(*) as c
FROM users
WHERE dt\_created < DATE\_SUB(NOW(), INTERVAL 30 DAY)
GROUP BY dt
");
results <- fetch(res, n=-1);
p <- ggplot(results, aes(x=dt, y=c))
p + geom_bar()
Example Examples
Almost every package has built in examples which shows how to use.Here are just a few, if you ever get stuck check out the examples:
example(plot)
example(abline)
example(pie)
example(spline)
library(ggplot2)
example(qplot)
library(lattice)
example(histogram)
Data Sources
- Basketball data from Basketball Database
- Baseball data from Baseball Databank
Further Reading
- Introduction to R
- R Tutorial by Clarkson University
- R Fundamentals by Thomas Lumley
- R in a Nutshell by Joseph Adler [Book]
- Baseball Hacks by Joseph Adler [Book]