Bryan Clair
November 16, 2017
Downloading R is unpleasantly difficult. Go here:
Pick your OS at the top, then:
Downloading RStudio is simpler. Go here:
https://www.rstudio.com/products/rstudio/download/
and scroll down to get the installer for your platform.
Built-in data: The iris
dataset. Fisher/Anderson 1935-6.
iris
?iris
str(iris)
head(iris)
summary(iris)
Grammar of Graphics, L. Wilkinson 2005
Implemented by Hadley Wickham
install.packages("ggplot2")
library(ggplot2)
ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
geom_point()
ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color=Species)) +
geom_point()
Try:
geom_jitter()
geom_density2d()
geom_bin2d()
Make a histogram of Petal.Length
geom_histogram()
geom_freqpoly()
geom_density()
color
aestheticfill
aestheticPlot Species
on the x aesthetic with Petal.Length
as y.
geom_point()
geom_boxplot()
geom_violin()
Built-in data sets:
chickwts
mtcars
quakes
diamonds
(actually part of ggplot2
)From HistData
library:
Prostitutes
(try geom_line()
)Galton
(try geom_smooth()
)Minard
ggplot() + geom_text(data=Minard.cities, aes(x=long,y=lat,label=city)) +
geom_point(data=Minard.troops,mapping=aes(x=long,y=lat,size=survivors,color=direction))
McDonald's locations in Missouri:
mcd <- read.csv("http://math.slu.edu/~clair/stat3850/data/mcdonalds-mo.csv")
ggplot(mcd, aes(x=long,y=lat)) + geom_point()
install.packages("ggmap")
library(ggmap)
momap <- get_map("Missouri", zoom=7)
ggmap(momap)
momap <- get_map("Jefferson City, MO", zoom=7)
ggmap(momap) + geom_point(data = mcd, aes(x=long,y=lat),
shape=21,fill="red",color="gold")
Try:
quakes
data, near Fijimaptype = "satellite"
when getting the mapinstall.packages("dplyr")
library(dplyr)
Simple verbs connected by pipes: %>%
.
Data moves left to right through the pipeline.
mtcars %>% filter(cyl==4) %>% arrange(qsec)
mpg cyl disp hp drat wt qsec vs am gear carb
1 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
2 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
4 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
5 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
6 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
7 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
8 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
9 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
10 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
11 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
filter()
: filter certain rows of the dataselect()
: select certain columns of the dataarrange()
: arrange rows in ordergroup_by()
: organize data into groups of rowssummarize()
: compute summary statistics of groupsmtcars %>% group_by(cyl) %>% summarize(mpg.avg = mean(mpg))
# A tibble: 3 x 2
cyl mpg.avg
<dbl> <dbl>
1 4 26.66364
2 6 19.74286
3 8 15.10000
filter()
: filter certain rows of the dataselect()
: select certain columns of the dataarrange()
: arrange rows in ordergroup_by()
: organize data into groups of rowssummarize()
: compute summary statistics of groupsmtcars %>% group_by(cyl) %>% summarize(mpg.avg = mean(mpg))
hp
mpg
of automatics (am=0
) for each number of cylinders.cyl
and carb
The data set movies
consists of 100,000 individual movie ratings selected from a much larger data set, MovieLens, freely available from GroupLens Research.
movies <- read.csv("http://stat.slu.edu/~speegle/_book_data/movieLensData")
What is the movie with the highest rating that has been rated at least 30 times?
movies %>%
group_by(Title) %>%
summarize(meanRating = mean(Rating), numRating = n()) %>%
filter(numRating >= 30) %>%
arrange(desc(meanRating))
# A tibble: 874 x 3
Title meanRating numRating
<fctr> <dbl> <int>
1 Shawshank Redemption, The (1994) 4.438953 344
2 Godfather, The (1972) 4.425439 228
3 Wallace & Gromit: A Close Shave (1995) 4.415663 83
4 Schindlers List (1993) 4.396491 285
5 Usual Suspects, The (1995) 4.393281 253
6 Godfather: Part II, The (1974) 4.385135 148
7 Lawrence of Arabia (1962) 4.371622 74
8 Big Night (1996) 4.363636 33
9 City of God (Cidade de Deus) (2002) 4.340909 44
10 Sting, The (1973) 4.337079 89
# ... with 864 more rows
Think of a question you could ask about the movie data set and answer it! Here are some problems we give our class:
install.packages("Lahman")
?Batting
?Pitching
?Master
install.packages("WHO")
library(WHO)
codes <- get_codes()
head(codes)
infant <- get_data("MDG_0000000001")
infant %>% filter(country == "Peru") %>% ggplot(aes(x=year,y=value,color=worldbankincomegroup)) + geom_line()