Bryan Clair and Darrin Speegle
September 28, 2017
Focus on Teaching and Technology Conference
Downloading R is unpleasantly difficult. Go here:
Pick your OS at the top, then:
Downloading RStudio is simpler. Go here:
https://www.rstudio.com/products/rstudio/download/
and scroll down to get the installer for your platform.
Source: The Popularity of Data Science Software by Robert A. Muenchen
Built in help
StackOverflow: https://stackoverflow.com/questions/tagged/r
Datacamp: https://www.datacamp.com/home
SimpleR: https://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf
Foundations of Stats with R: http://mathstat.slu.edu/~speegle/_book/
Workshop goal: play with these features
Install the HistData
library.
library(HistData)
head(Galton)
parent child
1 70.5 61.7
2 68.5 61.7
3 65.5 61.7
4 64.5 61.7
5 64.0 61.7
6 67.5 62.2
mean(Galton$parent)
[1] 68.30819
mean(Galton$child)
[1] 68.08847
A simple linear regression:
plot(Galton)
model <- lm(child ~ parent, data=Galton)
abline(model)
faithful
sample(1:6, 1)
[1] 1
sample(1:6, 20, replace=TRUE)
[1] 6 1 6 4 2 5 5 3 4 1 4 6 3 1 6 6 6 1 6 3
sum(sample(1:6, 2, replace=TRUE))
[1] 4
dierolls <- replicate(10000, sum(sample(1:6, 2, replace=TRUE)))
table(dierolls)
dierolls
2 3 4 5 6 7 8 9 10 11 12
235 577 859 1125 1381 1662 1381 1098 849 547 286
anyDuplicated
)R has support for every well known probability distribution.
runif(5) # 5 uniformly random numbers between 0 and 1
[1] 0.005813683 0.017561025 0.148594451 0.858384741 0.261783760
rnorm(5) # 5 standard normal random numbers
[1] -1.3272202 0.6986228 -0.4616152 1.4881083 -0.2382487
pnorm(1) - pnorm(-1) # area under normal curve within 1 sd of mean.
[1] 0.6826895
What is the probability that the max of two uniformly random numbers on [0,1] is larger than 0.5?
Choose a point in the plane with coordinates (x,y) which are standard normal rv's. What is the expected distance from the origin?
In this part of the workshop, we analyze a data set of movies using the R dplyr
tool.
movies
consists of 100,000 observations of the variables MovieID, Title, Genres, UserID, Rating and Timestamp. movies <- read.csv("http://stat.slu.edu/~speegle/_book_data/movieLensData")
select()
forms a new data frame with *select*ed columns.arrange()
forms a new data frame with row(s) *arrange*d in a specified order.filter()
forms a new data frame consisting of rows that satisfy certain *filter*ing conditions.mutate()
and transmute()
allow you to create new columns out of a data frame. mutate
adds to the data frame, and transmute
creates a new data frame.summarize()
summarizes a data frame into a single row.distinct()
collapses the identical observations into a single one.group_by()
groups the data to perform tasks by groups.select(movies, Rating, MovieID)
.arrange(movies, Timestamp)
filter(movies, Rating == 5)
summarize(movies, mean(Rating))
distinct(movies, UserID)
group_by()
really only works well when combined with other commands.The utility of dplyr
becomes more apparent when combining the various verbs using the pipe operator, %>%
. The pipe operator allows you to feed the result of one expression as input to a function. For example,
x %>% f(y)
becomes f(x,y)
andx %>% f(y) %>% g(z)
becomes g(f(x,y),z)
.Find the mean rating of Toy Story, which has MovieID 1.
filter(movies, MovieID == 1) %>%
summarize(mean(Rating))
mean(Rating)
1 3.875862
The filter()
command creates a data frame that consists solely of the observations of Toy Story, and summarize()
computes the mean rating.
List the movies that have an average rating of 5 stars. That is, each rating of the movie was 5 stars.
In order to do this, we will first use the group_by()
function to find the mean rating of each movie.
movies %>%
group_by(Title) %>%
summarize(meanRating = mean(Rating))
# A tibble: 6,200 x 2
Title meanRating
<fctr> <dbl>
1 ...All the Marbles (a.k.a. The California Dolls) (1981) 3.500000
2 ...And God Spoke (1993) 4.000000
3 ...And Justice for All (1979) 3.600000
4 *batteries not included (1987) 3.125000
5 10 (1979) 3.500000
6 10 Items or Less (2006) 3.000000
7 10 Rillington Place (1971) 4.250000
8 10 Things I Hate About You (1999) 3.450000
9 10 to Midnight (1983) 2.500000
10 101 Dalmatians (1996) 3.186047
# ... with 6,190 more rows
Next, we filter
out those that have a mean rating of five stars.
movies %>%
group_by(Title) %>%
summarize(meanRating = mean(Rating)) %>%
filter(meanRating == 5)
# A tibble: 113 x 2
Title meanRating
<fctr> <dbl>
1 49 Up (2005) 5
2 5,000 Fingers of Dr. T, The (1953) 5
3 Anastasia (1956) 5
4 Ay, Carmela! (¡Ay, Carmela!) (1990) 5
5 Ballad of Narayama, The (Narayama bushiko) (1983) 5
6 Band of the Hand (1986) 5
7 Beerfest (2006) 5
8 Big Clock, The (1948) 5
9 Boss of It All, The (Direktøren for det hele) (2006) 5
10 Boys (1996) 5
# ... with 103 more rows
What is the movie with the highest rating that has been rated at least 30 times?
movies %>%
group_by(Title) %>%
summarize(meanRating = mean(Rating), numRating = n()) %>%
filter(numRating >= 30) %>%
arrange(desc(meanRating))
# A tibble: 874 x 3
Title meanRating numRating
<fctr> <dbl> <int>
1 Shawshank Redemption, The (1994) 4.438953 344
2 Godfather, The (1972) 4.425439 228
3 Wallace & Gromit: A Close Shave (1995) 4.415663 83
4 Schindlers List (1993) 4.396491 285
5 Usual Suspects, The (1995) 4.393281 253
6 Godfather: Part II, The (1974) 4.385135 148
7 Lawrence of Arabia (1962) 4.371622 74
8 Big Night (1996) 4.363636 33
9 City of God (Cidade de Deus) (2002) 4.340909 44
10 Sting, The (1973) 4.337079 89
# ... with 864 more rows
Think of a question you could ask about the movie data set and answer it! Here are some problems we give our class:
The ggplot2
package works somewhat differently from other graphing utilities. The function ggplot
is used to set up a plot, and has two required parameters: the data that you wish to plot and the mapping of variables to aesthetics you wish to use.
Let's look again at the Galton
data set. Type the following command
ggplot(data = Galton, mapping = aes(x = parent, y = child))
ggplot
knows that you want to visualize the Galton
data set, and that you want to consider parent
as the x-coordinate and child
as the y-coordinate. However, you haven't yet told ggplot
how you want to visualize the data set.
To do that, there are various geom that are useful Try typing geom_
in the console, and see how many different geoms are available! In this case, we want to visualize a scatterplot, which we can do using geom_point
.
ggplot(data = Galton, mapping = aes(x = parent, y = child)) +
geom_point()
ggplot(data = Galton, mapping = aes(x = parent, y = child)) +
geom_jitter()
ggplot(data = Galton, mapping = aes(x = parent, y = child)) +
geom_density_2d()
The point is that there are several different visualizations of the Galton data set that might make sense, even after saying what the \( x \) and \( y \) coordinates are. For better or worse, ggplot
doesn't have a default visualization. It forces you to say exactly what you want to see.
To finish with Galton, we can add a regression line.
By default, it comes with an error region.
ggplot(data = Galton, mapping = aes(x = parent, y = child)) +
geom_jitter() +
geom_smooth(method = lm)
Next, we want to visualize the quakes
data set.
head(quakes)
lat long depth mag stations
1 -20.42 181.62 562 4.8 41
2 -20.62 181.03 650 4.2 15
3 -26.00 184.10 42 5.4 43
4 -17.97 181.66 626 4.1 19
5 -20.42 181.96 649 4.0 11
6 -19.68 184.31 195 4.0 12
ggplot(quakes, aes(x = long, y = lat)) +
geom_point()
Note that we have additional variables available to use in our plot of the earthquake data. If we wish to visualize the depth of the quake, we can assign that variable to an aesthetic, like this:
ggplot(quakes, aes(x = long, y = lat, color = depth)) +
geom_point()
In order to visualize the magnitude, we try
ggplot(quakes, aes(x = long, y = lat, color = depth, size = mag)) +
geom_point()
Now, things are piling up on top of each other a bit much, so we can adjust the transparency coefficient:
ggplot(quakes, aes(x = long, y = lat, color = depth, size = mag)) +
geom_point(alpha = 0.5)
Finally, since this represents a map, we should adjust the coordinates accordingly
ggplot(quakes, aes(x = long, y = lat, color = depth, size = mag)) +
geom_point(alpha = 0.5) +
coord_map()
By changing the geom
, we can create different plots from the same data and aesthetics. For example, we use geom_hist
to create a histogram of depths of earthquakes.
ggplot(quakes, aes(x = depth)) +
geom_histogram()
Finally, suppose we wish to visualize the relationship between magnitude and depth of earthquakes. We might wish to split the earthquake data into one group with depth less than 400 and one group with depth greater than 400, and create a box and whisker plot of magnitudes for each.
mutate(quakes, group = ifelse(depth < 400, "shallow", "deep")) %>%
ggplot(aes(x = group, y = mag)) +
geom_boxplot()
Create a plot or a group of plots that allow you to visualize the relationship(s) between depth, magnitude and number of stations reporting.
By the end of a one semester, first course in statistics, students were able to create relatively nice looking maps for their class projects, which we will now explain.
We start by seeing how to use ggplot
to plot maps.
This requires the ggplot2
library and the maps
library.
The basic tool for creating maps in ggplot is geom_polygon
, which creates the polygonal shapes that will appear on the map. Let's examine geom_polygon
a little bit.
squareCoords <- data.frame(x = c(0,0,1,1), y = c(0,1,1,0)) #The four corners of a square
squareCoords
x y
1 0 0
2 0 1
3 1 1
4 1 0
To plot it, we use
ggplot(squareCoords, aes(x = x, y = y)) +
geom_polygon()
Now, sometimes, for example when merging two data frames, the order of the indices gets messed up. So, it is a good idea to have a variable called order
that contains the order that the vertices are supposed to be listed.
If something happens that messes up the order of the vectors, you can sort the vectors again to get everything right.
unSquareCoords <- data.frame(x = c(0,0,1,1),
y = c(0,1,0,1))
ggplot(unSquareCoords, aes(x = x, y = y)) +
geom_polygon() #Oh, dear.
unSquareCoords <- data.frame(unSquareCoords, order = c(1,2,4,3))
unSquareCoords <- arrange(unSquareCoords, order)
ggplot(unSquareCoords, aes(x = x, y = y)) +
geom_polygon()
OK. Now, what if we want two squares? Well, we need to give all of the coordinates of the vertices.
twoSquares <- data.frame(x = c(0,0,1,1,3,3,4,4), y = c(0,1,1,0,3,4,4,3), order = c(1,2,3,4,5,6,7,8))
twoSquares
x y order
1 0 0 1
2 0 1 2
3 1 1 3
4 1 0 4
5 3 3 5
6 3 4 6
7 4 4 7
8 4 3 8
ggplot(twoSquares, aes(x = x, y = y)) + geom_polygon() #Oh, dear.
Yikes. What happened? Well, R saw that we have 8 vertices in our polygon, and bless its heart, tried its best to create polygon with those 8 vertices in that order. Good try, R. We appreciate your enthusiasm. What we need to do is tell R that we actually want two polygons by adding a grouping variable.
twoSquares <- data.frame(twoSquares, group = c(1,1,1,1,2,2,2,2))
twoSquares
x y order group
1 0 0 1 1
2 0 1 2 1
3 1 1 3 1
4 1 0 4 1
5 3 3 5 2
6 3 4 6 2
7 4 4 7 2
8 4 3 8 2
ggplot(twoSquares, aes(x = x, y = y)) +
geom_polygon(aes(group = group))
Longitude and lattitude information is stored in ggplot as data. In order to get to the data, we will be using the map_data
command, as follows. (For this, we will be working with county level data in Alabama.)
library(maps)
library(dplyr)
countyData <- map_data(map = "county")
countyData <- filter(countyData, region == "alabama")
head(countyData)
long lat group order region subregion
1 -86.50517 32.34920 1 1 alabama autauga
2 -86.53382 32.35493 1 2 alabama autauga
3 -86.54527 32.36639 1 3 alabama autauga
4 -86.55673 32.37785 1 4 alabama autauga
5 -86.57966 32.38357 1 5 alabama autauga
6 -86.59111 32.37785 1 6 alabama autauga
ggplot(data = countyData, mapping = aes(x = long, y = lat)) +
geom_polygon(aes(group = group))
ggplot(data = countyData, mapping = aes(x = long, y = lat)) +
geom_polygon(aes(group = group)) +
coord_map()
Finally, we can color the counties based on some characteristic that they have. For example, we could color each county based on the percentage of people in that county who voted for Trump in the most recent presidential election. Naturally, we would want to make counties more red if a higher percentage voted for Trump, and more blue the higher the percentage that voted for Clinton.
ggplot(data = countyData, mapping = aes(x = long, y = lat)) +
geom_polygon(aes(group = group, fill = long)) +
coord_map()
ggplot(data = countyData, mapping = aes(x = long, y = lat)) +
geom_polygon(aes(group = group, fill = long)) +
scale_fill_continuous(low = "lightpink", high = "darkred") +
coord_map()
Read data
alabamaData <- read.csv("alabamaElectionData.csv", header = TRUE, stringsAsFactors = FALSE)
head(alabamaData)
long lat group order region subregion fips cand
1 -86.50517 32.34920 1 1 alabama autauga 1001 Donald Trump
2 -86.53382 32.35493 1 2 alabama autauga 1001 Donald Trump
3 -86.54527 32.36639 1 3 alabama autauga 1001 Donald Trump
4 -86.55673 32.37785 1 4 alabama autauga 1001 Donald Trump
5 -86.57966 32.38357 1 5 alabama autauga 1001 Donald Trump
6 -86.59111 32.37785 1 6 alabama autauga 1001 Donald Trump
pct percent
1 0.7343579 0.7540178
2 0.7343579 0.7540178
3 0.7343579 0.7540178
4 0.7343579 0.7540178
5 0.7343579 0.7540178
6 0.7343579 0.7540178
ggplot(data = alabamaData, mapping = aes(x = long, y = lat)) +
geom_polygon(aes(group = group, fill = percent)) +
coord_map() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0.5)
Create an electoral map for either a different state, a region of the country, or the entire USA. Data has been saved in a convenient form as allCountyData.csv
on your flash drive.