Learning to Teach Introductory Statistics with R

Bryan Clair and Darrin Speegle
September 28, 2017

Focus on Teaching and Technology Conference

https://bitbucket.org/speegled/fttc-2017

R Installation

Downloading R is unpleasantly difficult. Go here:

http://cran.wustl.edu

Pick your OS at the top, then:

  • Mac OS: you want R-latest.pkg
  • Windows: you want base/R-release.exe
  • Linux: you want to do the usual Linux thing

Downloading RStudio is simpler. Go here:

https://www.rstudio.com/products/rstudio/download/

and scroll down to get the installer for your platform.

Stats at SLU

  • 1300: Elementary Statistics with Computers. Service course, college algebra prerequisite.
  • 3850: Foundations of Statistics. Majors course, calculus 2 prerequisite.
  • Advanced courses in probability, statistics, time series, statistical models, applied regression

Why Choose R?

  • Free, open source
  • Community
  • Reproducible work
  • Powerful libraries

 

 

Source: The Popularity of Data Science Software by Robert A. Muenchen

Why Not Choose R?

  • Learning curve
  • Discoverability
  • Syntax

Support Systems

What does R add to a course?

  • Probability via simulation
  • High quality visualization
  • Data wrangling
  • Modern data analysis, e.g. nonparametric tests, GIS, text analysis

Workshop goal: play with these features

R Basics

  • Command line calculation
  • Variable assignment with '<-'
  • Environment tab
  • History tab
  • Vectors with : and c()

Data in R

Install the HistData library.

library(HistData)
head(Galton)
  parent child
1   70.5  61.7
2   68.5  61.7
3   65.5  61.7
4   64.5  61.7
5   64.0  61.7
6   67.5  62.2
mean(Galton$parent)
[1] 68.30819
mean(Galton$child)
[1] 68.08847

Data in R

A simple linear regression:

plot(Galton)
model <- lm(child ~ parent, data=Galton)
abline(model)

plot of chunk unnamed-chunk-3

  • Explore the built-in data set faithful

Probability via Simulation

sample(1:6, 1)
[1] 1
sample(1:6, 20, replace=TRUE)
 [1] 6 1 6 4 2 5 5 3 4 1 4 6 3 1 6 6 6 1 6 3
sum(sample(1:6, 2, replace=TRUE))
[1] 4
dierolls <- replicate(10000, sum(sample(1:6, 2, replace=TRUE)))
table(dierolls)
dierolls
   2    3    4    5    6    7    8    9   10   11   12 
 235  577  859 1125 1381 1662 1381 1098  849  547  286 
  • Compute the probability that the sum of five dice is 15.
  • Compute the probability that two people in a room with 30 people have the same birthday. (Hint: use anyDuplicated )

Probability distributions

R has support for every well known probability distribution.

runif(5)  # 5 uniformly random numbers between 0 and 1
[1] 0.005813683 0.017561025 0.148594451 0.858384741 0.261783760
rnorm(5)  # 5 standard normal random numbers
[1] -1.3272202  0.6986228 -0.4616152  1.4881083 -0.2382487
pnorm(1) - pnorm(-1)  # area under normal curve within 1 sd of mean.
[1] 0.6826895

What is the probability that the max of two uniformly random numbers on [0,1] is larger than 0.5?

Choose a point in the plane with coordinates (x,y) which are standard normal rv's. What is the expected distance from the origin?

Data Manipulation

In this part of the workshop, we analyze a data set of movies using the R dplyr tool.

  • The data set movies consists of 100,000 observations of the variables MovieID, Title, Genres, UserID, Rating and Timestamp.
  • This data is from a much larger data set, MovieLens, freely available from GroupLens Research.
  • These observations were selected from 746 consecutive users selected with a random start point from MovieLens.

Input Data

movies <- read.csv("http://stat.slu.edu/~speegle/_book_data/movieLensData")

dplyr verbs

  1. select() forms a new data frame with *select*ed columns.
  2. arrange() forms a new data frame with row(s) *arrange*d in a specified order.
  3. filter() forms a new data frame consisting of rows that satisfy certain *filter*ing conditions.
  4. mutate() and transmute() allow you to create new columns out of a data frame. mutate adds to the data frame, and transmute creates a new data frame.
  5. summarize() summarizes a data frame into a single row.
  6. distinct() collapses the identical observations into a single one.
  7. group_by() groups the data to perform tasks by groups.

examples

  1. Create a new data frame with the columns MovieID and Rating. select(movies, Rating, MovieID).
  2. Order the ratings by the time at which they were reviewed. arrange(movies, Timestamp)
  3. Find all 5 star ratings. filter(movies, Rating == 5)
  4. Find the mean of all Ratings in the data frame. summarize(movies, mean(Rating))
  5. Form a data frame consisting of the unique User ID's. distinct(movies, UserID)
  6. group_by() really only works well when combined with other commands.

Combining verbs

The utility of dplyr becomes more apparent when combining the various verbs using the pipe operator, %>%. The pipe operator allows you to feed the result of one expression as input to a function. For example,

  • x %>% f(y) becomes f(x,y) and
  • x %>% f(y) %>% g(z) becomes g(f(x,y),z).

Example One

Find the mean rating of Toy Story, which has MovieID 1.

filter(movies, MovieID == 1) %>%
  summarize(mean(Rating))
  mean(Rating)
1     3.875862

The filter() command creates a data frame that consists solely of the observations of Toy Story, and summarize() computes the mean rating.

Example Two

List the movies that have an average rating of 5 stars. That is, each rating of the movie was 5 stars.

In order to do this, we will first use the group_by() function to find the mean rating of each movie.

movies %>% 
  group_by(Title) %>%
  summarize(meanRating = mean(Rating))
# A tibble: 6,200 x 2
                                                     Title meanRating
                                                    <fctr>      <dbl>
 1 ...All the Marbles (a.k.a. The California Dolls) (1981)   3.500000
 2                                 ...And God Spoke (1993)   4.000000
 3                           ...And Justice for All (1979)   3.600000
 4                          *batteries not included (1987)   3.125000
 5                                               10 (1979)   3.500000
 6                                 10 Items or Less (2006)   3.000000
 7                              10 Rillington Place (1971)   4.250000
 8                       10 Things I Hate About You (1999)   3.450000
 9                                   10 to Midnight (1983)   2.500000
10                                   101 Dalmatians (1996)   3.186047
# ... with 6,190 more rows

Example Two (continued)

Next, we filter out those that have a mean rating of five stars.

movies %>% 
  group_by(Title) %>%
  summarize(meanRating = mean(Rating)) %>%
  filter(meanRating == 5)
# A tibble: 113 x 2
                                                  Title meanRating
                                                 <fctr>      <dbl>
 1                                         49 Up (2005)          5
 2                   5,000 Fingers of Dr. T, The (1953)          5
 3                                     Anastasia (1956)          5
 4                  Ay, Carmela! (¡Ay, Carmela!) (1990)          5
 5    Ballad of Narayama, The (Narayama bushiko) (1983)          5
 6                              Band of the Hand (1986)          5
 7                                      Beerfest (2006)          5
 8                                Big Clock, The (1948)          5
 9 Boss of It All, The (Direktøren for det hele) (2006)          5
10                                          Boys (1996)          5
# ... with 103 more rows

One more example

What is the movie with the highest rating that has been rated at least 30 times?

movies %>%
  group_by(Title) %>%
  summarize(meanRating = mean(Rating), numRating = n()) %>%
  filter(numRating >= 30) %>%
  arrange(desc(meanRating))
# A tibble: 874 x 3
                                    Title meanRating numRating
                                   <fctr>      <dbl>     <int>
 1       Shawshank Redemption, The (1994)   4.438953       344
 2                  Godfather, The (1972)   4.425439       228
 3 Wallace & Gromit: A Close Shave (1995)   4.415663        83
 4                 Schindlers List (1993)   4.396491       285
 5             Usual Suspects, The (1995)   4.393281       253
 6         Godfather: Part II, The (1974)   4.385135       148
 7              Lawrence of Arabia (1962)   4.371622        74
 8                       Big Night (1996)   4.363636        33
 9    City of God (Cidade de Deus) (2002)   4.340909        44
10                      Sting, The (1973)   4.337079        89
# ... with 864 more rows

You try it

Think of a question you could ask about the movie data set and answer it! Here are some problems we give our class:

  1. Which genre has been rated the most?
  2. Which movie in the genre Comedy|Romance that has been rated at least 75 times has the lowest/highest mean rating?
  3. Which movie that has a mean rating of 4 or higher has been rated the most times?
  4. Which movies have the most ratings, all of which are one star or less.

ggplot

The ggplot2 package works somewhat differently from other graphing utilities. The function ggplot is used to set up a plot, and has two required parameters: the data that you wish to plot and the mapping of variables to aesthetics you wish to use.

Plotting Galton

Let's look again at the Galton data set. Type the following command

ggplot(data = Galton, mapping = aes(x = parent, y = child))

What happened?

ggplot knows that you want to visualize the Galton data set, and that you want to consider parent as the x-coordinate and child as the y-coordinate. However, you haven't yet told ggplot how you want to visualize the data set.

To do that, there are various geom that are useful Try typing geom_ in the console, and see how many different geoms are available! In this case, we want to visualize a scatterplot, which we can do using geom_point.

Scatterplot

ggplot(data = Galton, mapping = aes(x = parent, y = child)) +
  geom_point()

plot of chunk unnamed-chunk-13

Scatterplot with jitter

ggplot(data = Galton, mapping = aes(x = parent, y = child)) +
  geom_jitter()

plot of chunk unnamed-chunk-14

Density Contour

ggplot(data = Galton, mapping = aes(x = parent, y = child)) +
  geom_density_2d()

plot of chunk unnamed-chunk-15

ggplot philosophy

The point is that there are several different visualizations of the Galton data set that might make sense, even after saying what the \( x \) and \( y \) coordinates are. For better or worse, ggplot doesn't have a default visualization. It forces you to say exactly what you want to see.

Regression Line

To finish with Galton, we can add a regression line.

By default, it comes with an error region.

ggplot(data = Galton, mapping = aes(x = parent, y = child)) +
  geom_jitter() +
  geom_smooth(method = lm)

plot of chunk unnamed-chunk-16

Plotting the quakes data set

Next, we want to visualize the quakes data set.

head(quakes)
     lat   long depth mag stations
1 -20.42 181.62   562 4.8       41
2 -20.62 181.03   650 4.2       15
3 -26.00 184.10    42 5.4       43
4 -17.97 181.66   626 4.1       19
5 -20.42 181.96   649 4.0       11
6 -19.68 184.31   195 4.0       12

Scatterplot

ggplot(quakes, aes(x = long, y = lat)) + 
  geom_point()

plot of chunk unnamed-chunk-18

Adding dimensions

Note that we have additional variables available to use in our plot of the earthquake data. If we wish to visualize the depth of the quake, we can assign that variable to an aesthetic, like this:

ggplot(quakes, aes(x = long, y = lat, color = depth)) + 
  geom_point()

plot of chunk unnamed-chunk-19

More dimensions

In order to visualize the magnitude, we try

ggplot(quakes, aes(x = long, y = lat, color = depth, size = mag)) + 
  geom_point()

plot of chunk unnamed-chunk-20

More dimensions

Now, things are piling up on top of each other a bit much, so we can adjust the transparency coefficient:

ggplot(quakes, aes(x = long, y = lat, color = depth, size = mag)) + 
  geom_point(alpha = 0.5)

plot of chunk unnamed-chunk-21

Map projection

Finally, since this represents a map, we should adjust the coordinates accordingly

ggplot(quakes, aes(x = long, y = lat, color = depth, size = mag)) + 
  geom_point(alpha = 0.5) + 
  coord_map()

plot of chunk unnamed-chunk-22

Histogram

By changing the geom, we can create different plots from the same data and aesthetics. For example, we use geom_hist to create a histogram of depths of earthquakes.

Histogram

ggplot(quakes, aes(x = depth)) + 
  geom_histogram()  

plot of chunk unnamed-chunk-23

Piping from dplyr

Finally, suppose we wish to visualize the relationship between magnitude and depth of earthquakes. We might wish to split the earthquake data into one group with depth less than 400 and one group with depth greater than 400, and create a box and whisker plot of magnitudes for each.

Piping from dplyr

mutate(quakes, group = ifelse(depth < 400, "shallow", "deep")) %>%
  ggplot(aes(x = group, y = mag)) + 
  geom_boxplot()

plot of chunk unnamed-chunk-24

Challenge

Create a plot or a group of plots that allow you to visualize the relationship(s) between depth, magnitude and number of stations reporting.

Map Project

By the end of a one semester, first course in statistics, students were able to create relatively nice looking maps for their class projects, which we will now explain.

We start by seeing how to use ggplot to plot maps.

geom_polygon

This requires the ggplot2 library and the maps library.

The basic tool for creating maps in ggplot is geom_polygon, which creates the polygonal shapes that will appear on the map. Let's examine geom_polygon a little bit.

Plot a square

squareCoords <- data.frame(x = c(0,0,1,1), y = c(0,1,1,0)) #The four corners of a square
squareCoords
  x y
1 0 0
2 0 1
3 1 1
4 1 0

To plot it, we use

ggplot(squareCoords, aes(x = x, y = y)) + 
  geom_polygon()

plot of chunk unnamed-chunk-26

Order matters

Now, sometimes, for example when merging two data frames, the order of the indices gets messed up. So, it is a good idea to have a variable called order that contains the order that the vertices are supposed to be listed. If something happens that messes up the order of the vectors, you can sort the vectors again to get everything right.

Order matters

unSquareCoords <- data.frame(x = c(0,0,1,1), 
                             y = c(0,1,0,1))
ggplot(unSquareCoords, aes(x = x, y = y)) + 
  geom_polygon() #Oh, dear.

plot of chunk unnamed-chunk-27

unSquareCoords <- data.frame(unSquareCoords, order = c(1,2,4,3))
unSquareCoords <- arrange(unSquareCoords, order)
ggplot(unSquareCoords, aes(x = x, y = y)) + 
  geom_polygon()

plot of chunk unnamed-chunk-28

Multiple squares

OK. Now, what if we want two squares? Well, we need to give all of the coordinates of the vertices.

twoSquares <- data.frame(x = c(0,0,1,1,3,3,4,4), y = c(0,1,1,0,3,4,4,3), order = c(1,2,3,4,5,6,7,8))
twoSquares
  x y order
1 0 0     1
2 0 1     2
3 1 1     3
4 1 0     4
5 3 3     5
6 3 4     6
7 4 4     7
8 4 3     8

Multiple squares

ggplot(twoSquares, aes(x = x, y = y)) + geom_polygon() #Oh, dear.

plot of chunk unnamed-chunk-30

Yikes. What happened? Well, R saw that we have 8 vertices in our polygon, and bless its heart, tried its best to create polygon with those 8 vertices in that order. Good try, R. We appreciate your enthusiasm. What we need to do is tell R that we actually want two polygons by adding a grouping variable.

Multiple Squares

twoSquares <- data.frame(twoSquares, group = c(1,1,1,1,2,2,2,2))
twoSquares
  x y order group
1 0 0     1     1
2 0 1     2     1
3 1 1     3     1
4 1 0     4     1
5 3 3     5     2
6 3 4     6     2
7 4 4     7     2
8 4 3     8     2
ggplot(twoSquares, aes(x = x, y = y)) + 
  geom_polygon(aes(group = group))

plot of chunk unnamed-chunk-32

Plotting maps

Longitude and lattitude information is stored in ggplot as data. In order to get to the data, we will be using the map_data command, as follows. (For this, we will be working with county level data in Alabama.)

library(maps)
library(dplyr)
countyData <- map_data(map = "county")
countyData <- filter(countyData, region == "alabama")
head(countyData)
       long      lat group order  region subregion
1 -86.50517 32.34920     1     1 alabama   autauga
2 -86.53382 32.35493     1     2 alabama   autauga
3 -86.54527 32.36639     1     3 alabama   autauga
4 -86.55673 32.37785     1     4 alabama   autauga
5 -86.57966 32.38357     1     5 alabama   autauga
6 -86.59111 32.37785     1     6 alabama   autauga

Fixing projection

ggplot(data = countyData, mapping = aes(x = long, y = lat)) + 
  geom_polygon(aes(group = group))

plot of chunk unnamed-chunk-34

ggplot(data = countyData, mapping = aes(x = long, y = lat)) + 
  geom_polygon(aes(group = group)) +
  coord_map()

plot of chunk unnamed-chunk-35

Filling counties

Finally, we can color the counties based on some characteristic that they have. For example, we could color each county based on the percentage of people in that county who voted for Trump in the most recent presidential election. Naturally, we would want to make counties more red if a higher percentage voted for Trump, and more blue the higher the percentage that voted for Clinton.

Coloring counties based on longitude

ggplot(data = countyData, mapping = aes(x = long, y = lat)) + 
  geom_polygon(aes(group = group, fill = long)) +
  coord_map()

plot of chunk unnamed-chunk-36

Changing the default color scheme

ggplot(data = countyData, mapping = aes(x = long, y = lat)) + 
  geom_polygon(aes(group = group, fill = long)) +
  scale_fill_continuous(low = "lightpink", high = "darkred") +
  coord_map()

plot of chunk unnamed-chunk-37

Election Results

Read data

alabamaData <- read.csv("alabamaElectionData.csv", header = TRUE, stringsAsFactors = FALSE)
head(alabamaData)
       long      lat group order  region subregion fips         cand
1 -86.50517 32.34920     1     1 alabama   autauga 1001 Donald Trump
2 -86.53382 32.35493     1     2 alabama   autauga 1001 Donald Trump
3 -86.54527 32.36639     1     3 alabama   autauga 1001 Donald Trump
4 -86.55673 32.37785     1     4 alabama   autauga 1001 Donald Trump
5 -86.57966 32.38357     1     5 alabama   autauga 1001 Donald Trump
6 -86.59111 32.37785     1     6 alabama   autauga 1001 Donald Trump
        pct   percent
1 0.7343579 0.7540178
2 0.7343579 0.7540178
3 0.7343579 0.7540178
4 0.7343579 0.7540178
5 0.7343579 0.7540178
6 0.7343579 0.7540178

Election Results

ggplot(data = alabamaData, mapping = aes(x = long, y = lat)) + 
  geom_polygon(aes(group = group, fill = percent)) +
  coord_map() + 
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0.5)

plot of chunk unnamed-chunk-39

Challenge

Create an electoral map for either a different state, a region of the country, or the entire USA. Data has been saved in a convenient form as allCountyData.csv on your flash drive.

Fiji Map (I)

plot of chunk unnamed-chunk-42