Using R: ggplot2 and dplyr

Bryan Clair
November 16, 2017

SLU 1818 Statistics Day

https://turtlegraphics@bitbucket.org/turtlegraphics/slu1818-nov-17.git

R Installation

Downloading R is unpleasantly difficult. Go here:

http://cran.wustl.edu

Pick your OS at the top, then:

  • Mac OS: you want R-latest.pkg
  • Windows: you want base/R-release.exe
  • Linux: you want to do the usual Linux thing

Downloading RStudio is simpler. Go here:

https://www.rstudio.com/products/rstudio/download/

and scroll down to get the installer for your platform.

R Basics

  • Command line calculation
  • Variable assignment with '<-'
  • Environment tab
  • History tab

Data in R

Built-in data: The iris dataset. Fisher/Anderson 1935-6.

iris
?iris
str(iris)
head(iris)
summary(iris)

ggplot2

Grammar of Graphics, L. Wilkinson 2005

Implemented by Hadley Wickham

  • data
  • aesthetics map variables to visible appearance
  • geometries are drawn objects
install.packages("ggplot2")
library(ggplot2)

Scatterplot

ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
  geom_point()

plot of chunk unnamed-chunk-4

Scatterplot

ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color=Species)) +
  geom_point()

plot of chunk unnamed-chunk-5

Try:

  • geom_jitter()
  • geom_density2d()
  • geom_bin2d()

Single Variable

Make a histogram of Petal.Length

  • geom_histogram()
  • geom_freqpoly()
  • geom_density()
  • color aesthetic
  • fill aesthetic

Categorical Variable

Plot Species on the x aesthetic with Petal.Length as y.

  • geom_point()
  • geom_boxplot()
  • geom_violin()

Plotting Practice

Built-in data sets:

  • chickwts
  • mtcars
  • quakes
  • diamonds (actually part of ggplot2)

From HistData library:

  • Prostitutes (try geom_line())
  • Galton (try geom_smooth())
  • Minard
ggplot() + geom_text(data=Minard.cities, aes(x=long,y=lat,label=city)) +
  geom_point(data=Minard.troops,mapping=aes(x=long,y=lat,size=survivors,color=direction))

Maps with ggmap

McDonald's locations in Missouri:

mcd <- read.csv("http://math.slu.edu/~clair/stat3850/data/mcdonalds-mo.csv")
ggplot(mcd, aes(x=long,y=lat)) + geom_point()

plot of chunk unnamed-chunk-7

Maps with ggmap

install.packages("ggmap")
library(ggmap)
momap <- get_map("Missouri", zoom=7)
ggmap(momap)

plot of chunk momap

Maps with ggmap

momap <- get_map("Jefferson City, MO", zoom=7)
ggmap(momap) + geom_point(data = mcd, aes(x=long,y=lat),
                          shape=21,fill="red",color="gold")

plot of chunk mcdmap

Try:

  • quakes data, near Fiji
  • maptype = "satellite" when getting the map

Data Manipulation with dplyr

install.packages("dplyr")
library(dplyr)

Simple verbs connected by pipes: %>%.

Data moves left to right through the pipeline.

mtcars %>% filter(cyl==4) %>% arrange(qsec)
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
2  30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
3  30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
4  21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
5  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
6  27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
7  32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
8  33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
9  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
10 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
11 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2

dplyr verbs

  1. filter() : filter certain rows of the data
  2. select() : select certain columns of the data
  3. arrange() : arrange rows in order
  4. group_by() : organize data into groups of rows
  5. summarize() : compute summary statistics of groups
mtcars %>% group_by(cyl) %>% summarize(mpg.avg = mean(mpg))
# A tibble: 3 x 2
    cyl  mpg.avg
  <dbl>    <dbl>
1     4 26.66364
2     6 19.74286
3     8 15.10000

dplyr verbs

  1. filter() : filter certain rows of the data
  2. select() : select certain columns of the data
  3. arrange() : arrange rows in order
  4. group_by() : organize data into groups of rows
  5. summarize() : compute summary statistics of groups
mtcars %>% group_by(cyl) %>% summarize(mpg.avg = mean(mpg))
  • Find the car with the highest hp
  • Find the mean mpg of automatics (am=0) for each number of cylinders.
  • Find the mean weight of cars with each combination of cyl and carb

MovieLens Data

The data set movies consists of 100,000 individual movie ratings selected from a much larger data set, MovieLens, freely available from GroupLens Research.

movies <- read.csv("http://stat.slu.edu/~speegle/_book_data/movieLensData")

Example

What is the movie with the highest rating that has been rated at least 30 times?

movies %>%
  group_by(Title) %>%
  summarize(meanRating = mean(Rating), numRating = n()) %>%
  filter(numRating >= 30) %>%
  arrange(desc(meanRating))
# A tibble: 874 x 3
                                    Title meanRating numRating
                                   <fctr>      <dbl>     <int>
 1       Shawshank Redemption, The (1994)   4.438953       344
 2                  Godfather, The (1972)   4.425439       228
 3 Wallace & Gromit: A Close Shave (1995)   4.415663        83
 4                 Schindlers List (1993)   4.396491       285
 5             Usual Suspects, The (1995)   4.393281       253
 6         Godfather: Part II, The (1974)   4.385135       148
 7              Lawrence of Arabia (1962)   4.371622        74
 8                       Big Night (1996)   4.363636        33
 9    City of God (Cidade de Deus) (2002)   4.340909        44
10                      Sting, The (1973)   4.337079        89
# ... with 864 more rows

MovieLens questions

Think of a question you could ask about the movie data set and answer it! Here are some problems we give our class:

  1. What is the lowest rated movie among those with at least 50 reviews?
  2. Which user gave the worst ratings. Which film did they like the best?
  3. How many movies have only one rating?
  4. Which genre has been rated the most?
  5. Which movie that has a mean rating of 4 or higher has been rated the most times?
  6. Which movies have the most ratings, all of which are one star or less?
  7. Which movie has the worst average rating of all movies that got a five-star rating from someone?

Lahman

install.packages("Lahman")
?Batting
?Pitching
?Master
  • Plot the total number of at bats (AB) per year.
  • Find the team that has hit the most home runs, all time
  • Which player since 1950 has the most career triples (X3B)?

World Health Organization

install.packages("WHO")
library(WHO)
codes <- get_codes()
head(codes)
infant <- get_data("MDG_0000000001")
infant %>% filter(country == "Peru") %>% ggplot(aes(x=year,y=value,color=worldbankincomegroup)) + geom_line()