Summary Statistics

Grayson White

Math 141
Week 2 | Fall 2025

Announcements

The teaching team would love to see you in Office hours!
- Instructor office hours: For individual or small group help on problems or concepts
- Course assistant office hours: To work on assignments with your peers and get help from the course assistants when you are stuck

Goals for Today

Consider measures for summarizing quantitative data
- Center
- Spread/variability
Consider measures for summarizing categorical data

Load Necessary Packages

dplyr is part of the tidyverse collection of data science packages.

# Load necessary packages
library(tidyverse)

Import the Data

biketown <- read_csv("data/biketown.csv")

# Inspect the data
glimpse(biketown)

Rows: 9,999
Columns: 19
$ RouteID          <dbl> 4074085, 3719219, 3789757, 3576798, 3459987, 3947695,…
$ PaymentPlan      <chr> "Subscriber", "Casual", "Casual", "Subscriber", "Casu…
$ StartHub         <chr> "SE Elliott at Division", "SW Yamhill at Director Par…
$ StartLatitude    <dbl> 45.50513, 45.51898, 45.52990, 45.52389, 45.53028, 45.…
$ StartLongitude   <dbl> -122.6534, -122.6813, -122.6628, -122.6722, -122.6547…
$ StartDate        <chr> "8/17/2017", "7/22/2017", "7/27/2017", "7/12/2017", "…
$ StartTime        <time> 10:44:00, 14:49:00, 14:13:00, 13:23:00, 19:30:00, 10…
$ EndHub           <chr> "Blues Fest - SW Waterfront at Clay - Disabled", "SW …
$ EndLatitude      <dbl> 45.51287, 45.52142, 45.55902, 45.53409, 45.52990, 45.…
$ EndLongitude     <dbl> -122.6749, -122.6726, -122.6355, -122.6949, -122.6628…
$ EndDate          <chr> "8/17/2017", "7/22/2017", "7/27/2017", "7/12/2017", "…
$ EndTime          <time> 10:56:00, 15:00:00, 14:42:00, 13:38:00, 20:30:00, 10…
$ TripType         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ BikeID           <dbl> 6163, 6843, 6409, 7375, 6354, 6088, 6089, 5988, 6857,…
$ BikeName         <chr> "0488 BIKETOWN", "0759 BIKETOWN", "0614 BIKETOWN", "0…
$ Distance_Miles   <dbl> 1.91, 0.72, 3.42, 1.81, 4.51, 5.54, 1.59, 1.03, 0.70,…
$ Duration         <dbl> 11.500, 11.383, 28.317, 14.917, 60.517, 53.783, 23.86…
$ RentalAccessPath <chr> "keypad", "keypad", "keypad", "keypad", "keypad", "ke…
$ MultipleRental   <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE…

Summarizing Data

RouteID	PaymentPlan	StartHub	Distance_Miles
3596434	Subscriber	NW 18th at Flanders	0.59
3607170	Subscriber	NW Raleigh at 21st	0.71
3631639	Casual	SW 2nd at Pine	3.15
3912181	Casual	SE Water at Taylor	5.89
4031739	Casual	SE Clay at Water	1.34
3859969	Casual	SW Naito at Morrison	1.56
4315016	Casual	NW Everett at 22nd	3.50
4252609	Casual	NW Flanders at 14th	1.81
3809564	Casual	NA	2.74

Hard to do by eyeballing a spreadsheet with many rows!

Summarizing Data Visually

For a quantitative variable, want to answer:

What is an average value?
What is the trend/shape of the variable?
How much variation is there from case to case?

Need to learn key summary statistics: Numerical values computed based on the observed cases.

Measures of Center

Mean: Average of all the observations

\(n\) = Number of cases (sample size)
\(x_i\) = value of the i-th observation
Denote by \(\bar{x}\)

Measures of Center

Mean: Average of all the observations

\(n\) = Number of cases (sample size)
\(x_i\) = value of the i-th observation
Denote by \(\bar{x}\)

\[ \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i \]

Measures of Center

Mean: Average of all the observations

\(n\) = Number of cases (sample size)
\(x_i\) = value of the i-th observation
Denote by \(\bar{x}\)

\[ \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i \]

# Test out on first 6 values
head(biketown$Distance_Miles)

[1] 1.91 0.72 3.42 1.81 4.51 5.54

Compute with a dplyr function:

summarize(biketown, mean_miles = mean(Distance_Miles))

# A tibble: 1 × 1
  mean_miles
       <dbl>
1       2.04

Measures of Center

Median: Middle value

Half of the data falls below the median
Denote by \(m\)
If \(n\) is even, then it is the average of the middle two values

Measures of Center

Median: Middle value

Half of the data falls below the median
Denote by \(m\)
If \(n\) is even, then it is the average of the middle two values

# Test out on first 6 values
head(biketown$Distance_Miles)

[1] 1.91 0.72 3.42 1.81 4.51 5.54

Compute with a dplyr function:

summarize(biketown, median_miles = median(Distance_Miles))

# A tibble: 1 × 1
  median_miles
         <dbl>
1         1.48

Measures of Center

Why is the mean larger than the median?

summarize(biketown, 
          mean_miles = mean(Distance_Miles),
          median_miles = median(Distance_Miles))

# A tibble: 1 × 2
  mean_miles median_miles
       <dbl>        <dbl>
1       2.04         1.48

Computing Measures of Center by Groups

Question: Who travels further, on average? Casual biketown users or payment plan subscribers?

Computing Measures of Center by Groups

Handy dplyr function: group_by()

biketown_grouped <- group_by(biketown, PaymentPlan)
biketown_grouped

# A tibble: 9,999 × 19
# Groups:   PaymentPlan [2]
   RouteID PaymentPlan StartHub StartLatitude StartLongitude StartDate StartTime
     <dbl> <chr>       <chr>            <dbl>          <dbl> <chr>     <time>   
 1 4074085 Subscriber  SE Elli…          45.5          -123. 8/17/2017 10:44    
 2 3719219 Casual      SW Yamh…          45.5          -123. 7/22/2017 14:49    
 3 3789757 Casual      NE Holl…          45.5          -123. 7/27/2017 14:13    
 4 3576798 Subscriber  NW Couc…          45.5          -123. 7/12/2017 13:23    
 5 3459987 Casual      NE 11th…          45.5          -123. 7/3/2017  19:30    
 6 3947695 Casual      SW Mood…          45.5          -123. 8/8/2017  10:01    
 7 3549550 Casual      NW 2nd …          45.5          -123. 7/10/2017 14:13    
 8 4411957 Casual      NW Nait…          45.5          -123. 9/10/2017 07:41    
 9 4098004 Casual      NW Nait…          45.5          -123. 8/18/2017 23:35    
10 4096862 Subscriber  SW Mood…          45.5          -123. 8/18/2017 20:10    
# ℹ 9,989 more rows
# ℹ 12 more variables: EndHub <chr>, EndLatitude <dbl>, EndLongitude <dbl>,
#   EndDate <chr>, EndTime <time>, TripType <lgl>, BikeID <dbl>,
#   BikeName <chr>, Distance_Miles <dbl>, Duration <dbl>,
#   RentalAccessPath <chr>, MultipleRental <lgl>

Computing Measures of Center by Groups

Compute summary statistics on the grouped data frame:

biketown_grouped <- group_by(biketown, PaymentPlan)
summarize(biketown_grouped,
          mean_miles = mean(Distance_Miles),
          median_miles = median(Distance_Miles))

# A tibble: 2 × 3
  PaymentPlan mean_miles median_miles
  <chr>            <dbl>        <dbl>
1 Casual            2.56         2.03
2 Subscriber        1.45         1.02

And now it is time to learn the pipe: `%>%`

Chaining `dplyr` Operations

Instead of:

biketown_grouped <- group_by(biketown, PaymentPlan)
summarize(biketown_grouped,
          mean_miles = mean(Distance_Miles),
          median_miles = median(Distance_Miles))

# A tibble: 2 × 3
  PaymentPlan mean_miles median_miles
  <chr>            <dbl>        <dbl>
1 Casual            2.56         2.03
2 Subscriber        1.45         1.02

Use the pipe:

biketown %>%
  group_by(PaymentPlan) %>%
  summarize(mean_miles = mean(Distance_Miles),
          median_miles = median(Distance_Miles))

# A tibble: 2 × 3
  PaymentPlan mean_miles median_miles
  <chr>            <dbl>        <dbl>
1 Casual            2.56         2.03
2 Subscriber        1.45         1.02

Why pipe?

You can also use |>, which is newer and often referred to as the “base R pipe.”

Measures of Variability

Want a statistic that captures how much observations deviate from the mean

Find how much each observation deviates from the mean.
Compute the average of the deviations.

Measures of Variability

Want a statistic that captures how much observations deviate from the mean

Find how much each observation deviates from the mean.
Compute the average of the deviations.

\[ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x}) \]

Measures of Variability

Want a statistic that captures how much observations deviate from the mean

Find how much each observation deviates from the mean.
Compute the average of the deviations.

\[ \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x}) \]

# Test out on first 6 values
head(biketown$Distance_Miles)

[1] 1.91 0.72 3.42 1.81 4.51 5.54

Problem?

Measures of Variability

Want a statistic that captures how much observations deviate from the mean

Here is my NEW proposal:

Find how much each observation deviates from the mean.
Compute the average of the squared deviations.

Measures of Variability

Want a statistic that captures how much observations deviate from the mean

Here is my NEW proposal:

Find how much each observation deviates from the mean.
Compute the average of the squared deviations.

# Test out on first 6 values
head(biketown$Distance_Miles)

[1] 1.91 0.72 3.42 1.81 4.51 5.54

Measures of Variability

Want a statistic that captures how much observations deviate from the mean

Here is my ACTUAL formula:

Find how much each observation deviates from the mean.
Compute the (nearly) average of the squared deviations.
Called sample variance \(s^2\).

Measures of Variability

Want a statistic that captures how much observations deviate from the mean

Here is my ACTUAL formula:

Find how much each observation deviates from the mean.
Compute the (nearly) average of the squared deviations.
Called sample variance \(s^2\).

\[ s^2 = \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2 \]

Measures of Variability

Want a statistic that captures how much observations deviate from the mean

Here is my ACTUAL formula:

Find how much each observation deviates from the mean.
Compute the (nearly) average of the squared deviations.
Called sample variance \(s^2\).

\[ s^2 = \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2 \]

Compute with a dplyr function:

summarize(biketown, var_miles = var(Distance_Miles))

# A tibble: 1 × 1
  var_miles
      <dbl>
1      3.81

Measures of Variability

Want a statistic that captures how much observations deviate from the mean

Find how much each observation deviates from the mean.
Compute the (nearly) average of the squared deviations.
Called sample variance \(s^2\).
The square root of the sample variance is called the sample standard deviation \(s\).

Measures of Variability

Want a statistic that captures how much observations deviate from the mean

Find how much each observation deviates from the mean.
Compute the (nearly) average of the squared deviations.
Called sample variance \(s^2\).
The square root of the sample variance is called the sample standard deviation \(s\).

\[ s = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2} \]

Measures of Variability

Want a statistic that captures how much observations deviate from the mean

Find how much each observation deviates from the mean.
Compute the (nearly) average of the squared deviations.
Called sample variance \(s^2\).
The square root of the sample variance is called the sample standard deviation \(s\).

\[ s = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2} \]

Compute with a dplyr function:

summarize(biketown, var_miles = var(Distance_Miles),
          sd_miles = sd(Distance_Miles))

# A tibble: 1 × 2
  var_miles sd_miles
      <dbl>    <dbl>
1      3.81     1.95

Measures of Variability

In addition to the sample standard deviation and the sample variance, there is the sample interquartile range (IQR):

Measures of Variability

In addition to the sample standard deviation and the sample variance, there is the sample interquartile range (IQR):

\[ \mbox{IQR} = \mbox{Q}_3 - \mbox{Q}_1 \]

Measures of Variability

In addition to the sample standard deviation and the sample variance, there is the sample interquartile range (IQR):

\[ \mbox{IQR} = \mbox{Q}_3 - \mbox{Q}_1 \]

Compute with a dplyr function:

summarize(biketown, iqr_miles = IQR(Distance_Miles))

# A tibble: 1 × 1
  iqr_miles
      <dbl>
1      1.89

Comparing Measures of Variability

Which is more robust to outliers, the IQR or \(s\)?
Which is more commonly used, the IQR or \(s\)?

biketown %>%
  group_by(PaymentPlan) %>%
  summarize(sd_miles = sd(Distance_Miles),
            iqr_miles = IQR(Distance_Miles))

# A tibble: 2 × 3
  PaymentPlan sd_miles iqr_miles
  <chr>          <dbl>     <dbl>
1 Casual          2.15      2.25
2 Subscriber      1.48      1.21

Summarizing Categorical Variables

Return to the Cambridge Dogs

Focus on the dogs with the 5 most common names

dogs <- read_csv("https://data.cambridgema.gov/api/views/sckh-3xyx/rows.csv")

# Useful wrangling that we will come back to
dogs_top5 <- dogs %>% 
  mutate(Breed = case_when(
                       Dog_Breed == "Mixed Breed" ~ "Mixed",
                       Dog_Breed != "Mixed Breed" ~ "Single")) %>%
  filter(Dog_Name %in% c("Luna", "Charlie", "Lucy", "Cooper", "Rosie" ))

head(dogs_top5)

# A tibble: 6 × 7
  Dog_Name Dog_Breed            Location_masked Latitude_masked Longitude_masked
  <chr>    <chr>                <lgl>                     <dbl>            <dbl>
1 Lucy     Poodle               NA                         42.4            -71.1
2 Luna     LABRADOODLE          NA                         42.4            -71.1
3 Charlie  Border Terrier Mix   NA                         42.4            -71.1
4 Cooper   German Shorthaired … NA                         42.4            -71.1
5 Charlie  Golden Retriever     NA                         42.4            -71.1
6 Luna     Mixed Breed          NA                         42.4            -71.1
# ℹ 2 more variables: Neighborhood <chr>, Breed <chr>

Frequency Table

count(dogs_top5, Dog_Name)

# A tibble: 5 × 2
  Dog_Name     n
  <chr>    <int>
1 Charlie     14
2 Cooper       8
3 Lucy        11
4 Luna        14
5 Rosie       10

ggplot(data = dogs_top5, 
    mapping = aes(x = Dog_Name)) +
  geom_bar()

Frequency Table

count(dogs_top5, Dog_Name)

# A tibble: 5 × 2
  Dog_Name     n
  <chr>    <int>
1 Charlie     14
2 Cooper       8
3 Lucy        11
4 Luna        14
5 Rosie       10

count(dogs_top5, Dog_Name, sort = TRUE)

# A tibble: 5 × 2
  Dog_Name     n
  <chr>    <int>
1 Charlie     14
2 Luna        14
3 Lucy        11
4 Rosie       10
5 Cooper       8

Another `ggplot2` `geom`: `geom_col()`

If you have already aggregated the data, you will use geom_col() instead of geom_bar().

dog_counts <- count(dogs_top5, Dog_Name)
dog_counts

# A tibble: 5 × 2
  Dog_Name     n
  <chr>    <int>
1 Charlie     14
2 Cooper       8
3 Lucy        11
4 Luna        14
5 Rosie       10

ggplot(data = dog_counts,
       mapping = aes(x = Dog_Name,
                     y = n)) +
  geom_col()

Another `ggplot2` `geom`: `geom_col()`

And use fct_reorder() instead of fct_infreq() to reorder bars.

dog_counts <- count(dogs_top5, Dog_Name)
dog_counts

# A tibble: 5 × 2
  Dog_Name     n
  <chr>    <int>
1 Charlie     14
2 Cooper       8
3 Lucy        11
4 Luna        14
5 Rosie       10

ggplot(data = dog_counts,
       mapping = aes(x = fct_reorder(Dog_Name, n),
                     y = n)) +
  geom_col()

Contingency Table

count(dogs_top5, Dog_Name, Breed)

# A tibble: 10 × 3
   Dog_Name Breed      n
   <chr>    <chr>  <int>
 1 Charlie  Mixed      4
 2 Charlie  Single    10
 3 Cooper   Mixed      1
 4 Cooper   Single     7
 5 Lucy     Mixed      2
 6 Lucy     Single     9
 7 Luna     Mixed      7
 8 Luna     Single     7
 9 Rosie    Mixed      1
10 Rosie    Single     9

ggplot(data = dogs_top5, 
    mapping = aes(x = Dog_Name, fill = Breed)) +
  geom_bar(position = "dodge")

Conditional Proportions

Beyond raw counts, we often summarize categorical data with conditional proportions.
- Especially when looking for relationships!

ggplot(data = dogs_top5, 
    mapping = aes(x = Dog_Name, fill = Breed)) +
  geom_bar(position = "fill")

Conditional Proportions

count(dogs_top5, Dog_Name, Breed)

# A tibble: 10 × 3
   Dog_Name Breed      n
   <chr>    <chr>  <int>
 1 Charlie  Mixed      4
 2 Charlie  Single    10
 3 Cooper   Mixed      1
 4 Cooper   Single     7
 5 Lucy     Mixed      2
 6 Lucy     Single     9
 7 Luna     Mixed      7
 8 Luna     Single     7
 9 Rosie    Mixed      1
10 Rosie    Single     9

count(dogs_top5, Dog_Name, Breed) %>%
  group_by(Dog_Name) %>%
  mutate(prop = n/sum(n))

# A tibble: 10 × 4
# Groups:   Dog_Name [5]
   Dog_Name Breed      n  prop
   <chr>    <chr>  <int> <dbl>
 1 Charlie  Mixed      4 0.286
 2 Charlie  Single    10 0.714
 3 Cooper   Mixed      1 0.125
 4 Cooper   Single     7 0.875
 5 Lucy     Mixed      2 0.182
 6 Lucy     Single     9 0.818
 7 Luna     Mixed      7 0.5  
 8 Luna     Single     7 0.5  
 9 Rosie    Mixed      1 0.1  
10 Rosie    Single     9 0.9

The dplyr function mutate() adds new column(s) to your data frame.

Conditional Proportions

count(dogs_top5, Dog_Name, Breed) %>%
  group_by(Dog_Name) %>%
  mutate(prop = n/sum(n))

# A tibble: 10 × 4
# Groups:   Dog_Name [5]
   Dog_Name Breed      n  prop
   <chr>    <chr>  <int> <dbl>
 1 Charlie  Mixed      4 0.286
 2 Charlie  Single    10 0.714
 3 Cooper   Mixed      1 0.125
 4 Cooper   Single     7 0.875
 5 Lucy     Mixed      2 0.182
 6 Lucy     Single     9 0.818
 7 Luna     Mixed      7 0.5  
 8 Luna     Single     7 0.5  
 9 Rosie    Mixed      1 0.1  
10 Rosie    Single     9 0.9

count(dogs_top5, Dog_Name, Breed) %>%
  group_by(Breed) %>%
  mutate(prop = n/sum(n))

# A tibble: 10 × 4
# Groups:   Breed [2]
   Dog_Name Breed      n   prop
   <chr>    <chr>  <int>  <dbl>
 1 Charlie  Mixed      4 0.267 
 2 Charlie  Single    10 0.238 
 3 Cooper   Mixed      1 0.0667
 4 Cooper   Single     7 0.167 
 5 Lucy     Mixed      2 0.133 
 6 Lucy     Single     9 0.214 
 7 Luna     Mixed      7 0.467 
 8 Luna     Single     7 0.167 
 9 Rosie    Mixed      1 0.0667
10 Rosie    Single     9 0.214

How does the interpretation change based on which variable you condition on?

Reminders

The teaching team would love to see you in Office hours!
Next time:
- We’ll define data wrangling and
- Learn to use functions more in the dplyr package to summarize and wrangle data

Announcements

Goals for Today

Load Necessary Packages

Import the Data

Summarizing Data

Summarizing Data Visually

Measures of Center

Measures of Center

Measures of Center

Measures of Center

Measures of Center

Measures of Center

Computing Measures of Center by Groups

Computing Measures of Center by Groups

Computing Measures of Center by Groups

And now it is time to learn the pipe: %>%

Chaining dplyr Operations

Measures of Variability

Measures of Variability

Measures of Variability

Measures of Variability

Measures of Variability

Measures of Variability

Measures of Variability

Measures of Variability

Measures of Variability

Measures of Variability

Measures of Variability

Measures of Variability

Measures of Variability

Measures of Variability

Comparing Measures of Variability

Summarizing Categorical Variables

Return to the Cambridge Dogs

Frequency Table

Frequency Table

Another ggplot2 geom: geom_col()

Another ggplot2 geom: geom_col()

Contingency Table

Conditional Proportions

Conditional Proportions

Conditional Proportions

Reminders

And now it is time to learn the pipe: `%>%`

Chaining `dplyr` Operations

Another `ggplot2` `geom`: `geom_col()`

Another `ggplot2` `geom`: `geom_col()`