Quantifying Uncertainty II

Grayson White

Math 141
Week 7 | Fall 2025

Announcements

Midterm exam available today!
- Finish take-home by Weds 10am (2.5hr exam)
- Make sure you’ve signed up for an oral exam time!
Discussion of midterm logistics (cloning repo, Gradescope process, rendering, office hours)

Goals for Today

Continue our discussion of quantifying uncertainty
- Key ideas:
  - sampling variability, sampling distributions, populations, samples, parameters, statistics.

Construct sampling distributions in R

Statistical Inference

The ❤️ of statistical inference is quantifying uncertainty

library(tidyverse)
ce <- read_csv("data/fmli.csv")
summarize(ce, meanFINCBTAX = mean(FINCBTAX))

# A tibble: 1 × 1
  meanFINCBTAX
         <dbl>
1       62480.

The ❤️ of statistical inference is quantifying uncertainty

library(tidyverse)
ce <- read_csv("data/fmli.csv")
summarize(ce, meanFINCBTAX = mean(FINCBTAX))

# A tibble: 1 × 1
  meanFINCBTAX
         <dbl>
1       62480.

Like with regression, need to distinguish between the population and the sample

Parameters:
- Based on the population
- Unknown then if don’t have data on the whole population
- EX: \(\beta_o\) and \(\beta_1\)
- EX: \(\mu\) = population mean

Statistics:
- Based on the sample data
- Known
- Usually estimate a population parameter
- EX: \(\hat{\beta}_o\) and \(\hat{\beta}_1\)
- EX: \(\bar{x}\) = sample mean

Statistical Inference

Goal: Draw conclusions about the population based on the sample.

Plato’s allegory of the cave

Statistical Inference

Goal: Draw conclusions about the population based on the sample.

Main Flavors

Estimating numerical quantities (parameters).
Testing conjectures.

Estimation

Goal: Estimate a (population) parameter.

Best guess?

The corresponding (sample) statistic

Key Question: How accurate is the statistic as an estimate of the parameter?

Helpful Sub-Question: If we take many samples, how much would the statistic vary from sample to sample?

Need two new concepts:

The sampling variability of a statistic
The sampling distribution of a statistic

Sampling Distribution of a Statistic

Steps to Construct an (Approximate) Sampling Distribution:

Decide on a sample size, \(n\).
Randomly select a sample of size \(n\) from the population.
Compute the sample statistic.
Put the sample back in.
Repeat Steps 2 - 4 many (1000+) times.

Sampling Distribution of a Statistic

Center? Shape?
Spread?
- Standard error = standard deviation of the statistic

What happens to the center/spread/shape as we increase the sample size?
What happens to the center/spread/shape if the true parameter changes?

Let’s Construct Some Sampling Distributions using `R`!

Important Notes

To construct a sampling distribution for a statistic, we need access to the entire population so that we can take repeated samples from the population.
- Population = Mt Tabor trees
But if we have access to the entire population, then we know the value of the population parameter.
- Can compute the exact proportion of maple trees in our population.
The sampling distribution is needed in the exact scenario where we can’t compute it: the scenario where we only have a single sample.
We will learn how to estimate the sampling distribution soon.
Today, we have the entire population and are constructing sampling distributions anyway to study their properties!

New `R` Package: `infer`

library(infer)

New `R` Package: `infer`

library(infer)

Will use infer to conduct statistical inference.

Our Population Parameter

Create data frame of Mt Tabor trees:

library(tidyverse)
library(pdxTrees)
tabor <- get_pdxTrees_parks(park = "Mt Tabor Park")

Add variable of interest:

tabor <- tabor %>%
  mutate(tree_of_interest = case_when(
    Genus == "Pseudotsuga" ~ "yes",
    Genus != "Pseudotsuga" ~ "no"
  ))
count(tabor, tree_of_interest)

# A tibble: 2 × 2
  tree_of_interest     n
  <chr>            <int>
1 no                 435
2 yes                696

Population Parameter

# Population distribution
ggplot(data = tabor, 
       mapping = aes(x = tree_of_interest)) +
  geom_bar(aes(y = after_stat(prop), group = 1))

# True population parameter
tabor %>%
  summarize(parameter = mean(tree_of_interest == "yes"))

# A tibble: 1 × 1
  parameter
      <dbl>
1     0.615

Random Samples

Let’s look at 4 random samples.

# Draw random samples
samples <- tabor %>% 
  rep_sample_n(size = 20, reps = 4)
  
# Graph the samples
ggplot(data = samples, 
       mapping = aes(x = tree_of_interest)) +
  geom_bar(aes(y = after_stat(prop), group = 1)) +
  facet_wrap(~replicate)

Constructing the Sampling Distribution

Now, let’s take 1000 random samples.

# Construct the sampling distribution
samp_dist <- tabor %>% 
  rep_sample_n(size = 20, reps = 1000) %>%
  group_by(replicate) %>%
  summarize(
    statistic = mean(tree_of_interest == "yes")
    )

# Graph the sampling distribution
ggplot(data = samp_dist, 
       mapping = aes(x = statistic)) +
  geom_histogram(bins = 13)

Shape?

Center?

Spread?

Properties of the Sampling Distribution

summarize(samp_dist, estimate = mean(statistic))

# A tibble: 1 × 1
  estimate
     <dbl>
1    0.618

summarize(samp_dist, standard_error = sd(statistic))

# A tibble: 1 × 1
  standard_error
           <dbl>
1          0.108

The standard deviation of a sample statistic is called the standard error.

Properties of the Sampling Distribution

summarize(samp_dist, estimate = mean(statistic))

# A tibble: 1 × 1
  estimate
     <dbl>
1    0.618

summarize(samp_dist, standard_error = sd(statistic))

# A tibble: 1 × 1
  standard_error
           <dbl>
1          0.108

For approximately bell-shaped distributions, about 95% of observations fall within 1.96 standard deviations of the population’s mean \(p\).

Huge Implication:

The sampling distribution for the mean, \(\hat p\) is approximately bell-shaped
… and is centered at the population mean, \(p\).
So 95% of all sample statistics, \(\hat p\) fall within 1.96 standard errors of \(p\)!

What happens to the sampling distribution if we change the sample size from 20 to 100?

# Construct the sampling distribution
samp_dist_100 <- tabor %>% 
  rep_sample_n(size = 100, reps = 1000) %>%
  group_by(replicate) %>%
  summarize(
    statistic = mean(tree_of_interest == "yes")
    )

# Graph the sampling distribution
ggplot(data = samp_dist_100, 
       mapping = aes(x = statistic)) +
  geom_histogram(bins = 13)

summarize(samp_dist_100, mean(statistic), 
          sd(statistic))

# A tibble: 1 × 2
  `mean(statistic)` `sd(statistic)`
              <dbl>           <dbl>
1             0.613          0.0469

Changing sample size

As the size of our sample increases, the variability of the sampling distribution decreases.
This makes sense: more data –> more precise estimates

Changing sample size

Both sampling distributions are still centered at \(p\), but
We can see that 95% of samples lie within different ranges for different \(n\)
As \(n\) increases, sampling variability (i.e. the standard error of the sampling distribution) decreases

What if we change the true parameter value?

# Construct the sampling distribution
samp_dist <- tabor %>%
  rep_sample_n(size = 20, reps = 1000) %>%
  group_by(replicate) %>%
  summarize(
    statistic = mean(Functional_Type == "BD")
    )

# Graph the sampling distribution
ggplot(data = samp_dist,
       mapping = aes(x = statistic)) +
    geom_histogram(bins = 13)

summarize(samp_dist, mean(statistic, na.rm = TRUE),
          sd(statistic, na.rm = TRUE))

# A tibble: 1 × 2
  `mean(statistic, na.rm = TRUE)` `sd(statistic, na.rm = TRUE)`
                            <dbl>                         <dbl>
1                           0.284                         0.101

A Word of Warning: The Shape of the Sampling Distribution

Sometimes, you need a large enough sample for your sampling distribution to look bell-shaped.

Example: forest biomass in Oregon

oregon_biomass %>%
  ggplot(aes(x = biomass)) + 
  geom_histogram(bins = 15)

A Word of Warning: The Shape of the Sampling Distribution

Sometimes, you need a large enough sample for your sampling distribution to look bell-shaped.

Example: forest biomass in Oregon

oregon_5 <- oregon_biomass %>%
  rep_sample_n(size = 5, reps = 1000)

oregon_5 %>%
  group_by(replicate) %>%
  summarize(statistic = mean(biomass)) %>%
  ggplot(aes(x = statistic)) +
  geom_histogram(bins = 10) +
  labs(title = "n = 5")

A Word of Warning: The Shape of the Sampling Distribution

Sometimes, you need a large enough sample for your sampling distribution to look bell-shaped.

Example: forest biomass in Oregon

oregon_20 <- oregon_biomass %>%
  rep_sample_n(size = 20, reps = 1000)

oregon_20 %>%
  group_by(replicate) %>%
  summarize(statistic = mean(biomass)) %>%
  ggplot(aes(x = statistic)) +
  geom_histogram(bins = 10) + 
  labs(title = "n = 20")

A Word of Warning: The Shape of the Sampling Distribution

Sometimes, you need a large enough sample for your sampling distribution to look bell-shaped.

Example: forest biomass in Oregon

oregon_100 <- oregon_biomass %>%
  rep_sample_n(size = 100, reps = 1000)

oregon_100 %>%
  group_by(replicate) %>%
  summarize(statistic = mean(biomass)) %>%
  ggplot(aes(x = statistic)) +
  geom_histogram(bins = 10) + 
  labs(title = "n = 100")

On Lab 6:

Will investigate what happens when we change the parameter of interest to a mean or a correlation coefficient!

Key Features of a Sampling Distribution

What did we learn about sampling distributions?

Centered around the true population parameter.
As the sample size increases, the standard error (SE) of the statistic decreases.
As the sample size increases, the shape of the sampling distribution becomes more bell-shaped and symmetric.

Question:

How do sampling distributions help us quantify uncertainty?

Question:

If I am estimating a parameter in a real example, why won’t I be able to construct the sampling distribution?

Cliffhanger:

How can I quantify uncertainty in a sample statistic when I only have access to one sample??

Announcements

Goals for Today

Statistical Inference

The ❤️ of statistical inference is quantifying uncertainty

The ❤️ of statistical inference is quantifying uncertainty

Statistical Inference

Plato’s allegory of the cave

Statistical Inference

Estimation

Sampling Distribution of a Statistic

Sampling Distribution of a Statistic

Let’s Construct Some Sampling Distributions using R!

New R Package: infer

New R Package: infer

Our Population Parameter

Population Parameter

Random Samples

Constructing the Sampling Distribution

Properties of the Sampling Distribution

Properties of the Sampling Distribution

Changing sample size

Changing sample size

A Word of Warning: The Shape of the Sampling Distribution

A Word of Warning: The Shape of the Sampling Distribution

A Word of Warning: The Shape of the Sampling Distribution

A Word of Warning: The Shape of the Sampling Distribution

On Lab 6:

Key Features of a Sampling Distribution

Question:

Question:

Cliffhanger:

Let’s Construct Some Sampling Distributions using `R`!

New `R` Package: `infer`

New `R` Package: `infer`