

Quantifying Uncertainty II
Grayson White
Math 141
Week 7 | Fall 2025
R
# A tibble: 1 × 1
meanFINCBTAX
<dbl>
1 62480.
Like with regression, need to distinguish between the population and the sample
Goal: Draw conclusions about the population based on the sample.

Goal: Draw conclusions about the population based on the sample.
Main Flavors
Estimating numerical quantities (parameters).
Testing conjectures.
Goal: Estimate a (population) parameter.
Best guess?
Key Question: How accurate is the statistic as an estimate of the parameter?
Helpful Sub-Question: If we take many samples, how much would the statistic vary from sample to sample?
Need two new concepts:
The sampling variability of a statistic
The sampling distribution of a statistic
Steps to Construct an (Approximate) Sampling Distribution:
Decide on a sample size, \(n\).
Randomly select a sample of size \(n\) from the population.
Compute the sample statistic.
Put the sample back in.
Repeat Steps 2 - 4 many (1000+) times.

Center? Shape?
Spread?
What happens to the center/spread/shape as we increase the sample size?
What happens to the center/spread/shape if the true parameter changes?
R!Important Notes
To construct a sampling distribution for a statistic, we need access to the entire population so that we can take repeated samples from the population.
But if we have access to the entire population, then we know the value of the population parameter.
The sampling distribution is needed in the exact scenario where we can’t compute it: the scenario where we only have a single sample.
We will learn how to estimate the sampling distribution soon.
Today, we have the entire population and are constructing sampling distributions anyway to study their properties!
R Package: inferR Package: inferinfer to conduct statistical inference.Create data frame of Mt Tabor trees:
Add variable of interest:
Let’s look at 4 random samples.
Now, let’s take 1000 random samples.


The standard deviation of a sample statistic is called the standard error.

For approximately bell-shaped distributions, about 95% of observations fall within 1.96 standard deviations of the population’s mean \(p\).
Huge Implication:
The sampling distribution for the mean, \(\hat p\) is approximately bell-shaped
… and is centered at the population mean, \(p\).
So 95% of all sample statistics, \(\hat p\) fall within 1.96 standard errors of \(p\)!
What happens to the sampling distribution if we change the sample size from 20 to 100?
# Construct the sampling distribution
samp_dist_100 <- tabor %>%
rep_sample_n(size = 100, reps = 1000) %>%
group_by(replicate) %>%
summarize(
statistic = mean(tree_of_interest == "yes")
)
# Graph the sampling distribution
ggplot(data = samp_dist_100,
mapping = aes(x = statistic)) +
geom_histogram(bins = 13)
As the size of our sample increases, the variability of the sampling distribution decreases.
This makes sense: more data –> more precise estimates

Both sampling distributions are still centered at \(p\), but
We can see that 95% of samples lie within different ranges for different \(n\)
As \(n\) increases, sampling variability (i.e. the standard error of the sampling distribution) decreases
What if we change the true parameter value?
Sometimes, you need a large enough sample for your sampling distribution to look bell-shaped.
Sometimes, you need a large enough sample for your sampling distribution to look bell-shaped.
Sometimes, you need a large enough sample for your sampling distribution to look bell-shaped.
Example: forest biomass in Oregon
Sometimes, you need a large enough sample for your sampling distribution to look bell-shaped.
Example: forest biomass in Oregon
Will investigate what happens when we change the parameter of interest to a mean or a correlation coefficient!
What did we learn about sampling distributions?
Centered around the true population parameter.
As the sample size increases, the standard error (SE) of the statistic decreases.
As the sample size increases, the shape of the sampling distribution becomes more bell-shaped and symmetric.
How do sampling distributions help us quantify uncertainty?
If I am estimating a parameter in a real example, why won’t I be able to construct the sampling distribution?
How can I quantify uncertainty in a sample statistic when I only have access to one sample??