CLT-based inference

Grayson White

Math 141
Week 11 | Fall 2025

Logistics

Final exam format discussion.

Goals for Today

Learn theory-based statistical inference methods.

Theory-based inference for:
- a single mean
- a difference in means
- a difference in proportions

Statistical Inference Zoom Out – Estimation

Statistical Inference Zoom Out – Testing

Recap:

Z-score test statistics:

\[ \mbox{Z-score} = \frac{\mbox{statistic} - \mu}{\sigma} \]

Usually follows a standard normal or a t distribution.
Use the approximate distribution to find the p-value.

Recap:

Formula-Based P*100% Confidence Intervals

\[ \mbox{statistic} \pm z^* SE \]

where \(P(-z^* \leq Z \leq z^*) = P\)

Or we will see that sometimes we use a t critical value:

\[ \mbox{statistic} \pm t^* SE \]

where \(P(-t^* \leq t \leq t^*) = P\)

Recap: Probability Calculations in R

To help you remember:

Want a Probability?

→ use pnorm(), pt(), …

Want a Quantile (i.e. percentile)?

→ use qnorm(), qt(), …

Recap: Probability Calculations in R

Question: When might I want to do probability calculations in R?

Computed a test statistic that is approximated by a named random variable. Want to compute the p-value with p---()
Compute a confidence interval. Want to find the critical value with q---().
To do a Sample Size Calculation.

More CLT-based inference

Inference for a Single Mean

Example: Are lakes in Florida more acidic or alkaline? The pH of a liquid is the measure of its acidity or alkalinity where pure water has a pH of 7, a pH greater than 7 is alkaline and a pH less than 7 is acidic. The following dataset contains observations on a sample of 53 lakes in Florida.

library(tidyverse)
FloridaLakes <- read_csv("https://www.lock5stat.com/datasets1e/FloridaLakes.csv")

Cases:

Variable of interest:

Parameter of interest:

Hypotheses:

Inference for a Single Mean

Let’s consider conducting a hypothesis test for a single mean: \(\mu\)

Need:

Hypotheses
- Same as with the simulation-based methods
Test statistic and its null distribution
- Use a z-score test statistic and a t distribution
P-value
- Compute from the t distribution directly

Inference for a Single Mean

Let’s consider conducting a hypothesis test for a single mean: \(\mu\)

\(H_o: \mu = \mu_o\) where \(\mu_o\) = null value

\(H_a: \mu > \mu_o\) or \(H_a: \mu < \mu_o\) or \(H_a: \mu \neq \mu_o\)

By the CLT, under \(H_o\):

\[ \bar{x} \sim N \left(\mu_o, \frac{\sigma}{\sqrt{n}} \right) \]

Z-score test statistic:

\[ Z = \frac{\bar{x} - \mu_o}{\frac{\sigma}{\sqrt{n}}} \]

Problem: Don’t know \(\sigma\): the population standard deviation of our response variable!

Inference for a Single Mean

Z-score test statistic:

\[ t = \frac{\bar{x} - \mu_o}{\frac{s}{\sqrt{n}}} \]

Problem: Don’t know \(\sigma\): the population standard deviation of our response variable!
- For our example, \(\sigma\) would be the standard deviation of the Ph level for all lakes in Florida.
Solution: Plug in \(s\): the sample standard deviation of our response variable!
- For our example, \(s\) would be the standard deviation of the Ph level for the sampled lakes in Florida.
Use \(t(\mbox{df} = n - 1)\) to find the p-value

Inference for a Single Mean

library(infer)

#Compute obs stat
t_obs <- FloridaLakes %>%
  specify(response = pH) %>%
  hypothesize(null = "point", mu = 7) %>%  
  calculate(stat = "t")
t_obs

Response: pH (numeric)
Null Hypothesis: point
# A tibble: 1 × 1
   stat
  <dbl>
1 -2.31

# Generate null distribution
null_dist <- FloridaLakes %>%
 specify(response = pH) %>%
 hypothesize(null = "point", mu = 7) %>%
 generate(reps = 10000, type = "bootstrap") %>%
 calculate(stat = "t")

Inference for a Single Mean

What probability function is a good approximation to the null distribution?

null_dist %>%
  visualize(bins = 30) +
  geom_vline(xintercept = t_obs$stat,
             color = "deeppink",
             size = 2) +
  geom_vline(xintercept = abs(t_obs$stat),
             color = "deeppink", 
             size = 2)

Inference for a Single Mean

What probability function is a good approximation to the null distribution?

null_dist %>%
  visualize(bins = 30, method = "both",
            dens_color = "orange") +
  geom_vline(xintercept = t_obs$stat,
             color = "deeppink",
             size = 2) +
  geom_vline(xintercept = abs(t_obs$stat),
             color = "deeppink", 
             size = 2)

P-value options

P-value using the generated null distribution:

pvalue <- null_dist %>%
  get_p_value(obs_stat = t_obs,
              direction = "both")
pvalue

# A tibble: 1 × 1
  p_value
    <dbl>
1  0.0218

P-value using an approximate probability function:

# Using t distribution
pt(q = t_obs$stat, df = 52)*2

         t 
0.02468707

Do-it-all function:

t_test(FloridaLakes, response = pH, mu = 7,
       alternative = "two-sided")

# A tibble: 1 × 7
  statistic  t_df p_value alternative estimate lower_ci upper_ci
      <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>    <dbl>
1     -2.31    52  0.0247 two.sided       6.59     6.24     6.95

Recall the CLT:

Central Limit Theorem (CLT): For random samples and a large sample size \((n)\), the sampling distribution of many sample statistics is approximately normal.

Sample Proportion Version:

When \(n\) is large (at least 10 successes and 10 failures):

\[ \hat{p} \sim N \left(p,~ \sqrt{\frac{p(1-p)}{n}} \right) \]

Sample Mean Version:

When \(n\) is large (at least 30):

\[ \bar{x} \sim N \left(\mu,~ \frac{\sigma}{\sqrt{n}} \right) \]

There Are Several Versions of the CLT!

Response	Explanatory	Numerical_Quantity	Parameter	Statistic
quantitative	-	mean	\(\mu\)	\(\bar{x}\)
categorical	-	proportion	\(p\)	\(\hat{p}\)
quantitative	categorical	difference in means	\(\mu_1 - \mu_2\)	\(\bar{x}_1 - \bar{x}_2\)
categorical	categorical	difference in proportions	\(p_1 - p_2\)	\(\hat{p}_1 - \hat{p}_2\)
quantitative	quantitative	correlation	\(\rho\)	\(r\)

Refer to these tables for:
- CLT’s “large sample” assumption
- Equation for the test statistic
- Equation for the confidence interval

Let’s cover examples of theory-based inference for two variables.

Data Example

We have data on a random sub-sample of the 2010 American Community Survey. The American Community Survey is given every year to a random sample of US residents.

# Libraries
library(tidyverse)
library(Lock5Data)

# Data
data(ACS)
# Focus on adults
ACS_adults <- filter(ACS, Age >= 18)

glimpse(ACS_adults)

Rows: 1,936
Columns: 9
$ Sex             <int> 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, …
$ Age             <int> 38, 18, 21, 55, 51, 28, 46, 80, 62, 41, 37, 42, 69, 48…
$ Married         <int> 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, …
$ Income          <dbl> 64.0, 0.0, 4.0, 34.0, 30.0, 13.7, 114.0, 0.0, 0.0, 0.0…
$ HoursWk         <int> 40, 0, 20, 40, 40, 40, 60, 0, 0, 0, 40, 42, 0, 60, 0, …
$ Race            <fct> white, black, white, other, black, white, white, white…
$ USCitizen       <int> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, …
$ HealthInsurance <int> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, …
$ Language        <int> 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, …

Difference in Proportions

Let’s try to determine if there’s a relationship between US citizenship and marriage status.

Response variable:

Explanatory variable:

Parameter of interest:

Sample size requirement for theory-based inference:

Difference in Proportions

Let’s try to determine if there’s a relationship between US citizenship and marriage status.

# Exploratory data analysis
ggplot(data = ACS_adults, 
       mapping = aes(x = factor(USCitizen),
                     fill  = factor(Married))) +
  geom_bar(position = "fill")

# Sample size
ACS_adults %>%
  count(Married, USCitizen)

  Married USCitizen   n
1       0         0  64
2       0         1 832
3       1         0  79
4       1         1 961

Difference in Proportions

Let’s try to determine if there’s a relationship between US citizenship and marriage status.

Why isprop_test() failing?

library(infer)
ACS_adults %>%
prop_test(Married ~ USCitizen, 
          order = c("1", "0"), z = TRUE,
          success = "1")

Error in `prop_test()`:
! The response variable of `Married` is not appropriate since the
  response variable is expected to be categorical.

Difference in Proportions

Let’s try to determine if there’s a relationship between US citizenship and marriage status.

ACS_adults %>%
  mutate(MarriedCat = case_when(Married == 0 ~ "No",
                                Married == 1 ~ "Yes"),
         USCitizenCat = case_when(USCitizen == 0 ~ "Not citizen",
                                  USCitizen == 1 ~ "Citizen")) %>%
prop_test(MarriedCat ~ USCitizenCat, 
          order = c("Citizen", "Not citizen"), z = TRUE,
          success = "Yes")

# A tibble: 1 × 5
  statistic p_value alternative lower_ci upper_ci
      <dbl>   <dbl> <chr>          <dbl>    <dbl>
1    -0.380   0.704 two.sided     -0.101   0.0682

Difference in Means

Let’s estimate the average hours worked per week between married and unmarried US residents.

Response variable:

Explanatory variable:

Parameter of interest:

Sample size requirement for theory-based inference:

Difference in Means

Let’s estimate the average hours worked per week between married and unmarried US residents.

# Exploratory data analysis
ggplot(data = ACS_adults, mapping = aes(x = HoursWk)) +
  geom_histogram() +
  facet_wrap(~Married, ncol = 1)

# Sample size
ACS_adults %>%
  drop_na(HoursWk) %>%
  count(Married)

  Married    n
1       0  896
2       1 1040

Difference in Means

Let’s estimate the average hours worked per week between married and unmarried US residents.

Which arguments for t_test() reflect my research question?

library(infer)
ACS_adults %>%
t_test(HoursWk ~ Married, order = c("1", "0"))

# A tibble: 1 × 7
  statistic  t_df    p_value alternative estimate lower_ci upper_ci
      <dbl> <dbl>      <dbl> <chr>          <dbl>    <dbl>    <dbl>
1      4.81 1902. 0.00000160 two.sided       4.55     2.69     6.40

library(infer)
ACS_adults %>%
t_test(HoursWk ~ Married, order = c("1", "0"),
       alternative = "greater")

# A tibble: 1 × 7
  statistic  t_df     p_value alternative estimate lower_ci upper_ci
      <dbl> <dbl>       <dbl> <chr>          <dbl>    <dbl>    <dbl>
1      4.81 1902. 0.000000800 greater         4.55     2.99      Inf

Correlation

We want to determine if age and hours worked per week have a positive linear relationship.

Response variable:

Explanatory variable:

Parameter of interest:

Sample size requirement for theory-based inference:

Correlation

We want to determine if age and hours worked per week have a positive linear relationship.

# Exploratory data analysis
ggplot(data = ACS_adults, 
       mapping = aes(x = Age,
                     y  = HoursWk)) +
  geom_jitter(alpha = 0.5) +
  geom_smooth()

Correlation

We want to determine if age and hours worked per week have a positive linear relationship.

cor.test(~ HoursWk + Age, data = ACS_adults, alternative = "greater")


    Pearson's product-moment correlation

data:  HoursWk and Age
t = -17.007, df = 1934, p-value = 1
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
 -0.3927809  1.0000000
sample estimates:
      cor 
-0.360684

Correlation

We want to determine if age and hours worked per week have a positive linear relationship.

# Exploratory data analysis
ggplot(data = ACS_adults, 
       mapping = aes(x = Age,
                     y  = HoursWk)) +
  geom_jitter(alpha = 0.5) +
  geom_smooth()