Let’s add the regression line to our plot from before:
Recall: Linear Regression Assumptions
We can always find the line of best fit to explore data, but…
To make accurate predictions or inferences, certain conditions should be met.
To responsibly use linear regression tools for prediction or inference, we require:
Linearity: The relationship between explanatory and response variables must be approximately linear
Check using scatterplot of data, or residual plot
Independence: The observations should be independent of one another.
Can check by considering data context, and
by looking at residual scatterplots too
Normality: The distribution of residuals should be approximately bell-shaped, unimodal, symmetric, and centered at 0 at every “slice” of the explanatory variable
Simple check: look at histogram of residuals
Better to use a “Q-Q plot”
Equal Variability: Variance of residuals should be roughly constant across data set. Also called “homoscedasticity”. Models that violate this assumption are sometimes called “heteroscedastic”
Check using residual plot.
1) Linearity
Linearity: The relationship between explanatory and response variables must be approximately linear.
But these results depend on all LINE conditions being met!
…and they weren’t
Simulation-based inference for regression
Conducting simulation-based inference for regression
We can use bootstrap methods (and permutation) to help us do simulation-based inference for regression.
This approach is very flexible:
Only requires Linearity and Independence.
Residuals do not need to be normally distributed or have equal variability!
Needs some extra code (we get to use infer!)
Simulation-Based Inference for Regression
To conduct valid inference on regression parameters using simulation-based methods, we need:
Linearity
Independence
A “bigger” sample size
Same types of inference, just no “theory-based sampling distribution”
Bootstrapping in Regression
To approximate variability in our regression coefficients, \(\hat\beta_0\) and \(\hat\beta_1\) for this simple linear regression example, we bootstrap our sample!
i.e., we sample, with replacement, rows from our original data.
For each bootstrap sample, we calculate a new linear model
To approximate variability in our regression coefficients, \(\hat\beta_0\) and \(\hat\beta_1\) for this simple linear regression example, we bootstrap our sample!
i.e., we sample, with replacement, rows from our original data.
For each bootstrap sample, we calculate a new linear model
oregon %>%specify(biomass ~ canopy_cover) %>%generate(reps =2, type ="bootstrap" )
Red line: Original regression line
Blue line: Bootstrap regression line
Bootstrapping in Regression
To approximate variability in our regression coefficients, \(\hat\beta_0\) and \(\hat\beta_1\) for this simple linear regression example, we bootstrap our sample!
i.e., we sample, with replacement, rows from our original data.
For each bootstrap sample, we calculate a new linear model
Full infer workflow: bootstrapping sampling distributions for coefficients
boot_coefs <- oregon %>%# dataspecify(biomass ~ canopy_cover) %>%# linear regression modelgenerate(reps =1000, # number of bootstrapstype ="bootstrap" ) %>%fit() # new function! fits the regrssion models on each bootstrap samplehead(boot_coefs)
nrow(boot_coefs) # number of rows = reps * coefficients
[1] 2000
Confidence intervals
We use the bootstrap distributions to produce confidence intervals.
Confidence intervals
For the intercept:
ci_intercept <- boot_coefs %>%ungroup() %>%# by default, infer groups by rep so we need to ungroupfilter(term =="intercept") %>%# keep only the interceptsummarize(lower =quantile(estimate, probs =0.025),upper =quantile(estimate, probs =0.975) )ci_intercept
ci_coef <- boot_coefs %>%ungroup() %>%# by default, infer groups by rep so we need to ungroupfilter(term =="canopy_cover") %>%# keep only the coefficient you are interested insummarize(lower =quantile(estimate, probs =0.025),upper =quantile(estimate, probs =0.975) )ci_coef