SLR II: Regression Assumptions
Grayson White
Math 141
Week 4 | Fall 2025
\[ y = f(x) + \epsilon \]
Goal:
Determine a reasonable form for \(f()\). (Ex: Line, curve, …)
Estimate \(f()\) with \(\hat{f}()\) using the data.
Generate predicted values: \(\hat y = \hat{f}(x)\).
\[ y = \beta_o + \beta_1 x + \epsilon \]
Consider this model when:
Response variable \((y)\): quantitative
Explanatory variable \((x)\): quantitative
AND, \(f()\) can be approximated by a line.
Need to determine the best estimates of \(\beta_o\) and \(\beta_1\).
\[ y = \beta_o + \beta_1 x + \epsilon \]
\[ \hat{y} = \hat{ \beta}_o + \hat{\beta}_1 x \]
Recall our modeling goal: predict win percentage by using the price percentage variable.
candy <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv") %>%
mutate(pricepercent = pricepercent*100)
ggplot(data = candy,
mapping = aes(x = pricepercent,
y = winpercent)) +
geom_point(alpha = 0.6, size = 4,
color = "chocolate4") +
geom_smooth(method = "lm", se = FALSE,
color = "deeppink2")
Want residuals to be small.
Minimize a function of the residuals.
Minimize:
\[ \sum_{i = 1}^n e^2_i \]
After minimizing the sum of squared residuals, you get the following equations:
Get the following equations:
\[ \begin{align} \hat{\beta}_1 &= \frac{ \sum_{i = 1}^n (x_i - \bar{x}) (y_i - \bar{y})}{ \sum_{i = 1}^n (x_i - \bar{x})^2} \\ \hat{\beta}_o &= \bar{y} - \hat{\beta}_1 \bar{x} \end{align} \] where
\[ \begin{align} \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i \quad \mbox{and} \quad \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i \end{align} \]
Then we can estimate the whole function with:
\[ \hat{y} = \hat{\beta}_o + \hat{\beta}_1 x \]
Called the least squares line or the line of best fit.
We can use the lm()
function to construct the simple linear regression model in R.
# A tibble: 2 × 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 41.979 2.908 14.435 0 36.195 47.763
2 pricepercent 0.178 0.053 3.352 0.001 0.072 0.283
What is the fitted model form?
\[\begin{align*} \hat{y} &= \hat{\beta_o} + \hat{\beta_1} \times x_{pricepercent} \\ &= 41.979 + 0.178 \times x_{pricepercent} \end{align*}\]
How do we interpret the coefficients?
\[\begin{align*} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 \times x_{pricepercent} \\ &= 41.979 + 0.178 \times x_{pricepercent} \end{align*}\]
We need to be precise and careful when interpreting estimated coefficients!
Intercept: We expect/predict \(y\) to be \(\hat{\beta}_o\) on average when \(x = 0\).
Slope: For a one-unit increase in \(x\), we expect/predict \(y\) to change by \(\hat{\beta}_1\) units on average.
These interpretations are non-specific to the context of our model, but when we are interpreting coefficients, we always need to interpret the coefficients in context
\[\begin{align*} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 \times x_{pricepercent} \\ &= 41.979 + 0.178 \times x_{pricepercent} \end{align*}\]
Intercept: We expect/predict a candy’s win percentage to be 41.979 on average when their price percentage is 0.
Slope: For a one-unit increase in price percentage, we expect/predict the win percentage of a candy to change by 0.178 units on average.
1 2 3
46.42443 57.09409 68.65289
Careful to only predict values within the range of \(x\) values in the sample.
Make sure to investigate outliers: observations that fall far from the cloud of points.
# A tibble: 2 × 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 41.979 2.908 14.435 0 36.195 47.763
2 pricepercent 0.178 0.053 3.352 0.001 0.072 0.283
\[\begin{align*} \hat{y} &= \hat{\beta_o} + \hat{\beta_1} \times x_{pricepercent} \\ &= 41.979 + 0.178 \times x_{pricepercent} \end{align*}\]
What assumptions have we made?
We can always find the line of best fit to explore data, but…
To make accurate predictions or inferences, certain conditions should be met.
To responsibly use linear regression tools for prediction or inference, we require:
Linearity: The relationship between explanatory and response variables must be approximately linear
Independence: The observations should be independent of one another.
Normality: The distribution of residuals should be approximately bell-shaped, unimodal, symmetric, and centered at 0 at every “slice” of the explanatory variable
Equal Variability: Variance of residuals should be roughly constant across data set. Also called “homoscedasticity”. Models that violate this assumption are sometimes called “heteroscedastic”
Linearity
Independence
Normality
Equal Variability
In order to assess if we’ve met the described conditions, we will utilize four common diagnostics plots:
Let’s check if this meets the LINE assumptions
Linearity: ✅
Independence: ✅
Equal Variability: ✅
Predicting miles per gallon from engine displacement
Linearity: ❌
Independence: ❌
Equal Variability: ❓
Remember the example from last time:
What do diagnostics look like when we fit the teal model?
gglm::gglm()
You can see many of the diagnostic plots at once with the gglm()
function from the gglm
package.
Let’s use that function to diagnose the pink model