Multiple Linear Regression



Grayson White

Math 141
Week 5 | Fall 2025

Announcements

Goals for Today

  • Handling categorical, explanatory variables and quantitative variables at the same time.
  • Equal slopes model
  • Different slopes model

Multiple Linear Regression

Form of the Model:

\[ \begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align} \]

How does extending to more predictors change our process?

  • What doesn’t change:
    • Still use Method of Least Squares to estimate coefficients
    • Still use lm() to fit the model and predict() for prediction
  • What does change:
    • Meaning of the coefficients are more complicated and depend on other variables in the model
    • Need to decide which variables to include and how (linear term, squared term…)

Multiple Linear Regression

  • We are going to see a few examples of multiple linear regression this week.

  • We will need to return to modeling later in the course once we have learned about statistical inference (i.e., confidence intervals and p-values).

Example

Meadowfoam is a plant that grows in the Pacific Northwest and is harvested for its seed oil. In a randomized experiment, researchers at Oregon State University looked at how two light-related factors influenced the number of flowers per meadowfoam plant, the primary measure of productivity for this plant. The two light measures were light intensity (in mmol/ \(m^2\) /sec) and the timing of onset of the light (early or late in terms of photo periodic floral induction).

Response variable?

Explanatory variables?

Model Form?

Data Loading and Wrangling

library(tidyverse)
library(Sleuth3)
data(case0901)

# Recode the timing variable
count(case0901, Time)
  Time  n
1    1 12
2    2 12
case0901 <- case0901 %>%
  mutate(TimeCat = case_when(
    Time == 1 ~ "Late",
    Time == 2 ~ "Early"
    )) 
count(case0901, TimeCat)
  TimeCat  n
1   Early 12
2    Late 12

Visualizing the Data

ggplot(case0901,
       aes(x = Intensity,
           y = Flowers,
           color = TimeCat)) + 
  geom_point(size = 4)

Why don’t I have to include data = and mapping = in my ggplot() layer?

Building the Linear Regression Model

Full model form:

modFlowers <- lm(Flowers ~ Intensity + TimeCat, data = case0901)

library(moderndive)
get_regression_table(modFlowers)
# A tibble: 3 × 7
  term          estimate std_error statistic p_value lower_ci upper_ci
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept        83.5      3.27      25.5        0   76.7      90.3 
2 Intensity        -0.04     0.005     -7.89       0   -0.051    -0.03
3 TimeCat: Late   -12.2      2.63      -4.62       0  -17.6      -6.69
  • Estimated regression line for \(x_2 = 1\):



  • Estimated regression line for \(x_2 = 0\):

Appropriateness of Model Form

ggplot(case0901, 
       aes(x = Intensity,
           y = Flowers,
           color = TimeCat)) + 
  geom_point(size = 4) + 
  geom_smooth(method = "lm", se = FALSE)

Is the assumption of equal slopes reasonable here?

Prediction

flowersNew <- data.frame(Intensity = c(700, 700), TimeCat = c("Early", "Late"))
flowersNew
  Intensity TimeCat
1       700   Early
2       700    Late
predict(modFlowers, newdata = flowersNew)
       1        2 
55.13417 42.97583 

Returning to the Palmer Penguins

library(palmerpenguins)

Last time:

We predicted a penguin’s bill length based on their species.

ggplot(penguins, aes(x = species, y = bill_length_mm, fill = species)) +
  geom_boxplot() + 
  scale_fill_manual(values = c("steelblue",
                               "goldenrod", 
                               "plum3")) +
  stat_summary(fun = mean, 
               geom = "point", 
               size = 3,
               color = "deeppink") + 
  guides(fill = "none") +
  theme_bw()

  • Pink dots represent the mean value in each group.
  • For the single categorical variable model, those pink dots are the predicted values for each group.

This time:

We’ll incorporate bill depth for prediction!

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) +
  geom_point(size = 2) + 
  geom_smooth(method = "lm", se = FALSE, color = "red") + 
  theme_bw()

  • A moderate negative relationship between bill length and bill depth!
  • Does this make sense?

What if we include both explanatory variables?

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
  geom_point(size = 2) + 
  geom_smooth(inherit.aes = FALSE, 
              mapping = aes(x = bill_depth_mm,
                            y = bill_length_mm),
              method = "lm",
              se = FALSE,
              color = "red") + 
  geom_parallel_slopes(se = FALSE) + 
  scale_color_manual(values = c("steelblue",
                               "goldenrod", 
                               "plum3")) +
  theme_bw() +
  theme(legend.position = "bottom")

  • Negative relationships between bill depth and bill length overall.
  • Positive relationships between bill depth and bill length when accounting for species!
  • What is going on here??
  • This is a case of Simpson’s Paradox.

Three candidate models

species_mod <- lm(bill_length_mm ~ species, penguins)
get_regression_table(species_mod) %>%
  select(term, estimate)
# A tibble: 3 × 2
  term               estimate
  <chr>                 <dbl>
1 intercept             38.8 
2 species: Chinstrap    10.0 
3 species: Gentoo        8.71


depth_mod <- lm(bill_length_mm ~ bill_depth_mm, penguins)
get_regression_table(depth_mod) %>%
  select(term, estimate)
# A tibble: 2 × 2
  term          estimate
  <chr>            <dbl>
1 intercept        55.1 
2 bill_depth_mm    -0.65


both_mod <- lm(bill_length_mm ~ bill_depth_mm + species, penguins)
get_regression_table(both_mod) %>%
  select(term, estimate)
# A tibble: 4 × 2
  term               estimate
  <chr>                 <dbl>
1 intercept             13.2 
2 bill_depth_mm          1.39
3 species: Chinstrap     9.94
4 species: Gentoo       13.4 
  • Coefficient interpretations?

Interpreting the Multiple Regression Equation

get_regression_table(both_mod)
# A tibble: 4 × 7
  term               estimate std_error statistic p_value lower_ci upper_ci
  <chr>                 <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept             13.2      2.25       5.88       0     8.80    17.6 
2 bill_depth_mm          1.39     0.122     11.4        0     1.15     1.63
3 species: Chinstrap     9.94     0.368     27.0        0     9.22    10.7 
4 species: Gentoo       13.4      0.512     26.2        0    12.4     14.4 

\[ \hat{y}= 13.2 + 1.39 \cdot x_{\textrm{Bill Depth}} + 9.94 \cdot x_{\textrm{Species:Chinstrap}} + 13.4 \cdot x_{\textrm{Species:Gentoo}} \]

  • Slope Coefficient on Quantitative Variable: The expected/predicted change in response variable, on average, per unit increase in the explanatory variable, holding all other variables constant.
    • Example: We expect a 1.394mm increase in bill length for every 1mm increase in bill depth, on average, after holding species constant.

Interpreting the Multiple Regression Equation

get_regression_table(both_mod)
# A tibble: 4 × 7
  term               estimate std_error statistic p_value lower_ci upper_ci
  <chr>                 <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept             13.2      2.25       5.88       0     8.80    17.6 
2 bill_depth_mm          1.39     0.122     11.4        0     1.15     1.63
3 species: Chinstrap     9.94     0.368     27.0        0     9.22    10.7 
4 species: Gentoo       13.4      0.512     26.2        0    12.4     14.4 

\[ \hat{y}= 13.2 + 1.39 \cdot x_{\textrm{Bill Depth}} + 9.94 \cdot x_{\textrm{Species:Chinstrap}} + 13.4 \cdot x_{\textrm{Species:Gentoo}} \]

  • Intercept Coefficient: The expected/predicted value of the response, on average, when all explanatory variables are set to 0.
    • What does setting species to 0 mean in this context?

Returning to our MLR model

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
  geom_point(size = 2) + 
  geom_parallel_slopes(se = FALSE) + 
  scale_color_manual(values = c("steelblue",
                               "goldenrod", 
                               "plum3")) +
  theme_bw() 

Are equal slopes a reasonable assumption here?

Different Slopes Model

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
  geom_point(size = 2) + 
  geom_smooth(method = "lm", se = FALSE) + 
  scale_color_manual(values = c("steelblue",
                               "goldenrod", 
                               "plum3")) +
  theme_bw() 

  • Equal slopes models force the relationship between quantitative predictors and the response variable to be the same for each group in the model.

  • In contrast, different slopes models allow for different relationships between quantitative predictors and the response variable for each group in the model.

  • How can we allow our model to have different slopes?

Contrasting model forms

Recall the equal slopes model:

\[ y = \beta_o + \beta_1 \cdot x_{\textrm{Bill Depth}} + \beta_2 \cdot x_{\textrm{Species:Chinstrap}} + \beta_3 \cdot x_{\textrm{Species:Gentoo}} + \epsilon \]


How can we allow the slopes to vary?

\[ y = \beta_o + \beta_1 \cdot x_{\textrm{Bill Depth}} + \beta_2 \cdot x_{\textrm{Species:Chinstrap}} + \beta_3 \cdot x_{\textrm{Species:Gentoo}} + \\ \beta_4 \cdot x_{\textrm{Bill Depth}} \cdot x_{\textrm{Species:Chinstrap}} + \beta_5 \cdot x_{\textrm{Bill Depth}} \cdot x_{\textrm{Species:Gentoo}} + \epsilon \]

  • Coefficient interpretation?

Different Slopes Model in R

same_slope <- lm(bill_length_mm ~ bill_depth_mm + species, penguins)

diff_slope <- lm(bill_length_mm ~ bill_depth_mm * species, penguins)
get_regression_table(same_slope)
# A tibble: 4 × 7
  term               estimate std_error statistic p_value lower_ci upper_ci
  <chr>                 <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept             13.2      2.25       5.88       0     8.80    17.6 
2 bill_depth_mm          1.39     0.122     11.4        0     1.15     1.63
3 species: Chinstrap     9.94     0.368     27.0        0     9.22    10.7 
4 species: Gentoo       13.4      0.512     26.2        0    12.4     14.4 


get_regression_table(diff_slope)
# A tibble: 6 × 7
  term                    estimate std_error statistic p_value lower_ci upper_ci
  <chr>                      <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept                 23.1       3.02       7.65   0       17.1      29.0 
2 bill_depth_mm              0.857     0.164      5.22   0        0.534     1.18
3 species: Chinstrap        -9.64      5.72      -1.69   0.093  -20.9       1.60
4 species: Gentoo           -5.84      4.54      -1.29   0.199  -14.8       3.08
5 bill_depth_mm:speciesC…    1.06      0.31       3.44   0.001    0.455     1.68
6 bill_depth_mm:speciesG…    1.16      0.279      4.17   0        0.615     1.71

Next time

  • Linear regression with multiple quantitative variables
  • Linear regression with curved relationships
  • Discuss an article from The Quest