Math 141 – Data Wrangling with dplyr

Data Wrangling with dplyr

Grayson White

Math 141
Week 3 | Fall 2025

Goals For Today

Learn to wrangle data with dplyr!

Load Necessary Packages

dplyr is part of this collection of data science packages.

# Load necessary packages
library(tidyverse)

Data for today

dogs <- read_csv("https://data.cambridgema.gov/api/views/sckh-3xyx/rows.csv")

# Useful wrangling that we will come back to
dogs_top5 <- dogs %>% 
  mutate(Breed = case_when(
                       Dog_Breed == "Mixed Breed" ~ "Mixed",
                       Dog_Breed != "Mixed Breed" ~ "Single")) %>%
  filter(Dog_Name %in% c("Luna", "Charlie", "Lucy", "Cooper", "Rosie" ))

Data for today

glimpse(dogs)

Rows: 1,581
Columns: 6
$ Dog_Name         <chr> "Thor", "Lucy", "Ruby", "Georgie", "Candy", "kylo-roo…
$ Dog_Breed        <chr> "Boxer", "Poodle", "Collie", "Lhasa Apso", "Beagle", …
$ Location_masked  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Latitude_masked  <dbl> 42.3622, 42.3696, 42.3860, 42.3742, 42.3918, 42.3862,…
$ Longitude_masked <dbl> -71.0846, -71.1109, -71.1237, -71.1418, -71.1290, -71…
$ Neighborhood     <chr> "East Cambridge", "Mid-Cambridge", "Neighborhood Nine…

glimpse(dogs_top5)

Rows: 57
Columns: 7
$ Dog_Name         <chr> "Lucy", "Luna", "Charlie", "Cooper", "Charlie", "Luna…
$ Dog_Breed        <chr> "Poodle", "LABRADOODLE", "Border Terrier Mix", "Germa…
$ Location_masked  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Latitude_masked  <dbl> 42.3696, 42.3860, 42.3596, 42.3573, 42.3895, 42.3913,…
$ Longitude_masked <dbl> -71.1109, -71.1362, -71.1143, -71.1015, -71.1311, -71…
$ Neighborhood     <chr> "Mid-Cambridge", "Neighborhood Nine", "Cambridgeport"…
$ Breed            <chr> "Single", "Single", "Single", "Single", "Single", "Mi…

Data Wrangling: Transformations done on the data

Why wrangle the data?

To summarize the data.

→ To compute the mean and standard deviation of miles ridden by bikeshare users.

To drop missing values. (Need to be careful here!)

→ On our Lab 2, we are seeing that ggplot2 will often drop observations before creating a graph.

To filter to a particular subset of the data.

→ To subset the pdxTrees data to just a few species types.

To collapse the categories of a categorical variable.

→ To go from 86 dog breeds to just mixed or single breed.

Data Wrangling: Transformations done on the data

Why wrangle the data?

To arrange the data to make it easier to display.

→ To sort from most common dog name to least common.

To fix how R stores a variable.

→ Converting quantitative variables to/from categorical variables

dplyr for Data Wrangling

Six common wrangling verbs:
- summarize()
- count()
- mutate()
- select()
- filter()
- arrange()
One action:
- group_by()

Return to `mutate()`

Add new variables

count(dogs_top5, Dog_Name, Breed) %>%
  group_by(Dog_Name) %>%
  mutate(prop = n/sum(n))

# A tibble: 10 × 4
# Groups:   Dog_Name [5]
   Dog_Name Breed      n  prop
   <chr>    <chr>  <int> <dbl>
 1 Charlie  Mixed      4 0.286
 2 Charlie  Single    10 0.714
 3 Cooper   Mixed      1 0.125
 4 Cooper   Single     7 0.875
 5 Lucy     Mixed      2 0.182
 6 Lucy     Single     9 0.818
 7 Luna     Mixed      7 0.5  
 8 Luna     Single     7 0.5  
 9 Rosie    Mixed      1 0.1  
10 Rosie    Single     9 0.9

Modify existing variables

head(dogs_top5$Dog_Name)

[1] "Lucy"    "Luna"    "Charlie" "Cooper"  "Charlie" "Luna"

dogs_top5 <- dogs_top5 %>%
  mutate(Dog_Name = toupper(Dog_Name))

head(dogs_top5$Dog_Name)

[1] "LUCY"    "LUNA"    "CHARLIE" "COOPER"  "CHARLIE" "LUNA"

`select()`: Extract variables

dogs %>%
  select(Dog_Name, Dog_Breed)

# A tibble: 1,581 × 2
   Dog_Name Dog_Breed           
   <chr>    <chr>               
 1 Thor     Boxer               
 2 Lucy     Poodle              
 3 Ruby     Collie              
 4 Georgie  Lhasa Apso          
 5 Candy    Beagle              
 6 kylo-roo chihuahua yorkie mix
 7 Sandy    Cocker Spaniel      
 8 Willy    Labraddoodle        
 9 yoshi    plotthound pit mix  
10 Colby    Cocker Spaniel      
# ℹ 1,571 more rows

Motivation for `filter()`

count(dogs, Dog_Name, sort = TRUE)

# A tibble: 1,125 × 2
   Dog_Name     n
   <chr>    <int>
 1 Charlie     14
 2 Luna        14
 3 Penny       12
 4 Lucy        11
 5 Rosie       10
 6 Milo         9
 7 Pepper       9
 8 Cooper       8
 9 Teddy        8
10 Bella        7
# ℹ 1,115 more rows

`filter()`: Extract cases

dogs_top5 <- dogs %>% 
  filter(Dog_Name %in% c("Luna", "Charlie", "Lucy", "Cooper", "Rosie" ))

count(dogs_top5, Dog_Name, sort = TRUE)

# A tibble: 5 × 2
  Dog_Name     n
  <chr>    <int>
1 Charlie     14
2 Luna        14
3 Lucy        11
4 Rosie       10
5 Cooper       8

`arrange()`: Sort the cases

count(dogs_top5, Dog_Name) %>%
  arrange(n)

# A tibble: 5 × 2
  Dog_Name     n
  <chr>    <int>
1 Cooper       8
2 Rosie       10
3 Lucy        11
4 Charlie     14
5 Luna        14

count(dogs_top5, Dog_Name) %>%
  arrange(desc(n))

# A tibble: 5 × 2
  Dog_Name     n
  <chr>    <int>
1 Charlie     14
2 Luna        14
3 Lucy        11
4 Rosie       10
5 Cooper       8

count(dogs_top5, Dog_Name) %>%
  arrange(Dog_Name)

# A tibble: 5 × 2
  Dog_Name     n
  <chr>    <int>
1 Charlie     14
2 Cooper       8
3 Lucy        11
4 Luna        14
5 Rosie       10

New Data Setting: Bureau of Labor Statistics (BLS) Consumer Expenditure Survey

BLS Mission: “Measures labor market activity, working conditions, price changes, and productivity in the U.S. economy to support public and private decision making.”

Data: Last quarter of the 2016 BLS Consumer Expenditure Survey.

library(tidyverse)

ce_raw <- read_csv("data/fmli.csv", 
                 na = c("NA", "."))
glimpse(ce_raw)

Rows: 6,301
Columns: 51
$ NEWID    <chr> "03324174", "03324204", "03324214", "03324244", "03324274", "…
$ PRINEARN <chr> "01", "01", "01", "01", "02", "01", "01", "01", "02", "01", "…
$ FINLWT21 <dbl> 25984.767, 6581.018, 20208.499, 18078.372, 20111.619, 19907.3…
$ FINCBTAX <dbl> 116920, 200, 117000, 0, 2000, 942, 0, 91000, 95000, 40037, 10…
$ BLS_URBN <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ POPSIZE  <dbl> 2, 3, 4, 2, 2, 2, 1, 2, 5, 2, 3, 2, 2, 3, 4, 3, 3, 1, 4, 1, 1…
$ EDUC_REF <chr> "16", "15", "16", "15", "14", "11", "10", "13", "12", "12", "…
$ EDUCA2   <dbl> 15, 15, 13, NA, NA, NA, NA, 15, 15, 14, 12, 12, NA, NA, NA, 1…
$ AGE_REF  <dbl> 63, 50, 47, 37, 51, 63, 77, 37, 51, 64, 26, 59, 81, 51, 67, 4…
$ AGE2     <dbl> 50, 47, 46, NA, NA, NA, NA, 36, 53, 67, 44, 62, NA, NA, NA, 4…
$ SEX_REF  <dbl> 1, 1, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1…
$ SEX2     <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 1, 1, 1, NA, NA, NA, 1, NA, 1,…
$ REF_RACE <dbl> 1, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1…
$ RACE2    <dbl> 1, 4, 1, NA, NA, NA, NA, 1, 1, 1, 1, 1, NA, NA, NA, 2, NA, 1,…
$ HISP_REF <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1…
$ HISP2    <dbl> 2, 2, 1, NA, NA, NA, NA, 2, 2, 2, 2, 2, NA, NA, NA, 2, NA, 2,…
$ FAM_TYPE <dbl> 3, 4, 1, 8, 9, 9, 8, 3, 1, 1, 3, 1, 8, 9, 8, 5, 9, 4, 8, 3, 2…
$ MARITAL1 <dbl> 1, 1, 1, 5, 3, 3, 2, 1, 1, 1, 1, 1, 2, 3, 5, 1, 3, 1, 3, 1, 1…
$ REGION   <dbl> 4, 4, 3, 4, 4, 3, 4, 1, 3, 2, 1, 4, 1, 3, 3, 3, 2, 1, 2, 4, 3…
$ SMSASTAT <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ HIGH_EDU <chr> "16", "15", "16", "15", "14", "11", "10", "15", "15", "14", "…
$ EHOUSNGC <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ TOTEXPCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ FOODCQ   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ TRANSCQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ HEALTHCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ENTERTCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ EDUCACQ  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ TOBACCCQ <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ STUDFINX <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ IRAX     <dbl> 1000000, 10000, 0, NA, NA, 0, 0, 15000, NA, 477000, NA, NA, N…
$ CUTENURE <dbl> 1, 1, 1, 1, 1, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 4, 1, 1, 1, 4, 4…
$ FAM_SIZE <dbl> 4, 6, 2, 1, 2, 2, 1, 5, 2, 2, 4, 2, 1, 2, 1, 4, 2, 4, 1, 3, 3…
$ VEHQ     <dbl> 3, 5, 0, 4, 2, 0, 0, 2, 4, 2, 3, 2, 1, 3, 1, 2, 4, 4, 0, 2, 3…
$ ROOMSQ   <dbl> 8, 5, 6, 4, 4, 4, 7, 5, 4, 9, 6, 10, 4, 7, 5, 6, 6, 8, 18, 4,…
$ INC_HRS1 <dbl> 40, 40, 40, 44, 40, NA, NA, 40, 40, NA, 40, NA, NA, NA, NA, 4…
$ INC_HRS2 <dbl> 30, 40, 52, NA, NA, NA, NA, 40, 40, NA, 65, NA, NA, NA, NA, 6…
$ EARNCOMP <dbl> 3, 2, 2, 1, 4, 7, 8, 2, 2, 8, 2, 8, 8, 7, 8, 2, 7, 3, 1, 2, 1…
$ NO_EARNR <dbl> 4, 2, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 0, 1, 0, 2, 1, 3, 1, 2, 1…
$ OCCUCOD1 <chr> "03", "03", "05", "03", "04", "", "", "12", "04", "", "01", "…
$ OCCUCOD2 <chr> "04", "02", "01", "", "", "", "", "02", "03", "", "11", "", "…
$ STATE    <chr> "41", "15", "48", "06", "06", "48", "06", "42", "", "27", "25…
$ DIVISION <dbl> 9, 9, 7, 9, 9, 7, 9, 2, NA, 4, 1, 8, 2, 5, 6, 7, 3, 2, 3, 9, …
$ TOTXEST  <dbl> 15452, 11459, 15738, 25978, 588, 0, 0, 7261, 9406, -1414, 141…
$ CREDFINX <dbl> 0, NA, 0, NA, 5, NA, NA, NA, NA, 0, NA, 0, NA, NA, NA, 2, 35,…
$ CREDITB  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ CREDITX  <dbl> 4000, 5000, 2000, NA, 7000, 1800, NA, 6000, NA, 719, NA, 1200…
$ BUILDING <chr> "01", "01", "01", "02", "08", "01", "01", "01", "01", "01", "…
$ ST_HOUS  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ INT_PHON <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ INT_HOME <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Wrangling CE Data

Want to better understand a family’s income and expenditures

ce <- ce_raw %>%
  select(NEWID, PRINEARN, FINCBTAX,
         BLS_URBN, HIGH_EDU, TOTEXPCQ, IRAX)
dim(ce)

[1] 6301    7

Variables:

NEWID: ID for the household
PRINEARN: ID for which member of the household is the principal earner
FINCBTAX: Final income before taxes for the year

BLS_URBN: 1 = urban, 2 = rural
HIGH_EDU: Highest education in the household. 00 = Never attended, 10 = Grades 1-8, 11 = Grades 9-12, no degree, 12 = High school graduate, 13 = Some college, no degree, 14 = Associates degree, 15 = Bachelor’s degree, 16 = Masters, Professional/doctorate degree
TOTEXPCQ = Total household expenditures for the current quarter
IRAX = Total in retirement funds

Wrangling CE Data

ce <- ce %>%
  mutate(YEARLY_EXP = TOTEXPCQ*4)
ce

# A tibble: 6,301 × 8
   NEWID    PRINEARN FINCBTAX BLS_URBN HIGH_EDU TOTEXPCQ    IRAX YEARLY_EXP
   <chr>    <chr>       <dbl>    <dbl> <chr>       <dbl>   <dbl>      <dbl>
 1 03324174 01         116920        1 16              0 1000000          0
 2 03324204 01            200        1 15              0   10000          0
 3 03324214 01         117000        1 16              0       0          0
 4 03324244 01              0        1 15              0      NA          0
 5 03324274 02           2000        1 14              0      NA          0
 6 03324284 01            942        1 11              0       0          0
 7 03324294 01              0        1 10              0       0          0
 8 03324304 01          91000        1 15              0   15000          0
 9 03324324 02          95000        2 15              0      NA          0
10 03324334 01          40037        1 14              0  477000          0
# ℹ 6,291 more rows

Logical Operators

ce_sub <- ce %>%
  filter(YEARLY_EXP > 0, BLS_URBN == 1, HIGH_EDU != "00")
ce_sub

# A tibble: 3,950 × 8
   NEWID    PRINEARN FINCBTAX BLS_URBN HIGH_EDU TOTEXPCQ   IRAX YEARLY_EXP
   <chr>    <chr>       <dbl>    <dbl> <chr>       <dbl>  <dbl>      <dbl>
 1 03335204 01          37000        1 14          2492.      0      9968.
 2 03335214 01         103000        1 16          6128.     NA     24513.
 3 03335224 01          14686        1 13          1072.     NA      4287.
 4 03335244 02          33396        1 12          1630       0      6520 
 5 03335264 01              0        1 13          3213.     NA     12853.
 6 03335274 01              0        1 15          4674.      0     18694.
 7 03335294 01         745136        1 16          8693. 280000     34773.
 8 03335304 01          36000        1 16          3733.     NA     14933.
 9 03335314 02          45000        1 15          3627.   3000     14509 
10 03335334 01          20862        1 13           802.      0      3209.
# ℹ 3,940 more rows

Logical Operators

ce_sub <- ce %>%
  filter(YEARLY_EXP > 0, (BLS_URBN == 1 | HIGH_EDU != "00"))
ce_sub

# A tibble: 4,178 × 8
   NEWID    PRINEARN FINCBTAX BLS_URBN HIGH_EDU TOTEXPCQ   IRAX YEARLY_EXP
   <chr>    <chr>       <dbl>    <dbl> <chr>       <dbl>  <dbl>      <dbl>
 1 03335204 01          37000        1 14          2492.      0      9968.
 2 03335214 01         103000        1 16          6128.     NA     24513.
 3 03335224 01          14686        1 13          1072.     NA      4287.
 4 03335244 02          33396        1 12          1630       0      6520 
 5 03335264 01              0        1 13          3213.     NA     12853.
 6 03335274 01              0        1 15          4674.      0     18694.
 7 03335294 01         745136        1 16          8693. 280000     34773.
 8 03335304 01          36000        1 16          3733.     NA     14933.
 9 03335314 02          45000        1 15          3627.   3000     14509 
10 03335334 01          20862        1 13           802.      0      3209.
# ℹ 4,168 more rows

`case_when`: Recoding Variables

count(ce, BLS_URBN)

# A tibble: 2 × 2
  BLS_URBN     n
     <dbl> <int>
1        1  5952
2        2   349

ce <- ce %>%
  mutate(BLS_URBN = case_when(
    BLS_URBN == 1 ~ "Urban",
    BLS_URBN == 2 ~ "Rural"
  ))
count(ce, BLS_URBN)

# A tibble: 2 × 2
  BLS_URBN     n
  <chr>    <int>
1 Rural      349
2 Urban     5952

`case_when`: Creating Variables

count(ce, HIGH_EDU)

# A tibble: 8 × 2
  HIGH_EDU     n
  <chr>    <int>
1 00           8
2 10         110
3 11         302
4 12        1272
5 13        1297
6 14         714
7 15        1528
8 16        1070

ce <- ce %>%
  mutate(HIGH_EDU = as.numeric(HIGH_EDU))
count(ce, HIGH_EDU)

# A tibble: 8 × 2
  HIGH_EDU     n
     <dbl> <int>
1        0     8
2       10   110
3       11   302
4       12  1272
5       13  1297
6       14   714
7       15  1528
8       16  1070

ce <- ce %>%
  mutate(HIGH_EDU2 = case_when(
    is.na(HIGH_EDU) ~ NA,
    HIGH_EDU <= 11 ~ "Less than high school degree",
    between(HIGH_EDU, 12, 13) ~ "High school degree",
    HIGH_EDU >= 14 ~ "College degree"
  ))
count(ce, HIGH_EDU2)

# A tibble: 3 × 2
  HIGH_EDU2                        n
  <chr>                        <int>
1 College degree                3312
2 High school degree            2569
3 Less than high school degree   420

Variable Names

Sometimes datasets come with terrible variable names.

ce <- ce %>%
  rename(INCOME = FINCBTAX)
ce

# A tibble: 6,301 × 9
   NEWID PRINEARN INCOME BLS_URBN HIGH_EDU TOTEXPCQ    IRAX YEARLY_EXP HIGH_EDU2
   <chr> <chr>     <dbl> <chr>       <dbl>    <dbl>   <dbl>      <dbl> <chr>    
 1 0332… 01       116920 Urban          16        0 1000000          0 College …
 2 0332… 01          200 Urban          15        0   10000          0 College …
 3 0332… 01       117000 Urban          16        0       0          0 College …
 4 0332… 01            0 Urban          15        0      NA          0 College …
 5 0332… 02         2000 Urban          14        0      NA          0 College …
 6 0332… 01          942 Urban          11        0       0          0 Less tha…
 7 0332… 01            0 Urban          10        0       0          0 Less tha…
 8 0332… 01        91000 Urban          15        0   15000          0 College …
 9 0332… 02        95000 Rural          15        0      NA          0 College …
10 0332… 01        40037 Urban          14        0  477000          0 College …
# ℹ 6,291 more rows

Handling Missing Data

Want to compute mean income and mean retirement funds by location.

ce %>%
  group_by(BLS_URBN) %>%
  summarize(mean_INCOME = mean(INCOME),
            mean_IRAX = mean(IRAX),
            households = n())

# A tibble: 2 × 4
  BLS_URBN mean_INCOME mean_IRAX households
  <chr>          <dbl>     <dbl>      <int>
1 Rural         40440.        NA        349
2 Urban         63772.        NA       5952

ce_aggressive <- ce_raw %>%
  na.omit()
ce_aggressive

# A tibble: 0 × 51
# ℹ 51 variables: NEWID <chr>, PRINEARN <chr>, FINLWT21 <dbl>, FINCBTAX <dbl>,
#   BLS_URBN <dbl>, POPSIZE <dbl>, EDUC_REF <chr>, EDUCA2 <dbl>, AGE_REF <dbl>,
#   AGE2 <dbl>, SEX_REF <dbl>, SEX2 <dbl>, REF_RACE <dbl>, RACE2 <dbl>,
#   HISP_REF <dbl>, HISP2 <dbl>, FAM_TYPE <dbl>, MARITAL1 <dbl>, REGION <dbl>,
#   SMSASTAT <dbl>, HIGH_EDU <chr>, EHOUSNGC <dbl>, TOTEXPCQ <dbl>,
#   FOODCQ <dbl>, TRANSCQ <dbl>, HEALTHCQ <dbl>, ENTERTCQ <dbl>, EDUCACQ <dbl>,
#   TOBACCCQ <dbl>, STUDFINX <dbl>, IRAX <dbl>, CUTENURE <dbl>, …

Handling Missing Data

ce_moderate <- ce %>%
  drop_na(IRAX, INCOME, BLS_URBN) %>%
  group_by(BLS_URBN) %>%  
  summarize(mean_INCOME = mean(INCOME),
            mean_IRAX = mean(IRAX),
            households = n())

ce_moderate

# A tibble: 2 × 4
  BLS_URBN mean_INCOME mean_IRAX households
  <chr>          <dbl>     <dbl>      <int>
1 Rural         38651.    37008.         63
2 Urban         58987.    94512.        991

ce_light <- ce %>%
  group_by(BLS_URBN) %>%
  summarize(mean_INCOME = mean(INCOME, na.rm = TRUE),
            mean_IRAX = mean(IRAX, na.rm = TRUE), 
            households = n())

ce_light

# A tibble: 2 × 4
  BLS_URBN mean_INCOME mean_IRAX households
  <chr>          <dbl>     <dbl>      <int>
1 Rural         40440.    37008.        349
2 Urban         63772.    94512.       5952

Multiple Groupings

ce %>%
  group_by(BLS_URBN, HIGH_EDU2) %>%
  summarize(mean_INCOME = mean(INCOME, na.rm = TRUE),
            mean_IRAX = mean(IRAX, na.rm = TRUE), 
            households = n()) %>%
  arrange(mean_IRAX)

# A tibble: 6 × 5
# Groups:   BLS_URBN [2]
  BLS_URBN HIGH_EDU2                    mean_INCOME mean_IRAX households
  <chr>    <chr>                              <dbl>     <dbl>      <int>
1 Rural    Less than high school degree      14715.        0          39
2 Urban    Less than high school degree      23046.     8270.        381
3 Rural    High school degree                31087.    15543.        192
4 Urban    High school degree                39147.    30533.       2377
5 Rural    College degree                    64161.   105148.        118
6 Urban    College degree                    86957.   168767.       3194

Piping into `ggplot2`

ce %>%
  group_by(BLS_URBN, HIGH_EDU2) %>%
  summarize(mean_INCOME = mean(INCOME, na.rm = TRUE),
            mean_IRAX = mean(IRAX, na.rm = TRUE), 
            households = n()) %>%
  ggplot(mapping = aes(x = mean_INCOME,
                       y = mean_IRAX, 
                       shape = BLS_URBN,
                       color = HIGH_EDU2)) +
  geom_point(size = 5)

Naming Wrangled Data

When I make a new dataframe, what name should I give it? Importantly, should I write over my original dataframe or should I save a new dataframe?

My answer:
- Is your new dataframe structurally different? If so, give it a new name.
- Are you removing values you will need for a future analysis within the same document? If so, give it a new name.
- Are you just adding to or cleaning the data? If so, then write over the original.

Sage Advice from ModernDive

“Crucial: Unless you are very confident in what you are doing, it is worthwhile not starting to code right away. Rather, first sketch out on paper all the necessary data wrangling steps not using exact code, but rather high-level pseudocode that is informal yet detailed enough to articulate what you are doing. This way you won’t confuse what you are trying to do (the algorithm) with how you are going to do it (writing dplyr code).”

Goals For Today

Load Necessary Packages

Data for today

Data for today

Data Wrangling: Transformations done on the data

Data Wrangling: Transformations done on the data

dplyr for Data Wrangling

Return to mutate()

select(): Extract variables

Motivation for filter()

filter(): Extract cases

arrange(): Sort the cases

New Data Setting: Bureau of Labor Statistics (BLS) Consumer Expenditure Survey

Wrangling CE Data

Wrangling CE Data

Logical Operators

Logical Operators

case_when: Recoding Variables

case_when: Creating Variables

Variable Names

Handling Missing Data

Handling Missing Data

Multiple Groupings

Piping into ggplot2

Naming Wrangled Data

Sage Advice from ModernDive

Return to `mutate()`

`select()`: Extract variables

Motivation for `filter()`

`filter()`: Extract cases

`arrange()`: Sort the cases

`case_when`: Recoding Variables

`case_when`: Creating Variables

Piping into `ggplot2`