2 Data Frames and Data Manipulation

In the previous chapter, we learned R’s grammar — variables, vectors, functions, and pipes. Now we move from language basics to real data work: the kind of workflow you’ll use daily in clinical research and quality improvement projects.

By the end of this chapter, you’ll be able to import a CSV file, inspect it, clean it, transform it, and summarize it — all with readable, reproducible code.

Our running example: Preparing a diabetes registry

Imagine a colleague sends you a file from a diabetes screening program: diabetes.csv with 768 patient records. Your job is to turn this raw file into a clean, analysis-ready table.

Along the way, you’ll learn how to inspect, import, clean, transform, and summarize tabular data — the essential skills for any data analysis project.

2.1 What Is a Data Frame?

In Section 1.3.1, you learned that a vector holds a single variable — one column of data. A data frame is a collection of vectors arranged side-by-side, forming a table with rows and columns. Each column is a variable, each row is an observation (e.g., one patient).

Structure	What it holds	Example
Vector	One variable	`c(55, 62, 47)`
Data frame	Multiple variables (a table)	A table with `age`, `sex`, `glucose` columns

2.1.1 Creating a data frame by hand

You can create a data frame with data.frame():

patients <- data.frame(
  patient_id    = c("P001", "P002", "P003"),
  age           = c(55, 62, 47),
  sex           = c("Female", "Male", "Female"),
  glucose_mg_dl = c(145, 132, 178)
)
patients

  patient_id age    sex glucose_mg_dl
1       P001  55 Female           145
2       P002  62   Male           132
3       P003  47 Female           178

Each argument is a column. Notice that the columns are vectors of the same length — one value per patient.

2.1.2 Tibble: the tidyverse data frame

The tidyverse provides tibble(), a modern version of data.frame(). It prints more neatly and avoids some unexpected behaviors:

library(tidyverse)

patients <- tibble(
  patient_id    = c("P001", "P002", "P003"),
  age           = c(55, 62, 47),
  sex           = c("Female", "Male", "Female"),
  glucose_mg_dl = c(145, 132, 178)
)
patients

# A tibble: 3 × 4
  patient_id   age sex    glucose_mg_dl
  <chr>      <dbl> <chr>          <dbl>
1 P001          55 Female           145
2 P002          62 Male             132
3 P003          47 Female           178

A tibble shows its dimensions (3 × 4) and column types (<chr>, <dbl>) at the top — very handy for quick orientation.

The 60-second data orientation

When you encounter any new table, ask three questions:

What kind of object is this? → class(data)
How big is it? → dim(data) (rows × columns)
What are the variable names? → names(data)

class(patients)

[1] "tbl_df"     "tbl"        "data.frame"

dim(patients)

[1] 3 4

names(patients)

[1] "patient_id"    "age"           "sex"           "glucose_mg_dl"

This quick routine catches most early problems before they become hard-to-trace bugs downstream.

2.1.3 Exploring structure with `glimpse()`

For a richer overview, glimpse() shows every column’s name, type, and a few sample values — all in one compact display:

glimpse(patients)

Rows: 3
Columns: 4
$ patient_id    <chr> "P001", "P002", "P003"
$ age           <dbl> 55, 62, 47
$ sex           <chr> "Female", "Male", "Female"
$ glucose_mg_dl <dbl> 145, 132, 178

You’ll use glimpse() constantly. It’s one of the most useful functions in the tidyverse.

2.2 Warm-Up: Exploring a Packaged Dataset

Before touching our diabetes file, let’s practice with a small dataset that comes pre-loaded in an R package. The scurvy dataset from the medicaldata package records a historic clinical trial on scurvy treatments (12 patients, 8 variables).

data("scurvy", package = "medicaldata")

Apply the 60-second orientation:

class(scurvy)

[1] "tbl_df"     "tbl"        "data.frame"

dim(scurvy)

[1] 12  8

names(scurvy)

[1] "study_id"                  "treatment"                
[3] "dosing_regimen_for_scurvy" "gum_rot_d6"               
[5] "skin_sores_d6"             "weakness_of_the_knees_d6" 
[7] "lassitude_d6"              "fit_for_duty_d6"

glimpse(scurvy)

Rows: 12
Columns: 8
$ study_id                  <chr> "001", "002", "003", "004", "005", "006", "0…
$ treatment                 <fct> cider, cider, dilute_sulfuric_acid, dilute_s…
$ dosing_regimen_for_scurvy <chr> "1 quart per day", "1 quart per day", "25 dr…
$ gum_rot_d6                <fct> 2_moderate, 2_moderate, 1_mild, 2_moderate, …
$ skin_sores_d6             <fct> 2_moderate, 1_mild, 3_severe, 3_severe, 3_se…
$ weakness_of_the_knees_d6  <fct> 2_moderate, 2_moderate, 3_severe, 3_severe, …
$ lassitude_d6              <fct> 2_moderate, 3_severe, 3_severe, 3_severe, 3_…
$ fit_for_duty_d6           <fct> 0_no, 0_no, 0_no, 0_no, 0_no, 0_no, 0_no, 0_…

Notice the column types: <chr> for character, <fct> for factor. We’ll explain factors next — they’re important for clinical data.

Python Comparison

Python’s equivalent of a data frame is the pandas DataFrame:

import pandas as pd

# Create a DataFrame
patients = pd.DataFrame({
    "patient_id": ["P001", "P002", "P003"],
    "age": [55, 62, 47],
    "sex": ["Female", "Male", "Female"],
    "glucose_mg_dl": [145, 132, 178]
})

# Orientation
patients.shape    # (3, 4) — like dim()
patients.columns  # column names — like names()
patients.dtypes   # column types — like glimpse()
patients.info()   # combined overview

2.3 Factors in Practice

In Section 1.3.3, you learned that a factor is R’s type for categorical data — a character variable with a fixed set of levels. You also learned to create factors with factor() and control level order with the levels argument.

Now let’s learn a tidyverse tool that makes reordering levels more convenient: fct_relevel().

2.3.1 Reordering levels with `fct_relevel()`

The forcats package (loaded with library(tidyverse)) provides fct_relevel() to reorder factor levels after a factor has been created. This is especially useful when you receive data where the factor order isn’t what you need:

# Default: alphabetical order
outcome <- factor(c("pos", "neg", "pos", "neg"))
outcome

[1] pos neg pos neg
Levels: neg pos

# Reorder: put "neg" first (as the reference level)
outcome <- fct_relevel(outcome, "neg", "pos")
outcome

[1] pos neg pos neg
Levels: neg pos

fct_relevel() takes the factor and the levels in your desired order. You’ll use this frequently when preparing data for plots and statistical models, because the first level controls:

Plot axis/legend order — categories appear in level order
Table row order — summary tables follow level order
Statistical reference category — the first level is the baseline in regression

factor() vs fct_relevel() — when to use which?

factor(x, levels = ...) — when creating a factor from scratch (e.g., from a character vector). You learned this in Section 1.3.3.
fct_relevel(x, ...) — when reordering an existing factor (e.g., a column that’s already a factor but in the wrong order).

Both achieve the same result; fct_relevel() is simply more convenient for existing factors.

2.4 Importing Data from Files

In real projects, data arrives as files — CSV, Excel, or other formats. Let’s import our diabetes dataset.

2.4.1 Reading CSV with `read_csv()`

The readr package (part of the tidyverse) provides read_csv() for reading CSV files:

diabetes_raw <- read_csv("../../data/diabetes.csv", show_col_types = FALSE)

Let’s break down the syntax:

read_csv("path/to/file.csv") — reads the file at the given path and returns a tibble
show_col_types = FALSE — suppresses the column-type message that read_csv() prints by default. Useful for keeping output clean.

File paths in a Quarto book

The path "../../data/diabetes.csv" means “go up two directories from the chapter file, then into data/.” In your own R scripts, adjust the path to match where your file lives. Using RStudio Projects (as covered in Section 4) makes paths simpler and more portable.

2.4.2 Inspecting after import

Always check your data right after import:

dim(diabetes_raw)

[1] 768   9

names(diabetes_raw)

[1] "pregnancy_num"      "glucose_mg-dl"      "dbp_mm-hg"         
[4] "triceps_mm"         "insulin_microiu-ml" "bmi"               
[7] "pedigree"           "age"                "diabetes_5y"

glimpse(diabetes_raw)

Rows: 768
Columns: 9
$ pregnancy_num        <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7…
$ `glucose_mg-dl`      <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 11…
$ `dbp_mm-hg`          <dbl> 72, 66, 64, 66, 40, 74, 50, NA, 70, 96, 92, 74, 8…
$ triceps_mm           <dbl> 35, 29, NA, 23, 35, NA, 32, NA, 45, NA, NA, NA, N…
$ `insulin_microiu-ml` <dbl> NA, NA, NA, 94, 168, NA, 88, NA, 543, NA, NA, NA,…
$ bmi                  <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 3…
$ pedigree             <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, …
$ age                  <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 5…
$ diabetes_5y          <chr> "pos", "neg", "pos", "neg", "pos", "neg", "pos", …

We have 768 rows and 9 columns. But notice the column names: glucose_mg-dl, dbp_mm-hg — those hyphens will cause problems in R code.

The post-import checklist

Right after reading any file, run these four checks:

dim(data) — expected number of rows and columns?
names(data) — column names sensible?
glimpse(data) — types correct? (numbers as <dbl>, text as <chr>)
summary(data) — any suspicious min/max values or unexpected NAs?

2.4.3 Cleaning column names with `clean_names()`

The janitor package provides clean_names(), which converts column names to consistent snake_case — replacing hyphens, spaces, and dots with underscores:

library(janitor)

diabetes <- diabetes_raw |> clean_names()
names(diabetes)

[1] "pregnancy_num"      "glucose_mg_dl"      "dbp_mm_hg"         
[4] "triceps_mm"         "insulin_microiu_ml" "bmi"               
[7] "pedigree"           "age"                "diabetes_5y"

Now glucose_mg-dl has become glucose_mg_dl — much easier to type and less error-prone.

Why clean names matter

R can handle column names with hyphens (e.g., diabetes$`glucose_mg-dl`), but you need backticks every time. Clean snake_case names let you write diabetes$glucose_mg_dl without any special syntax.

2.4.4 Reading Excel files (brief)

For Excel files, the readxl package provides read_excel():

library(readxl)

diabetes_xlsx <- read_excel("../../data/diabetes.xlsx")
glimpse(diabetes_xlsx)

Rows: 768
Columns: 9
$ pregnancy_num        <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7…
$ `glucose_mg-dl`      <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 11…
$ `dbp_mm-hg`          <dbl> 72, 66, 64, 66, 40, 74, 50, NA, 70, 96, 92, 74, 8…
$ triceps_mm           <dbl> 35, 29, NA, 23, 35, NA, 32, NA, 45, NA, NA, NA, N…
$ `insulin_microiu-ml` <dbl> NA, NA, NA, 94, 168, NA, 88, NA, 543, NA, NA, NA,…
$ bmi                  <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 3…
$ pedigree             <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, …
$ age                  <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 5…
$ diabetes_5y          <chr> "pos", "neg", "pos", "neg", "pos", "neg", "pos", …

You can also specify a sheet or cell range:

read_excel("data/file.xlsx", sheet = "Sheet2")
read_excel("data/file.xlsx", range = "A1:D20")

Common import pitfalls

Problem	Symptom	Fix
Wrong file path	`Error: 'file.csv' does not exist`	Check `getwd()`, use RStudio file browser
Numbers read as text	Column shows `<chr>` instead of `<dbl>`	Check for text like `"N/A"` in numeric columns
Unexpected NAs	More `NA`s than expected	Source file uses blanks or custom codes (`-`, `999`)
Encoding issues	Garbled characters	Try `read_csv(..., locale = locale(encoding = "UTF-8"))`

Python Comparison

Python’s pandas uses similar functions:

import pandas as pd

# CSV
df = pd.read_csv("data/diabetes.csv")

# Excel
df = pd.read_excel("data/diabetes.xlsx", sheet_name="Sheet1")

# Inspect
df.shape
df.columns
df.info()
df.describe()

2.5 The dplyr Toolkit: Overview

Now that our data is imported and cleaned, it’s time to manipulate it — selecting columns, filtering rows, creating new variables, and computing summaries.

The dplyr package (part of the tidyverse) provides a set of verbs — functions named after what they do. Before diving into each one, here’s the big picture:

Core dplyr verbs organized by purpose
Category	Verb	What it does
Columns	`select()`	Choose which columns to keep
	`rename()`	Change column names
Rows	`filter()`	Keep rows that match a condition
	`arrange()`	Sort rows by a column
Create	`mutate()`	Add or modify columns
Aggregate	`summarise()`	Collapse many rows into a summary
	`group_by()`	Split data into groups before summarizing
	`count()`	Quick frequency table

All dplyr verbs follow the same pattern:

data |> verb(arguments)

The first argument is always the data (passed via |>), and you refer to column names without quotes.

2.6 Selecting and Renaming Columns

2.6.1 `select()` — choose columns

Use select() to keep only the columns you need:

diabetes |>
  select(age, bmi, glucose_mg_dl, diabetes_5y) |>
  head()

# A tibble: 6 × 4
    age   bmi glucose_mg_dl diabetes_5y
  <dbl> <dbl>         <dbl> <chr>      
1    50  33.6           148 pos        
2    31  26.6            85 neg        
3    32  23.3           183 pos        
4    21  28.1            89 neg        
5    33  43.1           137 pos        
6    30  25.6           116 neg

You can also exclude columns with -:

diabetes |>
  select(-pedigree, -pregnancy_num) |>
  head()

# A tibble: 6 × 7
  glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi   age diabetes_5y
          <dbl>     <dbl>      <dbl>              <dbl> <dbl> <dbl> <chr>      
1           148        72         35                 NA  33.6    50 pos        
2            85        66         29                 NA  26.6    31 neg        
3           183        64         NA                 NA  23.3    32 pos        
4            89        66         23                 94  28.1    21 neg        
5           137        40         35                168  43.1    33 pos        
6           116        74         NA                 NA  25.6    30 neg

Column selection helpers

When you have many columns with similar names, these helpers save typing:

starts_with("bp") — columns starting with “bp”
ends_with("_ml") — columns ending with “_ml”
contains("glucose") — columns containing “glucose”

diabetes |> select(starts_with("insulin"))

2.6.2 `rename()` — relabel columns

Use rename() to give columns clearer names. The syntax is new_name = old_name:

diabetes |>
  rename(outcome_5y = diabetes_5y) |>
  head(3)

# A tibble: 3 × 9
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <dbl>         <dbl>     <dbl>      <dbl>              <dbl> <dbl>
1             6           148        72         35                 NA  33.6
2             1            85        66         29                 NA  26.6
3             8           183        64         NA                 NA  23.3
# ℹ 3 more variables: pedigree <dbl>, age <dbl>, outcome_5y <chr>

2.7 Filtering and Arranging Rows

2.7.1 `filter()` — keep rows by condition

filter() keeps only the rows where your condition is TRUE. This uses the comparison operators you learned in Section 1.4:

# Patients with high glucose (>= 140 mg/dL)
diabetes |>
  filter(glucose_mg_dl >= 140) |>
  head()

# A tibble: 6 × 9
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <dbl>         <dbl>     <dbl>      <dbl>              <dbl> <dbl>
1             6           148        72         35                 NA  33.6
2             8           183        64         NA                 NA  23.3
3             2           197        70         45                543  30.5
4            10           168        74         NA                 NA  38  
5             1           189        60         23                846  30.1
6             5           166        72         19                175  25.8
# ℹ 3 more variables: pedigree <dbl>, age <dbl>, diabetes_5y <chr>

2.7.2 Combining conditions in `filter()`

Use & (AND) or , (comma, same as AND) to require multiple conditions. Use | (OR) for either-or:

# High glucose AND age over 50
diabetes |>
  filter(glucose_mg_dl >= 140, age > 50) |>
  head()

# A tibble: 6 × 9
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <dbl>         <dbl>     <dbl>      <dbl>              <dbl> <dbl>
1             2           197        70         45                543  30.5
2             1           189        60         23                846  30.1
3             5           166        72         19                175  25.8
4            11           143        94         33                146  36.6
5            13           145        82         19                110  22.2
6             9           171       110         24                240  45.4
# ℹ 3 more variables: pedigree <dbl>, age <dbl>, diabetes_5y <chr>

# Very young OR very old patients
diabetes |>
  filter(age < 25 | age > 65) |>
  head()

# A tibble: 6 × 9
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <dbl>         <dbl>     <dbl>      <dbl>              <dbl> <dbl>
1             1            89        66         23                 94  28.1
2             1            97        66         15                140  23.2
3             3            88        58         11                 54  24.8
4             2            71        70         27                 NA  28  
5             7           105        NA         NA                 NA  NA  
6             1           103        80         11                 82  19.4
# ℹ 3 more variables: pedigree <dbl>, age <dbl>, diabetes_5y <chr>

2.7.3 `arrange()` — sort rows

arrange() sorts rows by a column in ascending order:

diabetes |>
  arrange(glucose_mg_dl) |>
  head()

# A tibble: 6 × 9
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <dbl>         <dbl>     <dbl>      <dbl>              <dbl> <dbl>
1             5            44        62         NA                 NA  25  
2             2            56        56         28                 45  24.2
3             9            57        80         37                 NA  32.8
4             0            57        60         NA                 NA  21.7
5             3            61        82         28                 NA  34.4
6             7            62        78         NA                 NA  32.6
# ℹ 3 more variables: pedigree <dbl>, age <dbl>, diabetes_5y <chr>

2.7.4 Sorting in descending order with `desc()`

Wrap a column in desc() to sort from highest to lowest:

diabetes |>
  arrange(desc(glucose_mg_dl)) |>
  head()

# A tibble: 6 × 9
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <dbl>         <dbl>     <dbl>      <dbl>              <dbl> <dbl>
1             1           199        76         43                 NA  42.9
2             0           198        66         32                274  41.3
3             2           197        70         45                543  30.5
4             4           197        70         39                744  36.7
5             8           197        74         NA                 NA  25.9
6             2           197        70         99                 NA  34.7
# ℹ 3 more variables: pedigree <dbl>, age <dbl>, diabetes_5y <chr>

desc() simply means “descending.” You’ll use it whenever you want the largest (or latest) values first.

2.8 Creating New Variables with `mutate()`

mutate() adds new columns (or modifies existing ones) based on calculations or conditions.

2.8.1 Simple derived columns

diabetes |>
  mutate(age_months = age * 12) |>
  select(age, age_months) |>
  head()

# A tibble: 6 × 2
    age age_months
  <dbl>      <dbl>
1    50        600
2    31        372
3    32        384
4    21        252
5    33        396
6    30        360

The new column age_months is computed from the existing age column.

2.8.2 Two-way classification with `if_else()`

Often you need to create a variable with two categories: “yes or no”, “high or normal”, etc. The if_else() function handles this:

if_else(condition, true_value, false_value)

Let’s break it down:

condition — a logical test applied to each row (e.g., glucose_mg_dl >= 140)
true_value — the value to assign when the condition is TRUE
false_value — the value to assign when the condition is FALSE

diabetes |>
  mutate(glucose_flag = if_else(glucose_mg_dl >= 140, "High", "Normal")) |>
  select(glucose_mg_dl, glucose_flag) |>
  head()

# A tibble: 6 × 2
  glucose_mg_dl glucose_flag
          <dbl> <chr>       
1           148 High        
2            85 Normal      
3           183 High        
4            89 Normal      
5           137 Normal      
6           116 Normal

if_else() vs ifelse()

In Section 1.6, you learned base R’s ifelse(). The dplyr version if_else() (with an underscore) is stricter — it checks that the true and false values are the same type, catching subtle bugs early. Use if_else() inside mutate().

2.8.3 Multi-way classification with `case_when()`

When you have more than two categories, case_when() is the tool. Think of it as a vectorized version of if / else if / else from Section 1.6:

case_when(
  condition1 ~ value1,
  condition2 ~ value2,
  condition3 ~ value3,
  .default = fallback_value
)

Let’s break it down:

Each line is a condition ~ value pair (read ~ as “then”)
Conditions are checked in order — the first TRUE match wins
.default — the value when no condition matches (like the final else)

Here’s a clinical example — classifying BMI:

diabetes |>
  mutate(
    bmi_class = case_when(
      bmi < 18.5 ~ "Underweight",
      bmi < 25   ~ "Normal",
      bmi < 30   ~ "Overweight",
      .default   = "Obesity"
    )
  ) |>
  select(bmi, bmi_class) |>
  head(8)

# A tibble: 8 × 2
    bmi bmi_class 
  <dbl> <chr>     
1  33.6 Obesity   
2  26.6 Overweight
3  23.3 Normal    
4  28.1 Overweight
5  43.1 Obesity   
6  25.6 Overweight
7  31   Obesity   
8  35.3 Obesity

A patient with BMI 24.0 doesn’t match < 18.5, but does match < 25, so they get "Normal". The order matters — just like if / else if in Section 1.6.

The .default argument

In older R code and tutorials, you may see TRUE ~ "Obesity" instead of .default = "Obesity". Both work, but .default is the modern, clearer syntax. We use .default throughout this book.

Python Comparison

Python’s pandas offers np.where() for two-way (like if_else()) and pd.cut() or np.select() for multi-way (like case_when()):

import numpy as np

# Two-way (like if_else)
df["glucose_flag"] = np.where(df["glucose_mg_dl"] >= 140, "High", "Normal")

# Multi-way (like case_when)
conditions = [
    df["bmi"] < 18.5,
    df["bmi"] < 25,
    df["bmi"] < 30,
]
choices = ["Underweight", "Normal", "Overweight"]
df["bmi_class"] = np.select(conditions, choices, default="Obesity")

2.9 Summarizing Data

2.9.1 `summarise()` — collapse to one row

summarise() (or summarize() — both spellings work) computes summary statistics, collapsing the entire table into a single row:

diabetes |>
  summarise(
    n_patients   = n(),
    mean_age     = mean(age, na.rm = TRUE),
    mean_glucose = mean(glucose_mg_dl, na.rm = TRUE)
  )

# A tibble: 1 × 3
  n_patients mean_age mean_glucose
       <int>    <dbl>        <dbl>
1        768     33.2         122.

Let’s break this down:

n() — counts the number of rows (patients)
mean(column, na.rm = TRUE) — computes the mean, removing any NA values (recall na.rm from Section 1.5)
Each name = expression pair becomes a column in the output

2.9.2 `group_by()` + `summarise()` — summaries by group

Often you want summaries per group rather than for everyone. group_by() splits the data, and summarise() computes within each group:

diabetes |>
  group_by(diabetes_5y) |>
  summarise(
    n            = n(),
    mean_glucose = mean(glucose_mg_dl, na.rm = TRUE),
    mean_bmi     = mean(bmi, na.rm = TRUE)
  )

# A tibble: 2 × 4
  diabetes_5y     n mean_glucose mean_bmi
  <chr>       <int>        <dbl>    <dbl>
1 neg           500         111.     30.9
2 pos           268         142.     35.4

The pos group has higher mean glucose — a good sanity check that the data aligns with clinical expectations.

What about .groups = "drop"?

After group_by() |> summarise(), the result may retain grouping. Adding .groups = "drop" at the end of summarise() removes it:

diabetes |>
  group_by(diabetes_5y) |>
  summarise(n = n(), .groups = "drop")

This is good practice when you plan to pipe the result into further operations. If you see a message about “regrouping,” that’s R reminding you about this.

2.9.3 The `count()` shortcut

count() is a quick way to get frequency tables — it combines group_by() + summarise(n = n()) in one step:

diabetes |> count(diabetes_5y)

# A tibble: 2 × 2
  diabetes_5y     n
  <chr>       <int>
1 neg           500
2 pos           268

You can count combinations of multiple variables:

diabetes |>
  mutate(age_group = if_else(age >= 50, "50+", "Under 50")) |>
  count(diabetes_5y, age_group)

# A tibble: 4 × 3
  diabetes_5y age_group     n
  <chr>       <chr>     <int>
1 neg         50+          46
2 neg         Under 50    454
3 pos         50+          43
4 pos         Under 50    225

Python Comparison

Python’s pandas equivalents:

# summarise → agg
df.agg(n_patients=("age", "size"), mean_age=("age", "mean"))

# group_by + summarise → groupby + agg
df.groupby("diabetes_5y").agg(
    n=("age", "size"),
    mean_glucose=("glucose_mg_dl", "mean")
).reset_index()

# count → value_counts
df["diabetes_5y"].value_counts()

2.10 Building Pipelines

You’ve been using |> to chain single operations. Now let’s build multi-step pipelines — the real power of the tidyverse workflow.

2.10.1 Start small: two verbs

diabetes |>
  filter(age >= 50) |>
  select(age, glucose_mg_dl, diabetes_5y)

# A tibble: 89 × 3
     age glucose_mg_dl diabetes_5y
   <dbl>         <dbl> <chr>      
 1    50           148 pos        
 2    53           197 pos        
 3    54           125 pos        
 4    57           139 neg        
 5    59           189 pos        
 6    51           166 pos        
 7    50            99 neg        
 8    51           143 pos        
 9    57           145 neg        
10    60           109 neg        
# ℹ 79 more rows

Read it as: “Take diabetes, then keep patients 50+, then select three columns.”

2.10.2 Add a step: three verbs

diabetes |>
  filter(age >= 50) |>
  mutate(glucose_flag = if_else(glucose_mg_dl >= 140, "High", "Normal")) |>
  count(glucose_flag)

# A tibble: 2 × 2
  glucose_flag     n
  <chr>        <int>
1 High            43
2 Normal          46

Each line does one thing. You can read the pipeline top-to-bottom like a recipe.

2.10.3 Reading a pipeline aloud

A good test: if you can read your pipeline as an English sentence, it’s well-written:

“Take the diabetes data, keep patients aged 50 and older, create a glucose flag (high if >= 140, normal otherwise), then count how many are in each group.”

One verb per line

Write each verb on its own line. This makes pipelines easy to read, debug, and modify:

# Good — one verb per line
diabetes |>
  filter(age >= 50) |>
  mutate(glucose_flag = if_else(glucose_mg_dl >= 140, "High", "Normal")) |>
  count(glucose_flag)

# Hard to read — everything on one line
diabetes |> filter(age >= 50) |> mutate(glucose_flag = if_else(glucose_mg_dl >= 140, "High", "Normal")) |> count(glucose_flag)

Forgetting |> between lines

If you forget a pipe, R treats the next line as a separate (incomplete) command:

# BUG: missing |> after filter()
diabetes |>
  filter(age >= 50)
  select(age, glucose_mg_dl)   # Error!

If you see an “unexpected” error, check that every line (except the last) ends with |>.

2.11 Tidy Data: Wide vs Long

Before we move on, there’s one more concept you’ll encounter often: tidy data. A dataset is tidy when:

Each variable has its own column
Each observation has its own row
Each value has its own cell

Why does this matter? Because tidy data makes plotting and grouped analysis much easier. The tools we’ve learned (filter(), mutate(), ggplot2) all expect tidy format.

2.11.1 Example: Blood pressure over multiple days

Consider three patients with blood pressure measured on three days:

bp_wide <- tibble(
  patient_id = c("P001", "P002", "P003"),
  bp_day1    = c(142, 130, 150),
  bp_day2    = c(138, 128, 147),
  bp_day3    = c(135, 126, 145)
)
bp_wide

# A tibble: 3 × 4
  patient_id bp_day1 bp_day2 bp_day3
  <chr>        <dbl>   <dbl>   <dbl>
1 P001           142     138     135
2 P002           130     128     126
3 P003           150     147     145

This is wide format — each day is a separate column. It’s readable for humans but awkward for analysis: there’s no single “blood pressure” column to plot or summarize.

2.11.2 `pivot_longer()` — wide to long

pivot_longer() reshapes wide data into tidy (long) format:

bp_long <- bp_wide |>
  pivot_longer(
    cols      = starts_with("bp_day"),
    names_to  = "day",
    values_to = "systolic_bp"
  )
bp_long

# A tibble: 9 × 3
  patient_id day     systolic_bp
  <chr>      <chr>         <dbl>
1 P001       bp_day1         142
2 P001       bp_day2         138
3 P001       bp_day3         135
4 P002       bp_day1         130
5 P002       bp_day2         128
6 P002       bp_day3         126
7 P003       bp_day1         150
8 P003       bp_day2         147
9 P003       bp_day3         145

Let’s break down the arguments:

cols — which columns to reshape (here, all columns starting with "bp_day")
names_to — name for the new column that will hold the old column names
values_to — name for the new column that will hold the values

Now we have a proper “systolic_bp” variable — one row per measurement. In Chapter 3, this format will make plotting straightforward.

2.11.3 `pivot_wider()` — long to wide (brief)

The reverse operation converts long format back to wide:

bp_long |>
  pivot_wider(
    names_from  = day,
    values_from = systolic_bp
  )

# A tibble: 3 × 4
  patient_id bp_day1 bp_day2 bp_day3
  <chr>        <dbl>   <dbl>   <dbl>
1 P001           142     138     135
2 P002           130     128     126
3 P003           150     147     145

You’ll use pivot_wider() less often, but it’s good to know it exists.

Why tidy format matters

Most R tools — ggplot2 for plotting, dplyr for summaries, statistical functions — expect tidy data. When your data isn’t in the right shape, pivot_longer() and pivot_wider() are the tools to fix it.

2.12 Common Errors and Troubleshooting

Here are errors you’ll likely encounter when working with data frames and dplyr:

Common data manipulation errors and their fixes
Error Message	What It Means	Fix
`object 'colname' not found`	Column name has a typo	Check `names(data)` for the correct spelling
`could not find function "select"`	`dplyr` not loaded	Add `library(tidyverse)` at the top
Error in `filter()`: `i` `must be a logical vector`	Used `=` instead of `==` in a condition	Use `filter(age == 50)`, not `filter(age = 50)`
`'file.csv' does not exist`	Wrong file path	Check `getwd()`, adjust the path
`Column 'x' doesn't exist`	Wrong column name after `clean_names()`	Run `names(data)` to see cleaned names
`unexpected ')'`	Missing or extra parenthesis	Count your opening and closing parens

The debugging habit (data edition)

When a dplyr pipeline fails:

Run it line by line — add one verb at a time and check the output
Check column names — a typo is the most common cause
Check column types — glimpse(data) shows whether a column is numeric or character

2.13 Summary

Here’s the journey we took with the diabetes registry:

Chapter 4 journey — from raw file to analysis-ready table
Step	What we did	R tool
Orient	Explored structure of a table	`class()`, `dim()`, `names()`, `glimpse()`
Categorize	Created and ordered factors	`factor()`, `fct_relevel()`
Import	Loaded the CSV file	`read_csv()`
Clean	Standardized column names	`clean_names()`
Select	Chose relevant columns	`select()`, `rename()`
Filter	Kept rows by condition	`filter()`, `arrange()`, `desc()`
Transform	Created BMI class and glucose flag	`mutate()`, `if_else()`, `case_when()`
Summarize	Compared groups	`group_by()`, `summarise()`, `count()`
Reshape	Converted wide to long	`pivot_longer()`, `pivot_wider()`

These data manipulation skills are the foundation for everything that follows. In the next chapter, we’ll use these same tools to prepare data for visualization with ggplot2 — turning numbers into insights you can see.

Exercises

Quick import check. Import diabetes.csv using read_csv(), clean the column names, then report: the number of rows and columns, the column names, and one variable that contains missing values. (Hint: summary() shows NA counts.)
Focused subset for clinic handout. Starting from the cleaned diabetes table, select only age, bmi, glucose_mg_dl, and diabetes_5y. Keep patients aged 40 and older, and sort by descending glucose. Show the first 10 rows.
Blood pressure classification. Use mutate() with case_when() to classify dbp_mm_hg into:
- "Normal" (< 80)
- "Elevated" (80–89)
- "High" (90 or above)
- Use .default for any remaining values
Then use count() to see how many patients fall in each category.
Cohort comparison pipeline. Build a pipeline that:
1. Filters to patients aged 30 and above
2. Creates a glucose_flag using if_else() (High if glucose_mg_dl >= 140, Normal otherwise)
3. Groups by diabetes_5y and glucose_flag
4. Summarizes the count and mean BMI per group
Read your pipeline aloud — does it make sense as an English sentence?

2.1 What Is a Data Frame?

2.1.1 Creating a data frame by hand

2.1.2 Tibble: the tidyverse data frame

2.1.3 Exploring structure with glimpse()

2.2 Warm-Up: Exploring a Packaged Dataset

2.3 Factors in Practice

2.3.1 Reordering levels with fct_relevel()

2.4 Importing Data from Files

2.4.1 Reading CSV with read_csv()

2.4.2 Inspecting after import

2.4.3 Cleaning column names with clean_names()

2.4.4 Reading Excel files (brief)

2.5 The dplyr Toolkit: Overview

2.6 Selecting and Renaming Columns

2.6.1 select() — choose columns

2.6.2 rename() — relabel columns

2.7 Filtering and Arranging Rows

2.7.1 filter() — keep rows by condition

2.7.2 Combining conditions in filter()

2.7.3 arrange() — sort rows

2.7.4 Sorting in descending order with desc()

2.8 Creating New Variables with mutate()

2.8.1 Simple derived columns

2.8.2 Two-way classification with if_else()

2.8.3 Multi-way classification with case_when()

2.9 Summarizing Data

2.9.1 summarise() — collapse to one row

2.9.2 group_by() + summarise() — summaries by group

2.9.3 The count() shortcut

2.10 Building Pipelines

2.10.1 Start small: two verbs

2.10.2 Add a step: three verbs

2.10.3 Reading a pipeline aloud

2.11 Tidy Data: Wide vs Long

2.11.1 Example: Blood pressure over multiple days

2.11.2 pivot_longer() — wide to long

2.11.3 pivot_wider() — long to wide (brief)

2.12 Common Errors and Troubleshooting

2.13 Summary

Exercises

2.1.3 Exploring structure with `glimpse()`

2.3.1 Reordering levels with `fct_relevel()`

2.4.1 Reading CSV with `read_csv()`

2.4.3 Cleaning column names with `clean_names()`

2.6.1 `select()` — choose columns

2.6.2 `rename()` — relabel columns

2.7.1 `filter()` — keep rows by condition

2.7.2 Combining conditions in `filter()`

2.7.3 `arrange()` — sort rows

2.7.4 Sorting in descending order with `desc()`

2.8 Creating New Variables with `mutate()`

2.8.2 Two-way classification with `if_else()`

2.8.3 Multi-way classification with `case_when()`

2.9.1 `summarise()` — collapse to one row

2.9.2 `group_by()` + `summarise()` — summaries by group

2.9.3 The `count()` shortcut

2.11.2 `pivot_longer()` — wide to long

2.11.3 `pivot_wider()` — long to wide (brief)