2  Data Frames and Data Manipulation

In the previous chapter, we learned R’s grammar — variables, vectors, functions, and pipes. Now we move from language basics to real data work: the kind of workflow you’ll use daily in clinical research and quality improvement projects.

By the end of this chapter, you’ll be able to import a CSV file, inspect it, clean it, transform it, and summarize it — all with readable, reproducible code.

NoteOur running example: Preparing a diabetes registry

Imagine a colleague sends you a file from a diabetes screening program: diabetes.csv with 768 patient records. Your job is to turn this raw file into a clean, analysis-ready table.

Along the way, you’ll learn how to inspect, import, clean, transform, and summarize tabular data — the essential skills for any data analysis project.

2.1 What Is a Data Frame?

In Section 1.3.1, you learned that a vector holds a single variable — one column of data. A data frame is a collection of vectors arranged side-by-side, forming a table with rows and columns. Each column is a variable, each row is an observation (e.g., one patient).

Structure What it holds Example
Vector One variable c(55, 62, 47)
Data frame Multiple variables (a table) A table with age, sex, glucose columns

2.1.1 Creating a data frame by hand

You can create a data frame with data.frame():

patients <- data.frame(
  patient_id    = c("P001", "P002", "P003"),
  age           = c(55, 62, 47),
  sex           = c("Female", "Male", "Female"),
  glucose_mg_dl = c(145, 132, 178)
)
patients
  patient_id age    sex glucose_mg_dl
1       P001  55 Female           145
2       P002  62   Male           132
3       P003  47 Female           178

Each argument is a column. Notice that the columns are vectors of the same length — one value per patient.

2.1.2 Tibble: the tidyverse data frame

The tidyverse provides tibble(), a modern version of data.frame(). It prints more neatly and avoids some unexpected behaviors:

library(tidyverse)
patients <- tibble(
  patient_id    = c("P001", "P002", "P003"),
  age           = c(55, 62, 47),
  sex           = c("Female", "Male", "Female"),
  glucose_mg_dl = c(145, 132, 178)
)
patients
# A tibble: 3 × 4
  patient_id   age sex    glucose_mg_dl
  <chr>      <dbl> <chr>          <dbl>
1 P001          55 Female           145
2 P002          62 Male             132
3 P003          47 Female           178

A tibble shows its dimensions (3 × 4) and column types (<chr>, <dbl>) at the top — very handy for quick orientation.

TipThe 60-second data orientation

When you encounter any new table, ask three questions:

  1. What kind of object is this?class(data)
  2. How big is it?dim(data) (rows × columns)
  3. What are the variable names?names(data)
class(patients)
[1] "tbl_df"     "tbl"        "data.frame"
dim(patients)
[1] 3 4
names(patients)
[1] "patient_id"    "age"           "sex"           "glucose_mg_dl"

This quick routine catches most early problems before they become hard-to-trace bugs downstream.

2.1.3 Exploring structure with glimpse()

For a richer overview, glimpse() shows every column’s name, type, and a few sample values — all in one compact display:

glimpse(patients)
Rows: 3
Columns: 4
$ patient_id    <chr> "P001", "P002", "P003"
$ age           <dbl> 55, 62, 47
$ sex           <chr> "Female", "Male", "Female"
$ glucose_mg_dl <dbl> 145, 132, 178

You’ll use glimpse() constantly. It’s one of the most useful functions in the tidyverse.

2.2 Warm-Up: Exploring a Packaged Dataset

Before touching our diabetes file, let’s practice with a small dataset that comes pre-loaded in an R package. The scurvy dataset from the medicaldata package records a historic clinical trial on scurvy treatments (12 patients, 8 variables).

data("scurvy", package = "medicaldata")

Apply the 60-second orientation:

class(scurvy)
[1] "tbl_df"     "tbl"        "data.frame"
dim(scurvy)
[1] 12  8
names(scurvy)
[1] "study_id"                  "treatment"                
[3] "dosing_regimen_for_scurvy" "gum_rot_d6"               
[5] "skin_sores_d6"             "weakness_of_the_knees_d6" 
[7] "lassitude_d6"              "fit_for_duty_d6"          
glimpse(scurvy)
Rows: 12
Columns: 8
$ study_id                  <chr> "001", "002", "003", "004", "005", "006", "0…
$ treatment                 <fct> cider, cider, dilute_sulfuric_acid, dilute_s…
$ dosing_regimen_for_scurvy <chr> "1 quart per day", "1 quart per day", "25 dr…
$ gum_rot_d6                <fct> 2_moderate, 2_moderate, 1_mild, 2_moderate, …
$ skin_sores_d6             <fct> 2_moderate, 1_mild, 3_severe, 3_severe, 3_se…
$ weakness_of_the_knees_d6  <fct> 2_moderate, 2_moderate, 3_severe, 3_severe, …
$ lassitude_d6              <fct> 2_moderate, 3_severe, 3_severe, 3_severe, 3_…
$ fit_for_duty_d6           <fct> 0_no, 0_no, 0_no, 0_no, 0_no, 0_no, 0_no, 0_…

Notice the column types: <chr> for character, <fct> for factor. We’ll explain factors next — they’re important for clinical data.

Python’s equivalent of a data frame is the pandas DataFrame:

import pandas as pd

# Create a DataFrame
patients = pd.DataFrame({
    "patient_id": ["P001", "P002", "P003"],
    "age": [55, 62, 47],
    "sex": ["Female", "Male", "Female"],
    "glucose_mg_dl": [145, 132, 178]
})

# Orientation
patients.shape    # (3, 4) — like dim()
patients.columns  # column names — like names()
patients.dtypes   # column types — like glimpse()
patients.info()   # combined overview

2.3 Factors in Practice

In Section 1.3.3, you learned that a factor is R’s type for categorical data — a character variable with a fixed set of levels. You also learned to create factors with factor() and control level order with the levels argument.

Now let’s learn a tidyverse tool that makes reordering levels more convenient: fct_relevel().

2.3.1 Reordering levels with fct_relevel()

The forcats package (loaded with library(tidyverse)) provides fct_relevel() to reorder factor levels after a factor has been created. This is especially useful when you receive data where the factor order isn’t what you need:

# Default: alphabetical order
outcome <- factor(c("pos", "neg", "pos", "neg"))
outcome
[1] pos neg pos neg
Levels: neg pos
# Reorder: put "neg" first (as the reference level)
outcome <- fct_relevel(outcome, "neg", "pos")
outcome
[1] pos neg pos neg
Levels: neg pos

fct_relevel() takes the factor and the levels in your desired order. You’ll use this frequently when preparing data for plots and statistical models, because the first level controls:

  • Plot axis/legend order — categories appear in level order
  • Table row order — summary tables follow level order
  • Statistical reference category — the first level is the baseline in regression
Tipfactor() vs fct_relevel() — when to use which?
  • factor(x, levels = ...) — when creating a factor from scratch (e.g., from a character vector). You learned this in Section 1.3.3.
  • fct_relevel(x, ...) — when reordering an existing factor (e.g., a column that’s already a factor but in the wrong order).

Both achieve the same result; fct_relevel() is simply more convenient for existing factors.

2.4 Importing Data from Files

In real projects, data arrives as files — CSV, Excel, or other formats. Let’s import our diabetes dataset.

2.4.1 Reading CSV with read_csv()

The readr package (part of the tidyverse) provides read_csv() for reading CSV files:

diabetes_raw <- read_csv("../../data/diabetes.csv", show_col_types = FALSE)

Let’s break down the syntax:

  • read_csv("path/to/file.csv") — reads the file at the given path and returns a tibble
  • show_col_types = FALSE — suppresses the column-type message that read_csv() prints by default. Useful for keeping output clean.
NoteFile paths in a Quarto book

The path "../../data/diabetes.csv" means “go up two directories from the chapter file, then into data/.” In your own R scripts, adjust the path to match where your file lives. Using RStudio Projects (as covered in Section 4) makes paths simpler and more portable.

2.4.2 Inspecting after import

Always check your data right after import:

dim(diabetes_raw)
[1] 768   9
names(diabetes_raw)
[1] "pregnancy_num"      "glucose_mg-dl"      "dbp_mm-hg"         
[4] "triceps_mm"         "insulin_microiu-ml" "bmi"               
[7] "pedigree"           "age"                "diabetes_5y"       
glimpse(diabetes_raw)
Rows: 768
Columns: 9
$ pregnancy_num        <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7…
$ `glucose_mg-dl`      <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 11…
$ `dbp_mm-hg`          <dbl> 72, 66, 64, 66, 40, 74, 50, NA, 70, 96, 92, 74, 8…
$ triceps_mm           <dbl> 35, 29, NA, 23, 35, NA, 32, NA, 45, NA, NA, NA, N…
$ `insulin_microiu-ml` <dbl> NA, NA, NA, 94, 168, NA, 88, NA, 543, NA, NA, NA,…
$ bmi                  <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 3…
$ pedigree             <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, …
$ age                  <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 5…
$ diabetes_5y          <chr> "pos", "neg", "pos", "neg", "pos", "neg", "pos", …

We have 768 rows and 9 columns. But notice the column names: glucose_mg-dl, dbp_mm-hg — those hyphens will cause problems in R code.

TipThe post-import checklist

Right after reading any file, run these four checks:

  1. dim(data) — expected number of rows and columns?
  2. names(data) — column names sensible?
  3. glimpse(data) — types correct? (numbers as <dbl>, text as <chr>)
  4. summary(data) — any suspicious min/max values or unexpected NAs?

2.4.3 Cleaning column names with clean_names()

The janitor package provides clean_names(), which converts column names to consistent snake_case — replacing hyphens, spaces, and dots with underscores:

library(janitor)

diabetes <- diabetes_raw |> clean_names()
names(diabetes)
[1] "pregnancy_num"      "glucose_mg_dl"      "dbp_mm_hg"         
[4] "triceps_mm"         "insulin_microiu_ml" "bmi"               
[7] "pedigree"           "age"                "diabetes_5y"       

Now glucose_mg-dl has become glucose_mg_dl — much easier to type and less error-prone.

NoteWhy clean names matter

R can handle column names with hyphens (e.g., diabetes$`glucose_mg-dl`), but you need backticks every time. Clean snake_case names let you write diabetes$glucose_mg_dl without any special syntax.

2.4.4 Reading Excel files (brief)

For Excel files, the readxl package provides read_excel():

library(readxl)

diabetes_xlsx <- read_excel("../../data/diabetes.xlsx")
glimpse(diabetes_xlsx)
Rows: 768
Columns: 9
$ pregnancy_num        <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7…
$ `glucose_mg-dl`      <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 11…
$ `dbp_mm-hg`          <dbl> 72, 66, 64, 66, 40, 74, 50, NA, 70, 96, 92, 74, 8…
$ triceps_mm           <dbl> 35, 29, NA, 23, 35, NA, 32, NA, 45, NA, NA, NA, N…
$ `insulin_microiu-ml` <dbl> NA, NA, NA, 94, 168, NA, 88, NA, 543, NA, NA, NA,…
$ bmi                  <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 3…
$ pedigree             <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, …
$ age                  <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 5…
$ diabetes_5y          <chr> "pos", "neg", "pos", "neg", "pos", "neg", "pos", …

You can also specify a sheet or cell range:

read_excel("data/file.xlsx", sheet = "Sheet2")
read_excel("data/file.xlsx", range = "A1:D20")
WarningCommon import pitfalls
Problem Symptom Fix
Wrong file path Error: 'file.csv' does not exist Check getwd(), use RStudio file browser
Numbers read as text Column shows <chr> instead of <dbl> Check for text like "N/A" in numeric columns
Unexpected NAs More NAs than expected Source file uses blanks or custom codes (-, 999)
Encoding issues Garbled characters Try read_csv(..., locale = locale(encoding = "UTF-8"))

Python’s pandas uses similar functions:

import pandas as pd

# CSV
df = pd.read_csv("data/diabetes.csv")

# Excel
df = pd.read_excel("data/diabetes.xlsx", sheet_name="Sheet1")

# Inspect
df.shape
df.columns
df.info()
df.describe()

2.5 The dplyr Toolkit: Overview

Now that our data is imported and cleaned, it’s time to manipulate it — selecting columns, filtering rows, creating new variables, and computing summaries.

The dplyr package (part of the tidyverse) provides a set of verbs — functions named after what they do. Before diving into each one, here’s the big picture:

Core dplyr verbs organized by purpose
Category Verb What it does
Columns select() Choose which columns to keep
rename() Change column names
Rows filter() Keep rows that match a condition
arrange() Sort rows by a column
Create mutate() Add or modify columns
Aggregate summarise() Collapse many rows into a summary
group_by() Split data into groups before summarizing
count() Quick frequency table

All dplyr verbs follow the same pattern:

data |> verb(arguments)

The first argument is always the data (passed via |>), and you refer to column names without quotes.

2.6 Selecting and Renaming Columns

2.6.1 select() — choose columns

Use select() to keep only the columns you need:

diabetes |>
  select(age, bmi, glucose_mg_dl, diabetes_5y) |>
  head()
# A tibble: 6 × 4
    age   bmi glucose_mg_dl diabetes_5y
  <dbl> <dbl>         <dbl> <chr>      
1    50  33.6           148 pos        
2    31  26.6            85 neg        
3    32  23.3           183 pos        
4    21  28.1            89 neg        
5    33  43.1           137 pos        
6    30  25.6           116 neg        

You can also exclude columns with -:

diabetes |>
  select(-pedigree, -pregnancy_num) |>
  head()
# A tibble: 6 × 7
  glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi   age diabetes_5y
          <dbl>     <dbl>      <dbl>              <dbl> <dbl> <dbl> <chr>      
1           148        72         35                 NA  33.6    50 pos        
2            85        66         29                 NA  26.6    31 neg        
3           183        64         NA                 NA  23.3    32 pos        
4            89        66         23                 94  28.1    21 neg        
5           137        40         35                168  43.1    33 pos        
6           116        74         NA                 NA  25.6    30 neg        
TipColumn selection helpers

When you have many columns with similar names, these helpers save typing:

  • starts_with("bp") — columns starting with “bp”
  • ends_with("_ml") — columns ending with “_ml”
  • contains("glucose") — columns containing “glucose”
diabetes |> select(starts_with("insulin"))

2.6.2 rename() — relabel columns

Use rename() to give columns clearer names. The syntax is new_name = old_name:

diabetes |>
  rename(outcome_5y = diabetes_5y) |>
  head(3)
# A tibble: 3 × 9
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <dbl>         <dbl>     <dbl>      <dbl>              <dbl> <dbl>
1             6           148        72         35                 NA  33.6
2             1            85        66         29                 NA  26.6
3             8           183        64         NA                 NA  23.3
# ℹ 3 more variables: pedigree <dbl>, age <dbl>, outcome_5y <chr>

2.7 Filtering and Arranging Rows

2.7.1 filter() — keep rows by condition

filter() keeps only the rows where your condition is TRUE. This uses the comparison operators you learned in Section 1.4:

# Patients with high glucose (>= 140 mg/dL)
diabetes |>
  filter(glucose_mg_dl >= 140) |>
  head()
# A tibble: 6 × 9
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <dbl>         <dbl>     <dbl>      <dbl>              <dbl> <dbl>
1             6           148        72         35                 NA  33.6
2             8           183        64         NA                 NA  23.3
3             2           197        70         45                543  30.5
4            10           168        74         NA                 NA  38  
5             1           189        60         23                846  30.1
6             5           166        72         19                175  25.8
# ℹ 3 more variables: pedigree <dbl>, age <dbl>, diabetes_5y <chr>

2.7.2 Combining conditions in filter()

Use & (AND) or , (comma, same as AND) to require multiple conditions. Use | (OR) for either-or:

# High glucose AND age over 50
diabetes |>
  filter(glucose_mg_dl >= 140, age > 50) |>
  head()
# A tibble: 6 × 9
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <dbl>         <dbl>     <dbl>      <dbl>              <dbl> <dbl>
1             2           197        70         45                543  30.5
2             1           189        60         23                846  30.1
3             5           166        72         19                175  25.8
4            11           143        94         33                146  36.6
5            13           145        82         19                110  22.2
6             9           171       110         24                240  45.4
# ℹ 3 more variables: pedigree <dbl>, age <dbl>, diabetes_5y <chr>
# Very young OR very old patients
diabetes |>
  filter(age < 25 | age > 65) |>
  head()
# A tibble: 6 × 9
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <dbl>         <dbl>     <dbl>      <dbl>              <dbl> <dbl>
1             1            89        66         23                 94  28.1
2             1            97        66         15                140  23.2
3             3            88        58         11                 54  24.8
4             2            71        70         27                 NA  28  
5             7           105        NA         NA                 NA  NA  
6             1           103        80         11                 82  19.4
# ℹ 3 more variables: pedigree <dbl>, age <dbl>, diabetes_5y <chr>

2.7.3 arrange() — sort rows

arrange() sorts rows by a column in ascending order:

diabetes |>
  arrange(glucose_mg_dl) |>
  head()
# A tibble: 6 × 9
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <dbl>         <dbl>     <dbl>      <dbl>              <dbl> <dbl>
1             5            44        62         NA                 NA  25  
2             2            56        56         28                 45  24.2
3             9            57        80         37                 NA  32.8
4             0            57        60         NA                 NA  21.7
5             3            61        82         28                 NA  34.4
6             7            62        78         NA                 NA  32.6
# ℹ 3 more variables: pedigree <dbl>, age <dbl>, diabetes_5y <chr>

2.7.4 Sorting in descending order with desc()

Wrap a column in desc() to sort from highest to lowest:

diabetes |>
  arrange(desc(glucose_mg_dl)) |>
  head()
# A tibble: 6 × 9
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <dbl>         <dbl>     <dbl>      <dbl>              <dbl> <dbl>
1             1           199        76         43                 NA  42.9
2             0           198        66         32                274  41.3
3             2           197        70         45                543  30.5
4             4           197        70         39                744  36.7
5             8           197        74         NA                 NA  25.9
6             2           197        70         99                 NA  34.7
# ℹ 3 more variables: pedigree <dbl>, age <dbl>, diabetes_5y <chr>

desc() simply means “descending.” You’ll use it whenever you want the largest (or latest) values first.

2.8 Creating New Variables with mutate()

mutate() adds new columns (or modifies existing ones) based on calculations or conditions.

2.8.1 Simple derived columns

diabetes |>
  mutate(age_months = age * 12) |>
  select(age, age_months) |>
  head()
# A tibble: 6 × 2
    age age_months
  <dbl>      <dbl>
1    50        600
2    31        372
3    32        384
4    21        252
5    33        396
6    30        360

The new column age_months is computed from the existing age column.

2.8.2 Two-way classification with if_else()

Often you need to create a variable with two categories: “yes or no”, “high or normal”, etc. The if_else() function handles this:

if_else(condition, true_value, false_value)

Let’s break it down:

  • condition — a logical test applied to each row (e.g., glucose_mg_dl >= 140)
  • true_value — the value to assign when the condition is TRUE
  • false_value — the value to assign when the condition is FALSE
diabetes |>
  mutate(glucose_flag = if_else(glucose_mg_dl >= 140, "High", "Normal")) |>
  select(glucose_mg_dl, glucose_flag) |>
  head()
# A tibble: 6 × 2
  glucose_mg_dl glucose_flag
          <dbl> <chr>       
1           148 High        
2            85 Normal      
3           183 High        
4            89 Normal      
5           137 Normal      
6           116 Normal      
Tipif_else() vs ifelse()

In Section 1.6, you learned base R’s ifelse(). The dplyr version if_else() (with an underscore) is stricter — it checks that the true and false values are the same type, catching subtle bugs early. Use if_else() inside mutate().

2.8.3 Multi-way classification with case_when()

When you have more than two categories, case_when() is the tool. Think of it as a vectorized version of if / else if / else from Section 1.6:

case_when(
  condition1 ~ value1,
  condition2 ~ value2,
  condition3 ~ value3,
  .default = fallback_value
)

Let’s break it down:

  • Each line is a condition ~ value pair (read ~ as “then”)
  • Conditions are checked in order — the first TRUE match wins
  • .default — the value when no condition matches (like the final else)

Here’s a clinical example — classifying BMI:

diabetes |>
  mutate(
    bmi_class = case_when(
      bmi < 18.5 ~ "Underweight",
      bmi < 25   ~ "Normal",
      bmi < 30   ~ "Overweight",
      .default   = "Obesity"
    )
  ) |>
  select(bmi, bmi_class) |>
  head(8)
# A tibble: 8 × 2
    bmi bmi_class 
  <dbl> <chr>     
1  33.6 Obesity   
2  26.6 Overweight
3  23.3 Normal    
4  28.1 Overweight
5  43.1 Obesity   
6  25.6 Overweight
7  31   Obesity   
8  35.3 Obesity   

A patient with BMI 24.0 doesn’t match < 18.5, but does match < 25, so they get "Normal". The order matters — just like if / else if in Section 1.6.

WarningThe .default argument

In older R code and tutorials, you may see TRUE ~ "Obesity" instead of .default = "Obesity". Both work, but .default is the modern, clearer syntax. We use .default throughout this book.

Python’s pandas offers np.where() for two-way (like if_else()) and pd.cut() or np.select() for multi-way (like case_when()):

import numpy as np

# Two-way (like if_else)
df["glucose_flag"] = np.where(df["glucose_mg_dl"] >= 140, "High", "Normal")

# Multi-way (like case_when)
conditions = [
    df["bmi"] < 18.5,
    df["bmi"] < 25,
    df["bmi"] < 30,
]
choices = ["Underweight", "Normal", "Overweight"]
df["bmi_class"] = np.select(conditions, choices, default="Obesity")

2.9 Summarizing Data

2.9.1 summarise() — collapse to one row

summarise() (or summarize() — both spellings work) computes summary statistics, collapsing the entire table into a single row:

diabetes |>
  summarise(
    n_patients   = n(),
    mean_age     = mean(age, na.rm = TRUE),
    mean_glucose = mean(glucose_mg_dl, na.rm = TRUE)
  )
# A tibble: 1 × 3
  n_patients mean_age mean_glucose
       <int>    <dbl>        <dbl>
1        768     33.2         122.

Let’s break this down:

  • n() — counts the number of rows (patients)
  • mean(column, na.rm = TRUE) — computes the mean, removing any NA values (recall na.rm from Section 1.5)
  • Each name = expression pair becomes a column in the output

2.9.2 group_by() + summarise() — summaries by group

Often you want summaries per group rather than for everyone. group_by() splits the data, and summarise() computes within each group:

diabetes |>
  group_by(diabetes_5y) |>
  summarise(
    n            = n(),
    mean_glucose = mean(glucose_mg_dl, na.rm = TRUE),
    mean_bmi     = mean(bmi, na.rm = TRUE)
  )
# A tibble: 2 × 4
  diabetes_5y     n mean_glucose mean_bmi
  <chr>       <int>        <dbl>    <dbl>
1 neg           500         111.     30.9
2 pos           268         142.     35.4

The pos group has higher mean glucose — a good sanity check that the data aligns with clinical expectations.

NoteWhat about .groups = "drop"?

After group_by() |> summarise(), the result may retain grouping. Adding .groups = "drop" at the end of summarise() removes it:

diabetes |>
  group_by(diabetes_5y) |>
  summarise(n = n(), .groups = "drop")

This is good practice when you plan to pipe the result into further operations. If you see a message about “regrouping,” that’s R reminding you about this.

2.9.3 The count() shortcut

count() is a quick way to get frequency tables — it combines group_by() + summarise(n = n()) in one step:

diabetes |> count(diabetes_5y)
# A tibble: 2 × 2
  diabetes_5y     n
  <chr>       <int>
1 neg           500
2 pos           268

You can count combinations of multiple variables:

diabetes |>
  mutate(age_group = if_else(age >= 50, "50+", "Under 50")) |>
  count(diabetes_5y, age_group)
# A tibble: 4 × 3
  diabetes_5y age_group     n
  <chr>       <chr>     <int>
1 neg         50+          46
2 neg         Under 50    454
3 pos         50+          43
4 pos         Under 50    225

Python’s pandas equivalents:

# summarise → agg
df.agg(n_patients=("age", "size"), mean_age=("age", "mean"))

# group_by + summarise → groupby + agg
df.groupby("diabetes_5y").agg(
    n=("age", "size"),
    mean_glucose=("glucose_mg_dl", "mean")
).reset_index()

# count → value_counts
df["diabetes_5y"].value_counts()

2.10 Building Pipelines

You’ve been using |> to chain single operations. Now let’s build multi-step pipelines — the real power of the tidyverse workflow.

2.10.1 Start small: two verbs

diabetes |>
  filter(age >= 50) |>
  select(age, glucose_mg_dl, diabetes_5y)
# A tibble: 89 × 3
     age glucose_mg_dl diabetes_5y
   <dbl>         <dbl> <chr>      
 1    50           148 pos        
 2    53           197 pos        
 3    54           125 pos        
 4    57           139 neg        
 5    59           189 pos        
 6    51           166 pos        
 7    50            99 neg        
 8    51           143 pos        
 9    57           145 neg        
10    60           109 neg        
# ℹ 79 more rows

Read it as: “Take diabetes, then keep patients 50+, then select three columns.”

2.10.2 Add a step: three verbs

diabetes |>
  filter(age >= 50) |>
  mutate(glucose_flag = if_else(glucose_mg_dl >= 140, "High", "Normal")) |>
  count(glucose_flag)
# A tibble: 2 × 2
  glucose_flag     n
  <chr>        <int>
1 High            43
2 Normal          46

Each line does one thing. You can read the pipeline top-to-bottom like a recipe.

2.10.3 Reading a pipeline aloud

A good test: if you can read your pipeline as an English sentence, it’s well-written:

“Take the diabetes data, keep patients aged 50 and older, create a glucose flag (high if >= 140, normal otherwise), then count how many are in each group.”

TipOne verb per line

Write each verb on its own line. This makes pipelines easy to read, debug, and modify:

# Good — one verb per line
diabetes |>
  filter(age >= 50) |>
  mutate(glucose_flag = if_else(glucose_mg_dl >= 140, "High", "Normal")) |>
  count(glucose_flag)

# Hard to read — everything on one line
diabetes |> filter(age >= 50) |> mutate(glucose_flag = if_else(glucose_mg_dl >= 140, "High", "Normal")) |> count(glucose_flag)
WarningForgetting |> between lines

If you forget a pipe, R treats the next line as a separate (incomplete) command:

# BUG: missing |> after filter()
diabetes |>
  filter(age >= 50)
  select(age, glucose_mg_dl)   # Error!

If you see an “unexpected” error, check that every line (except the last) ends with |>.

2.11 Tidy Data: Wide vs Long

Before we move on, there’s one more concept you’ll encounter often: tidy data. A dataset is tidy when:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value has its own cell

Why does this matter? Because tidy data makes plotting and grouped analysis much easier. The tools we’ve learned (filter(), mutate(), ggplot2) all expect tidy format.

2.11.1 Example: Blood pressure over multiple days

Consider three patients with blood pressure measured on three days:

bp_wide <- tibble(
  patient_id = c("P001", "P002", "P003"),
  bp_day1    = c(142, 130, 150),
  bp_day2    = c(138, 128, 147),
  bp_day3    = c(135, 126, 145)
)
bp_wide
# A tibble: 3 × 4
  patient_id bp_day1 bp_day2 bp_day3
  <chr>        <dbl>   <dbl>   <dbl>
1 P001           142     138     135
2 P002           130     128     126
3 P003           150     147     145

This is wide format — each day is a separate column. It’s readable for humans but awkward for analysis: there’s no single “blood pressure” column to plot or summarize.

2.11.2 pivot_longer() — wide to long

pivot_longer() reshapes wide data into tidy (long) format:

bp_long <- bp_wide |>
  pivot_longer(
    cols      = starts_with("bp_day"),
    names_to  = "day",
    values_to = "systolic_bp"
  )
bp_long
# A tibble: 9 × 3
  patient_id day     systolic_bp
  <chr>      <chr>         <dbl>
1 P001       bp_day1         142
2 P001       bp_day2         138
3 P001       bp_day3         135
4 P002       bp_day1         130
5 P002       bp_day2         128
6 P002       bp_day3         126
7 P003       bp_day1         150
8 P003       bp_day2         147
9 P003       bp_day3         145

Let’s break down the arguments:

  • cols — which columns to reshape (here, all columns starting with "bp_day")
  • names_to — name for the new column that will hold the old column names
  • values_to — name for the new column that will hold the values

Now we have a proper “systolic_bp” variable — one row per measurement. In Chapter 3, this format will make plotting straightforward.

2.11.3 pivot_wider() — long to wide (brief)

The reverse operation converts long format back to wide:

bp_long |>
  pivot_wider(
    names_from  = day,
    values_from = systolic_bp
  )
# A tibble: 3 × 4
  patient_id bp_day1 bp_day2 bp_day3
  <chr>        <dbl>   <dbl>   <dbl>
1 P001           142     138     135
2 P002           130     128     126
3 P003           150     147     145

You’ll use pivot_wider() less often, but it’s good to know it exists.

NoteWhy tidy format matters

Most R tools — ggplot2 for plotting, dplyr for summaries, statistical functions — expect tidy data. When your data isn’t in the right shape, pivot_longer() and pivot_wider() are the tools to fix it.

2.12 Common Errors and Troubleshooting

Here are errors you’ll likely encounter when working with data frames and dplyr:

Common data manipulation errors and their fixes
Error Message What It Means Fix
object 'colname' not found Column name has a typo Check names(data) for the correct spelling
could not find function "select" dplyr not loaded Add library(tidyverse) at the top
Error in `filter()`: i must be a logical vector Used = instead of == in a condition Use filter(age == 50), not filter(age = 50)
'file.csv' does not exist Wrong file path Check getwd(), adjust the path
Column 'x' doesn't exist Wrong column name after clean_names() Run names(data) to see cleaned names
unexpected ')' Missing or extra parenthesis Count your opening and closing parens
TipThe debugging habit (data edition)

When a dplyr pipeline fails:

  1. Run it line by line — add one verb at a time and check the output
  2. Check column names — a typo is the most common cause
  3. Check column typesglimpse(data) shows whether a column is numeric or character

2.13 Summary

Here’s the journey we took with the diabetes registry:

Chapter 4 journey — from raw file to analysis-ready table
Step What we did R tool
Orient Explored structure of a table class(), dim(), names(), glimpse()
Categorize Created and ordered factors factor(), fct_relevel()
Import Loaded the CSV file read_csv()
Clean Standardized column names clean_names()
Select Chose relevant columns select(), rename()
Filter Kept rows by condition filter(), arrange(), desc()
Transform Created BMI class and glucose flag mutate(), if_else(), case_when()
Summarize Compared groups group_by(), summarise(), count()
Reshape Converted wide to long pivot_longer(), pivot_wider()

These data manipulation skills are the foundation for everything that follows. In the next chapter, we’ll use these same tools to prepare data for visualization with ggplot2 — turning numbers into insights you can see.

Exercises

  1. Quick import check. Import diabetes.csv using read_csv(), clean the column names, then report: the number of rows and columns, the column names, and one variable that contains missing values. (Hint: summary() shows NA counts.)

  2. Focused subset for clinic handout. Starting from the cleaned diabetes table, select only age, bmi, glucose_mg_dl, and diabetes_5y. Keep patients aged 40 and older, and sort by descending glucose. Show the first 10 rows.

  3. Blood pressure classification. Use mutate() with case_when() to classify dbp_mm_hg into:

    • "Normal" (< 80)
    • "Elevated" (80–89)
    • "High" (90 or above)
    • Use .default for any remaining values

    Then use count() to see how many patients fall in each category.

  4. Cohort comparison pipeline. Build a pipeline that:

    1. Filters to patients aged 30 and above
    2. Creates a glucose_flag using if_else() (High if glucose_mg_dl >= 140, Normal otherwise)
    3. Groups by diabetes_5y and glucose_flag
    4. Summarizes the count and mean BMI per group

    Read your pipeline aloud — does it make sense as an English sentence?