4 Descriptive Statistics

Before running any statistical test, you need to describe your data. How old are the patients? What is the distribution of glucose? How much data is missing? Is the cohort balanced across groups?

In this chapter, you’ll learn to compute and present descriptive summaries — the kind you’d include in a methods section, a departmental presentation, or a publication’s “Table 1.”

library(tidyverse)
library(janitor)
library(gtsummary)

Our running example: Summarizing the diabetes cohort

Imagine you’re preparing a summary for a departmental meeting. You need to describe the 768 patients in the diabetes registry:

What are the typical values for age, BMI, and glucose?
How many patients developed diabetes within 5 years?
How much data is missing?
Can you produce a single, publication-ready table?

We’ll answer each of these questions step by step.

4.1 Data Preparation

We load and prepare the diabetes data using the same tools from Chapters 4 and 5:

diabetes_raw <- read_csv("../../data/diabetes.csv", show_col_types = FALSE)

diabetes <- diabetes_raw |>
  clean_names() |>
  mutate(
    diabetes_5y = fct_relevel(diabetes_5y, "neg", "pos"),
    bmi_class = case_when(
      bmi < 18.5 ~ "Underweight",
      bmi < 25   ~ "Normal",
      bmi < 30   ~ "Overweight",
      .default   = "Obesity"
    ) |> factor(levels = c("Underweight", "Normal", "Overweight", "Obesity"))
  )

4.2 Choosing the Right Summary Statistics

Before computing anything, it helps to know which statistics suit which data. The choice depends on the variable type and its distribution:

Choosing the right summary statistics
Variable Type	Symmetric Distribution	Skewed Distribution
Continuous (e.g., age, BMI)	Mean and SD	Median and IQR
Categorical (e.g., sex, outcome)	n and %	n and %

For continuous variables, here’s what each statistic measures:

Common summary statistics and their R functions
Statistic	What it measures	R function
Mean	Average (center of gravity)	`mean()`
SD (standard deviation)	Spread around the mean	`sd()`
Median	Middle value (50th percentile)	`median()`
IQR (interquartile range)	Spread of the middle 50%	`IQR()`
Min / Max	Extremes	`min()`, `max()`

Mean + SD vs Median + IQR

Use Mean + SD when the data is roughly symmetric (bell-shaped)
Use Median + IQR when the data is skewed (long tail on one side) or has outliers

For clinical data, many variables (like insulin, length of stay, costs) are right-skewed, making median + IQR the safer default. When in doubt, check with a histogram (Chapter 3).

4.3 Summarizing Continuous Variables

4.3.1 Single variable summary

Let’s start with glucose. You can call summary functions directly, just like in Section 1.3.1:

mean(diabetes$glucose_mg_dl, na.rm = TRUE)

[1] 121.6868

sd(diabetes$glucose_mg_dl, na.rm = TRUE)

[1] 30.53564

median(diabetes$glucose_mg_dl, na.rm = TRUE)

[1] 117

IQR(diabetes$glucose_mg_dl, na.rm = TRUE)

[1] 42

Recall from Section 1.5 that na.rm = TRUE tells R to skip missing values when computing the statistic.

4.3.2 Summarizing multiple variables

To summarize several variables at once, use summarise() with one line per statistic:

diabetes |>
  summarise(
    mean_age      = mean(age, na.rm = TRUE),
    sd_age        = sd(age, na.rm = TRUE),
    mean_bmi      = mean(bmi, na.rm = TRUE),
    sd_bmi        = sd(bmi, na.rm = TRUE),
    mean_glucose  = mean(glucose_mg_dl, na.rm = TRUE),
    sd_glucose    = sd(glucose_mg_dl, na.rm = TRUE)
  )

# A tibble: 1 × 6
  mean_age sd_age mean_bmi sd_bmi mean_glucose sd_glucose
     <dbl>  <dbl>    <dbl>  <dbl>        <dbl>      <dbl>
1     33.2   11.8     32.5   6.92         122.       30.5

This is verbose but completely transparent — you can see exactly what’s being computed for each variable.

Shortcut: across() for many columns

When you have many columns to summarize, the across() function can save typing. This uses more advanced syntax that we won’t cover in detail — just know it exists:

diabetes |>
  summarise(
    across(
      c(age, bmi, glucose_mg_dl),
      list(mean = \(x) mean(x, na.rm = TRUE),
           sd   = \(x) sd(x, na.rm = TRUE))
    )
  )

# A tibble: 1 × 6
  age_mean age_sd bmi_mean bmi_sd glucose_mg_dl_mean glucose_mg_dl_sd
     <dbl>  <dbl>    <dbl>  <dbl>              <dbl>            <dbl>
1     33.2   11.8     32.5   6.92               122.             30.5

The column-by-column approach above is perfectly fine for most analyses. Use across() when you’re comfortable with it.

Python Comparison

Python’s pandas provides describe() for quick summaries:

diabetes[["age", "bmi", "glucose_mg_dl"]].describe()

This returns count, mean, std, min, percentiles, and max — similar to combining mean(), sd(), median(), min(), max() in R.

4.4 Summarizing Categorical Variables

4.4.1 Frequency table with `count()`

For categorical variables, the key statistics are n (count) and % (proportion). Use count() from Section 2.9:

diabetes |> count(diabetes_5y)

# A tibble: 2 × 2
  diabetes_5y     n
  <fct>       <int>
1 neg           500
2 pos           268

4.4.2 Adding percentages

Add a percentage column with mutate():

diabetes |>
  count(diabetes_5y) |>
  mutate(pct = round(100 * n / sum(n), 1))

# A tibble: 2 × 3
  diabetes_5y     n   pct
  <fct>       <int> <dbl>
1 neg           500  65.1
2 pos           268  34.9

About 35% of patients in this cohort developed diabetes within 5 years.

4.4.3 Stratified frequency table

To see how one categorical variable breaks down within another group, use count() with two variables and then compute within-group percentages:

diabetes |>
  count(diabetes_5y, bmi_class) |>
  group_by(diabetes_5y) |>
  mutate(pct = round(100 * n / sum(n), 1)) |>
  ungroup()

# A tibble: 7 × 4
  diabetes_5y bmi_class       n   pct
  <fct>       <fct>       <int> <dbl>
1 neg         Underweight     4   0.8
2 neg         Normal         95  19  
3 neg         Overweight    139  27.8
4 neg         Obesity       262  52.4
5 pos         Normal          7   2.6
6 pos         Overweight     40  14.9
7 pos         Obesity       221  82.5

Watch your denominator

When computing percentages, be clear about what the denominator is:

sum(n) after group_by(diabetes_5y) → percentages within each outcome group (columns sum to 100%)
sum(n) without grouping → percentages of the total cohort (all rows sum to 100%)

Getting the denominator wrong is one of the most common mistakes in clinical reporting.

4.5 Grouped Summaries by Outcome

A key step in any clinical analysis: summarize continuous variables by group to see how the groups compare.

4.5.1 Continuous variables by group

diabetes |>
  group_by(diabetes_5y) |>
  summarise(
    n            = n(),
    mean_age     = round(mean(age, na.rm = TRUE), 1),
    sd_age       = round(sd(age, na.rm = TRUE), 1),
    mean_glucose = round(mean(glucose_mg_dl, na.rm = TRUE), 1),
    sd_glucose   = round(sd(glucose_mg_dl, na.rm = TRUE), 1),
    mean_bmi     = round(mean(bmi, na.rm = TRUE), 1),
    sd_bmi       = round(sd(bmi, na.rm = TRUE), 1),
    .groups = "drop"
  )

# A tibble: 2 × 8
  diabetes_5y     n mean_age sd_age mean_glucose sd_glucose mean_bmi sd_bmi
  <fct>       <int>    <dbl>  <dbl>        <dbl>      <dbl>    <dbl>  <dbl>
1 neg           500     31.2   11.7         111.       24.8     30.9    6.6
2 pos           268     37.1   11           142.       29.6     35.4    6.6

4.5.2 Interpreting group differences

The pos group is older on average (37 vs 31 years), has higher mean glucose (142 vs 110 mg/dL), and higher mean BMI (35 vs 30 kg/m²). These differences are descriptive — we haven’t tested whether they’re statistically significant. That comes in the next chapters.

Python Comparison

Python’s groupby + agg achieves the same result:

(
    diabetes
    .groupby("diabetes_5y")
    .agg(
        n=("age", "size"),
        mean_age=("age", "mean"),
        mean_glucose=("glucose_mg_dl", "mean"),
        mean_bmi=("bmi", "mean")
    )
    .round(1)
)

4.6 Assessing Missing Data

4.6.1 Why report missingness?

Missing data is common in clinical datasets and can bias your results. Reporting how much data is missing — and in which variables — is essential for transparency. Journals and reviewers expect it.

4.6.2 Counting missing values per variable

The simplest approach: use summarise() with is.na() for each column you care about:

diabetes |>
  summarise(
    glucose_missing     = sum(is.na(glucose_mg_dl)),
    dbp_missing         = sum(is.na(dbp_mm_hg)),
    triceps_missing     = sum(is.na(triceps_mm)),
    insulin_missing     = sum(is.na(insulin_microiu_ml)),
    bmi_missing         = sum(is.na(bmi))
  )

# A tibble: 1 × 5
  glucose_missing dbp_missing triceps_missing insulin_missing bmi_missing
            <int>       <int>           <int>           <int>       <int>
1               5          35             227             374          11

sum(is.na(x)) counts the TRUEs — recall from Section 1.4 that TRUE counts as 1. So sum(is.na(x)) is literally “how many values are NA?”

4.6.3 A missingness summary table

For a cleaner report, build a tibble with variable names, counts, and percentages:

miss_table <- tibble(
  variable    = c("glucose_mg_dl", "dbp_mm_hg", "triceps_mm",
                  "insulin_microiu_ml", "bmi"),
  n_missing   = c(
    sum(is.na(diabetes$glucose_mg_dl)),
    sum(is.na(diabetes$dbp_mm_hg)),
    sum(is.na(diabetes$triceps_mm)),
    sum(is.na(diabetes$insulin_microiu_ml)),
    sum(is.na(diabetes$bmi))
  ),
  pct_missing = round(100 * n_missing / nrow(diabetes), 1)
) |>
  arrange(desc(pct_missing))

miss_table

# A tibble: 5 × 3
  variable           n_missing pct_missing
  <chr>                  <int>       <dbl>
1 insulin_microiu_ml       374        48.7
2 triceps_mm               227        29.6
3 dbp_mm_hg                 35         4.6
4 bmi                       11         1.4
5 glucose_mg_dl              5         0.7

Shortcut: using across() for missingness

For many columns, you can compute missingness more compactly:

diabetes |>
  summarise(across(everything(), \(x) sum(is.na(x))))

# A tibble: 1 × 10
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <int>         <int>     <int>      <int>              <int> <int>
1             0             5        35        227                374    11
# ℹ 4 more variables: pedigree <int>, age <int>, diabetes_5y <int>,
#   bmi_class <int>

This applies sum(is.na()) to every column at once. Handy when you have dozens of variables.

4.6.4 Interpretation

Insulin has ~49% missing and triceps has ~30% missing — that’s substantial. When analyzing these variables, you’ll need to consider whether the missingness is random or systematic. At a minimum, report it clearly.

4.7 Publication-Ready Tables with `gtsummary`

So far, we’ve built summaries manually with summarise(). That works well for understanding the data, but for a publication-quality “Table 1”, the gtsummary package automates the formatting.

One function, one Table 1

tbl_summary() takes a data frame and produces a formatted summary table — the kind you’d include in a journal article’s “Baseline Characteristics” table. It handles continuous and categorical variables automatically.

4.7.1 Step 1: The simplest table

Start with just the defaults — no customization at all:

diabetes |>
  select(age, bmi, glucose_mg_dl, diabetes_5y) |>
  tbl_summary()

Characteristic	N = 768¹
age	29 (24, 41)
bmi	32 (28, 37)
Unknown	11
glucose_mg_dl	117 (99, 141)
Unknown	5
diabetes_5y
neg	500 (65%)
pos	268 (35%)
¹ Median (Q1, Q3); n (%)

With one function call, gtsummary detected variable types, computed appropriate statistics (median + IQR for continuous, n + % for categorical), and formatted everything.

4.7.2 Step 2: Stratify by outcome

Add by = diabetes_5y to split the table by group:

diabetes |>
  select(diabetes_5y, age, bmi, glucose_mg_dl, bmi_class) |>
  tbl_summary(by = diabetes_5y)

Characteristic	neg N = 500¹	pos N = 268¹
age	27 (23, 37)	36 (28, 44)
bmi	30 (26, 35)	34 (31, 39)
Unknown	9	2
glucose_mg_dl	107 (93, 125)	140 (119, 167)
Unknown	3	2
bmi_class
Underweight	4 (0.8%)	0 (0%)
Normal	95 (19%)	7 (2.6%)
Overweight	139 (28%)	40 (15%)
Obesity	262 (52%)	221 (82%)
¹ Median (Q1, Q3); n (%)

Now each column is a group, and you can compare at a glance.

4.7.3 Step 3: Customize the statistics

By default, gtsummary reports median (IQR) for continuous variables. To switch to mean (SD), use the statistic argument:

diabetes |>
  select(diabetes_5y, age, bmi, glucose_mg_dl, bmi_class) |>
  tbl_summary(
    by = diabetes_5y,
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1
  )

Characteristic	neg N = 500¹	pos N = 268¹
age	31.2 (11.7)	37.1 (11.0)
bmi	30.9 (6.6)	35.4 (6.6)
Unknown	9	2
glucose_mg_dl	110.6 (24.8)	142.3 (29.6)
Unknown	3	2
bmi_class
Underweight	4 (0.8%)	0 (0%)
Normal	95 (19%)	7 (2.6%)
Overweight	139 (28%)	40 (15%)
Obesity	262 (52%)	221 (82%)
¹ Mean (SD); n (%)

Let’s break down the new syntax:

statistic = list(...) — controls what statistics are displayed. Each entry maps a set of variables to a template string.
Template strings use {placeholder} syntax:

`gtsummary` template placeholders
Placeholder	Meaning
`{mean}`	Mean
`{sd}`	Standard deviation
`{median}`	Median
`{p25}`, `{p75}`	25th and 75th percentiles
`{n}`	Count
`{p}`	Percentage

all_continuous() and all_categorical() are selector helpers — they select all continuous or all categorical variables, similar to how starts_with() works in select(). Think of them as shortcuts for “apply this format to all variables of this type.”
digits = all_continuous() ~ 1 — round continuous statistics to 1 decimal place.

4.7.4 Step 4: Add labels and polish

Add readable labels, an overall column, and bold formatting:

diabetes |>
  select(diabetes_5y, age, bmi, glucose_mg_dl, dbp_mm_hg, bmi_class) |>
  tbl_summary(
    by = diabetes_5y,
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "ifany",
    label = list(
      age           ~ "Age (years)",
      bmi           ~ "Body mass index (kg/m²)",
      glucose_mg_dl ~ "Glucose (mg/dL)",
      dbp_mm_hg     ~ "Diastolic BP (mmHg)",
      bmi_class     ~ "BMI class"
    )
  ) |>
  add_overall(last = TRUE) |>
  bold_labels()

Characteristic	neg N = 500¹	pos N = 268¹	Overall N = 768¹
Age (years)	31.2 (11.7)	37.1 (11.0)	33.2 (11.8)
Body mass index (kg/m²)	30.9 (6.6)	35.4 (6.6)	32.5 (6.9)
Unknown	9	2	11
Glucose (mg/dL)	110.6 (24.8)	142.3 (29.6)	121.7 (30.5)
Unknown	3	2	5
Diastolic BP (mmHg)	70.9 (12.2)	75.3 (12.3)	72.4 (12.4)
Unknown	19	16	35
BMI class
Underweight	4 (0.8%)	0 (0%)	4 (0.5%)
Normal	95 (19%)	7 (2.6%)	102 (13%)
Overweight	139 (28%)	40 (15%)	179 (23%)
Obesity	262 (52%)	221 (82%)	483 (63%)
¹ Mean (SD); n (%)

New pieces:

label = list(variable ~ "Label") — gives each variable a human-readable name
missing = "ifany" — shows a “Missing” row only when there are actually missing values. Other options: "no" (hide missing rows) or "always" (always show)
add_overall(last = TRUE) — adds an “Overall” column at the end
bold_labels() — makes variable names bold for readability

This is a complete Table 1 — ready for a report, slide deck, or journal submission.

Exporting your table

To save the table as an HTML or Word file:

library(gt)

diabetes |>
  select(diabetes_5y, age, bmi, glucose_mg_dl) |>
  tbl_summary(by = diabetes_5y) |>
  as_gt() |>
  gtsave("table1.html")     # or "table1.docx" for Word

The as_gt() function converts the gtsummary table to a gt object, which can be saved in various formats.

What about p-values?

You may notice that Table 1 often includes a “p-value” column comparing groups. In gtsummary, you add this with add_p(). We’ll cover this in the inferential statistics chapters — for now, focus on describing the data clearly.

Python Comparison

Python’s tableone package provides similar functionality:

from tableone import TableOne

table1 = TableOne(
    data=diabetes,
    columns=["age", "bmi", "glucose_mg_dl", "bmi_class"],
    groupby="diabetes_5y",
    pval=False
)
print(table1.tabulate(tablefmt="github"))

R’s gtsummary is generally more polished for publication-ready output, with built-in formatting and export options.

4.8 Common Errors and Troubleshooting

Common descriptive statistics errors and their fixes
Error or Symptom	Cause	Fix
`NaN` in mean or SD	All values are `NA` in that subset	Check your `filter()` — are you accidentally removing all data?
Wrong percentages (don’t sum to 100%)	Wrong `group_by()` before `mutate(pct = ...)`	Check your grouping — denominator must match your intended total
`gtsummary` shows “Unknown” category	`NA` values in a categorical variable	Use `mutate(var = fct_na_value_to_level(var))` or `missing = "no"`
`na.rm` forgotten	`NA` result from `mean()`, `sd()`, etc.	Add `na.rm = TRUE` to every summary function
Very wide output from `summarise()`	Too many columns in one `summarise()` call	Break into separate summaries or use `gtsummary` instead

4.9 Summary

Here’s the journey we took to describe the diabetes cohort:

Chapter 6 journey — from raw numbers to a publication-ready Table 1
Step	What we did	R tool
Choose statistics	Matched measure to data type	Mean + SD vs Median + IQR
Single variable	Computed glucose statistics	`mean()`, `sd()`, `median()`, `IQR()`
Multiple variables	Summarized age, BMI, glucose	`summarise()` column-by-column
Categorical counts	Frequency of outcome and BMI class	`count()`, `mutate(pct = ...)`
Grouped summaries	Compared pos vs neg groups	`group_by()` + `summarise()`
Missingness	Identified insulin as 49% missing	`sum(is.na())` per variable
Table 1	Publication-ready baseline table	`gtsummary::tbl_summary()`

These descriptive statistics are the foundation for any clinical analysis. In the next chapters, we’ll move to inferential statistics — testing whether the differences we’ve observed between groups are statistically significant.

Exercises

Vital signs summary. Using the diabetes dataset, compute the mean, SD, median, and IQR of age, bmi, and dbp_mm_hg using explicit summarise() (one line per statistic). Round to 1 decimal.
Categorical profile. Create a frequency table of bmi_class with percentages. Then create a stratified version by diabetes_5y showing within-group percentages. Which BMI class is most common in the pos group?
Missingness report. Build a missingness summary table for all numeric columns in the diabetes dataset. Sort by percent missing (highest first). Which variable has the most missing data, and what percentage is missing?
Table 1 challenge. Build a gtsummary table stratified by diabetes_5y that includes age, bmi, glucose_mg_dl, dbp_mm_hg, and bmi_class. Use mean (SD) for continuous variables, custom labels with units, and add an overall column. Export it to HTML if you’d like to practice with as_gt().

4.1 Data Preparation

4.2 Choosing the Right Summary Statistics

4.3 Summarizing Continuous Variables

4.3.1 Single variable summary

4.3.2 Summarizing multiple variables

4.4 Summarizing Categorical Variables

4.4.1 Frequency table with count()

4.4.2 Adding percentages

4.4.3 Stratified frequency table

4.5 Grouped Summaries by Outcome

4.5.1 Continuous variables by group

4.5.2 Interpreting group differences

4.6 Assessing Missing Data

4.6.1 Why report missingness?

4.6.2 Counting missing values per variable

4.6.3 A missingness summary table

4.6.4 Interpretation

4.7 Publication-Ready Tables with gtsummary

4.7.1 Step 1: The simplest table

4.7.2 Step 2: Stratify by outcome

4.7.3 Step 3: Customize the statistics

4.7.4 Step 4: Add labels and polish

4.8 Common Errors and Troubleshooting

4.9 Summary

Exercises

4.4.1 Frequency table with `count()`

4.7 Publication-Ready Tables with `gtsummary`