4  Descriptive Statistics

Before running any statistical test, you need to describe your data. How old are the patients? What is the distribution of glucose? How much data is missing? Is the cohort balanced across groups?

In this chapter, you’ll learn to compute and present descriptive summaries — the kind you’d include in a methods section, a departmental presentation, or a publication’s “Table 1.”

library(tidyverse)
library(janitor)
library(gtsummary)
NoteOur running example: Summarizing the diabetes cohort

Imagine you’re preparing a summary for a departmental meeting. You need to describe the 768 patients in the diabetes registry:

  • What are the typical values for age, BMI, and glucose?
  • How many patients developed diabetes within 5 years?
  • How much data is missing?
  • Can you produce a single, publication-ready table?

We’ll answer each of these questions step by step.

4.1 Data Preparation

We load and prepare the diabetes data using the same tools from Chapters 4 and 5:

diabetes_raw <- read_csv("../../data/diabetes.csv", show_col_types = FALSE)

diabetes <- diabetes_raw |>
  clean_names() |>
  mutate(
    diabetes_5y = fct_relevel(diabetes_5y, "neg", "pos"),
    bmi_class = case_when(
      bmi < 18.5 ~ "Underweight",
      bmi < 25   ~ "Normal",
      bmi < 30   ~ "Overweight",
      .default   = "Obesity"
    ) |> factor(levels = c("Underweight", "Normal", "Overweight", "Obesity"))
  )

4.2 Choosing the Right Summary Statistics

Before computing anything, it helps to know which statistics suit which data. The choice depends on the variable type and its distribution:

Choosing the right summary statistics
Variable Type Symmetric Distribution Skewed Distribution
Continuous (e.g., age, BMI) Mean and SD Median and IQR
Categorical (e.g., sex, outcome) n and % n and %

For continuous variables, here’s what each statistic measures:

Common summary statistics and their R functions
Statistic What it measures R function
Mean Average (center of gravity) mean()
SD (standard deviation) Spread around the mean sd()
Median Middle value (50th percentile) median()
IQR (interquartile range) Spread of the middle 50% IQR()
Min / Max Extremes min(), max()
TipMean + SD vs Median + IQR
  • Use Mean + SD when the data is roughly symmetric (bell-shaped)
  • Use Median + IQR when the data is skewed (long tail on one side) or has outliers

For clinical data, many variables (like insulin, length of stay, costs) are right-skewed, making median + IQR the safer default. When in doubt, check with a histogram (Chapter 3).

4.3 Summarizing Continuous Variables

4.3.1 Single variable summary

Let’s start with glucose. You can call summary functions directly, just like in Section 1.3.1:

mean(diabetes$glucose_mg_dl, na.rm = TRUE)
[1] 121.6868
sd(diabetes$glucose_mg_dl, na.rm = TRUE)
[1] 30.53564
median(diabetes$glucose_mg_dl, na.rm = TRUE)
[1] 117
IQR(diabetes$glucose_mg_dl, na.rm = TRUE)
[1] 42

Recall from Section 1.5 that na.rm = TRUE tells R to skip missing values when computing the statistic.

4.3.2 Summarizing multiple variables

To summarize several variables at once, use summarise() with one line per statistic:

diabetes |>
  summarise(
    mean_age      = mean(age, na.rm = TRUE),
    sd_age        = sd(age, na.rm = TRUE),
    mean_bmi      = mean(bmi, na.rm = TRUE),
    sd_bmi        = sd(bmi, na.rm = TRUE),
    mean_glucose  = mean(glucose_mg_dl, na.rm = TRUE),
    sd_glucose    = sd(glucose_mg_dl, na.rm = TRUE)
  )
# A tibble: 1 × 6
  mean_age sd_age mean_bmi sd_bmi mean_glucose sd_glucose
     <dbl>  <dbl>    <dbl>  <dbl>        <dbl>      <dbl>
1     33.2   11.8     32.5   6.92         122.       30.5

This is verbose but completely transparent — you can see exactly what’s being computed for each variable.

When you have many columns to summarize, the across() function can save typing. This uses more advanced syntax that we won’t cover in detail — just know it exists:

diabetes |>
  summarise(
    across(
      c(age, bmi, glucose_mg_dl),
      list(mean = \(x) mean(x, na.rm = TRUE),
           sd   = \(x) sd(x, na.rm = TRUE))
    )
  )
# A tibble: 1 × 6
  age_mean age_sd bmi_mean bmi_sd glucose_mg_dl_mean glucose_mg_dl_sd
     <dbl>  <dbl>    <dbl>  <dbl>              <dbl>            <dbl>
1     33.2   11.8     32.5   6.92               122.             30.5

The column-by-column approach above is perfectly fine for most analyses. Use across() when you’re comfortable with it.

Python’s pandas provides describe() for quick summaries:

diabetes[["age", "bmi", "glucose_mg_dl"]].describe()

This returns count, mean, std, min, percentiles, and max — similar to combining mean(), sd(), median(), min(), max() in R.

4.4 Summarizing Categorical Variables

4.4.1 Frequency table with count()

For categorical variables, the key statistics are n (count) and % (proportion). Use count() from Section 2.9:

diabetes |> count(diabetes_5y)
# A tibble: 2 × 2
  diabetes_5y     n
  <fct>       <int>
1 neg           500
2 pos           268

4.4.2 Adding percentages

Add a percentage column with mutate():

diabetes |>
  count(diabetes_5y) |>
  mutate(pct = round(100 * n / sum(n), 1))
# A tibble: 2 × 3
  diabetes_5y     n   pct
  <fct>       <int> <dbl>
1 neg           500  65.1
2 pos           268  34.9

About 35% of patients in this cohort developed diabetes within 5 years.

4.4.3 Stratified frequency table

To see how one categorical variable breaks down within another group, use count() with two variables and then compute within-group percentages:

diabetes |>
  count(diabetes_5y, bmi_class) |>
  group_by(diabetes_5y) |>
  mutate(pct = round(100 * n / sum(n), 1)) |>
  ungroup()
# A tibble: 7 × 4
  diabetes_5y bmi_class       n   pct
  <fct>       <fct>       <int> <dbl>
1 neg         Underweight     4   0.8
2 neg         Normal         95  19  
3 neg         Overweight    139  27.8
4 neg         Obesity       262  52.4
5 pos         Normal          7   2.6
6 pos         Overweight     40  14.9
7 pos         Obesity       221  82.5
WarningWatch your denominator

When computing percentages, be clear about what the denominator is:

  • sum(n) after group_by(diabetes_5y) → percentages within each outcome group (columns sum to 100%)
  • sum(n) without grouping → percentages of the total cohort (all rows sum to 100%)

Getting the denominator wrong is one of the most common mistakes in clinical reporting.

4.5 Grouped Summaries by Outcome

A key step in any clinical analysis: summarize continuous variables by group to see how the groups compare.

4.5.1 Continuous variables by group

diabetes |>
  group_by(diabetes_5y) |>
  summarise(
    n            = n(),
    mean_age     = round(mean(age, na.rm = TRUE), 1),
    sd_age       = round(sd(age, na.rm = TRUE), 1),
    mean_glucose = round(mean(glucose_mg_dl, na.rm = TRUE), 1),
    sd_glucose   = round(sd(glucose_mg_dl, na.rm = TRUE), 1),
    mean_bmi     = round(mean(bmi, na.rm = TRUE), 1),
    sd_bmi       = round(sd(bmi, na.rm = TRUE), 1),
    .groups = "drop"
  )
# A tibble: 2 × 8
  diabetes_5y     n mean_age sd_age mean_glucose sd_glucose mean_bmi sd_bmi
  <fct>       <int>    <dbl>  <dbl>        <dbl>      <dbl>    <dbl>  <dbl>
1 neg           500     31.2   11.7         111.       24.8     30.9    6.6
2 pos           268     37.1   11           142.       29.6     35.4    6.6

4.5.2 Interpreting group differences

The pos group is older on average (37 vs 31 years), has higher mean glucose (142 vs 110 mg/dL), and higher mean BMI (35 vs 30 kg/m²). These differences are descriptive — we haven’t tested whether they’re statistically significant. That comes in the next chapters.

Python’s groupby + agg achieves the same result:

(
    diabetes
    .groupby("diabetes_5y")
    .agg(
        n=("age", "size"),
        mean_age=("age", "mean"),
        mean_glucose=("glucose_mg_dl", "mean"),
        mean_bmi=("bmi", "mean")
    )
    .round(1)
)

4.6 Assessing Missing Data

4.6.1 Why report missingness?

Missing data is common in clinical datasets and can bias your results. Reporting how much data is missing — and in which variables — is essential for transparency. Journals and reviewers expect it.

4.6.2 Counting missing values per variable

The simplest approach: use summarise() with is.na() for each column you care about:

diabetes |>
  summarise(
    glucose_missing     = sum(is.na(glucose_mg_dl)),
    dbp_missing         = sum(is.na(dbp_mm_hg)),
    triceps_missing     = sum(is.na(triceps_mm)),
    insulin_missing     = sum(is.na(insulin_microiu_ml)),
    bmi_missing         = sum(is.na(bmi))
  )
# A tibble: 1 × 5
  glucose_missing dbp_missing triceps_missing insulin_missing bmi_missing
            <int>       <int>           <int>           <int>       <int>
1               5          35             227             374          11

sum(is.na(x)) counts the TRUEs — recall from Section 1.4 that TRUE counts as 1. So sum(is.na(x)) is literally “how many values are NA?”

4.6.3 A missingness summary table

For a cleaner report, build a tibble with variable names, counts, and percentages:

miss_table <- tibble(
  variable    = c("glucose_mg_dl", "dbp_mm_hg", "triceps_mm",
                  "insulin_microiu_ml", "bmi"),
  n_missing   = c(
    sum(is.na(diabetes$glucose_mg_dl)),
    sum(is.na(diabetes$dbp_mm_hg)),
    sum(is.na(diabetes$triceps_mm)),
    sum(is.na(diabetes$insulin_microiu_ml)),
    sum(is.na(diabetes$bmi))
  ),
  pct_missing = round(100 * n_missing / nrow(diabetes), 1)
) |>
  arrange(desc(pct_missing))

miss_table
# A tibble: 5 × 3
  variable           n_missing pct_missing
  <chr>                  <int>       <dbl>
1 insulin_microiu_ml       374        48.7
2 triceps_mm               227        29.6
3 dbp_mm_hg                 35         4.6
4 bmi                       11         1.4
5 glucose_mg_dl              5         0.7

For many columns, you can compute missingness more compactly:

diabetes |>
  summarise(across(everything(), \(x) sum(is.na(x))))
# A tibble: 1 × 10
  pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml   bmi
          <int>         <int>     <int>      <int>              <int> <int>
1             0             5        35        227                374    11
# ℹ 4 more variables: pedigree <int>, age <int>, diabetes_5y <int>,
#   bmi_class <int>

This applies sum(is.na()) to every column at once. Handy when you have dozens of variables.

4.6.4 Interpretation

Insulin has ~49% missing and triceps has ~30% missing — that’s substantial. When analyzing these variables, you’ll need to consider whether the missingness is random or systematic. At a minimum, report it clearly.

4.7 Publication-Ready Tables with gtsummary

So far, we’ve built summaries manually with summarise(). That works well for understanding the data, but for a publication-quality “Table 1”, the gtsummary package automates the formatting.

NoteOne function, one Table 1

tbl_summary() takes a data frame and produces a formatted summary table — the kind you’d include in a journal article’s “Baseline Characteristics” table. It handles continuous and categorical variables automatically.

4.7.1 Step 1: The simplest table

Start with just the defaults — no customization at all:

diabetes |>
  select(age, bmi, glucose_mg_dl, diabetes_5y) |>
  tbl_summary()
Characteristic N = 7681
age 29 (24, 41)
bmi 32 (28, 37)
    Unknown 11
glucose_mg_dl 117 (99, 141)
    Unknown 5
diabetes_5y
    neg 500 (65%)
    pos 268 (35%)
1 Median (Q1, Q3); n (%)

With one function call, gtsummary detected variable types, computed appropriate statistics (median + IQR for continuous, n + % for categorical), and formatted everything.

4.7.2 Step 2: Stratify by outcome

Add by = diabetes_5y to split the table by group:

diabetes |>
  select(diabetes_5y, age, bmi, glucose_mg_dl, bmi_class) |>
  tbl_summary(by = diabetes_5y)
Characteristic neg
N = 5001
pos
N = 2681
age 27 (23, 37) 36 (28, 44)
bmi 30 (26, 35) 34 (31, 39)
    Unknown 9 2
glucose_mg_dl 107 (93, 125) 140 (119, 167)
    Unknown 3 2
bmi_class

    Underweight 4 (0.8%) 0 (0%)
    Normal 95 (19%) 7 (2.6%)
    Overweight 139 (28%) 40 (15%)
    Obesity 262 (52%) 221 (82%)
1 Median (Q1, Q3); n (%)

Now each column is a group, and you can compare at a glance.

4.7.3 Step 3: Customize the statistics

By default, gtsummary reports median (IQR) for continuous variables. To switch to mean (SD), use the statistic argument:

diabetes |>
  select(diabetes_5y, age, bmi, glucose_mg_dl, bmi_class) |>
  tbl_summary(
    by = diabetes_5y,
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1
  )
Characteristic neg
N = 5001
pos
N = 2681
age 31.2 (11.7) 37.1 (11.0)
bmi 30.9 (6.6) 35.4 (6.6)
    Unknown 9 2
glucose_mg_dl 110.6 (24.8) 142.3 (29.6)
    Unknown 3 2
bmi_class

    Underweight 4 (0.8%) 0 (0%)
    Normal 95 (19%) 7 (2.6%)
    Overweight 139 (28%) 40 (15%)
    Obesity 262 (52%) 221 (82%)
1 Mean (SD); n (%)

Let’s break down the new syntax:

  • statistic = list(...) — controls what statistics are displayed. Each entry maps a set of variables to a template string.
  • Template strings use {placeholder} syntax:
gtsummary template placeholders
Placeholder Meaning
{mean} Mean
{sd} Standard deviation
{median} Median
{p25}, {p75} 25th and 75th percentiles
{n} Count
{p} Percentage
  • all_continuous() and all_categorical() are selector helpers — they select all continuous or all categorical variables, similar to how starts_with() works in select(). Think of them as shortcuts for “apply this format to all variables of this type.”
  • digits = all_continuous() ~ 1 — round continuous statistics to 1 decimal place.

4.7.4 Step 4: Add labels and polish

Add readable labels, an overall column, and bold formatting:

diabetes |>
  select(diabetes_5y, age, bmi, glucose_mg_dl, dbp_mm_hg, bmi_class) |>
  tbl_summary(
    by = diabetes_5y,
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "ifany",
    label = list(
      age           ~ "Age (years)",
      bmi           ~ "Body mass index (kg/m²)",
      glucose_mg_dl ~ "Glucose (mg/dL)",
      dbp_mm_hg     ~ "Diastolic BP (mmHg)",
      bmi_class     ~ "BMI class"
    )
  ) |>
  add_overall(last = TRUE) |>
  bold_labels()
Characteristic neg
N = 5001
pos
N = 2681
Overall
N = 7681
Age (years) 31.2 (11.7) 37.1 (11.0) 33.2 (11.8)
Body mass index (kg/m²) 30.9 (6.6) 35.4 (6.6) 32.5 (6.9)
    Unknown 9 2 11
Glucose (mg/dL) 110.6 (24.8) 142.3 (29.6) 121.7 (30.5)
    Unknown 3 2 5
Diastolic BP (mmHg) 70.9 (12.2) 75.3 (12.3) 72.4 (12.4)
    Unknown 19 16 35
BMI class


    Underweight 4 (0.8%) 0 (0%) 4 (0.5%)
    Normal 95 (19%) 7 (2.6%) 102 (13%)
    Overweight 139 (28%) 40 (15%) 179 (23%)
    Obesity 262 (52%) 221 (82%) 483 (63%)
1 Mean (SD); n (%)

New pieces:

  • label = list(variable ~ "Label") — gives each variable a human-readable name
  • missing = "ifany" — shows a “Missing” row only when there are actually missing values. Other options: "no" (hide missing rows) or "always" (always show)
  • add_overall(last = TRUE) — adds an “Overall” column at the end
  • bold_labels() — makes variable names bold for readability

This is a complete Table 1 — ready for a report, slide deck, or journal submission.

NoteExporting your table

To save the table as an HTML or Word file:

library(gt)

diabetes |>
  select(diabetes_5y, age, bmi, glucose_mg_dl) |>
  tbl_summary(by = diabetes_5y) |>
  as_gt() |>
  gtsave("table1.html")     # or "table1.docx" for Word

The as_gt() function converts the gtsummary table to a gt object, which can be saved in various formats.

NoteWhat about p-values?

You may notice that Table 1 often includes a “p-value” column comparing groups. In gtsummary, you add this with add_p(). We’ll cover this in the inferential statistics chapters — for now, focus on describing the data clearly.

Python’s tableone package provides similar functionality:

from tableone import TableOne

table1 = TableOne(
    data=diabetes,
    columns=["age", "bmi", "glucose_mg_dl", "bmi_class"],
    groupby="diabetes_5y",
    pval=False
)
print(table1.tabulate(tablefmt="github"))

R’s gtsummary is generally more polished for publication-ready output, with built-in formatting and export options.

4.8 Common Errors and Troubleshooting

Common descriptive statistics errors and their fixes
Error or Symptom Cause Fix
NaN in mean or SD All values are NA in that subset Check your filter() — are you accidentally removing all data?
Wrong percentages (don’t sum to 100%) Wrong group_by() before mutate(pct = ...) Check your grouping — denominator must match your intended total
gtsummary shows “Unknown” category NA values in a categorical variable Use mutate(var = fct_na_value_to_level(var)) or missing = "no"
na.rm forgotten NA result from mean(), sd(), etc. Add na.rm = TRUE to every summary function
Very wide output from summarise() Too many columns in one summarise() call Break into separate summaries or use gtsummary instead

4.9 Summary

Here’s the journey we took to describe the diabetes cohort:

Chapter 6 journey — from raw numbers to a publication-ready Table 1
Step What we did R tool
Choose statistics Matched measure to data type Mean + SD vs Median + IQR
Single variable Computed glucose statistics mean(), sd(), median(), IQR()
Multiple variables Summarized age, BMI, glucose summarise() column-by-column
Categorical counts Frequency of outcome and BMI class count(), mutate(pct = ...)
Grouped summaries Compared pos vs neg groups group_by() + summarise()
Missingness Identified insulin as 49% missing sum(is.na()) per variable
Table 1 Publication-ready baseline table gtsummary::tbl_summary()

These descriptive statistics are the foundation for any clinical analysis. In the next chapters, we’ll move to inferential statistics — testing whether the differences we’ve observed between groups are statistically significant.

Exercises

  1. Vital signs summary. Using the diabetes dataset, compute the mean, SD, median, and IQR of age, bmi, and dbp_mm_hg using explicit summarise() (one line per statistic). Round to 1 decimal.

  2. Categorical profile. Create a frequency table of bmi_class with percentages. Then create a stratified version by diabetes_5y showing within-group percentages. Which BMI class is most common in the pos group?

  3. Missingness report. Build a missingness summary table for all numeric columns in the diabetes dataset. Sort by percent missing (highest first). Which variable has the most missing data, and what percentage is missing?

  4. Table 1 challenge. Build a gtsummary table stratified by diabetes_5y that includes age, bmi, glucose_mg_dl, dbp_mm_hg, and bmi_class. Use mean (SD) for continuous variables, custom labels with units, and add an overall column. Export it to HTML if you’d like to practice with as_gt().