library(tidyverse)
library(janitor)
library(gtsummary)4 Descriptive Statistics
Before running any statistical test, you need to describe your data. How old are the patients? What is the distribution of glucose? How much data is missing? Is the cohort balanced across groups?
In this chapter, you’ll learn to compute and present descriptive summaries — the kind you’d include in a methods section, a departmental presentation, or a publication’s “Table 1.”
Imagine you’re preparing a summary for a departmental meeting. You need to describe the 768 patients in the diabetes registry:
- What are the typical values for age, BMI, and glucose?
- How many patients developed diabetes within 5 years?
- How much data is missing?
- Can you produce a single, publication-ready table?
We’ll answer each of these questions step by step.
4.1 Data Preparation
We load and prepare the diabetes data using the same tools from Chapters 4 and 5:
diabetes_raw <- read_csv("../../data/diabetes.csv", show_col_types = FALSE)
diabetes <- diabetes_raw |>
clean_names() |>
mutate(
diabetes_5y = fct_relevel(diabetes_5y, "neg", "pos"),
bmi_class = case_when(
bmi < 18.5 ~ "Underweight",
bmi < 25 ~ "Normal",
bmi < 30 ~ "Overweight",
.default = "Obesity"
) |> factor(levels = c("Underweight", "Normal", "Overweight", "Obesity"))
)4.2 Choosing the Right Summary Statistics
Before computing anything, it helps to know which statistics suit which data. The choice depends on the variable type and its distribution:
| Variable Type | Symmetric Distribution | Skewed Distribution |
|---|---|---|
| Continuous (e.g., age, BMI) | Mean and SD | Median and IQR |
| Categorical (e.g., sex, outcome) | n and % | n and % |
For continuous variables, here’s what each statistic measures:
| Statistic | What it measures | R function |
|---|---|---|
| Mean | Average (center of gravity) | mean() |
| SD (standard deviation) | Spread around the mean | sd() |
| Median | Middle value (50th percentile) | median() |
| IQR (interquartile range) | Spread of the middle 50% | IQR() |
| Min / Max | Extremes | min(), max() |
- Use Mean + SD when the data is roughly symmetric (bell-shaped)
- Use Median + IQR when the data is skewed (long tail on one side) or has outliers
For clinical data, many variables (like insulin, length of stay, costs) are right-skewed, making median + IQR the safer default. When in doubt, check with a histogram (Chapter 3).
4.3 Summarizing Continuous Variables
4.3.1 Single variable summary
Let’s start with glucose. You can call summary functions directly, just like in Section 1.3.1:
mean(diabetes$glucose_mg_dl, na.rm = TRUE)[1] 121.6868
sd(diabetes$glucose_mg_dl, na.rm = TRUE)[1] 30.53564
median(diabetes$glucose_mg_dl, na.rm = TRUE)[1] 117
IQR(diabetes$glucose_mg_dl, na.rm = TRUE)[1] 42
Recall from Section 1.5 that na.rm = TRUE tells R to skip missing values when computing the statistic.
4.3.2 Summarizing multiple variables
To summarize several variables at once, use summarise() with one line per statistic:
diabetes |>
summarise(
mean_age = mean(age, na.rm = TRUE),
sd_age = sd(age, na.rm = TRUE),
mean_bmi = mean(bmi, na.rm = TRUE),
sd_bmi = sd(bmi, na.rm = TRUE),
mean_glucose = mean(glucose_mg_dl, na.rm = TRUE),
sd_glucose = sd(glucose_mg_dl, na.rm = TRUE)
)# A tibble: 1 × 6
mean_age sd_age mean_bmi sd_bmi mean_glucose sd_glucose
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.2 11.8 32.5 6.92 122. 30.5
This is verbose but completely transparent — you can see exactly what’s being computed for each variable.
across() for many columns
When you have many columns to summarize, the across() function can save typing. This uses more advanced syntax that we won’t cover in detail — just know it exists:
diabetes |>
summarise(
across(
c(age, bmi, glucose_mg_dl),
list(mean = \(x) mean(x, na.rm = TRUE),
sd = \(x) sd(x, na.rm = TRUE))
)
)# A tibble: 1 × 6
age_mean age_sd bmi_mean bmi_sd glucose_mg_dl_mean glucose_mg_dl_sd
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.2 11.8 32.5 6.92 122. 30.5
The column-by-column approach above is perfectly fine for most analyses. Use across() when you’re comfortable with it.
Python’s pandas provides describe() for quick summaries:
diabetes[["age", "bmi", "glucose_mg_dl"]].describe()This returns count, mean, std, min, percentiles, and max — similar to combining mean(), sd(), median(), min(), max() in R.
4.4 Summarizing Categorical Variables
4.4.1 Frequency table with count()
For categorical variables, the key statistics are n (count) and % (proportion). Use count() from Section 2.9:
diabetes |> count(diabetes_5y)# A tibble: 2 × 2
diabetes_5y n
<fct> <int>
1 neg 500
2 pos 268
4.4.2 Adding percentages
Add a percentage column with mutate():
diabetes |>
count(diabetes_5y) |>
mutate(pct = round(100 * n / sum(n), 1))# A tibble: 2 × 3
diabetes_5y n pct
<fct> <int> <dbl>
1 neg 500 65.1
2 pos 268 34.9
About 35% of patients in this cohort developed diabetes within 5 years.
4.4.3 Stratified frequency table
To see how one categorical variable breaks down within another group, use count() with two variables and then compute within-group percentages:
diabetes |>
count(diabetes_5y, bmi_class) |>
group_by(diabetes_5y) |>
mutate(pct = round(100 * n / sum(n), 1)) |>
ungroup()# A tibble: 7 × 4
diabetes_5y bmi_class n pct
<fct> <fct> <int> <dbl>
1 neg Underweight 4 0.8
2 neg Normal 95 19
3 neg Overweight 139 27.8
4 neg Obesity 262 52.4
5 pos Normal 7 2.6
6 pos Overweight 40 14.9
7 pos Obesity 221 82.5
When computing percentages, be clear about what the denominator is:
sum(n)aftergroup_by(diabetes_5y)→ percentages within each outcome group (columns sum to 100%)sum(n)without grouping → percentages of the total cohort (all rows sum to 100%)
Getting the denominator wrong is one of the most common mistakes in clinical reporting.
4.5 Grouped Summaries by Outcome
A key step in any clinical analysis: summarize continuous variables by group to see how the groups compare.
4.5.1 Continuous variables by group
diabetes |>
group_by(diabetes_5y) |>
summarise(
n = n(),
mean_age = round(mean(age, na.rm = TRUE), 1),
sd_age = round(sd(age, na.rm = TRUE), 1),
mean_glucose = round(mean(glucose_mg_dl, na.rm = TRUE), 1),
sd_glucose = round(sd(glucose_mg_dl, na.rm = TRUE), 1),
mean_bmi = round(mean(bmi, na.rm = TRUE), 1),
sd_bmi = round(sd(bmi, na.rm = TRUE), 1),
.groups = "drop"
)# A tibble: 2 × 8
diabetes_5y n mean_age sd_age mean_glucose sd_glucose mean_bmi sd_bmi
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 neg 500 31.2 11.7 111. 24.8 30.9 6.6
2 pos 268 37.1 11 142. 29.6 35.4 6.6
4.5.2 Interpreting group differences
The pos group is older on average (37 vs 31 years), has higher mean glucose (142 vs 110 mg/dL), and higher mean BMI (35 vs 30 kg/m²). These differences are descriptive — we haven’t tested whether they’re statistically significant. That comes in the next chapters.
Python’s groupby + agg achieves the same result:
(
diabetes
.groupby("diabetes_5y")
.agg(
n=("age", "size"),
mean_age=("age", "mean"),
mean_glucose=("glucose_mg_dl", "mean"),
mean_bmi=("bmi", "mean")
)
.round(1)
)4.6 Assessing Missing Data
4.6.1 Why report missingness?
Missing data is common in clinical datasets and can bias your results. Reporting how much data is missing — and in which variables — is essential for transparency. Journals and reviewers expect it.
4.6.2 Counting missing values per variable
The simplest approach: use summarise() with is.na() for each column you care about:
diabetes |>
summarise(
glucose_missing = sum(is.na(glucose_mg_dl)),
dbp_missing = sum(is.na(dbp_mm_hg)),
triceps_missing = sum(is.na(triceps_mm)),
insulin_missing = sum(is.na(insulin_microiu_ml)),
bmi_missing = sum(is.na(bmi))
)# A tibble: 1 × 5
glucose_missing dbp_missing triceps_missing insulin_missing bmi_missing
<int> <int> <int> <int> <int>
1 5 35 227 374 11
sum(is.na(x)) counts the TRUEs — recall from Section 1.4 that TRUE counts as 1. So sum(is.na(x)) is literally “how many values are NA?”
4.6.3 A missingness summary table
For a cleaner report, build a tibble with variable names, counts, and percentages:
miss_table <- tibble(
variable = c("glucose_mg_dl", "dbp_mm_hg", "triceps_mm",
"insulin_microiu_ml", "bmi"),
n_missing = c(
sum(is.na(diabetes$glucose_mg_dl)),
sum(is.na(diabetes$dbp_mm_hg)),
sum(is.na(diabetes$triceps_mm)),
sum(is.na(diabetes$insulin_microiu_ml)),
sum(is.na(diabetes$bmi))
),
pct_missing = round(100 * n_missing / nrow(diabetes), 1)
) |>
arrange(desc(pct_missing))
miss_table# A tibble: 5 × 3
variable n_missing pct_missing
<chr> <int> <dbl>
1 insulin_microiu_ml 374 48.7
2 triceps_mm 227 29.6
3 dbp_mm_hg 35 4.6
4 bmi 11 1.4
5 glucose_mg_dl 5 0.7
across() for missingness
For many columns, you can compute missingness more compactly:
diabetes |>
summarise(across(everything(), \(x) sum(is.na(x))))# A tibble: 1 × 10
pregnancy_num glucose_mg_dl dbp_mm_hg triceps_mm insulin_microiu_ml bmi
<int> <int> <int> <int> <int> <int>
1 0 5 35 227 374 11
# ℹ 4 more variables: pedigree <int>, age <int>, diabetes_5y <int>,
# bmi_class <int>
This applies sum(is.na()) to every column at once. Handy when you have dozens of variables.
4.6.4 Interpretation
Insulin has ~49% missing and triceps has ~30% missing — that’s substantial. When analyzing these variables, you’ll need to consider whether the missingness is random or systematic. At a minimum, report it clearly.
4.7 Publication-Ready Tables with gtsummary
So far, we’ve built summaries manually with summarise(). That works well for understanding the data, but for a publication-quality “Table 1”, the gtsummary package automates the formatting.
tbl_summary() takes a data frame and produces a formatted summary table — the kind you’d include in a journal article’s “Baseline Characteristics” table. It handles continuous and categorical variables automatically.
4.7.1 Step 1: The simplest table
Start with just the defaults — no customization at all:
diabetes |>
select(age, bmi, glucose_mg_dl, diabetes_5y) |>
tbl_summary()| Characteristic | N = 7681 |
|---|---|
| age | 29 (24, 41) |
| bmi | 32 (28, 37) |
| Unknown | 11 |
| glucose_mg_dl | 117 (99, 141) |
| Unknown | 5 |
| diabetes_5y | |
| neg | 500 (65%) |
| pos | 268 (35%) |
| 1 Median (Q1, Q3); n (%) | |
With one function call, gtsummary detected variable types, computed appropriate statistics (median + IQR for continuous, n + % for categorical), and formatted everything.
4.7.2 Step 2: Stratify by outcome
Add by = diabetes_5y to split the table by group:
diabetes |>
select(diabetes_5y, age, bmi, glucose_mg_dl, bmi_class) |>
tbl_summary(by = diabetes_5y)| Characteristic | neg N = 5001 |
pos N = 2681 |
|---|---|---|
| age | 27 (23, 37) | 36 (28, 44) |
| bmi | 30 (26, 35) | 34 (31, 39) |
| Unknown | 9 | 2 |
| glucose_mg_dl | 107 (93, 125) | 140 (119, 167) |
| Unknown | 3 | 2 |
| bmi_class | ||
| Underweight | 4 (0.8%) | 0 (0%) |
| Normal | 95 (19%) | 7 (2.6%) |
| Overweight | 139 (28%) | 40 (15%) |
| Obesity | 262 (52%) | 221 (82%) |
| 1 Median (Q1, Q3); n (%) | ||
Now each column is a group, and you can compare at a glance.
4.7.3 Step 3: Customize the statistics
By default, gtsummary reports median (IQR) for continuous variables. To switch to mean (SD), use the statistic argument:
diabetes |>
select(diabetes_5y, age, bmi, glucose_mg_dl, bmi_class) |>
tbl_summary(
by = diabetes_5y,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1
)| Characteristic | neg N = 5001 |
pos N = 2681 |
|---|---|---|
| age | 31.2 (11.7) | 37.1 (11.0) |
| bmi | 30.9 (6.6) | 35.4 (6.6) |
| Unknown | 9 | 2 |
| glucose_mg_dl | 110.6 (24.8) | 142.3 (29.6) |
| Unknown | 3 | 2 |
| bmi_class | ||
| Underweight | 4 (0.8%) | 0 (0%) |
| Normal | 95 (19%) | 7 (2.6%) |
| Overweight | 139 (28%) | 40 (15%) |
| Obesity | 262 (52%) | 221 (82%) |
| 1 Mean (SD); n (%) | ||
Let’s break down the new syntax:
statistic = list(...)— controls what statistics are displayed. Each entry maps a set of variables to a template string.- Template strings use
{placeholder}syntax:
| Placeholder | Meaning |
|---|---|
{mean} |
Mean |
{sd} |
Standard deviation |
{median} |
Median |
{p25}, {p75} |
25th and 75th percentiles |
{n} |
Count |
{p} |
Percentage |
all_continuous()andall_categorical()are selector helpers — they select all continuous or all categorical variables, similar to howstarts_with()works inselect(). Think of them as shortcuts for “apply this format to all variables of this type.”digits = all_continuous() ~ 1— round continuous statistics to 1 decimal place.
4.7.4 Step 4: Add labels and polish
Add readable labels, an overall column, and bold formatting:
diabetes |>
select(diabetes_5y, age, bmi, glucose_mg_dl, dbp_mm_hg, bmi_class) |>
tbl_summary(
by = diabetes_5y,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "ifany",
label = list(
age ~ "Age (years)",
bmi ~ "Body mass index (kg/m²)",
glucose_mg_dl ~ "Glucose (mg/dL)",
dbp_mm_hg ~ "Diastolic BP (mmHg)",
bmi_class ~ "BMI class"
)
) |>
add_overall(last = TRUE) |>
bold_labels()| Characteristic | neg N = 5001 |
pos N = 2681 |
Overall N = 7681 |
|---|---|---|---|
| Age (years) | 31.2 (11.7) | 37.1 (11.0) | 33.2 (11.8) |
| Body mass index (kg/m²) | 30.9 (6.6) | 35.4 (6.6) | 32.5 (6.9) |
| Unknown | 9 | 2 | 11 |
| Glucose (mg/dL) | 110.6 (24.8) | 142.3 (29.6) | 121.7 (30.5) |
| Unknown | 3 | 2 | 5 |
| Diastolic BP (mmHg) | 70.9 (12.2) | 75.3 (12.3) | 72.4 (12.4) |
| Unknown | 19 | 16 | 35 |
| BMI class | |||
| Underweight | 4 (0.8%) | 0 (0%) | 4 (0.5%) |
| Normal | 95 (19%) | 7 (2.6%) | 102 (13%) |
| Overweight | 139 (28%) | 40 (15%) | 179 (23%) |
| Obesity | 262 (52%) | 221 (82%) | 483 (63%) |
| 1 Mean (SD); n (%) | |||
New pieces:
label = list(variable ~ "Label")— gives each variable a human-readable namemissing = "ifany"— shows a “Missing” row only when there are actually missing values. Other options:"no"(hide missing rows) or"always"(always show)add_overall(last = TRUE)— adds an “Overall” column at the endbold_labels()— makes variable names bold for readability
This is a complete Table 1 — ready for a report, slide deck, or journal submission.
To save the table as an HTML or Word file:
library(gt)
diabetes |>
select(diabetes_5y, age, bmi, glucose_mg_dl) |>
tbl_summary(by = diabetes_5y) |>
as_gt() |>
gtsave("table1.html") # or "table1.docx" for WordThe as_gt() function converts the gtsummary table to a gt object, which can be saved in various formats.
You may notice that Table 1 often includes a “p-value” column comparing groups. In gtsummary, you add this with add_p(). We’ll cover this in the inferential statistics chapters — for now, focus on describing the data clearly.
Python’s tableone package provides similar functionality:
from tableone import TableOne
table1 = TableOne(
data=diabetes,
columns=["age", "bmi", "glucose_mg_dl", "bmi_class"],
groupby="diabetes_5y",
pval=False
)
print(table1.tabulate(tablefmt="github"))R’s gtsummary is generally more polished for publication-ready output, with built-in formatting and export options.
4.8 Common Errors and Troubleshooting
| Error or Symptom | Cause | Fix |
|---|---|---|
NaN in mean or SD |
All values are NA in that subset |
Check your filter() — are you accidentally removing all data? |
| Wrong percentages (don’t sum to 100%) | Wrong group_by() before mutate(pct = ...) |
Check your grouping — denominator must match your intended total |
gtsummary shows “Unknown” category |
NA values in a categorical variable |
Use mutate(var = fct_na_value_to_level(var)) or missing = "no" |
na.rm forgotten |
NA result from mean(), sd(), etc. |
Add na.rm = TRUE to every summary function |
Very wide output from summarise() |
Too many columns in one summarise() call |
Break into separate summaries or use gtsummary instead |
4.9 Summary
Here’s the journey we took to describe the diabetes cohort:
| Step | What we did | R tool |
|---|---|---|
| Choose statistics | Matched measure to data type | Mean + SD vs Median + IQR |
| Single variable | Computed glucose statistics | mean(), sd(), median(), IQR() |
| Multiple variables | Summarized age, BMI, glucose | summarise() column-by-column |
| Categorical counts | Frequency of outcome and BMI class | count(), mutate(pct = ...) |
| Grouped summaries | Compared pos vs neg groups | group_by() + summarise() |
| Missingness | Identified insulin as 49% missing | sum(is.na()) per variable |
| Table 1 | Publication-ready baseline table | gtsummary::tbl_summary() |
These descriptive statistics are the foundation for any clinical analysis. In the next chapters, we’ll move to inferential statistics — testing whether the differences we’ve observed between groups are statistically significant.
Exercises
Vital signs summary. Using the diabetes dataset, compute the mean, SD, median, and IQR of
age,bmi, anddbp_mm_hgusing explicitsummarise()(one line per statistic). Round to 1 decimal.Categorical profile. Create a frequency table of
bmi_classwith percentages. Then create a stratified version bydiabetes_5yshowing within-group percentages. Which BMI class is most common in theposgroup?Missingness report. Build a missingness summary table for all numeric columns in the diabetes dataset. Sort by percent missing (highest first). Which variable has the most missing data, and what percentage is missing?
Table 1 challenge. Build a
gtsummarytable stratified bydiabetes_5ythat includesage,bmi,glucose_mg_dl,dbp_mm_hg, andbmi_class. Use mean (SD) for continuous variables, custom labels with units, and add an overall column. Export it to HTML if you’d like to practice withas_gt().