2 + 3[1] 5
In this chapter, you’ll learn the core building blocks of R programming — variables, data types, vectors, flow control, and functions. These are the tools you’ll use throughout the rest of this book.
To make these concepts concrete, we’ll use a running clinical example that builds from section to section.
Throughout this chapter, we’ll progressively build up a calculation using the Cockcroft-Gault equation — a widely used formula for estimating creatinine clearance (a measure of kidney function):
\[ eCrCl = \frac{(140 - age) \times weight_{kg}}{72 \times Cr_{mg/dL}} \times (0.85 \text{ if female}) \]
Starting from simple arithmetic, we’ll add variables, handle multiple patients, classify results into CKD stages, and finally wrap everything into a reusable function. Each section introduces a new R concept and applies it to this scenario.
At its core, R is a powerful calculator. Type a mathematical expression, and R gives you the answer:
2 + 3[1] 5
100 / 3[1] 33.33333
2^10[1] 1024
Here are R’s arithmetic operators:
| Operator | Meaning | Example |
|---|---|---|
+ |
Addition | 5 + 3 |
- |
Subtraction | 10 - 4 |
* |
Multiplication | 6 * 7 |
/ |
Division | 100 / 3 |
^ |
Exponent (power) | 2 ^ 3 |
Like in mathematics, parentheses control the order of operations:
(2 + 3) * 4[1] 20
2 + 3 * 4 # without parentheses: multiplication first[1] 14
print()
Unlike other programming language, In R, when you type an expression in the Console, an R Script (run line-by-line), or a Quarto code chunk, R automatically prints the result. You don’t need to wrap it in print() — just write the expression and R will display the output. This is called auto-printing, and it’s why 2 + 3 shows 5 without us writing print(2 + 3).
Let’s try something more relevant. The Cockcroft-Gault equation for a 55-year-old male patient weighing 70 kg with a serum creatinine of 1.2 mg/dL:
(140 - 55) * 70 / (72 * 1.2)[1] 68.86574
An estimated creatinine clearance of about 69 mL/min. But typing raw numbers like this is hard to read and easy to get wrong. Let’s fix that with variables.
A variable stores a value so you can use it later by name. In R, you create variables with the assignment operator <-:
age <- 55This stores the number 55 in a variable called age. You can now use age anywhere in place of 55:
age[1] 55
140 - age[1] 85
<- shortcut
In RStudio, press Alt + - (Windows/Linux) or Option + - (macOS) to type <-. You’ll use this thousands of times — it’s worth memorizing.
=?
You may see = used for assignment in some R code (e.g., age = 55). It works in most situations, but <- is the standard convention in the R community and is what you’ll see in virtually all R books, packages, and style guides. We’ll use <- throughout this book.
Variable names should be descriptive. R convention is snake_case — lowercase words separated by underscores:
| Good names | Bad names | Why bad? |
|---|---|---|
age |
a |
Not descriptive |
weight_kg |
Weight.KG |
Inconsistent casing |
serum_creatinine |
x1 |
Meaningless name |
patient_count |
myVar |
Vague, camelCase not standard |
Variable names must start with a letter and can contain letters, numbers, underscores (_), and dots (.). They are case-sensitive — Age and age are different variables. Avoid spaces and special characters.
Let’s store our patient’s data in properly named variables:
age <- 55
weight_kg <- 70
creatinine <- 1.2 # serum creatinine in mg/dL
sex <- "Male"Now the Cockcroft-Gault calculation becomes readable:
egfr <- (140 - age) * weight_kg / (72 * creatinine)
egfr[1] 68.86574
Much better. Each number has a name, and if you need to change a value (say, a different patient’s age), you only change it in one place.
In Python, = is the only assignment operator:
age = 55
weight_kg = 70
creatinine = 1.2
sex = "Male"Both R and Python use snake_case as the preferred naming convention.
Every value in R is stored as a vector — an ordered collection of elements. Even a single number like 42 is a vector of length 1. Understanding what types of data these vectors can hold is essential for working with R effectively.
R has six basic data types you’ll encounter:
| Type | Stores | Example |
|---|---|---|
double |
Decimal numbers | 3.14, 42, -7.5 |
integer |
Whole numbers | 1L, 55L, 100L |
character |
Text (strings) | "hello", "Male" |
logical |
True / False | TRUE, FALSE |
list |
Mixed-type collection | list("John", 55, TRUE) |
factor |
Categories | factor(c("G1", "G2", "G3")) |
The first four are atomic types — they can only hold one kind of data. Lists and factors are built on top of them.
An atomic vector holds a sequence of values that are all the same type. When you create a variable in R, you’re creating an atomic vector — even if it contains only one value.
R has two numeric types: doubles (decimal numbers) and integers (whole numbers).
Most numbers in R are stored as doubles by default:
typeof(age)[1] "double"
typeof(creatinine)[1] "double"
Even 55 — a whole number — is stored as a double internally. To explicitly create an integer, append L (for Long integer):
patient_count <- 55L
typeof(patient_count)[1] "integer"
In practice, the distinction rarely matters because both are “numeric”:
is.numeric(age) # double — numeric[1] TRUE
is.numeric(patient_count) # integer — also numeric[1] TRUE
For most data analysis, you can treat doubles and integers interchangeably — R converts between them automatically. The distinction matters mainly in:
When in doubt, just use regular numbers (doubles). R will handle the rest.
Text values are called character strings. They must be wrapped in quotes (" or '):
sex <- "Male"
diagnosis <- "Chronic Kidney Disease"
typeof(sex)[1] "character"
Logical values are either TRUE or FALSE. They often come from comparisons:
is_impaired <- egfr < 60
is_impaired[1] FALSE
typeof(is_impaired)[1] "logical"
Our patient’s eGFR of ~69 is not below 60, so is_impaired is FALSE.
R provides two functions for checking types:
typeof() — returns the internal storage type (e.g., "double", "character")class() — returns the higher-level type (e.g., "numeric", "factor")class(age)[1] "numeric"
class(sex)[1] "character"
class(is_impaired)[1] "logical"
For atomic vectors, class() is usually what you want — it tells you how R will treat the value.
So far, each variable has held a single value. To store multiple values, use c() (short for combine):
ages <- c(55, 70, 45, 62, 38)
ages[1] 55 70 45 62 38
sexes <- c("Male", "Female", "Male", "Female", "Male")
sexes[1] "Male" "Female" "Male" "Female" "Male"
The : operator creates a sequence of integers — useful for generating regular number sequences without typing each one:
1:5[1] 1 2 3 4 5
10:15[1] 10 11 12 13 14 15
All elements of an atomic vector must be the same type. If you mix types, R automatically coerces (converts) them to the most flexible type:
mixed <- c(1, "two", TRUE)
mixed[1] "1" "two" "TRUE"
typeof(mixed)[1] "character"
Everything became a character string! R follows a coercion hierarchy:
logical → integer → double → character
Each type can be converted to the types to its right, but not the other way without explicit conversion. This is a common source of surprises — if your “numbers” are behaving oddly, check whether a stray character value coerced the entire vector to character.
Access elements by position using square brackets []. R uses 1-based indexing — the first element is [1]:
ages[1] # first patient[1] 55
ages[2:4] # patients 2 through 4[1] 70 45 62
ages[-1] # all EXCEPT the first[1] 70 45 62 38
This is one of R’s most powerful features: operations apply to every element at once, without needing a loop.
ages + 10 # add 10 to every age[1] 65 80 55 72 48
Let’s use this to compute eGFR for five patients simultaneously:
# Five clinic patients
ages <- c(55, 70, 45, 62, 38)
weights <- c(70, 58, 82, 65, 90)
creatinines <- c(1.2, 1.8, 0.9, 2.1, 1.0)# Calculate eGFR for ALL patients at once (without sex adjustment for now)
egfrs <- (140 - ages) * weights / (72 * creatinines)
egfrs[1] 68.86574 31.32716 120.21605 33.53175 127.50000
One line of code — five eGFR results. No loops needed. We’ll add the sex adjustment factor later when we learn about functions.
In many languages, you’d write a for loop to process each patient one at a time. R’s vectorized operations are not only more concise — they’re also faster, because R optimizes the computation internally.
R provides built-in functions to summarize vectors:
mean(egfrs) # average eGFR[1] 76.28814
median(egfrs)[1] 68.86574
min(egfrs)[1] 31.32716
max(egfrs)[1] 127.5
length(egfrs) # how many patients[1] 5
round(egfrs, 1) # round to 1 decimal place[1] 68.9 31.3 120.2 33.5 127.5
Atomic vectors require all elements to be the same type. A list removes that restriction — it can hold elements of different types, including other vectors and even other lists.
Think of a list like a patient record: it bundles a name (character), an age (numeric), and a set of lab results (numeric vector) into one object.
patient <- list(
name = "Somchai",
age = 55,
labs = c(1.2, 98, 5.4)
)
patient$name
[1] "Somchai"
$age
[1] 55
$labs
[1] 1.2 98.0 5.4
Use $ to access a named element, or [[]] to access by name or position:
patient$name[1] "Somchai"
patient[["age"]][1] 55
patient[[3]] # third element (labs)[1] 1.2 98.0 5.4
[] vs [[]] for lists
patient[1] (single brackets) returns a smaller list containing the first elementpatient[[1]] (double brackets) extracts the element itselfThink of single brackets as taking a slice of the container, and double brackets as reaching inside to pull out the contents.
str()For complex lists, str() (short for structure) gives a compact overview:
str(patient)List of 3
$ name: chr "Somchai"
$ age : num 55
$ labs: num [1:3] 1.2 98 5.4
This shows the type and first few values of each element — very useful for understanding unfamiliar objects.
Many R functions return lists. For example, t.test() returns a list containing the test statistic, p-value, confidence interval, and more. You’ll see this in action when we cover inferential statistics later in the book.
R objects can carry attributes — metadata attached to a vector. You’ve already seen one attribute implicitly: when you name elements in a list, those names are an attribute.
Two common attributes are:
names — labels for each elementclass — tells R how to treat the object (e.g., as a factor, a date, or a data frame)labs <- c(creatinine = 1.2, glucose = 98, potassium = 5.4)
attributes(labs)$names
[1] "creatinine" "glucose" "potassium"
The most important use of attributes in everyday R is the factor — a vector with levels and class attributes that make it behave as categorical data.
A factor is R’s type for categorical data — variables with a fixed set of allowed values. Think of blood group (A, B, AB, O), treatment arm (placebo, drug), or CKD stage (G1, G2, G3, G4-G5).
Under the hood, a factor is an integer vector with two extra attributes: levels (the allowed categories) and class (set to "factor"). It looks like text, but R stores it as numbers with labels.
# A character vector — just text
ckd_chr <- c("G2", "G1", "G3", "G2", "G1")
class(ckd_chr)[1] "character"
# A factor — text with defined levels
ckd_fct <- factor(ckd_chr)
ckd_fct[1] G2 G1 G3 G2 G1
Levels: G1 G2 G3
class(ckd_fct)[1] "factor"
Notice the Levels: line — R now knows the allowed categories. By default, factor() sorts levels alphabetically (G1, G2, G3).
factor()The factor() function converts a character vector into a factor. You can explicitly set the order of levels with the levels argument:
factor(x, levels = c("level1", "level2", "level3"))Let’s break this down:
x — the character vector to convertlevels — a character vector specifying the allowed values and their order. The first level is treated as the “reference” in statistical models.Let’s use this with our CKD stages, where the clinical order matters:
# Without explicit levels — alphabetical order (happens to be correct here)
factor(c("G2", "G1", "G3"))[1] G2 G1 G3
Levels: G1 G2 G3
# With explicit levels — we control the order
ckd_stages <- factor(
c("G2", "G1", "G3"),
levels = c("G1", "G2", "G3", "G4-G5")
)
ckd_stages[1] G2 G1 G3
Levels: G1 G2 G3 G4-G5
Notice that G4-G5 appears in the levels even though no patient has that stage. That’s the power of factors — they define all possible categories, not just the ones present in your data.
We can verify the attributes that make a factor work:
attributes(ckd_stages)$levels
[1] "G1" "G2" "G3" "G4-G5"
$class
[1] "factor"
A character "G2" and a factor "G2" print the same way, but R treats them differently:
When you see unexpected ordering in plots or tables, check whether your variable is a factor with class().
Level order controls:
Getting the order right early saves headaches later. In the next chapter, you’ll learn fct_relevel() — a tidyverse tool to reorder levels conveniently.
Python’s type system differs from R in several ways:
Atomic types:
| R | Python | Example |
|---|---|---|
double |
float |
3.14 |
integer |
int |
42 |
character |
str |
"hello" |
logical |
bool |
True, False |
Note: Python booleans are capitalized (True/False) but not all-caps like R’s TRUE/FALSE.
Vectors and lists:
# Python list — NOT vectorized by default
ages = [55, 70, 45, 62, 38]
ages + 10 # TypeError!
ages[0] # 55 — 0-based indexing!
# For vectorized operations, use NumPy
import numpy as np
ages = np.array([55, 70, 45, 62, 38])
ages + 10 # array([65, 80, 55, 72, 48])Key difference: R is 1-based (ages[1] is the first), Python is 0-based (ages[0] is the first).
Factors:
# Factors equivalent in pandas:
import pandas as pd
ckd = pd.Categorical(
["G2", "G1", "G3"],
categories=["G1", "G2", "G3", "G4-G5"],
ordered=True
)Comparisons produce logical values (TRUE / FALSE). These are essential for filtering and classifying data.
| Operator | Meaning | Example |
|---|---|---|
== |
Equal to | x == 5 |
!= |
Not equal to | x != 5 |
> |
Greater than | x > 5 |
< |
Less than | x < 5 |
>= |
Greater or equal | x >= 5 |
<= |
Less than or equal | x <= 5 |
== for comparison, <- for assignment
Use == to test equality (“is this equal to?”) and <- to assign a value (“store this”). Mixing them up is a common source of bugs.
When applied to vectors, comparisons are vectorized:
egfrs[1] 68.86574 31.32716 120.21605 33.53175 127.50000
# Which patients have impaired kidney function (eGFR < 60)?
egfrs < 60[1] FALSE TRUE FALSE TRUE FALSE
Two of our five patients have eGFR below 60.
You can use a logical vector to subset another vector — extracting only the elements where the condition is TRUE:
# Extract only the impaired eGFR values
egfrs[egfrs < 60][1] 31.32716 33.53175
# What ages correspond to impaired kidneys?
ages[egfrs < 60][1] 70 62
Use & (AND), | (OR), and ! (NOT) to combine conditions:
# Patients older than 50 with eGFR < 60
egfrs < 60 & ages > 50[1] FALSE TRUE FALSE TRUE FALSE
%in% operator%in% checks whether values exist in a given set:
blood_types <- c("A", "B", "O", "AB", "A")
blood_types %in% c("A", "O")[1] TRUE FALSE TRUE FALSE TRUE
A useful trick: TRUE counts as 1 and FALSE as 0. This means sum() counts the TRUEs, and mean() gives the proportion:
sum(egfrs < 60) # how many patients have eGFR < 60?[1] 2
mean(egfrs < 60) # what proportion?[1] 0.4
40% of our patients (2 out of 5) have eGFR below 60.
NA)Real medical data is rarely complete. Lab results may be pending, forms may have blank fields, or data entry may have been skipped. R represents missing values as NA (Not Available).
NA is contagiousAny calculation involving NA returns NA — because if you don’t know a value, you can’t know the result:
NA + 1[1] NA
NA > 5[1] NA
NA == NA[1] NA
That last one may surprises you. NA == NA returns NA because: if you don’t know what either value is, you can’t tell whether they’re equal.
Use is.na() to test for NA:
x <- c(10, NA, 30, NA, 50)
is.na(x)[1] FALSE TRUE FALSE TRUE FALSE
!is.na(x) # the opposite: which are NOT missing?[1] TRUE FALSE TRUE FALSE TRUE
na.rm argumentMany summary functions return NA when the input contains missing values:
# What if patient 3's creatinine result is pending?
creatinines_with_na <- c(1.2, 1.8, NA, 2.1, 1.0)
mean(creatinines_with_na)[1] NA
The mean of “something unknown” is unknown. To skip the missing value, use na.rm = TRUE (remove NAs):
mean(creatinines_with_na, na.rm = TRUE)[1] 1.525
Now R computes the mean of the four known values, ignoring the missing one.
na.rm is everywhere
Most summary functions in R — mean(), sum(), median(), sd(), min(), max() — accept na.rm = TRUE. You’ll use it frequently with real-world datasets.
Python uses None and NaN (Not a Number) for missing values:
import numpy as np
import pandas as pd
x = [1, 2, None, 4]
pd.isna(x) # like is.na()
np.nanmean([1, 2, np.nan, 4]) # like mean(..., na.rm = TRUE)Pandas DataFrames also provide .dropna() and .fillna() for handling missing data.
Sometimes you need R to make decisions: “if the eGFR is above 90, classify as normal; otherwise, check the next threshold.” This is flow control.
if / elseThe simplest form makes a two-way decision: do one thing if a condition is true, something else if it’s false.
if (condition) {
# do this when TRUE
} else {
# do this when FALSE
}Let’s check whether a patient’s kidney function is impaired:
patient_egfr <- 68.9
if (patient_egfr < 60) {
status <- "Impaired"
} else {
status <- "Normal"
}
status[1] "Normal"
Our patient’s eGFR of 68.9 is not below 60, so the result is "Normal".
if / else if / elseWhen you have more than two categories, add else if branches. Each condition is checked in order — R takes the first one that is TRUE and skips the rest.
Let’s classify kidney function into simplified CKD stages:
patient_egfr <- 68.9
if (patient_egfr >= 90) {
stage <- "G1 (Normal)"
} else if (patient_egfr >= 60) {
stage <- "G2 (Mild decrease)"
} else if (patient_egfr >= 30) {
stage <- "G3 (Moderate decrease)"
} else {
stage <- "G4-G5 (Severe decrease)"
}
stage[1] "G2 (Mild decrease)"
The eGFR of 68.9 didn’t satisfy >= 90 (first branch), but it did satisfy >= 60 (second branch), so R assigned "G2 (Mild decrease)" and stopped checking.
We’re using a simplified 4-level classification for teaching purposes. The full KDIGO staging system divides kidney function into 6 stages (G1, G2, G3a, G3b, G4, G5).
ifelse()The if / else statement works on a single value. For vectors, use ifelse():
# Classify all patients as "Impaired" or "Normal" at once
ifelse(egfrs < 60, "Impaired", "Normal")[1] "Normal" "Impaired" "Normal" "Impaired" "Normal"
if / else → for a single condition (one patient)ifelse() → for a vector of values (multiple patients at once)In later chapters, we’ll learn an even more powerful tool for multi-level classification: dplyr::case_when().
for loops (brief)A for loop repeats an action for each element in a sequence:
for (i in 1:3) {
cat("Patient", i, ": eGFR =", round(egfrs[i], 1), "\n")
}Patient 1 : eGFR = 68.9
Patient 2 : eGFR = 31.3
Patient 3 : eGFR = 120.2
You’ll rarely write for loops in day-to-day R code. R’s vectorized operations and the tidyverse functions we’ll learn in later chapters handle most iteration more elegantly. We mention loops here so you recognize them, but don’t worry if they feel unfamiliar.
Python uses indentation instead of braces for code blocks:
# Python
if egfr >= 90:
stage = "G1 (Normal)"
elif egfr >= 60:
stage = "G2 (Mild decrease)"
elif egfr >= 30:
stage = "G3 (Moderate decrease)"
else:
stage = "G4-G5 (Severe decrease)"Key differences: elif (not else if), colon : after each condition, and indentation defines the blocks (no {}).
You’ve already been using functions throughout this chapter — mean(), round(), c(), is.na(). A function takes inputs (called arguments), does something with them, and returns a result.
flowchart LR A["1.2, 1.8"] -->|input| B["mean()"] B -->|output| C["1.5"] D["68.865"] -->|input| E["round(, 1)"] E -->|output| F["68.9"]
Functions in R generally come from three sources:
| Source | Examples | Documentation |
|---|---|---|
| Base R — built into R itself | mean(), round(), sum(), paste() |
Always available, no library() needed |
| R packages — installed add-ons (see more Section 1.8) | readr::read_csv(), dplyr::filter() |
Available after library() or via :: |
| User-defined — written by you | classify_ckd(), estimate_egfr() |
No built-in docs (unless you build a package) |
Everything you’ve used so far — mean(), c(), is.na(), round() — comes from base R. In later chapters, we’ll rely heavily on functions from R packages (especially the tidyverse). In this section, you’ll learn to write your own user-defined functions.
For base R and package functions, type ? followed by the function name in the console to view its documentation:
?mean
?roundThe help page appears in RStudio’s Help pane, showing the function’s description, arguments, and examples.
Use ?? (double question mark) to search across all documentation. For example, ??correlation will find functions related to correlation, even if you don’t know the exact function name.
? doesn’t work for user-defined functions
The ? help system only works for functions that come with documentation — i.e., base R and installed packages. Functions you write yourself (like the ones we’ll create below) won’t have help pages unless you package them into an R package with documentation.
Now let’s learn to create user-defined functions. The syntax is:
function_name <- function(arg1, arg2, arg3) {
# body: do something with the arguments
result # the last expression is automatically returned
}Let’s break this down piece by piece:
function_name — the name you give your function. Follow the same snake_case naming rules as variables. Pick a name that describes what the function does (e.g., classify_ckd, estimate_egfr).function(...) — the function keyword tells R you’re creating a function. This is followed by parentheses containing the arguments.arg1, arg2, arg3 — the arguments (inputs). These are placeholder names for the values the caller will provide. You can have zero, one, or many arguments.{ ... } — the body, wrapped in curly braces. This is the code that runs when the function is called. It can be one line or many lines.return() statement (though you can use one if you prefer).Let’s start by turning our CKD staging code into a reusable function:
classify_ckd <- function(egfr) {
if (egfr >= 90) {
"G1 (Normal)"
} else if (egfr >= 60) {
"G2 (Mild decrease)"
} else if (egfr >= 30) {
"G3 (Moderate decrease)"
} else {
"G4-G5 (Severe decrease)"
}
}Now we can classify any eGFR value with a single call:
classify_ckd(68.9)[1] "G2 (Mild decrease)"
classify_ckd(25.3)[1] "G4-G5 (Severe decrease)"
classify_ckd(95.0)[1] "G1 (Normal)"
When calling a function, you can pass arguments in two ways:
=.round(68.865, 1) # by position: x = 68.865, digits = 1[1] 68.9
round(x = 68.865, digits = 1) # by name: same result, more explicit[1] 68.9
round(digits = 1, x = 68.865) # named args can be in any order[1] 68.9
For functions with one or two arguments, positional is fine — round(68.865, 1) is clear enough. But when a function has many arguments, naming them makes your code much easier to read. Compare:
estimate_egfr(55, 70, 1.2, "Female") — what is 70? What is 1.2?estimate_egfr(age = 55, weight_kg = 70, creatinine = 1.2, sex = "Female") — self-documentingSometimes a function argument has a sensible “usual” value. You can specify a default by using = in the function definition:
greet <- function(name, greeting = "Hello") {
paste(greeting, name)
}If the caller doesn’t provide greeting, R uses the default:
greet("Dr. Smith")[1] "Hello Dr. Smith"
greet("Dr. Smith", greeting = "Good morning")[1] "Good morning Dr. Smith"
This is useful when most calls use the same value, but you still want the flexibility to override it.
Now let’s build the full Cockcroft-Gault calculation, combining everything we’ve learned — arithmetic, variables, if-else, default arguments, and function definition:
estimate_egfr <- function(age, weight_kg, creatinine, sex = "Male") {
egfr <- (140 - age) * weight_kg / (72 * creatinine)
# Apply sex correction factor
if (sex == "Female") {
egfr <- egfr * 0.85
}
round(egfr, 1)
}Here, sex = "Male" is a default argument — if the caller doesn’t specify sex, it assumes male (no correction applied).
# Male patient (default sex)
estimate_egfr(age = 55, weight_kg = 70, creatinine = 1.2)[1] 68.9
# Female patient (with 0.85 correction)
estimate_egfr(age = 70, weight_kg = 58, creatinine = 1.8, sex = "Female")[1] 26.6
The female correction factor (× 0.85) lowered the estimate from 31.3 to 26.6 — a clinically meaningful difference that shifts the CKD stage from G3 to G4–G5.
Use sapply() to apply a function to each element of a vector:
# Classify each patient's CKD stage
sapply(round(egfrs, 1), classify_ckd)[1] "G2 (Mild decrease)" "G3 (Moderate decrease)" "G1 (Normal)"
[4] "G3 (Moderate decrease)" "G1 (Normal)"
sapply() works for simple cases, but in later chapters you’ll learn dplyr::mutate() with case_when() — a much more elegant way to classify and transform columns in a data frame.
Python defines functions with def:
def estimate_egfr(age, weight_kg, creatinine, sex="Male"):
egfr = (140 - age) * weight_kg / (72 * creatinine)
if sex == "Female":
egfr *= 0.85
return round(egfr, 1)Key differences: def keyword (not function()), colon after declaration, explicit return statement, and indentation-based blocks.
R comes with useful built-in functions, but its real strength lies in packages — add-ons created by the R community. Think of them as apps for your phone: R is the operating system, and packages are the apps that add new capabilities.
There’s an important distinction:
| Action | Command | Frequency |
|---|---|---|
| Install a package | install.packages("name") |
Once |
| Load a package | library(name) |
Every R session |
# Install (one time — downloads the package)
install.packages("tidyverse")
# Load (every session — makes functions available)
library(tidyverse)Think of it like apps: you install an app once from the store, but you open it each time you want to use it. Same with R packages.
:: notationYou can use a single function from a package without loading the entire package:
readr::read_csv("data/diabetes.csv")The package::function() syntax is useful when you only need one function, or when two packages have functions with the same name.
The tidyverse is a collection of R packages designed for data science. It’s the core toolkit we’ll use for the rest of this book:
| Package | Purpose |
|---|---|
readr |
Reading data files (CSV, etc.) |
dplyr |
Data manipulation |
tidyr |
Reshaping data |
ggplot2 |
Data visualization |
stringr |
Working with text |
purrr |
Functional programming |
tibble |
Modern data frames |
We’ll explore these packages in depth starting from the next chapter.
Python uses pip for installation and import for loading:
# Install (in terminal)
pip install pandas
# Load (in script)
import pandas as pd
# Use
pd.read_csv("data/diabetes.csv")R’s library(dplyr) is analogous to Python’s import pandas as pd.
As your code becomes more complex, you’ll often chain multiple operations together. Without a pipe, this means nesting function calls:
# Nested: read from inside out
sort(round(egfrs, 1))[1] 31.3 33.5 68.9 120.2 127.5
To read this, you start from the innermost function (round) and work outward (sort). With two functions it’s manageable, but add a few more and it quickly becomes unreadable.
|> pipeR’s pipe operator |> lets you write operations left to right, like reading a sentence:
# Piped: read left to right
egfrs |> round(1) |> sort()[1] 31.3 33.5 68.9 120.2 127.5
Read this as: “Take egfrs, then round to 1 decimal, then sort.”
The pipe takes the result from the left side and passes it as the first argument to the function on the right.
| Step | Expression | Passes result to |
|---|---|---|
| 1 | egfrs |
round() |
| 2 | round(egfrs, 1) |
sort() |
| 3 | sort(...) |
final result |
Here’s a more practical example:
# Without pipe
round(mean(c(1, 2, NA, 4), na.rm = TRUE), 2)[1] 2.33
# With pipe — reads naturally
c(1, 2, NA, 4) |> mean(na.rm = TRUE) |> round(2)[1] 2.33
“Take these numbers, then compute the mean (removing NAs), then round to 2 decimal places.”
|> vs %>%You may encounter %>% in older R code or online tutorials. This is the magrittr pipe, which was R’s original pipe before version 4.1 (2021) added |> as a built-in feature.
Both work similarly for most use cases. We’ll use |> throughout this book since it requires no extra packages and is the modern standard.
Python doesn’t have a pipe operator. Instead, it uses method chaining with the dot .:
(df
.query("age > 50")
.sort_values("glucose")
.head(10)
)Conceptually similar to R’s pipe — both enable left-to-right reading of sequential data transformations.
Error messages are R’s way of telling you something went wrong. They can feel cryptic at first, but they follow patterns. Learning to read them is the #1 beginner survival skill.
| Error Message | What It Means |
|---|---|
object 'x' not found |
Typo in variable name, or forgot to run the line that creates it |
could not find function |
Package not loaded — did you forget library()? |
unexpected ')' in ... |
Mismatched parentheses — count your opening and closing parens |
non-numeric argument |
Used text where a number was expected |
there is no package called |
Package not installed — run install.packages() first |
Let’s see a common one in action:
# Typo in variable name
egfr_value <- 68.9
egrf_value # oops — "egrf" instead of "egfr"Error in eval(expr, envir, enclos): object 'egrf_value' not found
R tells you exactly what it can’t find. Read the message, spot the typo, and fix it.
Here’s another common one — calling a function from a package you haven’t loaded:
read_csv("data/diabetes.csv")Error in read_csv("data/diabetes.csv") :
could not find function "read_csv"
The fix: add library(readr) or library(tidyverse) at the top of your script.
When you see an error:
In this chapter, you’ve learned the core building blocks of R programming. Here’s the journey we took with our eGFR example:
| Step | What we did | R concept |
|---|---|---|
| Start | (140 - 55) * 70 / (72 * 1.2) |
Arithmetic operators |
| Store | age, weight_kg, creatinine |
Variables |
| Verify | typeof(), class() |
Data types |
| Categorize | CKD stages as factor() |
Factors |
| Scale | 5 patients at once | Vectors |
| Filter | egfrs < 60 |
Logical comparisons |
| Handle | Missing creatinine | NA and na.rm |
| Decide | CKD staging | if / else if / else |
| Wrap | estimate_egfr() |
Functions |
| Chain | \|> |
Pipes |
These fundamentals are the foundation for everything that follows. In the next chapter, we’ll learn about data frames — R’s structure for tabular data — and start working with real medical datasets.
BMI calculator. Create variables weight_kg and height_m for a patient, then calculate BMI using the formula: \(BMI = weight / height^2\). What is the BMI of a patient who weighs 85 kg and is 1.72 m tall?
Blood pressure classifier. Write an if / else if / else statement that classifies a systolic blood pressure value (sbp) into:
Test it with sbp <- 135.
Temperature converter. Write a function f_to_c() that converts Fahrenheit to Celsius using the formula: \(C = (F - 32) \times 5/9\). Test it with 98.6°F (normal body temperature) and 104°F (fever).
Lab value analysis. Given this vector of hemoglobin values (g/dL):
hgb <- c(12.5, NA, 15.2, 10.8, 14.1, NA, 11.3)Answer these questions using R: