library(dplyr)
library(readr)
library(tidyr)
library(ggplot2)
# Download the Eunomia GiBleed dataset
eunomia_url <- "https://raw.githubusercontent.com/OHDSI/EunomiaDatasets/main/datasets/GiBleed/GiBleed_5.3.zip"
zip_path <- file.path(tempdir(), "GiBleed_5.3.zip")
download.file(eunomia_url, zip_path, mode = "wb")
# Unzip
data_dir <- file.path(tempdir(), "GiBleed_5.3")
unzip(zip_path, exdir = tempdir())
# What files do we have?
list.files(data_dir) |> sort()Module 04: Cohort Definitions and Diagnostics
Exploring OHDSI Cohort Definitions and Building Intuition for Cohort Diagnostics
Agenda
- Exploring the OMOP CDM — Download Eunomia, read the raw tables, see how it all connects
- Cohort Definitions — Pull definitions from CohortDiagnostics, examine JSON and SQL
- Naive Cohort Diagnostics with dplyr — Build diagnostic visualizations by hand
- Transition to CohortDiagnostics — Why the package exists
1. Exploring the OMOP CDM
Eunomia is typically accessed through the Eunomia R package, which wraps a SQLite database. But under the hood, it’s just a set of CSV files in the OMOP CDM format. Let’s download the raw data and look at it directly.
Download and extract
Every file is one OMOP CDM table, stored as a CSV. This is the same data that Eunomia loads into SQLite — we’re just looking at it raw.
Read the core tables
person <- read_csv(file.path(data_dir, "PERSON.csv"), show_col_types = FALSE)
observation_period <- read_csv(file.path(data_dir, "OBSERVATION_PERIOD.csv"), show_col_types = FALSE)
condition_occurrence <- read_csv(file.path(data_dir, "CONDITION_OCCURRENCE.csv"), show_col_types = FALSE)
drug_exposure <- read_csv(file.path(data_dir, "DRUG_EXPOSURE.csv"), show_col_types = FALSE)
visit_occurrence <- read_csv(file.path(data_dir, "VISIT_OCCURRENCE.csv"), show_col_types = FALSE)
measurement <- read_csv(file.path(data_dir, "MEASUREMENT.csv"), show_col_types = FALSE)
procedure_occurrence <- read_csv(file.path(data_dir, "PROCEDURE_OCCURRENCE.csv"), show_col_types = FALSE)
concept <- read_csv(file.path(data_dir, "CONCEPT.csv"), show_col_types = FALSE)
concept_ancestor <- read_csv(file.path(data_dir, "CONCEPT_ANCESTOR.csv"), show_col_types = FALSE)
cohort <- read_csv(file.path(data_dir, "COHORT.csv"), show_col_types = FALSE)How big is this dataset?
tibble(
table = c("person", "observation_period", "condition_occurrence",
"drug_exposure", "visit_occurrence", "measurement",
"procedure_occurrence", "concept", "concept_ancestor", "cohort"),
rows = c(nrow(person), nrow(observation_period), nrow(condition_occurrence),
nrow(drug_exposure), nrow(visit_occurrence), nrow(measurement),
nrow(procedure_occurrence), nrow(concept), nrow(concept_ancestor),
nrow(cohort))
)The person table
Demographics. Every patient has exactly one row here.
names(person)
person |> head()Notice the *_concept_id columns: GENDER_CONCEPT_ID, RACE_CONCEPT_ID, ETHNICITY_CONCEPT_ID. These are codes, not human-readable names. To decode them, we need the concept table.
The concept table: the Rosetta Stone
names(concept)
concept |> head()This table maps CONCEPT_ID values to human-readable names, vocabularies, and domains.
# What gender concepts are used?
person |>
distinct(GENDER_CONCEPT_ID) |>
left_join(concept, by = c("GENDER_CONCEPT_ID" = "CONCEPT_ID")) |>
select(GENDER_CONCEPT_ID, CONCEPT_NAME, VOCABULARY_ID)This dataset doesn’t label race concepts, but we could look them up at Athena: [https://athena.ohdsi.org/search-terms/terms/8516].ss
# What race concepts?
person |>
distinct(RACE_CONCEPT_ID) |>
left_join(concept, by = c("RACE_CONCEPT_ID" = "CONCEPT_ID")) |>
select(RACE_CONCEPT_ID, CONCEPT_NAME, VOCABULARY_ID)This is the pattern everywhere in OMOP: a *_concept_id column holds a code, and you join to concept to get the name. The vocabulary system standardizes across ICD, SNOMED, RxNorm, LOINC, and more.
Observation period
When do we have data on each person? This table defines the window of time during which a person’s records are considered complete.
observation_period |> head()
# How many observation periods per person?
observation_period |>
count(PERSON_ID, name = "n_periods") |>
count(n_periods, name = "n_people")Condition occurrence
Diagnoses. Each row is one recorded condition for one person.
names(condition_occurrence)
condition_occurrence |> head()The CONDITION_CONCEPT_ID is the standardized code. Let’s see what the most common conditions are:
z <- condition_occurrence |>
count(CONDITION_CONCEPT_ID, sort = TRUE) |>
head(50) |>
left_join(concept, by = c("CONDITION_CONCEPT_ID" = "CONCEPT_ID")) |>
select(CONDITION_CONCEPT_ID, CONCEPT_NAME, n) |>
print(n = 25)This is the GiBleed dataset, so gastrointestinal hemorrhage (concept 192671) should be prominent.
Drug exposure
Medications. Each row is one drug exposure for one person.
names(drug_exposure)
drug_exposure |> head()# Most common drugs
drug_exposure |>
count(DRUG_CONCEPT_ID, sort = TRUE) |>
head(15) |>
left_join(concept, by = c("DRUG_CONCEPT_ID" = "CONCEPT_ID")) |>
select(DRUG_CONCEPT_ID, CONCEPT_NAME, VOCABULARY_ID, n)You should see NSAIDs like celecoxib and diclofenac. The GiBleed dataset is designed around the GI bleeds and NSAIDs.
How the tables connect
Everything connects through two keys:
PERSON_ID— links clinical events to a patientCONCEPT_ID— standardizes meaning across vocabularies
Let’s follow one patient through the data:
# Pick a patient with interesting data
patient_summary <- condition_occurrence |>
group_by(PERSON_ID) |>
summarise(n_conditions = n_distinct(CONDITION_CONCEPT_ID)) |>
arrange(desc(n_conditions)) |>
head(1)
pid <- patient_summary$PERSON_ID[1]
cat("Patient", pid, "has", patient_summary$n_conditions[1], "distinct conditions")# Demographics
person |>
filter(PERSON_ID == pid) |>
left_join(concept |> select(CONCEPT_ID, gender = CONCEPT_NAME),
by = c("GENDER_CONCEPT_ID" = "CONCEPT_ID")) |>
left_join(concept |> select(CONCEPT_ID, race = CONCEPT_NAME),
by = c("RACE_CONCEPT_ID" = "CONCEPT_ID")) |>
select(PERSON_ID, YEAR_OF_BIRTH, gender, race)# Their conditions
condition_occurrence |>
filter(PERSON_ID == pid) |>
left_join(concept, by = c("CONDITION_CONCEPT_ID" = "CONCEPT_ID")) |>
select(CONDITION_START_DATE, CONCEPT_NAME) |>
arrange(CONDITION_START_DATE)# Their drugs
drug_exposure |>
filter(PERSON_ID == pid) |>
left_join(concept, by = c("DRUG_CONCEPT_ID" = "CONCEPT_ID")) |>
select(DRUG_EXPOSURE_START_DATE, CONCEPT_NAME, DAYS_SUPPLY) |>
arrange(DRUG_EXPOSURE_START_DATE)The cohort table
names(cohort)
nrow(cohort)Four columns: COHORT_DEFINITION_ID, SUBJECT_ID, COHORT_START_DATE, COHORT_END_DATE. It’s empty because cohorts don’t exist in the raw data. They’re created by running a cohort definition against the CDM. That’s the topic of the next section.
The concept ancestor table
This table encodes vocabulary hierarchies. It lets you find, for example, all specific drugs that fall under the broader concept of “NSAIDs.”
concept_ancestor |> head()
# Find celecoxib (1118084) and its ancestors
concept_ancestor |>
filter(DESCENDANT_CONCEPT_ID == 1118084) |>
left_join(concept |> select(CONCEPT_ID, ancestor_name = CONCEPT_NAME),
by = c("ANCESTOR_CONCEPT_ID" = "CONCEPT_ID")) |>
select(ancestor_name, MIN_LEVELS_OF_SEPARATION, MAX_LEVELS_OF_SEPARATION) |>
arrange(MIN_LEVELS_OF_SEPARATION)This hierarchical structure is how OHDSI cohort definitions can specify “any NSAID” rather than listing every specific formulation.
Summary
The OMOP CDM is a set of connected tables. We just explored the same data that Eunomia and DatabaseConnector abstract behind a SQL interface , but it’s all just tables linked by person_id and concept_id. The cohort table is empty because cohorts are the output of running a cohort definition. That’s what we’ll look at next.
Resources
- CohortDiagnostics: https://ohdsi.github.io/CohortDiagnostics/
- CohortGenerator: https://ohdsi.github.io/CohortGenerator/
- Capr (R cohort builder): https://ohdsi.github.io/Capr/
- ATLAS demo: https://atlas-demo.ohdsi.org
- The Book of OHDSI, Chapter 12 (Cohorts): https://ohdsi.github.io/TheBookOfOhdsi/