Session 4: The Building Blocks of OHDSI Studies
Then: Cohort Diagnostics with your colleague.
Every hospital stores data differently.
How do you run the same study across all of them?
Common Data Model = one standard structure for all sites.
Each site transforms their data into OMOP format. Then the same code works everywhere.
Hospital A data ──→ OMOP CDM ──┐
Hospital B data ──→ OMOP CDM ──┤──→ Same analysis code
Hospital C data ──→ OMOP CDM ──┘
| Table | Contains |
|---|---|
person |
Demographics (birth year, gender, race) |
observation_period |
When we have data on each person |
visit_occurrence |
Healthcare encounters |
condition_occurrence |
Diagnoses |
drug_exposure |
Medications |
procedure_occurrence |
Procedures |
measurement |
Labs and vitals |
Everything links through person_id.
Raw codes differ across systems:
| System | Diabetes Code |
|---|---|
| ICD-10-CM | E11.9 |
| SNOMED | 201826 |
| Read | C10F. |
OMOP maps all of these to standard concepts.
concept_id = 201826 → “Type 2 diabetes mellitus”
The concept table is the Rosetta Stone of the CDM.
Two keys unlock the whole CDM:
person_id connects clinical events to patientsconcept_id standardizes meaning across vocabulariesLast session, we set up Eunomia: a synthetic OMOP CDM in SQLite.
Today we’ll use it to explore cohort definitions.
A cohort is a set of persons who satisfy one or more criteria for a duration of time.
Each person in a cohort has:
“All persons with a first diagnosis of T2DM, aged ≥18, with ≥365 days of prior observation.”
Every cohort produces the same simple output:
| cohort_definition_id | subject_id | cohort_start_date | cohort_end_date |
|---|---|---|---|
| 1 | 42 | 2015-03-15 | 2020-12-31 |
| 1 | 108 | 2017-06-22 | 2019-08-14 |
| 2 | 42 | 2016-01-10 | 2020-12-31 |
It’s just four columns.
Every OHDSI analysis starts with a cohort definition:
No cohort definition → no OHDSI study.
Two main tools:
| Tool | Type | Best For |
|---|---|---|
| ATLAS | Web GUI | Interactive exploration, visual building |
| Capr | R package | Programmatic, version-controlled definitions |
Both produce the same output.
A cohort definition specifies three things:
This logic gets represented in two formats.
JSON is the the portable specification
SQL is an executable implementation
If you’ve seen interfaces in software, this is the same idea:
Cohort Definition (abstract specification)
│
├── JSON (portable format)
│ Used by: Strategus, ATLAS, study packages
│
└── SQL (executable format)
Used by: CohortGenerator → database
One specification, multiple implementations.
Describes the cohort logic declaratively.
INSERT INTO @target_cohort_table (
cohort_definition_id, subject_id,
cohort_start_date, cohort_end_date
)
SELECT @target_cohort_id AS cohort_definition_id,
person_id AS subject_id,
condition_start_date AS cohort_start_date,
op.observation_period_end_date AS cohort_end_date
FROM condition_occurrence co
JOIN observation_period op
ON co.person_id = op.person_id
WHERE condition_concept_id IN (
SELECT concept_id FROM @concept_set_0
)Actually runs against the database.
JSON is better for:
SQL is better for:
OHDSI tools convert between them automatically.
Once you have a cohort definition, you need to verify it.
CohortDiagnostics automates these checks. Your colleague will walk you through it after our notebook work.
modules/04_cohorts/cohorts.qmd