The OMOP CDM and Cohort Definitions

Session 4: The Building Blocks of OHDSI Studies

Today’s Agenda

OMOP CDM Review: Quick refresher on the data model
Cohorts in OHDSI: The connective tissue of studies
Hands-on: Exploring cohort definitions and naive diagnostics

Then: Cohort Diagnostics with your colleague.

Part 1: OMOP CDM Review

The Problem

Every hospital stores data differently.

Hospital A: Custom EHR schema
Hospital B: Claims data format
Hospital C: Epic extraction
Hospital D: Vendor-specific warehouse

How do you run the same study across all of them?

The Solution: OMOP CDM

Common Data Model = one standard structure for all sites.

Each site transforms their data into OMOP format. Then the same code works everywhere.

Hospital A data ──→ OMOP CDM ──┐
Hospital B data ──→ OMOP CDM ──┤──→ Same analysis code
Hospital C data ──→ OMOP CDM ──┘

Core Clinical Tables

Table	Contains
`person`	Demographics (birth year, gender, race)
`observation_period`	When we have data on each person
`visit_occurrence`	Healthcare encounters
`condition_occurrence`	Diagnoses
`drug_exposure`	Medications
`procedure_occurrence`	Procedures
`measurement`	Labs and vitals

Everything links through person_id.

The Vocabulary System

Raw codes differ across systems:

System	Diabetes Code
ICD-10-CM	E11.9
SNOMED	201826
Read	C10F.

OMOP maps all of these to standard concepts.

concept_id = 201826 → “Type 2 diabetes mellitus”

The concept table is the Rosetta Stone of the CDM.

How It All Connects

Two keys unlock the whole CDM:

person_id connects clinical events to patients
concept_id standardizes meaning across vocabularies

Eunomia: Our Practice CDM

Last session, we set up Eunomia: a synthetic OMOP CDM in SQLite.

~2,700 synthetic patients
Full CDM structure
Runs locally, no server needed

Today we’ll use it to explore cohort definitions.

Part 2: Cohorts in OHDSI

What is a Cohort?

A cohort is a set of persons who satisfy one or more criteria for a duration of time.

Each person in a cohort has:

A start date (when they entered)
An end date (when they left)

“All persons with a first diagnosis of T2DM, aged ≥18, with ≥365 days of prior observation.”

The Cohort Table

Every cohort produces the same simple output:

cohort_definition_id	subject_id	cohort_start_date	cohort_end_date
1	42	2015-03-15	2020-12-31
1	108	2017-06-22	2019-08-14
2	42	2016-01-10	2020-12-31

It’s just four columns.

Cohorts are the Connective Tissue

Every OHDSI analysis starts with a cohort definition:

Characterization: Who is in this cohort? What are their features?
Population-Level Estimation: Compare outcomes between two cohorts
Patient-Level Prediction: Predict who will enter a cohort

No cohort definition → no OHDSI study.

Creating Cohort Definitions

Two main tools:

Tool	Type	Best For
ATLAS	Web GUI	Interactive exploration, visual building
Capr	R package	Programmatic, version-controlled definitions

Both produce the same output.

Anatomy of a Cohort Definition

A cohort definition specifies three things:

Entry criteria: What event qualifies someone to enter?
Inclusion criteria: What additional filters narrow the population?
Exit criteria: When does someone leave the cohort?

This logic gets represented in two formats.

Two Representations, One Definition

JSON is the the portable specification

Machine-readable
Maybe a little more human readable
Declarative: describes what, not how
Used by Strategus for study orchestration

SQL is an executable implementation

Actually queries the database
Generated from the JSON
Dialect-specific (SQL Server, PostgreSQL, etc.)

The Interface Pattern

If you’ve seen interfaces in software, this is the same idea:

Cohort Definition (abstract specification)
    │
    ├── JSON (portable format)
    │     Used by: Strategus, ATLAS, study packages
    │
    └── SQL (executable format)
          Used by: CohortGenerator → database

One specification, multiple implementations.

JSON: The Specification

{
  "ConceptSets": [{
    "id": 0,
    "name": "Type 2 DM",
    "expression": {
      "items": [{
        "concept": {
          "CONCEPT_ID": 201826,
          "CONCEPT_NAME": "Type 2 diabetes mellitus"
        }
      }]
    }
  }],
  "PrimaryCriteria": { "..." },
  "QualifiedLimit": { "..." }
}

Describes the cohort logic declaratively.

SQL: The Implementation

INSERT INTO @target_cohort_table (
  cohort_definition_id, subject_id,
  cohort_start_date, cohort_end_date
)
SELECT @target_cohort_id AS cohort_definition_id,
       person_id AS subject_id,
       condition_start_date AS cohort_start_date,
       op.observation_period_end_date AS cohort_end_date
FROM condition_occurrence co
JOIN observation_period op
  ON co.person_id = op.person_id
WHERE condition_concept_id IN (
  SELECT concept_id FROM @concept_set_0
)

Actually runs against the database.

Why Two Formats?

JSON is better for:

Sharing between tools and sites
Version control (readable diffs)
Automated study orchestration

SQL is better for:

Direct database execution
Understanding exactly what’s queried
Performance tuning

OHDSI tools convert between them automatically.

Cohort Diagnostics: Preview

Once you have a cohort definition, you need to verify it.

Is the cohort the right size?
Do the demographics look right?
How does incidence change over time?
Does it look similar across databases?

CohortDiagnostics automates these checks. Your colleague will walk you through it after our notebook work.

Hands-on

Next: Module Work

Pull actual cohort definitions and examine the JSON + SQL
Build naive cohort diagnostics with dplyr
See why CohortDiagnostics exists

modules/04_cohorts/cohorts.qmd