Module 03: Git, SQL, and the OMOP CDM

Hands-on Practice with Version Control and Healthcare Data

Agenda

Git Practice (Hands-on)
OMOP CDM Schema Tour (Exploration)
Eunomia Setup & SQL Basics (Hands-on)
SqlRender Across Databases (Demo)
Patient Odyssey (Guided Exploration)
ATLAS Cohort Definitions (Exploration)

1. Git Practice

We’ll create a Git repository using RStudio, practice the workflow with both the terminal and RStudio’s built-in tools.

Create a new RStudio Project

In RStudio: File -> New Project…
Select New Directory
Select New Project
Name it git-practice
Choose a location (e.g., your Desktop)
Click Create Project

RStudio opens your new project. Notice there’s no Git pane yet.

(Note: RStudio has a “Create a git repository” checkbox you can use to skip the next step, but we’ll do it manually to learn the command.)

Initialize Git via terminal

Open the Terminal in RStudio (Tools -> Terminal -> New Terminal).

git init

You should see: Initialized empty Git repository in .../git-practice/.git/

Now look at RStudio. The Git pane should appear (usually top-right). If not, restart RStudio or go to Tools -> Project Options -> Git/SVN and confirm Git is detected.

Copy this module file

Copy this module file (ohdsi-basics.qmd) into your new git-practice folder. You can:

Use your file browser to copy/paste, or
In the RStudio Files pane, click More -> Set As Working Directory, then copy the file

Explore RStudio’s Git pane

Look at the Git pane (usually top-right, next to Environment/History).

You should see ohdsi-basics.qmd listed with a yellow ? icon. This means it’s untracked.

Stage using RStudio

In the Git pane:

Check the box next to ohdsi-basics.qmd
The icon changes to a green A (Added/staged)

That’s it. You just staged a file without touching the terminal.

Commit using RStudio

Click Commit in the Git pane
A new window opens showing your staged changes
Write a commit message: “Add module file”
Click Commit

You’ve made your first commit.

Now let’s use the Terminal

RStudio’s GUI is convenient, but knowing the terminal commands matters. Open the Terminal (Tools -> Terminal -> New Terminal).

# Check status
git status

Git shows a clean working directory. Your commit worked.

# View history
git log --oneline

You’ll see your commit with its message.

Make changes and commit via terminal

Edit the module file: add a comment at the top, or change something small. Save it.

# See what changed
git status
git diff

git diff shows the exact lines that changed.

# Stage the file
git add ohdsi-basics.qmd

# Commit with a message
git commit -m "Make edits to module"

Compare: RStudio vs Terminal

See changes: Git pane shows icons, or git status
Stage: Check the box, or git add <file>
Commit: Click Commit button, or git commit -m "msg"
View diff: Click file in Commit window, or git diff
History: Click History, or git log

Both work. Use whichever feels natural.

Key terminal commands

git status: See current state
git add <file>: Stage changes
git commit -m "msg": Save snapshot
git diff: See unstaged changes
git log: View history

Connecting to GitHub

When you’re ready to backup to GitHub:

Go to GitHub and create a new repository (don’t initialize with README)
Copy the repository URL

In the terminal:

# Connect your local repo to GitHub
git remote add origin https://github.com/YOUR-USERNAME/git-practice.git

# Push your commits
git push -u origin main

Or in RStudio: Tools -> Project Options -> Git/SVN and add the remote URL.

After the first push, you can use:

Terminal: git push and git pull
RStudio: Push/Pull buttons in the Git pane

.gitignore

Not everything belongs in Git. A .gitignore file tells Git which files to skip.

Common things to ignore:

Large data files (*.csv, *.rds, data/)
Credentials and secrets (.env, .Renviron, *.pem)
Generated outputs (*_files/, *.html, _site/)
System junk (.DS_Store, Thumbs.db)

Create a .gitignore file in your project root:

# Data files
data/
*.csv
*.rds

# Secrets
.env
.Renviron

# Generated output
*_files/
*.html

# System
.DS_Store

RStudio projects create a default .gitignore for you. Open it and review what’s already there.

Why this matters: Without .gitignore, you might accidentally commit a 500MB data file or push database credentials to public GitHub.

Critical for biomedical work: Patient data and PHI should never be committed to a Git repository. Once something is committed, it lives in the history forever, even if you delete the file later. Even “private” repositories can be misconfigured or leaked. Always ignore your data/ folder and query data from databases rather than storing local copies.

Fetch vs Pull

git fetch   # Download changes from GitHub (don't merge yet)
git pull    # Download AND merge changes

fetch lets you see what’s new before merging. pull does both at once.

Clean up (optional)

When you’re done practicing, you can delete the git-practice folder from your Desktop.

2. OMOP CDM Schema Tour

The OMOP Common Data Model defines the structure for healthcare data. Let’s explore the official specification.

Browse the CDM Repository

Open in your browser: https://github.com/OHDSI/CommonDataModel

Exploration tasks:

Navigate to inst/ddl/5.4/ - these are the table definitions
Open postgresql/ and look at OMOPCDM_postgresql_5.4_ddl.sql
Find the CREATE TABLE person statement
What columns does the person table have?

CDM Table Relationships

The CDM tables connect through person_id and concept_id.

Core clinical tables (all link to person via person_id):

person: Demographics (birth year, gender, race)
observation_period: When we have data on each person
visit_occurrence: Healthcare encounters
condition_occurrence: Diagnoses
drug_exposure: Medications
procedure_occurrence: Procedures performed
measurement: Labs and vitals

The vocabulary table:

concept: Standardized meanings for all coded values

Clinical events link to visits via visit_occurrence_id, and everything uses concept_id to reference standardized terminology.

Key insight

Everything connects through:

person_id – links clinical events to a patient
concept_id – standardizes meaning across vocabularies

3. Eunomia Setup & SQL Basics

Eunomia is a synthetic OMOP CDM database we can query locally.

Install packages (if needed)

install.packages("DatabaseConnector")
install.packages("Eunomia")
install.packages("SqlRender")

Connect to Eunomia

library(DatabaseConnector)
library(Eunomia)

# Get connection details (downloads data on first run)
connectionDetails <- getEunomiaConnectionDetails()

# Connect
connection <- connect(connectionDetails)

See what’s in the database

# List all tables
getTableNames(connection, databaseSchema = "main")

Query the person table

querySql(connection, "
  SELECT *
  FROM main.person
  LIMIT 10
")

Notice the concept_id columns - these are codes, not human-readable names.

Join with concept to decode

querySql(connection, "
  SELECT
    p.person_id,
    p.year_of_birth,
    g.concept_name AS gender,
    r.concept_name AS race
  FROM main.person p
  LEFT JOIN main.concept g ON p.gender_concept_id = g.concept_id
  LEFT JOIN main.concept r ON p.race_concept_id = r.concept_id
  LIMIT 20
")

Count patients by gender

querySql(connection, "
  SELECT
    g.concept_name AS gender,
    COUNT(*) AS n_patients
  FROM main.person p
  LEFT JOIN main.concept g ON p.gender_concept_id = g.concept_id
  GROUP BY g.concept_name
")

Find the most common conditions

querySql(connection, "
  SELECT
    c.concept_name AS condition,
    COUNT(*) AS n_occurrences,
    COUNT(DISTINCT co.person_id) AS n_patients
  FROM main.condition_occurrence co
  JOIN main.concept c ON co.condition_concept_id = c.concept_id
  GROUP BY c.concept_name
  ORDER BY n_patients DESC
  LIMIT 15
")

Find the most common drugs

querySql(connection, "
  SELECT
    c.concept_name AS drug,
    COUNT(*) AS n_exposures,
    COUNT(DISTINCT de.person_id) AS n_patients
  FROM main.drug_exposure de
  JOIN main.concept c ON de.drug_concept_id = c.concept_id
  GROUP BY c.concept_name
  ORDER BY n_patients DESC
  LIMIT 15
")

4. SqlRender Across Databases

OHDSI needs SQL that works on PostgreSQL, SQL Server, Oracle, and more. SqlRender translates.

Load SqlRender

library(SqlRender)

The problem: SQL dialects differ

# SQL Server syntax for "top N rows"
sql_server_style <- "SELECT TOP 10 * FROM person;"

# PostgreSQL uses LIMIT
postgresql_style <- "SELECT * FROM person LIMIT 10;"

# Oracle uses ROWNUM
oracle_style <- "SELECT * FROM person WHERE ROWNUM <= 10;"

Three ways to say the same thing!

SqlRender translates for you

# Write in OHDSI SQL (SQL Server-ish syntax)
sql <- "SELECT TOP 10 * FROM person;"

# Translate to PostgreSQL
translate(sql, targetDialect = "postgresql")

# Translate to Oracle
translate(sql, targetDialect = "oracle")

# Translate to SQLite (what Eunomia uses)
translate(sql, targetDialect = "sqlite")

Parameterization with @variables

# Template SQL with parameters
sql <- "
SELECT *
FROM @cdm_schema.person
WHERE year_of_birth >= @min_year
  AND year_of_birth <= @max_year
"

# Render with specific values
render(sql,
  cdm_schema = "main",
  min_year = 1950,
  max_year = 1970
)

Combine render + translate

sql <- "
SELECT TOP 100 *
FROM @cdm_schema.condition_occurrence
WHERE condition_concept_id = @concept_id
"

# Render parameters
rendered <- render(sql,
  cdm_schema = "main",
  concept_id = 201826  # Type 2 diabetes
)

# Translate to target dialect
translated <- translate(rendered, targetDialect = "sqlite")

# Execute
querySql(connection, translated)

Why this matters

You write SQL once. SqlRender handles the differences between:

Your local SQLite (Eunomia)
Your institution’s SQL Server
A collaborator’s PostgreSQL
A cloud Redshift cluster

This is how OHDSI runs the same study at 100+ sites.

5. Patient Odyssey

Let’s follow a single patient through the CDM and understand their healthcare journey.

Find a patient with interesting data

# Find patients with multiple conditions and drugs
interesting_patients <- querySql(connection, "
  SELECT
    p.person_id,
    COUNT(DISTINCT co.condition_concept_id) AS n_conditions,
    COUNT(DISTINCT de.drug_concept_id) AS n_drugs,
    COUNT(DISTINCT vo.visit_occurrence_id) AS n_visits
  FROM main.person p
  LEFT JOIN main.condition_occurrence co ON p.person_id = co.person_id
  LEFT JOIN main.drug_exposure de ON p.person_id = de.person_id
  LEFT JOIN main.visit_occurrence vo ON p.person_id = vo.person_id
  GROUP BY p.person_id
  HAVING n_conditions > 5 AND n_drugs > 5
  ORDER BY n_visits DESC
  LIMIT 10
")

interesting_patients

Pick a person_id from this list for our odyssey.

Get patient demographics

# Replace with your chosen person_id
patient_id <- 1  # Use a real ID from above

querySql(connection, sprintf("
  SELECT
    p.person_id,
    p.year_of_birth,
    g.concept_name AS gender,
    r.concept_name AS race
  FROM main.person p
  LEFT JOIN main.concept g ON p.gender_concept_id = g.concept_id
  LEFT JOIN main.concept r ON p.race_concept_id = r.concept_id
  WHERE p.person_id = %d
", patient_id))

Get observation period

When do we have data on this patient?

querySql(connection, sprintf("
  SELECT
    observation_period_start_date,
    observation_period_end_date
  FROM main.observation_period
  WHERE person_id = %d
", patient_id))

Get their conditions

patient_conditions <- querySql(connection, sprintf("
  SELECT
    co.condition_start_date,
    c.concept_name AS condition,
    c.concept_id,
    v.concept_name AS visit_type
  FROM main.condition_occurrence co
  JOIN main.concept c ON co.condition_concept_id = c.concept_id
  LEFT JOIN main.visit_occurrence vo ON co.visit_occurrence_id = vo.visit_occurrence_id
  LEFT JOIN main.concept v ON vo.visit_concept_id = v.concept_id
  WHERE co.person_id = %d
  ORDER BY co.condition_start_date
", patient_id))

patient_conditions

Get their drug exposures

patient_drugs <- querySql(connection, sprintf("
  SELECT
    de.drug_exposure_start_date,
    c.concept_name AS drug,
    c.concept_id,
    de.quantity,
    de.days_supply
  FROM main.drug_exposure de
  JOIN main.concept c ON de.drug_concept_id = c.concept_id
  WHERE de.person_id = %d
  ORDER BY de.drug_exposure_start_date
", patient_id))

patient_drugs

Visualize the patient timeline

library(dplyr)
library(ggplot2)

# Combine conditions and drugs into a timeline
conditions_timeline <- patient_conditions |>
  select(date = CONDITION_START_DATE, event = CONDITION, type = "Condition")

drugs_timeline <- patient_drugs |>
  select(date = DRUG_EXPOSURE_START_DATE, event = DRUG, type = "Drug")

timeline <- bind_rows(conditions_timeline, drugs_timeline) |>
  mutate(date = as.Date(date)) |>
  arrange(date)

# Plot
ggplot(timeline, aes(x = date, y = event, color = type)) +
  geom_point(size = 3) +
  labs(
    title = sprintf("Patient %d Healthcare Timeline", patient_id),
    x = "Date",
    y = "",
    color = "Event Type"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 8))

How the data connects

For any patient, the data flows like this:

Start with person (demographics)
Check observation_period to see when we have data
Look at visit_occurrence to find their encounters
From visits, find clinical events: condition_occurrence, drug_exposure, procedure_occurrence, measurement
Join any *_concept_id column to concept to get human-readable names

Questions to explore

Does this patient have conditions that match their drug exposures? (e.g., diabetes + metformin)
What’s the time gap between diagnosis and first treatment?
Are there any emergency visits? What happened?

6. ATLAS Cohort Definitions

ATLAS is OHDSI’s web-based tool for defining cohorts and analyses. Let’s see how cohorts translate to SQL.

Access the ATLAS demo

Open in your browser: https://atlas-demo.ohdsi.org

(Note: This is a public demo instance - no real patient data)

Explore existing cohort definitions

Click Cohort Definitions in the left menu
Browse the existing cohorts
Click on one to see its definition

Anatomy of a cohort definition

A cohort has:

Entry criteria: How does someone enter the cohort?
Inclusion criteria: Additional requirements to stay in
Exit criteria: When does someone leave the cohort?

View the generated SQL

In any cohort definition:

Click the Export tab
Select SQL format
See the generated SQL code

This SQL can run on any OMOP CDM database!

Example: Type 2 Diabetes cohort

A simple T2DM cohort might have:

Entry criteria: - First diagnosis of Type 2 diabetes (concept 201826)

Inclusion criteria: - Age ≥ 18 at diagnosis - At least 365 days of prior observation

Exit criteria: - End of observation period

The generated SQL

ATLAS generates complex SQL that:

Finds all qualifying entry events
Applies inclusion criteria
Handles overlapping periods
Outputs the standard cohort format

-- Simplified example of what ATLAS generates
INSERT INTO cohort (
  cohort_definition_id,
  subject_id,
  cohort_start_date,
  cohort_end_date
)
SELECT
  1 AS cohort_definition_id,
  person_id AS subject_id,
  condition_start_date AS cohort_start_date,
  observation_period_end_date AS cohort_end_date
FROM condition_occurrence co
JOIN observation_period op ON co.person_id = op.person_id
WHERE condition_concept_id = 201826
  AND DATEDIFF(year, birth_datetime, condition_start_date) >= 18
  AND DATEDIFF(day, observation_period_start_date, condition_start_date) >= 365

Why this matters

Reproducibility: The SQL precisely defines who’s in the cohort
Portability: Run the same definition at any OHDSI site
Transparency: Review and audit the exact criteria

Exercise: Build your own

In ATLAS demo:

Click New Cohort Definition
Name it “Practice - [Your Name]”
Add an initial event (try a condition)
Add an inclusion criterion
Export and examine the SQL

Clean up

disconnect(connection)

Summary

Today we covered:

Git: Create repos, stage, commit, push/pull
CDM Schema: Table structure, relationships via person_id and concept_id
Eunomia: Local CDM practice, SQL queries
SqlRender: Dialect translation for multi-site compatibility
Patient Odyssey: Following one patient through connected tables
ATLAS: Visual cohort building to SQL generation

Key takeaways

Git tracks your code history - use it for every project
Remote databases solve scale, security, and collaboration problems
OHDSI’s SQL tools let you write once, run anywhere
The CDM connects through person_id and concept_id
ATLAS translates visual cohort definitions to executable SQL

Resources

Git tutorial: https://git-scm.com/book/en/v2
OMOP CDM docs: https://ohdsi.github.io/CommonDataModel/
SqlRender: https://ohdsi.github.io/SqlRender/
ATLAS demo: https://atlas-demo.ohdsi.org
The Book of OHDSI: https://ohdsi.github.io/TheBookOfOhdsi/

Next session

We’ll dive deeper into cohort definitions and start working with HADES packages for analysis.