install.packages("DatabaseConnector")
install.packages("Eunomia")
install.packages("SqlRender")Module 03: Git, SQL, and the OMOP CDM
Hands-on Practice with Version Control and Healthcare Data
Agenda
- Git Practice (Hands-on)
- OMOP CDM Schema Tour (Exploration)
- Eunomia Setup & SQL Basics (Hands-on)
- SqlRender Across Databases (Demo)
- Patient Odyssey (Guided Exploration)
- ATLAS Cohort Definitions (Exploration)
1. Git Practice
We’ll create a Git repository using RStudio, practice the workflow with both the terminal and RStudio’s built-in tools.
Create a new RStudio Project
- In RStudio: File -> New Project…
- Select New Directory
- Select New Project
- Name it
git-practice - Choose a location (e.g., your Desktop)
- Click Create Project
RStudio opens your new project. Notice there’s no Git pane yet.
(Note: RStudio has a “Create a git repository” checkbox you can use to skip the next step, but we’ll do it manually to learn the command.)
Initialize Git via terminal
Open the Terminal in RStudio (Tools -> Terminal -> New Terminal).
git initYou should see: Initialized empty Git repository in .../git-practice/.git/
Now look at RStudio. The Git pane should appear (usually top-right). If not, restart RStudio or go to Tools -> Project Options -> Git/SVN and confirm Git is detected.
Copy this module file
Copy this module file (ohdsi-basics.qmd) into your new git-practice folder. You can:
- Use your file browser to copy/paste, or
- In the RStudio Files pane, click More -> Set As Working Directory, then copy the file
Explore RStudio’s Git pane
Look at the Git pane (usually top-right, next to Environment/History).
You should see ohdsi-basics.qmd listed with a yellow ? icon. This means it’s untracked.
Stage using RStudio
In the Git pane:
- Check the box next to
ohdsi-basics.qmd - The icon changes to a green A (Added/staged)
That’s it. You just staged a file without touching the terminal.
Commit using RStudio
- Click Commit in the Git pane
- A new window opens showing your staged changes
- Write a commit message: “Add module file”
- Click Commit
You’ve made your first commit.
Now let’s use the Terminal
RStudio’s GUI is convenient, but knowing the terminal commands matters. Open the Terminal (Tools -> Terminal -> New Terminal).
# Check status
git statusGit shows a clean working directory. Your commit worked.
# View history
git log --onelineYou’ll see your commit with its message.
Make changes and commit via terminal
Edit the module file: add a comment at the top, or change something small. Save it.
# See what changed
git status
git diffgit diff shows the exact lines that changed.
# Stage the file
git add ohdsi-basics.qmd
# Commit with a message
git commit -m "Make edits to module"Compare: RStudio vs Terminal
- See changes: Git pane shows icons, or
git status - Stage: Check the box, or
git add <file> - Commit: Click Commit button, or
git commit -m "msg" - View diff: Click file in Commit window, or
git diff - History: Click History, or
git log
Both work. Use whichever feels natural.
Key terminal commands
git status: See current stategit add <file>: Stage changesgit commit -m "msg": Save snapshotgit diff: See unstaged changesgit log: View history
Connecting to GitHub
When you’re ready to backup to GitHub:
- Go to GitHub and create a new repository (don’t initialize with README)
- Copy the repository URL
In the terminal:
# Connect your local repo to GitHub
git remote add origin https://github.com/YOUR-USERNAME/git-practice.git
# Push your commits
git push -u origin mainOr in RStudio: Tools -> Project Options -> Git/SVN and add the remote URL.
After the first push, you can use:
- Terminal:
git pushandgit pull - RStudio: Push/Pull buttons in the Git pane
.gitignore
Not everything belongs in Git. A .gitignore file tells Git which files to skip.
Common things to ignore:
- Large data files (
*.csv,*.rds,data/) - Credentials and secrets (
.env,.Renviron,*.pem) - Generated outputs (
*_files/,*.html,_site/) - System junk (
.DS_Store,Thumbs.db)
Create a .gitignore file in your project root:
# Data files
data/
*.csv
*.rds
# Secrets
.env
.Renviron
# Generated output
*_files/
*.html
# System
.DS_Store
RStudio projects create a default .gitignore for you. Open it and review what’s already there.
Why this matters: Without .gitignore, you might accidentally commit a 500MB data file or push database credentials to public GitHub.
Critical for biomedical work: Patient data and PHI should never be committed to a Git repository. Once something is committed, it lives in the history forever, even if you delete the file later. Even “private” repositories can be misconfigured or leaked. Always ignore your data/ folder and query data from databases rather than storing local copies.
Fetch vs Pull
git fetch # Download changes from GitHub (don't merge yet)
git pull # Download AND merge changesfetch lets you see what’s new before merging. pull does both at once.
Clean up (optional)
When you’re done practicing, you can delete the git-practice folder from your Desktop.
2. OMOP CDM Schema Tour
The OMOP Common Data Model defines the structure for healthcare data. Let’s explore the official specification.
Browse the CDM Repository
Open in your browser: https://github.com/OHDSI/CommonDataModel
Exploration tasks:
- Navigate to
inst/ddl/5.4/- these are the table definitions - Open
postgresql/and look atOMOPCDM_postgresql_5.4_ddl.sql - Find the
CREATE TABLE personstatement - What columns does the
persontable have?
CDM Table Relationships
The CDM tables connect through person_id and concept_id.
Core clinical tables (all link to person via person_id):
person: Demographics (birth year, gender, race)observation_period: When we have data on each personvisit_occurrence: Healthcare encounterscondition_occurrence: Diagnosesdrug_exposure: Medicationsprocedure_occurrence: Procedures performedmeasurement: Labs and vitals
The vocabulary table:
concept: Standardized meanings for all coded values
Clinical events link to visits via visit_occurrence_id, and everything uses concept_id to reference standardized terminology.
Key insight
Everything connects through:
person_id– links clinical events to a patientconcept_id– standardizes meaning across vocabularies
3. Eunomia Setup & SQL Basics
Eunomia is a synthetic OMOP CDM database we can query locally.
Install packages (if needed)
Connect to Eunomia
library(DatabaseConnector)
library(Eunomia)
# Get connection details (downloads data on first run)
connectionDetails <- getEunomiaConnectionDetails()
# Connect
connection <- connect(connectionDetails)See what’s in the database
# List all tables
getTableNames(connection, databaseSchema = "main")Query the person table
querySql(connection, "
SELECT *
FROM main.person
LIMIT 10
")Notice the concept_id columns - these are codes, not human-readable names.
Join with concept to decode
querySql(connection, "
SELECT
p.person_id,
p.year_of_birth,
g.concept_name AS gender,
r.concept_name AS race
FROM main.person p
LEFT JOIN main.concept g ON p.gender_concept_id = g.concept_id
LEFT JOIN main.concept r ON p.race_concept_id = r.concept_id
LIMIT 20
")Count patients by gender
querySql(connection, "
SELECT
g.concept_name AS gender,
COUNT(*) AS n_patients
FROM main.person p
LEFT JOIN main.concept g ON p.gender_concept_id = g.concept_id
GROUP BY g.concept_name
")Find the most common conditions
querySql(connection, "
SELECT
c.concept_name AS condition,
COUNT(*) AS n_occurrences,
COUNT(DISTINCT co.person_id) AS n_patients
FROM main.condition_occurrence co
JOIN main.concept c ON co.condition_concept_id = c.concept_id
GROUP BY c.concept_name
ORDER BY n_patients DESC
LIMIT 15
")Find the most common drugs
querySql(connection, "
SELECT
c.concept_name AS drug,
COUNT(*) AS n_exposures,
COUNT(DISTINCT de.person_id) AS n_patients
FROM main.drug_exposure de
JOIN main.concept c ON de.drug_concept_id = c.concept_id
GROUP BY c.concept_name
ORDER BY n_patients DESC
LIMIT 15
")4. SqlRender Across Databases
OHDSI needs SQL that works on PostgreSQL, SQL Server, Oracle, and more. SqlRender translates.
Load SqlRender
library(SqlRender)The problem: SQL dialects differ
# SQL Server syntax for "top N rows"
sql_server_style <- "SELECT TOP 10 * FROM person;"
# PostgreSQL uses LIMIT
postgresql_style <- "SELECT * FROM person LIMIT 10;"
# Oracle uses ROWNUM
oracle_style <- "SELECT * FROM person WHERE ROWNUM <= 10;"Three ways to say the same thing!
SqlRender translates for you
# Write in OHDSI SQL (SQL Server-ish syntax)
sql <- "SELECT TOP 10 * FROM person;"
# Translate to PostgreSQL
translate(sql, targetDialect = "postgresql")
# Translate to Oracle
translate(sql, targetDialect = "oracle")
# Translate to SQLite (what Eunomia uses)
translate(sql, targetDialect = "sqlite")Parameterization with @variables
# Template SQL with parameters
sql <- "
SELECT *
FROM @cdm_schema.person
WHERE year_of_birth >= @min_year
AND year_of_birth <= @max_year
"
# Render with specific values
render(sql,
cdm_schema = "main",
min_year = 1950,
max_year = 1970
)Combine render + translate
sql <- "
SELECT TOP 100 *
FROM @cdm_schema.condition_occurrence
WHERE condition_concept_id = @concept_id
"
# Render parameters
rendered <- render(sql,
cdm_schema = "main",
concept_id = 201826 # Type 2 diabetes
)
# Translate to target dialect
translated <- translate(rendered, targetDialect = "sqlite")
# Execute
querySql(connection, translated)Why this matters
You write SQL once. SqlRender handles the differences between:
- Your local SQLite (Eunomia)
- Your institution’s SQL Server
- A collaborator’s PostgreSQL
- A cloud Redshift cluster
This is how OHDSI runs the same study at 100+ sites.
5. Patient Odyssey
Let’s follow a single patient through the CDM and understand their healthcare journey.
Find a patient with interesting data
# Find patients with multiple conditions and drugs
interesting_patients <- querySql(connection, "
SELECT
p.person_id,
COUNT(DISTINCT co.condition_concept_id) AS n_conditions,
COUNT(DISTINCT de.drug_concept_id) AS n_drugs,
COUNT(DISTINCT vo.visit_occurrence_id) AS n_visits
FROM main.person p
LEFT JOIN main.condition_occurrence co ON p.person_id = co.person_id
LEFT JOIN main.drug_exposure de ON p.person_id = de.person_id
LEFT JOIN main.visit_occurrence vo ON p.person_id = vo.person_id
GROUP BY p.person_id
HAVING n_conditions > 5 AND n_drugs > 5
ORDER BY n_visits DESC
LIMIT 10
")
interesting_patientsPick a person_id from this list for our odyssey.
Get patient demographics
# Replace with your chosen person_id
patient_id <- 1 # Use a real ID from above
querySql(connection, sprintf("
SELECT
p.person_id,
p.year_of_birth,
g.concept_name AS gender,
r.concept_name AS race
FROM main.person p
LEFT JOIN main.concept g ON p.gender_concept_id = g.concept_id
LEFT JOIN main.concept r ON p.race_concept_id = r.concept_id
WHERE p.person_id = %d
", patient_id))Get observation period
When do we have data on this patient?
querySql(connection, sprintf("
SELECT
observation_period_start_date,
observation_period_end_date
FROM main.observation_period
WHERE person_id = %d
", patient_id))Get their conditions
patient_conditions <- querySql(connection, sprintf("
SELECT
co.condition_start_date,
c.concept_name AS condition,
c.concept_id,
v.concept_name AS visit_type
FROM main.condition_occurrence co
JOIN main.concept c ON co.condition_concept_id = c.concept_id
LEFT JOIN main.visit_occurrence vo ON co.visit_occurrence_id = vo.visit_occurrence_id
LEFT JOIN main.concept v ON vo.visit_concept_id = v.concept_id
WHERE co.person_id = %d
ORDER BY co.condition_start_date
", patient_id))
patient_conditionsGet their drug exposures
patient_drugs <- querySql(connection, sprintf("
SELECT
de.drug_exposure_start_date,
c.concept_name AS drug,
c.concept_id,
de.quantity,
de.days_supply
FROM main.drug_exposure de
JOIN main.concept c ON de.drug_concept_id = c.concept_id
WHERE de.person_id = %d
ORDER BY de.drug_exposure_start_date
", patient_id))
patient_drugsVisualize the patient timeline
library(dplyr)
library(ggplot2)
# Combine conditions and drugs into a timeline
conditions_timeline <- patient_conditions |>
select(date = CONDITION_START_DATE, event = CONDITION, type = "Condition")
drugs_timeline <- patient_drugs |>
select(date = DRUG_EXPOSURE_START_DATE, event = DRUG, type = "Drug")
timeline <- bind_rows(conditions_timeline, drugs_timeline) |>
mutate(date = as.Date(date)) |>
arrange(date)
# Plot
ggplot(timeline, aes(x = date, y = event, color = type)) +
geom_point(size = 3) +
labs(
title = sprintf("Patient %d Healthcare Timeline", patient_id),
x = "Date",
y = "",
color = "Event Type"
) +
theme_minimal() +
theme(axis.text.y = element_text(size = 8))How the data connects
For any patient, the data flows like this:
- Start with
person(demographics) - Check
observation_periodto see when we have data - Look at
visit_occurrenceto find their encounters - From visits, find clinical events:
condition_occurrence,drug_exposure,procedure_occurrence,measurement - Join any
*_concept_idcolumn toconceptto get human-readable names
Questions to explore
- Does this patient have conditions that match their drug exposures? (e.g., diabetes + metformin)
- What’s the time gap between diagnosis and first treatment?
- Are there any emergency visits? What happened?
6. ATLAS Cohort Definitions
ATLAS is OHDSI’s web-based tool for defining cohorts and analyses. Let’s see how cohorts translate to SQL.
Access the ATLAS demo
Open in your browser: https://atlas-demo.ohdsi.org
(Note: This is a public demo instance - no real patient data)
Explore existing cohort definitions
- Click Cohort Definitions in the left menu
- Browse the existing cohorts
- Click on one to see its definition
Anatomy of a cohort definition
A cohort has:
- Entry criteria: How does someone enter the cohort?
- Inclusion criteria: Additional requirements to stay in
- Exit criteria: When does someone leave the cohort?
View the generated SQL
In any cohort definition:
- Click the Export tab
- Select SQL format
- See the generated SQL code
This SQL can run on any OMOP CDM database!
Example: Type 2 Diabetes cohort
A simple T2DM cohort might have:
Entry criteria: - First diagnosis of Type 2 diabetes (concept 201826)
Inclusion criteria: - Age ≥ 18 at diagnosis - At least 365 days of prior observation
Exit criteria: - End of observation period
The generated SQL
ATLAS generates complex SQL that:
- Finds all qualifying entry events
- Applies inclusion criteria
- Handles overlapping periods
- Outputs the standard cohort format
-- Simplified example of what ATLAS generates
INSERT INTO cohort (
cohort_definition_id,
subject_id,
cohort_start_date,
cohort_end_date
)
SELECT
1 AS cohort_definition_id,
person_id AS subject_id,
condition_start_date AS cohort_start_date,
observation_period_end_date AS cohort_end_date
FROM condition_occurrence co
JOIN observation_period op ON co.person_id = op.person_id
WHERE condition_concept_id = 201826
AND DATEDIFF(year, birth_datetime, condition_start_date) >= 18
AND DATEDIFF(day, observation_period_start_date, condition_start_date) >= 365Why this matters
- Reproducibility: The SQL precisely defines who’s in the cohort
- Portability: Run the same definition at any OHDSI site
- Transparency: Review and audit the exact criteria
Exercise: Build your own
In ATLAS demo:
- Click New Cohort Definition
- Name it “Practice - [Your Name]”
- Add an initial event (try a condition)
- Add an inclusion criterion
- Export and examine the SQL
Clean up
disconnect(connection)Summary
Today we covered:
- Git: Create repos, stage, commit, push/pull
- CDM Schema: Table structure, relationships via person_id and concept_id
- Eunomia: Local CDM practice, SQL queries
- SqlRender: Dialect translation for multi-site compatibility
- Patient Odyssey: Following one patient through connected tables
- ATLAS: Visual cohort building to SQL generation
Key takeaways
- Git tracks your code history - use it for every project
- Remote databases solve scale, security, and collaboration problems
- OHDSI’s SQL tools let you write once, run anywhere
- The CDM connects through
person_idandconcept_id - ATLAS translates visual cohort definitions to executable SQL
Resources
- Git tutorial: https://git-scm.com/book/en/v2
- OMOP CDM docs: https://ohdsi.github.io/CommonDataModel/
- SqlRender: https://ohdsi.github.io/SqlRender/
- ATLAS demo: https://atlas-demo.ohdsi.org
- The Book of OHDSI: https://ohdsi.github.io/TheBookOfOhdsi/
Next session
We’ll dive deeper into cohort definitions and start working with HADES packages for analysis.