Git, Databases, and OHDSI’s SQL Strategy

Session 3: Working with Version Control and Data Infrastructure

Today’s Agenda

Git & GitHub: Version control fundamentals
Remote Databases: Why they matter in informatics
OHDSI’s SQL Approach: DatabaseConnector, SqlRender, Andromeda

Then: Hands-on practice in the module.

Part 1: Git & GitHub

What is Git?

Version control allows for tracking changes to files over time.

Take “snapshots” of your project (commits)
Go back to any previous state
See who changed what and when
Work on features without breaking main code

Git is local software on your machine.

What is GitHub?

GitHub is basically cloud storage for Git repositories.

Backup your work remotely
Share code with collaborators
Host websites, documentation

GitHub is a website that stores Git repositories.

Why Version Control Matters

Without Git:

analysis_v1.R
analysis_v2.R
analysis_v2_final.R
analysis_v2_final_REALLY_FINAL.R
analysis_v2_final_REALLY_FINAL_fixed.R

With Git:

analysis.R  (with full history)

A Pragmatic View

For most researchers, Git is essentially:

Cloud backup with time travel.

Your code is safe on GitHub
You can recover any previous version
You have a record of what changed and when

Git allows for fancy branching and merging workflows, but you probably won’t need them for scientific work.

The Basic Git Workflow

Working Directory → Staging Area → Repository
     (edit)           (add)         (commit)

Edit files in your working directory
Stage changes you want to save (git add)
Commit staged changes to history (git commit)

Key Commands: Local

Command	What it does
`git init`	Create a new repo
`git status`	See what’s changed
`git add <file>`	Stage a file
`git commit -m "msg"`	Save staged changes
`git log`	View commit history

Key Commands: Remote

Command	What it does
`git remote add origin <url>`	Link to GitHub
`git push`	Upload commits to GitHub
`git pull`	Download changes from GitHub
`git fetch`	Check for remote changes
`git clone <url>`	Copy a repo from GitHub

The Commit Message

Good commit messages:

“Add data cleaning function for labs”
“Fix bug in age calculation”
“Update analysis with new cohort criteria”

Bad commit messages:

“stuff”
“fixed it”
“asdfasdf”

Branching (Preview)

OHDSI packages typically use:

main:      A───────────F───G
            \         /
develop:    B───C───D───E
                 \
feature:          X───Y

main: stable releases (what you install)
develop: what’s in the pipeline/what people are working on
feature: specific new functionality

Check develop to see what’s coming next.

.gitignore

Not everything should be tracked. Create a .gitignore file to exclude:

Large data files (*.csv, *.rds)
Sensitive credentials (.env, *.pem)
Generated outputs (*_files/, *.html)
System files (.DS_Store, Thumbs.db)

# Example .gitignore
data/
*.csv
.env
.Renviron

Why .gitignore Matters

Without it:

Accidentally commit 500MB data files
Push credentials to public GitHub
Clutter history with generated files

RStudio projects include a sensible .gitignore by default. Review it.

.gitignore and PHI

In biomedical work, this is not optional.

Patient data should never be in a Git repository
Once committed, data lives in history forever
Even “private” repos can leak

Always ignore your data/ folder. Query data from databases instead of storing local copies.

Part 2: Remote Databases

Why Not Just Use Files?

Typical stats training/work uses CSVs and Excel files, so why do we need databases?

Scale Problem

A single CSV with 100 million rows:

Takes forever to load into R
Uses all your RAM
R crashes

A database with 100 million rows:

Loads only what you query
Processes on the server
Your laptop stays happy

Multi-User Problem

CSV files:

One person edits at a time
“Who has the latest version?”
Overwrite conflicts

Databases:

Many users simultaneously
Single source of truth
Transaction safety (ACID)

ACID Properties

Databases guarantee safe transactions:

Property	Meaning
Atomicity	All or nothing (no partial writes)
Consistency	Data stays valid after each transaction
Isolation	Concurrent users don’t interfere
Durability	Committed data survives crashes

CSVs have none of these.

Security Problem

CSV on a shared drive:

Anyone can copy the whole thing
No audit trail
PHI everywhere

Database:

Fine-grained access control
Query logging
Data stays on server

Real World: Healthcare Data

Claims data: billions of records
EHR data: tens of millions of patients
Genomics data: petabytes

This doesn’t fit on your laptop.

The Pattern

Your laptop          Server
┌──────────┐        ┌──────────┐
│ R Code   │ ─SQL─→ │ Database │
│          │ ←rows─ │  Server  │
└──────────┘        └──────────┘

Write query → Send to server → Get results back

Common Database Systems

Database	Common Use
PostgreSQL	General purpose, OHDSI standard
SQL Server	Common with “enterprise” systems
Oracle	Enterprise, legacy systems
SQLite	Local files, prototyping
Redshift/Snowflake	Cloud analytics
Spark	Distributed processing (Databricks)

They all speak SQL (with dialects).

Databricks and Spark

Databricks = managed cloud platform built on Apache Spark.

Processes massive datasets across many machines
Uses “Spark SQL” dialect
Increasingly common for health system data; JH is transitioning to it.

DatabaseConnector supports Spark, so the same OHDSI code works here too.

Part 3: OHDSI’s SQL Strategy

The OHDSI Problem

OHDSI runs the same analysis at:

Hospital A: PostgreSQL
Hospital B: SQL Server
Hospital C: Oracle
Hospital D: Redshift

How do you write code that works everywhere?

Solution 1: DatabaseConnector

Unified database interface for R.

library(DatabaseConnector)

# Works with any supported database
connectionDetails <- createConnectionDetails(
  dbms = "postgresql",  # or "sql server", "oracle", etc.
  server = "hostname",
  user = "username",
  password = "password"
)

connection <- connect(connectionDetails)

DatabaseConnector in Action

# Same code, any database
result <- querySql(connection, "
  SELECT person_id, year_of_birth
  FROM person
  LIMIT 100
")

# Clean up
disconnect(connection)

One interface, many backends.

Solution 2: SqlRender

SQL dialect translation.

library(SqlRender)

# Write SQL once
sql <- "SELECT TOP 100 * FROM person;"

# Translate to different dialects
render(sql, targetDialect = "postgresql")
# → "SELECT * FROM person LIMIT 100;"

render(sql, targetDialect = "oracle")
# → "SELECT * FROM person WHERE ROWNUM <= 100;"

SqlRender: Parameterization

sql <- "
SELECT * FROM @cdm_schema.person
WHERE year_of_birth > @min_year
"

render(sql,
  cdm_schema = "cdm_v5",
  min_year = 1980)

# → SELECT * FROM cdm_v5.person
#   WHERE year_of_birth > 1980

Write flexible SQL, fill in values at runtime.

The Reality of SQL Translation

SqlRender is useful, but not magic.

Some dialect features don’t have equivalents
Edge cases and complex queries can fail
Date/time functions are particularly tricky
Performance characteristics differ across platforms

In practice: Test your queries on the actual target database. Don’t assume translation “just works.”

Solution 3: Andromeda

Local analytics on big data.

Problem: Can’t load 100M rows into R.

Solution: Andromeda wraps SQLite for out-of-memory processing.

library(Andromeda)

# Create local database
andr <- andromeda()

# Add big data (stays on disk)
andr$patients <- huge_dataframe

# Query it like dplyr
andr$patients |> filter(age > 65) |> collect()

The OHDSI Stack

┌─────────────────────────────────┐
│     Your Analysis Code (R)      │
├─────────────────────────────────┤
│  SqlRender (dialect translation)│
├─────────────────────────────────┤
│ DatabaseConnector (connections) │
├─────────────────────────────────┤
│ PostgreSQL │ SQL Server │ Oracle│
└─────────────────────────────────┘

Why This Matters

Write once, run anywhere:

Your code works at your site
Same code works at collaborator’s site
Network studies become possible

This is how OHDSI studies run at multiple sites.

Eunomia: Test Data

Eunomia = sample OMOP CDM in SQLite

Synthetic patient data
Full CDM structure
Runs locally, no server needed

Userful for testing and learning before touching real data.

Summary

Tool	Purpose
Git/GitHub	Track code changes, collaborate
Remote databases	Handle scale, security, multi-user
DatabaseConnector	Unified R database interface
SqlRender	SQL dialect translation
Andromeda	Local big-data processing
Eunomia	Safe practice environment

Hands-on:

Create and use a Git repository
Connect to Eunomia
Query CDM tables with SQL
Follow a patient through the data
Explore ATLAS cohort definitions

modules/03_ohdsi-basics/ohdsi-basics.qmd