Git, Databases, and OHDSI’s SQL Strategy

Session 3: Working with Version Control and Data Infrastructure

Today’s Agenda

  1. Git & GitHub: Version control fundamentals
  2. Remote Databases: Why they matter in informatics
  3. OHDSI’s SQL Approach: DatabaseConnector, SqlRender, Andromeda

Then: Hands-on practice in the module.

Part 1: Git & GitHub

What is Git?

Version control allows for tracking changes to files over time.

  • Take “snapshots” of your project (commits)
  • Go back to any previous state
  • See who changed what and when
  • Work on features without breaking main code

Git is local software on your machine.

What is GitHub?

GitHub is basically cloud storage for Git repositories.

  • Backup your work remotely
  • Share code with collaborators
  • Host websites, documentation

GitHub is a website that stores Git repositories.

Why Version Control Matters

Without Git:

analysis_v1.R
analysis_v2.R
analysis_v2_final.R
analysis_v2_final_REALLY_FINAL.R
analysis_v2_final_REALLY_FINAL_fixed.R

With Git:

analysis.R  (with full history)

A Pragmatic View

For most researchers, Git is essentially:

Cloud backup with time travel.

  • Your code is safe on GitHub
  • You can recover any previous version
  • You have a record of what changed and when

Git allows for fancy branching and merging workflows, but you probably won’t need them for scientific work.

The Basic Git Workflow

Working Directory → Staging Area → Repository
     (edit)           (add)         (commit)
  1. Edit files in your working directory
  2. Stage changes you want to save (git add)
  3. Commit staged changes to history (git commit)

Key Commands: Local

Command What it does
git init Create a new repo
git status See what’s changed
git add <file> Stage a file
git commit -m "msg" Save staged changes
git log View commit history

Key Commands: Remote

Command What it does
git remote add origin <url> Link to GitHub
git push Upload commits to GitHub
git pull Download changes from GitHub
git fetch Check for remote changes
git clone <url> Copy a repo from GitHub

The Commit Message

Good commit messages:

  • “Add data cleaning function for labs”
  • “Fix bug in age calculation”
  • “Update analysis with new cohort criteria”

Bad commit messages:

  • “stuff”
  • “fixed it”
  • “asdfasdf”

Branching (Preview)

OHDSI packages typically use:

main:      A───────────F───G
            \         /
develop:    B───C───D───E
                 \
feature:          X───Y
  • main: stable releases (what you install)
  • develop: what’s in the pipeline/what people are working on
  • feature: specific new functionality

Check develop to see what’s coming next.

.gitignore

Not everything should be tracked. Create a .gitignore file to exclude:

  • Large data files (*.csv, *.rds)
  • Sensitive credentials (.env, *.pem)
  • Generated outputs (*_files/, *.html)
  • System files (.DS_Store, Thumbs.db)
# Example .gitignore
data/
*.csv
.env
.Renviron

Why .gitignore Matters

Without it:

  • Accidentally commit 500MB data files
  • Push credentials to public GitHub
  • Clutter history with generated files

RStudio projects include a sensible .gitignore by default. Review it.

.gitignore and PHI

In biomedical work, this is not optional.

  • Patient data should never be in a Git repository
  • Once committed, data lives in history forever
  • Even “private” repos can leak

Always ignore your data/ folder. Query data from databases instead of storing local copies.

Part 2: Remote Databases

Why Not Just Use Files?

Typical stats training/work uses CSVs and Excel files, so why do we need databases?

Scale Problem

A single CSV with 100 million rows:

  • Takes forever to load into R
  • Uses all your RAM
  • R crashes

A database with 100 million rows:

  • Loads only what you query
  • Processes on the server
  • Your laptop stays happy

Multi-User Problem

CSV files:

  • One person edits at a time
  • “Who has the latest version?”
  • Overwrite conflicts

Databases:

  • Many users simultaneously
  • Single source of truth
  • Transaction safety (ACID)

ACID Properties

Databases guarantee safe transactions:

Property Meaning
Atomicity All or nothing (no partial writes)
Consistency Data stays valid after each transaction
Isolation Concurrent users don’t interfere
Durability Committed data survives crashes

CSVs have none of these.

Security Problem

CSV on a shared drive:

  • Anyone can copy the whole thing
  • No audit trail
  • PHI everywhere

Database:

  • Fine-grained access control
  • Query logging
  • Data stays on server

Real World: Healthcare Data

  • Claims data: billions of records
  • EHR data: tens of millions of patients
  • Genomics data: petabytes

This doesn’t fit on your laptop.

The Pattern

Your laptop          Server
┌──────────┐        ┌──────────┐
│ R Code   │ ─SQL─→ │ Database │
│          │ ←rows─ │  Server  │
└──────────┘        └──────────┘

Write query → Send to server → Get results back

Common Database Systems

Database Common Use
PostgreSQL General purpose, OHDSI standard
SQL Server Common with “enterprise” systems
Oracle Enterprise, legacy systems
SQLite Local files, prototyping
Redshift/Snowflake Cloud analytics
Spark Distributed processing (Databricks)

They all speak SQL (with dialects).

Databricks and Spark

Databricks = managed cloud platform built on Apache Spark.

  • Processes massive datasets across many machines
  • Uses “Spark SQL” dialect
  • Increasingly common for health system data; JH is transitioning to it.

DatabaseConnector supports Spark, so the same OHDSI code works here too.

Part 3: OHDSI’s SQL Strategy

The OHDSI Problem

OHDSI runs the same analysis at:

  • Hospital A: PostgreSQL
  • Hospital B: SQL Server
  • Hospital C: Oracle
  • Hospital D: Redshift

How do you write code that works everywhere?

Solution 1: DatabaseConnector

Unified database interface for R.

library(DatabaseConnector)

# Works with any supported database
connectionDetails <- createConnectionDetails(
  dbms = "postgresql",  # or "sql server", "oracle", etc.
  server = "hostname",
  user = "username",
  password = "password"
)

connection <- connect(connectionDetails)

DatabaseConnector in Action

# Same code, any database
result <- querySql(connection, "
  SELECT person_id, year_of_birth
  FROM person
  LIMIT 100
")

# Clean up
disconnect(connection)

One interface, many backends.

Solution 2: SqlRender

SQL dialect translation.

library(SqlRender)

# Write SQL once
sql <- "SELECT TOP 100 * FROM person;"

# Translate to different dialects
render(sql, targetDialect = "postgresql")
# → "SELECT * FROM person LIMIT 100;"

render(sql, targetDialect = "oracle")
# → "SELECT * FROM person WHERE ROWNUM <= 100;"

SqlRender: Parameterization

sql <- "
SELECT * FROM @cdm_schema.person
WHERE year_of_birth > @min_year
"

render(sql,
  cdm_schema = "cdm_v5",
  min_year = 1980)

# → SELECT * FROM cdm_v5.person
#   WHERE year_of_birth > 1980

Write flexible SQL, fill in values at runtime.

The Reality of SQL Translation

SqlRender is useful, but not magic.

  • Some dialect features don’t have equivalents
  • Edge cases and complex queries can fail
  • Date/time functions are particularly tricky
  • Performance characteristics differ across platforms

In practice: Test your queries on the actual target database. Don’t assume translation “just works.”

Solution 3: Andromeda

Local analytics on big data.

Problem: Can’t load 100M rows into R.

Solution: Andromeda wraps SQLite for out-of-memory processing.

library(Andromeda)

# Create local database
andr <- andromeda()

# Add big data (stays on disk)
andr$patients <- huge_dataframe

# Query it like dplyr
andr$patients |> filter(age > 65) |> collect()

The OHDSI Stack

┌─────────────────────────────────┐
│     Your Analysis Code (R)      │
├─────────────────────────────────┤
│  SqlRender (dialect translation)│
├─────────────────────────────────┤
│ DatabaseConnector (connections) │
├─────────────────────────────────┤
│ PostgreSQL │ SQL Server │ Oracle│
└─────────────────────────────────┘

Why This Matters

Write once, run anywhere:

  • Your code works at your site
  • Same code works at collaborator’s site
  • Network studies become possible

This is how OHDSI studies run at multiple sites.

Eunomia: Test Data

Eunomia = sample OMOP CDM in SQLite

  • Synthetic patient data
  • Full CDM structure
  • Runs locally, no server needed

Userful for testing and learning before touching real data.

Summary

Tool Purpose
Git/GitHub Track code changes, collaborate
Remote databases Handle scale, security, multi-user
DatabaseConnector Unified R database interface
SqlRender SQL dialect translation
Andromeda Local big-data processing
Eunomia Safe practice environment

Next

Hands-on:

  1. Create and use a Git repository
  2. Connect to Eunomia
  3. Query CDM tables with SQL
  4. Follow a patient through the data
  5. Explore ATLAS cohort definitions

modules/03_ohdsi-basics/ohdsi-basics.qmd