Session 3: Working with Version Control and Data Infrastructure
Then: Hands-on practice in the module.
Version control allows for tracking changes to files over time.
Git is local software on your machine.
GitHub is basically cloud storage for Git repositories.
GitHub is a website that stores Git repositories.
Without Git:
analysis_v1.R
analysis_v2.R
analysis_v2_final.R
analysis_v2_final_REALLY_FINAL.R
analysis_v2_final_REALLY_FINAL_fixed.R
With Git:
analysis.R (with full history)
For most researchers, Git is essentially:
Cloud backup with time travel.
Git allows for fancy branching and merging workflows, but you probably won’t need them for scientific work.
Working Directory → Staging Area → Repository
(edit) (add) (commit)
git add)git commit)| Command | What it does |
|---|---|
git init |
Create a new repo |
git status |
See what’s changed |
git add <file> |
Stage a file |
git commit -m "msg" |
Save staged changes |
git log |
View commit history |
| Command | What it does |
|---|---|
git remote add origin <url> |
Link to GitHub |
git push |
Upload commits to GitHub |
git pull |
Download changes from GitHub |
git fetch |
Check for remote changes |
git clone <url> |
Copy a repo from GitHub |
Good commit messages:
Bad commit messages:
OHDSI packages typically use:
main: A───────────F───G
\ /
develop: B───C───D───E
\
feature: X───Y
main: stable releases (what you install)develop: what’s in the pipeline/what people are working onfeature: specific new functionalityCheck develop to see what’s coming next.
Not everything should be tracked. Create a .gitignore file to exclude:
*.csv, *.rds).env, *.pem)*_files/, *.html).DS_Store, Thumbs.db)# Example .gitignore
data/
*.csv
.env
.Renviron
Without it:
RStudio projects include a sensible .gitignore by default. Review it.
In biomedical work, this is not optional.
Always ignore your data/ folder. Query data from databases instead of storing local copies.
Typical stats training/work uses CSVs and Excel files, so why do we need databases?
A single CSV with 100 million rows:
A database with 100 million rows:
CSV files:
Databases:
Databases guarantee safe transactions:
| Property | Meaning |
|---|---|
| Atomicity | All or nothing (no partial writes) |
| Consistency | Data stays valid after each transaction |
| Isolation | Concurrent users don’t interfere |
| Durability | Committed data survives crashes |
CSVs have none of these.
CSV on a shared drive:
Database:
This doesn’t fit on your laptop.
Your laptop Server
┌──────────┐ ┌──────────┐
│ R Code │ ─SQL─→ │ Database │
│ │ ←rows─ │ Server │
└──────────┘ └──────────┘
Write query → Send to server → Get results back
| Database | Common Use |
|---|---|
| PostgreSQL | General purpose, OHDSI standard |
| SQL Server | Common with “enterprise” systems |
| Oracle | Enterprise, legacy systems |
| SQLite | Local files, prototyping |
| Redshift/Snowflake | Cloud analytics |
| Spark | Distributed processing (Databricks) |
They all speak SQL (with dialects).
Databricks = managed cloud platform built on Apache Spark.
DatabaseConnector supports Spark, so the same OHDSI code works here too.
OHDSI runs the same analysis at:
How do you write code that works everywhere?
Unified database interface for R.
One interface, many backends.
SQL dialect translation.
Write flexible SQL, fill in values at runtime.
SqlRender is useful, but not magic.
In practice: Test your queries on the actual target database. Don’t assume translation “just works.”
Local analytics on big data.
Problem: Can’t load 100M rows into R.
Solution: Andromeda wraps SQLite for out-of-memory processing.
┌─────────────────────────────────┐
│ Your Analysis Code (R) │
├─────────────────────────────────┤
│ SqlRender (dialect translation)│
├─────────────────────────────────┤
│ DatabaseConnector (connections) │
├─────────────────────────────────┤
│ PostgreSQL │ SQL Server │ Oracle│
└─────────────────────────────────┘
Write once, run anywhere:
This is how OHDSI studies run at multiple sites.
Eunomia = sample OMOP CDM in SQLite
Userful for testing and learning before touching real data.
| Tool | Purpose |
|---|---|
| Git/GitHub | Track code changes, collaborate |
| Remote databases | Handle scale, security, multi-user |
| DatabaseConnector | Unified R database interface |
| SqlRender | SQL dialect translation |
| Andromeda | Local big-data processing |
| Eunomia | Safe practice environment |
Hands-on:
modules/03_ohdsi-basics/ohdsi-basics.qmd