Version Control with Git

Chapter 1: Python & Programming Foundations Deep Dive — Tool & Internals

For a long time, I used Git the way a lot of people use Git: by memorizing a handful of spells. git add ., git commit -m "stuff", git push. When something went wrong—a merge conflict I didn't understand, a branch that had somehow diverged from reality—I'd stare at the terminal, try a few things from Stack Overflow, and if nothing worked, delete the folder and clone again. I'm not proud of it. But I suspect I'm not alone.

What finally broke me out of it was a nagging discomfort. I didn't know what was actually happening. I was operating a machine I couldn't see inside, and eventually that bothered me enough to go looking. What I found was genuinely surprising: Git's internals are simple. Not easy, but simple. A small number of elegant ideas compose into everything you've ever seen Git do.

Git was created by Linus Torvalds in 2005, over about two weeks, because the tool the Linux kernel team had been using revoked their free license and they needed a replacement fast. It's a distributed version control system—meaning every clone of a repository contains the full history, not just a pointer to a central server. Every ML project you'll work on, every open-source library you'll depend on, every production system you'll help build: it all lives in Git.

We'll be looking at SHA hashes and internal data structures in this section, but no prior knowledge is required. We'll build every concept from scratch, starting from how Git actually stores things, and work our way up to the commands you run every day. This isn't a short journey, but I hope you'll be glad you came.

What We'll Cover

Snapshots, Not Diffs
The Object Model — blobs, trees, commits, and content-addressable storage
The Three Trees — working directory, staging area, and repository
Refs — where branches actually live
The DAG — how history actually works
Merge vs. Rebase — two ways to combine work
When Everything Goes Wrong — reflog, bisect, cherry-pick, stash, reset
Git for Machine Learning Teams — .gitignore, DVC, Git LFS
Packfiles and Garbage Collection

Snapshots, Not Diffs

Here's the misconception that trips up almost everyone. People assume Git tracks changes—that it stores a log of diffs, and to reconstruct any version of a file, it replays the patch history forward. That's how CVS and Subversion worked. Git does something different.

Git takes a snapshot of your entire project at every commit. Not a list of what changed—a photograph of every file at that exact moment.

Suppose we have a tiny ML project with three files: train.py, config.yaml, and data/dataset.csv. When we make our first commit, Git records the state of all three. When we edit only train.py and commit again, Git takes a new snapshot—but for config.yaml and data/dataset.csv, which didn't change, it doesn't store them again. It stores a reference back to the previous versions. Same content, same reference.

The mental model that unlocks everything: Git is a photo album, not a film strip. Each commit is a complete photograph of your project at that moment. A file that didn't change doesn't get a new photo—it gets a pointer to the old one. This is the shift that makes everything else click. Hold it in mind as we go deeper.

But storing complete snapshots raises an obvious question: how does Git do that without eating your disk alive? That's where the object model comes in.

The Object Model

Inside your project's .git/objects/ directory, Git stores everything as one of four types of objects: blobs, trees, commits, and tags. Understanding these four is understanding Git.

Blobs

When we run git add train.py, Git compresses the file contents and stores them as a blob object. A blob is pure content—it doesn't know its own filename, doesn't know what directory it lives in, doesn't know anything about the project around it. It's just bytes.

Git hashes that content using SHA-1, a cryptographic hash function that takes any input and produces a consistent 40-character hexadecimal string. That hash is the blob's address in the object store. You can verify this yourself:

echo "print('hello')" | git hash-object --stdin

Run that command twice, on any machine, and you get the same hash. Change a single character in the input and the hash changes completely. This property—same content always maps to the same address—is called content-addressable storage. It's also why Git never stores the same content twice: if you have two files with identical content, they hash to the same blob. One copy, two names.

I'll be honest: I used Git for two years before I learned that a blob doesn't even know its own filename. That blew my mind. If the blob has no idea what it's called, then something else must be responsible for names. That something is the tree.

Trees

A tree object is the Git equivalent of a directory listing. It maps filenames (and permissions) to blob hashes—and for subdirectories, to other tree hashes. When we run git cat-file -p <tree-hash>, we see something like this:

git cat-file -p HEAD^{tree}

100644 blob a8c3f2e1...  config.yaml
100644 blob 4d9f7a12...  train.py
040000 tree 7b2e1c8a...  data

The tree gives blobs their names and their place in the directory hierarchy. The root tree of a project is the snapshot of your entire filesystem at that commit.

Commits

A commit object ties everything together. It points to a tree (the root snapshot), stores author name, email, timestamp, and commit message, and—crucially—points to its parent commit or commits. Let's look at one:

git cat-file -p HEAD

tree 7b2e1c8a4d3e9f1c2b5a8d7e6f0c1a3b4d5e6f7a
parent 3a1b2c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b
author Ada Lovelace <ada@example.com> 1710000000 +0000
committer Ada Lovelace <ada@example.com> 1710000000 +0000

Add balanced sampling to training loop

That's it. A commit is a small text file with four fields. The content determines the hash, and the hash becomes the commit's identity. Change the message by a single character and the hash changes—it's a completely different commit. This immutability is fundamental to Git's integrity. Every commit forms a cryptographic chain back to the very first one—tampering with any commit in the middle would cascade hash changes all the way forward, making it immediately visible.

The Three Trees

Git manages three distinct "trees" simultaneously. The word "tree" is a loose metaphor here—not all of them are literal tree data structures—but the name has stuck. Understanding them is what makes git status, git add, and git commit legible rather than magical.

The first is the working directory: your actual files on disk, the ones you open in your editor, the ones you run directly. Changes here are untracked until you explicitly tell Git about them.

The second is the staging area, also called the index. This is a draft of your next commit—a prepared snapshot that sits between your working directory and the permanent record. Running git add train.py moves the current state of train.py from the working directory into the staging area.

The third is the repository: the committed history stored in .git/, the chain of commit objects we just looked at.

Why does the staging area exist? Because you've often changed five files while solving a problem, but only three of those changes actually belong together as a coherent unit. The staging area lets you curate exactly what goes into each commit. You're not forced to commit everything at once. It's a powerful feature that feels annoying until the day it saves you.

Walk through the commands with this model in mind. Edit train.py: now the working directory differs from the staging area. Run git add train.py: now the staging area differs from the last commit. Run git commit: now the repository matches the staging area.

git diff           # working directory vs. staging area
git diff --staged  # staging area vs. last commit
git status         # shows the state of all three trees at once

If the working directory is your workshop, the staging area is the box where you pack items before shipping. You pick and choose what goes in the box. The commit ships the box and seals it permanently. Once sealed, nothing changes—ever.

We know about objects and the three trees. But Git still needs a way to track which commit is "current," and what branches actually are. That's refs.

Refs — Where Branches Actually Live

A branch in Git is a text file. That's the whole secret. Open .git/refs/heads/main in any repository and you'll find a single line: a 40-character SHA-1 hash pointing to a commit.

cat .git/refs/heads/main
# a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0

When you make a new commit on main, Git writes the new commit hash into that file. That's all branching is—a file that gets updated. Branches are incredibly cheap to create because they're just 41 bytes on disk.

HEAD is another reference, stored at .git/HEAD. Usually it contains a symbolic reference to the current branch:

cat .git/HEAD
# ref: refs/heads/main

When you switch branches with git checkout feature or git switch feature, Git updates .git/HEAD to point to the new branch. Then, when you make a commit, Git follows HEAD to find the current branch, and updates that branch's ref file to point to the new commit. The whole mechanism is beautifully simple.

Detached HEAD happens when HEAD points directly to a commit hash instead of a branch name. You're looking at a specific snapshot but not "on" any named branch. Any commits you make will be orphaned—not reachable from any branch—unless you create a new branch before leaving. I still remember the first time I saw "detached HEAD state" in my terminal and genuinely panicked. Now I just run git switch -c new-branch-name and carry on.

Tags work similarly. Lightweight tags are just ref files in .git/refs/tags/—like branches, but they don't move when you commit. Annotated tags are full objects (the fourth type) with their own hash, author, and message, pointed to by a ref. Both serve as stable, named bookmarks to specific commits—typically for release versions.

The DAG — How History Actually Works

Commits form a Directed Acyclic Graph, or DAG. Each commit points backward to its parent or parents. "Directed" means the edges go one way—from child to parent, never the reverse. "Acyclic" means there are no loops; you can walk the graph backward but you'll never return to where you started.

Most commits have exactly one parent. The very first commit in a repository has zero parents. Merge commits have two—one for each branch that was merged. That's the entire structure. A few rules, endlessly composed.

Here's a tiny ASCII diagram of a branch and merge:

A --- B --- C --- F   (main)
             \       /
              D --- E  (feature)

Commits A through C are on main. At C, someone created a feature branch and made commits D and E. Then F is a merge commit with two parents: C (from main) and E (from feature). The graph captures the entire history of how work came together.

This is why git log --graph --oneline draws those branch-and-merge lines. It's rendering the DAG. The DAG is also what makes Git efficient at answering ancestry questions: to find out if commit A is an ancestor of commit B, Git walks the graph backward from B. No need to scan all commits—just traverse the edges.

We now have the full mental model: content-addressed objects (blobs, trees, commits) stored in a DAG, navigated by refs (branches, tags, HEAD), with three trees mediating our daily workflow. That's all of Git, in principle.

Congratulations on making it this far. You can stop here if you want.

Seriously. You now understand how Git actually works at the level that matters. Next time someone says "just rebase" or "you're in detached HEAD," you'll know exactly what's happening to the graph instead of guessing. That's genuinely valuable knowledge, and most working engineers don't have it.

But if the discomfort of not knowing what happens when things go wrong is nagging at you—and it probably should, because it will go wrong—read on.

Merge vs. Rebase — Two Ways to Combine Work

You've been on a feature branch for two days. Main has moved forward—your teammates merged things while you were working. Now you need to combine your work with theirs. Git gives you two fundamentally different ways to do this, and understanding the difference is worth the investment.

Fast-Forward Merge

If main hasn't diverged—that is, your branch is ahead of main but main hasn't received any new commits since you branched off—Git can do a fast-forward merge. There's nothing to reconcile. Git moves the main pointer forward to point at your latest commit. No new commit is created. History stays linear.

git switch main
git merge feature   # fast-forward if possible

Three-Way Merge

If both branches have new commits since they diverged, Git needs to do a three-way merge. It finds the common ancestor (the commit where the branches split), compares both branches against it, and synthesizes a new merge commit with two parents. If the same part of the same file was changed on both branches, Git can't resolve that automatically—that's a merge conflict, and it asks you to resolve it manually.

The result is historically accurate but visually noisy: git log --graph shows the branch-and-merge topology, which is honest but can be hard to read on busy projects.

Rebase

Rebase is the alternative. Instead of creating a merge commit, rebase takes your feature branch commits and replays them, one by one, on top of the latest commit on main. The result looks as if you started your branch from main's current tip—linear history, no merge commit, no graph branching.

git switch feature
git rebase main     # replay feature commits on top of latest main

But there's a catch. Replaying commits creates new commits—new parent, new hash, even if the diff is identical. The old commits still exist in the object store, but the branch now points to the new ones. If anyone else has pulled your feature branch and based work on those old commits, they now have commits that no longer exist on any branch. Their history has diverged from yours in a way that's genuinely painful to untangle.

The golden rule: never rebase commits that other people have already pulled. On a local branch you haven't pushed, rebase away. On a shared branch? Don't. I once rebased a shared branch that a colleague was actively working on. It took our team half a day to figure out what had happened and reconstruct everyone's work. We were not pleased with each other that afternoon.

When Everything Goes Wrong

This is the section that earns its keep at 2 AM. Not a list of commands—a set of tools that make sense once you understand why they work. They work because of the object model. Commits don't get deleted; they get unreferenced. As long as something points to a commit—a branch, a tag, or the reflog—it's recoverable.

The Reflog — Git's Flight Recorder

Every time HEAD moves—commit, checkout, reset, rebase—Git records it in the reflog. Run git reflog and you'll see the last 90 days of HEAD movements, each with a short hash and a description of what caused the move.

git reflog

a1b2c3d HEAD@{0}: commit: Add regularization
f4e5d6c HEAD@{1}: rebase: fast-forward
9g8h7i6 HEAD@{2}: checkout: moving from main to feature
...

Even after a hard reset that seemed to destroy your last three commits, those commits are still in the object store. The reflog shows you their hashes. You can recover them with git checkout <hash> and then create a new branch from that point. Nothing is really gone until Git garbage-collects it, and that takes at least two weeks by default.

Bisect — Binary Search for Bugs

Bisect is one of Git's most underused superpowers. You know the code worked at commit A and is broken at commit Z, and there are 200 commits in between. Git bisect performs a binary search: it checks out the commit halfway between A and Z, you test it, tell Git whether it's good or bad, and Git narrows the range in half. Repeat until you've found the exact commit that introduced the bug. For 200 commits, that's at most 8 tests.

git bisect start
git bisect bad              # current commit is broken
git bisect good a1b2c3d     # this commit was known-good
# Git checks out the midpoint — test your code, then:
git bisect good             # or: git bisect bad
# Repeat until Git announces the first bad commit
git bisect reset            # return to where you started

For ML work, this is invaluable. When your model's validation accuracy mysteriously drops from 0.94 to 0.87 between last week and this week, bisect finds the exact commit that caused it—whether it was a subtle data processing change, a hyperparameter tweak, or a normalization bug.

Cherry-Pick — Transplanting a Single Commit

Cherry-pick applies the changes from a specific commit onto your current branch, creating a new commit with the same diff but a new hash and a new parent. It's surgical: you're not merging entire branches, you're transplanting one specific change.

git cherry-pick a1b2c3d    # apply the changes from that commit here

The classic use case: a critical bug fix lands on the develop branch, but you need it on main right now without pulling in all the other half-finished work on develop. Cherry-pick the fix and ship it.

Interactive Rebase — Rewriting Recent History

Interactive rebase, git rebase -i HEAD~5, opens an editor showing your last five commits with one-letter command prefixes. You can reorder them, squash multiple commits into one, reword messages, or drop commits entirely. It's how you turn a branch full of "WIP", "fix", "fix fix", and "ok actually fix" commits into a clean, coherent history before pushing.

git rebase -i HEAD~5

This opens an editor. Change pick to squash (or s) on commits you want folded into the one above them. Change pick to reword (or r) to edit a message. Save and close, and Git does the work. The result is a clean history that looks intentional, even if the reality was messier.

Stash — A Clipboard for Work in Progress

Stash temporarily shelves uncommitted changes. You've been editing train.py when your teammate asks you to look at something on a different branch. You're not ready to commit. git stash saves your changes, cleans the working directory to match HEAD, and lets you switch context. git stash pop brings everything back.

git stash           # save changes, clean working directory
git switch main     # do other work...
git switch feature
git stash pop       # restore your changes

You can accumulate multiple stashes (git stash list shows them all) and apply specific ones with git stash apply stash@{2}. I treat stash as a stack of temporary saves, not a long-term storage strategy—if you need something permanently, commit it to a branch.

Reset — Three Flavors of Undo

Reset moves the current branch pointer to a different commit. It comes in three modes, and knowing which one to use is genuinely important because two of them are recoverable and one is not.

git reset --soft HEAD~1    # undo last commit; changes stay staged
git reset --mixed HEAD~1   # undo last commit; changes unstaged (this is the default)
git reset --hard HEAD~1    # undo last commit; changes DESTROYED

Soft reset is the most useful. Made a commit with a bad message? git reset --soft HEAD~1, fix the message, recommit. Need to break a big commit into smaller ones? Soft reset, re-add carefully. Mixed reset is the default and is fine for most cases. Hard reset is the one to be careful with—it discards working directory changes too. But even then: the commit object still exists. The reflog still points to it. You can get it back within the 90-day window.

The reason these rescue tools exist—the reason the reflog can recover from a hard reset, the reason bisect can navigate hundreds of commits efficiently—is the object model. Commits are immutable and content-addressed. They don't get deleted just because a branch stops pointing to them. Understanding the internals isn't academic. It changes how you feel about making mistakes.

Git for Machine Learning Teams

Here's where Git for ML projects diverges from Git for regular software development. The core problem has two parts: your code is small, text-based, and diffable. Your data and model weights are huge, binary, and not meaningfully diffable. Get the separation wrong and your repository grows to gigabytes, cloning takes twenty minutes, and you've permanently embedded a model checkpoint in the history that you can never cleanly remove.

The .gitignore — Your First Commit

Before you write a single line of code in a new ML project, create a .gitignore. This file tells Git which files and directories to completely ignore. It only affects untracked files—if you accidentally committed something first and then added it to .gitignore, it'll keep being tracked. Fix that with git rm --cached filename to stop tracking the file without deleting it from disk.

# Data — never in git
data/
*.csv
*.parquet
*.json.gz

# Model artifacts
models/
*.pt
*.pth
*.pkl
*.safetensors
*.ckpt
*.onnx

# Python artifacts
__pycache__/
*.pyc
*.pyo
.venv/
venv/
*.egg-info/

# Environment and secrets
.env
.env.*
*.key

# Experiment tracking noise
wandb/
mlruns/
.ipynb_checkpoints/
*.log

Commit this file. Push it. Make it the first thing anyone who clones the repo sees.

DVC — Git for Data and Models

DVC (Data Version Control) is the solution to the large-file problem that was designed specifically for ML. The idea: Git tracks code, DVC tracks data and models. You use both simultaneously, and they integrate cleanly.

dvc add data/train.csv

This does two things. First, it adds data/train.csv to .gitignore (so Git ignores the actual file). Second, it creates data/train.csv.dvc—a small text file containing the file's hash and size, which does go in Git. When a collaborator clones the repo and runs dvc pull, DVC fetches the actual data from remote storage (S3, GCS, Azure, SSH—DVC supports all of them). Your Git history stays lean; your data stays versioned.

DVC also defines pipelines. A dvc.yaml file describes your ML workflow as a DAG of stages—preprocess, train, evaluate—each with defined inputs and outputs. Run dvc repro and DVC reruns only the stages whose inputs have changed. It's reproducibility infrastructure: given a commit and a data version, you can reproduce the exact model that produced a given result.

dvc repro          # reproduce the pipeline, skipping unchanged stages
dvc push           # upload tracked data/models to remote storage
dvc pull           # download tracked data/models from remote storage

I once committed a 2GB model checkpoint directly to Git. Not to a branch—to main. The repository was ruined for everyone. Even after removing the file, it lived in Git's history, and every clone pulled down that 2GB forever. We had to rewrite history with git filter-branch, force-push everything, and have every team member re-clone. Nobody was happy. Use DVC.

Git LFS — The Simpler Alternative

Git LFS (Large File Storage) is a simpler alternative that requires no new mental model if you already know Git. You install the LFS extension, tell it which patterns to track, and it handles the rest transparently.

git lfs install
git lfs track "*.pt"       # track all .pt files with LFS
git add .gitattributes     # commit the tracking rules
git add model.pt           # now stored in LFS, not in git
git commit -m "Add model checkpoint"

Large files are stored on a separate LFS server. Git objects contain small pointer files instead. Cloning is fast because LFS downloads only what you need. It's less powerful than DVC—no pipelines, no experiment tracking, no Python API—but if your only problem is large model files and you don't want to learn new tooling, LFS solves it.

Commit Messages That Matter

This sounds like a soft skill. It's not. "Updated model" is a commit message. "Retrain with balanced sampling — accuracy 0.92 but minority-class recall was 0.31; switched to weighted cross-entropy" is a commit message. When your model accuracy regresses three weeks from now and you're bisecting through 40 commits at 11 PM, the difference between those two messages is whether you can find the cause in five minutes or two hours.

The convention: one short summary line (under 72 characters), then a blank line, then as much context as you want. Include what you changed, why you changed it, and what the result was. If you ran an experiment, put the key metric in the message. Future you will be grateful. Your teammates will be grateful. Code review will be faster.

Branching for Experiments

Because branches are just text files, they're essentially free. The pattern that works well for ML experimentation: one short-lived branch per experiment. Branch from main, change a hyperparameter or a preprocessing step, train, evaluate, push results to the commit message or to a linked experiment tracker. Merge the winner. Delete the losers—or keep them around, since disk is cheap and history is useful.

This keeps main always in a known-good state. It lets you run multiple experiments in parallel without clobbering each other's changes. It makes it easy to compare: two branches, two sets of results, pick the better one. And because branches are so cheap in Git—just a 41-byte text file—there's no reason not to make a new one for every experiment. I've worked on teams where people committed directly to main. The first time a bad experiment broke everyone's baseline, the branching habit stuck for good.

Packfiles and Garbage Collection

Git starts by storing each object as a separate file in .git/objects/. These are called loose objects. As a repository accumulates history—hundreds of commits, thousands of files—the loose object count grows until performance starts to suffer. Git handles this automatically.

Running git gc (garbage collection) packs loose objects into packfiles—compressed bundles stored in .git/objects/pack/. Packfiles use delta compression: rather than storing each version of a file in full, they store similar objects as diffs against each other. A source file that changed slightly across 50 commits takes up far less space in a packfile than as 50 loose blobs. Git runs git gc automatically in the background when you push or when enough loose objects accumulate.

Unreachable objects—commits orphaned by a hard reset, blobs from a dropped rebase—hang around in the loose object store for the default 90-day window. This is your safety net. If you realize three weeks after a hard reset that you needed something, the reflog and the loose objects still have it. After 90 days, git gc prunes them for real. That's the actual point of no return. Understanding this window—and that it exists at all—transforms how you feel about "dangerous" operations like reset and rebase. They're reversible. Almost everything in Git is reversible, as long as you know where to look.

Wrapping Up

If you're still with me, thank you. I hope it was worth the trip.

We started with the realization that Git isn't a diff tracker—it's a snapshot machine. We built up the object model: blobs that store pure content, trees that give blobs names and hierarchy, commits that tie trees to history and to each other. We saw how those commits form a Directed Acyclic Graph, how branches are just text files pointing to commits, how HEAD is just a pointer to a branch, and how the three trees—working directory, staging area, repository—mediate everything we do each day. We walked through merge strategies and saw why rebase rewrites history. We explored the rescue tools: reflog, bisect, cherry-pick, interactive rebase, stash, and reset. And we looked at what makes ML projects different: the large-file problem, DVC, Git LFS, and the importance of commit messages that tell a story.

My hope is that the next time you see "detached HEAD state" in your terminal, or find yourself in the middle of a merge conflict at midnight, instead of panicking and cloning fresh, you'll know exactly what's happening in the graph. You'll know which object contains which content, where HEAD is pointing, and what tool gets you back to safe ground. And you'll fix it with confidence.

Resources

Pro Git (git-scm.com/book) — The definitive reference. Free online. Chapters 1–3 cover the daily workflow; Chapter 10, "Git Internals," is wildly good and covers everything we discussed here in even more depth. Read the internals chapter. You won't regret it.

"Git from the Bottom Up" by John Wiegley — A short essay that builds Git from the object model upward, the same direction we went here. If this section resonated with you, Wiegley's piece will too. It's available free online and takes about an hour to read.

DVC documentation (dvc.org) — The best resource for ML-specific version control. Their tutorials are genuinely practical and well-written. Start with "Get Started" and follow the pipeline tutorial—it clicks faster than you'd expect.

"Think Like a Git" (think-like-a-git.net) — A short, opinionated guide that focuses entirely on the graph model. The sections on reachability are particularly good. Unforgettable once you've read it, because it reframes every Git command as a graph operation.

Oh Shit, Git!? (ohshitgit.com) — For when everything goes wrong and you don't have time to look at diagrams. Plain English, no judgment, just solutions. Bookmark it. You'll need it eventually.

← PreviousData Visualization Next →Scikit-learn