The moment you realize you've committed a 2GB dataset, a leaked API key, or six months' worth of generated build artifacts to your Git repository, your first instinct — `git rm` followed by a new commit — is the wrong one. The file is gone from `HEAD`, but it's still sitting in every clone, every fetch, and every backup, because Git's content-addressable storage keeps every blob it has ever seen.

Removing files from Git history means rewriting history. That's a destructive operation with real consequences: every collaborator's local clone diverges from the remote, every deployment pipeline tied to commit hashes loses its anchor, and any secret that was committed must be treated as compromised regardless of whether the rewrite succeeds. This tutorial walks through doing it correctly with the modern toolchain, what to do afterwards, and when to choose a different approach entirely.

---

## Why `git rm` Isn't Enough

Git stores every version of every tracked file as an immutable blob, addressable by SHA-1 hash. When you commit a file, Git writes a blob into `.git/objects/`. When you remove the file with `git rm` and commit the deletion, Git writes a *new* tree that no longer references the blob — but the blob itself stays.

```mermaid
flowchart LR
    A["commit 1<br>adds 500MB file"] --> B["commit 2<br>edits other files"]
    B --> C["commit 3<br>git rm 500MB file"]
    C --> D["HEAD"]
    A -.-> E["blob still in<br>.git/objects/"]
    B -.-> E
    C -.-> E
    style E fill:#ffe5e5
```

Anyone who clones the repo still pulls that blob. `git gc` won't delete it because it's reachable from commit 1. The only way to reclaim that space — or to make a leaked secret unreachable — is to rewrite the commits that introduced the file so the blob is no longer referenced anywhere in history.

---

## The Three Tools (And Which to Use)

Git ships with `git filter-branch` for this purpose, but the Git project itself now [recommends against using it](https://git-scm.com/docs/git-filter-branch#_warning) — it's slow, has subtle correctness bugs, and the maintainers point users elsewhere. The modern options are:

| Tool | Language | When to use | Notes |
|------|----------|-------------|-------|
| **`git filter-repo`** | Python | Default choice for almost every case | Written by a Git maintainer. Fast, accurate, supports complex rewrites. |
| **BFG Repo-Cleaner** | Java | Quick one-shot cleanups, no Python available | Simpler CLI for the 80% case. Limited compared to `filter-repo`. |
| `git filter-branch` | Built-in | Avoid | Deprecated. Listed only because old StackOverflow answers still recommend it. |

Use `git filter-repo` unless you have a specific reason not to. Everything in the next section assumes it.

---

## Setup: Always Mirror-Clone First

History rewrites are not undoable. Before touching anything, work on a fresh mirror clone that you can throw away if something goes wrong:

```bash
git clone --mirror git@github.com:your-org/your-repo.git repo-rewrite.git
cd repo-rewrite.git
```

A mirror clone is a bare repository that contains every ref from the remote — branches, tags, notes, everything. You'll do the rewrite here and force-push from this clone, leaving your working clones untouched until the rewrite is verified.

Install `git filter-repo` if you haven't already. On macOS via Homebrew:

```bash
brew install git-filter-repo
```

Or via pip:

```bash
pip install git-filter-repo
```

Verify:

```bash
git filter-repo --version
```

---

## Removing a Specific File or Directory

The most common case: a single file or directory got committed and you want it gone from every commit that ever contained it.

```bash
git filter-repo --path path/to/large-file.zip --invert-paths
```

`--invert-paths` flips the meaning of `--path` from "keep only these paths" to "remove these paths and keep everything else". You can pass `--path` multiple times to remove several files in one pass:

```bash
git filter-repo \
  --path secrets/api-keys.json \
  --path data/training-set.csv \
  --path build/2024-archive.tar.gz \
  --invert-paths
```

For directories, the path is treated as a prefix automatically:

```bash
git filter-repo --path node_modules/ --invert-paths
```

After running, `git filter-repo` rewrites every commit that touched those paths, prints a summary of what changed, and updates the refs in place. The original objects are still in `.git/objects/` until garbage collection runs — that's by design, in case you need to recover.

---

## Removing All Files Above a Size Threshold

When you don't know which files are bloating the repo, let `git filter-repo` find them. First, run the analyzer:

```bash
git filter-repo --analyze
```

This generates a `.git/filter-repo/analysis/` directory with reports ranked by size. The `path-all-sizes.txt` and `path-deleted-sizes.txt` files are the most useful — they show every path Git has ever stored, sorted by total size across history.

Once you've identified the offenders, you can either pass them to `--path --invert-paths` as above, or strip everything above a size threshold:

```bash
git filter-repo --strip-blobs-bigger-than 10M
```

This removes any blob larger than 10MB from history, regardless of path. Useful for the "we accumulated four years of accidentally-committed binaries" scenario.

---

## Removing a Leaked Secret

If the file you committed contains a credential — API key, password, private key — **rotate the credential before doing anything else**. Treat it as compromised the moment it lands on a public host. Rewriting history hides the file from new clones, but you can't recall what's already on Google's cache, GitHub's archives, or someone's local fork.

For the rewrite itself, `git filter-repo` can replace strings inline rather than removing whole files, which is useful when only a few lines of a config file leaked:

```bash
echo 'AKIAIOSFODNN7EXAMPLE==>REDACTED' > replacements.txt
echo 'private_key_data_here==>REDACTED' >> replacements.txt
git filter-repo --replace-text replacements.txt
```

Each line follows the format `original==>replacement`. The original string is replaced everywhere in every blob across history.

---

## Force-Pushing the Rewritten History

After `git filter-repo` finishes, the local refs in your mirror clone are correct but the remote is unchanged. You'll need to force-push:

```bash
git push --force --all
git push --force --tags
```

`--force-with-lease` doesn't apply here because you're pushing from a fresh mirror clone, not a working clone with a tracked upstream. If your hosting platform protects the default branch, you'll need to temporarily disable branch protection for the push, then re-enable it.

---

## What Collaborators and Deployments Must Do Next

Anyone with an existing clone is now on a divergent history. They have two choices:

**Option 1 — fresh clone (simplest):**

```bash
cd ..
rm -rf old-clone
git clone git@github.com:your-org/your-repo.git
```

**Option 2 — rebase existing work onto the rewritten history:**

```bash
git fetch origin
git rebase --onto origin/main <old-base-commit> <local-branch>
```

This is fiddly and error-prone. Tell people to fresh-clone unless they have uncommitted local work that's expensive to recreate.

For deployment pipelines, the implications depend on how your deployment tool tracks revisions. If your pipeline pins to specific commit SHAs (in dashboards, audit logs, or rollback histories), those SHAs no longer exist in the rewritten history. [DeployHQ deployments](https://www.deployhq.com/features/automatic-deployments) keyed to webhook events from new pushes will continue to work, but a "rollback to commit `abc123`" action where `abc123` predates the rewrite will fail until you re-trigger fresh deployments and let new SHAs accumulate.

Open pull requests are the worst-affected: most platforms close PRs when the source branch's history diverges. Plan to re-open or recreate them after the rewrite.

---

## Garbage Collection and Reclaiming Disk Space

The blobs you removed are still in `.git/objects/` — `git filter-repo` only updates refs. To actually shrink the repository on disk:

```bash
git reflog expire --expire=now --all
git gc --prune=now --aggressive
```

On the remote, GitHub and GitLab run garbage collection on their own schedule. The repository size on the platform won't drop instantly — for GitHub, opening a support ticket asking them to run GC is the fastest way to reclaim space publicly.

---

## When Not to Rewrite History

History rewrites are a heavy hammer. They're the right tool when:

- A large file or secret was committed recently and the repo is private or has few external collaborators
- You're shrinking a bloated repo before migrating it to a new host
- A regulatory requirement forces removal (GDPR, copyright takedown)

They're the wrong tool when:

- The repo is public and has been forked or cloned widely — your rewrite doesn't propagate to forks, and the leaked content is already in the wild
- The "large file" is actually a working asset that needs to stay tracked — use [Git LFS](https://www.deployhq.com/git/managing-git-lfs-and-deployhq) instead, which keeps the file accessible but stops it from bloating clones
- The cost of disrupting every collaborator outweighs the disk-space gain

For ongoing large-file management — design assets, ML weights, compiled binaries that genuinely belong in the repo — Git LFS is the right answer. History rewriting is an emergency tool, not a workflow.

---

## A Pre-Flight Checklist

Before running `git filter-repo` on a real repository:

1. Mirror-clone to a throwaway directory.
2. Run `git filter-repo --analyze` and review the size reports.
3. Note the current `HEAD` SHA so you can compare counts before and after.
4. Tell collaborators to push outstanding work and stop pushing during the rewrite window.
5. If a secret leaked, rotate it now — don't wait for the rewrite.
6. Run the rewrite and inspect the output for unexpected changes.
7. Force-push, then verify remote history with a fresh clone before declaring victory.
8. Run `git gc` locally and ask the host platform to GC the remote.

The whole process takes 10–30 minutes for a typical repository. Skipping the mirror clone is the most common way teams turn a routine cleanup into a recovery operation.
