Last updated on 7th May 2026

Removing Large Files from Git History

The moment you realize you've committed a 2GB dataset, a leaked API key, or six months' worth of generated build artifacts to your Git repository, your first instinct — git rm followed by a new commit — is the wrong one. The file is gone from HEAD, but it's still sitting in every clone, every fetch, and every backup, because Git's content-addressable storage keeps every blob it has ever seen.

Removing files from Git history means rewriting history. That's a destructive operation with real consequences: every collaborator's local clone diverges from the remote, every deployment pipeline tied to commit hashes loses its anchor, and any secret that was committed must be treated as compromised regardless of whether the rewrite succeeds. This tutorial walks through doing it correctly with the modern toolchain, what to do afterwards, and when to choose a different approach entirely.


Why git rm Isn't Enough

Git stores every version of every tracked file as an immutable blob, addressable by SHA-1 hash. When you commit a file, Git writes a blob into .git/objects/. When you remove the file with git rm and commit the deletion, Git writes a new tree that no longer references the blob — but the blob itself stays.

flowchart LR
    A["commit 1<br>adds 500MB file"] --> B["commit 2<br>edits other files"]
    B --> C["commit 3<br>git rm 500MB file"]
    C --> D["HEAD"]
    A -.-> E["blob still in<br>.git/objects/"]
    B -.-> E
    C -.-> E
    style E fill:#ffe5e5

Anyone who clones the repo still pulls that blob. git gc won't delete it because it's reachable from commit 1. The only way to reclaim that space — or to make a leaked secret unreachable — is to rewrite the commits that introduced the file so the blob is no longer referenced anywhere in history.


The Three Tools (And Which to Use)

Git ships with git filter-branch for this purpose, but the Git project itself now recommends against using it — it's slow, has subtle correctness bugs, and the maintainers point users elsewhere. The modern options are:

Tool Language When to use Notes
git filter-repo Python Default choice for almost every case Written by a Git maintainer. Fast, accurate, supports complex rewrites.
BFG Repo-Cleaner Java Quick one-shot cleanups, no Python available Simpler CLI for the 80% case. Limited compared to filter-repo.
git filter-branch Built-in Avoid Deprecated. Listed only because old StackOverflow answers still recommend it.

Use git filter-repo unless you have a specific reason not to. Everything in the next section assumes it.


Setup: Always Mirror-Clone First

History rewrites are not undoable. Before touching anything, work on a fresh mirror clone that you can throw away if something goes wrong:

git clone --mirror git@github.com:your-org/your-repo.git repo-rewrite.git
cd repo-rewrite.git

A mirror clone is a bare repository that contains every ref from the remote — branches, tags, notes, everything. You'll do the rewrite here and force-push from this clone, leaving your working clones untouched until the rewrite is verified.

Install git filter-repo if you haven't already. On macOS via Homebrew:

brew install git-filter-repo

Or via pip:

pip install git-filter-repo

Verify:

git filter-repo --version

Removing a Specific File or Directory

The most common case: a single file or directory got committed and you want it gone from every commit that ever contained it.

git filter-repo --path path/to/large-file.zip --invert-paths

--invert-paths flips the meaning of --path from "keep only these paths" to "remove these paths and keep everything else". You can pass --path multiple times to remove several files in one pass:

git filter-repo \
  --path secrets/api-keys.json \
  --path data/training-set.csv \
  --path build/2024-archive.tar.gz \
  --invert-paths

For directories, the path is treated as a prefix automatically:

git filter-repo --path node_modules/ --invert-paths

After running, git filter-repo rewrites every commit that touched those paths, prints a summary of what changed, and updates the refs in place. The original objects are still in .git/objects/ until garbage collection runs — that's by design, in case you need to recover.


Removing All Files Above a Size Threshold

When you don't know which files are bloating the repo, let git filter-repo find them. First, run the analyzer:

git filter-repo --analyze

This generates a .git/filter-repo/analysis/ directory with reports ranked by size. The path-all-sizes.txt and path-deleted-sizes.txt files are the most useful — they show every path Git has ever stored, sorted by total size across history.

Once you've identified the offenders, you can either pass them to --path --invert-paths as above, or strip everything above a size threshold:

git filter-repo --strip-blobs-bigger-than 10M

This removes any blob larger than 10MB from history, regardless of path. Useful for the "we accumulated four years of accidentally-committed binaries" scenario.


Removing a Leaked Secret

If the file you committed contains a credential — API key, password, private key — rotate the credential before doing anything else. Treat it as compromised the moment it lands on a public host. Rewriting history hides the file from new clones, but you can't recall what's already on Google's cache, GitHub's archives, or someone's local fork.

For the rewrite itself, git filter-repo can replace strings inline rather than removing whole files, which is useful when only a few lines of a config file leaked:

echo 'AKIAIOSFODNN7EXAMPLE==>REDACTED' > replacements.txt
echo 'private_key_data_here==>REDACTED' >> replacements.txt
git filter-repo --replace-text replacements.txt

Each line follows the format original==>replacement. The original string is replaced everywhere in every blob across history.


Force-Pushing the Rewritten History

After git filter-repo finishes, the local refs in your mirror clone are correct but the remote is unchanged. You'll need to force-push:

git push --force --all
git push --force --tags

--force-with-lease doesn't apply here because you're pushing from a fresh mirror clone, not a working clone with a tracked upstream. If your hosting platform protects the default branch, you'll need to temporarily disable branch protection for the push, then re-enable it.


What Collaborators and Deployments Must Do Next

Anyone with an existing clone is now on a divergent history. They have two choices:

Option 1 — fresh clone (simplest):

cd ..
rm -rf old-clone
git clone git@github.com:your-org/your-repo.git

Option 2 — rebase existing work onto the rewritten history:

git fetch origin
git rebase --onto origin/main <old-base-commit> <local-branch>

This is fiddly and error-prone. Tell people to fresh-clone unless they have uncommitted local work that's expensive to recreate.

For deployment pipelines, the implications depend on how your deployment tool tracks revisions. If your pipeline pins to specific commit SHAs (in dashboards, audit logs, or rollback histories), those SHAs no longer exist in the rewritten history. DeployHQ deployments keyed to webhook events from new pushes will continue to work, but a "rollback to commit abc123" action where abc123 predates the rewrite will fail until you re-trigger fresh deployments and let new SHAs accumulate.

Open pull requests are the worst-affected: most platforms close PRs when the source branch's history diverges. Plan to re-open or recreate them after the rewrite.


Garbage Collection and Reclaiming Disk Space

The blobs you removed are still in .git/objects/git filter-repo only updates refs. To actually shrink the repository on disk:

git reflog expire --expire=now --all
git gc --prune=now --aggressive

On the remote, GitHub and GitLab run garbage collection on their own schedule. The repository size on the platform won't drop instantly — for GitHub, opening a support ticket asking them to run GC is the fastest way to reclaim space publicly.


When Not to Rewrite History

History rewrites are a heavy hammer. They're the right tool when:

  • A large file or secret was committed recently and the repo is private or has few external collaborators
  • You're shrinking a bloated repo before migrating it to a new host
  • A regulatory requirement forces removal (GDPR, copyright takedown)

They're the wrong tool when:

  • The repo is public and has been forked or cloned widely — your rewrite doesn't propagate to forks, and the leaked content is already in the wild
  • The "large file" is actually a working asset that needs to stay tracked — use Git LFS instead, which keeps the file accessible but stops it from bloating clones
  • The cost of disrupting every collaborator outweighs the disk-space gain

For ongoing large-file management — design assets, ML weights, compiled binaries that genuinely belong in the repo — Git LFS is the right answer. History rewriting is an emergency tool, not a workflow.


A Pre-Flight Checklist

Before running git filter-repo on a real repository:

  1. Mirror-clone to a throwaway directory.
  2. Run git filter-repo --analyze and review the size reports.
  3. Note the current HEAD SHA so you can compare counts before and after.
  4. Tell collaborators to push outstanding work and stop pushing during the rewrite window.
  5. If a secret leaked, rotate it now — don't wait for the rewrite.
  6. Run the rewrite and inspect the output for unexpected changes.
  7. Force-push, then verify remote history with a fresh clone before declaring victory.
  8. Run git gc locally and ask the host platform to GC the remote.

The whole process takes 10–30 minutes for a typical repository. Skipping the mirror clone is the most common way teams turn a routine cleanup into a recovery operation.