Subbu Lakshmanan

Slimming Down Repo-1

How can we reduce the repo size when it exceeds the size limit imposed by the repoitory hosting site

Most of the repository providers (Github, Bitbucket) have a size limitation on the repository size and mechanisms to store large files using LFS options. In most cases, a repository may not reach these typical max size limits enforced. However, it can happen over a long period unless certain measures are taken to reduce the file size.

Disclaimer:

  1. There are a lot of articles on what to commit to a repository and what to ignore. The assumption is that you have followed the guidelines not to store any runtime/build/temporary files, yet reach the size limit.
  2. There are a few commands that can permanently delete the data. The suggested approach to a clean-up is to create a new clone of the repo, execute the clean-up procedure and verify the results. If you are happy with what you have arrived at, then push the changes to your remote.
  3. Since the process involves re-writing history, you will need to 'force' push the changes to remote. So make sure to perform these commands with the approval of the team and only when it's required (i.e., The Git repo storage limit is reached).

These are a few commands to identify the biggest files that could be potential files to remove. You can remove these files and commit, but this will not remove the pack files associated with the original commits.

One way to remove the large files and the pack files associated with them is by using the git filter-branch along with the git reflog & git gc command.

Before beginning, Here are a few commands that we use to identify and remove bigger files.

  • To identify the size of the repo

    du -scH
    
  • To identify the count of files & other info

    git count-objects -vH
    
  • Identify the top 10 big files

    ls -ld -- **/*(DOL[1,10])
    
  • Identify the commits with the largest 'n' blobs (Sort ascending and list the last 'n' items)

    git verify-pack -v .git/objects/pack/<pack-name>.idx | sort -k 3 -n | tail -n 2
    
  • List the files in a commit

    git rev-list --objects --all | grep <commit-id>
    
  • Remove the file and re-write the history

    git filter-branch --index-filter 'git rm --cached --ignore-unmatch <File-to-be-removed>' --tag-name-filter cat -- --all
    
  • To prune older reflog entries

    git reflog expire --expire=now --all
    
  • To perform clean up unnecessary files and optimize the local repository

    git gc --prune=now
    

Here's a sequence of actions I performed in one of my repositories to reduce the storage. (I didn't reach the storage limit, I performed these steps to demo the idea)

Identify

Identify the biggest file in the repo

du -scH                                                                                                                                      
257M    .
257M    total

▶ git verify-pack -v .git/objects/pack/pack-3c30d356e18bda774eb13dc9e53929012ec06800.idx | sort -k 3 -n | tail -n 2
f5ed007fc5ee61733ee9bec25fdeac3f0119644f blob   12362185 12166055 61415659
de3e5b333ba453655951cabdae20588419ef7fe0 blob   18025235 18030628 35112687git rev-list --objects --all | grep f5ed007fc5ee61733ee9bec25fdeac3f0119644f
f5ed007fc5ee61733ee9bec25fdeac3f0119644f Side_Projects/MemoryTiles/google-play-screenshots-v1-todoriliev.com.sketch/Data

Removal

git filter-branch --index-filter 'git rm --cached --ignore-unmatch Side_Projects/MemoryTiles/google-play-screenshots-v1-todoriliev.com.sketch' --tag-name-filter cat -- --all
WARNING: git-filter-branch has a glut of gotchas generating mangled history
         rewrites.  Hit Ctrl-C before proceeding to abort, then use an
         alternative filtering tool such as 'git filter-repo'
         (https://github.com/newren/git-filter-repo/) instead.  See the
         filter-branch manual page for more details; to squelch this warning,
         set FILTER_BRANCH_SQUELCH_WARNING=1.
Proceeding with filter-branch...
...
...
Ref 'refs/heads/main' was rewritten
Ref 'refs/remotes/origin/ESI_Archive' was rewritten
Ref 'refs/remotes/origin/main' was rewritten
WARNING: Ref 'refs/remotes/origin/main' is unchanged
Ref 'refs/stash' was rewritten

Clean-up

The git filter-branch command will create backup refs in .git/refs/original. These refs must be deleted in order to remove references to these objects. Also, it's good to perform a garbage collection to do some clean-up.

git for-each-ref --format="%(refname)" refs/original/ | while read ref; do git update-ref -d $ref; donegit reflog expire --expire=now --all

▶ git gc --prune=now
Enumerating objects: 7802, done.
Counting objects: 100% (7802/7802), done.
Delta compression using up to 10 threads
Compressing objects: 100% (3950/3950), done.
Writing objects: 100% (7802/7802), done.
Total 7802 (delta 4367), reused 6593 (delta 3732), pack-reused 0git verify-pack -v .git/objects/pack/pack-358d01cc2715da3d0f49ccea3e5d3352e596e7c0.idx | sort -k 3 -n | tail -n 2 
20ca85599c3decf2a972b0ede24ac0a8231b4cd9 blob   7218535 239853 111309837
b9785b3f3b5cddb633e7b2204d08c4bfd32ca501 blob   7608065 7502636 55950582git push
To github.com:subbramanil/my-dev-notes.git
 ! [rejected]        main -> main (fetch first)
error: failed to push some refs to 'github.com:subbramanil/my-dev-notes.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

Documents/personal/my-dev-notes  main ✔                                                                                                                                          2h3m  ⍉
▶ git push -f
Enumerating objects: 7796, done.
Counting objects: 100% (7796/7796), done.
Delta compression using up to 10 threads
Compressing objects: 100% (3312/3312), done.
Writing objects: 100% (7796/7796), 133.17 MiB | 2.66 MiB/s, done.
Total 7796 (delta 4364), reused 7795 (delta 4364), pack-reused 0
remote: Resolving deltas: 100% (4364/4364), done.
To github.com:subbramanil/my-dev-notes.git
 + 734e92e...71bd9bb main -> main (forced update)du -scH
212M    .
212M    total

Reducing a repo size from 257 MB to 212 MB (45 MB) may not look like a big saving, however, the approach can be applied repeatedly to remove the bigger files to reduce the size of the repo.


I found out there there are two alternatives to achieve similar results. I will write a follow-up blog on these two.


References:


This post is also available on DEV.

All rights reserved