Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
anakin87Β 
posted an update 3 days ago
Post
1018
🧰 Free up space on the Hub with super_squash_history 🧹

As you may know, Hugging Face Hub has storage limits on private repos (100 GB for free users, 1 TB for PROs).

This weekend I did some cleanup on my private repos
I went 1.58 TB down to 1 GB. πŸ˜…

Besides deleting old, unused models, the main tool I used was a lesser-known command:
super_squash_history.

When you train a model, you often push multiple checkpoints to the Hub.
Each checkpoint = a commit.
A 2.6B model in BF16 is ~5 GB.
So 10 checkpoints = 50 GB. That adds up fast.

While full commit history can be useful for rollbacks, it's often unnecessary for older experiments where only the final model matters.

In these cases, you can use super_squash_history: it reduces your entire repo history to a single commit.

https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.super_squash_history

⚠️ super_squash_history is a non-revertible operation. Once squashed, the commit history cannot be retrieved.

Hope this is useful to others.

They should add this to the huggingface-cli arguments or even make a separate button in the repository settings

I wish we could ncdu our repos / have them all listed by storage size. Thanks for posting this, I've got to de-clutter mine.

Edit: got this working:

image.png

import math
from huggingface_hub import HfApi
from tqdm.auto import tqdm

def format_size(size_bytes):
    """Converts bytes to human-readable format (KB, MB, GB)."""
    if size_bytes is None or size_bytes == 0:
        return "0 B"
    size_name = ("B", "KB", "MB", "GB", "TB")
    i = int(math.floor(math.log(size_bytes, 1024)))
    p = math.pow(1024, i)
    s = round(size_bytes / p, 2)
    return f"{s} {size_name[i]}"

#ncdu datasets
print("Fetching dataset list...")
api = HfApi()
user = api.whoami()["name"]

all_datasets_basic = list(api.list_datasets(author=user))

detailed_datasets = []
for dataset in tqdm(all_datasets_basic, desc="Fetching storage info for datasets"):
    info = api.dataset_info(dataset.id, expand=["usedStorage"])
    detailed_datasets.append(info)

sorted_datasets = sorted(
    [d for d in detailed_datasets if hasattr(d, 'usedStorage') and d.usedStorage is not None], 
    key=lambda x: x.usedStorage, 
    reverse=True
)

print("-" * 50)
print(f"Datasets for user: {user} (sorted by size)")
print("-" * 50)

if not sorted_datasets:
    print("No datasets with storage information found.")
else:
    for dataset in sorted_datasets:
        print(f"{format_size(dataset.usedStorage):>10} | {dataset.id}")

#ncdu models
print("\nFetching model list...")

all_models_basic = list(api.list_models(author=user))

detailed_models = []
for model in tqdm(all_models_basic, desc="Fetching storage info for models"):
    info = api.model_info(model.id, expand=["usedStorage"])
    detailed_models.append(info)

sorted_models = sorted(
    [m for m in detailed_models if hasattr(m, 'usedStorage') and m.usedStorage is not None], 
    key=lambda x: x.usedStorage, 
    reverse=True
)

print("-" * 50)
print(f"Models for user: {user} (sorted by size)")
print("-" * 50)

if not sorted_models:
    print("No models with storage information found.")
else:
    for model in sorted_models:
        print(f"{format_size(model.usedStorage):>10} | {model.id}")

I had some multi-branch repos as well as the old checkpoints taking up space.