xet-team/README · Xet Storage Not Deduplicating for Even Simple Binary Files

lyk

May 19

I have migrated to xet storage. And today, I try to test that is the xet really working?

My test is, simply generate a all one (int) array using numpy, and upload it to huggingface.

import numpy as np
a = np.ones(10000000,dtype=int)
np.save("./one.npy", a)

And update it.

pip install -U "huggingface_hub[cli,hf_xet]"
huggingface-cli.exe upload lyk/XetTest . --repo-type=dataset

Start hashing 1 files.
Finished hashing 1 files.
Uploading files using Xet Storage..

It shows that I am using xet but finally I got the LFS storage at 40MB, just as large as the raw simple file, no deduplication.

Well, maybe it only dedupilcates history commits. And I just generate a twice large file

import numpy as np
a = np.ones(10000000,dtype=int)
np.save("./one.npy", a)

Then I upload it and get a 120MB LFS storage usage.

And during the whole process, the progress bar in terminal shows that I uploaded the whole files(40MB and 80MB) although xet is enabled.

I don't know why xet does not work. Any thing wrong?

https://github.com/huggingface/huggingface_hub/issues/3090

https://github.com/huggingface/xet-core/issues/343

https://discuss.huggingface.co/t/xet-storage-not-deduplicating-for-even-simple-binary-files/155771

rajatarya

Xet Team org May 19

Thanks for trying Xet! When you see the file (one.npy) in the Hub web interface does it have an xet logo next to it? If so, this file was uploaded and stored as a xet file.

The file size reported is always what the full file size when downloaded, that is expected behavior. Deduplication benefits can be experienced with reduced time necessary for uploading & downloading files. The file storage will always be reported in total bytes needed to download the entire file.

lyk

May 19

Yes, it shows that the file is stored using xet

But it is still confused that the progress bar in cli always shows that I uploaded the whole file, not several blocks. And the upload speed is also equal to the speed of my WIFI.

Also, I don't know why super squash can't remove all this LFS files wich are not in the commit histories now.

Do I new to manually clone the repo and prune the lfs file?

lyk

May 19

And although Public Repositories Storage is unlimited, I am curious that how the Private Repositories Storage is calculated.
If it is the sum of the size of all LFS files, not size of all blocks in xet, does it means that I still need to do super squash mannully? I append some new rows to large parquet daily, and that file is now serveral GBs.

lyk

May 19

•

edited May 19

And would users be able to see the real repo size in xet in the futhure? (Just on the setting page or the dataset card page)

If I only know the lfs size, I would always feel uncomfortable and anxious. It just "seems" that I am waisting the storage space of huggingface with lots of duplicate files, although I know xet is used

rajatarya

Xet Team org May 19

But it is still confused that the progress bar in cli always shows that I uploaded the whole file, not several blocks. And the upload speed is also equal to the speed of my WIFI.

The deduplication process occurs while uploading, so it would be confusing to see the progress bar show different file sizes as chunks are deduplicated, which is why the total file size is shown.

Also, I don't know why super squash can't remove all this LFS files wich are not in the commit histories now.

Help me understand this scenario, what is super squash? Is your goal to remove / delete the entire history of these binary files from the repo?

Do I new to manually clone the repo and prune the lfs file?

The same procedure you would do for an LFS repo should work, please let us know if you see that failing in some way.

rajatarya

Xet Team org May 19

And although Public Repositories Storage is unlimited, I am curious that how the Private Repositories Storage is calculated.
If it is the sum of the size of all LFS files, not size of all blocks in xet, does it means that I still need to do super squash mannully? I append some new rows to large parquet daily, and that file is now serveral GBs.

Repository Storage is calculated as the sum of all LFS files, just like it was prior to Xet Storage. Yes, if super squash was to lower the repository storage by removing the older LFS files, then this practice should be continued. However, with Xet storage you should see the appended Parquet files uploading very quickly.

rajatarya

Xet Team org May 19

And would users be able to see the real repo size in xet in the futhure? (Just on the setting page or the dataset card page)

Currently there are no plans to display the deduplicated repo size. This is also difficult / confusing to display, as deduplication is done across all repos at Hugging Face. And so it is hard to show who 'owns' the blocks that are shared.

If I only know the lfs size, I would always feel uncomfortable and anxious. It just "seems" that I am waisting the storage space of huggingface with lots of duplicate files, although I know xet is used

Since storage used is calculated on total file size (not deduplicated size), using Xet storage should not change the mental model around files being stored. Think about it this way, you always need to download the entire file when on a fresh machine/environment. That is why LFS file size is shown.

lyk

May 19

Help me understand this scenario, what is super squash? Is your goal to remove / delete the entire history of these binary files from the repo?

I mean super_squash_history

It works well in my other repos, where I append the GB level parquet files daily. But failed to delete the unused LFS files in the simple testcase. Maybe that's because I uploaded the same file after several commits? Now, there are some LFS files just seems like some dangling pointers.

Since storage used is calculated on total file size (not deduplicated size), using Xet storage should not change the mental model around files being stored. Think about it this way, you always need to download the entire file when on a fresh machine/environment. That is why LFS file size is shown.

And well, although I need to download all the current files, I don't want to downloaed all historical files. Imagine that you append one row to a large parquet daily, then it will leads to a O(n^2) LFS storage space complexity, although it is actually linear with xet.

So what I do now is to call super_squash_history each week to reduce the LFS storage size. The cost is, I will loss all the commit history.

rajatarya changed discussion status to closed May 20