Spaces:
Running
Running
update
Browse files- app/src/content/article.mdx +9 -11
app/src/content/article.mdx
CHANGED
|
@@ -40,7 +40,7 @@ Even though open-weights Vision-Language Models (VLMs) are becoming ever more po
|
|
| 40 |
We manually collect over 180 image-text datasets from the recent literature and create new subsets in lacking domains.
|
| 41 |
|
| 42 |
<Accordion title="FineVision Subsets">
|
| 43 |
-
|Subset Name |Total Images|Total Samples|Total Turns|Total Question Tokens|Total Answer Tokens|
|
| 44 |
|--------------------------------------|------------|-------------|-----------|---------------------|-------------------|----------------------|
|
| 45 |
|coco_colors |118,287 |118,287 |118,287 |1,301,157 |6,376,672 |Captioning & Knowledge|
|
| 46 |
|densefusion_1m |1,058,751 |1,058,751 |1,058,751 |10,692,478 |263,718,217 |Captioning & Knowledge|
|
|
@@ -239,11 +239,9 @@ There are multiple ways to count the data in a multimodal dataset. The most comm
|
|
| 239 |
|
| 240 |
In total, FineVision has 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens. Based on these 4 distributions, multiple different mixtures are possible. In conjunction with the provided ratings, we encourage the community to experiment with downsampling large categories, for example according to quality and diversity criteria, and with upsampling high quality samples in small categories.
|
| 241 |
|
| 242 |
-
---
|
| 243 |
<FullWidth>
|
| 244 |
<HtmlEmbed src="d3-pie.html" desc="Distribution of Categories in FineVision" align="center" />
|
| 245 |
</FullWidth>
|
| 246 |
-
---
|
| 247 |
|
| 248 |
## Experimental Setup
|
| 249 |
To evaluate how our dataset compares to other open-source datasets, we conduct various experiments.
|
|
@@ -255,7 +253,7 @@ For most of the ablations and experiments, we train a 450M parameter VLM, since
|
|
| 255 |
We compare our dataset against 3 popular open source alternatives: [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron), [LLaVA-OneVision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data) and [Cambrian-7M](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M). We analyse all of them with the same pipeline concerning potential test-set contamination. Note that these rates are not the actual contaminations. While the pipeline discovers some similarities between the images of the test sets and the train sets, this does not mean that they are actually samples from the test set, since these consist of both image and the corresponding text. This is rather used as an upper bound on potential train/test overlap and as a relative comparison between the four datasets.
|
| 256 |
|
| 257 |
### Evaluations
|
| 258 |
-
To evaluate our ablations in a reproducible manner, we utilize lmms-eval during training. We evaluate on a diverse set of
|
| 259 |
|
| 260 |
## Experiments
|
| 261 |
Each of our ablations trains a 450M model with maximal image size of 1536x1536 pixel (without resizing smaller images) and a maximal input token length of 4096. In all single stage configurations we train for 20k Steps on 32 H100s for approximately 20h while evaluating all 11 benchmarks every 1k Steps. If not specified otherwise, the “Baseline” in our intra dataset ablations refers to a training run on the full unfiltered and unchanged dataset
|
|
@@ -263,7 +261,7 @@ Each of our ablations trains a 450M model with maximal image size of 1536x1536 p
|
|
| 263 |
### How does FineVision compare against the Baselines?
|
| 264 |
Compared against existing VLM training datasets, FineVision produces significantly higher benchmark ranks than the other options.
|
| 265 |
|
| 266 |
-
<HtmlEmbed src="
|
| 267 |
|
| 268 |
### How contaminated are the datasets?
|
| 269 |
To investigate data leakage from benchmarks into this dataset, we construct a deduplication pipeline based on the sample images. We embed the images of 66 image-test datasets from the lmms-eval framework using the SSCD descriptor, and compute the cosine similarity between our samples and the test-set embeddings. Whenever a sample has a similarity higher than a threshold of 0.95 it is assumed to be a duplicate. While our tests with various thresholds show that this is flagging some samples that are not actual duplicates (especially if the image depicts similar but different images in detail, like graphs or tables), we preferred to err on the side of caution. We open-source the deduplication pipeline here as well as the precomputed test-set embedding’s here.
|
|
@@ -279,7 +277,7 @@ TODO: Insert the Images here
|
|
| 279 |
|
| 280 |
Additionally, we experimented with removing all found samples from all datasets to see if the outcome is different from the results above, but we observe the same distribution.
|
| 281 |
|
| 282 |
-
<HtmlEmbed src="
|
| 283 |
|
| 284 |
TODO: After removing these duplicates, the performance of the models dropped by … % over all benchmarks.
|
| 285 |
|
|
@@ -297,12 +295,12 @@ Similarly to the comparison of the size, we also wanted to evaluate the datasets
|
|
| 297 |
Since the training of a VLM already builds upon pretrained vision and language backbones, datasets are usually not completely unstructured, but follow an image+question and answer structure. Recent works have shown that consolidating multiple questions for the same image into a multi-turn conversation where the image is shown only once improves model performance, and additionally also reduces the datasets memory footprint. We therefore experiment with deduplicating every image in our dataset internally using the same SSCD descriptors, manually inspect the resulting clusters and merge fitting samples into a multi-turn conversation.
|
| 298 |
Even when training for longer than the other ablations, we did not observe a significant difference, if at all rather one in favour against merging multiple samples together.
|
| 299 |
|
| 300 |
-
<HtmlEmbed src="
|
| 301 |
|
| 302 |
### Should you train on multilingual data if your language backbone was not?
|
| 303 |
There are some multilingual datasets in our mixture, but since our Language Backbone is only trained on English data, we experimented with removing all the multilingual, mainly Chinese, subsets. This does also not seem to make a big difference, with slight advantages to leaving the data, even if it was not part of the Language Backbone's initial training. In our training setup with this configuration, one epoch over the whole dataset equals ~12k steps, so the benefit of unseen languages only materializes after the first full epoch.
|
| 304 |
|
| 305 |
-
<HtmlEmbed src="
|
| 306 |
|
| 307 |
### How can you assess the quality of the dataset?
|
| 308 |
|
|
@@ -324,7 +322,7 @@ This is the distribution of scores across the different filters for FineVision.
|
|
| 324 |
|
| 325 |
To try to quantify the quality of the training data and the effect it has on the model’s performance, we run extensive ablations on our generated ratings.
|
| 326 |
|
| 327 |
-
<HtmlEmbed src="
|
| 328 |
|
| 329 |
Interestingly, both when only training on turns that have any of the 4 ratings under a certain threshold, as well as when training on turns where only a single rating at a time is used, we observe the same behaviour. Simply training on all samples of the dataset outperforms in benchmarks. This could mean multiple things.
|
| 330 |
We can almost see the same distribution in the ranks across all filters: From best to worst with an increase in the rating threshold. For example the visual dependency and the image correspondence rating both result in exactly the same distribution of rankings, corresponding to the natural order of options, 1 through 5. This could indicate that with a sufficiently large dataset that you train on long enough, it hurts more to remove samples, even if they were judged to be of low quality, than to train on them.
|
|
@@ -339,14 +337,14 @@ The standard training procedure of a VLM usually follows at least two stages. Fi
|
|
| 339 |
|
| 340 |
#### 1 Stage vs 2 Stages
|
| 341 |
|
| 342 |
-
<HtmlEmbed src="
|
| 343 |
|
| 344 |
We observe that at this model size, with this amount of available data, training only a single stage actually outperforms a multi stage approach.
|
| 345 |
|
| 346 |
#### 2 Stages vs 2.5 Stages
|
| 347 |
We also experiment if splitting the second stage results in any performance improvements. We take the baseline, and continue training for another 20k steps, both with the unfiltered (>= 1) as well as filtered subsets of FineVision according to our ratings.
|
| 348 |
|
| 349 |
-
<HtmlEmbed src="
|
| 350 |
|
| 351 |
Like in the previous results, we observe that the best outcome is simply achieved by training on as much data as possible.
|
| 352 |
|
|
|
|
| 40 |
We manually collect over 180 image-text datasets from the recent literature and create new subsets in lacking domains.
|
| 41 |
|
| 42 |
<Accordion title="FineVision Subsets">
|
| 43 |
+
|Subset Name |Total Images|Total Samples|Total Turns|Total Question Tokens|Total Answer Tokens|Category |
|
| 44 |
|--------------------------------------|------------|-------------|-----------|---------------------|-------------------|----------------------|
|
| 45 |
|coco_colors |118,287 |118,287 |118,287 |1,301,157 |6,376,672 |Captioning & Knowledge|
|
| 46 |
|densefusion_1m |1,058,751 |1,058,751 |1,058,751 |10,692,478 |263,718,217 |Captioning & Knowledge|
|
|
|
|
| 239 |
|
| 240 |
In total, FineVision has 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens. Based on these 4 distributions, multiple different mixtures are possible. In conjunction with the provided ratings, we encourage the community to experiment with downsampling large categories, for example according to quality and diversity criteria, and with upsampling high quality samples in small categories.
|
| 241 |
|
|
|
|
| 242 |
<FullWidth>
|
| 243 |
<HtmlEmbed src="d3-pie.html" desc="Distribution of Categories in FineVision" align="center" />
|
| 244 |
</FullWidth>
|
|
|
|
| 245 |
|
| 246 |
## Experimental Setup
|
| 247 |
To evaluate how our dataset compares to other open-source datasets, we conduct various experiments.
|
|
|
|
| 253 |
We compare our dataset against 3 popular open source alternatives: [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron), [LLaVA-OneVision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data) and [Cambrian-7M](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M). We analyse all of them with the same pipeline concerning potential test-set contamination. Note that these rates are not the actual contaminations. While the pipeline discovers some similarities between the images of the test sets and the train sets, this does not mean that they are actually samples from the test set, since these consist of both image and the corresponding text. This is rather used as an upper bound on potential train/test overlap and as a relative comparison between the four datasets.
|
| 254 |
|
| 255 |
### Evaluations
|
| 256 |
+
To evaluate our ablations in a reproducible manner, we utilize lmms-eval during training. We evaluate on a diverse set of 10 benchmarks: AI2D, ChartQA, DocVQA, InfoVQA, MME, MMMU, MMStar, OCRBench, TextVQA and Seedbench. Since these benchmarks cover different topics and produce results on different scales, e.g. AI2D returns the accuracy of the exact matches (0-100), but MME returns a continuous score (0-2800), aggregating them is not trivial. In our ablations the relative performance between the different configurations matters, so we determine the rank of every model in each training step and average it over all the benchmarks. This way we can judge where different configurations rank among each other over the course of training, and how big the difference between them is.
|
| 257 |
|
| 258 |
## Experiments
|
| 259 |
Each of our ablations trains a 450M model with maximal image size of 1536x1536 pixel (without resizing smaller images) and a maximal input token length of 4096. In all single stage configurations we train for 20k Steps on 32 H100s for approximately 20h while evaluating all 11 benchmarks every 1k Steps. If not specified otherwise, the “Baseline” in our intra dataset ablations refers to a training run on the full unfiltered and unchanged dataset
|
|
|
|
| 261 |
### How does FineVision compare against the Baselines?
|
| 262 |
Compared against existing VLM training datasets, FineVision produces significantly higher benchmark ranks than the other options.
|
| 263 |
|
| 264 |
+
<HtmlEmbed src="against-baselines.html" title="D3 Line" desc="TODO - Average Rank of Models trained on different open source datasets." />
|
| 265 |
|
| 266 |
### How contaminated are the datasets?
|
| 267 |
To investigate data leakage from benchmarks into this dataset, we construct a deduplication pipeline based on the sample images. We embed the images of 66 image-test datasets from the lmms-eval framework using the SSCD descriptor, and compute the cosine similarity between our samples and the test-set embeddings. Whenever a sample has a similarity higher than a threshold of 0.95 it is assumed to be a duplicate. While our tests with various thresholds show that this is flagging some samples that are not actual duplicates (especially if the image depicts similar but different images in detail, like graphs or tables), we preferred to err on the side of caution. We open-source the deduplication pipeline here as well as the precomputed test-set embedding’s here.
|
|
|
|
| 277 |
|
| 278 |
Additionally, we experimented with removing all found samples from all datasets to see if the outcome is different from the results above, but we observe the same distribution.
|
| 279 |
|
| 280 |
+
<HtmlEmbed src="against-baselines-deduplicated.html" title="D3 Line" desc="TODO - Average Rank of Models trained on different deduplicated open source datasets." />
|
| 281 |
|
| 282 |
TODO: After removing these duplicates, the performance of the models dropped by … % over all benchmarks.
|
| 283 |
|
|
|
|
| 295 |
Since the training of a VLM already builds upon pretrained vision and language backbones, datasets are usually not completely unstructured, but follow an image+question and answer structure. Recent works have shown that consolidating multiple questions for the same image into a multi-turn conversation where the image is shown only once improves model performance, and additionally also reduces the datasets memory footprint. We therefore experiment with deduplicating every image in our dataset internally using the same SSCD descriptors, manually inspect the resulting clusters and merge fitting samples into a multi-turn conversation.
|
| 296 |
Even when training for longer than the other ablations, we did not observe a significant difference, if at all rather one in favour against merging multiple samples together.
|
| 297 |
|
| 298 |
+
<HtmlEmbed src="internal-deduplication.html" title="D3 Line" desc="TODO - Average Ranking of Models trained with internally deduplicated / merged samples." />
|
| 299 |
|
| 300 |
### Should you train on multilingual data if your language backbone was not?
|
| 301 |
There are some multilingual datasets in our mixture, but since our Language Backbone is only trained on English data, we experimented with removing all the multilingual, mainly Chinese, subsets. This does also not seem to make a big difference, with slight advantages to leaving the data, even if it was not part of the Language Backbone's initial training. In our training setup with this configuration, one epoch over the whole dataset equals ~12k steps, so the benefit of unseen languages only materializes after the first full epoch.
|
| 302 |
|
| 303 |
+
<HtmlEmbed src="remove-ch.html" title="D3 Line" desc="TODO - Average Rank of Models trained with and without multilingual samples" />
|
| 304 |
|
| 305 |
### How can you assess the quality of the dataset?
|
| 306 |
|
|
|
|
| 322 |
|
| 323 |
To try to quantify the quality of the training data and the effect it has on the model’s performance, we run extensive ablations on our generated ratings.
|
| 324 |
|
| 325 |
+
<HtmlEmbed src="all-ratings.html" title="D3 Line" desc="TODO - Average Rank of Models trained with samples that have all 4 ratings above a certain threshold." />
|
| 326 |
|
| 327 |
Interestingly, both when only training on turns that have any of the 4 ratings under a certain threshold, as well as when training on turns where only a single rating at a time is used, we observe the same behaviour. Simply training on all samples of the dataset outperforms in benchmarks. This could mean multiple things.
|
| 328 |
We can almost see the same distribution in the ranks across all filters: From best to worst with an increase in the rating threshold. For example the visual dependency and the image correspondence rating both result in exactly the same distribution of rankings, corresponding to the natural order of options, 1 through 5. This could indicate that with a sufficiently large dataset that you train on long enough, it hurts more to remove samples, even if they were judged to be of low quality, than to train on them.
|
|
|
|
| 337 |
|
| 338 |
#### 1 Stage vs 2 Stages
|
| 339 |
|
| 340 |
+
<HtmlEmbed src="ss_vs_s1.html" title="D3 Line" desc="TODO - Average Rank of a model trained for 20K steps in a single stage, and a model trained for the same 20k steps on top of pretraining the Modality Projection and Vision Encoder for 10k steps." />
|
| 341 |
|
| 342 |
We observe that at this model size, with this amount of available data, training only a single stage actually outperforms a multi stage approach.
|
| 343 |
|
| 344 |
#### 2 Stages vs 2.5 Stages
|
| 345 |
We also experiment if splitting the second stage results in any performance improvements. We take the baseline, and continue training for another 20k steps, both with the unfiltered (>= 1) as well as filtered subsets of FineVision according to our ratings.
|
| 346 |
|
| 347 |
+
<HtmlEmbed src="s25_ratings.html" title="D3 Line" desc="TODO - Average Rank if a model trained for an additional 20K steps on top of unfiltered training for 20K steps." />
|
| 348 |
|
| 349 |
Like in the previous results, we observe that the best outcome is simply achieved by training on as much data as possible.
|
| 350 |
|