Spaces:
Running
Running
Added notes and changed to wide
Browse files- app/src/content/article.mdx +13 -7
app/src/content/article.mdx
CHANGED
|
@@ -27,9 +27,15 @@ import visualPoster from "./assets/images/visual-vocabulary-poster.png";
|
|
| 27 |
import Accordion from '../components/Accordion.astro'
|
| 28 |
|
| 29 |
<Sidenote>
|
| 30 |
-
TLDR; Today, we release FineVision, a new multimodal dataset with 17M images, 24 million samples, 90M question-answer turns and 10B answer tokens. We have extensively cleaned, analysed, and rated every single turn across 4 qualitative metrics with a score from 1-5 to enable the construction and study of individual training mixtures.
|
| 31 |
|
| 32 |
-
Additionally, we ran extensive ablations and compared the performance of models trained on our dataset with common open source alternatives
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
</Sidenote>
|
| 34 |
|
| 35 |
## Introduction
|
|
@@ -39,7 +45,7 @@ Even though open-weights Vision-Language Models (VLMs) are becoming ever more po
|
|
| 39 |
### Data Collection
|
| 40 |
We manually collect over 180 image-text datasets from the recent literature and create new subsets in lacking domains.
|
| 41 |
|
| 42 |
-
<
|
| 43 |
<Accordion title="FineVision Subsets">
|
| 44 |
|Subset Name |Total Images|Total Samples|Total Turns|Total Question Tokens|Total Answer Tokens|Category |
|
| 45 |
|--------------------------------------|------------|-------------|-----------|---------------------|-------------------|----------------------|
|
|
@@ -229,7 +235,7 @@ We manually collect over 180 image-text datasets from the recent literature and
|
|
| 229 |
|text_wizardlm_evol |0 |69,999 |69,999 |7,753,963 |21,955,856 |Text-only |
|
| 230 |
|text_OpenMathInstruct-2 |0 |1,000,000 |1,000,000 |74,905,850 |413,132,418 |Text-only |
|
| 231 |
</Accordion>
|
| 232 |
-
</
|
| 233 |
|
| 234 |
### Cleaning
|
| 235 |
After gathering all the sub-datasets, every turn is cleaned. We remove all individual turns whose combined question and answer length exceeds 8192 tokens. We resize big images to have a longest side of 2048 pixels while keeping the aspect ratio, and discard images with corrupted metadata. This results in a clean final dataset with a maximum turn length of 8192 tokens and a maximum image dimension of 2048 pixels on the longest side.
|
|
@@ -241,9 +247,9 @@ There are multiple ways to count the data in a multimodal dataset. The most comm
|
|
| 241 |
|
| 242 |
In total, FineVision has 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens. Based on these 4 distributions, multiple different mixtures are possible. In conjunction with the provided ratings, we encourage the community to experiment with downsampling large categories, for example according to quality and diversity criteria, and with upsampling high quality samples in small categories.
|
| 243 |
|
| 244 |
-
<
|
| 245 |
<HtmlEmbed src="d3-pie.html" desc="Distribution of Categories in FineVision" align="center" />
|
| 246 |
-
</
|
| 247 |
|
| 248 |
## Experimental Setup
|
| 249 |
To evaluate how our dataset compares to other open-source datasets, we conduct various experiments.
|
|
@@ -261,7 +267,7 @@ To evaluate our ablations in a reproducible manner, we utilize lmms-eval during
|
|
| 261 |
Each of our ablations trains a 450M model with maximal image size of 1536x1536 pixel (without resizing smaller images) and a maximal input token length of 4096. In all single stage configurations we train for 20k Steps on 32 H100s for approximately 20h while evaluating all 11 benchmarks every 1k Steps. If not specified otherwise, the “Baseline” in our intra dataset ablations refers to a training run on the full unfiltered and unchanged dataset
|
| 262 |
|
| 263 |
### How does FineVision compare against the Baselines?
|
| 264 |
-
Compared against existing VLM training datasets, FineVision produces significantly higher benchmark ranks than the other options.
|
| 265 |
|
| 266 |
<HtmlEmbed src="against-baselines.html" desc="Average Rank of Models trained on different open source datasets." />
|
| 267 |
|
|
|
|
| 27 |
import Accordion from '../components/Accordion.astro'
|
| 28 |
|
| 29 |
<Sidenote>
|
| 30 |
+
TLDR; Today, we release FineVision, a new multimodal dataset with 17M images, 24 million samples, 90M question-answer turns and 10B answer tokens comprising 5TB. We have extensively cleaned, analysed, and rated every single turn across 4 qualitative metrics with a score from 1-5 to enable the construction and study of individual training mixtures.
|
| 31 |
|
| 32 |
+
Additionally, we ran extensive ablations and compared the performance of models trained on our dataset with common open source alternatives. Our dataset is both more divers, and achieves an average improvement of 35% in 10 common benchmarks over all baselines.
|
| 33 |
+
|
| 34 |
+
To use the dataset, simply load it with:
|
| 35 |
+
```python
|
| 36 |
+
from datasets import load_dataset
|
| 37 |
+
ds = load_dataset('HuggingFaceM4/FineVision', name='ai2d_merged', split='train', streaming=True)
|
| 38 |
+
```
|
| 39 |
</Sidenote>
|
| 40 |
|
| 41 |
## Introduction
|
|
|
|
| 45 |
### Data Collection
|
| 46 |
We manually collect over 180 image-text datasets from the recent literature and create new subsets in lacking domains.
|
| 47 |
|
| 48 |
+
<Wide>
|
| 49 |
<Accordion title="FineVision Subsets">
|
| 50 |
|Subset Name |Total Images|Total Samples|Total Turns|Total Question Tokens|Total Answer Tokens|Category |
|
| 51 |
|--------------------------------------|------------|-------------|-----------|---------------------|-------------------|----------------------|
|
|
|
|
| 235 |
|text_wizardlm_evol |0 |69,999 |69,999 |7,753,963 |21,955,856 |Text-only |
|
| 236 |
|text_OpenMathInstruct-2 |0 |1,000,000 |1,000,000 |74,905,850 |413,132,418 |Text-only |
|
| 237 |
</Accordion>
|
| 238 |
+
</Wide>
|
| 239 |
|
| 240 |
### Cleaning
|
| 241 |
After gathering all the sub-datasets, every turn is cleaned. We remove all individual turns whose combined question and answer length exceeds 8192 tokens. We resize big images to have a longest side of 2048 pixels while keeping the aspect ratio, and discard images with corrupted metadata. This results in a clean final dataset with a maximum turn length of 8192 tokens and a maximum image dimension of 2048 pixels on the longest side.
|
|
|
|
| 247 |
|
| 248 |
In total, FineVision has 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens. Based on these 4 distributions, multiple different mixtures are possible. In conjunction with the provided ratings, we encourage the community to experiment with downsampling large categories, for example according to quality and diversity criteria, and with upsampling high quality samples in small categories.
|
| 249 |
|
| 250 |
+
<Wide>
|
| 251 |
<HtmlEmbed src="d3-pie.html" desc="Distribution of Categories in FineVision" align="center" />
|
| 252 |
+
</Wide>
|
| 253 |
|
| 254 |
## Experimental Setup
|
| 255 |
To evaluate how our dataset compares to other open-source datasets, we conduct various experiments.
|
|
|
|
| 267 |
Each of our ablations trains a 450M model with maximal image size of 1536x1536 pixel (without resizing smaller images) and a maximal input token length of 4096. In all single stage configurations we train for 20k Steps on 32 H100s for approximately 20h while evaluating all 11 benchmarks every 1k Steps. If not specified otherwise, the “Baseline” in our intra dataset ablations refers to a training run on the full unfiltered and unchanged dataset
|
| 268 |
|
| 269 |
### How does FineVision compare against the Baselines?
|
| 270 |
+
Compared against existing VLM training datasets, FineVision produces significantly higher benchmark ranks than the other options. Over the 10 different metrics, FineVision achieves a 45.68% improvement over the Cauldron, a 13.04% improvement over Cambrian, and a 46.83% improvement over LLaVa.
|
| 271 |
|
| 272 |
<HtmlEmbed src="against-baselines.html" desc="Average Rank of Models trained on different open source datasets." />
|
| 273 |
|