lusxvr commited on
Commit
120e1bb
·
1 Parent(s): b010293

big update

Browse files
app/src/content/article.mdx CHANGED
@@ -18,6 +18,9 @@ authors:
18
  - name: "Thibaud Frere"
19
  url: "https://huggingface.co/tfrere"
20
  affiliations: [1]
 
 
 
21
  affiliations:
22
  - name: "Hugging Face"
23
  url: "https://huggingface.co"
@@ -41,9 +44,9 @@ import visualPoster from "./assets/images/visual-vocabulary-poster.png";
41
  import Accordion from '../components/Accordion.astro'
42
 
43
  <Sidenote>
44
- TLDR; Today, we release **FineVision**, a new multimodal dataset with **17M images**, 24 million samples, 90M question-answer turns and 10B answer tokens comprising **5TB**. We have extensively cleaned, analysed, and rated every single turn across 4 qualitative metrics with a score from 1-5 to enable the construction and study of individual training mixtures.
45
 
46
- Additionally, we ran extensive ablations and compared the performance of models trained on our dataset with common open source alternatives. Our dataset is both more divers, and achieves an average improvement of **35%** in **10 common benchmarks** over all baselines.
47
 
48
  To use the dataset, simply load it with:
49
 
@@ -51,29 +54,33 @@ import Accordion from '../components/Accordion.astro'
51
  ```python
52
  from datasets import load_dataset
53
 
54
- ds = load_dataset('HuggingFaceM4/FineVision', name='ai2d_merged', split='train', streaming=True)
 
 
 
 
 
55
  ```
56
  </Sidenote>
57
 
58
- ## Introduction
59
- Even though open-weights Vision-Language Models (VLMs) are becoming ever more powerful, the accessibility of large-scale, state-of-the-art **training data** for these models is still lagging behind.
60
-
61
- The data to train these models is often proprietary and inaccessible for the broader community.
62
 
63
- Projects like The Cauldron, LLaVa and Cambrian aim to provide such datasets, but get quickly outpaced by the speed of the field and the emergence of novel applications for VLMs, like agentic tasks.
 
64
 
65
- ## Constructing FineVision
66
  ### Data Collection
67
- We manually collect **over 180** image-text datasets from the recent literature and create new subsets in lacking domains.
68
 
69
  <Wide>
70
  <Accordion size="big" title="FineVision Subsets - click to see more">
71
  |Subset Name |Total Images|Total Samples|Total Turns|Total Question Tokens|Total Answer Tokens|Category |Source |
72
  |--------------------------------------|------------|-------------|-----------|---------------------|-------------------|----------------------|------- |
73
  |coco_colors |118,287 |118,287 |118,287 |1,301,157 |6,376,672 |Captioning & Knowledge|[@noauthor_hazal-karakusmscoco-controlnet] |
74
- |densefusion_1m |1,058,751 |1,058,751 |1,058,751 |10,692,478 |263,718,217 |Captioning & Knowledge|[@li_densefusion-1m_2024] |
75
  |face_emotion |797 |797 |797 |8,767 |8,066 |Captioning & Knowledge|[@mollahosseini_affectnet_2017] |
76
- |google_landmarks |299,993 |299,993 |842,127 |6,194,978 |10,202,980 |Captioning & Knowledge|Ours |
77
  |image_textualization(filtered) |99,573 |99,573 |99,573 |917,577 |19,374,090 |Captioning & Knowledge|[@pi_image_2024] |
78
  |laion_gpt4v |9,301 |9,301 |9,301 |93,950 |1,875,283 |Captioning & Knowledge|[@noauthor_laiongpt4v-dataset_2023] |
79
  |localized_narratives |199,998 |199,998 |199,998 |2,167,179 |8,021,473 |Captioning & Knowledge|[@vedaldi_connecting_2020] |
@@ -104,7 +111,7 @@ We manually collect **over 180** image-text datasets from the recent literature
104
  |tabmwp(mathv360k) |22,452 |22,452 |22,452 |963,498 |158,042 |Chart & Table |[@shi_math-llava_2024] |
105
  |tat_dqa |2,448 |2,207 |13,251 |320,356 |1,177,852 |Chart & Table |[@zhu_towards_2022] |
106
  |tat_qa |2,199 |2,199 |13,215 |989,419 |254,790 |Chart & Table |[@zhu_tat-qa_2021] |
107
- |Unichart |611,925 |611,925 |6,898,324 |96,702,288 |211,989,247 |Chart & Table |[@masry_unichart_2023] |
108
  |vistext |9,969 |9,969 |9,969 |88,770 |1,191,127 |Chart & Table |[@tang_vistext_2023] |
109
  |vqaonbd |39,986 |39,986 |1,254,165 |36,066,807 |5,620,523 |Chart & Table |[@noauthor_jp1924vqaonbd_nodate] |
110
  |alfworldgpt |45,073 |45,073 |45,073 |17,864,033 |6,276,573 |General VQA |[@shridhar_alfworld_2021] |
@@ -143,9 +150,9 @@ We manually collect **over 180** image-text datasets from the recent literature
143
  |websight |10,000 |10,000 |10,000 |113,114 |5,237,381 |General VQA |[@laurencon_unlocking_2024] |
144
  |wildvision |333 |333 |405 |50,161 |72,820 |General VQA |[@lu_wildvision_2024] |
145
  |yesbut |4,318 |4,318 |4,318 |38,365 |157,229 |General VQA |[@nandy_yesbut_2024] |
146
- |aguvis-stage-1 |458,957 |458,957 |3,831,666 |36,151,272 |93,546,182 |Grounding & Counting |[@xu_aguvis_2025] |
147
  |groundui |13,531 |13,531 |18,016 |200,094 |883,274 |Grounding & Counting |[@zheng_agentstudio_2025] |
148
- |objects365_qa |1,742,287 |1,742,287 |12,329,259 |135,681,680 |2,146,619,635 |Grounding & Counting |[@shao_objects365_2019] |
149
  |oodvqa |8,488 |8,488 |8,488 |227,028 |8,488 |Grounding & Counting |[@tu_how_2023] |
150
  |tallyqa |98,680 |98,680 |183,986 |2,674,306 |370,282 |Grounding & Counting |[@acharya_tallyqa_2019] |
151
  |clevr |70,000 |70,000 |699,989 |19,277,813 |1,570,525 |Mathematics |[@lindstrom_clevr-math_2022-1] |
@@ -180,7 +187,7 @@ We manually collect **over 180** image-text datasets from the recent literature
180
  |latex_handwritten |39,583 |39,583 |39,583 |390,343 |1,874,733 |Naive OCR |[@noauthor_im2latex_nodate] |
181
  |latexformulas |552,340 |552,340 |552,340 |5,138,603 |43,094,747 |Naive OCR |[@noauthor_oleehyolatex-formulas_2024] |
182
  |maptext |200 |200 |799 |9,434 |70,813 |Naive OCR |[@barney_smith_icdar_2024] |
183
- |mathwriting-google |300,000 |300,000 |300,000 |2,461,270 |5,954,806 |Naive OCR |[@gervais_mathwriting_2025] |
184
  |memotion |6,991 |6,991 |6,991 |194,718 |177,429 |Naive OCR |[@sharma_semeval-2020_2020] |
185
  |orand_car_a |1,999 |1,999 |1,999 |43,978 |9,035 |Naive OCR |[@diem_icfhr_2014] |
186
  |rendered_text |10,000 |10,000 |10,000 |85,879 |244,183 |Naive OCR |[@noauthor_wendlercrenderedtext_2024] |
@@ -191,8 +198,8 @@ We manually collect **over 180** image-text datasets from the recent literature
191
  |SynthFormulaNet |499,997 |499,997 |499,997 |1,999,631 |51,215,097 |Naive OCR |[@nassar_smoldocling_2025] |
192
  |tal_ocr_eng |256,646 |256,646 |256,646 |3,385,012 |7,465,207 |Naive OCR |[@noauthor_httpsai100talcomdataset_nodate] |
193
  |wordart |19,066 |4,804 |4,804 |78,032 |54,263 |Naive OCR |[@avidan_toward_2022] |
194
- |olmOCR-mix-0225-documents |228,864 |228,864 |228,858 |2,197,147 |163,194,337 |Naive OCR |[@poznanski_olmocr_2025] |
195
- |olmOCR-mix-0225-books |15,194 |15,194 |15,194 |145,750 |7,962,779 |Naive OCR |[@poznanski_olmocr_2025] |
196
  |a_okvqa |54,602 |54,602 |54,602 |1,065,188 |360,990 |OCR QA |[@avidan_-okvqa_2022] |
197
  |aokvqa |16,539 |16,539 |17,056 |743,458 |218,917 |OCR QA |[@avidan_-okvqa_2022] |
198
  |arxivqa |100,000 |100,000 |100,000 |7,022,001 |6,422,269 |OCR QA |[@li_multimodal_2024] |
@@ -255,112 +262,86 @@ We manually collect **over 180** image-text datasets from the recent literature
255
  |text_theoremqa |0 |800 |800 |50,065 |3,468 |Text-only |[@chen_theoremqa_2023] |
256
  |text_wizardlm_evol |0 |69,999 |69,999 |7,753,963 |21,955,856 |Text-only |[@noauthor_wizardlmteamwizardlm_evol_instruct_70k_2024] |
257
  |text_OpenMathInstruct-2 |0 |1,000,000 |1,000,000 |74,905,850 |413,132,418 |Text-only |[@toshniwal_openmathinstruct-2_2024] |
 
258
 
259
  </Accordion>
260
  </Wide>
261
 
262
  ### Cleaning
263
- After gathering all the sub-datasets, every turn is cleaned.
264
-
265
- We remove all individual turns whose combined question and answer length exceeds **8192 tokens**.
266
-
267
- We resize big images to have a longest side of **2048 pixels** while keeping the aspect ratio, and discard images with corrupted metadata.
268
-
269
- This results in a clean final dataset with a maximum turn length of 8192 tokens and a maximum image dimension of 2048 pixels on the longest side.
270
-
271
-
272
- ### Result
273
- **FineVision** consists of **9 categories**: Captioning & Knowledge, Chart & Table, General VQA, Grounding & Counting, Mathematics, Naive OCR, OCR QA, Science, Text-only.
274
-
275
- There are multiple ways to count the data in a multimodal dataset.
276
-
277
- The most common are the number of samples and the number of images.
278
 
279
- Additionally, a single sample can consist of multiple question/answer pairs in the form of a multi-turn conversation.
280
-
281
- Similarly to text-only datasets, the number of answer tokens is also interesting, since these are the tokens the model is actually trained on.
 
 
 
 
282
 
283
- We keep track of all **4 different distributions**.
 
 
 
 
 
 
284
 
285
- In total, **FineVision** has **17.3M images**, **24.3M samples**, **88.9M turns**, and **9.5B answer tokens**.
 
286
 
287
- Based on these **4 distributions**, multiple different mixtures are possible.
 
288
 
289
- In conjunction with the provided ratings, we encourage the community to experiment with downsampling large categories, for example according to quality and diversity criteria, and with upsampling high quality samples in small categories.
290
  <br/>
291
  <Wide>
292
- <HtmlEmbed src="d3-pie.html" desc="Distribution of Categories in FineVision" align="center" />
293
  </Wide>
294
 
295
  ## Experimental Setup
296
- To evaluate how our dataset compares to other open-source datasets, we conduct various experiments.
297
-
298
- ### Model Architecture: nanoVLM
299
- For most of the ablations and experiments, we train a **450M** parameter VLM, since it provides a good trade off between training time and model performance.
300
-
301
- We utilize the lightweight nanoVLM training framework with **SmolLM2-360M-Instruct** as the text backbone, andSigLIP2-512 as the vision encoder.
302
-
303
- We experimented with a classic 2-stage training schedule where the first stage is used to train mainly the Modality Projection to align the Language and Image Embeddings, and the second stage is used to train the whole model.
304
-
305
- Interestingly, we did not observe any significant benefits from this additional first stage compared to training the whole model directly, so we settled on a **single stage** training for most ablations.
306
 
 
 
307
 
308
  ### Baseline Datasets
 
309
 
310
- We compare our dataset against 3 popular open source alternatives: **[The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron)**, **[LLaVA-OneVision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data)** and **[Cambrian-7M](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M)**.
311
-
312
- We analyse all of them with the same pipeline concerning potential test-set contamination.
313
-
314
- Note that these rates are not the actual contaminations.
315
-
316
- While the pipeline discovers some similarities between the images of the test sets and the train sets, this does not mean that they are actually samples from the test set, since these consist of both image and the corresponding text.
317
-
318
- This is rather used as an upper bound on potential train/test overlap and as a relative comparison between the four datasets.
319
 
320
  ### Evaluations
 
321
 
322
- To evaluate our ablations in a reproducible manner, we utilize **lmms-eval** during training.
323
-
324
- We evaluate on a diverse set of **10 benchmarks**: AI2D, ChartQA, DocVQA, InfoVQA, MME, MMMU, MMStar, OCRBench, TextVQA and Seedbench.
325
-
326
- Since these benchmarks cover different topics and produce results on different scales, e.g. AI2D returns the accuracy of the exact matches (0-100), but MME returns a continuous score (0-2800), aggregating them is not trivial.
327
-
328
- In our ablations the relative performance between the different configurations matters, so we determine the rank of every model in each training step and average it over all the benchmarks.
329
-
330
- This way we can judge where different configurations rank among each other over the course of training, and how big the difference between them is.
331
 
332
  ## Experiments
333
- Each of our ablations trains a 450M model with maximal image size of 1536x1536 pixel (without resizing smaller images) and a maximal input token length of 4096.
334
-
335
- In all single stage configurations we train for **20k Steps** on **32 H100s** for approximately 20h while evaluating all 11 benchmarks every 1k Steps.
336
 
337
- If not specified otherwise, the “Baseline” in our intra dataset ablations refers to a training run on the full unfiltered and unchanged dataset
 
 
338
 
339
- ### How does FineVision compare against the Baselines?
340
 
341
- Compared against existing VLM training datasets, **FineVision** produces significantly higher benchmark ranks than the other options.
 
342
 
343
- Over the 10 different metrics, **FineVision** achieves a **45.68%** improvement over the Cauldron, a **13.04%** improvement over Cambrian, and a **46.83%** improvement over LLaVa. <a href="#against-baselines">Fig1</a>
344
 
345
- ---
346
- <HtmlEmbed id="against-baselines" src="against-baselines.html" desc="Average Rank of Models trained on different open source datasets." />
347
-
348
- ### How contaminated are the datasets?
349
-
350
- To investigate data leakage from benchmarks into this dataset, we construct a deduplication pipeline based on the sample images.
351
-
352
- We embed the images of 66 image-test datasets from the lmms-eval framework using the SSCD descriptor, and compute the cosine similarity between our samples and the test-set embeddings.
353
-
354
- Whenever a sample has a similarity higher than a threshold of **0.95** it is assumed to be a duplicate.
355
 
356
- While our tests with various thresholds show that this is still flagging more false-positives than false-negatives, we preferred to err on the side of caution.
357
 
358
- Below is an example of a correctly identified Duplicate ("Photo"), a false-positive with a similarity score above 0.95 ("Chart") and a false-negative with a similarity score below 0.95 ("Drawing").
359
 
360
- We open-source the deduplication pipeline here as well as the precomputed test-set embedding’s here.
361
  <br/>
362
  <Wide>
363
- <HtmlEmbed src="comparison.html" align="center" desc="Examples of the Deduplication Pipeline."/>
364
  </Wide>
365
 
366
  | Name | Samples | Contamination Rate | Performance Drop |
@@ -370,145 +351,68 @@ We open-source the deduplication pipeline here as well as the precomputed test-s
370
  | Cambrian-7M | 7.0M | 2.29% | 2.78% |
371
  | FineVision | 24.3M | 1.02% | 1.45% |
372
 
373
- Additionally, we experimented with removing all found samples from all datasets to see if the outcome is different from the results above, but we observe the same distribution.
 
374
 
375
- ---
376
- <HtmlEmbed src="against-baselines-deduplicated.html" desc="Average Rank of Models trained on different deduplicated open source datasets." />
377
- ---
378
 
379
- After removing these duplicates, the average performance of the models over all benchmarks dropped by 2.78% for Cambrian, 2.39% for Cauldron, **1.45%** for FineVision and 2.72% for LLaVa, indicating that while FineVision is already performing best, test-set contamination has the smallest effect in this dataset.
380
 
381
  ### How diverse are the datasets?
 
382
 
383
- Similarly to the comparison of the size, we also wanted to evaluate the datasets for diversity.
384
-
385
- Evaluating the diversity of a dataset is a field of study for itself, which we will not dive into here, rather we borrow techniques from computer vision and use the already computed **SSCD embeddings** as a proxy of visual diversity.
386
-
387
- The SSCD embeddings should provide a good approximation since they are specifically optimized for distinguishing between visually similar content, through their differential entropy regularization that tries to ensure the full utilization of the embedding space.
388
-
389
- The resulting approximately uniform distribution of the embeddings promotes consistent separation between descriptor vectors, making distances from different embedding regions more comparable, which is crucial for meaningful diversity measurements.
390
-
391
- To not rely on a subsample of the dataset in estimating the diversity, we analyse the covariance metric of the full embeddings since this can be computed over the whole dataset in a numerically stable way (using Welford’s algorithm).
392
-
393
- From this covariance matrix, we can calculate the eigenvalues for analysis.
394
-
395
- We get the effective rank of the covariance matrix, which measures how uniformly the variance is distributed across dimensions, as well as the participation ratio, which measures how many dimensions actively contribute to the overall variance.
396
-
397
- The effective rank (entropy based) estimates the uniformity of the variance distribution, while the participation ratio (concentration-based) estimates the breadth of the variance participation .
398
-
399
- To obtain a single ‘**diversity score**’ for the datasets, we normalize the effective rank and participation ratio with the embedding dimension and compute their geometric mean.
400
-
401
- We observe that **FineVision** is not only the biggest, but also the most diverse dataset.
402
-
403
- | Name | Effective Rank | Participation Ratio | Diversity |
404
- |---------------|----------------|---------------------|-----------|
405
- | Cauldron | 324.05 | 129.22 | 0.400 |
406
- | Llava-Vision | 267.89 | 87.05 | 0.298 |
407
- | Cambrian-7M | 359.73 | 152.70 | 0.458 |
408
- | FineVision | 359.22 | 182.52 | 0.500 |
409
 
410
  ### Should you merge multiple questions for the same image into a single multi turn conversation?
 
411
 
412
- Since the training of a VLM already builds upon pretrained vision and language backbones, datasets are usually not completely unstructured, but follow an image+question and answer structure.
413
-
414
- Recent works have shown that consolidating multiple questions for the same image into a **multi-turn conversation** where the image is shown only once improves model performance, and additionally also reduces the datasets memory footprint.
415
-
416
- We therefore experiment with deduplicating every image in our dataset internally using the same SSCD descriptors, manually inspect the resulting clusters and merge fitting samples into a multi-turn conversation.
417
-
418
- Even when training for longer than the other ablations, we did not observe a significant difference, if at all rather one in favour against merging multiple samples together.
419
 
420
  ---
421
- <HtmlEmbed src="internal-deduplication.html" desc="Average Ranking of Models trained with internally deduplicated / merged samples." />
422
  ---
423
 
424
  ### Should you train on multilingual data if your language backbone was not?
425
-
426
- There are some multilingual datasets in our mixture, but since our Language Backbone is only trained on English data, we experimented with removing all the multilingual, mainly Chinese, subsets.
427
-
428
- This does also not seem to make a big difference, with slight advantages to leaving the data, even if it was not part of the Language Backbone's initial training.
429
-
430
- In our training setup with this configuration, one epoch over the whole dataset equals ~12k steps, so the benefit of unseen languages only materializes after the first full epoch.
431
 
432
  ---
433
- <HtmlEmbed src="remove-ch.html" desc="Average Rank of Models trained with and without multilingual samples" />
434
  ---
435
 
436
  ### How can you assess the quality of the dataset?
437
-
438
- The usual goal for every dataset, to collect samples with the highest quality possible, is quite an abstract endeavour in practice, especially for multimodal datasets. Additionally, different training stages usually have different qualitative and quantitative requirements.
439
-
440
- Finally, tuning the mixtures of different categories is also reliant on how much data with what quality is available. For image-text datasets, there are 3 different combinatorial ways to evaluate a sample: text-only, image-only, and image-text correspondence. The question persists, how do you actually measure the quality of a sample, especially if you have to do so in 3 different ways.
441
-
442
- With **FineVision**, we test a framework to rate every single turn in our dataset across 4 axes.
443
-
444
- For this, we used a LLM and VLM-as-a-judge pipeline (using Qwen3-32B and Qwen2.5VL-32B), to rate every turn on a scale from 1-5 in these 4 categories:
445
- - **Text Formatting Quality**: How is the quality of the answer both linguistically and structurally? (Question and Answer)
446
- - **Question-Answer Relevance**: Does the answer properly respond to the question? (Question and Answer)
447
- - **Visual Dependency**: How much does the question depend on visual information to be answered? (Question only)
448
- - **Image-Question Correspondence**: How well does the image support answering the question? (Image and Question)
449
-
450
- This is the distribution of scores across the different filters for **FineVision**.
451
- | Filter | 1 | 2 | 3 | 4 | 5 |
452
- |-----------------------|-------|-------|-------|-------|-------|
453
- | Formatting | 0.5 | 0.7 | 1.1 | 77.5 | 20.3 |
454
- | Relevance | 2.9 | 0.5 | 14.7 | 16.5 | 65.4 |
455
- | Visual Dependency | 11.0 | 20.4 | 2.6 | 24.2 | 41.8 |
456
- | Image Correspondence | 8.1 | 3.6 | 17.3 | 26.8 | 44.1 |
457
 
458
  To try to quantify the quality of the training data and the effect it has on the model’s performance, we run extensive ablations on our generated ratings.
459
 
460
  ---
461
- <HtmlEmbed src="all-ratings.html" desc="Average Rank of Models trained with samples that have all 4 ratings above a certain threshold." />
462
  ---
463
 
464
- Interestingly, both when only training on turns that have any of the 4 ratings under a certain threshold, as well as when training on turns where only a single rating at a time is used, we observe the same behaviour.
465
-
466
- Simply training on all samples of the dataset **outperforms in benchmarks**.
467
- This could mean multiple things.
468
-
469
- We can almost see the same distribution in the ranks across all filters: From best to worst with an increase in the rating threshold.
470
-
471
- For example the visual dependency and the image correspondence rating both result in exactly the same distribution of rankings, corresponding to the natural order of options, 1 through 5.
472
-
473
- This could indicate that with a sufficiently large dataset that you train on long enough, it hurts more to remove samples, even if they were judged to be of low quality, than to train on them.
474
-
475
- The notion of quality for VLM datasets is nuanced in general.
476
-
477
- If we compare training a VLM and an LLM, training the VLM is closer in nature to the SFT part than the ‘Pre-Training’ part of training a LLM.
478
 
479
- We do not train on crawls of internet data, instead we train on individual samples of Image-Question and Answer pairs, and these datapoints are usually ‘curated rather than collected’.
480
 
481
- We also do not train on trillions of tokens, but on billions.
482
-
483
- This means that the datasets for VLMs usually already have a certain baseline quality.
484
-
485
- Since **FineVision** is mainly a collection of common VLM datasets, combined with a few newly created ones in low resource domains, this baseline quality is the same here.
486
-
487
- We could therefore be trying to measure and quantify nuances in the quality of Image-Question-Answer Pairs, instead of using the binary indicator of using curated SFT datasets as the measure for quality, and training on as much data as possible.
488
-
489
- Alternatively, while we used state-of-the-art open source models to judge our datapoints, we still had to find a compromise between model quality and cost due to the raw required effort to rate every single turn of **FineVision**.
490
-
491
- The chosen models could simply not be powerful enough to recognize and judge the quality of samples.
492
-
493
- Even though our first proposal to judge the quality of multimodal data on a per-turn basis did not yield any improvement in model performance, we believe that this is still an exciting and important direction of research and hope the release of **FineVision** encourages the community to develop techniques for this at large scale.
494
 
495
  <Wide>
496
- <HtmlEmbed src="filters-quad.html" title="Quality Filters Overview" desc="Interactive comparison across thresholds for all four filters: Formatting, Relevance, Visual Dependency, and Image-Question Correspondence." align="center" />
497
  </Wide>
498
 
499
  ### Should you train in multiple stages?
500
-
501
- The standard training procedure of a VLM usually follows at least two stages. First, you train only the connecting module, potentially in addition the image encoder, and then you train the whole model in a second stage. Some work has even introduced an additional Stage 2.5, where you train the full model on a smaller subset of higher quality data.
502
-
503
- To investigate this on small models, we experiment both with single, dual and triple stage training.
504
 
505
  ---
506
  #### 1 Stage vs 2 Stages
 
507
 
508
- <HtmlEmbed src="ss-vs-s1.html" desc="Average Rank of a model trained for 20K steps in a single stage, and a model trained for the same 20k steps on top of pretraining the Modality Projection and Vision Encoder for 10k steps." />
509
-
510
-
511
- We observe that at this model size, with this amount of available data, training only a single stage actually outperforms a multi stage approach.
512
 
513
  ---
514
  #### 2 Stages vs 2.5 Stages
@@ -516,9 +420,9 @@ We also experiment if splitting the second stage results in any performance impr
516
 
517
  We take the baseline, and continue training for another 20k steps, both with the unfiltered (>= 1) as well as filtered subsets of **FineVision** according to our ratings.
518
 
519
- <HtmlEmbed src="s25-ratings.html" desc="Average Rank if a model trained for an additional 20K steps on top of unfiltered training for 20K steps." />
520
 
521
- Like in the previous results, we observe that the best outcome is simply achieved by training on as much data as possible.
522
 
523
  ## Conclusion
524
- We introduce **FineVision**, a new state of the art open dataset to train VLMs, that is both **bigger and more diverse** than previous open source datasets. In addition to extensive ablations, we present a new family of small, purely data-centric trained VLMs, and hope we can empower both further research and the community with this.
 
18
  - name: "Thibaud Frere"
19
  url: "https://huggingface.co/tfrere"
20
  affiliations: [1]
21
+ - name: "Leandro von Werra"
22
+ url: "https://huggingface.co/lvwerra"
23
+ affiliations: [1]
24
  affiliations:
25
  - name: "Hugging Face"
26
  url: "https://huggingface.co"
 
44
  import Accordion from '../components/Accordion.astro'
45
 
46
  <Sidenote>
47
+ Today, we release **FineVision**, a new multimodal dataset with **24 million samples**. We created FineVision by processing over 200 datasets containing 17M images, 90M question-answer turns, and 10B answer tokens, totaling **5TB of high-quality data**. Additionally, we extensively processed all datasets to unify their format, clean them of duplicates and poor data, and rated all 90M turns using 32B VLMs across 4 qualitative metrics with a score from 1-5 to enable the construction and study of individual training mixtures.
48
 
49
+ To enable everyone to construct SOTA open VLMs, we ran extensive ablations on FineVision, and compared it to publicly available alternatives. Models trained on FineVision lead in performance over 11 common benchmarks compared against every baseline, thanks to FineVision’s scale and diversity of data.
50
 
51
  To use the dataset, simply load it with:
52
 
 
54
  ```python
55
  from datasets import load_dataset
56
 
57
+ # Get all subset names and load the first one
58
+ available_subsets = get_dataset_config_names('HuggingFaceM4/FineVision')
59
+ ds = load_dataset('HuggingFaceM4/FineVision', name=availible_subsets[0], split='train', streaming=True)
60
+
61
+ # Inspect the first sample
62
+ ds[0]
63
  ```
64
  </Sidenote>
65
 
66
+ ## Why this dataset?
67
+ Even though open-weights Vision-Language Models (VLMs) are becoming ever more powerful, the accessibility of the training data used for these models is lagging behind. This data is often proprietary and inaccessible for the broader community. Projects like The Cauldron, LLaVa and Cambrian aim to provide such datasets, but get quickly outpaced by the speed of the field and the emergence of novel applications for VLMs, like agentic tasks.
68
+ For FineVision we set out to combine and unify existing available data sources to create a large and high-quality dataset. As a first step we need to collect and standardize the datasets.
 
69
 
70
+ ## How did we build FineVision?
71
+ FineVision was a giant act of data curation. We started by collecting publicly available datasets, and augmenting underrepresented categories. We then evaluated all datasets for duplicated data internally and benchmarks contamination. This data is then cleaned and rated, before being added to the final mixture.
72
 
 
73
  ### Data Collection
74
+ We manually collected over **200 image-text datasets** from various publicly available sources and processed them to unify their formatting. On top of that, some datasets are not presented in chat form, so we converted them into question-answer pairs. In some cases, this goes as far as creating questions for all samples synthetically. Finally, we created new subsets for underrepresented domains, such as GUI-oriented data. To address this gap, we added a new dataset which is compiled from existing GUI datasets, after applying chat normalization and unifying the action space to convert their specific formats into a more general GUI action space.
75
 
76
  <Wide>
77
  <Accordion size="big" title="FineVision Subsets - click to see more">
78
  |Subset Name |Total Images|Total Samples|Total Turns|Total Question Tokens|Total Answer Tokens|Category |Source |
79
  |--------------------------------------|------------|-------------|-----------|---------------------|-------------------|----------------------|------- |
80
  |coco_colors |118,287 |118,287 |118,287 |1,301,157 |6,376,672 |Captioning & Knowledge|[@noauthor_hazal-karakusmscoco-controlnet] |
81
+ |densefusion_1m |1,058,751 |1,058,751 |1,058,751 |10,692,478 |263,718,217 |Captioning & Knowledge|[@li_densefusion-1m_2024] (Converted) |
82
  |face_emotion |797 |797 |797 |8,767 |8,066 |Captioning & Knowledge|[@mollahosseini_affectnet_2017] |
83
+ |google_landmarks |299,993 |299,993 |842,127 |6,194,978 |10,202,980 |Captioning & Knowledge|[@weyand2020googlelandmarksdatasetv2] (Converted) |
84
  |image_textualization(filtered) |99,573 |99,573 |99,573 |917,577 |19,374,090 |Captioning & Knowledge|[@pi_image_2024] |
85
  |laion_gpt4v |9,301 |9,301 |9,301 |93,950 |1,875,283 |Captioning & Knowledge|[@noauthor_laiongpt4v-dataset_2023] |
86
  |localized_narratives |199,998 |199,998 |199,998 |2,167,179 |8,021,473 |Captioning & Knowledge|[@vedaldi_connecting_2020] |
 
111
  |tabmwp(mathv360k) |22,452 |22,452 |22,452 |963,498 |158,042 |Chart & Table |[@shi_math-llava_2024] |
112
  |tat_dqa |2,448 |2,207 |13,251 |320,356 |1,177,852 |Chart & Table |[@zhu_towards_2022] |
113
  |tat_qa |2,199 |2,199 |13,215 |989,419 |254,790 |Chart & Table |[@zhu_tat-qa_2021] |
114
+ |Unichart |611,925 |611,925 |6,898,324 |96,702,288 |211,989,247 |Chart & Table |[@masry_unichart_2023] (Converted) |
115
  |vistext |9,969 |9,969 |9,969 |88,770 |1,191,127 |Chart & Table |[@tang_vistext_2023] |
116
  |vqaonbd |39,986 |39,986 |1,254,165 |36,066,807 |5,620,523 |Chart & Table |[@noauthor_jp1924vqaonbd_nodate] |
117
  |alfworldgpt |45,073 |45,073 |45,073 |17,864,033 |6,276,573 |General VQA |[@shridhar_alfworld_2021] |
 
150
  |websight |10,000 |10,000 |10,000 |113,114 |5,237,381 |General VQA |[@laurencon_unlocking_2024] |
151
  |wildvision |333 |333 |405 |50,161 |72,820 |General VQA |[@lu_wildvision_2024] |
152
  |yesbut |4,318 |4,318 |4,318 |38,365 |157,229 |General VQA |[@nandy_yesbut_2024] |
153
+ |aguvis-stage-1 |458,957 |458,957 |3,831,666 |36,151,272 |93,546,182 |Grounding & Counting |[@xu_aguvis_2025] (Converted) |
154
  |groundui |13,531 |13,531 |18,016 |200,094 |883,274 |Grounding & Counting |[@zheng_agentstudio_2025] |
155
+ |objects365_qa |1,742,287 |1,742,287 |12,329,259 |135,681,680 |2,146,619,635 |Grounding & Counting |[@shao_objects365_2019] (Converted) |
156
  |oodvqa |8,488 |8,488 |8,488 |227,028 |8,488 |Grounding & Counting |[@tu_how_2023] |
157
  |tallyqa |98,680 |98,680 |183,986 |2,674,306 |370,282 |Grounding & Counting |[@acharya_tallyqa_2019] |
158
  |clevr |70,000 |70,000 |699,989 |19,277,813 |1,570,525 |Mathematics |[@lindstrom_clevr-math_2022-1] |
 
187
  |latex_handwritten |39,583 |39,583 |39,583 |390,343 |1,874,733 |Naive OCR |[@noauthor_im2latex_nodate] |
188
  |latexformulas |552,340 |552,340 |552,340 |5,138,603 |43,094,747 |Naive OCR |[@noauthor_oleehyolatex-formulas_2024] |
189
  |maptext |200 |200 |799 |9,434 |70,813 |Naive OCR |[@barney_smith_icdar_2024] |
190
+ |mathwriting-google |300,000 |300,000 |300,000 |2,461,270 |5,954,806 |Naive OCR |[@gervais_mathwriting_2025] (Converted) |
191
  |memotion |6,991 |6,991 |6,991 |194,718 |177,429 |Naive OCR |[@sharma_semeval-2020_2020] |
192
  |orand_car_a |1,999 |1,999 |1,999 |43,978 |9,035 |Naive OCR |[@diem_icfhr_2014] |
193
  |rendered_text |10,000 |10,000 |10,000 |85,879 |244,183 |Naive OCR |[@noauthor_wendlercrenderedtext_2024] |
 
198
  |SynthFormulaNet |499,997 |499,997 |499,997 |1,999,631 |51,215,097 |Naive OCR |[@nassar_smoldocling_2025] |
199
  |tal_ocr_eng |256,646 |256,646 |256,646 |3,385,012 |7,465,207 |Naive OCR |[@noauthor_httpsai100talcomdataset_nodate] |
200
  |wordart |19,066 |4,804 |4,804 |78,032 |54,263 |Naive OCR |[@avidan_toward_2022] |
201
+ |olmOCR-mix-0225-documents |228,864 |228,864 |228,858 |2,197,147 |163,194,337 |Naive OCR |[@poznanski_olmocr_2025] (Converted) |
202
+ |olmOCR-mix-0225-books |15,194 |15,194 |15,194 |145,750 |7,962,779 |Naive OCR |[@poznanski_olmocr_2025] (Converted) |
203
  |a_okvqa |54,602 |54,602 |54,602 |1,065,188 |360,990 |OCR QA |[@avidan_-okvqa_2022] |
204
  |aokvqa |16,539 |16,539 |17,056 |743,458 |218,917 |OCR QA |[@avidan_-okvqa_2022] |
205
  |arxivqa |100,000 |100,000 |100,000 |7,022,001 |6,422,269 |OCR QA |[@li_multimodal_2024] |
 
262
  |text_theoremqa |0 |800 |800 |50,065 |3,468 |Text-only |[@chen_theoremqa_2023] |
263
  |text_wizardlm_evol |0 |69,999 |69,999 |7,753,963 |21,955,856 |Text-only |[@noauthor_wizardlmteamwizardlm_evol_instruct_70k_2024] |
264
  |text_OpenMathInstruct-2 |0 |1,000,000 |1,000,000 |74,905,850 |413,132,418 |Text-only |[@toshniwal_openmathinstruct-2_2024] |
265
+ |**Totals** |17,372,293 |24,322,193 |88,928,343 |3,168,958,417 |9,459,677,828 | | |
266
 
267
  </Accordion>
268
  </Wide>
269
 
270
  ### Cleaning
271
+ After gathering all the sub-datasets, every turn is cleaned. We removed all individual turns whose combined question and answer length exceeds 8192 tokens. We resize big images to have a longest side of 2048 pixels while keeping the aspect ratio, and discard samples with corrupted images.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
272
 
273
+ ### Rating
274
+ Finally, we rate every single turn in our dataset across 4 axes.
275
+ For this, we used a LLM and VLM-as-a-judge pipeline (using Qwen3-32B and Qwen2.5VL-32B), to rate every turn on a scale from 1-5 in these 4 categories:
276
+ - Text Formatting Quality: How is the quality of the answer both linguistically and structurally? (Question and Answer)
277
+ - Question-Answer Relevance: Does the answer properly respond to the question? (Question and Answer)
278
+ - Visual Dependency: How much does the question depend on visual information to be answered? (Question only)
279
+ - Image-Question Correspondence: How well does the image support answering the question? (Image and Question)
280
 
281
+ This is the distribution of scores across the different filters for FineVision.
282
+ | Filter | 1 | 2 | 3 | 4 | 5 |
283
+ |-----------------------|----- |----- |----- |----- |----- |
284
+ | Formatting | 0.5 | 0.7 | 1.1 | 77.5 | 20.3 |
285
+ | Relevance | 2.9 | 0.5 | 14.7 | 16.5 | 65.4 |
286
+ | Visual Dependency | 11.0 | 20.4 | 2.6 | 24.2 | 41.8 |
287
+ | Image Correspondence | 8.1 | 3.6 | 17.3 | 26.8 | 44.1 |
288
 
289
+ ### FineVision Base Dataset
290
+ We classify FineVision’s subsets into 9 categories: Captioning & Knowledge, Chart & Table, General VQA, Grounding & Counting, Mathematics, Naive OCR, OCR QA, Science, Text-only.
291
 
292
+ There are multiple ways to count the data in a multimodal dataset. The most common are the number of samples and the number of images. Additionally, a single sample can consist of multiple question/answer pairs in the form of a multi-turn conversation. Similarly to text-only datasets, the number of answer tokens is also interesting, since these are the tokens the model is actually trained on. We count all these characteristics for FineVision and arrive at 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens. Based on these 4 distributions, multiple different mixtures are possible. In conjunction with the provided ratings, we encourage the community to create their own mixtures and experiment with the data. For example, large categories could be downsamples, while high-quality data could be upsampled with techniques such as rephrasing.
293
+ After collecting and processing the data, we run multiple experiments and ablations to provide practical recommendations on how to train small, data-centric VLMs.
294
 
 
295
  <br/>
296
  <Wide>
297
+ <HtmlEmbed src="d3-pie.html" desc="Figure 1: Distribution of Categories in FineVision by Answer Tokens, Number of Samples, Turns, and Images. While the distributions differ a bit with the different metrics, FineVision provides a good baseline mixture especially when judging by the number of images in the individual categories. Samples from Chart & Table usually lend themselves well to multi turn conversations, since multiple similar questions can be asked for a single Chart. Samples from OCR QA often have a lot of answer tokens, since they aim at detailed document understanding, which are rarely answered with a short sentence." align="center" />
298
  </Wide>
299
 
300
  ## Experimental Setup
301
+ To ensure a fair comparison between different configurations, we use the same setup and evaluations for all of our ablations. This enables us to compare FineVision to other publicly available datasets as well as experiment with different intra-dataset configurations.
 
 
 
 
 
 
 
 
 
302
 
303
+ ### Model Architecture: [nanoVLM](https://github.com/huggingface/nanoVLM)
304
+ For all ablations and experiments, we train a 460M parameter VLM, since it provides a good trade-off between training time and model performance. We utilize the lightweight nanoVLM training framework with [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) as the text backbone, and [SigLIP2-Base-512](https://huggingface.co/google/siglip-base-patch16-512) as the vision encoder. We experimented with a classic 2-stage training schedule where the first stage is used to train mainly the Modality Projection to align the Language and Image Embeddings, and the second stage is used to train the whole model. Interestingly, we did not observe any significant benefits from this additional first stage compared to training the whole model directly at our size and training duration, so we settled on a single stage training for most ablations.
305
 
306
  ### Baseline Datasets
307
+ We use 3 similar open source alternatives as baselines to compare our dataset to: **[The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron)**, **[LLaVA-OneVision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data)** and **[Cambrian-7M](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M)**.
308
 
309
+ | Name | Samples | Images | Answer Tokens | Turns |
310
+ |---------------|---------|---------|----------------|-------|
311
+ | Cauldron | 1.8M | 2.0M | 0.3B | 27.8M |
312
+ | Llava-Vision | 3.9M | 2.5M | 1.0B | 9.1M |
313
+ | Cambrian-7M | 7M | 5.4M | 0.8B | 12.2M |
314
+ | FineVision | 24.3M | 17.3M | 9.5B | 88.9M |
 
 
 
315
 
316
  ### Evaluations
317
+ We utilize [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) during training to evaluate our ablations in a reproducible manner. We evaluate on a diverse set of 11 benchmarks: AI2D, ChartQA, DocVQA, InfoVQA, MME, MMMU, MMStar, OCRBench, ScienceQA, TextVQA and Seedbench. Since these benchmarks cover different topics and produce results on different scales, e.g. AI2D returns the accuracy of the exact matches (0-100), but MME returns a continuous score (0-2800), aggregating them is not trivial. In our ablations the relative performance between the different configurations matters, so we determine the rank of each model compared to the others in every benchmark at every training step and average it over all the benchmarks. This way we can judge where different configurations rank among each other over the course of training, and how big the difference between them is.
318
 
319
+ ### Training Configuration
320
+ Each of our ablations trains said 460M model with a maximal image size of 1536x1536 pixel (without resizing smaller images) and a maximal input token length of 4096. This results in a maximum batch size of 2 for a single H100, which we adapt with 8 steps of gradient accumulation for an effective batch size of 16. In all single stage configurations we train for 20k Steps on 32 H100s for approximately 20h while evaluating all 11 benchmarks every 1k Steps. If not specified otherwise, the “Baseline” in our intra dataset ablations refers to a training run on the full unfiltered and unchanged dataset. In this configuration, a full epoch of the unfiltered FineVision dataset takes 12k steps.
 
 
 
 
 
 
 
321
 
322
  ## Experiments
323
+ While there are a lot of interesting questions that could be investigated, we mainly focus on the aspects of the training that are influenced by the data. Before we dive into the internal details of FineVision, let’s have a look at our performance against the baselines.
 
 
324
 
325
+ ### How does FineVision compare to other open datasets?
326
+ Here we see the first interesting trend: VLMs still benefit from training on a larger, more diverse dataset than what was available until today. FineVision doesn't lead the race in the first few thousand training steps, after all, it does include new tasks such as pointing and agentic browsing, so it shouldn't be better at first. But after seeing enough varied data, FineVision clearly shows the best performance across a wide set of benchmarks, which can be seen in its average ranking <a href="#against-baselines">(Fig. 2)</a>. One epoch of FineVision in our setup takes 12k training steps, so we train for close to 2 epochs in these ablations. Looking at the average benchmark, we can see how the models saturate around different points: 18k steps for cambrian, 12k for LLaVa and 7k for the cauldron.
327
+ In particular, over 11 different benchmarks, FineVision achieves an average improvement of 40.7% over the Cauldron, 12.1% over Cambrian, and 46.3% over LLaVa, which increases to 51.3%, 18.6% and 58.0% when comparing the deduplicated versions of the datasets. Additionally, FineVision includes data for tasks such as agentic browsing, and counting and pointing, which are not part of the other baselines.
328
 
329
+ <HtmlEmbed id="against-baselines" src="against-baselines.html" desc="Figure 2: Average Rank of Models trained on different open source datasets. FineVision shows both the highest average rank as well as the highes average over benchmarks." />
330
 
331
+ ### How much test data is in publicly available datasets?
332
+ We investigate data leakage by finding images from test sets that appear in the dataset. For this, we constructed an image deduplication pipeline. We used this pipeline to compare all images in FineVision to all images from 66 benchmarks from the lmms-eval framework.
333
 
334
+ For the comparison, we embed the images using the SSCD descriptor, and compute cosine similarity between a given image in FineVision and all images from the test-set embeddings. Whenever a sample has a similarity higher than a threshold of **0.95** it is assumed to be a duplicate.
335
 
336
+ While our tests with various thresholds show that this is still flagging more false-positives than false-negatives, given the scale of data we have, we preferred to err on the side of caution.
 
 
 
 
 
 
 
 
 
337
 
338
+ Below is an example of a correctly identified Duplicate (“Photo”), a false-positive with a similarity score above 0.95 (“Chart”) and a false-negative with a similarity score below 0.95 (“Drawing”) <a href="#comparison">(Fig. 3)</a>.
339
 
340
+ We open-source the deduplication pipeline [here](https://github.com/huggingface/large-scale-image-deduplication) as well as the precomputed test-set embedding’s [here](https://huggingface.co/datasets/HuggingFaceM4/lmms-eval-embeddings).
341
 
 
342
  <br/>
343
  <Wide>
344
+ <HtmlEmbed id="comparison" src="comparison.html" align="center" desc="Figure 3: Examples of the Deduplication Results."/>
345
  </Wide>
346
 
347
  | Name | Samples | Contamination Rate | Performance Drop |
 
351
  | Cambrian-7M | 7.0M | 2.29% | 2.78% |
352
  | FineVision | 24.3M | 1.02% | 1.45% |
353
 
354
+ We repeated this deduplication procedure on all the baselines to analyse how contaminated they were. We found out that all baselines contain between 2-3% images from test benchmarks, and removing them results in a performance drop of 2.4-2.8%. Interestingly, we find that for some benchmarks the difference is negligible, while other benchmarks suffer significantly. For example, after deduplicating, ScienceQA falls by 14.49% on average while OCRBench only drops by 1.08%.
355
+ This deduplications also shows that FineVision contains the smallest relative amount of duplicated data at 1%, and also suffers the smallest performance drop over all benchmarks after deduplication at just 1.45%.
356
 
357
+ Additionally, we experimented with removing all found samples from all datasets to see if the outcome is different from <a href="#against-baselines">Fig. 2</a>, but we observe the same distribution <a href="#against-baselines-deduplicated">(Fig. 4)</a>.
 
 
358
 
359
+ <HtmlEmbed id="against-baselines-deduplicated" src="against-baselines-deduplicated.html" desc="Figure 4: Average Rank of Models trained on different deduplicated open source datasets. Even after deduplicating all dataset, FineVision shows the best performance." />
360
 
361
  ### How diverse are the datasets?
362
+ Similarly to the comparison of the size, we also wanted to evaluate the datasets for diversity. Evaluating the diversity of a dataset is a field of study for itself, which we will not dive into here, rather we borrow techniques from computer vision and use the already computed SSCD embeddings as a proxy of visual diversity. To not rely on a subsample of the dataset in estimating the diversity, we analyse the covariance metric of the full embeddings. From this covariance matrix, we can calculate the eigenvalues for analysis. We get the effective rank of the covariance matrix, which measures how uniformly the variance is distributed across dimensions, as well as the participation ratio, which measures how many dimensions actively contribute to the overall variance. To obtain a single **diversity score** for the datasets, we normalize the effective rank and participation ratio with the embedding dimension and compute their geometric mean. We observe that FineVision is not only the biggest, but also the most diverse dataset. Additionally, you can also clearly see that more images do not necessarily result in more diversity, since LLaVa is substantially less diverse than the Cauldron, even with more images.
363
 
364
+ | Name | Images | Effective Rank | Participation Ratio | Diversity |
365
+ |---------------|--------|------------ |---------- |---------- |
366
+ | Cauldron | 2.0M | 324.05 | 129.22 | 0.400 |
367
+ | LLaVa-Vision | 2.5M | 267.89 | 87.05 | 0.298 |
368
+ | Cambrian-7M | 5.4M | 359.73 | 152.70 | 0.458 |
369
+ | FineVision | 17.3M | 359.22 | 182.52 | 0.500 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
370
 
371
  ### Should you merge multiple questions for the same image into a single multi turn conversation?
372
+ Since the training of a VLM already builds upon pretrained vision and language backbones, datasets are usually not completely unstructured, but follow an image+question and answer structure. Recent works have shown that consolidating multiple questions for the same image into a multi-turn conversation where the image is shown only once improves model performance, reduces training budget, and reduces the datasets’ memory footprint. We therefore experiment with deduplicating every image in our dataset internally using the same SSCD descriptors, manually inspect the resulting clusters, and merge fitting samples into a multi-turn conversation.
373
 
374
+ When training with the same training budget, we find that both models perform very similarly. Some benchmarks favor one image/several turns, while others favor one image/one turn. Given this, we decide to release the dataset without merging multiple questions for the same image, and open-source the pipeline in case users want to explore this further.
 
 
 
 
 
 
375
 
376
  ---
377
+ <HtmlEmbed src="internal-deduplication.html" desc="Figure 5: Average Ranking of Models trained with internally deduplicated / merged samples. No clear benefit in merging ca be seen with respect to model performance." />
378
  ---
379
 
380
  ### Should you train on multilingual data if your language backbone was not?
381
+ There are some multilingual datasets in our mixture, but since our Language Backbone is only trained on English data, we experimented with removing all the multilingual, mainly Chinese, subsets. Our results show that there is a slight advantage in leaving the multilingual data, even if it was not part of the Language Backbone's initial training. We believe this reinforces our hypothesis that more diversity in the dataset is generally preferable for VLM training. In our training setup with this configuration, one epoch over the whole non-deduplicated dataset equals ~12k steps, so the benefit of unseen languages only materializes after the first full epoch.
 
 
 
 
 
382
 
383
  ---
384
+ <HtmlEmbed src="remove-ch.html" desc="Figure 6: Average Rank of Models trained with and without multilingual samples. Keeping samples in unseen langauges improves performance after the first epoch." />
385
  ---
386
 
387
  ### How can you assess the quality of the dataset?
388
+ The usual goal for every dataset, to collect samples with the highest quality possible, is quite an abstract endeavour in practice, especially for multimodal datasets. Additionally, different training stages usually have different qualitative and quantitative requirements. Finally, tuning the mixtures of different categories is also reliant on how much data with what quality is available. For image-text datasets, there are 3 different combinatorial ways to evaluate a sample: text-only, image-only, and image-text correspondence. The question persists, how do you actually measure the quality of a sample, especially if you have to do so in 3 different ways. We propose doing so by leveraging both a LLM and a VLM as a judge.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
389
 
390
  To try to quantify the quality of the training data and the effect it has on the model’s performance, we run extensive ablations on our generated ratings.
391
 
392
  ---
393
+ <HtmlEmbed src="all-ratings.html" desc="Figure 6: Average Rank of Models trained with samples that have all 4 ratings above a certain threshold. Keeping all samples results in the best performance." />
394
  ---
395
 
396
+ Interestingly, both when only training on turns that have any of the 4 ratings under a certain threshold, as well as when training on turns where only a single rating at a time is used, we observe the same behaviour. Simply training on the most diverse data, that one containing all samples, outperforms in benchmarks. This could mean multiple things.
397
+ Firstly, we can see almost the same distribution in the ranks across all filters: From best to worst with an increase in the rating threshold. For example the visual dependency and the image correspondence rating both result in exactly the same distribution of rankings, corresponding to the natural order of options, 1 through 5. This could indicate that with a sufficiently large dataset that you train on long enough, it hurts more to remove samples, even if they were judged to be of low quality, than to train on them.
 
 
 
 
 
 
 
 
 
 
 
 
398
 
399
+ Additionally, the notion of quality for VLM datasets is nuanced in general. If we compare training a VLM and an LLM, training the VLM is closer in nature to the SFT part than the ‘Pre-Training’ part of training a LLM. We do not train on crawls of internet data, instead we train on individual samples of Image-Question and Answer pairs, and these datapoints are usually ‘curated rather than collected’. We also do not train on trillions of tokens, but on billions. This means that the datasets for VLMs usually already have a certain baseline quality. Since FineVision is mainly a collection of common VLM datasets, combined with a few newly created ones in low resource domains, this baseline quality is the same here. We could therefore be trying to measure and quantify nuances in the quality of Image-Question-Answer Pairs, instead of using the binary indicator of using curated SFT datasets as the measure for quality, and training on as much data as possible.
400
 
401
+ Alternatively, while we used state-of-the-art open source models to judge our datapoints, we still had to find a compromise between model quality and cost due to the raw required effort to rate every single turn of FineVision. The chosen models could simply not be powerful enough to recognize and judge the quality of samples.
402
+ Even though our first proposal to judge the quality of multimodal data on a per-turn basis did not yield any improvement in model performance, we believe that this is still an exciting and important direction of research and hope the release of FineVision encourages the community to develop techniques for this at large scale.
 
 
 
 
 
 
 
 
 
 
 
403
 
404
  <Wide>
405
+ <HtmlEmbed src="filters-quad.html" title="Model Performance After Applying Individual Filters" desc="Figure 7: Comparison across thresholds for all four filters individually: Formatting, Relevance, Visual Dependency, and Image-Question Correspondence." align="center" />
406
  </Wide>
407
 
408
  ### Should you train in multiple stages?
409
+ The standard training procedure of a VLM usually follows at least two stages. First, you train only the connecting module, potentially in addition the image encoder, and then you train the whole model in a second stage. Some work has even introduced an additional Stage 2.5, where you train the full model on a smaller subset of higher quality data. To investigate this on small models, we experiment both with single, dual and triple stage training.
 
 
 
410
 
411
  ---
412
  #### 1 Stage vs 2 Stages
413
+ To evaluate if pre-training the Modality Projection and the Vision Encoder provides any benefits to the final model performance, we conduct this experiment at a higher image resolution of 2048px and train substantially longer. We can see that even for training longer, the general difference in model performance is quite small. Individual benchmarks, do show differences (ScienceQA drops by 5% but OCRBench improves by 5% in the two-stage setup), so the better setup is individual to the desired model capabilities. This also shows that evaluation (and through this also selecting a training procedure) a VLM is not straightforward tasks, since availible benchmarks are limited proxies for the underlying model performance.
414
 
415
+ <HtmlEmbed src="ss-vs-s1.html" desc="Figure 8: Average Rank of a model trained for 60K steps in a single stage, and a model trained for the same 60k steps on top of pretraining the Modality Projection and Vision Encoder for 15k steps. The pre-training procedure is not depicted in this graph." />
 
 
 
416
 
417
  ---
418
  #### 2 Stages vs 2.5 Stages
 
420
 
421
  We take the baseline, and continue training for another 20k steps, both with the unfiltered (>= 1) as well as filtered subsets of **FineVision** according to our ratings.
422
 
423
+ <HtmlEmbed src="s25-ratings.html" desc="Figure 9: Average Rank if a model trained for an additional 20K steps on top of unfiltered training for 20K steps. Subselecting data for the final training steps does not yield a performance improvement with our quality measure." />
424
 
425
+ As in the previous results, we observe that the best outcome is simply achieved by training on as much and as diverse data as possible. Like before, this could also be due to the way we filter the data, and a different quality measure might yield different results.
426
 
427
  ## Conclusion
428
+ We introduce **FineVision**, a new state of the art open dataset to train VLMs, that is both **bigger and more diverse** than previous open source datasets. We provide extensive analysis regarding size, diversity, contamination and data-centric model training, and hope we can empower both further research and the community with this.
app/src/content/assets/data/against_baselines.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a5e6173a1541b9798278da1729f1e357c0711d2e270f68aa4af8eae962f146dd
3
- size 53573
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:764b11e74bc5f2c9f2552e4dbcd9822595ec8298c7bf29b25930e078205d1927
3
+ size 60723
app/src/content/assets/data/against_baselines_deduplicated.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:56d18f581eff719023eb87c695e0e11770738d7872c8b9dac9bc23d9b0ef560b
3
- size 32738
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ff1f6a97f3e9860bbdb9c04037f56def02bcf976808ac3484e2b5d94bf8f497
3
+ size 48308
app/src/content/assets/data/all_ratings_luis.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1a47d8de2edf309fd39eb7e2ef5790d7f9c3ec4d5cc0f0c8680c12112f0d63e3
3
- size 63287
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8506d7db8c75f35c0ec19ceb682598fba94d8eaa12b8fc0000ddb876029f56c9
3
+ size 71878
app/src/content/assets/data/formatting_filters.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e5218781e5f018891311410d684785a3c661ca3cd25d2ac62bf45e6bb7d69e78
3
- size 63268
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:edb5461e4255b2d17e238131ecfb2a73b1299f21183ded3b7731a9c0ad9a8644
3
+ size 71920
app/src/content/assets/data/image_correspondence_filters.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:64a8af61666421e33d02bf0e52d9df576a6a831677910b3631e8b02069e380a6
3
- size 60206
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:128806fcf3bcc092dfdb99ba02ebbcf03efa111d3e27c8a39d1811a68aa5d84f
3
+ size 68509
app/src/content/assets/data/internal_deduplication.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d6b6bf0d84fe1bc67436c70f9a8d5919627e9c2bc9c3f931f4af80c01be22649
3
- size 47060
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7a11ab47e01d2eb722b328b3a839ee954e7465df4983596457cb23a59a9b3785
3
+ size 26553
app/src/content/assets/data/relevance_filters.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:69acb8bc0b80b2c664d821b1c06d67af315e67d8a706cf9e5d351e4468392cc6
3
- size 63236
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4586c33143745d82016d6e0f9966c20399389962da29a4d553ae56a86ee85f11
3
+ size 71948
app/src/content/assets/data/remove_ch.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:869fc4724af7e9c868b6024f472f9ae0f6468b74ef61db101438f80610828abb
3
- size 28837
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:227604c35283c80133f50a3db40becf375b12e3abcb8a6dbd951e67a31a30f0e
3
+ size 31160
app/src/content/assets/data/s25_ratings.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ca22654a0302da0ca335420b0a89cd770cea560b11f2a9f9f25927877d7ed231
3
- size 61626
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:af54d566f07b618e25f58ef03393514683b43fdde6ac08b10c963ad13bb0dec0
3
+ size 67302
app/src/content/assets/data/{ss_vs_s1.csv → ss_vs_s1_fullres.csv} RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:3f076631fcad76129ed8cab03c72a61965b465e1f3e7fa8dc68b7c7a9275616b
3
- size 28041
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5c99dce5f3129460643380c68210cb5f3f0c0f1c399997aefc69058f95a8711c
3
+ size 43618
app/src/content/assets/data/visual_dependency_filters.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a967b10ba4a1034f4d6da250d267a6af51722c3f6dbae0ef0221a62d53502d69
3
- size 60114
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:44615f2d25e51fa3b975e1919b5b478c93018f4c72f2a6a2eb67b80c839fc1f8
3
+ size 68344
app/src/content/embeds/ss-vs-s1.html CHANGED
@@ -114,16 +114,16 @@
114
 
115
  // CSV: prefer public path, fallback to relative
116
  const CSV_PATHS = [
117
- '/data/ss_vs_s1.csv',
118
- './assets/data/ss_vs_s1.csv',
119
- '../assets/data/ss_vs_s1.csv',
120
- '../../assets/data/ss_vs_s1.csv'
121
  ];
122
  const fetchFirstAvailable = async (paths) => {
123
  for (const p of paths) {
124
  try { const r = await fetch(p, { cache: 'no-cache' }); if (r.ok) return await r.text(); } catch(e) {}
125
  }
126
- throw new Error('CSV not found: ss_vs_s1.csv');
127
  };
128
 
129
  // Controls UI
 
114
 
115
  // CSV: prefer public path, fallback to relative
116
  const CSV_PATHS = [
117
+ '/data/ss_vs_s1_fullres.csv',
118
+ './assets/data/ss_vs_s1_fullres.csv',
119
+ '../assets/data/ss_vs_s1_fullres.csv',
120
+ '../../assets/data/ss_vs_s1_fullres.csv'
121
  ];
122
  const fetchFirstAvailable = async (paths) => {
123
  for (const p of paths) {
124
  try { const r = await fetch(p, { cache: 'no-cache' }); if (r.ok) return await r.text(); } catch(e) {}
125
  }
126
+ throw new Error('CSV not found: ss_vs_s1_fullres.csv');
127
  };
128
 
129
  // Controls UI