thibaud frere commited on
Commit
dd02bcf
·
2 Parent(s): 871106f 07f5cb0

Merge branch 'main' of hf.co:spaces/HuggingFaceM4/FineVision

Browse files
app/src/components/Accordion.astro CHANGED
@@ -90,12 +90,13 @@ const wrapperClass = ["accordion", className].filter(Boolean).join(" ");
90
  list-style: none;
91
  display: flex;
92
  align-items: center;
93
- justify-content: space-between;
94
  gap: 4px;
95
  padding: 4px;
96
  cursor: pointer;
97
  color: var(--text-color);
98
  user-select: none;
 
99
  }
100
 
101
  /* Remove conditional padding to avoid jump on close */
@@ -113,7 +114,8 @@ const wrapperClass = ["accordion", className].filter(Boolean).join(" ");
113
  }
114
 
115
  .accordion__chevron {
116
- flex: 0 0 auto;
 
117
  transition: transform 220ms ease;
118
  opacity: .85;
119
  }
 
90
  list-style: none;
91
  display: flex;
92
  align-items: center;
93
+ justify-content: center;
94
  gap: 4px;
95
  padding: 4px;
96
  cursor: pointer;
97
  color: var(--text-color);
98
  user-select: none;
99
+ position: relative;
100
  }
101
 
102
  /* Remove conditional padding to avoid jump on close */
 
114
  }
115
 
116
  .accordion__chevron {
117
+ position: absolute;
118
+ right: 8px;
119
  transition: transform 220ms ease;
120
  opacity: .85;
121
  }
app/src/content/article.mdx CHANGED
@@ -27,9 +27,15 @@ import visualPoster from "./assets/images/visual-vocabulary-poster.png";
27
  import Accordion from '../components/Accordion.astro'
28
 
29
  <Sidenote>
30
- TLDR; Today, we release FineVision, a new multimodal dataset with 17M images, 24 million samples, 90M question-answer turns and 10B answer tokens. We have extensively cleaned, analysed, and rated every single turn across 4 qualitative metrics with a score from 1-5 to enable the construction and study of individual training mixtures.
31
 
32
- Additionally, we ran extensive ablations and compared the performance of models trained on our dataset with common open source alternatives, and achieved better model performance and higher quantity and diversity of data.
 
 
 
 
 
 
33
  </Sidenote>
34
 
35
  ## Introduction
@@ -39,7 +45,7 @@ Even though open-weights Vision-Language Models (VLMs) are becoming ever more po
39
  ### Data Collection
40
  We manually collect over 180 image-text datasets from the recent literature and create new subsets in lacking domains.
41
 
42
- <FullWidth>
43
  <Accordion title="FineVision Subsets">
44
  |Subset Name |Total Images|Total Samples|Total Turns|Total Question Tokens|Total Answer Tokens|Category |
45
  |--------------------------------------|------------|-------------|-----------|---------------------|-------------------|----------------------|
@@ -229,7 +235,7 @@ We manually collect over 180 image-text datasets from the recent literature and
229
  |text_wizardlm_evol |0 |69,999 |69,999 |7,753,963 |21,955,856 |Text-only |
230
  |text_OpenMathInstruct-2 |0 |1,000,000 |1,000,000 |74,905,850 |413,132,418 |Text-only |
231
  </Accordion>
232
- </FullWidth>
233
 
234
  ### Cleaning
235
  After gathering all the sub-datasets, every turn is cleaned. We remove all individual turns whose combined question and answer length exceeds 8192 tokens. We resize big images to have a longest side of 2048 pixels while keeping the aspect ratio, and discard images with corrupted metadata. This results in a clean final dataset with a maximum turn length of 8192 tokens and a maximum image dimension of 2048 pixels on the longest side.
@@ -241,9 +247,9 @@ There are multiple ways to count the data in a multimodal dataset. The most comm
241
 
242
  In total, FineVision has 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens. Based on these 4 distributions, multiple different mixtures are possible. In conjunction with the provided ratings, we encourage the community to experiment with downsampling large categories, for example according to quality and diversity criteria, and with upsampling high quality samples in small categories.
243
 
244
- <FullWidth>
245
  <HtmlEmbed src="d3-pie.html" desc="Distribution of Categories in FineVision" align="center" />
246
- </FullWidth>
247
 
248
  ## Experimental Setup
249
  To evaluate how our dataset compares to other open-source datasets, we conduct various experiments.
@@ -261,15 +267,16 @@ To evaluate our ablations in a reproducible manner, we utilize lmms-eval during
261
  Each of our ablations trains a 450M model with maximal image size of 1536x1536 pixel (without resizing smaller images) and a maximal input token length of 4096. In all single stage configurations we train for 20k Steps on 32 H100s for approximately 20h while evaluating all 11 benchmarks every 1k Steps. If not specified otherwise, the “Baseline” in our intra dataset ablations refers to a training run on the full unfiltered and unchanged dataset
262
 
263
  ### How does FineVision compare against the Baselines?
264
- Compared against existing VLM training datasets, FineVision produces significantly higher benchmark ranks than the other options.
265
 
266
  <HtmlEmbed src="against-baselines.html" desc="Average Rank of Models trained on different open source datasets." />
267
 
268
  ### How contaminated are the datasets?
269
- To investigate data leakage from benchmarks into this dataset, we construct a deduplication pipeline based on the sample images. We embed the images of 66 image-test datasets from the lmms-eval framework using the SSCD descriptor, and compute the cosine similarity between our samples and the test-set embeddings. Whenever a sample has a similarity higher than a threshold of 0.95 it is assumed to be a duplicate. While our tests with various thresholds show that this is flagging some samples that are not actual duplicates (especially if the image depicts similar but different images in detail, like graphs or tables), we preferred to err on the side of caution. We open-source the deduplication pipeline here as well as the precomputed test-set embedding’s here.
270
-
271
- <HtmlEmbed src="comparison.html" desc="desc" title="title"/>
272
 
 
 
 
273
 
274
  | Name | Samples | Contamination Rate | Performance Drop |
275
  |---------------|---------|--------------------|------------------|
 
27
  import Accordion from '../components/Accordion.astro'
28
 
29
  <Sidenote>
30
+ TLDR; Today, we release FineVision, a new multimodal dataset with 17M images, 24 million samples, 90M question-answer turns and 10B answer tokens comprising 5TB. We have extensively cleaned, analysed, and rated every single turn across 4 qualitative metrics with a score from 1-5 to enable the construction and study of individual training mixtures.
31
 
32
+ Additionally, we ran extensive ablations and compared the performance of models trained on our dataset with common open source alternatives. Our dataset is both more divers, and achieves an average improvement of 35% in 10 common benchmarks over all baselines.
33
+
34
+ To use the dataset, simply load it with:
35
+ ```python
36
+ from datasets import load_dataset
37
+ ds = load_dataset('HuggingFaceM4/FineVision', name='ai2d_merged', split='train', streaming=True)
38
+ ```
39
  </Sidenote>
40
 
41
  ## Introduction
 
45
  ### Data Collection
46
  We manually collect over 180 image-text datasets from the recent literature and create new subsets in lacking domains.
47
 
48
+ <Wide>
49
  <Accordion title="FineVision Subsets">
50
  |Subset Name |Total Images|Total Samples|Total Turns|Total Question Tokens|Total Answer Tokens|Category |
51
  |--------------------------------------|------------|-------------|-----------|---------------------|-------------------|----------------------|
 
235
  |text_wizardlm_evol |0 |69,999 |69,999 |7,753,963 |21,955,856 |Text-only |
236
  |text_OpenMathInstruct-2 |0 |1,000,000 |1,000,000 |74,905,850 |413,132,418 |Text-only |
237
  </Accordion>
238
+ </Wide>
239
 
240
  ### Cleaning
241
  After gathering all the sub-datasets, every turn is cleaned. We remove all individual turns whose combined question and answer length exceeds 8192 tokens. We resize big images to have a longest side of 2048 pixels while keeping the aspect ratio, and discard images with corrupted metadata. This results in a clean final dataset with a maximum turn length of 8192 tokens and a maximum image dimension of 2048 pixels on the longest side.
 
247
 
248
  In total, FineVision has 17.3M images, 24.3M samples, 88.9M turns, and 9.5B answer tokens. Based on these 4 distributions, multiple different mixtures are possible. In conjunction with the provided ratings, we encourage the community to experiment with downsampling large categories, for example according to quality and diversity criteria, and with upsampling high quality samples in small categories.
249
 
250
+ <Wide>
251
  <HtmlEmbed src="d3-pie.html" desc="Distribution of Categories in FineVision" align="center" />
252
+ </Wide>
253
 
254
  ## Experimental Setup
255
  To evaluate how our dataset compares to other open-source datasets, we conduct various experiments.
 
267
  Each of our ablations trains a 450M model with maximal image size of 1536x1536 pixel (without resizing smaller images) and a maximal input token length of 4096. In all single stage configurations we train for 20k Steps on 32 H100s for approximately 20h while evaluating all 11 benchmarks every 1k Steps. If not specified otherwise, the “Baseline” in our intra dataset ablations refers to a training run on the full unfiltered and unchanged dataset
268
 
269
  ### How does FineVision compare against the Baselines?
270
+ Compared against existing VLM training datasets, FineVision produces significantly higher benchmark ranks than the other options. Over the 10 different metrics, FineVision achieves a 45.68% improvement over the Cauldron, a 13.04% improvement over Cambrian, and a 46.83% improvement over LLaVa.
271
 
272
  <HtmlEmbed src="against-baselines.html" desc="Average Rank of Models trained on different open source datasets." />
273
 
274
  ### How contaminated are the datasets?
275
+ To investigate data leakage from benchmarks into this dataset, we construct a deduplication pipeline based on the sample images. We embed the images of 66 image-test datasets from the lmms-eval framework using the SSCD descriptor, and compute the cosine similarity between our samples and the test-set embeddings. Whenever a sample has a similarity higher than a threshold of 0.95 it is assumed to be a duplicate. While our tests with various thresholds show that this is still flagging more false-positives than false-negatives, we preferred to err on the side of caution. Below is an example of a correctly identified Duplicate ("Photo"), a false-positive with a similarity score above 0.95 ("Chart") and a false-negative with a similarity score below 0.95 ("Drawing"). We open-source the deduplication pipeline here as well as the precomputed test-set embedding’s here.
 
 
276
 
277
+ <Wide>
278
+ <HtmlEmbed src="comparison.html" desc="Examples of the Deduplication Pipeline."/>
279
+ </Wide>
280
 
281
  | Name | Samples | Contamination Rate | Performance Drop |
282
  |---------------|---------|--------------------|------------------|
app/src/content/assets/data/banner_visualisation_data.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:953ea855559ec3b033717b829136612710f306539077bd0fc41634f012df6065
3
- size 81486
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b19a66e4f5999c3dcf60140f5eeab57016ebeb3d277db08c16dd0b5bede35495
3
+ size 81529
app/src/content/embeds/against-baselines-deduplicated.html CHANGED
@@ -414,7 +414,7 @@
414
  rankTickMax = Math.max(1, Math.round(maxVal));
415
  yScale.domain([rankTickMax, 1]);
416
  } else {
417
- yScale.domain([0, Math.max(1, maxVal)]).nice();
418
  }
419
  isRankStrictFlag = isRankStrict;
420
 
 
414
  rankTickMax = Math.max(1, Math.round(maxVal));
415
  yScale.domain([rankTickMax, 1]);
416
  } else {
417
+ yScale.domain([minVal, maxVal]).nice();
418
  }
419
  isRankStrictFlag = isRankStrict;
420
 
app/src/content/embeds/against-baselines.html CHANGED
@@ -411,7 +411,7 @@
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
- yScale.domain([0, Math.max(1, maxVal)]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
 
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
+ yScale.domain([minVal, maxVal]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
app/src/content/embeds/all-ratings.html CHANGED
@@ -411,7 +411,7 @@
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
- yScale.domain([0, Math.max(1, maxVal)]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
 
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
+ yScale.domain([minVal, maxVal]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
app/src/content/embeds/comparison.html CHANGED
@@ -111,13 +111,12 @@
111
  const meta = document.createElement('div'); meta.className = 'meta';
112
 
113
  if (isQuery) {
114
- const label = document.createElement('span'); label.className = 'label'; label.textContent = 'Query';
115
  meta.appendChild(label);
116
  } else {
117
- const label = document.createElement('span'); label.className = 'label'; label.textContent = `Match ${idx}:`;
118
- const similarity = document.createElement('span'); similarity.className = 'value'; similarity.textContent = `Similarity: ${formatSim(sim)}`;
119
- meta.appendChild(label);
120
- meta.appendChild(similarity);
121
  }
122
 
123
  card.appendChild(media); card.appendChild(meta); grid.appendChild(card);
 
111
  const meta = document.createElement('div'); meta.className = 'meta';
112
 
113
  if (isQuery) {
114
+ const label = document.createElement('span'); label.className = 'value'; label.textContent = 'Query';
115
  meta.appendChild(label);
116
  } else {
117
+ const content = document.createElement('span');
118
+ content.innerHTML = `<span class="value">Match ${idx}</span><br><span class="label">Similarity: ${formatSim(sim)}</span>`;
119
+ meta.appendChild(content);
 
120
  }
121
 
122
  card.appendChild(media); card.appendChild(meta); grid.appendChild(card);
app/src/content/embeds/formatting-filters.html CHANGED
@@ -411,7 +411,7 @@
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
- yScale.domain([0, Math.max(1, maxVal)]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
 
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
+ yScale.domain([minVal, maxVal]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
app/src/content/embeds/image-correspondence-filters.html CHANGED
@@ -411,7 +411,7 @@
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
- yScale.domain([0, Math.max(1, maxVal)]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
 
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
+ yScale.domain([minVal, maxVal]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
app/src/content/embeds/internal-deduplication.html CHANGED
@@ -411,7 +411,7 @@
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
- yScale.domain([0, Math.max(1, maxVal)]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
 
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
+ yScale.domain([minVal, maxVal]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
app/src/content/embeds/relevance-filters.html CHANGED
@@ -411,7 +411,7 @@
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
- yScale.domain([0, Math.max(1, maxVal)]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
 
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
+ yScale.domain([minVal, maxVal]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
app/src/content/embeds/remove-ch.html CHANGED
@@ -411,7 +411,7 @@
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
- yScale.domain([0, Math.max(1, maxVal)]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
 
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
+ yScale.domain([minVal, maxVal]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
app/src/content/embeds/s25-ratings.html CHANGED
@@ -411,7 +411,7 @@
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
- yScale.domain([0, Math.max(1, maxVal)]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
 
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
+ yScale.domain([minVal, maxVal]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
app/src/content/embeds/ss-vs-s1.html CHANGED
@@ -411,7 +411,7 @@
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
- yScale.domain([0, Math.max(1, maxVal)]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
 
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
+ yScale.domain([minVal, maxVal]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
app/src/content/embeds/visual-dependency-filters.html CHANGED
@@ -411,7 +411,7 @@
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
- yScale.domain([0, Math.max(1, maxVal)]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
 
411
  rankTickMax = Math.max(1, Math.round(maxVal));
412
  yScale.domain([rankTickMax, 1]);
413
  } else {
414
+ yScale.domain([minVal, maxVal]).nice();
415
  }
416
  isRankStrictFlag = isRankStrict;
417
 
app/src/styles/_layout.css CHANGED
@@ -88,8 +88,8 @@ main > nav:first-of-type { display: none; }
88
  .full-width { box-sizing: border-box; position: relative; z-index: var(--z-elevated); }
89
 
90
  .wide {
91
- /* Target up to ~1100px while staying within viewport minus page gutters */
92
- width: min(1100px, 100vw - 32px);
93
  margin-left: 50%;
94
  transform: translateX(-50%);
95
  }
 
88
  .full-width { box-sizing: border-box; position: relative; z-index: var(--z-elevated); }
89
 
90
  .wide {
91
+ /* Target up to ~1400px while staying within viewport minus page gutters */
92
+ width: min(1400px, 100vw - 32px);
93
  margin-left: 50%;
94
  transform: translateX(-50%);
95
  }