thibaud frere commited on
Commit
cbeaba5
·
1 Parent(s): 09f2ef0

update link

Browse files
Files changed (1) hide show
  1. app/src/content/article.mdx +17 -13
app/src/content/article.mdx CHANGED
@@ -43,25 +43,29 @@ import visualPoster from "./assets/images/visual-vocabulary-poster.png";
43
 
44
  import Accordion from '../components/Accordion.astro'
45
 
46
- <Sidenote>
 
 
 
 
 
47
  Today, we release **FineVision**, a new multimodal dataset with **24 million samples**. We created FineVision by collecting over 200 datasets containing 17M images, 89M question-answer turns, and 10B answer tokens, totaling **5TB of high-quality data**. Additionally, we extensively processed all datasets to unify their format, clean them of duplicates and poor data, and rated all turns using 32B VLMs across 4 qualitative metrics with a score from 1-5 to enable the construction and study of individual training mixtures.
48
 
49
  To enable everyone to construct state-of-the-art open Vision-Language Models (VLMs), we ran extensive ablations on FineVision, and compared it to publicly available alternatives. Models trained on FineVision lead in performance over 11 common benchmarks compared against every baseline, thanks to FineVision’s scale and diversity of data.
 
50
 
51
- To use the dataset, simply load it with:
 
 
 
52
 
53
- <small class="muted">python</small>
54
- ```python
55
- from datasets import load_dataset
56
 
57
- # Get all subset names and load the first one
58
- available_subsets = get_dataset_config_names('HuggingFaceM4/FineVision')
59
- ds = load_dataset('HuggingFaceM4/FineVision', name=availible_subsets[0], split='train', streaming=True)
60
-
61
- # Inspect the first sample
62
- ds[0]
63
- ```
64
- </Sidenote>
65
 
66
  ## Why this dataset?
67
  Even though open-weights Vision-Language Models are becoming ever more powerful, the accessibility of the training data used for these models is lagging behind. This data is often proprietary and inaccessible for the broader community. Projects like The Cauldron, LLaVa and Cambrian aim to provide such datasets, but get quickly outpaced by the speed of the field and the emergence of novel applications for VLMs, like agentic tasks.
 
43
 
44
  import Accordion from '../components/Accordion.astro'
45
 
46
+ <p style="text-align: center;">
47
+ <a style=" margin: 0 auto; font-weight: bold;" href="https://huggingface.co/datasets/HuggingFaceM4/FineVision"> huggingface.co/datasets/HuggingFaceM4/FineVision</a>
48
+ </p>
49
+ <br/>
50
+
51
+ <div style="text-align:justify!important;">
52
  Today, we release **FineVision**, a new multimodal dataset with **24 million samples**. We created FineVision by collecting over 200 datasets containing 17M images, 89M question-answer turns, and 10B answer tokens, totaling **5TB of high-quality data**. Additionally, we extensively processed all datasets to unify their format, clean them of duplicates and poor data, and rated all turns using 32B VLMs across 4 qualitative metrics with a score from 1-5 to enable the construction and study of individual training mixtures.
53
 
54
  To enable everyone to construct state-of-the-art open Vision-Language Models (VLMs), we ran extensive ablations on FineVision, and compared it to publicly available alternatives. Models trained on FineVision lead in performance over 11 common benchmarks compared against every baseline, thanks to FineVision’s scale and diversity of data.
55
+ </div>
56
 
57
+ To use the dataset, simply load it with:
58
+ <small class="muted">python</small>
59
+ ```python
60
+ from datasets import load_dataset
61
 
62
+ # Get all subset names and load the first one
63
+ available_subsets = get_dataset_config_names('HuggingFaceM4/FineVision')
64
+ ds = load_dataset('HuggingFaceM4/FineVision', name=availible_subsets[0], split='train', streaming=True)
65
 
66
+ # Inspect the first sample
67
+ ds[0]
68
+ ```
 
 
 
 
 
69
 
70
  ## Why this dataset?
71
  Even though open-weights Vision-Language Models are becoming ever more powerful, the accessibility of the training data used for these models is lagging behind. This data is often proprietary and inaccessible for the broader community. Projects like The Cauldron, LLaVa and Cambrian aim to provide such datasets, but get quickly outpaced by the speed of the field and the emergence of novel applications for VLMs, like agentic tasks.