view article Article FineWeb2-C: Help Build Better Language Models in Your Language By davanstrien • 3 days ago • 10
view article Article Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing By Pclanglais • Jul 19 • 18
Probably function calling datasets Collection Created using the https://huggingface.co/spaces/librarian-bots/dataset-column-search-api Space. • 39 items • Updated Jul 17 • 36
synthetic-data-generation-demos Collection A collection of demos for various approaches to synthetic data generation • 4 items • Updated Jun 25 • 13
sentence-transformers-from-synthetic-data Collection Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model • 4 items • Updated Jun 21 • 21
view article Article 🦙⚗️ Using Llama3 and distilabel to build fine-tuning datasets By dvilasuero • Jun 4 • 73
StarCraftImage: A Dataset For Prototyping Spatial Reasoning Methods For Multi-Agent Environments Paper • 2401.04290 • Published Jan 9 • 3
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding Paper • 2401.04575 • Published Jan 9 • 14
AeroPath: An airway segmentation benchmark dataset with challenging pathology Paper • 2311.01138 • Published Nov 2, 2023 • 5
RadioGalaxyNET: Dataset and Novel Computer Vision Algorithms for the Detection of Extended Radio Galaxies and Infrared Hosts Paper • 2312.00306 • Published Dec 1, 2023 • 2
SynFundus: Generating a synthetic fundus images dataset with millions of samples and multi-disease annotations Paper • 2312.00377 • Published Dec 1, 2023 • 3
Enhancing Visually-Rich Document Understanding via Layout Structure Modeling Paper • 2308.07777 • Published Aug 15, 2023 • 2
smol models Collection Models where the size of the model file (model.safetensors or pytorch_model.bin) < 50mb • 58 items • Updated Jul 3 • 7