Models
Datasets
Spaces
Posts
Docs
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2406.20094

Papers I want to read

Papers in my to-read list

RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published May 13 • 67
Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16 • 125
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24 • 52
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 84

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

Paper • 2407.09413 • Published Jul 12 • 9
Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28 • 93
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Paper • 2406.17720 • Published Jun 25 • 7
LiveBench: A Challenging, Contamination-Free LLM Benchmark

Paper • 2406.19314 • Published Jun 27 • 17

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28 • 93

Interesting Papers

Chain-of-Knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs

Paper • 2407.00653 • Published Jun 30 • 11
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

Paper • 2406.20086 • Published Jun 28 • 3
UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

Paper • 2407.00106 • Published Jun 27 • 5
MIRAI: Evaluating LLM Agents for Event Forecasting

Paper • 2407.01231 • Published Jul 1 • 15

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25 • 84
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Paper • 2406.16860 • Published Jun 24 • 55
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Paper • 2406.17720 • Published Jun 25 • 7
Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28 • 93

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28 • 93

How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Paper • 2406.11813 • Published Jun 17 • 29
From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries

Paper • 2406.12824 • Published Jun 18 • 20
Tokenization Falling Short: The Curse of Tokenization

Paper • 2406.11687 • Published Jun 17 • 14
Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Paper • 2406.11817 • Published Jun 17 • 13

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Paper • 2305.13169 • Published May 22, 2023 • 3
A Survey on Data Selection for Language Models

Paper • 2402.16827 • Published Feb 26 • 3
HuggingFaceFW/fineweb-edu

Viewer • Updated 26 days ago • 3B • 138k • 473
allenai/MADLAD-400

Updated 10 days ago • 25 • 118

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Paper • 2405.07526 • Published May 13 • 16
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Paper • 2405.15613 • Published May 24 • 13
A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20 • 13
How Do Large Language Models Acquire Factual Knowledge During Pretraining?

Paper • 2406.11813 • Published Jun 17 • 29

Foundation AI Papers (II)

Iterative Reasoning Preference Optimization

Paper • 2404.19733 • Published Apr 30 • 46
Better & Faster Large Language Models via Multi-token Prediction

Paper • 2404.19737 • Published Apr 30 • 73
ORPO: Monolithic Preference Optimization without Reference Model

Paper • 2403.07691 • Published Mar 12 • 59
KAN: Kolmogorov-Arnold Networks

Paper • 2404.19756 • Published Apr 30 • 108

Previous
1
2
Next

Company

© Hugging Face

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs