Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs Paper • 2502.12982 • Published 10 days ago • 13
Running 103 103 TxT360: Trillion Extracted Text 📖 Create a large, deduplicated dataset for LLM pre-training
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published 24 days ago • 197
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions Paper • 2412.08737 • Published Dec 11, 2024 • 53
METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring Paper • 2501.02045 • Published Jan 3 • 21
Diving into Self-Evolving Training for Multimodal Reasoning Paper • 2412.17451 • Published Dec 23, 2024 • 43
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation Paper • 2412.03304 • Published Dec 4, 2024 • 18
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation Paper • 2109.06379 • Published Sep 14, 2021
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs Paper • 2406.20098 • Published Jun 28, 2024
O1 Replication Journey: A Strategic Progress Report -- Part 1 Paper • 2410.18982 • Published Oct 8, 2024 • 3
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models Paper • 2308.16149 • Published Aug 30, 2023 • 28
Running 103 103 TxT360: Trillion Extracted Text 📖 Create a large, deduplicated dataset for LLM pre-training