mdw123
's Collections
Papers
updated
Beyond Language Models: Byte Models are Digital World Simulators
Paper
•
2402.19155
•
Published
•
50
Griffin: Mixing Gated Linear Recurrences with Local Attention for
Efficient Language Models
Paper
•
2402.19427
•
Published
•
53
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper
•
2403.00522
•
Published
•
45
Resonance RoPE: Improving Context Length Generalization of Large
Language Models
Paper
•
2403.00071
•
Published
•
23
Learning and Leveraging World Models in Visual Representation Learning
Paper
•
2403.00504
•
Published
•
32
AtP*: An efficient and scalable method for localizing LLM behaviour to
components
Paper
•
2403.00745
•
Published
•
13
Learning to Decode Collaboratively with Multiple Language Models
Paper
•
2403.03870
•
Published
•
21
ShortGPT: Layers in Large Language Models are More Redundant Than You
Expect
Paper
•
2403.03853
•
Published
•
63
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Paper
•
2403.03507
•
Published
•
185
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
•
2403.05135
•
Published
•
42
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
•
2403.05525
•
Published
•
43
Stealing Part of a Production Language Model
Paper
•
2403.06634
•
Published
•
91
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
75
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
Paper
•
2403.07816
•
Published
•
40
Synth^2: Boosting Visual-Language Models with Synthetic Captions and
Image Embeddings
Paper
•
2403.07750
•
Published
•
23
Chronos: Learning the Language of Time Series
Paper
•
2403.07815
•
Published
•
47
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
•
2403.09611
•
Published
•
126
Veagle: Advancements in Multimodal Representation Learning
Paper
•
2403.08773
•
Published
•
9
Paper
•
2309.16609
•
Published
•
35
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
•
2308.12966
•
Published
•
8
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
•
2403.10301
•
Published
•
52
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
Paper
•
2403.13372
•
Published
•
67
The Unreasonable Ineffectiveness of the Deeper Layers
Paper
•
2403.17887
•
Published
•
79
InternLM2 Technical Report
Paper
•
2403.17297
•
Published
•
31
Jamba: A Hybrid Transformer-Mamba Language Model
Paper
•
2403.19887
•
Published
•
107
Transformer-Lite: High-efficiency Deployment of Large Language Models on
Mobile Phone GPUs
Paper
•
2403.20041
•
Published
•
35
Localizing Paragraph Memorization in Language Models
Paper
•
2403.19851
•
Published
•
15
DiJiang: Efficient Large Language Models through Compact Kernelization
Paper
•
2403.19928
•
Published
•
12
Long-form factuality in large language models
Paper
•
2403.18802
•
Published
•
25
Mixture-of-Depths: Dynamically allocating compute in transformer-based
language models
Paper
•
2404.02258
•
Published
•
104
Leave No Context Behind: Efficient Infinite Context Transformers with
Infini-attention
Paper
•
2404.07143
•
Published
•
106
Pre-training Small Base LMs with Fewer Tokens
Paper
•
2404.08634
•
Published
•
35
Megalodon: Efficient LLM Pretraining and Inference with Unlimited
Context Length
Paper
•
2404.08801
•
Published
•
66
SnapKV: LLM Knows What You are Looking for Before Generation
Paper
•
2404.14469
•
Published
•
24
FlowMind: Automatic Workflow Generation with LLMs
Paper
•
2404.13050
•
Published
•
34
Paper
•
2412.15115
•
Published
•
346
YuLan-Mini: An Open Data-efficient Language Model
Paper
•
2412.17743
•
Published
•
65
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper
•
2412.18925
•
Published
•
97
Token-Budget-Aware LLM Reasoning
Paper
•
2412.18547
•
Published
•
46
DeepSeek-V3 Technical Report
Paper
•
2412.19437
•
Published
•
51
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
•
2501.08313
•
Published
•
273
Evolving Deeper LLM Thinking
Paper
•
2501.09891
•
Published
•
106
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language
Model
Paper
•
2502.02737
•
Published
•
187
Demystifying Long Chain-of-Thought Reasoning in LLMs
Paper
•
2502.03373
•
Published
•
51
LIMO: Less is More for Reasoning
Paper
•
2502.03387
•
Published
•
56
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time
Scaling
Paper
•
2502.06703
•
Published
•
132
The Differences Between Direct Alignment Algorithms are a Blur
Paper
•
2502.01237
•
Published
•
111
s1: Simple test-time scaling
Paper
•
2501.19393
•
Published
•
105
Qwen2.5-1M Technical Report
Paper
•
2501.15383
•
Published
•
61
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
Paper
•
2501.12948
•
Published
•
327
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Paper
•
2501.12599
•
Published
•
96