-
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 38 -
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Paper • 2502.15007 • Published • 175 -
Transformers without Normalization
Paper • 2503.10622 • Published • 167 -
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32
Collections
Discover the best community collections!
Collections including paper arxiv:2502.09245
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 59 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 53 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 43 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 62
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 30 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 15 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 51 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 34
-
LM2: Large Memory Models
Paper • 2502.06049 • Published • 30 -
Titans: Learning to Memorize at Test Time
Paper • 2501.00663 • Published • 25 -
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 123 -
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 38
-
VILA^2: VILA Augmented VILA
Paper • 2407.17453 • Published • 42 -
Octopus v4: Graph of language models
Paper • 2404.19296 • Published • 119 -
Octo-planner: On-device Language Model for Planner-Action Agents
Paper • 2406.18082 • Published • 49 -
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
Paper • 2408.15518 • Published • 43
-
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 38 -
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers
Paper • 2502.15007 • Published • 175 -
Transformers without Normalization
Paper • 2503.10622 • Published • 167 -
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32
-
LM2: Large Memory Models
Paper • 2502.06049 • Published • 30 -
Titans: Learning to Memorize at Test Time
Paper • 2501.00663 • Published • 25 -
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 123 -
You Do Not Fully Utilize Transformer's Representation Capacity
Paper • 2502.09245 • Published • 38
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 59 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 53 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 43 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 62
-
VILA^2: VILA Augmented VILA
Paper • 2407.17453 • Published • 42 -
Octopus v4: Graph of language models
Paper • 2404.19296 • Published • 119 -
Octo-planner: On-device Language Model for Planner-Action Agents
Paper • 2406.18082 • Published • 49 -
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
Paper • 2408.15518 • Published • 43
-
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Paper • 2404.15653 • Published • 30 -
MoDE: CLIP Data Experts via Clustering
Paper • 2404.16030 • Published • 15 -
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper • 2405.12130 • Published • 51 -
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper • 2405.12981 • Published • 34