Daily Papers
Paper • 2402.18668 • Published • 18Note - Focus: recall-throughput tradeoff in attention-based LMs - Proposed: "Based" combining linear and sliding window attention - Results: Matches Mamba in perplexity; improves recall tasks by 6.22 accuracy points - Efficiency: Developed IO-aware algorithms, achieving 24× higher throughput than FlashAttention-2 for 1.3b models generating 1024 tokens
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 79Note - Introduced ReBased model with learnable kernel, outperforms Based in MQAR task and language modeling on Pile dataset. - Incorporates Layer Normalization into the kernel function. - Significant improvement demonstrated across sequence lengths [128, 256, 512, 1024, 2048]. - Attention matrix analysis shows closer resemblance to vanilla attention compared to Based. - Chosen kernel: ReBased ( gamma *norm(x) + beta)^2
Repeat After Me: Transformers are Better than State Space Models at Copying
Paper • 2402.01032 • Published • 22Note - Transformers outperform GSSMs in copying tasks; fundamental on input context retrieval. - Empirical tests show Transformers' superiority on synthetic, shuffled, and natural language strings, preserving efficiency across varying input lengths. - GSSM struggles with memory-intensive tasks; architecture limits practicality despite potential state space. - Evaluations involve models with ~160M parameters, leveraging positional encoding variations.
Zoology: Measuring and Improving Recall in Efficient Language Models
Paper • 2312.04927 • Published • 2Note - Attention-free language models with gating and convolutions are gaining popularity. - Gated-convolution architectures underperform attention models by up to 2.1 perplexity points on the Pile. - 70M parameter attention model outclasses 1.4 billion parameter gated-convolution model on associative recall. - New task multi-query associative recall (Mqar) formulated to close gap. - Convolution-attention hybrids with input-dependent sparse attention patterns can close 97.4% of the gap.
Can Mamba Learn How to Learn? A Comparative Study on In-Context Learning Tasks
Paper • 2402.04248 • Published • 30Note - SSMs like Mamba and Transformers compared for in-context learning (ICL) capabilities. - Mamba + Transformer hybrid, MambaFormer, outperforms in tasks challenging for either model alone. - Experimented across tasks like sparse parity, vector-valued MQAR; Mamba struggles in retrieval tasks. - MambaFormer showcases best-of-both-worlds in ICL tasks, suggesting hybrid architectures' potential.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Paper • 2205.14135 • Published • 11Note - FlashAttention introduces IO-aware exact attention, optimizing GPU HBM/SRAM access. - Achieved 15% speedup over MLPerf 1.1 on BERT-large, 3x on GPT-2 (1K seq. length), 2.4x on long-range arena (1K-4K seq. length). - Enabled Transformers for Path-X (16K seq. length, 61.4% accuracy) and Path-256 (64K seq. length, 63.1% accuracy) challenges. - Employs tiling and recomputation for efficiency.
Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture
Paper • 2310.12109 • Published • 1Note - Introduced Monarch Mixer (M2) architecture utilizing Monarch matrices for sub-quadratic scaling in sequence length and model dimension. - Achieved matching or surpassing performance with fewer parameters: BERT-base (-27%), BERT-large (-24%), ViT-b (+1% accuracy, half parameters). - Developed causality enforcement strategy enabling causal sequence mixing, applicable to GPT-style models with 0.2 PPL improvement on The PILE.
Lost in the Middle: How Language Models Use Long Contexts
Paper • 2307.03172 • Published • 37Note - Language models show U-shaped performance curve in long-context tasks: highest at start/end, drops in middle. - GPT-3.5-Turbo’s multi-document QA drops over 20% when info is mid-context. - Encoder-decoder models robust within training limit, less outside. - Query-aware contextualization improves key-value retrieval, minimal effect on multi-document QA.
Never Lost in the Middle: Improving Large Language Models via Attention Strengthening Question Answering
Paper • 2311.09198 • Published • 3Note - "Lost in the middle" issue tackled with ASM QA, boosting LLMs in Multi-doc QA. - Ziya-Reader outperforms SOTA by up to 21.5% in passage retrieval; 13.7% in shuffled settings. - Employs Attention-Strengthening Multi-doc QA (ASM QA) for enhanced focus in long contexts. - Benchmarks: Multi-doc QA, Synthesis Tasks, Summarization on LongBench.
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 70Note - MoE-Mamba outperforms Mamba and Transformer-MoE; achieves same Mamba performance in 2.2x fewer training steps (Fig. 1) - Demonstrates efficiency gains of combining SSMs with MoE - Scales well with number of experts, best result with 32 experts - Training setup detailed in Table 3, alternative designs explored but not optimal
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Paper • 2101.03961 • Published • 14Note - **Switch Transformer:** Improved MoE model addressing drawbacks of complexity, communication cost, and training instability. - Achieves 7x pre-training speedups and 4x speedup over T5-XXL with trillion parameter models on the Colossal Clean Crawled Corpus. - Demonstrates superior scaling and fine-tuning benefits; significant improvements in multilingual settings across 101 languages. - Achieved through simplified routing, reduced communication, and enhanced training techniques.
Accelerating LLM Inference with Staged Speculative Decoding
Paper • 2308.04623 • Published • 23Hydragen: High-Throughput LLM Inference with Shared Prefixes
Paper • 2402.05099 • Published • 19DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 43
Scaling Laws for Fine-Grained Mixture of Experts
Paper • 2402.07871 • Published • 11Note - Introduced "granularity" as a hyperparameter; adjusting it enhances MoE model efficiency. - Proposed new scaling laws incorporating granularity, model size, and training tokens. - Showed optimal granularity (G) enhances compute-optimal MoE performance over dense Transformers. - Empirical findings: compute-optimal MoE with 1020 FLOPs achieves equivalent performance to dense Transformer with 20× FLOPs.
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Paper • 2310.05736 • Published • 4
Mixtral of Experts
Paper • 2401.04088 • Published • 158Note - Introduced Mixtral 8x7B, SMoE language model; outperforms Llama 2 70B and GPT-3.5 in benchmarks like mathematics, code generation. - Mixtral 8x7B - Instruct surpasses GPT-3.5 Turbo, Claude-2.1 in human benchmarks; achieves 70.6% on MMLU. - Routing analysis shows no expert specialization across domains; high temporal locality in expert assignment.
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
Paper • 2310.15961 • Published • 1Note - Mixture of Tokens (MoT) addresses MoE challenges: training instability, load imbalance. - MoT: fully-differentiable, token-expert mixing, avoids discrete operations. - Results show training time reduction by 3× vs. vanilla Transformer, promising for larger models. - Experiment: GPT, C4 dataset, 250k steps, significant train step/time reduction noted. - Future focus: MoT to MoE transition, privacy considerations in autoregressive decoding via temperature parameter.
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Paper • 2404.02258 • Published • 104Note - Mixture-of-Depths (MoD) optimizes transformer FLOP allocation per token, improving efficiency. - Dynamic vs. static compute: top-𝑘 routing; "expert-choice" routing for load balance. - MoD matches/bests vanilla transformers in isoFLOP settings; up to 50% fewer FLOPs, 60% faster step times. - Empirical analysis suggests "optimal" at routing every other block, 12.5% capacity. - Optimal model for X FLOPs: train model with 12.5% capacity being X - MoDE - Mixture of Depths Experts
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 23Note - BlackMamba: combines Mamba SSM and MoE for linear-complexity generation and fast inference. - Open-source: 340M/1.5B & 630M/2.8B models, trained on 300B tokens. - Outperforms transformer baselines in inference & training FLOPs. - Introduced Sinkhorn algorithm innovation for MoE routing, significantly reducing convergence iterations. - Evaluation: competitive against pretrained LLMs; superior scaling evident in downstream tasks.
DoRA: Weight-Decomposed Low-Rank Adaptation
Paper • 2402.09353 • Published • 26HyperAttention: Long-context Attention in Near-Linear Time
Paper • 2310.05869 • Published • 2Ring Attention with Blockwise Transformers for Near-Infinite Context
Paper • 2310.01889 • Published • 10The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper • 2402.17764 • Published • 603
Yi: Open Foundation Models by 01.AI
Paper • 2403.04652 • Published • 62Note - Yi model family by 01.AI, extending 6B & 34B pretrained LMs to various applications including chat, vision-language models. - Data-engineering, achieving strong human preference rates on AlpacaEval and Chatbot Arena. - Performance gains attributed to high-quality 3.1 trillion token pretraining corpus & finetuning dataset iteration. - Highlights depth-upscaled models & 200K context extension, showing notable benchmark performance.
sDPO: Don't Use Your Data All at Once
Paper • 2403.19270 • Published • 40Note - sDPO proposed for LLM alignment, outperforms other models in H4 score (74.31 vs. 72.67 DPO on SOLAR 10.7B). - Employs preference datasets stepwise, using previously aligned models as references. - Demonstrated on datasets like Ultrafeedback Cleaned, OpenOrca; benchmarks include ARC, HellaSWAG, MMLU, TruthfulQA. - Challenges: optimal data segmentation, expanding model scope.
Long Range Arena: A Benchmark for Efficient Transformers
Paper • 2011.04006 • PublishedNote - Focus on quadratic self-attention complexity in Transformers - Introduced Long-Range Arena benchmark for evaluating efficient Transformers under long-context scenarios - Tasks include ListOps, document classification/retrieval, image classification, and pathfinder with sequences 1K-16K tokens - Extensive comparison of ten models; BigBird shows consistent performance across tasks - No "one-size-fits-all"; trade-offs in model quality, speed/memory
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Paper • 2304.01373 • Published • 9Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Paper • 2402.19427 • Published • 52
Effective Long-Context Scaling of Foundation Models
Paper • 2309.16039 • Published • 30Note tl;dr: - increasing the frequency in RoPE from 10k to 1mln+ - used by Yi for 200k context window (they used 10mln)
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Paper • 2404.07143 • Published • 104Note 1. Infini-attention = linear long-term compressive memory and local causal attention for efficiently modeling both long and short-range contextual dependencies. 2. Minimal change to the standard scaled dot-product attention and supports plug-and-play continual pre-training and long-context adaptation by design. 3. Infinitely long context with a bounded memory - streaming
GLU Variants Improve Transformer
Paper • 2002.05202 • Published • 1Note ablation study of various GLU variants GeGLU winning
Thinking Like Transformers
Paper • 2106.06981 • PublishedHGRN2: Gated Linear RNNs with State Expansion
Paper • 2404.07904 • Published • 17