Representation & Optimization - a Ksgk-fy Collection

Nuclear Norm Regularization for Deep Learning

Paper • 2405.14544 • Published May 23, 2024 • 1

Note CS inequality for matrix allows penalizing element-wise Frobenius norm to encourage low-rank representations.

Token embeddings violate the manifold hypothesis

Paper • 2504.01002 • Published Apr 1 • 1

Note Some token have more synonyms than others.

Approximate Nullspace Augmented Finetuning for Robust Vision Transformers

Paper • 2403.10476 • Published Mar 15, 2024 • 1

ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning

Paper • 2504.00254 • Published Mar 31 • 1

Flex Attention: A Programming Model for Generating Optimized Attention Kernels

Paper • 2412.05496 • Published Dec 7, 2024 • 1

Note Customize attention mask with optimized performance comparable with Flashattention

Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad

Paper • 2503.21934 • Published Mar 27

Value Residual Learning For Alleviating Attention Concentration In Transformers

Paper • 2410.17897 • Published Oct 23, 2024 • 9

Note Halve KV cache via sharing value embedding across attention blocks

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

Paper • 2504.06261 • Published Apr 8 • 111

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Paper • 2503.01840 • Published Mar 3 • 5

Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure

Paper • 2504.01928 • Published Apr 2 • 1

Gradient Surgery for Multi-Task Learning

Paper • 2001.06782 • Published Jan 19, 2020 • 1

SelfCP: Compressing Long Prompt to 1/12 Using the Frozen Large Language Model Itself

Paper • 2405.17052 • Published May 27, 2024 • 2

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Paper • 2403.19647 • Published Mar 28, 2024 • 4

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Paper • 2504.13837 • Published Apr 18 • 134

It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

Paper • 2504.13173 • Published Apr 17 • 19

Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light

Paper • 2504.16922 • Published Apr 23 • 1

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Paper • 2504.01871 • Published Apr 2 • 13

Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

Paper • 2504.03206 • Published Apr 4 • 1

Note PBRS (Potential Based Reward Shaping) can be used for gated regularization

Overtrained Language Models Are Harder to Fine-Tune

Paper • 2503.19206 • Published Mar 24 • 2

Long Context In-Context Compression by Getting to the Gist of Gisting

Paper • 2504.08934 • Published Apr 11 • 1

Model Diffusion for Certifiable Few-shot Transfer Learning

Paper • 2502.06970 • Published Feb 10 • 1

Memorization-Compression Cycles Improve Generalization

Paper • 2505.08727 • Published May 13 • 5

Note Occam razor's principle expressed in mathematical term inspires new ways of training LLM to rely less on quantify of data.

Chain-of-Model Learning for Language Model

Paper • 2505.11820 • Published May 17 • 122

Shannon information and integrated information: message and meaning

Paper • 2412.10626 • Published Dec 14, 2024 • 1

Let's Predict Sentence by Sentence

Paper • 2505.22202 • Published May 28 • 19

Learning to Reason without External Rewards

Paper • 2505.19590 • Published May 26 • 29

Pre-trained Large Language Models Learn Hidden Markov Models In-context

Paper • 2506.07298 • Published Jun 8 • 26

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Paper • 2506.06941 • Published Jun 7 • 14

A projection-based framework for gradient-free and parallel learning

Paper • 2506.05878 • Published Jun 6 • 2

In-Context Learning Strategies Emerge Rationally

Paper • 2506.17859 • Published Jun 21 • 10

Global and Local Entailment Learning for Natural World Imagery

Paper • 2506.21476 • Published Jun 26 • 1

Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation

Paper • 2506.19852 • Published Jun 24 • 41

Data Efficacy for Language Model Training

Paper • 2506.21545 • Published Jun 26 • 11

Note The 'learnability' metric require training a small LM beforehand instead of computed online, in that sense, selecting 'easy-to-learn' sample is an old idea.

Energy-Based Transformers are Scalable Learners and Thinkers

Paper • 2507.02092 • Published Jul 2 • 61

Note Using a neural network to directly predict outputs makes inference fast but makes search-based reasoning at inference time feel unnatural. In contrast, training a network to predict a loss function naturally supports gradient-based search at inference time—more aligned with tasks like image generation in continuous domains. However, this approach is 3× heavier at both training and inference.

Tensor Product Attention Is All You Need

Paper • 2501.06425 • Published Jan 11 • 89

Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis

Paper • 2505.11581 • Published May 16 • 2

Note Deep learning tends to favor high-entropy representation.

Towards Distributed Neural Architectures

Paper • 2506.22389 • Published Jun 27 • 1

Scaling RL to Long Videos

Paper • 2507.07966 • Published Jul 10 • 157

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Paper • 2507.07990 • Published Jul 10 • 45

StreamDiT: Real-Time Streaming Text-to-Video Generation

Paper • 2507.03745 • Published Jul 4 • 29

Note Training to stream with monotonously increasing noise level.

What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models

Paper • 2507.06952 • Published Jul 9 • 7

Potemkin Understanding in Large Language Models

Paper • 2506.21521 • Published Jun 26 • 3

Large Language Diffusion Models

Paper • 2502.09992 • Published Feb 14 • 123

Note Current-token prediction with [Mask] token embedding. Iterative inference with re-masking on high-perplexity token.

Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking

Paper • 2505.18495 • Published May 24 • 1

Note Current-token prediction with interpolated [Mask]--[Predicted Token] embedding

Anchored Diffusion Language Model

Paper • 2505.18456 • Published May 24 • 1

Note Identified the right problem for diffusion language modeling: I can't imagine what I'd say 5 words away without saying the 5 words first.

Fractal Generative Models

Paper • 2502.17437 • Published Feb 24 • 1

Note coarse to fine generation with recursive forward propagation

nablaNABLA: Neighborhood Adaptive Block-Level Attention

Paper • 2507.13546 • Published Jul 17 • 120

Agentic Reinforced Policy Optimization

Paper • 2507.19849 • Published Jul 26 • 148

The Serial Scaling Hypothesis

Paper • 2507.12549 • Published Jul 16 • 9

Note Lots of computation process can't be parallelized, speeding up serial computation is also desirable.

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Paper • 2508.01191 • Published Aug 2 • 234

Hierarchical Reasoning Model

Paper • 2506.21734 • Published Jun 26 • 34

Note Recurrent model at multiple levels, aligned in different temporal scales.

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Paper • 2508.05629 • Published about 1 month ago • 170

Note Down-scale impact of hard examples improve cross entropy loss in post-training stage, a.k.a. "don't break anything yo dog"

Differentiable Causal Discovery For Latent Hierarchical Causal Models

Paper • 2411.19556 • Published Nov 29, 2024 • 1

Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models

Paper • 2502.20332 • Published Feb 27 • 2

Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective

Paper • 2505.23833 • Published May 28 • 1

Untrained neural networks can demonstrate memorization-independent abstract reasoning

Paper • 2407.17791 • Published Jul 25, 2024 • 1

Building, Reusing, and Generalizing Abstract Representations from Concrete Sequences

Paper • 2410.21332 • Published Oct 27, 2024 • 1

What is an "Abstract Reasoner"? Revisiting Experiments and Arguments about Large Language Models

Paper • 2507.22457 • Published Jul 30 • 1

Residual Connections Harm Generative Representation Learning

Paper • 2404.10947 • Published Apr 16, 2024 • 1

XAttention: Block Sparse Attention with Antidiagonal Scoring

Paper • 2503.16428 • Published Mar 20 • 15

Group Sequence Policy Optimization

Paper • 2507.18071 • Published Jul 24 • 294

Note GRPO training is unstable for MoE model because token_ratio = policy_new(token) / policy_old(token) easily spikes under different routing choices for old and new policies. GSPO instead uses seq_ratio = policy_new(seq) / policy_old(seq) and find this to be more stable.

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Paper • 2508.17445 • Published 13 days ago • 78

Reasoning-Intensive Regression

Paper • 2508.21762 • Published 8 days ago • 1

Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic

Paper • 2509.01363 • Published 5 days ago • 31

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Paper • 2509.02479 • Published 4 days ago • 76

GenCompositor: Generative Video Compositing with Diffusion Transformer

Paper • 2509.02460 • Published 4 days ago • 20

Towards a Unified View of Large Language Model Post-Training

Paper • 2509.04419 • Published 2 days ago • 52

Note When SFT demo data & RL signal are both available, add two loss together and optimize, when model sucks weight more on SFT and vice versa. Duh..

Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

Paper • 2509.03646 • Published 3 days ago • 1

Note Good.

CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning

Paper • 2509.04027 • Published 3 days ago • 1

Note I think the "continuous manifold of semantic space when reasoning token approaches infinity" breaks when they introduce a metric based on "prefix" and then "distance" ...

What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?

Paper • 2509.03790 • Published 3 days ago • 1

Differentiable Entropy Regularization for Geometry and Neural Networks

Paper • 2509.03733 • Published 3 days ago • 1

Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning

Paper • 2509.03345 • Published 3 days ago

Dynamic Speculative Agent Planning

Paper • 2509.01920 • Published 5 days ago

Mixture of Contexts for Long Video Generation

Paper • 2508.21058 • Published 9 days ago • 30

Reinforcement Learning for Machine Learning Engineering Agents

Paper • 2509.01684 • Published 5 days ago • 1

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Paper • 2508.21184 • Published 9 days ago • 1