Cognition
Perception and abstraction. Each modality is tokenized and embedded into vectors for model to comprehend.
Paper • 2407.17453 • Published • 38Note General model is not great at specializing tasks. Narrow-domain fine-tuned checkpoint becomes better at specific tasks, such local improvement can feedback onto the full training dataset, achieving self-augmentation based improvement. This is a interesting idea.
Octopus v4: Graph of language models
Paper • 2404.19296 • Published • 117Note Use small language model to search the graph and route to the doman expert.
Octo-planner: On-device Language Model for Planner-Action Agents
Paper • 2406.18082 • Published • 47Note Automatic Flow Engineering done by 3B fine-tuned LLM, grounded on selective set of API-based functions. Planning model perform task decomposition, but do not do specific calls. Effectively doing flow (prompt) engineering here. Topology in plans are lacking and static plan-ahead approach is less robust (although good according to their curated 1k test dataset)
Dolphin: Long Context as a New Modality for Energy-Efficient On-Device Language Models
Paper • 2408.15518 • Published • 42Iterative Graph Alignment
Paper • 2408.16667 • Published • 2Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper • 2408.16725 • Published • 52MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 124DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper • 2403.05525 • Published • 39VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Paper • 2403.10517 • Published • 31LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Paper • 2409.02889 • Published • 54Law of Vision Representation in MLLMs
Paper • 2408.16357 • Published • 92VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper • 2408.05211 • Published • 46MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper • 2408.01800 • Published • 76NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 71
WaveletGPT: Wavelets Meet Large Language Models
Paper • 2409.12924 • Published • 1Note Treating intermediate embedding sequences as a bunch of signals and apply 1D convolution on temporal axis, similar to ConvMixer's manipulation in some sense, experimentation conducted on pre-training transformer. Interesting result is reported in the paper. Unfortunately no 'wave' is actually applied, no 'periodic' information is captured.
ClaimVer: Explainable Claim-Level Verification and Evidence Attribution of Text Through Knowledge Graphs
Paper • 2403.09724 • Published • 1
Learning Iterative Reasoning through Energy Diffusion
Paper • 2406.11179 • Published • 1Note Newton's introduction of gravity illustrates how understanding derivatives—knowing how things move rather than just where they are—enhances reasoning about the world. Large language models (LLMs), while excelling at compressing data distributions, struggle with reasoning. Reasoning involves grasping the 'abstract structure' of data. Therefore, by modeling derivatives of data distributions, could we improve LLMs' reasoning capabilities?
Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding
Paper • 2106.02795 • Published • 1Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 99Can LLMs Reason in the Wild with Programs?
Paper • 2406.13764 • Published • 1MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models
Paper • 2409.17481 • Published • 46Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
Paper • 2409.17115 • Published • 59Negating Negatives: Alignment without Human Positive Samples via Distributional Dispreference Optimization
Paper • 2403.03419 • Published • 1
Emu3: Next-Token Prediction is All You Need
Paper • 2409.18869 • Published • 89Note Tokenization unifies perception and generation, end-to-end training with discrete multi-modality signal enables both.
Can Models Learn Skill Composition from Examples?
Paper • 2409.19808 • Published • 8Not All LLM Reasoners Are Created Equal
Paper • 2410.01748 • Published • 27RATIONALYST: Pre-training Process-Supervision for Improving Reasoning
Paper • 2410.01044 • Published • 34
Intelligence at the Edge of Chaos
Paper • 2410.02536 • Published • 6Note Intelligence is very likely the ability to model higher order derivatives given lower order observation.
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Paper • 2410.02155 • Published • 1Note MLLM usually project a continuous Image embedding onto hidden space of LLM. Vector quantization (VQ) convert an image into discrete codes representing each of its patches, these tokens could be ported into LLM in a more similar fashion as text tokens -- new embedding vectors. Therefore a natural extension is just to re-use the BPE approach onto these image tokens. Which is precisely what happens in this work. However, I
Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation
Paper • 2410.02725 • Published • 1
Selective Attention Improves Transformer
Paper • 2410.02703 • Published • 23Note "If two computer programs perform the same task, the shorter one is generally better." This principle, known as Occam's Razor, is a critical guideline for scientific discovery. Our best program today is the Transformer. Can we make it more efficient? Selective attention improves the Transformer by allowing each token to decide whether previous context is still relevant for future tokens.
FAN: Fourier Analysis Networks
Paper • 2410.02675 • Published • 24EmbedLLM: Learning Compact Representations of Large Language Models
Paper • 2410.02223 • Published • 3Model Comparisons: XNet Outperforms KAN
Paper • 2410.02033 • Published • 1Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL
Paper • 2410.01930 • Published • 1Addition is All You Need for Energy-efficient Language Models
Paper • 2410.00907 • Published • 143
ε-VAE: Denoising as Visual Decoding
Paper • 2410.04081 • Published • 7Note I find it strange to view encoder which produces embedding vector as a type of tokenization --- then transformer effectively has two tokenization process... a discrete one and then a continuous one ?
Emergent properties with repeated examples
Paper • 2410.07041 • Published • 8Note Compression requires redundancy, otherwise it's just memorization
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Paper • 2410.06981 • Published • 1Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines
Paper • 2410.07896 • Published • 2Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding
Paper • 2408.08252 • Published • 1From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions
Paper • 2410.08197 • Published • 1Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Paper • 2410.06940 • Published • 4LeanAgent: Lifelong Learning for Formal Theorem Proving
Paper • 2410.06209 • Published • 1SimpleStrat: Diversifying Language Model Generation with Stratification
Paper • 2410.09038 • Published • 4Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation
Paper • 2410.08821 • Published • 1Discrete Flow Matching
Paper • 2407.15595 • Published • 11Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Paper • 2410.11081 • Published • 16EVOLvE: Evaluating and Optimizing LLMs For Exploration
Paper • 2410.06238 • Published • 1Neural Metamorphosis
Paper • 2410.11878 • Published • 7Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming
Paper • 2410.12112 • Published • 1Steering Large Language Models between Code Execution and Textual Reasoning
Paper • 2410.03524 • Published • 1A Scalable Communication Protocol for Networks of Large Language Models
Paper • 2410.11905 • Published • 1Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL
Paper • 2410.12491 • Published • 4Revealing the Barriers of Language Agents in Planning
Paper • 2410.12409 • Published • 23Learning to Compress: Local Rank and Information Compression in Deep Neural Networks
Paper • 2410.07687 • Published • 1Grandmaster-Level Chess Without Search
Paper • 2402.04494 • Published • 67Instruction-Driven Game Engine: A Poker Case Study
Paper • 2410.13441 • Published • 1Transformer Guided Coevolution: Improved Team Formation in Multiagent Adversarial Games
Paper • 2410.13769 • Published • 1Learning Graph Quantized Tokenizers for Transformers
Paper • 2410.13798 • Published • 1Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design
Paper • 2410.13643 • PublishedLearning to Route with Confidence Tokens
Paper • 2410.13284 • Published • 1An Evolved Universal Transformer Memory
Paper • 2410.13166 • Published • 1Artificial Kuramoto Oscillatory Neurons
Paper • 2410.13821 • Published • 1TopoLM: brain-like spatio-functional organization in a topographic language model
Paper • 2410.11516 • Published • 1Autoregressive Image Generation without Vector Quantization
Paper • 2406.11838 • Published • 2LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper • 2404.16710 • Published • 73DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing
Paper • 2410.12189 • Published • 1SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
Paper • 2410.13276 • Published • 24Do LLMs "know" internally when they follow instructions?
Paper • 2410.14516 • Published • 1Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models
Paper • 2410.10846 • Published • 2One-Step Diffusion Distillation through Score Implicit Matching
Paper • 2410.16794 • Published • 1Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass
Paper • 2405.18400 • Published • 1Lightweight Neural App Control
Paper • 2410.17883 • Published • 8Literature Meets Data: A Synergistic Approach to Hypothesis Generation
Paper • 2410.17309 • Published • 1
Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration
Paper • 2410.18076 • Published • 4Note Encodes interaction trajectories into "skill vectors" that act like abstract concepts: a skill decoder (low-level policy) translates them into specific actions based on the current state—similar to how our concepts become concrete actions in different situations. By relabeling experiences with these skills, they train a high-level policy to select optimal skills that maximize rewards. This hierarchical approach hints at the possibility for AI systems to formulate and think in their own-curated a
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting
Paper • 2410.17856 • Published • 48Non-myopic Generation of Language Model for Reasoning and Planning
Paper • 2410.17195 • Published • 1LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Paper • 2410.17434 • Published • 24Unbounded: A Generative Infinite Game of Character Life Simulation
Paper • 2410.18975 • Published • 34ToolGen: Unified Tool Retrieval and Calling via Generation
Paper • 2410.03439 • Published • 1
Accelerating Exploration with Unlabeled Prior Data
Paper • 2311.05067 • Published • 1Note Random network distillation as extra reward for exploration encouragement for RL.
Efficient Online Reinforcement Learning with Offline Data
Paper • 2302.02948 • Published • 2Note Re-using previous experience to increase RL learning efficiency.
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Paper • 2410.17891 • Published • 15Diffusion for World Modeling: Visual Details Matter in Atari
Paper • 2405.12399 • Published • 27Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Paper • 2410.13835 • Published • 1PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Paper • 2410.17247 • Published • 43HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
Paper • 2410.10812 • Published • 14MCSD: An Efficient Language Model with Diverse Fusion
Paper • 2406.12230 • Published • 1The Scene Language: Representing Scenes with Programs, Words, and Embeddings
Paper • 2410.16770 • Published • 1Pyramidal Flow Matching for Efficient Video Generative Modeling
Paper • 2410.05954 • Published • 37Energy-Based Diffusion Language Models for Text Generation
Paper • 2410.21357 • Published • 1iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 12nGPT: Normalized Transformer with Representation Learning on the Hypersphere
Paper • 2410.01131 • Published • 8OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 43Inference Optimal VLMs Need Only One Visual Token but Larger Models
Paper • 2411.03312 • Published • 5DroidSpeak: Enhancing Cross-LLM Communication
Paper • 2411.02820 • Published • 1Wave Network: An Ultra-Small Language Model
Paper • 2411.02674 • Published • 3Thinking Forward and Backward: Effective Backward Planning with Large Language Models
Paper • 2411.01790 • Published • 1
Adaptive Length Image Tokenization via Recurrent Allocation
Paper • 2411.02393 • Published • 11Note Using fixed tokens to encode image, adding new tokens recursively until reaching satisfacotry compression level.
Improving Steering Vectors by Targeting Sparse Autoencoder Features
Paper • 2411.02193 • Published • 1How Far is Video Generation from World Model: A Physical Law Perspective
Paper • 2411.02385 • Published • 27Tool Learning with Foundation Models
Paper • 2304.08354 • Published • 2